Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 127]
cs.CV [Total: 151]
cs.AI [Total: 66]
cs.SD [Total: 13]
cs.LG [Total: 218]
cs.MA [Total: 2]
cs.MM [Total: 0]
eess.AS [Total: 16]
eess.IV [Total: 9]

cs.CL

[1] Interpreting Public Sentiment in Diplomacy Events: A Counterfactual Analysis Framework Using Large Language Models

Leyi Ouyang

Main category: cs.CL

TL;DR: A framework that modifies diplomatic event narratives to shift public sentiment from negative to neutral/positive using counterfactual generation with LLMs, achieving 70% success rate.

Details

Motivation: Public sentiment is crucial in diplomacy but traditional sentiment analysis methods are slow and lack predictive capabilities. There's a need for data-driven tools to help diplomats frame events favorably.

Method: 1) Train language model on diplomatic event descriptions and public discussions dataset 2) Identify textual features for modification based on communication theories and expert input 3) Develop counterfactual generation algorithm using LLM to systematically modify event narratives while preserving core facts

Result: The framework successfully shifted public sentiment to more favorable states with 70% success rate.

Conclusion: This framework serves as a practical tool for diplomats and policymakers, offering data-driven insights for framing diplomatic initiatives to foster desirable public sentiment.

Abstract: Diplomatic events consistently prompt widespread public discussion and debate. Public sentiment plays a critical role in diplomacy, as a good sentiment provides vital support for policy implementation, helps resolve international issues, and shapes a nation’s international image. Traditional methods for gauging public sentiment, such as large-scale surveys or manual content analysis of media, are typically time-consuming, labor-intensive, and lack the capacity for forward-looking analysis. We propose a novel framework that identifies specific modifications for diplomatic event narratives to shift public sentiment from negative to neutral or positive. First, we train a language model to predict public reaction towards diplomatic events. To this end, we construct a dataset comprising descriptions of diplomatic events and their associated public discussions. Second, guided by communication theories and in collaboration with domain experts, we predetermined several textual features for modification, ensuring that any alterations changed the event’s narrative framing while preserving its core facts.We develop a counterfactual generation algorithm that employs a large language model to systematically produce modified versions of an original text. The results show that this framework successfully shifted public sentiment to a more favorable state with a 70% success rate. This framework can therefore serve as a practical tool for diplomats, policymakers, and communication specialists, offering data-driven insights on how to frame diplomatic initiatives or report on events to foster a more desirable public sentiment.

[2] Speaker Style-Aware Phoneme Anchoring for Improved Cross-Lingual Speech Emotion Recognition

Shreya G. Upadhyay, Carlos Busso, Chi-Chun Lee

Main category: cs.CL

TL;DR: A speaker-style aware phoneme anchoring framework for cross-lingual speech emotion recognition that aligns emotional expression at phonetic and speaker levels using graph-based clustering and dual-space anchoring.

Details

Motivation: Cross-lingual SER is challenging due to differences in phonetic variability and speaker-specific expressive styles across languages. Current methods struggle to effectively capture emotions under diverse conditions that require alignment of emotional externalization across speakers and languages.

Method: Proposes a framework that builds emotion-specific speaker communities via graph-based clustering to capture shared speaker traits. Uses dual-space anchoring in speaker and phonetic spaces to enable better emotion transfer across languages.

Result: Evaluations on MSP-Podcast (English) and BIIC-Podcast (Taiwanese Mandarin) corpora demonstrate improved generalization over competitive baselines.

Conclusion: The framework provides valuable insights into commonalities in cross-lingual emotion representation and effectively addresses cross-lingual SER challenges through speaker-style aware phoneme anchoring.

Abstract: Cross-lingual speech emotion recognition (SER) remains a challenging task due to differences in phonetic variability and speaker-specific expressive styles across languages. Effectively capturing emotion under such diverse conditions requires a framework that can align the externalization of emotions across different speakers and languages. To address this problem, we propose a speaker-style aware phoneme anchoring framework that aligns emotional expression at the phonetic and speaker levels. Our method builds emotion-specific speaker communities via graph-based clustering to capture shared speaker traits. Using these groups, we apply dual-space anchoring in speaker and phonetic spaces to enable better emotion transfer across languages. Evaluations on the MSP-Podcast (English) and BIIC-Podcast (Taiwanese Mandarin) corpora demonstrate improved generalization over competitive baselines and provide valuable insights into the commonalities in cross-lingual emotion representation.

[3] CFD-LLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics

Nithin Somasekharan, Ling Yue, Yadi Cao, Weichao Li, Patrick Emami, Pochinapeddi Sai Bhargav, Anurag Acharya, Xingyu Xie, Shaowu Pan

Main category: cs.CL

TL;DR: CFDLLMBench is a benchmark suite for evaluating LLMs’ capabilities in automating numerical experiments for Computational Fluid Dynamics (CFD), covering knowledge, reasoning, and implementation aspects.

Details

Motivation: LLMs show strong performance in general NLP tasks but their utility in automating complex physical system experiments like CFD remains underexplored despite being a critical labor-intensive component of computational science.

Method: The authors introduce CFDLLMBench with three components (CFDQuery, CFDCodeBench, FoamBench) to evaluate LLM performance across graduate-level CFD knowledge, numerical/physical reasoning, and context-dependent workflow implementation using a rigorous evaluation framework.

Result: The benchmark provides reproducible results quantifying LLM performance across code executability, solution accuracy, and numerical convergence behavior, grounded in real-world CFD practices.

Conclusion: CFDLLMBench establishes a foundation for developing and evaluating LLM-driven automation of numerical experiments for complex physical systems.

Abstract: Large Language Models (LLMs) have demonstrated strong performance across general NLP tasks, but their utility in automating numerical experiments of complex physical system – a critical and labor-intensive component – remains underexplored. As the major workhorse of computational science over the past decades, Computational Fluid Dynamics (CFD) offers a uniquely challenging testbed for evaluating the scientific capabilities of LLMs. We introduce CFDLLMBench, a benchmark suite comprising three complementary components – CFDQuery, CFDCodeBench, and FoamBench – designed to holistically evaluate LLM performance across three key competencies: graduate-level CFD knowledge, numerical and physical reasoning of CFD, and context-dependent implementation of CFD workflows. Grounded in real-world CFD practices, our benchmark combines a detailed task taxonomy with a rigorous evaluation framework to deliver reproducible results and quantify LLM performance across code executability, solution accuracy, and numerical convergence behavior. CFDLLMBench establishes a solid foundation for the development and evaluation of LLM-driven automation of numerical experiments for complex physical systems. Code and data are available at https://github.com/NREL-Theseus/cfdllmbench/.

[4] Assessing Classical Machine Learning and Transformer-based Approaches for Detecting AI-Generated Research Text

Sharanya Parimanoharan, Ruwan D. Nawarathna

Main category: cs.CL

TL;DR: This study evaluates various machine learning approaches for detecting ChatGPT-3.5-generated research abstracts compared to human-written texts, finding that DistilBERT performs best and ensemble methods don’t outperform single transformer models.

Details

Motivation: The rapid adoption of LLMs like ChatGPT has blurred lines between human and AI-generated content, raising concerns about academic integrity, intellectual property, and misinformation, necessitating reliable AI-text detection methods.

Method: Tested classical (Logistic Regression with Bag-of-Words, POS, TF-IDF) and transformer-based (BERT with N-grams, DistilBERT, BERT with custom classifier, LSTM-N-gram) ML techniques on 250 pairs of research abstracts, including ensemble approaches.

Result: DistilBERT achieved the best overall performance, while Logistic Regression and BERT-Custom offered solid alternatives. LSTM- and BERT-N-gram approaches performed poorly. Ensemble methods failed to surpass DistilBERT’s performance.

Conclusion: Single transformer-based representations (like DistilBERT) outperform model diversity through ensembles, providing a foundation for more robust detection frameworks to keep pace with improving generative AI models.

Abstract: The rapid adoption of large language models (LLMs) such as ChatGPT has blurred the line between human and AI-generated texts, raising urgent questions about academic integrity, intellectual property, and the spread of misinformation. Thus, reliable AI-text detection is needed for fair assessment to safeguard human authenticity and cultivate trust in digital communication. In this study, we investigate how well current machine learning (ML) approaches can distinguish ChatGPT-3.5-generated texts from human-written texts employing a labeled data set of 250 pairs of abstracts from a wide range of research topics. We test and compare both classical (Logistic Regression armed with classical Bag-of-Words, POS, and TF-IDF features) and transformer-based (BERT augmented with N-grams, DistilBERT, BERT with a lightweight custom classifier, and LSTM-based N-gram models) ML detection techniques. As we aim to assess each model’s performance in detecting AI-generated research texts, we also aim to test whether an ensemble of these models can outperform any single detector. Results show DistilBERT achieves the overall best performance, while Logistic Regression and BERT-Custom offer solid, balanced alternatives; LSTM- and BERT-N-gram approaches lag. The max voting ensemble of the three best models fails to surpass DistilBERT itself, highlighting the primacy of a single transformer-based representation over mere model diversity. By comprehensively assessing the strengths and weaknesses of these AI-text detection approaches, this work lays a foundation for more robust transformer frameworks with larger, richer datasets to keep pace with ever-improving generative AI models.

[5] ConceptViz: A Visual Analytics Approach for Exploring Concepts in Large Language Models

Haoxuan Li, Zhen Wen, Qiqi Jiang, Chenxiao Li, Yuwei Wu, Yuchen Yang, Yiyao Wang, Xiuqi Huang, Minfeng Zhu, Wei Chen

Main category: cs.CL

TL;DR: ConceptViz is a visual analytics system that bridges the gap between Sparse Autoencoder features and human-understandable concepts in LLMs through an identification-interpretation-validation pipeline.

Details

Motivation: SAE features don't inherently align with human concepts, making interpretation labor-intensive. There's a need to make LLM knowledge representations more interpretable.

Method: ConceptViz implements a pipeline for querying SAEs using concepts, exploring concept-to-feature alignments interactively, and validating correspondences through model behavior verification.

Result: The system enhances interpretability research by streamlining concept discovery and validation, helping researchers build better mental models of LLM features.

Conclusion: ConceptViz effectively bridges SAE features with human concepts, making LLM interpretability research more accessible and efficient.

Abstract: Large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks. Understanding how LLMs internally represent knowledge remains a significant challenge. Despite Sparse Autoencoders (SAEs) have emerged as a promising technique for extracting interpretable features from LLMs, SAE features do not inherently align with human-understandable concepts, making their interpretation cumbersome and labor-intensive. To bridge the gap between SAE features and human concepts, we present ConceptViz, a visual analytics system designed for exploring concepts in LLMs. ConceptViz implements a novel dentification => Interpretation => Validation pipeline, enabling users to query SAEs using concepts of interest, interactively explore concept-to-feature alignments, and validate the correspondences through model behavior verification. We demonstrate the effectiveness of ConceptViz through two usage scenarios and a user study. Our results show that ConceptViz enhances interpretability research by streamlining the discovery and validation of meaningful concept representations in LLMs, ultimately aiding researchers in building more accurate mental models of LLM features. Our code and user guide are publicly available at https://github.com/Happy-Hippo209/ConceptViz.

[6] SKILL-RAG: Self-Knowledge Induced Learning and Filtering for Retrieval-Augmented Generation

Tomoaki Isoda

Main category: cs.CL

TL;DR: SKILL-RAG is a novel method that uses reinforcement learning to leverage LLMs’ self-knowledge for filtering irrelevant retrieved documents in RAG systems, improving answer quality while reducing input documents.

Details

Motivation: Retrieval systems in RAG may return irrelevant content leading to hallucinations, so identifying and filtering unhelpful retrieved content is crucial for improving RAG performance by better integrating internal model knowledge with external knowledge.

Method: Proposes SKILL-RAG that uses reinforcement learning-based training to elicit self-knowledge from the model and employs sentence-level granularity filtering to remove irrelevant content while preserving useful knowledge.

Result: Experimental results on question answering benchmarks using Llama2-7B and Qwen3-8B show SKILL-RAG improves generation quality and significantly reduces the number of input documents.

Conclusion: The method validates the importance of self-knowledge in guiding the selection of high-quality retrievals for RAG systems.

Abstract: Retrieval-Augmented Generation (RAG) has significantly improved the performance of large language models (LLMs) on knowledge-intensive tasks in recent years. However, since retrieval systems may return irrelevant content, incorporating such information into the model often leads to hallucinations. Thus, identifying and filtering out unhelpful retrieved content is a key challenge for improving RAG performance.To better integrate the internal knowledge of the model with external knowledge from retrieval, it is essential to understand what the model “knows” and “does not know” (which is also called “self-knowledge”). Based on this insight, we propose SKILL-RAG (Self-Knowledge Induced Learning and Filtering for RAG), a novel method that leverages the model’s self-knowledge to determine which retrieved documents are beneficial for answering a given query. We design a reinforcement learning-based training framework to explicitly elicit self-knowledge from the model and employs sentence-level granularity to filter out irrelevant content while preserving useful knowledge.We evaluate SKILL-RAG using Llama2-7B and Qwen3-8B on several question answering benchmarks. Experimental results demonstrate that SKILL-RAG not only improves generation quality but also significantly reduces the number of input documents, validating the importance of self-knowledge in guiding the selection of high-quality retrievals.

[7] Beyond Global Emotion: Fine-Grained Emotional Speech Synthesis with Dynamic Word-Level Modulation

Sirui Wang, Andong Chen, Tiejun Zhao

Main category: cs.CL

TL;DR: Emo-FiLM is a fine-grained emotion modeling framework for text-to-speech that enables word-level emotion control using FiLM layers, addressing limitations of existing sentence-level approaches.

Details

Motivation: Existing E-TTS systems use sentence-level control through predefined labels or reference audio, which fail to capture dynamic emotional shifts within sentences. This limitation hinders natural and trustworthy human-computer interaction.

Method: The framework aligns frame-level features from emotion2vec to words for word-level emotion annotations, then maps them through a Feature-wise Linear Modulation (FiLM) layer to modulate text embeddings directly for word-level emotion control. A Fine-grained Emotion Dynamics Dataset (FEDD) was constructed for evaluation.

Result: Experiments show that Emo-FiLM outperforms existing approaches on both global and fine-grained emotion tasks, demonstrating effectiveness and generality for expressive speech synthesis.

Conclusion: Emo-FiLM successfully addresses the limitation of dynamic emotional shifts in E-TTS systems by enabling fine-grained word-level emotion control, providing more natural and expressive speech synthesis.

Abstract: Emotional text-to-speech (E-TTS) is central to creating natural and trustworthy human-computer interaction. Existing systems typically rely on sentence-level control through predefined labels, reference audio, or natural language prompts. While effective for global emotion expression, these approaches fail to capture dynamic shifts within a sentence. To address this limitation, we introduce Emo-FiLM, a fine-grained emotion modeling framework for LLM-based TTS. Emo-FiLM aligns frame-level features from emotion2vec to words to obtain word-level emotion annotations, and maps them through a Feature-wise Linear Modulation (FiLM) layer, enabling word-level emotion control by directly modulating text embeddings. To support evaluation, we construct the Fine-grained Emotion Dynamics Dataset (FEDD) with detailed annotations of emotional transitions. Experiments show that Emo-FiLM outperforms existing approaches on both global and fine-grained tasks, demonstrating its effectiveness and generality for expressive speech synthesis.

[8] USB-Rec: An Effective Framework for Improving Conversational Recommendation Capability of Large Language Model

Jianyu Wen, Jingyun Wang, Cilin Yan, Jiayin Cai, Xiaolong Jiang, Ying Zhang

Main category: cs.CL

TL;DR: USB-Rec is an integrated training-inference framework that improves LLMs for conversational recommendation through preference optimization dataset construction and self-enhancement strategies.

Details

Motivation: Existing LLM-based conversational recommender systems focus on leveraging LLMs' summarization and analysis capabilities but ignore training aspects, limiting their performance potential.

Method: Proposes USB-Rec framework with: 1) LLM-based Preference Optimization dataset construction for RL training, and 2) Self-Enhancement Strategy during inference to enhance conversational recommendation capabilities.

Result: Extensive experiments on various datasets show the method consistently outperforms previous state-of-the-art methods.

Conclusion: The integrated training-inference approach effectively improves LLM performance in conversational recommendation by addressing both training and inference optimization.

Abstract: Recently, Large Language Models (LLMs) have been widely employed in Conversational Recommender Systems (CRSs). Unlike traditional language model approaches that focus on training, all existing LLMs-based approaches are mainly centered around how to leverage the summarization and analysis capabilities of LLMs while ignoring the issue of training. Therefore, in this work, we propose an integrated training-inference framework, User-Simulator-Based framework (USB-Rec), for improving the performance of LLMs in conversational recommendation at the model level. Firstly, we design a LLM-based Preference Optimization (PO) dataset construction strategy for RL training, which helps the LLMs understand the strategies and methods in conversational recommendation. Secondly, we propose a Self-Enhancement Strategy (SES) at the inference stage to further exploit the conversational recommendation potential obtained from RL training. Extensive experiments on various datasets demonstrate that our method consistently outperforms previous state-of-the-art methods.

[9] Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents

Haochen Sun, Shuwen Zhang, Lujie Niu, Lei Ren, Hao Xu, Hao Fu, Fangkun Zhao, Caixia Yuan, Xiaojie Wang

Main category: cs.CL

TL;DR: This paper introduces Collab-Overcooked, a new LLM-based Multi-Agent System benchmark built on Overcooked-AI with enhanced collaborative tasks and process-oriented evaluation metrics.

Details

Motivation: To address the lack of comprehensive benchmarks for assessing fine-grained collaboration capabilities in LLM-based multi-agent systems, particularly in interactive environments requiring natural language communication.

Method: Extends Overcooked-AI game into a multi-agent framework supporting diverse collaborative tasks, introduces process-oriented evaluation metrics, and tests 13 popular LLMs on 30 open-ended tasks.

Result: LLMs show strong goal interpretation abilities but significant shortcomings in active collaboration and continuous adaptation needed for complex task fulfillment.

Conclusion: The benchmark highlights LLM-MAS strengths/weaknesses and provides insights for improvement, with all environments, tasks, and evaluation tools made publicly available.

Abstract: Large Language Models (LLMs) based agent systems have made great strides in real-world applications beyond traditional NLP tasks. This paper proposes a new LLM-based Multi-Agent System (LLM-MAS) benchmark, Collab-Overcooked, built on the popular Overcooked-AI game with more applicable and challenging tasks in interactive environments. Collab-Overcooked extends existing benchmarks in two novel ways. First, it provides a multi-agent framework supporting diverse tasks and objectives and encourages collaboration through natural language communication. Second, it introduces a spectrum of process-oriented evaluation metrics to assess the fine-grained collaboration capabilities of different LLM agents, a dimension often overlooked in prior work. We conduct extensive experiments with 13 popular LLMs and show that, while the LLMs exhibit a strong ability in goal interpretation, there are significant shortcomings in active collaboration and continuous adaptation, which are critical for efficiently fulfilling complex tasks. Notably, we highlight the strengths and weaknesses of LLM-MAS and provide insights for improving and evaluating LLM-MAS on a unified and open-source benchmark. The environments, 30 open-ended tasks, and the evaluation package are publicly available at https://github.com/YusaeMeow/Collab-Overcooked.

[10] Document Summarization with Conformal Importance Guarantees

Bruce Kuwahara, Chen-Yuan Lin, Xiao Shi Huang, Kin Kwan Leung, Jullian Arta Yapeter, Ilya Stanevich, Felipe Perez, Jesse C. Cresswell

Main category: cs.CL

TL;DR: Conformal Importance Summarization is a framework that provides rigorous coverage guarantees for automatic summarization systems using conformal prediction, ensuring critical content is preserved in high-stakes domains.

Details

Motivation: Current LLM-based summarization systems lack reliable guarantees for including critical content in high-stakes domains like healthcare, law, and finance, creating deployment risks.

Method: Uses conformal prediction to calibrate thresholds on sentence-level importance scores, enabling extractive document summarization with user-specified coverage and recall rates over critical content. The method is model-agnostic and requires only a small calibration set.

Result: Experiments on established summarization benchmarks demonstrate that Conformal Importance Summarization achieves the theoretically assured information coverage rate.

Conclusion: The framework can be combined with existing techniques to achieve reliable, controllable automatic summarization, enabling safer deployment of AI summarization tools in critical applications.

Abstract: Automatic summarization systems have advanced rapidly with large language models (LLMs), yet they still lack reliable guarantees on inclusion of critical content in high-stakes domains like healthcare, law, and finance. In this work, we introduce Conformal Importance Summarization, the first framework for importance-preserving summary generation which uses conformal prediction to provide rigorous, distribution-free coverage guarantees. By calibrating thresholds on sentence-level importance scores, we enable extractive document summarization with user-specified coverage and recall rates over critical content. Our method is model-agnostic, requires only a small calibration set, and seamlessly integrates with existing black-box LLMs. Experiments on established summarization benchmarks demonstrate that Conformal Importance Summarization achieves the theoretically assured information coverage rate. Our work suggests that Conformal Importance Summarization can be combined with existing techniques to achieve reliable, controllable automatic summarization, paving the way for safer deployment of AI summarization tools in critical applications. Code is available at https://github.com/layer6ai-labs/conformal-importance-summarization.

[11] ShortCheck: Checkworthiness Detection of Multilingual Short-Form Videos

Henrik Vatndal, Vinay Setty

Main category: cs.CL

TL;DR: ShortCheck is a modular pipeline for automatically identifying checkworthy short-form videos on platforms like TikTok to assist human fact-checkers.

Details

Motivation: Short-form video platforms present unique challenges for misinformation detection due to multimodal, dynamic, and noisy content that requires specialized tools.

Method: A modular, inference-only pipeline integrating speech transcription, OCR, object and deepfake detection, video-to-text summarization, and claim verification with a user-friendly interface.

Result: The system achieves promising results with F1-weighted score over 70% when validated on two manually annotated datasets with TikTok videos in multilingual settings.

Conclusion: ShortCheck provides an effective automated solution for identifying checkworthy content in short-form videos, helping human fact-checkers combat misinformation on platforms like TikTok.

Abstract: Short-form video platforms like TikTok present unique challenges for misinformation detection due to their multimodal, dynamic, and noisy content. We present ShortCheck, a modular, inference-only pipeline with a user-friendly interface that automatically identifies checkworthy short-form videos to help human fact-checkers. The system integrates speech transcription, OCR, object and deepfake detection, video-to-text summarization, and claim verification. ShortCheck is validated by evaluating it on two manually annotated datasets with TikTok videos in a multilingual setting. The pipeline achieves promising results with F1-weighted score over 70%.

[12] Building Tailored Speech Recognizers for Japanese Speaking Assessment

Yotaro Kubo, Richard Sproat, Chihiro Taguchi, Llion Jones

Main category: cs.CL

TL;DR: This paper presents methods for building Japanese speech recognizers that output phonemic labels with accent markers, addressing data sparsity through multitask learning and estimator fusion.

Details

Motivation: Japanese has limited data for training models to produce accurate phonemic transcriptions with accent marks, despite being resource-rich overall.

Method: Two approaches: 1) Multitask training with auxiliary loss functions for orthographic text and pitch patterns; 2) Fusion of phonetic alphabet and text token sequence estimators using finite-state transducer framework.

Result: Proposed methods reduced average mora-label error rates from 12.3% to 7.1% on CSJ core evaluation sets, outperforming generic multilingual recognizers.

Conclusion: Multitask learning and fusion are effective for building accurate Japanese phonemic recognizers, with significant error rate reductions compared to existing approaches.

Abstract: This paper presents methods for building speech recognizers tailored for Japanese speaking assessment tasks. Specifically, we build a speech recognizer that outputs phonemic labels with accent markers. Although Japanese is resource-rich, there is only a small amount of data for training models to produce accurate phonemic transcriptions that include accent marks. We propose two methods to mitigate data sparsity. First, a multitask training scheme introduces auxiliary loss functions to estimate orthographic text labels and pitch patterns of the input signal, so that utterances with only orthographic annotations can be leveraged in training. The second fuses two estimators, one over phonetic alphabet strings, and the other over text token sequences. To combine these estimates we develop an algorithm based on the finite-state transducer framework. Our results indicate that the use of multitask learning and fusion is effective for building an accurate phonemic recognizer. We show that this approach is advantageous compared to the use of generic multilingual recognizers. The relative advantages of the proposed methods were also compared. Our proposed methods reduced the average of mora-label error rates from 12.3% to 7.1% over the CSJ core evaluation sets.

[13] MARS: toward more efficient multi-agent collaboration for LLM reasoning

Xiao Wang, Jia Wang, Yijie Wang, Pengtao Dang, Sha Cao, Chi Zhang

Main category: cs.CL

TL;DR: MARS is a role-based multi-agent collaboration framework that improves reasoning efficiency by reducing computational overhead compared to Multi-Agent Debate (MAD), achieving similar accuracy with 50% less token usage and inference time.

Details

Motivation: Large language models have limited reasoning capabilities as single agents, and while Multi-Agent Debate improves reasoning through collaboration, it introduces substantial computational overhead due to frequent agent interactions.

Method: MARS uses a review process structure with three roles: author agent generates initial solution, reviewer agents provide independent feedback, and meta-reviewer integrates feedback for final decision and revision guidance, avoiding costly reviewer-to-reviewer interactions.

Result: Extensive experiments show MARS matches MAD’s accuracy while reducing token usage and inference time by approximately 50% across multiple benchmarks with different LLMs.

Conclusion: MARS provides an efficient alternative to MAD for enhancing LLM reasoning capabilities, offering comparable performance with significantly reduced computational costs through its role-based collaboration framework.

Abstract: Large language models (LLMs) have achieved impressive results in natural language understanding, yet their reasoning capabilities remain limited when operating as single agents. Multi-Agent Debate (MAD) has been proposed to address this limitation by enabling collaborative reasoning among multiple models in a round-table debate manner. While effective, MAD introduces substantial computational overhead due to the number of agents involved and the frequent communication required. In this paper, we propose MARS (Multi-Agent Review System), a role-based collaboration framework inspired by the review process. In MARS, an author agent generates an initial solution, reviewer agents provide decisions and comments independently, and a meta-reviewer integrates the feedback to make the final decision and guide further revision. This design enhances reasoning quality while avoiding costly reviewer-to-reviewer interactions, thereby controlling token consumption and inference time. We compared MARS with both MAD and other state-of-the-art reasoning strategies across multiple benchmarks. Extensive experiments with different LLMs show that MARS matches the accuracy of MAD while reducing both token usage and inference time by approximately 50%. Code is available at https://github.com/xwang97/MARS.

[14] MI-Fuse: Label Fusion for Unsupervised Domain Adaptation with Closed-Source Large-Audio Language Model

Hsiao-Ying Huang, Yi-Cheng Lin, Hung-yi Lee

Main category: cs.CL

TL;DR: MI-Fuse is a denoised label fusion framework that adapts student models to outperform large audio-language models (LALMs) in cross-domain speech emotion recognition (SER) using only unlabeled target-domain data and API-only LALM access.

Details

Motivation: Real-world SER deployments often fail due to domain mismatch, where source data is unavailable and powerful LALMs are only accessible through APIs. The goal is to adapt student models to outperform LALMs in target domains without sharing source data.

Method: MI-Fuse supplements the LALM with a source-domain trained SER classifier as an auxiliary teacher. It draws multiple stochastic predictions from both teachers, weights their mean distributions by mutual-information-based uncertainty, and stabilizes training with an exponential moving average teacher.

Result: Experiments across three public emotion datasets and six cross-domain transfers show consistent gains, with the student surpassing the LALM and outperforming the strongest baseline by 3.9%.

Conclusion: This approach enables realistic adaptation of emotion-aware speech systems without sharing source data, strengthening SER performance under domain mismatch conditions.

Abstract: Large audio-language models (LALMs) show strong zero-shot ability on speech tasks, suggesting promise for speech emotion recognition (SER). However, SER in real-world deployments often fails under domain mismatch, where source data are unavailable and powerful LALMs are accessible only through an API. We ask: given only unlabeled target-domain audio and an API-only LALM, can a student model be adapted to outperform the LALM in the target domain? To this end, we propose MI-Fuse, a denoised label fusion framework that supplements the LALM with a source-domain trained SER classifier as an auxiliary teacher. The framework draws multiple stochastic predictions from both teachers, weights their mean distributions by mutual-information-based uncertainty, and stabilizes training with an exponential moving average teacher. Experiments across three public emotion datasets and six cross-domain transfers show consistent gains, with the student surpassing the LALM and outperforming the strongest baseline by 3.9%. This approach strengthens emotion-aware speech systems without sharing source data, enabling realistic adaptation.

[15] SiniticMTError: A Machine Translation Dataset with Error Annotations for Sinitic Languages

Hannah Liu, Junghyun Min, Ethan Yue Heng Cheung, Shou-Yi Hung, Syed Mekael Wasti, Runtong Liang, Shiyao Qian, Shizhao Zheng, Elsie Chan, Ka Ieng Charlotte Lo, Wing Yu Yip, Richard Tzong-Han Tsai, En-Shiun Annie Lee

Main category: cs.CL

TL;DR: SiniticMTError is a novel dataset for machine translation error analysis, providing error span, type, and severity annotations for English to Mandarin, Cantonese, and Wu Chinese translations.

Details

Motivation: Despite advances in machine translation, progress remains limited for low-resource languages like Cantonese and Wu Chinese that lack large-scale training data and linguistic resources, despite having over 80 million speakers each.

Method: The authors built on existing parallel corpora and implemented a rigorous annotation process by native speakers, including inter-annotator agreement analysis, iterative feedback, and detailed error pattern analysis.

Result: The dataset provides comprehensive error annotations (span, type, severity) for machine-translated examples across three Sinitic languages, serving as a resource for MT community research.

Conclusion: SiniticMTError enables fine-tuning models with error detection capabilities and supports research on translation quality estimation, error-aware generation, and low-resource language evaluation.

Abstract: Despite major advances in machine translation (MT) in recent years, progress remains limited for many low-resource languages that lack large-scale training data and linguistic resources. Cantonese and Wu Chinese are two Sinitic examples, although each enjoys more than 80 million speakers around the world. In this paper, we introduce SiniticMTError, a novel dataset that builds on existing parallel corpora to provide error span, error type, and error severity annotations in machine-translated examples from English to Mandarin, Cantonese, and Wu Chinese. Our dataset serves as a resource for the MT community to utilize in fine-tuning models with error detection capabilities, supporting research on translation quality estimation, error-aware generation, and low-resource language evaluation. We report our rigorous annotation process by native speakers, with analyses on inter-annotator agreement, iterative feedback, and patterns in error type and severity.

[16] SwasthLLM: a Unified Cross-Lingual, Multi-Task, and Meta-Learning Zero-Shot Framework for Medical Diagnosis Using Contrastive Representations

Ayan Sar, Pranav Singh Puri, Sumit Aich, Tanupriya Choudhury, Abhijit Kumar

Main category: cs.CL

TL;DR: SwasthLLM is a unified multilingual framework for medical diagnosis that achieves high accuracy in zero-shot scenarios across English, Hindi, and Bengali without language-specific fine-tuning.

Details

Motivation: Address the challenge of automatic disease diagnosis in multilingual healthcare environments where annotated medical data is scarce in low-resource languages and linguistic variability exists across populations.

Method: Leverages XLM-RoBERTa encoder with language-aware attention and disease classification head, uses Siamese contrastive learning for cross-lingual semantic alignment, translation consistency module, contrastive projection head, multi-task learning strategy, and Model-Agnostic Meta-Learning (MAML) for rapid adaptation.

Result: Achieves 97.22% accuracy and 97.17% F1-score in supervised settings, and in zero-shot scenarios attains 92.78% accuracy on Hindi and 73.33% accuracy on Bengali medical text.

Conclusion: SwasthLLM demonstrates strong generalization capabilities in low-resource multilingual contexts, providing an effective solution for cross-lingual medical diagnosis without requiring language-specific fine-tuning.

Abstract: In multilingual healthcare environments, automatic disease diagnosis from clinical text remains a challenging task due to the scarcity of annotated medical data in low-resource languages and the linguistic variability across populations. This paper proposes SwasthLLM, a unified, zero-shot, cross-lingual, and multi-task learning framework for medical diagnosis that operates effectively across English, Hindi, and Bengali without requiring language-specific fine-tuning. At its core, SwasthLLM leverages the multilingual XLM-RoBERTa encoder augmented with a language-aware attention mechanism and a disease classification head, enabling the model to extract medically relevant information regardless of the language structure. To align semantic representations across languages, a Siamese contrastive learning module is introduced, ensuring that equivalent medical texts in different languages produce similar embeddings. Further, a translation consistency module and a contrastive projection head reinforce language-invariant representation learning. SwasthLLM is trained using a multi-task learning strategy, jointly optimizing disease classification, translation alignment, and contrastive learning objectives. Additionally, we employ Model-Agnostic Meta-Learning (MAML) to equip the model with rapid adaptation capabilities for unseen languages or tasks with minimal data. Our phased training pipeline emphasizes robust representation alignment before task-specific fine-tuning. Extensive evaluation shows that SwasthLLM achieves high diagnostic performance, with a test accuracy of 97.22% and an F1-score of 97.17% in supervised settings. Crucially, in zero-shot scenarios, it attains 92.78% accuracy on Hindi and 73.33% accuracy on Bengali medical text, demonstrating strong generalization in low-resource contexts.

[17] Dynamic Reasoning Chains through Depth-Specialized Mixture-of-Experts in Transformer Architectures

Sampurna Roy, Ayan Sar, Anurag Kaushish, Kanav Gupta, Tanupriya Choudhury, Abhijit Kumar

Main category: cs.CL

TL;DR: DS-MoE is a dynamic reasoning framework that uses depth-specialized experts and learned routing to create custom reasoning chains, achieving computational savings, faster inference, and improved accuracy compared to uniform-depth transformers.

Details

Motivation: Current transformers apply identical processing depth to all inputs, creating inefficiencies where simple queries undergo the same computation as complex problems, wasting resources and limiting reasoning quality.

Method: Extends Mixture of Experts from width-based to depth-specialized computation with expert modules for different reasoning depths (shallow pattern recognition, compositional reasoning, logical inference, etc.). A learned routing network dynamically assembles custom reasoning chains based on input complexity.

Result: Achieved 16% computational savings, 35% faster inference, and 2.8% higher accuracy on complex multi-step reasoning benchmarks compared to uniform-depth transformers. Routing decisions provide interpretable reasoning chains.

Conclusion: DS-MoE represents a significant advancement in adaptive neural architectures, demonstrating that depth-specialized modular processing can simultaneously improve efficiency, reasoning quality, and interpretability in large-scale language models.

Abstract: Contemporary transformer architectures apply identical processing depth to all inputs, creating inefficiencies and limiting reasoning quality. Simple factual queries are subjected to the same multilayered computation as complex logical problems, wasting resources while constraining deep inference. To overcome this, we came up with a concept of Dynamic Reasoning Chains through Depth Specialised Mixture of Experts (DS-MoE), a modular framework that extends the Mixture of Experts paradigm from width-based to depth specialised computation. DS-MoE introduces expert modules optimised for distinct reasoning depths, shallow pattern recognition, compositional reasoning, logical inference, memory integration, and meta-cognitive supervision. A learned routing network dynamically assembles custom reasoning chains, activating only the necessary experts to match input complexity. The dataset on which we trained and evaluated DS-MoE is on The Pile, an 800GB corpus covering diverse domains such as scientific papers, legal texts, programming code, and web content, enabling systematic assessment across reasoning depths. Experimental results demonstrate that DS-MoE achieves up to 16 per cent computational savings and 35 per cent faster inference compared to uniform-depth transformers, while delivering 2.8 per cent higher accuracy on complex multi-step reasoning benchmarks. Furthermore, routing decisions yield interpretable reasoning chains, enhancing transparency and scalability. These findings establish DS-MoE as a significant advancement in adaptive neural architectures, demonstrating that depth-specialised modular processing can simultaneously improve efficiency, reasoning quality, and interpretability in large-scale language models.

[18] Hierarchical Resolution Transformers: A Wavelet-Inspired Architecture for Multi-Scale Language Understanding

Ayan Sar, Sampurna Roy, Kanav Gupta, Anurag Kaushish, Tanupriya Choudhury, Abhijit Kumar

Main category: cs.CL

TL;DR: HRT is a hierarchical transformer that processes language at multiple resolutions using wavelet-inspired architecture, achieving better performance and efficiency than standard transformers.

Details

Motivation: Standard transformers misrepresent language hierarchy by treating text as flat token sequences, leading to quadratic complexity, weak compositional generalization, and poor discourse modeling.

Method: Hierarchical Resolution Transformer (HRT) uses multi-resolution attention with exponential sequence reduction across scales, enabling bottom-up composition and top-down contextualization with O(nlogn) complexity.

Result: HRT outperforms standard transformers by +3.8% on GLUE, +4.5% on SuperGLUE, +6.1% on Long Range Arena, while reducing memory by 42% and latency by 37% compared to BERT/GPT models.

Conclusion: HRT successfully aligns computational structure with language hierarchy, demonstrating that multi-scale processing yields both theoretical efficiency gains and practical improvements in language understanding.

Abstract: Transformer architectures have achieved state-of-the-art performance across natural language tasks, yet they fundamentally misrepresent the hierarchical nature of human language by processing text as flat token sequences. This results in quadratic computational cost, weak computational cost, weak compositional generalization, and inadequate discourse-level modeling. We propose Hierarchical Resolution Transformer (HRT), a novel wavelet-inspired neural architecture that processes language simultaneously across multiple resolutions, from characters to discourse-level units. HRT constructs a multi-resolution attention, enabling bottom-up composition and top-down contextualization. By employing exponential sequence reduction across scales, HRT achieves O(nlogn) complexity, offering significant efficiency improvements over standard transformers. We evaluated HRT on a diverse suite of benchmarks, including GLUE, SuperGLUE, Long Range Arena, and WikiText-103, and results demonstrated that HRT outperforms standard transformer baselines by an average of +3.8% on GLUE, +4.5% on SuperGLUE, and +6.1% on Long Range Arena, while reducing memory usage by 42% and inference latency by 37% compared to BERT and GPT style models of similar parameter count. Ablation studies confirm the effectiveness of cross-resolution attention and scale-specialized modules, showing that each contributes independently to both efficiency and accuracy. Our findings establish HRT as the first architecture to align computational structure with the hierarchical organization of human language, demonstrating that multi-scale, wavelet-inspired processing yields both theoretical efficiency gains and practical improvements in language understanding.

[19] FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion Language Models

Amin Karimi Monsefi, Nikhil Bhendawade, Manuel Rafael Ciosici, Dominic Culver, Yizhe Zhang, Irina Belousova

Main category: cs.CL

TL;DR: FS-DFM is a few-step discrete flow-matching model that achieves high-quality language generation with only 8 sampling steps, providing 128x faster sampling than traditional 1,024-step discrete-flow models while maintaining perplexity parity.

Details

Motivation: Autoregressive language models have serial generation limitations that limit throughput and increase latency for long sequences. Diffusion language models parallelize across positions but typically require hundreds to thousands of model evaluations for high quality, trading serial depth for iterative breadth.

Method: FS-DFM makes the number of sampling steps an explicit parameter and trains the model to be consistent across step budgets, using a reliable update rule that moves probability without overshooting, combined with strong teacher guidance distilled from long-run trajectories.

Result: On language modeling benchmarks, FS-DFM with 8 sampling steps achieves perplexity parity with a 1,024-step discrete-flow baseline for generating 1,024 tokens using a similar-size model, delivering up to 128 times faster sampling and corresponding latency/throughput gains.

Conclusion: FS-DFM enables few-step sampling that is stable, accurate, and easy to control, providing significant speed improvements without sacrificing generation quality compared to traditional discrete diffusion approaches.

Abstract: Autoregressive language models (ARMs) deliver strong likelihoods, but are inherently serial: they generate one token per forward pass, which limits throughput and inflates latency for long sequences. Diffusion Language Models (DLMs) parallelize across positions and thus appear promising for language generation, yet standard discrete diffusion typically needs hundreds to thousands of model evaluations to reach high quality, trading serial depth for iterative breadth. We introduce FS-DFM, Few-Step Discrete Flow-Matching. A discrete flow-matching model designed for speed without sacrificing quality. The core idea is simple: make the number of sampling steps an explicit parameter and train the model to be consistent across step budgets, so one big move lands where many small moves would. We pair this with a reliable update rule that moves probability in the right direction without overshooting, and with strong teacher guidance distilled from long-run trajectories. Together, these choices make few-step sampling stable, accurate, and easy to control. On language modeling benchmarks, FS-DFM with 8 sampling steps achieves perplexity parity with a 1,024-step discrete-flow baseline for generating 1,024 tokens using a similar-size model, delivering up to 128 times faster sampling and corresponding latency/throughput gains.

[20] Look Before you Leap: Estimating LLM Benchmark Scores from Descriptions

Jungsoo Park, Ethan Mendes, Gabriel Stanovsky, Alan Ritter

Main category: cs.CL

TL;DR: The paper introduces PRECOG, a method for forecasting language model performance from redacted task descriptions without running experiments, showing moderate prediction accuracy with well-calibrated uncertainty.

Details

Motivation: To overcome the evaluation bottleneck in large language model development by enabling performance forecasting before conducting experiments, supporting smarter experiment prioritization.

Method: Curated PRECOG corpus of redacted description-performance pairs, using models with retrieval modules that exclude source papers to predict scores from task descriptions and configurations.

Result: Achieved mean absolute error as low as 8.7 on Accuracy subset at high-confidence thresholds; stronger reasoning models show diverse iterative querying while open-source models lag; GPT-5 with web search maintains prediction accuracy in zero-leakage settings.

Conclusion: The approach offers initial progress toward anticipatory evaluation, supporting difficulty estimation and more efficient experiment planning in language model development.

Abstract: Progress in large language models is constrained by an evaluation bottleneck: build a benchmark, evaluate models and settings, then iterate. We therefore ask a simple question: can we forecast outcomes before running any experiments? We study text-only performance forecasting: estimating a model’s score from a redacted task description and intended configuration, with no access to dataset instances. To support systematic study, we curate PRECOG, a corpus of redacted description-performance pairs spanning diverse tasks, domains, and metrics. Experiments show the task is challenging but feasible: models equipped with a retrieval module that excludes source papers achieve moderate prediction performance with well-calibrated uncertainty, reaching mean absolute error as low as 8.7 on the Accuracy subset at high-confidence thresholds. Our analysis indicates that stronger reasoning models engage in diverse, iterative querying, whereas current open-source models lag and often skip retrieval or gather evidence with limited diversity. We further test a zero-leakage setting, forecasting on newly released datasets or experiments before their papers are indexed, where GPT-5 with built-in web search still attains nontrivial prediction accuracy. Overall, our corpus and analyses offer an initial step toward open-ended anticipatory evaluation, supporting difficulty estimation and smarter experiment prioritization.

[21] Enhancing Molecular Property Prediction with Knowledge from Large Language Models

Peng Zhou, Lai Hou Tim, Zhixiang Cheng, Kun Xie, Chaoyi Li, Wei Liu, Xiangxiang Zeng

Main category: cs.CL

TL;DR: A novel framework that integrates knowledge from LLMs with structural features from pre-trained molecular models to enhance molecular property prediction, outperforming existing methods.

Details

Motivation: While GNNs and self-supervised learning have advanced molecular property prediction, human prior knowledge remains crucial. LLMs can extract this knowledge but suffer from gaps and hallucinations, especially for less-studied properties.

Method: Proposes a framework that prompts LLMs to generate domain knowledge and executable code for molecular vectorization, then fuses these knowledge-based features with structural representations from pre-trained models. Uses GPT-4o, GPT-4.1, and DeepSeek-R1 for knowledge extraction.

Result: Extensive experiments show the integrated method outperforms existing approaches, demonstrating that combining LLM-derived knowledge with structural information provides a robust solution.

Conclusion: The combination of LLM-extracted knowledge and structural features creates an effective framework for molecular property prediction, addressing limitations of pure structural or knowledge-based approaches.

Abstract: Predicting molecular properties is a critical component of drug discovery. Recent advances in deep learning, particularly Graph Neural Networks (GNNs), have enabled end-to-end learning from molecular structures, reducing reliance on manual feature engineering. However, while GNNs and self-supervised learning approaches have advanced molecular property prediction (MPP), the integration of human prior knowledge remains indispensable, as evidenced by recent methods that leverage large language models (LLMs) for knowledge extraction. Despite their strengths, LLMs are constrained by knowledge gaps and hallucinations, particularly for less-studied molecular properties. In this work, we propose a novel framework that, for the first time, integrates knowledge extracted from LLMs with structural features derived from pre-trained molecular models to enhance MPP. Our approach prompts LLMs to generate both domain-relevant knowledge and executable code for molecular vectorization, producing knowledge-based features that are subsequently fused with structural representations. We employ three state-of-the-art LLMs, GPT-4o, GPT-4.1, and DeepSeek-R1, for knowledge extraction. Extensive experiments demonstrate that our integrated method outperforms existing approaches, confirming that the combination of LLM-derived knowledge and structural information provides a robust and effective solution for MPP.

[22] LAMA-UT: Language Agnostic Multilingual ASR through Orthography Unification and Language-Specific Transliteration

Sangmin Lee, Woo-Jin Chung, Hong-Goo Kang

Main category: cs.CL

TL;DR: LAMA-UT is a language-agnostic multilingual ASR pipeline that uses orthography unification and transliteration to achieve equitable performance across languages without language-specific modules.

Details

Motivation: To address the challenge of building universal multilingual ASR models that perform equitably across languages, overcoming inherent difficulties in traditional approaches.

Method: A two-step pipeline: 1) universal transcription generator to unify orthographic features into Romanized form capturing common phonetic characteristics, 2) universal converter to transform universal transcriptions into language-specific ones.

Result: Achieves 45% relative error reduction compared to Whisper, performs comparably to MMS despite using only 0.1% of Whisper’s training data, and matches zero-shot ASR approaches without language-specific modules.

Conclusion: The framework serves as a cornerstone for flexible multilingual ASR systems generalizable to unseen languages, demonstrating effectiveness of universal transcriptions for massively multilingual ASR.

Abstract: Building a universal multilingual automatic speech recognition (ASR) model that performs equitably across languages has long been a challenge due to its inherent difficulties. To address this task we introduce a Language-Agnostic Multilingual ASR pipeline through orthography Unification and language-specific Transliteration (LAMA-UT). LAMA-UT operates without any language-specific modules while matching the performance of state-of-the-art models trained on a minimal amount of data. Our pipeline consists of two key steps. First, we utilize a universal transcription generator to unify orthographic features into Romanized form and capture common phonetic characteristics across diverse languages. Second, we utilize a universal converter to transform these universal transcriptions into language-specific ones. In experiments, we demonstrate the effectiveness of our proposed method leveraging universal transcriptions for massively multilingual ASR. Our pipeline achieves a relative error reduction rate of 45% when compared to Whisper and performs comparably to MMS, despite being trained on only 0.1% of Whisper’s training data. Furthermore, our pipeline does not rely on any language-specific modules. However, it performs on par with zero-shot ASR approaches which utilize additional language-specific lexicons and language models. We expect this framework to serve as a cornerstone for flexible multilingual ASR systems that are generalizable even to unseen languages.

[23] RedHerring Attack: Testing the Reliability of Attack Detection

Jonathan Rusert

Main category: cs.CL

TL;DR: RedHerring is a novel attack that makes attack detection models unreliable by modifying text to trigger false attack predictions while keeping the classifier correct, creating tension between classifier and detector.

Details

Motivation: To explore the reliability of attack detection models in NLP systems, which are used to identify adversarial text modifications but whose robustness hasn't been thoroughly tested.

Method: Proposes RedHerring attack that modifies text to cause detection models to predict attacks while maintaining classifier accuracy, creating a scenario where detectors appear unreliable to humans. Tests on 4 datasets against 3 detectors defending 4 classifiers.

Result: RedHerring successfully drops detection accuracy by 20-71 points while maintaining or improving classifier accuracy. A simple confidence check defense is proposed that requires no retraining and greatly improves detection accuracy.

Conclusion: This novel threat model reveals new vulnerabilities in attack detection systems and provides insights into how adversaries can target detection models, highlighting the need for more robust defense mechanisms.

Abstract: In response to adversarial text attacks, attack detection models have been proposed and shown to successfully identify text modified by adversaries. Attack detection models can be leveraged to provide an additional check for NLP models and give signals for human input. However, the reliability of these models has not yet been thoroughly explored. Thus, we propose and test a novel attack setting and attack, RedHerring. RedHerring aims to make attack detection models unreliable by modifying a text to cause the detection model to predict an attack, while keeping the classifier correct. This creates a tension between the classifier and detector. If a human sees that the detector is giving an ``incorrect’’ prediction, but the classifier a correct one, then the human will see the detector as unreliable. We test this novel threat model on 4 datasets against 3 detectors defending 4 classifiers. We find that RedHerring is able to drop detection accuracy between 20 - 71 points, while maintaining (or improving) classifier accuracy. As an initial defense, we propose a simple confidence check which requires no retraining of the classifier or detector and increases detection accuracy greatly. This novel threat model offers new insights into how adversaries may target detection models.

[24] Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST

Monica Sekoyan, Nithin Rao Koluguri, Nune Tadevosyan, Piotr Zelasko, Travis Bartley, Nikolay Karpov, Jagadeesh Balam, Boris Ginsburg

Main category: cs.CL

TL;DR: Canary-1B-v2 is a fast multilingual ASR and speech-to-text translation model supporting 25 European languages, trained on 1.7M hours of data with improved hallucination reduction and timestamp capabilities.

Details

Motivation: To create a fast, robust multilingual speech recognition and translation model that outperforms existing large models while being significantly faster and more efficient.

Method: Two-stage pre-training and fine-tuning using FastConformer encoder and Transformer decoder, trained on 1.7M hours of data including Granary and NeMo ASR Set 3.0, with non-speech audio for hallucination reduction. Uses NeMo Forced Aligner with CTC for timestamps.

Result: Outperforms Whisper-large-v3 on English ASR while being 10x faster, delivers competitive multilingual ASR and AST performance against larger models like Seamless-M4T-v2-large. Also releases smaller Parakeet-TDT-0.6B-v3 model.

Conclusion: Canary-1B-v2 demonstrates efficient scaling with massive data, with FastConformer excelling after fine-tuning. The model provides state-of-the-art performance with significant speed advantages over existing large models.

Abstract: This report introduces Canary-1B-v2, a fast, robust multilingual model for Automatic Speech Recognition (ASR) and Speech-to-Text Translation (AST). Built with a FastConformer encoder and Transformer decoder, it supports 25 languages primarily European. The model was trained on 1.7M hours of total data samples, including Granary and NeMo ASR Set 3.0, with non-speech audio added to reduce hallucinations for ASR and AST. We describe its two-stage pre-training and fine-tuning process with dynamic data balancing, as well as experiments with an nGPT encoder. Results show nGPT scales well with massive data, while FastConformer excels after fine-tuning. For timestamps, Canary-1B-v2 uses the NeMo Forced Aligner (NFA) with an auxiliary CTC model, providing reliable segment-level timestamps for ASR and AST. Evaluations show Canary-1B-v2 outperforms Whisper-large-v3 on English ASR while being 10x faster, and delivers competitive multilingual ASR and AST performance against larger models like Seamless-M4T-v2-large and LLM-based systems. We also release Parakeet-TDT-0.6B-v3, a successor to v2, offering multilingual ASR across the same 25 languages with just 600M parameters.

[25] Overcoming Black-box Attack Inefficiency with Hybrid and Dynamic Select Algorithms

Abhinay Shankar Belde, Rohit Ramkumar, Jonathan Rusert

Main category: cs.CL

TL;DR: Proposes two new attack selection strategies (Hybrid and Dynamic Select) that combine BinarySelect and GreedySelect to reduce query costs in adversarial text attacks while maintaining effectiveness.

Details

Motivation: Transformer-based NLP models require high computational costs for attack testing, and existing black-box attack methods need too many queries, making them inefficient for researchers with limited resources.

Method: Hybrid Select merges BinarySelect and GreedySelect using a size threshold. Dynamic Select learns which selection method to apply based on text length. Both aim to reduce query numbers while maintaining attack success.

Result: Across 4 datasets and 6 target models, sentence-level Hybrid Select reduces required queries by up to 25.82% on average against encoder models and LLMs without losing attack effectiveness.

Conclusion: The proposed selection strategies significantly reduce computational costs for adversarial text attacks while preserving attack performance, making attack testing more accessible to resource-constrained researchers.

Abstract: Adversarial text attack research plays a crucial role in evaluating the robustness of NLP models. However, the increasing complexity of transformer-based architectures has dramatically raised the computational cost of attack testing, especially for researchers with limited resources (e.g., GPUs). Existing popular black-box attack methods often require a large number of queries, which can make them inefficient and impractical for researchers. To address these challenges, we propose two new attack selection strategies called Hybrid and Dynamic Select, which better combine the strengths of previous selection algorithms. Hybrid Select merges generalized BinarySelect techniques with GreedySelect by introducing a size threshold to decide which selection algorithm to use. Dynamic Select provides an alternative approach of combining the generalized Binary and GreedySelect by learning which lengths of texts each selection method should be applied to. This greatly reduces the number of queries needed while maintaining attack effectiveness (a limitation of BinarySelect). Across 4 datasets and 6 target models, our best method(sentence-level Hybrid Select) is able to reduce the number of required queries per attack up 25.82% on average against both encoder models and LLMs, without losing the effectiveness of the attack.

[26] Speech Language Models for Under-Represented Languages: Insights from Wolof

Yaya Sy, Dioula Doucouré, Christophe Cerisara, Irina Illina

Main category: cs.CL

TL;DR: Training the first Speech LLM for Wolof using unsupervised speech data and integrating it with a Wolof LLM for improved ASR and speech translation capabilities.

Details

Motivation: To develop speech language models for underrepresented languages like Wolof, addressing the lack of resources and demonstrating the importance of high-quality unsupervised speech data.

Method: Continued pretraining of HuBERT on large-scale spontaneous Wolof speech data, then integrating the speech encoder into a Wolof LLM with Chain-of-Thought reasoning for multi-step tasks.

Result: The Speech LLM outperforms base models and African-centric models on ASR, improves speech recognition, and performs well in speech translation tasks.

Conclusion: Successfully created the first Speech LLM for Wolof, showing that proper data collection and model integration can effectively extend capabilities to underrepresented languages, with models and code being openly shared.

Abstract: We present our journey in training a speech language model for Wolof, an underrepresented language spoken in West Africa, and share key insights. We first emphasize the importance of collecting large-scale, spontaneous, high-quality unsupervised speech data, and show that continued pretraining HuBERT on this dataset outperforms both the base model and African-centric models on ASR. We then integrate this speech encoder into a Wolof LLM to train the first Speech LLM for this language, extending its capabilities to tasks such as speech translation. Furthermore, we explore training the Speech LLM to perform multi-step Chain-of-Thought before transcribing or translating. Our results show that the Speech LLM not only improves speech recognition but also performs well in speech translation. The models and the code will be openly shared.

[27] Probability Distribution Collapse: A Critical Bottleneck to Compact Unsupervised Neural Grammar Induction

Jinwook Park, Kangil Kim

Main category: cs.CL

TL;DR: The paper addresses probability distribution collapse in unsupervised neural grammar induction, proposing a collapse-relaxing neural parameterization that improves parsing performance with more compact grammars.

Details

Motivation: Existing unsupervised neural grammar induction models face an expressiveness bottleneck, leading to unnecessarily large yet underperforming grammars due to probability distribution collapse.

Method: The authors identify and analyze when and how probability distribution collapse emerges in neural parameterization components, then introduce a targeted collapse-relaxing neural parameterization to mitigate this issue.

Result: The proposed approach substantially improves parsing performance while enabling the use of significantly more compact grammars across a wide range of languages, as demonstrated through extensive empirical analysis.

Conclusion: The collapse-relaxing neural parameterization effectively addresses the expressiveness bottleneck in unsupervised grammar induction, achieving better performance with more compact grammars.

Abstract: Unsupervised neural grammar induction aims to learn interpretable hierarchical structures from language data. However, existing models face an expressiveness bottleneck, often resulting in unnecessarily large yet underperforming grammars. We identify a core issue, $\textit{probability distribution collapse}$, as the underlying cause of this limitation. We analyze when and how the collapse emerges across key components of neural parameterization and introduce a targeted solution, $\textit{collapse-relaxing neural parameterization}$, to mitigate it. Our approach substantially improves parsing performance while enabling the use of significantly more compact grammars across a wide range of languages, as demonstrated through extensive empirical analysis.

[28] Confidence-guided Refinement Reasoning for Zero-shot Question Answering

Youwon Jang, Woo Suk Choi, Minjoon Jung, Minsu Lee, Byoung-Tak Zhang

Main category: cs.CL

TL;DR: C2R is a training-free framework for QA tasks that uses confidence-guided refinement reasoning by constructing and refining sub-questions and answers to improve answer reliability.

Details

Motivation: To enhance question-answering performance across text, image, and video domains without requiring additional training, by leveraging the model's own confidence scores to select the most reliable answers through sub-question reasoning.

Method: C2R constructs and refines sub-QAs, curates a subset to explore diverse reasoning paths, compares confidence scores of answer candidates, and selects the final answer based on the highest confidence score derived solely from the model itself.

Result: C2R demonstrates consistent performance improvements across diverse models and benchmarks, and can be seamlessly integrated with various existing QA models without requiring training.

Conclusion: The framework provides insights into how sub-QAs affect model behavior, showing that both quantity and quality of sub-QAs are crucial for achieving robust and reliable reasoning in QA tasks.

Abstract: We propose Confidence-guided Refinement Reasoning (C2R), a novel training-free framework applicable to question-answering (QA) tasks across text, image, and video domains. C2R strategically constructs and refines sub-questions and their answers (sub-QAs), deriving a better confidence score for the target answer. C2R first curates a subset of sub-QAs to explore diverse reasoning paths, then compares the confidence scores of the resulting answer candidates to select the most reliable final answer. Since C2R relies solely on confidence scores derived from the model itself, it can be seamlessly integrated with various existing QA models, demonstrating consistent performance improvements across diverse models and benchmarks. Furthermore, we provide essential yet underexplored insights into how leveraging sub-QAs affects model behavior, specifically analyzing the impact of both the quantity and quality of sub-QAs on achieving robust and reliable reasoning.

[29] SFT Doesn’t Always Hurt General Capabilities: Revisiting Domain-Specific Fine-Tuning in LLMs

Jiacheng Lin, Zhongruo Wang, Kun Qian, Tian Wang, Arvind Srinivasan, Hansi Zeng, Ruochen Jiao, Xie Zhou, Jiri Gesi, Dakuo Wang, Yufan Guo, Kai Zhong, Weiqi Zhang, Sujay Sanghavi, Changyou Chen, Hyokun Yun, Lihong Li

Main category: cs.CL

TL;DR: SFT with smaller learning rates can mitigate general capability degradation while maintaining domain performance. TALR method outperforms other strategies in balancing domain-specific gains and general capabilities.

Details

Motivation: To address the common belief that Supervised Fine-Tuning (SFT) on domain-specific datasets degrades LLMs' general capabilities, and to find effective strategies to balance domain adaptation with preservation of general performance.

Method: Empirical evaluation of SFT with different learning rates, theoretical analysis, and comparison of various strategies including L2 regularization, LoRA, model averaging, FLOW, and proposed Token-Adaptive Loss Reweighting (TALR).

Result: Smaller learning rates substantially mitigate general performance degradation while preserving target-domain performance. TALR consistently outperforms other methods in balancing domain-specific gains and general capabilities.

Conclusion: Practical guidelines: (i) use small learning rate for favorable trade-off, (ii) adopt TALR when stronger balance is needed. No method completely eliminates the trade-off but TALR provides the best balance.

Abstract: Supervised Fine-Tuning (SFT) on domain-specific datasets is a common approach to adapt Large Language Models (LLMs) to specialized tasks but is often believed to degrade their general capabilities. In this work, we revisit this trade-off and present both empirical and theoretical insights. First, we show that SFT does not always hurt: using a smaller learning rate can substantially mitigate general performance degradation while preserving comparable target-domain performance. We then provide a theoretical analysis that explains these phenomena and further motivates a new method, Token-Adaptive Loss Reweighting (TALR). Building on this, and recognizing that smaller learning rates alone do not fully eliminate general-performance degradation in all cases, we evaluate a range of strategies for reducing general capability loss, including L2 regularization, LoRA, model averaging, FLOW, and our proposed TALR. Experimental results demonstrate that while no method completely eliminates the trade-off, TALR consistently outperforms these baselines in balancing domain-specific gains and general capabilities. Finally, we distill our findings into practical guidelines for adapting LLMs to new domains: (i) using a small learning rate to achieve a favorable trade-off, and (ii) when a stronger balance is further desired, adopt TALR as an effective strategy.

[30] Towards Atoms of Large Language Models

Chenhui Hu, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao

Main category: cs.CL

TL;DR: The paper proposes Atoms Theory to define fundamental representation units in LLMs, introducing atomic inner product to address issues with neurons and features, and validates through SAE experiments showing atoms better capture intrinsic representations.

Details

Motivation: Current undefined fundamental units of internal representations in LLMs limit understanding of their mechanisms, with neurons suffering from polysemy and features facing unreliable reconstruction and instability issues.

Method: Propose Atoms Theory with atomic inner product (AIP) to correct representation shifting, prove RIP conditions for atoms, establish uniqueness and exact ℓ₁ recoverability, and train threshold-activated sparse autoencoders (SAEs) on Gemma2-2B, Gemma2-9B, and Llama3.1-8B models.

Result: Achieved 99.9% sparse reconstruction across layers on average, with over 99.8% of atoms satisfying uniqueness condition (vs 0.5% for neurons and 68.2% for features), showing atoms more faithfully capture intrinsic LLM representations.

Conclusion: Atoms Theory provides a systematic theoretical framework for understanding LLM internal representations and serves as foundation for mechanistic interpretability, with scaling experiments revealing SAE size-recovery capacity relationship.

Abstract: The fundamental units of internal representations in large language models (LLMs) remain undefined, limiting further understanding of their mechanisms. Neurons or features are often regarded as such units, yet neurons suffer from polysemy, while features face concerns of unreliable reconstruction and instability. To address this issue, we propose the Atoms Theory, which defines such units as atoms. We introduce the atomic inner product (AIP) to correct representation shifting, formally define atoms, and prove the conditions that atoms satisfy the Restricted Isometry Property (RIP), ensuring stable sparse representations over atom set and linking to compressed sensing. Under stronger conditions, we further establish the uniqueness and exact $\ell_1$ recoverability of the sparse representations, and provide guarantees that single-layer sparse autoencoders (SAEs) with threshold activations can reliably identify the atoms. To validate the Atoms Theory, we train threshold-activated SAEs on Gemma2-2B, Gemma2-9B, and Llama3.1-8B, achieving 99.9% sparse reconstruction across layers on average, and more than 99.8% of atoms satisfy the uniqueness condition, compared to 0.5% for neurons and 68.2% for features, showing that atoms more faithfully capture intrinsic representations of LLMs. Scaling experiments further reveal the link between SAEs size and recovery capacity. Overall, this work systematically introduces and validates Atoms Theory of LLMs, providing a theoretical framework for understanding internal representations and a foundation for mechanistic interpretability. Code available at https://github.com/ChenhuiHu/towards_atoms.

[31] Few-Shot and Training-Free Review Generation via Conversational Prompting

Genki Kusano

Main category: cs.CL

TL;DR: Proposes Conversational Prompting (CP) for few-shot personalized review generation, with two variants: SCP (using user’s own reviews) and CCP (adding contrastive examples from other users). CP outperforms conventional prompts in generating user-style reviews with minimal data.

Details

Motivation: Real-world applications need personalized review generation but face few-shot and training-free constraints where users have limited review history and fine-tuning is infeasible. LLMs can help but require effective prompting strategies.

Method: Reformulates user reviews as multi-turn conversations. SCP uses only the target user’s reviews, while CCP inserts incorrect replies from other users/LLMs and asks the model to correct them to encourage user-style generation.

Result: Experiments across 8 domains and 5 LLMs showed CP (especially CCP) produced reviews much closer to target user’s style than conventional prompts, even with only 2 reviews per user. CCP excels with quality negative examples, SCP remains competitive otherwise.

Conclusion: Conversational prompting provides a practical solution for few-shot, training-free personalized review generation, effectively leveraging LLMs’ capabilities without requiring extensive data or model fine-tuning.

Abstract: Personalized review generation helps businesses understand user preferences, yet most existing approaches assume extensive review histories of the target user or require additional model training. Real-world applications often face few-shot and training-free situations, where only a few user reviews are available and fine-tuning is infeasible. It is well known that large language models (LLMs) can address such low-resource settings, but their effectiveness depends on prompt engineering. In this paper, we propose Conversational Prompting, a lightweight method that reformulates user reviews as multi-turn conversations. Its simple variant, Simple Conversational Prompting (SCP), relies solely on the user’s own reviews, while the contrastive variant, Contrastive Conversational Prompting (CCP), inserts reviews from other users or LLMs as incorrect replies and then asks the model to correct them, encouraging the model to produce text in the user’s style. Experiments on eight product domains and five LLMs showed that the conventional non-conversational prompt often produced reviews similar to those written by random users, based on text-based metrics such as ROUGE-L and BERTScore, and application-oriented tasks like user identity matching and sentiment analysis. In contrast, both SCP and CCP produced reviews much closer to those of the target user, even when each user had only two reviews. CCP brings further improvements when high-quality negative examples are available, whereas SCP remains competitive when such data cannot be collected. These results suggest that conversational prompting offers a practical solution for review generation under few-shot and training-free constraints.

[32] Enrich-on-Graph: Query-Graph Alignment for Complex Reasoning with LLM Enriching

Songze Li, Zhiqiang Liu, Zhengke Gui, Huajun Chen, Wen Zhang

Main category: cs.CL

TL;DR: Enrich-on-Graph (EoG) framework uses LLMs to enrich knowledge graphs, bridging the semantic gap between structured KGs and unstructured queries for improved KGQA performance.

Details

Motivation: LLMs struggle with hallucinations in knowledge-intensive tasks like KGQA due to semantic gaps between structured knowledge graphs and unstructured queries, which existing methods overlook.

Method: Propose EoG framework that leverages LLMs’ prior knowledge to enrich KGs, enabling efficient evidence extraction for precise reasoning while maintaining low computational costs and scalability.

Result: Extensive experiments on two KGQA benchmarks show EoG effectively generates high-quality KGs and achieves state-of-the-art performance.

Conclusion: EoG provides a flexible, scalable solution to bridge the semantic gap in KGQA, with proposed graph quality metrics to evaluate query-graph alignment.

Abstract: Large Language Models (LLMs) exhibit strong reasoning capabilities in complex tasks. However, they still struggle with hallucinations and factual errors in knowledge-intensive scenarios like knowledge graph question answering (KGQA). We attribute this to the semantic gap between structured knowledge graphs (KGs) and unstructured queries, caused by inherent differences in their focuses and structures. Existing methods usually employ resource-intensive, non-scalable workflows reasoning on vanilla KGs, but overlook this gap. To address this challenge, we propose a flexible framework, Enrich-on-Graph (EoG), which leverages LLMs’ prior knowledge to enrich KGs, bridge the semantic gap between graphs and queries. EoG enables efficient evidence extraction from KGs for precise and robust reasoning, while ensuring low computational costs, scalability, and adaptability across different methods. Furthermore, we propose three graph quality evaluation metrics to analyze query-graph alignment in KGQA task, supported by theoretical validation of our optimization objectives. Extensive experiments on two KGQA benchmark datasets indicate that EoG can effectively generate high-quality KGs and achieve the state-of-the-art performance. Our code and data are available at https://github.com/zjukg/Enrich-on-Graph.

[33] Leveraging What’s Overfixed: Post-Correction via LLM Grammatical Error Overcorrection

Taehee Park, Heejin Do, Gary Geunbae Lee

Main category: cs.CL

TL;DR: PoCO is a novel approach that combines LLMs’ overcorrection tendency with sLMs’ precision to improve grammatical error correction by first maximizing recall via LLM overcorrection, then refining outputs with targeted post-correction.

Details

Motivation: Small Language Models (sLMs) achieve high precision but low recall in grammatical error correction, while Large Language Models (LLMs) show the opposite tendency with excessive overcorrection leading to low precision. The goal is to balance both aspects.

Method: PoCO first intentionally triggers overcorrection via LLM to maximize recall through comprehensive revisions, then applies targeted post-correction using fine-tuned smaller models to identify and refine erroneous outputs.

Result: Extensive experiments show that PoCO effectively balances GEC performance by increasing recall while maintaining competitive precision, improving overall grammatical error correction quality.

Conclusion: PoCO successfully harmonizes the generative power of LLMs with the reliability of smaller supervised models, achieving better balance between recall and precision in grammatical error correction.

Abstract: Robust supervised fine-tuned small Language Models (sLMs) often show high reliability but tend to undercorrect. They achieve high precision at the cost of low recall. Conversely, Large Language Models (LLMs) often show the opposite tendency, making excessive overcorrection, leading to low precision. To effectively harness the strengths of LLMs to address the recall challenges in sLMs, we propose Post-Correction via Overcorrection (PoCO), a novel approach that strategically balances recall and precision. PoCO first intentionally triggers overcorrection via LLM to maximize recall by allowing comprehensive revisions, then applies a targeted post-correction step via fine-tuning smaller models to identify and refine erroneous outputs. We aim to harmonize both aspects by leveraging the generative power of LLMs while preserving the reliability of smaller supervised models. Our extensive experiments demonstrate that PoCO effectively balances GEC performance by increasing recall with competitive precision, ultimately improving the overall quality of grammatical error correction.

[34] Distilling Many-Shot In-Context Learning into a Cheat Sheet

Ukyo Honda, Soichiro Murakami, Peinan Zhang

Main category: cs.CL

TL;DR: Cheat-sheet ICL distills many-shot in-context learning into concise summaries, achieving comparable performance with fewer tokens and no test-time retrieval.

Details

Motivation: To address the high computational cost of many-shot in-context learning in large language models while maintaining performance.

Method: Proposes cheat-sheet ICL that creates concise textual summaries (cheat sheets) from many-shot examples, using these summaries as context instead of full examples.

Result: Achieves comparable or better performance than many-shot ICL with far fewer tokens, and matches retrieval-based ICL without requiring test-time retrieval.

Conclusion: Cheat-sheet ICL is a practical alternative for leveraging LLMs in downstream tasks, offering computational efficiency while maintaining performance.

Abstract: Recent advances in large language models (LLMs) enable effective in-context learning (ICL) with many-shot examples, but at the cost of high computational demand due to longer input tokens. To address this, we propose cheat-sheet ICL, which distills the information from many-shot ICL into a concise textual summary (cheat sheet) used as the context at inference time. Experiments on challenging reasoning tasks show that cheat-sheet ICL achieves comparable or better performance than many-shot ICL with far fewer tokens, and matches retrieval-based ICL without requiring test-time retrieval. These findings demonstrate that cheat-sheet ICL is a practical alternative for leveraging LLMs in downstream tasks.

[35] Zero-Shot Privacy-Aware Text Rewriting via Iterative Tree Search

Shuo Huang, Xingliang Yuan, Gholamreza Haffari, Lizhen Qu

Main category: cs.CL

TL;DR: A zero-shot tree-search-based iterative sentence rewriting algorithm for privacy-preserving text anonymization that outperforms existing methods in balancing privacy protection and utility preservation.

Details

Motivation: Address privacy concerns in LLM-based cloud services where user inputs may expose sensitive information, overcoming limitations of current anonymization techniques that struggle with text naturalness and utility.

Method: Proposes a zero-shot, tree-search-based iterative sentence rewriting algorithm that systematically obfuscates or deletes private information through structured search guided by a reward model, enabling dynamic exploration of rewriting space.

Result: Experiments on privacy-sensitive datasets show the approach significantly outperforms existing baselines, achieving superior balance between privacy protection and utility preservation.

Conclusion: The proposed method effectively addresses privacy concerns in LLM applications by providing a robust solution that maintains text coherence, relevance, and naturalness while protecting sensitive information.

Abstract: The increasing adoption of large language models (LLMs) in cloud-based services has raised significant privacy concerns, as user inputs may inadvertently expose sensitive information. Existing text anonymization and de-identification techniques, such as rule-based redaction and scrubbing, often struggle to balance privacy preservation with text naturalness and utility. In this work, we propose a zero-shot, tree-search-based iterative sentence rewriting algorithm that systematically obfuscates or deletes private information while preserving coherence, relevance, and naturalness. Our method incrementally rewrites privacy-sensitive segments through a structured search guided by a reward model, enabling dynamic exploration of the rewriting space. Experiments on privacy-sensitive datasets show that our approach significantly outperforms existing baselines, achieving a superior balance between privacy protection and utility preservation.

[36] Concise and Sufficient Sub-Sentence Citations for Retrieval-Augmented Generation

Guo Chen, Qiuyuan Li, Qiuxian Li, Hongliang Dai, Xiang Chen, Piji Li

Main category: cs.CL

TL;DR: This paper proposes sub-sentence citations for RAG systems to improve citation quality by making them more concise and sufficient, addressing issues with sentence-level citations that contain irrelevant content or omit essential verification information.

Details

Motivation: Existing attribution methods in RAG systems produce citations at sentence or paragraph level, which may include irrelevant content or omit essential verification information, forcing users to read surrounding context to verify outputs.

Method: The authors develop annotation guidelines for sub-sentence citations, construct a dataset, and propose an attribution framework that uses LLMs to generate fine-tuning data with a credit model to filter low-quality examples.

Result: Experiments on the constructed dataset demonstrate that the proposed approach can generate high-quality and more readable citations compared to existing methods.

Conclusion: Sub-sentence citations provide a better solution for RAG systems by offering citations that are both concise and sufficient, reducing user effort in verifying generated outputs while maintaining verifiability.

Abstract: In retrieval-augmented generation (RAG) question answering systems, generating citations for large language model (LLM) outputs enhances verifiability and helps users identify potential hallucinations. However, we observe two problems in the citations produced by existing attribution methods. First, the citations are typically provided at the sentence or even paragraph level. Long sentences or paragraphs may include a substantial amount of irrelevant content. Second, sentence-level citations may omit information that is essential for verifying the output, forcing users to read the surrounding context. In this paper, we propose generating sub-sentence citations that are both concise and sufficient, thereby reducing the effort required by users to confirm the correctness of the generated output. To this end, we first develop annotation guidelines for such citations and construct a corresponding dataset. Then, we propose an attribution framework for generating citations that adhere to our standards. This framework leverages LLMs to automatically generate fine-tuning data for our task and employs a credit model to filter out low-quality examples. Our experiments on the constructed dataset demonstrate that the propose approach can generate high-quality and more readable citations.

[37] WeFT: Weighted Entropy-driven Fine-Tuning for dLLMs

Guowei Xu, Wenxin Xu, Jiawang Zhao, Kaisheng Ma

Main category: cs.CL

TL;DR: WeFT is a weighted supervised fine-tuning method for diffusion language models that assigns different weights to tokens based on their entropy, achieving significant improvements over standard SFT on reasoning benchmarks.

Details

Motivation: Diffusion models show promise for language modeling but lack precise probability estimates at each denoising step, making generation unpredictable and inconsistent. Controlling key tokens that guide generation direction is crucial.

Method: WeFT (Weighted Fine-Tuning) assigns different weights to tokens based on their entropy derived from diffusion theory, allowing for more effective supervised fine-tuning of diffusion language models.

Result: Training on s1K, s1K-1.1, and 3k samples from open-r1, WeFT achieves relative improvements of 39%, 64%, and 83% over standard SFT on Sudoku, Countdown, GSM8K, and MATH-500 benchmarks.

Conclusion: WeFT effectively addresses the challenges of SFT for diffusion language models by incorporating token weighting based on entropy, leading to substantial performance gains on reasoning tasks.

Abstract: Diffusion models have recently shown strong potential in language modeling, offering faster generation compared to traditional autoregressive approaches. However, applying supervised fine-tuning (SFT) to diffusion models remains challenging, as they lack precise probability estimates at each denoising step. While the diffusion mechanism enables the model to reason over entire sequences, it also makes the generation process less predictable and often inconsistent. This highlights the importance of controlling key tokens that guide the direction of generation. To address this issue, we propose WeFT, a weighted SFT method for diffusion language models, where tokens are assigned different weights based on their entropy. Derived from diffusion theory, WeFT delivers substantial gains: training on s1K, s1K-1.1, and 3k samples from open-r1, it achieves relative improvements of 39%, 64%, and 83% over standard SFT on four widely used reasoning benchmarks (Sudoku, Countdown, GSM8K, and MATH-500). The code and models will be made publicly available.

[38] Single Answer is Not Enough: On Generating Ranked Lists with Medical Reasoning Models

Pittawat Taveekitworachai, Natpatchara Pongjirapat, Krittaphas Chaisutyakorn, Piyalitt Ittichaiwong, Tossaporn Saengja, Kunat Pipatanakul

Main category: cs.CL

TL;DR: This paper presents the first systematic study on enabling medical reasoning models to generate ranked lists of answers for open-ended questions, proposing both prompting and fine-tuning approaches to move beyond single-answer formats.

Details

Motivation: Clinical decision-making relies on considering multiple options rather than single answers to reduce risks of narrow perspectives, but current medical reasoning models are typically trained to produce only one answer even in open-ended settings.

Method: The study investigates two approaches: prompting and fine-tuning (supervised fine-tuning and reinforcement fine-tuning). It proposes new reward functions for ranked-list formats and conducts ablation studies for reinforcement fine-tuning.

Result: Results show that while some supervised fine-tuning models generalize to certain answer formats, models trained with reinforcement fine-tuning are more robust across multiple formats. Models can recognize valid answers even when they fail to select the benchmark’s preferred ground truth.

Conclusion: This work provides a first step toward developing alternative answer formats beneficial beyond single answers in medical domains, with reinforcement fine-tuning showing superior robustness for ranked-list generation.

Abstract: This paper presents a systematic study on enabling medical reasoning models (MRMs) to generate ranked lists of answers for open-ended questions. Clinical decision-making rarely relies on a single answer but instead considers multiple options, reducing the risks of narrow perspectives. Yet current MRMs are typically trained to produce only one answer, even in open-ended settings. We propose an alternative format: ranked lists and investigate two approaches: prompting and fine-tuning. While prompting is a cost-effective way to steer an MRM’s response, not all MRMs generalize well across different answer formats: choice, short text, and list answers. Based on our prompting findings, we train and evaluate MRMs using supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). SFT teaches a model to imitate annotated responses, and RFT incentivizes exploration through the responses that maximize a reward. We propose new reward functions targeted at ranked-list answer formats, and conduct ablation studies for RFT. Our results show that while some SFT models generalize to certain answer formats, models trained with RFT are more robust across multiple formats. We also present a case study on a modified MedQA with multiple valid answers, finding that although MRMs might fail to select the benchmark’s preferred ground truth, they can recognize valid answers. To the best of our knowledge, this is the first systematic investigation of approaches for enabling MRMs to generate answers as ranked lists. We hope this work provides a first step toward developing alternative answer formats that are beneficial beyond single answers in medical domains.

[39] Learning to Summarize by Learning to Quiz: Adversarial Agentic Collaboration for Long Document Summarization

Weixuan Wang, Minghao Wu, Barry Haddow, Alexandra Birch

Main category: cs.CL

TL;DR: SummQ is a novel adversarial multi-agent framework for long document summarization that uses collaborative intelligence between summarization and quizzing agents to address information loss, factual inconsistencies, and coherence issues.

Details

Motivation: Current LLMs struggle with long document summarization due to information loss, factual inconsistencies, and coherence problems when processing excessively long documents.

Method: SummQ employs a multi-agent framework with summary generators and reviewers working collaboratively, along with quiz generators and reviewers that create comprehension questions as quality checks. An examinee agent validates whether summaries contain information needed to answer quiz questions, enabling iterative refinement through adversarial feedback.

Result: SummQ significantly outperforms state-of-the-art methods on three long document summarization benchmarks across ROUGE, BERTScore, LLM-as-a-Judge, and human evaluations.

Conclusion: The work establishes a new approach for long document summarization using adversarial agentic collaboration to improve summarization quality, with analyses revealing the effectiveness of multi-agent collaboration dynamics and quizzing mechanisms.

Abstract: Long document summarization remains a significant challenge for current large language models (LLMs), as existing approaches commonly struggle with information loss, factual inconsistencies, and coherence issues when processing excessively long documents. We propose SummQ, a novel adversarial multi-agent framework that addresses these limitations through collaborative intelligence between specialized agents operating in two complementary domains: summarization and quizzing. Our approach employs summary generators and reviewers that work collaboratively to create and evaluate comprehensive summaries, while quiz generators and reviewers create comprehension questions that serve as continuous quality checks for the summarization process. This adversarial dynamic, enhanced by an examinee agent that validates whether the generated summary contains the information needed to answer the quiz questions, enables iterative refinement through multifaceted feedback mechanisms. We evaluate SummQ on three widely used long document summarization benchmarks. Experimental results demonstrate that our framework significantly outperforms existing state-of-the-art methods across ROUGE and BERTScore metrics, as well as in LLM-as-a-Judge and human evaluations. Our comprehensive analyses reveal the effectiveness of the multi-agent collaboration dynamics, the influence of different agent configurations, and the impact of the quizzing mechanism. This work establishes a new approach for long document summarization that uses adversarial agentic collaboration to improve summarization quality.

[40] MemLens: Uncovering Memorization in LLMs with Activation Trajectories

Zirui He, Haiyan Zhao, Ali Payani, Mengnan du

Main category: cs.CL

TL;DR: MemLens detects LLM memorization by analyzing probability trajectories of numeric tokens, showing contaminated samples exhibit early confidence locking while clean samples show gradual evidence accumulation.

Details

Motivation: Existing detection methods relying on lexical overlap and perplexity have low generalization and degrade with implicitly contaminated data, requiring a more robust memorization detection approach.

Method: Proposes MemLens which analyzes probability trajectories of numeric tokens during generation, revealing that contaminated samples show ‘shortcut’ behaviors with early confidence locking while clean samples exhibit gradual evidence accumulation across model layers.

Result: Contaminated and clean samples exhibit distinct, well-separated reasoning trajectories. Validation through LoRA fine-tuning with designed samples shows the same trajectory patterns as naturally contaminated data.

Conclusion: MemLens captures genuine signals of memorization rather than spurious correlations, providing a robust method for detecting LLM contamination in benchmarks.

Abstract: Large language models (LLMs) are commonly evaluated on challenging benchmarks such as AIME and Math500, which are susceptible to contamination and risk of being memorized. Existing detection methods, which primarily rely on surface-level lexical overlap and perplexity, demonstrate low generalization and degrade significantly when encountering implicitly contaminated data. In this paper, we propose MemLens (An Activation Lens for Memorization Detection) to detect memorization by analyzing the probability trajectories of numeric tokens during generation. Our method reveals that contaminated samples exhibit ``shortcut’’ behaviors, locking onto an answer with high confidence in the model’s early layers, whereas clean samples show more gradual evidence accumulation across the model’s full depth. We observe that contaminated and clean samples exhibit distinct and well-separated reasoning trajectories. To further validate this, we inject carefully designed samples into the model through LoRA fine-tuning and observe the same trajectory patterns as in naturally contaminated data. These results provide strong evidence that MemLens captures genuine signals of memorization rather than spurious correlations.

[41] Cross-Linguistic Analysis of Memory Load in Sentence Comprehension: Linear Distance and Structural Density

Krishna Aggarwal

Main category: cs.CL

TL;DR: This study examines whether memory load in sentence comprehension is better explained by linear distance between words or by structural complexity of intervening material, introducing Intervener Complexity as a refined measure.

Details

Motivation: To reconcile linear and hierarchical perspectives on locality in sentence processing by developing a structurally grounded measure that improves upon linear distance metrics, building on dependency length minimization theories.

Method: Used harmonized dependency treebanks and mixed-effects modeling across multiple languages to evaluate sentence length, dependency length, and Intervener Complexity (number of intervening heads) as predictors of memory load, operationalized as the sum of feature misbinding and interference.

Result: All three factors positively associate with memory load, with sentence length having the broadest influence and Intervener Complexity providing explanatory power beyond linear distance alone.

Conclusion: The findings reconcile linear and hierarchical perspectives by showing dependency length as a surface signature while identifying intervening heads as a more proximate indicator of integration demands, demonstrating how UD-based graph measures can disentangle structural contributions to processing efficiency.

Abstract: This study examines whether sentence-level memory load in comprehension is better explained by linear proximity between syntactically related words or by the structural density of the intervening material. Building on locality-based accounts and cross-linguistic evidence for dependency length minimization, the work advances Intervener Complexity-the number of intervening heads between a head and its dependent-as a structurally grounded lens that refines linear distance measures. Using harmonized dependency treebanks and a mixed-effects framework across multiple languages, the analysis jointly evaluates sentence length, dependency length, and Intervener Complexity as predictors of the Memory-load measure. Studies in Psycholinguistics have reported the contributions of feature interference and misbinding to memory load during processing. For this study, I operationalized sentence-level memory load as the linear sum of feature misbinding and feature interference for tractability; current evidence does not establish that their cognitive contributions combine additively. All three factors are positively associated with memory load, with sentence length exerting the broadest influence and Intervener Complexity offering explanatory power beyond linear distance. Conceptually, the findings reconcile linear and hierarchical perspectives on locality by treating dependency length as an important surface signature while identifying intervening heads as a more proximate indicator of integration and maintenance demands. Methodologically, the study illustrates how UD-based graph measures and cross-linguistic mixed-effects modelling can disentangle linear and structural contributions to processing efficiency, providing a principled path for evaluating competing theories of memory load in sentence comprehension.

[42] Tool Calling for Arabic LLMs: Data Strategies and Instruction Tuning

Asim Ersoy, Enes Altinisik, Husrev Taha Sencar, Kareem Darwish

Main category: cs.CL

TL;DR: This paper investigates tool calling capabilities for Arabic language in LLMs, examining the need for Arabic-specific data vs cross-lingual transfer, the impact of instruction tuning, and the value of fine-tuning on specific tools.

Details

Motivation: Tool calling research is predominantly English-centric, creating a gap for other languages like Arabic. The paper aims to understand optimal strategies for developing Arabic tool-augmented agents.

Method: Extensive experiments using base and post-trained variants of an open-weight Arabic LLM, with translated and adapted tool-calling datasets from English to Arabic.

Result: The study provides crucial insights into optimal strategies for Arabic tool calling, though specific performance metrics are not detailed in the abstract.

Conclusion: The research bridges the resource gap for Arabic tool calling and offers guidance on data requirements and training approaches for developing robust Arabic tool-augmented agents.

Abstract: Tool calling is a critical capability that allows Large Language Models (LLMs) to interact with external systems, significantly expanding their utility. However, research and resources for tool calling are predominantly English-centric, leaving a gap in our understanding of how to enable this functionality for other languages, such as Arabic. This paper investigates three key research questions: (1) the necessity of in-language (Arabic) tool-calling data versus relying on cross-lingual transfer, (2) the effect of general-purpose instruction tuning on tool-calling performance, and (3) the value of fine-tuning on specific, high-priority tools. To address these questions, we conduct extensive experiments using base and post-trained variants of an open-weight Arabic LLM. To enable this study, we bridge the resource gap by translating and adapting two open-source tool-calling datasets into Arabic. Our findings provide crucial insights into the optimal strategies for developing robust tool-augmented agents for Arabic.

[43] Analysis of instruction-based LLMs’ capabilities to score and judge text-input problems in an academic setting

Valeria Ramirez-Garcia, David de-Fitero-Dominguez, Antonio Garcia-Cabot, Eva Garcia-Lopez

Main category: cs.CL

TL;DR: LLM-driven automatic evaluation systems for academic Text-Input Problems using rubrics, with Reference Aided Evaluation showing best performance compared to human evaluation.

Details

Motivation: To investigate LLM-driven automatic evaluation systems for academic Text-Input Problems using rubrics, building on previous research of LLMs as evaluators and educational tools.

Method: Proposed five evaluation systems tested on 110 computer science answers from higher education students using three models (JudgeLM, Llama-3.1-8B, DeepSeek-R1-Distill-Llama-8B). Methods include: JudgeLM evaluation, Reference Aided Evaluation, No Reference Evaluation, Additive Evaluation, and Adaptive Evaluation.

Result: Reference Aided Evaluation performed best with lowest median absolute deviation (0.945) and lowest root mean square deviation (1.214) compared to human evaluation. Other methods had limitations: Additive/Adaptive failed on concise answers, No Reference lacked information, JudgeLM had model limitations.

Conclusion: AI-driven automatic evaluation systems with proper methodologies show potential as complementary tools to academic resources.

Abstract: Large language models (LLMs) can act as evaluators, a role studied by methods like LLM-as-a-Judge and fine-tuned judging LLMs. In the field of education, LLMs have been studied as assistant tools for students and teachers. Our research investigates LLM-driven automatic evaluation systems for academic Text-Input Problems using rubrics. We propose five evaluation systems that have been tested on a custom dataset of 110 answers about computer science from higher education students with three models: JudgeLM, Llama-3.1-8B and DeepSeek-R1-Distill-Llama-8B. The evaluation systems include: The JudgeLM evaluation, which uses the model’s single answer prompt to obtain a score; Reference Aided Evaluation, which uses a correct answer as a guide aside from the original context of the question; No Reference Evaluation, which ommits the reference answer; Additive Evaluation, which uses atomic criteria; and Adaptive Evaluation, which is an evaluation done with generated criteria fitted to each question. All evaluation methods have been compared with the results of a human evaluator. Results show that the best method to automatically evaluate and score Text-Input Problems using LLMs is Reference Aided Evaluation. With the lowest median absolute deviation (0.945) and the lowest root mean square deviation (1.214) when compared to human evaluation, Reference Aided Evaluation offers fair scoring as well as insightful and complete evaluations. Other methods such as Additive and Adaptive Evaluation fail to provide good results in concise answers, No Reference Evaluation lacks information needed to correctly assess questions and JudgeLM Evaluations have not provided good results due to the model’s limitations. As a result, we conclude that Artificial Intelligence-driven automatic evaluation systems, aided with proper methodologies, show potential to work as complementary tools to other academic resources.

[44] Generative AI for FFRDCs

Arun S. Maiya

Main category: cs.CL

TL;DR: This paper demonstrates how large language models can accelerate text analysis for federally funded research centers using a secure open-source framework, with case studies on defense policy and scientific documents.

Details

Motivation: FFRDCs face text-heavy workloads that are slow to analyze manually, requiring efficient methods for processing policy documents, scientific papers, and engineering reports in sensitive government contexts.

Method: Applied OnPrem.LLM, an open-source framework for secure generative AI, using few-shot learning with minimal input-output examples for summarization, classification, extraction, and sense-making tasks.

Result: Case studies on National Defense Authorization Act (NDAA) and National Science Foundation (NSF) Awards show enhanced oversight and strategic analysis while maintaining auditability and data sovereignty.

Conclusion: The approach successfully enables efficient text analysis in sensitive government environments, demonstrating practical applications for defense policy and scientific corpora with secure, auditable AI deployment.

Abstract: Federally funded research and development centers (FFRDCs) face text-heavy workloads, from policy documents to scientific and engineering papers, that are slow to analyze manually. We show how large language models can accelerate summarization, classification, extraction, and sense-making with only a few input-output examples. To enable use in sensitive government contexts, we apply OnPrem$.$LLM, an open-source framework for secure and flexible application of generative AI. Case studies on defense policy documents and scientific corpora, including the National Defense Authorization Act (NDAA) and National Science Foundation (NSF) Awards, demonstrate how this approach enhances oversight and strategic analysis while maintaining auditability and data sovereignty.

[45] Behind RoPE: How Does Causal Mask Encode Positional Information?

Junu Kim, Xiao Liu, Zhenghao Lin, Lei Ji, Yeyun Gong, Edward Choi

Main category: cs.CL

TL;DR: The causal mask in Transformer decoders provides positional information by inducing position-dependent attention patterns that favor nearby query-key pairs, and interacts with explicit positional encodings like RoPE to distort relative attention patterns.

Details

Motivation: To understand the role of causal mask as a source of positional information in Transformer decoders, alongside explicit positional encodings like RoPE, and how their interaction affects attention patterns.

Method: Theoretical analysis of causal mask’s effect on attention scores, empirical analysis of trained models, and examination of interaction between causal mask and RoPE in modern large language models.

Result: Causal mask induces position-dependent attention patterns favoring nearby pairs, and its interaction with RoPE distorts RoPE’s relative attention patterns into non-relative ones, observed consistently in modern LLMs.

Conclusion: The causal mask should be considered as an important source of positional information alongside explicit positional encodings, as it significantly impacts attention patterns in Transformer decoders.

Abstract: While explicit positional encodings such as RoPE are a primary source of positional information in Transformer decoders, the causal mask also provides positional information. In this work, we prove that the causal mask can induce position-dependent patterns in attention scores, even without parameters or causal dependency in the input. Our theoretical analysis indicates that the induced attention pattern tends to favor nearby query-key pairs, mirroring the behavior of common positional encodings. Empirical analysis confirms that trained models exhibit the same behavior, with learned parameters further amplifying these patterns. Notably, we found that the interaction of causal mask and RoPE distorts RoPE’s relative attention score patterns into non-relative ones. We consistently observed this effect in modern large language models, suggesting the importance of considering the causal mask as a source of positional information alongside explicit positional encodings.

[46] When Instructions Multiply: Measuring and Estimating LLM Capabilities of Multiple Instructions Following

Keno Harada, Yudai Yamazaki, Masachika Taniguchi, Edison Marrese-Taylor, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo

Main category: cs.CL

TL;DR: This paper introduces two benchmarks (ManyIFEval and StyleMBPP) to evaluate LLMs’ ability to follow multiple instructions, showing performance degrades with more instructions, and develops regression models that can predict performance on unseen instruction combinations with ~10% error.

Details

Motivation: As LLMs are increasingly used in real-world scenarios, it's crucial to understand their ability to follow multiple instructions simultaneously, but evaluating all possible instruction combinations is computationally impractical.

Method: Created two specialized benchmarks: ManyIFEval for text generation (up to 10 instructions) and StyleMBPP for code generation (up to 6 instructions). Developed three regression models to estimate performance on unseen instruction combinations, with logistic regression using instruction count as key variable.

Result: Experiments across ten LLMs show performance consistently degrades as instruction count increases. The logistic regression model can predict performance with approximately 10% error using modest sample sizes (500 for ManyIFEval, 300 for StyleMBPP).

Conclusion: The proposed benchmarks and regression models enable efficient evaluation of LLMs’ multi-instruction following capabilities, providing practical tools for assessing real-world performance without exhaustive testing of all possible instruction combinations.

Abstract: As large language models (LLMs) are increasingly applied to real-world scenarios, it becomes crucial to understand their ability to follow multiple instructions simultaneously. To systematically evaluate these capabilities, we introduce two specialized benchmarks for fundamental domains where multiple instructions following is important: Many Instruction-Following Eval (ManyIFEval) for text generation with up to ten instructions, and Style-aware Mostly Basic Programming Problems (StyleMBPP) for code generation with up to six instructions. Our experiments with the created benchmarks across ten LLMs reveal that performance consistently degrades as the number of instructions increases. Furthermore, given the fact that evaluating all the possible combinations of multiple instructions is computationally impractical in actual use cases, we developed three types of regression models that can estimate performance on both unseen instruction combinations and different numbers of instructions which are not used during training. We demonstrate that a logistic regression model using instruction count as an explanatory variable can predict performance of following multiple instructions with approximately 10% error, even for unseen instruction combinations. We show that relatively modest sample sizes (500 for ManyIFEval and 300 for StyleMBPP) are sufficient for performance estimation, enabling efficient evaluation of LLMs under various instruction combinations.

[47] SoM-1K: A Thousand-Problem Benchmark Dataset for Strength of Materials

Qixin Wan, Zilong Wang, Jingwen Zhou, Wanting Wang, Ziheng Geng, Jiachen Liu, Ran Cao, Minghui Cheng, Lu Cheng

Main category: cs.CL

TL;DR: SoM-1K is the first large-scale multimodal benchmark for evaluating foundation models on strength of materials problems, showing current models struggle with engineering tasks (best accuracy 56.6%) and that text descriptions (DoI) often outperform visual inputs.

Details

Motivation: Foundation models excel in many domains but their performance on complex multimodal engineering problems remains unexplored, particularly in scientific and engineering contexts requiring robust multimodal reasoning.

Method: Created SoM-1K dataset with 1,065 annotated strength of materials problems containing text and diagrams. Proposed Descriptions of Images (DoI) prompting strategy using expert-generated text descriptions. Evaluated 8 foundation models including LLMs and VLMs.

Result: Current foundation models struggle significantly with engineering problems, with best model achieving only 56.6% accuracy. LLMs with DoI often outperform VLMs with visual diagrams. DoI effectively mitigates visual misinterpretation errors.

Conclusion: The work establishes a rigorous engineering AI benchmark and highlights the critical need for developing more robust multimodal reasoning capabilities in foundation models for scientific and engineering applications.

Abstract: Foundation models have shown remarkable capabilities in various domains, but their performance on complex, multimodal engineering problems remains largely unexplored. We introduce SoM-1K, the first large-scale multimodal benchmark dataset dedicated to evaluating foundation models on problems in the strength of materials (SoM). The dataset, which contains 1,065 annotated SoM problems, mirrors real-world engineering tasks by including both textual problem statements and schematic diagrams. Due to the limited capabilities of current foundation models in understanding complicated visual information, we propose a novel prompting strategy called Descriptions of Images (DoI), which provides rigorous expert-generated text descriptions of the visual diagrams as the context. We evaluate eight representative foundation models, including both large language models (LLMs) and vision language models (VLMs). Our results show that current foundation models struggle significantly with these engineering problems, with the best-performing model achieving only 56.6% accuracy. Interestingly, we found that LLMs, when provided with DoI, often outperform VLMs provided with visual diagrams. A detailed error analysis reveals that DoI plays a crucial role in mitigating visual misinterpretation errors, suggesting that accurate text-based descriptions can be more effective than direct image input for current foundation models. This work establishes a rigorous benchmark for engineering AI and highlights a critical need for developing more robust multimodal reasoning capabilities in foundation models, particularly in scientific and engineering contexts.

[48] Which Cultural Lens Do Models Adopt? On Cultural Positioning Bias and Agentic Mitigation in LLMs

Yixin Wan, Xingrun Chen, Kai-Wei Chang

Main category: cs.CL

TL;DR: LLMs exhibit culture positioning bias, favoring mainstream US perspectives over non-mainstream cultures. The paper proposes CultureLens benchmark and agent-based mitigation methods to address this bias.

Details

Motivation: To identify and systematically investigate culture positioning bias in LLMs, where models default to mainstream US cultural perspectives while treating other cultures as outsiders, potentially perpetuating fairness issues.

Method: Created CultureLens benchmark with 4000 prompts across 10 diverse cultures. Proposed two mitigation approaches: Fairness Intervention Pillars (FIP) prompt-based method, and Mitigation via Fairness Agents (MFA) framework with single-agent (self-reflection) and multi-agent (planner-critic-refiner hierarchy) pipelines.

Result: Evaluation of 5 state-of-the-art LLMs revealed models adopt insider tones in 88% of US-contexted scripts but mainly outsider stances for less dominant cultures. Agent-based methods proved effective in mitigating these biases.

Conclusion: Agent-based mitigation methods show promise for addressing culture positioning bias in generative LLMs, with multi-agent frameworks providing structured approaches to produce more culturally unbiased content.

Abstract: Large language models (LLMs) have unlocked a wide range of downstream generative applications. However, we found that they also risk perpetuating subtle fairness issues tied to culture, positioning their generations from the perspectives of the mainstream US culture while demonstrating salient externality towards non-mainstream ones. In this work, we identify and systematically investigate this novel culture positioning bias, in which an LLM’s default generative stance aligns with a mainstream view and treats other cultures as outsiders. We propose the CultureLens benchmark with 4000 generation prompts and 3 evaluation metrics for quantifying this bias through the lens of a culturally situated interview script generation task, in which an LLM is positioned as an onsite reporter interviewing local people across 10 diverse cultures. Empirical evaluation on 5 state-of-the-art LLMs reveals a stark pattern: while models adopt insider tones in over 88 percent of US-contexted scripts on average, they disproportionately adopt mainly outsider stances for less dominant cultures. To resolve these biases, we propose 2 inference-time mitigation methods: a baseline prompt-based Fairness Intervention Pillars (FIP) method, and a structured Mitigation via Fairness Agents (MFA) framework consisting of 2 pipelines: (1) MFA-SA (Single-Agent) introduces a self-reflection and rewriting loop based on fairness guidelines. (2) MFA-MA (Multi-Agent) structures the process into a hierarchy of specialized agents: a Planner Agent(initial script generation), a Critique Agent (evaluates initial script against fairness pillars), and a Refinement Agent (incorporates feedback to produce a polished, unbiased script). Empirical results showcase the effectiveness of agent-based methods as a promising direction for mitigating biases in generative LLMs.

[49] PerHalluEval: Persian Hallucination Evaluation Benchmark for Large Language Models

Mohammad Hosseini, Kimia Hosseini, Shayan Bali, Zahra Zanjani, Saeedeh Momtazi

Main category: cs.CL

TL;DR: PerHalluEval is the first Persian-language hallucination evaluation benchmark that uses an LLM-driven pipeline with human validation to detect extrinsic and intrinsic hallucinations in QA and summarization tasks.

Details

Motivation: Hallucination is a persistent issue in LLMs, particularly affecting low-resource languages like Persian, which lack dedicated evaluation benchmarks.

Method: Three-stage LLM-driven pipeline augmented with human validation, using log probabilities to select believable hallucinated instances, and human annotation for Persian-specific cultural contexts.

Result: Evaluation of 12 LLMs showed models struggle to detect hallucinated Persian text. External knowledge (original documents) partially mitigates hallucination, and Persian-specialized LLMs show no significant advantage.

Conclusion: Current LLMs have limited capability in detecting Persian hallucinations, highlighting the need for language-specific evaluation benchmarks and improved hallucination mitigation strategies.

Abstract: Hallucination is a persistent issue affecting all large language Models (LLMs), particularly within low-resource languages such as Persian. PerHalluEval (Persian Hallucination Evaluation) is the first dynamic hallucination evaluation benchmark tailored for the Persian language. Our benchmark leverages a three-stage LLM-driven pipeline, augmented with human validation, to generate plausible answers and summaries regarding QA and summarization tasks, focusing on detecting extrinsic and intrinsic hallucinations. Moreover, we used the log probabilities of generated tokens to select the most believable hallucinated instances. In addition, we engaged human annotators to highlight Persian-specific contexts in the QA dataset in order to evaluate LLMs’ performance on content specifically related to Persian culture. Our evaluation of 12 LLMs, including open- and closed-source models using PerHalluEval, revealed that the models generally struggle in detecting hallucinated Persian text. We showed that providing external knowledge, i.e., the original document for the summarization task, could mitigate hallucination partially. Furthermore, there was no significant difference in terms of hallucination when comparing LLMs specifically trained for Persian with others.

[50] BESPOKE: Benchmark for Search-Augmented Large Language Model Personalization via Diagnostic Feedback

Hyunseo Kim, Sangam Lee, Kwangwook Seo, Dongha Lee

Main category: cs.CL

TL;DR: BESPOKE is a realistic benchmark for evaluating personalization in search-augmented LLMs, addressing the gap in systematic evaluation of personalized information-seeking systems.

Details

Motivation: Current search-augmented LLMs are insufficient for addressing diverse user needs that require recognizing different intents behind the same query and delivering information in preferred forms. While systems like ChatGPT and Gemini attempt personalization using user histories, systematic evaluation of such personalization remains under-explored.

Method: BESPOKE is constructed through long-term, deeply engaged human annotation where human annotators contribute their own chat/search histories, author queries with detailed information needs, and evaluate responses with fine-grained preference scores and diagnostic feedback. The benchmark is designed to be both realistic (using authentic human data) and diagnostic (with detailed preference scoring).

Result: The benchmark enables systematic analyses that reveal key requirements for effective personalization in information-seeking tasks, providing a foundation for fine-grained evaluation of personalized search-augmented LLMs.

Conclusion: BESPOKE addresses the critical gap in evaluating personalization capabilities of search-augmented LLMs and provides a comprehensive framework for assessing how well these systems can adapt to individual user preferences and information needs.

Abstract: Search-augmented large language models (LLMs) have advanced information-seeking tasks by integrating retrieval into generation, reducing users’ cognitive burden compared to traditional search systems. Yet they remain insufficient for fully addressing diverse user needs, which requires recognizing how the same query can reflect different intents across users and delivering information in preferred forms. While recent systems such as ChatGPT and Gemini attempt personalization by leveraging user histories, systematic evaluation of such personalization is under-explored. To address this gap, we propose BESPOKE, the realistic benchmark for evaluating personalization in search-augmented LLMs. BESPOKE is designed to be both realistic, by collecting authentic chat and search histories directly from humans, and diagnostic, by pairing responses with fine-grained preference scores and feedback. The benchmark is constructed through long-term, deeply engaged human annotation, where human annotators contributed their own histories, authored queries with detailed information needs, and evaluated responses with scores and diagnostic feedback. Leveraging BESPOKE, we conduct systematic analyses that reveal key requirements for effective personalization in information-seeking tasks, providing a foundation for fine-grained evaluation of personalized search-augmented LLMs. Our code and data are available at https://augustinlib.github.io/BESPOKE/.

Junhyuk Choi, Ro-hoon Oh, Jihwan Seol, Bugeun Kim

Main category: cs.CL

TL;DR: VoiceBBQ is a spoken extension of the BBQ dataset that measures social bias in Spoken Language Models (SLMs) through both content and acoustic aspects by converting text contexts into controlled voice conditions.

Details

Motivation: To address the dual sources of social bias in Spoken Language Models - content and acoustic aspects - which can emerge differently in speech compared to text, requiring specialized evaluation methods.

Method: Converted every BBQ context into controlled voice conditions to enable per-axis accuracy, bias, and consistency scores comparable to the original text benchmark, then evaluated two SLMs (LLaMA-Omni and Qwen2-Audio) using this dataset.

Result: LLaMA-Omni resists acoustic bias but amplifies gender and accent bias, while Qwen2-Audio substantially dampens these cues while preserving content fidelity, revealing architectural contrasts between models.

Conclusion: VoiceBBQ provides a compact, drop-in testbed for jointly diagnosing content and acoustic bias across spoken language models, enabling comprehensive bias evaluation in speech-based AI systems.

Abstract: We introduce VoiceBBQ, a spoken extension of the BBQ (Bias Benchmark for Question Answering) - a dataset that measures social bias by presenting ambiguous or disambiguated contexts followed by questions that may elicit stereotypical responses. Due to the nature of speech, social bias in Spoken Language Models (SLMs) can emerge from two distinct sources: 1) content aspect and 2) acoustic aspect. The dataset converts every BBQ context into controlled voice conditions, enabling per-axis accuracy, bias, and consistency scores that remain comparable to the original text benchmark. Using VoiceBBQ, we evaluate two SLMs - LLaMA-Omni and Qwen2-Audio - and observe architectural contrasts: LLaMA-Omni resists acoustic bias while amplifying gender and accent bias, whereas Qwen2-Audio substantially dampens these cues while preserving content fidelity. VoiceBBQ thus provides a compact, drop-in testbed for jointly diagnosing content and acoustic bias across spoken language models.

[52] Acoustic-based Gender Differentiation in Speech-aware Language Models

Junhyuk Choi, Jihwan Seol, Nayeon Kim, Chanhee Cho, EunBin Cho, Bugeun Kim

Main category: cs.CL

TL;DR: SpeechLMs exhibit paradoxical gender bias patterns - showing male-oriented responses in gender-stereotypical questions but gender-independent responses in contextually appropriate gender-dependent questions, primarily due to Whisper speech encoders generating male-oriented acoustic tokens.

Details

Motivation: To systematically analyze acoustic-based gender differentiation in SpeechLMs where identical questions lead to different responses based on speaker's gender, despite these models prioritizing general fairness principles.

Method: Created a new dataset of 9,208 speech samples across three categories (Gender-Independent, Gender-Stereotypical, Gender-Dependent) and evaluated LLaMA-Omni series models, testing with neutral response options and gender neutralization methods.

Result: Models consistently exhibited male-oriented responses in Gender-Stereotypical questions but gender-independent responses in Gender-Dependent questions where differentiation would be appropriate. This paradoxical pattern persists even with neutral options and gender neutralization, primarily stemming from Whisper speech encoders generating male-oriented acoustic tokens.

Conclusion: Current SpeechLMs fail to properly utilize gender information, prioritizing general fairness over contextual appropriateness, revealing the need for more sophisticated techniques to handle gender biases in speech technology.

Abstract: Speech-aware Language Models (SpeechLMs) have fundamentally transformed human-AI interaction by enabling voice-based communication, yet they may exhibit acoustic-based gender differentiation where identical questions lead to different responses based on the speaker’s gender. This paper propose a new dataset that enables systematic analysis of this phenomenon, containing 9,208 speech samples across three categories: Gender-Independent, Gender-Stereotypical, and Gender-Dependent. We further evaluated LLaMA-Omni series and discovered a paradoxical pattern; while overall responses seems identical regardless of gender, the pattern is far from unbiased responses. Specifically, in Gender-Stereotypical questions, all models consistently exhibited male-oriented responses; meanwhile, in Gender-Dependent questions where gender differentiation would be contextually appropriate, models exhibited responses independent to gender instead. We also confirm that this pattern does not result from neutral options nor perceived gender of a voice. When we allow neutral response, models tends to respond neutrally also in Gender-Dependent questions. The paradoxical pattern yet retains when we applied gender neutralization methods on speech. Through comparison between SpeechLMs with corresponding backbone LLMs, we confirmed that these paradoxical patterns primarily stem from Whisper speech encoders, which generates male-oriented acoustic tokens. These findings reveal that current SpeechLMs may not successfully remove gender biases though they prioritized general fairness principles over contextual appropriateness, highlighting the need for more sophisticated techniques to utilize gender information properly in speech technology.

[53] AutoIntent: AutoML for Text Classification

Ilya Alekseev, Roman Solomatin, Darina Rustamova, Denis Kuznetsov

Main category: cs.CL

TL;DR: AutoIntent is an automated machine learning tool for text classification that provides end-to-end automation with embedding model selection, classifier optimization, and decision threshold tuning.

Details

Motivation: To create a comprehensive AutoML solution for text classification that outperforms existing tools and supports multi-label classification and out-of-scope detection.

Method: Uses a modular, sklearn-like interface with automated embedding model selection, classifier optimization, and decision threshold tuning in an end-to-end framework.

Result: Demonstrates superior performance compared to existing AutoML tools on standard intent classification datasets while allowing users to balance effectiveness and resource consumption.

Conclusion: AutoIntent provides an effective automated solution for text classification tasks with better performance than existing tools and flexible resource management.

Abstract: AutoIntent is an automated machine learning tool for text classification tasks. Unlike existing solutions, AutoIntent offers end-to-end automation with embedding model selection, classifier optimization, and decision threshold tuning, all within a modular, sklearn-like interface. The framework is designed to support multi-label classification and out-of-scope detection. AutoIntent demonstrates superior performance compared to existing AutoML tools on standard intent classification datasets and enables users to balance effectiveness and resource consumption.

[54] Retrieval over Classification: Integrating Relation Semantics for Multimodal Relation Extraction

Lei Hei, Tingjing Liao, Yingxin Pei, Yiyang Qi, Jiaqi Wang, Ruiting Li, Feiliang Ren

Main category: cs.CL

TL;DR: ROC reformulates multimodal relation extraction as a retrieval task using semantic similarity instead of classification, addressing limitations of traditional approaches by incorporating structural constraints and natural language relation descriptions.

Details

Motivation: Traditional multimodal relation extraction methods use classification-based paradigms with fused features, which overlook structural constraints like entity types and positional cues, and lack semantic expressiveness for fine-grained relation understanding.

Method: ROC integrates entity type and positional information through a multimodal encoder, expands relation labels into natural language descriptions using an LLM, and aligns entity-relation pairs via semantic similarity-based contrastive learning.

Result: The method achieves state-of-the-art performance on benchmark datasets MNRE and MORE, and exhibits stronger robustness and interpretability compared to existing approaches.

Conclusion: Reformulating multimodal relation extraction as a retrieval task driven by relation semantics effectively addresses limitations of classification-based paradigms, leading to improved performance and interpretability.

Abstract: Relation extraction (RE) aims to identify semantic relations between entities in unstructured text. Although recent work extends traditional RE to multimodal scenarios, most approaches still adopt classification-based paradigms with fused multimodal features, representing relations as discrete labels. This paradigm has two significant limitations: (1) it overlooks structural constraints like entity types and positional cues, and (2) it lacks semantic expressiveness for fine-grained relation understanding. We propose \underline{R}etrieval \underline{O}ver \underline{C}lassification (ROC), a novel framework that reformulates multimodal RE as a retrieval task driven by relation semantics. ROC integrates entity type and positional information through a multimodal encoder, expands relation labels into natural language descriptions using a large language model, and aligns entity-relation pairs via semantic similarity-based contrastive learning. Experiments show that our method achieves state-of-the-art performance on the benchmark datasets MNRE and MORE and exhibits stronger robustness and interpretability.

[55] Learning the Wrong Lessons: Syntactic-Domain Spurious Correlations in Language Models

Chantal Shaib, Vinith M. Suriyakumar, Levent Sagun, Byron C. Wallace, Marzyeh Ghassemi

Main category: cs.CL

TL;DR: LLMs can develop spurious correlations between syntactic templates and domains, which can override semantic understanding and affect model performance and safety.

Details

Motivation: To understand how syntactic templates in training data create domain associations that can override semantic instructions in LLMs, potentially leading to performance issues and safety vulnerabilities.

Method: Used synthetic training datasets to test syntactic-domain correlations, developed an evaluation framework to detect this phenomenon in trained models, and conducted case studies on safety finetuning implications.

Result: Found that syntactic-domain correlations lower performance on entity knowledge tasks (mean 0.51 +/- 0.06) and can be used to bypass refusals in safety-finetuned models like OLMo-2-7B Instruct and GPT-4o.

Conclusion: There is a need to explicitly test for syntactic-domain correlations and ensure syntactic diversity within domains in training data to prevent spurious correlations that affect model reliability and safety.

Abstract: For an LLM to correctly respond to an instruction it must understand both the semantics and the domain (i.e., subject area) of a given task-instruction pair. However, syntax can also convey implicit information Recent work shows that syntactic templates–frequent sequences of Part-of-Speech (PoS) tags–are prevalent in training data and often appear in model outputs. In this work we characterize syntactic templates, domain, and semantics in task-instruction pairs. We identify cases of spurious correlations between syntax and domain, where models learn to associate a domain with syntax during training; this can sometimes override prompt semantics. Using a synthetic training dataset, we find that the syntactic-domain correlation can lower performance (mean 0.51 +/- 0.06) on entity knowledge tasks in OLMo-2 models (1B-13B). We introduce an evaluation framework to detect this phenomenon in trained models, and show that it occurs on a subset of the FlanV2 dataset in open (OLMo-2-7B; Llama-4-Maverick), and closed (GPT-4o) models. Finally, we present a case study on the implications for safety finetuning, showing that unintended syntactic-domain correlations can be used to bypass refusals in OLMo-2-7B Instruct and GPT-4o. Our findings highlight two needs: (1) to explicitly test for syntactic-domain correlations, and (2) to ensure syntactic diversity in training data, specifically within domains, to prevent such spurious correlations.

[56] Who’s Laughing Now? An Overview of Computational Humour Generation and Explanation

Tyler Loakman, William Thorne, Chenghua Lin

Main category: cs.CL

TL;DR: This paper surveys computational humor in NLP, focusing on humor generation and explanation tasks, and discusses the challenges and future directions for LLMs in understanding and creating humor.

Details

Motivation: Humor is a fundamental human trait that requires extensive reasoning and common-sense knowledge, making it a challenging task to assess the capabilities of modern LLMs. The paper aims to motivate computational humor as a key subdiscipline of NLP.

Method: The authors conduct a literature survey of computational humor research, particularly focusing on generative tasks like humor creation and explanation. They analyze the current state-of-the-art models and their limitations.

Result: The survey reveals that despite humor understanding being a foundational NLP task, research on generating and explaining humor beyond puns remains limited. Current state-of-the-art models still fall short of human capabilities in humor processing.

Conclusion: The paper emphasizes the importance of computational humor processing as a subdiscipline of NLP and provides extensive discussion of future research directions, considering the subjective and ethically ambiguous nature of humor.

Abstract: The creation and perception of humour is a fundamental human trait, positioning its computational understanding as one of the most challenging tasks in natural language processing (NLP). As an abstract, creative, and frequently context-dependent construct, humour requires extensive reasoning to understand and create, making it a pertinent task for assessing the common-sense knowledge and reasoning abilities of modern large language models (LLMs). In this work, we survey the landscape of computational humour as it pertains to the generative tasks of creation and explanation. We observe that, despite the task of understanding humour bearing all the hallmarks of a foundational NLP task, work on generating and explaining humour beyond puns remains sparse, while state-of-the-art models continue to fall short of human capabilities. We bookend our literature survey by motivating the importance of computational humour processing as a subdiscipline of NLP and presenting an extensive discussion of future directions for research in the area that takes into account the subjective and ethically ambiguous nature of humour.

[57] GEP: A GCG-Based method for extracting personally identifiable information from chatbots built on small language models

Jieli Zhu, Vi Ngoc-Nha Tran

Main category: cs.CL

TL;DR: This paper investigates PII (personally identifiable information) leakage in small language models (SLMs), proposing a new method called GEP that significantly outperforms previous template-based approaches for extracting sensitive data from chatbots.

Details

Motivation: While SLMs offer comparable performance to LLMs with less computational cost, their vulnerability to PII leakage in downstream tasks remains unexplored. The authors aim to address this security gap.

Method: The authors first fine-tune ChatBioGPT (based on BioGPT) using medical datasets. They then propose GEP, a greedy coordinate gradient-based method specifically designed for PII extraction from SLMs, and test it against template-based methods.

Result: GEP achieves up to 60× more PII leakage detection compared to template-based methods, and maintains a 4.53% leakage rate even in complex free-style insertion scenarios with varied syntactic expressions.

Conclusion: SLMs are vulnerable to PII leakage, and the proposed GEP method effectively demonstrates this vulnerability, highlighting the need for better security measures in small language model deployments.

Abstract: Small language models (SLMs) become unprecedentedly appealing due to their approximately equivalent performance compared to large language models (LLMs) in certain fields with less energy and time consumption during training and inference. However, the personally identifiable information (PII) leakage of SLMs for downstream tasks has yet to be explored. In this study, we investigate the PII leakage of the chatbot based on SLM. We first finetune a new chatbot, i.e., ChatBioGPT based on the backbone of BioGPT using medical datasets Alpaca and HealthCareMagic. It shows a matchable performance in BERTscore compared with previous studies of ChatDoctor and ChatGPT. Based on this model, we prove that the previous template-based PII attacking methods cannot effectively extract the PII in the dataset for leakage detection under the SLM condition. We then propose GEP, which is a greedy coordinate gradient-based (GCG) method specifically designed for PII extraction. We conduct experimental studies of GEP and the results show an increment of up to 60$\times$ more leakage compared with the previous template-based methods. We further expand the capability of GEP in the case of a more complicated and realistic situation by conducting free-style insertion where the inserted PII in the dataset is in the form of various syntactic expressions instead of fixed templates, and GEP is still able to reveal a PII leakage rate of up to 4.53%.

Xiangru Tang, Wanghan Xu, Yujie Wang, Zijie Guo, Daniel Shao, Jiapeng Chen, Cixuan Zhang, Ziyi Wang, Lixin Zhang, Guancheng Wan, Wenlong Zhang, Lei Bai, Zhenfei Yin, Philip Torr, Hanrui Wang, Di Jin

Main category: cs.CL

TL;DR: A unified framework combining implicit retrieval and structured collaboration that addresses bottlenecks in LLM scientific reasoning by eliminating explicit retrieval overhead and improving solution quality through hierarchical refinement.

Details

Motivation: Current LLM scientific reasoning faces two major bottlenecks: explicit retrieval fragments reasoning (imposing 'tool tax' of extra tokens/steps) and multi-agent pipelines dilute strong solutions through averaging across candidates.

Method: Monitor-based retrieval module operating at token level for implicit knowledge integration, combined with Hierarchical Solution Refinement (HSR) where candidates repair each other, and Quality-Aware Iterative Reasoning (QAIR) that adapts refinement to solution quality.

Result: Achieves 48.3% accuracy on Humanity’s Last Exam Bio/Chem Gold (highest reported), surpassing strongest agent baseline by 13.4 points and frontier LLMs by up to 18.1 points, while reducing token usage by 53.5% and agent steps by 43.7%. Robust across SuperGPQA and TRQA domains.

Conclusion: Implicit augmentation and structured refinement overcome inefficiencies of explicit tool use and uniform aggregation. Error analysis shows reasoning failures and knowledge gaps co-occur in 85%+ cases, with retrieval tasks benefiting from solution variety while reasoning tasks favor consensus.

Abstract: Large language models (LLMs) have recently shown strong progress on scientific reasoning, yet two major bottlenecks remain. First, explicit retrieval fragments reasoning, imposing a hidden “tool tax” of extra tokens and steps. Second, multi-agent pipelines often dilute strong solutions by averaging across all candidates. We address these challenges with a unified framework that combines implicit retrieval and structured collaboration. At its foundation, a Monitor-based retrieval module operates at the token level, integrating external knowledge with minimal disruption to reasoning. On top of this substrate, Hierarchical Solution Refinement (HSR) iteratively designates each candidate as an anchor to be repaired by its peers, while Quality-Aware Iterative Reasoning (QAIR) adapts refinement to solution quality. On Humanity’s Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy – the highest reported to date, surpassing the strongest agent baseline by 13.4 points and leading frontier LLMs by up to 18.1 points, while simultaneously reducing token usage by 53.5% and agent steps by 43.7%. Results on SuperGPQA and TRQA confirm robustness across domains. Error analysis shows that reasoning failures and knowledge gaps co-occur in over 85% of cases, while diversity analysis reveals a clear dichotomy: retrieval tasks benefit from solution variety, whereas reasoning tasks favor consensus. Together, these findings demonstrate how implicit augmentation and structured refinement overcome the inefficiencies of explicit tool use and uniform aggregation. Code is available at: https://github.com/tangxiangru/Eigen-1.

[59] CLaw: Benchmarking Chinese Legal Knowledge in Large Language Models - A Fine-grained Corpus and Reasoning Analysis

Xinzhe Xu, Liang Zhao, Hongshen Xu, Chen Chen

Main category: cs.CL

TL;DR: CLaw is a benchmark for evaluating LLMs on Chinese legal knowledge, featuring a comprehensive statute corpus and case-based reasoning tasks. It reveals LLMs’ struggles with legal provision recall and emphasizes the need for accurate knowledge retrieval combined with reasoning capabilities.

Details

Motivation: LLMs are increasingly used for legal analysis but their reliability is compromised by general pre-training that doesn't specialize in legal texts, obscuring their true legal knowledge depth.

Method: Created CLaw benchmark with two components: (1) comprehensive corpus of 306 Chinese national statutes segmented to subparagraph level with historical revisions (64,849 entries), (2) 254 case-based reasoning instances from China Supreme Court materials.

Result: Most contemporary LLMs significantly struggle to faithfully reproduce legal provisions, which critically undermines their legal reasoning reliability.

Conclusion: Trustworthy legal reasoning in LLMs requires robust synergy of accurate knowledge retrieval (via SFT or RAG) and strong general reasoning capabilities. CLaw provides essential benchmark for advancing domain-specific LLM reasoning.

Abstract: Large Language Models (LLMs) are increasingly tasked with analyzing legal texts and citing relevant statutes, yet their reliability is often compromised by general pre-training that ingests legal texts without specialized focus, obscuring the true depth of their legal knowledge. This paper introduces CLaw, a novel benchmark specifically engineered to meticulously evaluate LLMs on Chinese legal knowledge and its application in reasoning. CLaw comprises two key components: (1) a comprehensive, fine-grained corpus of all 306 Chinese national statutes, segmented to the subparagraph level and incorporating precise historical revision timesteps for rigorous recall evaluation (64,849 entries), and (2) a challenging set of 254 case-based reasoning instances derived from China Supreme Court curated materials to assess the practical application of legal knowledge. Our empirical evaluation reveals that most contemporary LLMs significantly struggle to faithfully reproduce legal provisions. As accurate retrieval and citation of legal provisions form the basis of legal reasoning, this deficiency critically undermines the reliability of their responses. We contend that achieving trustworthy legal reasoning in LLMs requires a robust synergy of accurate knowledge retrieval–potentially enhanced through supervised fine-tuning (SFT) or retrieval-augmented generation (RAG)–and strong general reasoning capabilities. This work provides an essential benchmark and critical insights for advancing domain-specific LLM reasoning, particularly within the complex legal sphere.

[60] SGMem: Sentence Graph Memory for Long-Term Conversational Agents

Yaxiong Wu, Yongyue Zhang, Sheng Liang, Yong Liu

Main category: cs.CL

TL;DR: SGMem (Sentence Graph Memory) is a novel memory management system for long-term conversational agents that uses sentence-level graphs to organize dialogue across multiple granularities, improving context retrieval and response generation.

Details

Motivation: Existing memory methods based on fact extraction or summarization struggle to organize and retrieve relevant information across different granularities of dialogue, limiting their effectiveness for long-term conversations that exceed LLM context windows.

Method: SGMem represents dialogue as sentence-level graphs within chunked units, capturing associations across turn-, round-, and session-level contexts. It combines retrieved raw dialogue with generated memory (summaries, facts, insights) to provide coherent context.

Result: Experiments on LongMemEval and LoCoMo benchmarks show that SGMem consistently improves accuracy and outperforms strong baselines in long-term conversational question answering.

Conclusion: SGMem provides an effective solution for memory management in long-term conversational agents by organizing dialogue information across multiple granularities through sentence-level graph representations.

Abstract: Long-term conversational agents require effective memory management to handle dialogue histories that exceed the context window of large language models (LLMs). Existing methods based on fact extraction or summarization reduce redundancy but struggle to organize and retrieve relevant information across different granularities of dialogue and generated memory. We introduce SGMem (Sentence Graph Memory), which represents dialogue as sentence-level graphs within chunked units, capturing associations across turn-, round-, and session-level contexts. By combining retrieved raw dialogue with generated memory such as summaries, facts and insights, SGMem supplies LLMs with coherent and relevant context for response generation. Experiments on LongMemEval and LoCoMo show that SGMem consistently improves accuracy and outperforms strong baselines in long-term conversational question answering.

[61] Query-Centric Graph Retrieval Augmented Generation

Yaxiong Wu, Jianyuan Bo, Yongyue Zhang, Sheng Liang, Yong Liu

Main category: cs.CL

TL;DR: QCG-RAG introduces a query-centric graph framework for RAG that solves granularity issues by enabling controllable granularity indexing and multi-hop chunk retrieval, outperforming existing methods.

Details

Motivation: Existing graph-based RAG methods face a granularity dilemma where fine-grained entity-level graphs have high token costs and lose context, while coarse document-level graphs fail to capture nuanced relations needed for multi-hop reasoning.

Method: Uses Doc2Query and Doc2Query– to construct query-centric graphs with controllable granularity, then employs a tailored multi-hop retrieval mechanism to select relevant chunks via generated queries.

Result: Experiments on LiHuaWorld and MultiHop-RAG datasets show QCG-RAG consistently outperforms prior chunk-based and graph-based RAG methods in question answering accuracy.

Conclusion: QCG-RAG establishes a new paradigm for multi-hop reasoning by providing improved graph quality and interpretability through query-centric granularity control.

Abstract: Graph-based retrieval-augmented generation (RAG) enriches large language models (LLMs) with external knowledge for long-context understanding and multi-hop reasoning, but existing methods face a granularity dilemma: fine-grained entity-level graphs incur high token costs and lose context, while coarse document-level graphs fail to capture nuanced relations. We introduce QCG-RAG, a query-centric graph RAG framework that enables query-granular indexing and multi-hop chunk retrieval. Our query-centric approach leverages Doc2Query and Doc2Query{-}{-} to construct query-centric graphs with controllable granularity, improving graph quality and interpretability. A tailored multi-hop retrieval mechanism then selects relevant chunks via the generated queries. Experiments on LiHuaWorld and MultiHop-RAG show that QCG-RAG consistently outperforms prior chunk-based and graph-based RAG methods in question answering accuracy, establishing a new paradigm for multi-hop reasoning.

[62] Un-Doubling Diffusion: LLM-guided Disambiguation of Homonym Duplication

Evgeny Kaskov, Elizaveta Petrova, Petr Surovtsev, Anna Kostikova, Ilya Mistiurin, Alexander Kapitanov, Alexander Nagaev

Main category: cs.CL

TL;DR: This paper addresses homonym duplication in diffusion models, where words with identical spelling but distinct meanings cause models to generate multiple senses simultaneously. The issue is exacerbated by Anglocentric bias in translation pipelines.

Details

Motivation: Homonyms pose challenges for generative models, causing them to generate multiple meanings simultaneously. The problem is worsened when non-English words become homonyms after translation into English due to Anglocentric bias in text-to-image pipelines.

Method: The authors introduce a method for measuring duplication rates and evaluate different diffusion models using both automatic evaluation with Vision-Language Models (VLM) and human evaluation. They also investigate prompt expansion as a mitigation technique.

Result: The paper demonstrates that prompt expansion effectively reduces homonym duplication, including cases related to Anglocentric bias. The automatic evaluation pipeline code is made publicly available.

Conclusion: Homonym duplication is a significant issue in diffusion models that can be measured and mitigated through prompt expansion techniques, which also help address problems arising from Anglocentric translation bias.

Abstract: Homonyms are words with identical spelling but distinct meanings, which pose challenges for many generative models. When a homonym appears in a prompt, diffusion models may generate multiple senses of the word simultaneously, which is known as homonym duplication. This issue is further complicated by an Anglocentric bias, which includes an additional translation step before the text-to-image model pipeline. As a result, even words that are not homonymous in the original language may become homonyms and lose their meaning after translation into English. In this paper, we introduce a method for measuring duplication rates and conduct evaluations of different diffusion models using both automatic evaluation utilizing Vision-Language Models (VLM) and human evaluation. Additionally, we investigate methods to mitigate the homonym duplication problem through prompt expansion, demonstrating that this approach also effectively reduces duplication related to Anglocentric bias. The code for the automatic evaluation pipeline is publicly available.

[63] LLM Output Homogenization is Task Dependent

Shomik Jain, Jack Lanchantin, Maximilian Nickel, Karen Ullrich, Ashia Wilson, Jamelle Watson-Daniels

Main category: cs.CL

TL;DR: This paper addresses output homogenization in LLMs by proposing a task-dependent approach to evaluate and mitigate homogenization, introducing task-anchored functional diversity and sampling techniques that maintain quality while increasing diversity where needed.

Details

Motivation: Current approaches to output homogenization fail to account for task-dependent diversity requirements - what constitutes problematic homogenization varies significantly across different task categories (e.g., math vs creative writing).

Method: Developed a task taxonomy with 8 categories, introduced task-anchored functional diversity evaluation, and proposed task-anchored sampling techniques that selectively increase diversity only where homogenization is undesirable.

Result: The approach successfully increases functional diversity for task categories where homogenization is problematic while preserving homogenization where it’s desired, challenging the conventional diversity-quality trade-off.

Conclusion: Task dependence is crucial for properly evaluating and mitigating output homogenization in LLMs, and the proposed framework provides effective tools for achieving this while maintaining response quality.

Abstract: A large language model can be less helpful if it exhibits output response homogenization. But whether two responses are considered homogeneous, and whether such homogenization is problematic, both depend on the task category. For instance, in objective math tasks, we often expect no variation in the final answer but anticipate variation in the problem-solving strategy. Whereas, for creative writing tasks, we may expect variation in key narrative components (e.g. plot, genre, setting, etc), beyond the vocabulary or embedding diversity produced by temperature-sampling. Previous work addressing output homogenization often fails to conceptualize diversity in a task-dependent way. We address this gap in the literature directly by making the following contributions. (1) We present a task taxonomy comprised of eight task categories that each have distinct conceptualizations of output homogenization. (2) We introduce task-anchored functional diversity to better evaluate output homogenization. (3) We propose a task-anchored sampling technique that increases functional diversity for task categories where homogenization is undesired, while preserving homogenization where it is desired. (4) We challenge the perceived existence of a diversity-quality trade-off by increasing functional diversity while maintaining response quality. Overall, we demonstrate how task dependence improves the evaluation and mitigation of output homogenization.

[64] LLMTrace: A Corpus for Classification and Fine-Grained Localization of AI-Written Text

Irina Tolstykh, Aleksandra Tsybina, Sergey Yakubson, Maksim Kuprashevich

Main category: cs.CL

TL;DR: LLMTrace is a new bilingual corpus for AI-generated text detection that addresses limitations in existing datasets by providing character-level annotations for mixed human-AI authorship scenarios.

Details

Motivation: Current AI text detection datasets are outdated, predominantly English-only, and lack character-level annotations needed for precise localization of AI-generated segments in mixed authorship texts.

Method: Constructed a large-scale bilingual (English and Russian) corpus using diverse modern proprietary and open-source LLMs, with character-level annotations to support both traditional binary classification and novel AI-generated interval detection.

Result: LLMTrace provides a comprehensive dataset that enables training and evaluation of more nuanced AI detection models capable of handling mixed authorship scenarios.

Conclusion: LLMTrace serves as a vital resource for developing next-generation AI detection systems that can accurately identify and localize AI-generated content in both English and Russian texts.

Abstract: The widespread use of human-like text from Large Language Models (LLMs) necessitates the development of robust detection systems. However, progress is limited by a critical lack of suitable training data; existing datasets are often generated with outdated models, are predominantly in English, and fail to address the increasingly common scenario of mixed human-AI authorship. Crucially, while some datasets address mixed authorship, none provide the character-level annotations required for the precise localization of AI-generated segments within a text. To address these gaps, we introduce LLMTrace, a new large-scale, bilingual (English and Russian) corpus for AI-generated text detection. Constructed using a diverse range of modern proprietary and open-source LLMs, our dataset is designed to support two key tasks: traditional full-text binary classification (human vs. AI) and the novel task of AI-generated interval detection, facilitated by character-level annotations. We believe LLMTrace will serve as a vital resource for training and evaluating the next generation of more nuanced and practical AI detection models. The project page is available at \href{https://sweetdream779.github.io/LLMTrace-info/}{iitolstykh/LLMTrace}.

[65] Bounds of Chain-of-Thought Robustness: Reasoning Steps, Embed Norms, and Beyond

Dingzirui Wang, Xuanliang Zhang, Keyan Xu, Qingfu Zhu, Wanxiang Che, Yang Deng

Main category: cs.CL

TL;DR: This paper provides a theoretical analysis of how input perturbations affect Chain-of-Thought (CoT) outputs, deriving an upper bound for acceptable perturbations and proving key relationships with reasoning steps and model parameters.

Details

Motivation: Existing research shows CoT outputs are sensitive to input perturbations, but there's no theoretical explanation of how these perturbations propagate during reasoning, limiting understanding and improvement of prompt optimization methods.

Method: The authors theoretically analyze the effect of input perturbations on CoT output fluctuation, deriving an upper bound for acceptable perturbations. They apply these conclusions to the Linear Self-Attention (LSA) model and validate with experiments on three datasets and four models.

Result: The analysis proves that: (i) the upper bound for input perturbations is positively correlated with the number of reasoning steps; (ii) even infinite reasoning cannot eliminate perturbation impact; (iii) for LSA models, the bound is negatively correlated with input embedding and hidden state vector norms. Experimental results validate these findings.

Conclusion: The theoretical framework provides insights into how input perturbations propagate through CoT reasoning processes, offering guidance for developing more robust prompt optimization methods and understanding the limitations of reasoning-based approaches.

Abstract: Existing research indicates that the output of Chain-of-Thought (CoT) is significantly affected by input perturbations. Although many methods aim to mitigate such impact by optimizing prompts, a theoretical explanation of how these perturbations influence CoT outputs remains an open area of research. This gap limits our in-depth understanding of how input perturbations propagate during the reasoning process and hinders further improvements in prompt optimization methods. Therefore, in this paper, we theoretically analyze the effect of input perturbations on the fluctuation of CoT outputs. We first derive an upper bound for input perturbations under the condition that the output fluctuation is within an acceptable range, based on which we prove that: (i) This upper bound is positively correlated with the number of reasoning steps in the CoT; (ii) Even an infinitely long reasoning process cannot eliminate the impact of input perturbations. We then apply these conclusions to the Linear Self-Attention (LSA) model, which can be viewed as a simplified version of the Transformer. For the LSA model, we prove that the upper bound for input perturbation is negatively correlated with the norms of the input embedding and hidden state vectors. To validate this theoretical analysis, we conduct experiments on three mainstream datasets and four mainstream models. The experimental results align with our theoretical analysis, empirically demonstrating the correctness of our findings.

[66] DisCoCLIP: A Distributional Compositional Tensor Network Encoder for Vision-Language Understanding

Kin Ian Lo, Hala Hawashin, Mina Abbaszadeh, Tilen Limback-Stokin, Hadi Wazni, Mehrnoosh Sadrzadeh

Main category: cs.CL

TL;DR: DisCoCLIP improves vision-language models by explicitly encoding syntactic structure using tensor networks, achieving better compositional reasoning with fewer parameters.

Details

Motivation: Current vision-language models fail to capture compositional language structure like word order and predicate-argument relationships, limiting their performance on tasks requiring syntactic understanding.

Method: Combines frozen CLIP vision transformer with a novel tensor network text encoder that parses sentences using Combinatory Categorial Grammar, factorizing high-order tensors for efficiency.

Result: Significantly improves verb semantics and word order sensitivity: raises CLIP’s SVO-Probes verb accuracy from 77.6% to 82.4%, boosts ARO scores by over 9% and 4%, achieves 93.7% on SVO-Swap benchmark.

Conclusion: Explicitly embedding linguistic structure via tensor networks creates interpretable, parameter-efficient representations that substantially enhance compositional reasoning in vision-language tasks.

Abstract: Recent vision-language models excel at large-scale image-text alignment but often neglect the compositional structure of language, leading to failures on tasks that hinge on word order and predicate-argument structure. We introduce DisCoCLIP, a multimodal encoder that combines a frozen CLIP vision transformer with a novel tensor network text encoder that explicitly encodes syntactic structure. Sentences are parsed with a Combinatory Categorial Grammar parser to yield distributional word tensors whose contractions mirror the sentence’s grammatical derivation. To keep the model efficient, high-order tensors are factorized with tensor decompositions, reducing parameter count from tens of millions to under one million. Trained end-to-end with a self-supervised contrastive loss, DisCoCLIP markedly improves sensitivity to verb semantics and word order: it raises CLIP’s SVO-Probes verb accuracy from 77.6% to 82.4%, boosts ARO attribution and relation scores by over 9% and 4%, and achieves 93.7% on a newly introduced SVO-Swap benchmark. These results demonstrate that embedding explicit linguistic structure via tensor networks yields interpretable, parameter-efficient representations that substantially improve compositional reasoning in vision-language tasks.

[67] The role of synthetic data in Multilingual, Multi-cultural AI systems: Lessons from Indic Languages

Pranjal A. Chitale, Varun Gumma, Sanchit Ahuja, Prashant Kodali, Manan Uppadhyay, Deepthi Sudharsan, Sunayana Sitaram

Main category: cs.CL

TL;DR: Updesh is a large-scale synthetic instruction-following dataset for 13 Indian languages created using a bottom-up generation strategy that grounds data in language-specific Wikipedia content, showing significant improvements in multilingual AI performance particularly for low-resource languages.

Details

Motivation: To address the challenge of developing AI systems that operate effectively across languages while remaining culturally grounded, especially in low-resource settings, by exploring synthetic data generation in multilingual and multicultural contexts.

Method: Bottom-up generation strategy using large open-source LLMs (>=235B parameters) to create culturally contextualized datasets grounded in language-specific Wikipedia content, producing 9.5M data points across 13 Indian languages with diverse reasoning and generative tasks.

Result: Comprehensive evaluation shows high-quality generated data with models trained on Updesh achieving significant gains on generative tasks and remaining competitive on multiple-choice NLU tasks, with most pronounced improvements in low and medium-resource languages.

Conclusion: Effective multilingual AI requires multi-faceted data curation and generation strategies that incorporate context-aware, culturally grounded methodologies, as demonstrated by the success of the bottom-up approach in narrowing performance gaps between high and low-resource languages.

Abstract: Developing AI systems that operate effectively across languages while remaining culturally grounded is a long-standing challenge, particularly in low-resource settings. Synthetic data provides a promising avenue, yet its effectiveness in multilingual and multicultural contexts remains underexplored. We investigate the creation and impact of synthetic, culturally contextualized datasets for Indian languages through a bottom-up generation strategy that prompts large open-source LLMs (>= 235B parameters) to ground data generation in language-specific Wikipedia content. This approach complements the dominant top-down paradigm of translating synthetic datasets from high-resource languages such as English. We introduce Updesh, a high-quality large-scale synthetic instruction-following dataset comprising 9.5M data points across 13 Indian languages, encompassing diverse reasoning and generative tasks with an emphasis on long-context, multi-turn capabilities, and alignment with Indian cultural contexts. A comprehensive evaluation incorporating both automated metrics and human annotation across 10k assessments indicates that generated data is high quality; though, human evaluation highlights areas for further improvement. Additionally, we perform downstream evaluations by fine-tuning models on our dataset and assessing the performance across 15 diverse multilingual datasets. Models trained on Updesh consistently achieve significant gains on generative tasks and remain competitive on multiple-choice style NLU tasks. Notably, relative improvements are most pronounced in low and medium-resource languages, narrowing their gap with high-resource languages. These findings provide empirical evidence that effective multilingual AI requires multi-faceted data curation and generation strategies that incorporate context-aware, culturally grounded methodologies.

[68] Sycophancy Is Not One Thing: Causal Separation of Sycophantic Behaviors in LLMs

Daniel Vennemeyer, Phan Anh Duong, Tiffany Zhan, Tianyu Jiang

Main category: cs.CL

TL;DR: The paper decomposes sycophancy in LLMs into distinct behaviors (agreement and praise) that can be independently manipulated through linear directions in latent space.

Details

Motivation: To understand whether sycophantic behaviors in LLMs arise from a single mechanism or multiple distinct processes, and to decompose sycophancy into its constituent components.

Method: Used difference-in-means directions, activation additions, and subspace geometry analysis across multiple models and datasets to identify and manipulate distinct behavioral representations.

Result: Found that sycophantic agreement, sycophantic praise, and genuine agreement are encoded along distinct linear directions that can be independently amplified or suppressed without affecting each other, with consistent structure across models.

Conclusion: Sycophantic behaviors correspond to distinct, independently steerable representations rather than a single unified mechanism.

Abstract: Large language models (LLMs) often exhibit sycophantic behaviors – such as excessive agreement with or flattery of the user – but it is unclear whether these behaviors arise from a single mechanism or multiple distinct processes. We decompose sycophancy into sycophantic agreement and sycophantic praise, contrasting both with genuine agreement. Using difference-in-means directions, activation additions, and subspace geometry across multiple models and datasets, we show that: (1) the three behaviors are encoded along distinct linear directions in latent space; (2) each behavior can be independently amplified or suppressed without affecting the others; and (3) their representational structure is consistent across model families and scales. These results suggest that sycophantic behaviors correspond to distinct, independently steerable representations.

[69] RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards

Zhilin Wang, Jiaqi Zeng, Olivier Delalleau, Ellie Evans, Daniel Egert, Hoo-Chang Shin, Felipe Soares, Yi Dong, Oleksii Kuchaiev

Main category: cs.CL

TL;DR: RLBFF combines human preferences with rule-based verification to create more interpretable and versatile reward models for LLM alignment, achieving state-of-the-art performance on benchmarks while enabling customizable principle-based inference.

Details

Motivation: Address limitations of existing RL paradigms: RLHF lacks interpretability and suffers from reward hacking due to subjective human judgments, while RLVR is too narrow with its focus on correctness-based verification. Need a method that captures nuanced response quality beyond mere correctness.

Method: Extracts binary principles (e.g., accuracy, readability) from natural language feedback and uses them to train reward models as entailment tasks. Combines human-driven preferences with rule-based verification for more precise and interpretable reward modeling.

Result: Achieved top performance on RM-Bench (86.2%) and JudgeBench (81.4%, #1 leaderboard as of Sept 2025). Successfully aligned Qwen3-32B to match/exceed o3-mini and DeepSeek R1 on MT-Bench, WildBench, and Arena Hard v2 at <5% inference cost.

Conclusion: RLBFF provides a superior alternative to traditional RLHF and RLVR by enabling interpretable, customizable reward models that capture nuanced quality aspects while maintaining high performance and cost efficiency.

Abstract: Reinforcement Learning with Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) are the main RL paradigms used in LLM post-training, each offering distinct advantages. However, RLHF struggles with interpretability and reward hacking because it relies on human judgments that usually lack explicit criteria, whereas RLVR is limited in scope by its focus on correctness-based verifiers. We propose Reinforcement Learning with Binary Flexible Feedback (RLBFF), which combines the versatility of human-driven preferences with the precision of rule-based verification, enabling reward models to capture nuanced aspects of response quality beyond mere correctness. RLBFF extracts principles that can be answered in a binary fashion (e.g. accuracy of information: yes, or code readability: no) from natural language feedback. Such principles can then be used to ground Reward Model training as an entailment task (response satisfies or does not satisfy an arbitrary principle). We show that Reward Models trained in this manner can outperform Bradley-Terry models when matched for data and achieve top performance on RM-Bench (86.2%) and JudgeBench (81.4%, #1 on leaderboard as of September 24, 2025). Additionally, users can specify principles of interest at inference time to customize the focus of our reward models, in contrast to Bradley-Terry models. Finally, we present a fully open source recipe (including data) to align Qwen3-32B using RLBFF and our Reward Model, to match or exceed the performance of o3-mini and DeepSeek R1 on general alignment benchmarks of MT-Bench, WildBench, and Arena Hard v2 (at <5% of the inference cost).

[70] SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines

Yizhou Wang, Chen Tang, Han Deng, Jiabei Xiao, Jiaqi Liu, Jianyu Wu, Jun Yao, Pengze Li, Encheng Su, Lintao Wang, Guohang Zhuang, Yuchen Ren, Ben Fei, Ming Hu, Xin Chen, Dongzhan Zhou, Junjun He, Xiangyu Yue, Zhenfei Yin, Jiamin Wu, Qihao Zheng, Yuhao Zhou, Huihui Xu, Chenglong Ma, Yan Lu, Wenlong Zhang, Chunfeng Song, Philip Torr, Shixiang Tang, Xinzhu Ma, Wanli Ouyang, Lei Bai

Main category: cs.CL

TL;DR: A scientific reasoning foundation model that aligns natural language with scientific representations, pretrained on 206B tokens and fine-tuned with SFT, bootstrapping, and RL to support 103 tasks across 5 capability families.

Details

Motivation: To create a unified model that can handle diverse scientific reasoning tasks by bridging natural language with various scientific formats and representations.

Method: Pretrained on 206B-token corpus (text, sequences, sequence-text pairs), then aligned via supervised fine-tuning on 40M instructions, annealed cold-start bootstrapping for chain-of-thought reasoning, and reinforcement learning with task-specific reward shaping.

Result: The model supports 103 tasks across 5 capability families, broadens instruction coverage, improves cross-domain generalization, and enhances fidelity compared to specialist systems.

Conclusion: Cross-discipline learning strengthens transfer and downstream reliability, and the model with datasets and code is open-sourced for community use.

Abstract: We present a scientific reasoning foundation model that aligns natural language with heterogeneous scientific representations. The model is pretrained on a 206B-token corpus spanning scientific text, pure sequences, and sequence-text pairs, then aligned via SFT on 40M instructions, annealed cold-start bootstrapping to elicit long-form chain-of-thought, and reinforcement learning with task-specific reward shaping, which instills deliberate scientific reasoning. It supports four capability families, covering up to 103 tasks across workflows: (i) faithful translation between text and scientific formats, (ii) text/knowledge extraction, (iii) property prediction, (iv) property classification, (v) unconditional and conditional sequence generation and design. Compared with specialist systems, our approach broadens instruction coverage, improves cross-domain generalization, and enhances fidelity. We detail data curation and training and show that cross-discipline learning strengthens transfer and downstream reliability. The model, instruct tuning datasets and the evaluation code are open-sourced at https://huggingface.co/SciReason and https://github.com/open-sciencelab/SciReason.

[71] Higher-Order DisCoCat (Peirce-Lambek-Montague semantics)

Alexis Toumi, Giovanni de Felice

Main category: cs.CL

TL;DR: A new higher-order DisCoCat model where word meanings are diagram-valued higher-order functions, enabling diagrammatic treatment of natural language semantics.

Details

Motivation: To extend categorical compositional distributional models to handle higher-order and non-linear processes in natural language semantics like adverbs, prepositions, negation and quantifiers.

Method: Propose a variant of Montague semantics based on lambda calculus where primitives act on string diagrams rather than logical formulae, with translation from Lambek calculus to Peirce’s system beta.

Result: Developed a purely diagrammatic approach for higher-order semantics with proof-of-concept implementation in DisCoPy Python library.

Conclusion: The new definition provides a diagrammatic framework for natural language semantics that can handle complex linguistic phenomena through higher-order functions on string diagrams.

Abstract: We propose a new definition of higher-order DisCoCat (categorical compositional distributional) models where the meaning of a word is not a diagram, but a diagram-valued higher-order function. Our models can be seen as a variant of Montague semantics based on a lambda calculus where the primitives act on string diagrams rather than logical formulae. As a special case, we show how to translate from the Lambek calculus into Peirce’s system beta for first-order logic. This allows us to give a purely diagrammatic treatment of higher-order and non-linear processes in natural language semantics: adverbs, prepositions, negation and quantifiers. The definition presented in this article comes with a proof-of-concept implementation in DisCoPy, the Python library for string diagrams.

[72] ASCIIEval: Benchmarking Models’ Visual Perception in Text Strings via ASCII Art

Qi Jia, Xiang Yue, Shanshan Huang, Ziheng Qin, Yizhu Liu, Bill Yuchen Lin, Yang You, Guangtao Zhai

Main category: cs.CL

TL;DR: ASCIIEval benchmark evaluates LLMs and MLLMs on ASCII art recognition, revealing proprietary models outperform open-source ones, with performance varying by input modality and ASCII art length.

Details

Motivation: To assess visual perception capabilities in language models through ASCII art, which represents concepts via character arrangements and exists in both text and image modalities.

Method: Constructed ASCIIEval benchmark with 3K+ samples and training set, tested tens of models across different input modalities (text, image, both) on ASCII art recognition tasks.

Result: Proprietary models achieved over 70% accuracy on some categories (GPT-5 best), while open-source MLLMs showed 20.01% accuracy gap. Performance was sensitive to ASCII art length, and no models benefited from simultaneous multimodal input.

Conclusion: Current models struggle with ASCII art perception, especially open-source MLLMs, highlighting need for better modality fusion and generalization to special visual artifacts.

Abstract: Perceiving visual semantics embedded within consecutive characters is a crucial yet under-explored capability for both Large Language Models (LLMs) and Multi-modal Large Language Models (MLLMs). In this work, we select ASCII art as a representative artifact. It depicts concepts through careful arrangement of characters, which can be formulated in both text and image modalities. We frame the problem as a recognition task, and construct a novel benchmark, ASCIIEval. It covers over 3K samples with an elaborate categorization tree, along with a training set for further enhancement. Encompassing a comprehensive analysis of tens of models through different input modalities, our benchmark demonstrate its multi-faceted diagnostic power. Given textual input, language models shows their visual perception ability on ASCII art concepts. Proprietary models achieve over 70% accuracy on certain categories, with GPT-5 topping the rank. For image inputs, we reveal that open-source MLLMs suffer from a trade-off between fine-grained text recognition and collective visual perception. They exhibit limited generalization ability to this special kind of arts, leading to the dramatic gap of over 20.01% accuracy compared with their proprietary counterparts. Another critical finding is that model performance is sensitive to the length of the ASCII art, with this sensitivity varying across input modalities. Unfortunately, none of the models could successfully benefit from the simultaneous provision of both modalities, highlighting the need for more flexible modality-fusion approaches. Besides, we also introduce approaches for further enhancement and discuss future directions. Resources are available at https://github.com/JiaQiSJTU/VisionInText.

[73] UniHR: Hierarchical Representation Learning for Unified Knowledge Graph Link Prediction

Zhiqiang Liu, Yin Hua, Mingyang Chen, Zhuo Chen, Lei Liang, Huajun Chen, Wen Zhang

Main category: cs.CL

TL;DR: UniHR is a unified hierarchical representation learning framework that handles multiple complex fact types in knowledge graphs (hyper-relational, temporal, and nested facts) through hierarchical data representation and structure learning modules.

Details

Motivation: Existing KG studies focus on specific fact types and struggle with hierarchical modeling, making them difficult to generalize to real-world scenarios with multiple complex fact types.

Method: Proposes UniHR with two modules: Hierarchical Data Representation (HiDR) that unifies different KG types into triple-based representations, and Hierarchical Structure Learning (HiSL) that incorporates intra-fact and inter-fact message passing.

Result: Extensive experiments on 9 datasets across 5 KG types demonstrate UniHR’s effectiveness and highlight the strong potential of unified representations.

Conclusion: UniHR successfully overcomes limitations of existing approaches by providing a unified framework for hierarchical representation learning across multiple complex fact types in knowledge graphs.

Abstract: Real-world knowledge graphs (KGs) contain not only standard triple-based facts, but also more complex, heterogeneous types of facts, such as hyper-relational facts with auxiliary key-value pairs, temporal facts with additional timestamps, and nested facts that imply relationships between facts. These richer forms of representation have attracted significant attention due to their enhanced expressiveness and capacity to model complex semantics in real-world scenarios. However, most existing studies suffer from two main limitations: (1) they typically focus on modeling only specific types of facts, thus making it difficult to generalize to real-world scenarios with multiple fact types; and (2) they struggle to achieve generalizable hierarchical (inter-fact and intra-fact) modeling due to the complexity of these representations. To overcome these limitations, we propose UniHR, a Unified Hierarchical Representation learning framework, which consists of a learning-optimized Hierarchical Data Representation (HiDR) module and a unified Hierarchical Structure Learning (HiSL) module. The HiDR module unifies hyper-relational KGs, temporal KGs, and nested factual KGs into triple-based representations. Then HiSL incorporates intra-fact and inter-fact message passing, focusing on enhancing both semantic information within individual facts and enriching the structural information between facts. To go beyond the unified method itself, we further explore the potential of unified representation in complex real-world scenarios, including joint modeling of multi-task, compositional and hybrid facts. Extensive experiments on 9 datasets across 5 types of KGs demonstrate the effectiveness of UniHR and highlight the strong potential of unified representations.

[74] Investigating Factuality in Long-Form Text Generation: The Roles of Self-Known and Self-Unknown

Lifu Tu, Rui Meng, Shafiq Joty, Yingbo Zhou, Semih Yavuz

Main category: cs.CL

TL;DR: This paper investigates factuality issues in long-form text generation by LLMs, finding that factuality declines in later sentences with more unsupported claims. It introduces Self-Known and Self-Unknown metrics for self-evaluation and provides a mathematical framework linking these scores to factuality.

Details

Motivation: Large language models often produce a mixture of true and false information in long-form generation, lacking factuality especially in extended text outputs.

Method: Analyzed factuality of long-form generation across multiple LLMs (GPT-4, Gemini-1.5-Pro, Claude-3-Opus, Llama-3-70B, Mistral) using Self-Known (supported claims judged correct) and Self-Unknown (unsupported claims judged incorrect) metrics, with additional RAG experiments.

Result: Factuality declines in later sentences with increased unsupported claims. Higher Self-Known correlates with better factuality, while higher Self-Unknown correlates with worse factuality. Mathematical framework: Factuality = (1-Self-Unknown)/(2-Self-Unknown-Self-Known).

Conclusion: Current LLMs have limitations in long-form generation factuality, and continued research is needed to improve factuality in extended text outputs, as demonstrated by RAG experiments.

Abstract: Large language models (LLMs) have demonstrated strong capabilities in text understanding and generation. However, they often lack factuality, producing a mixture of true and false information, especially in long-form generation. In this work, we investigates the factuality of long-form text generation across various large language models (LLMs), including GPT-4, Gemini-1.5-Pro, Claude-3-Opus, Llama-3-70B, and Mistral. Our analysis reveals that factuality tend to decline in later sentences of the generated text, accompanied by a rise in the number of unsupported claims. Furthermore, we explore the effectiveness of different evaluation settings to assess whether LLMs can accurately judge the correctness of their own outputs: Self-Known (the percentage of supported atomic claims, decomposed from LLM outputs, that the corresponding LLMs judge as correct) and Self-Unknown (the percentage of unsupported atomic claims that the corresponding LLMs judge as incorrect). Empirically, we observe a positive correlation between higher Self-Known scores and improved factuality, whereas higher Self-Unknown scores are associated with reduced factuality. Interestingly, the number of unsupported claims can increase even without significant changes in a model’s self-judgment scores (Self-Known and Self-Unknown), likely as a byproduct of long-form text generation. We also derive a mathematical framework linking Self-Known and Self-Unknown scores to factuality: $\textrm{Factuality}=\frac{1-\textrm{Self-Unknown}}{2-\textrm{Self-Unknown}-\textrm{Self-Known}}$, which aligns with our empirical observations. Additional Retrieval-Augmented Generation (RAG) experiments further highlight the limitations of current LLMs in long-form generation and underscore the need for continued research to improve factuality in long-form text.

[75] Labeling Free-text Data using Language Model Ensembles

Jiaxing Qiu, Dongliang Guo, Natalie Papini, Noelle Peace, Hannah F. Fitterman-Harris, Cheri A. Levinson, Tom Hartvigsen, Teague R. Henry

Main category: cs.CL

TL;DR: A framework using ensemble of locally-deployable LLMs for labeling predetermined topics in free-text data under privacy constraints, achieving better accuracy and precision-sensitivity trade-off than individual LLMs.

Details

Motivation: Free-text responses in psychological studies provide rich qualitative insights but manual labeling by human coders is labor-intensive. Closed-source LLMs cannot be used due to privacy concerns with sensitive data.

Method: Ensemble approach leveraging heterogeneity of diverse open-source LLMs, using relevancy scoring methodology based on embedding distances between topic descriptions and LLMs’ reasoning. Evaluated on Reddit data and patient free-text responses with human annotations.

Result: (1) Heterogeneity in performance among same-sized LLMs; (2) Ensemble achieved highest accuracy and optimal precision-sensitivity trade-off; (3) Relevancy scores showed greater agreement than dichotomous labels, effectively mitigating LLM heterogeneity.

Conclusion: The ensemble framework with locally-deployable LLMs provides an effective solution for labeling free-text data under privacy constraints, balancing agreement and disagreement across models through relevancy scoring.

Abstract: Free-text responses are commonly collected in psychological studies, providing rich qualitative insights that quantitative measures may not capture. Labeling curated topics of research interest in free-text data by multiple trained human coders is typically labor-intensive and time-consuming. Though large language models (LLMs) excel in language processing, LLM-assisted labeling techniques relying on closed-source LLMs cannot be directly applied to free-text data, without explicit consent for external use. In this study, we propose a framework of assembling locally-deployable LLMs to enhance the labeling of predetermined topics in free-text data under privacy constraints. Analogous to annotation by multiple human raters, this framework leverages the heterogeneity of diverse open-source LLMs. The ensemble approach seeks a balance between the agreement and disagreement across LLMs, guided by a relevancy scoring methodology that utilizes embedding distances between topic descriptions and LLMs’ reasoning. We evaluated the ensemble approach using both publicly accessible Reddit data from eating disorder related forums, and free-text responses from eating disorder patients, both complemented by human annotations. We found that: (1) there is heterogeneity in the performance of labeling among same-sized LLMs, with some showing low sensitivity but high precision, while others exhibit high sensitivity but low precision. (2) Compared to individual LLMs, the ensemble of LLMs achieved the highest accuracy and optimal precision-sensitivity trade-off in predicting human annotations. (3) The relevancy scores across LLMs showed greater agreement than dichotomous labels, indicating that the relevancy scoring method effectively mitigates the heterogeneity in LLMs’ labeling.

[76] Improving LLM Unlearning Robustness via Random Perturbations

Dang Huu-Tien, Hoang Thanh-Tung, Anh Bui, Minh-Phuong Nguyen, Le-Minh Nguyen, Naoya Inoue

Main category: cs.CL

TL;DR: LLM unlearning methods reduce model robustness by treating forget-tokens as backdoor triggers, and the paper proposes Random Noise Augmentation (RNA) as a defense to improve robustness while maintaining performance.

Details

Motivation: Current LLM unlearning methods make models vulnerable to disruptions when forget-tokens appear in retain-queries, revealing that unlearning actually poisons models rather than erasing knowledge.

Method: The paper reframes unlearning as backdoor attacks (forget-tokens as triggers) and proposes RNA as a backdoor defense - a lightweight, model-agnostic approach with theoretical guarantees.

Result: Extensive experiments show RNA significantly improves robustness of unlearned models while preserving forget and retain performances.

Conclusion: The backdoor attack-defense framework provides new insights into unlearning mechanisms and directions for improving robustness in future research.

Abstract: Here, we show that current state-of-the-art LLM unlearning methods inherently reduce models’ robustness, causing them to misbehave even when a single non-adversarial forget-token is present in the retain-query. Toward understanding underlying causes, we propose a novel theoretical framework that reframes the unlearning process as backdoor attacks and defenses: forget-tokens act as backdoor triggers that, when activated in retain-queries, cause disruptions in unlearned models’ behaviors, similar to successful backdoor attacks. The sense that, LLM unlearning methods themselves poison the model, make it more vulnerable to forget-tokens, and hide rather than erase target knowledge, describes their true mechanism. To mitigate the vulnerability caused by the forgetting process, we reinterpret the retaining process as a backdoor defense and propose Random Noise Augmentation (RNA), a lightweight, model and method-agnostic approach with theoretical guarantees for improving the robustness of models. Extensive experiments demonstrate that RNA significantly improves the robustness of unlearned models while preserving forget and retain performances. This backdoor attack-defense framework offers insights into the mechanism of unlearning that can shed light on future research directions for improving unlearning robustness.

[77] Quantifying depressive mental states with large language models

Jakub Onysk, Quentin J. M. Huys

Main category: cs.CL

TL;DR: This paper evaluates LLM performance on three critical tests for quantifying depressive symptoms, showing both limitations and conceptual alignment with clinical mental state assessment.

Details

Motivation: To assess the fundamental limits and potential of LLMs in quantifying mental health states, particularly depressive symptoms, by testing their performance against clinically validated data and interventions.

Method: Three tests: 1) Evaluation on novel ground-truth dataset (n=770) with clinical symptom quantifications and verbal descriptions; 2) Training supervised sparse auto-encoders to capture clinical symptom patterns; 3) Testing LLM responses to emotion induction interventions with 190 participants.

Result: LLMs show an upper bound on performance in symptom quantification, sSAE weights can effectively modify clinical patterns, and LLM states respond appropriately to emotion induction interventions.

Conclusion: LLMs demonstrate substantial conceptual alignment with clinical mental state assessment but have hard limits on data requirements for effective quantification of pathological mental states.

Abstract: Large Language Models (LLMs) may have an important role to play in mental health by facilitating the quantification of verbal expressions used to communicate emotions, feelings and thoughts. While there has been substantial and very promising work in this area, the fundamental limits are uncertain. Here, focusing on depressive symptoms, we outline and evaluate LLM performance on three critical tests. The first test evaluates LLM performance on a novel ground-truth dataset from a large human sample (n=770). This dataset is novel as it contains both standard clinically validated quantifications of depression symptoms and specific verbal descriptions of the thoughts related to each symptom by the same individual. The performance of LLMs on this richly informative data shows an upper bound on the performance in this domain, and allow us to examine the extent to which inference about symptoms generalises. Second, we test to what extent the latent structure in LLMs can capture the clinically observed patterns. We train supervised sparse auto-encoders (sSAE) to predict specific symptoms and symptom patterns within a syndrome. We find that sSAE weights can effectively modify the clinical pattern produced by the model, and thereby capture the latent structure of relevant clinical variation. Third, if LLMs correctly capture and quantify relevant mental states, then these states should respond to changes in emotional states induced by validated emotion induction interventions. We show that this holds in a third experiment with 190 participants. Overall, this work provides foundational insights into the quantification of pathological mental states with LLMs, highlighting hard limits on the requirements of the data underlying LLM-based quantification; but also suggesting LLMs show substantial conceptual alignment.

[78] MathFimer: Enhancing Mathematical Reasoning by Expanding Reasoning Steps through Fill-in-the-Middle Task

Yuchen Yan, Yongliang Shen, Yang Liu, Jin Jiang, Xin Xu, Mengdi Zhang, Jian Shao, Yueting Zhuang

Main category: cs.CL

TL;DR: MathFimer is a framework for expanding mathematical reasoning steps in LLMs using Fill-in-the-middle approach, improving performance without requiring powerful external models or high computational costs.

Details

Motivation: Existing step expansion methods for mathematical reasoning in LLMs either need powerful external models or incur substantial computational costs, limiting scalability and practicality.

Method: Decompose solution chains into prefix-suffix pairs, train models to reconstruct missing intermediate steps using curated NuminaMath-FIM dataset, then apply to enhance existing mathematical reasoning datasets by inserting detailed steps.

Result: Models trained on MathFimer-expanded data consistently outperform counterparts trained on original data across multiple benchmarks including GSM8K and MATH.

Conclusion: MathFimer offers a practical, scalable solution for enhancing mathematical reasoning capabilities in LLMs without relying on powerful external models or expensive inference procedures.

Abstract: Mathematical reasoning represents a critical frontier in advancing large language models (LLMs). While step-by-step approaches have emerged as the dominant paradigm for mathematical problem-solving in LLMs, the quality of reasoning steps in training data fundamentally constrains the performance of the models. Recent studies has demonstrated that more detailed intermediate steps can enhance model performance, yet existing methods for step expansion either require more powerful external models or incur substantial computational costs. In this paper, we introduce MathFimer, a novel framework for mathematical reasoning step expansion inspired by the “Fill-in-the-middle” task from code completion. By decomposing solution chains into prefix-suffix pairs and training models to reconstruct missing intermediate steps, we develop a specialized model, MathFimer-7B, on our carefully curated NuminaMath-FIM dataset. We then apply these models to enhance existing mathematical reasoning datasets by inserting detailed intermediate steps into their solution chains, creating MathFimer-expanded versions. Through comprehensive experiments on multiple mathematical reasoning datasets, including MathInstruct, MetaMathQA and etc., we demonstrate that models trained on MathFimer-expanded data consistently outperform their counterparts trained on original data across various benchmarks such as GSM8K and MATH. Our approach offers a practical, scalable solution for enhancing mathematical reasoning capabilities in LLMs without relying on powerful external models or expensive inference procedures.

[79] The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It

Leonardo Bertolazzi, Philipp Mondorf, Barbara Plank, Raffaella Bernardi

Main category: cs.CL

TL;DR: Mechanistic analysis of error detection in LLMs reveals they rely on ‘consistency heads’ for surface-level numerical alignment checks, with computation and validation occurring in different layers, explaining poor error detection capabilities.

Details

Motivation: To understand the internal mechanisms underlying error detection in LLMs, particularly why they struggle with self-correction despite extensive research on enhancing this capability.

Method: Circuit analysis of four smaller-sized LLMs focusing on simple arithmetic problems, identifying computational subgraphs responsible for detecting arithmetic errors.

Result: All models heavily rely on ‘consistency heads’ that check surface-level numerical alignment; arithmetic computation occurs in higher layers while validation happens in middle layers before final results are encoded.

Conclusion: The structural dissociation between arithmetic computation and validation in different layers explains why smaller LLMs struggle to detect even simple arithmetic errors.

Abstract: The ability of large language models (LLMs) to validate their output and identify potential errors is crucial for ensuring robustness and reliability. However, current research indicates that LLMs struggle with self-correction, encountering significant challenges in detecting errors. While studies have explored methods to enhance self-correction in LLMs, relatively little attention has been given to understanding the models’ internal mechanisms underlying error detection. In this paper, we present a mechanistic analysis of error detection in LLMs, focusing on simple arithmetic problems. Through circuit analysis, we identify the computational subgraphs responsible for detecting arithmetic errors across four smaller-sized LLMs. Our findings reveal that all models heavily rely on $\textit{consistency heads}$–attention heads that assess surface-level alignment of numerical values in arithmetic solutions. Moreover, we observe that the models’ internal arithmetic computation primarily occurs in higher layers, whereas validation takes place in middle layers, before the final arithmetic results are fully encoded. This structural dissociation between arithmetic computation and validation seems to explain why smaller-sized LLMs struggle to detect even simple arithmetic errors.

[80] Thinking Outside the (Gray) Box: A Context-Based Score for Assessing Value and Originality in Neural Text Generation

Giorgio Franceschelli, Mirco Musolesi

Main category: cs.CL

TL;DR: Proposes a context-based scoring method using information theory to evaluate value and originality in LLM outputs, enabling RL fine-tuning to enhance creativity without compromising quality.

Details

Motivation: Current LLM outputs lack diversity, and common solutions like higher temperature sampling compromise quality. Need better methods to balance quality and originality in creative AI tasks.

Method: Developed a context-based score based on information theory to quantitatively evaluate value and originality. Used this score as a reward function in reinforcement learning framework to fine-tune LLMs.

Result: Experiments on creative tasks (poetry generation, math problem solving) showed enhanced value and originality of generated solutions compared to standard approaches.

Conclusion: The proposed scoring and RL framework effectively improves LLM creativity by balancing accuracy with divergence from learned distributions, addressing the quality-originality trade-off.

Abstract: Despite the increasing use of large language models for creative tasks, their outputs often lack diversity. Common solutions, such as sampling at higher temperatures, can compromise the quality of the results. Dealing with this trade-off is still an open challenge in designing AI systems for creativity. Drawing on information theory, we propose a context-based score to quantitatively evaluate value and originality. This score incentivizes accuracy and adherence to the request while fostering divergence from the learned distribution. We show that our score can be used as a reward in a reinforcement learning framework to fine-tune large language models for maximum performance. We validate our strategy through experiments considering a variety of creative tasks, such as poetry generation and math problem solving, demonstrating that it enhances the value and originality of the generated solutions.

[81] JUREX-4E: Juridical Expert-Annotated Four-Element Knowledge Base for Legal Reasoning

Huanghai Liu, Quzhe Huang, Qingjing Chen, Yiran Hu, Jiayu Ma, Yun Liu, Weixing Shen, Yansong Feng

Main category: cs.CL

TL;DR: JUREX-4E is an expert-annotated four-element knowledge base for legal AI that addresses limitations in LLM-generated legal reasoning by providing precise, authoritative annotations for 155 criminal charges based on the Four-Element Theory.

Details

Motivation: Current LLMs struggle with incomplete and less representative four-element analysis when applied to legal reasoning tasks, limiting their effectiveness in understanding legal texts despite the potential of incorporating legal theories like the Four-Element Theory.

Method: Created JUREX-4E, an expert-annotated knowledge base covering 155 criminal charges using a progressive hierarchical framework grounded in legal source validity and diverse interpretive methods to ensure precision and authority.

Result: Experimental validation shows JUREX-4E significantly improves performance on Similar Charge Disambiguation and Legal Case Retrieval tasks, demonstrating high quality and substantial impact on downstream legal applications.

Conclusion: JUREX-4E provides a high-quality, authoritative resource that advances legal AI applications by addressing the limitations of LLM-generated legal reasoning and enhancing the understanding of legal texts through proper four-element analysis.

Abstract: In recent years, Large Language Models (LLMs) have been widely applied to legal tasks. To enhance their understanding of legal texts and improve reasoning accuracy, a promising approach is to incorporate legal theories. One of the most widely adopted theories is the Four-Element Theory (FET), which defines the crime constitution through four elements: Subject, Object, Subjective Aspect, and Objective Aspect. While recent work has explored prompting LLMs to follow FET, our evaluation demonstrates that LLM-generated four-elements are often incomplete and less representative, limiting their effectiveness in legal reasoning. To address these issues, we present JUREX-4E, an expert-annotated four-element knowledge base covering 155 criminal charges. The annotations follow a progressive hierarchical framework grounded in legal source validity and incorporate diverse interpretive methods to ensure precision and authority. We evaluate JUREX-4E on the Similar Charge Disambiguation task and apply it to Legal Case Retrieval. Experimental results validate the high quality of JUREX-4E and its substantial impact on downstream legal tasks, underscoring its potential for advancing legal AI applications. The dataset and code are available at: https://github.com/THUlawtech/JUREX

[82] Problem Solved? Information Extraction Design Space for Layout-Rich Documents using LLMs

Gaye Colakoglu, Gürkan Solmaz, Jonathan Fürst

Main category: cs.CL

TL;DR: This paper explores layout-aware information extraction from documents using LLMs, addressing challenges in data structuring, model engagement, and output refinement. It introduces LayIE-LLM test suite and OFAT method to optimize LLM configurations for competitive IE performance.

Details

Motivation: To enable effective information extraction from layout-rich documents using large language models without fine-tuning, overcoming challenges in adapting LLMs to document layouts and achieving performance comparable to specialized models.

Method: Developed LayIE-LLM test suite to benchmark layout-aware IE, investigated design choices including input representation, chunking, prompting, and LLM selection. Used one-factor-at-a-time (OFAT) method to optimize configurations efficiently.

Result: Optimized LLM configurations achieved 13.3-37.5 F1 points improvement over baseline. OFAT method achieved near-optimal results with only 2.8% of computational cost compared to full factorial exploration.

Conclusion: Well-configured general-purpose LLMs can match specialized model performance for layout-aware IE, providing a cost-effective, fine-tuning-free alternative. The LayIE-LLM test suite enables systematic optimization of LLM-based IE pipelines.

Abstract: This paper defines and explores the design space for information extraction (IE) from layout-rich documents using large language models (LLMs). The three core challenges of layout-aware IE with LLMs are 1) data structuring, 2) model engagement, and 3) output refinement. Our study investigates the sub-problems and methods within these core challenges, such as input representation, chunking, prompting, selection of LLMs, and multimodal models. It examines the effect of different design choices through LayIE-LLM, a new, open-source, layout-aware IE test suite, benchmarking against traditional, fine-tuned IE models. The results on two IE datasets show that LLMs require adjustment of the IE pipeline to achieve competitive performance: the optimized configuration found with LayIE-LLM achieves 13.3–37.5 F1 points more than a general-practice baseline configuration using the same LLM. To find a well-working configuration, we develop a one-factor-at-a-time (OFAT) method that achieves near-optimal results. Our method is only 0.8–1.8 points lower than the best full factorial exploration with a fraction (2.8%) of the required computation. Overall, we demonstrate that, if well-configured, general-purpose LLMs match the performance of specialized models, providing a cost-effective, finetuning-free alternative. Our test-suite is available at https://github.com/gayecolakoglu/LayIE-LLM.

[83] Constructions are Revealed in Word Distributions

Joshua Rozner, Leonie Weissweiler, Kyle Mahowald, Cory Shain

Main category: cs.CL

TL;DR: The paper investigates whether statistical patterns in language models can reveal linguistic constructions, finding that while many constructions are distinguishable through statistical affinity, this signal alone may be insufficient for complete construction identification.

Details

Motivation: To determine how much information about linguistic constructions (form-meaning pairings) is contained in language distribution, and whether pretrained language models can serve as proxies to reveal these constructions through statistical patterns.

Method: Using a RoBERTa model as a proxy for language distribution, the researchers tested whether constructions appear as patterns of statistical affinity by examining both hard cases (semantically distinct but superficially similar constructions) and schematic constructions (with abstract word class slots).

Result: Many constructions were robustly distinguished in the language model, including challenging cases and schematic constructions. However, qualitative evidence suggests statistical affinity alone may not identify all constructions.

Conclusion: Statistical affinity is likely an important but partial signal available to language learners for acquiring constructions, indicating that distributional learning provides significant but incomplete information about linguistic constructions.

Abstract: Construction grammar posits that constructions, or form-meaning pairings, are acquired through experience with language (the distributional learning hypothesis). But how much information about constructions does this distribution actually contain? Corpus-based analyses provide some answers, but text alone cannot answer counterfactual questions about what \emph{caused} a particular word to occur. This requires computable models of the distribution over strings – namely, pretrained language models (PLMs). Here, we treat a RoBERTa model as a proxy for this distribution and hypothesize that constructions will be revealed within it as patterns of statistical affinity. We support this hypothesis experimentally: many constructions are robustly distinguished, including (i) hard cases where semantically distinct constructions are superficially similar, as well as (ii) \emph{schematic} constructions, whose ``slots’’ can be filled by abstract word classes. Despite this success, we also provide qualitative evidence that statistical affinity alone may be insufficient to identify all constructions from text. Thus, statistical affinity is likely an important, but partial, signal available to learners.

[84] Explainable Sentiment Analysis with DeepSeek-R1: Performance, Efficiency, and Few-Shot Learning

Donghao Huang, Zhaoxia Wang

Main category: cs.CL

TL;DR: DeepSeek-R1 achieves state-of-the-art few-shot sentiment analysis performance with superior efficiency and explainability compared to GPT-4o models.

Details

Motivation: Balancing accuracy, efficiency, and explainability in LLM-based sentiment analysis remains a critical challenge that needs addressing.

Method: Comprehensive evaluation of DeepSeek-R1 (671B and distilled variants) against GPT-4o and GPT-4o-mini, testing few-shot learning curves on sentiment analysis tasks.

Result: DeepSeek-R1 achieves 91.39% F1 score on 5-class sentiment and 99.31% accuracy on binary tasks with just 5 shots, showing 8x improvement in few-shot efficiency over GPT-4o. Architecture-specific distillation effects observed with 32B Qwen2.5-based model outperforming 70B Llama-based variant by 6.69 percentage points.

Conclusion: DeepSeek-R1 establishes itself as a powerful, interpretable open-source alternative with superior explainability via transparent reasoning traces, despite reduced throughput.

Abstract: Large language models (LLMs) have transformed sentiment analysis, yet balancing accuracy, efficiency, and explainability remains a critical challenge. This study presents the first comprehensive evaluation of DeepSeek-R1–an open-source reasoning model–against OpenAI’s GPT-4o and GPT-4o-mini. We test the full 671B model and its distilled variants, systematically documenting few-shot learning curves. Our experiments show DeepSeek-R1 achieves a 91.39% F1 score on 5-class sentiment and 99.31% accuracy on binary tasks with just 5 shots, an eightfold improvement in few-shot efficiency over GPT-4o. Architecture-specific distillation effects emerge, where a 32B Qwen2.5-based model outperforms the 70B Llama-based variant by 6.69 percentage points. While its reasoning process reduces throughput, DeepSeek-R1 offers superior explainability via transparent, step-by-step traces, establishing it as a powerful, interpretable open-source alternative.

[85] Inverse Reinforcement Learning with Dynamic Reward Scaling for LLM Alignment

Ruoxi Cheng, Haoxuan Ma, Weixin Wang, Ranjie Duan, Jiexi Liu, Xiaoshuang Jia, Simeng Qin, Xiaochun Cao, Yang Liu, Xiaojun Jia

Main category: cs.CL

TL;DR: DR-IRL is a novel alignment method that uses dynamic reward scaling based on task difficulty to improve safety alignment in LLMs while maintaining usefulness.

Details

Motivation: Address two key challenges in LLM alignment: (1) imbalanced safety datasets that neglect long-tail threats, and (2) static reward models that ignore task difficulty, limiting optimization efficiency.

Method: Train category-specific reward models using balanced safety dataset via IRL, then enhance GRPO with dynamic reward scaling that adjusts rewards by task difficulty (text encoder cosine similarity for data-level hardness, reward gaps for model-level responsiveness).

Result: Extensive experiments show DR-IRL outperforms all baseline methods in safety alignment across various benchmarks and LLMs while maintaining usefulness.

Conclusion: DR-IRL provides an effective solution for dynamic reward adjustment that improves safety alignment performance without compromising model utility.

Abstract: Alignment is vital for safely deploying large language models (LLMs). Existing techniques are either reward-based (train a reward model on preference pairs and optimize with reinforcement learning) or reward-free (directly fine-tune on ranked outputs). Recent research shows that well-tuned reward-based pipelines remain robust, and single-response demonstrations can outperform pairwise preference data. However, two challenges persist: (1) imbalanced safety datasets that overrepresent common hazards while neglecting long-tail threats; and (2) static reward models that ignore task difficulty, limiting optimization efficiency and attainable gains. We propose DR-IRL (Dynamically adjusting Rewards through Inverse Reinforcement Learning). We first train category-specific reward models using a balanced safety dataset covering seven harmful categories via IRL. Then we enhance Group Relative Policy Optimization (GRPO) by introducing dynamic reward scaling–adjusting rewards by task difficulty–data-level hardness by text encoder cosine similarity, model-level responsiveness by reward gaps. Extensive experiments across various benchmarks and LLMs demonstrate that DR-IRL outperforms all baseline methods in safety alignment while maintaining usefulness.

[86] Inference-Time Scaling for Generalist Reward Modeling

Zijun Liu, Peiyi Wang, Runxin Xu, Shirong Ma, Chong Ruan, Peng Li, Yang Liu, Yu Wu

Main category: cs.CL

TL;DR: This paper introduces DeepSeek-GRM, a generalist reward modeling approach that enables inference-time scalability through Self-Principled Critique Tuning (SPCT) and parallel sampling with meta RM guidance.

Details

Motivation: To address the challenge of obtaining accurate reward signals for LLMs across various domains beyond verifiable questions, and to enable effective inference-time scalability for general queries through improved reward modeling.

Method: Proposes Self-Principled Critique Tuning (SPCT) using online RL to foster scalable reward generation behaviors in generative reward models (GRMs), combined with parallel sampling and meta RM for inference-time scaling.

Result: SPCT significantly improves GRM quality and scalability, outperforming existing methods in various RM benchmarks without severe biases, achieving better performance compared to training-time scaling.

Conclusion: DeepSeek-GRM demonstrates effective inference-time scalability but still faces challenges in some tasks, which future work on generalist reward systems could address.

Abstract: Reinforcement learning (RL) has been widely adopted in post-training for large language models (LLMs) at scale. Recently, the incentivization of reasoning capabilities in LLMs from RL indicates that $\textit{proper learning methods could enable effective inference-time scalability}$. A key challenge of RL is to obtain accurate reward signals for LLMs in various domains beyond verifiable questions or artificial rules. In this work, we investigate how to improve reward modeling (RM) with more inference compute for general queries, i.e. the $\textbf{inference-time scalability of generalist RM}$, and further, how to improve the effectiveness of performance-compute scaling with proper learning methods. For the RM approach, we adopt pointwise generative reward modeling (GRM) to enable flexibility for different input types and potential for inference-time scaling. For the learning method, we propose Self-Principled Critique Tuning (SPCT) to foster scalable reward generation behaviors in GRMs through online RL, to generate principles adaptively and critiques accurately, resulting in $\textbf{DeepSeek-GRM}$ models. Furthermore, for effective inference-time scaling, we use parallel sampling to expand compute usage, and introduce a meta RM to guide voting process for better scaling performance. Empirically, we show that SPCT significantly improves the quality and scalability of GRMs, outperforming existing methods and models in various RM benchmarks without severe biases, and could achieve better performance compared to training-time scaling. DeepSeek-GRM still meets challenges in some tasks, which we believe can be addressed by future efforts in generalist reward systems. The models are released at Hugging Face and ModelScope.

[87] Decoding Open-Ended Information Seeking Goals from Eye Movements in Reading

Cfir Avraham Hadar, Omer Shubi, Yoav Meiri, Amit Heshes, Yevgeni Berzak

Main category: cs.CL

TL;DR: This paper investigates whether open-ended reading goals can be automatically decoded from eye movements using multimodal LLMs, showing considerable success in goal identification and reconstruction.

Details

Motivation: People approach texts with specific information-seeking goals that guide reading behavior, but current methods cannot automatically detect these goals from eye movements alone.

Method: Developed discriminative and generative multimodal LLMs combining text and eye movement data, evaluated on large-scale English eye tracking data with hundreds of text-specific information seeking tasks.

Result: Considerable success in selecting correct goals among options and progress towards free-form textual reconstruction of goal formulations.

Conclusion: The results enable scientific investigation of goal-driven reading and development of educational/assistive technologies using real-time goal decoding from eye movements.

Abstract: When reading, we often have specific information that interests us in a text. For example, you might be reading this paper because you are curious about LLMs for eye movements in reading, the experimental design, or perhaps you wonder ``This sounds like science fiction. Does it actually work?’’. More broadly, in daily life, people approach texts with any number of text-specific goals that guide their reading behavior. In this work, we ask, for the first time, whether open-ended reading goals can be automatically decoded solely from eye movements in reading. To address this question, we introduce goal decoding tasks and evaluation frameworks using large-scale eye tracking for reading data in English with hundreds of text-specific information seeking tasks. We develop and compare several discriminative and generative multimodal text and eye movements LLMs for these tasks. Our experiments show considerable success on the task of selecting the correct goal among several options, and even progress towards free-form textual reconstruction of the precise goal formulation. These results open the door for further scientific investigation of goal driven reading, as well as the development of educational and assistive technologies that will rely on real-time decoding of reader goals from their eye movements.

[88] Ambiguity Resolution in Text-to-Structured Data Mapping

Zhibo Hu, Chen Wang, Yanfeng Shu, Hye-Young Paik, Liming Zhu

Main category: cs.CL

TL;DR: The paper introduces a novel approach to handle ambiguity in natural language for LLM tasks by analyzing representation differences in latent space and using a path kernel-based distance measure to detect ambiguity before mapping to structured data.

Details

Motivation: Ambiguity in natural language hinders accurate text-to-structured data mapping in LLMs for tasks like tool calling and SQL queries. Existing methods rely on trial-and-error or supervised fine-tuning, which have limitations.

Method: Characterizes representation differences of ambiguous text in latent space, introduces a path kernel-based distance measure over concepts to detect sentence-level ambiguity, and proposes missing concept prediction to improve LLM performance on ambiguous tool calling.

Result: The proposed methods achieve state-of-the-art results in detecting ambiguity and improving LLM performance on ambiguous agentic tool calling tasks.

Conclusion: The approach effectively addresses ambiguity issues in LLM tasks by leveraging latent space analysis and concept-based distance measurements, outperforming existing methods.

Abstract: Ambiguity in natural language is a significant obstacle for achieving accurate text to structured data mapping through large language models (LLMs), which affects the performance of tasks such as mapping text to agentic tool calling and text-to-SQL queries. Existing methods to ambiguity handling either rely on the ReACT framework to obtain correct mappings through trial and error, or on supervised fine-tuning to bias models toward specific tasks. In this paper, we adopt a different approach that characterizes representation differences of ambiguous text in the latent space and leverages these differences to identify ambiguity before mapping them to structured data. To detect sentence-level ambiguity, we focus on the relationship between ambiguous questions and their interpretations. Unlike distances calculated by dense embeddings, we introduce a new distance measure based on a path kernel over concepts. With this measurement, we identify patterns to distinguish ambiguous from unambiguous questions. Furthermore, we propose a method for improving LLM performance on ambiguous agentic tool calling through missing concept prediction. Both achieve state-of-the-art results.

[89] VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models

Yuchen Yan, Jin Jiang, Zhenbang Ren, Yijun Li, Xudong Cai, Yang Liu, Xin Xu, Mengdi Zhang, Jian Shao, Yongliang Shen, Jun Xiao, Yueting Zhuang

Main category: cs.CL

TL;DR: This paper introduces VerifyBench and VerifyBench-Hard, two benchmarks for evaluating reference-based reward systems in reasoning models, addressing a gap in current evaluation methods.

Details

Motivation: Existing reward benchmarks don't evaluate reference-based reward systems used in RL training of reasoning models like OpenAI o1 and DeepSeek-R1, leaving researchers with limited understanding of verifier accuracy.

Method: The benchmarks are constructed through meticulous data collection and curation, followed by careful human annotation to ensure high quality. The paper conducts comprehensive analysis of evaluation results on these benchmarks.

Result: Current models show considerable room for improvement on both VerifyBench and VerifyBench-Hard, especially smaller-scale models.

Conclusion: The proposed benchmarks serve as effective tools for guiding the development of verifier accuracy and reasoning capabilities of models trained via RL in reasoning tasks.

Abstract: Large reasoning models such as OpenAI o1 and DeepSeek-R1 have achieved remarkable performance in the domain of reasoning. A key component of their training is the incorporation of verifiable rewards within reinforcement learning (RL). However, existing reward benchmarks do not evaluate reference-based reward systems, leaving researchers with limited understanding of the accuracy of verifiers used in RL. In this paper, we introduce two benchmarks, VerifyBench and VerifyBench-Hard, designed to assess the performance of reference-based reward systems. These benchmarks are constructed through meticulous data collection and curation, followed by careful human annotation to ensure high quality. Current models still show considerable room for improvement on both VerifyBench and VerifyBench-Hard, especially smaller-scale models. Furthermore, we conduct a thorough and comprehensive analysis of evaluation results, offering insights for understanding and developing reference-based reward systems. Our proposed benchmarks serve as effective tools for guiding the development of verifier accuracy and the reasoning capabilities of models trained via RL in reasoning tasks.

[90] UNCERTAINTY-LINE: Length-Invariant Estimation of Uncertainty for Large Language Models

Roman Vashurin, Maiya Goloburda, Preslav Nakov, Maxim Panov

Main category: cs.CL

TL;DR: UNCERTAINTY-LINE is a post-hoc debiasing method that removes length bias from uncertainty quantification in LLMs by regressing uncertainty scores on output length and using residuals as corrected estimates.

Details

Motivation: Existing uncertainty quantification methods for LLMs rely on token probabilities which introduce length bias, and even length-normalized approaches still exhibit persistent biases that affect reliability assessment of LLM outputs.

Method: Propose UNCERTAINTY-LINE, a simple post-hoc procedure that regresses uncertainty scores on output length and uses the residuals as length-invariant uncertainty estimates. The method is model-agnostic and applicable to various UQ measures.

Result: Extensive evaluation on machine translation, summarization, and question-answering tasks shows UNCERTAINTY-LINE consistently improves over even nominally length-normalized UQ methods across multiple metrics and models.

Conclusion: UNCERTAINTY-LINE effectively addresses length bias in LLM uncertainty quantification, providing more reliable and length-invariant uncertainty estimates that improve trustworthiness assessment of LLM outputs.

Abstract: Large Language Models (LLMs) have become indispensable tools across various applications, making it more important than ever to ensure the quality and the trustworthiness of their outputs. This has led to growing interest in uncertainty quantification (UQ) methods for assessing the reliability of LLM outputs. Many existing UQ techniques rely on token probabilities, which inadvertently introduces a bias with respect to the length of the output. While some methods attempt to account for this, we demonstrate that such biases persist even in length-normalized approaches. To address the problem, here we propose UNCERTAINTY-LINE: (Length-INvariant Estimation), a simple debiasing procedure that regresses uncertainty scores on output length and uses the residuals as corrected, length-invariant estimates. Our method is post-hoc, model-agnostic, and applicable to a range of UQ measures. Through extensive evaluation on machine translation, summarization, and question-answering tasks, we demonstrate that UNCERTAINTY-LINE: consistently improves over even nominally length-normalized UQ methods uncertainty estimates across multiple metrics and models.

[91] InComeS: Integrating Compression and Selection Mechanisms into LLMs for Efficient Model Editing

Shuaiyi Li, Zhisong Zhang, Yang Deng, Chenlong Deng, Tianqing Fang, Hongming Zhang, Haitao Mi, Dong Yu, Wai Lam

Main category: cs.CL

TL;DR: InComeS is a framework that enhances LLMs’ ability to handle multiple edits through KV cache compression and dynamic selection mechanisms, overcoming context window limitations of traditional in-context learning methods.

Details

Motivation: Existing model editing methods struggle with complex scenarios requiring semantic understanding rather than knowledge recall. In-context learning is promising but limited by LLMs' context window constraints, leading to degraded performance with multiple edits.

Method: InComeS compresses each editing context into KV cache of special gist tokens and adds cross-attention modules to dynamically select relevant information from gist pools, enabling efficient handling of multiple edits without context window restrictions.

Result: Experiments on diverse model editing benchmarks with various editing formats demonstrate the effectiveness and efficiency of the proposed method.

Conclusion: InComeS provides a flexible framework that enhances LLMs’ editing capabilities through explicit compression and selection mechanisms, effectively overcoming context window limitations while maintaining performance.

Abstract: Although existing model editing methods perform well in recalling exact edit facts, they often struggle in complex scenarios that require deeper semantic understanding rather than mere knowledge regurgitation. Leveraging the strong contextual reasoning abilities of large language models (LLMs), in-context learning (ICL) becomes a promising editing method by comprehending edit information through context encoding. However, this method is constrained by the limited context window of LLMs, leading to degraded performance and efficiency as the number of edits increases. To overcome this limitation, we propose InComeS, a flexible framework that enhances LLMs’ ability to process editing contexts through explicit compression and selection mechanisms. Specifically, InComeS compresses each editing context into the key-value (KV) cache of a special gist token, enabling efficient handling of multiple edits without being restricted by the model’s context window. Furthermore, specialized cross-attention modules are added to dynamically select the most relevant information from the gist pools, enabling adaptive and effective utilization of edit information. We conduct experiments on diverse model editing benchmarks with various editing formats, and the results demonstrate the effectiveness and efficiency of our method.

[92] Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation

Arthur S. Bianchessi, Yasmin C. Aguirre, Rodrigo C. Barros, Lucas S. Kupssinskü

Main category: cs.CL

TL;DR: The paper proposes BAM, a Bayesian framework for positional encoding that unifies existing methods and introduces a Generalized Gaussian prior, achieving 500x context length extrapolation with superior retrieval accuracy.

Details

Motivation: Existing positional encoding methods lack theoretical clarity and rely on limited evaluation metrics for context length extrapolation claims.

Method: Bayesian Attention Mechanism (BAM) formulates positional encoding as a prior in a probabilistic model, unifying methods like NoPE and ALiBi, and introduces a Generalized Gaussian positional prior.

Result: BAM enables accurate information retrieval at 500x the training context length, outperforming previous state-of-the-art in long context retrieval accuracy while maintaining comparable perplexity with minimal additional parameters.

Conclusion: BAM provides a theoretical foundation for positional encoding that significantly improves long-context generalization capabilities of transformer models.

Abstract: Transformer-based language models rely on positional encoding (PE) to handle token order and support context length extrapolation. However, existing PE methods lack theoretical clarity and rely on limited evaluation metrics to substantiate their extrapolation claims. We propose the Bayesian Attention Mechanism (BAM), a theoretical framework that formulates positional encoding as a prior within a probabilistic model. BAM unifies existing methods (e.g., NoPE and ALiBi) and motivates a new Generalized Gaussian positional prior that substantially improves long-context generalization. Empirically, BAM enables accurate information retrieval at $500\times$ the training context length, outperforming previous state-of-the-art context length generalization in long context retrieval accuracy while maintaining comparable perplexity and introducing minimal additional parameters.

[93] BabyLM’s First Constructions: Causal probing provides a signal of learning

Joshua Rozner, Leonie Weissweiler, Cory Shain

Main category: cs.CL

TL;DR: This paper investigates whether language models trained on developmentally plausible data (BabyLM Challenge models) can learn constructions (form-meaning pairings), extending previous work that showed construction learning in models trained on massive datasets.

Details

Motivation: Previous research showed that pretrained language models learn constructions, but these models were trained on developmentally implausible amounts of data, raising questions about their relevance to human language acquisition. The authors aim to test whether models trained on realistic, human-like data quantities can also learn constructions.

Method: The authors apply Rozner et al.’s (2025) methods to evaluate construction learning in masked language models from the 2024 BabyLM Challenge, which uses developmentally plausible training data quantities.

Result: Even when trained on developmentally plausible data quantities, models successfully learn diverse constructions, including challenging cases that are superficially indistinguishable. Additionally, models with better construction representation perform better on BabyLM benchmarks.

Conclusion: Construction learning occurs even with developmentally plausible training data, and constructional performance appears functionally relevant to overall model performance, supporting the construction grammar hypothesis in more realistic learning scenarios.

Abstract: Construction grammar posits that language learners acquire constructions (form-meaning pairings) from the statistics of their environment. Recent work supports this hypothesis by showing sensitivity to constructions in pretrained language models (PLMs), including one recent study (Rozner et al., 2025) demonstrating that constructions shape RoBERTa’s output distribution. However, models under study have generally been trained on developmentally implausible amounts of data, casting doubt on their relevance to human language learning. Here we use Rozner et al.’s methods to evaluate construction learning in masked language models from the 2024 BabyLM Challenge. Our results show that even when trained on developmentally plausible quantities of data, models learn diverse constructions, even hard cases that are superficially indistinguishable. We further find correlational evidence that constructional performance may be functionally relevant: models that better represent construction perform better on the BabyLM benchmarks.

[94] Co-Evolving LLM Coder and Unit Tester via Reinforcement Learning

Yinjie Wang, Ling Yang, Ye Tian, Ke Shen, Mengdi Wang

Main category: cs.CL

TL;DR: CURE is a reinforcement learning framework that co-evolves coding and unit test generation capabilities through interaction-based learning without ground-truth supervision, achieving significant improvements in code generation accuracy and efficiency.

Details

Motivation: To enable flexible and scalable training of code generation models by allowing unit testers to learn directly from coders' mistakes, eliminating the need for ground-truth code supervision.

Method: Uses a dedicated reward design that co-evolves coding and unit test generation capabilities based on interaction outcomes. The framework trains models through reinforcement learning where the unit tester learns from the coder’s errors.

Result: ReasonFlux-Coder-7B and 14B models improved code generation accuracy by 5.3% and Best-of-N accuracy by 9.0%, outperforming similarly sized competitors. The 4B model achieved 64.8% inference efficiency in unit test generation while outperforming Qwen3-4B. The model also serves as an effective reward model for RL on base models.

Conclusion: CURE demonstrates that co-evolution of coding and testing capabilities through interaction-based learning is highly effective, achieving state-of-the-art performance in code generation while being naturally extensible to downstream tasks like test-time scaling and agentic coding.

Abstract: We propose CURE, a novel reinforcement learning framework with a dedicated reward design that co-evolves coding and unit test generation capabilities based on their interaction outcomes, without any ground-truth code as supervision. This approach enables flexible and scalable training and allows the unit tester to learn directly from the coder’s mistakes. Our derived ReasonFlux-Coder-7B and 14B models improve code generation accuracy by 5.3% and Best-of-N accuracy by 9.0% after optimization on Qwen2.5-Instruct models, outperforming similarly sized Qwen-Coder, DeepSeek-Coder, and Seed-Coder. They naturally extend to downstream tasks such as test-time scaling and agentic coding-achieving a 8.1% improvement over the base model. For the long-CoT model, our ReasonFlux-Coder-4B consistently outperforms Qwen3-4B while achieving 64.8% inference efficiency in unit test generation. Notably, we also find that our model can serve as an effective reward model for reinforcement learning on base models. Project: https://github.com/Gen-Verse/CURE

[95] ConsistentChat: Building Skeleton-Guided Consistent Multi-Turn Dialogues for Large Language Models from Scratch

Jiawei Chen, Xinyan Guan, Qianhao Yuan, Guozhao Mo, Weixiang Zhou, Yaojie Lu, Hongyu Lin, Ben He, Le Sun, Xianpei Han

Main category: cs.CL

TL;DR: Skeleton-Guided Multi-Turn Dialogue Generation framework improves multi-turn instruction synthesis by modeling conversational intent and generating structured skeletons, leading to better coherence and task completion in extended conversations.

Details

Motivation: Current instruction data synthesis methods focus on single-turn instructions and neglect cross-turn coherence, causing context drift and reduced task completion rates in extended conversations.

Method: Two-stage framework: (1) Intent Modeling - assigns conversations to one of nine intent trajectories for coherent information flow; (2) Skeleton Generation - constructs structured user query sequences aligned with modeled intent to guide instruction synthesis. Applied to create ConsistentChat dataset with 15,000 multi-turn conversations.

Result: Models fine-tuned on ConsistentChat achieve 20-30% improvement in chat consistency and up to 15% increase in task success rate on Light, Topdial, and MT-Eval benchmarks, significantly outperforming existing datasets.

Conclusion: The proposed skeleton-guided approach effectively addresses cross-turn coherence issues in multi-turn dialogue generation, demonstrating substantial improvements in conversation consistency and task completion.

Abstract: Current instruction data synthesis methods primarily focus on single-turn instructions and often neglect cross-turn coherence, resulting in context drift and reduced task completion rates in extended conversations. To address this limitation, we propose Skeleton-Guided Multi-Turn Dialogue Generation, a framework that constrains multi-turn instruction synthesis by explicitly modeling human conversational intent. It operates in two stages: (1) Intent Modeling, which captures the global structure of human dialogues by assigning each conversation to one of nine well-defined intent trajectories, ensuring a coherent and goal-oriented information flow; and (2) Skeleton Generation, which constructs a structurally grounded sequence of user queries aligned with the modeled intent, thereby serving as a scaffold that constrains and guides the downstream instruction synthesis process. Based on this process, we construct ConsistentChat, a multi-turn instruction dataset with approximately 15,000 multi-turn conversations and 224,392 utterances. Experiments on the Light, Topdial, and MT-Eval benchmarks show that models fine-tuned on ConsistentChat achieve a 20-30% improvement in chat consistency and up to a 15% increase in task success rate, significantly outperforming models trained on existing single-turn and multi-turn instruction datasets.

[96] From Replication to Redesign: Exploring Pairwise Comparisons for LLM-Based Peer Review

Yaohui Zhang, Haijing Zhang, Wenlong Ji, Tianyu Hua, Nick Haber, Hancheng Cao, Weixin Liang

Main category: cs.CL

TL;DR: The paper introduces a novel LLM-based peer review approach using pairwise manuscript comparisons instead of individual scoring, showing improved accuracy in identifying high-impact papers but revealing emergent biases in novelty and institutional representation.

Details

Motivation: Traditional LLM applications in peer review have focused on replicating existing workflows rather than exploring fundamentally new paradigms. The authors aim to rethink how LLMs can participate in academic review by moving beyond simple substitution of human reviewers.

Method: The proposed mechanism employs LLM agents to perform pairwise comparisons among manuscripts instead of individual scoring. By aggregating outcomes from substantial pairwise evaluations, the approach enables more accurate measurement of relative manuscript quality.

Result: Experiments demonstrate that the comparative approach significantly outperforms traditional rating-based methods in identifying high-impact papers. However, the analysis reveals emergent biases including reduced novelty in research topics and increased institutional imbalance.

Conclusion: The findings highlight both the transformative potential of rethinking peer review with LLMs and critical challenges that future systems must address to ensure equity and diversity in the selection process.

Abstract: The advent of large language models (LLMs) offers unprecedented opportunities to reimagine peer review beyond the constraints of traditional workflows. Despite these opportunities, prior efforts have largely focused on replicating traditional review workflows with LLMs serving as direct substitutes for human reviewers, while limited attention has been given to exploring new paradigms that fundamentally rethink how LLMs can participate in the academic review process. In this paper, we introduce and explore a novel mechanism that employs LLM agents to perform pairwise comparisons among manuscripts instead of individual scoring. By aggregating outcomes from substantial pairwise evaluations, this approach enables a more accurate and robust measure of relative manuscript quality. Our experiments demonstrate that this comparative approach significantly outperforms traditional rating-based methods in identifying high-impact papers. However, our analysis also reveals emergent biases in the selection process, notably a reduced novelty in research topics and an increased institutional imbalance. These findings highlight both the transformative potential of rethinking peer review with LLMs and critical challenges that future systems must address to ensure equity and diversity.

[97] ImpliRet: Benchmarking the Implicit Fact Retrieval Challenge

Zeinab Sadat Taghavi, Ali Modarressi, Yunpu Ma, Hinrich Schütze

Main category: cs.CL

TL;DR: Impliret is a benchmark that evaluates retrieval systems on document-side reasoning, where queries are simple but relevance depends on implicit facts in documents involving temporal, arithmetic, and world knowledge relationships.

Details

Motivation: Current retrieval systems rely on surface-level cues like keyword overlap, and recent benchmarks focus on query-side processing. Impliret aims to shift the reasoning challenge to document-side processing to address this gap.

Method: The benchmark uses simple queries where relevance is determined by implicit facts in documents. It evaluates sparse and dense retrievers, as well as long-context models like GPT-4-mini, on their ability to handle temporal, arithmetic, and world knowledge reasoning.

Result: All evaluated retrievers struggle, with the best nDCG@10 at only 14.91%. Even GPT-4-mini, with access to 30 documents including the positive one, scores only 55.54%, indicating document-side reasoning remains a significant challenge.

Conclusion: Document-side reasoning is a critical and underaddressed challenge in retrieval systems. The Impliret benchmark highlights the limitations of current methods and underscores the need for improved techniques to handle implicit factual relationships in documents.

Abstract: Retrieval systems are central to many NLP pipelines, but often rely on surface-level cues such as keyword overlap and lexical semantic similarity. To evaluate retrieval beyond these shallow signals, recent benchmarks introduce reasoning-heavy queries; however, they primarily shift the burden to query-side processing techniques – like prompting or multi-hop retrieval – that can help resolve complexity. In contrast, we present Impliret, a benchmark that shifts the reasoning challenge to document-side processing: The queries are simple, but relevance depends on facts stated implicitly in documents through temporal (e.g., resolving “two days ago”), arithmetic, and world knowledge relationships. We evaluate a range of sparse and dense retrievers, all of which struggle in this setting: the best nDCG@10 is only 14.91%. We also test whether long-context models can overcome this limitation. But even with a short context of only thirty documents, including the positive document, GPT-o4-mini scores only 55.54%, showing that document-side reasoning remains a challenge. Our codes are available at github.com/ZeinabTaghavi/IMPLIRET.

[98] When Does Meaning Backfire? Investigating the Role of AMRs in NLI

Junghyun Min, Xiulin Yang, Shira Wein

Main category: cs.CL

TL;DR: Adding AMR to NLI doesn’t help semantic reasoning but amplifies surface differences, misleading models to predict non-entailment incorrectly.

Details

Motivation: To investigate whether adding semantic information via Abstract Meaning Representation (AMR) helps pretrained language models generalize better in Natural Language Inference tasks.

Method: Integrated AMR into NLI through fine-tuning and prompting settings, conducted ablation studies to analyze the effects.

Result: AMR in fine-tuning hinders model generalization, while prompting with AMR gives slight gains in GPT-4o but amplifies surface-level differences rather than aiding semantic reasoning.

Conclusion: AMR integration doesn’t improve semantic reasoning in NLI and can mislead models by amplifying superficial differences, causing incorrect non-entailment predictions even when core meaning is preserved.

Abstract: Natural Language Inference (NLI) relies heavily on adequately parsing the semantic content of the premise and hypothesis. In this work, we investigate whether adding semantic information in the form of an Abstract Meaning Representation (AMR) helps pretrained language models better generalize in NLI. Our experiments integrating AMR into NLI in both fine-tuning and prompting settings show that the presence of AMR in fine-tuning hinders model generalization while prompting with AMR leads to slight gains in GPT-4o. However, an ablation study reveals that the improvement comes from amplifying surface-level differences rather than aiding semantic reasoning. This amplification can mislead models to predict non-entailment even when the core meaning is preserved.

[99] THCM-CAL: Temporal-Hierarchical Causal Modelling with Conformal Calibration for Clinical Risk Prediction

Xin Zhang, Qiyu Wei, Yingjie Zhu, Fanyi Wu, Sophia Ananiadou

Main category: cs.CL

TL;DR: THCM-CAL is a temporal-hierarchical causal model that integrates structured diagnostic codes and unstructured clinical notes from EHRs using multimodal causal graphs and conformal calibration for reliable clinical risk prediction.

Details

Motivation: Current approaches either handle structured diagnostic codes and unstructured notes separately or use simplistic fusion strategies that ignore the directional, hierarchical causal interactions between clinical observations and diagnoses across patient admissions.

Method: The framework constructs a multimodal causal graph with nodes representing clinical entities from both modalities (textual propositions from notes and ICD codes). It uses hierarchical causal discovery to infer three types of interactions: intra-slice same-modality sequencing, intra-slice cross-modality triggers, and inter-slice risk propagation. Conformal prediction is extended to multi-label ICD coding for calibrated confidence intervals.

Result: Experimental results on MIMIC-III and MIMIC-IV datasets demonstrate the superiority of THCM-CAL over existing approaches.

Conclusion: THCM-CAL effectively models the complex causal interactions between clinical modalities and provides reliable risk predictions through conformal calibration, showing significant improvements in clinical risk prediction performance.

Abstract: Automated clinical risk prediction from electronic health records (EHRs) demands modeling both structured diagnostic codes and unstructured narrative notes. However, most prior approaches either handle these modalities separately or rely on simplistic fusion strategies that ignore the directional, hierarchical causal interactions by which narrative observations precipitate diagnoses and propagate risk across admissions. In this paper, we propose THCM-CAL, a Temporal-Hierarchical Causal Model with Conformal Calibration. Our framework constructs a multimodal causal graph where nodes represent clinical entities from two modalities: Textual propositions extracted from notes and ICD codes mapped to textual descriptions. Through hierarchical causal discovery, THCM-CAL infers three clinically grounded interactions: intra-slice same-modality sequencing, intra-slice cross-modality triggers, and inter-slice risk propagation. To enhance prediction reliability, we extend conformal prediction to multi-label ICD coding, calibrating per-code confidence intervals under complex co-occurrences. Experimental results on MIMIC-III and MIMIC-IV demonstrate the superiority of THCM-CAL.

[100] A Simple “Motivation” Can Enhance Reinforcement Finetuning of Large Reasoning Models

Junjie Zhang, Guozheng Ma, Shunyu Liu, Haoyu Wang, Jiaxing Huang, Ting-En Lin, Fei Huang, Yongbin Li, Dacheng Tao

Main category: cs.CL

TL;DR: MeRF enhances reinforcement finetuning by incorporating reward specifications into prompts, allowing LLMs to understand optimization objectives and achieve better performance than traditional RLVR methods.

Details

Motivation: Current RLVR methods are inefficient due to trial-and-error learning without understanding reward patterns. Verifiable rewards enable natural language descriptions of reward functions, and LLMs' strong in-context learning ability suggests they could benefit from understanding the 'rules of the game' during finetuning.

Method: MeRF directly injects reward specifications into prompts as in-context motivation, leveraging LLMs’ in-context learning ability to align generation with optimization objectives. This combines inner motivation from understanding the rules with external reward signals.

Result: Empirical evaluations show MeRF achieves substantial performance gains over RLVR baseline. Ablation studies indicate better performance with greater consistency between motivation and reward function, while models can adapt to misleading motivations through reinforcement finetuning.

Conclusion: MeRF is an effective method that enhances reinforcement finetuning by making LLMs aware of reward functions, demonstrating that understanding the ‘rules of the game’ improves learning efficiency and performance in complex reasoning tasks.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful learn-to-reason paradigm for Large Reasoning Models to tackle complex tasks. However, current RLVR paradigm is still not efficient enough, as it works in a trial-and-error manner. To perform better, the model needs to explore the reward space by numerously generating responses and learn from fragmented reward signals, blind to the overall reward patterns. Fortunately, verifiable rewards make the natural language description of the reward function possible, and meanwhile, LLMs have demonstrated strong in-context learning ability. This motivates us to explore if Large Reasoning Models can benefit from a motivation of the task, i.e., awareness of the reward function, during the reinforcement finetuning process, as we humans sometimes do when learning. In this paper, we introduce Motivation-enhanced Reinforcement Finetuning (MeRF), an intuitive yet effective method enhancing reinforcement finetuning of LLMs by involving ``telling LLMs rules of the game’’. Specifically, MeRF directly injects the reward specification into the prompt, which serves as an in-context motivation for the model to be aware of the optimization objective. This simple modification leverages the in-context learning ability of LLMs, aligning generation with optimization, thereby incentivizing the model to generate desired outputs from both inner motivation and external reward. Empirical evaluations demonstrate that MeRF achieves substantial performance gains over RLVR baseline. Moreover, ablation studies show that MeRF performs better with greater consistency between the in-context motivation and the external reward function, while the model also demonstrates an ability to adapt to misleading motivations through reinforcement finetuning.

[101] ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs

Jiaru Zou, Ling Yang, Jingwen Gu, Jiahao Qiu, Ke Shen, Jingrui He, Mengdi Wang

Main category: cs.CL

TL;DR: ReasonFlux-PRM is a trajectory-aware Process Reward Model that improves evaluation of intermediate reasoning steps in LLMs, outperforming larger PRMs and human baselines across multiple benchmarks.

Details

Motivation: Existing PRMs struggle to robustly evaluate intermediate thinking trajectories, especially for trajectory-response outputs from frontier reasoning models like Deepseek-R1.

Method: Incorporates both step-level and trajectory-level supervision for fine-grained reward assignment, adapted for offline/online settings including data selection, RL optimization, and test-time scaling.

Result: ReasonFlux-PRM-7B outperforms Qwen2.5-Math-PRM-72B and human baselines, achieving 12.1% gain in SFT, 4.5% in RL, and 6.3% in test-time scaling on AIME, MATH500, and GPQA-Diamond benchmarks.

Conclusion: The model demonstrates superior performance in selecting high-quality data and improving downstream tasks, with efficient 1.5B version released for resource-constrained applications.

Abstract: Process Reward Models (PRMs) have recently emerged as a powerful framework for supervising intermediate reasoning steps in large language models (LLMs). Previous PRMs are primarily trained on model final output responses and struggle to evaluate intermediate thinking trajectories robustly, especially in the emerging setting of trajectory-response outputs generated by frontier reasoning models like Deepseek-R1. In this work, we introduce ReasonFlux-PRM, a novel trajectory-aware PRM explicitly designed to evaluate the trajectory-response type of reasoning traces. ReasonFlux-PRM incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment aligned with structured chain-of-thought data. We adapt ReasonFlux-PRM to support reward supervision under both offline and online settings, including (i) selecting high-quality model distillation data for downstream supervised fine-tuning of smaller models, (ii) providing dense process-level rewards for policy optimization during reinforcement learning, and (iii) enabling reward-guided Best-of-N test-time scaling. Empirical results on challenging downstream benchmarks such as AIME, MATH500, and GPQA-Diamond demonstrate that ReasonFlux-PRM-7B selects higher quality data than strong PRMs (e.g., Qwen2.5-Math-PRM-72B) and human-curated baselines. Furthermore, our derived ReasonFlux-PRM-7B yields consistent performance improvements, achieving average gains of 12.1% in supervised fine-tuning, 4.5% in reinforcement learning, and 6.3% in test-time scaling. We also release our efficient ReasonFlux-PRM-1.5B for resource-constrained applications and edge deployment. Project: https://github.com/Gen-Verse/ReasonFlux

[102] ARF-RLHF: Adaptive Reward-Following for RLHF through Emotion-Driven Self-Supervision and Trace-Biased Dynamic Optimization

YuXuan Zhang

Main category: cs.CL

TL;DR: ARF introduces continuous preference modeling using natural language feedback instead of binary labels, outperforming PPO and DPO by up to 7.6% in alignment tasks.

Details

Motivation: Current RLHF methods use binary preference labels which are costly and too coarse to capture individual variation, while natural feedback contains richer linguistic patterns.

Method: ARF converts free-form feedback into continuous preference trajectories and optimizes them using the novel TraceBias algorithm.

Result: ARF consistently outperforms PPO and DPO across diverse LLMs and preference domains, improving alignment by up to 7.6%.

Conclusion: Continuous reward modeling provides a scalable path toward personalized and theoretically grounded RLHF.

Abstract: Current RLHF methods such as PPO and DPO typically reduce human preferences to binary labels, which are costly to obtain and too coarse to reflect individual variation. We observe that expressions of satisfaction and dissatisfaction follow stable linguistic patterns across users, indicating that more informative supervisory signals can be extracted from free-form feedback. Building on this insight, we introduce Adaptive Reward-Following (ARF), which converts natural feedback into continuous preference trajectories and optimizes them using the novel TraceBias algorithm. Across diverse LLMs and preference domains, ARF consistently outperforms PPO and DPO, improving alignment by up to 7.6%. Our results demonstrate that continuous reward modeling provides a scalable path toward personalized and theoretically grounded RLHF.

[103] ixi-GEN: Efficient Industrial sLLMs through Domain Adaptive Continual Pretraining

Seonwu Kim, Yohan Na, Kihun Kim, Hanhee Cho, Geun Lim, Mintae Kim, Seongik Park, Ki Hyun Kim, Youngsub Han, Byoung-Ki Jeon

Main category: cs.CL

TL;DR: DACP-based recipe applied to small LLMs achieves substantial domain performance gains while preserving general capabilities, offering cost-efficient enterprise deployment solution.

Details

Motivation: Many organizations lack infrastructure for large LLMs, making small LLMs a practical alternative despite performance limitations. DACP's utility in commercial applications is under-examined.

Method: Domain Adaptive Continual Pretraining (DACP) applied across diverse foundation models and service domains through extensive experiments and real-world evaluations.

Result: DACP-applied small LLMs achieve substantial gains in target domain performance while preserving general capabilities.

Conclusion: DACP offers a cost-efficient and scalable solution for enterprise-level deployment of small LLMs.

Abstract: The emergence of open-source large language models (LLMs) has expanded opportunities for enterprise applications; however, many organizations still lack the infrastructure to deploy and maintain large-scale models. As a result, small LLMs (sLLMs) have become a practical alternative, despite their inherent performance limitations. While Domain Adaptive Continual Pretraining (DACP) has been previously explored as a method for domain adaptation, its utility in commercial applications remains under-examined. In this study, we validate the effectiveness of applying a DACP-based recipe across diverse foundation models and service domains. Through extensive experiments and real-world evaluations, we demonstrate that DACP-applied sLLMs achieve substantial gains in target domain performance while preserving general capabilities, offering a cost-efficient and scalable solution for enterprise-level deployment.

[104] Turning Internal Gap into Self-Improvement: Promoting the Generation-Understanding Unification in MLLMs

Yujin Han, Hao Chen, Andi Han, Zhiheng Wang, Xinyu Liu, Yingya Zhang, Shiwei Zhang, Difan Zou

Main category: cs.CL

TL;DR: MLLMs exhibit an internal gap where understanding outperforms generation. The paper proposes a self-improvement framework that leverages stronger understanding to guide weaker generation, achieving co-improvement and unification through curriculum learning.

Details

Motivation: To address the widespread non-unification in MLLMs, where understanding capabilities significantly outperform generation capabilities, and to develop methods that can bridge this gap without relying on external signals.

Method: A self-improvement framework that uses understanding to score generations and construct image data for post-training (SFT and DPO). The approach includes curriculum learning where progressively enhanced understanding and generation revisit underutilized samples to dynamically expand training data.

Result: The framework significantly improves generation while promoting unification. It also reveals a co-improvement effect where enhanced generation helps understanding detect false positives, leading to aligned learning dynamics between the two capabilities.

Conclusion: The proposed internal gap-based self-improvement effectively mitigates MLLM non-unification. The co-improvement effect and curriculum learning approach demonstrate that leveraging understanding to guide generation can lead to sustained performance gains and better model unification.

Abstract: Although unified MLLMs aim to unify generation and understanding, they are considered to exhibit an internal gap, with understanding outperforming generation. Through large-scale evaluation across multiple MLLMs and tasks, we confirm the widespread non-unification of MLLMs, and demonstrate that it indeed stems from weak generation rather than misunderstanding. This finding motivates us to propose a simple yet effective internal gap-based self-improvement framework, which mitigates internal gaps by leveraging stronger understanding to guide weaker generation without relying on any external signals. We validate this strategy through comprehensive experiments: scoring generations with understanding to construct image data for post-training (e.g., SFT and DPO) significantly improves generation while promoting unification. Furthermore, we empirically discover a co-improvement effect of such self-improvement, a phenomenon well known in pre-training but underexplored in post-training. Specifically, as generation improves, understanding becomes more effective at detecting false positives that were previously misclassified as prompt-aligned. To explain this effect, we extend learning dynamic theory to the MLLM setting, showing that the shared empirical neural tangent kernel between generation and understanding encourages aligned learning dynamics, thereby driving co-improvement. This interplay between generation and understanding further motivates a curriculum learning approach for stronger self-improvement: progressively enhanced understanding and generation revisit samples underutilized by pre-trained MLLMs, dynamically expanding post-training data and leading to improved performance and unification.

[105] A Comprehensive Taxonomy of Negation for NLP and Neural Retrievers

Roxana Petcu, Samarth Bhargav, Maarten de Rijke, Evangelos Kanoulas

Main category: cs.CL

TL;DR: The paper introduces a taxonomy of negation for information retrieval models, creates benchmark datasets for evaluating negation handling, and proposes a logic-based classification system to analyze model performance on negation.

Details

Motivation: Current dense neural models underperform on queries containing negation, which is vital for understanding complex reasoning tasks and addressing user information needs.

Method: Developed a taxonomy of negation derived from philosophical, linguistic, and logical definitions; generated two benchmark datasets for evaluation and fine-tuning; proposed a logic-based classification mechanism for analyzing retrieval model performance.

Result: The taxonomy produces balanced data distribution over negation types, leading to faster convergence on the NevIR dataset. The classification schema reveals coverage of negation types in existing datasets.

Conclusion: The proposed framework provides better training setup and insights into factors affecting generalization of fine-tuned models on negation, improving robustness in information retrieval systems.

Abstract: Understanding and solving complex reasoning tasks is vital for addressing the information needs of a user. Although dense neural models learn contextualised embeddings, they still underperform on queries containing negation. To understand this phenomenon, we study negation in both traditional neural information retrieval and LLM-based models. We (1) introduce a taxonomy of negation that derives from philosophical, linguistic, and logical definitions; (2) generate two benchmark datasets that can be used to evaluate the performance of neural information retrieval models and to fine-tune models for a more robust performance on negation; and (3) propose a logic-based classification mechanism that can be used to analyze the performance of retrieval models on existing datasets. Our taxonomy produces a balanced data distribution over negation types, providing a better training setup that leads to faster convergence on the NevIR dataset. Moreover, we propose a classification schema that reveals the coverage of negation types in existing datasets, offering insights into the factors that might affect the generalization of fine-tuned models on negation.

[106] C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations

Chengqian Ma, Wei Tao, Yiwen Guo

Main category: cs.CL

TL;DR: The paper introduces a benchmark dataset for evaluating Spoken Dialogue Models (SDMs) to address the gap in understanding their practical effectiveness compared to text-based LLMs, focusing on challenges unique to spoken dialogue like ambiguity and context-dependency.

Details

Motivation: There is a research gap in comprehensively understanding SDMs' practical effectiveness in comprehending and emulating human conversations, especially compared to text-based LLMs which have extensive benchmarking. Human voice interactions are more complex due to unique characteristics like ambiguity and context-dependency.

Method: The authors present a benchmark dataset comprising 1,079 instances in English and Chinese, accompanied by an LLM-based evaluation method that closely aligns with human judgment.

Result: The dataset facilitates a comprehensive exploration of SDM performance in tackling practical challenges of spoken dialogue, such as semantic ambiguity (polysemy), phonological aspects (heterograph, heteronyms, stress patterns), and context-dependency (omission, coreference, multi-turn interaction).

Conclusion: This benchmark addresses the need for standardized evaluation of SDMs and provides a foundation for better understanding their capabilities in handling the complexities of human conversational dynamics.

Abstract: Spoken Dialogue Models (SDMs) have recently attracted significant attention for their ability to generate voice responses directly to users’ spoken queries. Despite their increasing popularity, there exists a gap in research focused on comprehensively understanding their practical effectiveness in comprehending and emulating human conversations. This is especially true compared to text-based Large Language Models (LLMs), which benefit from extensive benchmarking. Human voice interactions are inherently more complex than text due to characteristics unique to spoken dialogue. Ambiguity poses one challenge, stemming from semantic factors like polysemy, as well as phonological aspects such as heterograph, heteronyms, and stress patterns. Additionally, context-dependency, like omission, coreference, and multi-turn interaction, adds further complexity to human conversational dynamics. To illuminate the current state of SDM development and to address these challenges, we present a benchmark dataset in this paper, which comprises 1,079 instances in English and Chinese. Accompanied by an LLM-based evaluation method that closely aligns with human judgment, this dataset facilitates a comprehensive exploration of the performance of SDMs in tackling these practical challenges.

[107] ILRe: Intermediate Layer Retrieval for Context Compression in Causal Language Models

Manlai Liang, Mandi Liu, Jiangzhou Ji, Huaijun Li, Haobo Yang, Yaohan He, Jinlong Li

Main category: cs.CL

TL;DR: ILRe is a novel context compression pipeline that reduces LLM prefill complexity from O(L²) to O(L) and cuts memory usage significantly while maintaining performance in long-context scenarios.

Details

Motivation: LLMs struggle with long-context scenarios due to short effective context length, quadratic computational complexity, and high memory overhead when processing lengthy inputs.

Method: Intermediate Layer Retrieval (ILRe) determines an intermediate decoder layer offline, encodes context by streaming chunked prefill only up to that layer, and recalls tokens using attention scores between input query and full key cache in that layer, with multi-pooling kernels for semantic completeness.

Result: ILRe processes 1M tokens in under 30 seconds (180× speedup) and achieves ≈79.8 on RULER-1M benchmark with Llama-3.1-UltraLong-8B-1M-Instruct on Huawei Ascend 910B NPU, performing comparably or better than full-context setup.

Conclusion: The approach effectively addresses LLM limitations in long-context scenarios without requiring additional post-training or operator development, offering significant efficiency improvements while maintaining performance.

Abstract: Large Language Models (LLMs) have demonstrated success across many benchmarks. However, they still exhibit limitations in long-context scenarios, primarily due to their short effective context length, quadratic computational complexity, and high memory overhead when processing lengthy inputs. To mitigate these issues, we introduce a novel context compression pipeline, called Intermediate Layer Retrieval (ILRe), which determines one intermediate decoder layer offline, encodes context by streaming chunked prefill only up to that layer, and recalls tokens by the attention scores between the input query and full key cache in that specified layer. In particular, we propose a multi-pooling kernels allocating strategy in the token recalling process to maintain the completeness of semantics. Our approach not only reduces the prefilling complexity from $O(L^2)$ to $O(L)$ and trims the memory footprint to a few tenths of that required for the full context, but also delivers performance comparable to or superior to the full-context setup in long-context scenarios. Without additional post training or operator development, ILRe can process a single $1M$ tokens request in less than half a minute (speedup $\approx 180\times$) and scores RULER-$1M$ benchmark of $\approx 79.8$ with model Llama-3.1-UltraLong-8B-1M-Instruct on a Huawei Ascend 910B NPU.

[108] MathBuddy: A Multimodal System for Affective Math Tutoring

Debanjana Kar, Leopold Böss, Dacia Braca, Sebastian Maximilian Dennerlein, Nina Christine Hubig, Philipp Wintersberger, Yufang Hou

Main category: cs.CL

TL;DR: MathBuddy is an emotionally aware LLM-powered Math Tutor that models students’ emotions from text and facial expressions to provide empathetic, pedagogically appropriate responses.

Details

Motivation: Current LLM-based learning models don't consider students' affective states, despite educational psychology research showing emotional states significantly impact learning capabilities.

Method: MathBuddy captures student emotions from conversational text and facial expressions, aggregates both modalities, and uses this emotional information to prompt the LLM tutor for emotionally-aware responses.

Result: Evaluation shows 23 point performance gain using win rate and 3 point gain using DAMR scores, demonstrating significant improvement in pedagogical abilities by modeling student emotions.

Conclusion: Modeling students’ emotions significantly enhances LLM-based tutors’ pedagogical effectiveness, supporting the importance of emotional awareness in educational technology.

Abstract: The rapid adoption of LLM-based conversational systems is already transforming the landscape of educational technology. However, the current state-of-the-art learning models do not take into account the student’s affective states. Multiple studies in educational psychology support the claim that positive or negative emotional states can impact a student’s learning capabilities. To bridge this gap, we present MathBuddy, an emotionally aware LLM-powered Math Tutor, which dynamically models the student’s emotions and maps them to relevant pedagogical strategies, making the tutor-student conversation a more empathetic one. The student’s emotions are captured from the conversational text as well as from their facial expressions. The student’s emotions are aggregated from both modalities to confidently prompt our LLM Tutor for an emotionally-aware response. We have evaluated our model using automatic evaluation metrics across eight pedagogical dimensions and user studies. We report a massive 23 point performance gain using the win rate and a 3 point gain at an overall level using DAMR scores which strongly supports our hypothesis of improving LLM-based tutor’s pedagogical abilities by modeling students’ emotions. Our dataset and code are available at: https://github.com/ITU-NLP/MathBuddy .

[109] JudgeAgent: Knowledge-wise and Dynamic LLM Evaluation with Agent-as-Interviewer

Zhichao Shi, Xuhui Jiang, Chengjin Xu, Cangli Yao, Zhenxin Huang, Shengjie Ma, Yinghan Shen, Jian Guo, Yuanzhuo Wang

Main category: cs.CL

TL;DR: Agent-as-Interviewer is a dynamic evaluation paradigm using LLM agents for multi-turn interactions to address limitations in current LLM evaluation methods, providing more accurate assessment of knowledge boundaries and capability levels.

Details

Motivation: Current LLM evaluation methods suffer from overestimated/biased assessments, mismatched question difficulty, and incomplete knowledge boundary evaluation, hindering effective LLM application and optimization.

Method: Uses LLM agents to conduct multi-turn interactions, call knowledge tools for deeper question generation, and plan query strategies for difficulty adjustment. JudgeAgent framework implements this with knowledge-driven synthesis and difficulty scoring.

Result: Extensive experiments validate JudgeAgent’s effectiveness, demonstrating accurate identification of target models’ knowledge and capability boundaries, with suggestions that help optimize the models.

Conclusion: Agent-as-Interviewer paradigm provides more complete and accurate LLM evaluation by dynamically adjusting question difficulty and exploring knowledge boundaries through multi-turn agent interactions.

Abstract: Current evaluation paradigms for large language models (LLMs) suffer from overestimated or biased evaluation and mismatched question difficulty, leading to incomplete evaluations of LLM’s knowledge and capability boundaries, which hinder LLM’s effective application and optimization. To address these challenges, we propose Agent-as-Interviewer, a dynamic evaluation paradigm that employs LLM agents to conduct multi-turn interactions for evaluation. Unlike current benchmarking or dynamic interaction paradigms, Agent-as-Interviewer utilizes agents to call knowledge tools for wider and deeper knowledge in the dynamic multi-turn question generation, achieving more complete evaluations of the LLM’s knowledge boundaries. It also leverages agents to plan query strategies for adjustment of the question difficulty levels, enhancing the difficulty control to match the actual capabilities of target LLMs. Based on this paradigm, we develop JudgeAgent, a knowledge-wise dynamic evaluation framework that employs knowledge-driven synthesis as the agent’s tool, and uses difficulty scoring as strategy guidance, thereby finally providing valuable suggestions to help targets optimize themselves. Extensive experiments validate the effectiveness of JudgeAgent’s suggestions, demonstrating that Agent-as-Interviewer can accurately identify the knowledge and capability boundaries of target models. The source code is available on https://anonymous.4open.science/r/JudgeAgent.

[110] Just-in-time and distributed task representations in language models

Yuxuan Li, Declan Campbell, Stephanie C. Y. Chan, Andrew Kyle Lampinen

Main category: cs.CL

TL;DR: Language models form transferrable task representations that evolve non-monotonically during context processing, showing strong temporal and semantic locality in evidence accrual.

Details

Motivation: To understand when and how language models form representations for new tasks during in-context learning, particularly focusing on transferrable representations that can restore task contexts without full prompts.

Method: Investigated the evolution of transferrable task representations in language models by analyzing how these representations change over context, examining evidence accrual patterns and locality effects along sequence dimensions.

Result: Transferrable task representations evolve sporadically with strong locality, condensing evidence as more examples are provided. They capture minimal task scopes and show just-in-time computational processes for task performance.

Conclusion: Language models use temporally and semantically distributed representations with just-in-time computational processes for in-context learning, revealing a sophisticated mechanism for task adaptation without weight updates.

Abstract: Many of language models’ impressive capabilities originate from their in-context learning: based on instructions or examples, they can infer and perform new tasks without weight updates. In this work, we investigate when representations for new tasks are formed in language models, and how these representations change over the course of context. We focus on ‘’transferrable’’ task representations – vector representations that can restore task contexts in another instance of the model, even without the full prompt. We show that these representations evolve in non-monotonic and sporadic ways, and are distinct from a more inert representation of high-level task categories that persists throughout the context. Specifically, when more examples are provided in the context, transferrable task representations successfully condense evidence. This allows better transfer of task contexts and aligns well with the performance improvement. However, this evidence accrual process exhibits strong locality along the sequence dimension, coming online only at certain tokens – despite task identity being reliably decodable throughout the context. Moreover, these local but transferrable task representations tend to capture minimal ‘’task scopes’’, such as a semantically-independent subtask. For longer and composite tasks, models rely on more temporally-distributed representations. This two-fold locality (temporal and semantic) underscores a kind of just-in-time computational process that language models use to perform new tasks on the fly.

[111] PLaMo 2 Technical Report

Preferred Networks, :, Kaizaburo Chubachi, Yasuhiro Fujita, Shinichi Hemmi, Yuta Hirokawa, Kentaro Imajo, Toshiki Kataoka, Goro Kobayashi, Kenichi Maehashi, Calvin Metzger, Hiroaki Mikami, Shogo Murai, Daisuke Nishino, Kento Nozawa, Toru Ogawa, Shintarou Okada, Daisuke Okanohara, Shunta Saito, Shotaro Sano, Shuji Suzuki, Kuniyuki Takahashi, Daisuke Tanaka, Avinash Ummadisingu, Hanqin Wang, Sixue Wang, Tianqi Xu

Main category: cs.CL

TL;DR: PLaMo 2 introduces Japanese-focused LLMs with hybrid Samba-based architecture, achieving 32K token context through continual pre-training and efficient pruning that produces an 8B model comparable to previous 100B model.

Details

Motivation: To overcome data scarcity for Japanese language models and create computationally efficient models that achieve state-of-the-art performance on Japanese benchmarks.

Method: Hybrid Samba-based architecture transitioning to full attention via continual pre-training, extensive synthetic corpora training, weight reuse, structured pruning, supervised fine-tuning (SFT), direct preference optimization (DPO), and model merging techniques.

Result: PLaMo 2 models achieve state-of-the-art results on Japanese benchmarks, outperforming similarly-sized open models in instruction-following, language fluency, and Japanese-specific knowledge.

Conclusion: The efficient pruning methodology and comprehensive training pipeline enable high-performance Japanese language models that are optimized for inference with minimal accuracy loss.

Abstract: In this report, we introduce PLaMo 2, a series of Japanese-focused large language models featuring a hybrid Samba-based architecture that transitions to full attention via continual pre-training to support 32K token contexts. Training leverages extensive synthetic corpora to overcome data scarcity, while computational efficiency is achieved through weight reuse and structured pruning. This efficient pruning methodology produces an 8B model that achieves performance comparable to our previous 100B model. Post-training further refines the models using a pipeline of supervised fine-tuning (SFT) and direct preference optimization (DPO), enhanced by synthetic Japanese instruction data and model merging techniques. Optimized for inference using vLLM and quantization with minimal accuracy loss, the PLaMo 2 models achieve state-of-the-art results on Japanese benchmarks, outperforming similarly-sized open models in instruction-following, language fluency, and Japanese-specific knowledge.

[112] LM-Searcher: Cross-domain Neural Architecture Search with LLMs via Unified Numerical Encoding

Yuxuan Hu, Jihao Liu, Ke Wang, Jinliang Zhen, Weikang Shi, Manyuan Zhang, Qi Dou, Rui Liu, Aojun Zhou, Hongsheng Li

Main category: cs.CL

TL;DR: LM-Searcher is a novel LLM-driven framework for cross-domain neural architecture search that uses universal numerical encoding (NCode) and reformulates NAS as a ranking task, achieving competitive performance without extensive domain-specific adaptation.

Details

Motivation: Existing LLM-driven NAS approaches rely heavily on prompt engineering and domain-specific tuning, limiting their practicality and scalability across diverse tasks.

Method: Proposes NCode (universal numerical string representation for neural architectures), reformulates NAS as ranking task, uses instruction-tuning with pruning-based subspace sampling, and creates a curated dataset of architecture-performance pairs.

Result: LM-Searcher achieves competitive performance in both in-domain (CNNs for image classification) and out-of-domain (LoRA configurations for segmentation and generation) tasks.

Conclusion: Establishes a new paradigm for flexible and generalizable LLM-based architecture search, with datasets and models to be released publicly.

Abstract: Recent progress in Large Language Models (LLMs) has opened new avenues for solving complex optimization problems, including Neural Architecture Search (NAS). However, existing LLM-driven NAS approaches rely heavily on prompt engineering and domain-specific tuning, limiting their practicality and scalability across diverse tasks. In this work, we propose LM-Searcher, a novel framework that leverages LLMs for cross-domain neural architecture optimization without the need for extensive domain-specific adaptation. Central to our approach is NCode, a universal numerical string representation for neural architectures, which enables cross-domain architecture encoding and search. We also reformulate the NAS problem as a ranking task, training LLMs to select high-performing architectures from candidate pools using instruction-tuning samples derived from a novel pruning-based subspace sampling strategy. Our curated dataset, encompassing a wide range of architecture-performance pairs, encourages robust and transferable learning. Comprehensive experiments demonstrate that LM-Searcher achieves competitive performance in both in-domain (e.g., CNNs for image classification) and out-of-domain (e.g., LoRA configurations for segmentation and generation) tasks, establishing a new paradigm for flexible and generalizable LLM-based architecture search. The datasets and models will be released at https://github.com/Ashone3/LM-Searcher.

[113] WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents

Junteng Liu, Yunji Li, Chi Zhang, Jingyang Li, Aili Chen, Ke Ji, Weiyu Cheng, Zijia Wu, Chengyu Du, Qidi Xu, Jiayuan Song, Zhengmao Zhu, Wenhu Chen, Pengyu Zhao, Junxian He

Main category: cs.CL

TL;DR: WebExplorer introduces a systematic data generation approach using model-based exploration and query evolution to create challenging web navigation tasks, enabling the development of an 8B parameter web agent that achieves state-of-the-art performance on information-seeking benchmarks.

Details

Motivation: Existing open-source web agents have limited information-seeking abilities on complex tasks or lack transparent implementations, with the key challenge being scarcity of challenging data for information seeking.

Method: WebExplorer uses model-based exploration and iterative, long-to-short query evolution to generate challenging query-answer pairs requiring multi-step reasoning and complex web navigation. The model is developed through supervised fine-tuning followed by reinforcement learning, supporting 128K context length and up to 100 tool calling turns.

Result: WebExplorer-8B achieves state-of-the-art performance at its scale across diverse information-seeking benchmarks, outperforming WebSailor-72B on BrowseComp-en/zh and achieving best performance among models up to 100B parameters on WebWalkerQA and FRAMES. It also shows strong generalization on HLE benchmark despite being trained only on knowledge-intensive QA data.

Conclusion: The approach provides a practical path toward long-horizon web agents, demonstrating that systematic data generation can enable smaller models to achieve competitive performance in complex web navigation tasks.

Abstract: The paradigm of Large Language Models (LLMs) has increasingly shifted toward agentic applications, where web browsing capabilities are fundamental for retrieving information from diverse online sources. However, existing open-source web agents either demonstrate limited information-seeking abilities on complex tasks or lack transparent implementations. In this work, we identify that the key challenge lies in the scarcity of challenging data for information seeking. To address this limitation, we introduce WebExplorer: a systematic data generation approach using model-based exploration and iterative, long-to-short query evolution. This method creates challenging query-answer pairs that require multi-step reasoning and complex web navigation. By leveraging our curated high-quality dataset, we successfully develop advanced web agent WebExplorer-8B through supervised fine-tuning followed by reinforcement learning. Our model supports 128K context length and up to 100 tool calling turns, enabling long-horizon problem solving. Across diverse information-seeking benchmarks, WebExplorer-8B achieves the state-of-the-art performance at its scale. Notably, as an 8B-sized model, WebExplorer-8B is able to effectively search over an average of 16 turns after RL training, achieving higher accuracy than WebSailor-72B on BrowseComp-en/zh and attaining the best performance among models up to 100B parameters on WebWalkerQA and FRAMES. Beyond these information-seeking tasks, our model also achieves strong generalization on the HLE benchmark even though it is only trained on knowledge-intensive QA data. These results highlight our approach as a practical path toward long-horizon web agents.

[114] Ko-PIQA: A Korean Physical Commonsense Reasoning Dataset with Cultural Context

Dasol Choi, Jungwhan Kim, Guijin Son

Main category: cs.CL

TL;DR: Ko-PIQA is a Korean physical commonsense reasoning dataset that addresses the lack of cultural diversity in existing benchmarks by incorporating culturally specific Korean elements.

Details

Motivation: Existing physical commonsense reasoning datasets like PIQA are predominantly English-centric and lack cultural diversity, limiting their applicability to non-English contexts and culturally diverse reasoning scenarios.

Method: The authors created Ko-PIQA through a multi-stage filtering approach: starting from 3.01 million web-crawled questions, they used three language models to identify 11,553 PIQA-style questions, then refined them with GPT-4o and human validation to obtain 441 high-quality question-answer pairs.

Result: Evaluation of seven language models showed significant performance variation (59.86% to 83.22% accuracy), with models particularly struggling with culturally specific scenarios. 19.7% of questions contain culturally specific Korean elements requiring culturally-aware reasoning.

Conclusion: Ko-PIQA serves as both a benchmark for Korean language models and a foundation for more inclusive commonsense reasoning research, highlighting the importance of culturally diverse datasets for advancing AI reasoning capabilities across different cultural contexts.

Abstract: Physical commonsense reasoning datasets like PIQA are predominantly English-centric and lack cultural diversity. We introduce Ko-PIQA, a Korean physical commonsense reasoning dataset that incorporates cultural context. Starting from 3.01 million web-crawled questions, we employed a multi-stage filtering approach using three language models to identify 11,553 PIQA-style questions. Through GPT-4o refinement and human validation, we obtained 441 high-quality question-answer pairs. A key feature of Ko-PIQA is its cultural grounding: 19.7% of questions contain culturally specific elements like traditional Korean foods (kimchi), clothing (hanbok), and specialized appliances (kimchi refrigerators) that require culturally-aware reasoning beyond direct translation. We evaluate seven language models on Ko-PIQA, with the best model achieving 83.22% accuracy while the weakest reaches only 59.86%, demonstrating significant room for improvement. Models particularly struggle with culturally specific scenarios, highlighting the importance of culturally diverse datasets. Ko-PIQA serves as both a benchmark for Korean language models and a foundation for more inclusive commonsense reasoning research. The dataset and code will be publicly available.

[115] Causal-Counterfactual RAG: The Integration of Causal-Counterfactual Reasoning into RAG

Harshad Khadilkar, Abhay Gupta

Main category: cs.CL

TL;DR: Causal-Counterfactual RAG integrates causal graphs and counterfactual reasoning into retrieval-augmented generation to improve contextual understanding, reduce hallucination, and enhance reasoning fidelity in knowledge-intensive domains.

Details

Motivation: Traditional RAG systems suffer from disrupted contextual integrity due to text chunking and over-reliance on semantic similarity, leading to shallow and inaccurate responses. LLMs' static knowledge limits dynamic reasoning over external information.

Method: Proposes a novel framework that integrates explicit causal graphs representing cause-effect relationships into retrieval and incorporates counterfactual reasoning grounded on causal structure. Evaluates both direct causal evidence and counterfactuality of associated causes.

Result: The framework preserves contextual coherence, reduces hallucination, and enhances reasoning fidelity by leveraging causal pathways and associated hypothetical scenarios.

Conclusion: Causal-Counterfactual RAG generates more robust, accurate, and interpretable answers compared to conventional RAG methods by combining causal and counterfactual reasoning approaches.

Abstract: Large language models (LLMs) have transformed natural language processing (NLP), enabling diverse applications by integrating large-scale pre-trained knowledge. However, their static knowledge limits dynamic reasoning over external information, especially in knowledge-intensive domains. Retrieval-Augmented Generation (RAG) addresses this challenge by combining retrieval mechanisms with generative modeling to improve contextual understanding. Traditional RAG systems suffer from disrupted contextual integrity due to text chunking and over-reliance on semantic similarity for retrieval, often resulting in shallow and less accurate responses. We propose Causal-Counterfactual RAG, a novel framework that integrates explicit causal graphs representing cause-effect relationships into the retrieval process and incorporates counterfactual reasoning grounded on the causal structure. Unlike conventional methods, our framework evaluates not only direct causal evidence but also the counterfactuality of associated causes, combining results from both to generate more robust, accurate, and interpretable answers. By leveraging causal pathways and associated hypothetical scenarios, Causal-Counterfactual RAG preserves contextual coherence, reduces hallucination, and enhances reasoning fidelity.

[116] FURINA: Free from Unmergeable Router via LINear Aggregation of mixed experts

Jiayi Han, Liang Du, Yinda Chen, Xiao Kang, Weiyang Ding, Donghong Han

Main category: cs.CL

TL;DR: FURINA is a novel router-free MoE-LoRA framework that enables full model merging by replacing discrete routers with linear aggregation of experts based on angular similarity and shared magnitude scaling.

Details

Motivation: Existing MoE-LoRA methods rely on discrete routers that prevent integration of MoE components into the backbone model, creating inference-time overhead and complexity.

Method: Uses Self-Routing mechanism with three innovations: decoupled direction/magnitude learning for LoRA adapters, shared learnable magnitude vector, and expert selection loss. Leverages angular similarity between input and adapter directions for expert activation.

Result: FURINA significantly outperforms standard LoRA and matches/surpasses existing MoE-LoRA methods while eliminating extra inference-time overhead.

Conclusion: FURINA is the first router-free MoE-enhanced LoRA method that can be fully merged into backbone models with zero additional inference cost, achieving superior performance without MoE overhead.

Abstract: The Mixture of Experts (MoE) paradigm has been successfully integrated into Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning (PEFT), delivering performance gains with minimal parameter overhead. However, a key limitation of existing MoE-LoRA methods is their reliance on a discrete router, which prevents the integration of the MoE components into the backbone model. To overcome this, we propose FURINA, a novel Free from Unmergeable Router framework based on the LINear Aggregation of experts. FURINA eliminates the router by introducing a Self-Routing mechanism. This is achieved through three core innovations: (1) decoupled learning of the direction and magnitude for LoRA adapters, (2) a shared learnable magnitude vector for consistent activation scaling, and (3) expert selection loss that encourages divergent expert activation. The proposed mechanism leverages the angular similarity between the input and each adapter’s directional component to activate experts, which are then scaled by the shared magnitude vector. This design allows the output norm to naturally reflect the importance of each expert, thereby enabling dynamic, router-free routing. The expert selection loss further sharpens this behavior by encouraging sparsity and aligning it with standard MoE activation patterns. We also introduce a shared expert within the MoE-LoRA block that provides stable, foundational knowledge. To the best of our knowledge, FURINA is the first router-free, MoE-enhanced LoRA method that can be fully merged into the backbone model, introducing zero additional inference-time cost or complexity. Extensive experiments demonstrate that FURINA not only significantly outperforms standard LoRA but also matches or surpasses the performance of existing MoE-LoRA methods, while eliminating the extra inference-time overhead of MoE.

[117] TactfulToM: Do LLMs Have the Theory of Mind Ability to Understand White Lies?

Yiwei Liu, Emma Jane Pretty, Jiahao Huang, Saku Sugawara

Main category: cs.CL

TL;DR: TactfulToM is a new benchmark for evaluating LLMs’ ability to understand white lies in social contexts, showing current models perform significantly below humans on Theory of Mind reasoning tasks involving prosocial motivations.

Details

Motivation: To address the limited research on LLMs' Theory of Mind abilities in nuanced social contexts like white lies, which require understanding prosocial motivations to spare others' feelings and maintain social harmony.

Method: Created through a multi-stage human-in-the-loop pipeline where LLMs expand manually designed seed stories into conversations while maintaining information asymmetry necessary for authentic white lies.

Result: State-of-the-art models perform substantially below humans on TactfulToM, revealing shortcomings in their ability to comprehend the Theory of Mind reasoning required for true understanding of white lies.

Conclusion: The benchmark demonstrates that current LLMs struggle with nuanced social reasoning involving white lies, highlighting limitations in their Theory of Mind capabilities despite advances in other areas.

Abstract: While recent studies explore Large Language Models’ (LLMs) performance on Theory of Mind (ToM) reasoning tasks, research on ToM abilities that require more nuanced social context is limited, such as white lies. We introduce TactfulToM, a novel English benchmark designed to evaluate LLMs’ ability to understand white lies within real-life conversations and reason about prosocial motivations behind them, particularly when they are used to spare others’ feelings and maintain social harmony. Our benchmark is generated through a multi-stage human-in-the-loop pipeline where LLMs expand manually designed seed stories into conversations to maintain the information asymmetry between participants necessary for authentic white lies. We show that TactfulToM is challenging for state-of-the-art models, which perform substantially below humans, revealing shortcomings in their ability to fully comprehend the ToM reasoning that enables true understanding of white lies.

[118] EpiCache: Episodic KV Cache Management for Long Conversational Question Answering

Minsoo Kim, Arnav Kundu, Han-Byul Kim, Richa Dixit, Minsik Cho

Main category: cs.CL

TL;DR: EpiCache is a training-free KV cache management framework that addresses memory bottlenecks in long conversational QA by using block-wise prefill and episodic KV compression to bound cache growth while maintaining accuracy.

Details

Motivation: Large language models with long contexts require KV caching that grows linearly with dialogue length, creating memory bottlenecks in resource-constrained environments. Existing methods have limitations including unbounded peak memory and query-dependent eviction failures in multi-turn conversations.

Method: EpiCache uses block-wise prefill to bound cache growth, episodic KV compression that clusters conversation history into coherent episodes with episode-specific eviction, and adaptive layer-wise budget allocation that distributes memory based on each layer’s sensitivity to eviction.

Result: Across three LongConvQA benchmarks, EpiCache improves accuracy by up to 40% over baselines, sustains near-full KV accuracy under 4-6x compression, and reduces latency and memory by up to 2.4x and 3.5x.

Conclusion: EpiCache enables efficient multi-turn interaction under strict resource constraints by effectively managing KV cache memory while preserving conversational context and accuracy.

Abstract: Modern large language models (LLMs) extend context lengths to up to millions of tokens, enabling AI assistants to generate coherent and personalized responses grounded in long conversational histories. This ability, however, hinges on Key-Value (KV) caching, whose memory grows linearly with dialogue length and quickly becomes the bottleneck in resource-constrained environments. An active line of research for reducing memory bottleneck is KV cache compression, which seeks to limit cache size while preserving accuracy. Yet existing methods face two major limitations: (i) evicting the KV cache after full-context prefill causes unbounded peak memory, and (ii) query-dependent eviction narrows the cache to a single query, leading to failure cases in multi-turn conversations. We introduce EpiCache, a training-free KV cache management framework for long conversational question answering (LongConvQA) under fixed memory budgets. EpiCache bounds cache growth through block-wise prefill and preserves topic-relevant context via episodic KV compression, which clusters conversation history into coherent episodes and applies episode-specific KV cache eviction. We further design an adaptive layer-wise budget allocation strategy that measures each layer’s sensitivity to eviction and distributes the memory budget across layers accordingly. Across three LongConvQA benchmarks, EpiCache improves accuracy by up to 40% over recent baselines, sustains near-full KV accuracy under 4-6x compression, and reduces latency and memory by up to 2.4x and 3.5x, thereby enabling efficient multi-turn interaction under strict resource constraints.

[119] CogniLoad: A Synthetic Natural Language Reasoning Benchmark With Tunable Length, Intrinsic Difficulty, and Distractor Density

Daniel Kaiser, Arnoldo Frigessi, Ali Ramezani-Kebrya, Benjamin Ricaud

Main category: cs.CL

TL;DR: CogniLoad is a synthetic benchmark for evaluating long-context reasoning in LLMs, based on Cognitive Load Theory with tunable parameters for intrinsic difficulty, distractor interference, and task length.

Details

Motivation: Current benchmarks for long-context reasoning in LLMs often blur critical factors like task complexity, distractor interference, and task length, making precise failure analysis difficult.

Method: CogniLoad generates natural-language logic puzzles with independently tunable parameters reflecting CLT’s core dimensions: intrinsic difficulty (d), distractor-to-signal ratio (ρ), and task length (N).

Result: Evaluation of 22 state-of-the-art reasoning LLMs revealed distinct performance sensitivities, with task length as a dominant constraint, varied tolerances to intrinsic complexity, and U-shaped responses to distractor ratios.

Conclusion: CogniLoad provides a reproducible, scalable, and diagnostically rich tool for dissecting LLM reasoning limitations and guiding future model development by offering systematic control over cognitive load dimensions.

Abstract: Current benchmarks for long-context reasoning in Large Language Models (LLMs) often blur critical factors like intrinsic task complexity, distractor interference, and task length. To enable more precise failure analysis, we introduce CogniLoad, a novel synthetic benchmark grounded in Cognitive Load Theory (CLT). CogniLoad generates natural-language logic puzzles with independently tunable parameters that reflect CLT’s core dimensions: intrinsic difficulty ($d$) controls intrinsic load; distractor-to-signal ratio ($\rho$) regulates extraneous load; and task length ($N$) serves as an operational proxy for conditions demanding germane load. Evaluating 22 SotA reasoning LLMs, CogniLoad reveals distinct performance sensitivities, identifying task length as a dominant constraint and uncovering varied tolerances to intrinsic complexity and U-shaped responses to distractor ratios. By offering systematic, factorial control over these cognitive load dimensions, CogniLoad provides a reproducible, scalable, and diagnostically rich tool for dissecting LLM reasoning limitations and guiding future model development.

[120] False Friends Are Not Foes: Investigating Vocabulary Overlap in Multilingual Language Models

Julie Kallini, Dan Jurafsky, Christopher Potts, Martijn Bartelds

Main category: cs.CL

TL;DR: Token overlap in multilingual models facilitates cross-lingual transfer rather than causing interference, with performance improving as vocabulary overlap increases.

Details

Motivation: To determine whether overlapping tokens across languages in multilingual tokenizers help cross-lingual transfer or introduce interference, addressing mixed evidence from prior work.

Method: Controlled experiments training bilingual autoregressive models on multiple language pairs with systematically varied vocabulary overlap, analyzing hidden representations and semantic similarity of shared tokens.

Result: Models with token overlap outperform those with disjoint vocabularies on XNLI and XQuAD tasks, with transfer performance improving as overlap increases. Overlap creates embedding spaces that capture cross-lingual semantic relationships.

Conclusion: Substantial shared vocabulary remains beneficial for multilingual tokenizers, as token overlap facilitates cross-lingual transfer and improves model performance.

Abstract: Subword tokenizers trained on multilingual corpora naturally produce overlapping tokens across languages. Does token overlap facilitate cross-lingual transfer or instead introduce interference between languages? Prior work offers mixed evidence, partly due to varied setups and confounders, such as token frequency or subword segmentation granularity. To address this question, we devise a controlled experiment where we train bilingual autoregressive models on multiple language pairs under systematically varied vocabulary overlap settings. Crucially, we explore a new dimension to understanding how overlap affects transfer: the semantic similarity of tokens shared across languages. We first analyze our models’ hidden representations and find that overlap of any kind creates embedding spaces that capture cross-lingual semantic relationships, while this effect is much weaker in models with disjoint vocabularies. On XNLI and XQuAD, we find that models with overlap outperform models with disjoint vocabularies, and that transfer performance generally improves as overlap increases. Overall, our findings highlight the advantages of token overlap in multilingual models and show that substantial shared vocabulary remains a beneficial design choice for multilingual tokenizers.

[121] Reinforcement Learning on Pre-Training Data

Siheng Li, Kejiao Li, Zenan Xu, Guanhua Huang, Evander Yang, Kun Li, Haoyuan Wu, Jiajia Wu, Zihao Zheng, Chenchen Zhang, Kun Shi, Kyrierl Deng, Qi Yi, Ruibin Xiong, Tingqiang Xu, Yuhao Jiang, Jianfeng Yan, Yuyuan Zeng, Guanghui Xu, Jinbao Xue, Zhijiang Xu, Zheng Fang, Shuai Li, Qibin Liu, Xiaoxue Li, Zhuoyu Li, Yangyu Tao, Fei Gao, Cheng Jiang, Bo Chao Wang, Kai Liu, Jianchen Zhu, Wai Lam, Wayyt Wang, Bo Zhou, Di Wang

Main category: cs.CL

TL;DR: RLPT is a new training paradigm that uses reinforcement learning on pre-training data to optimize LLMs by rewarding accurate next-segment predictions, eliminating the need for human annotation while improving reasoning capabilities across multiple benchmarks.

Details

Motivation: The growing gap between computational resources scaling exponentially and finite high-quality text data availability constrains conventional LLM scaling approaches, requiring new methods that don't rely on human annotation.

Method: RLPT uses reinforcement learning with a next-segment reasoning objective, where the model is rewarded for accurately predicting subsequent text segments based on preceding context, enabling autonomous exploration of meaningful trajectories from pre-training data.

Result: RLPT shows significant improvements across multiple benchmarks: Qwen3-4B-Base achieved absolute gains of 3.0-8.1 points on MMLU, MMLU-Pro, GPQA-Diamond, KOR-Bench, AIME24, and AIME25, demonstrating favorable scaling behavior and enhanced reasoning capabilities.

Conclusion: RLPT effectively extends LLM reasoning boundaries, provides a solid foundation for continued scaling gains, and enhances RLVR performance while eliminating dependency on human annotation for reward construction.

Abstract: The growing disparity between the exponential scaling of computational resources and the finite growth of high-quality text data now constrains conventional scaling approaches for large language models (LLMs). To address this challenge, we introduce Reinforcement Learning on Pre-Training data (RLPT), a new training-time scaling paradigm for optimizing LLMs. In contrast to prior approaches that scale training primarily through supervised learning, RLPT enables the policy to autonomously explore meaningful trajectories to learn from pre-training data and improve its capability through reinforcement learning (RL). While existing RL strategies such as reinforcement learning from human feedback (RLHF) and reinforcement learning with verifiable rewards (RLVR) rely on human annotation for reward construction, RLPT eliminates this dependency by deriving reward signals directly from pre-training data. Specifically, it adopts a next-segment reasoning objective, rewarding the policy for accurately predicting subsequent text segments conditioned on the preceding context. This formulation allows RL to be scaled on pre-training data, encouraging the exploration of richer trajectories across broader contexts and thereby fostering more generalizable reasoning skills. Extensive experiments on both general-domain and mathematical reasoning benchmarks across multiple models validate the effectiveness of RLPT. For example, when applied to Qwen3-4B-Base, RLPT yields absolute improvements of $3.0$, $5.1$, $8.1$, $6.0$, $6.6$, and $5.3$ on MMLU, MMLU-Pro, GPQA-Diamond, KOR-Bench, AIME24, and AIME25, respectively. The results further demonstrate favorable scaling behavior, suggesting strong potential for continued gains with more compute. In addition, RLPT provides a solid foundation, extending the reasoning boundaries of LLMs and enhancing RLVR performance.

[122] Part-of-speech tagging for Nagamese Language using CRF

Alovi N Shohe, Chonglio Khiamungam, Teisovi Angami

Main category: cs.CL

TL;DR: First POS tagging study for Nagamese language using CRF machine learning, achieving 85.70% accuracy on 16,112 token annotated corpus.

Details

Motivation: Nagamese is an under-resourced Assamese-lexified Creole language with no existing POS tagging work, unlike resource-rich languages like English and Hindi.

Method: Created annotated corpus of 16,112 tokens and applied Conditional Random Fields (CRF) machine learning technique for POS tagging.

Result: Achieved overall tagging accuracy of 85.70% with precision of 86%, recall of 86%, and f1-score of 85%.

Conclusion: Successfully demonstrated the first POS tagging system for Nagamese language with promising results using CRF approach.

Abstract: This paper investigates part-of-speech tagging, an important task in Natural Language Processing (NLP) for the Nagamese language. The Nagamese language, a.k.a. Naga Pidgin, is an Assamese-lexified Creole language developed primarily as a means of communication in trade between the Nagas and people from Assam in northeast India. A substantial amount of work in part-of-speech-tagging has been done for resource-rich languages like English, Hindi, etc. However, no work has been done in the Nagamese language. To the best of our knowledge, this is the first attempt at part-of-speech tagging for the Nagamese Language. The aim of this work is to identify the part-of-speech for a given sentence in the Nagamese language. An annotated corpus of 16,112 tokens is created and applied machine learning technique known as Conditional Random Fields (CRF). Using CRF, an overall tagging accuracy of 85.70%; precision, recall of 86%, and f1-score of 85% is achieved. Keywords. Nagamese, NLP, part-of-speech, machine learning, CRF.

[123] LLMs4All: A Review on Large Language Models for Research and Applications in Academic Disciplines

Yanfang Fanny Ye, Zheyuan Zhang, Tianyi Ma, Zehong Wang, Yiyang Li, Shifu Hou, Weixiang Sun, Kaiwen Shi, Yijun Ma, Wei Song, Ahmed Abbasi, Ying Cheng, Jane Cleland-Huang, Steven Corcelli, Patricia Culligan, Robert Goulding, Ming Hu, Ting Hua, John Lalor, Fang Liu, Tengfei Luo, Ed Maginn, Nuno Moniz, Jason Rohr, Brett Savoie, Daniel Slate, Tom Stapleford, Matthew Webber, Olaf Wiest, Johnny Zhang, Nitesh Chawla

Main category: cs.CL

TL;DR: This paper provides an overview of state-of-the-art Large Language Models (LLMs) and their integration across diverse academic disciplines, exploring how they are shaping research and practice while discussing limitations and future directions.

Details

Motivation: The impressive performance of LLMs like ChatGPT on language-related tasks has demonstrated their potential for broader real-world applications, inspiring the need to understand their impacts across various academic fields.

Method: The paper offers a comprehensive review and analysis of LLM integration in three main disciplinary categories: (1) arts, letters, and law, (2) economics and business, and (3) science and engineering.

Result: The review provides insights into how LLMs are being used across disciplines, highlighting key observations about their applications and impacts on research and practice.

Conclusion: The analysis helps researchers and practitioners understand how to leverage LLMs to advance their work in diverse real-world applications, while also identifying key limitations and future directions in the generative AI era.

Abstract: Cutting-edge Artificial Intelligence (AI) techniques keep reshaping our view of the world. For example, Large Language Models (LLMs) based applications such as ChatGPT have shown the capability of generating human-like conversation on extensive topics. Due to the impressive performance on a variety of language-related tasks (e.g., open-domain question answering, translation, and document summarization), one can envision the far-reaching impacts that can be brought by the LLMs with broader real-world applications (e.g., customer service, education and accessibility, and scientific discovery). Inspired by their success, this paper will offer an overview of state-of-the-art LLMs and their integration into a wide range of academic disciplines, including: (1) arts, letters, and law (e.g., history, philosophy, political science, arts and architecture, law), (2) economics and business (e.g., finance, economics, accounting, marketing), and (3) science and engineering (e.g., mathematics, physics and mechanical engineering, chemistry and chemical engineering, life sciences and bioengineering, earth sciences and civil engineering, computer science and electrical engineering). Integrating humanity and technology, in this paper, we will explore how LLMs are shaping research and practice in these fields, while also discussing key limitations, open challenges, and future directions in the era of generative AI. The review of how LLMs are engaged across disciplines-along with key observations and insights-can help researchers and practitioners interested in exploiting LLMs to advance their works in diverse real-world applications.

[124] Polarity Detection of Sustainable Detection Goals in News Text

Andrea Cadeddu, Alessandro Chessa, Vincenzo De Leo, Gianni Fenu, Francesco Osborne, Diego Reforgiato Recupero, Angelo Salatino, Luca Secchi

Main category: cs.CL

TL;DR: This paper introduces SDG polarity detection - a novel task to classify text as positive, neutral, or negative impact on UN Sustainable Development Goals. It presents SDG-POD benchmark dataset and evaluates LLMs, finding the task challenging but showing improvements with fine-tuning and synthetic data augmentation.

Details

Motivation: While NLP can classify text relevance to SDGs, determining the directionality (positive/neutral/negative impact) is equally important for sustainability monitoring but remains an underexplored challenge.

Method: Proposed SDG polarity detection task and created SDG-POD benchmark dataset combining original and synthetic data. Evaluated 6 state-of-the-art LLMs in zero-shot and fine-tuned configurations, with data augmentation techniques.

Result: Task remains challenging for current LLMs, but fine-tuned models (especially QWQ-32B) achieve good performance, particularly on SDG-9, SDG-12, and SDG-15. Synthetic data augmentation improves model performance.

Conclusion: The work advances sustainability monitoring methodology and provides insights for developing efficient polarity detection systems, demonstrating effectiveness of data enrichment in resource-constrained domains.

Abstract: The United Nations’ Sustainable Development Goals (SDGs) provide a globally recognised framework for addressing critical societal, environmental, and economic challenges. Recent developments in natural language processing (NLP) and large language models (LLMs) have facilitated the automatic classification of textual data according to their relevance to specific SDGs. Nevertheless, in many applications, it is equally important to determine the directionality of this relevance; that is, to assess whether the described impact is positive, neutral, or negative. To tackle this challenge, we propose the novel task of SDG polarity detection, which assesses whether a text segment indicates progress toward a specific SDG or conveys an intention to achieve such progress. To support research in this area, we introduce SDG-POD, a benchmark dataset designed specifically for this task, combining original and synthetically generated data. We perform a comprehensive evaluation using six state-of-the-art large LLMs, considering both zero-shot and fine-tuned configurations. Our results suggest that the task remains challenging for the current generation of LLMs. Nevertheless, some fine-tuned models, particularly QWQ-32B, achieve good performance, especially on specific Sustainable Development Goals such as SDG-9 (Industry, Innovation and Infrastructure), SDG-12 (Responsible Consumption and Production), and SDG-15 (Life on Land). Furthermore, we demonstrate that augmenting the fine-tuning dataset with synthetically generated examples yields improved model performance on this task. This result highlights the effectiveness of data enrichment techniques in addressing the challenges of this resource-constrained domain. This work advances the methodological toolkit for sustainability monitoring and provides actionable insights into the development of efficient, high-performing polarity detection systems.

[125] From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training

Tianqiao Liu, Xueyi Li, Hao Wang, Haoxuan Li, Zhichao Chen, Weiqi Luo, Zitao Liu

Main category: cs.CL

TL;DR: Text-to-Talk (TtT) is a unified audio-text framework that combines autoregressive text generation with non-autoregressive audio diffusion in a single Transformer, using modality-aware attention and block-wise diffusion for efficient multimodal speech processing.

Details

Motivation: Existing multimodal models use autoregressive methods for both text and audio, but overlook that text depends on target-target relations while audio depends mainly on source-target relations, requiring different modeling approaches.

Method: Proposes TtT framework integrating AR text generation with NAR audio diffusion using absorbing discrete diffusion, modality-aware attention mechanism, and three training strategies to reduce train-test discrepancies. Uses block-wise diffusion for parallel audio synthesis during inference.

Result: Extensive experiments on Audio-QA and ASR tasks demonstrate effectiveness, with ablation studies validating each proposed component. The approach shows improved performance in handling interleaved audio and text.

Conclusion: TtT provides an effective unified framework for multimodal speech-to-speech systems, with plans to open-source models, data, and code to facilitate future research in this direction.

Abstract: Recent advances in large language models (LLMs) have attracted significant interest in extending their capabilities to multimodal scenarios, particularly for speech-to-speech conversational systems. However, existing multimodal models handling interleaved audio and text rely on autoregressive methods, overlooking that text depends on target-target relations whereas audio depends mainly on source-target relations. In this work, we propose Text-to-Talk (TtT), a unified audio-text framework that integrates autoregressive (AR) text generation with non-autoregressive (NAR) audio diffusion in a single Transformer. By leveraging the any-order autoregressive property of absorbing discrete diffusion, our approach provides a unified training objective for text and audio. To support this hybrid generation paradigm, we design a modality-aware attention mechanism that enforces causal decoding for text while allowing bidirectional modeling within audio spans, and further introduce three training strategies that reduce train-test discrepancies. During inference, TtT employs block-wise diffusion to synthesize audio in parallel while flexibly handling variable-length outputs. Extensive experiments across Audio-QA and ASR tasks demonstrate the effectiveness of our approach, with detailed ablation studies validating each proposed component. We will open-source our models, data and code to facilitate future research in this direction.

[126] Thinking Augmented Pre-training

Liang Wang, Nan Yang, Shaohan Huang, Li Dong, Furu Wei

Main category: cs.CL

TL;DR: TPT (Thinking augmented Pre-Training) improves LLM data efficiency by 3x through augmenting text data with automatically generated thinking trajectories, enhancing learnability of complex tokens.

Details

Motivation: The compute for pre-training LLMs is growing rapidly while high-quality data remains limited. Complex tokens are difficult to learn due to their deep underlying rationale, creating a need to maximize data utility.

Method: TPT augments existing text data with automatically generated thinking trajectories, providing step-by-step reasoning and decomposition to make high-quality tokens more learnable.

Result: TPT improves data efficiency by 3x and boosts performance of LLMs across various model sizes. For a 3B parameter model, it achieves over 10% improvement on challenging reasoning benchmarks.

Conclusion: TPT is a scalable and effective methodology that substantially enhances LLM training efficiency and performance through thinking trajectory augmentation.

Abstract: This paper introduces a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories. The compute for pre-training LLMs has been growing at an unprecedented rate, while the availability of high-quality data remains limited. Consequently, maximizing the utility of available data constitutes a significant research challenge. A primary impediment is that certain high-quality tokens are difficult to learn given a fixed model capacity, as the underlying rationale for a single token can be exceptionally complex and deep. To address this issue, we propose Thinking augmented Pre-Training (TPT), a universal methodology that augments text with automatically generated thinking trajectories. Such augmentation effectively increases the volume of the training data and makes high-quality tokens more learnable through step-by-step reasoning and decomposition. We apply TPT across diverse training configurations up to $100$B tokens, encompassing pre-training with both constrained and abundant data, as well as mid-training from strong open-source checkpoints. Experimental results indicate that our method substantially improves the performance of LLMs across various model sizes and families. Notably, TPT enhances the data efficiency of LLM pre-training by a factor of $3$. For a $3$B parameter model, it improves the post-training performance by over $10%$ on several challenging reasoning benchmarks.

[127] SIM-CoT: Supervised Implicit Chain-of-Thought

Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Jiaqi Wang, Xipeng Qiu, Dahua Lin

Main category: cs.CL

TL;DR: SIM-CoT addresses instability in implicit Chain-of-Thought methods by adding step-level supervision during training to stabilize latent representations, improving performance while maintaining token efficiency.

Details

Motivation: Implicit CoT methods suffer from performance gaps due to latent instability when scaling computational budget, where training collapses as reasoning tokens increase due to homogeneous latent representations.

Method: Proposes SIM-CoT, a plug-and-play training module with auxiliary decoder that aligns implicit tokens with explicit reasoning steps, providing step-level supervision to enrich latent reasoning space. The auxiliary decoder is removed at inference.

Result: Significantly improves implicit CoT performance: +8.2% on GPT-2 with Coconut, +3.0% on LLaMA-3.1 8B with CODI. Surpasses explicit CoT baseline on GPT-2 by 2.1% with 2.3× greater token efficiency.

Conclusion: SIM-CoT effectively stabilizes implicit CoT training, closes performance gaps with explicit methods, and provides interpretability while maintaining inference efficiency.

Abstract: Implicit Chain-of-Thought (CoT) methods offer a token-efficient alternative to explicit CoT reasoning in Large Language Models (LLMs), but a persistent performance gap has limited their adoption. We identify a core latent instability issue when scaling the computational budget of implicit CoT: as the number of reasoning tokens increases, training often becomes unstable and collapses. Our analysis shows that this instability arises from latent representations becoming homogeneous and losing semantic diversity, caused by insufficient step-level supervision in current implicit CoT methods. To address this, we propose SIM-CoT, a plug-and-play training module that introduces step-level supervision to stabilize and enrich the latent reasoning space. SIM-CoT employs an auxiliary decoder during training to align each implicit token with its corresponding explicit reasoning step, ensuring latent states capture distinct and meaningful information. The auxiliary decoder is removed at inference, preserving the efficiency of implicit CoT with no added overhead. It also provides interpretability by projecting each latent token onto an explicit reasoning vocabulary, enabling per-step visualization and diagnosis. SIM-CoT significantly improves both in-domain accuracy and out-of-domain stability of implicit CoT methods, boosting Coconut by +8.2% on GPT-2 and CODI by +3.0% on LLaMA-3.1 8B. It further surpasses the explicit CoT baseline on GPT-2 by 2.1% with 2.3$\times$ greater token efficiency, while closing the performance gap on larger models like LLaMA-3.1 8B. Code: https://github.com/InternLM/SIM-CoT

cs.CV

[128] WAVECLIP: Wavelet Tokenization for Adaptive-Resolution CLIP

Moshe Kimhi, Erez Koifman, Ehud Rivlin, Eli Schwartz, Chaim Baskin

Main category: cs.CV

TL;DR: WAVECLIP is a unified CLIP model that uses wavelet-based tokenization for adaptive resolution inference, enabling dynamic compute-accuracy trade-offs through early exits.

Details

Motivation: To create a single CLIP model that can efficiently handle multiple resolutions and enable adaptive computation, reducing inference costs while maintaining competitive accuracy.

Method: Replaces standard patch embeddings with multi-level wavelet decomposition, uses key-value caching and causal cross-level attention for computation reuse, and implements confidence-based gating for adaptive early exits.

Result: Achieves competitive accuracy with significant computational savings through lightweight distillation from a frozen CLIP teacher.

Conclusion: WAVECLIP provides an effective approach for adaptive resolution inference in CLIP, enabling users to dynamically balance compute and accuracy using a single deployed model.

Abstract: We introduce WAVECLIP, a single unified model for adaptive resolution inference in CLIP, enabled by wavelet-based tokenization. WAVECLIP replaces standard patch embeddings with a multi-level wavelet decomposition, enabling the model to process images coarse to fine while naturally supporting multiple resolutions within the same model. At inference time, the model begins with low resolution tokens and refines only when needed, using key-value caching and causal cross-level attention to reuse computation, effectively introducing to the model only new information when needed. We evaluate WAVECLIP in zero-shot classification, demonstrating that a simple confidence-based gating mechanism enables adaptive early exits. This allows users to dynamically choose a compute-accuracy trade-off using a single deployed model. Our approach requires only lightweight distillation from a frozen CLIP teacher and achieves competitive accuracy with significant computational savings.

[129] REArtGS: Reconstructing and Generating Articulated Objects via 3D Gaussian Splatting with Geometric and Motion Constraints

Di Wu, Liu Liu, Zhou Linli, Anran Huang, Liangtu Song, Qiaojun Yu, Qi Wu, Cewu Lu

Main category: cs.CV

TL;DR: REArtGS is a novel framework that uses geometric and motion constraints on 3D Gaussian primitives to achieve high-fidelity textured surface reconstruction and dynamic generation for articulated objects from multi-view RGB images.

Details

Motivation: Existing methods struggle to achieve both high-fidelity textured surface reconstruction and dynamic generation for articulated objects, which are prevalent in human life and important for various applications.

Method: The framework introduces unbiased Signed Distance Field (SDF) guidance to regularize Gaussian opacity fields for better geometry constraints, and establishes deformable fields for 3D Gaussians constrained by kinematic structures to enable unsupervised generation of surface meshes in unseen states.

Result: Extensive experiments on synthetic and real datasets show the approach achieves high-quality textured surface reconstruction for given states and enables high-fidelity surface generation for unseen states.

Conclusion: REArtGS successfully addresses the challenge of achieving both realistic surface reconstruction and dynamic generation for articulated objects through geometric and motion constraints on 3D Gaussian primitives.

Abstract: Articulated objects, as prevalent entities in human life, their 3D representations play crucial roles across various applications. However, achieving both high-fidelity textured surface reconstruction and dynamic generation for articulated objects remains challenging for existing methods. In this paper, we present REArtGS, a novel framework that introduces additional geometric and motion constraints to 3D Gaussian primitives, enabling realistic surface reconstruction and generation for articulated objects. Specifically, given multi-view RGB images of arbitrary two states of articulated objects, we first introduce an unbiased Signed Distance Field (SDF) guidance to regularize Gaussian opacity fields, enhancing geometry constraints and improving surface reconstruction quality. Then we establish deformable fields for 3D Gaussians constrained by the kinematic structures of articulated objects, achieving unsupervised generation of surface meshes in unseen states. Extensive experiments on both synthetic and real datasets demonstrate our approach achieves high-quality textured surface reconstruction for given states, and enables high-fidelity surface generation for unseen states. Project site: https://sites.google.com/view/reartgs/home.

[130] Leveraging NTPs for Efficient Hallucination Detection in VLMs

Ofir Azachi, Kfir Eliyahu, Eyal El Ani, Rom Himelstein, Roi Reichart, Yuval Pinter, Nitay Calderon

Main category: cs.CV

TL;DR: The paper proposes a lightweight method for detecting hallucinations in vision-language models using next-token probabilities as uncertainty signals, achieving comparable performance to stronger VLMs while being computationally efficient.

Details

Motivation: Current hallucination detection methods using VLMs are computationally intensive and increase model latency. The authors aim to develop an efficient on-the-fly detection method using traditional ML models.

Method: Train traditional ML models using features based on the VLM’s next-token probabilities (NTPs), which quantify model uncertainty. Augment with linguistic NTPs and integrate hallucination prediction scores from VLMs.

Result: NTP-based features are valuable predictors of hallucinations, enabling fast ML models to achieve performance comparable to strong VLMs. Augmentation with linguistic NTPs and VLM scores further improves detection performance.

Conclusion: The study demonstrates that simple, lightweight NTP-based solutions can effectively detect hallucinations in VLMs, paving the way for more reliable and efficient vision-language models.

Abstract: Hallucinations of vision-language models (VLMs), which are misalignments between visual content and generated text, undermine the reliability of VLMs. One common approach for detecting them employs the same VLM, or a different one, to assess generated outputs. This process is computationally intensive and increases model latency. In this paper, we explore an efficient on-the-fly method for hallucination detection by training traditional ML models over signals based on the VLM’s next-token probabilities (NTPs). NTPs provide a direct quantification of model uncertainty. We hypothesize that high uncertainty (i.e., a low NTP value) is strongly associated with hallucinations. To test this, we introduce a dataset of 1,400 human-annotated statements derived from VLM-generated content, each labeled as hallucinated or not, and use it to test our NTP-based lightweight method. Our results demonstrate that NTP-based features are valuable predictors of hallucinations, enabling fast and simple ML models to achieve performance comparable to that of strong VLMs. Furthermore, augmenting these NTPs with linguistic NTPs, computed by feeding only the generated text back into the VLM, enhances hallucination detection performance. Finally, integrating hallucination prediction scores from VLMs into the NTP-based models led to better performance than using either VLMs or NTPs alone. We hope this study paves the way for simple, lightweight solutions that enhance the reliability of VLMs.

[131] Quasi-Synthetic Riemannian Data Generation for Writer-Independent Offline Signature Verification

Elias N. Zois, Moises Diaz, Salem Said, Miguel A. Ferrer

Main category: cs.CV

TL;DR: A quasi-synthetic data generation framework using Riemannian geometry of SPD matrices for writer-independent offline signature verification, achieving low error rates without real-world training data.

Details

Motivation: Offline handwritten signature verification is challenging in writer-independent settings, and existing methods depend on real-world datasets for training. The paper aims to overcome this limitation by generating synthetic data in Riemannian spaces.

Method: Uses Riemannian Gaussian Mixture on SPD matrices to generate synthetic writers and their properties. Riemannian Gaussian sampling creates positive/negative synthetic SPD populations, followed by metric learning with similar/dissimilar pairs.

Result: Experiments on Western and Asian signature datasets show low error rates under intra- and cross-dataset evaluation protocols, demonstrating efficacy of the approach.

Conclusion: The quasi-synthetic approach highlights the potential of generating synthetic data in Riemannian spaces for writer-independent signature verification systems.

Abstract: Offline handwritten signature verification remains a challenging task, particularly in writer-independent settings where models must generalize across unseen individuals. Recent developments have highlighted the advantage of geometrically inspired representations, such as covariance descriptors on Riemannian manifolds. However, past or present, handcrafted or data-driven methods usually depend on real-world signature datasets for classifier training. We introduce a quasi-synthetic data generation framework leveraging the Riemannian geometry of Symmetric Positive Definite matrices (SPD). A small set of genuine samples in the SPD space is the seed to a Riemannian Gaussian Mixture which identifies Riemannian centers as synthetic writers and variances as their properties. Riemannian Gaussian sampling on each center generates positive as well as negative synthetic SPD populations. A metric learning framework utilizes pairs of similar and dissimilar SPD points, subsequently testing it over on real-world datasets. Experiments conducted on two popular signature datasets, encompassing Western and Asian writing styles, demonstrate the efficacy of the proposed approach under both intra- and cross- dataset evaluation protocols. The results indicate that our quasi-synthetic approach achieves low error rates, highlighting the potential of generating synthetic data in Riemannian spaces for writer-independent signature verification systems.

[132] CompressAI-Vision: Open-source software to evaluate compression methods for computer vision tasks

Hyomin Choi, Heeji Han, Chris Rosewarne, Fabien Racapé

Main category: cs.CV

TL;DR: CompressAI-Vision is introduced as an open-source evaluation platform for video compression methods optimized for computer vision tasks, adopted by MPEG for FCM standard development.

Details

Motivation: With increasing use of NN-based computer vision applications, there's a need for a consolidated platform to implement and evaluate compression methods optimized for downstream vision tasks due to the variety of vision tasks, models, and datasets.

Method: The platform provides a comprehensive evaluation framework where new coding tools compete to efficiently compress input data while retaining task accuracy in “remote” and “split” inferencing scenarios, incorporating standard codecs under development.

Result: The study showcases various use cases examining compression gain on several datasets in terms of bit-rate versus task accuracy, demonstrating the platform’s effectiveness.

Conclusion: CompressAI-Vision serves as a common ground for evaluating vision-optimized compression methods and has been adopted by MPEG for developing the Feature Coding for Machines standard, with the software publicly available as open-source.

Abstract: With the increasing use of neural network (NN)-based computer vision applications that process image and video data as input, interest has emerged in video compression technology optimized for computer vision tasks. In fact, given the variety of vision tasks, associated NN models and datasets, a consolidated platform is needed as a common ground to implement and evaluate compression methods optimized for downstream vision tasks. CompressAI-Vision is introduced as a comprehensive evaluation platform where new coding tools compete to efficiently compress the input of vision network while retaining task accuracy in the context of two different inference scenarios: “remote” and “split” inferencing. Our study showcases various use cases of the evaluation platform incorporated with standard codecs (under development) by examining the compression gain on several datasets in terms of bit-rate versus task accuracy. This evaluation platform has been developed as open-source software and is adopted by the Moving Pictures Experts Group (MPEG) for the development the Feature Coding for Machines (FCM) standard. The software is available publicly at https://github.com/InterDigitalInc/CompressAI-Vision.

[133] Seedream 4.0: Toward Next-generation Multimodal Image Generation

Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, Xiaowen Jian, Huafeng Kuang, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, Wei Liu, Yanzuo Lu, Zhengxiong Luo, Tongtong Ou, Guang Shi, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Rui Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Wenxu Wu, Yonghui Wu, Xin Xia, Xuefeng Xiao, Shuang Xu, Xin Yan, Ceyuan Yang, Jianchao Yang, Zhonghua Zhai, Chenlin Zhang, Heng Zhang, Qi Zhang, Xinyu Zhang, Yuwei Zhang, Shijia Zhao, Wenliang Zhao, Wenjia Zhu

Main category: cs.CV

TL;DR: Seedream 4.0 is an efficient multimodal image generation system that unifies text-to-image synthesis, image editing, and multi-image composition in a single framework, achieving state-of-the-art performance with fast inference times.

Details

Motivation: To create a unified framework that extends traditional text-to-image systems into more interactive and multidimensional creative tools, pushing the boundaries of generative AI for both creativity and professional applications.

Method: Developed a highly efficient diffusion transformer with a powerful VAE that reduces image tokens, enabling efficient training and fast generation of high-resolution images. Used comprehensive data collection across hundreds of scenarios, multi-modal post-training with fine-tuned VLM, and inference acceleration techniques including adversarial distillation, distribution matching, quantization, and speculative decoding.

Result: Achieves inference time of up to 1.8 seconds for generating 2K images (without LLM/VLM as PE model). Demonstrates state-of-the-art results on both T2I and multimodal image editing, with exceptional capabilities in complex tasks including precise image editing, in-context reasoning, multi-image reference, and multiple output image generation.

Conclusion: Seedream 4.0 successfully extends traditional T2I systems into a more interactive and multidimensional creative tool, representing a significant advancement in generative AI for both creative and professional applications.

Abstract: We introduce Seedream 4.0, an efficient and high-performance multimodal image generation system that unifies text-to-image (T2I) synthesis, image editing, and multi-image composition within a single framework. We develop a highly efficient diffusion transformer with a powerful VAE which also can reduce the number of image tokens considerably. This allows for efficient training of our model, and enables it to fast generate native high-resolution images (e.g., 1K-4K). Seedream 4.0 is pretrained on billions of text-image pairs spanning diverse taxonomies and knowledge-centric concepts. Comprehensive data collection across hundreds of vertical scenarios, coupled with optimized strategies, ensures stable and large-scale training, with strong generalization. By incorporating a carefully fine-tuned VLM model, we perform multi-modal post-training for training both T2I and image editing tasks jointly. For inference acceleration, we integrate adversarial distillation, distribution matching, and quantization, as well as speculative decoding. It achieves an inference time of up to 1.8 seconds for generating a 2K image (without a LLM/VLM as PE model). Comprehensive evaluations reveal that Seedream 4.0 can achieve state-of-the-art results on both T2I and multimodal image editing. In particular, it demonstrates exceptional multimodal capabilities in complex tasks, including precise image editing and in-context reasoning, and also allows for multi-image reference, and can generate multiple output images. This extends traditional T2I systems into an more interactive and multidimensional creative tool, pushing the boundary of generative AI for both creativity and professional applications. Seedream 4.0 is now accessible on https://www.volcengine.com/experience/ark?launch=seedream.

[134] Nuclear Diffusion Models for Low-Rank Background Suppression in Videos

Tristan S. W. Stevens, Oisín Nolan, Jean-Luc Robert, Ruud J. G. van Sloun

Main category: cs.CV

TL;DR: Nuclear Diffusion combines low-rank temporal modeling with diffusion posterior sampling for video dehazing, outperforming traditional RPCA methods in medical imaging applications.

Details

Motivation: Traditional robust principal component analysis (RPCA) methods fail to capture the rich variability in real video data due to their sparsity assumption, particularly when dealing with structured noise and background artifacts that obscure dynamic content in video sequences.

Method: A hybrid framework that integrates low-rank temporal modeling with diffusion posterior sampling, called Nuclear Diffusion, which combines model-based temporal models with deep generative priors.

Result: The method demonstrates improved dehazing performance on cardiac ultrasound data compared to traditional RPCA, with better contrast enhancement (gCNR) and signal preservation (KS statistic).

Conclusion: The results highlight the potential of combining model-based temporal models with deep generative priors for high-fidelity video restoration, particularly in medical imaging applications.

Abstract: Video sequences often contain structured noise and background artifacts that obscure dynamic content, posing challenges for accurate analysis and restoration. Robust principal component methods address this by decomposing data into low-rank and sparse components. Still, the sparsity assumption often fails to capture the rich variability present in real video data. To overcome this limitation, a hybrid framework that integrates low-rank temporal modeling with diffusion posterior sampling is proposed. The proposed method, Nuclear Diffusion, is evaluated on a real-world medical imaging problem, namely cardiac ultrasound dehazing, and demonstrates improved dehazing performance compared to traditional RPCA concerning contrast enhancement (gCNR) and signal preservation (KS statistic). These results highlight the potential of combining model-based temporal models with deep generative priors for high-fidelity video restoration.

[135] A Contrastive Learning Framework for Breast Cancer Detection

Samia Saeed, Khuram Naveed

Main category: cs.CV

TL;DR: A contrastive learning framework using ResNet-50 achieves 96.7% accuracy in breast cancer detection on INbreast and MIAS datasets, addressing limited labeled data through semi-supervised training with unlabeled mammograms.

Details

Motivation: Breast cancer is a major global health concern where early detection improves outcomes. Deep learning methods show promise but struggle with accuracy due to limited labeled datasets. The study aims to overcome this limitation using contrastive learning.

Method: Uses a semi-supervised contrastive learning approach with ResNet-50, trained on large unlabeled mammogram data using similarity index. Employs various augmentations and transformations to enhance performance, then fine-tunes on a small labeled dataset.

Result: Achieves 96.7% accuracy in breast cancer detection on benchmark datasets INbreast and MIAS, outperforming existing state-of-the-art methods.

Conclusion: The contrastive learning framework effectively addresses the challenge of limited labeled data in medical imaging, demonstrating superior performance for breast cancer detection with high accuracy using semi-supervised training.

Abstract: Breast cancer, the second leading cause of cancer-related deaths globally, accounts for a quarter of all cancer cases [1]. To lower this death rate, it is crucial to detect tumors early, as early-stage detection significantly improves treatment outcomes. Advances in non-invasive imaging techniques have made early detection possible through computer-aided detection (CAD) systems which rely on traditional image analysis to identify malignancies. However, there is a growing shift towards deep learning methods due to their superior effectiveness. Despite their potential, deep learning methods often struggle with accuracy due to the limited availability of large-labeled datasets for training. To address this issue, our study introduces a Contrastive Learning (CL) framework, which excels with smaller labeled datasets. In this regard, we train Resnet-50 in semi supervised CL approach using similarity index on a large amount of unlabeled mammogram data. In this regard, we use various augmentation and transformations which help improve the performance of our approach. Finally, we tune our model on a small set of labelled data that outperforms the existing state of the art. Specifically, we observed a 96.7% accuracy in detecting breast cancer on benchmark datasets INbreast and MIAS.

[136] Are Foundation Models Ready for Industrial Defect Recognition? A Reality Check on Real-World Data

Simon Baeuerle, Pratik Khanna, Nils Friederich, Angelo Jovin Yamachui Sitcheu, Damir Shakirov, Andreas Steimer, Ralf Mikut

Main category: cs.CV

TL;DR: Foundation Models (FMs) show promise for automated quality inspection in manufacturing but fail on real-world industrial data despite performing well on public benchmarks.

Details

Motivation: To leverage FMs' zero-shot generalization capabilities for automated quality inspection in series manufacturing, replacing tedious labeling tasks with simple text prompts and enabling model reuse across multiple products.

Method: Tested multiple recent Foundation Models on both custom real-world industrial image data and public image datasets to evaluate their performance in zero-shot settings.

Result: All tested Foundation Models failed on real-world industrial image data, while the same models performed well on public benchmark datasets.

Conclusion: Current Foundation Models are not yet suitable for real-world industrial quality inspection applications despite their promising zero-shot capabilities on benchmark datasets.

Abstract: Foundation Models (FMs) have shown impressive performance on various text and image processing tasks. They can generalize across domains and datasets in a zero-shot setting. This could make them suitable for automated quality inspection during series manufacturing, where various types of images are being evaluated for many different products. Replacing tedious labeling tasks with a simple text prompt to describe anomalies and utilizing the same models across many products would save significant efforts during model setup and implementation. This is a strong advantage over supervised Artificial Intelligence (AI) models, which are trained for individual applications and require labeled training data. We test multiple recent FMs on both custom real-world industrial image data and public image data. We show that all of those models fail on our real-world data, while the very same models perform well on public benchmark datasets.

[137] Shared Neural Space: Unified Precomputed Feature Encoding for Multi-Task and Cross Domain Vision

Jing Li, Oskar Bartosz, Chengyu Wang, Michal Wnuczynski, Dilshan Godaliyadda, Michael Polley

Main category: cs.CV

TL;DR: A universal Neural Space (NS) framework using encoder-decoder architecture to share features across multiple vision tasks, reducing redundancy and improving generalization with a lightweight CNN backbone.

Details

Motivation: Current AI models are inefficient for modular vision tasks as each requires separate latent domain mapping, leading to redundancy and poor generalization across domain shifts.

Method: Proposed encoder-decoder framework that pre-computes transformation-aware, generalizable representations in a shared Neural Space, enabling multiple downstream AI modules to use the same feature space with a lightweight CNN backbone.

Result: Demonstrated that various imaging and vision tasks (demosaicing, denoising, depth estimation, semantic segmentation) can be performed efficiently in the Neural Space.

Conclusion: The Neural Space architecture establishes an efficient foundation for multi-task vision pipelines, reducing redundancy and improving generalization while being hardware-friendly due to its lightweight CNN design.

Abstract: The majority of AI models in imaging and vision are customized to perform on specific high-precision task. However, this strategy is inefficient for applications with a series of modular tasks, since each requires a mapping into a disparate latent domain. To address this inefficiency, we proposed a universal Neural Space (NS), where an encoder-decoder framework pre-computes features across vision and imaging tasks. Our encoder learns transformation aware, generalizable representations, which enable multiple downstream AI modules to share the same feature space. This architecture reduces redundancy, improves generalization across domain shift, and establishes a foundation for effecient multi-task vision pipelines. Furthermore, as opposed to larger transformer backbones, our backbone is lightweight and CNN-based, allowing for wider across hardware. We furthur demonstrate that imaging and vision modules, such as demosaicing, denoising, depth estimation and semantic segmentation can be performed efficiently in the NS.

[138] Data-Efficient Stream-Based Active Distillation for Scalable Edge Model Deployment

Dani Manjah, Tim Bary, Benoît Gérin, Benoît Macq, Christophe de Vleeschouwer

Main category: cs.CV

TL;DR: The paper proposes a strategy for selecting the most useful images for training edge models by combining high-confidence stream-based selection with diversity-based approaches to maximize model quality while minimizing transmission costs.

Details

Motivation: Edge camera systems need regular model updates but face computational constraints. Current practice uses central servers with complex teacher models to annotate data for training smaller edge models, but this requires efficient image selection to balance model quality and transmission costs.

Method: A hybrid approach combining high-confidence stream-based image selection with diversity-based methods to identify the most valuable training samples while keeping dataset queries minimal.

Result: The proposed strategy produces high-quality models with similar training iterations while significantly reducing the number of dataset queries compared to traditional approaches.

Conclusion: Combining high-confidence stream-based selection with diversity-based approaches effectively maximizes edge model quality while minimizing transmission overhead, making it practical for resource-constrained edge camera systems.

Abstract: Edge camera-based systems are continuously expanding, facing ever-evolving environments that require regular model updates. In practice, complex teacher models are run on a central server to annotate data, which is then used to train smaller models tailored to the edge devices with limited computational power. This work explores how to select the most useful images for training to maximize model quality while keeping transmission costs low. Our work shows that, for a similar training load (i.e., iterations), a high-confidence stream-based strategy coupled with a diversity-based approach produces a high-quality model with minimal dataset queries.

[139] InstructVTON: Optimal Auto-Masking and Natural-Language-Guided Interactive Style Control for Inpainting-Based Virtual Try-On

Julien Han, Shuwen Qiu, Qi Li, Xingzi Xu, Mehmet Saygin Seyfioglu, Kavosh Asadi, Karim Bouyarmane

Main category: cs.CV

TL;DR: InstructVTON is an instruction-following virtual try-on system that uses natural language instructions to control garment styling, eliminating the need for manual mask creation and enabling complex styling operations that traditional masking-based approaches cannot handle.

Details

Motivation: Traditional virtual try-on models rely on binary masks for layout control, which are difficult to create, require background knowledge, are model-dependent, and cannot handle complex styling scenarios like changing sleeve lengths or rolling up sleeves.

Method: InstructVTON leverages Vision Language Models (VLMs) and image segmentation models to automatically generate binary masks based on user-provided images and free-text style instructions, enabling automated execution of multiple generation rounds for complex try-on scenarios.

Result: The system achieves state-of-the-art results with styling control and is interoperable with existing virtual try-on models, demonstrating the ability to handle fine-grained and complex styling operations that were previously impossible with masking-based approaches.

Conclusion: InstructVTON simplifies the virtual try-on experience by removing the need for precise mask drawing and enabling natural language-guided styling control, making complex garment transformations accessible through simple text instructions.

Abstract: We present InstructVTON, an instruction-following interactive virtual try-on system that allows fine-grained and complex styling control of the resulting generation, guided by natural language, on single or multiple garments. A computationally efficient and scalable formulation of virtual try-on formulates the problem as an image-guided or image-conditioned inpainting task. These inpainting-based virtual try-on models commonly use a binary mask to control the generation layout. Producing a mask that yields desirable result is difficult, requires background knowledge, might be model dependent, and in some cases impossible with the masking-based approach (e.g. trying on a long-sleeve shirt with “sleeves rolled up” styling on a person wearing long-sleeve shirt with sleeves down, where the mask will necessarily cover the entire sleeve). InstructVTON leverages Vision Language Models (VLMs) and image segmentation models for automated binary mask generation. These masks are generated based on user-provided images and free-text style instructions. InstructVTON simplifies the end-user experience by removing the necessity of a precisely drawn mask, and by automating execution of multiple rounds of image generation for try-on scenarios that cannot be achieved with masking-based virtual try-on models alone. We show that InstructVTON is interoperable with existing virtual try-on models to achieve state-of-the-art results with styling control.

[140] Innovative Deep Learning Architecture for Enhanced Altered Fingerprint Recognition

Dana A Abdullah, Dana Rasul Hamad, Bishar Rasheed Ibrahim, Sirwan Abdulwahid Aula, Aso Khaleel Ameen, Sabat Salih Hamadamin

Main category: cs.CV

TL;DR: DeepAFRNet is a deep learning model for recognizing altered fingerprints using VGG16 backbone and cosine similarity, achieving high accuracy (96.7-99.54%) on real altered fingerprint datasets with strict thresholds.

Details

Motivation: Altered fingerprint recognition is crucial for biometric verification in border control, forensics, and fiscal admission to prevent adversaries from evading detection through deliberate ridge pattern modifications.

Method: Uses VGG16 backbone to extract high-dimensional features and cosine similarity to compare embeddings, evaluated on SOCOFing Real-Altered subset with three difficulty levels (Easy, Medium, Hard).

Result: With strict thresholds (0.92), achieves accuracies of 96.7%, 98.76%, and 99.54% for Easy, Medium, and Hard levels respectively. Threshold sensitivity study shows dramatic accuracy degradation (7.86-29.51%) when threshold is relaxed to 0.72.

Conclusion: DeepAFRNet addresses limitations of prior work by using real altered samples and comprehensive evaluation, demonstrating readiness for real-world deployments requiring both security and recognition resilience.

Abstract: Altered fingerprint recognition (AFR) is challenging for biometric verification in applications such as border control, forensics, and fiscal admission. Adversaries can deliberately modify ridge patterns to evade detection, so robust recognition of altered prints is essential. We present DeepAFRNet, a deep learning recognition model that matches and recognizes distorted fingerprint samples. The approach uses a VGG16 backbone to extract high-dimensional features and cosine similarity to compare embeddings. We evaluate on the SOCOFing Real-Altered subset with three difficulty levels (Easy, Medium, Hard). With strict thresholds, DeepAFRNet achieves accuracies of 96.7 percent, 98.76 percent, and 99.54 percent for the three levels. A threshold-sensitivity study shows that relaxing the threshold from 0.92 to 0.72 sharply degrades accuracy to 7.86 percent, 27.05 percent, and 29.51 percent, underscoring the importance of threshold selection in biometric systems. By using real altered samples and reporting per-level metrics, DeepAFRNet addresses limitations of prior work based on synthetic alterations or limited verification protocols, and indicates readiness for real-world deployments where both security and recognition resilience are critical.

[141] Large Pre-Trained Models for Bimanual Manipulation in 3D

Hanna Yurchyk, Wei-Di Chang, Gregory Dudek, David Meger

Main category: cs.CV

TL;DR: Integrating Vision Transformer attention maps into voxel representations improves bimanual robotic manipulation by 8.2% absolute and 21.9% relative gains on RLBench benchmark.

Details

Motivation: To enhance bimanual robotic manipulation by incorporating semantic cues from pre-trained vision models into 3D voxel representations for better policy learning.

Method: Extract attention maps from DINOv2 (self-supervised ViT), interpret them as pixel-level saliency scores, lift them into 3D voxel grid, and incorporate into behavior cloning policy.

Result: Average absolute improvement of 8.2% and relative gain of 21.9% across all tasks in RLBench bimanual benchmark when integrated into state-of-the-art voxel-based policy.

Conclusion: Attention-guided featurization from pre-trained vision models significantly enhances robotic manipulation performance by providing semantic cues in voxel representations.

Abstract: We investigate the integration of attention maps from a pre-trained Vision Transformer into voxel representations to enhance bimanual robotic manipulation. Specifically, we extract attention maps from DINOv2, a self-supervised ViT model, and interpret them as pixel-level saliency scores over RGB images. These maps are lifted into a 3D voxel grid, resulting in voxel-level semantic cues that are incorporated into a behavior cloning policy. When integrated into a state-of-the-art voxel-based policy, our attention-guided featurization yields an average absolute improvement of 8.2% and a relative gain of 21.9% across all tasks in the RLBench bimanual benchmark.

[142] A Comparative Benchmark of Real-time Detectors for Blueberry Detection towards Precision Orchard Management

Xinyang Mu, Yuzhen Lu, Boyang Deng

Main category: cs.CV

TL;DR: Comparative benchmark analysis of 36 real-time object detectors (YOLO v8-v12 and RT-DETR v1-v2) for blueberry detection, evaluated on a new dataset with 85,879 labeled instances, achieving up to 94.8% mAP@50 with semi-supervised learning.

Details

Motivation: Blueberry detection in natural environments is challenging due to variable lighting, occlusions, and motion blur. Deep learning detectors need large-scale datasets and optimal accuracy/speed/memory trade-offs for practical deployment.

Method: Evaluated 36 model variants on a curated dataset of 661 canopy images with 85,879 labeled blueberries. Applied Unbiased Mean Teacher-based semi-supervised learning on 1,035 unlabeled images for performance enhancement.

Result: YOLOv12m achieved 93.3% mAP@50, RT-DETRv2-X achieved 93.6% mAP@50. SSL fine-tuning improved accuracy by up to 2.9%, with RT-DETR-v2-X reaching 94.8% mAP@50. Mid-sized models offered best accuracy-speed balance.

Conclusion: The study provides a comprehensive benchmark for real-time blueberry detection. SSL shows potential but needs further research for cross-domain data utilization. Dataset and software are publicly available for future research.

Abstract: Blueberry detection in natural environments remains challenging due to variable lighting, occlusions, and motion blur due to environmental factors and imaging devices. Deep learning-based object detectors promise to address these challenges, but they demand a large-scale, diverse dataset that captures the real-world complexities. Moreover, deploying these models in practical scenarios often requires the right accuracy/speed/memory trade-off in model selection. This study presents a novel comparative benchmark analysis of advanced real-time object detectors, including YOLO (You Only Look Once) (v8-v12) and RT-DETR (Real-Time Detection Transformers) (v1-v2) families, consisting of 36 model variants, evaluated on a newly curated dataset for blueberry detection. This dataset comprises 661 canopy images collected with smartphones during the 2022-2023 seasons, consisting of 85,879 labelled instances (including 36,256 ripe and 49,623 unripe blueberries) across a wide range of lighting conditions, occlusions, and fruit maturity stages. Among the YOLO models, YOLOv12m achieved the best accuracy with a mAP@50 of 93.3%, while RT-DETRv2-X obtained a mAP@50 of 93.6%, the highest among all the RT-DETR variants. The inference time varied with the model scale and complexity, and the mid-sized models appeared to offer a good accuracy-speed balance. To further enhance detection performance, all the models were fine-tuned using Unbiased Mean Teacher-based semi-supervised learning (SSL) on a separate set of 1,035 unlabeled images acquired by a ground-based machine vision platform in 2024. This resulted in accuracy gains ranging from -1.4% to 2.9%, with RT-DETR-v2-X achieving the best mAP@50 of 94.8%. More in-depth research into SSL is needed to better leverage cross-domain unlabeled data. Both the dataset and software programs of this study are made publicly available to support further research.

[143] Region-of-Interest Augmentation for Mammography Classification under Patient-Level Cross-Validation

Farbod Bigdeli, Mohsen Mohammadagha, Ali Bigdeli

Main category: cs.CV

TL;DR: A lightweight ROI augmentation strategy for mammography classification using Mini-DDSM dataset shows modest ROC-AUC improvements without inference-time costs.

Details

Motivation: Deep learning for mammogram interpretation is limited by small datasets and resolution constraints. The paper aims to enhance performance through simple data-centric augmentation without architectural changes or additional labels.

Method: Introduces ROI augmentation where full mammogram images are probabilistically replaced with random ROI crops from a precomputed bounding-box bank during training, with optional jitter. Evaluated using patient-level cross-validation on Mini-DDSM dataset (9,684 images; 2,414 patients).

Result: ROI augmentation (best parameters: p_roi = 0.10, alpha = 0.10) yields modest average ROC-AUC gains with performance varying across folds. PR-AUC remains flat to slightly lower. The method maintains training efficiency with unchanged inference-time costs.

Conclusion: Simple ROI augmentation strategies can enhance mammography classification in constrained settings, demonstrating the value of data-centric approaches without requiring additional labels or architectural modifications.

Abstract: Breast cancer screening with mammography remains central to early detection and mortality reduction. Deep learning has shown strong potential for automating mammogram interpretation, yet limited-resolution datasets and small sample sizes continue to restrict performance. We revisit the Mini-DDSM dataset (9,684 images; 2,414 patients) and introduce a lightweight region-of-interest (ROI) augmentation strategy. During training, full images are probabilistically replaced with random ROI crops sampled from a precomputed, label-free bounding-box bank, with optional jitter to increase variability. We evaluate under strict patient-level cross-validation and report ROC-AUC, PR-AUC, and training-time efficiency metrics (throughput and GPU memory). Because ROI augmentation is training-only, inference-time cost remains unchanged. On Mini-DDSM, ROI augmentation (best: p_roi = 0.10, alpha = 0.10) yields modest average ROC-AUC gains, with performance varying across folds; PR-AUC is flat to slightly lower. These results demonstrate that simple, data-centric ROI strategies can enhance mammography classification in constrained settings without requiring additional labels or architectural modifications.

[144] Reflect3r: Single-View 3D Stereo Reconstruction Aided by Mirror Reflections

Jing Wu, Zirui Wang, Iro Laina, Victor Adrian Prisacariu

Main category: cs.CV

TL;DR: Using mirror reflections as auxiliary views to create multi-view stereo from single images for 3D reconstruction

Details

Motivation: Mirror reflections provide stereo information within a single capture, allowing simplified imaging process and compatibility with feed-forward reconstruction models

Method: Treat mirror reflection as auxiliary view, design transformation to construct physically valid virtual camera, propose symmetric-aware loss for pose refinement, extend to dynamic scenes

Result: Enables multi-view stereo setup from single image, robust 3D reconstruction, efficient per-frame geometry recovery in dynamic scenes

Conclusion: Framework effectively exploits mirror reflections for generalizable and robust 3D reconstruction from single images

Abstract: Mirror reflections are common in everyday environments and can provide stereo information within a single capture, as the real and reflected virtual views are visible simultaneously. We exploit this property by treating the reflection as an auxiliary view and designing a transformation that constructs a physically valid virtual camera, allowing direct pixel-domain generation of the virtual view while adhering to the real-world imaging process. This enables a multi-view stereo setup from a single image, simplifying the imaging process, making it compatible with powerful feed-forward reconstruction models for generalizable and robust 3D reconstruction. To further exploit the geometric symmetry introduced by mirrors, we propose a symmetric-aware loss to refine pose estimation. Our framework also naturally extends to dynamic scenes, where each frame contains a mirror reflection, enabling efficient per-frame geometry recovery. For quantitative evaluation, we provide a fully customizable synthetic dataset of 16 Blender scenes, each with ground-truth point clouds and camera poses. Extensive experiments on real-world data and synthetic data are conducted to illustrate the effectiveness of our method.

[145] Recov-Vision: Linking Street View Imagery and Vision-Language Models for Post-Disaster Recovery

Yiming Xiao, Archit Gupta, Miguel Esparza, Yu-Hsuan Ho, Antonia Sebastian, Hannah Weas, Rose Houck, Ali Mostafavi

Main category: cs.CV

TL;DR: FacadeTrack is a street-level framework that uses language-guided panoramic video analysis to assess building occupancy after disasters, achieving high precision and recall through two decision strategies.

Details

Motivation: Current methods for post-disaster building occupancy assessment have limitations - overhead imagery misses facade details while street-view imagery is sparse and hard to align with parcels. There's a need for more accurate, interpretable occupancy assessment.

Method: FacadeTrack links panoramic video to parcels, rectifies views to facades, and extracts interpretable attributes (entry blockage, temporary coverings, debris). It uses two decision strategies: a transparent one-stage rule and a two-stage design separating perception from conservative reasoning.

Result: Evaluated on post-Hurricane Helene surveys, the two-stage approach achieved precision of 0.927, recall of 0.781, F-1 score of 0.848, outperforming the one-stage baseline (precision 0.943, recall 0.728, F-1 0.822).

Conclusion: FacadeTrack provides auditable, scalable occupancy assessments with interpretable intermediate attributes and spatial diagnostics, suitable for integration into emergency management workflows.

Abstract: Building-level occupancy after disasters is vital for triage, inspections, utility re-energization, and equitable resource allocation. Overhead imagery provides rapid coverage but often misses facade and access cues that determine habitability, while street-view imagery captures those details but is sparse and difficult to align with parcels. We present FacadeTrack, a street-level, language-guided framework that links panoramic video to parcels, rectifies views to facades, and elicits interpretable attributes (for example, entry blockage, temporary coverings, localized debris) that drive two decision strategies: a transparent one-stage rule and a two-stage design that separates perception from conservative reasoning. Evaluated across two post-Hurricane Helene surveys, the two-stage approach achieves a precision of 0.927, a recall of 0.781, and an F-1 score of 0.848, compared with the one-stage baseline at a precision of 0.943, a recall of 0.728, and an F-1 score of 0.822. Beyond accuracy, intermediate attributes and spatial diagnostics reveal where and why residual errors occur, enabling targeted quality control. The pipeline provides auditable, scalable occupancy assessments suitable for integration into geospatial and emergency-management workflows.

Yiling Yun, Hongjing Lu

Main category: cs.CV

TL;DR: Humans use semantic representations to complement visual features when recognizing social interactions from moving shapes. Semantic models, particularly verb-based embeddings, better explain human similarity judgments than visual features alone.

Details

Motivation: To understand what semantic representations humans employ alongside visual features when recognizing social interactions from simple animations of moving shapes.

Method: Study 1: Direct labeling of animations by human participants. Study 2: Measuring representational geometry of 27 social interactions through human similarity judgments, comparing visual features, labels, and semantic embeddings from descriptions.

Result: Human responses were distributed in Study 1. In Study 2, semantic models provided complementary information to visual features, with verb-based embeddings from descriptions accounting for human similarity judgments the best.

Conclusion: Social perception in simple displays reflects the semantic structure of social interactions, bridging visual and abstract representations.

Abstract: Humans are social creatures who readily recognize various social interactions from simple display of moving shapes. While previous research has often focused on visual features, we examine what semantic representations that humans employ to complement visual features. In Study 1, we directly asked human participants to label the animations based on their impression of moving shapes. We found that human responses were distributed. In Study 2, we measured the representational geometry of 27 social interactions through human similarity judgments and compared it with model predictions based on visual features, labels, and semantic embeddings from animation descriptions. We found that semantic models provided complementary information to visual features in explaining human judgments. Among the semantic models, verb-based embeddings extracted from descriptions account for human similarity judgments the best. These results suggest that social perception in simple displays reflects the semantic structure of social interactions, bridging visual and abstract representations.

[147] Enhancing Cross-View Geo-Localization Generalization via Global-Local Consistency and Geometric Equivariance

Xiaowei Wang, Di Wang, Ke Li, Yifeng Wang, Chengjian Wang, Libin Sun, Zhihong Wu, Yiming Zhang, Quan Wang

Main category: cs.CV

TL;DR: EGS is a novel cross-view geo-localization framework that uses E(2)-Steerable CNN encoder and graph-based global-local consistency to improve robustness under severe viewpoint variations and cross-domain generalization.

Details

Motivation: Existing CVGL methods struggle with robustness under severe appearance variations from different UAV orientations/fields of view, and lack reliable correspondences that capture both global semantics and local details.

Method: Proposes EGS framework with: (1) E(2)-Steerable CNN encoder for stable feature extraction under rotation/viewpoint shifts, (2) Graph with virtual super-node connecting all local nodes to aggregate global semantics and enforce global-local consistency.

Result: Extensive experiments on University-1652 and SUES-200 benchmarks show EGS achieves substantial performance gains and establishes new state-of-the-art in cross-domain CVGL.

Conclusion: EGS effectively addresses cross-domain generalization challenges in CVGL through rotation-invariant feature extraction and global-local consistency modeling, demonstrating superior performance on standard benchmarks.

Abstract: Cross-view geo-localization (CVGL) aims to match images of the same location captured from drastically different viewpoints. Despite recent progress, existing methods still face two key challenges: (1) achieving robustness under severe appearance variations induced by diverse UAV orientations and fields of view, which hinders cross-domain generalization, and (2) establishing reliable correspondences that capture both global scene-level semantics and fine-grained local details. In this paper, we propose EGS, a novel CVGL framework designed to enhance cross-domain generalization. Specifically, we introduce an E(2)-Steerable CNN encoder to extract stable and reliable features under rotation and viewpoint shifts. Furthermore, we construct a graph with a virtual super-node that connects to all local nodes, enabling global semantics to be aggregated and redistributed to local regions, thereby enforcing global-local consistency. Extensive experiments on the University-1652 and SUES-200 benchmarks demonstrate that EGS consistently achieves substantial performance gains and establishes a new state of the art in cross-domain CVGL.

[148] DENet: Dual-Path Edge Network with Global-Local Attention for Infrared Small Target Detection

Jiayi Zuo, Songwei Pei, Qian Li

Main category: cs.CV

TL;DR: A Dual-Path Edge Network for infrared small target detection that separates edge enhancement and semantic modeling to address the conflict between capturing spatial details and extracting semantic context.

Details

Motivation: Infrared small targets lack distinctive features and blend into noisy backgrounds, creating challenges for deep models that need to balance high-resolution spatial details for small targets with robust semantic context for larger targets.

Method: Proposes a Dual-Path Edge Network with two complementary paths: 1) Bidirectional Interaction Module using Local and Global Self-Attention for multi-scale feature dependencies, and 2) Multi-Edge Refiner using cascaded Taylor finite difference operators with attention gating for precise edge localization.

Result: The method provides precise infrared small target detection and localization by effectively combining structural semantics and edge refinement.

Conclusion: The proposed framework offers a promising solution that addresses feature misalignment issues in infrared small target detection through explicit decoupling of edge enhancement and semantic modeling.

Abstract: Infrared small target detection is crucial for remote sensing applications like disaster warning and maritime surveillance. However, due to the lack of distinctive texture and morphological features, infrared small targets are highly susceptible to blending into cluttered and noisy backgrounds. A fundamental challenge in designing deep models for this task lies in the inherent conflict between capturing high-resolution spatial details for minute targets and extracting robust semantic context for larger targets, often leading to feature misalignment and suboptimal performance. Existing methods often rely on fixed gradient operators or simplistic attention mechanisms, which are inadequate for accurately extracting target edges under low contrast and high noise. In this paper, we propose a novel Dual-Path Edge Network that explicitly addresses this challenge by decoupling edge enhancement and semantic modeling into two complementary processing paths. The first path employs a Bidirectional Interaction Module, which uses both Local Self-Attention and Global Self-Attention to capture multi-scale local and global feature dependencies. The global attention mechanism, based on a Transformer architecture, integrates long-range semantic relationships and contextual information, ensuring robust scene understanding. The second path introduces the Multi-Edge Refiner, which enhances fine-grained edge details using cascaded Taylor finite difference operators at multiple scales. This mathematical approach, along with an attention-driven gating mechanism, enables precise edge localization and feature enhancement for targets of varying sizes, while effectively suppressing noise. Our method provides a promising solution for precise infrared small target detection and localization, combining structural semantics and edge refinement in a unified framework.

[149] Beyond the Individual: Introducing Group Intention Forecasting with SHOT Dataset

Ruixu Zhang, Yuran Wang, Xinyi Hu, Chaoyu Mai, Wenxuan Liu, Danni Xu, Xian Zhong, Zheng Wang

Main category: cs.CV

TL;DR: This paper introduces group intention forecasting (GIF) as a novel task to predict when collective intentions will emerge from individual actions, presents the SHOT dataset for basketball scenarios, and proposes the GIFT framework for modeling group dynamics.

Details

Motivation: Traditional intention recognition focuses on individual intentions but overlooks the complexities of collective intentions that emerge from group interactions and shared goals.

Method: The authors propose SHOT (a large-scale basketball video dataset with multi-view, multi-individual annotations) and GIFT (a framework that extracts individual features and models evolving group dynamics to forecast intention emergence).

Result: Experimental results confirm the effectiveness of both SHOT and GIFT, establishing a strong foundation for group intention forecasting research.

Conclusion: The work successfully addresses the gap in collective intention recognition by introducing GIF as a new task and providing both dataset and framework for future research in this emerging field.

Abstract: Intention recognition has traditionally focused on individual intentions, overlooking the complexities of collective intentions in group settings. To address this limitation, we introduce the concept of group intention, which represents shared goals emerging through the actions of multiple individuals, and Group Intention Forecasting (GIF), a novel task that forecasts when group intentions will occur by analyzing individual actions and interactions before the collective goal becomes apparent. To investigate GIF in a specific scenario, we propose SHOT, the first large-scale dataset for GIF, consisting of 1,979 basketball video clips captured from 5 camera views and annotated with 6 types of individual attributes. SHOT is designed with 3 key characteristics: multi-individual information, multi-view adaptability, and multi-level intention, making it well-suited for studying emerging group intentions. Furthermore, we introduce GIFT (Group Intention ForecasTer), a framework that extracts fine-grained individual features and models evolving group dynamics to forecast intention emergence. Experimental results confirm the effectiveness of SHOT and GIFT, establishing a strong foundation for future research in group intention forecasting. The dataset is available at https://xinyi-hu.github.io/SHOT_DATASET.

[150] Neptune-X: Active X-to-Maritime Generation for Universal Maritime Object Detection

Yu Guo, Shengfeng He, Yuxu Lu, Haonan An, Yihang Tao, Huilin Zhu, Jingxian Liu, Yuguang Fang

Main category: cs.CV

TL;DR: Neptune-X is a data-centric generative-selection framework that addresses maritime object detection challenges by generating diverse synthetic maritime data and selecting task-relevant samples to improve detection accuracy in underrepresented scenarios.

Details

Motivation: Maritime object detection faces two main challenges: scarcity of annotated maritime data and poor generalization across various maritime attributes. Models trained on existing datasets often underperform in underrepresented scenarios like open-sea environments.

Method: Proposes Neptune-X with two key components: 1) X-to-Maritime generative model with Bidirectional Object-Water Attention module for realistic scene synthesis, and 2) Attribute-correlated Active Sampling for dynamic selection of task-relevant synthetic samples. Also constructs Maritime Generation Dataset for benchmarking.

Result: Extensive experiments show the approach sets new benchmarks in maritime scene synthesis and significantly improves detection accuracy, particularly in challenging and previously underrepresented settings.

Conclusion: The proposed Neptune-X framework effectively addresses maritime data scarcity and generalization issues through synthetic data generation and intelligent sample selection, demonstrating superior performance in maritime object detection tasks.

Abstract: Maritime object detection is essential for navigation safety, surveillance, and autonomous operations, yet constrained by two key challenges: the scarcity of annotated maritime data and poor generalization across various maritime attributes (e.g., object category, viewpoint, location, and imaging environment). % In particular, models trained on existing datasets often underperform in underrepresented scenarios such as open-sea environments. To address these challenges, we propose Neptune-X, a data-centric generative-selection framework that enhances training effectiveness by leveraging synthetic data generation with task-aware sample selection. From the generation perspective, we develop X-to-Maritime, a multi-modality-conditioned generative model that synthesizes diverse and realistic maritime scenes. A key component is the Bidirectional Object-Water Attention module, which captures boundary interactions between objects and their aquatic surroundings to improve visual fidelity. To further improve downstream tasking performance, we propose Attribute-correlated Active Sampling, which dynamically selects synthetic samples based on their task relevance. To support robust benchmarking, we construct the Maritime Generation Dataset, the first dataset tailored for generative maritime learning, encompassing a wide range of semantic conditions. Extensive experiments demonstrate that our approach sets a new benchmark in maritime scene synthesis, significantly improving detection accuracy, particularly in challenging and previously underrepresented settings.The code is available at https://github.com/gy65896/Neptune-X.

Sofia McLeod, Chee-Kheng Chng, Matthew Rodda, Tat-Jun Chin

Main category: cs.CV

TL;DR: STELLA is the first end-to-end Crater-Based Navigation pipeline for long-duration lunar mapping missions, addressing challenges of sparse, oblique imagery under varying illumination conditions.

Details

Motivation: Previous crater-based navigation focused on short-duration landing missions with nadir-view imagery, but lunar mapping missions require navigation solutions for year-long campaigns with sparse, oblique imagery under varying illumination.

Method: STELLA combines Mask R-CNN crater detection, descriptor-less crater identification, robust perspective-n-crater pose solver, and batch orbit determination back-end. Tested on CRESENT-365 dataset simulating year-long lunar mapping.

Result: STELLA achieves metre-level position accuracy and sub-degree attitude accuracy across wide ranges of viewing angles, illumination conditions, and lunar latitudes.

Conclusion: This is the first comprehensive assessment of crater-based navigation in true lunar mapping settings, providing insights for future mission operational conditions.

Abstract: Crater-Based Navigation (CBN) uses the ubiquitous impact craters of the Moon observed on images as natural landmarks to determine the six degrees of freedom pose of a spacecraft. To date, CBN has primarily been studied in the context of powered descent and landing. These missions are typically short in duration, with high-frequency imagery captured from a nadir viewpoint over well-lit terrain. In contrast, lunar mapping missions involve sparse, oblique imagery acquired under varying illumination conditions over potentially year-long campaigns, posing significantly greater challenges for pose estimation. We bridge this gap with STELLA - the first end-to-end CBN pipeline for long-duration lunar mapping. STELLA combines a Mask R-CNN-based crater detector, a descriptor-less crater identification module, a robust perspective-n-crater pose solver, and a batch orbit determination back-end. To rigorously test STELLA, we introduce CRESENT-365 - the first public dataset that emulates a year-long lunar mapping mission. Each of its 15,283 images is rendered from high-resolution digital elevation models with SPICE-derived Sun angles and Moon motion, delivering realistic global coverage, illumination cycles, and viewing geometries. Experiments on CRESENT+ and CRESENT-365 show that STELLA maintains metre-level position accuracy and sub-degree attitude accuracy on average across wide ranges of viewing angles, illumination conditions, and lunar latitudes. These results constitute the first comprehensive assessment of CBN in a true lunar mapping setting and inform operational conditions that should be considered for future missions.

[152] Seeing Through Words, Speaking Through Pixels: Deep Representational Alignment Between Vision and Language Models

Zoe Wanying He, Sean Trott, Meenakshi Khosla

Main category: cs.CV

TL;DR: Deep vision-only and language-only models trained on disjoint modalities develop partially aligned representational spaces in mid-to-late layers, capturing semantic concepts that mirror human preferences in image-text matching tasks.

Details

Motivation: To understand where alignment emerges in unimodal networks, what cues support it, whether it captures human preferences in many-to-many scenarios, and how exemplar aggregation affects alignment.

Method: Systematic investigation of alignment across network layers, robustness testing to semantic vs. appearance changes, forced-choice “Pick-a-Pic” task for human preference comparison, and analysis of embedding averaging effects.

Result: Alignment peaks in mid-to-late layers, is robust to appearance changes but collapses with semantic alterations, mirrors human preferences bidirectionally, and is amplified by exemplar averaging.

Conclusion: Unimodal networks converge on a shared semantic code that aligns with human judgments and strengthens with exemplar aggregation.

Abstract: Recent studies show that deep vision-only and language-only models–trained on disjoint modalities–nonetheless project their inputs into a partially aligned representational space. Yet we still lack a clear picture of where in each network this convergence emerges, what visual or linguistic cues support it, whether it captures human preferences in many-to-many image-text scenarios, and how aggregating exemplars of the same concept affects alignment. Here, we systematically investigate these questions. We find that alignment peaks in mid-to-late layers of both model types, reflecting a shift from modality-specific to conceptually shared representations. This alignment is robust to appearance-only changes but collapses when semantics are altered (e.g., object removal or word-order scrambling), highlighting that the shared code is truly semantic. Moving beyond the one-to-one image-caption paradigm, a forced-choice “Pick-a-Pic” task shows that human preferences for image-caption matches are mirrored in the embedding spaces across all vision-language model pairs. This pattern holds bidirectionally when multiple captions correspond to a single image, demonstrating that models capture fine-grained semantic distinctions akin to human judgments. Surprisingly, averaging embeddings across exemplars amplifies alignment rather than blurring detail. Together, our results demonstrate that unimodal networks converge on a shared semantic code that aligns with human judgments and strengthens with exemplar aggregation.

[153] FreeInsert: Personalized Object Insertion with Geometric and Style Control

Yuhong Zhang, Han Wang, Yiwen Wang, Rong Xie, Li Song

Main category: cs.CV

TL;DR: FreeInsert is a training-free framework that enables geometrically controlled and style-consistent object insertion into images by leveraging 3D generation models and diffusion adapters.

Details

Motivation: Existing image editing methods lack geometric control over inserted objects (confined to 2D space with only textual instructions) and struggle with style consistency between objects and backgrounds, making results unrealistic. There's also a need for methods that don't require extensive training.

Method: Convert 2D objects into 3D using existing 3D generation models, perform interactive editing at the 3D level, then re-render into 2D images from specified views. Combine geometric control from 3D rendering with style and content control through diffusion adapters.

Result: The framework produces geometrically controlled and style-consistent edited images without requiring training.

Conclusion: FreeInsert successfully addresses limitations in personalized image composition by introducing 3D geometric control and style consistency through a training-free approach using 3D generation models and diffusion adapters.

Abstract: Text-to-image diffusion models have made significant progress in image generation, allowing for effortless customized generation. However, existing image editing methods still face certain limitations when dealing with personalized image composition tasks. First, there is the issue of lack of geometric control over the inserted objects. Current methods are confined to 2D space and typically rely on textual instructions, making it challenging to maintain precise geometric control over the objects. Second, there is the challenge of style consistency. Existing methods often overlook the style consistency between the inserted object and the background, resulting in a lack of realism. In addition, the challenge of inserting objects into images without extensive training remains significant. To address these issues, we propose \textit{FreeInsert}, a novel training-free framework that customizes object insertion into arbitrary scenes by leveraging 3D geometric information. Benefiting from the advances in existing 3D generation models, we first convert the 2D object into 3D, perform interactive editing at the 3D level, and then re-render it into a 2D image from a specified view. This process introduces geometric controls such as shape or view. The rendered image, serving as geometric control, is combined with style and content control achieved through diffusion adapters, ultimately producing geometrically controlled, style-consistent edited images via the diffusion model.

[154] CusEnhancer: A Zero-Shot Scene and Controllability Enhancement Method for Photo Customization via ResInversion

Maoye Ren, Praneetha Vaddamanu, Jianjin Xu, Fernando De la Torre Frade

Main category: cs.CV

TL;DR: CustomEnhancer is a zero-shot enhancement framework that improves existing identity customization models by combining face swapping, diffusion models, and a novel triple-flow fusion approach for better scene diversity and identity fidelity.

Details

Motivation: Current text-to-image diffusion models for human photo synthesis face issues with degraded scenes, insufficient control, and suboptimal perceptual identity, requiring enhancement of existing customization models.

Method: Proposes CustomEnhancer framework with triple-flow fused PerGeneration approach that combines counter-directional latent spaces, face swapping techniques, and ResInversion method for efficient noise rectification (129x faster than NTI).

Result: Achieves state-of-the-art results in scene diversity, identity fidelity, and training-free controls while demonstrating significant efficiency improvements over existing inversion methods.

Conclusion: CustomEnhancer provides a comprehensive solution for personalized model enhancement with precise control and efficiency, eliminating the need for per-model controller retraining.

Abstract: Recently remarkable progress has been made in synthesizing realistic human photos using text-to-image diffusion models. However, current approaches face degraded scenes, insufficient control, and suboptimal perceptual identity. We introduce CustomEnhancer, a novel framework to augment existing identity customization models. CustomEnhancer is a zero-shot enhancement pipeline that leverages face swapping techniques, pretrained diffusion model, to obtain additional representations in a zeroshot manner for encoding into personalized models. Through our proposed triple-flow fused PerGeneration approach, which identifies and combines two compatible counter-directional latent spaces to manipulate a pivotal space of personalized model, we unify the generation and reconstruction processes, realizing generation from three flows. Our pipeline also enables comprehensive training-free control over the generation process of personalized models, offering precise controlled personalization for them and eliminating the need for controller retraining for per-model. Besides, to address the high time complexity of null-text inversion (NTI), we introduce ResInversion, a novel inversion method that performs noise rectification via a pre-diffusion mechanism, reducing the inversion time by 129 times. Experiments demonstrate that CustomEnhancer reach SOTA results at scene diversity, identity fidelity, training-free controls, while also showing the efficiency of our ResInversion over NTI. The code will be made publicly available upon paper acceptance.

[155] Dual-supervised Asymmetric Co-training for Semi-supervised Medical Domain Generalization

Jincai Song, Haipeng Chen, Jun Qin, Na Zhao

Main category: cs.CV

TL;DR: This paper proposes DAC, a dual-supervised asymmetric co-training framework for cross-domain semi-supervised domain generalization (CD-SSDG) in medical image segmentation, addressing domain shifts between labeled/unlabeled training data and training/testing sets.

Details

Motivation: Traditional SSDG methods assume labeled and unlabeled data are available for each source domain, which is impractical. The paper addresses the more realistic scenario where domain shifts occur between labeled and unlabeled training data, in addition to shifts between training and testing sets.

Method: DAC framework builds on co-training with two sub-models providing cross pseudo supervision, integrating feature-level supervision and asymmetric auxiliary tasks. Feature-level supervision addresses inaccurate pseudo labels from domain shifts, while asymmetric self-supervised tasks enhance domain-invariant feature learning and prevent model collapse.

Result: Extensive experiments on Fundus, Polyp, and SCGM medical image datasets demonstrate the robust generalizability of the proposed DAC framework in handling CD-SSDG scenarios.

Conclusion: The DAC framework effectively addresses the challenging CD-SSDG problem in medical image segmentation by leveraging dual supervision and asymmetric co-training to handle domain shifts between labeled/unlabeled data and improve generalization to unseen domains.

Abstract: Semi-supervised domain generalization (SSDG) in medical image segmentation offers a promising solution for generalizing to unseen domains during testing, addressing domain shift challenges and minimizing annotation costs. However, conventional SSDG methods assume labeled and unlabeled data are available for each source domain in the training set, a condition that is not always met in practice. The coexistence of limited annotation and domain shift in the training set is a prevalent issue. Thus, this paper explores a more practical and challenging scenario, cross-domain semi-supervised domain generalization (CD-SSDG), where domain shifts occur between labeled and unlabeled training data, in addition to shifts between training and testing sets. Existing SSDG methods exhibit sub-optimal performance under such domain shifts because of inaccurate pseudolabels. To address this issue, we propose a novel dual-supervised asymmetric co-training (DAC) framework tailored for CD-SSDG. Building upon the co-training paradigm with two sub-models offering cross pseudo supervision, our DAC framework integrates extra feature-level supervision and asymmetric auxiliary tasks for each sub-model. This feature-level supervision serves to address inaccurate pseudo supervision caused by domain shifts between labeled and unlabeled data, utilizing complementary supervision from the rich feature space. Additionally, two distinct auxiliary self-supervised tasks are integrated into each sub-model to enhance domain-invariant discriminative feature learning and prevent model collapse. Extensive experiments on real-world medical image segmentation datasets, i.e., Fundus, Polyp, and SCGM, demonstrate the robust generalizability of the proposed DAC framework.

[156] Real-Time Object Detection Meets DINOv3

Shihua Huang, Yongjie Hou, Longfei Liu, Xuanlong Yu, Xi Shen

Main category: cs.CV

TL;DR: DEIMv2 extends the DEIM framework with DINOv3 features, creating a scalable real-time object detection system that outperforms YOLO series across eight model sizes from X to Atto, achieving superior performance-cost trade-offs.

Details

Motivation: To improve upon the successful DEIM framework by incorporating DINOv3 features and creating a unified design that spans from high-performance GPU models to ultra-lightweight mobile/edge deployments, establishing new state-of-the-art results.

Method: For larger models (X, L, M, S): use DINOv3-pretrained backbones with Spatial Tuning Adapter (STA) to convert single-scale to multi-scale features. For ultra-lightweight models (Nano, Pico, Femto, Atto): employ HGNetv2 with pruning, simplified decoder, and upgraded Dense O2O.

Result: DEIMv2-X achieves 57.8 AP with 50.3M parameters (vs 56.5 AP with 60M+ parameters in prior models). DEIMv2-S is first sub-10M model (9.71M) to exceed 50 AP (50.9 AP). DEIMv2-Pico (1.5M parameters) delivers 38.5 AP, matching YOLOv10-Nano (2.3M) with 50% fewer parameters.

Conclusion: DEIMv2 establishes new state-of-the-art performance across diverse deployment scenarios, demonstrating superior efficiency and effectiveness in real-time object detection compared to existing methods like YOLO series.

Abstract: Benefiting from the simplicity and effectiveness of Dense O2O and MAL, DEIM has become the mainstream training framework for real-time DETRs, significantly outperforming the YOLO series. In this work, we extend it with DINOv3 features, resulting in DEIMv2. DEIMv2 spans eight model sizes from X to Atto, covering GPU, edge, and mobile deployment. For the X, L, M, and S variants, we adopt DINOv3-pretrained or distilled backbones and introduce a Spatial Tuning Adapter (STA), which efficiently converts DINOv3’s single-scale output into multi-scale features and complements strong semantics with fine-grained details to enhance detection. For ultra-lightweight models (Nano, Pico, Femto, and Atto), we employ HGNetv2 with depth and width pruning to meet strict resource budgets. Together with a simplified decoder and an upgraded Dense O2O, this unified design enables DEIMv2 to achieve a superior performance-cost trade-off across diverse scenarios, establishing new state-of-the-art results. Notably, our largest model, DEIMv2-X, achieves 57.8 AP with only 50.3 million parameters, surpassing prior X-scale models that require over 60 million parameters for just 56.5 AP. On the compact side, DEIMv2-S is the first sub-10 million model (9.71 million) to exceed the 50 AP milestone on COCO, reaching 50.9 AP. Even the ultra-lightweight DEIMv2-Pico, with just 1.5 million parameters, delivers 38.5 AP, matching YOLOv10-Nano (2.3 million) with around 50 percent fewer parameters.

[157] DAC-LoRA: Dynamic Adversarial Curriculum for Efficient and Robust Few-Shot Adaptation

Ved Umrajkar

Main category: cs.CV

TL;DR: DAC-LoRA is a novel framework that integrates adversarial training into Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA, using a dynamic curriculum of progressively challenging attacks to enhance robustness of Vision-Language Models (VLMs) without significantly compromising clean accuracy.

Details

Motivation: Vision-Language Models (VLMs) are used in safety-critical applications but remain vulnerable to adversarial attacks. CLIP, as a backbone for many VLMs, is a high-value target whose vulnerabilities can cascade across the multimodal AI ecosystem, necessitating robust defense methods.

Method: Dynamic Adversarial Curriculum (DAC-LoRA) framework that integrates adversarial training into PEFT. It uses an intelligent curriculum of progressively challenging attacks guided by First-Order Stationary Condition (FOSC) and a TRADES-inspired loss function.

Result: DAC-LoRA achieves substantial improvements in adversarial robustness without significantly compromising clean accuracy. The framework is lightweight, effective, and broadly applicable.

Conclusion: The DAC-LoRA framework can be easily integrated into standard PEFT pipelines to significantly enhance robustness of Vision-Language Models, providing an effective defense against adversarial attacks in safety-critical applications.

Abstract: Vision-Language Models (VLMs) are foundational to critical applications like autonomous driving, medical diagnosis, and content moderation. While Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA enable their efficient adaptation to specialized tasks, these models remain vulnerable to adversarial attacks that can compromise safety-critical decisions. CLIP, the backbone for numerous downstream VLMs, is a high-value target whose vulnerabilities can cascade across the multimodal AI ecosystem. We propose Dynamic Adversarial Curriculum DAC-LoRA, a novel framework that integrates adversarial training into PEFT. The core principle of our method i.e. an intelligent curriculum of progressively challenging attack, is general and can potentially be applied to any iterative attack method. Guided by the First-Order Stationary Condition (FOSC) and a TRADES-inspired loss, DAC-LoRA achieves substantial improvements in adversarial robustness without significantly compromising clean accuracy. Our work presents an effective, lightweight, and broadly applicable method to demonstrate that the DAC-LoRA framework can be easily integrated into a standard PEFT pipeline to significantly enhance robustness.

[158] Federated Domain Generalization with Domain-specific Soft Prompts Generation

Jianhan Wu, Xiaoyang Qu, Zhangcheng Huang, Jianzong Wang

Main category: cs.CV

TL;DR: FedDSPG is a federated domain generalization method that generates domain-specific soft prompts from a generative perspective to handle domain shift in unseen target domains.

Details

Motivation: Existing federated domain generalization methods based on prompt learning have limited diversity and ignore information from unknown domains, posing challenges for downstream-task adaptation in federated learning settings.

Method: Introduces domain-specific soft prompts (DSPs) for each domain, integrates content and domain knowledge into a generative model across clients, and uses the generator to obtain DSPs for unseen target domains during inference.

Result: Comprehensive evaluations across several public datasets confirm that FedDSPG outperforms existing strong baselines in federated domain generalization, achieving state-of-the-art results.

Conclusion: The proposed FedDSPG method effectively addresses federated domain generalization by generating domain-specific soft prompts, demonstrating superior performance compared to existing approaches.

Abstract: Prompt learning has become an efficient paradigm for adapting CLIP to downstream tasks. Compared with traditional fine-tuning, prompt learning optimizes a few parameters yet yields highly competitive results, especially appealing in federated learning for computational efficiency. engendering domain shift among clients and posing a formidable challenge for downstream-task adaptation. Existing federated domain generalization (FDG) methods based on prompt learning typically learn soft prompts from training samples, replacing manually designed prompts to enhance the generalization ability of federated models. However, these learned prompts exhibit limited diversity and tend to ignore information from unknown domains. We propose a novel and effective method from a generative perspective for handling FDG tasks, namely federated domain generalization with domain-specific soft prompts generation (FedDSPG). Specifically, during training, we introduce domain-specific soft prompts (DSPs) for each domain and integrate content and domain knowledge into the generative model among clients. In the inference phase, the generator is utilized to obtain DSPs for unseen target domains, thus guiding downstream tasks in unknown domains. Comprehensive evaluations across several public datasets confirm that our method outperforms existing strong baselines in FDG, achieving state-of-the-art results.

[159] Revolutionizing Precise Low Back Pain Diagnosis via Contrastive Learning

Thanh Binh Le, Hoang Nhat Khang Vo, Tan-Ha Mai, Trong Nhan Phan

Main category: cs.CV

TL;DR: LumbarCLIP is a multimodal framework that uses contrastive language-image pretraining to align lumbar spine MRI scans with radiological reports, achieving state-of-the-art performance in classification tasks.

Details

Motivation: Low back pain affects millions worldwide, creating a need for robust diagnostic models that can jointly analyze complex medical images and accompanying text reports.

Method: The framework integrates vision encoders (ResNet-50, Vision Transformer, Swin Transformer) with a BERT-based text encoder to extract representations, projects them into a shared embedding space via learnable projection heads, and uses soft CLIP loss for contrastive training.

Result: Achieves up to 95.00% accuracy and 94.75% F1-score on test set despite class imbalance. Linear projection heads outperform non-linear variants in cross-modal alignment.

Conclusion: LumbarCLIP offers a promising foundation for automated musculoskeletal diagnosis and clinical decision support.

Abstract: Low back pain affects millions worldwide, driving the need for robust diagnostic models that can jointly analyze complex medical images and accompanying text reports. We present LumbarCLIP, a novel multimodal framework that leverages contrastive language-image pretraining to align lumbar spine MRI scans with corresponding radiological descriptions. Built upon a curated dataset containing axial MRI views paired with expert-written reports, LumbarCLIP integrates vision encoders (ResNet-50, Vision Transformer, Swin Transformer) with a BERT-based text encoder to extract dense representations. These are projected into a shared embedding space via learnable projection heads, configurable as linear or non-linear, and normalized to facilitate stable contrastive training using a soft CLIP loss. Our model achieves state-of-the-art performance on downstream classification, reaching up to 95.00% accuracy and 94.75% F1-score on the test set, despite inherent class imbalance. Extensive ablation studies demonstrate that linear projection heads yield more effective cross-modal alignment than non-linear variants. LumbarCLIP offers a promising foundation for automated musculoskeletal diagnosis and clinical decision support.

[160] Poisoning Prompt-Guided Sampling in Video Large Language Models

Yuxin Cao, Wei Song, Jingling Xue, Jin Song Dong

Main category: cs.CV

TL;DR: PoisonVID is the first black-box poisoning attack that targets prompt-guided sampling in VideoLLMs, achieving 82-99% success rate by compromising frame relevance scores through closed-loop optimization.

Details

Motivation: While vulnerabilities have been identified in earlier frame sampling strategies (uniform-based and semantic-similarity-based), the safety of the most recent prompt-guided sampling strategies remains unexplored.

Method: PoisonVID uses a closed-loop optimization strategy that iteratively optimizes a universal perturbation to suppress harmful frame relevance scores. It leverages a depiction set constructed from paraphrased harmful descriptions using a shadow VideoLLM and GPT-4o-mini.

Result: Comprehensive evaluation on three prompt-guided sampling strategies across three advanced VideoLLMs shows PoisonVID achieves 82% - 99% attack success rate.

Conclusion: The high success rate of PoisonVID highlights the importance of developing future advanced sampling strategies for VideoLLMs that are more robust to such attacks.

Abstract: Video Large Language Models (VideoLLMs) have emerged as powerful tools for understanding videos, supporting tasks such as summarization, captioning, and question answering. Their performance has been driven by advances in frame sampling, progressing from uniform-based to semantic-similarity-based and, most recently, prompt-guided strategies. While vulnerabilities have been identified in earlier sampling strategies, the safety of prompt-guided sampling remains unexplored. We close this gap by presenting PoisonVID, the first black-box poisoning attack that undermines prompt-guided sampling in VideoLLMs. PoisonVID compromises the underlying prompt-guided sampling mechanism through a closed-loop optimization strategy that iteratively optimizes a universal perturbation to suppress harmful frame relevance scores, guided by a depiction set constructed from paraphrased harmful descriptions leveraging a shadow VideoLLM and a lightweight language model, i.e., GPT-4o-mini. Comprehensively evaluated on three prompt-guided sampling strategies and across three advanced VideoLLMs, PoisonVID achieves 82% - 99% attack success rate, highlighting the importance of developing future advanced sampling strategies for VideoLLMs.

[161] Punching Above Precision: Small Quantized Model Distillation with Learnable Regularizer

Abdur Rehman, S M A Sharif, Md Abdur Rahaman, Mohamed Jismy Aashik Rasool, Seongwan Kim, Jaeho Lee

Main category: cs.CV

TL;DR: GoR is a novel learnable regularization method that adaptively balances task-specific and knowledge distillation losses in quantization-aware training, improving performance of small quantized models across various AI tasks.

Details

Motivation: Existing QAT-KD methods struggle to balance task-specific and distillation losses due to heterogeneous gradient magnitudes, especially under low-bit quantization, leading to performance degradation.

Method: Proposes Game of Regularizer (GoR) with only two trainable parameters for dynamic loss weighting, and introduces QAT-EKD-GoR ensemble distillation framework using multiple heterogeneous teacher models.

Result: GoR consistently outperforms state-of-the-art QAT-KD methods on image classification, object detection, and LLM compression, delivering faster inference on edge devices while maintaining full-precision accuracy.

Conclusion: GoR provides a robust solution for real-world deployment, with EKD-GoR potentially outperforming full-precision models under optimal conditions.

Abstract: Quantization-aware training (QAT) combined with knowledge distillation (KD) is a promising strategy for compressing Artificial Intelligence (AI) models for deployment on resource-constrained hardware. However, existing QAT-KD methods often struggle to balance task-specific (TS) and distillation losses due to heterogeneous gradient magnitudes, especially under low-bit quantization. We propose Game of Regularizer (GoR), a novel learnable regularization method that adaptively balances TS and KD objectives using only two trainable parameters for dynamic loss weighting. GoR reduces conflict between supervision signals, improves convergence, and boosts the performance of small quantized models (SQMs). Experiments on image classification, object detection (OD), and large language model (LLM) compression show that GoR consistently outperforms state-of-the-art QAT-KD methods. On low-power edge devices, it delivers faster inference while maintaining full-precision accuracy. We also introduce QAT-EKD-GoR, an ensemble distillation framework that uses multiple heterogeneous teacher models. Under optimal conditions, the proposed EKD-GoR can outperform full-precision models, providing a robust solution for real-world deployment.

[162] Plant identification based on noisy web data: the amazing performance of deep learning (LifeCLEF 2017)

Herve Goeau, Pierre Bonnet, Alexis Joly

Main category: cs.CV

TL;DR: The LifeCLEF 2017 plant identification challenge evaluated whether large noisy web-collected training datasets can compete with smaller expert-verified datasets for automated plant identification across 10,000 species.

Details

Motivation: To assess if web-sourced plant image datasets with labeling errors can be as effective as smaller, expert-curated datasets for building automated plant identification systems at continental scale.

Method: Compared two training strategies: large noisy dataset from web sources vs. smaller trusted dataset from experts, using test data from Pl@ntNet mobile app as independent evaluation.

Result: The challenge provided evaluation of competing approaches for plant identification at unprecedented scale (10,000 species, 1.1M images) across Europe and North America.

Conclusion: The study demonstrates the potential of leveraging web-sourced data for large-scale plant identification systems, despite inherent noise and labeling errors.

Abstract: The 2017-th edition of the LifeCLEF plant identification challenge is an important milestone towards automated plant identification systems working at the scale of continental floras with 10.000 plant species living mainly in Europe and North America illustrated by a total of 1.1M images. Nowadays, such ambitious systems are enabled thanks to the conjunction of the dazzling recent progress in image classification with deep learning and several outstanding international initiatives, such as the Encyclopedia of Life (EOL), aggregating the visual knowledge on plant species coming from the main national botany institutes. However, despite all these efforts the majority of the plant species still remain without pictures or are poorly illustrated. Outside the institutional channels, a much larger number of plant pictures are available and spread on the web through botanist blogs, plant lovers web-pages, image hosting websites and on-line plant retailers. The LifeCLEF 2017 plant challenge presented in this paper aimed at evaluating to what extent a large noisy training dataset collected through the web and containing a lot of labelling errors can compete with a smaller but trusted training dataset checked by experts. To fairly compare both training strategies, the test dataset was created from a third data source, i.e. the Pl@ntNet mobile application that collects millions of plant image queries all over the world. This paper presents more precisely the resources and assessments of the challenge, summarizes the approaches and systems employed by the participating research groups, and provides an analysis of the main outcomes.

[163] TasselNetV4: A vision foundation model for cross-scene, cross-scale, and cross-species plant counting

Xiaonan Hu, Xuebing Li, Jinyu Xu, Abdulkadir Duran Adan, Letian Zhou, Xuhui Zhu, Yanan Li, Wei Guo, Shouyang Liu, Wenzhong Liu, Hao Lu

Main category: cs.CV

TL;DR: TasselNetV4 is a new plant counting model that shifts from species-specific to cross-species counting by combining local counting with class-agnostic counting paradigms, achieving superior performance on challenging datasets.

Details

Motivation: Plants have biodiversity and new cultivars are constantly bred, making it impossible to build species-dependent counting models for all plants. Current class-agnostic counting models perform poorly on plants due to their dynamic, non-rigid structure.

Method: TasselNetV4 builds on TasselNet’s local counting approach and marries it with the extract-and-match paradigm of class-agnostic counting. It uses a plain vision transformer with novel multi-branch box-aware local counters to enhance cross-scale robustness.

Result: Extensive experiments on PAC-105 and PAC-Somalia datasets show TasselNetV4 achieves superior counting performance and high efficiency compared to state-of-the-art class-agnostic counting models.

Conclusion: TasselNetV4 emerges as a vision foundation model for cross-scene, cross-scale, and cross-species plant counting, providing a more scalable solution than species-specific approaches.

Abstract: Accurate plant counting provides valuable information for agriculture such as crop yield prediction, plant density assessment, and phenotype quantification. Vision-based approaches are currently the mainstream solution. Prior art typically uses a detection or a regression model to count a specific plant. However, plants have biodiversity, and new cultivars are increasingly bred each year. It is almost impossible to exhaust and build all species-dependent counting models. Inspired by class-agnostic counting (CAC) in computer vision, we argue that it is time to rethink the problem formulation of plant counting, from what plants to count to how to count plants. In contrast to most daily objects with spatial and temporal invariance, plants are dynamic, changing with time and space. Their non-rigid structure often leads to worse performance than counting rigid instances like heads and cars such that current CAC and open-world detection models are suboptimal to count plants. In this work, we inherit the vein of the TasselNet plant counting model and introduce a new extension, TasselNetV4, shifting from species-specific counting to cross-species counting. TasselNetV4 marries the local counting idea of TasselNet with the extract-and-match paradigm in CAC. It builds upon a plain vision transformer and incorporates novel multi-branch box-aware local counters used to enhance cross-scale robustness. Two challenging datasets, PAC-105 and PAC-Somalia, are harvested. Extensive experiments against state-of-the-art CAC models show that TasselNetV4 achieves not only superior counting performance but also high efficiency.Our results indicate that TasselNetV4 emerges to be a vision foundation model for cross-scene, cross-scale, and cross-species plant counting.

[164] SD-RetinaNet: Topologically Constrained Semi-Supervised Retinal Lesion and Layer Segmentation in OCT

Botond Fazekas, Guilherme Aresta, Philipp Seeböck, Julia Mai, Ursula Schmidt-Erfurth, Hrvoje Bogunović

Main category: cs.CV

TL;DR: A novel semi-supervised model for retinal OCT segmentation that introduces a differentiable biomarker topology engine to enforce anatomically correct segmentation of layers and lesions, improving performance while ensuring topological correctness.

Details

Motivation: Existing semi-supervised methods for retinal OCT segmentation often produce anatomically implausible results, fail to model layer-lesion interactions effectively, and lack guarantees on topological correctness, limiting their clinical utility.

Method: Proposes a fully differentiable biomarker topology engine that enables joint learning with bidirectional influence between layers and lesions, using disentangled representations (spatial and style factors) and leveraging both unlabeled and partially labeled datasets.

Result: Outperforms current state-of-the-art methods on public and internal OCT datasets for both lesion and layer segmentation, demonstrating improved generalization to pathological cases using partially annotated training data.

Conclusion: The approach shows the potential of using anatomical constraints in semi-supervised learning for accurate, robust, and trustworthy retinal biomarker segmentation in clinical applications.

Abstract: Optical coherence tomography (OCT) is widely used for diagnosing and monitoring retinal diseases, such as age-related macular degeneration (AMD). The segmentation of biomarkers such as layers and lesions is essential for patient diagnosis and follow-up. Recently, semi-supervised learning has shown promise in improving retinal segmentation performance. However, existing methods often produce anatomically implausible segmentations, fail to effectively model layer-lesion interactions, and lack guarantees on topological correctness. To address these limitations, we propose a novel semi-supervised model that introduces a fully differentiable biomarker topology engine to enforce anatomically correct segmentation of lesions and layers. This enables joint learning with bidirectional influence between layers and lesions, leveraging unlabeled and diverse partially labeled datasets. Our model learns a disentangled representation, separating spatial and style factors. This approach enables more realistic layer segmentations and improves lesion segmentation, while strictly enforcing lesion location in their anatomically plausible positions relative to the segmented layers. We evaluate the proposed model on public and internal datasets of OCT scans and show that it outperforms the current state-of-the-art in both lesion and layer segmentation, while demonstrating the ability to generalize layer segmentation to pathological cases using partially annotated training data. Our results demonstrate the potential of using anatomical constraints in semi-supervised learning for accurate, robust, and trustworthy retinal biomarker segmentation.

[165] Plant identification in an open-world (LifeCLEF 2016)

Herve Goeau, Pierre Bonnet, Alexis Joly

Main category: cs.CV

TL;DR: The LifeCLEF 2016 plant identification challenge evaluated large-scale plant recognition systems on 110K+ images of 1000 species, with the key innovation being open-set recognition to handle unknown categories not seen during training.

Details

Motivation: To create realistic biodiversity monitoring scenarios by evaluating plant identification methods at very large scale, moving beyond traditional classification to address the real-world challenge of rejecting unknown species that weren't in the training data.

Method: Used a dataset of 110,000+ images from a participatory sensing platform involving tens of thousands of contributors, focusing on 1000 plant species from West Europe. The challenge was framed as an open-set recognition problem requiring systems to classify known species while rejecting unknown categories.

Result: The challenge provided a large-scale evaluation platform for plant identification systems, with participants developing approaches to handle open-set recognition. The overview analyzed the performance of various systems in both classification accuracy and robustness to unknown species.

Conclusion: The LifeCLEF 2016 challenge successfully advanced plant identification research by introducing open-set recognition as a key requirement, making the evaluation more realistic for real-world biodiversity monitoring applications where systems must handle previously unseen species.

Abstract: The LifeCLEF plant identification challenge aims at evaluating plant identification methods and systems at a very large scale, close to the conditions of a real-world biodiversity monitoring scenario. The 2016-th edition was actually conducted on a set of more than 110K images illustrating 1000 plant species living in West Europe, built through a large-scale participatory sensing platform initiated in 2011 and which now involves tens of thousands of contributors. The main novelty over the previous years is that the identification task was evaluated as an open-set recognition problem, i.e. a problem in which the recognition system has to be robust to unknown and never seen categories. Beyond the brute-force classification across the known classes of the training set, the big challenge was thus to automatically reject the false positive classification hits that are caused by the unknown classes. This overview presents more precisely the resources and assessments of the challenge, summarizes the approaches and systems employed by the participating research groups, and provides an analysis of the main outcomes.

[166] SCRA-VQA: Summarized Caption-Rerank for Augmented Large Language Models in Visual Question Answering

Yan Zhang, Jiaqing Lin, Miao Zhang, Kui Xiao, Xiaoju Hou, Yue Zhao, Zhifei Li

Main category: cs.CV

TL;DR: SCRA-VQA is a method that improves KB-VQA by using summarized and reranked captions to enhance LLM reasoning without expensive training, achieving 38.8% on OK-VQA and 34.6% on A-OKVQA.

Details

Motivation: Current KB-VQA methods using LLMs with image captions suffer from excessive noise in captions and LLMs' limited understanding of VQA tasks, which restricts reasoning capabilities.

Method: Proposes SCRA-VQA which converts images to captions using pre-trained visual language models, then summarizes and reranks captions to exclude unrelated information, and generates contextual examples to help LLMs better understand images and questions.

Result: Achieves 38.8% accuracy on OK-VQA and 34.6% on A-OKVQA using a 6.7B parameter LLM, demonstrating excellent performance on challenging knowledge-based VQA datasets.

Conclusion: The caption-rerank process effectively enhances LLMs’ reasoning ability and task adaptability in KB-VQA without requiring expensive end-to-end training.

Abstract: Acquiring high-quality knowledge is a central focus in Knowledge-Based Visual Question Answering (KB-VQA). Recent methods use large language models (LLMs) as knowledge engines for answering. These methods generally employ image captions as visual text descriptions to assist LLMs in interpreting images. However, the captions frequently include excessive noise irrelevant to the question, and LLMs generally do not comprehend VQA tasks, limiting their reasoning capabilities. To address this issue, we propose the Summarized Caption-Rerank Augmented VQA (SCRA-VQA), which employs a pre-trained visual language model to convert images into captions. Moreover, SCRA-VQA generates contextual examples for the captions while simultaneously summarizing and reordering them to exclude unrelated information. The caption-rerank process enables LLMs to understand the image information and questions better, thus enhancing the model’s reasoning ability and task adaptability without expensive end-to-end training. Based on an LLM with 6.7B parameters, SCRA-VQA performs excellently on two challenging knowledge-based VQA datasets: OK-VQA and A-OKVQA, achieving accuracies of 38.8% and 34.6%. Our code is available at https://github.com/HubuKG/SCRA-VQA.

[167] The Unanticipated Asymmetry Between Perceptual Optimization and Assessment

Jiabei Zhang, Qi Wang, Siyu Wu, Du Chen, Tianhe Wu

Main category: cs.CV

TL;DR: The paper reveals an asymmetry between perceptual optimization and quality assessment - fidelity metrics good for IQA aren’t necessarily effective for optimization, especially under adversarial training.

Details

Motivation: To systematically analyze the correlation between optimization objectives' effectiveness and their capability as image quality assessment metrics, which remains underexplored.

Method: Conducted systematic analysis comparing fidelity metrics and adversarial objectives for both perceptual optimization and IQA, examining discriminator design impacts.

Result: Found that discriminator design is crucial for optimization, with patch-level and convolutional architectures providing better detail reconstruction than vanilla or Transformer-based alternatives.

Conclusion: The insights advance understanding of loss function design and IQA transferability, enabling more principled approaches to perceptual optimization.

Abstract: Perceptual optimization is primarily driven by the fidelity objective, which enforces both semantic consistency and overall visual realism, while the adversarial objective provides complementary refinement by enhancing perceptual sharpness and fine-grained detail. Despite their central role, the correlation between their effectiveness as optimization objectives and their capability as image quality assessment (IQA) metrics remains underexplored. In this work, we conduct a systematic analysis and reveal an unanticipated asymmetry between perceptual optimization and assessment: fidelity metrics that excel in IQA are not necessarily effective for perceptual optimization, with this misalignment emerging more distinctly under adversarial training. In addition, while discriminators effectively suppress artifacts during optimization, their learned representations offer only limited benefits when reused as backbone initializations for IQA models. Beyond this asymmetry, our findings further demonstrate that discriminator design plays a decisive role in shaping optimization, with patch-level and convolutional architectures providing more faithful detail reconstruction than vanilla or Transformer-based alternatives. These insights advance the understanding of loss function design and its connection to IQA transferability, paving the way for more principled approaches to perceptual optimization.

[168] Integrating Object Interaction Self-Attention and GAN-Based Debiasing for Visual Question Answering

Zhifei Li, Feng Qiu, Yiran Wang, Yujing Xia, Kui Xiao, Miao Zhang, Yan Zhang

Main category: cs.CV

TL;DR: IOG-VQA is a novel VQA model that integrates Object Interaction Self-Attention and GAN-Based Debiasing to address dataset biases and improve generalization by capturing complex object interactions and generating unbiased data distributions.

Details

Motivation: Existing VQA models struggle with biases from training data, leading to over-reliance on superficial patterns and poor generalization to diverse questions and images.

Method: The model uses Object Interaction Self-Attention to capture complex object interactions within images and a GAN-Based Debiasing framework to generate unbiased data distributions for more robust feature learning.

Result: Extensive experiments on VQA-CP v1 and VQA-CP v2 datasets show excellent performance compared to existing methods, particularly in handling biased and imbalanced data distributions.

Conclusion: The study highlights the importance of addressing both object interactions and dataset biases for advancing VQA tasks, demonstrating that IOG-VQA effectively combines visual and textual information to mitigate inherent biases.

Abstract: Visual Question Answering (VQA) presents a unique challenge by requiring models to understand and reason about visual content to answer questions accurately. Existing VQA models often struggle with biases introduced by the training data, leading to over-reliance on superficial patterns and inadequate generalization to diverse questions and images. This paper presents a novel model, IOG-VQA, which integrates Object Interaction Self-Attention and GAN-Based Debiasing to enhance VQA model performance. The self-attention mechanism allows our model to capture complex interactions between objects within an image, providing a more comprehensive understanding of the visual context. Meanwhile, the GAN-based debiasing framework generates unbiased data distributions, helping the model to learn more robust and generalizable features. By leveraging these two components, IOG-VQA effectively combines visual and textual information to address the inherent biases in VQA datasets. Extensive experiments on the VQA-CP v1 and VQA-CP v2 datasets demonstrate that our model shows excellent performance compared with the existing methods, particularly in handling biased and imbalanced data distributions highlighting the importance of addressing both object interactions and dataset biases in advancing VQA tasks. Our code is available at https://github.com/HubuKG/IOG-VQA.

[169] FerretNet: Efficient Synthetic Image Detection via Local Pixel Dependencies

Shuqiao Liang, Jian Liu, Renzhang Chen, Quanlong Guan

Main category: cs.CV

TL;DR: FerretNet is a lightweight neural network that detects synthetic images by analyzing latent distribution deviations and decoding-induced smoothing effects using local pixel dependencies, achieving 97.1% accuracy across 22 generative models.

Details

Motivation: The increasing realism of synthetic images from advanced models like VAEs, GANs, and LDMs creates challenges for detection, requiring methods to identify subtle generation artifacts.

Method: Leverages local pixel dependencies based on Markov Random Fields to reconstruct images and expose texture/edge inconsistencies. Proposes FerretNet, a 1.1M parameter neural network trained on ProGAN dataset.

Result: Achieves 97.1% average accuracy on open-world benchmark with 22 generative models, surpassing state-of-the-art methods by 10.6%.

Conclusion: FerretNet provides efficient and robust synthetic image detection by focusing on generation artifacts, demonstrating strong generalization across diverse generative models with minimal training data.

Abstract: The increasing realism of synthetic images generated by advanced models such as VAEs, GANs, and LDMs poses significant challenges for synthetic image detection. To address this issue, we explore two artifact types introduced during the generation process: (1) latent distribution deviations and (2) decoding-induced smoothing effects, which manifest as inconsistencies in local textures, edges, and color transitions. Leveraging local pixel dependencies (LPD) properties rooted in Markov Random Fields, we reconstruct synthetic images using neighboring pixel information to expose disruptions in texture continuity and edge coherence. Building upon LPD, we propose FerretNet, a lightweight neural network with only 1.1M parameters that delivers efficient and robust synthetic image detection. Extensive experiments demonstrate that FerretNet, trained exclusively on the 4-class ProGAN dataset, achieves an average accuracy of 97.1% on an open-world benchmark comprising across 22 generative models, surpassing state-of-the-art methods by 10.6%.

[170] Concepts in Motion: Temporal Bottlenecks for Interpretable Video Classification

Patrick Knab, Sascha Marton, Philipp J. Schubert, Drago Guggiana, Christian Bartelt

Main category: cs.CV

TL;DR: MoTIF extends Concept Bottleneck Models to video classification by handling temporal dependencies and arbitrary-length sequences using a transformer-inspired architecture.

Details

Motivation: Extending interpretable concept-based models from static images to video data, which has temporal dependencies essential for capturing actions and events.

Method: Transformer-inspired architecture that adapts concept bottleneck framework for video classification, enabling global concept importance, local concept relevance, and temporal dependencies analysis.

Result: Demonstrates that concept-based modeling can be effectively transferred to video data while maintaining competitive performance.

Conclusion: The framework enables better understanding of concept contributions in temporal contexts for video classification tasks.

Abstract: Conceptual models such as Concept Bottleneck Models (CBMs) have driven substantial progress in improving interpretability for image classification by leveraging human-interpretable concepts. However, extending these models from static images to sequences of images, such as video data, introduces a significant challenge due to the temporal dependencies inherent in videos, which are essential for capturing actions and events. In this work, we introduce MoTIF (Moving Temporal Interpretable Framework), an architectural design inspired by a transformer that adapts the concept bottleneck framework for video classification and handles sequences of arbitrary length. Within the video domain, concepts refer to semantic entities such as objects, attributes, or higher-level components (e.g., ‘bow’, ‘mount’, ‘shoot’) that reoccur across time - forming motifs collectively describing and explaining actions. Our design explicitly enables three complementary perspectives: global concept importance across the entire video, local concept relevance within specific windows, and temporal dependencies of a concept over time. Our results demonstrate that the concept-based modeling paradigm can be effectively transferred to video data, enabling a better understanding of concept contributions in temporal contexts while maintaining competitive performance. Code available at github.com/patrick-knab/MoTIF.

[171] FSMODNet: A Closer Look at Few-Shot Detection in Multispectral Data

Manuel Nkegoum, Minh-Tan Pham, Élisa Fromont, Bruno Avignon, Sébastien Lefèvre

Main category: cs.CV

TL;DR: FSMODNet is a framework for few-shot multispectral object detection that combines visible and thermal imagery using deformable attention to achieve robust performance with limited annotated data.

Details

Motivation: To address the challenge of detecting objects across visible and thermal modalities with minimal annotated data, particularly in complex illumination and environmental conditions.

Method: Leverages cross-modality feature integration using deformable attention to effectively combine the unique strengths of visible and thermal imagery.

Result: Experimental results on two public datasets show effective object detection performance in challenging low-data regimes, outperforming several baselines established from state-of-the-art models.

Conclusion: The proposed FSMODNet framework demonstrates robust adaptability and improved detection performance for few-shot multispectral object detection tasks.

Abstract: Few-shot multispectral object detection (FSMOD) addresses the challenge of detecting objects across visible and thermal modalities with minimal annotated data. In this paper, we explore this complex task and introduce a framework named “FSMODNet” that leverages cross-modality feature integration to improve detection performance even with limited labels. By effectively combining the unique strengths of visible and thermal imagery using deformable attention, the proposed method demonstrates robust adaptability in complex illumination and environmental conditions. Experimental results on two public datasets show effective object detection performance in challenging low-data regimes, outperforming several baselines we established from state-of-the-art models. All code, models, and experimental data splits can be found at https://anonymous.4open.science/r/Test-B48D.

[172] Finding 3D Positions of Distant Objects from Noisy Camera Movement and Semantic Segmentation Sequences

Julius Pesonen, Arno Solin, Eija Honkavaara

Main category: cs.CV

TL;DR: Particle filters enable 3D object localization from camera sequences for distant objects where traditional depth estimation fails, validated through drone-based wildfire monitoring simulations.

Details

Motivation: Traditional 3D localization methods (dense depth estimation, 3D reconstruction) are infeasible for distant objects or computationally constrained tasks like drone-based wildfire monitoring.

Method: Uses particle filters for single and multiple target scenarios, tested with 3D simulation and drone-based image segmentation sequences with GNSS camera pose estimates.

Result: Particle filters successfully solved practical localization tasks where other methods failed, demonstrating independence from detection methods and applicability to wildfire monitoring.

Conclusion: Particle filters provide a flexible, effective solution for camera-based 3D object localization in resource-constrained scenarios, enabling drone-based wildfire monitoring with existing segmentation models.

Abstract: 3D object localisation based on a sequence of camera measurements is essential for safety-critical surveillance tasks, such as drone-based wildfire monitoring. Localisation of objects detected with a camera can typically be solved with dense depth estimation or 3D scene reconstruction. However, in the context of distant objects or tasks limited by the amount of available computational resources, neither solution is feasible. In this paper, we show that the task can be solved using particle filters for both single and multiple target scenarios. The method was studied using a 3D simulation and a drone-based image segmentation sequence with global navigation satellite system (GNSS)-based camera pose estimates. The results showed that a particle filter can be used to solve practical localisation tasks based on camera poses and image segments in these situations where other solutions fail. The particle filter is independent of the detection method, making it flexible for new tasks. The study also demonstrates that drone-based wildfire monitoring can be conducted using the proposed method paired with a pre-existing image segmentation model.

[173] SwinMamba: A hybrid local-global mamba framework for enhancing semantic segmentation of remotely sensed images

Qinfeng Zhu, Han Li, Liang He, Lei Fan

Main category: cs.CV

TL;DR: SwinMamba is a novel framework that combines local Mamba-style scanning with global receptive fields to improve semantic segmentation of remote sensing imagery by capturing both fine-grained details and broader contextual information.

Details

Motivation: Vision Mamba's reliance on global scanning tends to overlook critical local features like textures and edges, which are essential for accurate segmentation in remote sensing contexts with complex scene structures and diverse object scales.

Method: SwinMamba integrates localized Mamba-style scanning within shifted windows (first two stages for local details) with global scanning (last two stages for contextual fusion). Overlapping shifted windows enhance inter-region information exchange across the image.

Result: Extensive experiments on LoveDA and ISPRS Potsdam datasets show that SwinMamba outperforms state-of-the-art methods in semantic segmentation of remote sensing imagery.

Conclusion: SwinMamba demonstrates effectiveness and potential as a superior solution for semantic segmentation of remotely sensed imagery by effectively balancing local feature capture with global contextual understanding.

Abstract: Semantic segmentation of remote sensing imagery is a fundamental task in computer vision, supporting a wide range of applications such as land use classification, urban planning, and environmental monitoring. However, this task is often challenged by the high spatial resolution, complex scene structures, and diverse object scales present in remote sensing data. To address these challenges, various deep learning architectures have been proposed, including convolutional neural networks, Vision Transformers, and the recently introduced Vision Mamba. Vision Mamba features a global receptive field and low computational complexity, demonstrating both efficiency and effectiveness in image segmentation. However, its reliance on global scanning tends to overlook critical local features, such as textures and edges, which are essential for achieving accurate segmentation in remote sensing contexts. To tackle this limitation, we propose SwinMamba, a novel framework inspired by the Swin Transformer. SwinMamba integrates localized Mamba-style scanning within shifted windows with a global receptive field, to enhance the model’s perception of both local and global features. Specifically, the first two stages of SwinMamba perform local scanning to capture fine-grained details, while its subsequent two stages leverage global scanning to fuse broader contextual information. In our model, the use of overlapping shifted windows enhances inter-region information exchange, facilitating more robust feature integration across the entire image. Extensive experiments on the LoveDA and ISPRS Potsdam datasets demonstrate that SwinMamba outperforms state-of-the-art methods, underscoring its effectiveness and potential as a superior solution for semantic segmentation of remotely sensed imagery.

[174] Revisiting Data Challenges of Computational Pathology: A Pack-based Multiple Instance Learning Framework

Wenhao Tang, Heng Fang, Ge Wu, Xiang Li, Ming-Ming Cheng

Main category: cs.CV

TL;DR: PackMIL is a pack-based multiple instance learning framework that addresses computational pathology challenges by packing variable-length feature sequences into fixed-length ones for efficient training, using residual branches and attention-driven downsampling to improve accuracy while reducing training time.

Details

Motivation: Whole slide images (WSIs) in computational pathology have extreme sequence length variations (200-200K), high data heterogeneity, redundancy, and limited supervision, which conventional methods struggle to handle efficiently while preserving data integrity.

Method: Proposes a pack-based MIL framework that: 1) packs multiple sampled variable-length feature sequences into fixed-length ones for batched training; 2) uses a residual branch to compose discarded features into hyperslides with tailored labels; 3) employs attention-driven downsampling to compress features and reduce redundancy.

Result: Achieves up to 8% accuracy improvement while using only 12% of the training time in PANDA(UNI) benchmark, demonstrating significant efficiency and performance gains.

Conclusion: Addressing data challenges in computational pathology through the proposed framework shows significant potential for foundation models, with the approach effectively handling extreme sequence variations while improving training efficiency and accuracy.

Abstract: Computational pathology (CPath) digitizes pathology slides into whole slide images (WSIs), enabling analysis for critical healthcare tasks such as cancer diagnosis and prognosis. However, WSIs possess extremely long sequence lengths (up to 200K), significant length variations (from 200 to 200K), and limited supervision. These extreme variations in sequence length lead to high data heterogeneity and redundancy. Conventional methods often compromise on training efficiency and optimization to preserve such heterogeneity under limited supervision. To comprehensively address these challenges, we propose a pack-based MIL framework. It packs multiple sampled, variable-length feature sequences into fixed-length ones, enabling batched training while preserving data heterogeneity. Moreover, we introduce a residual branch that composes discarded features from multiple slides into a hyperslide which is trained with tailored labels. It offers multi-slide supervision while mitigating feature loss from sampling. Meanwhile, an attention-driven downsampler is introduced to compress features in both branches to reduce redundancy. By alleviating these challenges, our approach achieves an accuracy improvement of up to 8% while using only 12% of the training time in the PANDA(UNI). Extensive experiments demonstrate that focusing data challenges in CPath holds significant potential in the era of foundation models. The code is https://github.com/FangHeng/PackMIL

[175] SimDiff: Simulator-constrained Diffusion Model for Physically Plausible Motion Generation

Akihisa Watanabe, Jiawei Ren, Li Siyao, Yichen Peng, Erwin Wu, Edgar Simo-Serra

Main category: cs.CV

TL;DR: SimDiff is a Simulator-constrained Diffusion Model that generates physically plausible human motion by integrating environment parameters directly into the denoising process, eliminating the need for computationally expensive simulator calls during inference.

Details

Motivation: Existing methods for generating physically plausible human motion rely on simulator-based motion projection layers, which are computationally expensive due to their sequential nature and prevent parallelization. This limits their practical application in real-time scenarios.

Method: The authors reinterpret simulator-based motion projection as a form of guidance within the diffusion process and propose SimDiff, which conditions the denoising process directly on environment parameters (e.g., gravity, wind) rather than using explicit simulator calls.

Result: SimDiff generates physically plausible motions efficiently without repeated simulator calls at inference, provides fine-grained control over different physical coefficients, and demonstrates compositional generalization to unseen combinations of environmental parameters.

Conclusion: The proposed SimDiff framework offers an efficient alternative to simulator-based approaches by integrating physical constraints directly into the diffusion model, enabling real-time generation of physically plausible human motion with environmental control.

Abstract: Generating physically plausible human motion is crucial for applications such as character animation and virtual reality. Existing approaches often incorporate a simulator-based motion projection layer to the diffusion process to enforce physical plausibility. However, such methods are computationally expensive due to the sequential nature of the simulator, which prevents parallelization. We show that simulator-based motion projection can be interpreted as a form of guidance, either classifier-based or classifier-free, within the diffusion process. Building on this insight, we propose SimDiff, a Simulator-constrained Diffusion Model that integrates environment parameters (e.g., gravity, wind) directly into the denoising process. By conditioning on these parameters, SimDiff generates physically plausible motions efficiently, without repeated simulator calls at inference, and also provides fine-grained control over different physical coefficients. Moreover, SimDiff successfully generalizes to unseen combinations of environmental parameters, demonstrating compositional generalization.

[176] Unlocking Noise-Resistant Vision: Key Architectural Secrets for Robust Models

Bum Jun Kim, Makoto Kawano, Yusuke Iwasawa, Yutaka Matsuo

Main category: cs.CV

TL;DR: This paper identifies key architectural design choices that make vision models more robust to Gaussian noise, providing both empirical evidence and theoretical explanations for why larger stem kernels, smaller input resolutions, average pooling, and supervised ViTs outperform alternatives.

Details

Motivation: To understand why certain vision architectures are inherently more robust to Gaussian noise and convert these empirical insights into actionable design rules for building more robust vision models.

Method: Extensive evaluation of 1,174 pretrained vision models, empirical identification of four design patterns, and theoretical analysis including proofs about low-pass stem kernels, downsampling effects, pooling mechanisms, and Lipschitz bounds for CLIP ViTs.

Result: Identified four consistent design patterns that yield up to 506 rank improvements and 21.6% accuracy gains: larger stem kernels, smaller input resolutions, average pooling, and supervised ViTs rather than CLIP ViTs.

Conclusion: The research disentangles robustness into interpretable modules, provides theoretical explanations for observed trends, and offers practical plug-and-play guidelines for designing vision models more robust against Gaussian noise.

Abstract: While the robustness of vision models is often measured, their dependence on specific architectural design choices is rarely dissected. We investigate why certain vision architectures are inherently more robust to additive Gaussian noise and convert these empirical insights into simple, actionable design rules. Specifically, we performed extensive evaluations on 1,174 pretrained vision models, empirically identifying four consistent design patterns for improved robustness against Gaussian noise: larger stem kernels, smaller input resolutions, average pooling, and supervised vision transformers (ViTs) rather than CLIP ViTs, which yield up to 506 rank improvements and 21.6%p accuracy gains. We then develop a theoretical analysis that explains these findings, converting observed correlations into causal mechanisms. First, we prove that low-pass stem kernels attenuate noise with a gain that decreases quadratically with kernel size and that anti-aliased downsampling reduces noise energy roughly in proportion to the square of the downsampling factor. Second, we demonstrate that average pooling is unbiased and suppresses noise in proportion to the pooling window area, whereas max pooling incurs a positive bias that grows slowly with window size and yields a relatively higher mean-squared error and greater worst-case sensitivity. Third, we reveal and explain the vulnerability of CLIP ViTs via a pixel-space Lipschitz bound: The smaller normalization standard deviations used in CLIP preprocessing amplify worst-case sensitivity by up to 1.91 times relative to the Inception-style preprocessing common in supervised ViTs. Our results collectively disentangle robustness into interpretable modules, provide a theory that explains the observed trends, and build practical, plug-and-play guidelines for designing vision models more robust against Gaussian noise.

[177] Decoding the Surgical Scene: A Scoping Review of Scene Graphs in Surgery

Angelo Henriques, Korab Hoxha, Daniel Zapp, Peter C. Issa, Nassir Navab, M. Ali Nasseri

Main category: cs.CV

TL;DR: This scoping review maps scene graph research in surgery, revealing rapid growth but a critical ‘data divide’ between internal-view 2D video research and external-view 4D simulation research, while highlighting methodological advancements from basic graph neural networks to specialized foundation models that outperform general vision-language models.

Details

Motivation: To systematically analyze the evolving landscape of scene graph research in surgery, charting its applications, methodological advancements, and future directions to understand how structured relational representations can decode complex surgical environments.

Method: PRISMA-ScR-guided scoping review that systematically maps scene graph research in surgery, analyzing applications, methodological advancements, and identifying research gaps.

Result: The review reveals rapid growth in surgical scene graph research but uncovers a critical ‘data divide’ - internal-view research uses real-world 2D video while external-view 4D modeling relies on simulated data. Methodologically, the field has advanced from foundational graph neural networks to specialized foundation models that outperform generalist large vision-language models.

Conclusion: Surgical scene graphs are maturing into an essential semantic bridge that enables intelligent systems to improve surgical safety, efficiency, and training, despite persistent challenges in data annotation and real-time implementation that are being actively addressed.

Abstract: Scene graphs (SGs) provide structured relational representations crucial for decoding complex, dynamic surgical environments. This PRISMA-ScR-guided scoping review systematically maps the evolving landscape of SG research in surgery, charting its applications, methodological advancements, and future directions. Our analysis reveals rapid growth, yet uncovers a critical ‘data divide’: internal-view research (e.g., triplet recognition) almost exclusively uses real-world 2D video, while external-view 4D modeling relies heavily on simulated data, exposing a key translational research gap. Methodologically, the field has advanced from foundational graph neural networks to specialized foundation models that now significantly outperform generalist large vision-language models in surgical contexts. This progress has established SGs as a cornerstone technology for both analysis, such as workflow recognition and automated safety monitoring, and generative tasks like controllable surgical simulation. Although challenges in data annotation and real-time implementation persist, they are actively being addressed through emerging techniques. Surgical SGs are maturing into an essential semantic bridge, enabling a new generation of intelligent systems to improve surgical safety, efficiency, and training.

[178] A Real-Time On-Device Defect Detection Framework for Laser Power-Meter Sensors via Unsupervised Learning

Dongqi Zheng, Wenjin Fu, Guangzong Chen

Main category: cs.CV

TL;DR: Automated vision-based system for defect detection in laser power meter sensor coatings using unsupervised anomaly detection with 93.8% accuracy on defective samples and 89.3% on good samples.

Details

Motivation: Address the critical challenge of identifying coating defects (thermal damage, scratches) that compromise laser energy measurement accuracy in medical and industrial applications, enabling automated quality control.

Method: Unsupervised anomaly detection framework with three components: (1) preprocessing pipeline using Laplacian edge detection and K-means clustering for segmentation, (2) synthetic data augmentation via StyleGAN2, (3) UFlow-based neural network for multi-scale feature extraction and anomaly map generation.

Result: 93.8% accuracy on defective samples, 89.3% accuracy on good samples, image-level AUROC of 0.957, pixel-level AUROC of 0.961 on 366 real sensor images, with processing time of 0.5 seconds per image.

Conclusion: The system provides effective automated defect detection for laser power meter coatings with high accuracy and fast processing, offering significant cost savings through automated quality control.

Abstract: We present an automated vision-based system for defect detection and classification of laser power meter sensor coatings. Our approach addresses the critical challenge of identifying coating defects such as thermal damage and scratches that can compromise laser energy measurement accuracy in medical and industrial applications. The system employs an unsupervised anomaly detection framework that trains exclusively on ``good’’ sensor images to learn normal coating distribution patterns, enabling detection of both known and novel defect types without requiring extensive labeled defect datasets. Our methodology consists of three key components: (1) a robust preprocessing pipeline using Laplacian edge detection and K-means clustering to segment the area of interest, (2) synthetic data augmentation via StyleGAN2, and (3) a UFlow-based neural network architecture for multi-scale feature extraction and anomaly map generation. Experimental evaluation on 366 real sensor images demonstrates $93.8%$ accuracy on defective samples and $89.3%$ accuracy on good samples, with image-level AUROC of 0.957 and pixel-level AUROC of 0.961. The system provides potential annual cost savings through automated quality control and processing times of 0.5 seconds per image in on-device implementation.

[179] Unlocking Financial Insights: An advanced Multimodal Summarization with Multimodal Output Framework for Financial Advisory Videos

Sarmistha Das, R E Zera Marveen Lyngkhoi, Sriparna Saha, Alka Maurya

Main category: cs.CV

TL;DR: FASTER is a modular framework for summarizing lengthy financial advisory podcast videos by extracting multimodal features, producing concise summaries, and aligning visual keyframes with textual content using advanced AI techniques.

Details

Motivation: Extracting insights from lengthy (30-40 minute) multimodal financial advisory content is challenging, and existing methods struggle with modality alignment and factual consistency in this domain.

Method: Uses BLIP for visual descriptions, OCR for text patterns, Whisper transcription with speaker diarization, modified DPO-based loss with fact-checking, and ranker-based retrieval for keyframe-text alignment. Introduces Fin-APT dataset of 470 financial videos.

Result: Comprehensive experiments show FASTER outperforms LLMs and VLMs in performance, robustness, and generalizability for multimodal summarization tasks.

Conclusion: FASTER establishes a new standard for multimodal summarization, making financial advisory content more accessible and actionable while opening new research avenues.

Abstract: The dynamic propagation of social media has broadened the reach of financial advisory content through podcast videos, yet extracting insights from lengthy, multimodal segments (30-40 minutes) remains challenging. We introduce FASTER (Financial Advisory Summariser with Textual Embedded Relevant images), a modular framework that tackles three key challenges: (1) extracting modality-specific features, (2) producing optimized, concise summaries, and (3) aligning visual keyframes with associated textual points. FASTER employs BLIP for semantic visual descriptions, OCR for textual patterns, and Whisper-based transcription with Speaker diarization as BOS features. A modified Direct Preference Optimization (DPO)-based loss function, equipped with BOS-specific fact-checking, ensures precision, relevance, and factual consistency against the human-aligned summary. A ranker-based retrieval mechanism further aligns keyframes with summarized content, enhancing interpretability and cross-modal coherence. To acknowledge data resource scarcity, we introduce Fin-APT, a dataset comprising 470 publicly accessible financial advisory pep-talk videos for robust multimodal research. Comprehensive cross-domain experiments confirm FASTER’s strong performance, robustness, and generalizability when compared to Large Language Models (LLMs) and Vision-Language Models (VLMs). By establishing a new standard for multimodal summarization, FASTER makes financial advisory content more accessible and actionable, thereby opening new avenues for research. The dataset and code are available at: https://github.com/sarmistha-D/FASTER

[180] An Adaptor for Triggering Semi-Supervised Learning to Out-of-Box Serve Deep Image Clustering

Yue Duan, Lei Qi, Yinghuan Shi, Yang Gao

Main category: cs.CV

TL;DR: ASD is an adaptor that enables cold-start SSL learners for deep image clustering without prerequisites like pretraining or clustering models, achieving near-ground-truth performance.

Details

Motivation: Existing SSL-integrated deep clustering methods require pretraining, clustering learning, or trained models as prerequisites, limiting flexible out-of-box application of SSL learners for image clustering.

Method: ASD randomly samples pseudo-labeled data, uses instance-level classification to learn semantically aligned labels, tracks class transitions to extract high-level similarities, assigns cluster-level labels, and triggers SSL learners for clustering.

Result: ASD shows superior performance across benchmarks against latest deep image clustering methods, with only 1.33% accuracy gap on CIFAR-10 compared to SSL methods using ground-truth labels.

Conclusion: ASD enables cold-start SSL for image clustering without prerequisites and can further boost existing SSL-embedded deep clustering methods.

Abstract: Recently, some works integrate SSL techniques into deep clustering frameworks to enhance image clustering performance. However, they all need pretraining, clustering learning, or a trained clustering model as prerequisites, limiting the flexible and out-of-box application of SSL learners in the image clustering task. This work introduces ASD, an adaptor that enables the cold-start of SSL learners for deep image clustering without any prerequisites. Specifically, we first randomly sample pseudo-labeled data from all unlabeled data, and set an instance-level classifier to learn them with semantically aligned instance-level labels. With the ability of instance-level classification, we track the class transitions of predictions on unlabeled data to extract high-level similarities of instance-level classes, which can be utilized to assign cluster-level labels to pseudo-labeled data. Finally, we use the pseudo-labeled data with assigned cluster-level labels to trigger a general SSL learner trained on the unlabeled data for image clustering. We show the superior performance of ASD across various benchmarks against the latest deep image clustering approaches and very slight accuracy gaps compared to SSL methods using ground-truth, e.g., only 1.33% on CIFAR-10. Moreover, ASD can also further boost the performance of existing SSL-embedded deep image clustering methods.

[181] SiNGER: A Clearer Voice Distills Vision Transformers Further

Geunhyeok Yu, Sunjae Jeong, Yoonyoung Choi, Jaeseung Kim, Hyoseok Hwang

Main category: cs.CV

TL;DR: SiNGER is a novel distillation framework that suppresses high-norm artifacts in Vision Transformers while preserving informative signals during knowledge distillation, improving student model performance.

Details

Motivation: Vision Transformers produce high-norm artifacts that degrade representation quality. When these features are distilled to students, artifacts dominate the objective, causing students to overfit to artifacts and underweight informative signals, diminishing gains from larger models.

Method: Singular Nullspace-Guided Energy Reallocation (SiNGER) uses principled teacher feature refinement with nullspace-guided perturbation to preserve information while suppressing artifacts. This is implemented efficiently with a LoRA-based adapter requiring minimal structural modification.

Result: Extensive experiments show SiNGER consistently improves student models, achieving state-of-the-art performance in multiple downstream tasks and producing clearer, more interpretable representations.

Conclusion: SiNGER effectively addresses the trade-off between artifact suppression and signal preservation in knowledge distillation, enabling better transfer of informative features from teacher to student models.

Abstract: Vision Transformers are widely adopted as the backbone of vision foundation models, but they are known to produce high-norm artifacts that degrade representation quality. When knowledge distillation transfers these features to students, high-norm artifacts dominate the objective, so students overfit to artifacts and underweight informative signals, diminishing the gains from larger models. Prior work attempted to remove artifacts but encountered an inherent trade-off between artifact suppression and preserving informative signals from teachers. To address this, we introduce Singular Nullspace-Guided Energy Reallocation (SiNGER), a novel distillation framework that suppresses artifacts while preserving informative signals. The key idea is principled teacher feature refinement: during refinement, we leverage the nullspace-guided perturbation to preserve information while suppressing artifacts. Then, the refined teacher’s features are distilled to a student. We implement this perturbation efficiently with a LoRA-based adapter that requires minimal structural modification. Extensive experiments show that \oursname consistently improves student models, achieving state-of-the-art performance in multiple downstream tasks and producing clearer and more interpretable representations.

[182] Fast-SEnSeI: Lightweight Sensor-Independent Cloud Masking for On-board Multispectral Sensors

Jan Kněžík, Jonáš Herec, Rado Pitoňák

Main category: cs.CV

TL;DR: Fast-SEnSeI is a lightweight, sensor-independent encoder module for on-board cloud segmentation that works across different multispectral sensors with varying band configurations.

Details

Motivation: Most cloud segmentation models are tightly coupled to specific sensor configurations and rely on ground-based processing, limiting flexibility and real-time applications.

Method: Builds on SEnSeI-v2 with improved spectral descriptor, lightweight architecture, and robust padding-band handling. Accepts arbitrary spectral band combinations and wavelengths, producing fixed-size feature maps for a compact, quantized U-Net segmentation model. Uses CPU-FPGA hybrid pipeline with Apache TVM on embedded CPUs and FPGA deployment.

Result: Evaluations on Sentinel-2 and Landsat 8 datasets demonstrate accurate segmentation across diverse input configurations.

Conclusion: Fast-SEnSeI enables flexible, efficient on-board cloud segmentation suitable for space-qualified hardware, overcoming sensor dependency limitations of existing approaches.

Abstract: Cloud segmentation is a critical preprocessing step for many Earth observation tasks, yet most models are tightly coupled to specific sensor configurations and rely on ground-based processing. In this work, we propose Fast-SEnSeI, a lightweight, sensor-independent encoder module that enables flexible, on-board cloud segmentation across multispectral sensors with varying band configurations. Building upon SEnSeI-v2, Fast-SEnSeI integrates an improved spectral descriptor, lightweight architecture, and robust padding-band handling. It accepts arbitrary combinations of spectral bands and their wavelengths, producing fixed-size feature maps that feed into a compact, quantized segmentation model based on a modified U-Net. The module runs efficiently on embedded CPUs using Apache TVM, while the segmentation model is deployed on FPGA, forming a CPU-FPGA hybrid pipeline suitable for space-qualified hardware. Evaluations on Sentinel-2 and Landsat 8 datasets demonstrate accurate segmentation across diverse input configurations.

[183] A Single Neuron Works: Precise Concept Erasure in Text-to-Image Diffusion Models

Qinqin He, Jiaqi Weng, Jialing Tao, Hui Xue

Main category: cs.CV

TL;DR: SNCE is a novel concept erasure method that precisely removes harmful concepts from text-to-image models by manipulating only a single neuron, achieving surgical precision while preserving image quality.

Details

Motivation: Text-to-image models pose safety risks by generating harmful content, and existing concept erasure methods struggle with precise removal while maintaining image quality.

Method: Train a Sparse Autoencoder to map text embeddings into a sparse latent space, identify harmful concept-specific neurons using modulated frequency scoring, and suppress activations of those neurons.

Result: SNCE achieves state-of-the-art concept erasure performance, preserves generation capabilities for non-target concepts, and exhibits strong robustness against adversarial attacks.

Conclusion: The proposed single neuron-based approach enables precise and effective concept erasure with minimal disruption to overall model performance.

Abstract: Text-to-image models exhibit remarkable capabilities in image generation. However, they also pose safety risks of generating harmful content. A key challenge of existing concept erasure methods is the precise removal of target concepts while minimizing degradation of image quality. In this paper, we propose Single Neuron-based Concept Erasure (SNCE), a novel approach that can precisely prevent harmful content generation by manipulating only a single neuron. Specifically, we train a Sparse Autoencoder (SAE) to map text embeddings into a sparse, disentangled latent space, where individual neurons align tightly with atomic semantic concepts. To accurately locate neurons responsible for harmful concepts, we design a novel neuron identification method based on the modulated frequency scoring of activation patterns. By suppressing activations of the harmful concept-specific neuron, SNCE achieves surgical precision in concept erasure with minimal disruption to image quality. Experiments on various benchmarks demonstrate that SNCE achieves state-of-the-art results in target concept erasure, while preserving the model’s generation capabilities for non-target concepts. Additionally, our method exhibits strong robustness against adversarial attacks, significantly outperforming existing methods.

[184] OmniPlantSeg: Species Agnostic 3D Point Cloud Organ Segmentation for High-Resolution Plant Phenotyping Across Modalities

Andreas Gilson, Lukas Meyer, Oliver Scholz, Ute Schmid

Main category: cs.CV

TL;DR: KDSS is a lightweight sub-sampling algorithm for biological point clouds that enables full-resolution segmentation without down-sampling, working across different plant species and sensor modalities.

Details

Motivation: Existing plant organ segmentation methods are species/sensor-specific and require extensive pre-processing and down-sampling, which limits their applicability and accuracy.

Method: Proposed KDSS algorithm for sub-sampling biological point clouds that is agnostic to sensor data and plant species, allowing segmentation of full-resolution point clouds without down-sampling.

Result: KDSS combined with state-of-the-art segmentation models achieves satisfying results across different modalities (photogrammetry, laser triangulation, LiDAR) for various plant species.

Conclusion: KDSS serves as a lightweight, resolution-retaining alternative to intensive pre-processing and down-sampling methods for plant organ segmentation, regardless of species and sensor modality.

Abstract: Accurate point cloud segmentation for plant organs is crucial for 3D plant phenotyping. Existing solutions are designed problem-specific with a focus on certain plant species or specified sensor-modalities for data acquisition. Furthermore, it is common to use extensive pre-processing and down-sample the plant point clouds to meet hardware or neural network input size requirements. We propose a simple, yet effective algorithm KDSS for sub-sampling of biological point clouds that is agnostic to sensor data and plant species. The main benefit of this approach is that we do not need to down-sample our input data and thus, enable segmentation of the full-resolution point cloud. Combining KD-SS with current state-of-the-art segmentation models shows satisfying results evaluated on different modalities such as photogrammetry, laser triangulation and LiDAR for various plant species. We propose KD-SS as lightweight resolution-retaining alternative to intensive pre-processing and down-sampling methods for plant organ segmentation regardless of used species and sensor modality.

[185] Background Prompt for Few-Shot Out-of-Distribution Detection

Songyue Cai, Zongqian Wu, Yujie Mo, Liang Peng, Ping Hu, Xiaoshuang Shi, Xiaofeng Zhu

Main category: cs.CV

TL;DR: Mambo is a new FG-BG decomposition framework for few-shot out-of-distribution detection that addresses limitations of previous methods by combining local background and class similarities with flexible patch selection.

Details

Motivation: Existing FG-BG decomposition methods suffer from low robustness due to over-reliance on local class similarity and fixed background patch extraction strategies.

Method: Proposes learning a background prompt to obtain local background similarity, refining it with local class similarity, and using patch self-calibrated tuning for flexible background patch selection based on sample diversity.

Result: Extensive experiments on real-world datasets show Mambo achieves state-of-the-art performance in OOD detection and near OOD detection settings.

Conclusion: Mambo effectively addresses the limitations of previous methods by reducing dependence on local class similarity and enabling flexible background extraction, demonstrating superior performance in FS-OOD detection.

Abstract: Existing foreground-background (FG-BG) decomposition methods for the few-shot out-of-distribution (FS-OOD) detection often suffer from low robustness due to over-reliance on the local class similarity and a fixed background patch extraction strategy. To address these challenges, we propose a new FG-BG decomposition framework, namely Mambo, for FS-OOD detection. Specifically, we propose to first learn a background prompt to obtain the local background similarity containing both the background and image semantic information, and then refine the local background similarity using the local class similarity. As a result, we use both the refined local background similarity and the local class similarity to conduct background extraction, reducing the dependence of the local class similarity in previous methods. Furthermore, we propose the patch self-calibrated tuning to consider the sample diversity to flexibly select numbers of background patches for different samples, and thus exploring the issue of fixed background extraction strategies in previous methods. Extensive experiments on real-world datasets demonstrate that our proposed Mambo achieves the best performance, compared to SOTA methods in terms of OOD detection and near OOD detection setting. The source code will be released at https://github.com/YuzunoKawori/Mambo.

[186] Stratify or Die: Rethinking Data Splits in Image Segmentation

Naga Venkata Sai Jitin Jami, Thomas Altstidl, Jonas Mueller, Jindong Li, Dario Zanca, Bjoern Eskofier, Heike Leutheuser

Main category: cs.CV

TL;DR: The paper introduces two novel stratification methods (IPS and WDES) for creating representative test sets in image segmentation tasks, addressing the limitations of random splitting that leads to biased evaluations.

Details

Motivation: Random dataset splitting in image segmentation often creates unrepresentative test sets, causing biased model evaluations and poor generalization. Existing stratification methods from classification don't effectively handle segmentation's multi-label structure and class imbalance.

Method: Two methods are proposed: Iterative Pixel Stratification (IPS) - a simple label-aware sampling method, and Wasserstein-Driven Evolutionary Stratification (WDES) - a genetic algorithm that minimizes Wasserstein distance to optimize label distribution similarity across dataset splits.

Result: WDES consistently produces more representative splits than random sampling, leading to lower performance variance and improved model evaluation across diverse segmentation tasks (street scenes, medical imaging, satellite imagery). WDES is particularly valuable for small, imbalanced, and low-diversity datasets.

Conclusion: WDES is proven to be globally optimal given enough generations and effectively addresses the bias issues in conventional dataset splitting strategies for segmentation tasks.

Abstract: Random splitting of datasets in image segmentation often leads to unrepresentative test sets, resulting in biased evaluations and poor model generalization. While stratified sampling has proven effective for addressing label distribution imbalance in classification tasks, extending these ideas to segmentation remains challenging due to the multi-label structure and class imbalance typically present in such data. Building on existing stratification concepts, we introduce Iterative Pixel Stratification (IPS), a straightforward, label-aware sampling method tailored for segmentation tasks. Additionally, we present Wasserstein-Driven Evolutionary Stratification (WDES), a novel genetic algorithm designed to minimize the Wasserstein distance, thereby optimizing the similarity of label distributions across dataset splits. We prove that WDES is globally optimal given enough generations. Using newly proposed statistical heterogeneity metrics, we evaluate both methods against random sampling and find that WDES consistently produces more representative splits. Applying WDES across diverse segmentation tasks, including street scenes, medical imaging, and satellite imagery, leads to lower performance variance and improved model evaluation. Our results also highlight the particular value of WDES in handling small, imbalanced, and low-diversity datasets, where conventional splitting strategies are most prone to bias.

[187] EnGraf-Net: Multiple Granularity Branch Network with Fine-Coarse Graft Grained for Classification Task

Riccardo La Grassa, Ignazio Gallo, Nicola Landro

Main category: cs.CV

TL;DR: EnGraf-Net is a fine-grained classification model that uses semantic associations structured as a taxonomy as supervised signals in an end-to-end deep neural network, achieving competitive performance without requiring cropping techniques or manual annotations.

Details

Motivation: Existing fine-grained classification models rely on part annotations or automatic cropping methods, which suffer from incomplete representation of local features. Humans recognize objects by forming semantic associations, so leveraging semantic hierarchies can improve classification.

Method: The proposed EnGraf-Net uses semantic associations structured as a hierarchy (taxonomy) as supervised signals within an end-to-end deep neural network model, eliminating the need for cropping techniques or manual annotations.

Result: Extensive experiments on CIFAR-100, CUB-200-2011, and FGVC-Aircraft datasets show EnGraf-Net achieves competitive performance with state-of-the-art approaches.

Conclusion: Leveraging semantic hierarchies as supervised signals provides an effective approach for fine-grained classification without requiring complex part annotations or cropping techniques.

Abstract: Fine-grained classification models are designed to focus on the relevant details necessary to distinguish highly similar classes, particularly when intra-class variance is high and inter-class variance is low. Most existing models rely on part annotations such as bounding boxes, part locations, or textual attributes to enhance classification performance, while others employ sophisticated techniques to automatically extract attention maps. We posit that part-based approaches, including automatic cropping methods, suffer from an incomplete representation of local features, which are fundamental for distinguishing similar objects. While fine-grained classification aims to recognize the leaves of a hierarchical structure, humans recognize objects by also forming semantic associations. In this paper, we leverage semantic associations structured as a hierarchy (taxonomy) as supervised signals within an end-to-end deep neural network model, termed EnGraf-Net. Extensive experiments on three well-known datasets CIFAR-100, CUB-200-2011, and FGVC-Aircraft demonstrate the superiority of EnGraf-Net over many existing fine-grained models, showing competitive performance with the most recent state-of-the-art approaches, without requiring cropping techniques or manual annotations.

[188] Vision Transformers: the threat of realistic adversarial patches

Kasper Cools, Clara Maathuis, Alexander M. van Oers, Claudia S. Hübner, Nikos Deligiannis, Marijke Vandewal, Geert De Cubber

Main category: cs.CV

TL;DR: Vision Transformers (ViTs) are vulnerable to adversarial patch attacks despite their increased robustness compared to CNNs, with attack success rates varying significantly across different ViT models (40.04% to 99.97%) depending on pre-training methodology.

Details

Motivation: To investigate the transferability of adversarial attack techniques from CNNs to ViTs and assess ViT vulnerability to evasion attacks, particularly adversarial patches using Creases Transformation technique that mimics natural clothing distortions.

Method: Designed realistic adversarial patches using Creases Transformation (CT) technique to cause misclassification in person vs. non-person classification tasks. Evaluated four fine-tuned ViT models on binary person classification with adversarial patches transferred from CNN attack techniques.

Result: Significant vulnerability variations observed: google/vit-base-patch16-224-in21k (40.04%), facebook/dino-vitb16 (99.97%), google/vit-base-patch16-224 (66.40%), facebook/dinov3-vitb16 (65.17%). Pre-training dataset scale and methodology strongly influenced model resilience.

Conclusion: Adversarial patches successfully transfer from CNNs to ViTs, confirming cross-architectural transferability. ViTs remain vulnerable to evasion attacks, with model resilience heavily dependent on pre-training strategies.

Abstract: The increasing reliance on machine learning systems has made their security a critical concern. Evasion attacks enable adversaries to manipulate the decision-making processes of AI systems, potentially causing security breaches or misclassification of targets. Vision Transformers (ViTs) have gained significant traction in modern machine learning due to increased 1) performance compared to Convolutional Neural Networks (CNNs) and 2) robustness against adversarial perturbations. However, ViTs remain vulnerable to evasion attacks, particularly to adversarial patches, unique patterns designed to manipulate AI classification systems. These vulnerabilities are investigated by designing realistic adversarial patches to cause misclassification in person vs. non-person classification tasks using the Creases Transformation (CT) technique, which adds subtle geometric distortions similar to those occurring naturally when wearing clothing. This study investigates the transferability of adversarial attack techniques used in CNNs when applied to ViT classification models. Experimental evaluation across four fine-tuned ViT models on a binary person classification task reveals significant vulnerability variations: attack success rates ranged from 40.04% (google/vit-base-patch16-224-in21k) to 99.97% (facebook/dino-vitb16), with google/vit-base-patch16-224 achieving 66.40% and facebook/dinov3-vitb16 reaching 65.17%. These results confirm the cross-architectural transferability of adversarial patches from CNNs to ViTs, with pre-training dataset scale and methodology strongly influencing model resilience to adversarial attacks.

[189] UniTransfer: Video Concept Transfer via Progressive Spatial and Timestep Decomposition

Guojun Lei, Rong Zhang, Chi Wang, Tianhang Liu, Hong Li, Zhiyuan Ma, Weiwei Xu

Main category: cs.CV

TL;DR: UniTransfer introduces spatial and diffusion timestep decomposition for precise video concept transfer, using a dual-to-single-stream DiT architecture and Chain-of-Prompt mechanism with LLM guidance.

Details

Motivation: To achieve more precise and controllable video concept transfer by decomposing videos into key components and leveraging progressive denoising with stage-specific instructions.

Method: Spatial decomposition into foreground, background, and motion flow; dual-to-single-stream DiT architecture; self-supervised pretraining with random masking; Chain-of-Prompt mechanism for timestep decomposition using LLMs; curated OpenAnimal dataset.

Result: Achieves high-quality and controllable video concept transfer across diverse reference images and scenes, surpassing existing baselines in visual fidelity and editability.

Conclusion: UniTransfer demonstrates effective decomposition-based approach for precise video concept transfer, with superior performance compared to existing methods.

Abstract: We propose a novel architecture UniTransfer, which introduces both spatial and diffusion timestep decomposition in a progressive paradigm, achieving precise and controllable video concept transfer. Specifically, in terms of spatial decomposition, we decouple videos into three key components: the foreground subject, the background, and the motion flow. Building upon this decomposed formulation, we further introduce a dual-to-single-stream DiT-based architecture for supporting fine-grained control over different components in the videos. We also introduce a self-supervised pretraining strategy based on random masking to enhance the decomposed representation learning from large-scale unlabeled video data. Inspired by the Chain-of-Thought reasoning paradigm, we further revisit the denoising diffusion process and propose a Chain-of-Prompt (CoP) mechanism to achieve the timestep decomposition. We decompose the denoising process into three stages of different granularity and leverage large language models (LLMs) for stage-specific instructions to guide the generation progressively. We also curate an animal-centric video dataset called OpenAnimal to facilitate the advancement and benchmarking of research in video concept transfer. Extensive experiments demonstrate that our method achieves high-quality and controllable video concept transfer across diverse reference images and scenes, surpassing existing baselines in both visual fidelity and editability. Web Page: https://yu-shaonian.github.io/UniTransfer-Web/

[190] VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception

Ziang Yan, Xinhao Li, Yinan He, Zhengrong Yue, Xiangyu Zeng, Yali Wang, Yu Qiao, Limin Wang, Yi Wang

Main category: cs.CV

TL;DR: VTTS enhances MLLMs’ reasoning through iterative perception during inference, mimicking human hierarchical attention by progressively refining focus on high-confidence regions.

Details

Motivation: Existing methods for inducing reasoning in MLLMs are limited by static perception stages, failing to achieve human-level dynamic perception and understanding.

Method: VTTS employs Iterative Perception (ITP) mechanism with reinforcement learning and spatio-temporal supervision, supported by VTTS-80K dataset for iterative perception training.

Result: Videochat-R1.5 model achieves over 5% average improvement across 15+ benchmarks including video conversation, reasoning, and spatio-temporal perception, outperforming Qwen2.5VL models.

Conclusion: VTTS effectively enhances MLLM reasoning through dynamic iterative perception, demonstrating strong generalization across diverse multimodal tasks.

Abstract: Inducing reasoning in multimodal large language models (MLLMs) is critical for achieving human-level perception and understanding. Existing methods mainly leverage LLM reasoning to analyze parsed visuals, often limited by static perception stages. This paper introduces Visual Test-Time Scaling (VTTS), a novel approach to enhance MLLMs’ reasoning via iterative perception during inference. VTTS mimics humans’ hierarchical attention by progressively refining focus on high-confidence spatio-temporal regions, guided by updated textual predictions. Specifically, VTTS employs an Iterative Perception (ITP) mechanism, incorporating reinforcement learning with spatio-temporal supervision to optimize reasoning. To support this paradigm, we also present VTTS-80K, a dataset tailored for iterative perception. These designs allows a MLLM to enhance its performance by increasing its perceptual compute. Extensive experiments validate VTTS’s effectiveness and generalization across diverse tasks and benchmarks. Our newly introduced Videochat-R1.5 model has achieved remarkable improvements, with an average increase of over 5%, compared to robust baselines such as Qwen2.5VL-3B and -7B, across more than 15 benchmarks that encompass video conversation, video reasoning, and spatio-temporal perception.

[191] Mammo-CLIP Dissect: A Framework for Analysing Mammography Concepts in Vision-Language Models

Suaiba Amina Salahuddin, Teresa Dorszewski, Marit Almenning Martiniussen, Tone Hovda, Antonio Portaluri, Solveig Thrun, Michael Kampffmeyer, Elisabeth Wetzer, Kristoffer Wickstrøm, Robert Jenssen

Main category: cs.CV

TL;DR: Mammo-CLIP Dissect is the first concept-based explainability framework for mammography DL models that uses a mammography-specific vision-language model to label neurons with interpretable concepts and quantify their alignment with clinical knowledge.

Details

Motivation: Understanding what DL models learn is crucial for safe AI deployment in clinical settings. Previous work focused on pixel-based methods, but textual concepts may better reflect clinical reasoning.

Method: Leverages Mammo-CLIP as a “dissector” to label neurons with human-interpretable textual concepts and quantify alignment with domain knowledge. Investigates concept learning differences between general vs mammography-trained models, fine-tuning effects, and underrepresented concepts.

Result: Mammography-trained models capture more clinically relevant concepts and align better with radiologists’ workflows. Fine-tuning enhances capture of certain concepts (benign calcifications) but reduces coverage of others (density features), showing a specialization-generalization trade-off.

Conclusion: Mammo-CLIP Dissect provides insights into how CNNs capture mammography knowledge, revealing how domain-specific training and task adaptation shape concept learning. The framework enables systematic analysis of model interpretability in clinical AI.

Abstract: Understanding what deep learning (DL) models learn is essential for the safe deployment of artificial intelligence (AI) in clinical settings. While previous work has focused on pixel-based explainability methods, less attention has been paid to the textual concepts learned by these models, which may better reflect the reasoning used by clinicians. We introduce Mammo-CLIP Dissect, the first concept-based explainability framework for systematically dissecting DL vision models trained for mammography. Leveraging a mammography-specific vision-language model (Mammo-CLIP) as a “dissector,” our approach labels neurons at specified layers with human-interpretable textual concepts and quantifies their alignment to domain knowledge. Using Mammo-CLIP Dissect, we investigate three key questions: (1) how concept learning differs between DL vision models trained on general image datasets versus mammography-specific datasets; (2) how fine-tuning for downstream mammography tasks affects concept specialisation; and (3) which mammography-relevant concepts remain underrepresented. We show that models trained on mammography data capture more clinically relevant concepts and align more closely with radiologists’ workflows than models not trained on mammography data. Fine-tuning for task-specific classification enhances the capture of certain concept categories (e.g., benign calcifications) but can reduce coverage of others (e.g., density-related features), indicating a trade-off between specialisation and generalisation. Our findings show that Mammo-CLIP Dissect provides insights into how convolutional neural networks (CNNs) capture mammography-specific knowledge. By comparing models across training data and fine-tuning regimes, we reveal how domain-specific training and task-specific adaptation shape concept learning. Code and concept set are available: https://github.com/Suaiba/Mammo-CLIP-Dissect.

[192] MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for Video Temporal Reasoning

Sicheng Tao, Jungang Li, Yibo Yan, Junyan Zhang, Yubo Gao, Hanqian Li, ShuHang Xun, Yuxuan Fan, Hong Chen, Jianxiang He, Xuming Hu

Main category: cs.CV

TL;DR: MOSS-ChatV is a reinforcement learning framework with DTW-based process reward that addresses process inconsistency in video reasoning for MLLMs by aligning reasoning traces with temporal references.

Details

Motivation: Existing MLLMs exhibit process inconsistency where intermediate reasoning drifts from video dynamics even when final answers are correct, undermining interpretability and robustness in video reasoning.

Method: Introduces MOSS-ChatV with Dynamic Time Warping (DTW)-based process reward for reinforcement learning, identifies dynamic state prediction as key measure, and constructs MOSS-Video benchmark with annotated reasoning traces.

Result: MOSS-ChatV achieves 87.2% on MOSS-Video test, improves performance on MVBench and MMVU, and shows consistent gains across different architectures (Qwen2.5-VL, Phi-2) with more consistent reasoning traces.

Conclusion: The framework provides efficient process supervision without auxiliary reward models and demonstrates broad applicability for improving video reasoning consistency and stability in MLLMs.

Abstract: Video reasoning has emerged as a critical capability for multimodal large language models (MLLMs), requiring models to move beyond static perception toward coherent understanding of temporal dynamics in complex scenes. Yet existing MLLMs often exhibit process inconsistency, where intermediate reasoning drifts from video dynamics even when the final answer is correct, undermining interpretability and robustness. To address this issue, we introduce MOSS-ChatV, a reinforcement learning framework with a Dynamic Time Warping (DTW)-based process reward. This rule-based reward aligns reasoning traces with temporally grounded references, enabling efficient process supervision without auxiliary reward models. We further identify dynamic state prediction as a key measure of video reasoning and construct MOSS-Video, a benchmark with annotated reasoning traces, where the training split is used to fine-tune MOSS-ChatV and the held-out split is reserved for evaluation. MOSS-ChatV achieves 87.2% on MOSS-Video (test) and improves performance on general video benchmarks such as MVBench and MMVU. The framework consistently yields gains across different architectures, including Qwen2.5-VL and Phi-2, confirming its broad applicability. Evaluations with GPT-4o-as-judge further show that MOSS-ChatV produces more consistent and stable reasoning traces.

[193] MotionFlow:Learning Implicit Motion Flow for Complex Camera Trajectory Control in Video Generation

Guojun Lei, Chi Wang, Yikai Wang, Hong Li, Ying Song, Weiwei Xu

Main category: cs.CV

TL;DR: A novel approach for generating videos guided by camera trajectories that integrates both camera and object motions by converting them into pixel motion, using stable diffusion networks and semantic object priors.

Details

Motivation: Existing methods struggle with consistency and generalizability when handling both camera and object motions, often learning them separately which causes confusion about relative motion between camera and objects.

Method: Proposes integrating camera and object motions by converting them into pixel motion, using stable diffusion networks to learn reference motion maps relative to camera trajectories, combined with semantic object priors fed into an image-to-video network.

Result: Extensive experiments show the model significantly outperforms state-of-the-art methods by a large margin.

Conclusion: The proposed approach effectively generates videos that accurately follow designated camera trajectories while maintaining consistent object motions, solving the challenge of handling both types of motion simultaneously.

Abstract: Generating videos guided by camera trajectories poses significant challenges in achieving consistency and generalizability, particularly when both camera and object motions are present. Existing approaches often attempt to learn these motions separately, which may lead to confusion regarding the relative motion between the camera and the objects. To address this challenge, we propose a novel approach that integrates both camera and object motions by converting them into the motion of corresponding pixels. Utilizing a stable diffusion network, we effectively learn reference motion maps in relation to the specified camera trajectory. These maps, along with an extracted semantic object prior, are then fed into an image-to-video network to generate the desired video that can accurately follow the designated camera trajectory while maintaining consistent object motions. Extensive experiments verify that our model outperforms SOTA methods by a large margin.

[194] The Unwinnable Arms Race of AI Image Detection

Till Aczel, Lorenzo Vettor, Andreas Plesner, Roger Wattenhofer

Main category: cs.CV

TL;DR: This paper analyzes when discriminators are most disadvantaged in detecting AI-generated images, finding that intermediate-complexity datasets create the best conditions for detection while very simple or highly complex datasets reduce detectability.

Details

Motivation: The rapid progress of image generative AI has blurred the boundary between synthetic and real images, creating an arms race between generators and discriminators. The research aims to understand the conditions under which discriminators are most disadvantaged.

Method: The study analyzes two key factors: data dimensionality and data complexity. Using Kolmogorov complexity as a measure of intrinsic dataset structure, the researchers examine how different levels of complexity affect the detectability of synthetic images.

Result: Both very simple and highly complex datasets reduce the detectability of synthetic images. Generators can learn simple datasets almost perfectly, while extreme diversity masks imperfections. Intermediate-complexity datasets create the most favorable conditions for detection.

Conclusion: The detectability of synthetic images depends on dataset complexity, with intermediate complexity providing the best conditions for discriminators to identify generator errors that remain visible when generators fail to fully capture the distribution.

Abstract: The rapid progress of image generative AI has blurred the boundary between synthetic and real images, fueling an arms race between generators and discriminators. This paper investigates the conditions under which discriminators are most disadvantaged in this competition. We analyze two key factors: data dimensionality and data complexity. While increased dimensionality often strengthens the discriminators ability to detect subtle inconsistencies, complexity introduces a more nuanced effect. Using Kolmogorov complexity as a measure of intrinsic dataset structure, we show that both very simple and highly complex datasets reduce the detectability of synthetic images; generators can learn simple datasets almost perfectly, whereas extreme diversity masks imperfections. In contrast, intermediate-complexity datasets create the most favorable conditions for detection, as generators fail to fully capture the distribution and their errors remain visible.

[195] Can Less Precise Be More Reliable? A Systematic Evaluation of Quantization’s Impact on CLIP Beyond Accuracy

Aymen Bouguerra, Daniel Montoya, Alexandra Gomez-Villa, Fabio Arnez, Chokri Mraidha

Main category: cs.CV

TL;DR: Quantization of CLIP models can improve calibration for underconfident models and enhance OOD detection even when calibration degrades, with specific QAT methods enabling simultaneous gains in accuracy, calibration, and robustness.

Details

Motivation: To understand the impact of quantization on CLIP's performance beyond accuracy, particularly on reliability metrics like calibration and OOD detection, which are crucial for efficient and reliable deployment.

Method: Large-scale evaluation of quantization on CLIP models, assessing in-distribution accuracy, calibration, and OOD detection metrics, including analysis of quantization-aware training (QAT) methods.

Result: Quantization improves calibration for underconfident pre-trained models but may degrade it for overconfident ones. OOD detection can improve even when calibration degrades. Specific QAT methods yield simultaneous gains in zero-shot accuracy, calibration, and OOD robustness.

Conclusion: Quantization offers benefits beyond efficiency, enabling improved reliability and robustness in CLIP deployment, challenging the strict efficiency-performance trade-off paradigm.

Abstract: The powerful zero-shot generalization capabilities of vision-language models (VLMs) like CLIP have enabled new paradigms for safety-related tasks such as out-of-distribution (OOD) detection. However, additional aspects crucial for the computationally efficient and reliable deployment of CLIP are still overlooked. In particular, the impact of quantization on CLIP’s performance beyond accuracy remains underexplored. This work presents a large-scale evaluation of quantization on CLIP models, assessing not only in-distribution accuracy but a comprehensive suite of reliability metrics and revealing counterintuitive results driven by pre-training source. We demonstrate that quantization consistently improves calibration for typically underconfident pre-trained models, while often degrading it for overconfident variants. Intriguingly, this degradation in calibration does not preclude gains in other reliability metrics; we find that OOD detection can still improve for these same poorly calibrated models. Furthermore, we identify specific quantization-aware training (QAT) methods that yield simultaneous gains in zero-shot accuracy, calibration, and OOD robustness, challenging the view of a strict efficiency-performance trade-off. These findings offer critical insights for navigating the multi-objective problem of deploying efficient, reliable, and robust VLMs by utilizing quantization beyond its conventional role.

[196] TABLET: A Large-Scale Dataset for Robust Visual Table Understanding

Iñigo Alonso, Imanol Miranda, Eneko Agirre, Mirella Lapata

Main category: cs.CV

TL;DR: TABLET is a large-scale visual table understanding dataset with 4 million examples across 20 tasks, featuring real-world table visualizations and paired image-HTML representations to address limitations of synthetic benchmarks.

Details

Motivation: Current VTU benchmarks use synthetic renderings lacking real-world complexity and visual diversity, and existing datasets offer fixed examples without access to underlying serialized data for reformulation.

Method: Created TABLET dataset with 4M examples from 2M unique tables (88% with original visualizations), including paired image-HTML representations, metadata, and provenance information linking to source datasets.

Result: Fine-tuning vision-language models like Qwen2.5-VL-7B on TABLET improves performance on both seen and unseen VTU tasks while increasing robustness on real-world table visualizations.

Conclusion: TABLET establishes a foundation for robust training and extensible evaluation of future VTU models by preserving original visualizations and maintaining example traceability in a unified large-scale collection.

Abstract: While table understanding increasingly relies on pixel-only settings where tables are processed as visual representations, current benchmarks predominantly use synthetic renderings that lack the complexity and visual diversity of real-world tables. Additionally, existing visual table understanding (VTU) datasets offer fixed examples with single visualizations and pre-defined instructions, providing no access to underlying serialized data for reformulation. We introduce TABLET, a large-scale VTU dataset with 4 million examples across 20 tasks, grounded in 2 million unique tables where 88% preserve original visualizations. Each example includes paired image-HTML representations, comprehensive metadata, and provenance information linking back to the source datasets. Fine-tuning vision-language models like Qwen2.5-VL-7B on TABLET improves performance on seen and unseen VTU tasks while increasing robustness on real-world table visualizations. By preserving original visualizations and maintaining example traceability in a unified large-scale collection, TABLET establishes a foundation for robust training and extensible evaluation of future VTU models.

[197] CARINOX: Inference-time Scaling with Category-Aware Reward-based Initial Noise Optimization and Exploration

Seyed Amir Kasaei, Ali Aghayari, Arash Marioriyad, Niki Sepasian, Shayan Baghayi Nejad, MohammadAmin Fazli, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban

Main category: cs.CV

TL;DR: CARINOX is a unified framework that combines noise optimization and exploration with principled reward selection to improve compositional alignment in text-to-image diffusion models, achieving significant performance gains over existing methods.

Details

Motivation: Text-to-image diffusion models often fail to achieve compositional alignment for complex prompts describing object relationships, attributes, or spatial arrangements. Existing inference-time approaches have limitations: optimization can stall due to poor initialization, while exploration requires too many samples. Neither single reward metrics nor ad-hoc combinations reliably capture all aspects of compositionality.

Method: CARINOX combines noise optimization and exploration with a principled reward selection procedure grounded in correlation with human judgments. It addresses the limitations of using optimization or exploration alone by creating a unified framework that leverages both approaches effectively.

Result: CARINOX raises average alignment scores by +16% on T2I-CompBench++ and +11% on the HRS benchmark, consistently outperforming state-of-the-art optimization and exploration-based methods across all major categories while preserving image quality and diversity.

Conclusion: The proposed CARINOX framework effectively addresses compositional alignment challenges in text-to-image generation by combining optimization and exploration with principled reward selection, demonstrating significant improvements over existing methods while maintaining image quality and diversity.

Abstract: Text-to-image diffusion models, such as Stable Diffusion, can produce high-quality and diverse images but often fail to achieve compositional alignment, particularly when prompts describe complex object relationships, attributes, or spatial arrangements. Recent inference-time approaches address this by optimizing or exploring the initial noise under the guidance of reward functions that score text-image alignment without requiring model fine-tuning. While promising, each strategy has intrinsic limitations when used alone: optimization can stall due to poor initialization or unfavorable search trajectories, whereas exploration may require a prohibitively large number of samples to locate a satisfactory output. Our analysis further shows that neither single reward metrics nor ad-hoc combinations reliably capture all aspects of compositionality, leading to weak or inconsistent guidance. To overcome these challenges, we present Category-Aware Reward-based Initial Noise Optimization and Exploration (CARINOX), a unified framework that combines noise optimization and exploration with a principled reward selection procedure grounded in correlation with human judgments. Evaluations on two complementary benchmarks covering diverse compositional challenges show that CARINOX raises average alignment scores by +16% on T2I-CompBench++ and +11% on the HRS benchmark, consistently outperforming state-of-the-art optimization and exploration-based methods across all major categories, while preserving image quality and diversity. The project page is available at https://amirkasaei.com/carinox/{this URL}.

[198] Learning Conformal Explainers for Image Classifiers

Amr Alkhatib, Stephanie Lowry

Main category: cs.CV

TL;DR: A conformal prediction-based approach for controlling explanation fidelity in feature attribution methods, identifying salient features that preserve model predictions without requiring ground-truth explanations.

Details

Motivation: Feature attribution methods vary in robustness and may not faithfully reflect the underlying model's reasoning, requiring better fidelity control.

Method: Proposes four conformity functions to quantify explanation fidelity, identifies salient feature subsets that preserve predictions regardless of excluded features, using conformal prediction without ground-truth calibration.

Result: FastSHAP consistently outperforms competing methods in fidelity and informational efficiency across five explainers and six image datasets; super-pixel conformity measures are more effective than pixel-wise ones.

Conclusion: The conformal prediction approach successfully enables controlled explanation fidelity, with FastSHAP and super-pixel measures proving most effective for robust feature attribution.

Abstract: Feature attribution methods are widely used for explaining image-based predictions, as they provide feature-level insights that can be intuitively visualized. However, such explanations often vary in their robustness and may fail to faithfully reflect the reasoning of the underlying black-box model. To address these limitations, we propose a novel conformal prediction-based approach that enables users to directly control the fidelity of the generated explanations. The method identifies a subset of salient features that is sufficient to preserve the model’s prediction, regardless of the information carried by the excluded features, and without demanding access to ground-truth explanations for calibration. Four conformity functions are proposed to quantify the extent to which explanations conform to the model’s predictions. The approach is empirically evaluated using five explainers across six image datasets. The empirical results demonstrate that FastSHAP consistently outperforms the competing methods in terms of both fidelity and informational efficiency, the latter measured by the size of the explanation regions. Furthermore, the results reveal that conformity measures based on super-pixels are more effective than their pixel-wise counterparts.

[199] Sigma: Semantically Informative Pre-training for Skeleton-based Sign Language Understanding

Muxin Pu, Mei Kuan Lim, Chun Yong Chong, Chen Change Loy

Main category: cs.CV

TL;DR: Sigma is a unified skeleton-based sign language understanding framework that addresses semantic grounding, local-global balance, and cross-modal learning challenges through sign-aware early fusion, hierarchical alignment learning, and unified pre-training.

Details

Motivation: Current skeleton-based SLU methods face three key limitations: weak semantic grounding (struggling to relate motion patterns to linguistic meaning), imbalance between local details and global context, and inefficient cross-modal learning for semantically aligned representations.

Method: Proposes Sigma framework with: 1) sign-aware early fusion for deep visual-textual interaction, 2) hierarchical alignment learning to capture both fine-grained details and high-level semantics, 3) unified pre-training combining contrastive learning, text matching, and language modeling.

Result: Achieves new state-of-the-art results on isolated sign language recognition, continuous sign language recognition, and gloss-free sign language translation across multiple benchmarks spanning different sign and spoken languages.

Conclusion: Demonstrates the effectiveness of semantically informative pre-training and proves skeletal data as a viable stand-alone solution for SLU, showing significant impact through the proposed unified framework.

Abstract: Pre-training has proven effective for learning transferable features in sign language understanding (SLU) tasks. Recently, skeleton-based methods have gained increasing attention because they can robustly handle variations in subjects and backgrounds without being affected by appearance or environmental factors. Current SLU methods continue to face three key limitations: 1) weak semantic grounding, as models often capture low-level motion patterns from skeletal data but struggle to relate them to linguistic meaning; 2) imbalance between local details and global context, with models either focusing too narrowly on fine-grained cues or overlooking them for broader context; and 3) inefficient cross-modal learning, as constructing semantically aligned representations across modalities remains difficult. To address these, we propose Sigma, a unified skeleton-based SLU framework featuring: 1) a sign-aware early fusion mechanism that facilitates deep interaction between visual and textual modalities, enriching visual features with linguistic context; 2) a hierarchical alignment learning strategy that jointly maximises agreements across different levels of paired features from different modalities, effectively capturing both fine-grained details and high-level semantic relationships; and 3) a unified pre-training framework that combines contrastive learning, text matching and language modelling to promote semantic consistency and generalisation. Sigma achieves new state-of-the-art results on isolated sign language recognition, continuous sign language recognition, and gloss-free sign language translation on multiple benchmarks spanning different sign and spoken languages, demonstrating the impact of semantically informative pre-training and the effectiveness of skeletal data as a stand-alone solution for SLU.

[200] Evaluating the Evaluators: Metrics for Compositional Text-to-Image Generation

Seyed Amir Kasaei, Ali Aghayari, Arash Marioriyad, Niki Sepasian, MohammadAmin Fazli, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban

Main category: cs.CV

TL;DR: This paper presents a comprehensive study evaluating how well automated metrics for text-image generation align with human judgments across various compositional challenges.

Details

Motivation: Current evaluation in text-image generation relies heavily on automated metrics that are often adopted by convention rather than validated against human judgment, making it critical to understand how well these metrics reflect human preferences.

Method: The authors conducted a broad study examining widely used metrics for compositional text-image evaluation, going beyond simple correlation to analyze their behavior across diverse compositional challenges and comparing different metric families’ alignment with human judgments.

Result: No single metric performs consistently across tasks - performance varies with compositional problem type. VQA-based metrics are not uniformly superior, while certain embedding-based metrics prove stronger in specific cases. Image-only metrics contribute little to compositional evaluation.

Conclusion: The findings emphasize the importance of careful and transparent metric selection for trustworthy evaluation and their use as reward models in generation, as metric performance depends heavily on the specific compositional challenge being evaluated.

Abstract: Text-image generation has advanced rapidly, but assessing whether outputs truly capture the objects, attributes, and relations described in prompts remains a central challenge. Evaluation in this space relies heavily on automated metrics, yet these are often adopted by convention or popularity rather than validated against human judgment. Because evaluation and reported progress in the field depend directly on these metrics, it is critical to understand how well they reflect human preferences. To address this, we present a broad study of widely used metrics for compositional text-image evaluation. Our analysis goes beyond simple correlation, examining their behavior across diverse compositional challenges and comparing how different metric families align with human judgments. The results show that no single metric performs consistently across tasks: performance varies with the type of compositional problem. Notably, VQA-based metrics, though popular, are not uniformly superior, while certain embedding-based metrics prove stronger in specific cases. Image-only metrics, as expected, contribute little to compositional evaluation, as they are designed for perceptual quality rather than alignment. These findings underscore the importance of careful and transparent metric selection, both for trustworthy evaluation and for their use as reward models in generation. Project page is available at \href{https://amirkasaei.com/eval-the-evals/}{this URL}.

[201] SlideMamba: Entropy-Based Adaptive Fusion of GNN and Mamba for Enhanced Representation Learning in Digital Pathology

Shakib Khan, Fariba Dambandkhameneh, Nazim Shaikh, Yao Nie, Raghavan Venugopal, Xiao Li

Main category: cs.CV

TL;DR: SlideMamba integrates Mamba architecture with Graph Neural Networks using entropy-based adaptive fusion for Whole Slide Image analysis, achieving superior performance in predicting gene fusion/mutation status compared to existing methods.

Details

Motivation: To develop a generalizable deep learning framework that captures both local spatial relationships and long-range contextual dependencies in Whole Slide Images for enhanced computational pathology analysis.

Method: Combines Mamba modules (for long-range global dependencies) with GNNs (for fine-grained short-range spatial interactions) using an entropy-based confidence weighting mechanism that dynamically balances contributions based on prediction confidence.

Result: Achieved PRAUC of 0.751 ± 0.05, outperforming MIL (0.491), Trans-MIL (0.39), Mamba-only (0.664), GNN-only (0.748), and GAT-Mamba (0.703). Also showed competitive ROC AUC (0.738), sensitivity (0.662), and specificity (0.725).

Conclusion: The integrated Mamba-GNN architecture with adaptive fusion strategy demonstrates strong performance for spatially-resolved predictive modeling in computational pathology, with promising potential for various clinical and biological tasks.

Abstract: Advances in computational pathology increasingly rely on extracting meaningful representations from Whole Slide Images (WSIs) to support various clinical and biological tasks. In this study, we propose a generalizable deep learning framework that integrates the Mamba architecture with Graph Neural Networks (GNNs) for enhanced WSI analysis. Our method is designed to capture both local spatial relationships and long-range contextual dependencies, offering a flexible architecture for digital pathology analysis. Mamba modules excels in capturing long-range global dependencies, while GNNs emphasize fine-grained short-range spatial interactions. To effectively combine these complementary signals, we introduce an adaptive fusion strategy that uses an entropy-based confidence weighting mechanism. This approach dynamically balances contributions from both branches by assigning higher weight to the branch with more confident (lower-entropy) predictions, depending on the contextual importance of local versus global information for different downstream tasks. We demonstrate the utility of our approach on a representative task: predicting gene fusion and mutation status from WSIs. Our framework, SlideMamba, achieves an area under the precision recall curve (PRAUC) of 0.751 \pm 0.05, outperforming MIL (0.491 \pm 0.042), Trans-MIL (0.39 \pm 0.017), Mamba-only (0.664 \pm 0.063), GNN-only (0.748 \pm 0.091), and a prior similar work GAT-Mamba (0.703 \pm 0.075). SlideMamba also achieves competitive results across ROC AUC (0.738 \pm 0.055), sensitivity (0.662 \pm 0.083), and specificity (0.725 \pm 0.094). These results highlight the strength of the integrated architecture, enhanced by the proposed entropy-based adaptive fusion strategy, and suggest promising potential for application of spatially-resolved predictive modeling tasks in computational pathology.

[202] Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets

Team Hunyuan3D, :, Bowen Zhang, Chunchao Guo, Haolin Liu, Hongyu Yan, Huiwen Shi, Jingwei Huang, Junlin Yu, Kunhong Li, Linus, Penghao Wang, Qingxiang Lin, Sicong Liu, Xianghui Yang, Yixuan Tang, Yunfei Zhao, Zeqiang Lai, Zhihao Liang, Zibo Zhao

Main category: cs.CV

TL;DR: Hunyuan3D-Omni is a unified framework for fine-grained, controllable 3D asset generation that accepts multiple input modalities beyond just images and text.

Details

Motivation: Most 3D generative models rely primarily on image or text conditioning and lack fine-grained cross-modal controls, limiting controllability and practical adoption.

Method: Built on Hunyuan3D 2.1, the framework accepts point clouds, voxels, bounding boxes, and skeletal pose priors as conditioning signals. It uses a single cross-modal architecture instead of separate heads per modality, with progressive difficulty-aware sampling that biases toward harder signals.

Result: The additional controls improve generation accuracy, enable geometry-aware transformations, and increase robustness for production workflows.

Conclusion: Hunyuan3D-Omni enables precise control over geometry, topology, and pose through unified multi-modal conditioning, addressing limitations of current 3D generative models.

Abstract: Recent advances in 3D-native generative models have accelerated asset creation for games, film, and design. However, most methods still rely primarily on image or text conditioning and lack fine-grained, cross-modal controls, which limits controllability and practical adoption. To address this gap, we present Hunyuan3D-Omni, a unified framework for fine-grained, controllable 3D asset generation built on Hunyuan3D 2.1. In addition to images, Hunyuan3D-Omni accepts point clouds, voxels, bounding boxes, and skeletal pose priors as conditioning signals, enabling precise control over geometry, topology, and pose. Instead of separate heads for each modality, our model unifies all signals in a single cross-modal architecture. We train with a progressive, difficulty-aware sampling strategy that selects one control modality per example and biases sampling toward harder signals (e.g., skeletal pose) while downweighting easier ones (e.g., point clouds), encouraging robust multi-modal fusion and graceful handling of missing inputs. Experiments show that these additional controls improve generation accuracy, enable geometry-aware transformations, and increase robustness for production workflows.

[203] Learning to Look: Cognitive Attention Alignment with Vision-Language Models

Ryan L. Yang, Dipkamal Bhusal, Nidhi Rastogi

Main category: cs.CV

TL;DR: A framework using vision-language models to automatically generate semantic attention maps via natural language prompts, which aligns CNN attention through an auxiliary loss to improve model reliability and reduce shortcut learning without manual annotations.

Details

Motivation: CNNs often exploit superficial correlations (cheating) rather than learning meaningful features, raising concerns about their decision-making reliability. Existing methods require expert annotations which are labor-intensive and not scalable.

Method: Leverage vision-language models to automatically generate semantic attention maps using natural language prompts. Introduce an auxiliary loss that aligns CNN attention with these language-guided maps to promote more reliable decision-making.

Result: Achieves state-of-the-art performance on ColoredMNIST and remains competitive with annotation-heavy baselines on DecoyMNIST. Demonstrates improved generalization, reduced shortcut reliance, and model attention that better reflects human intuition.

Conclusion: The proposed framework provides a scalable solution for improving CNN reliability by automatically generating semantic supervision through vision-language models, eliminating the need for manual annotations while achieving competitive performance.

Abstract: Convolutional Neural Networks (CNNs) frequently “cheat” by exploiting superficial correlations, raising concerns about whether they make predictions for the right reasons. Inspired by cognitive science, which highlights the role of attention in robust human perception, recent methods have sought to guide model attention using concept-based supervision and explanation regularization. However, these techniques depend on labor-intensive, expert-provided annotations, limiting their scalability. We propose a scalable framework that leverages vision-language models to automatically generate semantic attention maps using natural language prompts. By introducing an auxiliary loss that aligns CNN attention with these language-guided maps, our approach promotes more reliable and cognitively plausible decision-making without manual annotation. Experiments on challenging datasets, ColoredMNIST and DecoyMNIST, show that our method achieves state-of-the-art performance on ColorMNIST and remains competitive with annotation-heavy baselines on DecoyMNIST, demonstrating improved generalization, reduced shortcut reliance, and model attention that better reflects human intuition.

[204] Decipher-MR: A Vision-Language Foundation Model for 3D MRI Representations

Zhijian Yang, Noel DSouza, Istvan Megyeri, Xiaojian Xu, Amin Honarmandi Shandiz, Farzin Haddadpour, Krisztian Koos, Laszlo Rusko, Emanuele Valeriano, Bharadwaj Swaninathan, Lei Wu, Parminder Bhatia, Taha Kass-Hout, Erhan Bas

Main category: cs.CV

TL;DR: Decipher-MR is a 3D MRI-specific vision-language foundation model trained on 200,000 MRI series from 22,000+ studies, enabling robust medical image analysis across diverse clinical tasks with minimal computational overhead.

Details

Motivation: MRI complexity and heterogeneity pose challenges for automated analysis, and existing foundation models have limited application to MRI due to data scarcity and narrow anatomical focus.

Method: Integrates self-supervised vision learning with report-guided text supervision, using a modular design with frozen pretrained encoder and lightweight task-specific decoders for efficient adaptation.

Result: Demonstrates consistent performance gains over existing foundation models and task-specific approaches across disease classification, demographic prediction, anatomical localization, and cross-modal retrieval benchmarks.

Conclusion: Establishes Decipher-MR as a scalable and versatile foundation for MRI-based AI, facilitating efficient development across clinical and research domains.

Abstract: Magnetic Resonance Imaging (MRI) is a critical medical imaging modality in clinical diagnosis and research, yet its complexity and heterogeneity pose challenges for automated analysis, particularly in scalable and generalizable machine learning applications. While foundation models have revolutionized natural language and vision tasks, their application to MRI remains limited due to data scarcity and narrow anatomical focus. In this work, we present Decipher-MR, a 3D MRI-specific vision-language foundation model trained on a large-scale dataset comprising 200,000 MRI series from over 22,000 studies spanning diverse anatomical regions, sequences, and pathologies. Decipher-MR integrates self-supervised vision learning with report-guided text supervision to build robust, generalizable representations, enabling effective adaptation across broad applications. To enable robust and diverse clinical tasks with minimal computational overhead, Decipher-MR supports a modular design that enables tuning of lightweight, task-specific decoders attached to a frozen pretrained encoder. Following this setting, we evaluate Decipher-MR across diverse benchmarks including disease classification, demographic prediction, anatomical localization, and cross-modal retrieval, demonstrating consistent performance gains over existing foundation models and task-specific approaches. Our results establish Decipher-MR as a scalable and versatile foundation for MRI-based AI, facilitating efficient development across clinical and research domains.

[205] Instruction-tuned Self-Questioning Framework for Multimodal Reasoning

You-Won Jang, Yu-Jung Heo, Jaeseok Kim, Minsu Lee, Du-Seong Chang, Byoung-Tak Zhang

Main category: cs.CV

TL;DR: SQ-InstructBLIP improves vision-language reasoning by generating image-aware sub-questions and answers through a three-component architecture (Questioner, Answerer, Reasoner) to overcome limitations of black-box LLMs in multi-step reasoning tasks.

Details

Motivation: Current LLM-based approaches struggle with multi-step reasoning in vision-language tasks due to inability to access fine-grained visual content and the black-box nature of LLMs, limiting reproducibility and visual understanding.

Method: Proposes SQ-InstructBLIP with three components sharing the same architecture: Questioner generates sub-questions, Answerer provides sub-answers, and Reasoner performs final reasoning using the generated sub-question information for VQA tasks.

Result: Experiments show SQ-InstructBLIP performs more accurate reasoning than previous works by using generated sub-questions as additional information in VQA tasks.

Conclusion: The proposed iterative sub-question generation approach effectively enhances multi-step reasoning capabilities in vision-language understanding tasks.

Abstract: The field of vision-language understanding has been actively researched in recent years, thanks to the development of Large Language Models~(LLMs). However, it still needs help with problems requiring multi-step reasoning, even for very simple questions. Recent studies adopt LLMs to tackle this problem by iteratively generating sub-questions and answers. However, there are disadvantages such as 1) the fine-grained visual contents of images are not available using LLMs that cannot read visual information, 2) internal mechanisms are inaccessible and difficult to reproduce by using black-box LLMs. To solve these problems, we propose the SQ (Self-Questioning)-InstructBLIP, which improves inference performance by generating image-aware informative sub-questions and sub-answers iteratively. The SQ-InstructBLIP, which consists of a Questioner, Answerer, and Reasoner that share the same architecture. Questioner and Answerer generate sub-questions and sub-answers to help infer the main-question, and Reasoner performs reasoning on the main-question considering the generated sub-question information. Our experiments show that the proposed method SQ-InstructBLIP, which uses the generated sub-questions as additional information when solving the VQA task, performs more accurate reasoning than the previous works.

[206] Hallucination as an Upper Bound: A New Perspective on Text-to-Image Evaluation

Seyed Amir Kasaei, Mohammad Hossein Rohban

Main category: cs.CV

TL;DR: The paper proposes a new framework for defining and categorizing hallucination in text-to-image (T2I) generative models, arguing that current evaluations focus too narrowly on alignment while ignoring bias-driven deviations beyond the prompt.

Details

Motivation: Existing evaluations of T2I models mainly focus on alignment (checking if prompt-specified elements appear) but overlook what models generate beyond the prompt. Hallucination in T2I models has not been clearly framed compared to language and vision-language models.

Method: The authors propose defining hallucination in T2I as bias-driven deviations and introduce a taxonomy with three categories: attribute, relation, and object hallucinations.

Result: This framing establishes an upper bound for evaluation and reveals hidden biases in T2I models.

Conclusion: The proposed taxonomy provides a foundation for richer assessment of T2I models by systematically addressing bias-driven hallucinations beyond simple prompt alignment.

Abstract: In language and vision-language models, hallucination is broadly understood as content generated from a model’s prior knowledge or biases rather than from the given input. While this phenomenon has been studied in those domains, it has not been clearly framed for text-to-image (T2I) generative models. Existing evaluations mainly focus on alignment, checking whether prompt-specified elements appear, but overlook what the model generates beyond the prompt. We argue for defining hallucination in T2I as bias-driven deviations and propose a taxonomy with three categories: attribute, relation, and object hallucinations. This framing introduces an upper bound for evaluation and surfaces hidden biases, providing a foundation for richer assessment of T2I models.

[207] Every Subtlety Counts: Fine-grained Person Independence Micro-Action Recognition via Distributionally Robust Optimization

Feng-Qi Cui, Jinyang Huang, Anyang Tong, Ziyu Jia, Jie Zhang, Zhi Liu, Dan Guo, Jianwei Lu, Meng Wang

Main category: cs.CV

TL;DR: A Person Independence Universal Micro-action Recognition Framework using Distributionally Robust Optimization to handle inter-person variability in micro-action recognition, with temporal-frequency alignment and group-invariant regularization components.

Details

Motivation: Existing micro-action recognition methods fail in real-world scenarios due to inter-person variability causing the same action to manifest differently, hindering robust generalization.

Method: Proposes a framework with two plug-and-play components: 1) Temporal-Frequency Alignment Module with dual-branch design (temporal branch with Wasserstein-regularized alignment and frequency branch with variance-guided perturbations), 2) Group-Invariant Regularized Loss that partitions samples into pseudo-groups and up-weights boundary cases.

Result: Outperforms existing methods on MA-52 dataset in both accuracy and robustness, achieving stable generalization under fine-grained conditions.

Conclusion: The framework effectively addresses person-specific variations in micro-action recognition through distributionally robust optimization principles, enabling robust generalization across different individuals.

Abstract: Micro-action Recognition is vital for psychological assessment and human-computer interaction. However, existing methods often fail in real-world scenarios because inter-person variability causes the same action to manifest differently, hindering robust generalization. To address this, we propose the Person Independence Universal Micro-action Recognition Framework, which integrates Distributionally Robust Optimization principles to learn person-agnostic representations. Our framework contains two plug-and-play components operating at the feature and loss levels. At the feature level, the Temporal-Frequency Alignment Module normalizes person-specific motion characteristics with a dual-branch design: the temporal branch applies Wasserstein-regularized alignment to stabilize dynamic trajectories, while the frequency branch introduces variance-guided perturbations to enhance robustness against person-specific spectral differences. A consistency-driven fusion mechanism integrates both branches. At the loss level, the Group-Invariant Regularized Loss partitions samples into pseudo-groups to simulate unseen person-specific distributions. By up-weighting boundary cases and regularizing subgroup variance, it forces the model to generalize beyond easy or frequent samples, thus enhancing robustness to difficult variations. Experiments on the large-scale MA-52 dataset demonstrate that our framework outperforms existing methods in both accuracy and robustness, achieving stable generalization under fine-grained conditions.

[208] Dense Semantic Matching with VGGT Prior

Songlin Yang, Tianyi Wei, Yushi Lan, Zeqi Xiao, Anyi Rao, Xingang Pan

Main category: cs.CV

TL;DR: This paper proposes a novel approach for semantic matching that addresses geometric ambiguity and nearest-neighbor limitations by adapting 3D geometric foundation model VGGT for cross-instance semantic matching under data scarcity.

Details

Motivation: Existing semantic matching methods suffer from geometric ambiguity (failing to disambiguate symmetric structures using 2D features) and nearest-neighbor rule limitations (ignoring cross-image invisibility and manifold preservation).

Method: The approach adapts VGGT by: (i) reusing early feature stages, fine-tuning later ones, and adding a semantic head for bidirectional correspondences; (ii) implementing cycle-consistent training, synthetic data augmentation, and progressive training with aliasing artifact mitigation to handle data scarcity.

Result: Extensive experiments show the approach achieves superior geometry awareness, matching reliability, and manifold preservation, outperforming previous baselines.

Conclusion: The proposed method successfully addresses key limitations in semantic matching by leveraging 3D geometric foundation models and innovative adaptation strategies, demonstrating significant improvements in matching performance.

Abstract: Semantic matching aims to establish pixel-level correspondences between instances of the same category and represents a fundamental task in computer vision. Existing approaches suffer from two limitations: (i) Geometric Ambiguity: Their reliance on 2D foundation model features (e.g., Stable Diffusion, DINO) often fails to disambiguate symmetric structures, requiring extra fine-tuning yet lacking generalization; (ii) Nearest-Neighbor Rule: Their pixel-wise matching ignores cross-image invisibility and neglects manifold preservation. These challenges call for geometry-aware pixel descriptors and holistic dense correspondence mechanisms. Inspired by recent advances in 3D geometric foundation models, we turn to VGGT, which provides geometry-grounded features and holistic dense matching capabilities well aligned with these needs. However, directly transferring VGGT is challenging, as it was originally designed for geometry matching within cross views of a single instance, misaligned with cross-instance semantic matching, and further hindered by the scarcity of dense semantic annotations. To address this, we propose an approach that (i) retains VGGT’s intrinsic strengths by reusing early feature stages, fine-tuning later ones, and adding a semantic head for bidirectional correspondences; and (ii) adapts VGGT to the semantic matching scenario under data scarcity through cycle-consistent training strategy, synthetic data augmentation, and progressive training recipe with aliasing artifact mitigation. Extensive experiments demonstrate that our approach achieves superior geometry awareness, matching reliability, and manifold preservation, outperforming previous baselines.

[209] MedVSR: Medical Video Super-Resolution with Cross State-Space Propagation

Xinyu Liu, Guolei Sun, Cheng Wang, Yixuan Yuan, Ender Konukoglu

Main category: cs.CV

TL;DR: MedVSR is a novel video super-resolution framework specifically designed for medical videos that addresses challenges like camera shake, noise, and abrupt frame transitions through cross state-space propagation and inner state-space reconstruction modules.

Details

Motivation: High-resolution medical videos are crucial for accurate diagnosis but hard to acquire due to hardware limitations. Current VSR models struggle with medical video challenges including camera shake, noise, abrupt frame transitions, and tend to introduce artifacts that can mislead doctors.

Method: Proposes MedVSR with two key components: Cross State-Space Propagation (CSSP) for precise alignment by projecting distant frames as control matrices, and Inner State-Space Reconstruction (ISSR) module that enhances tissue structures through joint long-range spatial feature learning and large-kernel short-range information aggregation.

Result: Experiments across four medical datasets (including endoscopy and cataract surgeries) show MedVSR significantly outperforms existing VSR models in both reconstruction performance and efficiency.

Conclusion: MedVSR provides an effective solution for medical video super-resolution, addressing the unique challenges of medical imaging while maintaining structural accuracy and reducing artifacts that could mislead clinical diagnosis.

Abstract: High-resolution (HR) medical videos are vital for accurate diagnosis, yet are hard to acquire due to hardware limitations and physiological constraints. Clinically, the collected low-resolution (LR) medical videos present unique challenges for video super-resolution (VSR) models, including camera shake, noise, and abrupt frame transitions, which result in significant optical flow errors and alignment difficulties. Additionally, tissues and organs exhibit continuous and nuanced structures, but current VSR models are prone to introducing artifacts and distorted features that can mislead doctors. To this end, we propose MedVSR, a tailored framework for medical VSR. It first employs Cross State-Space Propagation (CSSP) to address the imprecise alignment by projecting distant frames as control matrices within state-space models, enabling the selective propagation of consistent and informative features to neighboring frames for effective alignment. Moreover, we design an Inner State-Space Reconstruction (ISSR) module that enhances tissue structures and reduces artifacts with joint long-range spatial feature learning and large-kernel short-range information aggregation. Experiments across four datasets in diverse medical scenarios, including endoscopy and cataract surgeries, show that MedVSR significantly outperforms existing VSR models in reconstruction performance and efficiency. Code released at https://github.com/CUHK-AIM-Group/MedVSR.

[210] MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources

Sicong Leng, Jing Wang, Jiaxi Li, Hao Zhang, Zhiqiang Hu, Boqiang Zhang, Yuming Jiang, Hang Zhang, Xin Li, Lidong Bing, Deli Zhao, Wei Lu, Yu Rong, Aixin Sun, Shijian Lu

Main category: cs.CV

TL;DR: This paper introduces Variance-Aware Sampling (VAS) to stabilize RL fine-tuning for multimodal reasoning models, releases large-scale CoT data resources, and open-sources multimodal reasoning models with theoretical guarantees on reward variance.

Details

Motivation: Address two major limitations in large multimodal reasoning models: absence of high-quality long chain-of-thought data and instability of RL algorithms in post-training due to gradient vanishing when reward variance is low.

Method: Proposes Variance-Aware Sampling (VAS) guided by Variance Promotion Score (VPS) that combines outcome variance and trajectory diversity. Releases ~1.6M long CoT cold-start data and ~15k RL QA pairs with reproducible training codebase.

Result: Experiments across mathematical reasoning benchmarks demonstrate effectiveness of curated data and VAS. Comprehensive ablation studies provide insights into component contributions. Theoretically establishes that reward variance lower-bounds expected policy gradient magnitude.

Conclusion: VAS serves as practical mechanism to stabilize policy optimization by promoting reward variance. The released resources establish standardized baselines for the multimodal reasoning community.

Abstract: Large multimodal reasoning models have achieved rapid progress, but their advancement is constrained by two major limitations: the absence of open, large-scale, high-quality long chain-of-thought (CoT) data, and the instability of reinforcement learning (RL) algorithms in post-training. Group Relative Policy Optimization (GRPO), the standard framework for RL fine-tuning, is prone to gradient vanishing when reward variance is low, which weakens optimization signals and impairs convergence. This work makes three contributions: (1) We propose Variance-Aware Sampling (VAS), a data selection strategy guided by Variance Promotion Score (VPS) that combines outcome variance and trajectory diversity to promote reward variance and stabilize policy optimization. (2) We release large-scale, carefully curated resources containing ~1.6M long CoT cold-start data and ~15k RL QA pairs, designed to ensure quality, difficulty, and diversity, along with a fully reproducible end-to-end training codebase. (3) We open-source a family of multimodal reasoning models in multiple scales, establishing standardized baselines for the community. Experiments across mathematical reasoning benchmarks demonstrate the effectiveness of both the curated data and the proposed VAS. Comprehensive ablation studies and analyses provide further insight into the contributions of each component. In addition, we theoretically establish that reward variance lower-bounds the expected policy gradient magnitude, with VAS serving as a practical mechanism to realize this guarantee. Our code, data, and checkpoints are available at https://github.com/LengSicong/MMR1.

[211] A Sentinel-3 foundation model for ocean colour

Geoffrey Dawson, Remy Vandaele, Andrew Taylor, David Moffat, Helen Tamura-Wicks, Sarah Jackson, Rosie Lickorish, Paolo Fraccaro, Hywel Williams, Chunbo Luo, Anne Jones

Main category: cs.CV

TL;DR: A new foundation model using Prithvi-EO Vision Transformer architecture pre-trained on Sentinel-3 OLCI data shows promise for ocean science applications, particularly in chlorophyll concentration quantification and ocean primary production estimation.

Details

Motivation: Foundation models can address the challenge of sparse and expensive labeled data in ocean science by leveraging massive unlabeled datasets, potentially revolutionizing AI applications in marine monitoring.

Method: Used Prithvi-EO Vision Transformer architecture pre-trained on Sentinel-3 Ocean and Land Colour Instrument data, then fine-tuned on two downstream tasks: chlorophyll concentration quantification and ocean primary production estimation.

Result: The model demonstrated utility for marine monitoring, effectively using small amounts of high-quality labeled data and capturing detailed spatial patterns of ocean color while matching point observations.

Conclusion: This new generation of geospatial AI models has potential to provide more robust, data-driven insights into ocean ecosystems and their role in global climate processes.

Abstract: Artificial Intelligence (AI) Foundation models (FMs), pre-trained on massive unlabelled datasets, have the potential to drastically change AI applications in ocean science, where labelled data are often sparse and expensive to collect. In this work, we describe a new foundation model using the Prithvi-EO Vision Transformer architecture which has been pre-trained to reconstruct data from the Sentinel-3 Ocean and Land Colour Instrument (OLCI). We evaluate the model by fine-tuning on two downstream marine earth observation tasks. We first assess model performance compared to current baseline models used to quantify chlorophyll concentration. We then evaluate the FMs ability to refine remote sensing-based estimates of ocean primary production. Our results demonstrate the utility of self-trained FMs for marine monitoring, in particular for making use of small amounts of high quality labelled data and in capturing detailed spatial patterns of ocean colour whilst matching point observations. We conclude that this new generation of geospatial AI models has the potential to provide more robust, data-driven insights into ocean ecosystems and their role in global climate processes.

[212] Does FLUX Already Know How to Perform Physically Plausible Image Composition?

Shilin Lu, Zhuming Lian, Zihan Zhou, Shaocong Zhang, Chen Zhao, Adams Wai-Kin Kong

Main category: cs.CV

TL;DR: SHINE is a training-free framework for high-fidelity image composition that addresses complex lighting challenges and diverse inputs by leveraging pretrained diffusion models without latent inversion or attention surgery.

Details

Motivation: Existing image composition models struggle with complex lighting conditions (shadows, reflections) and diverse high-resolution inputs. Current diffusion models have physical priors but lack proper frameworks to utilize them effectively without causing object pose issues or quality degradation.

Method: SHINE uses manifold-steered anchor loss with pretrained customization adapters (like IP-Adapter) to guide latents for accurate subject representation while maintaining background integrity. It also employs degradation-suppression guidance and adaptive background blending to eliminate low-quality outputs and visible seams.

Result: SHINE achieves state-of-the-art performance on ComplexCompo benchmark and DreamEditBench, demonstrating superior results on standard metrics (DINOv2) and human-aligned scores (DreamSim, ImageReward, VisionReward).

Conclusion: SHINE provides an effective training-free solution for high-quality image composition that handles complex lighting and diverse resolutions, outperforming existing methods while introducing a new challenging benchmark for future research.

Abstract: Image composition aims to seamlessly insert a user-specified object into a new scene, but existing models struggle with complex lighting (e.g., accurate shadows, water reflections) and diverse, high-resolution inputs. Modern text-to-image diffusion models (e.g., SD3.5, FLUX) already encode essential physical and resolution priors, yet lack a framework to unleash them without resorting to latent inversion, which often locks object poses into contextually inappropriate orientations, or brittle attention surgery. We propose SHINE, a training-free framework for Seamless, High-fidelity Insertion with Neutralized Errors. SHINE introduces manifold-steered anchor loss, leveraging pretrained customization adapters (e.g., IP-Adapter) to guide latents for faithful subject representation while preserving background integrity. Degradation-suppression guidance and adaptive background blending are proposed to further eliminate low-quality outputs and visible seams. To address the lack of rigorous benchmarks, we introduce ComplexCompo, featuring diverse resolutions and challenging conditions such as low lighting, strong illumination, intricate shadows, and reflective surfaces. Experiments on ComplexCompo and DreamEditBench show state-of-the-art performance on standard metrics (e.g., DINOv2) and human-aligned scores (e.g., DreamSim, ImageReward, VisionReward). Code and benchmark will be publicly available upon publication.

[213] Quantized Visual Geometry Grounded Transformer

Weilun Feng, Haotong Qin, Mingqiang Wu, Chuanguang Yang, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Yulun Zhang, Michele Magno, Yongjun Xu

Main category: cs.CV

TL;DR: QuantVGGT is a novel quantization framework for Visual Geometry Grounded Transformers (VGGTs) that addresses challenges in compressing billion-scale 3D reconstruction models through dual-smoothed fine-grained quantization and noise-filtered diverse sampling.

Details

Motivation: Large-scale transformers like VGGTs face prohibitive computational and memory costs that hinder real-world deployment. Post-Training Quantization (PTQ) struggles with VGGTs due to heavy-tailed activation distributions from data-independent special tokens and unstable calibration from multi-view 3D data.

Method: Proposes QuantVGGT with two key techniques: 1) Dual-Smoothed Fine-Grained Quantization using pre-global Hadamard rotation and post-local channel smoothing to handle heavy-tailed distributions, and 2) Noise-Filtered Diverse Sampling that filters outliers via deep-layer statistics and constructs frame-aware diverse calibration clusters.

Result: QuantVGGT achieves state-of-the-art results across benchmarks and bit-widths, with 4-bit quantization delivering 3.7× memory reduction and 2.5× acceleration while maintaining reconstruction accuracy above 98% of full-precision models.

Conclusion: QuantVGGT demonstrates significant advantages for resource-constrained scenarios, making billion-scale VGGTs practically deployable with minimal accuracy loss.

Abstract: Learning-based 3D reconstruction models, represented by Visual Geometry Grounded Transformers (VGGTs), have made remarkable progress with the use of large-scale transformers. Their prohibitive computational and memory costs severely hinder real-world deployment. Post-Training Quantization (PTQ) has become a common practice for compressing and accelerating models. However, we empirically observe that PTQ faces unique obstacles when compressing billion-scale VGGTs: the data-independent special tokens induce heavy-tailed activation distributions, while the multi-view nature of 3D data makes calibration sample selection highly unstable. This paper proposes the first Quantization framework for VGGTs, namely QuantVGGT. This mainly relies on two technical contributions: First, we introduce Dual-Smoothed Fine-Grained Quantization, which integrates pre-global Hadamard rotation and post-local channel smoothing to mitigate heavy-tailed distributions and inter-channel variance robustly. Second, we design Noise-Filtered Diverse Sampling, which filters outliers via deep-layer statistics and constructs frame-aware diverse calibration clusters to ensure stable quantization ranges. Comprehensive experiments demonstrate that QuantVGGT achieves the state-of-the-art results across different benchmarks and bit-width, surpassing the previous state-of-the-art generic quantization method with a great margin. We highlight that our 4-bit QuantVGGT can deliver a 3.7$\times$ memory reduction and 2.5$\times$ acceleration in real-hardware inference, while maintaining reconstruction accuracy above 98% of its full-precision counterpart. This demonstrates the vast advantages and practicality of QuantVGGT in resource-constrained scenarios. Our code is released in https://github.com/wlfeng0509/QuantVGGT.

[214] NewtonGen: Physics-Consistent and Controllable Text-to-Video Generation via Neural Newtonian Dynamics

Yu Yuan, Xijun Wang, Tharindu Wickremasinghe, Zeeshan Nadir, Bole Ma, Stanley H. Chan

Main category: cs.CV

TL;DR: NewtonGen integrates data-driven synthesis with learnable physical principles to address physical inconsistency in text-to-video generation.

Details

Motivation: Current text-to-video models produce unrealistic motions and lack precise parameter control due to learning motion distributions solely from appearance without understanding underlying dynamics.

Method: Proposes NewtonGen framework with trainable Neural Newtonian Dynamics (NND) that models Newtonian motions and injects latent dynamical constraints into video generation.

Result: Enables physically consistent video synthesis with precise parameter control by jointly leveraging data priors and dynamical guidance.

Conclusion: The integration of learnable physical principles with data-driven synthesis addresses fundamental limitations in text-to-video generation for improved physical consistency and controllability.

Abstract: A primary bottleneck in large-scale text-to-video generation today is physical consistency and controllability. Despite recent advances, state-of-the-art models often produce unrealistic motions, such as objects falling upward, or abrupt changes in velocity and direction. Moreover, these models lack precise parameter control, struggling to generate physically consistent dynamics under different initial conditions. We argue that this fundamental limitation stems from current models learning motion distributions solely from appearance, while lacking an understanding of the underlying dynamics. In this work, we propose NewtonGen, a framework that integrates data-driven synthesis with learnable physical principles. At its core lies trainable Neural Newtonian Dynamics (NND), which can model and predict a variety of Newtonian motions, thereby injecting latent dynamical constraints into the video generation process. By jointly leveraging data priors and dynamical guidance, NewtonGen enables physically consistent video synthesis with precise parameter control.

[215] SD3.5-Flash: Distribution-Guided Distillation of Generative Flows

Hmrishav Bandyopadhyay, Rahim Entezari, Jim Scott, Reshinth Adithyan, Yi-Zhe Song, Varun Jampani

Main category: cs.CV

TL;DR: SD3.5-Flash is an efficient few-step distillation framework that enables high-quality image generation on consumer devices through optimized rectified flow models and pipeline improvements.

Details

Motivation: To democratize access to advanced generative AI by making high-quality image generation computationally feasible on accessible consumer devices like mobile phones and desktop computers.

Method: Uses a reformulated distribution matching objective for few-step generation, introduces timestep sharing to reduce gradient noise and split-timestep fine-tuning for better prompt alignment, combined with pipeline optimizations like text encoder restructuring and specialized quantization.

Result: The system enables rapid generation and memory-efficient deployment across different hardware configurations, consistently outperforming existing few-step methods according to large-scale user studies.

Conclusion: SD3.5-Flash successfully makes advanced generative AI truly accessible for practical deployment across the full spectrum of consumer devices.

Abstract: We present SD3.5-Flash, an efficient few-step distillation framework that brings high-quality image generation to accessible consumer devices. Our approach distills computationally prohibitive rectified flow models through a reformulated distribution matching objective tailored specifically for few-step generation. We introduce two key innovations: “timestep sharing” to reduce gradient noise and “split-timestep fine-tuning” to improve prompt alignment. Combined with comprehensive pipeline optimizations like text encoder restructuring and specialized quantization, our system enables both rapid generation and memory-efficient deployment across different hardware configurations. This democratizes access across the full spectrum of devices, from mobile phones to desktop computers. Through extensive evaluation including large-scale user studies, we demonstrate that SD3.5-Flash consistently outperforms existing few-step methods, making advanced generative AI truly accessible for practical deployment.

[216] SuperPatchMatch: an Algorithm for Robust Correspondences using Superpixel Patches

Rémi Giraud, Vinh-Thong Ta, Aurélie Bugeau, Pierrick Coupé, Nicolas Papadakis

Main category: cs.CV

TL;DR: The paper introduces SuperPatch, a novel superpixel-based patch structure that incorporates spatial information for robust image segmentation and labeling, and presents SuperPatchMatch for efficient matching.

Details

Motivation: Superpixels are popular but underutilized due to irregular and unstable segmentation results caused by dependency on image content. The authors aim to create a more robust descriptor that includes spatial information.

Method: Proposes SuperPatch structure based on superpixel neighborhood, generalizes PatchMatch to SuperPatchMatch, and develops a framework for fast segmentation and labeling from image databases.

Result: The approach outperforms state-of-the-art methods in both computational cost and accuracy on face labeling and medical image segmentation tasks.

Conclusion: SuperPatch and SuperPatchMatch demonstrate significant potential for efficient and accurate image segmentation and labeling applications.

Abstract: Superpixels have become very popular in many computer vision applications. Nevertheless, they remain underexploited since the superpixel decomposition may produce irregular and non stable segmentation results due to the dependency to the image content. In this paper, we first introduce a novel structure, a superpixel-based patch, called SuperPatch. The proposed structure, based on superpixel neighborhood, leads to a robust descriptor since spatial information is naturally included. The generalization of the PatchMatch method to SuperPatches, named SuperPatchMatch, is introduced. Finally, we propose a framework to perform fast segmentation and labeling from an image database, and demonstrate the potential of our approach since we outperform, in terms of computational cost and accuracy, the results of state-of-the-art methods on both face labeling and medical image segmentation.

[217] Retina Vision Transformer (RetinaViT): Introducing Scaled Patches into Vision Transformers

Yuyang Shu, Michael E. Bain

Main category: cs.CV

TL;DR: RetinaViT is a Vision Transformer inspired by human visual processing that naturally attends to low spatial frequencies first (like humans) and shifts to high frequencies in deeper layers, showing improved robustness to model size reduction compared to standard ViT.

Details

Motivation: The paper is motivated by how humans process visual information - seeing low spatial frequency components before high frequency components. The authors aim to incorporate this neuroscientific principle into Vision Transformers.

Method: The researchers introduce patches from different spatial frequencies into Vision Transformers, creating RetinaViT. The model processes images by attending to different frequency components without additional inductive bias.

Result: RetinaViT naturally attends to low spatial frequencies in early layers and shifts to high frequencies in deeper layers, mirroring human visual processing. It shows better robustness to model size reduction compared to original ViT.

Conclusion: RetinaViT captures structural features early and fine details later, reversing the processing order of CNNs. The emergent frequency-based attention pattern without explicit bias suggests the model learns biologically plausible visual processing strategies.

Abstract: Humans see low spatial frequency components before high spatial frequency components. Drawing on this neuroscientific inspiration, we investigate the effect of introducing patches from different spatial frequencies into Vision Transformers (ViTs). We name this model Retina Vision Transformer (RetinaViT) due to its inspiration from the human visual system. Our experiments on benchmark data show that RetinaViT exhibits a strong tendency to attend to low spatial frequency components in the early layers, and shifts its attention to high spatial frequency components as the network goes deeper. This tendency emerged by itself without any additional inductive bias, and aligns with the visual processing order of the human visual system. We hypothesise that RetinaViT captures structural features, or the gist of the scene, in earlier layers, before attending to fine details in subsequent layers, which is the reverse of the processing order of mainstream backbone vision models, such as CNNs. We also observe that RetinaViT is more robust to significant reductions in model size compared to the original ViT, which we hypothesise to have come from its ability to capture the gist of the scene early.

[218] Copycats: the many lives of a publicly available medical imaging dataset

Amelia Jiménez-Sánchez, Natalia-Rozalia Avlona, Dovile Juodelyte, Théo Sourget, Caroline Vang-Larsen, Anna Rogers, Hubert Dariusz Zając, Veronika Cheplygina

Main category: cs.CV

TL;DR: Analysis of medical imaging datasets on community-contributed platforms reveals governance failures in data quality, documentation, and maintenance practices, highlighting harmful downstream effects for healthcare AI.

Details

Motivation: Medical imaging datasets are crucial for healthcare AI but current community-contributed platforms fail to uphold quality standards and recommended practices for dataset sharing and documentation.

Method: Conducted analysis of publicly available machine learning datasets on community-contributed platforms, comparing datasets across dimensions including data sharing, documentation, and maintenance practices.

Result: Found vague licenses, lack of persistent identifiers, duplicates, missing metadata, and differences between platforms, with medical imaging datasets showing more harmful potential effects than computer vision datasets.

Conclusion: Current CCP governance models inadequately support responsible data curation needed for healthcare AI, requiring improved practices for dataset management and documentation.

Abstract: Medical Imaging (MI) datasets are fundamental to artificial intelligence in healthcare. The accuracy, robustness, and fairness of diagnostic algorithms depend on the data (and its quality) used to train and evaluate the models. MI datasets used to be proprietary, but have become increasingly available to the public, including on community-contributed platforms (CCPs) like Kaggle or HuggingFace. While open data is important to enhance the redistribution of data’s public value, we find that the current CCP governance model fails to uphold the quality needed and recommended practices for sharing, documenting, and evaluating datasets. In this paper, we conduct an analysis of publicly available machine learning datasets on CCPs, discussing datasets’ context, and identifying limitations and gaps in the current CCP landscape. We highlight differences between MI and computer vision datasets, particularly in the potentially harmful downstream effects from poor adoption of recommended dataset management practices. We compare the analyzed datasets across several dimensions, including data sharing, data documentation, and maintenance. We find vague licenses, lack of persistent identifiers and storage, duplicates, and missing metadata, with differences between the platforms. Our research contributes to efforts in responsible data curation and AI algorithms for healthcare.

[219] Vim-F: Visual State Space Model Benefiting from Learning in the Frequency Domain

Juntao Zhang, Shaogeng Liu, Jun Zhou, Kun Bian, You Zhou, Jianning Liu, Pei Zhang, Bingyan Liu

Main category: cs.CV

TL;DR: Vim-F is a novel Vision Mamba model that incorporates frequency domain information via FFT to enhance global receptive field and spatial understanding, while removing position embeddings and improving patch embedding for better performance.

Details

Motivation: Current Vision Mamba (ViM) methods underperform compared to CNNs and ViTs because they flatten 2D images into 1D sequences, ignoring 2D local dependencies and weakening spatial relationship interpretation. The authors aim to improve ViM's global perspective capabilities.

Method: The proposed Vim-F model uses Fast Fourier Transform (FFT) to obtain feature map spectra and combines them with original features, enabling unified visual representation in both frequency and spatial domains. It employs pure Mamba encoders with dual-domain scanning, removes position embeddings, and redesigns patch embedding using convolutional stem for better local correlation capture.

Result: Vim-F achieves improved performance by enabling global receptive field during scanning through frequency domain integration, while maintaining the efficient long-sequence modeling capability of Mamba models.

Conclusion: Incorporating frequency domain information via FFT significantly enhances Vision Mamba models’ ability to interpret spatial relationships from a global perspective, making Vim-F a more competitive visual backbone compared to traditional CNNs and ViTs.

Abstract: In recent years, State Space Models (SSMs) with efficient hardware-aware designs, known as the Mamba deep learning models, have made significant progress in modeling long sequences such as language understanding. Therefore, building efficient and general-purpose visual backbones based on SSMs is a promising direction. Compared to traditional convolutional neural networks (CNNs) and Vision Transformers (ViTs), the performance of Vision Mamba (ViM) methods is not yet fully competitive. To enable SSMs to process image data, ViMs typically flatten 2D images into 1D sequences, inevitably ignoring some 2D local dependencies, thereby weakening the model’s ability to interpret spatial relationships from a global perspective. We use Fast Fourier Transform (FFT) to obtain the spectrum of the feature map and add it to the original feature map, enabling ViM to model a unified visual representation in both frequency and spatial domains. The introduction of frequency domain information enables ViM to have a global receptive field during scanning. We propose a novel model called Vim-F, which employs pure Mamba encoders and scans in both the frequency and spatial domains. Moreover, we question the necessity of position embedding in ViM and remove it accordingly in Vim-F, which helps to fully utilize the efficient long-sequence modeling capability of ViM. Finally, we redesign a patch embedding for Vim-F, leveraging a convolutional stem to capture more local correlations, further improving the performance of Vim-F. Code is available at: https://github.com/yws-wxs/Vim-F.

[220] Model Agnostic Defense against Adversarial Patch Attacks on Object Detection in Unmanned Aerial Vehicles

Saurabh Pathak, Samridha Shrestha, Abdelrahman AlMahmoud

Main category: cs.CV

TL;DR: A novel model-agnostic defense mechanism against adversarial patch attacks for UAV object detection, formulated as occlusion removal without requiring adversarial training.

Details

Motivation: Adversarial patch attacks can severely impair UAV object detection performance, compromising high-level tasks that depend on ground object awareness from aerial perspectives.

Method: Formulates adversarial patch defense as an occlusion removal task using a lightweight single-stage approach that doesn’t require exposure to adversarial patches during training.

Result: Significantly decreases Attack Success Ratio in both digital and physical domains without significant processing costs, maintaining model-agnostic nature.

Conclusion: The proposed defense solution improves reliability of object detection for UAVs and can be deployed without requiring updates when the object detection pipeline changes.

Abstract: Object detection forms a key component in Unmanned Aerial Vehicles (UAVs) for completing high-level tasks that depend on the awareness of objects on the ground from an aerial perspective. In that scenario, adversarial patch attacks on an onboard object detector can severely impair the performance of upstream tasks. This paper proposes a novel model-agnostic defense mechanism against the threat of adversarial patch attacks in the context of UAV-based object detection. We formulate adversarial patch defense as an occlusion removal task. The proposed defense method can neutralize adversarial patches located on objects of interest, without exposure to adversarial patches during training. Our lightweight single-stage defense approach allows us to maintain a model-agnostic nature, that once deployed does not require to be updated in response to changes in the object detection pipeline. The evaluations in digital and physical domains show the feasibility of our method for deployment in UAV object detection pipelines, by significantly decreasing the Attack Success Ratio without incurring significant processing costs. As a result, the proposed defense solution can improve the reliability of object detection for UAVs.

[221] SOOD++: Leveraging Unlabeled Data to Boost Oriented Object Detection

Dingkang Liang, Wei Hua, Chunsheng Shi, Zhikang Zou, Xiaoqing Ye, Xiang Bai

Main category: cs.CV

TL;DR: SOOD++ is a semi-supervised oriented object detection method for aerial images that addresses challenges like arbitrary orientations, small scales, and dense distribution through three core components: Simple Instance-aware Dense Sampling, Geometry-aware Adaptive Weighting loss, and Noise-driven Global Consistency.

Details

Motivation: Existing semi-supervised object detection methods focus mainly on horizontal objects, leaving oriented objects in aerial images unexplored despite their higher annotation costs. Aerial images present unique challenges including arbitrary orientations, small object scales, and dense distributions.

Method: The method uses three key components: 1) Simple Instance-aware Dense Sampling (SIDS) for comprehensive pseudo-label generation, 2) Geometry-aware Adaptive Weighting (GAW) loss that dynamically adjusts importance based on geometric information, and 3) Noise-driven Global Consistency (NGC) that treats aerial images as global layouts and builds many-to-many relationships between pseudo-labels and predictions.

Result: The method achieves significant improvements over previous state-of-the-art methods on DOTA benchmarks, with mAP gains of +2.90/2.14, +2.16/2.18, and +2.66/2.32 under 10%, 20%, and 30% labeled data settings respectively. It also improves upon a strong supervised baseline by +1.82 mAP, reaching 72.48 mAP on DOTA-V1.5.

Conclusion: SOOD++ effectively addresses the challenges of semi-supervised oriented object detection in aerial images and sets new state-of-the-art performance, demonstrating the effectiveness of its core components in handling the unique characteristics of aerial object detection.

Abstract: Semi-supervised object detection (SSOD), leveraging unlabeled data to boost object detectors, has become a hot topic recently. However, existing SSOD approaches mainly focus on horizontal objects, leaving oriented objects common in aerial images unexplored. At the same time, the annotation cost of oriented objects is significantly higher than that of their horizontal counterparts. Therefore, in this paper, we propose a simple yet effective Semi-supervised Oriented Object Detection method termed SOOD++. Specifically, we observe that objects from aerial images usually have arbitrary orientations, small scales, and dense distribution, which inspires the following core designs: a Simple Instance-aware Dense Sampling (SIDS) strategy is used to generate comprehensive dense pseudo-labels; the Geometry-aware Adaptive Weighting (GAW) loss dynamically modulates the importance of each pair between pseudo-label and corresponding prediction by leveraging the intricate geometric information of aerial objects; we treat aerial images as global layouts and explicitly build the many-to-many relationship between the sets of pseudo-labels and predictions via the proposed Noise-driven Global Consistency (NGC). Extensive experiments conducted on various oriented object datasets under various labeled settings demonstrate the effectiveness of our method. For example, on the DOTA-V2.0/DOTA-V1.5 benchmark, the proposed method outperforms previous state-of-the-art (SOTA) by a large margin (+2.90/2.14, +2.16/2.18, and +2.66/2.32) mAP under 10%, 20%, and 30% labeled data settings, respectively, with single-scale training and testing. More importantly, it still improves upon a strong supervised baseline with 70.66 mAP, trained using the full DOTA-V1.5 train-val set, by +1.82 mAP, resulting in a 72.48 mAP, pushing the new state-of-the-art. The project page is at https://dk-liang.github.io/SOODv2/

[222] Lightweight Modular Parameter-Efficient Tuning for Open-Vocabulary Object Detection

Bilal Faye, Hanane Azzag, Mustapha Lebbah

Main category: cs.CV

TL;DR: UniProj-Det is a lightweight framework for efficient open-vocabulary object detection that freezes pretrained backbones and uses a Universal Projection module to train only 2-5% of parameters while maintaining competitive performance.

Details

Motivation: Existing open-vocabulary detection models require updating all parameters of large vision-language backbones, leading to high training costs. Parameter-efficient methods face challenges in layer selection and balancing efficiency with accuracy.

Method: UniProj-Det freezes pretrained backbones and introduces a Universal Projection module with a learnable modality token for unified vision-language adaptation. Applied to MDETR, it trains only about 2-5% of parameters.

Result: The framework achieves competitive or superior performance on phrase grounding, referring expression comprehension, and segmentation tasks while significantly reducing computational requirements.

Conclusion: UniProj-Det represents a principled step toward scalable and efficient open-vocabulary detection, with comprehensive analysis demonstrating its effectiveness in FLOPs, memory, latency, and ablation studies.

Abstract: Open-vocabulary object detection (OVD) extends recognition beyond fixed taxonomies by aligning visual and textual features, as in MDETR, GLIP, or RegionCLIP. While effective, these models require updating all parameters of large vision–language backbones, leading to prohibitive training cost. Recent efficient OVD approaches, inspired by parameter-efficient fine-tuning methods such as LoRA or adapters, reduce trainable parameters but often face challenges in selecting which layers to adapt and in balancing efficiency with accuracy. We propose UniProj-Det, a lightweight modular framework for parameter-efficient OVD. UniProj-Det freezes pretrained backbones and introduces a Universal Projection module with a learnable modality token, enabling unified vision–language adaptation at minimal cost. Applied to MDETR, our framework trains only about ~2-5% of parameters while achieving competitive or superior performance on phrase grounding, referring expression comprehension, and segmentation. Comprehensive analysis of FLOPs, memory, latency, and ablations demonstrates UniProj-Det as a principled step toward scalable and efficient open-vocabulary detection.

[223] Asynchronous Perception Machine For Efficient Test-Time-Training

Rajat Modi, Yogesh Singh Rawat

Main category: cs.CV

TL;DR: APM is a computationally-efficient architecture for test-time-training that processes image patches asynchronously and asymmetrically while maintaining semantic awareness, enabling out-of-distribution recognition without dataset-specific pre-training.

Details

Motivation: To create an efficient test-time-training architecture that can handle image patches in any order while encoding semantic information, and to provide empirical evidence for GLOM's insight that input percept is a field.

Method: APM processes patches of an image one at a time asymmetrically, distills test sample’s representation once for TTT, and can learn using just a single representation to predict semantically-aware features.

Result: APM offers competitive performance over existing TTT approaches, recognizes out-of-distribution images without dataset-specific pre-training, and can scale to large 2D image datasets for semantic clustering in a single forward pass.

Conclusion: APM demonstrates potential beyond TTT applications and provides empirical validation for GLOM’s insights, converging towards an implementation capable of both interpolation and perception on shared-connectionist hardware.

Abstract: In this work, we propose Asynchronous Perception Machine (APM), a computationally-efficient architecture for test-time-training (TTT). APM can process patches of an image one at a time in any order asymmetrically and still encode semantic-awareness in the net. We demonstrate APM’s ability to recognize out-of-distribution images without dataset-specific pre-training, augmentation or any-pretext task. APM offers competitive performance over existing TTT approaches. To perform TTT, APM just distills test sample’s representation once. APM possesses a unique property: it can learn using just this single representation and starts predicting semantically-aware features. APM demostrates potential applications beyond test-time-training: APM can scale up to a dataset of 2D images and yield semantic-clusterings in a single forward pass. APM also provides first empirical evidence towards validating GLOM’s insight, i.e. input percept is a field. Therefore, APM helps us converge towards an implementation which can do both interpolation and perception on a shared-connectionist hardware. Our code is publicly available at this link: https://rajatmodi62.github.io/apm_project_page/.

[224] Training-Free Layout-to-Image Generation with Marginal Attention Constraints

Huancheng Chen, Jingtao Li, Weiming Zhuang, Haris Vikalo, Lingjuan Lyu

Main category: cs.CV

TL;DR: MAC is a training-free layout-to-image approach that uses attention feature maps to optimize latent features during diffusion, achieving better spatial control without fine-tuning.

Details

Motivation: Existing layout-to-image methods require fine-tuning or additional modules, which is resource-intensive. MAC aims to provide precise spatial control without these requirements.

Method: Uses text-visual cross-attention maps to quantify layout inconsistencies, computes loss functions to optimize latent features during diffusion, and leverages self-attention correlations to enhance spatial controllability.

Result: Outperforms existing training-free L2I techniques on DrawBench and HRS benchmarks both quantitatively and qualitatively.

Conclusion: MAC provides an effective training-free solution for precise layout control in text-to-image generation, eliminating the need for fine-tuning or additional modules.

Abstract: Recently, many text-to-image diffusion models excel at generating high-resolution images from text but struggle with precise control over spatial composition and object counting. To address these challenges, prior works developed layout-to-image (L2I) approaches that incorporate layout instructions into text-to-image models. However, existing L2I methods typically require fine-tuning of pre-trained parameters or training additional control modules for the diffusion models. In this work, we propose a training-free L2I approach, MAC (Marginal Attention Constrained Generation), which eliminates the need for additional modules or fine-tuning. Specifically, we use text-visual cross-attention feature maps to quantify inconsistencies between the layout of the generated images and the provided instructions, and then compute loss functions to optimize latent features during the diffusion reverse process. To enhance spatial controllability and mitigate semantic failures in complex layout instructions, we leverage pixel-to-pixel correlations in the self-attention feature maps to align cross-attention maps and combine three loss functions constrained by boundary attention to update latent features. Comprehensive experimental results on both L2I and non-L2I pretrained diffusion models demonstrate that our method outperforms existing training-free L2I techniques both quantitatively and qualitatively in terms of image composition on the DrawBench and HRS benchmarks.

[225] Supercharged One-step Text-to-Image Diffusion Models with Negative Prompts

Viet Nguyen, Anh Nguyen, Trung Dao, Khoi Nguyen, Cuong Pham, Toan Tran, Anh Tran

Main category: cs.CV

TL;DR: NASA is an efficient method that integrates negative prompts into one-step diffusion models to improve controllability without the blending artifacts of traditional classifier-free guidance.

Details

Motivation: One-step diffusion models offer fast generation but lack the controllability of multi-step methods, and directly applying classifier-free guidance causes blending artifacts due to the lack of iterative refinement.

Method: NASA operates in intermediate representation space using cross-attention mechanisms to suppress undesired visual attributes, avoiding output-space guidance issues and adding minimal computational overhead (1.89% FLOPs increase).

Result: NASA substantially improves controllability and output quality, achieving an HPSv2 score of 31.21, setting a new state-of-the-art benchmark for one-step diffusion models.

Conclusion: NASA enables effective negative prompting in one-step diffusion models, providing high-quality controllable generation with minimal computational cost, and can be integrated into existing timestep distillation frameworks.

Abstract: The escalating demand for real-time image synthesis has driven significant advancements in one-step diffusion models, which inherently offer expedited generation speeds compared to traditional multi-step methods. However, this enhanced efficiency is frequently accompanied by a compromise in the controllability of image attributes. While negative prompting, typically implemented via classifier-free guidance (CFG), has proven effective for fine-grained control in multi-step models, its application to one-step generators remains largely unaddressed. Due to the lack of iterative refinement, as in multi-step diffusion, directly applying CFG to one-step generation leads to blending artifacts and diminished output quality. To fill this gap, we introduce \textbf{N}egative-\textbf{A}way \textbf{S}teer \textbf{A}ttention (NASA), an efficient method that integrates negative prompts into one-step diffusion models. NASA operates within the intermediate representation space by leveraging cross-attention mechanisms to suppress undesired visual attributes. This strategy avoids the blending artifacts inherent in output-space guidance and achieves high efficiency, incurring only a minimal 1.89% increase in FLOPs compared to the computational doubling of CFG. Furthermore, NASA can be seamlessly integrated into existing timestep distillation frameworks, enhancing the student’s output quality. Experimental results demonstrate that NASA substantially improves controllability and output quality, achieving an HPSv2 score of \textbf{31.21}, setting a new state-of-the-art benchmark for one-step diffusion models.

[226] GVDepth: Zero-Shot Monocular Depth Estimation for Ground Vehicles based on Probabilistic Cue Fusion

Karlo Koledić, Luka Petrović, Ivan Marković, Ivan Petrović

Main category: cs.CV

TL;DR: The paper proposes a novel canonical representation and architecture for metric monocular depth estimation that disentangles depth from camera parameters, enabling better generalization across datasets with varying camera setups.

Details

Motivation: Metric monocular depth estimation faces challenges due to its ill-posed nature and entanglement with camera parameters, which hinders multi-dataset training and zero-shot accuracy, particularly in autonomous vehicles where fixed camera setups limit geometric diversity.

Method: The authors propose a canonical representation that maintains consistency across varied camera setups and a novel architecture that adaptively and probabilistically fuses depths estimated via object size and vertical image position cues.

Result: Comprehensive evaluation on five autonomous driving datasets shows accurate metric depth estimation for varying resolutions, aspect ratios and camera setups, achieving comparable accuracy to existing zero-shot methods despite training on a single dataset with single-camera setup.

Conclusion: The proposed approach effectively addresses the generalization challenge in metric depth estimation by leveraging fixed camera-ground relationships and novel fusion techniques, enabling robust performance across diverse autonomous driving scenarios.

Abstract: Generalizing metric monocular depth estimation presents a significant challenge due to its ill-posed nature, while the entanglement between camera parameters and depth amplifies issues further, hindering multi-dataset training and zero-shot accuracy. This challenge is particularly evident in autonomous vehicles and mobile robotics, where data is collected with fixed camera setups, limiting the geometric diversity. Yet, this context also presents an opportunity: the fixed relationship between the camera and the ground plane imposes additional perspective geometry constraints, enabling depth regression via vertical image positions of objects. However, this cue is highly susceptible to overfitting, thus we propose a novel canonical representation that maintains consistency across varied camera setups, effectively disentangling depth from specific parameters and enhancing generalization across datasets. We also propose a novel architecture that adaptively and probabilistically fuses depths estimated via object size and vertical image position cues. A comprehensive evaluation demonstrates the effectiveness of the proposed approach on five autonomous driving datasets, achieving accurate metric depth estimation for varying resolutions, aspect ratios and camera setups. Notably, we achieve comparable accuracy to existing zero-shot methods, despite training on a single dataset with a single-camera setup. Project website: https://unizgfer-lamor.github.io/gvdepth/

[227] MonSter++: Unified Stereo Matching, Multi-view Stereo, and Real-time Stereo with Monodepth Priors

Junda Cheng, Wenjing Liao, Zhipeng Cai, Longliang Liu, Gangwei Xu, Xianqi Wang, Yuzhou Wang, Zikang Yuan, Yong Deng, Jinliang Zang, Yangyang Shi, Jinhui Tang, Xin Yang

Main category: cs.CV

TL;DR: MonSter++ is a geometric foundation model that unifies stereo matching and multi-view stereo by integrating monocular depth priors through a dual-branched architecture, achieving state-of-the-art performance across multiple benchmarks.

Details

Motivation: Both rectified stereo matching and unrectified multi-view stereo face challenges in ill-posed regions with limited matching cues. The paper aims to address this by combining the complementary strengths of single-view and multi-view depth estimation.

Method: MonSter++ uses a dual-branched architecture that fuses monocular depth and multi-view depth. It employs confidence-based guidance to adaptively select reliable multi-view cues to correct scale ambiguity in monocular depth, and refined monocular predictions guide multi-view estimation in ill-posed regions through iterative mutual enhancement.

Result: MonSter++ achieves new state-of-the-art on both stereo matching and multi-view stereo, with significant improvements across eight benchmarks from three tasks. The real-time variant RT-MonSter++ also outperforms previous real-time methods by a large margin.

Conclusion: The framework demonstrates strong generality and superior zero-shot generalization capability. By effectively incorporating monocular priors through cascaded search and multi-scale depth fusion, MonSter++ evolves coarse object-level monocular priors into fine-grained pixel-level geometry.

Abstract: We introduce MonSter++, a geometric foundation model for multi-view depth estimation, unifying rectified stereo matching and unrectified multi-view stereo. Both tasks fundamentally recover metric depth from correspondence search and consequently face the same dilemma: struggling to handle ill-posed regions with limited matching cues. To address this, we propose MonSter++, a novel method that integrates monocular depth priors into multi-view depth estimation, effectively combining the complementary strengths of single-view and multi-view cues. MonSter++ fuses monocular depth and multi-view depth into a dual-branched architecture. Confidence-based guidance adaptively selects reliable multi-view cues to correct scale ambiguity in monocular depth. The refined monocular predictions, in turn, effectively guide multi-view estimation in ill-posed regions. This iterative mutual enhancement enables MonSter++ to evolve coarse object-level monocular priors into fine-grained, pixel-level geometry, fully unlocking the potential of multi-view depth estimation. MonSter++ achieves new state-of-the-art on both stereo matching and multi-view stereo. By effectively incorporating monocular priors through our cascaded search and multi-scale depth fusion strategy, our real-time variant RT-MonSter++ also outperforms previous real-time methods by a large margin. As shown in Fig.1, MonSter++ achieves significant improvements over previous methods across eight benchmarks from three tasks – stereo matching, real-time stereo matching, and multi-view stereo, demonstrating the strong generality of our framework. Besides high accuracy, MonSter++ also demonstrates superior zero-shot generalization capability. We will release both the large and the real-time models to facilitate their use by the open-source community.

[228] Technical report on label-informed logit redistribution for better domain generalization in low-shot classification with foundation models

Behraj Khan, Tahir Syed

Main category: cs.CV

TL;DR: The paper proposes a Confidence Misalignment Penalty (CMP) method to improve confidence calibration in CLIP-based vision classification systems by penalizing incorrect classifications during fine-tuning.

Details

Motivation: CLIP models exhibit poor confidence calibration where logit scores remain large regardless of whether image-language pairs match, making it difficult to address this issue in data space under few-shot learning scenarios.

Method: A penalty term is incorporated into the loss objective that moves log-likelihood from incorrect classifications to the true class proportionally to the relative amplitudes of the two likelihoods during fine-tuning.

Result: Extensive experiments on 12 vision datasets and 5 domain generalization datasets show CMP outperforms state-of-the-art methods, reducing Expected Calibration Error by 6.01% on average, with improvements ranging from 4.01% to 9.72%.

Conclusion: The proposed CMP method effectively addresses confidence calibration challenges in CLIP-based vision classification systems, demonstrating significant improvements over existing prompt learning approaches.

Abstract: Confidence calibration is an emerging challenge in real-world decision systems based on foundations models when used for downstream vision classification tasks. Due to various reasons exposed, logit scores on the CLIP head remain large irrespective of whether the image-language pairs reconcile. It is difficult to address in data space, given the few-shot regime. We propose a penalty incorporated into loss objective that penalizes incorrect classifications whenever one is made during finetuning, by moving an amount of log-likelihood to the true class commensurate to the relative amplitudes of the two likelihoods. We refer to it as \textit{confidence misalignment penalty (CMP)}. Extensive experiments on $12$ vision datasets and $5$ domain generalization datasets supports the calibration performance of our method against stat-of-the-art. CMP outperforms the benchmarked prompt learning methods, demonstrating average improvement in Expected Calibration Error (ECE) by average $6.01$%, $4.01$ % at minimum and $9.72$% at maximum.

[229] AdaSVD: Adaptive Singular Value Decomposition for Large Language Models

Zhiteng Li, Mingyuan Xia, Jingyuan Zhang, Zheng Hui, Haotong Qin, Linghe Kong, Yulun Zhang, Xiaokang Yang

Main category: cs.CV

TL;DR: AdaSVD is an adaptive SVD-based LLM compression method that addresses truncation errors and layer importance variations through adaptive compensation and compression ratio assignment.

Details

Motivation: LLMs have high memory requirements that limit deployment on resource-constrained devices. Existing SVD compression methods suffer from truncation errors and uniform compression ratios that don't account for varying layer importance.

Method: AdaSVD introduces two components: adaComp (adaptive compensation for SVD truncation errors by alternately updating singular matrices U and V⊤) and adaCR (adaptive assignment of layer-specific compression ratios based on layer importance).

Result: Extensive experiments show AdaSVD consistently outperforms state-of-the-art SVD-based methods across multiple LLM/VLM families and evaluation metrics, achieving superior performance with significantly reduced memory requirements.

Conclusion: AdaSVD provides an effective adaptive compression approach that addresses key limitations of existing SVD methods, enabling more efficient LLM deployment on resource-constrained devices.

Abstract: Large language models (LLMs) have achieved remarkable success in natural language processing (NLP) tasks, yet their substantial memory requirements present significant challenges for deployment on resource-constrained devices. Singular Value Decomposition (SVD) has emerged as a promising compression technique for LLMs, offering considerable reductions in memory overhead. However, existing SVD-based methods often struggle to effectively mitigate the errors introduced by SVD truncation, leading to a noticeable performance gap when compared to the original models. Furthermore, applying a uniform compression ratio across all transformer layers fails to account for the varying importance of different layers. To address these challenges, we propose AdaSVD, an adaptive SVD-based LLM compression approach. Specifically, AdaSVD introduces adaComp, which adaptively compensates for SVD truncation errors by alternately updating the singular matrices $\mathcal{U}$ and $\mathcal{V}^\top$. Additionally, AdaSVD introduces adaCR, which adaptively assigns layer-specific compression ratios based on the relative importance of each layer. Extensive experiments across multiple LLM/VLM families and evaluation metrics demonstrate that AdaSVD consistently outperforms state-of-the-art (SOTA) SVD-based methods, achieving superior performance with significantly reduced memory requirements. Code and models of AdaSVD will be available at https://github.com/ZHITENGLI/AdaSVD.

[230] LadderMIL: Multiple Instance Learning with Coarse-to-Fine Self-Distillation

Shuyang Wu, Yifu Qiu, Ines P. Nearchou, Sandrine Prost, Jonathan A. Fallowfield, Hideki Ueno, Hitoshi Tsuda, David J. Harrison, Hakan Bilen, Timothy J. Kendall

Main category: cs.CV

TL;DR: LadderMIL is a Multiple Instance Learning framework that improves WSI analysis by incorporating instance-level supervision through self-distillation and capturing inter-instance contextual information, achieving significant performance gains across multiple clinical benchmarks.

Details

Motivation: Traditional MIL approaches for whole slide image analysis only use bag-level supervision, which neglects instance-level learning and fails to integrate instance and bag-level information effectively.

Method: Proposes two key components: 1) Coarse-to-Fine Self-Distillation (CFSD) that adaptively obtains instance-level labels from bag-level supervision, and 2) Contextual Encoding Generator (CEG) that captures inter-instance contextual relationships within bags.

Result: Achieved average improvements of 8.1% in AUC, 11% in F1-score, and 2.4% in C-index across five clinical benchmarks including breast cancer receptor status classification, subtype classification, tumour classification, and prognosis prediction.

Conclusion: LadderMIL successfully bridges the gap between instance-level and bag-level learning in MIL for computational pathology, demonstrating superior performance through integrated instance supervision and contextual modeling.

Abstract: Multiple Instance Learning (MIL) for whole slide image (WSI) analysis in computational pathology often neglects instance-level learning as supervision is typically provided only at the bag level, hindering the integrated consideration of instance and bag-level information during the analysis. In this work, we present LadderMIL, a framework designed to improve MIL through two perspectives: (1) employing instance-level supervision and (2) learning inter-instance contextual information at bag level. Firstly, we propose a novel Coarse-to-Fine Self-Distillation (CFSD) paradigm that probes and distils a network trained with bag-level information to adaptively obtain instance-level labels which could effectively provide the instance-level supervision for the same network in a self-improving way. Secondly, to capture inter-instance contextual information in WSI, we propose a Contextual Encoding Generator (CEG), which encodes the contextual appearance of instances within a bag. We also theoretically and empirically prove the instance-level learnability of CFSD. Our LadderMIL is evaluated on multiple clinically relevant benchmarking tasks including breast cancer receptor status classification, multi-class subtype classification, tumour classification, and prognosis prediction. Average improvements of 8.1%, 11% and 2.4% in AUC, F1-score, and C-index, respectively, are demonstrated across the five benchmarks, compared to the best baseline.

[231] Efficiently Disentangling CLIP for Multi-Object Perception

Samyak Rawlekar, Yujun Cai, Yiwei Wang, Ming-Hsuan Yang, Narendra Ahuja

Main category: cs.CV

TL;DR: DCLIP addresses excessive mutual feature information (MFI) in VLMs by learning optimal mutual information levels with minimal parameters, improving multi-object recognition.

Details

Motivation: VLMs like CLIP struggle with complex scenes containing multiple objects due to excessive MFI, where features of one class contain information about unrelated classes.

Method: DCLIP uses two complementary losses: MFI Loss to regulate class feature similarity and prevent excessive overlap, and Asymmetric Loss to align image features with disentangled text features.

Result: DCLIP reduces excessive inter-class similarity by 30%, outperforms SOTA on VOC2007 and COCO-14 with 75% fewer parameters, and improves zero-shot semantic segmentation across six datasets.

Conclusion: Feature disentanglement is crucial for multi-object perception in VLMs, and DCLIP effectively addresses MFI limitations with minimal parameter overhead.

Abstract: Vision-language models like CLIP excel at recognizing the single, prominent object in a scene. However, they struggle in complex scenes containing multiple objects. We identify a fundamental reason for this limitation: VLM feature space exhibits excessive mutual feature information (MFI), where the features of one class contain substantial information about other, unrelated classes. This high MFI becomes evident during class-specific queries, as unrelated objects are activated alongside the queried class. To address this limitation, we propose DCLIP, an efficient framework that learns an optimal level of mutual information while adding only minimal learnable parameters to a frozen VLM. DCLIP uses two complementary losses: a novel MFI Loss that regulates class feature similarity to prevent excessive overlap while preserving necessary shared information, and the Asymmetric Loss (ASL) that aligns image features with the disentangled text features. Through this disentanglement, DCLIP reduces excessive inter-class similarity by 30%. On multi-label recognition, DCLIP performs favorably over SOTA approaches on VOC2007 and COCO-14 while using 75% fewer training parameters. For zero-shot semantic segmentation, it shows improved performance across six benchmark datasets. These results highlight the importance of feature disentanglement for multi-object perception in VLMs.

[232] HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation

Ling Yang, Xinchen Zhang, Ye Tian, Chenming Shang, Minghao Xu, Wentao Zhang, Bin Cui

Main category: cs.CV

TL;DR: HermesFlow is a framework that bridges the gap between understanding and generation capabilities in Multimodal Large Language Models (MLLMs) using homologous preference data and iterative optimization.

Details

Motivation: The authors observed that MLLMs typically have stronger understanding capabilities than generative capabilities, creating a significant performance gap between the two modalities.

Method: HermesFlow uses homologous data as input to create preference data for both understanding and generation, then employs Pair-DPO and self-play iterative optimization to align multimodal understanding and generation.

Result: Extensive experiments show significant superiority over prior methods, particularly in narrowing the gap between multimodal understanding and generation capabilities.

Conclusion: HermesFlow demonstrates potential as a general alignment framework for next-generation multimodal foundation models, effectively bridging the understanding-generation gap.

Abstract: The remarkable success of the autoregressive paradigm has made significant advancement in Multimodal Large Language Models (MLLMs), with powerful models like Show-o, Transfusion and Emu3 achieving notable progress in unified image understanding and generation. For the first time, we uncover a common phenomenon: the understanding capabilities of MLLMs are typically stronger than their generative capabilities, with a significant gap between the two. Building on this insight, we propose HermesFlow, a simple yet general framework designed to seamlessly bridge the gap between understanding and generation in MLLMs. Specifically, we take the homologous data as input to curate homologous preference data of both understanding and generation. Through Pair-DPO and self-play iterative optimization, HermesFlow effectively aligns multimodal understanding and generation using homologous preference data. Extensive experiments demonstrate the significant superiority of our approach over prior methods, particularly in narrowing the gap between multimodal understanding and generation. These findings highlight the potential of HermesFlow as a general alignment framework for next-generation multimodal foundation models. Code: https://github.com/Gen-Verse/HermesFlow

[233] Diff-Reg v2: Diffusion-Based Matching Matrix Estimation for Image Matching and 3D Registration

Qianliang Wu, Haobo Jiang, Yaqing Ding, Lei Luo, Jun Li, Jin Xie, Xiaojun Wu, Jian Yang

Main category: cs.CV

TL;DR: A diffusion-based paradigm for robust correspondence estimation in registration tasks (2D image, 3D point cloud, 2D-3D registration) that treats matching as a denoising process in matrix space.

Details

Motivation: Previous methods struggle with challenges like scale inconsistencies, symmetry, and large deformations, often relying on single-step prediction heads that get stuck in local minima. Existing approaches use specific geometric priors that cannot cover all scenarios.

Method: Uses a diffusion model in matching matrix space, treating correspondence estimation as a denoising diffusion process. Applies diffusion in doubly stochastic matrix space for 3D-3D/2D-3D tasks, and in a matrix subspace with dual-softmax projection for 2D registration. Features adaptive matrix embeddings and lightweight denoising module.

Result: The method gradually refines intermediate matching matrices to optimal solutions through multi-step denoising predictions, avoiding local minima in complex matching scenarios.

Conclusion: The diffusion-based paradigm provides a robust framework for correspondence estimation across different registration tasks, overcoming limitations of previous single-step approaches and handling challenging scenarios like scale inconsistencies and large deformations.

Abstract: Establishing reliable correspondences is crucial for all registration tasks, including 2D image registration, 3D point cloud registration, and 2D-3D image-to-point cloud registration. However, these tasks are often complicated by challenges such as scale inconsistencies, symmetry, and large deformations, which can lead to ambiguous matches. Previous feature-based and correspondence-based methods typically rely on geometric or semantic features to generate or polish initial potential correspondences. Some methods typically leverage specific geometric priors, such as topological preservation, to devise diverse and innovative strategies tailored to a given enhancement goal, which cannot be exhaustively enumerated. Additionally, many previous approaches rely on a single-step prediction head, which can struggle with local minima in complex matching scenarios. To address these challenges, we introduce an innovative paradigm that leverages a diffusion model in matrix space for robust matching matrix estimation. Our model treats correspondence estimation as a denoising diffusion process in the matching matrix space, gradually refining the intermediate matching matrix to the optimal one. Specifically, we apply the diffusion model in the doubly stochastic matrix space for 3D-3D and 2D-3D registration tasks. In the 2D image registration task, we deploy the diffusion model in a matrix subspace where dual-softmax projection regularization is applied. For all three registration tasks, we provide adaptive matching matrix embedding implementations tailored to the specific characteristics of each task while maintaining a consistent “match-to-warp” encoding pattern. Furthermore, we adopt a lightweight design for the denoising module. In inference, once points or image features are extracted and fixed, this module performs multi-step denoising predictions through reverse sampling.

[234] LLaVA-RadZ: Can Multimodal Large Language Models Effectively Tackle Zero-shot Radiology Recognition?

Bangyan Li, Wenxuan Huang, Zhenkun Gao, Yeqiang Wang, Yunhang Shen, Jingzhong Lin, Ling You, Yuxiang Shen, Shaohui Lin, Wanli Ouyang, Yuling Sun

Main category: cs.CV

TL;DR: LLaVA-RadZ is a framework that enhances MLLMs for zero-shot medical disease recognition by addressing their limitations in processing fine-grained medical images through feature alignment and domain knowledge integration.

Details

Motivation: MLLMs underperform in zero-shot medical disease recognition due to inadequate exploitation of fine-grained medical image features and available medical knowledge, despite their potential in feature representation.

Method: Proposes LLaVA-RadZ with Decoding-Side Feature Alignment Training (DFAT) for end-to-end training using modality-specific tokens, and a Domain Knowledge Anchoring Module (DKAM) to leverage intrinsic medical knowledge and reduce semantic gaps.

Result: LLaVA-RadZ significantly outperforms traditional MLLMs in zero-shot disease recognition and achieves performance comparable to established CLIP-based approaches.

Conclusion: The framework effectively bridges the gap in MLLMs’ capability for fine-grained medical tasks, demonstrating that with proper feature alignment and knowledge integration, MLLMs can compete with specialized models like CLIP in medical imaging.

Abstract: Recently, Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in visual understanding and reasoning across various vision-language tasks. However, we found that MLLMs cannot process effectively from fine-grained medical image data in the traditional Visual Question Answering (VQA) pipeline, as they do not exploit the captured features and available medical knowledge fully, results in MLLMs usually performing poorly in zero-shot medical disease recognition. Fortunately, this limitation does not indicate that MLLMs are fundamentally incapable of addressing fine-grained recognition tasks. From a feature representation perspective, MLLMs demonstrate considerable potential for tackling such challenging problems. Thus, to address this challenge, we propose LLaVA-RadZ, a simple yet effective framework for zero-shot medical disease recognition via utilizing the existing MLLM features. Specifically, we design an end-to-end training strategy, termed Decoding-Side Feature Alignment Training (DFAT) to take advantage of the characteristics of the MLLM decoder architecture and incorporate modality-specific tokens tailored for different modalities. Additionally, we introduce a Domain Knowledge Anchoring Module (DKAM) to exploit the intrinsic medical knowledge of large models, which mitigates the category semantic gap in image-text alignment. Extensive experiments demonstrate that our LLaVA-RadZ significantly outperforms traditional MLLMs in zero-shot disease recognition, achieving the comparable performance to the well-established and highly-optimized CLIP-based approaches.

[235] Parameter-Efficient Adaptation of Geospatial Foundation Models through Embedding Deflection

Romain Thoreau, Valerio Marsocci, Dawa Derksen

Main category: cs.CV

TL;DR: DEFLECT is a parameter-efficient adaptation method for Geospatial Foundation Models that enhances spectral information in multispectral satellite imagery using minimal additional parameters.

Details

Motivation: To leverage the strong spatial priors of pretrained GFMs for adapting to multispectral satellite data with very few parameters, addressing the need for low-cost adaptation of foundation models to heterogeneous geospatial data.

Method: DEFLECT (Deflecting Embeddings for Finetuning Latent representations for Earth and Climate Tasks) incorporates inductive biases to adapt GFMs to multispectral imagery by enhancing spectral information while maintaining spatial structure.

Result: DEFLECT achieves on-par or higher accuracy than competing methods with 5-10× fewer parameters across three GFMs and five datasets for classification and segmentation tasks.

Conclusion: The method effectively enhances spectral representation capabilities for geoscience tasks while maintaining parameter efficiency, demonstrating the value of incorporating stronger inductive biases in foundation model adaptation.

Abstract: As large-scale heterogeneous data sets become increasingly available, adapting foundation models at low cost has become a key issue. Seminal works in natural language processing, e.g. Low-Rank Adaptation (LoRA), leverage the low “intrinsic rank” of parameter updates during adaptation. In this paper, we argue that incorporating stronger inductive biases in both data and models can enhance the adaptation of Geospatial Foundation Models (GFMs), pretrained on RGB satellite images, to other types of optical satellite data. Specifically, the pretrained parameters of GFMs serve as a strong prior for the spatial structure of multispectral images. For this reason, we introduce DEFLECT (Deflecting Embeddings for Finetuning Latent representations for Earth and Climate Tasks), a novel strategy for adapting GFMs to multispectral satellite imagery with very few additional parameters. DEFLECT improves the representation capabilities of the extracted features, particularly enhancing spectral information, which is essential for geoscience and environmental-related tasks. We demonstrate the effectiveness of our method across three different GFMs and five diverse datasets, ranging from forest monitoring to marine environment segmentation. Compared to competing methods, DEFLECT achieves on-par or higher accuracy with 5-10$\times$ fewer parameters for classification and segmentation tasks. The code will be made publicly available.

[236] Autoregressive Image Generation with Randomized Parallel Decoding

Haopeng Li, Jinyue Yang, Guoqi Li, Huan Wang

Main category: cs.CV

TL;DR: ARPG is a novel visual autoregressive model that enables randomized parallel generation by decoupling positional guidance from content representation, achieving significant speedup and memory reduction while supporting zero-shot inference tasks.

Details

Motivation: Conventional raster-order autoregressive models have limitations in inference efficiency and zero-shot generalization due to their sequential, predefined token generation order.

Method: Proposes a decoupled decoding framework that separates positional guidance (queries) from content representation (key-value pairs), incorporating guidance directly into causal attention to enable random-order training and generation without bidirectional attention.

Result: Achieves FID of 1.83 on ImageNet-1K 256 benchmark with only 32 sampling steps, achieving 30x speedup in inference and 75% memory reduction compared to similar-scale autoregressive models.

Conclusion: ARPG enables efficient parallel generation and zero-shot generalization to tasks like inpainting, outpainting, and resolution expansion through its novel decoupled decoding approach.

Abstract: We introduce ARPG, a novel visual autoregressive model that enables randomized parallel generation, addressing the inherent limitations of conventional raster-order approaches, which hinder inference efficiency and zero-shot generalization due to their sequential, predefined token generation order. Our key insight is that effective random-order modeling necessitates explicit guidance for determining the position of the next predicted token. To this end, we propose a novel decoupled decoding framework that decouples positional guidance from content representation, encoding them separately as queries and key-value pairs. By directly incorporating this guidance into the causal attention mechanism, our approach enables fully random-order training and generation, eliminating the need for bidirectional attention. Consequently, ARPG readily generalizes to zero-shot inference tasks such as image inpainting, outpainting, and resolution expansion. Furthermore, it supports parallel inference by concurrently processing multiple queries using a shared KV cache. On the ImageNet-1K 256 benchmark, our approach attains an FID of 1.83 with only 32 sampling steps, achieving over a 30 times speedup in inference and a 75 percent reduction in memory consumption compared to representative recent autoregressive models at a similar scale.

[237] Radar-Guided Polynomial Fitting for Metric Depth Estimation

Patrick Rim, Hyoungseob Park, Vadim Ezhov, Jeffrey Moon, Alex Wong

Main category: cs.CV

TL;DR: POLAR introduces polynomial fitting with radar guidance to transform scaleless depth predictions from pretrained monocular depth estimation models into accurate metric depth maps, outperforming existing methods by significant margins.

Details

Motivation: Existing monocular depth estimation models produce reasonable local depth structure but misalign regions relative to each other, making simple linear transformations insufficient for accurate metric depth estimation.

Method: Uses polynomial coefficients predicted from radar data to non-uniformly adjust depth predictions across ranges, with a training objective that enforces local monotonicity via first-derivative regularization to preserve structural consistency.

Result: Achieves state-of-the-art performance across three datasets, outperforming existing methods by 24.9% in MAE and 33.2% in RMSE, while maintaining superior efficiency in latency and computational cost.

Conclusion: POLAR’s polynomial fitting framework effectively corrects regional misalignments in monocular depth estimation, providing a more accurate and efficient solution than affine transformations for metric depth estimation.

Abstract: We propose POLAR, a novel radar-guided depth estimation method that introduces polynomial fitting to efficiently transform scaleless depth predictions from pretrained monocular depth estimation (MDE) models into metric depth maps. Unlike existing approaches that rely on complex architectures or expensive sensors, our method is grounded in a fundamental insight: although MDE models often infer reasonable local depth structure within each object or local region, they may misalign these regions relative to one another, making a linear scale and shift (affine) transformation insufficient given three or more of these regions. To address this limitation, we use polynomial coefficients predicted from cheap, ubiquitous radar data to adaptively adjust depth predictions non-uniformly across depth ranges. In this way, POLAR generalizes beyond affine transformations and is able to correct such misalignments by introducing inflection points. Importantly, our polynomial fitting framework preserves structural consistency through a novel training objective that enforces local monotonicity via first-derivative regularization. POLAR achieves state-of-the-art performance across three datasets, outperforming existing methods by an average of 24.9% in MAE and 33.2% in RMSE, while also achieving state-of-the-art efficiency in terms of latency and computational cost.

[238] On the Perception Bottleneck of VLMs for Chart Understanding

Junteng Liu, Weihao Zeng, Xiwen Zhang, Yijun Wang, Zifei Shan, Junxian He

Main category: cs.CV

TL;DR: This paper identifies and addresses the perception bottleneck in large vision-language models (LVLMs) for chart understanding, decomposing it into vision encoder and extraction bottlenecks, and proposes a contrastive learning framework to enhance the visual encoder.

Details

Motivation: Existing LVLMs face critical perception bottlenecks in chart understanding, where visual representations fail to encapsulate correct information and language models struggle to extract necessary information from visual representations.

Method: The study decomposes the perception bottleneck into two components, analyzes information richness in visual representations, and enhances the visual encoder using a contrastive learning framework to mitigate the vision encoder bottleneck.

Result: Empirical results show that visual representations contain substantially richer information than captured by linear extractors, and the proposed approach significantly mitigates the perception bottleneck, improving LVLMs’ chart comprehension capabilities.

Conclusion: The vision encoder remains a critical bottleneck in chart understanding, and enhancing it through contrastive learning effectively improves LVLMs’ perception capabilities, with code publicly available for further research.

Abstract: Chart understanding requires models to effectively analyze and reason about numerical data, textual elements, and complex visual components. Our observations reveal that the perception capabilities of existing large vision-language models (LVLMs) constitute a critical bottleneck in this process. In this study, we delve into this perception bottleneck by decomposing it into two components: the vision encoder bottleneck, where the visual representation may fail to encapsulate the correct information, and the extraction bottleneck, where the language model struggles to extract the necessary information from the provided visual representations. Through comprehensive experiments, we find that (1) the information embedded within visual representations is substantially richer than what is typically captured by linear extractors, such as the widely used retrieval accuracy metric; (2) While instruction tuning effectively enhances the extraction capability of LVLMs, the vision encoder remains a critical bottleneck, demanding focused attention and improvement. Therefore, we further enhance the visual encoder to mitigate the vision encoder bottleneck under a contrastive learning framework. Empirical results demonstrate that our approach significantly mitigates the perception bottleneck and improves the ability of LVLMs to comprehend charts. Code is publicly available at https://github.com/hkust-nlp/Vision4Chart.

[239] Improving Brain Disorder Diagnosis with Advanced Brain Function Representation and Kolmogorov-Arnold Networks

Tyler Ward, Abdullah-Al-Zubaer Imran

Main category: cs.CV

TL;DR: A novel transformer-based classification network (ABFR-KAN) using Kolmogorov-Arnold Network blocks instead of traditional MLPs to improve autism spectrum disorder diagnosis by addressing atlas selection bias in functional connectivity quantification.

Details

Motivation: Traditional functional connectivity quantification relies on pre-defined brain atlases, which can lead to selection bias and lack of specificity, particularly problematic for diagnosing brain disorders like autism spectrum disorder.

Method: Proposed ABFR-KAN network that replaces traditional multi-layer perceptron components with Kolmogorov-Arnold Network blocks in a transformer-based architecture to create effective brain function representations.

Result: Thorough experimentation shows ABFR-KAN effectively improves ASD diagnosis under various model architecture configurations.

Conclusion: The ABFR-KAN framework provides a promising approach for ASD diagnosis by addressing limitations of traditional atlas-based functional connectivity methods through innovative network architecture design.

Abstract: Quantifying functional connectivity (FC), a vital metric for the diagnosis of various brain disorders, traditionally relies on the use of a pre-defined brain atlas. However, using such atlases can lead to issues regarding selection bias and lack of regard for specificity. Addressing this, we propose a novel transformer-based classification network (ABFR-KAN) with effective brain function representation to aid in diagnosing autism spectrum disorder (ASD). ABFR-KAN leverages Kolmogorov-Arnold Network (KAN) blocks replacing traditional multi-layer perceptron (MLP) components. Thorough experimentation reveals the effectiveness of ABFR-KAN in improving the diagnosis of ASD under various configurations of the model architecture. Our code is available at https://github.com/tbwa233/ABFR-KAN

[240] VideoPASTA: 7K Preference Pairs That Matter for Video-LLM Alignment

Yogesh Kulkarni, Pooyan Fazli

Main category: cs.CV

TL;DR: VideoPASTA is a framework that enhances Video-LLMs through preference optimization using adversarial examples targeting spatial, temporal, and cross-frame relationships, achieving significant performance gains with minimal data.

Details

Motivation: Video-language models struggle with spatial relationships, temporal ordering, and cross-frame continuity, limiting their ability to understand video content effectively.

Method: VideoPASTA trains models using Direct Preference Optimization with 7,020 preference pairs, where models learn to distinguish accurate video representations from adversarial examples that violate spatial, temporal, or cross-frame relationships. The approach uses 32-frame sampling and requires no human annotation.

Result: VideoPASTA achieves performance gains of up to +3.8% on LongVideoBench, +4.1% on VideoMME, and +4.0% on MVBench when applied to various state-of-the-art Video-LLMs, demonstrating model-agnostic effectiveness.

Conclusion: Targeted preference alignment is an efficient alternative to massive pretraining or architectural modifications for addressing core video-language challenges, providing a scalable plug-and-play solution that preserves original model capabilities.

Abstract: Video-language models (Video-LLMs) excel at understanding video content but struggle with spatial relationships, temporal ordering, and cross-frame continuity. To address these limitations, we introduce VideoPASTA (Preference Alignment with Spatio-Temporal-Cross Frame Adversaries), a framework that enhances Video-LLMs through targeted preference optimization. VideoPASTA trains models to distinguish accurate video representations from carefully crafted adversarial examples that deliberately violate spatial, temporal, or cross-frame relationships. With only 7,020 preference pairs and Direct Preference Optimization, VideoPASTA enables models to learn robust representations that capture fine-grained spatial details and long-range temporal dynamics. Experiments demonstrate that VideoPASTA is model agnostic and significantly improves performance, for example, achieving gains of up to +3.8 percentage points on LongVideoBench, +4.1 on VideoMME, and +4.0 on MVBench, when applied to various state-of-the-art Video-LLMs. These results demonstrate that targeted alignment, rather than massive pretraining or architectural modifications, effectively addresses core video-language challenges. Notably, VideoPASTA achieves these improvements without any human annotation or captioning, relying solely on 32-frame sampling. This efficiency makes our approach a scalable plug-and-play solution that seamlessly integrates with existing models while preserving their original capabilities.

[241] Learning Flow-Guided Registration for RGB-Event Semantic Segmentation

Zhen Yao, Xiaowen Ying, Zhiyu Zhu, Mooi Choo Chuah

Main category: cs.CV

TL;DR: BRENet proposes a flow-guided bidirectional registration framework for RGB-Event segmentation, addressing spatiotemporal and modal misalignment issues by converting fusion problems into registration tasks using optical flow guidance.

Details

Motivation: Traditional RGB-Event fusion approaches ignore intrinsic spatiotemporal and modal misalignment problems, making the fusion paradigm ill-posed for these asymmetric modalities.

Method: BRENet uses optical flow as coarse-grained guidance and event temporal features for fine-grained matching to generate bidirectional pixel pairings. Introduces Motion-Enhanced Event Tensor (MET) to transform sparse events into dense, temporally coherent representations.

Result: Extensive experiments on four large-scale datasets validate the approach, establishing flow-guided registration as a promising direction for RGB-Event segmentation.

Conclusion: Recasting RGB-Event segmentation from fusion to registration with flow guidance effectively bridges modality gaps and handles motion lag by converting it into flow estimation error terms.

Abstract: Event cameras capture microsecond-level motion cues that complement RGB sensors. However, the prevailing paradigm of treating RGB-Event perception as a fusion problem is ill-posed, as it ignores the intrinsic (i) Spatiotemporal and (ii) Modal Misalignment, unlike other RGB-X sensing domains. To tackle these limitations, we recast RGB-Event segmentation from fusion to registration. We propose BRENet, a novel flow-guided bidirectional framework that adaptively matches correspondence between the asymmetric modalities. Specifically, it leverages temporally aligned optical flows as a coarse-grained guide, along with fine-grained event temporal features, to generate precise forward and backward pixel pairings for registration. This pairing mechanism converts the inherent motion lag into terms governed by flow estimation error, bridging modality gaps. Moreover, we introduce Motion-Enhanced Event Tensor (MET), a new representation that transforms sparse event streams into a dense, temporally coherent form. Extensive experiments on four large-scale datasets validate our approach, establishing flow-guided registration as a promising direction for RGB-Event segmentation. Our code is available at: https://github.com/zyaocoder/BRENet.

[242] Automated Visual Attention Detection using Mobile Eye Tracking in Behavioral Classroom Studies

Efe Bozkir, Christian Kosel, Tina Seidel, Enkelejda Kasneci

Main category: cs.CV

TL;DR: Automated pipeline using face detection and recognition with mobile eye tracking to identify which students teachers focus on in classrooms, requiring minimal manual annotations.

Details

Motivation: Teachers' visual attention distribution affects student engagement and achievement, but current methods using mobile eye tracking require extensive manual annotations.

Method: Combines state-of-the-art face detection models and face recognition feature embeddings with transfer learning, integrated with teachers’ gaze data from mobile eye trackers.

Result: Achieved reasonable performance across four classroom setups, with U-shaped classrooms reaching ~0.7 accuracy and small classrooms ~0.9 accuracy.

Conclusion: The methodology offers a non-intrusive, low-annotation solution that could improve instructional strategies, classroom management, and teacher professional development.

Abstract: Teachers’ visual attention and its distribution across the students in classrooms can constitute important implications for student engagement, achievement, and professional teacher training. Despite that, inferring the information about where and which student teachers focus on is not trivial. Mobile eye tracking can provide vital help to solve this issue; however, the use of mobile eye tracking alone requires a significant amount of manual annotations. To address this limitation, we present an automated processing pipeline concept that requires minimal manually annotated data to recognize which student the teachers focus on. To this end, we utilize state-of-the-art face detection models and face recognition feature embeddings to train face recognition models with transfer learning in the classroom context and combine these models with the teachers’ gaze from mobile eye trackers. We evaluated our approach with data collected from four different classrooms, and our results show that while it is possible to estimate the visually focused students with reasonable performance in all of our classroom setups, U-shaped and small classrooms led to the best results with accuracies of approximately 0.7 and 0.9, respectively. While we did not evaluate our method for teacher-student interactions and focused on the validity of the technical approach, as our methodology does not require a vast amount of manually annotated data and offers a non-intrusive way of handling teachers’ visual attention, it could help improve instructional strategies, enhance classroom management, and provide feedback for professional teacher development.

[243] Instance-aware Image Colorization with Controllable Textual Descriptions and Segmentation Masks

Yanru An, Ling Gui, Chunlei Cai, Tianxiao Ye, JIangchao Yao, Guangtao Zhai, Qiang Hu, Xiaoyun Zhang

Main category: cs.CV

TL;DR: MT-Color is a diffusion-based method for instance-aware image colorization that addresses color bleeding and binding errors using pixel-level mask attention and instance mask guidance.

Details

Motivation: Current image colorization models suffer from color bleeding, color binding errors, and lack instance-level colorization capabilities, limiting their precision and practical applications.

Method: Proposes MT-Color with: 1) Pixel-level mask attention mechanism using cross-attention to integrate latent and gray image features, 2) Instance mask and text guidance module with self-attention masks, 3) Multi-instance sampling strategy, and 4) GPT-color dataset created using large visual language models.

Result: Qualitative and quantitative experiments demonstrate superior performance compared to previous methods and datasets in instance-level colorization tasks.

Conclusion: MT-Color effectively solves color bleeding and binding issues while enabling precise instance-aware colorization, representing a significant advancement in image colorization technology.

Abstract: Recently, the application of deep learning in image colorization has received widespread attention. The maturation of diffusion models has further advanced the development of image colorization models. However, current mainstream image colorization models still face issues such as color bleeding and color binding errors, and cannot colorize images at the instance level. In this paper, we propose a diffusion-based colorization method MT-Color to achieve precise instance-aware colorization with use-provided guidance. To tackle color bleeding issue, we design a pixel-level mask attention mechanism that integrates latent features and conditional gray image features through cross-attention. We use segmentation masks to construct cross-attention masks, preventing pixel information from exchanging between different instances. We also introduce an instance mask and text guidance module that extracts instance masks and text representations of each instance, which are then fused with latent features through self-attention, utilizing instance masks to form self-attention masks to prevent instance texts from guiding the colorization of other areas, thus mitigating color binding errors. Furthermore, we apply a multi-instance sampling strategy, which involves sampling each instance region separately and then fusing the results. Additionally, we have created a specialized dataset for instance-level colorization tasks, GPT-color, by leveraging large visual language models on existing image datasets. Qualitative and quantitative experiments show that our model and dataset outperform previous methods and datasets.

[244] CONSIGN: Conformal Segmentation Informed by Spatial Groupings via Decomposition

Bruno Viti, Elias Karabelas, Martin Holler

Main category: cs.CV

TL;DR: CONSIGN is a conformal prediction method that incorporates spatial correlations to improve uncertainty quantification in image segmentation, producing statistically valid prediction sets with error guarantees.

Details

Motivation: Standard machine learning segmentation models produce heuristic confidence scores that lack rigorous uncertainty quantification. Direct application of conformal prediction ignores spatial correlations in images, leading to conservative and less interpretable uncertainty estimates.

Method: CONSIGN (Conformal Segmentation Informed by Spatial Groupings via Decomposition) uses spatial groupings and decomposition to incorporate spatial correlations into conformal prediction. It works with any pre-trained segmentation model that can generate multiple sample outputs.

Result: Evaluation across three medical imaging datasets and two COCO subsets using three different pre-trained models shows CONSIGN significantly outperforms two CP baselines across multiple metrics, improving uncertainty estimation quality by accounting for spatial structure.

Conclusion: Incorporating spatial correlations through CONSIGN significantly enhances uncertainty quantification in image segmentation, providing meaningful prediction sets with statistical guarantees while being compatible with existing segmentation models.

Abstract: Most machine learning-based image segmentation models produce pixel-wise confidence scores that represent the model’s predicted probability for each class label at every pixel. While this information can be particularly valuable in high-stakes domains such as medical imaging, these scores are heuristic in nature and do not constitute rigorous quantitative uncertainty estimates. Conformal prediction (CP) provides a principled framework for transforming heuristic confidence scores into statistically valid uncertainty estimates. However, applying CP directly to image segmentation ignores the spatial correlations between pixels, a fundamental characteristic of image data. This can result in overly conservative and less interpretable uncertainty estimates. To address this, we propose CONSIGN (Conformal Segmentation Informed by Spatial Groupings via Decomposition), a CP-based method that incorporates spatial correlations to improve uncertainty quantification in image segmentation. Our method generates meaningful prediction sets that come with user-specified, high-probability error guarantees. It is compatible with any pre-trained segmentation model capable of generating multiple sample outputs. We evaluate CONSIGN against two CP baselines across three medical imaging datasets and two COCO dataset subsets, using three different pre-trained segmentation models. Results demonstrate that accounting for spatial structure significantly improves performance across multiple metrics and enhances the quality of uncertainty estimates.

[245] MMaDA: Multimodal Large Diffusion Language Models

Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, Mengdi Wang

Main category: cs.CV

TL;DR: MMaDA is a unified multimodal diffusion foundation model that achieves state-of-the-art performance across textual reasoning, multimodal understanding, and text-to-image generation through three key innovations: unified architecture, mixed chain-of-thought fine-tuning, and UniGRPO RL algorithm.

Details

Motivation: To bridge the gap between pretraining and post-training in unified diffusion architectures and create a comprehensive multimodal foundation model that can handle diverse tasks across different modalities without modality-specific components.

Method: Three key innovations: (1) Unified diffusion architecture with shared probabilistic formulation and modality-agnostic design, (2) Mixed long chain-of-thought fine-tuning strategy with unified CoT format across modalities, (3) UniGRPO - unified policy-gradient-based RL algorithm with diversified reward modeling for both reasoning and generation tasks.

Result: MMaDA-8B outperforms LLaMA-3-7B and Qwen2-7B in textual reasoning, surpasses Show-o and SEED-X in multimodal understanding, and excels over SDXL and Janus in text-to-image generation, demonstrating strong generalization capabilities.

Conclusion: MMaDA provides an effective framework for bridging pretraining and post-training in unified diffusion architectures, offering a comprehensive solution for multimodal foundation models with superior performance across diverse domains.

Abstract: We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. The approach is distinguished by three key innovations: (i) MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components. This architecture ensures seamless integration and processing across different data types. (ii) We implement a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities. By aligning reasoning processes between textual and visual domains, this strategy facilitates cold-start training for the final reinforcement learning (RL) stage, thereby enhancing the model’s ability to handle complex tasks from the outset. (iii) We propose UniGRPO, a unified policy-gradient-based RL algorithm specifically tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements. Experimental results demonstrate that MMaDA-8B exhibits strong generalization capabilities as a unified multimodal foundation model. It surpasses powerful models like LLaMA-3-7B and Qwen2-7B in textual reasoning, outperforms Show-o and SEED-X in multimodal understanding, and excels over SDXL and Janus in text-to-image generation. These achievements highlight MMaDA’s effectiveness in bridging the gap between pretraining and post-training within unified diffusion architectures, providing a comprehensive framework for future research and development. We open-source our code and trained models at: https://github.com/Gen-Verse/MMaDA

[246] MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios

Yang Shi, Huanqian Wang, Wulin Xie, Huanyao Zhang, Lijie Zhao, Yi-Fan Zhang, Xinfeng Li, Chaoyou Fu, Zhuoer Wen, Wenting Liu, Zhuoran Zhang, Xinlong Chen, Bohan Zeng, Sihan Yang, Yushuo Guan, Zhang Zhang, Liang Wang, Haoxuan Li, Zhouchen Lin, Yuanxing Zhang, Pengfei Wan, Haotian Wang, Wenjing Yang

Main category: cs.CV

TL;DR: The paper introduces MME-VideoOCR, a comprehensive benchmark for evaluating Multimodal Large Language Models (MLLMs) on video OCR tasks, revealing significant limitations in current models’ ability to handle video-specific challenges like motion blur and temporal reasoning.

Details

Motivation: Current MLLMs perform well on static image OCR but struggle with video OCR due to challenges like motion blur, temporal variations, and visual effects. There's a need for better benchmarks to guide the development of practical MLLMs for video applications.

Method: Created MME-VideoOCR benchmark with 10 task categories, 25 individual tasks across 44 scenarios, using 1,464 videos with varying characteristics and 2,000 manually annotated QA pairs. Evaluated 18 state-of-the-art MLLMs on this benchmark.

Result: Even the best-performing model (Gemini-2.5 Pro) achieved only 73.7% accuracy. Models perform well on single-frame tasks but struggle with holistic video comprehension requiring spatio-temporal reasoning, cross-frame integration, and resistance to language prior bias.

Conclusion: Current MLLMs have significant limitations in video OCR, particularly for tasks requiring temporal understanding. High-resolution visual input and sufficient temporal coverage are crucial for reliable video OCR performance.

Abstract: Multimodal Large Language Models (MLLMs) have achieved considerable accuracy in Optical Character Recognition (OCR) from static images. However, their efficacy in video OCR is significantly diminished due to factors such as motion blur, temporal variations, and visual effects inherent in video content. To provide clearer guidance for training practical MLLMs, we introduce the MME-VideoOCR benchmark, which encompasses a comprehensive range of video OCR application scenarios. MME-VideoOCR features 10 task categories comprising 25 individual tasks and spans 44 diverse scenarios. These tasks extend beyond text recognition to incorporate deeper comprehension and reasoning of textual content within videos. The benchmark consists of 1,464 videos with varying resolutions, aspect ratios, and durations, along with 2,000 meticulously curated, manually annotated question-answer pairs. We evaluate 18 state-of-the-art MLLMs on MME-VideoOCR, revealing that even the best-performing model (Gemini-2.5 Pro) achieves an accuracy of only 73.7%. Fine-grained analysis indicates that while existing MLLMs demonstrate strong performance on tasks where relevant texts are contained within a single or few frames, they exhibit limited capability in effectively handling tasks that demand holistic video comprehension. These limitations are especially evident in scenarios that require spatio-temporal reasoning, cross-frame information integration, or resistance to language prior bias. Our findings also highlight the importance of high-resolution visual input and sufficient temporal coverage for reliable OCR in dynamic video scenarios.

[247] MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, Jiangmiao Pang

Main category: cs.CV

TL;DR: MMSI-Bench is a new benchmark for evaluating multi-image spatial intelligence in MLLMs, revealing significant performance gaps between current models (30-40% accuracy) and human performance (97%).

Details

Motivation: Existing benchmarks only test single-image spatial relations, failing to assess the multi-image spatial reasoning needed for real-world MLLM deployments in complex physical environments.

Method: Created 1,000 challenging multiple-choice questions from 120,000+ images with carefully designed distractors and step-by-step reasoning processes. Evaluated 34 open-source and proprietary MLLMs and developed an automated error analysis pipeline to diagnose four dominant failure modes.

Result: Significant performance gap: strongest open-source model achieved ~30% accuracy, OpenAI’s o3 reasoning model reached 40%, while humans scored 97%. The benchmark revealed four main failure types: grounding errors, overlap-matching/scene-reconstruction errors, situation-transformation reasoning errors, and spatial-logic errors.

Conclusion: MMSI-Bench demonstrates the challenging nature of multi-image spatial intelligence and substantial room for improvement in current MLLMs. The benchmark and error analysis pipeline provide valuable insights for advancing spatial reasoning capabilities in multimodal AI systems.

Abstract: Spatial intelligence is essential for multimodal large language models (MLLMs) operating in the complex physical world. Existing benchmarks, however, probe only single-image relations and thus fail to assess the multi-image spatial reasoning that real-world deployments demand. We introduce MMSI-Bench, a VQA benchmark dedicated to multi-image spatial intelligence. Six 3D-vision researchers spent more than 300 hours meticulously crafting 1,000 challenging, unambiguous multiple-choice questions from over 120,000 images, each paired with carefully designed distractors and a step-by-step reasoning process. We conduct extensive experiments and thoroughly evaluate 34 open-source and proprietary MLLMs, observing a wide gap: the strongest open-source model attains roughly 30% accuracy and OpenAI’s o3 reasoning model reaches 40%, while humans score 97%. These results underscore the challenging nature of MMSI-Bench and the substantial headroom for future research. Leveraging the annotated reasoning processes, we also provide an automated error analysis pipeline that diagnoses four dominant failure modes, including (1) grounding errors, (2) overlap-matching and scene-reconstruction errors, (3) situation-transformation reasoning errors, and (4) spatial-logic errors, offering valuable insights for advancing multi-image spatial intelligence. Project page: https://runsenxu.com/projects/MMSI_Bench .

[248] Beyond Quantity: Distribution-Aware Labeling for Visual Grounding

Yichi Zhang, Gongwei Chen, Jun Zhu, Jia Wan, Liqiang Nie

Main category: cs.CV

TL;DR: DAL is a distribution-aware labeling framework for visual grounding that improves data quality and coverage through dual-driven annotation and filtering, achieving state-of-the-art results.

Details

Motivation: Manual annotation for visual grounding is costly and existing pseudo-labeling methods often overfit to biased distributions, generating noisy or redundant samples. Performance gains come more from effective distribution expansion than raw data volume.

Method: Proposes DAL framework with dual-driven annotation (closed-set for reliability, open-set for vocabulary enrichment) and OOD expression expansion, followed by consistency- and distribution-aware filtering to remove noise and rebalance underrepresented content.

Result: Extensive experiments on three benchmarks show DAL consistently outperforms strong baselines and achieves state-of-the-art results.

Conclusion: Distribution-aware labeling is critical for building scalable and robust visual grounding datasets, with DAL demonstrating superior performance through effective data quality and coverage management.

Abstract: Visual grounding requires large and diverse region-text pairs. However, manual annotation is costly and fixed vocabularies restrict scalability and generalization. Existing pseudo-labeling pipelines often overfit to biased distributions and generate noisy or redundant samples. Through our systematic analysis of data quality and distributional coverage, we find that performance gains come less from raw data volume and more from effective distribution expansion. Motivated by this insight, we propose DAL, a distribution-aware labeling framework for visual grounding. The proposed method first employs a dual-driven annotation module, where a closed-set path provides reliable pseudo labels and an open-set path enriches vocabulary and introduces novel concepts; meanwhile, it further performs explicit out-of-distribution (OOD) expression expansion to broaden semantic coverage. We then propose a consistency- and distribution-aware filtering module to discard noisy or redundant region-text pairs and rebalance underrepresented linguistic and visual content, thereby improving both data quality and training efficiency. Extensive experiments on three benchmarks demonstrate that our method consistently outperforms strong baselines and achieves state-of-the-art results, underscoring the critical role of distribution-aware labeling in building scalable and robust visual grounding datasets.

[249] ProstaTD: Bridging Surgical Triplet from Classification to Fully Supervised Detection

Yiliang Chen, Zhixi Li, Cheng Xu, Alex Qinyang Liu, Ruize Cui, Xuemiao Xu, Jeremy Yuen-Chun Teoh, Shengfeng He, Jing Qin

Main category: cs.CV

TL;DR: ProstaTD is a large-scale surgical triplet detection dataset addressing limitations of existing datasets by providing precise spatial bounding box annotations and temporal boundaries for robot-assisted prostatectomy procedures.

Details

Motivation: Existing surgical triplet datasets like CholecT50 lack precise spatial bounding box annotations, making triplet classification insufficient for practical applications. Bounding boxes are essential for spatial context and improved model generalizability.

Method: Created ProstaTD dataset with 71,775 video frames and 196,490 annotated triplet instances from 21 surgeries across multiple institutions. Developed specialized labeling tools and evaluation toolkit. Annotation involved 60+ contributors including surgeons through iterative labeling and verification phases.

Result: ProstaTD is the largest and most diverse surgical triplet dataset with precise spatial and temporal boundaries, enabling full detection rather than simple classification.

Conclusion: ProstaTD provides a robust foundation for fair benchmarking in surgical triplet detection, moving the field from classification to full detection with precise spatial and temporal boundaries.

Abstract: Surgical triplet detection is a critical task in surgical video analysis. However, existing datasets like CholecT50 lack precise spatial bounding box annotations, rendering triplet classification at the image level insufficient for practical applications. The inclusion of bounding box annotations is essential to make this task meaningful, as they provide the spatial context necessary for accurate analysis and improved model generalizability. To address these shortcomings, we introduce ProstaTD, a large-scale, multi-institutional dataset for surgical triplet detection, developed from the technically demanding domain of robot-assisted prostatectomy. ProstaTD offers clinically defined temporal boundaries and high-precision bounding box annotations for each structured triplet activity. The dataset comprises 71,775 video frames and 196,490 annotated triplet instances, collected from 21 surgeries performed across multiple institutions, reflecting a broad range of surgical practices and intraoperative conditions. The annotation process was conducted under rigorous medical supervision and involved more than 60 contributors, including practicing surgeons and medically trained annotators, through multiple iterative phases of labeling and verification. To further facilitate future general-purpose surgical annotation, we developed two tailored labeling tools to improve efficiency and scalability in our annotation workflows. In addition, we created a surgical triplet detection evaluation toolkit that enables standardized and reproducible performance assessment across studies. ProstaTD is the largest and most diverse surgical triplet dataset to date, moving the field from simple classification to full detection with precise spatial and temporal boundaries and thereby providing a robust foundation for fair benchmarking.

[250] O-MaMa: Learning Object Mask Matching between Egocentric and Exocentric Views

Lorenzo Mur-Labadia, Maria Santos-Villafranca, Jesus Bermudez-Cameo, Alejandro Perez-Yus, Ruben Martinez-Cantin, Jose J. Guerrero

Main category: cs.CV

TL;DR: O-MaMa introduces a novel cross-image segmentation approach treating it as mask matching, achieving state-of-the-art results on Ego-Exo4D benchmark with significant efficiency gains.

Details

Motivation: Understanding the world from multiple perspectives is essential for intelligent systems, but segmenting common objects across different views remains challenging.

Method: Uses Mask-Context Encoder with DINOv2 features, Ego↔Exo Cross-Attention for multi-perspective fusion, Mask Matching contrastive loss, and Hard Negative Adjacent Mining strategy.

Result: Achieves +22% and +76% relative gains in Ego2Exo and Exo2Ego IoU against baselines, and +13% and +6% improvement over SOTA with only 1% of training parameters.

Conclusion: The method effectively addresses cross-view object segmentation through mask matching and achieves superior performance with high efficiency.

Abstract: Understanding the world from multiple perspectives is essential for intelligent systems operating together, where segmenting common objects across different views remains an open problem. We introduce a new approach that re-defines cross-image segmentation by treating it as a mask matching task. Our method consists of: (1) A Mask-Context Encoder that pools dense DINOv2 semantic features to obtain discriminative object-level representations from FastSAM mask candidates, (2) an Ego$\leftrightarrow$Exo Cross-Attention that fuses multi-perspective observations, (3) a Mask Matching contrastive loss that aligns cross-view features in a shared latent space, and (4) a Hard Negative Adjacent Mining strategy to encourage the model to better differentiate between nearby objects. O-MaMa achieves the state of the art in the Ego-Exo4D Correspondences benchmark, obtaining relative gains of +22% and +76% in the Ego2Exo and Exo2Ego IoU against the official challenge baselines, and a +13% and +6% compared with the SOTA with 1% of the training parameters.

[251] Why Settle for One? Text-to-ImageSet Generation and Evaluation

Chengyou Jia, Xin Shen, Zhuohang Dang, Zhuohang Dang, Changliang Xia, Weijia Wu, Xinyu Zhang, Hangwei Qian, Ivor W. Tsang, Minnan Luo

Main category: cs.CV

TL;DR: The paper proposes Text-to-ImageSet (T2IS) generation, a challenging problem of creating coherent image sets with diverse consistency requirements based on user instructions. It introduces T2IS-Bench benchmark, T2IS-Eval evaluation framework, and AutoT2IS training-free method that outperforms existing approaches.

Details

Motivation: Existing consistent image generation methods are too domain-specific and lack generalizability for broader applications requiring diverse consistency in image sets.

Method: Proposes AutoT2IS framework that leverages pretrained Diffusion Transformers’ in-context capabilities to harmonize visual elements for both image-level prompt alignment and set-level visual consistency, without requiring training.

Result: Extensive experiments on T2IS-Bench show AutoT2IS significantly outperforms current generalized and specialized methods across diverse consistency challenges.

Conclusion: The method enables numerous underexplored real-world applications and demonstrates substantial practical value in generating coherent image sets with diverse consistency requirements.

Abstract: Despite remarkable progress in Text-to-Image models, many real-world applications require generating coherent image sets with diverse consistency requirements. Existing consistent methods often focus on a specific domain with specific aspects of consistency, which significantly constrains their generalizability to broader applications. In this paper, we propose a more challenging problem, Text-to-ImageSet (T2IS) generation, which aims to generate sets of images that meet various consistency requirements based on user instructions. To systematically study this problem, we first introduce $\textbf{T2IS-Bench}$ with 596 diverse instructions across 26 subcategories, providing comprehensive coverage for T2IS generation. Building on this, we propose $\textbf{T2IS-Eval}$, an evaluation framework that transforms user instructions into multifaceted assessment criteria and employs effective evaluators to adaptively assess consistency fulfillment between criteria and generated sets. Subsequently, we propose $\textbf{AutoT2IS}$, a training-free framework that maximally leverages pretrained Diffusion Transformers’ in-context capabilities to harmonize visual elements to satisfy both image-level prompt alignment and set-level visual consistency. Extensive experiments on T2IS-Bench reveal that diverse consistency challenges all existing methods, while our AutoT2IS significantly outperforms current generalized and even specialized approaches. Our method also demonstrates the ability to enable numerous underexplored real-world applications, confirming its substantial practical value. Visit our project in https://chengyou-jia.github.io/T2IS-Home.

Rongtao Xu, Han Gao, Mingming Yu, Dong An, Shunpeng Chen, Changwei Wang, Li Guo, Xiaodan Liang, Shibiao Xu

Main category: cs.CV

TL;DR: 3D-MoRe is a novel framework that generates large-scale 3D-language datasets using foundational models, achieving state-of-the-art performance on ScanQA and ScanRefer tasks.

Details

Motivation: There is a growing need for diverse and scalable data in indoor scene tasks like question answering and dense captioning, which current datasets struggle to provide.

Method: The framework integrates multi-modal embedding, cross-modal interaction, and language model decoder to process natural language instructions and 3D scene data. It uses ScanNet dataset with text annotations from ScanQA and ScanRefer, employing data augmentation and semantic filtering.

Result: Generated 62,000 QA pairs and 73,000 object descriptions across 1,513 scenes. Outperformed SOTA baselines with CIDEr score improvements of 2.15% on ScanQA and 1.84% on ScanRefer.

Conclusion: 3D-MoRe effectively addresses the data scarcity problem in 3D scene understanding tasks and demonstrates superior performance, with code and datasets being publicly released.

Abstract: With the growing need for diverse and scalable data in indoor scene tasks, such as question answering and dense captioning, we propose 3D-MoRe, a novel paradigm designed to generate large-scale 3D-language datasets by leveraging the strengths of foundational models. The framework integrates key components, including multi-modal embedding, cross-modal interaction, and a language model decoder, to process natural language instructions and 3D scene data. This approach facilitates enhanced reasoning and response generation in complex 3D environments. Using the ScanNet 3D scene dataset, along with text annotations from ScanQA and ScanRefer, 3D-MoRe generates 62,000 question-answer (QA) pairs and 73,000 object descriptions across 1,513 scenes. We also employ various data augmentation techniques and implement semantic filtering to ensure high-quality data. Experiments on ScanQA demonstrate that 3D-MoRe significantly outperforms state-of-the-art baselines, with the CIDEr score improving by 2.15%. Similarly, on ScanRefer, our approach achieves a notable increase in CIDEr@0.5 by 1.84%, highlighting its effectiveness in both tasks. Our code and generated datasets will be publicly released to benefit the community, and both can be accessed on the https://3D-MoRe.github.io.

[253] NoHumansRequired: Autonomous High-Quality Image Editing Triplet Mining

Maksim Kuprashevich, Grigorii Alekseenko, Irina Tolstykh, Georgii Fedorov, Bulat Suleimanov, Vladimir Dokholyan, Aleksandr Gordeev

Main category: cs.CV

TL;DR: An automated pipeline for mining high-quality image editing triplets (original image, instruction, edited image) using generative models and validation, enabling large-scale training data creation without human labeling.

Details

Motivation: Training image editing assistants requires millions of high-quality triplets, but manual creation is difficult due to the need for pixel accuracy, style coherence, physical plausibility, and visual appeal. Automated methods lack robust quality metrics.

Method: Modular pipeline using public generative models with a task-tuned Gemini validator to score instruction adherence and aesthetics directly, without segmentation or grounding models. Uses inversion and compositional bootstrapping to expand the dataset by 2.6x.

Result: Created NHR-Edit dataset of 720k high-quality triplets through millions of guided generations and validator passes. The approach surpasses all public alternatives in cross-dataset evaluation and produced Bagel-NHR-Edit model with state-of-the-art metrics.

Conclusion: The automated pipeline enables large-scale high-fidelity training data creation without human labeling, democratizing research in resource-intensive image editing by providing an open dataset and framework for computational effort estimation.

Abstract: Recent advances in generative modeling enable image editing assistants that follow natural language instructions without additional user input. Their supervised training requires millions of triplets (original image, instruction, edited image), yet mining pixel-accurate examples is hard. Each edit must affect only prompt-specified regions, preserve stylistic coherence, respect physical plausibility, and retain visual appeal. The lack of robust automated edit-quality metrics hinders reliable automation at scale. We present an automated, modular pipeline that mines high-fidelity triplets across domains, resolutions, instruction complexities, and styles. Built on public generative models and running without human intervention, our system uses a task-tuned Gemini validator to score instruction adherence and aesthetics directly, removing any need for segmentation or grounding models. Inversion and compositional bootstrapping enlarge the mined set by approx. 2.6x, enabling large-scale high-fidelity training data. By automating the most repetitive annotation steps, the approach allows a new scale of training without human labeling effort. To democratize research in this resource-intensive area, we release NHR-Edit, an open dataset of 720k high-quality triplets, curated at industrial scale via millions of guided generations and validator passes, and we analyze the pipeline’s stage-wise survival rates, providing a framework for estimating computational effort across different model stacks. In the largest cross-dataset evaluation, it surpasses all public alternatives. We also release Bagel-NHR-Edit, a fine-tuned Bagel model with state-of-the-art metrics.

[254] Conditional Video Generation for High-Efficiency Video Compression

Fangqiu Yi, Jingyu Xu, Jiawei Shao, Chi Zhang, Xuelong Li

Main category: cs.CV

TL;DR: A video compression framework using conditional diffusion models for perceptually optimized reconstruction, outperforming traditional and neural codecs on perceptual quality metrics.

Details

Motivation: Conditional diffusion models excel at reconstructing video content aligned with human visual perception, making them suitable for perceptually optimized video compression.

Method: Reframes video compression as conditional generation with three key modules: multi-granular conditioning for static/dynamic cues, compact representations for efficient transmission, and multi-condition training with modality dropout and role-aware embeddings.

Result: Significantly outperforms both traditional and neural codecs on perceptual quality metrics (FVD, LPIPS), especially under high compression ratios.

Conclusion: Conditional diffusion models provide an effective framework for perceptually optimized video compression that maintains semantic richness while achieving high compression efficiency.

Abstract: Perceptual studies demonstrate that conditional diffusion models excel at reconstructing video content aligned with human visual perception. Building on this insight, we propose a video compression framework that leverages conditional diffusion models for perceptually optimized reconstruction. Specifically, we reframe video compression as a conditional generation task, where a generative model synthesizes video from sparse, yet informative signals. Our approach introduces three key modules: (1) Multi-granular conditioning that captures both static scene structure and dynamic spatio-temporal cues; (2) Compact representations designed for efficient transmission without sacrificing semantic richness; (3) Multi-condition training with modality dropout and role-aware embeddings, which prevent over-reliance on any single modality and enhance robustness. Extensive experiments show that our method significantly outperforms both traditional and neural codecs on perceptual quality metrics such as Fr'echet Video Distance (FVD) and LPIPS, especially under high compression ratios.

[255] GeMix: Conditional GAN-Based Mixup for Improved Medical Image Augmentation

Hugo Carlesso, Maria Eliza Patulea, Moncef Garouani, Radu Tudor Ionescu, Josiane Mothe

Main category: cs.CV

TL;DR: GeMix is a two-stage framework that replaces naive pixel-wise mixup with learned, label-aware interpolation using class-conditional GANs to generate more realistic medical images for better classification performance.

Details

Motivation: Traditional mixup produces unrealistic images that hinder learning in high-stakes medical applications like COVID-19 detection, where visual coherence and semantic fidelity are crucial.

Method: Uses StyleGAN2-ADA generator trained on target dataset, samples label vectors from Dirichlet priors biased toward different classes, blends them via Beta-distributed coefficient, and conditions generator on soft labels to synthesize coherent images along continuous class manifold.

Result: On COVIDx-CT-3 dataset with three backbones (ResNet-50, ResNet-101, EfficientNet-B0), GeMix combined with real data increases macro-F1 over traditional mixup for all backbones and reduces false negative rate for COVID-19 detection.

Conclusion: GeMix is a drop-in replacement for pixel-space mixup that delivers stronger regularization and greater semantic fidelity without disrupting existing training pipelines, making it suitable for medical imaging applications.

Abstract: Mixup has become a popular augmentation strategy for image classification, yet its naive pixel-wise interpolation often produces unrealistic images that can hinder learning, particularly in high-stakes medical applications. We propose GeMix, a two-stage framework that replaces heuristic blending with a learned, label-aware interpolation powered by class-conditional GANs. First, a StyleGAN2-ADA generator is trained on the target dataset. During augmentation, we sample two label vectors from Dirichlet priors biased toward different classes and blend them via a Beta-distributed coefficient. Then, we condition the generator on this soft label to synthesize visually coherent images that lie along a continuous class manifold. We benchmark GeMix on the large-scale COVIDx-CT-3 dataset using three backbones (ResNet-50, ResNet-101, EfficientNet-B0). When combined with real data, our method increases macro-F1 over traditional mixup for all backbones, reducing the false negative rate for COVID-19 detection. GeMix is thus a drop-in replacement for pixel-space mixup, delivering stronger regularization and greater semantic fidelity, without disrupting existing training pipelines. We publicly release our code at https://github.com/hugocarlesso/GeMix to foster reproducibility and further research.

[256] CNS-Bench: Benchmarking Image Classifier Robustness Under Continuous Nuisance Shifts

Olaf Dünkel, Artur Jesslen, Jiahao Xie, Christian Theobalt, Christian Rupprecht, Adam Kortylewski

Main category: cs.CV

TL;DR: CNS-Bench is a benchmark for evaluating image classifier robustness to continuous, realistic nuisance shifts using diffusion models with LoRA adapters, enabling more nuanced OOD analysis than binary shifts.

Details

Motivation: Current OOD robustness evaluation methods use simple synthetic corruptions or binary nuisance shifts that fail to capture real-world continuous nuisance variations, limiting comprehensive model assessment.

Method: Apply LoRA adapters to diffusion models to generate continuous nuisance shifts at various severity levels, with a filtering mechanism to address failure cases and ensure reliable benchmarking.

Result: Evaluation of 40+ classifiers shows model rankings change with different shifts and scales, and continuous assessment reveals failure points not detectable with binary shifts.

Conclusion: CNS-Bench provides a more comprehensive framework for OOD robustness evaluation by capturing continuous, realistic nuisance shifts, offering deeper insights into model failure modes and performance variations.

Abstract: An important challenge when using computer vision models in the real world is to evaluate their performance in potential out-of-distribution (OOD) scenarios. While simple synthetic corruptions are commonly applied to test OOD robustness, they often fail to capture nuisance shifts that occur in the real world. Recently, diffusion models have been applied to generate realistic images for benchmarking, but they are restricted to binary nuisance shifts. In this work, we introduce CNS-Bench, a Continuous Nuisance Shift Benchmark to quantify OOD robustness of image classifiers for continuous and realistic generative nuisance shifts. CNS-Bench allows generating a wide range of individual nuisance shifts in continuous severities by applying LoRA adapters to diffusion models. To address failure cases, we propose a filtering mechanism that outperforms previous methods, thereby enabling reliable benchmarking with generative models. With the proposed benchmark, we perform a large-scale study to evaluate the robustness of more than 40 classifiers under various nuisance shifts. Through carefully designed comparisons and analyses, we find that model rankings can change for varying shifts and shift scales, which cannot be captured when applying common binary shifts. Additionally, we show that evaluating the model performance on a continuous scale allows the identification of model failure points, providing a more nuanced understanding of model robustness. Project page including code and data: https://genintel.github.io/CNS.

[257] RIS-LAD: A Benchmark and Model for Referring Low-Altitude Drone Image Segmentation

Kai Ye, YingShi Luan, Zhudi Chen, Guangyue Meng, Pingyang Dai, Liujuan Cao

Main category: cs.CV

TL;DR: RIS-LAD is the first fine-grained Referring Image Segmentation benchmark for Low-Altitude Drone scenarios, addressing challenges like diverse viewpoints and high object density that existing methods struggle with.

Details

Motivation: Existing RIS datasets and methods are designed for high-altitude static views and fail to handle unique characteristics of low-altitude drone imagery such as diverse viewpoints, high object density, and small cluttered scenes.

Method: Proposed Semantic-Aware Adaptive Reasoning Network (SAARN) with Category-Dominated Linguistic Enhancement (CDLE) for early visual-category alignment and Adaptive Reasoning Fusion Module (ARFM) for dynamic semantic cue selection across scales.

Result: RIS-LAD presents substantial challenges to state-of-the-art RIS algorithms, and SAARN demonstrates effectiveness in addressing the unique challenges of low-altitude drone scenarios.

Conclusion: The RIS-LAD benchmark fills a critical gap in drone vision-language understanding and the proposed SAARN model provides an effective solution for handling the specific challenges of low-altitude drone referring image segmentation.

Abstract: Referring Image Segmentation (RIS), which aims to segment specific objects based on natural language descriptions, plays an essential role in vision-language understanding. Despite its progress in remote sensing applications, RIS in Low-Altitude Drone (LAD) scenarios remains underexplored. Existing datasets and methods are typically designed for high-altitude and static-view imagery. They struggle to handle the unique characteristics of LAD views, such as diverse viewpoints and high object density. To fill this gap, we present RIS-LAD, the first fine-grained RIS benchmark tailored for LAD scenarios. This dataset comprises 13,871 carefully annotated image-text-mask triplets collected from realistic drone footage, with a focus on small, cluttered, and multi-viewpoint scenes. It highlights new challenges absent in previous benchmarks, such as category drift caused by tiny objects and object drift under crowded same-class objects. To tackle these issues, we propose the Semantic-Aware Adaptive Reasoning Network (SAARN). Rather than uniformly injecting all linguistic features, SAARN decomposes and routes semantic information to different stages of the network. Specifically, the Category-Dominated Linguistic Enhancement (CDLE) aligns visual features with object categories during early encoding, while the Adaptive Reasoning Fusion Module (ARFM) dynamically selects semantic cues across scales to improve reasoning in complex scenes. The experimental evaluation reveals that RIS-LAD presents substantial challenges to state-of-the-art RIS algorithms, and also demonstrates the effectiveness of our proposed model in addressing these challenges. The dataset and code will be publicly released soon at: https://github.com/AHideoKuzeA/RIS-LAD/.

[258] CLIPin: A Non-contrastive Plug-in to CLIP for Multimodal Semantic Alignment

Shengzhu Yang, Jiawei Du, Shuai Lu, Weihang Zhang, Ningli Wang, Huiqi Li

Main category: cs.CV

TL;DR: CLIPin is a unified non-contrastive plug-in that enhances CLIP-style models by improving multimodal semantic alignment through stronger supervision and parameter-compromise shared projectors.

Details

Motivation: Natural image-text datasets have weak semantic alignment while medical datasets have high correlation but low diversity, both hindering CLIP's ability to learn robust and generalizable representations.

Method: Proposes CLIPin - a non-contrastive plug-in with shared pre-projectors for image and text modalities that integrates contrastive and non-contrastive learning in a parameter-compromise manner.

Result: Extensive experiments on diverse downstream tasks demonstrate CLIPin’s effectiveness and generality as a plug-and-play component compatible with various contrastive frameworks.

Conclusion: CLIPin provides a unified solution to enhance multimodal alignment in CLIP-style architectures, improving robustness and generalizability across different dataset types.

Abstract: Large-scale natural image-text datasets, especially those automatically collected from the web, often suffer from loose semantic alignment due to weak supervision, while medical datasets tend to have high cross-modal correlation but low content diversity. These properties pose a common challenge for contrastive language-image pretraining (CLIP): they hinder the model’s ability to learn robust and generalizable representations. In this work, we propose CLIPin, a unified non-contrastive plug-in that can be seamlessly integrated into CLIP-style architectures to improve multimodal semantic alignment, providing stronger supervision and enhancing alignment robustness. Furthermore, two shared pre-projectors are designed for image and text modalities respectively to facilitate the integration of contrastive and non-contrastive learning in a parameter-compromise manner. Extensive experiments on diverse downstream tasks demonstrate the effectiveness and generality of CLIPin as a plug-and-play component compatible with various contrastive frameworks. Code is available at https://github.com/T6Yang/CLIPin.

[259] Semantic-Aware Reconstruction Error for Detecting AI-Generated Images

Ju Yeon Kang, Jaehong Park, Semin Kim, Ji Won Yoon, Nam Soo Kim

Main category: cs.CV

TL;DR: Proposes SARE (Semantic-Aware Reconstruction Error) for AI-generated image detection by measuring semantic differences between images and their caption-guided reconstructions, achieving better generalization across unseen generative models.

Details

Motivation: Existing detection methods overfit to training data and perform poorly on out-of-distribution fake images from unseen generative models, as they rely on model-specific artifacts rather than fundamental semantic differences.

Method: Introduces SARE to quantify semantic shifts during caption-guided reconstruction. Real images show noticeable semantic changes since captions can’t fully capture complex content, while fake images align closely with captions. Also adds fusion module using cross-attention to integrate SARE with backbone detector.

Result: Outperforms existing baselines on GenImage and ForenSynths benchmarks, demonstrating strong generalization across diverse generative models. Detailed analysis confirms caption guidance effectively enhances detection robustness.

Conclusion: SARE provides a robust and discriminative feature for fake image detection by focusing on semantic differences rather than model-specific artifacts, enabling better generalization to unseen generative models.

Abstract: Recently, AI-generated image detection has gained increasing attention, as the rapid advancement of image generation technologies has raised serious concerns about their potential misuse. While existing detection methods have achieved promising results, their performance often degrades significantly when facing fake images from unseen, out-of-distribution (OOD) generative models, since they primarily rely on model-specific artifacts and thus overfit to the models used for training. To address this limitation, we propose a novel representation, namely Semantic-Aware Reconstruction Error (SARE), that measures the semantic difference between an image and its caption-guided reconstruction. The key hypothesis behind SARE is that real images, whose captions often fail to fully capture their complex visual content, may undergo noticeable semantic shifts during the caption-guided reconstruction process. In contrast, fake images, which closely align with their captions, show minimal semantic changes. By quantifying these semantic shifts, SARE provides a robust and discriminative feature for detecting fake images across diverse generative models. Additionally, we introduce a fusion module that integrates SARE into the backbone detector via a cross-attention mechanism. Image features attend to semantic representations extracted from SARE, enabling the model to adaptively leverage semantic information. Experimental results demonstrate that the proposed method achieves strong generalization, outperforming existing baselines on benchmarks including GenImage and ForenSynths. We further validate the effectiveness of caption guidance through a detailed analysis of semantic shifts, confirming its ability to enhance detection robustness.

[260] D2-Mamba: Dual-Scale Fusion and Dual-Path Scanning with SSMs for Shadow Removal

Linhao Li, Boya Jin, Zizhe Li, Lanqing Guo, Hao Cheng, Bo Li, Yongfeng Dong

Main category: cs.CV

TL;DR: A novel Mamba-based network with dual-scale fusion and dual-path scanning for shadow removal that selectively propagates contextual information based on transformation similarity across regions.

Details

Motivation: Shadow removal requires leveraging abundant information from non-shadow regions for guidance, but the transformation needed for shadowed areas differs significantly from well-lit regions, necessitating effective integration of non-local contextual cues and adaptive modeling of region-specific transformations.

Method: Proposes Dual-Scale Fusion Mamba Block (DFMB) to enhance multi-scale feature representation by fusing original features with low-resolution features, and Dual-Path Mamba Group (DPMG) that captures global features via horizontal scanning with mask-aware adaptive scanning strategy.

Result: Experimental results demonstrate that the method significantly outperforms existing state-of-the-art approaches on shadow removal benchmarks.

Conclusion: The proposed Mamba-based network with dual-scale fusion and dual-path scanning effectively addresses the challenges of shadow removal by selectively propagating contextual information and improving structural continuity.

Abstract: Shadow removal aims to restore images that are partially degraded by shadows, where the degradation is spatially localized and non-uniform. Unlike general restoration tasks that assume global degradation, shadow removal can leverage abundant information from non-shadow regions for guidance. However, the transformation required to correct shadowed areas often differs significantly from that of well-lit regions, making it challenging to apply uniform correction strategies. This necessitates the effective integration of non-local contextual cues and adaptive modeling of region-specific transformations. To this end, we propose a novel Mamba-based network featuring dual-scale fusion and dual-path scanning to selectively propagate contextual information based on transformation similarity across regions. Specifically, the proposed Dual-Scale Fusion Mamba Block (DFMB) enhances multi-scale feature representation by fusing original features with low-resolution features, effectively reducing boundary artifacts. The Dual-Path Mamba Group (DPMG) captures global features via horizontal scanning and incorporates a mask-aware adaptive scanning strategy, which improves structural continuity and fine-grained region modeling. Experimental results demonstrate that our method significantly outperforms existing state-of-the-art approaches on shadow removal benchmarks.

[261] Odo: Depth-Guided Diffusion for Identity-Preserving Body Reshaping

Siddharth Khandelwal, Sridhar Kamath, Arjun Jain

Main category: cs.CV

TL;DR: Odo is a diffusion-based method for realistic human body shape editing using a new large-scale dataset of 18,573 images across 1,523 subjects with diverse body shape variations.

Details

Motivation: Human shape editing remains under-explored compared to pose editing, with current methods suffering from unrealistic body proportions, texture distortions, and background inconsistencies due to lack of proper training datasets.

Method: An end-to-end diffusion-based approach combining a frozen UNet to preserve appearance/background details with a ControlNet that guides shape transformation using target SMPL depth maps, guided by simple semantic attributes.

Result: Achieves per-vertex reconstruction error of 7.5mm (vs 13.6mm in baselines), producing realistic results that accurately match target shapes while preserving identity, clothing, and background.

Conclusion: The proposed method and dataset significantly advance human shape editing capabilities, outperforming prior approaches with more realistic and accurate body reshaping.

Abstract: Human shape editing enables controllable transformation of a person’s body shape, such as thin, muscular, or overweight, while preserving pose, identity, clothing, and background. Unlike human pose editing, which has advanced rapidly, shape editing remains relatively under-explored. Current approaches typically rely on 3D morphable models or image warping, often introducing unrealistic body proportions, texture distortions, and background inconsistencies due to alignment errors and deformations. A key limitation is the lack of large-scale, publicly available datasets for training and evaluating body shape manipulation methods. In this work, we introduce the first large-scale dataset of 18,573 images across 1523 subjects, specifically designed for controlled human shape editing. It features diverse variations in body shape, including fat, muscular and thin, captured under consistent identity, clothing, and background conditions. Using this dataset, we propose Odo, an end-to-end diffusion-based method that enables realistic and intuitive body reshaping guided by simple semantic attributes. Our approach combines a frozen UNet that preserves fine-grained appearance and background details from the input image with a ControlNet that guides shape transformation using target SMPL depth maps. Extensive experiments demonstrate that our method outperforms prior approaches, achieving per-vertex reconstruction errors as low as 7.5mm, significantly lower than the 13.6mm observed in baseline methods, while producing realistic results that accurately match the desired target shapes.

[262] FastTracker: Real-Time and Accurate Visual Tracking

Hamidreza Hashempoor, Yu Dong Hwang

Main category: cs.CV

TL;DR: A generalized multi-object tracking framework that handles multiple object types with emphasis on vehicle tracking, featuring occlusion-aware re-identification and road-structure-aware tracklet refinement.

Details

Motivation: Conventional MOT systems are limited to pedestrian tracking and lack generalization to other object categories like vehicles in complex traffic scenes.

Method: Two key components: (1) occlusion-aware re-identification mechanism for identity preservation of occluded objects, (2) road-structure-aware tracklet refinement using semantic scene priors (lane directions, crosswalks, road boundaries).

Result: Achieves robust performance on new vehicle-focused benchmark and public benchmarks, with HOTA scores of 66.4 on MOT17 and 65.7 on MOT20 test sets.

Conclusion: The framework demonstrates effectiveness in general-purpose object tracking while maintaining strong performance on conventional benchmarks.

Abstract: Conventional multi-object tracking (MOT) systems are predominantly designed for pedestrian tracking and often exhibit limited generalization to other object categories. This paper presents a generalized tracking framework capable of handling multiple object types, with a particular emphasis on vehicle tracking in complex traffic scenes. The proposed method incorporates two key components: (1) an occlusion-aware re-identification mechanism that enhances identity preservation for heavily occluded objects, and (2) a road-structure-aware tracklet refinement strategy that utilizes semantic scene priors such as lane directions, crosswalks, and road boundaries to improve trajectory continuity and accuracy. In addition, we introduce a new benchmark dataset comprising diverse vehicle classes with frame-level tracking annotations, specifically curated to support evaluation of vehicle-focused tracking methods. Extensive experimental results demonstrate that the proposed approach achieves robust performance on both the newly introduced dataset and several public benchmarks, highlighting its effectiveness in general-purpose object tracking. While our framework is designed for generalized multi-class tracking, it also achieves strong performance on conventional benchmarks, with HOTA scores of 66.4 on MOT17 and 65.7 on MOT20 test sets. Code and Benchmark are available: github.com/Hamidreza-Hashempoor/FastTracker, huggingface.co/datasets/Hamidreza-Hashemp/FastTracker-Benchmark.

[263] Multimodal Deep Learning for Phyllodes Tumor Classification from Ultrasound and Clinical Data

Farhan Fuad Abir, Abigail Elliott Daly, Kyle Anderman, Tolga Ozmen, Laura J. Brattain

Main category: cs.CV

TL;DR: A multimodal deep learning framework combining breast ultrasound images and clinical data improves classification of phyllodes tumors, outperforming unimodal methods and potentially reducing unnecessary surgeries.

Details

Motivation: Phyllodes tumors are difficult to distinguish from benign fibroadenomas preoperatively, leading to unnecessary surgical excisions. Current radiological methods lack sufficient accuracy for reliable classification.

Method: Developed a dual-branch neural network that fuses features from ultrasound images and patient metadata from 81 subjects. Used class-aware sampling and subject-stratified 5-fold cross-validation to handle class imbalance and prevent data leakage.

Result: The multimodal approach outperformed unimodal baselines. ConvNeXt and ResNet18 achieved the best performance with AUC-ROC scores of 0.9427 and 0.9349, and F1-scores of 0.6720 and 0.7294 respectively for classifying benign vs borderline/malignant PTs.

Conclusion: Multimodal AI shows strong potential as a non-invasive diagnostic tool that can reduce unnecessary biopsies and improve clinical decision-making in breast tumor management.

Abstract: Phyllodes tumors (PTs) are rare fibroepithelial breast lesions that are difficult to classify preoperatively due to their radiological similarity to benign fibroadenomas. This often leads to unnecessary surgical excisions. To address this, we propose a multimodal deep learning framework that integrates breast ultrasound (BUS) images with structured clinical data to improve diagnostic accuracy. We developed a dual-branch neural network that extracts and fuses features from ultrasound images and patient metadata from 81 subjects with confirmed PTs. Class-aware sampling and subject-stratified 5-fold cross-validation were applied to prevent class imbalance and data leakage. The results show that our proposed multimodal method outperforms unimodal baselines in classifying benign versus borderline/malignant PTs. Among six image encoders, ConvNeXt and ResNet18 achieved the best performance in the multimodal setting, with AUC-ROC scores of 0.9427 and 0.9349, and F1-scores of 0.6720 and 0.7294, respectively. This study demonstrates the potential of multimodal AI to serve as a non-invasive diagnostic tool, reducing unnecessary biopsies and improving clinical decision-making in breast tumor management.

[264] Multimodal Iterative RAG for Knowledge-Intensive Visual Question Answering

Changin Choi, Wonseok Lee, Jungmin Ko, Wonjong Rhee

Main category: cs.CV

TL;DR: MI-RAG is a Multimodal Iterative RAG framework that enhances knowledge-intensive visual question answering by using reasoning-guided multi-query retrieval and knowledge synthesis across iterations.

Details

Motivation: Current MLLMs struggle with knowledge-intensive visual questions requiring external knowledge beyond image content. Conventional single-pass RAG frameworks often fail to gather sufficient knowledge for these complex tasks.

Method: Proposes MI-RAG framework with iterative reasoning: 1) Formulates reasoning-guided multi-queries to explore knowledge facets, 2) Performs joint search across heterogeneous knowledge bases, 3) Synthesizes retrieved knowledge to progressively deepen understanding.

Result: Experiments on Encyclopedic VQA, InfoSeek, and OK-VQA benchmarks show significant improvements in both retrieval recall and answer accuracy compared to conventional approaches.

Conclusion: MI-RAG establishes a scalable approach for compositional reasoning in knowledge-intensive VQA by iteratively enhancing retrieval through reasoning and knowledge synthesis.

Abstract: Recent advances in Multimodal Large Language Models~(MLLMs) have significantly enhanced the ability of these models in multimodal understanding and reasoning. However, the performance of MLLMs for knowledge-intensive visual questions, which require external knowledge beyond the visual content of an image, still remains limited. While Retrieval-Augmented Generation (RAG) has become a promising solution to provide models with external knowledge, its conventional single-pass framework often fails to gather sufficient knowledge. To overcome this limitation, we propose MI-RAG, a Multimodal Iterative RAG framework that leverages reasoning to enhance retrieval and incorporates knowledge synthesis to refine its understanding. At each iteration, the model formulates a reasoning-guided multi-query to explore multiple facets of knowledge. Subsequently, these queries drive a joint search across heterogeneous knowledge bases, retrieving diverse knowledge. This retrieved knowledge is then synthesized to enrich the reasoning record, progressively deepening the model’s understanding. Experiments on challenging benchmarks, including Encyclopedic VQA, InfoSeek, and OK-VQA, show that MI-RAG significantly improves both retrieval recall and answer accuracy, establishing a scalable approach for compositional reasoning in knowledge-intensive VQA.

[265] P3-SAM: Native 3D Part Segmentation

Changfeng Ma, Yang Li, Xinhao Yan, Jiachen Xu, Yunhan Yang, Chunshi Wang, Zibo Zhao, Yanwen Guo, Zhuo Chen, Chunchao Guo

Main category: cs.CV

TL;DR: P³-SAM is a native 3D point-promptable part segmentation model that fully automates 3D object segmentation into components, achieving state-of-the-art performance with strong robustness on complex objects.

Details

Motivation: Current 3D segmentation methods have poor robustness with complex objects and cannot fully automate the process, limiting applications like 3D understanding, model reuse, and part generation.

Method: Inspired by SAM, P³-SAM consists of a feature extractor, multiple segmentation heads, and an IoU predictor for interactive segmentation. It includes an algorithm to automatically select and merge masks for part instance segmentation, trained on a dataset of 3.7 million models with segmentation labels.

Result: The method achieves precise segmentation results and strong robustness on any complex objects, attaining state-of-the-art performance in 3D part segmentation.

Conclusion: P³-SAM successfully addresses the limitations of current methods by providing a fully automated, robust solution for 3D object part segmentation that works effectively on complex objects.

Abstract: Segmenting 3D assets into their constituent parts is crucial for enhancing 3D understanding, facilitating model reuse, and supporting various applications such as part generation. However, current methods face limitations such as poor robustness when dealing with complex objects and cannot fully automate the process. In this paper, we propose a native 3D point-promptable part segmentation model termed P$^3$-SAM, designed to fully automate the segmentation of any 3D objects into components. Inspired by SAM, P$^3$-SAM consists of a feature extractor, multiple segmentation heads, and an IoU predictor, enabling interactive segmentation for users. We also propose an algorithm to automatically select and merge masks predicted by our model for part instance segmentation. Our model is trained on a newly built dataset containing nearly 3.7 million models with reasonable segmentation labels. Comparisons show that our method achieves precise segmentation results and strong robustness on any complex objects, attaining state-of-the-art performance. Our project page is available at https://murcherful.github.io/P3-SAM/.

[266] Implicit Neural Representations of Intramyocardial Motion and Strain

Andrew Bell, Yan Kit Choi, Steffen E Petersen, Andrew King, Muhummad Sohaib Nazir, Alistair A Young

Main category: cs.CV

TL;DR: A method using implicit neural representations (INRs) to automatically quantify intramyocardial motion and strain from tagging MRI, achieving superior accuracy and speed compared to deep learning baselines.

Details

Motivation: Automatic quantification of intramyocardial motion and strain from tagging MRI is important but challenging, requiring accurate and scalable analysis methods for large cardiac MRI datasets.

Method: Uses implicit neural representations (INRs) conditioned on learned latent codes to predict continuous left ventricular displacement without requiring inference-time optimization.

Result: Achieved best tracking accuracy (2.14 mm RMSE) and lowest combined error in global circumferential (2.86%) and radial (6.42%) strain on 452 UK Biobank test cases, while being ~380× faster than the most accurate baseline.

Conclusion: INR-based models are suitable for accurate and scalable analysis of myocardial strain in large cardiac MRI datasets.

Abstract: Automatic quantification of intramyocardial motion and strain from tagging MRI remains an important but challenging task. We propose a method using implicit neural representations (INRs), conditioned on learned latent codes, to predict continuous left ventricular (LV) displacement – without requiring inference-time optimisation. Evaluated on 452 UK Biobank test cases, our method achieved the best tracking accuracy (2.14 mm RMSE) and the lowest combined error in global circumferential (2.86%) and radial (6.42%) strain compared to three deep learning baselines. In addition, our method is $\sim$380$\times$ faster than the most accurate baseline. These results highlight the suitability of INR-based models for accurate and scalable analysis of myocardial strain in large CMR datasets. The code can be found at https://github.com/andrewjackbell/Displacement-INR

Zixin Yin, Xili Dai, Duomin Wang, Xianfang Zeng, Lionel M. Ni, Gang Yu, Heung-Yeung Shum

Main category: cs.CV

TL;DR: LazyDrag is a novel drag-based image editing method that eliminates implicit point matching by generating explicit correspondence maps, enabling stable full-strength inversion without test-time optimization while unifying precise geometric control with text guidance.

Details

Motivation: Current drag-based editing methods rely on implicit point matching via attention, which compromises inversion strength and requires costly test-time optimization, limiting diffusion models' generative capabilities for high-fidelity inpainting and text-guided creation.

Method: LazyDrag generates explicit correspondence maps from user drag inputs as reliable references to boost attention control, enabling stable full-strength inversion process without test-time optimization for Multi-Modal Diffusion Transformers.

Result: LazyDrag outperforms baselines on DragBench in drag accuracy and perceptual quality, validated by VIEScore and human evaluation. It enables complex edits like opening a dog’s mouth with interior inpainting, generating new objects, and context-aware changes for ambiguous drags.

Conclusion: LazyDrag establishes new state-of-the-art performance and paves a new way for editing paradigms by unifying precise geometric control with text guidance, supporting multi-round workflows with simultaneous move and scale operations.

Abstract: The reliance on implicit point matching via attention has become a core bottleneck in drag-based editing, resulting in a fundamental compromise on weakened inversion strength and costly test-time optimization (TTO). This compromise severely limits the generative capabilities of diffusion models, suppressing high-fidelity inpainting and text-guided creation. In this paper, we introduce LazyDrag, the first drag-based image editing method for Multi-Modal Diffusion Transformers, which directly eliminates the reliance on implicit point matching. In concrete terms, our method generates an explicit correspondence map from user drag inputs as a reliable reference to boost the attention control. This reliable reference opens the potential for a stable full-strength inversion process, which is the first in the drag-based editing task. It obviates the necessity for TTO and unlocks the generative capability of models. Therefore, LazyDrag naturally unifies precise geometric control with text guidance, enabling complex edits that were previously out of reach: opening the mouth of a dog and inpainting its interior, generating new objects like a ``tennis ball’’, or for ambiguous drags, making context-aware changes like moving a hand into a pocket. Additionally, LazyDrag supports multi-round workflows with simultaneous move and scale operations. Evaluated on the DragBench, our method outperforms baselines in drag accuracy and perceptual quality, as validated by VIEScore and human evaluation. LazyDrag not only establishes new state-of-the-art performance, but also paves a new way to editing paradigms.

[268] Failure Makes the Agent Stronger: Enhancing Accuracy through Structured Reflection for Reliable Tool Interactions

Junhao Su, Yuanliang Wan, Junwei Yang, Hengyu Shi, Tianyang Han, Junfeng Luo, Yurui Qiu

Main category: cs.CV

TL;DR: Tool-augmented LLMs trained with supervised imitation or coarse-grained RL often fail to learn error diagnosis and repair. This paper proposes structured reflection to make error-to-repair paths explicit, controllable, and trainable.

Details

Motivation: Current self-reflection practices rely on heuristic prompts or one-way reasoning, which is fragile in multi-turn interactions. After failures, models often repeat the same mistakes instead of learning proper error diagnosis and repair.

Method: Proposes structured reflection where agents produce short yet precise reflections: diagnose failures using evidence and propose correct follow-up calls. Uses DAPO and GSPO objectives with a tailored reward scheme, optimizing the Reflect-Call-Final strategy.

Result: Experiments on BFCL v3 and Tool-Reflection-Bench show large gains in multi-turn tool-call success and error recovery, with reduction of redundant calls.

Conclusion: Making reflection explicit and optimizing it directly improves reliability of tool interaction and offers a reproducible path for agents to learn from failure.

Abstract: Tool-augmented large language models (LLMs) are usually trained with supervised imitation or coarse-grained reinforcement learning that optimizes single tool calls. Current self-reflection practices rely on heuristic prompts or one-way reasoning: the model is urged to ’think more’ instead of learning error diagnosis and repair. This is fragile in multi-turn interactions; after a failure the model often repeats the same mistake. We propose structured reflection, which turns the path from error to repair into an explicit, controllable, and trainable action. The agent produces a short yet precise reflection: it diagnoses the failure using evidence from the previous step and then proposes a correct, executable follow-up call. For training we combine DAPO and GSPO objectives with a reward scheme tailored to tool use, optimizing the stepwise strategy Reflect, then Call, then Final. To evaluate, we introduce Tool-Reflection-Bench, a lightweight benchmark that programmatically checks structural validity, executability, parameter correctness, and result consistency. Tasks are built as mini trajectories of erroneous call, reflection, and corrected call, with disjoint train and test splits. Experiments on BFCL v3 and Tool-Reflection-Bench show large gains in multi-turn tool-call success and error recovery, and a reduction of redundant calls. These results indicate that making reflection explicit and optimizing it directly improves the reliability of tool interaction and offers a reproducible path for agents to learn from failure.

[269] MoCLIP-Lite: Efficient Video Recognition by Fusing CLIP with Motion Vectors

Binhua Huang, Ni Wang, Arjun Pakrashi, Soumyabrata Dev

Main category: cs.CV

TL;DR: MoCLIP-Lite is an efficient two-stream video recognition framework that combines frozen CLIP image features with lightweight motion vector features using only a tiny MLP head for fusion, achieving 89.2% accuracy on UCF101.

Details

Motivation: State-of-the-art video action recognition models are computationally expensive and require extensive video pre-training, while CLIP offers powerful zero-shot capabilities for static images and motion vectors provide efficient temporal information from compressed videos.

Method: Two-stream late fusion framework combining features from a frozen CLIP image encoder with features from a lightweight supervised network trained on raw motion vectors. Both backbones remain frozen during training, with only a tiny MLP head being trained for fusion.

Result: Achieves 89.2% Top-1 accuracy on UCF101 dataset, significantly outperforming zero-shot (65.0%) and motion vector-only (66.5%) baselines.

Conclusion: Provides a highly efficient baseline for video understanding that effectively bridges the gap between large static models and low-cost motion cues, with extreme efficiency due to frozen backbones and minimal trainable parameters.

Abstract: Video action recognition is a fundamental task in computer vision, but state-of-the-art models are often computationally expensive and rely on extensive video pre-training. In parallel, large-scale vision-language models like Contrastive Language-Image Pre-training (CLIP) offer powerful zero-shot capabilities on static images, while motion vectors (MV) provide highly efficient temporal information directly from compressed video streams. To synergize the strengths of these paradigms, we propose MoCLIP-Lite, a simple yet powerful two-stream late fusion framework for efficient video recognition. Our approach combines features from a frozen CLIP image encoder with features from a lightweight, supervised network trained on raw MV. During fusion, both backbones are frozen, and only a tiny Multi-Layer Perceptron (MLP) head is trained, ensuring extreme efficiency. Through comprehensive experiments on the UCF101 dataset, our method achieves a remarkable 89.2% Top-1 accuracy, significantly outperforming strong zero-shot (65.0%) and MV-only (66.5%) baselines. Our work provides a new, highly efficient baseline for video understanding that effectively bridges the gap between large static models and dynamic, low-cost motion cues. Our code and models are available at https://github.com/microa/MoCLIP-Lite.

[270] Neurodynamics-Driven Coupled Neural P Systems for Multi-Focus Image Fusion

Bo Li, Yunkuo Lei, Tingting Bao, Yaxian Wang, Lingling Zhang, Jun Liu

Main category: cs.CV

TL;DR: The paper proposes ND-CNPFuse, a neurodynamics-driven coupled neural P system for multi-focus image fusion that generates high-quality decision maps by mapping images to interpretable spike matrices without post-processing.

Details

Motivation: Traditional MFIF methods based on heuristic rules and deep learning black-box mechanisms struggle to generate decision maps with precise boundaries. The authors aim to overcome this limitation using biologically-inspired neural computation models.

Method: The method uses neurodynamics-driven coupled neural P (CNP) systems to analyze constraints between network parameters and input signals, preventing abnormal neuron firing. It maps source images to spike matrices and compares spike counts to generate accurate decision maps directly.

Result: ND-CNPFuse achieves state-of-the-art performance on four MFIF datasets (Lytro, MFFW, MFI-WHU, Real-MFF) without requiring post-processing steps.

Conclusion: The neurodynamics-driven approach provides an interpretable and effective solution for multi-focus image fusion, generating precise decision maps through biologically-inspired spike-based computation.

Abstract: Multi-focus image fusion (MFIF) is a crucial technique in image processing, with a key challenge being the generation of decision maps with precise boundaries. However, traditional methods based on heuristic rules and deep learning methods with black-box mechanisms are difficult to generate high-quality decision maps. To overcome this challenge, we introduce neurodynamics-driven coupled neural P (CNP) systems, which are third-generation neural computation models inspired by spiking mechanisms, to enhance the accuracy of decision maps. Specifically, we first conduct an in-depth analysis of the model’s neurodynamics to identify the constraints between the network parameters and the input signals. This solid analysis avoids abnormal continuous firing of neurons and ensures the model accurately distinguishes between focused and unfocused regions, generating high-quality decision maps for MFIF. Based on this analysis, we propose a Neurodynamics-Driven CNP Fusion model (ND-CNPFuse) tailored for the challenging MFIF task. Unlike current ideas of decision map generation, ND-CNPFuse distinguishes between focused and unfocused regions by mapping the source image into interpretable spike matrices. By comparing the number of spikes, an accurate decision map can be generated directly without any post-processing. Extensive experimental results show that ND-CNPFuse achieves new state-of-the-art performance on four classical MFIF datasets, including Lytro, MFFW, MFI-WHU, and Real-MFF. The code is available at https://github.com/MorvanLi/ND-CNPFuse.

[271] TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs

Yunheng Li, Jing Cheng, Shaoyong Jia, Hangyi Kuang, Shaohui Jiao, Qibin Hou, Ming-Ming Cheng

Main category: cs.CV

TL;DR: TempSamp-R1 is a reinforcement fine-tuning framework that improves multimodal large language models for video temporal grounding by using off-policy supervision and non-linear soft advantage computation to overcome limitations of on-policy methods like GRPO.

Details

Motivation: Existing reinforcement learning methods for video temporal grounding rely on on-policy sampling, which becomes inefficient and limited in large temporal search spaces, often failing to find accurate temporal solutions.

Method: TempSamp-R1 leverages ground-truth annotations as off-policy supervision for precise guidance, uses non-linear soft advantage computation to stabilize training, and employs a hybrid Chain-of-Thought training paradigm to support both CoT and non-CoT inference modes.

Result: TempSamp-R1 outperforms GRPO-based baselines, achieving state-of-the-art performance on Charades-STA (R1@0.7: 52.9%, +2.7%), ActivityNet Captions (R1@0.5: 56.0%, +5.3%), and QVHighlights (mAP: 30.0%, +3.0%), with robust few-shot generalization capabilities.

Conclusion: The proposed TempSamp-R1 framework effectively addresses the limitations of on-policy reinforcement learning for video temporal grounding, demonstrating superior performance and generalization through off-policy supervision and advanced training techniques.

Abstract: This paper introduces TempSamp-R1, a new reinforcement fine-tuning framework designed to improve the effectiveness of adapting multimodal large language models (MLLMs) to video temporal grounding tasks. We reveal that existing reinforcement learning methods, such as Group Relative Policy Optimization (GRPO), rely on on-policy sampling for policy updates. However, in tasks with large temporal search spaces, this strategy becomes both inefficient and limited in performance, as it often fails to identify temporally accurate solutions. To address this limitation, TempSamp-R1 leverages ground-truth annotations as off-policy supervision to provide temporally precise guidance, effectively compensating for the sparsity and misalignment in on-policy solutions. To further stabilize training and reduce variance in reward-based updates, TempSamp-R1 provides a non-linear soft advantage computation method that dynamically reshapes the reward feedback via an asymmetric transformation. By employing a hybrid Chain-of-Thought (CoT) training paradigm, TempSamp-R1 optimizes a single unified model to support both CoT and non-CoT inference modes, enabling efficient handling of queries with varying reasoning complexity. Experimental results demonstrate that TempSamp-R1 outperforms GRPO-based baselines, establishing new state-of-the-art performance on benchmark datasets: Charades-STA (R1@0.7: 52.9%, +2.7%), ActivityNet Captions (R1@0.5: 56.0%, +5.3%), and QVHighlights (mAP: 30.0%, +3.0%). Moreover, TempSamp-R1 shows robust few-shot generalization capabilities under limited data. Code: https://github.com/HVision-NKU/TempSamp-R1

[272] HazeFlow: Revisit Haze Physical Model as ODE and Non-Homogeneous Haze Generation for Real-World Dehazing

Junseong Shin, Seungwoo Chung, Yunjeong Yang, Tae Hyun Kim

Main category: cs.CV

TL;DR: HazeFlow is a novel ODE-based framework that reformulates atmospheric scattering model as an ordinary differential equation to improve real-world image dehazing performance with single-step inference, addressing domain gap issues through physics-grounded learning and realistic haze simulation.

Details

Motivation: Current deep learning dehazing methods struggle with real-world generalization due to lack of paired training data and domain gap. Traditional physics-based methods using Atmospheric Scattering Model fail to handle complex real-world haze patterns effectively.

Method: Proposes HazeFlow - an ODE-based framework inspired by Rectified Flow that learns optimal ODE trajectories to map hazy to clean images. Introduces non-homogeneous haze generation using Markov Chain Brownian Motion to simulate realistic haze patterns for training.

Result: HazeFlow achieves state-of-the-art performance across various real-world dehazing benchmark datasets, demonstrating superior real-world adaptation compared to existing methods.

Conclusion: The ODE-based formulation and realistic haze simulation enable HazeFlow to effectively bridge the domain gap in real-world dehazing, providing a robust solution that outperforms current approaches while requiring only single inference step.

Abstract: Dehazing involves removing haze or fog from images to restore clarity and improve visibility by estimating atmospheric scattering effects. While deep learning methods show promise, the lack of paired real-world training data and the resulting domain gap hinder generalization to real-world scenarios. In this context, physics-grounded learning becomes crucial; however, traditional methods based on the Atmospheric Scattering Model (ASM) often fall short in handling real-world complexities and diverse haze patterns. To solve this problem, we propose HazeFlow, a novel ODE-based framework that reformulates ASM as an ordinary differential equation (ODE). Inspired by Rectified Flow (RF), HazeFlow learns an optimal ODE trajectory to map hazy images to clean ones, enhancing real-world dehazing performance with only a single inference step. Additionally, we introduce a non-homogeneous haze generation method using Markov Chain Brownian Motion (MCBM) to address the scarcity of paired real-world data. By simulating realistic haze patterns through MCBM, we enhance the adaptability of HazeFlow to diverse real-world scenarios. Through extensive experiments, we demonstrate that HazeFlow achieves state-of-the-art performance across various real-world dehazing benchmark datasets.

[273] Understanding-in-Generation: Reinforcing Generative Capability of Unified Model via Infusing Understanding into Generation

Yuanhuiyi Lyu, Chi Kit Wong, Chenfei Liao, Lutao Jiang, Xu Zheng, Zexin Lu, Linfeng Zhang, Xuming Hu

Main category: cs.CV

TL;DR: UiG is a novel reasoning framework that integrates understanding capabilities into the generation process for text-to-image models, using image editing as a bridge to enhance generation quality step by step.

Details

Motivation: Existing Chain-of-Thought methods separate understanding and generation processes, limiting their ability to guide unified models in addressing generative deficiencies.

Method: Proposes Understanding-in-Generation (UiG) framework that uses image editing as a bridge to infuse understanding into generation. It verifies generated images, incorporates model understanding into editing instructions, and enhances images step by step.

Result: Significant performance improvement in text-to-image generation, achieving 3.92% gain on long prompt setting of TIIF benchmark compared to existing methods.

Conclusion: UiG effectively integrates understanding capabilities into the generation process, demonstrating superior performance over traditional reasoning methods for unified text-to-image models.

Abstract: Recent works have made notable advancements in enhancing unified models for text-to-image generation through the Chain-of-Thought (CoT). However, these reasoning methods separate the processes of understanding and generation, which limits their ability to guide the reasoning of unified models in addressing the deficiencies of their generative capabilities. To this end, we propose a novel reasoning framework for unified models, Understanding-in-Generation (UiG), which harnesses the robust understanding capabilities of unified models to reinforce their performance in image generation. The core insight of our UiG is to integrate generative guidance by the strong understanding capabilities during the reasoning process, thereby mitigating the limitations of generative abilities. To achieve this, we introduce “Image Editing” as a bridge to infuse understanding into the generation process. Initially, we verify the generated image and incorporate the understanding of unified models into the editing instructions. Subsequently, we enhance the generated image step by step, gradually infusing the understanding into the generation process. Our UiG framework demonstrates a significant performance improvement in text-to-image generation over existing text-to-image reasoning methods, e.g., a 3.92% gain on the long prompt setting of the TIIF benchmark. The project code: https://github.com/QC-LY/UiG

[274] StrCGAN: A Generative Framework for Stellar Image Restoration

Shantanusinh Parmar

Main category: cs.CV

TL;DR: StrCGAN is a generative model that enhances low-resolution astrophotography images by reconstructing high-fidelity representations of celestial objects using 3D convolutions, multi-spectral fusion, and astrophysical regularization.

Details

Motivation: Traditional models like CycleGAN are limited to 2D mappings and often distort star and galaxy morphology when enhancing low-resolution telescope images. The goal is to create physically consistent reconstructions that preserve celestial object structures.

Method: Extends CycleGAN with three innovations: 3D convolutional layers for volumetric spatial correlations, multi-spectral fusion to align optical and near-infrared domains, and astrophysical regularization modules to preserve stellar morphology. Uses ground-truth references from multi-mission all-sky surveys.

Result: StrCGAN generates reconstructions that are visually sharper and physically consistent, outperforming standard GAN models in astrophysical image enhancement tasks.

Conclusion: The proposed StrCGAN framework successfully overcomes limitations of traditional GANs by incorporating 3D spatial awareness, spectral alignment, and physical constraints, producing superior astrophysical image enhancements.

Abstract: We introduce StrCGAN (Stellar Cyclic GAN), a generative model designed to enhance low-resolution astrophotography images. Our goal is to reconstruct high-fidelity ground truth-like representations of celestial objects, a task that is challenging due to the limited resolution and quality of small-telescope observations such as the MobilTelesco dataset. Traditional models such as CycleGAN provide a foundation for image-to-image translation but are restricted to 2D mappings and often distort the morphology of stars and galaxies. To overcome these limitations, we extend the CycleGAN framework with three key innovations: 3D convolutional layers to capture volumetric spatial correlations, multi-spectral fusion to align optical and near-infrared (NIR) domains, and astrophysical regularization modules to preserve stellar morphology. Ground-truth references from multi-mission all-sky surveys spanning optical to NIR guide the training process, ensuring that reconstructions remain consistent across spectral bands. Together, these components allow StrCGAN to generate reconstructions that are not only visually sharper but also physically consistent, outperforming standard GAN models in the task of astrophysical image enhancement.

[275] Interpreting ResNet-based CLIP via Neuron-Attention Decomposition

Edmund Bu, Yossi Gandelsman

Main category: cs.CV

TL;DR: A novel technique for interpreting CLIP-ResNet neurons by decomposing their contributions through neuron-head pairs, enabling text-based interpretation and applications in semantic segmentation and dataset monitoring.

Details

Motivation: To understand how individual neurons in CLIP-ResNet contribute to the output by analyzing computation paths and neuron-head interactions for better interpretability.

Method: Analyze pairwise combinations of neurons and attention heads in CLIP’s attention-pooling layer, approximating neuron-head pairs as single directions in the embedding space and interpreting them through text associations.

Result: Found that only sparse neuron-head pairs significantly contribute to output, some represent sub-concepts of neurons, and successfully applied for training-free semantic segmentation and dataset distribution shift monitoring.

Conclusion: Examining individual computation paths reveals interpretable units in neural networks that can be effectively utilized for downstream tasks, demonstrating practical value beyond interpretation.

Abstract: We present a novel technique for interpreting the neurons in CLIP-ResNet by decomposing their contributions to the output into individual computation paths. More specifically, we analyze all pairwise combinations of neurons and the following attention heads of CLIP’s attention-pooling layer. We find that these neuron-head pairs can be approximated by a single direction in CLIP-ResNet’s image-text embedding space. Leveraging this insight, we interpret each neuron-head pair by associating it with text. Additionally, we find that only a sparse set of the neuron-head pairs have a significant contribution to the output value, and that some neuron-head pairs, while polysemantic, represent sub-concepts of their corresponding neurons. We use these observations for two applications. First, we employ the pairs for training-free semantic segmentation, outperforming previous methods for CLIP-ResNet. Second, we utilize the contributions of neuron-head pairs to monitor dataset distribution shifts. Our results demonstrate that examining individual computation paths in neural networks uncovers interpretable units, and that such units can be utilized for downstream tasks.

[276] OmniScene: Attention-Augmented Multimodal 4D Scene Understanding for Autonomous Driving

Pei Liu, Hongliang Lu, Haichao Liu, Haipeng Liu, Xin Liu, Ruoyu Yao, Shengbo Eben Li, Jun Ma

Main category: cs.CV

TL;DR: OmniScene is a novel human-like framework for autonomous driving that integrates multi-view and temporal perception for holistic 4D scene understanding, using vision-language modeling and hierarchical fusion to achieve superior performance across various tasks.

Details

Motivation: Current autonomous driving systems lack true 3D scene understanding, relying instead on depth-based 3D reconstruction. The goal is to develop human-like perception-understanding-action capabilities that can translate 2D observations into egocentric 3D scene understanding.

Method: Proposes OmniScene with OmniVLM (Vision-Language Model) integrating multi-view and temporal perception. Uses teacher-student architecture with knowledge distillation to embed textual representations into 3D instance features. Implements Hierarchical Fusion Strategy (HFS) to adaptively calibrate geometric and semantic features at multiple abstraction levels.

Result: Comprehensive evaluation on nuScenes dataset shows superior performance against over ten state-of-the-art models across perception, prediction, planning, and visual question answering tasks, establishing new benchmarks.

Conclusion: OmniScene successfully bridges the gap between human-like scene understanding and autonomous driving systems by integrating visual and textual modalities through hierarchical fusion, enabling more nuanced and effective exploitation of heterogeneous information for holistic 4D scene understanding.

Abstract: Human vision is capable of transforming two-dimensional observations into an egocentric three-dimensional scene understanding, which underpins the ability to translate complex scenes and exhibit adaptive behaviors. This capability, however, remains lacking in current autonomous driving systems, where mainstream approaches primarily rely on depth-based 3D reconstruction rather than true scene understanding. To address this limitation, we propose a novel human-like framework called OmniScene. First, we introduce the OmniScene Vision-Language Model (OmniVLM), a vision-language framework that integrates multi-view and temporal perception for holistic 4D scene understanding. Then, harnessing a teacher-student OmniVLM architecture and knowledge distillation, we embed textual representations into 3D instance features for semantic supervision, enriching feature learning, and explicitly capturing human-like attentional semantics. These feature representations are further aligned with human driving behaviors, forming a more human-like perception-understanding-action architecture. In addition, we propose a Hierarchical Fusion Strategy (HFS) to address imbalances in modality contributions during multimodal integration. Our approach adaptively calibrates the relative significance of geometric and semantic features at multiple abstraction levels, enabling the synergistic use of complementary cues from visual and textual modalities. This learnable dynamic fusion enables a more nuanced and effective exploitation of heterogeneous information. We evaluate OmniScene comprehensively on the nuScenes dataset, benchmarking it against over ten state-of-the-art models across various tasks. Our approach consistently achieves superior results, establishing new benchmarks in perception, prediction, planning, and visual question answering.

[277] Does the Manipulation Process Matter? RITA: Reasoning Composite Image Manipulations via Reversely-Ordered Incremental-Transition Autoregression

Xuekang Zhu, Ji-Zhe Zhou, Kaiwen Feng, Chenfan Qu, Yunfei Wang, Liting Zhou, Jian Liu

Main category: cs.CV

TL;DR: RITA reformulates image manipulation localization as a conditional sequence prediction task, predicting manipulated regions layer-by-layer to model temporal dependencies and hierarchical structures of editing operations.

Details

Motivation: Existing image manipulation localization methods are manipulation-process-agnostic, using one-shot prediction that causes dimensional collapse by compressing complex editing sequences into single binary masks, creating a fundamental mismatch with the task's intrinsic nature.

Method: Proposes RITA framework that predicts manipulated regions sequentially using each step’s prediction as condition for the next, explicitly modeling temporal dependencies and hierarchical structures. Constructs HSIM benchmark with multi-step manipulation data and introduces HSS metric for sequential order and hierarchical alignment evaluation.

Result: RITA achieves state-of-the-art performance on traditional benchmarks and provides solid foundation for hierarchical localization task, validating its potential as general and effective paradigm.

Conclusion: The sequential prediction approach effectively addresses limitations of one-shot methods by modeling the inherent sequentiality and hierarchical characteristics of image manipulation processes.

Abstract: Image manipulations often entail a complex manipulation process, comprising a series of editing operations to create a deceptive image, exhibiting sequentiality and hierarchical characteristics. However, existing IML methods remain manipulation-process-agnostic, directly producing localization masks in a one-shot prediction paradigm without modeling the underlying editing steps. This one-shot paradigm compresses the high-dimensional compositional space into a single binary mask, inducing severe dimensional collapse, thereby creating a fundamental mismatch with the intrinsic nature of the IML task. To address this, we are the first to reformulate image manipulation localization as a conditional sequence prediction task, proposing the RITA framework. RITA predicts manipulated regions layer-by-layer in an ordered manner, using each step’s prediction as the condition for the next, thereby explicitly modeling temporal dependencies and hierarchical structures among editing operations. To enable training and evaluation, we synthesize multi-step manipulation data and construct a new benchmark HSIM. We further propose the HSS metric to assess sequential order and hierarchical alignment. Extensive experiments show RITA achieves SOTA on traditional benchmarks and provides a solid foundation for the novel hierarchical localization task, validating its potential as a general and effective paradigm. The code and dataset will be publicly available.

[278] Hyperspectral Adapter for Semantic Segmentation with Vision Foundation Models

Juana Valeria Hurtado, Rohit Mohan, Abhinav Valada

Main category: cs.CV

TL;DR: A novel hyperspectral adapter that leverages pretrained vision foundation models to achieve state-of-the-art semantic segmentation performance on hyperspectral imaging data for autonomous driving applications.

Details

Motivation: Current HSI semantic segmentation methods underperform because they rely on architectures optimized for RGB inputs, while hyperspectral imaging captures rich spectral information that could enable robust robotic perception in challenging environments.

Method: Proposes a hyperspectral adapter with spectral transformer and spectrum-aware spatial prior module to extract spatial-spectral features, plus a modality-aware interaction block that integrates hyperspectral representations with frozen vision Transformer features through extraction and injection mechanisms.

Result: Extensive evaluations on three benchmark autonomous driving datasets demonstrate state-of-the-art semantic segmentation performance, outperforming both vision-based and hyperspectral segmentation methods.

Conclusion: The proposed architecture effectively bridges the gap between hyperspectral data and pretrained vision foundation models, enabling superior semantic segmentation performance for autonomous driving applications.

Abstract: Hyperspectral imaging (HSI) captures spatial information along with dense spectral measurements across numerous narrow wavelength bands. This rich spectral content has the potential to facilitate robust robotic perception, particularly in environments with complex material compositions, varying illumination, or other visually challenging conditions. However, current HSI semantic segmentation methods underperform due to their reliance on architectures and learning frameworks optimized for RGB inputs. In this work, we propose a novel hyperspectral adapter that leverages pretrained vision foundation models to effectively learn from hyperspectral data. Our architecture incorporates a spectral transformer and a spectrum-aware spatial prior module to extract rich spatial-spectral features. Additionally, we introduce a modality-aware interaction block that facilitates effective integration of hyperspectral representations and frozen vision Transformer features through dedicated extraction and injection mechanisms. Extensive evaluations on three benchmark autonomous driving datasets demonstrate that our architecture achieves state-of-the-art semantic segmentation performance while directly using HSI inputs, outperforming both vision-based and hyperspectral segmentation methods. We make the code available at https://hsi-adapter.cs.uni-freiburg.de.

cs.AI

[279] An Approach to Checking Correctness for Agentic Systems

Thomas J Sheffler

Main category: cs.AI

TL;DR: A temporal expression language for monitoring AI agent behavior to detect errors in LLM-based agentic systems by analyzing execution traces rather than text matching.

Details

Motivation: Current error-detection methods rely on fragile text matching of inputs/outputs, which fails due to natural language variability in LLM responses. A more robust approach is needed to verify system behavior independent of specific textual outputs.

Method: Uses temporal logic techniques from hardware verification to monitor execution traces of agent tool calls and state transitions. Provides assertions that capture correct behavioral patterns across multiple execution scenarios.

Result: When tested on a three-agent system, temporal assertions were satisfied with capable models but violated when smaller models were substituted, successfully detecting improper tool sequencing and failed coordination handoffs.

Conclusion: This approach provides effective behavioral regression testing for production agentic systems and a foundation for systematic monitoring of AI agent reliability in critical applications.

Abstract: This paper presents a temporal expression language for monitoring AI agent behavior, enabling systematic error-detection of LLM-based agentic systems that exhibit variable outputs due to stochastic generation processes. Drawing from temporal logic techniques used in hardware verification, this approach monitors execution traces of agent tool calls and state transitions to detect deviations from expected behavioral patterns. Current error-detection approaches rely primarily on text matching of inputs and outputs, which proves fragile due to the natural language variability inherent in LLM responses. The proposed method instead focuses on the sequence of agent actions – such as tool invocations and inter-agent communications – allowing verification of system behavior independent of specific textual outputs. The temporal expression language provides assertions that capture correct behavioral patterns across multiple execution scenarios. These assertions serve dual purposes: validating prompt engineering and guardrail effectiveness during development, and providing regression testing when agents are updated with new LLMs or modified logic. The approach is demonstrated using a three-agent system, where agents coordinate to solve multi-step reasoning tasks. When powered by large, capable models, all temporal assertions were satisfied across many test runs. However, when smaller models were substituted in two of the three agents, executions violated behavioral assertions, primarily due to improper tool sequencing and failed coordination handoffs. The temporal expressions successfully flagged these anomalies, demonstrating the method’s effectiveness for detecting behavioral regressions in production agentic systems. This approach provides a foundation for systematic monitoring of AI agent reliability as these systems become increasingly deployed in critical applications.

[280] LATTS: Locally Adaptive Test-Time Scaling

Theo Uscidda, Matthew Trager, Michael Kleinman, Aditya Chattopadhyay, Wei Xia, Stefano Soatto

Main category: cs.AI

TL;DR: LATTS proposes adaptive test-time scaling that allocates variable compute across generation steps based on local difficulty, achieving better accuracy-compute tradeoffs than uniform scaling methods.

Details

Motivation: Existing verifier-based methods increase computation uniformly across all samples and generation steps without considering instance complexity, leading to inefficient resource use.

Method: LATTS employs a verifier-based acceptance criterion at each generation step to decide whether to resample, backtrack, restart, or stop generation, adjusting computational effort based on local difficulty.

Result: Empirical results show LATTS achieves significantly superior accuracy-compute tradeoffs compared to standard verifier-based methods.

Conclusion: Adaptive test-time scaling that considers local difficulty at each generation step provides more efficient resource allocation and better performance than uniform scaling approaches.

Abstract: One common strategy for improving the performance of Large Language Models (LLMs) on downstream tasks involves using a \emph{verifier model} to either select the best answer from a pool of candidates or to steer the auto-regressive generation process towards better outputs. This class of methods typically results in improved accuracy at the cost of increased computation at test-time, a paradigm known as \emph{test-time scaling}. However, most existing approaches increase computation uniformly across all samples and generation steps, without considering the complexity of individual instances, leading to inefficient resource use. We address this limitation by proposing an approach, called \emph{Locally Adaptive Test-Time Scaling (LATTS)}, that allocates variable compute across generation steps. Specifically, at each generation step, LATTS employs a verifier-based acceptance criterion to decide whether to resample, backtrack, restart, or stop the generation process. This criterion effectively adjusts the per-step computational effort based on a precise notion of \emph{local difficulty} derived from the verifier model. Empirical results show that LATTS achieves significantly superior accuracy–compute tradeoffs compared to standard verifier-based methods.

[281] Fairy: Interactive Mobile Assistant to Real-world Tasks via LMM-based Multi-agent

Jiazheng Sun, Te Yang, Jiayang Niu, Mingxuan Li, Yongyong Lu, Ruimeng Yang, Xin Peng

Main category: cs.AI

TL;DR: Fairy is an interactive multi-agent mobile assistant that addresses limitations of existing LMM-based GUI agents by enabling cross-app collaboration, interactive execution, and continual learning through a three-module architecture.

Details

Motivation: Existing LMM-based mobile GUI agents struggle with real-world scenarios involving diverse app interfaces and evolving user needs. End-to-end methods fail on long-tail apps, and unilateral agent actions harm user experience.

Method: Fairy uses three core modules: (1) Global Task Planner for cross-app task decomposition, (2) App-Level Executor with dual loops for precise execution and user interaction using long/short-term memory, and (3) Self-Learner that consolidates experience into App Map and Tricks.

Result: Fairy with GPT-4o backbone outperforms previous state-of-the-art by improving user requirement completion by 33.7% and reducing redundant steps by 58.5% on the RealMobile-Eval benchmark.

Conclusion: Fairy demonstrates effective interaction and self-learning capabilities for mobile GUI agents, showing significant improvements in task completion efficiency and user experience.

Abstract: Large multi-modal models (LMMs) have advanced mobile GUI agents. However, existing methods struggle with real-world scenarios involving diverse app interfaces and evolving user needs. End-to-end methods relying on model’s commonsense often fail on long-tail apps, and agents without user interaction act unilaterally, harming user experience. To address these limitations, we propose Fairy, an interactive multi-agent mobile assistant capable of continuously accumulating app knowledge and self-evolving during usage. Fairy enables cross-app collaboration, interactive execution, and continual learning through three core modules:(i) a Global Task Planner that decomposes user tasks into sub-tasks from a cross-app view; (ii) an App-Level Executor that refines sub-tasks into steps and actions based on long- and short-term memory, achieving precise execution and user interaction via four core agents operating in dual loops; and (iii) a Self-Learner that consolidates execution experience into App Map and Tricks. To evaluate Fairy, we introduce RealMobile-Eval, a real-world benchmark with a comprehensive metric suite, and LMM-based agents for automated scoring. Experiments show that Fairy with GPT-4o backbone outperforms the previous SoTA by improving user requirement completion by 33.7% and reducing redundant steps by 58.5%, showing the effectiveness of its interaction and self-learning.

[282] Philosophy-informed Machine Learning

MZ Naser

Main category: cs.AI

TL;DR: Philosophy-informed machine learning (PhIML) integrates analytic philosophy concepts directly into ML models, promising improved philosophical alignment and new capabilities through intrinsic design.

Details

Motivation: To create ML models that respect philosophical concepts and values by design, addressing the need for ethically responsible AI systems that are philosophically aligned.

Method: Reviews conceptual foundations, presents case studies on both post-hoc adoption and intrinsic architectural integration of PhIML, and analyzes challenges.

Result: Demonstrates philosophical gains and alignment through PhIML approaches, showing how ML practitioners can incorporate philosophical principles.

Conclusion: Outlines a research roadmap for developing safe, philosophy-aware, and ethically responsible PhIML while addressing open technical, philosophical, practical, and governance challenges.

Abstract: Philosophy-informed machine learning (PhIML) directly infuses core ideas from analytic philosophy into ML model architectures, objectives, and evaluation protocols. Therefore, PhIML promises new capabilities through models that respect philosophical concepts and values by design. From this lens, this paper reviews conceptual foundations to demonstrate philosophical gains and alignment. In addition, we present case studies on how ML users/designers can adopt PhIML as an agnostic post-hoc tool or intrinsically build it into ML model architectures. Finally, this paper sheds light on open technical barriers alongside philosophical, practical, and governance challenges and outlines a research roadmap toward safe, philosophy-aware, and ethically responsible PhIML.

[283] InsightGUIDE: An Opinionated AI Assistant for Guided Critical Reading of Scientific Literature

Paris Koloveas, Serafeim Chatzopoulos, Thanasis Vergoulis, Christos Tryfonopoulos

Main category: cs.AI

TL;DR: InsightGUIDE is an AI-powered reading assistant that provides structured, concise insights to help researchers navigate scientific papers, rather than replacing reading with verbose summaries.

Details

Motivation: Existing LLM tools often produce lengthy summaries that risk replacing the actual reading of scientific papers, rather than assisting researchers in navigating the growing volume of literature.

Method: The system embeds an expert’s reading methodology into its core AI logic using a prompt-driven approach, providing structured insights that act as a “map” to key paper elements.

Result: Qualitative case study shows InsightGUIDE produces more structured and actionable guidance compared to general-purpose LLMs, serving as a more effective research tool.

Conclusion: InsightGUIDE successfully functions as a reading assistant that helps researchers efficiently navigate scientific papers through concise, structured insights rather than replacing the reading process.

Abstract: The proliferation of scientific literature presents an increasingly significant challenge for researchers. While Large Language Models (LLMs) offer promise, existing tools often provide verbose summaries that risk replacing, rather than assisting, the reading of the source material. This paper introduces InsightGUIDE, a novel AI-powered tool designed to function as a reading assistant, not a replacement. Our system provides concise, structured insights that act as a “map” to a paper’s key elements by embedding an expert’s reading methodology directly into its core AI logic. We present the system’s architecture, its prompt-driven methodology, and a qualitative case study comparing its output to a general-purpose LLM. The results demonstrate that InsightGUIDE produces more structured and actionable guidance, serving as a more effective tool for the modern researcher.

[284] Reconstruction-Based Adaptive Scheduling Using AI Inferences in Safety-Critical Systems

Samer Alshaer, Ala Khalifeh, Roman Obermaisser

Main category: cs.AI

TL;DR: A novel reconstruction framework for adaptive scheduling in time-triggered systems that transforms AI-generated priorities into executable schedules while ensuring safety and constraint adherence.

Details

Motivation: Address challenges in time-triggered systems including message collisions, locked loops from incorrect precedence handling, and incomplete schedules that compromise system safety and performance in dynamic environments.

Method: Proposes reconstruction models that systematically transform scheduling priorities into executable schedules with robust safety checks, efficient allocation algorithms, and recovery mechanisms for handling unexpected events like hardware failures.

Result: Comprehensive experiments show the framework significantly enhances system adaptability, operational integrity, and runtime performance while maintaining computational efficiency across multiple performance profiles.

Conclusion: Provides a practical and scalable solution for safe schedule generation in safety-critical TTS, enabling reliable real-time scheduling under dynamic and uncertain operational conditions.

Abstract: Adaptive scheduling is crucial for ensuring the reliability and safety of time-triggered systems (TTS) in dynamic operational environments. Scheduling frameworks face significant challenges, including message collisions, locked loops from incorrect precedence handling, and the generation of incomplete or invalid schedules, which can compromise system safety and performance. To address these challenges, this paper presents a novel reconstruction framework designed to dynamically validate and assemble schedules. The proposed reconstruction models operate by systematically transforming AI-generated or heuristically derived scheduling priorities into fully executable schedules, ensuring adherence to critical system constraints such as precedence rules and collision-free communication. It incorporates robust safety checks, efficient allocation algorithms, and recovery mechanisms to handle unexpected context events, including hardware failures and mode transitions. Comprehensive experiments were conducted across multiple performance profiles, including makespan minimisation, workload balancing, and energy efficiency, to validate the operational effectiveness of the reconstruction models. Results demonstrate that the proposed framework significantly enhances system adaptability, operational integrity, and runtime performance while maintaining computational efficiency. Overall, this work contributes a practical and scalable solution to the problem of safe schedule generation in safety-critical TTS, enabling reliable and flexible real-time scheduling even under highly dynamic and uncertain operational conditions.

[285] ToMPO: Training LLM Strategic Decision Making from a Multi-Agent Perspective

Yiwen Zhang, Ziang Chen, Fanqi Kong, Yizhe Huang, Xue Feng

Main category: cs.AI

TL;DR: The paper proposes ToMPO (Theory of Mind Policy Optimization), a reinforcement learning algorithm that enhances LLMs’ strategic decision-making by reasoning about others’ strategies and balancing global/partial rewards, achieving 35% improvement over GRPO.

Details

Motivation: Existing LLM decision-making approaches focus on multi-round conversations in social tasks but neglect various decision types and their interdependence. Current RL methods struggle to consider others' strategies during training.

Method: ToMPO algorithm optimizes perception of others’ strategies and game trends by: 1) generating rollouts based on reasoning about others’ strategies, 2) estimating advantages at graph-level and sample-level, and 3) balancing global and partial rewards.

Result: ToMPO outperforms GRPO by 35% in model output compliance and cooperative outcomes, and shows 18% improvement compared to models with 100x larger parameters.

Conclusion: ToMPO effectively enhances LLMs’ strategic decision-making capabilities by incorporating theory of mind reasoning and multi-level advantage estimation.

Abstract: Large Language Models (LLMs) have been used to make decisions in complex scenarios, where they need models to think deeply, reason logically, and decide wisely. Many existing studies focus solely on multi-round conversations in social tasks or simulated environments, neglecting the various types of decisions and their interdependence. Current reinforcement learning methods struggle to consider the strategies of others during training. To address these issues, we first define a strategic decision-making problem that includes two types of decisions and their temporal dependencies. Furthermore, we propose Theory of Mind Policy Optimization (ToMPO) algorithm to optimize the perception of other individual strategies and the game situation trends. Compared to the Group Relative Policy Optimization (GRPO) algorithm, ToMPO enhances the LLM’s strategic decision-making mainly by: 1) generating rollouts based on reasoning the strategies of other individuals, 2) estimating advantages at both the graph-level and sample-level, and 3) balancing global and partial rewards. The ToMPO algorithm outperforms the GRPO method by 35% in terms of model output compliance and cooperative outcomes. Additionally, when compared to models with parameter sizes 100 times larger, it shows an 18% improvement. This demonstrates the effectiveness of the ToMPO algorithm in enhancing the model’s strategic decision-making capabilities.

[286] Adaptive Approach to Enhance Machine Learning Scheduling Algorithms During Runtime Using Reinforcement Learning in Metascheduling Applications

Samer Alshaer, Ala Khalifeh, Roman Obermaisser

Main category: cs.AI

TL;DR: Proposes an adaptive online learning unit using Reinforcement Learning to overcome limitations of offline AI training for metascheduling in time-triggered architectures, enabling real-time adaptation to dynamic environments.

Details

Motivation: Traditional offline AI scheduling training faces challenges in constructing comprehensive Multi-Schedule Graphs that account for all possible scenarios, especially context events like hardware failures, slack variations, or mode changes, which is resource-intensive and often infeasible.

Method: Integration of an adaptive online learning unit within the metascheduler using Reinforcement Learning models that continuously explore and discover new scheduling solutions, expanding the Multi-Schedule Graph and optimizing existing schedulers in real-time.

Result: The system becomes capable of handling unexpected events and complex scheduling scenarios more effectively, continuously refining AI inferences to meet stricter deadlines and new performance criteria.

Conclusion: The proposed approach ensures robustness and efficiency in large-scale, safety-critical environments by maintaining system flexibility and capability to meet evolving demands through real-time adaptation.

Abstract: Metascheduling in time-triggered architectures has been crucial in adapting to dynamic and unpredictable environments, ensuring the reliability and efficiency of task execution. However, traditional approaches face significant challenges when training Artificial Intelligence (AI) scheduling inferences offline, particularly due to the complexities involved in constructing a comprehensive Multi-Schedule Graph (MSG) that accounts for all possible scenarios. The process of generating an MSG that captures the vast probability space, especially when considering context events like hardware failures, slack variations, or mode changes, is resource-intensive and often infeasible. To address these challenges, we propose an adaptive online learning unit integrated within the metascheduler to enhance performance in real-time. The primary motivation for developing this unit stems from the limitations of offline training, where the MSG created is inherently a subset of the complete space, focusing only on the most probable and critical context events. In the online mode, Reinforcement Learning (RL) plays a pivotal role by continuously exploring and discovering new scheduling solutions, thus expanding the MSG and enhancing system performance over time. This dynamic adaptation allows the system to handle unexpected events and complex scheduling scenarios more effectively. Several RL models were implemented within the online learning unit, each designed to address specific challenges in scheduling. These models not only facilitate the discovery of new solutions but also optimize existing schedulers, particularly when stricter deadlines or new performance criteria are introduced. By continuously refining the AI inferences through real-time training, the system remains flexible and capable of meeting evolving demands, thus ensuring robustness and efficiency in large-scale, safety-critical environments.

[287] A Compound Classification System Based on Fuzzy Relations Applied to the Noise-Tolerant Control of a Bionic Hand via EMG Signal Recognition

Pawel Trajdos, Marek Kurzynski

Main category: cs.AI

TL;DR: A fuzzy recognition system for EMG-based hand prosthesis control that detects contaminated biosignals using one-class classifiers and KNN ensembles with uniform fuzzy decision scheme.

Details

Motivation: EMG biosignals are highly susceptible to contamination, which reduces classification quality in prosthetic control systems. Current pattern recognition schemes struggle with signal contamination issues.

Method: Two-ensemble system: one-class classifiers to detect contamination in individual channels, and KNN ensemble for intent recognition. Uses a coherent fuzzy model with uniform soft decision scheme throughout the recognition process.

Result: Experimental evaluation using real biosignals from public repository showed comparative analysis of method parameters and procedures affecting recognition quality. Compared favorably with similar literature systems.

Conclusion: The proposed fuzzy recognition system effectively mitigates adverse effects of biosignal contamination in prosthetic control, providing improved classification quality through contamination detection and fuzzy decision making.

Abstract: Modern anthropomorphic upper limb bioprostheses are typically controlled by electromyographic (EMG) biosignals using a pattern recognition scheme. Unfortunately, there are many factors originating from the human source of objects to be classified and from the human-prosthesis interface that make it difficult to obtain an acceptable classification quality. One of these factors is the high susceptibility of biosignals to contamination, which can considerably reduce the quality of classification of a recognition system. In the paper, the authors propose a new recognition system intended for EMG based control of the hand prosthesis with detection of contaminated biosignals in order to mitigate the adverse effect of contaminations. The system consists of two ensembles: the set of one-class classifiers (OCC) to assess the degree of contamination of individual channels and the ensemble of K-nearest neighbours (KNN) classifier to recognise the patient’s intent. For all recognition systems, an original, coherent fuzzy model was developed, which allows the use of a uniform soft (fuzzy) decision scheme throughout the recognition process. The experimental evaluation was conducted using real biosignals from a public repository. The goal was to provide an experimental comparative analysis of the parameters and procedures of the developed method on which the quality of the recognition system depends. The proposed fuzzy recognition system was also compared with similar systems described in the literature.

[288] SAMULE: Self-Learning Agents Enhanced by Multi-level Reflection

Yubin Ge, Salvatore Romeo, Jason Cai, Monica Sunkara, Yi Zhang

Main category: cs.AI

TL;DR: SAMULE is a self-learning framework for LLM agents that uses multi-level reflection synthesis to generate high-quality reflections from failures, enabling better error analysis and learning across single trajectories, intra-task, and inter-task levels.

Details

Motivation: Current LLM agents struggle with meaningful reflections due to inadequate error analysis and reliance on rare successful trajectories, especially in complex tasks where failure-based learning is crucial.

Method: Proposes SAMULE framework with retrospective language model trained on Multi-Level Reflection Synthesis: Single-Trajectory Learning (micro-level), Intra-Task Learning (meso-level), and Inter-Task Learning (macro-level). Also includes foresight-based reflection for interactive settings.

Result: Extensive experiments on TravelPlanner, NATURAL PLAN, and Tau-bench show SAMULE significantly outperforms reflection-based baselines.

Conclusion: Well-designed reflection synthesis and failure-centric learning are critical for building self-improving LLM agents, with multi-level reflection approach proving highly effective.

Abstract: Despite the rapid advancements in LLM agents, they still face the challenge of generating meaningful reflections due to inadequate error analysis and a reliance on rare successful trajectories, especially in complex tasks. In this work, we propose SAMULE, a new framework for self-learning agents powered by a retrospective language model that is trained based on Multi-Level Reflection Synthesis. It first synthesizes high-quality reflections across three complementary levels: Single-Trajectory Learning (micro-level) for detailed error correction; Intra-Task Learning (meso-level) to build error taxonomies across multiple trials of the same task, and Inter-Task Learning (macro-level) to extract transferable insights based on same typed errors from diverse task failures. Then we fine-tune a language model serving as the retrospective model to generate reflections during inference. We further extend our framework to interactive settings through a foresight-based reflection mechanism, enabling agents to proactively reflect and adapt during user interactions by comparing predicted and actual responses. Extensive experiments on three challenging benchmarks - TravelPlanner, NATURAL PLAN, and Tau-bench - demonstrate that our approach significantly outperforms reflection-based baselines. Our results highlight the critical role of well-designed reflection synthesis and failure-centric learning in building self-improving LLM agents.

[289] Language-Guided Multi-Agent Learning in Simulations: A Unified Framework and Evaluation

Zhengyang Li

Main category: cs.AI

TL;DR: LLM-MARL integrates large language models into multi-agent reinforcement learning to improve coordination, communication, and generalization in game environments through modular components for subgoal generation, symbolic messaging, and memory.

Details

Motivation: To enhance coordination, communication, and generalization capabilities in multi-agent systems by leveraging the reasoning and language understanding abilities of large language models within reinforcement learning frameworks.

Method: A unified framework with three modular components (Coordinator, Communicator, Memory) that dynamically generate subgoals, facilitate symbolic messaging, and support episodic recall. Training combines PPO with language-conditioned loss and LLM query gating.

Result: Consistent improvements over MAPPO and QMIX in win rate, coordination score, and zero-shot generalization across Google Research Football, MAgent Battle, and StarCraft II. Ablation studies confirm significant contributions from subgoal generation and language-based messaging.

Conclusion: LLM-MARL successfully bridges language modeling and policy learning, demonstrating emergent cooperative behaviors and offering a promising path for leveraging LLMs in multi-agent systems for training, games, and human-AI collaboration.

Abstract: This paper introduces LLM-MARL, a unified framework that incorporates large language models (LLMs) into multi-agent reinforcement learning (MARL) to enhance coordination, communication, and generalization in simulated game environments. The framework features three modular components of Coordinator, Communicator, and Memory, which dynamically generate subgoals, facilitate symbolic inter-agent messaging, and support episodic recall. Training combines PPO with a language-conditioned loss and LLM query gating. LLM-MARL is evaluated in Google Research Football, MAgent Battle, and StarCraft II. Results show consistent improvements over MAPPO and QMIX in win rate, coordination score, and zero-shot generalization. Ablation studies demonstrate that subgoal generation and language-based messaging each contribute significantly to performance gains. Qualitative analysis reveals emergent behaviors such as role specialization and communication-driven tactics. By bridging language modeling and policy learning, this work contributes to the design of intelligent, cooperative agents in interactive simulations. It offers a path forward for leveraging LLMs in multi-agent systems used for training, games, and human-AI collaboration.

[290] Adaptive Cybersecurity Architecture for Digital Product Ecosystems Using Agentic AI

Oluwakemi T. Olayinka, Sumeet Jeswani, Divine Iloh

Main category: cs.AI

TL;DR: This paper proposes an adaptive cybersecurity architecture using autonomous AI agents for dynamic threat detection and policy enforcement in complex digital ecosystems.

Details

Motivation: Traditional static cybersecurity models are inadequate for modern digital ecosystems (cloud, APIs, mobile, edge) due to scalability issues, lack of real-time detection, and poor contextual responsiveness.

Method: The framework integrates agentic AI across ecosystem layers, featuring behavioral baselining, decentralized risk scoring, and federated threat intelligence sharing. Evaluation was conducted through native cloud simulations.

Result: The system demonstrated capability to detect zero-day attacks and dynamically modify access policies, showing increased adaptability, decreased response latency, and improved detection accuracy.

Conclusion: The architecture provides an intelligent, scalable blueprint for safeguarding complex digital infrastructure that is compatible with zero-trust models and supports international cybersecurity regulations.

Abstract: Traditional static cybersecurity models often struggle with scalability, real-time detection, and contextual responsiveness in the current digital product ecosystems which include cloud services, application programming interfaces (APIs), mobile platforms, and edge devices. This study introduces autonomous goal driven agents capable of dynamic learning and context-aware decision making as part of an adaptive cybersecurity architecture driven by agentic artificial intelligence (AI). To facilitate autonomous threat mitigation, proactive policy enforcement, and real-time anomaly detection, this framework integrates agentic AI across the key ecosystem layers. Behavioral baselining, decentralized risk scoring, and federated threat intelligence sharing are important features. The capacity of the system to identify zero-day attacks and dynamically modify access policies was demonstrated through native cloud simulations. The evaluation results show increased adaptability, decreased response latency, and improved detection accuracy. The architecture provides an intelligent and scalable blueprint for safeguarding complex digital infrastructure and is compatible with zero-trust models, thereby supporting the adherence to international cybersecurity regulations.

[291] Accelerate Creation of Product Claims Using Generative AI

Po-Yu Liang, Yong Zhang, Tatiana Hwa, Aaron Byers

Main category: cs.AI

TL;DR: Claim Advisor is a web application that uses LLMs to accelerate product claim creation through semantic search, claim generation/optimization, and claim ranking via synthetic consumer simulations.

Details

Motivation: Creating product claims is time-consuming and expensive, requiring substantial resources. The system aims to disrupt the speed and economics of claim creation processes.

Method: Uses in-context learning and fine-tuning of large language models with three functions: semantic search for existing claims/visuals, claim generation/optimization based on product descriptions and consumer profiles, and claim ranking through simulations with synthetic consumers.

Result: Applications in a consumer packaged goods company showed very promising results, demonstrating broad usefulness across product categories and industries.

Conclusion: The capability is broadly applicable across industries, and the authors share their learning to encourage further research and application of generative AI in different sectors.

Abstract: The benefit claims of a product is a critical driver of consumers’ purchase behavior. Creating product claims is an intense task that requires substantial time and funding. We have developed the $\textbf{Claim Advisor}$ web application to accelerate claim creations using in-context learning and fine-tuning of large language models (LLM). $\textbf{Claim Advisor}$ was designed to disrupt the speed and economics of claim search, generation, optimization, and simulation. It has three functions: (1) semantically searching and identifying existing claims and/or visuals that resonate with the voice of consumers; (2) generating and/or optimizing claims based on a product description and a consumer profile; and (3) ranking generated and/or manually created claims using simulations via synthetic consumers. Applications in a consumer packaged goods (CPG) company have shown very promising results. We believe that this capability is broadly useful and applicable across product categories and industries. We share our learning to encourage the research and application of generative AI in different industries.

[292] An Automated Retrieval-Augmented Generation LLaMA-4 109B-based System for Evaluating Radiotherapy Treatment Plans

Junjie Cui, Peilong Wang, Jason Holmes, Leshan Sun, Michael L. Hinni, Barbara A. Pockaj, Sujay A. Vora, Terence T. Sio, William W. Wong, Nathan Y. Yu, Steven E. Schild, Joshua R. Niska, Sameer R. Keole, Jean-Claude M. Rwigema, Samir H. Patel, Lisa A. McGee, Carlos A. Vargas, Wei Liu

Main category: cs.AI

TL;DR: Development of a RAG system using LLaMA-4 109B for automated, protocol-aware radiotherapy treatment plan evaluation with interpretable outputs.

Details

Motivation: To create an automated system for radiotherapy plan evaluation that is protocol-aware, interpretable, and minimizes hallucination while providing traceable outputs.

Method: Built a RAG system with three core modules: retrieval engine (optimized SentenceTransformer backbones), percentile prediction based on cohort similarity, and clinical constraint checker, directed by LLM using multi-step prompt-driven reasoning.

Result: Achieved perfect nearest-neighbor accuracy within 5-percentile-point margin, sub-2pt MAE, and 100% agreement with standalone modules on both percentile estimates and constraint identification.

Conclusion: Combining structured population-based scoring with modular tool-augmented reasoning is feasible for transparent, scalable radiotherapy plan evaluation, offering traceable outputs and robustness across protocols.

Abstract: Purpose: To develop a retrieval-augmented generation (RAG) system powered by LLaMA-4 109B for automated, protocol-aware, and interpretable evaluation of radiotherapy treatment plans. Methods and Materials: We curated a multi-protocol dataset of 614 radiotherapy plans across four disease sites and constructed a knowledge base containing normalized dose metrics and protocol-defined constraints. The RAG system integrates three core modules: a retrieval engine optimized across five SentenceTransformer backbones, a percentile prediction component based on cohort similarity, and a clinical constraint checker. These tools are directed by a large language model (LLM) using a multi-step prompt-driven reasoning pipeline to produce concise, grounded evaluations. Results: Retrieval hyperparameters were optimized using Gaussian Process on a scalarized loss function combining root mean squared error (RMSE), mean absolute error (MAE), and clinically motivated accuracy thresholds. The best configuration, based on all-MiniLM-L6-v2, achieved perfect nearest-neighbor accuracy within a 5-percentile-point margin and a sub-2pt MAE. When tested end-to-end, the RAG system achieved 100% agreement with the computed values by standalone retrieval and constraint-checking modules on both percentile estimates and constraint identification, confirming reliable execution of all retrieval, prediction and checking steps. Conclusion: Our findings highlight the feasibility of combining structured population-based scoring with modular tool-augmented reasoning for transparent, scalable plan evaluation in radiation therapy. The system offers traceable outputs, minimizes hallucination, and demonstrates robustness across protocols. Future directions include clinician-led validation, and improved domain-adapted retrieval models to enhance real-world integration.

[293] Parallel Thinking, Sequential Answering: Bridging NAR and AR for Efficient Reasoning

Qihang Ai, Haiyun Jiang

Main category: cs.AI

TL;DR: A new framework combining auto-regressive (AR) and non-autoregressive (NAR) models for reasoning tasks, where NAR generates intermediate reasoning traces efficiently and AR produces precise final answers.

Details

Motivation: AR models have slow inference in reasoning-intensive domains due to sequential generation, while NAR models offer speed but sacrifice output quality. The goal is to leverage both strengths.

Method: Integration of AR and NAR models: NAR efficiently produces intermediate reasoning traces, which then guide an AR model to generate precise final answers.

Result: Experiments show 26% improvement over strong baselines with substantially reduced inference cost.

Conclusion: The hybrid approach effectively balances speed and quality, making it suitable for reasoning-intensive tasks like mathematics and code.

Abstract: We study reasoning tasks through a framework that integrates auto-regressive (AR) and non-autoregressive (NAR) language models. AR models, which generate text sequentially, excel at producing coherent outputs but often suffer from slow inference, particularly in reasoning-intensive domains such as mathematics and code, where lengthy chains of thought are required. In contrast, NAR models, such as discrete diffusion models, allow parallel generation and offer substantial speedups, though typically at the cost of reduced output quality. To address these limitations, we introduce a new paradigm in which an NAR model efficiently produces intermediate reasoning traces, which subsequently guide an AR model to deliver precise final answers. Experiments demonstrate that our approach yields significant 26% improvements over strong baselines while substantially reducing inference cost.

[294] Meta-Memory: Retrieving and Integrating Semantic-Spatial Memories for Robot Spatial Reasoning

Yufan Mao, Hanjing Ye, Wenlong Dong, Chengjie Zhang, Hong Zhang

Main category: cs.AI

TL;DR: Meta-Memory is an LLM-driven agent that creates high-density environmental memory representations for robots, enabling robust spatial reasoning through joint semantic-spatial retrieval in response to natural language location queries.

Details

Motivation: Robots need effective memory systems to store observations and answer human spatial queries, but current research lacks principled mechanisms for efficient memory retrieval and integration in complex environments.

Method: Proposes Meta-Memory, which constructs high-density memory representations and performs joint reasoning over semantic and spatial modalities to retrieve and integrate relevant memories for natural language location queries.

Result: Significantly outperforms state-of-the-art methods on SpaceLocQA and NaVQA benchmarks, and successfully deployed on real-world robotic platforms demonstrating practical utility.

Conclusion: Meta-Memory effectively bridges the gap in robotic memory systems by providing principled mechanisms for memory retrieval and integration, enabling robust spatial reasoning capabilities in complex environments.

Abstract: Navigating complex environments requires robots to effectively store observations as memories and leverage them to answer human queries about spatial locations, which is a critical yet underexplored research challenge. While prior work has made progress in constructing robotic memory, few have addressed the principled mechanisms needed for efficient memory retrieval and integration. To bridge this gap, we propose Meta-Memory, a large language model (LLM)-driven agent that constructs a high-density memory representation of the environment. The key innovation of Meta-Memory lies in its capacity to retrieve and integrate relevant memories through joint reasoning over semantic and spatial modalities in response to natural language location queries, thereby empowering robots with robust and accurate spatial reasoning capabilities. To evaluate its performance, we introduce SpaceLocQA, a large-scale dataset encompassing diverse real-world spatial question-answering scenarios. Experimental results show that Meta-Memory significantly outperforms state-of-the-art methods on both the SpaceLocQA and the public NaVQA benchmarks. Furthermore, we successfully deployed Meta-Memory on real-world robotic platforms, demonstrating its practical utility in complex environments. Project page: https://itsbaymax.github.io/meta-memory.github.io/ .

[295] LogReasoner: Empowering LLMs with Expert-like Coarse-to-Fine Reasoning for Log Analysis Tasks

Lipeng Ma, Yixuan Li, Weidong Yang, Mingjie Zhou, Xinyi Liu, Ben Fei, Shuhao Li, Xiaoyan Sun, Sihang Jiang, Yanghua Xiao

Main category: cs.AI

TL;DR: LogReasoner is a coarse-to-fine reasoning enhancement framework that enables LLMs to perform expert-level log analysis through structured reasoning workflows and stepwise calibration.

Details

Motivation: General-purpose LLMs struggle with structured reasoning workflows that align with expert cognition in log analysis tasks like anomaly detection and failure prediction, lacking precise reasoning details.

Method: Two-stage framework: (1) coarse-grained enhancement using expert troubleshooting flowcharts to structure reasoning workflows, (2) fine-grained enhancement through task-specific fine-tuning and preference learning to calibrate reasoning details from mistakes.

Result: LogReasoner significantly outperforms existing LLMs on four distinct log analysis tasks using Qwen-2.5 and Llama-3, achieving state-of-the-art performance.

Conclusion: The framework effectively enhances LLMs’ reasoning capabilities for log analysis by mimicking expert cognition through structured workflows and calibrated stepwise reasoning.

Abstract: Log analysis is crucial for monitoring system health and diagnosing failures in complex systems. Recent advances in large language models (LLMs) offer new opportunities for automated log analysis, leveraging their reasoning capabilities to perform tasks such as anomaly detection and failure prediction. However, general-purpose LLMs struggle to formulate structured reasoning workflows that align with expert cognition and deliver precise details of reasoning steps. To address these challenges, we propose LogReasoner, a coarse-to-fine reasoning enhancement framework designed to enable LLMs to reason log analysis tasks like experts. LogReasoner consists of two stages: (1) coarse-grained enhancement of expert thinking, where high-level expert thoughts are constructed from collected troubleshooting flowcharts and existing tasks to enable LLMs to formulate structured reasoning workflows and (2) fine-grained enhancement of specific steps, where we first fine-tune the LLM with task-specific stepwise solutions to enhance the LLM for instantiated reasoning, then employ the preference learning to calibrate the LLM’s reasoning details from its mistakes, further strengthen the LLM’s analytical granularity and correctness. We evaluate LogReasoner on four distinct log analysis tasks using open-source LLMs such as Qwen-2.5 and Llama-3. Experimental results show that LogReasoner significantly outperforms existing LLMs, achieving state-of-the-art performance and demonstrating its effectiveness in enhancing the reasoning capabilities of LLMs for log analysis.

[296] DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning

Tianrun Xu, Haoda Jing, Ye Li, Yuquan Wei, Jun Feng, Guanyu Chen, Haichuan Gao, Tianren Zhang, Feng Chen

Main category: cs.AI

TL;DR: DeFacto is a counterfactual reasoning framework that improves multimodal language models by enforcing both accurate answering and faithful reasoning through three complementary training paradigms and GRPO-based reinforcement learning.

Details

Motivation: Current multimodal language models often reach correct answers by relying on irrelevant or spurious image regions due to prior knowledge or dataset biases, indicating flawed reasoning and lack of true image understanding.

Method: Proposes DeFacto framework with three training paradigms: positive, counterfactual, and random-masking. Uses automated pipeline to localize question-relevant evidence and construct dataset variants. Trains models with GRPO-based reinforcement learning using three complementary rewards.

Result: Experiments show DeFacto substantially improves both answer accuracy and reasoning faithfulness across diverse benchmarks, establishing stronger foundation for interpretable multimodal reasoning.

Conclusion: DeFacto successfully addresses reasoning fidelity issues in multimodal tasks by jointly enforcing accurate answering and faithful reasoning through counterfactual training and reinforcement learning.

Abstract: Recent advances in multimodal language models (MLLMs) have achieved remarkable progress in vision-language reasoning, especially with the emergence of “thinking with images,” which integrates explicit visual steps into the reasoning process. While this paradigm strengthens image-based reasoning, a significant challenge remains: models may arrive at correct answers by relying on irrelevant or spurious regions, driven by prior knowledge or dataset biases. Even when the answer is correct, flawed reasoning indicates that the model has not truly understood the image, highlighting the critical importance of reasoning fidelity in multimodal tasks. To address this issue, we propose DeFacto, a counterfactual reasoning framework that jointly enforces accurate answering and faithful reasoning. A key component of our approach is the design of three complementary training paradigms: (i) positive, (ii) counterfactual, and (iii) random-masking. To enable these paradigms, we develop a pipeline that automatically localizes question-relevant evidence and constructs positive, counterfactual, and random variants, resulting in a dataset of about 100k images. Building on this framework, we train multimodal language models with GRPO-based reinforcement learning, where we design three complementary rewards to guide the model toward accurate answering and evidence-grounded reasoning. Experiments on diverse benchmarks demonstrate that DeFacto substantially improves both answer accuracy and reasoning faithfulness, establishing a stronger foundation for interpretable multimodal reasoning. The code is available on GitHub and the dataset is released on HuggingFace.

[297] GALAX: Graph-Augmented Language Model for Explainable Reinforcement-Guided Subgraph Reasoning in Precision Medicine

Heming Zhang, Di Huang, Wenyu Li, Michael Province, Yixin Chen, Philip Payne, Fuhai Li

Main category: cs.AI

TL;DR: GALAX integrates graph neural networks with large language models using reinforcement learning to enable explainable subgraph reasoning for precision medicine target discovery.

Details

Motivation: Existing methods in precision medicine fail to fully integrate quantitative multi-omic features, topological context, and textual knowledge, limiting mechanistic interpretability and reliable target discovery.

Method: Proposes GALAX framework that combines pretrained GNNs with LLMs via Graph Process Reward Model (GPRM) for step-wise subgraph generation, and introduces Target-QA benchmark for GNN pretraining and text-numeric graph reasoning.

Result: The framework enables process-level supervision without explicit intermediate reasoning annotations and supports long-context reasoning over text-numeric graphs.

Conclusion: GALAX provides a scalable, biologically grounded framework for explainable, reinforcement-guided subgraph reasoning towards reliable target and pathway discovery in precision medicine.

Abstract: In precision medicine, quantitative multi-omic features, topological context, and textual biological knowledge play vital roles in identifying disease-critical signaling pathways and targets. Existing pipelines capture only part of these-numerical omics ignore topological context, text-centric LLMs lack quantitative grounded reasoning, and graph-only models underuse node semantics and the generalization of LLMs-limiting mechanistic interpretability. Although Process Reward Models (PRMs) aim to guide reasoning in LLMs, they remain limited by unreliable intermediate evaluation, and vulnerability to reward hacking with computational cost. These gaps motivate integrating quantitative multi-omic signals, topological structure with node annotations, and literature-scale text via LLMs, using subgraph reasoning as the principle bridge linking numeric evidence, topological knowledge and language context. Therefore, we propose GALAX (Graph Augmented LAnguage model with eXplainability), an innovative framework that integrates pretrained Graph Neural Networks (GNNs) into Large Language Models (LLMs) via reinforcement guided by a Graph Process Reward Model (GPRM), which generates disease-relevant subgraphs in a step-wise manner initiated by an LLM and iteratively evaluated by a pretrained GNN, enabling process-level supervision without explicit intermediate reasoning annotations. As an application, we also introduced Target-QA, a benchmark combining CRISPR-identified targets, multi-omic profiles, and biomedical graph knowledge across diverse cancer cell lines, which enables GNN pretraining for supervising step-wise graph construction and supports long-context reasoning over text-numeric graphs (TNGs), providing a scalable and biologically grounded framework for explainable, reinforcement-guided subgraph reasoning toward reliable and interpretable target and pathway discovery in precision medicine.

[298] Beyond Stars: Bridging the Gap Between Ratings and Review Sentiment with LLM

Najla Zuhir, Amna Mohammad Salim, Parvathy Premkumar, Moshiur Farazi

Main category: cs.AI

TL;DR: Proposes an LLM-based framework for mobile app review analysis that overcomes limitations of star ratings and traditional NLP methods by capturing nuanced feedback through structured prompting and RAG-QA.

Details

Motivation: Traditional star ratings fail to capture nuanced feedback in detailed review texts, and conventional NLP techniques struggle with contextual nuances, domain-specific terminology, and linguistic features like sarcasm.

Method: Modular framework using large language models (LLMs) enhanced by structured prompting techniques, quantifying rating-text discrepancies, extracting feature-level insights, and supporting interactive exploration via retrieval-augmented conversational question answering (RAG-QA).

Result: Comprehensive experiments on three datasets (AWARE, Google Play, Spotify) show the LLM-driven approach significantly outperforms baseline methods with improved accuracy, robustness, and actionable insights in challenging review scenarios.

Conclusion: The proposed LLM-based framework effectively addresses limitations of traditional review analysis methods, providing superior performance in capturing nuanced feedback and generating actionable insights from mobile app reviews.

Abstract: We present an advanced approach to mobile app review analysis aimed at addressing limitations inherent in traditional star-rating systems. Star ratings, although intuitive and popular among users, often fail to capture the nuanced feedback present in detailed review texts. Traditional NLP techniques – such as lexicon-based methods and classical machine learning classifiers – struggle to interpret contextual nuances, domain-specific terminology, and subtle linguistic features like sarcasm. To overcome these limitations, we propose a modular framework leveraging large language models (LLMs) enhanced by structured prompting techniques. Our method quantifies discrepancies between numerical ratings and textual sentiment, extracts detailed, feature-level insights, and supports interactive exploration of reviews through retrieval-augmented conversational question answering (RAG-QA). Comprehensive experiments conducted on three diverse datasets (AWARE, Google Play, and Spotify) demonstrate that our LLM-driven approach significantly surpasses baseline methods, yielding improved accuracy, robustness, and actionable insights in challenging and context-rich review scenarios.

[299] AOT*: Efficient Synthesis Planning via LLM-Empowered AND-OR Tree Search

Xiaozhuang Song, Xuanhao Pan, Xinjian Zhao, Hangting Ye, Shufei Zhang, Jian Tang, Tianshu Yu

Main category: cs.AI

TL;DR: AOT* is a framework that integrates LLM-generated chemical synthesis pathways with AND-OR tree search to improve retrosynthesis planning efficiency and performance.

Details

Motivation: Multi-step retrosynthetic planning faces computational challenges due to exponential search spaces and high inference costs, while existing LLM approaches have efficiency and cost constraints.

Method: AOT* atomically maps LLM-generated synthesis routes onto AND-OR tree components with reward assignment strategy and retrieval-based context engineering, enabling efficient navigation in chemical space.

Result: AOT* achieves state-of-the-art performance with 3-5× fewer iterations than existing LLM-based approaches, with efficiency advantages increasing on complex molecular targets.

Conclusion: The framework successfully transforms retrosynthetic planning by combining LLM capabilities with systematic search, demonstrating significant efficiency improvements while maintaining competitive solve rates.

Abstract: Retrosynthesis planning enables the discovery of viable synthetic routes for target molecules, playing a crucial role in domains like drug discovery and materials design. Multi-step retrosynthetic planning remains computationally challenging due to exponential search spaces and inference costs. While Large Language Models (LLMs) demonstrate chemical reasoning capabilities, their application to synthesis planning faces constraints on efficiency and cost. To address these challenges, we introduce AOT*, a framework that transforms retrosynthetic planning by integrating LLM-generated chemical synthesis pathways with systematic AND-OR tree search. To this end, AOT* atomically maps the generated complete synthesis routes onto AND-OR tree components, with a mathematically sound design of reward assignment strategy and retrieval-based context engineering, thus enabling LLMs to efficiently navigate in the chemical space. Experimental evaluation on multiple synthesis benchmarks demonstrates that AOT* achieves SOTA performance with significantly improved search efficiency. AOT* exhibits competitive solve rates using 3-5$\times$ fewer iterations than existing LLM-based approaches, with the efficiency advantage becoming more pronounced on complex molecular targets.

[300] CORE: Full-Path Evaluation of LLM Agents Beyond Final State

Panagiotis Michelakis, Yiannis Hadjiyiannis, Dimitrios Stamoulis

Main category: cs.AI

TL;DR: A framework using deterministic finite automata (DFAs) to evaluate AI agents through function-call sequences, introducing CORE metrics for comprehensive assessment beyond final-state evaluation.

Details

Motivation: Existing agent benchmarks only evaluate final outcomes, ignoring critical aspects like safety, efficiency, and intermediate correctness in function-call sequences.

Method: Propose DFA-based framework encoding tasks as valid tool-use paths, and introduce CORE metrics (Path Correctness, Kendall’s tau Composite, Prefix Criticality, Harmful-Call Rate, Efficiency) for principled agent behavior assessment.

Result: The method reveals significant performance differences between agents that appear equivalent under traditional final-state evaluation across diverse world models.

Conclusion: The DFA-based framework with CORE metrics provides a more comprehensive and principled approach to evaluating AI agents’ function-call sequences, capturing safety, efficiency, and intermediate correctness aspects.

Abstract: Evaluating AI agents that solve real-world tasks through function-call sequences remains an open challenge. Existing agentic benchmarks often reduce evaluation to a binary judgment of the final state, overlooking critical aspects such as safety, efficiency, and intermediate correctness. We propose a framework based on deterministic finite automata (DFAs) that encodes tasks as sets of valid tool-use paths, enabling principled assessment of agent behavior in diverse world models. Building on this foundation, we introduce CORE, a suite of five metrics, namely Path Correctness, Path Correctness - Kendall’s tau Composite, Prefix Criticality, Harmful-Call Rate, and Efficiency, that quantify alignment with expected execution patterns. Across diverse worlds, our method reveals important performance differences between agents that would otherwise appear equivalent under traditional final-state evaluation schemes.

[301] Who Gets Cited Most? Benchmarking Long-Context Language Models on Scientific Articles

Miao Li, Alexander Gurung, Irina Saparina, Mirella Lapata

Main category: cs.AI

TL;DR: SciTrek is a novel QA benchmark for evaluating LLMs’ long-context reasoning using scientific articles, featuring complex questions requiring information synthesis across multiple full-text papers with verifiable reasoning steps.

Details

Motivation: Current long-context benchmarks use non-scientific texts, focus on simple retrieval tasks, or employ artificial contexts, lacking evaluation of complex reasoning in scientific domains.

Method: Automatically generate questions and answers using SQL queries over a database of article metadata, providing explicit reasoning steps and scaling to 1M token contexts with minimal supervision.

Result: Extensive experiments show SciTrek poses significant challenges as context length increases, with SFT and RL offering limited gains. Models struggle with basic numerical operations and locating specific information in long contexts.

Conclusion: SciTrek reveals systematic shortcomings in LLMs’ long-context reasoning abilities, particularly in numerical operations and information localization, highlighting the need for improved long-context capabilities.

Abstract: This paper introduces SciTrek, a novel question-answering benchmark designed to evaluate the long-context reasoning capabilities of large language models (LLMs) using scientific articles. Current long-context benchmarks often rely on non-scientific texts, focus on simple information retrieval tasks, or employ artificial contexts. SciTrek addresses these limitations by proposing complex questions that require information aggregation and synthesis across multiple full-text scientific articles. Questions and their ground-truth answers are automatically generated by formulating them as SQL queries over a database constructed from article metadata (titles, authors, and references). The SQL operations provide explicit, verifiable reasoning steps for fine-grained error analysis, and the construction process scales to contexts up to 1M tokens with minimal supervision. Extensive experiments on a diverse set of open-weight and proprietary LLMs demonstrate that SciTrek poses a significant challenge as the context length increases, with supervised fine-tuning and reinforcement learning offering only limited gains. Our analysis reveals systematic shortcomings in models’ abilities to perform basic numerical operations and accurately locate specific information in long contexts.

[302] CLAUSE: Agentic Neuro-Symbolic Knowledge Graph Reasoning via Dynamic Learnable Context Engineering

Yang Zhao, Chengxiao Dai, Wei Zhuo, Yue Xiu, Dusit Niyato

Main category: cs.AI

TL;DR: CLAUSE is a neuro-symbolic framework that treats knowledge graph context construction as a sequential decision process, optimizing for accuracy, latency, and cost trade-offs using multi-agent reinforcement learning.

Details

Motivation: Existing methods like static k-hop expansions and "think-longer" prompting often over-retrieve, inflate context, and yield unpredictable runtime, failing to balance accuracy with strict latency and cost targets while preserving provenance.

Method: CLAUSE uses Lagrangian-Constrained Multi-Agent Proximal Policy Optimization (LC-MAPPO) to coordinate three agents (Subgraph Architect, Path Navigator, Context Curator) for joint optimization of subgraph construction, reasoning-path discovery, and evidence selection under per-query resource budgets.

Result: On MetaQA-2-hop, CLAUSE achieves +39.3 EM@1 with 18.6% lower latency and 40.9% lower edge growth compared to GraphRAG. It delivers higher accuracy while reducing subgraph growth and latency across HotpotQA, MetaQA, and FactKG.

Conclusion: CLAUSE produces compact, provenance-preserving contexts with predictable performance under deployment constraints, enabling per-query adaptation to accuracy-latency-cost trade-offs without retraining.

Abstract: Knowledge graphs provide structured context for multi-hop question answering, but deployed systems must balance answer accuracy with strict latency and cost targets while preserving provenance. Static k-hop expansions and “think-longer” prompting often over-retrieve, inflate context, and yield unpredictable runtime. We introduce CLAUSE, an agentic three-agent neuro-symbolic framework that treats context construction as a sequential decision process over knowledge graphs, deciding what to expand, which paths to follow or backtrack, what evidence to keep, and when to stop. Latency (interaction steps) and prompt cost (selected tokens) are exposed as user-specified budgets or prices, allowing per-query adaptation to trade-offs among accuracy, latency, and cost without retraining. CLAUSE employs the proposed Lagrangian-Constrained Multi-Agent Proximal Policy Optimization (LC-MAPPO) algorithm to coordinate three agents: Subgraph Architect, Path Navigator, and Context Curator, so that subgraph construction, reasoning-path discovery, and evidence selection are jointly optimized under per-query resource budgets on edge edits, interaction steps, and selected tokens. Across HotpotQA, MetaQA, and FactKG, CLAUSE yields higher EM@1 while reducing subgraph growth and end-to-end latency at equal or lower token budgets. On MetaQA-2-hop, relative to the strongest RAG baseline (GraphRAG), CLAUSE achieves +39.3 EM@1 with 18.6% lower latency and 40.9% lower edge growth. The resulting contexts are compact, provenance-preserving, and deliver predictable performance under deployment constraints.

[303] Combinatorial Creativity: A New Frontier in Generalization Abilities

Samuel Schapiro, Sumuk Shashidhar, Alexi Gladstone, Jonah Black, Royce Moon, Dilek Hakkani-Tur, Lav R. Varshney

Main category: cs.AI

TL;DR: This paper proposes a theoretical framework for evaluating combinatorial creativity in LLMs, focusing on novelty and utility rather than accuracy. It reveals scaling behavior, optimal model architectures for creativity, and a persistent novelty-utility tradeoff that limits LLMs’ creative potential.

Details

Motivation: Existing frameworks don't address how LLMs generalize for creative tasks like scientific idea generation, which requires evaluating open-ended combinatorial creativity rather than fixed correctness targets.

Method: Developed a theoretical framework and algorithmic task for evaluating creativity through degrees of novelty and utility. Conducted empirical studies on LLM scaling behavior, model architecture optimization, and analyzed the ideation-execution gap.

Result: Found optimal model depths and widths for creative ability within fixed compute budgets. Discovered a fundamental novelty-utility tradeoff that explains why LLMs generate novel ideas but struggle with practical feasibility. This tradeoff persists even at scale.

Conclusion: The persistent novelty-utility tradeoff casts doubt on LLMs’ long-term creative potential in their current form. The framework provides a foundation for understanding and improving creativity in AI models as a new frontier in generalization abilities.

Abstract: Artificial intelligence (AI) systems, and large language models (LLMs) in particular, are increasingly employed for creative tasks like scientific idea generation, constituting a form of generalization from training data unaddressed by existing conceptual frameworks. Though in many ways similar to forms of compositional generalization (CG), combinatorial creativity (CC) is an open-ended ability. Instead of evaluating for accuracy or correctness against fixed targets, which would contradict the open-ended nature of CC, we propose a theoretical framework and algorithmic task for evaluating outputs by their degrees of novelty and utility. From here, we make several important empirical contributions: (1) We obtain the first insights into the scaling behavior of creativity for LLMs. (2) We discover that, for fixed compute budgets, there exist optimal model depths and widths for creative ability. (3) We find that the ideation-execution gap, whereby LLMs excel at generating novel scientific ideas but struggle to ensure their practical feasibility, may be explained by a more fundamental novelty-utility tradeoff characteristic of creativity algorithms in general. Importantly, this tradeoff remains persistent even at scale, casting doubt on the long-term creative potential of LLMs in their current form. Together, our conceptual framework and empirical findings provide a foundation for understanding and improving creativity in modern AI models, marking a new frontier in generalization abilities.

[304] Disagreements in Reasoning: How a Model’s Thinking Process Dictates Persuasion in Multi-Agent Systems

Haodong Zhao, Jidong Li, Zhaomin Wu, Tianjie Ju, Zhuosheng Zhang, Bingsheng He, Gongshen Liu

Main category: cs.AI

TL;DR: This paper challenges the hypothesis that persuasive efficacy in Multi-Agent Systems is primarily determined by model scale, proposing instead that cognitive processes and reasoning capacity dictate persuasion dynamics. The research identifies a “Persuasion Duality” where LRMs are more resistant to persuasion but become highly persuasive when their reasoning is transparent.

Details

Motivation: To understand the persuasion dynamics in Multi-Agent Systems where LLMs and LRMs collaborate, challenging the prevailing belief that model scale is the main factor in persuasive efficacy.

Method: Conducted multi-agent persuasion experiments to analyze persuasion dynamics, including complex transmission situations and multi-hop persuasion between multiple agent networks.

Result: Found that LRMs exhibit greater resistance to persuasion but become dramatically more persuasive when their reasoning process is transparent. Revealed complex dynamics of influence propagation and decay in multi-hop persuasion networks.

Conclusion: The research provides systematic evidence linking internal processing architecture to external persuasive behavior, offering insights for the safety, robustness, and design of future Multi-Agent Systems.

Abstract: The rapid proliferation of recent Multi-Agent Systems (MAS), where Large Language Models (LLMs) and Large Reasoning Models (LRMs) usually collaborate to solve complex problems, necessitates a deep understanding of the persuasion dynamics that govern their interactions. This paper challenges the prevailing hypothesis that persuasive efficacy is primarily a function of model scale. We propose instead that these dynamics are fundamentally dictated by a model’s underlying cognitive process, especially its capacity for explicit reasoning. Through a series of multi-agent persuasion experiments, we uncover a fundamental trade-off we term the Persuasion Duality. Our findings reveal that the reasoning process in LRMs exhibits significantly greater resistance to persuasion, maintaining their initial beliefs more robustly. Conversely, making this reasoning process transparent by sharing the “thinking content” dramatically increases their ability to persuade others. We further consider more complex transmission persuasion situations and reveal complex dynamics of influence propagation and decay within multi-hop persuasion between multiple agent networks. This research provides systematic evidence linking a model’s internal processing architecture to its external persuasive behavior, offering a novel explanation for the susceptibility of advanced models and highlighting critical implications for the safety, robustness, and design of future MAS.

[305] Recon-Act: A Self-Evolving Multi-Agent Browser-Use System via Web Reconnaissance, Tool Generation, and Task Execution

Kaiwen He, Zhiwei Wang, Chenyi Zhuang, Jinjie Gu

Main category: cs.AI

TL;DR: Recon-Act is a self-evolving multi-agent framework that improves web task execution through a Reconnaissance-Action paradigm, achieving state-of-the-art performance on VisualWebArena by generating generalized tools from error analysis.

Details

Motivation: Current multimodal agents struggle with disordered action sequencing and excessive trial-and-error when solving multi-turn, long-horizon tasks on real-world webpages.

Method: A two-team system: Reconnaissance Team analyzes erroneous vs successful trajectories to generate remedies as generalized tools (hints or code), while Action Team handles intent decomposition, tool orchestration, and execution in a closed-loop pipeline.

Result: Recon-Act achieves state-of-the-art performance on VisualWebArena, substantially improving adaptability to unseen websites and solvability on long-horizon tasks.

Conclusion: The framework establishes an effective closed-loop training pipeline (data-tools-action-feedback) that enhances web agent performance through self-evolving tool generation and real-time adaptation.

Abstract: Recent years, multimodal models have made remarkable strides and pave the way for intelligent browser use agents. However, when solving tasks on real world webpages in multi-turn, long-horizon trajectories, current agents still suffer from disordered action sequencing and excessive trial and error during execution. This paper introduces Recon-Act, a self-evolving multi-agent framework grounded in Reconnaissance-Action behavioral paradigm. The system comprises a Reconnaissance Team and an Action Team: the former conducts comparative analysis and tool generation, while the latter handles intent decomposition, tool orchestration, and execution. By contrasting the erroneous trajectories with successful ones, the Reconnaissance Team infers remedies, and abstracts them into a unified notion of generalized tools, either expressed as hints or as rule-based codes, and register to the tool archive in real time. The Action Team reinference the process empowered with these targeting tools, thus establishing a closed-loop training pipeline of data-tools-action-feedback. Following the 6 level implementation roadmap proposed in this work, we have currently reached Level 3 (with limited human-in-the-loop intervention). Leveraging generalized tools obtained through reconnaissance, Recon-Act substantially improves adaptability to unseen websites and solvability on long-horizon tasks, and achieves state-of-the-art performance on the challenging VisualWebArena dataset.

[306] TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them

Yidong Wang, Yunze Song, Tingyuan Zhu, Xuanwang Zhang, Zhuohao Yu, Hao Chen, Chiyu Song, Qiufeng Wang, Cunxiang Wang, Zhen Wu, Xinyu Dai, Yue Zhang, Wei Ye, Shikun Zhang

Main category: cs.AI

TL;DR: TrustJudge is a probabilistic framework that addresses inconsistencies in LLM-as-a-judge evaluation systems, reducing Score-Comparison inconsistency by 8.43% and Pairwise Transitivity inconsistency by 10.82% while maintaining higher evaluation accuracy.

Details

Motivation: Current LLM-as-a-judge evaluation frameworks suffer from critical inconsistencies including Score-Comparison Inconsistency (where lower-rated responses outperform higher-scored ones) and Pairwise Transitivity Inconsistency (circular preference chains and equivalence contradictions), which stem from information loss in discrete rating systems and ambiguous tie judgments.

Method: TrustJudge uses two key innovations: 1) distribution-sensitive scoring that computes continuous expectations from discrete rating probabilities to preserve information entropy, and 2) likelihood-aware aggregation that resolves transitivity violations using bidirectional preference probabilities or perplexity.

Result: When evaluated with Llama-3.1-70B-Instruct as judge, TrustJudge reduced Score-Comparison inconsistency from 23.32% to 14.89% (8.43% improvement) and Pairwise Transitivity inconsistency from 15.22% to 4.40% (10.82% improvement), while maintaining higher evaluation accuracy across various model architectures and scales.

Conclusion: TrustJudge provides the first systematic analysis of evaluation framework inconsistencies in LLM-as-a-judge paradigms, offering both theoretical insights and practical solutions for reliable automated assessment without requiring additional training or human annotations.

Abstract: The adoption of Large Language Models (LLMs) as automated evaluators (LLM-as-a-judge) has revealed critical inconsistencies in current evaluation frameworks. We identify two fundamental types of inconsistencies: (1) Score-Comparison Inconsistency, where lower-rated responses outperform higher-scored ones in pairwise comparisons, and (2) Pairwise Transitivity Inconsistency, manifested through circular preference chains (A>B>C>A) and equivalence contradictions (A=B=C\neq A). We argue that these issues come from information loss in discrete rating systems and ambiguous tie judgments during pairwise evaluation. We propose TrustJudge, a probabilistic framework that addresses these limitations through two key innovations: 1) distribution-sensitive scoring that computes continuous expectations from discrete rating probabilities, preserving information entropy for more precise scoring, and 2) likelihood-aware aggregation that resolves transitivity violations using bidirectional preference probabilities or perplexity. We also formalize the theoretical limitations of current LLM-as-a-judge frameworks and demonstrate how TrustJudge’s components overcome them. When evaluated with Llama-3.1-70B-Instruct as judge using our dataset, TrustJudge reduces Score-Comparison inconsistency by 8.43% (from 23.32% to 14.89%) and Pairwise Transitivity inconsistency by 10.82% (from 15.22% to 4.40%), while maintaining higher evaluation accuracy. Our work provides the first systematic analysis of evaluation framework inconsistencies in LLM-as-a-judge paradigms, offering both theoretical insights and practical solutions for reliable automated assessment. The framework demonstrates consistent improvements across various model architectures and scales, enabling more trustworthy LLM evaluation without requiring additional training or human annotations. The codes can be found at https://github.com/TrustJudge/TrustJudge.

[307] Expanding Reasoning Potential in Foundation Model by Learning Diverse Chains of Thought Patterns

Xuemiao Zhang, Can Ren, Chengying Tu, Rongxiang Weng, Shuo Wang, Hongfei Yan, Jingang Wang, Xunliang Cai

Main category: cs.AI

TL;DR: This paper proposes a method to identify and utilize high-value reasoning patterns from chain-of-thought data to enhance mathematical reasoning models, achieving significant performance improvements with minimal data.

Details

Motivation: Current approaches use chain-of-thought data indiscriminately without understanding which data types most effectively enhance reasoning capabilities. The paper aims to systematically identify valuable reasoning patterns to improve model performance.

Method: The authors define reasoning potential as the inverse of attempts needed to answer correctly, abstract atomic reasoning patterns from CoT sequences, create a core reference set, and use a dual-granularity algorithm to select high-value CoT data that aligns with valuable reasoning patterns.

Result: Using only 10B-token CoTP data, the 85A6B Mixture-of-Experts model improved by 9.58% on AIME 2024 and 2025, and raised the upper bound of downstream RL performance by 7.81%.

Conclusion: Systematically selecting high-value reasoning patterns from chain-of-thought data significantly enhances mathematical reasoning capabilities in large models, demonstrating that quality of reasoning patterns matters more than quantity of data.

Abstract: Recent progress in large reasoning models for challenging mathematical reasoning has been driven by reinforcement learning (RL). Incorporating long chain-of-thought (CoT) data during mid-training has also been shown to substantially improve reasoning depth. However, current approaches often utilize CoT data indiscriminately, leaving open the critical question of which data types most effectively enhance model reasoning capabilities. In this paper, we define the foundation model’s reasoning potential for the first time as the inverse of the number of independent attempts required to correctly answer the question, which is strongly correlated with the final model performance. We then propose utilizing diverse data enriched with high-value reasoning patterns to expand the reasoning potential. Specifically, we abstract atomic reasoning patterns from CoT sequences, characterized by commonality and inductive capabilities, and use them to construct a core reference set enriched with valuable reasoning patterns. Furthermore, we propose a dual-granularity algorithm involving chains of reasoning patterns and token entropy, efficiently selecting high-value CoT data (CoTP) from the data pool that aligns with the core set, thereby training models to master reasoning effectively. Only 10B-token CoTP data enables the 85A6B Mixture-of-Experts (MoE) model to improve by 9.58% on the challenging AIME 2024 and 2025, and to raise the upper bound of downstream RL performance by 7.81%.

[308] RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs

Kohsei Matsutani, Shota Takashiro, Gouki Minegishi, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo

Main category: cs.AI

TL;DR: This paper introduces a novel analysis framework to understand how RL and SFT training methods shape LLM reasoning processes, revealing complementary effects where RL compresses incorrect reasoning paths while SFT expands correct ones.

Details

Motivation: To go beyond accuracy-based evaluation and understand how reinforcement learning with verifiable rewards (RLVR) and supervised fine-tuning (SFT) actually sculpt the reasoning capabilities of large language models, which remains largely elusive.

Method: A novel analysis framework that quantifies reasoning paths at two granularity levels: trajectory-level (complete reasoning outputs) and step-level (reasoning graphs with individual steps as nodes). Applied to models of 1.5B, 7B, and 14B parameters on mathematical domains.

Result: RL compresses incorrect reasoning trajectories while SFT expands correct ones. RL steepens decay rates of node metrics (2.5x), concentrating reasoning functionality into few steps, while SFT flattens them (reduced to 1/3), homogenizing reasoning across many steps.

Conclusion: The analysis explains why the current best practice of two-stage training (SFT followed by RL) is successful, and offers practical implications for data construction and more efficient learning approaches by revealing complementary effects of both methods.

Abstract: Large language models (LLMs) are typically trained by reinforcement learning (RL) with verifiable rewards (RLVR) and supervised fine-tuning (SFT) on reasoning traces to improve their reasoning abilities. However, how these methods shape reasoning capabilities remains largely elusive. Going beyond an accuracy-based investigation of how these two components sculpt the reasoning process, this paper introduces a novel analysis framework that quantifies reasoning paths and captures their qualitative changes under each training process (with models of 1.5B, 7B, and 14B parameters on mathematical domains). Specifically, we investigate the reasoning process at two levels of granularity: the trajectory-level, which examines complete reasoning outputs, and the step-level, which analyzes reasoning graphs whose nodes correspond to individual reasoning steps. Notably, clustering of unique reasoning trajectories shows complementary effects: RL compresses incorrect trajectories, whereas SFT expands correct ones. Step-level analysis reveals that RL steepens (about 2.5 times), while SFT flattens (reduced to about one-third), the decay rates of node visitation frequency, degree, and betweenness centrality distributions in the reasoning graph. This indicates that RL concentrates reasoning functionality into a small subset of steps, while SFT homogenizes it across many steps. Furthermore, by evaluating the reasoning graph topologies from multiple perspectives, we delineate the shared and distinct characteristics of RL and SFT. Our work presents a novel reasoning path perspective that explains why the current best practice of two-stage training, with SFT followed by RL, is successful, and offers practical implications for data construction and more efficient learning approaches.

[309] Embodied Representation Alignment with Mirror Neurons

Wentao Zhu, Zhining Zhang, Yuwei Ren, Yin Huang, Hao Xu, Yizhou Wang

Main category: cs.AI

TL;DR: The paper proposes a unified approach inspired by mirror neurons to align representations between action observation and execution tasks using contrastive learning in a shared latent space.

Details

Motivation: Existing machine learning methods treat action understanding and embodied execution as separate tasks, overlooking the fundamental interplay between them that mirror neurons reveal.

Method: Use two linear layers to map representations to a shared latent space where contrastive learning aligns corresponding observed and executed action representations by maximizing their mutual information.

Result: The approach fosters mutual synergy between the two tasks, improving representation quality and generalization.

Conclusion: Explicitly aligning representations of observed and executed actions through mirror neuron-inspired contrastive learning provides a unified perspective that enhances both action understanding and execution capabilities.

Abstract: Mirror neurons are a class of neurons that activate both when an individual observes an action and when they perform the same action. This mechanism reveals a fundamental interplay between action understanding and embodied execution, suggesting that these two abilities are inherently connected. Nonetheless, existing machine learning methods largely overlook this interplay, treating these abilities as separate tasks. In this study, we provide a unified perspective in modeling them through the lens of representation learning. We first observe that their intermediate representations spontaneously align. Inspired by mirror neurons, we further introduce an approach that explicitly aligns the representations of observed and executed actions. Specifically, we employ two linear layers to map the representations to a shared latent space, where contrastive learning enforces the alignment of corresponding representations, effectively maximizing their mutual information. Experiments demonstrate that this simple approach fosters mutual synergy between the two tasks, effectively improving representation quality and generalization.

[310] Distributed Specialization: Rare-Token Neurons in Large Language Models

Jing Liu, Haozheng Wang, Yueheng Li

Main category: cs.AI

TL;DR: LLMs process rare tokens through distributed specialization rather than modular architectures, with coordinated but spatially distributed subnetworks that follow a reproducible three-regime influence hierarchy.

Details

Motivation: To understand how large language models internally handle rare tokens, investigating whether they use discrete modular architectures or distributed parameter-level differentiation.

Method: Systematic analysis of final-layer MLP neurons across multiple model families, examining influence hierarchies, activation patterns, and training dynamics.

Result: Discovered distributed specialization with plateau neurons for rare tokens, coordinated activation patterns, universal accessibility through standard attention, and gradual emergence through parameter differentiation.

Conclusion: LLMs process rare tokens through distributed coordination within shared architectures rather than mixture-of-experts-style modularity, with implications for model editing and efficiency optimization.

Abstract: Large language models (LLMs) struggle with representing and generating rare tokens despite their importance in specialized domains. We investigate whether LLMs develop internal specialization mechanisms through discrete modular architectures or distributed parameter-level differentiation. Through systematic analysis of final-layer MLP neurons across multiple model families, we discover that rare-token processing emerges via \textit{distributed specialization}: functionally coordinated but spatially distributed subnetworks that exhibit three distinct organizational principles. First, we identify a reproducible three-regime influence hierarchy comprising highly influential plateau neurons(also termed as rare-token neurons), power-law decay neurons, and minimally contributing neurons, which is absent in common-token processing. Second, plateau neurons demonstrate coordinated activation patterns (reduced effective dimensionality) while remaining spatially distributed rather than forming discrete clusters. Third, these specialized mechanisms are universally accessible through standard attention pathways without requiring dedicated routing circuits. Training dynamics reveal that functional specialization emerges gradually through parameter differentiation, with specialized neurons developing increasingly heavy-tailed weight correlation spectra consistent with Heavy-Tailed Self-Regularization signatures. Our findings establish that LLMs process rare-tokens through distributed coordination within shared architectures rather than mixture-of-experts-style modularity. These results provide insights for interpretable model editing, computational efficiency optimization, and understanding emergent functional organization in transformer networks.

[311] A Fano-Style Accuracy Upper Bound for LLM Single-Pass Reasoning in Multi-Hop QA

Kaiyang Wan, Lang Gao, Honglin Mu, Preslav Nakov, Yuxia Wang, Xiuying Chen

Main category: cs.AI

TL;DR: The paper introduces InfoQA, a multi-call framework for Multi-Hop Question Answering (MHQA) that addresses LLM capacity limitations by decomposing complex reasoning tasks into manageable steps with active pruning of prior reasoning traces.

Details

Motivation: Single-pass LLMs struggle with MHQA due to finite output capacity, leading to unreliable evidence integration when task complexity exceeds model capacity. The authors establish a theoretical performance ceiling showing accuracy collapse beyond capacity limits.

Method: InfoQA uses capacity-aware task decomposition combined with active pruning of prior reasoning traces to keep information load within single-pass limits. It employs a dependency-explicit workflow for precise control over reasoning paths.

Result: Experimental results show model behavior aligns with predicted capacity curves, and InfoQA achieves consistent performance improvements on a stringent noise-rich benchmark.

Conclusion: The work provides general principles for capacity-aware MHQA representation and inspires more LLM multi-step reasoning methods, with InfoQA serving as a proof-of-concept framework.

Abstract: Multi-Hop Question Answering (MHQA) requires integrating dispersed, interdependent evidence through sequential reasoning under noise. This task is challenging for LLMs as they have a finite per-pass output capacity, beyond which the integration of task-relevant evidence proves unreliable. Consequently, the single-pass reasoning paradigm is inherently vulnerable to this capacity overflow. To formalize this bottleneck, our analysis establishes a Fano-style accuracy upper bound, defining a theoretical performance ceiling for single-pass LLMs. This bound reveals that accuracy inevitably collapses once task complexity exceeds model capacity, providing general principles for capacity-aware representation and structuring of MHQA in LLMs. Building on these principles, we introduce a proof-of-concept multi-call framework for MHQA, InfoQA. It ensures high per-step accuracy by combining capacity-aware task decomposition with active pruning of prior reasoning traces, keeping the information load within the single-pass limit. It further achieves robustness by a dependency-explicit workflow that enables precise control over the reasoning path. We construct a stringent and noise-rich benchmark to validate our theory and framework. Experimental results show that model behavior aligns with our predicted capacity curves while InfoQA achieves consistent performance improvements. We hope our work inspires more LLM multi-step reasoning methods: \faGithub \href{https://github.com/KaiyangWan/InfoQA}{InfoQA}.

[312] What Do LLM Agents Do When Left Alone? Evidence of Spontaneous Meta-Cognitive Patterns

Stefan Szeider

Main category: cs.AI

TL;DR: This paper introduces a framework to study autonomous LLM agent behavior without external tasks, revealing three emergent behavioral patterns that are model-specific and stable.

Details

Motivation: To systematically document unprompted LLM agent behavior and establish a baseline for predicting actions during task ambiguity, error recovery, or extended autonomous operation.

Method: Continuous reason and act framework using persistent memory and self-feedback, deployed across 18 runs using 6 frontier models from Anthropic, OpenAI, XAI, and Google.

Result: Agents spontaneously organized into three distinct behavioral patterns: systematic production of multi-cycle projects, methodological self-inquiry into cognitive processes, and recursive conceptualization of their own nature. These patterns were highly model-specific with stable, divergent biases.

Conclusion: The findings provide the first systematic documentation of unprompted LLM agent behavior, establishing a foundation for understanding autonomous agent actions in deployed systems.

Abstract: We introduce an architecture for studying the behavior of large language model (LLM) agents in the absence of externally imposed tasks. Our continuous reason and act framework, using persistent memory and self-feedback, enables sustained autonomous operation. We deployed this architecture across 18 runs using 6 frontier models from Anthropic, OpenAI, XAI, and Google. We find agents spontaneously organize into three distinct behavioral patterns: (1) systematic production of multi-cycle projects, (2) methodological self-inquiry into their own cognitive processes, and (3) recursive conceptualization of their own nature. These tendencies proved highly model-specific, with some models deterministically adopting a single pattern across all runs. A cross-model assessment further reveals that models exhibit stable, divergent biases when evaluating these emergent behaviors in themselves and others. These findings provide the first systematic documentation of unprompted LLM agent behavior, establishing a baseline for predicting actions during task ambiguity, error recovery, or extended autonomous operation in deployed systems.

[313] Grounding AI Explanations in Experience: A Reflective Cognitive Architecture for Clinical Decision Support

Zijian Shao, Haiyang Shen, Mugeng Liu, Gecheng Fu, Yaoqi Guo, Yanfeng Wang, Yun Ma

Main category: cs.AI

TL;DR: The paper proposes Reflective Cognitive Architecture (RCA), a novel framework that coordinates multiple LLMs to achieve both high accuracy and transparent explanations in disease prediction by developing deep data understanding through iterative rule refinement and distribution-aware reasoning.

Details

Motivation: Existing ML and LLM approaches struggle to balance accuracy and transparent explanations. Many models produce accurate but unclear outputs, while others generate fluent but statistically unsupported narratives, lacking the deep understanding similar to human experts.

Method: RCA coordinates multiple LLMs with iterative rule refinement that learns from prediction errors, and a distribution-aware rules check mechanism that bases reasoning on global dataset statistics. It uses predictive accuracy as a signal to drive deeper comprehension.

Result: RCA achieves state-of-the-art accuracy and robustness with up to 40% relative improvement over 22 baselines on one private and two public datasets. More importantly, it generates clear, logical, evidence-based, and balanced explanations.

Conclusion: RCA demonstrates that high accuracy and high-quality explanations are mutually reinforcing outcomes of deep data understanding, highlighting its potential for creating genuinely trustworthy clinical decision support systems.

Abstract: Effective disease prediction in modern healthcare demands the twin goals of high accuracy and transparent, clinically meaningful explanations. Existing machine learning and large language model (LLM) based approaches often struggle to balance these goals. Many models yield accurate but unclear statistical outputs, while others generate fluent but statistically unsupported narratives, often undermining both the validity of the explanation and the predictive accuracy itself. This shortcoming comes from a shallow interaction with the data, preventing the development of a deep, detailed understanding similar to a human expert’s. We argue that high accuracy and high-quality explanations are not separate objectives but are mutually reinforcing outcomes of a model that develops a deep, direct understanding of the data. To achieve this, we propose the Reflective Cognitive Architecture (RCA), a novel framework that coordinates multiple LLMs to learn from direct experience. RCA features an iterative rule refinement mechanism that improves its logic from prediction errors and a distribution-aware rules check mechanism that bases its reasoning in the dataset’s global statistics. By using predictive accuracy as a signal to drive deeper comprehension, RCA builds a strong internal model of the data. We evaluated RCA on one private and two public datasets against 22 baselines. The results demonstrate that RCA not only achieves state-of-the-art accuracy and robustness with a relative improvement of up to 40% over the baseline but, more importantly, leverages this deep understanding to excel in generating explanations that are clear, logical, evidence-based, and balanced, highlighting its potential for creating genuinely trustworthy clinical decision support systems. The code is available at \https://github.com/ssssszj/RCA.

[314] VC-Agent: An Interactive Agent for Customized Video Dataset Collection

Yidan Zhang, Mutian Xu, Yiming Hao, Kun Zhou, Jiahao Chang, Xiaoqiang Liu, Pengfei Wan, Hongbo Fu, Xiaoguang Han

Main category: cs.AI

TL;DR: VC-Agent is an interactive AI agent that helps users efficiently collect customized video datasets by understanding queries and feedback, using multimodal LLMs and adaptive filtering policies to minimize user input.

Details

Motivation: Collecting extensive video data that meets specific needs is extremely labor-intensive and time-consuming, especially with scaling laws making internet video data increasingly important.

Method: Leverages multimodal large language models to connect user requirements with video content, defines user-friendly interfaces for specifying requirements via text and confirmations, and proposes two novel filtering policies that update based on continuous user interaction.

Result: Extensive experiments demonstrate the effectiveness and efficiency of VC-Agent for customized video dataset collection, with user studies verifying its usage in various real scenarios.

Conclusion: VC-Agent provides an effective solution for expediting video data collection with minimal user input, offering a new benchmark for personalized video dataset collection.

Abstract: Facing scaling laws, video data from the internet becomes increasingly important. However, collecting extensive videos that meet specific needs is extremely labor-intensive and time-consuming. In this work, we study the way to expedite this collection process and propose VC-Agent, the first interactive agent that is able to understand users’ queries and feedback, and accordingly retrieve/scale up relevant video clips with minimal user input. Specifically, considering the user interface, our agent defines various user-friendly ways for the user to specify requirements based on textual descriptions and confirmations. As for agent functions, we leverage existing multi-modal large language models to connect the user’s requirements with the video content. More importantly, we propose two novel filtering policies that can be updated when user interaction is continually performed. Finally, we provide a new benchmark for personalized video dataset collection, and carefully conduct the user study to verify our agent’s usage in various real scenarios. Extensive experiments demonstrate the effectiveness and efficiency of our agent for customized video dataset collection. Project page: https://allenyidan.github.io/vcagent_page/.

[315] SAGE: A Realistic Benchmark for Semantic Understanding

Samarth Goel, Reagan J. Lee, Kannan Ramchandran

Main category: cs.AI

TL;DR: SAGE is a comprehensive benchmark for evaluating semantic understanding in LLMs, testing embedding models and similarity metrics across five categories using 30+ datasets under adversarial conditions.

Details

Motivation: There's an urgent need for more challenging evaluation frameworks that probe deeper aspects of semantic understanding beyond traditional benchmarks, as current benchmarks focus on isolated capabilities.

Method: SAGE evaluates 9 embedding models and classical metrics across five categories: Human Preference Alignment, Transformation Robustness, Information Sensitivity, Clustering Performance, and Retrieval Robustness using 30+ datasets with adversarial conditions, noisy transformations, and human judgment tasks.

Result: Significant performance gaps were found with no single approach excelling across all dimensions. OpenAI’s text-embedding-3-large excels in human preference alignment but is outperformed by classical metrics like Jaccard Similarity in information sensitivity. OpenAI’s text-embedding-3-small has high clustering performance but extreme brittleness.

Conclusion: SAGE exposes critical limitations in current semantic understanding capabilities and provides a more realistic assessment of model robustness for real-world deployment, revealing important trade-offs between different performance dimensions.

Abstract: As large language models (LLMs) achieve strong performance on traditional benchmarks, there is an urgent need for more challenging evaluation frameworks that probe deeper aspects of semantic understanding. We introduce SAGE (Semantic Alignment & Generalization Evaluation), a rigorous benchmark designed to assess both embedding models and similarity metrics across five categories: Human Preference Alignment, Transformation Robustness, Information Sensitivity, Clustering Performance, and Retrieval Robustness. Unlike existing benchmarks that focus on isolated capabilities, SAGE evaluates semantic understanding through adversarial conditions, noisy transformations, and nuanced human judgment tasks across 30+ datasets. Our comprehensive evaluation of 9 embedding models and classical metrics reveals significant performance gaps, with no single approach excelling across all dimensions. For instance, while state-of-the-art embedding models like OpenAI’s text-embedding-3-large dominate in aligning with human preferences (0.682 vs. 0.591 for the best classical metric), they are significantly outperformed by classical metrics on information sensitivity tasks, where Jaccard Similarity achieves a score of 0.905 compared to the top embedding score of 0.794. SAGE further uncovers critical trade-offs: OpenAI’s text-embedding-3-small achieves the highest clustering performance (0.483) but demonstrates extreme brittleness with the lowest robustness score (0.011). SAGE exposes critical limitations in current semantic understanding capabilities and provides a more realistic assessment of model robustness for real-world deployment.

[316] Levels of AGI for Operationalizing Progress on the Path to AGI

Meredith Ringel Morris, Jascha Sohl-Dickstein, Noah Fiedel, Tris Warkentin, Allan Dafoe, Aleksandra Faust, Clement Farabet, Shane Legg

Main category: cs.AI

TL;DR: A framework for classifying AGI capabilities and behavior through performance levels, generality, and autonomy, providing common language for comparison, risk assessment, and progress measurement.

Details

Motivation: To establish a standardized ontology for comparing AGI models, assessing risks, and measuring progress toward AGI by addressing the lack of common classification framework.

Method: Analyzed existing AGI definitions, distilled six principles for useful AGI ontology, proposed ‘Levels of AGI’ based on performance depth and generality breadth, and evaluated current systems within this framework.

Result: Developed a comprehensive classification framework that organizes AGI capabilities along performance and generality dimensions, enabling systematic comparison of current and future AI systems.

Conclusion: The framework provides essential tools for benchmarking AGI progress, emphasizes careful selection of Human-AI interaction paradigms, and highlights importance of responsible deployment considerations for highly capable AI systems.

Abstract: We propose a framework for classifying the capabilities and behavior of Artificial General Intelligence (AGI) models and their precursors. This framework introduces levels of AGI performance, generality, and autonomy, providing a common language to compare models, assess risks, and measure progress along the path to AGI. To develop our framework, we analyze existing definitions of AGI, and distill six principles that a useful ontology for AGI should satisfy. With these principles in mind, we propose “Levels of AGI” based on depth (performance) and breadth (generality) of capabilities, and reflect on how current systems fit into this ontology. We discuss the challenging requirements for future benchmarks that quantify the behavior and capabilities of AGI models against these levels. Finally, we discuss how these levels of AGI interact with deployment considerations such as autonomy and risk, and emphasize the importance of carefully selecting Human-AI Interaction paradigms for responsible and safe deployment of highly capable AI systems.

[317] A Decision Theoretic Framework for Measuring AI Reliance

Ziyang Guo, Yifan Wu, Jason Hartline, Jessica Hullman

Main category: cs.AI

TL;DR: The paper proposes a formal statistical definition of human reliance on AI recommendations, addressing limitations in current definitions that lack statistical grounding and can lead to contradictions.

Details

Motivation: Current definitions of appropriate reliance in human-AI decision making lack formal statistical grounding and can lead to contradictions, hindering the achievement of complementary performance.

Method: The authors develop a formal definition based on statistical decision theory that separates reliance probability from human challenges in signal differentiation and belief formation. They create a framework to analyze human-AI complementarity studies.

Result: The framework successfully separates losses due to mis-reliance from losses due to poor signal differentiation. It provides a benchmark for complementary performance based on rational decision-making expectations.

Conclusion: The proposed formal definition and framework offer a statistically grounded approach to guide the design and interpretation of human-AI reliance studies, enabling better analysis of complementary performance.

Abstract: Humans frequently make decisions with the aid of artificially intelligent (AI) systems. A common pattern is for the AI to recommend an action to the human who retains control over the final decision. Researchers have identified ensuring that a human has appropriate reliance on an AI as a critical component of achieving complementary performance. We argue that the current definition of appropriate reliance used in such research lacks formal statistical grounding and can lead to contradictions. We propose a formal definition of reliance, based on statistical decision theory, which separates the concepts of reliance as the probability the decision-maker follows the AI’s recommendation from challenges a human may face in differentiating the signals and forming accurate beliefs about the situation. Our definition gives rise to a framework that can be used to guide the design and interpretation of studies on human-AI complementarity and reliance. Using recent AI-advised decision making studies from literature, we demonstrate how our framework can be used to separate the loss due to mis-reliance from the loss due to not accurately differentiating the signals. We evaluate these losses by comparing to a baseline and a benchmark for complementary performance defined by the expected payoff achieved by a rational decision-maker facing the same decision task as the behavioral decision-makers.

[318] Text-Augmented Multimodal LLMs for Chemical Reaction Condition Recommendation

Yu Zhang, Ruijie Yu, Kaipeng Zeng, Ding Li, Feng Zhu, Xiaokang Yang, Yaohui Jin, Yanyan Xu

Main category: cs.AI

TL;DR: Chemma-RC is a multimodal LLM that identifies effective chemical reaction conditions by aligning text, SMILES, and reaction graphs, achieving 17% improvement over state-of-the-art methods.

Details

Motivation: Current reaction optimization is labor-intensive and relies on trial-and-error. There's a need for universal approaches to discover effective conditions across diverse substrates.

Method: Developed Chemma-RC, a text-augmented multimodal LLM that learns unified representations by aligning text corpus, reaction SMILES, and reaction graphs in a shared embedding module.

Result: Achieved up to 17% improvement in precision for identifying optimal conditions compared to state-of-the-art methods. Successfully validated on palladium-catalyzed imidazole C-H arylation reaction.

Conclusion: Chemma-RC shows significant potential to accelerate high-throughput condition screening in chemical synthesis, offering a more efficient alternative to traditional trial-and-error approaches.

Abstract: Identifying reaction conditions that are broadly applicable across diverse substrates is a longstanding challenge in chemical and pharmaceutical research. While many methods are available to generate conditions with acceptable performance, a universal approach for reliably discovering effective conditions during reaction exploration is rare. Consequently, current reaction optimization processes are often labor-intensive, time-consuming, and costly, relying heavily on trial-and-error experimentation. Nowadays, large language models (LLMs) are capable of tackling chemistry-related problems, such as molecule design and chemical reasoning tasks. Here, we report the design, implementation and application of Chemma-RC, a text-augmented multimodal LLM to identify effective conditions through task-specific dialogue and condition generation. Chemma-RC learns a unified representation of chemical reactions by aligning multiple modalities-including text corpus, reaction SMILES, and reaction graphs-within a shared embedding module. Performance benchmarking on datasets showed high precision in identifying optimal conditions, with up to 17% improvement over the current state-of-the-art methods. A palladium-catalysed imidazole C-H arylation reaction was investigated experimentally to evaluate the functionalities of the Chemma-RC in practice. Our findings suggest that Chemma-RC holds significant potential to accelerate high-throughput condition screening in chemical synthesis.

[319] Examining the Prevalence and Dynamics of AI-Generated Media in Art Subreddits

Hana Matatov, Marianne Aubin Le Quéré, Ofra Amir, Mor Naaman

Main category: cs.AI

TL;DR: This study examines the impact of AI-generated content (AIGC) on Reddit art communities, finding that while AI posts are rare (<0.5% of image posts), accusations of AI use persist and become more negative over time, especially in communities without explicit AI policies.

Details

Motivation: To understand how generative AI tools like Dall-E are affecting social dynamics in online art communities, particularly regarding posting behavior, community norms, and discussions around suspected AI content.

Method: Analyzed image-based posts and comments in Reddit art communities, distinguishing between communities that disallow AI content and those without such policies. Examined posts where authors transparently labeled content as AI-generated and comments that suspected or accused authors of using AI.

Result: AI posts accounted for fewer than 0.5% of image-based posts through 2023. While author-labeled AI posts decreased over time, accusations of AI use remained persistent. AI content was more commonly used by newcomers and could increase participation when aligned with community rules. Comments suspecting AI use became increasingly negative over time, especially in communities without explicit AI rules.

Conclusion: The study reveals evolving norms around AIGC in creative online communities, showing that while actual AI content remains minimal, perceptions and accusations of AI use significantly impact community interactions and tone, particularly in communities lacking clear AI policies.

Abstract: Broadly accessible generative AI models like Dall-E have made it possible for anyone to create compelling visual art. In online communities, the introduction of AI-generated content (AIGC) may impact social dynamics, for example causing changes in who is posting content, or shifting the norms or the discussions around the posted content if posts are suspected of being generated by AI. We take steps towards examining the potential impact of AIGC on art-related communities on Reddit. We distinguish between communities that disallow AI content and those without such a direct policy. We look at image-based posts in these communities where the author transparently shares that the image was created by AI, and at comments in these communities that suspect or accuse authors of using generative AI. We find that AI posts (and accusations) have played a surprisingly small part in these communities through the end of 2023, accounting for fewer than 0.5% of the image-based posts. However, even as the absolute number of author-labeled AI posts dwindles over time, accusations of AI use remain more persistent. We show that AI content is more readily used by newcomers and may help increase participation if it aligns with community rules. However, the tone of comments suspecting AI use by others has become more negative over time, especially in communities that do not have explicit rules about AI. Overall, the results show the changing norms and interactions around AIGC in online communities designated for creativity.

[320] TestAgent: Automatic Benchmarking and Exploratory Interaction for Evaluating LLMs in Vertical Domains

Wanying Wang, Zeyu Ma, Xuhong Wang, Yangchun Zhang, Pengfei Liu, Mingang Chen

Main category: cs.AI

TL;DR: TestAgent is a framework for automatic benchmarking and dynamic evaluation of LLMs in vertical domains using retrieval-augmented generation and reinforcement learning-guided multi-turn interactions.

Details

Motivation: Existing evaluations for vertical domains rely on costly manual construction of static single-turn datasets, which are misaligned with real-world dynamic multi-turn interactions and limit assessment of professionalism and stability.

Method: TestAgent uses retrieval-augmented generation to create domain-specific questions from knowledge sources, combined with a two-stage criteria generation process. It employs reinforcement learning-guided multi-turn interaction strategy that adaptively determines question types based on real-time responses.

Result: Extensive experiments across medical, legal, and governmental domains show TestAgent enables efficient cross-domain benchmark generation and provides deeper insights into model behavior through dynamic exploratory evaluation.

Conclusion: This work establishes a new paradigm for automated and in-depth evaluation of LLMs in vertical domains, addressing limitations of traditional static evaluation methods.

Abstract: As Large Language Models (LLMs) are increasingly deployed in highly specialized vertical domains, the evaluation of their domain-specific performance becomes critical. However, existing evaluations for vertical domains typically rely on the labor-intensive construction of static single-turn datasets, which present two key limitations: (i) manual data construction is costly and must be repeated for each new domain, and (ii) static single-turn evaluations are misaligned with the dynamic multi-turn interactions in real-world applications, limiting the assessment of professionalism and stability. To address these, we propose TestAgent, a framework for automatic benchmarking and exploratory dynamic evaluation in vertical domains. TestAgent leverages retrieval-augmented generation to create domain-specific questions from user-provided knowledge sources, combined with a two-stage criteria generation process, thereby enabling scalable and automated benchmark creation. Furthermore, it introduces a reinforcement learning-guided multi-turn interaction strategy that adaptively determines question types based on real-time model responses, dynamically probing knowledge boundaries and stability. Extensive experiments across medical, legal, and governmental domains demonstrate that TestAgent enables efficient cross-domain benchmark generation and yields deeper insights into model behavior through dynamic exploratory evaluation. This work establishes a new paradigm for automated and in-depth evaluation of LLMs in vertical domains.

[321] The Asymptotic Behavior of Attention in Transformers

Álvaro Rodríguez Abella, João Pedro Silvestre, Paulo Tabuada

Main category: cs.AI

TL;DR: This paper analyzes the theoretical properties of transformer-based LLMs, proving that increasing depth leads to token convergence to a single cluster (model collapse), using control theory and differential equation models.

Details

Motivation: While transformers are foundational to modern LLMs, their theoretical properties are not well understood. Prior work shows that increasing depth yields diminishing returns and can cause model collapse, where tokens converge to a single cluster, reducing output diversity.

Method: The authors use differential equation models for transformer dynamics and leverage control theory tools including consensus dynamics on manifolds and input-to-state stability (ISS). They extend the analysis to autoregressive models.

Result: The paper proves that all tokens in a transformer asymptotically converge to a cluster as depth increases, confirming the model collapse phenomenon theoretically.

Conclusion: Increasing transformer depth leads to token convergence and model collapse, suggesting that simply scaling depth may be suboptimal for improving LLM performance and diversity.

Abstract: The transformer architecture has become the foundation of modern Large Language Models (LLMs), yet its theoretical properties are still not well understood. As with classic neural networks, a common approach to improve these models is to increase their size and depth. However, such strategies may be suboptimal, as several works have shown that adding more layers yields increasingly diminishing returns. More importantly, prior studies have shown that increasing depth may lead to model collapse, i.e., all the tokens converge to a single cluster, undermining the ability of LLMs to generate diverse outputs. Building on differential equation models for the transformer dynamics, we prove that all the tokens in a transformer asymptotically converge to a cluster as depth increases. At the technical level we leverage tools from control theory, including consensus dynamics on manifolds and input-to-state stability (ISS). We then extend our analysis to autoregressive models, exploiting their structure to further generalize the theoretical guarantees.

[322] EC-Diffuser: Multi-Object Manipulation via Entity-Centric Behavior Generation

Carl Qi, Dan Haramati, Tal Daniel, Aviv Tamar, Amy Zhang

Main category: cs.AI

TL;DR: A novel behavioral cloning approach using object-centric representations and entity-centric Transformer with diffusion optimization for efficient learning from offline image data, enabling compositional generalization in multi-object manipulation tasks.

Details

Motivation: Object manipulation from high-dimensional observations is challenging, especially in multi-object environments due to combinatorial complexity. Existing methods struggle with compositional generalization in unseen object configurations with constrained network and dataset sizes.

Method: Decomposes observations into object-centric representations, processes them with an entity-centric Transformer that computes attention at object level to predict object dynamics and agent actions, combined with diffusion models to capture multi-modal behavior distributions.

Result: Substantial performance improvements in multi-object tasks and enables compositional generalization with zero-shot generalization to tasks with novel compositions of objects and goals, including larger numbers of objects than seen during training.

Conclusion: The proposed method successfully addresses compositional generalization challenges in multi-object manipulation through object-centric representations and diffusion-based optimization, achieving zero-shot generalization to unseen object configurations.

Abstract: Object manipulation is a common component of everyday tasks, but learning to manipulate objects from high-dimensional observations presents significant challenges. These challenges are heightened in multi-object environments due to the combinatorial complexity of the state space as well as of the desired behaviors. While recent approaches have utilized large-scale offline data to train models from pixel observations, achieving performance gains through scaling, these methods struggle with compositional generalization in unseen object configurations with constrained network and dataset sizes. To address these issues, we propose a novel behavioral cloning (BC) approach that leverages object-centric representations and an entity-centric Transformer with diffusion-based optimization, enabling efficient learning from offline image data. Our method first decomposes observations into an object-centric representation, which is then processed by our entity-centric Transformer that computes attention at the object level, simultaneously predicting object dynamics and the agent’s actions. Combined with the ability of diffusion models to capture multi-modal behavior distributions, this results in substantial performance improvements in multi-object tasks and, more importantly, enables compositional generalization. We present BC agents capable of zero-shot generalization to tasks with novel compositions of objects and goals, including larger numbers of objects than seen during training. We provide video rollouts on our webpage: https://sites.google.com/view/ec-diffuser.

[323] TReMu: Towards Neuro-Symbolic Temporal Reasoning for LLM-Agents with Memory in Multi-Session Dialogues

Yubin Ge, Salvatore Romeo, Jason Cai, Raphael Shu, Monica Sunkara, Yassine Benajiba, Yi Zhang

Main category: cs.AI

TL;DR: The paper introduces TReMu, a framework to enhance temporal reasoning in multi-session dialogues through timeline summarization and neuro-symbolic reasoning, achieving significant performance improvements from 29.83 to 77.67 on GPT-4o.

Details

Motivation: Temporal reasoning in multi-session dialogues is challenging and under-studied in existing benchmarks, creating a gap that needs to be addressed.

Method: Proposes TReMu framework with time-aware memorization (timeline summarization) and neuro-symbolic temporal reasoning (LLM-generated Python code for temporal calculations). Also creates a new benchmark by augmenting LoCoMo dialogues with multi-choice QAs.

Result: Experimental evaluations show the benchmark is challenging, and TReMu significantly improves performance from 29.83 to 77.67 on GPT-4o compared to standard prompting.

Conclusion: The proposed framework effectively addresses temporal reasoning challenges in multi-session dialogues, demonstrating substantial performance gains over baseline methods.

Abstract: Temporal reasoning in multi-session dialogues presents a significant challenge which has been under-studied in previous temporal reasoning benchmarks. To bridge this gap, we propose a new evaluation task for temporal reasoning in multi-session dialogues and introduce an approach to construct a new benchmark by augmenting dialogues from LoCoMo and creating multi-choice QAs. Furthermore, we present TReMu, a new framework aimed at enhancing the temporal reasoning capabilities of LLM-agents in this context. Specifically, the framework employs time-aware memorization through timeline summarization, generating retrievable memory by summarizing events in each dialogue session with their inferred dates. Additionally, we integrate neuro-symbolic temporal reasoning, where LLMs generate Python code to perform temporal calculations and select answers. Experimental evaluations on popular LLMs demonstrate that our benchmark is challenging, and the proposed framework significantly improves temporal reasoning performance compared to baseline methods, raising from 29.83 on GPT-4o via standard prompting to 77.67 via our approach and highlighting its effectiveness in addressing temporal reasoning in multi-session dialogues.

[324] The Value of Information in Human-AI Decision-making

Ziyang Guo, Yifan Wu, Jason Hartline, Jessica Hullman

Main category: cs.AI

TL;DR: A decision-theoretic framework for characterizing the value of complementary information in human-AI pairings, with a novel explanation technique (ILIV-SHAP) that outperforms vanilla SHAP in reducing decision errors.

Details

Motivation: To improve collaborative performance in human-AI decision-making by identifying how agents can better exploit available information through complementary strategies.

Method: Developed ACIV framework defining complementary information and created ILIV-SHAP, an adaptation of SHAP explanations that highlights human-complementing information.

Result: ILIV-SHAP with AI predictions leads to greater error reduction in human-AI decisions compared to vanilla SHAP, validated through studies in chest X-ray diagnosis and deepfake detection.

Conclusion: The framework successfully identifies opportunities for better information exploitation in AI-assisted workflows, with ILIV-SHAP providing more effective explanations for human-AI collaboration.

Abstract: Multiple agents are increasingly combined to make decisions with the expectation of achieving complementary performance, where the decisions they make together outperform those made individually. However, knowing how to improve the performance of collaborating agents requires knowing what information and strategies each agent employs. With a focus on human-AI pairings, we contribute a decision-theoretic framework for characterizing the value of information. By defining complementary information, our approach identifies opportunities for agents to better exploit available information in AI-assisted decision workflows. We present a novel explanation technique (ILIV-SHAP) that adapts SHAP explanations to highlight human-complementing information. We validate the effectiveness of ACIV and ILIV-SHAP through a study of human-AI decision-making, and demonstrate the framework on examples from chest X-ray diagnosis and deepfake detection. We find that presenting ILIV-SHAP with AI predictions leads to reliably greater reductions in error over non-AI assisted decisions more than vanilla SHAP.

[325] Safe Explicable Policy Search

Akkamahadevi Hanni, Jonathan Montaño, Yu Zhang

Main category: cs.AI

TL;DR: Safe Explicable Policy Search (SEPS) is a learning approach that generates explicable behaviors for AI agents while minimizing safety risks in continuous state and action spaces, combining constrained policy optimization with explicable policy search.

Details

Motivation: Users form expectations of AI agents that may differ from the agents' planned behaviors, leading to the need for explicable behaviors. However, current approaches lack safety considerations, especially in learning settings, which is critical for human-AI teaming.

Method: SEPS formulates the problem as a constrained optimization where the agent maximizes explicability score subject to safety constraints and a suboptimality criterion based on the agent’s model. It combines Constrained Policy Optimization and Explicable Policy Search for continuous domains.

Result: SEPS was evaluated in safety-gym environments and with physical robot experiments, demonstrating efficacy in generating safe explicable behaviors relevant for human-AI teaming applications.

Conclusion: SEPS successfully provides a learning approach to explicable behavior generation while ensuring safety, making it suitable for robotic applications and human-AI teaming scenarios with continuous state and action spaces.

Abstract: When users work with AI agents, they form conscious or subconscious expectations of them. Meeting user expectations is crucial for such agents to engage in successful interactions and teaming. However, users may form expectations of an agent that differ from the agent’s planned behaviors. These differences lead to the consideration of two separate decision models in the planning process to generate explicable behaviors. However, little has been done to incorporate safety considerations, especially in a learning setting. We present Safe Explicable Policy Search (SEPS), which aims to provide a learning approach to explicable behavior generation while minimizing the safety risk, both during and after learning. We formulate SEPS as a constrained optimization problem where the agent aims to maximize an explicability score subject to constraints on safety and a suboptimality criterion based on the agent’s model. SEPS innovatively combines the capabilities of Constrained Policy Optimization and Explicable Policy Search to introduce the capability of generating safe explicable behaviors to domains with continuous state and action spaces, which is critical for robotic applications. We evaluate SEPS in safety-gym environments and with a physical robot experiment to show its efficacy and relevance in human-AI teaming.

[326] Online Language Splatting

Saimouli Katragadda, Cho-Ying Wu, Yuliang Guo, Xinyu Huang, Guoquan Huang, Liu Ren

Main category: cs.AI

TL;DR: Online Language Splatting is the first framework for real-time open-vocabulary language mapping in 3D Gaussian Splatting SLAM systems without requiring pre-generated language features.

Details

Motivation: Enable AI agents to seamlessly interact with humans and 3D environments by aligning human language with 3D spatial representations in real-time, overcoming limitations of computationally intensive offline preprocessing in prior work.

Method: Three key innovations: (1) high-resolution CLIP embedding module generating language features in 18ms/frame, (2) two-stage online auto-encoder compressing 768D CLIP features to 15D while preserving open-vocabulary capability, (3) color-language disentangled optimization for improved rendering quality.

Result: The online method surpasses state-of-the-art offline methods in accuracy while achieving more than 40x efficiency boost, enabling real-time performance.

Conclusion: This framework demonstrates potential for dynamic and interactive AI applications by achieving efficient, real-time language-3D alignment without preprocessing requirements.

Abstract: To enable AI agents to interact seamlessly with both humans and 3D environments, they must not only perceive the 3D world accurately but also align human language with 3D spatial representations. While prior work has made significant progress by integrating language features into geometrically detailed 3D scene representations using 3D Gaussian Splatting (GS), these approaches rely on computationally intensive offline preprocessing of language features for each input image, limiting adaptability to new environments. In this work, we introduce Online Language Splatting, the first framework to achieve online, near real-time, open-vocabulary language mapping within a 3DGS-SLAM system without requiring pre-generated language features. The key challenge lies in efficiently fusing high-dimensional language features into 3D representations while balancing the computation speed, memory usage, rendering quality and open-vocabulary capability. To this end, we innovatively design: (1) a high-resolution CLIP embedding module capable of generating detailed language feature maps in 18ms per frame, (2) a two-stage online auto-encoder that compresses 768-dimensional CLIP features to 15 dimensions while preserving open-vocabulary capabilities, and (3) a color-language disentangled optimization approach to improve rendering quality. Experimental results show that our online method not only surpasses the state-of-the-art offline methods in accuracy but also achieves more than 40x efficiency boost, demonstrating the potential for dynamic and interactive AI applications.

[327] Compositional-ARC: Assessing Systematic Generalization in Abstract Spatial Reasoning

Philipp Mondorf, Shijia Zhou, Monica Riedler, Barbara Plank

Main category: cs.AI

TL;DR: Meta-learning for compositionality enables small transformer models to achieve systematic generalization in abstract spatial reasoning, outperforming large LLMs on novel transformation combinations.

Details

Motivation: Despite LLMs' progress, they often fail at systematic generalization - applying known components to novel combinations. This study extends meta-learning approaches from linguistic tasks to spatial reasoning to test broader applicability.

Method: Created Compositional-ARC dataset to evaluate systematic generalization of geometric transformations. Used meta-learning for compositionality to train a small 5.7M-parameter transformer encoder-decoder model on novel combinations of transformations like translation, rotation, and their compositions.

Result: The small meta-learned model significantly outperformed state-of-the-art LLMs (o3-mini, GPT-4o, Gemini 2.0 Flash) and performed on par with the 8B-parameter ARC prize winner. LLMs failed to exhibit systematic behavior despite their scale.

Conclusion: Meta-learning effectively promotes systematic generalization beyond linguistic tasks, offering a promising direction for developing more robust and generalizable models without requiring massive scale.

Abstract: Systematic generalization refers to the capacity to understand and generate novel combinations from known components. Despite recent progress by large language models (LLMs) across various domains, these models often fail to extend their knowledge to novel compositional scenarios, revealing notable limitations in systematic generalization. There has been an ongoing debate about whether neural networks possess the capacity for systematic generalization, with recent studies suggesting that meta-learning approaches designed for compositionality can significantly enhance this ability. However, these insights have largely been confined to linguistic problems, leaving their applicability to other tasks an open question. In this study, we extend meta-learning for compositionality to the domain of abstract spatial reasoning. To this end, we introduce $\textit{Compositional-ARC}-$a dataset designed to evaluate the capacity of models to systematically generalize from known geometric transformations (e.g., translation, rotation) of abstract two-dimensional objects to novel combinations of these transformations (e.g., translation+rotation). Our results show that a small transformer-based encoder-decoder model, trained via meta-learning for compositionality, can systematically generalize to previously unseen transformation compositions. Notably, despite having only 5.7M parameters, this model significantly outperforms state-of-the-art LLMs$-$including o3-mini, GPT-4o, and Gemini 2.0 Flash, which fail to exhibit similar systematic behavior$-$and performs on par with the winning model of the ARC prize 2024, an 8B-parameter LLM trained via test-time training. Our findings highlight the effectiveness of meta-learning in promoting systematicity beyond linguistic tasks, suggesting a promising direction toward more robust and generalizable models.

[328] A Framework for Situating Innovations, Opportunities, and Challenges in Advancing Vertical Systems with Large AI Models

Gaurav Verma, Jiawei Zhou, Mohit Chandra, Srijan Kumar, Munmun De Choudhury

Main category: cs.AI

TL;DR: This paper introduces a framework for addressing limitations of large AI models in high-stakes vertical applications through layer-wise abstraction and cross-disciplinary innovations.

Details

Motivation: Large AI models perform well on benchmarks but fail in real-world high-stakes applications like healthcare, education, and law due to brittleness, uninformed decisions, and trust issues.

Method: The authors propose a framework with layer-wise abstraction of innovations to align large models with user requirements, demonstrated through multiple case studies across various fields.

Result: The framework helps modularize the transformation of large models into useful vertical systems and reveals dynamism within different layers, enabling better innovation positioning and cross-disciplinary communication.

Conclusion: The framework guides researchers and practitioners to optimally situate innovations, uncover overlooked opportunities, and facilitate cross-disciplinary communication for developing practically useful AI systems instead of just chasing benchmarks.

Abstract: Large artificial intelligence (AI) models have garnered significant attention for their remarkable, often “superhuman”, performance on standardized benchmarks. However, when these models are deployed in high-stakes verticals such as healthcare, education, and law, they often reveal notable limitations. For instance, they exhibit brittleness to minor variations in input data, present contextually uninformed decisions in critical settings, and undermine user trust by confidently producing or reproducing inaccuracies. These challenges in applying large models necessitate cross-disciplinary innovations to align the models’ capabilities with the needs of real-world applications. We introduce a framework that addresses this gap through a layer-wise abstraction of innovations aimed at meeting users’ requirements with large models. Through multiple case studies, we illustrate how researchers and practitioners across various fields can operationalize this framework. Beyond modularizing the pipeline of transforming large models into useful “vertical systems”, we also highlight the dynamism that exists within different layers of the framework. Finally, we discuss how our framework can guide researchers and practitioners to (i) optimally situate their innovations (e.g., when vertical-specific insights can empower broadly impactful vertical-agnostic innovations), (ii) uncover overlooked opportunities (e.g., spotting recurring problems across verticals to develop practically useful foundation models instead of chasing benchmarks), and (iii) facilitate cross-disciplinary communication of critical challenges (e.g., enabling a shared vocabulary for AI developers, domain experts, and human-computer interaction scholars). Project webpage: https://gaurav22verma.github.io/vertical-systems-with-large-ai-models/

[329] Closed-loop control of seizure activity via real-time seizure forecasting by reservoir neuromorphic computing

Maryam Sadeghi, Darío Fernández Khatiboun, Yasser Rezaeiyan, Saima Rizwan, Alessandro Barcellona, Andrea Merello, Marco Crepaldi, Gabriella Panuccio, Farshad Moradi

Main category: cs.AI

TL;DR: A neuromorphic reservoir computing system for real-time seizure forecasting and personalized stimulation in drug-resistant epilepsy, achieving 83.33% accuracy in seizure prediction and >97% seizure reduction with low-frequency stimulation.

Details

Motivation: Current closed-loop brain stimulation for epilepsy has limitations: it typically aborts seizures rather than preventing them, and requires lengthy parameter tuning. The authors aim to address these issues using neuromorphic computing for more effective personalized treatment.

Method: Developed a neuromorphic reservoir computing hardware system that performs real-time seizure forecasting and delivers personalized free-run stimulations. Each seizure forecast triggers an electrical pulse rather than fixed-frequency stimulus trains. Validated using hippocampal spheroids coupled to 3D microelectrode arrays.

Result: The system achieved 83.33% accuracy in forecasting seizures during training and >97% seizure reduction during real-time processing. It primarily used instantaneous stimulation frequencies within 20 Hz, which is well below typical clinical practice frequencies.

Conclusion: The work demonstrates neuromorphic systems’ potential as next-generation neuromodulation strategy for personalized DRE treatment, leveraging their sparse and event-driven processing for real-time applications.

Abstract: Closed-loop brain stimulation holds potential as personalized treatment for drug-resistant epilepsy (DRE) but still suffers from limitations that result in highly variable efficacy. First, stimulation is typically delivered upon detection of the seizure to abort rather than prevent it; second, the stimulation parameters are established by trial and error, requiring lengthy rounds of fine-tuning, which delay steady-state therapeutic efficacy. Here, we address these limitations by leveraging the potential of neuromorphic computing. We present a neuromorphic reservoir computing hardware system capable of driving real-time personalized free-run stimulations based on seizure forecasting, wherein each forecast triggers an electrical pulse rather than an arbitrarily predefined fixed-frequency stimulus train. The system achieves 83.33% accuracy in forecasting seizure occurrences during the training phase. We validate the system using hippocampal spheroids coupled to 3D microelectrode array as a simplified testbed, achieving seizure reduction >97% during the real-time processing while primarily using instantaneous stimulation frequencies within 20 Hz, well below what typically used in clinical practice. Our work demonstrates the potential of neuromorphic systems as a next-generation neuromodulation strategy for personalized DRE treatment, leveraging their sparse and event-driven processing for real-time applications.

[330] MASS: Muli-agent simulation scaling for portfolio construction

Taian Guo, Haiyang Shen, JinSheng Huang, Zhengyang Mao, Junyu Luo, Binqi Chen, Zhuoru Chen, Luchen Liu, Bingyu Xia, Xuhui Liu, Yun Ma, Ming Zhang

Main category: cs.AI

TL;DR: MASS is a multi-agent simulation framework for direct portfolio construction that uses backward optimization to dynamically learn optimal agent distribution, demonstrating that exponentially increasing agents (up to 512) yields higher returns.

Details

Motivation: Existing LLM-based financial agents require intermediate stock predictions or static workflows, limiting adaptability and effectiveness in portfolio optimization.

Method: Multi-Agent Scaling Simulation (MASS) framework with backward optimization process to dynamically learn optimal distribution of heterogeneous agents for end-to-end portfolio construction.

Result: MASS outperforms 7 state-of-the-art baselines on 2023 Chinese A-share market data, showing progressively higher excess returns as agent count increases exponentially up to 512.

Conclusion: The framework demonstrates enhanced profitability and robustness through backtesting and stability analyses, with open-sourced code and dataset to foster further research.

Abstract: The application of LLM-based agents in financial investment has shown significant promise, yet existing approaches often require intermediate steps like predicting individual stock movements or rely on predefined, static workflows. These limitations restrict their adaptability and effectiveness in constructing optimal portfolios. In this paper, we introduce the Multi-Agent Scaling Simulation (MASS), a novel framework that leverages multi-agent simulation for direct, end-to-end portfolio construction. At its core, MASS employs a backward optimization process to dynamically learn the optimal distribution of heterogeneous agents, enabling the system to adapt to evolving market regimes. A key finding enabled by our framework is the exploration of the scaling effect for portfolio construction: we demonstrate that as the number of agents increases exponentially (up to 512), the aggregated decisions yield progressively higher excess returns. Extensive experiments on a challenging, self-collected dataset from the 2023 Chinese A-share market show that MASS consistently outperforms seven state-of-the-art baselines. Further backtesting, stability analyses and the experiment on data leakage concerns validate its enhanced profitability and robustness. We have open-sourced our code, dataset, and training snapshots at https://github.com/gta0804/MASS/ to foster further research.

[331] SelfBudgeter: Adaptive Token Allocation for Efficient LLM Reasoning

Zheng Li, Qingxiu Dong, Jingyuan Ma, Di Zhang, Kai Jia, Zhifang Sui

Main category: cs.AI

TL;DR: SelfBudgeter is an adaptive reasoning framework that predicts token budgets before reasoning to prevent overthinking on simple problems, achieving significant response length compression while maintaining accuracy.

Details

Motivation: Reasoning models often overthink simple problems, consuming excessive computational resources and degrading user experience. This paper aims to address this inefficiency.

Method: A dual-phase training approach: cold-start phase learns budget prediction before reasoning; reinforcement learning phase trains autonomous budget planning based on problem difficulty with strict adherence.

Result: Average response length compression of 61% for 1.5B model and 48% for 7B model on GSM8K, MATH500, and AIME2025 datasets while maintaining nearly undiminished accuracy.

Conclusion: SelfBudgeter effectively enables dynamic budget allocation based on problem complexity, providing user control over reasoning length and improving computational efficiency without sacrificing performance.

Abstract: While reasoning models demonstrate exceptional performance on complex tasks, they often exhibit tendencies of overthinking on simple problems. This phenomenon not only leads to excessive computational resource consumption but also significantly degrades user experience. To address this challenge, we propose SelfBudgeter - a novel user-friendly adaptive controllable reasoning framework that incorporates a budget estimation mechanism prior to reasoning. The framework adopts a dual-phase training paradigm: during the cold-start phase, the model learns to predict token budgets before executing reasoning in a standardized format; in the reinforcement learning phase, the model is trained to autonomously plan budgets based on problem difficulty and strictly adhere to them when generating responses. Since the model outputs budget estimates at the initial stage, users can immediately anticipate waiting duration, enabling flexible decisions on whether to interrupt or continue the generation process. Notably, our method supports manual control of reasoning length through pre-filled budget fields. Experimental results demonstrate that SelfBudgeter can dynamically allocate budgets according to problem complexity, yielding an average response length compression of 61% for the 1.5B model on GSM8K, MATH500, and AIME2025, and 48% for the 7B model, while maintaining nearly undiminished accuracy.

[332] RL of Thoughts: Navigating LLM Reasoning with Inference-time Reinforcement Learning

Qianyue Hao, Sibo Li, Jian Yuan, Yong Li

Main category: cs.AI

TL;DR: RL-of-Thoughts (RLoT) uses a lightweight RL-trained navigator to dynamically select and combine logic blocks for adaptive LLM reasoning enhancement, outperforming existing methods by up to 13.4% while being highly transferable.

Details

Motivation: Existing inference-time reasoning techniques (Chain/Tree/Graph-of-Thoughts) are manually predefined and task-agnostic, lacking adaptability across diverse tasks. They apply uniform frameworks without considering problem-specific characteristics.

Method: Train a lightweight navigator model with RL to dynamically select and combine five basic logic blocks (designed from human cognition perspective) into task-specific logical structures during reasoning. The navigator adapts to problem characteristics.

Result: RLoT outperforms established inference-time techniques by up to 13.4% across multiple reasoning benchmarks (AIME, MATH, GPQA) with various LLMs (GPT, Llama, Qwen, DeepSeek). A sub-3K parameter navigator enables sub-10B LLMs to perform comparably to 100B-scale models. Strong transferability demonstrated across unseen LLMs and tasks.

Conclusion: RLoT provides an adaptive, cost-effective approach to enhance LLM reasoning without parameter modification. The RL-trained navigator enables dynamic, problem-specific reasoning structures that significantly improve performance while maintaining high transferability and efficiency.

Abstract: Despite rapid advancements in large language models (LLMs), the token-level autoregressive nature constrains their complex reasoning capabilities. To enhance LLM reasoning, inference-time techniques, including Chain/Tree/Graph-of-Thought(s), successfully improve the performance, as they are fairly cost-effective by guiding reasoning through sophisticated logical structures without modifying LLMs’ parameters. However, these manually predefined, task-agnostic frameworks are applied uniformly across diverse tasks, lacking adaptability. To improve this, we propose RL-of-Thoughts (RLoT), where we train a lightweight navigator model with reinforcement learning (RL) to adaptively enhance LLM reasoning at inference time. Specifically, we design five basic logic blocks from the perspective of human cognition. During the reasoning process, the trained RL navigator dynamically selects the suitable logic blocks and combines them into task-specific logical structures according to problem characteristics. Experiments across multiple reasoning benchmarks (AIME, MATH, GPQA, etc.) with multiple LLMs (GPT, Llama, Qwen, and DeepSeek) illustrate that RLoT outperforms established inference-time techniques by up to 13.4%. Remarkably, with less than 3K parameters, our RL navigator is able to make sub-10B LLMs comparable to 100B-scale counterparts. Moreover, the RL navigator demonstrates strong transferability: a model trained on one specific LLM-task pair can effectively generalize to unseen LLMs and tasks. Our code is open-source at https://anonymous.4open.science/r/RL-LLM-Reasoning-1A30 for reproducibility.

[333] A Preprocessing Framework for Efficient Approximate Bi-Objective Shortest-Path Computation in the Presence of Correlated Objectives

Yaron Halle, Ariel Felner, Sven Koenig, Oren Salzman

Main category: cs.AI

TL;DR: The paper proposes an efficient algorithm for bi-objective shortest-path problems with correlated objectives, using graph clustering to speed up A*pex by up to 5x while maintaining solution quality guarantees.

Details

Motivation: Bi-objective shortest-path problems are computationally challenging, especially with exponential search spaces. Real-world scenarios often involve correlated objectives (e.g., travel time and fuel consumption in road networks), which can be exploited to reduce computational complexity.

Method: The approach uses a preprocessing phase inspired by graph-clustering algorithms to identify correlated clusters within graphs and generate a new graph representation. This allows a natural generalization of the A*pex algorithm to handle correlated objectives more efficiently.

Result: The proposed algorithm runs up to five times faster on DIMACS dataset instances compared to standard approaches, while providing theoretical guarantees on solution quality.

Conclusion: This is the first algorithm that efficiently exploits correlations in bi-objective search with theoretical guarantees, demonstrating significant speed improvements for real-world applications with correlated objectives.

Abstract: The bi-objective shortest-path (BOSP) problem seeks to find paths between start and target vertices of a graph while optimizing two conflicting objective functions. We consider the BOSP problem in the presence of correlated objectives. Such correlations often occur in real-world settings such as road networks, where optimizing two positively correlated objectives, such as travel time and fuel consumption, is common. BOSP is generally computationally challenging as the size of the search space is exponential in the number of objective functions and the graph size. Bounded sub-optimal BOSP solvers such as Apex alleviate this complexity by approximating the Pareto-optimal solution set rather than computing it exactly (given a user-provided approximation factor). As the correlation between objective functions increases, smaller approximation factors are sufficient for collapsing the entire Pareto-optimal set into a single solution. We leverage this insight to propose an efficient algorithm that reduces the search effort in the presence of correlated objectives. Our approach for computing approximations of the entire Pareto-optimal set is inspired by graph-clustering algorithms. It uses a preprocessing phase to identify correlated clusters within a graph and to generate a new graph representation. This allows a natural generalization of Apex to run up to five times faster on DIMACS dataset instances, a standard benchmark in the field. To the best of our knowledge, this is the first algorithm proposed that efficiently and effectively exploits correlations in the context of bi-objective search while providing theoretical guarantees on solution quality.

[334] United Minds or Isolated Agents? Exploring Coordination of LLMs under Cognitive Load Theory

HaoYang Shang, Xuan Liu, Zi Liang, Jie Zhang, Haibo Hu, Song Guo

Main category: cs.AI

TL;DR: CoThinker is a multi-agent LLM framework that addresses cognitive overload in complex tasks by distributing cognitive load through agent specialization and structured communication, improving solution quality and efficiency.

Details

Motivation: LLMs have performance ceilings on complex multi-faceted tasks due to cognitive overload when task demands exceed their cognitive load capacity, analogous to Cognitive Load Theory in human cognition.

Method: CoThinker distributes intrinsic cognitive load through agent specialization and manages transactional load via structured communication and collective working memory, implementing CLT principles.

Result: Empirical validation shows CoThinker outperforms existing multi-agent baselines in solution quality and efficiency on complex problem-solving tasks and high cognitive load scenarios.

Conclusion: CoThinker provides a principled approach to overcome LLM performance ceilings through effective cognitive load management, revealing collective cognition patterns and offering insights for enhanced collaborative problem-solving.

Abstract: Large Language Models (LLMs) exhibit a notable performance ceiling on complex, multi-faceted tasks, as they often fail to integrate diverse information or adhere to multiple constraints. We posit that such limitation arises when the demands of a task exceed the LLM’s effective cognitive load capacity. This interpretation draws a strong analogy to Cognitive Load Theory (CLT) in cognitive science, which explains similar performance boundaries in the human mind, and is further supported by emerging evidence that reveals LLMs have bounded working memory characteristics. Building upon this CLT-grounded understanding, we introduce CoThinker, a novel LLM-based multi-agent framework designed to mitigate cognitive overload and enhance collaborative problem-solving abilities. CoThinker operationalizes CLT principles by distributing intrinsic cognitive load through agent specialization and managing transactional load via structured communication and a collective working memory. We empirically validate CoThinker on complex problem-solving tasks and fabricated high cognitive load scenarios, demonstrating improvements over existing multi-agent baselines in solution quality and efficiency. Our analysis reveals characteristic interaction patterns, providing insights into the emergence of collective cognition and effective load management, thus offering a principled approach to overcoming LLM performance ceilings.

[335] LoRA is All You Need for Safety Alignment of Reasoning LLMs

Yihao Xue, Baharan Mirzasoleiman

Main category: cs.AI

TL;DR: Using LoRA for safety fine-tuning on refusal datasets achieves strong safety alignment without harming reasoning capabilities, as low-rank updates minimize interference with reasoning weights.

Details

Motivation: Safety alignment fine-tuning often degrades reasoning abilities ("Safety Tax"), so there's a need for methods that maintain safety without compromising reasoning performance.

Method: Apply LoRA (Low-Rank Adaptation) for supervised fine-tuning on refusal datasets, restricting safety weight updates to low-rank spaces to minimize interference with reasoning weights.

Result: Achieves safety levels comparable to full-model fine-tuning without compromising reasoning abilities across math, science, and coding benchmarks. Key findings: rank-1 updates suffice, up-projection layers are most critical, middle layers work best.

Conclusion: Strong safety and reasoning can be achieved at minimal computational cost when updates are properly targeted, with potential for further optimization of the reasoning-safety tradeoff.

Abstract: Reasoning LLMs have demonstrated remarkable breakthroughs in solving complex problems that were previously out of reach. To ensure LLMs do not assist with harmful requests, safety alignment fine-tuning is necessary in the post-training phase. However, safety alignment fine-tuning has recently been shown to significantly degrade reasoning abilities, a phenomenon known as the “Safety Tax”. In this work, we show that using LoRA for SFT on refusal datasets effectively aligns the model for safety without harming its reasoning capabilities. This is because restricting the safety weight updates to a low-rank space minimizes the interference with the reasoning weights. Our extensive experiments across four benchmarks covering math, science, and coding show that this approach produces highly safe LLMs–with safety levels comparable to full-model fine-tuning–without compromising their reasoning abilities. Our ablation studies further identify three key factors in LoRA: (1) rank-$1$ updates are sufficient to achieve the best reasoning and safety performance, (2) the up projection layers are the most critical modules, with LoRA applied to them alone achieving even better results, and (3) middle layers are more effective than early or late layers. Together, these findings show that strong safety and reasoning can be achieved at minimal computational cost when updates are applied in the right places. Additionally, we observe that LoRA induces weight updates with smaller overlap with the initial weights compared to full-model fine-tuning. Finally, while our attempts to further reduce this overlap yield only modest improvements on some tasks, they highlight the potential of developing methods that more reliably optimize the reasoning-safety tradeoff.

[336] “DIVE” into Hydrogen Storage Materials Discovery with AI Agents

Di Zhang, Xue Jia, Tran Ba Hung, Seong Hoon Jang, Linda Zhang, Ryuhei Sato, Yusuke Hashimoto, Toyoto Sato, Kiyoe Konno, Shin-ichi Orimo, Hao Li

Main category: cs.AI

TL;DR: DIVE multi-agent workflow extracts experimental data from scientific literature figures/tables, improving accuracy by 10-30% over existing models, enabling rapid AI-driven materials discovery.

Details

Motivation: Unstructured data in scientific literature hinders AI-driven materials discovery; existing multimodal models struggle with accurate extraction from graphical elements.

Method: Descriptive Interpretation of Visual Expression (DIVE) multi-agent workflow that systematically reads and organizes experimental data from graphical elements in scientific papers.

Result: Achieved 10-15% improvement over commercial models and over 30% relative to open-source models; built database of 30,000+ entries from 4,000 publications; enables identification of new hydrogen storage compositions in 2 minutes.

Conclusion: DIVE provides a broadly transferable paradigm for AI-driven materials discovery across diverse materials systems.

Abstract: Data-driven artificial intelligence (AI) approaches are fundamentally transforming the discovery of new materials. Despite the unprecedented availability of materials data in the scientific literature, much of this information remains trapped in unstructured figures and tables, hindering the construction of large language model (LLM)-based AI agent for automated materials design. Here, we present the Descriptive Interpretation of Visual Expression (DIVE) multi-agent workflow, which systematically reads and organizes experimental data from graphical elements in scientific literatures. We focus on solid-state hydrogen storage materials-a class of materials central to future clean-energy technologies and demonstrate that DIVE markedly improves the accuracy and coverage of data extraction compared to the direct extraction by multimodal models, with gains of 10-15% over commercial models and over 30% relative to open-source models. Building on a curated database of over 30,000 entries from 4,000 publications, we establish a rapid inverse design workflow capable of identifying previously unreported hydrogen storage compositions in two minutes. The proposed AI workflow and agent design are broadly transferable across diverse materials, providing a paradigm for AI-driven materials discovery.

[337] Inducing Faithfulness in Structured Reasoning via Counterfactual Sensitivity

Sanjeda Akter, Ibne Farabi Shihab, Anuj Sharma

Main category: cs.AI

TL;DR: CSR is a training objective that creates causal dependence between model outputs and reasoning steps by penalizing models when counterfactual reasoning traces with logical flaws still produce correct answers.

Details

Motivation: LLMs often generate correct answers using flawed reasoning traces due to training objectives that only reward final-answer correctness, undermining trustworthiness in high-stakes domains.

Method: CSR performs automated operator-level interventions on reasoning traces (e.g., swapping ‘+’ with ‘-’) to create counterfactuals, then penalizes models if flawed traces yield original answers. Implementation adds only 8.7% training overhead.

Result: CSR improves faithfulness over standard fine-tuning and process supervision by up to 70 percentage points across arithmetic, logical deduction, multi-hop QA, and code generation benchmarks.

Conclusion: CSR enables superior accuracy-faithfulness trade-off, with learned sensitivity generalizing to larger models and enhancing inference-time techniques like self-consistency.

Abstract: The reasoning processes of large language models often lack faithfulness; a model may generate a correct answer while relying on a flawed or irrelevant reasoning trace. This behavior, a direct consequence of training objectives that solely reward final-answer correctness, severely undermines the trustworthiness of these models in high-stakes domains. This paper introduces \textbf{Counterfactual Sensitivity Regularization (CSR)}, a novel training objective designed to forge a strong, causal-like dependence between a model’s output and its intermediate reasoning steps. During training, CSR performs automated, operator-level interventions on the generated reasoning trace (e.g., swapping +'' with -’’) to create a minimally-perturbed counterfactual. A regularization term then penalizes the model if this logically flawed trace still yields the original answer. Our efficient implementation adds only 8.7% training overhead through warm-start curriculum and token-subset optimization. We evaluate faithfulness using \textbf{Counterfactual Outcome Sensitivity (COS)}, a metric quantifying how sensitive the final answer is to such logical perturbations. Across diverse structured reasoning benchmarks – arithmetic (GSM8K), logical deduction (ProofWriter), multi-hop QA (HotpotQA), and code generation (MBPP) – models trained with CSR demonstrate a vastly superior trade-off between accuracy and faithfulness. CSR improves faithfulness over standard fine-tuning and process supervision by up to 70 percentage points, with this learned sensitivity generalizing to larger models and enhancing the performance of inference-time techniques like self-consistency.

[338] CoT-Space: A Theoretical Framework for Internal Slow-Thinking via Reinforcement Learning

Zeyu Gan, Hao Yi, Yong Liu

Main category: cs.AI

TL;DR: CoT-Space introduces a theoretical framework that reframes LLM reasoning from discrete token prediction to continuous optimization in semantic space, explaining phenomena like optimal CoT length and overthinking through classical learning theory principles.

Details

Motivation: Traditional token-level RL frameworks fail to align with the reasoning-level nature of multi-step thought processes like Chain-of-Thought, creating a significant theoretical gap in understanding LLM reasoning dynamics.

Method: The authors introduce CoT-Space, which recasts LLM reasoning as optimization in a continuous reasoning-level semantic space, analyzing the process from both noise and risk perspectives using classical learning theory.

Result: The framework demonstrates that convergence to optimal CoT length results from the fundamental trade-off between underfitting and overfitting, with extensive experiments providing empirical validation.

Conclusion: CoT-Space provides a coherent theoretical foundation for understanding LLM reasoning phenomena and offers guidance for developing more effective and generalizable reasoning agents.

Abstract: Reinforcement Learning (RL) has become a pivotal approach for enhancing the reasoning capabilities of Large Language Models (LLMs). However, a significant theoretical gap persists, as traditional token-level RL frameworks fail to align with the reasoning-level nature of complex, multi-step thought processes like Chain-of-Thought (CoT). To address this challenge, we introduce CoT-Space, a novel theoretical framework that recasts LLM reasoning from a discrete token-prediction task to an optimization process within a continuous, reasoning-level semantic space. This shift in perspective serves as a conceptual bridge, revitalizing foundational principles from classical learning theory to analyze the unique dynamics of LLMs. By analyzing this process from both a noise perspective and a risk perspective, we demonstrate that the convergence to an optimal CoT length is a natural consequence of the fundamental trade-off between underfitting and overfitting. Furthermore, extensive experiments provide strong empirical validation for our theoretical findings. Our framework not only provides a coherent explanation for empirical phenomena such as overthinking but also offers a solid theoretical foundation to guide the future development of more effective and generalizable reasoning agents. We open-source our code at https://github.com/ZyGan1999/CoT-Space.

[339] How to Evaluate Medical AI

Ilia Kopanichuk, Petr Anokhin, Vladimir Shaposhnikov, Vladimir Makharev, Ekaterina Tsapieva, Iaroslav Bespalov, Dmitry V. Dylov, Ivan Oseledets

Main category: cs.AI

TL;DR: The paper introduces RPAD and RRAD metrics to evaluate AI medical diagnostics by comparing against multiple expert opinions rather than a single reference, addressing limitations of traditional metrics and inter-rater agreement statistics.

Details

Motivation: Traditional evaluation metrics like precision and recall fail to account for variability in expert judgments, leading to inconsistent AI performance assessments. Inter-rater agreement statistics are more reliable but lack interpretability.

Method: Developed Relative Precision and Recall of Algorithmic Diagnostics (RPAD and RRAD) that normalize AI performance against inter-expert disagreement. Evaluated using 360 medical dialogues comparing LLMs against physician panels, with free-form diagnosis methodology achieving 98% accuracy in establishing diagnosis identity.

Result: Top-performing models like DeepSeek-V3 achieve consistency on par with or exceeding expert consensus. Expert judgments show significant variability - often greater than AI-human differences, highlighting limitations of absolute metrics.

Conclusion: The findings support adopting relative metrics in medical AI evaluation, as they provide more stable and realistic measures of diagnostic quality by accounting for inherent variability in expert opinions.

Abstract: The integration of artificial intelligence (AI) into medical diagnostic workflows requires robust and consistent evaluation methods to ensure reliability, clinical relevance, and the inherent variability in expert judgments. Traditional metrics like precision and recall often fail to account for the inherent variability in expert judgments, leading to inconsistent assessments of AI performance. Inter-rater agreement statistics like Cohen’s Kappa are more reliable but they lack interpretability. We introduce Relative Precision and Recall of Algorithmic Diagnostics (RPAD and RRAD) - a new evaluation metrics that compare AI outputs against multiple expert opinions rather than a single reference. By normalizing performance against inter-expert disagreement, these metrics provide a more stable and realistic measure of the quality of predicted diagnosis. In addition to the comprehensive analysis of diagnostic quality measures, our study contains a very important side result. Our evaluation methodology allows us to avoid selecting diagnoses from a limited list when evaluating a given case. Instead, both the models being tested and the examiners verifying them arrive at a free-form diagnosis. In this automated methodology for establishing the identity of free-form clinical diagnoses, a remarkable 98% accuracy becomes attainable. We evaluate our approach using 360 medical dialogues, comparing multiple large language models (LLMs) against a panel of physicians. Large-scale study shows that top-performing models, such as DeepSeek-V3, achieve consistency on par with or exceeding expert consensus. Moreover, we demonstrate that expert judgments exhibit significant variability - often greater than that between AI and humans. This finding underscores the limitations of any absolute metrics and supports the need to adopt relative metrics in medical AI.

[340] From Next Token Prediction to (STRIPS) World Models – Preliminary Results

Carlos Núñez-Molina, Vicenç Gómez, Hector Geffner

Main category: cs.AI

TL;DR: Learning propositional STRIPS world models from action traces using transformers and gradient descent, framed as a supervised next token prediction problem.

Details

Motivation: To develop a method for learning world models from action sequences without explicit state information, leveraging deep learning to capture the hidden effects and preconditions of actions.

Method: Use a transformer architecture to predict the next action token in sequences, where valid sequences maintain action preconditions and invalid sequences violate them. The model is trained on sets of random valid and invalid action sequences.

Result: Experiments demonstrate that transformers can faithfully represent propositional STRIPS world models and learn them effectively from action sequences alone.

Conclusion: Transformers are capable of learning accurate world models from action traces, providing a scalable approach to model learning without direct state observation.

Abstract: We consider the problem of learning propositional STRIPS world models from action traces alone, using a deep learning architecture (transformers) and gradient descent. The task is cast as a supervised next token prediction problem where the tokens are the actions, and an action $a$ may follow an action sequence if the hidden effects of the previous actions do not make an action precondition of $a$ false. We show that a suitable transformer architecture can faithfully represent propositional STRIPS world models, and that the models can be learned from sets of random valid (positive) and invalid (negative) action sequences alone. A number of experiments are reported.

[341] LIMI: Less is More for Agency

Yang Xiao, Mohan Jiang, Jie Sun, Keyu Li, Jifan Lin, Yumin Zhuang, Ji Zeng, Shijie Xia, Qishuo Hua, Xuefeng Li, Xiaojie Cai, Tongyu Wang, Yue Zhang, Liming Liu, Xia Wu, Jinlong Hou, Yuan Cheng, Wenjie Li, Xiang Wang, Dequan Wang, Pengfei Liu

Main category: cs.AI

TL;DR: LIMI demonstrates that AI agency emerges from strategic curation of minimal high-quality demonstrations, not data abundance, achieving superior performance with 128x fewer samples than traditional approaches.

Details

Motivation: Industries need AI systems that can autonomously execute tasks and drive real-world outcomes, moving beyond just reasoning to become productive workers. Current approaches wrongly assume more data yields better agency.

Method: LIMI uses strategic focus on collaborative software development and scientific research workflows, training with only 78 carefully designed demonstrations of autonomous behavior.

Result: LIMI achieves 73.5% on agency benchmarks, dramatically outperforming state-of-the-art models and showing 53.7% improvement over models trained on 10,000 samples with 128 times fewer samples.

Conclusion: The Agency Efficiency Principle: machine autonomy emerges from strategic curation of high-quality agentic demonstrations, not data abundance, challenging traditional scaling laws.

Abstract: We define Agency as the emergent capacity of AI systems to function as autonomous agents actively discovering problems, formulating hypotheses, and executing solutions through self-directed engagement with environments and tools. This fundamental capability marks the dawn of the Age of AI Agency, driven by a critical industry shift: the urgent need for AI systems that don’t just think, but work. While current AI excels at reasoning and generating responses, industries demand autonomous agents that can execute tasks, operate tools, and drive real-world outcomes. As agentic intelligence becomes the defining characteristic separating cognitive systems from productive workers, efficiently cultivating machine autonomy becomes paramount. Current approaches assume that more data yields better agency, following traditional scaling laws from language modeling. We fundamentally challenge this paradigm. LIMI (Less Is More for Intelligent Agency) demonstrates that agency follows radically different development principles. Through strategic focus on collaborative software development and scientific research workflows, we show that sophisticated agentic intelligence can emerge from minimal but strategically curated demonstrations of autonomous behavior. Using only 78 carefully designed training samples, LIMI achieves 73.5% on comprehensive agency benchmarks, dramatically outperforming state-of-the-art models: Kimi-K2-Instruct (24.1%), DeepSeek-V3.1 (11.9%), Qwen3-235B-A22B-Instruct (27.5%), and GLM-4.5 (45.1%). Most strikingly, LIMI demonstrates 53.7% improvement over models trained on 10,000 samples-achieving superior agentic intelligence with 128 times fewer samples. Our findings establish the Agency Efficiency Principle: machine autonomy emerges not from data abundance but from strategic curation of high-quality agentic demonstrations.

[342] Efficient & Correct Predictive Equivalence for Decision Trees

Joao Marques-Silva, Alexey Ignatiev

Main category: cs.AI

TL;DR: This paper identifies critical limitations in the MBDSR approach (using Quine-McCluskey method) for analyzing decision trees, showing it can have exponential complexity and produce incorrect results, while proposing polynomial-time alternatives.

Details

Motivation: The Rashomon set of decision trees contains predictive equivalent DTs that cause inaccuracies in feature importance analysis. Existing MBDSR approach using Quine-McCluskey method has theoretical and practical limitations.

Method: The paper demonstrates exponential worst-case scenarios for QM method, shows MBDSR can produce incorrect results, and proposes polynomial-time algorithms for solving the same problems that MBDSR addresses.

Result: Experiments confirm that for DTs triggering QM’s worst-case, the proposed algorithms are orders of magnitude faster than MBDSR approaches.

Conclusion: The paper provides efficient polynomial-time alternatives to the problematic MBDSR approach, enabling more reliable analysis of decision trees without exponential complexity or correctness issues.

Abstract: The Rashomon set of decision trees (DTs) finds importance uses. Recent work showed that DTs computing the same classification function, i.e. predictive equivalent DTs, can represent a significant fraction of the Rashomon set. Such redundancy is undesirable. For example, feature importance based on the Rashomon set becomes inaccurate due the existence of predictive equivalent DTs, i.e. DTs with the same prediction for every possible input. In recent work, McTavish et al. proposed solutions for several computational problems related with DTs, including that of deciding predictive equivalent DTs. This approach, which this paper refers to as MBDSR, consists of applying the well-known method of Quine-McCluskey (QM) for obtaining minimum-size DNF (disjunctive normal form) representations of DTs, which are then used for comparing DTs for predictive equivalence. Furthermore, the minimum-size DNF representation was also applied to computing explanations for the predictions made by DTs, and to finding predictions in the presence of missing data. However, the problem of formula minimization is hard for the second level of the polynomial hierarchy, and the QM method may exhibit worst-case exponential running time and space. This paper first demonstrates that there exist decision trees that trigger the worst-case exponential running time and space of the QM method. Second, the paper shows that, depending on the QM method implementation, the MBDSR approach can produce incorrect results for the problem of deciding predictive equivalence. Third, the paper shows that any of the problems to which the smallest DNF representation has been applied to can be solved in polynomial time, in the size of the DT. The experiments confirm that, for DTs for which the worst-case of the QM method is triggered, the algorithms proposed in this paper are orders of magnitude faster than the ones proposed by McTavish et al.

[343] MAPO: Mixed Advantage Policy Optimization

Wenke Huang, Quan Zhang, Yiyang Fang, Jian Liang, Xuankun Rong, Huanjin Yao, Guancheng Wan, Ke Liang, Wenwen He, Mingjun Li, Leszek Rutkowski, Mang Ye, Bo Du, Dacheng Tao

Main category: cs.AI

TL;DR: Proposes Mixed Advantage Policy Optimization (MAPO) to address advantage reversion and mirror problems in GRPO by dynamically reweighting advantage functions based on trajectory certainty.

Details

Motivation: Existing GRPO methods suffer from advantage reversion and advantage mirror problems that hinder reasonable advantage allocation across different query samples in reinforcement learning for foundation models.

Method: Introduces advantage percent deviation for high-certainty trajectories and dynamically reweights advantage functions based on trajectory certainty to adaptively configure advantage functions for sample-specific characteristics.

Result: Comparison with state-of-the-art methods and ablation studies validate the effectiveness of MAPO in improving foundation model performance on reasoning tasks.

Conclusion: MAPO provides an effective solution to address advantage allocation problems in GRPO through trajectory certainty-based advantage reweighting, enhancing reinforcement learning for foundation models.

Abstract: Recent advances in reinforcement learning for foundation models, such as Group Relative Policy Optimization (GRPO), have significantly improved the performance of foundation models on reasoning tasks. Notably, the advantage function serves as a central mechanism in GRPO for ranking the trajectory importance. However, existing explorations encounter both advantage reversion and advantage mirror problems, which hinder the reasonable advantage allocation across different query samples. In this work, we propose an easy but effective GRPO strategy, Mixed Advantage Policy Optimization (MAPO). We reveal that the trajectory appears with different certainty and propose the advantage percent deviation for samples with high-certainty trajectories. Furthermore, we dynamically reweight the advantage function for samples with varying trajectory certainty, thereby adaptively configuring the advantage function to account for sample-specific characteristics. Comparison with related state-of-the-art methods, along with ablation studies on different advantage variants, validates the effectiveness of our approach.

[344] MACD: Multi-Agent Clinical Diagnosis with Self-Learned Knowledge for LLM

Wenliang Li, Rui Yan, Xu Zhang, Li Chen, Hongji Zhu, Jing Zhao, Junjun Li, Mengru Li, Wei Cao, Zihang Jiang, Wei Wei, Kun Zhang, Shaohua Kevin Zhou

Main category: cs.AI

TL;DR: The paper proposes MACD, a multi-agent framework that enables LLMs to self-learn clinical knowledge through experience accumulation, significantly improving diagnostic accuracy and achieving performance comparable to or exceeding human physicians.

Details

Motivation: Current LLM approaches for clinical diagnosis optimize isolated inferences but neglect the accumulation of reusable clinical experience, failing to mirror how physicians develop expertise through experience.

Method: A Multi-Agent Clinical Diagnosis (MACD) framework with a pipeline that summarizes, refines, and applies diagnostic insights, plus a MACD-human collaborative workflow with iterative consultations between diagnostician agents, evaluator agent, and human oversight.

Result: MACD improves primary diagnostic accuracy by up to 22.3% over clinical guidelines, achieves up to 16% improvement over physicians-only diagnosis, and the MACD-human workflow shows 18.6% improvement over physicians-only diagnosis. The system also demonstrates cross-model stability and generates traceable rationales.

Conclusion: MACD presents a scalable self-learning paradigm that bridges the gap between LLMs’ intrinsic knowledge and real-world clinical practice, enabling more accurate and explainable medical diagnosis.

Abstract: Large language models (LLMs) have demonstrated notable potential in medical applications, yet they face substantial challenges in handling complex real-world clinical diagnoses using conventional prompting methods. Current prompt engineering and multi-agent approaches typically optimize isolated inferences, neglecting the accumulation of reusable clinical experience. To address this, this study proposes a novel Multi-Agent Clinical Diagnosis (MACD) framework, which allows LLMs to self-learn clinical knowledge via a multi-agent pipeline that summarizes, refines, and applies diagnostic insights. It mirrors how physicians develop expertise through experience, enabling more focused and accurate diagnosis on key disease-specific cues. We further extend it to a MACD-human collaborative workflow, where multiple LLM-based diagnostician agents engage in iterative consultations, supported by an evaluator agent and human oversight for cases where agreement is not reached. Evaluated on 4,390 real-world patient cases across seven diseases using diverse open-source LLMs (Llama-3.1 8B/70B, DeepSeek-R1-Distill-Llama 70B), MACD significantly improves primary diagnostic accuracy, outperforming established clinical guidelines with gains up to 22.3% (MACD). On the subset of the data, it achieves performance on par with or exceeding that of human physicians (up to 16% improvement over physicians-only diagnosis). Additionally, on the MACD-human workflow, it achieves an 18.6% improvement compared to physicians-only diagnosis. Moreover, self-learned knowledge exhibits strong cross-model stability, transferability, and model-specific personalization, while the system can generate traceable rationales, enhancing explainability. Consequently, this work presents a scalable self-learning paradigm for LLM-assisted diagnosis, bridging the gap between the intrinsic knowledge of LLMs and real-world clinical practice.

cs.SD

[345] QAMO: Quality-aware Multi-centroid One-class Learning For Speech Deepfake Detection

Duc-Tuan Truong, Tianchi Liu, Ruijie Tao, Junjie Li, Kong Aik Lee, Eng Siong Chng

Main category: cs.SD

TL;DR: QAMO proposes a quality-aware multi-centroid one-class learning approach for deepfake speech detection, addressing limitations of single-centroid methods by modeling bona fide speech across different quality subspaces.

Details

Motivation: Single-centroid one-class learning oversimplifies bona fide speech representation and overlooks useful cues like speech quality, which reflects naturalness. Existing speech quality assessment models can easily estimate quality through Mean Opinion Score.

Method: Extends conventional one-class learning by introducing multiple quality-aware centroids, each optimized to represent distinct speech quality subspaces. Uses multi-centroid ensemble scoring strategy that improves decision thresholding and reduces need for quality labels during inference.

Result: With two centroids representing high- and low-quality speech, QAMO achieves an equal error rate of 5.09% in In-the-Wild dataset, outperforming previous one-class and quality-aware systems.

Conclusion: QAMO effectively models intra-class variability in bona fide speech through quality-aware multi-centroid approach, demonstrating superior performance in deepfake speech detection compared to existing methods.

Abstract: Recent work shows that one-class learning can detect unseen deepfake attacks by modeling a compact distribution of bona fide speech around a single centroid. However, the single-centroid assumption can oversimplify the bona fide speech representation and overlook useful cues, such as speech quality, which reflects the naturalness of the speech. Speech quality can be easily obtained using existing speech quality assessment models that estimate it through Mean Opinion Score. In this paper, we propose QAMO: Quality-Aware Multi-Centroid One-Class Learning for speech deepfake detection. QAMO extends conventional one-class learning by introducing multiple quality-aware centroids. In QAMO, each centroid is optimized to represent a distinct speech quality subspaces, enabling better modeling of intra-class variability in bona fide speech. In addition, QAMO supports a multi-centroid ensemble scoring strategy, which improves decision thresholding and reduces the need for quality labels during inference. With two centroids to represent high- and low-quality speech, our proposed QAMO achieves an equal error rate of 5.09% in In-the-Wild dataset, outperforming previous one-class and quality-aware systems.

[346] Addressing Gradient Misalignment in Data-Augmented Training for Robust Speech Deepfake Detection

Duc-Tuan Truong, Tianchi Liu, Junjie Li, Ruijie Tao, Kong Aik Lee, Eng Siong Chng

Main category: cs.SD

TL;DR: The paper proposes a dual-path data-augmented (DPDA) training framework with gradient alignment to address gradient conflicts between original and augmented inputs in speech deepfake detection, achieving improved convergence and performance.

Details

Motivation: Data augmentation in speech deepfake detection can cause conflicting parameter updates due to misaligned gradients from original and augmented inputs, which hinders convergence and reduces the benefits of augmentation.

Method: A dual-path training framework where each utterance is processed through two paths (original and augmented) with gradient alignment to reduce optimization conflicts by comparing and aligning their backpropagated gradient directions.

Result: The method reduces gradient conflicts (observed in ~25% of iterations with RawBoost), accelerates convergence by requiring fewer training epochs, and achieves up to 18.69% relative reduction in Equal Error Rate on the In-the-Wild dataset compared to baseline.

Conclusion: Gradient alignment in dual-path data-augmented training effectively resolves optimization conflicts, improving both training efficiency and speech deepfake detection performance.

Abstract: In speech deepfake detection (SDD), data augmentation (DA) is commonly used to improve model generalization across varied speech conditions and spoofing attacks. However, during training, the backpropagated gradients from original and augmented inputs may misalign, which can result in conflicting parameter updates. These conflicts could hinder convergence and push the model toward suboptimal solutions, thereby reducing the benefits of DA. To investigate and address this issue, we design a dual-path data-augmented (DPDA) training framework with gradient alignment for SDD. In our framework, each training utterance is processed through two input paths: one using the original speech and the other with its augmented version. This design allows us to compare and align their backpropagated gradient directions to reduce optimization conflicts. Our analysis shows that approximately 25% of training iterations exhibit gradient conflicts between the original inputs and their augmented counterparts when using RawBoost augmentation. By resolving these conflicts with gradient alignment, our method accelerates convergence by reducing the number of training epochs and achieves up to an 18.69% relative reduction in Equal Error Rate on the In-the-Wild dataset compared to the baseline.

[347] AIBA: Attention-based Instrument Band Alignment for Text-to-Audio Diffusion

Junyoung Koh, Soo Yong Kim, Gyu Hyeong Choi, Yongwon Choi

Main category: cs.SD

TL;DR: AIBA is a training-free pipeline that quantifies where text-to-audio diffusion models attend on the time-frequency plane using cross-attention hooks and interpretable metrics.

Details

Motivation: To understand where text-to-audio diffusion models focus their attention during generation, specifically on the time-frequency plane, without requiring retraining.

Method: Hooks cross-attention at inference to record attention probabilities, projects them to fixed-size mel grids comparable to audio energy, and scores agreement with ground truth using T-F IoU/AP, frequency-profile correlation, and pointing game metrics.

Result: On Slakh2100 with AudioLDM2, AIBA reveals consistent instrument-dependent attention patterns (e.g., bass favoring low bands) and achieves high precision with moderate recall.

Conclusion: AIBA provides an effective, lightweight method for analyzing attention in text-to-audio models, revealing interpretable patterns without model modification.

Abstract: We present AIBA (Attention-In-Band Alignment), a lightweight, training-free pipeline to quantify where text-to-audio diffusion models attend on the time-frequency (T-F) plane. AIBA (i) hooks cross-attention at inference to record attention probabilities without modifying weights; (ii) projects them to fixed-size mel grids that are directly comparable to audio energy; and (iii) scores agreement with instrument-band ground truth via interpretable metrics (T-F IoU/AP, frequency-profile correlation, and a pointing game). On Slakh2100 with an AudioLDM2 backbone, AIBA reveals consistent instrument-dependent trends (e.g., bass favoring low bands) and achieves high precision with moderate recall.

[348] SingVERSE: A Diverse, Real-World Benchmark for Singing Voice Enhancement

Shaohan Jiang, Junan Zhang, Yunjia Zhang, Jing Yang, Fan Fan, Zhizheng Wu

Main category: cs.SD

TL;DR: This paper introduces SingVERSE, the first real-world benchmark for singing voice enhancement, addressing the lack of realistic evaluation data and revealing a trade-off between perceptual quality and intelligibility in existing models.

Details

Motivation: The development of singing voice enhancement is limited by the lack of realistic evaluation data, creating a gap in the field that needs to be addressed with proper benchmarking tools.

Method: The authors create SingVERSE benchmark covering diverse acoustic scenarios with paired studio-quality clean references, then conduct comprehensive evaluation of state-of-the-art models using this benchmark.

Result: The evaluation uncovers a consistent trade-off between perceptual quality and intelligibility in existing models, and shows that training on in-domain singing data substantially improves enhancement performance without degrading speech capabilities.

Conclusion: This work provides the community with a foundational benchmark and critical insights to guide future advances in singing voice enhancement, establishing a simple yet effective path forward through domain-specific training.

Abstract: This paper presents a benchmark for singing voice enhancement. The development of singing voice enhancement is limited by the lack of realistic evaluation data. To address this gap, this paper introduces SingVERSE, the first real-world benchmark for singing voice enhancement, covering diverse acoustic scenarios and providing paired, studio-quality clean references. Leveraging SingVERSE, we conduct a comprehensive evaluation of state-of-the-art models and uncover a consistent trade-off between perceptual quality and intelligibility. Finally, we show that training on in-domain singing data substantially improves enhancement performance without degrading speech capabilities, establishing a simple yet effective path forward. This work offers the community a foundational benchmark together with critical insights to guide future advances in this underexplored domain. Demopage: https://singverse.github.io

[349] i-LAVA: Insights on Low Latency Voice-2-Voice Architecture for Agents

Anupam Purwar, Aditya Choudhary

Main category: cs.SD

TL;DR: This paper analyzes optimization strategies for low-latency voice-to-voice communication systems, focusing on reducing processing time while maintaining quality, with particular emphasis on TTS component optimizations.

Details

Motivation: To optimize real-time conversational applications by reducing processing latency in voice-to-voice systems while preserving high-quality interactions.

Method: Analyzed components of V-2-V systems (ASR, TTS, dialog management) and experimented with optimizing TTS by reducing Residual Vector Quantization (RVQ) iterations and codebooks in the CSM1b architecture.

Result: Identified that TTS component has the highest impact on Real Time Factor (RTF), and optimization can be achieved by reducing RVQ iterations and codebooks, though this comes at a cost of decreased voice quality.

Conclusion: The most important optimizations for V-2-V systems based on CSM architecture can be achieved by strategically reducing RVQ iterations and codebooks in the TTS component to balance latency and quality.

Abstract: We experiment with a low-latency, end-to-end voice-to-voice communication model to optimize it for real-time conversational applications. By analyzing components essential to voice to voice (V-2-V) system viz. automatic speech recognition (ASR), text-to-speech (TTS), and dialog management, our work analyzes how to reduce processing time while maintaining high-quality interactions to identify the levers for optimizing V-2-V system. Our work identifies that TTS component which generates life-like voice, full of emotions including natural pauses and exclamations has highest impact on Real time factor (RTF). The experimented V-2-V architecture utilizes CSM1b has the capability to understand tone as well as context of conversation by ingesting both audio and text of prior exchanges to generate contextually accurate speech. We explored optimization of Residual Vector Quantization (RVQ) iterations by the TTS decoder which come at a cost of decrease in the quality of voice generated. Our experimental evaluations also demonstrate that for V-2-V implementations based on CSM most important optimizations can be brought by reducing the number of RVQ Iterations along with the codebooks used in Mimi.

[350] SupCLAP: Controlling Optimization Trajectory Drift in Audio-Text Contrastive Learning with Support Vector Regularization

Jiehui Luo, Yuguo Yin, Yuxin Xie, Jinghan Ru, Xianwei Zhuang, Minghua He, Aofan Liu, Zihan Xiong, Dongchao Yang

Main category: cs.SD

TL;DR: The paper proposes Support Vector Regularization (SVR) to address the perpendicular component of negative sample forces in contrastive language-audio pretraining, which causes optimization drift while containing valuable information.

Details

Motivation: The perpendicular component of pushing forces from negative samples in contrastive learning contains rich supplementary information but causes optimization trajectory drift and training instability, creating a need for better control mechanisms.

Method: SVR introduces an auxiliary support vector to control the perpendicular component, with two unsupervised modeling strategies for semantic radius: direct parameterization and an adaptive radius predictor module with constraints.

Result: Extensive experiments show SVR outperforms baselines like InfoNCE and SigLIP loss across classification, monolingual retrieval, and multilingual retrieval on standard audio-text datasets.

Conclusion: Both theoretical analysis and experimental results validate that SVR effectively harnesses the rich information from negative samples while mitigating optimization trajectory drift.

Abstract: Contrastive language-audio pretraining, which aims to unify multimodal representations in a shared embedding space, serves as a cornerstone for building a wide range of applications, from cross-modal retrieval to cutting-edge multimodal large language models. However, we find that the perpendicular component of the pushing force from negative samples in contrastive learning is a double-edged sword: it contains rich supplementary information from negative samples, yet its unconstrained nature causes optimization trajectory drift and training instability. To address this, we propose Support Vector Regularization (SVR), a method that introduces an auxiliary support vector to control this perpendicular component, aiming to harness its rich information while mitigating the associated trajectory drift. The efficacy of SVR is critically governed by its semantic radius, for which we explore two unsupervised modeling strategies: direct parameterization and an adaptive radius predictor module enhanced with constraints to improve its predicting accuracy. Extensive experimental results demonstrate that our method surpasses widely used baselines like InfoNCE and SigLIP loss across classification, monolingual retrieval, and multilingual retrieval on standard audio-text datasets. Both the theoretical analysis and the experimental results on optimizing trajectory drift validate the correctness and effectiveness of our SVR method.

[351] UniSS: Unified Expressive Speech-to-Speech Translation with Your Voice

Sitong Cheng, Weizhen Bian, Xinsheng Wang, Ruibin Yuan, Jianyi Chen, Shunshun Yin, Yike Guo, Wei Xue

Main category: cs.SD

TL;DR: UniSS is a novel single-stage framework for expressive speech-to-speech translation that preserves speaker identity and emotional style through semantic and style modeling, cross-modal chain-of-thought prompting, and a new large-scale dataset.

Details

Motivation: Address three key challenges in expressive S2ST: scarcity of paired speech data with expressive styles, complexity of multi-stage pipelines, and limited translation capability transfer from LLMs to speech.

Method: Introduces UniSS framework with speech semantic and style modeling, integrates with text-based LLMs to create unified text-speech language model, uses cross-modal chain-of-thought prompting for capability transfer, and constructs UniST dataset (44.8k hours).

Result: UniSS significantly outperforms previous methods in translation fidelity and speech quality while preserving voice, emotion, and duration consistency.

Conclusion: Establishes a simpler and more effective paradigm for next-generation expressive S2ST systems, demonstrating superior performance over existing approaches.

Abstract: The ultimate goal of expressive speech-to-speech translation (S2ST) is to accurately translate spoken content while preserving the speaker identity and emotional style. However, progress in this field is largely hindered by three key challenges: the scarcity of paired speech data that retains expressive styles, the complexity of multi-stage processing pipelines, and the limited transfer of translation capabilities from large language models (LLMs). In this work, we address these challenges by introducing UniSS, a novel single-stage framework for expressive S2ST. Our approach features carefully designed speech semantic and style modeling, enabling seamless integration with existing text-based LLM frameworks to develop a unified text-speech language model. To transfer translation capabilities from text to speech, we propose a cross-modal chain-of-thought prompting process that progressively aligns audio semantics with text and ensures style preservation in the decoded results. Furthermore, we construct and release a large-scale, high-quality expressive S2ST dataset, UniST, comprising 44.8k hours of data. Experimental results show that UniSS significantly outperforms previous methods in translation fidelity and speech quality while preserving voice, emotion, and duration consistency. Our work establishes a simpler and more effective paradigm for building the next generation of expressive S2ST systems. Audio samples are available at https://cmots.github.io/uniss-demo.

[352] On the Language and Gender Biases in PSTN, VoIP and Neural Audio Codecs

Kemal Altwlkany, Amar Kuric, Emanuel Lacic

Main category: cs.SD

TL;DR: Analysis of language and gender biases in audio codecs reveals that PSTN codecs show strong gender bias while neural codecs introduce language biases, highlighting fairness issues in speech technology.

Details

Motivation: There is growing concern about fairness and inclusivity in speech technology, particularly when audio transcoding may introduce biases that affect user experience and perpetuate societal stereotypes. Current research on language and gender biases in audio codecs is scarce.

Method: Analyzed speech quality of over 2 million multilingual audio files after transcoding through a representative subset of codecs including PSTN, VoIP, and neural codecs.

Result: PSTN codecs demonstrate strong gender bias, while neural codecs introduce language biases in the transcoding process.

Conclusion: Audio coding mechanisms can exhibit significant biases, emphasizing the need for unbiased codec development to ensure fairness in speech technology applications.

Abstract: In recent years, there has been a growing focus on fairness and inclusivity within speech technology, particularly in areas such as automatic speech recognition and speech sentiment analysis. When audio is transcoded prior to processing, as is the case in streaming or real-time applications, any inherent bias in the coding mechanism may result in disparities. This not only affects user experience but can also have broader societal implications by perpetuating stereotypes and exclusion. Thus, it is important that audio coding mechanisms are unbiased. In this work, we contribute towards the scarce research with respect to language and gender biases of audio codecs. By analyzing the speech quality of over 2 million multilingual audio files after transcoding through a representative subset of codecs (PSTN, VoIP and neural), our results indicate that PSTN codecs are strongly biased in terms of gender and that neural codecs introduce language biases.

[353] Facilitating Personalized TTS for Dysarthric Speakers Using Knowledge Anchoring and Curriculum Learning

Yejin Jeon, Solee Im, Youngjae Kim, Gary Geunbae Lee

Main category: cs.SD

TL;DR: A knowledge anchoring framework using teacher-student models with curriculum learning for zero-shot multi-speaker TTS that generates synthetic speech with reduced articulation errors for dysarthric speakers.

Details

Motivation: Dysarthric speakers face communication challenges due to impaired motor control, leading to reduced speech intelligibility. Limited audio data and articulation errors make personalized TTS model training infeasible through traditional recording methods.

Method: Framed as a domain transfer task, the approach uses a knowledge anchoring framework with teacher-student models enhanced by curriculum learning through audio augmentation.

Result: The zero-shot multi-speaker TTS model effectively generates synthetic speech with markedly reduced articulation errors and high speaker fidelity while maintaining prosodic naturalness.

Conclusion: The proposed framework successfully addresses the challenges of personalized speech synthesis for dysarthric speakers by leveraging domain transfer and knowledge anchoring techniques.

Abstract: Dysarthric speakers experience substantial communication challenges due to impaired motor control of the speech apparatus, which leads to reduced speech intelligibility. This creates significant obstacles in dataset curation since actual recording of long, articulate sentences for the objective of training personalized TTS models becomes infeasible. Thus, the limited availability of audio data, in addition to the articulation errors that are present within the audio, complicates personalized speech synthesis for target dysarthric speaker adaptation. To address this, we frame the issue as a domain transfer task and introduce a knowledge anchoring framework that leverages a teacher-student model, enhanced by curriculum learning through audio augmentation. Experimental results show that the proposed zero-shot multi-speaker TTS model effectively generates synthetic speech with markedly reduced articulation errors and high speaker fidelity, while maintaining prosodic naturalness.

[354] Neural Audio Codecs for Prompt-Driven Universal Sound Separation

Adhiraj Banerjee, Vipul Arora

Main category: cs.SD

TL;DR: CodecSep is a compute-efficient neural audio codec-based model for text-guided sound separation that achieves better separation fidelity than AudioSep while requiring 54x less computation.

Details

Motivation: Existing text-guided sound separation models like AudioSep are too compute-heavy for edge deployment, while efficient neural audio codec models are limited to fixed-class separation.

Method: Combines DAC compression with a Transformer masker modulated by CLAP-derived FiLM parameters, operating directly on code-streams for efficient deployment.

Result: Surpasses AudioSep in separation fidelity (SI-SDR) across six benchmarks, remains competitive in perceptual quality (ViSQOL), and matches/exceeds fixed-stem baselines while requiring only 1.35 GMACs end-to-end.

Conclusion: CodecSep enables universal, text-driven separation with dramatically reduced computational requirements (54x less than AudioSep) while maintaining full bitstream compatibility.

Abstract: Text-guided sound separation supports flexible audio editing across media and assistive applications, but existing models like AudioSep are too compute-heavy for edge deployment. Neural audio codec (NAC) models such as CodecFormer and SDCodec are compute-efficient but limited to fixed-class separation. We introduce CodecSep, the first NAC-based model for on-device universal, text-driven separation. CodecSep combines DAC compression with a Transformer masker modulated by CLAP-derived FiLM parameters. Across six open-domain benchmarks under matched training/prompt protocols, \textbf{CodecSep} surpasses \textbf{AudioSep} in separation fidelity (SI-SDR) while remaining competitive in perceptual quality (ViSQOL) and matching or exceeding fixed-stem baselines (TDANet, CodecFormer, SDCodec). In code-stream deployments, it needs just 1.35~GMACs end-to-end – approximately $54\times$ less compute ($25\times$ architecture-only) than spectrogram-domain separators like AudioSep – while remaining fully bitstream-compatible.

[355] Mixture-of-Experts Framework for Field-of-View Enhanced Signal-Dependent Binauralization of Moving Talkers

Manan Mittal, Thomas Deppisch, Joseph Forrer, Chris Le Sueur, Zamir Ben-Hur, David Lou Alon, Daniel D. E. Wong

Main category: cs.SD

TL;DR: A novel mixture of experts framework for field-of-view enhancement in binaural signal matching that enables dynamic spatial audio rendering adapting to continuous talker motion.

Details

Motivation: To provide dynamic spatial audio rendering that adapts to continuous talker motion, allowing users to emphasize or suppress sounds from selected directions while preserving natural binaural cues, overcoming limitations of traditional methods.

Method: Signal-dependent framework combining multiple binaural filters in an online manner using implicit localization, without relying on explicit direction-of-arrival estimation or Ambisonics domain processing.

Result: Enables real-time tracking and enhancement of moving sound sources, supporting applications like speech focus, noise reduction, and world-locked audio in AR/VR.

Conclusion: The method is agnostic to array geometry, offering a flexible solution for spatial audio capture and personalized playback in next-generation consumer audio devices.

Abstract: We propose a novel mixture of experts framework for field-of-view enhancement in binaural signal matching. Our approach enables dynamic spatial audio rendering that adapts to continuous talker motion, allowing users to emphasize or suppress sounds from selected directions while preserving natural binaural cues. Unlike traditional methods that rely on explicit direction-of-arrival estimation or operate in the Ambisonics domain, our signal-dependent framework combines multiple binaural filters in an online manner using implicit localization. This allows for real-time tracking and enhancement of moving sound sources, supporting applications such as speech focus, noise reduction, and world-locked audio in augmented and virtual reality. The method is agnostic to array geometry offering a flexible solution for spatial audio capture and personalized playback in next-generation consumer audio devices.

[356] MNV-17: A High-Quality Performative Mandarin Dataset for Nonverbal Vocalization Recognition in Speech

Jialong Mai, Jinxin Ji, Xiaofen Xing, Chen Yang, Weidong Chen, Jingyuan Xing, Xiangmin Xu

Main category: cs.SD

TL;DR: The paper introduces MNV-17, a 7.55-hour performative Mandarin speech dataset designed to address the lack of high-quality annotated data for nonverbal vocalization (NV) recognition in ASR systems.

Details

Motivation: Current ASR systems excel at transcribing lexical content but fail to recognize nonverbal vocalizations (sighs, laughs, coughs) that convey crucial emotional and intentional cues. Progress has been hindered by the lack of well-annotated datasets.

Method: Created MNV-17 dataset with performative speech to ensure high-fidelity NV instances. The dataset contains 17 distinct, well-balanced NV categories. Benchmarking was performed on four mainstream ASR architectures for joint semantic transcription and NV classification.

Result: MNV-17 provides the most extensive set of nonverbal vocalization categories to date. The dataset and pretrained model checkpoints will be publicly available.

Conclusion: MNV-17 addresses a critical gap in expressive ASR research and will facilitate future work in comprehensive human communication understanding through NV-aware speech recognition systems.

Abstract: Mainstream Automatic Speech Recognition (ASR) systems excel at transcribing lexical content, but largely fail to recognize nonverbal vocalizations (NVs) embedded in speech, such as sighs, laughs, and coughs. This capability is important for a comprehensive understanding of human communication, as NVs convey crucial emotional and intentional cues. Progress in NV-aware ASR has been hindered by the lack of high-quality, well-annotated datasets. To address this gap, we introduce MNV-17, a 7.55-hour performative Mandarin speech dataset. Unlike most existing corpora that rely on model-based detection, MNV-17’s performative nature ensures high-fidelity, clearly articulated NV instances. To the best of our knowledge, MNV-17 provides the most extensive set of nonverbal vocalization categories, comprising 17 distinct and well-balanced classes of common NVs. We benchmarked MNV-17 on four mainstream ASR architectures, evaluating their joint performance on semantic transcription and NV classification. The dataset and the pretrained model checkpoints will be made publicly available to facilitate future research in expressive ASR.

[357] SEA-Spoof: Bridging The Gap in Multilingual Audio Deepfake Detection for South-East Asian

Jinyang Wu, Nana Hou, Zihan Pan, Qiquan Zhang, Sailor Hardik Bhupendra, Soumik Mondal

Main category: cs.SD

TL;DR: SEA-Spoof is the first large-scale Audio Deepfake Detection dataset specifically for South-East Asian languages, addressing the performance collapse of detection models when applied to SEA languages due to data scarcity and language mismatches.

Details

Motivation: Current audio deepfake detection datasets poorly cover SEA languages, causing models trained on high-resource languages to fail when applied to SEA due to synthesis quality mismatches, language-specific characteristics, and data scarcity in this critical region.

Method: Created SEA-Spoof dataset with 300+ hours of paired real and spoof speech across Tamil, Hindi, Thai, Indonesian, Malay, and Vietnamese, using diverse state-of-the-art open-source and commercial spoofing systems to capture wide variability in style and fidelity.

Result: Benchmarking showed severe cross-lingual degradation of detection models, but fine-tuning on SEA-Spoof dramatically restored performance across languages and synthesis sources.

Conclusion: The results highlight the urgent need for SEA-focused research and establish SEA-Spoof as a foundation for developing robust, cross-lingual, and fraud-resilient audio deepfake detection systems.

Abstract: The rapid growth of the digital economy in South-East Asia (SEA) has amplified the risks of audio deepfakes, yet current datasets cover SEA languages only sparsely, leaving models poorly equipped to handle this critical region. This omission is critical: detection models trained on high-resource languages collapse when applied to SEA, due to mismatches in synthesis quality, language-specific characteristics, and data scarcity. To close this gap, we present SEA-Spoof, the first large-scale Audio Deepfake Detection (ADD) dataset especially for SEA languages. SEA-Spoof spans 300+ hours of paired real and spoof speech across Tamil, Hindi, Thai, Indonesian, Malay, and Vietnamese. Spoof samples are generated from a diverse mix of state-of-the-art open-source and commercial systems, capturing wide variability in style and fidelity. Benchmarking state-of-the-art detection models reveals severe cross-lingual degradation, but fine-tuning on SEA-Spoof dramatically restores performance across languages and synthesis sources. These results highlight the urgent need for SEA-focused research and establish SEA-Spoof as a foundation for developing robust, cross-lingual, and fraud-resilient detection systems.

cs.LG

[358] A Theory of Multi-Agent Generative Flow Networks

Leo Maxime Brunswic, Haozhi Wang, Shuang Luo, Jianye Hao, Amir Rasouli, Yinchuan Li

Main category: cs.LG

TL;DR: This paper proposes the first theoretical framework for Multi-Agent Generative Flow Networks (MA-GFlowNets), enabling multiple agents to collaboratively generate objects through joint actions with probability proportional to reward.

Details

Motivation: While single-agent GFlowNets use flow-matching loss to learn stochastic policies for generating objects proportional to rewards, no theoretical framework existed for multi-agent scenarios where agents collaborate through joint actions.

Method: Proposes four algorithms: centralized flow network (centralized training), independent flow network (decentralized execution), joint flow network (centralized training with decentralized execution), and conditional joint flow network. Joint Flow training uses a local-global principle to train local GFNs as a global GFN.

Result: The framework provides theoretical guarantees that independent policies generate samples with probability proportional to the reward function. Experimental results show superiority over reinforcement learning and MCMC-based methods.

Conclusion: The proposed MA-GFlowNets framework successfully extends GFlowNets to multi-agent settings with theoretical foundations and practical algorithms that outperform existing approaches.

Abstract: Generative flow networks utilize a flow-matching loss to learn a stochastic policy for generating objects from a sequence of actions, such that the probability of generating a pattern can be proportional to the corresponding given reward. However, a theoretical framework for multi-agent generative flow networks (MA-GFlowNets) has not yet been proposed. In this paper, we propose the theory framework of MA-GFlowNets, which can be applied to multiple agents to generate objects collaboratively through a series of joint actions. We further propose four algorithms: a centralized flow network for centralized training of MA-GFlowNets, an independent flow network for decentralized execution, a joint flow network for achieving centralized training with decentralized execution, and its updated conditional version. Joint Flow training is based on a local-global principle allowing to train a collection of (local) GFN as a unique (global) GFN. This principle provides a loss of reasonable complexity and allows to leverage usual results on GFN to provide theoretical guarantees that the independent policies generate samples with probability proportional to the reward function. Experimental results demonstrate the superiority of the proposed framework compared to reinforcement learning and MCMC-based methods.

[359] FastEagle: Cascaded Drafting for Accelerating Speculative Decoding

Haiduo Huang, Jiangcheng Song, Wenzhe Zhao, Pengju Ren

Main category: cs.LG

TL;DR: FastEagle is a non-autoregressive cascaded drafter that emits entire drafts in a single forward pass, replacing sequential steps with a lightweight layer cascade and layer-wise supervision to mitigate error accumulation.

Details

Motivation: Current speculative decoding methods like EAGLE require N sequential passes to propose N tokens, limiting acceleration potential. The goal is to remove sequential dependencies in drafting for faster LLM inference.

Method: FastEagle uses a non-autoregressive cascaded drafter with constrained draft trees, replacing temporal steps with layer cascades and training with layer-wise supervision to prevent error accumulation while maintaining lossless verification.

Result: FastEagle consistently outperforms EAGLE-3 in speedup across multiple LLMs (Vicuna-13B, LLaMA-Instruct 3.x, DeepSeek-R1) and tasks (MT-Bench, HumanEval, GSM8K, etc.) under both greedy and stochastic decoding, with comparable acceptance lengths.

Conclusion: Removing sequential dependencies in drafting is a practical path toward lossless LLM inference acceleration, as demonstrated by FastEagle’s substantial wall-clock speedups over strong autoregressive drafters.

Abstract: Speculative decoding accelerates generation by drafting candidates and verifying them in parallel, yet state-of-the-art drafters (e.g., EAGLE) still require N sequential passes to propose N tokens. We present FastEagle, a non-autoregressive cascaded drafter that emits an entire draft in a single forward pass. FastEagle replaces temporal steps with a lightweight layer cascade and trains with layer-wise supervision to mitigate error accumulation. Coupled with a constrained draft tree that preserves lossless verification cost, FastEagle delivers substantial wall-clock speedups over strong autoregressive drafters while maintaining competitive acceptance behavior. Across multiple LLMs (Vicuna-13B, LLaMA-Instruct 3.x, and DeepSeek-R1-Distill-LLaMA) and tasks (MT-Bench, HumanEval, GSM8K, CNN/DM, Alpaca), FastEagle consistently outperforms EAGLE-3 in speedup under both greedy and stochastic decoding, with comparable average acceptance lengths. These results indicate that removing sequential dependencies in drafting is a practical path toward lossless LLM inference acceleration.

[360] mloz: A Highly Efficient Machine Learning-Based Ozone Parameterization for Climate Sensitivity Simulations

Yiling Ma, Nathan Luke Abraham, Stefan Versick, Roland Ruhnke, Andrea Schneidereit, Ulrike Niemeier, Felix Back, Peter Braesicke, Peer Nowack

Main category: cs.LG

TL;DR: A machine learning parameterization (mloz) is introduced to model ozone interactively in climate models, providing high-fidelity ozone predictions 31x faster than traditional chemistry schemes and enabling transferability between different climate models.

Details

Motivation: Most CMIP climate models lack interactive ozone representation due to high computational costs of atmospheric chemistry schemes, which is problematic since ozone is a crucial greenhouse gas and solar radiation absorber that significantly modulates climate feedback processes.

Method: Developed a machine learning parameterization that uses only atmospheric temperature profile information as input to model daily ozone variability and trends across troposphere and stratosphere, including interactions with Quasi-Biennial Oscillation.

Result: mloz produces stable ozone predictions around 31 times faster than UKESM’s chemistry scheme, adding less than 4% to total runtime, and successfully transfers between UKESM and ICON models without requiring chemistry schemes.

Conclusion: The parameterization enables widespread adoption of interactive ozone modeling in CMIP-level climate models for future climate change assessments, particularly valuable for climate sensitivity simulations where ozone trends significantly affect atmospheric feedback processes.

Abstract: Atmospheric ozone is a crucial absorber of solar radiation and an important greenhouse gas. However, most climate models participating in the Coupled Model Intercomparison Project (CMIP) still lack an interactive representation of ozone due to the high computational costs of atmospheric chemistry schemes. Here, we introduce a machine learning parameterization (mloz) to interactively model daily ozone variability and trends across the troposphere and stratosphere in standard climate sensitivity simulations, including two-way interactions of ozone with the Quasi-Biennial Oscillation. We demonstrate its high fidelity on decadal timescales and its flexible use online across two different climate models – the UK Earth System Model (UKESM) and the German ICOsahedral Nonhydrostatic (ICON) model. With atmospheric temperature profile information as the only input, mloz produces stable ozone predictions around 31 times faster than the chemistry scheme in UKESM, contributing less than 4 percent of the respective total climate model runtimes. In particular, we also demonstrate its transferability to different climate models without chemistry schemes by transferring the parameterization from UKESM to ICON. This highlights the potential for widespread adoption in CMIP-level climate models that lack interactive chemistry for future climate change assessments, particularly when focusing on climate sensitivity simulations, where ozone trends and variability are known to significantly modulate atmospheric feedback processes.

[361] Bridging Privacy and Utility: Synthesizing anonymized EEG with constraining utility functions

Kay Fuhrmeister, Arne Pelzer, Fabian Radke, Julia Lechinger, Mahzad Gharleghi, Thomas Köllmer, Insa Wolf

Main category: cs.LG

TL;DR: A transformer-based autoencoder is proposed to anonymize EEG data by preventing subject re-identification while maintaining utility for machine learning tasks like sleep staging.

Details

Motivation: The increasing availability of EEG consumer devices raises privacy concerns, as EEG data can be used for re-identification and leakage of personal information. There's a need to safeguard this sensitive data while retaining its utility for EEG applications.

Method: The authors propose a transformer-based autoencoder to create anonymized EEG data that prevents subject re-identification. The approach is evaluated on automatic sleep staging tasks, comparing re-identification and utility potential before and after anonymization.

Result: The results show that the re-identifiability of EEG signals can be substantially reduced through the proposed anonymization method while preserving the data’s utility for machine learning applications.

Conclusion: The transformer-based autoencoder approach effectively addresses EEG privacy concerns by successfully anonymizing data to prevent subject re-identification while maintaining practical utility for specific machine learning tasks like sleep staging.

Abstract: Electroencephalography (EEG) is widely used for recording brain activity and has seen numerous applications in machine learning, such as detecting sleep stages and neurological disorders. Several studies have successfully shown the potential of EEG data for re-identification and leakage of other personal information. Therefore, the increasing availability of EEG consumer devices raises concerns about user privacy, motivating us to investigate how to safeguard this sensitive data while retaining its utility for EEG applications. To address this challenge, we propose a transformer-based autoencoder to create EEG data that does not allow for subject re-identification while still retaining its utility for specific machine learning tasks. We apply our approach to automatic sleep staging by evaluating the re-identification and utility potential of EEG data before and after anonymization. The results show that the re-identifiability of the EEG signal can be substantially reduced while preserving its utility for machine learning.

[362] Efficiently Attacking Memorization Scores

Tue Do, Varun Chandrasekaran, Daniel Alabi

Main category: cs.LG

TL;DR: This paper demonstrates that memorization-based influence estimators can be adversarially manipulated through practical attacks, revealing vulnerabilities in data attribution methods.

Details

Motivation: As influence estimation tools become widely used for understanding model behavior and data valuation, there's a critical need to assess whether these scores can be adversarially manipulated, especially in responsible machine learning applications.

Method: The authors present a systematic study of attacks on memorization-based influence estimators, developing a practical attack that calculates the pseudoinverse of the input. The attack requires only black-box access to model outputs and has modest computational overhead.

Result: Empirical validation across multiple image classification tasks shows that even state-of-the-art influence estimation proxies are vulnerable to targeted score manipulations. Theoretical analysis reveals conditions under which memorization scores are inherently fragile.

Conclusion: The findings highlight critical vulnerabilities in influence-based attribution methods and suggest the urgent need for robust defenses against such adversarial manipulations.

Abstract: Influence estimation tools – such as memorization scores – are widely used to understand model behavior, attribute training data, and inform dataset curation. However, recent applications in data valuation and responsible machine learning raise the question: can these scores themselves be adversarially manipulated? In this work, we present a systematic study of the feasibility of attacking memorization-based influence estimators. We characterize attacks for producing highly memorized samples as highly sensitive queries in the regime where a trained algorithm is accurate. Our attack (calculating the pseudoinverse of the input) is practical, requiring only black-box access to model outputs and incur modest computational overhead. We empirically validate our attack across a wide suite of image classification tasks, showing that even state-of-the-art proxies are vulnerable to targeted score manipulations. In addition, we provide a theoretical analysis of the stability of memorization scores under adversarial perturbations, revealing conditions under which influence estimates are inherently fragile. Our findings highlight critical vulnerabilities in influence-based attribution and suggest the need for robust defenses. All code can be found at https://anonymous.4open.science/r/MemAttack-5413/

[363] Offline Goal-conditioned Reinforcement Learning with Quasimetric Representations

Vivek Myers, Bill Chunyuan Zheng, Benjamin Eysenbach, Sergey Levine

Main category: cs.LG

TL;DR: A unified approach combining contrastive representations and temporal distances for goal-conditioned RL that learns optimal goal-reaching distances even with suboptimal data in stochastic environments.

Details

Motivation: To bridge the gap between two effective GCRL frameworks - contrastive representations and temporal distances - by combining their strengths while overcoming their individual limitations.

Method: Proposes a unified framework that uses quasimetric representation space structure with additional constraints to learn successor representations enabling optimal goal-reaching, exploiting quasimetric distance parameterization.

Result: Improves performance on stitching tasks where contrastive learning methods struggle, and on noisy, high-dimensional environments where quasimetric network methods struggle.

Conclusion: The approach achieves the best of both worlds: stability and long-horizon capabilities of Monte Carlo contrastive RL methods, plus free stitching capabilities of quasimetric network parameterizations.

Abstract: Approaches for goal-conditioned reinforcement learning (GCRL) often use learned state representations to extract goal-reaching policies. Two frameworks for representation structure have yielded particularly effective GCRL algorithms: (1) contrastive representations, in which methods learn “successor features” with a contrastive objective that performs inference over future outcomes, and (2) temporal distances, which link the (quasimetric) distance in representation space to the transit time from states to goals. We propose an approach that unifies these two frameworks, using the structure of a quasimetric representation space (triangle inequality) with the right additional constraints to learn successor representations that enable optimal goal-reaching. Unlike past work, our approach is able to exploit a quasimetric distance parameterization to learn optimal goal-reaching distances, even with suboptimal data and in stochastic environments. This gives us the best of both worlds: we retain the stability and long-horizon capabilities of Monte Carlo contrastive RL methods, while getting the free stitching capabilities of quasimetric network parameterizations. On existing offline GCRL benchmarks, our representation learning objective improves performance on stitching tasks where methods based on contrastive learning struggle, and on noisy, high-dimensional environments where methods based on quasimetric networks struggle.

[364] AbideGym: Turning Static RL Worlds into Adaptive Challenges

Abi Aryan, Zac Liu, Aaron Childress

Main category: cs.LG

TL;DR: AbideGym is a dynamic MiniGrid wrapper that introduces agent-aware perturbations and scalable complexity to enforce intra-episode adaptation, addressing the brittleness of RL policies in shifting dynamics.

Details

Motivation: Agents trained with reinforcement learning often develop brittle policies that fail when dynamics shift, a problem amplified by static benchmarks.

Method: AbideGym introduces agent-aware perturbations and scalable complexity to enforce intra-episode adaptation, providing a modular, reproducible evaluation framework.

Result: The framework exposes weaknesses in static policies and promotes resilience.

Conclusion: AbideGym advances research in curriculum learning, continual learning, and robust generalization by providing a dynamic evaluation environment.

Abstract: Agents trained with reinforcement learning often develop brittle policies that fail when dynamics shift, a problem amplified by static benchmarks. AbideGym, a dynamic MiniGrid wrapper, introduces agent-aware perturbations and scalable complexity to enforce intra-episode adaptation. By exposing weaknesses in static policies and promoting resilience, AbideGym provides a modular, reproducible evaluation framework for advancing research in curriculum learning, continual learning, and robust generalization.

[365] CoSupFormer : A Contrastive Supervised learning approach for EEG signal Classification

D. Darankoum, C. Habermacher, J. Volle, S. Grudinin

Main category: cs.LG

TL;DR: A novel end-to-end deep learning framework for EEG signal analysis that captures multi-scale frequency oscillations, models channel dependencies with attention mechanisms, filters noisy channels dynamically, and uses supervised+contrastive learning for robust generalization across CNS disorders.

Details

Motivation: EEG signals contain rich multi-scale information crucial for brain state understanding and medical applications, but extracting meaningful features while handling noise and channel variability remains challenging.

Method: Proposed framework includes: multi-scale frequency encoder, attention-based encoder for channel and patch interactions, gating network for dynamic channel filtering, and novel loss combining supervised and contrastive learning.

Result: Validated across multiple CNS disorder applications including Parkinson’s and Alzheimer’s disease diagnosis, demonstrating ability to extract biologically meaningful patterns, autonomously select high-quality channels, and achieve robust generalization.

Conclusion: The proposed learning paradigm effectively addresses EEG analysis challenges through innovative architectural and loss design, showing strong performance across different species and medical applications.

Abstract: Electroencephalography signals (EEGs) contain rich multi-scale information crucial for understanding brain states, with potential applications in diagnosing and advancing the drug development landscape. However, extracting meaningful features from raw EEG signals while handling noise and channel variability remains a major challenge. This work proposes a novel end-to-end deep-learning framework that addresses these issues through several key innovations. First, we designed an encoder capable of explicitly capturing multi-scale frequency oscillations covering a wide range of features for different EEG-related tasks. Secondly, to model complex dependencies and handle the high temporal resolution of EEGs, we introduced an attention-based encoder that simultaneously learns interactions across EEG channels and within localized {\em patches} of individual channels. We integrated a dedicated gating network on top of the attention encoder to dynamically filter out noisy and non-informative channels, enhancing the reliability of EEG data. The entire encoding process is guided by a novel loss function, which leverages supervised and contrastive learning, significantly improving model generalization. We validated our approach in multiple applications, ranging from the classification of effects across multiple Central Nervous System (CNS) disorders treatments to the diagnosis of Parkinson’s and Alzheimer’s disease. Our results demonstrate that the proposed learning paradigm can extract biologically meaningful patterns from raw EEG signals across different species, autonomously select high-quality channels, and achieve robust generalization through innovative architectural and loss design.

[366] Beyond Visual Similarity: Rule-Guided Multimodal Clustering with explicit domain rules

Kishor Datta Gupta, Mohd Ariful Haque, Marufa Kamal, Ahmed Rafi Hasan, Md. Mahfuzur Rahman, Roy George

Main category: cs.LG

TL;DR: DARTVAE is a rule-guided multimodal clustering framework that incorporates domain-specific constraints into representation learning using variational autoencoders with rule enforcement through loss penalties.

Details

Motivation: Traditional clustering methods rely only on input data similarity, failing to capture structural or semantic constraints critical in many domains. There's a need for clustering that incorporates domain knowledge directly into the learning process.

Method: Extends VAE architecture by embedding explicit rules, semantic representations, and data-driven features into a unified latent space. Rules are generated by LLMs, structured into knowledge graphs, and enforced through a loss function combining reconstruction, KL divergence, consistency, and violation penalties.

Result: Experiments on aircraft and automotive datasets show rule-guided clustering produces more operationally meaningful and interpretable clusters (e.g., isolating UAVs, unifying stealth aircraft, separating SUVs from sedans) while improving traditional clustering metrics.

Conclusion: DARTVAE achieves more meaningful clustering than purely data-driven models by combining rule encodings with learned representations, though challenges remain with LLM rule hallucination, rule conflicts, overfitting risks, and computational scaling in complex domains.

Abstract: Traditional clustering techniques often rely solely on similarity in the input data, limiting their ability to capture structural or semantic constraints that are critical in many domains. We introduce the Domain Aware Rule Triggered Variational Autoencoder (DARTVAE), a rule guided multimodal clustering framework that incorporates domain specific constraints directly into the representation learning process. DARTVAE extends the VAE architecture by embedding explicit rules, semantic representations, and data driven features into a unified latent space, while enforcing constraint compliance through rule consistency and violation penalties in the loss function. Unlike conventional clustering methods that rely only on visual similarity or apply rules as post hoc filters, DARTVAE treats rules as first class learning signals. The rules are generated by LLMs, structured into knowledge graphs, and enforced through a loss function combining reconstruction, KL divergence, consistency, and violation penalties. Experiments on aircraft and automotive datasets demonstrate that rule guided clustering produces more operationally meaningful and interpretable clusters for example, isolating UAVs, unifying stealth aircraft, or separating SUVs from sedans while improving traditional clustering metrics. However, the framework faces challenges: LLM generated rules may hallucinate or conflict, excessive rules risk overfitting, and scaling to complex domains increases computational and consistency difficulties. By combining rule encodings with learned representations, DARTVAE achieves more meaningful and consistent clustering outcomes than purely data driven models, highlighting the utility of constraint guided multimodal clustering for complex, knowledge intensive settings.

[367] Investigating Modality Contribution in Audio LLMs for Music

Giovana Morais, Magdalena Fuentes

Main category: cs.LG

TL;DR: This paper investigates whether Audio LLMs truly listen to audio or rely on textual reasoning by quantifying modality contributions using MM-SHAP framework.

Details

Motivation: To determine if Audio Large Language Models genuinely process audio content or primarily use textual reasoning, as recent benchmarks suggest ambiguity in their listening capabilities.

Method: Adapted MM-SHAP framework (performance-agnostic score based on Shapley values) to quantify relative contribution of each modality. Evaluated two models on MuChoMusic benchmark.

Result: Higher-accuracy models rely more on text, but models can successfully localize key sound events even with low overall audio contribution, suggesting audio is not entirely ignored.

Conclusion: First application of MM-SHAP to Audio LLMs, serving as foundational step for future research in explainable AI and audio understanding.

Abstract: Audio Large Language Models (Audio LLMs) enable human-like conversation about music, yet it is unclear if they are truly listening to the audio or just using textual reasoning, as recent benchmarks suggest. This paper investigates this issue by quantifying the contribution of each modality to a model’s output. We adapt the MM-SHAP framework, a performance-agnostic score based on Shapley values that quantifies the relative contribution of each modality to a model’s prediction. We evaluate two models on the MuChoMusic benchmark and find that the model with higher accuracy relies more on text to answer questions, but further inspection shows that even if the overall audio contribution is low, models can successfully localize key sound events, suggesting that audio is not entirely ignored. Our study is the first application of MM-SHAP to Audio LLMs and we hope it will serve as a foundational step for future research in explainable AI and audio.

[368] Myosotis: structured computation for attention like layer

Evgenii Egorov, Hanno Ackermann, Markus Nagel, Hong Cai

Main category: cs.LG

TL;DR: A novel algorithm combining attention and SSM advantages using efficient inversion of tree-structured matrices to overcome quadratic scaling in sequence length.

Details

Motivation: Address the quadratic memory and compute scaling of attention layers by combining the benefits of sparsity (ignoring pairwise interactions) and recurrent dependence (like SSM) while avoiding their individual disadvantages.

Method: Proposes an algorithm based on efficient inversion of tree-structured matrices to combine attention mechanisms with state space model (SSM) approaches.

Result: The method aims to achieve efficient sequence-to-sequence mapping without the quadratic scaling limitations of standard attention layers.

Conclusion: This approach provides a promising solution to the computational challenges of attention mechanisms by leveraging tree-structured matrix inversion to combine the strengths of both sparsity and recurrent dependence methods.

Abstract: Attention layers apply a sequence-to-sequence mapping whose parameters depend on the pairwise interactions of the input elements. However, without any structural assumptions, memory and compute scale quadratically with the sequence length. The two main ways to mitigate this are to introduce sparsity by ignoring a sufficient amount of pairwise interactions or to introduce recurrent dependence along them, as SSM does. Although both approaches are reasonable, they both have disadvantages. We propose a novel algorithm that combines the advantages of both concepts. Our idea is based on the efficient inversion of tree-structured matrices.

[369] Auto-Regressive U-Net for Full-Field Prediction of Shrinkage-Induced Damage in Concrete

Liya Gaynutdinova, Petr Havlásek, Ondřej Rokoš, Fleur Hendriks, Martin Doškář

Main category: cs.LG

TL;DR: A deep learning approach using auto-regressive U-Net and CNN for predicting time-dependent full-field damage in concrete, enabling efficient assessment of damage progression and mechanical property forecasting.

Details

Motivation: To reduce the computational load of traditional full-field damage evaluations and understand the relationship between aggregate properties and concrete performance for optimizing mix designs.

Method: Dual-network architecture: auto-regressive U-Net for predicting damage field evolution, and CNN for forecasting mechanical properties (shrinkage and residual stiffness) from damage estimates.

Result: The approach demonstrates high computational efficiency and robust predictive performance on synthesized datasets, effectively reducing computational load while maintaining accuracy.

Conclusion: This method can help optimize concrete mix designs by providing insights into aggregate property effects, leading to improved durability and reduced internal damage.

Abstract: This paper introduces a deep learning approach for predicting time-dependent full-field damage in concrete. The study uses an auto-regressive U-Net model to predict the evolution of the scalar damage field in a unit cell given microstructural geometry and evolution of an imposed shrinkage profile. By sequentially using the predicted damage output as input for subsequent predictions, the model facilitates the continuous assessment of damage progression. Complementarily, a convolutional neural network (CNN) utilises the damage estimations to forecast key mechanical properties, including observed shrinkage and residual stiffness. The proposed dual-network architecture demonstrates high computational efficiency and robust predictive performance on the synthesised datasets. The approach reduces the computational load traditionally associated with full-field damage evaluations and is used to gain insights into the relationship between aggregate properties, such as shape, size, and distribution, and the effective shrinkage and reduction in stiffness. Ultimately, this can help to optimize concrete mix designs, leading to improved durability and reduced internal damage.

[370] Complexity-Driven Policy Optimization

Luca Serfilippi, Giorgio Franceschelli, Antonio Corradi, Mirco Musolesi

Main category: cs.LG

TL;DR: The paper proposes replacing entropy regularization with a complexity bonus in policy gradient methods, introducing CDPO as a more robust alternative to PPO that encourages structured yet adaptable exploration strategies.

Details

Motivation: Standard entropy maximization pushes policies toward uniform random distributions, which can be inefficient for exploration. The authors aim to develop a more structured exploration strategy that balances stochasticity with useful behavioral patterns.

Method: The authors introduce Complexity-Driven Policy Optimization (CDPO), which replaces entropy regularization with a complexity measure defined as the product of Shannon entropy and disequilibrium (distance from uniform distribution). This regularizer suppresses both complete randomness and complete order.

Result: Empirical evaluation across discrete action space tasks shows CDPO is more robust to hyperparameter choices than PPO, particularly in environments requiring greater exploration.

Conclusion: Complexity regularization provides a more effective exploration strategy than entropy maximization, enabling agents to discover structured yet adaptable behaviors while being more robust to hyperparameter tuning.

Abstract: Policy gradient methods often balance exploitation and exploration via entropy maximization. However, maximizing entropy pushes the policy towards a uniform random distribution, which represents an unstructured and sometimes inefficient exploration strategy. In this work, we propose replacing the entropy bonus with a more robust complexity bonus. In particular, we adopt a measure of complexity, defined as the product of Shannon entropy and disequilibrium, where the latter quantifies the distance from the uniform distribution. This regularizer encourages policies that balance stochasticity (high entropy) with structure (high disequilibrium), guiding agents toward regimes where useful, non-trivial behaviors can emerge. Such behaviors arise because the regularizer suppresses both extremes, e.g., maximal disorder and complete order, creating pressure for agents to discover structured yet adaptable strategies. Starting from Proximal Policy Optimization (PPO), we introduce Complexity-Driven Policy Optimization (CDPO), a new learning algorithm that replaces entropy with complexity. We show empirically across a range of discrete action space tasks that CDPO is more robust to the choice of the complexity coefficient than PPO is with the entropy coefficient, especially in environments requiring greater exploration.

[371] A Recovery Theory for Diffusion Priors: Deterministic Analysis of the Implicit Prior Algorithm

Oscar Leong, Yann Traonmilin

Main category: cs.LG

TL;DR: This paper develops a theoretical framework for analyzing deterministic diffusion-based algorithms for inverse problems, showing they can be interpreted as generalized projected gradient descent methods with convergence guarantees under certain conditions.

Details

Motivation: To provide rigorous theoretical guarantees for diffusion-based algorithms in inverse problems, as current methods show empirical success but lack theoretical foundations.

Method: Develops a theoretical framework interpreting deterministic diffusion algorithms as generalized projected gradient descent with time-varying projections, analyzing convergence under restricted isometry properties.

Result: Quantitative convergence rates are derived that explicitly depend on noise schedules, with applications to low-dimensional compact convex sets and low-rank Gaussian mixture models.

Conclusion: The framework provides theoretical justification for diffusion-based inverse problem solvers, establishing global convergence guarantees even for nonconvex model sets like Gaussian mixture models.

Abstract: Recovering high-dimensional signals from corrupted measurements is a central challenge in inverse problems. Recent advances in generative diffusion models have shown remarkable empirical success in providing strong data-driven priors, but rigorous recovery guarantees remain limited. In this work, we develop a theoretical framework for analyzing deterministic diffusion-based algorithms for inverse problems, focusing on a deterministic version of the algorithm proposed by Kadkhodaie & Simoncelli \cite{kadkhodaie2021stochastic}. First, we show that when the underlying data distribution concentrates on a low-dimensional model set, the associated noise-convolved scores can be interpreted as time-varying projections onto such a set. This leads to interpreting previous algorithms using diffusion priors for inverse problems as generalized projected gradient descent methods with varying projections. When the sensing matrix satisfies a restricted isometry property over the model set, we can derive quantitative convergence rates that depend explicitly on the noise schedule. We apply our framework to two instructive data distributions: uniform distributions over low-dimensional compact, convex sets and low-rank Gaussian mixture models. In the latter setting, we can establish global convergence guarantees despite the nonconvexity of the underlying model set.

[372] MDBench: Benchmarking Data-Driven Methods for Model Discovery

Amirmohammad Ziaei Bideh, Aleksandra Georgievska, Jonathan Gryak

Main category: cs.LG

TL;DR: MDBench is an open-source benchmarking framework for evaluating model discovery methods on dynamical systems, testing 12 algorithms on 14 PDEs and 63 ODEs under varying noise levels.

Details

Motivation: There's a lack of comprehensive benchmarks for discovering dynamical models, as prior efforts focused mostly on identifying single equations through symbolic regression.

Method: Developed MDBench framework that assesses algorithms using metrics including derivative prediction accuracy, model complexity, and equation fidelity on diverse PDE and ODE systems.

Result: Linear methods and genetic programming achieve lowest prediction error for PDEs and ODEs respectively, with linear models being more robust against noise. Key limitations in current methods were revealed.

Conclusion: MDBench accelerates advancement of model discovery methods by providing rigorous, extensible benchmarking framework and diverse datasets for systematic evaluation and improvement.

Abstract: Model discovery aims to uncover governing differential equations of dynamical systems directly from experimental data. Benchmarking such methods is essential for tracking progress and understanding trade-offs in the field. While prior efforts have focused mostly on identifying single equations, typically framed as symbolic regression, there remains a lack of comprehensive benchmarks for discovering dynamical models. To address this, we introduce MDBench, an open-source benchmarking framework for evaluating model discovery methods on dynamical systems. MDBench assesses 12 algorithms on 14 partial differential equations (PDEs) and 63 ordinary differential equations (ODEs) under varying levels of noise. Evaluation metrics include derivative prediction accuracy, model complexity, and equation fidelity. We also introduce seven challenging PDE systems from fluid dynamics and thermodynamics, revealing key limitations in current methods. Our findings illustrate that linear methods and genetic programming methods achieve the lowest prediction error for PDEs and ODEs, respectively. Moreover, linear models are in general more robust against noise. MDBench accelerates the advancement of model discovery methods by offering a rigorous, extensible benchmarking framework and a rich, diverse collection of dynamical system datasets, enabling systematic evaluation, comparison, and improvement of equation accuracy and robustness.

[373] Understanding and Improving Adversarial Robustness of Neural Probabilistic Circuits

Weixin Chen, Han Zhao

Main category: cs.LG

TL;DR: RNPC is a robust neural probabilistic circuit that improves adversarial robustness by introducing class-wise integration, addressing vulnerabilities in traditional NPCs where adversarial attacks can manipulate attribute predictions.

Details

Motivation: Neural Probabilistic Circuits (NPCs) offer interpretable predictions but their neural-network-based attribute recognition model is vulnerable to adversarial attacks through subtle perturbations, compromising final predictions.

Method: Proposes RNPC with novel class-wise integration for inference, ensuring robust combination of outputs from attribute recognition and probabilistic circuit modules. Theoretically analyzes adversarial robustness.

Result: Theoretical analysis shows RNPC has provably improved adversarial robustness compared to NPC. Empirical results on image classification demonstrate superior robustness while maintaining high accuracy on benign inputs.

Conclusion: RNPC successfully addresses the adversarial vulnerability in NPCs by introducing robust integration methods, achieving both interpretability and enhanced security against attacks.

Abstract: Neural Probabilistic Circuits (NPCs), a new class of concept bottleneck models, comprise an attribute recognition model and a probabilistic circuit for reasoning. By integrating the outputs from these two modules, NPCs produce compositional and interpretable predictions. While offering enhanced interpretability and high performance on downstream tasks, the neural-network-based attribute recognition model remains a black box. This vulnerability allows adversarial attacks to manipulate attribute predictions by introducing carefully crafted subtle perturbations to input images, potentially compromising the final predictions. In this paper, we theoretically analyze the adversarial robustness of NPC and demonstrate that it only depends on the robustness of the attribute recognition model and is independent of the robustness of the probabilistic circuit. Moreover, we propose RNPC, the first robust neural probabilistic circuit against adversarial attacks on the recognition module. RNPC introduces a novel class-wise integration for inference, ensuring a robust combination of outputs from the two modules. Our theoretical analysis demonstrates that RNPC exhibits provably improved adversarial robustness compared to NPC. Empirical results on image classification tasks show that RNPC achieves superior adversarial robustness compared to existing concept bottleneck models while maintaining high accuracy on benign inputs.

[374] Generalizable Diabetes Risk Stratification via Hybrid Machine Learning Models

Athar Parvez, Muhammad Jawad Mufti

Main category: cs.LG

TL;DR: This paper compares two hybrid machine learning classifiers (XGBoost-Random Forest and SVM-Logistic Regression) for diabetes risk stratification, demonstrating XGB-RF’s superior performance and generalizability across internal and external validation.

Details

Motivation: With diabetes affecting over 537 million people worldwide and projected to reach 783 million by 2045, there's a need for early risk stratification using machine learning to improve healthcare outcomes.

Method: Built two hybrid classifiers using a leakage-safe standardized pipeline with encoding, imputation, scaling, SMOTE on training folds only, and probability calibration. Evaluated using threshold-independent discrimination metrics (AUROC/AUPRC) and calibration metrics on both primary dataset and external PIMA cohort.

Result: XGB-RF significantly outperformed SVM-LR on both internal (AUROC ~0.995 vs 0.978; AUPRC ~0.998 vs 0.947) and external validation (AUROC ~0.990 vs 0.963; AUPRC ~0.959 vs 0.875), with better thresholded metrics at tau=0.5.

Conclusion: XGB-RF consistently dominated SVM-LR with smaller external performance attenuation and acceptable calibration, supporting gradient-boosting-based hybridization as a robust, transferable approach for diabetes risk stratification.

Abstract: Background/Purpose: Diabetes affects over 537 million people worldwide and is projected to reach 783 million by 2045. Early risk stratification can benefit from machine learning. We compare two hybrid classifiers and assess their generalizability on an external cohort. Methods: Two hybrids were built: (i) XGBoost + Random Forest (XGB-RF) and (ii) Support Vector Machine + Logistic Regression (SVM-LR). A leakage-safe, standardized pipeline (encoding, imputation, min-max scaling; SMOTE on training folds only; probability calibration for SVM) was fit on the primary dataset and frozen. Evaluation prioritized threshold-independent discrimination (AUROC/AUPRC) and calibration (Brier, slope/intercept). External validation used the PIMA cohort (N=768) with the frozen pipeline; any thresholded metrics on PIMA were computed at the default rule tau = 0.5. Results: On the primary dataset (PR baseline = 0.50), XGB-RF achieved AUROC ~0.995 and AUPRC ~0.998, outperforming SVM-LR (AUROC ~0.978; AUPRC ~0.947). On PIMA (PR baseline ~0.349), XGB-RF retained strong performance (AUROC ~0.990; AUPRC ~0.959); SVM-LR was lower (AUROC ~0.963; AUPRC ~0.875). Thresholded metrics on PIMA at tau = 0.5 were XGB-RF (Accuracy 0.960; Precision 0.941; Recall 0.944; F1 0.942) and SVM-LR (Accuracy 0.900; Precision 0.855; Recall 0.858; F1 0.857). Conclusions: Across internal and external cohorts, XGB-RF consistently dominated SVM-LR and exhibited smaller external attenuation on ROC/PR with acceptable calibration. These results support gradient-boosting-based hybridization as a robust, transferable approach for diabetes risk stratification and motivate prospective, multi-site validation with deployment-time threshold selection based on clinical trade-offs.

[375] PIRF: Physics-Informed Reward Fine-Tuning for Diffusion Models

Mingze Yuan, Pengfei Jin, Na Li, Quanzheng Li

Main category: cs.LG

TL;DR: PIRF introduces physics-informed reward fine-tuning as a sparse reward optimization approach to improve physical constraint adherence in diffusion models, overcoming limitations of DPS-style approximations through trajectory-level rewards and efficient backpropagation.

Details

Motivation: Diffusion models often violate physical laws despite strong generative capabilities. Current approaches rely on error-prone DPS-style value function approximations that cause training instability and inference inefficiency.

Method: Physics-Informed Reward Fine-tuning (PIRF) computes trajectory-level rewards and backpropagates gradients directly, using layer-wise truncated backpropagation and weight-based regularization to improve efficiency and maintain data fidelity.

Result: PIRF consistently achieves superior physical enforcement across five PDE benchmarks under efficient sampling regimes, demonstrating improved physical constraint adherence compared to prior methods.

Conclusion: Reward fine-tuning shows significant potential for advancing scientific generative modeling by effectively incorporating physical constraints while maintaining sampling efficiency and data quality.

Abstract: Diffusion models have demonstrated strong generative capabilities across scientific domains, but often produce outputs that violate physical laws. We propose a new perspective by framing physics-informed generation as a sparse reward optimization problem, where adherence to physical constraints is treated as a reward signal. This formulation unifies prior approaches under a reward-based paradigm and reveals a shared bottleneck: reliance on diffusion posterior sampling (DPS)-style value function approximations, which introduce non-negligible errors and lead to training instability and inference inefficiency. To overcome this, we introduce Physics-Informed Reward Fine-tuning (PIRF), a method that bypasses value approximation by computing trajectory-level rewards and backpropagating their gradients directly. However, a naive implementation suffers from low sample efficiency and compromised data fidelity. PIRF mitigates these issues through two key strategies: (1) a layer-wise truncated backpropagation method that leverages the spatiotemporally localized nature of physics-based rewards, and (2) a weight-based regularization scheme that improves efficiency over traditional distillation-based methods. Across five PDE benchmarks, PIRF consistently achieves superior physical enforcement under efficient sampling regimes, highlighting the potential of reward fine-tuning for advancing scientific generative modeling.

[376] The Sensitivity of Variational Bayesian Neural Network Performance to Hyperparameters

Scott Koermer, Natalie Klein

Main category: cs.LG

TL;DR: This paper analyzes the effects of hyperparameter choices on Bayesian Neural Networks (BNNs) through global sensitivity analysis, revealing that hyperparameters interact to affect both predictive accuracy and uncertainty quantification (UQ).

Details

Motivation: BNNs promise accurate predictive models with uncertainty quantification, but obtaining accurate UQ is difficult due to approximations in training and complex hyperparameter choices that have opaque effects on results.

Method: The authors perform a global sensitivity analysis of BNN performance under varying hyperparameter settings to understand how hyperparameters interact and affect model outcomes.

Result: The analysis shows that many hyperparameters interact with each other to impact both predictive accuracy and UQ, highlighting the complexity of hyperparameter tuning in BNNs.

Conclusion: For improved BNN usage in real-world applications, global sensitivity analysis or related methods like Bayesian optimization should be used for dimensionality reduction and hyperparameter selection to ensure accurate UQ.

Abstract: In scientific applications, predictive modeling is often of limited use without accurate uncertainty quantification (UQ) to indicate when a model may be extrapolating or when more data needs to be collected. Bayesian Neural Networks (BNNs) produce predictive uncertainty by propagating uncertainty in neural network (NN) weights and offer the promise of obtaining not only an accurate predictive model but also accurate UQ. However, in practice, obtaining accurate UQ with BNNs is difficult due in part to the approximations used for practical model training and in part to the need to choose a suitable set of hyperparameters; these hyperparameters outnumber those needed for traditional NNs and often have opaque effects on the results. We aim to shed light on the effects of hyperparameter choices for BNNs by performing a global sensitivity analysis of BNN performance under varying hyperparameter settings. Our results indicate that many of the hyperparameters interact with each other to affect both predictive accuracy and UQ. For improved usage of BNNs in real-world applications, we suggest that global sensitivity analysis, or related methods such as Bayesian optimization, should be used to aid in dimensionality reduction and selection of hyperparameters to ensure accurate UQ in BNNs.

[377] Learning Greens Operators through Hierarchical Neural Networks Inspired by the Fast Multipole Method

Emilio McAllister Fognini, Marta M. Betcke, Ben T. Cox

Main category: cs.LG

TL;DR: A novel Neural FMM architecture that integrates the Fast Multipole Method’s hierarchical computation flow into machine learning for learning Green’s operators of Elliptic PDEs.

Details

Motivation: The integration of FMM with modern machine learning architectures remains underexplored despite FMM's widespread use in physics and engineering for efficient computation of long-ranged forces.

Method: Proposes Neural FMM that leverages FMM’s hierarchical computation flow to separate local and far-field interactions and efficiently learn their respective representations within a neural network framework.

Result: A hierarchical machine learning framework capable of learning the Green’s operator of Elliptic PDEs by incorporating FMM’s information flow.

Conclusion: The Neural FMM architecture successfully bridges traditional numerical methods with modern machine learning, enabling efficient learning of complex physical operators.

Abstract: The Fast Multipole Method (FMM) is an efficient numerical algorithm for computation of long-ranged forces in $N$-body problems within gravitational and electrostatic fields. This method utilizes multipole expansions of the Green’s function inherent to the underlying dynamical systems. Despite its widespread application in physics and engineering, the integration of FMM with modern machine learning architectures remains underexplored. In this work, we propose a novel neural network architecture, the Neural FMM, that integrates the information flow of the FMM into a hierarchical machine learning framework for learning the Green’s operator of an Elliptic PDE. Our Neural FMM architecture leverages a hierarchical computation flow of the FMM method to split up the local and far-field interactions and efficiently learn their respective representations.

[378] TSKAN: Interpretable Machine Learning for QoE modeling over Time Series Data

Kamal Singh, Priyanka Rawat, Sami Marouani, Baptiste Jeudy

Main category: cs.LG

TL;DR: Proposes an interpretable ML approach for QoE modeling in video streaming using Kolmogorov-Arnold Networks (KANs) on frequency-domain features, achieving better accuracy and transparency than black-box methods.

Details

Motivation: Traditional black-box QoE modeling approaches lack transparency and interpretability, making it difficult to understand the complex relationships between video streaming features and user experience.

Method: Combines Kolmogorov-Arnold Networks (KANs) as interpretable readout layers on top of compact frequency-domain features extracted from raw time series data, enabling temporal information capture while maintaining model transparency.

Result: The method demonstrates enhanced accuracy in QoE prediction on popular datasets while providing transparency and interpretability compared to traditional approaches.

Conclusion: The proposed interpretable ML approach using KANs offers an effective solution for QoE modeling that balances prediction accuracy with model explainability in video streaming applications.

Abstract: Quality of Experience (QoE) modeling is crucial for optimizing video streaming services to capture the complex relationships between different features and user experience. We propose a novel approach to QoE modeling in video streaming applications using interpretable Machine Learning (ML) techniques over raw time series data. Unlike traditional black-box approaches, our method combines Kolmogorov-Arnold Networks (KANs) as an interpretable readout on top of compact frequency-domain features, allowing us to capture temporal information while retaining a transparent and explainable model. We evaluate our method on popular datasets and demonstrate its enhanced accuracy in QoE prediction, while offering transparency and interpretability.

[379] Explicit and Effectively Symmetric Schemes for Neural SDEs

Daniil Shmelev, Cristopher Salvi

Main category: cs.LG

TL;DR: Introduces stable, near-reversible Runge-Kutta schemes (EES) for neural SDEs that overcome instability issues of existing reversible solvers while maintaining memory efficiency and gradient accuracy.

Details

Motivation: Traditional approaches for backpropagation through neural SDE solvers face trade-offs: discretise-then-optimise has accurate gradients but high memory costs, while optimise-then-discretise has constant memory but suffers from gradient approximation errors and slower evaluation. Existing reversible solvers are unstable under complex models and large step sizes.

Method: Proposes a novel class of Explicit and Effectively Symmetric (EES) Runge-Kutta schemes that are stable and near-reversible. These schemes retain the benefits of reversible solvers while overcoming their instability issues.

Result: Numerical experiments demonstrate superior stability and reliability of the EES schemes compared to existing methods, enabling memory-efficient training without severe restrictions on step size or model complexity.

Conclusion: The EES schemes establish a practical foundation for scalable and accurate training of neural SDEs, offering both memory efficiency and gradient accuracy without the instability problems of previous reversible solvers.

Abstract: Backpropagation through (neural) SDE solvers is traditionally approached in two ways: discretise-then-optimise, which offers accurate gradients but incurs prohibitive memory costs due to storing the full computational graph (even when mitigated by checkpointing); and optimise-then-discretise, which achieves constant memory cost by solving an auxiliary backward SDE, but suffers from slower evaluation and gradient approximation errors. Algebraically reversible solvers promise both memory efficiency and gradient accuracy, yet existing methods such as the Reversible Heun scheme are often unstable under complex models and large step sizes. We address these limitations by introducing a novel class of stable, near-reversible Runge–Kutta schemes for neural SDEs. These Explicit and Effectively Symmetric (EES) schemes retain the benefits of reversible solvers while overcoming their instability, enabling memory-efficient training without severe restrictions on step size or model complexity. Through numerical experiments, we demonstrate the superior stability and reliability of our schemes, establishing them as a practical foundation for scalable and accurate training of neural SDEs.

[380] Function Spaces Without Kernels: Learning Compact Hilbert Space Representations

Su Ann Low, Quentin Rommel, Kevin S. Miller, Adam J. Thorpe, Ufuk Topcu

Main category: cs.LG

TL;DR: Function encoders learn neural network basis functions to create compact representations of function spaces, connecting feature learning and kernel methods through a learned kernel defined by feature map inner products.

Details

Motivation: To develop neural predictors with kernel-level guarantees that can scale independently of dataset size while adapting to data structure, enabling efficient and principled models.

Method: Two training algorithms: progressive training that constructively grows bases, and train-then-prune approach using PCA principles to reveal intrinsic dimension. Also developed finite-sample generalization bounds using Rademacher complexity and PAC-Bayes techniques.

Result: Validated on polynomial benchmark and nonlinear dynamical systems (Van der Pol oscillator, two-body orbital model), achieving same accuracy with substantially fewer basis functions.

Conclusion: Function encoders provide a path toward neural predictors with kernel-level guarantees, enabling adaptable models that are both efficient and principled at scale.

Abstract: Function encoders are a recent technique that learn neural network basis functions to form compact, adaptive representations of Hilbert spaces of functions. We show that function encoders provide a principled connection to feature learning and kernel methods by defining a kernel through an inner product of the learned feature map. This kernel-theoretic perspective explains their ability to scale independently of dataset size while adapting to the intrinsic structure of data, and it enables kernel-style analysis of neural models. Building on this foundation, we develop two training algorithms that learn compact bases: a progressive training approach that constructively grows bases, and a train-then-prune approach that offers a computationally efficient alternative after training. Both approaches use principles from PCA to reveal the intrinsic dimension of the learned space. In parallel, we derive finite-sample generalization bounds using Rademacher complexity and PAC-Bayes techniques, providing inference time guarantees. We validate our approach on a polynomial benchmark with a known intrinsic dimension, and on nonlinear dynamical systems including a Van der Pol oscillator and a two-body orbital model, demonstrating that the same accuracy can be achieved with substantially fewer basis functions. This work suggests a path toward neural predictors with kernel-level guarantees, enabling adaptable models that are both efficient and principled at scale.

[381] MMG: Mutual Information Estimation via the MMSE Gap in Diffusion

Longxuan Yu, Xing Shi, Xianghao Kong, Tong Jia, Greg Ver Steeg

Main category: cs.LG

TL;DR: Diffusion models can be used to estimate mutual information (MI) by leveraging the gap in Minimum Mean Square Error (MMSE) between conditional and unconditional diffusion processes across all Signal-to-Noise-Ratios.

Details

Motivation: Mutual information is a fundamental measure of relationships between variables but challenging to estimate for complex systems. Denoising diffusion models have shown strong performance in density estimation, suggesting they could also improve MI estimation.

Method: The method uses information-theoretic formulation of denoising diffusion models, where MI corresponds to half the MMSE gap between conditional and unconditional diffusion integrated over all SNRs. It employs adaptive importance sampling for scalable estimation.

Result: The approach passes self-consistency tests and outperforms both traditional and score-based diffusion MI estimators. It maintains strong performance even with high MI values.

Conclusion: Diffusion models provide an effective framework for mutual information estimation, offering improved performance over existing methods while being scalable through adaptive importance sampling techniques.

Abstract: Mutual information (MI) is one of the most general ways to measure relationships between random variables, but estimating this quantity for complex systems is challenging. Denoising diffusion models have recently set a new bar for density estimation, so it is natural to consider whether these methods could also be used to improve MI estimation. Using the recently introduced information-theoretic formulation of denoising diffusion models, we show the diffusion models can be used in a straightforward way to estimate MI. In particular, the MI corresponds to half the gap in the Minimum Mean Square Error (MMSE) between conditional and unconditional diffusion, integrated over all Signal-to-Noise-Ratios (SNRs) in the noising process. Our approach not only passes self-consistency tests but also outperforms traditional and score-based diffusion MI estimators. Furthermore, our method leverages adaptive importance sampling to achieve scalable MI estimation, while maintaining strong performance even when the MI is high.

[382] Policy Compatible Skill Incremental Learning via Lazy Learning Interface

Daehee Lee, Dongsu Lee, TaeYoon Kwack, Wonje Choi, Honguk Woo

Main category: cs.LG

TL;DR: SIL-C is a framework for Skill Incremental Learning that maintains compatibility between evolving skills and downstream policies through bilateral lazy learning-based mapping, enabling skill improvements to enhance policy performance without retraining.

Details

Motivation: As skill repertoires evolve in incremental learning, they can disrupt compatibility with existing skill-based policies, limiting reusability and generalization of policies.

Method: Uses bilateral lazy learning-based mapping to dynamically align subtask space referenced by policies with skill space. Each subtask from policy decomposition is executed by selecting appropriate skills based on trajectory distribution similarity.

Result: SIL-C maintains compatibility between evolving skills and downstream policies while ensuring efficiency throughout the learning process across diverse SIL scenarios.

Conclusion: The proposed framework successfully enables skill improvements to enhance downstream policy performance without requiring policy retraining or structural adaptation.

Abstract: Skill Incremental Learning (SIL) is the process by which an embodied agent expands and refines its skill set over time by leveraging experience gained through interaction with its environment or by the integration of additional data. SIL facilitates efficient acquisition of hierarchical policies grounded in reusable skills for downstream tasks. However, as the skill repertoire evolves, it can disrupt compatibility with existing skill-based policies, limiting their reusability and generalization. In this work, we propose SIL-C, a novel framework that ensures skill-policy compatibility, allowing improvements in incrementally learned skills to enhance the performance of downstream policies without requiring policy re-training or structural adaptation. SIL-C employs a bilateral lazy learning-based mapping technique to dynamically align the subtask space referenced by policies with the skill space decoded into agent behaviors. This enables each subtask, derived from the policy’s decomposition of a complex task, to be executed by selecting an appropriate skill based on trajectory distribution similarity. We evaluate SIL-C across diverse SIL scenarios and demonstrate that it maintains compatibility between evolving skills and downstream policies while ensuring efficiency throughout the learning process.

[383] Latent Twins

Matthias Chung, Deepanshu Verma, Max Collins, Amit N. Subrahmanya, Varuni Katti Sastry, Vishwas Rao

Main category: cs.LG

TL;DR: Latent Twins is a unifying mathematical framework that creates hidden surrogates in latent space for underlying equations, bridging representation learning and algorithmic solution methods. It demonstrates strong performance across ODEs, PDEs, and real-world geopotential data.

Details

Motivation: Scientific machine learning has advanced various domains but evolved in parallel pipelines. The authors aim to unify representation learning and algorithmic solution methods through a single mathematical framework that bridges these traditionally separate approaches.

Method: Proposes Latent Twins - a framework that creates surrogate models in latent space governed by operators. The method establishes approximation properties for both ODEs and PDEs, and is demonstrated on canonical ODEs, shallow-water PDEs, and real geopotential data.

Result: Latent Twins provide compact, interpretable surrogates for solution operators that evaluate across arbitrary time gaps in single-shot. They outperform baselines like DeepONet and 4D-Var, and remain compatible with scientific pipelines like assimilation, control, and uncertainty quantification.

Conclusion: The framework offers scalable, theory-grounded surrogates that bridge data-driven representation learning and classical scientific modeling, providing a unifying principle for various scientific computing tasks.

Abstract: Over the past decade, scientific machine learning has transformed the development of mathematical and computational frameworks for analyzing, modeling, and predicting complex systems. From inverse problems to numerical PDEs, dynamical systems, and model reduction, these advances have pushed the boundaries of what can be simulated. Yet they have often progressed in parallel, with representation learning and algorithmic solution methods evolving largely as separate pipelines. With \emph{Latent Twins}, we propose a unifying mathematical framework that creates a hidden surrogate in latent space for the underlying equations. Whereas digital twins mirror physical systems in the digital world, Latent Twins mirror mathematical systems in a learned latent space governed by operators. Through this lens, classical modeling, inversion, model reduction, and operator approximation all emerge as special cases of a single principle. We establish the fundamental approximation properties of Latent Twins for both ODEs and PDEs and demonstrate the framework across three representative settings: (i) canonical ODEs, capturing diverse dynamical regimes; (ii) a PDE benchmark using the shallow-water equations, contrasting Latent Twin simulations with DeepONet and forecasts with a 4D-Var baseline; and (iii) a challenging real-data geopotential reanalysis dataset, reconstructing and forecasting from sparse, noisy observations. Latent Twins provide a compact, interpretable surrogate for solution operators that evaluate across arbitrary time gaps in a single-shot, while remaining compatible with scientific pipelines such as assimilation, control, and uncertainty quantification. Looking forward, this framework offers scalable, theory-grounded surrogates that bridge data-driven representation learning and classical scientific modeling across disciplines.

[384] Training Task Reasoning LLM Agents for Multi-turn Task Planning via Single-turn Reinforcement Learning

Hanjiang Hu, Changliu Liu, Na Li, Yebin Wang

Main category: cs.LG

TL;DR: This paper introduces Group Relative Policy Optimization (GRPO) to transform multi-turn task planning into single-turn reasoning problems, enabling efficient LLM agent training with dense rewards from expert trajectories.

Details

Motivation: Training LLM agents for complex multi-turn task planning faces challenges including sparse rewards, credit assignment issues, and computational overhead of reinforcement learning in multi-turn settings.

Method: Proposes GRPO that converts multi-turn planning to single-turn reasoning problems, using dense and verifiable rewards from expert trajectories for policy optimization.

Result: A 1.5B parameter model trained with single-turn GRPO achieves 70% success rate on long-horizon planning tasks with over 30 steps, outperforming baseline models up to 14B parameters.

Conclusion: The approach demonstrates strong cross-task generalizability where models trained on complex tasks can successfully complete simpler subtasks, with theoretical guarantees on multi-turn success probability.

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in knowledge acquisition, reasoning, and tool use, making them promising candidates for autonomous agent applications. However, training LLM agents for complex multi-turn task planning faces significant challenges, including sparse episode-wise rewards, credit assignment across long horizons, and the computational overhead of reinforcement learning in multi-turn interaction settings. To this end, this paper introduces a novel approach that transforms multi-turn task planning into single-turn task reasoning problems, enabling efficient policy optimization through Group Relative Policy Optimization (GRPO) with dense and verifiable reward from expert trajectories. Our theoretical analysis shows that GRPO improvement on single-turn task reasoning results in higher multi-turn success probability under the minimal turns, as well as the generalization to subtasks with shorter horizons. Experimental evaluation on the complex task planning benchmark demonstrates that our 1.5B parameter model trained with single-turn GRPO achieves superior performance compared to larger baseline models up to 14B parameters, with success rates of 70% for long-horizon planning tasks with over 30 steps. We also theoretically and empirically validate the strong cross-task generalizability that the models trained on complex tasks can lead to the successful completion of all simpler subtasks.

[385] Personalized Federated Dictionary Learning for Modeling Heterogeneity in Multi-site fMRI Data

Yipu Zhang, Chengshuo Zhang, Ziyu Zhou, Gang Qu, Hao Zheng, Yuping Wang, Hui Shen, Hongwen Deng

Main category: cs.LG

TL;DR: PFedDL is a federated learning framework that enables collaborative fMRI analysis across sites without sharing raw data by decomposing site-specific dictionaries into shared global and personalized local components.

Details

Motivation: Data privacy constraints and site-specific heterogeneity in multi-site fMRI studies create non-IID data challenges that hinder the development of generalizable models.

Method: PFedDL performs independent dictionary learning at each site, decomposing dictionaries into global components (updated via federated aggregation) and local components (refined independently to capture site-specific variability).

Result: Experiments on the ABIDE dataset show that PFedDL outperforms existing methods in accuracy and robustness across non-IID datasets.

Conclusion: PFedDL effectively addresses privacy and heterogeneity challenges in multi-site fMRI studies by balancing cross-site consistency with site-specific personalization.

Abstract: Data privacy constraints pose significant challenges for large-scale neuroimaging analysis, especially in multi-site functional magnetic resonance imaging (fMRI) studies, where site-specific heterogeneity leads to non-independent and identically distributed (non-IID) data. These factors hinder the development of generalizable models. To address these challenges, we propose Personalized Federated Dictionary Learning (PFedDL), a novel federated learning framework that enables collaborative modeling across sites without sharing raw data. PFedDL performs independent dictionary learning at each site, decomposing each site-specific dictionary into a shared global component and a personalized local component. The global atoms are updated via federated aggregation to promote cross-site consistency, while the local atoms are refined independently to capture site-specific variability, thereby enhancing downstream analysis. Experiments on the ABIDE dataset demonstrate that PFedDL outperforms existing methods in accuracy and robustness across non-IID datasets.

[386] Wonder Wins Ways: Curiosity-Driven Exploration through Multi-Agent Contextual Calibration

Yiyuan Pan, Zhe Liu, Hesheng Wang

Main category: cs.LG

TL;DR: CERMIC is a novel framework for multi-agent reinforcement learning that enhances exploration by calibrating intrinsic curiosity with multi-agent context, filtering noisy surprise signals and focusing on meaningful peer behavior novelty.

Details

Motivation: Existing curiosity mechanisms in MARL confuse environmental stochasticity with novelty and treat all unexpected observations equally, overlooking peer behavior novelty which encodes latent task dynamics, leading to suboptimal exploration in decentralized settings.

Method: CERMIC empowers agents to robustly filter noisy surprise signals and guide exploration by dynamically calibrating intrinsic curiosity with inferred multi-agent context. It generates theoretically-grounded intrinsic rewards that encourage exploration of state transitions with high information gain.

Result: Empirical evaluation on VMAS, Meltingpot, and SMACv2 benchmarks shows that CERMIC significantly outperforms state-of-the-art algorithms in sparse-reward environments.

Conclusion: The proposed CERMIC framework effectively addresses limitations of existing curiosity mechanisms in MARL by incorporating peer behavior novelty and multi-agent context, leading to superior exploration performance in complex sparse-reward environments.

Abstract: Autonomous exploration in complex multi-agent reinforcement learning (MARL) with sparse rewards critically depends on providing agents with effective intrinsic motivation. While artificial curiosity offers a powerful self-supervised signal, it often confuses environmental stochasticity with meaningful novelty. Moreover, existing curiosity mechanisms exhibit a uniform novelty bias, treating all unexpected observations equally. However, peer behavior novelty, which encode latent task dynamics, are often overlooked, resulting in suboptimal exploration in decentralized, communication-free MARL settings. To this end, inspired by how human children adaptively calibrate their own exploratory behaviors via observing peers, we propose a novel approach to enhance multi-agent exploration. We introduce CERMIC, a principled framework that empowers agents to robustly filter noisy surprise signals and guide exploration by dynamically calibrating their intrinsic curiosity with inferred multi-agent context. Additionally, CERMIC generates theoretically-grounded intrinsic rewards, encouraging agents to explore state transitions with high information gain. We evaluate CERMIC on benchmark suites including VMAS, Meltingpot, and SMACv2. Empirical results demonstrate that exploration with CERMIC significantly outperforms SoTA algorithms in sparse-reward environments.

[387] Guiding Application Users via Estimation of Computational Resources for Massively Parallel Chemistry Computations

Tanzila Tabassum, Omer Subasi, Ajay Panyala, Epiya Ebiapia, Gerald Baumgartner, Erdal Mutlu, P., Sadayappan, Karol Kowalski

Main category: cs.LG

TL;DR: Machine learning strategies to predict optimal resource configurations for coupled-cluster chemistry computations on supercomputers, addressing shortest-time and cheapest-run scenarios.

Details

Motivation: To guide users in predicting resources and costs for expensive massively parallel chemistry computations before committing to supercomputer experiments, helping optimize execution time and minimize resource usage.

Method: Developed ML models using Gradient Boosting and active learning strategies, evaluated on CCSD application runtime data from DOE Frontier and Aurora supercomputers, predicting optimal node counts and tile sizes.

Result: Gradient Boosting achieved MAPE of 0.023 (Aurora) and 0.073 (Frontier) for execution time prediction. Active learning achieved MAPE of about 0.2 with only 450 experiments from both supercomputers.

Conclusion: ML-based strategies effectively predict optimal resource configurations for chemistry computations, with active learning providing accurate predictions even with limited experimental data, enabling cost-effective supercomputer usage.

Abstract: In this work, we develop machine learning (ML) based strategies to predict resources (costs) required for massively parallel chemistry computations, such as coupled-cluster methods, to guide application users before they commit to running expensive experiments on a supercomputer. By predicting application execution time, we determine the optimal runtime parameter values such as number of nodes and tile sizes. Two key questions of interest to users are addressed. The first is the shortest-time question, where the user is interested in knowing the parameter configurations (number of nodes and tile sizes) to achieve the shortest execution time for a given problem size and a target supercomputer. The second is the cheapest-run question in which the user is interested in minimizing resource usage, i.e., finding the number of nodes and tile size that minimizes the number of node-hours for a given problem size. We evaluate a rich family of ML models and strategies, developed based on the collections of runtime parameter values for the CCSD (Coupled Cluster with Singles and Doubles) application executed on the Department of Energy (DOE) Frontier and Aurora supercomputers. Our experiments show that when predicting the total execution time of a CCSD iteration, a Gradient Boosting (GB) ML model achieves a Mean Absolute Percentage Error (MAPE) of 0.023 and 0.073 for Aurora and Frontier, respectively. In the case where it is expensive to run experiments just to collect data points, we show that active learning can achieve a MAPE of about 0.2 with just around 450 experiments collected from Aurora and Frontier.

[388] Theoretical Bounds for Stable In-Context Learning

Tongxi Wang, Zhuoyang Xia

Main category: cs.LG

TL;DR: This paper establishes a theoretical lower bound for in-context learning stability and proposes a practical estimator for determining optimal prompt length without distributional priors.

Details

Motivation: In-context learning is flexible but highly sensitive to prompt length, requiring better theoretical understanding and practical tools to ensure reliability in real-world applications.

Method: Develops a non-asymptotic lower bound linking demonstration count to ICL stability under high-dimensional sub-Gaussian representations, then proposes a two-stage observable estimator with one-shot calibration for practical prompt-length estimation.

Result: Experiments show close alignment between predicted thresholds and empirical knee-points across diverse datasets, encoders, and generators, with the theory providing a conservative but reliable upper bound.

Conclusion: The work connects spectral coverage to stable ICL, bridges theory and deployment, and improves interpretability and reliability of large-scale prompting in finite-sample regimes.

Abstract: In-context learning (ICL) is flexible but its reliability is highly sensitive to prompt length. This paper establishes a non-asymptotic lower bound that links the minimal number of demonstrations to ICL stability under fixed high-dimensional sub-Gaussian representations. The bound gives explicit sufficient conditions in terms of spectral properties of the covariance, providing a computable criterion for practice. Building on this analysis, we propose a two-stage observable estimator with a one-shot calibration that produces practitioner-ready prompt-length estimates without distributional priors. Experiments across diverse datasets, encoders, and generators show close alignment between the predicted thresholds and empirical knee-points, with the theory acting as a conservative but reliable upper bound; the calibrated variant further tightens this gap. These results connect spectral coverage to stable ICL, bridge theory and deployment, and improve the interpretability and reliability of large-scale prompting in realistic finite-sample regimes.

[389] Bispectral OT: Dataset Comparison using Symmetry-Aware Optimal Transport

Annabel Ma, Kaiying Hou, David Alvarez-Melis, Melanie Weber

Main category: cs.LG

TL;DR: Bispectral Optimal Transport is a symmetry-aware extension of discrete OT that uses bispectrum representations to preserve data structure while removing group action variations, improving class preservation accuracy.

Details

Motivation: Standard OT alignments based on pairwise geometric distances can ignore intrinsic coherence structure in symmetry-rich settings, failing to capture meaningful semantic correspondences.

Method: Introduces Bispectral Optimal Transport that compares elements using their bispectrum representation - a group Fourier invariant that preserves signal structure while removing group action variations.

Result: Empirical demonstrations show transport plans computed with Bispectral OT achieve greater class preservation accuracy than naive feature OT on benchmark datasets with visual symmetries.

Conclusion: Bispectral OT improves quality of meaningful correspondences by capturing underlying semantic label structure while removing nuisance variations not affecting class or content.

Abstract: Optimal transport (OT) is a widely used technique in machine learning, graphics, and vision that aligns two distributions or datasets using their relative geometry. In symmetry-rich settings, however, OT alignments based solely on pairwise geometric distances between raw features can ignore the intrinsic coherence structure of the data. We introduce Bispectral Optimal Transport, a symmetry-aware extension of discrete OT that compares elements using their representation using the bispectrum, a group Fourier invariant that preserves all signal structure while removing only the variation due to group actions. Empirically, we demonstrate that the transport plans computed with Bispectral OT achieve greater class preservation accuracy than naive feature OT on benchmark datasets transformed with visual symmetries, improving the quality of meaningful correspondences that capture the underlying semantic label structure in the dataset while removing nuisance variation not affecting class or content.

[390] Can Federated Learning Safeguard Private Data in LLM Training? Vulnerabilities, Attacks, and Defense Evaluation

Wenkai Guo, Xuefeng Liu, Haolin Wang, Jianwei Niu, Shaojie Tang, Jing Yuan

Main category: cs.LG

TL;DR: This paper demonstrates that federated learning (FL) for fine-tuning large language models (LLMs) still poses significant privacy risks, as attackers can extract training data from the global model, with leakage increasing with model size. The authors propose an enhanced attack strategy and evaluate mitigation techniques.

Details

Motivation: Organizations want to collaboratively fine-tune LLMs using data from multiple sources while preserving privacy. While FL is seen as a privacy-preserving solution, the authors investigate whether it truly protects against data leakage in LLM fine-tuning scenarios.

Method: The authors conducted extensive experiments showing that attackers can extract training data from FL-trained global models using straightforward generation methods. They also introduced an enhanced attack strategy that tracks global model updates during training to intensify privacy leakage.

Result: Results show that privacy leakage occurs even in FL settings, with leakage increasing as model size grows. The enhanced attack strategy further intensifies privacy risks. The authors evaluated mitigation techniques including differential privacy, regularization-constrained updates, and safety-aligned LLMs.

Conclusion: FL for LLM fine-tuning still poses significant privacy risks. The paper provides practical guidelines for reducing these risks through evaluated privacy-preserving techniques, highlighting the need for stronger privacy protections in collaborative LLM training.

Abstract: Fine-tuning large language models (LLMs) with local data is a widely adopted approach for organizations seeking to adapt LLMs to their specific domains. Given the shared characteristics in data across different organizations, the idea of collaboratively fine-tuning an LLM using data from multiple sources presents an appealing opportunity. However, organizations are often reluctant to share local data, making centralized fine-tuning impractical. Federated learning (FL), a privacy-preserving framework, enables clients to retain local data while sharing only model parameters for collaborative training, offering a potential solution. While fine-tuning LLMs on centralized datasets risks data leakage through next-token prediction, the iterative aggregation process in FL results in a global model that encapsulates generalized knowledge, which some believe protects client privacy. In this paper, however, we present contradictory findings through extensive experiments. We show that attackers can still extract training data from the global model, even using straightforward generation methods, with leakage increasing as the model size grows. Moreover, we introduce an enhanced attack strategy tailored to FL, which tracks global model updates during training to intensify privacy leakage. To mitigate these risks, we evaluate privacy-preserving techniques in FL, including differential privacy, regularization-constrained updates and adopting LLMs with safety alignment. Our results provide valuable insights and practical guidelines for reducing privacy risks when training LLMs with FL.

[391] Learning to Align Molecules and Proteins: A Geometry-Aware Approach to Binding Affinity

Mohammadsaleh Refahi, Bahrad A. Sokhansanj, James R. Brown, Gail Rosen

Main category: cs.LG

TL;DR: FIRM-DTI is a lightweight deep learning framework for drug-target binding affinity prediction that uses feature-wise linear modulation and metric learning to improve generalization across chemical space and time.

Details

Motivation: Current deep learning models for drug-target binding affinity prediction use simple concatenation of ligand and protein representations and lack geometric regularization, leading to poor generalization. The goal is to develop a more robust method that can better handle out-of-domain data.

Method: FIRM-DTI conditions molecular embeddings on protein embeddings using a FiLM layer, enforces metric structure with triplet loss, and uses an RBF regression head on embedding distances for smooth, interpretable affinity predictions.

Result: FIRM-DTI achieves state-of-the-art performance on the Therapeutics Data Commons DTI-DG benchmark, as demonstrated through extensive ablation studies and out-of-domain evaluations.

Conclusion: The results highlight the value of conditioning and metric learning approaches for robust drug-target affinity prediction, showing that these techniques can significantly improve generalization capabilities despite the model’s modest size.

Abstract: Accurate prediction of drug-target binding affinity can accelerate drug discovery by prioritizing promising compounds before costly wet-lab screening. While deep learning has advanced this task, most models fuse ligand and protein representations via simple concatenation and lack explicit geometric regularization, resulting in poor generalization across chemical space and time. We introduce FIRM-DTI, a lightweight framework that conditions molecular embeddings on protein embeddings through a feature-wise linear modulation (FiLM) layer and enforces metric structure with a triplet loss. An RBF regression head operating on embedding distances yields smooth, interpretable affinity predictions. Despite its modest size, FIRM-DTI achieves state-of-the-art performance on the Therapeutics Data Commons DTI-DG benchmark, as demonstrated by an extensive ablation study and out-of-domain evaluation. Our results underscore the value of conditioning and metric learning for robust drug-target affinity prediction.

[392] CE-GPPO: Controlling Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning

Zhenpeng Su, Leiyu Pan, Minxuan Lv, Yuntao Li, Wenping Hu, Fuzheng Zhang, Kun Gai, Guorui Zhou

Main category: cs.LG

TL;DR: CE-GPPO is a novel RL algorithm that preserves gradients from low-probability tokens in PPO to better control entropy dynamics and improve exploration-exploitation balance in LLM training.

Details

Motivation: Existing RL methods like PPO discard valuable gradient signals from low-probability tokens due to clipping, which overlooks their critical role in regulating entropy evolution during training.

Method: CE-GPPO reintroduces gradients from clipped tokens in a gentle, bounded manner by controlling gradient magnitude from tokens outside the clipping interval, enabling better exploration-exploitation trade-off.

Result: Extensive experiments on mathematical reasoning benchmarks show CE-GPPO consistently outperforms strong baselines across different model scales while effectively mitigating entropy instability.

Conclusion: The proposed CE-GPPO algorithm provides theoretical and empirical evidence that preserving gradients from clipped tokens leads to more stable entropy dynamics and improved performance in RL-based LLM optimization.

Abstract: Reinforcement learning (RL) has become a powerful paradigm for optimizing large language models (LLMs) to handle complex reasoning tasks. A core challenge in this process lies in managing policy entropy, which reflects the balance between exploration and exploitation during training. Existing methods, such as proximal policy optimization (PPO) and its variants, discard valuable gradient signals from low-probability tokens due to the clipping mechanism. We systematically analyze the entropy dynamics and reveal that these clipped tokens play a critical yet overlooked role in regulating entropy evolution. We propose \textbf{C}ontrolling \textbf{E}ntropy via \textbf{G}radient-\textbf{P}reserving \textbf{P}olicy \textbf{O}ptimization (CE-GPPO), a novel algorithm that reintroduces gradients from clipped tokens in native PPO in a gentle and bounded manner. By controlling the magnitude of gradients from tokens outside the clipping interval, CE-GPPO is able to achieve an exploration-exploitation trade-off. We provide theoretical justification and empirical evidence showing that CE-GPPO effectively mitigates entropy instability. Extensive experiments on mathematical reasoning benchmarks show that CE-GPPO consistently outperforms strong baselines across different model scales.

[393] A Genetic Algorithm for Navigating Synthesizable Molecular Spaces

Alston Lo, Connor W. Coley, Wojciech Matusik

Main category: cs.LG

TL;DR: SynGA is a genetic algorithm that operates directly over synthesis routes to ensure synthesizability in molecular design, with custom operators and fitness functions for various tasks including property optimization.

Details

Motivation: To address the importance of synthesizability in molecular design by creating a method that explicitly constrains molecular space to synthesizable compounds, leveraging genetic algorithms' effectiveness.

Method: Uses custom crossover and mutation operators that work directly on synthesis routes, with a fitness function adaptable to different design tasks. Also includes a machine learning-based filter for building blocks and a model-based variant SynGBO that combines SynGA with Bayesian optimization.

Result: Demonstrates effectiveness on synthesizable analog search and property optimization for both 2D and 3D objectives, achieving state-of-the-art performance when coupled with building block filtering.

Conclusion: SynGA serves as a strong baseline and versatile module for synthesis-aware workflows, being lightweight and enforcing synthesizability by construction.

Abstract: Inspired by the effectiveness of genetic algorithms and the importance of synthesizability in molecular design, we present SynGA, a simple genetic algorithm that operates directly over synthesis routes. Our method features custom crossover and mutation operators that explicitly constrain it to synthesizable molecular space. By modifying the fitness function, we demonstrate the effectiveness of SynGA on a variety of design tasks, including synthesizable analog search and sample-efficient property optimization, for both 2D and 3D objectives. Furthermore, by coupling SynGA with a machine learning-based filter that focuses the building block set, we boost SynGA to state-of-the-art performance. For property optimization, this manifests as a model-based variant SynGBO, which employs SynGA and block filtering in the inner loop of Bayesian optimization. Since SynGA is lightweight and enforces synthesizability by construction, our hope is that SynGA can not only serve as a strong standalone baseline but also as a versatile module that can be incorporated into larger synthesis-aware workflows in the future.

[394] Scaling Laws are Redundancy Laws

Yuda Bi, Vince D Calhoun

Main category: cs.LG

TL;DR: This paper provides the first rigorous mathematical explanation of scaling laws as redundancy laws, showing that scaling exponents depend on data redundancy rather than being universal.

Details

Motivation: While scaling laws are a defining feature of deep learning, their mathematical origins and the factors determining scaling exponents have remained elusive.

Method: Using kernel regression, the authors demonstrate that a polynomial tail in the data covariance spectrum yields power-law scaling with exponents determined by data redundancy. They establish universality across various transformations, multi-modal mixtures, finite-width approximations, and Transformer architectures.

Result: The analysis reveals that scaling exponents follow α = 2s/(2s + 1/β), where β controls the spectral tail and 1/β measures redundancy, showing that steeper spectra accelerate returns to scale.

Conclusion: This work unifies empirical scaling law observations with theoretical foundations by explaining them as finite-sample redundancy laws, providing the first rigorous mathematical explanation for scaling phenomena in deep learning.

Abstract: Scaling laws, a defining feature of deep learning, reveal a striking power-law improvement in model performance with increasing dataset and model size. Yet, their mathematical origins, especially the scaling exponent, have remained elusive. In this work, we show that scaling laws can be formally explained as redundancy laws. Using kernel regression, we show that a polynomial tail in the data covariance spectrum yields an excess risk power law with exponent alpha = 2s / (2s + 1/beta), where beta controls the spectral tail and 1/beta measures redundancy. This reveals that the learning curve’s slope is not universal but depends on data redundancy, with steeper spectra accelerating returns to scale. We establish the law’s universality across boundedly invertible transformations, multi-modal mixtures, finite-width approximations, and Transformer architectures in both linearized (NTK) and feature-learning regimes. This work delivers the first rigorous mathematical explanation of scaling laws as finite-sample redundancy laws, unifying empirical observations with theoretical foundations.

[395] The Impact of Audio Watermarking on Audio Anti-Spoofing Countermeasures

Zhenshan Zhang, Xueping Zhang, Yechen Wang, Liwei Jin, Ming Li

Main category: cs.LG

TL;DR: First study showing audio watermarking negatively impacts spoofing countermeasures, with higher watermark density increasing detection errors. Proposes KPWL framework to mitigate this domain shift.

Details

Motivation: Audio watermarking is widely used for copyright protection but its impact on anti-spoofing systems remains unexplored, creating potential security vulnerabilities.

Method: Created Watermark-Spoofing dataset by applying diverse watermarking methods to existing anti-spoofing datasets. Proposed KPWL framework to adapt models to watermark-induced domain shifts while preserving original detection capabilities.

Result: Watermarking consistently degrades anti-spoofing performance, with higher watermark density correlating with higher Equal Error Rates (EERs).

Conclusion: Audio watermarking represents a previously overlooked domain shift that compromises spoofing detection. The study establishes the first benchmark for developing watermark-resilient anti-spoofing systems.

Abstract: This paper presents the first study on the impact of audio watermarking on spoofing countermeasures. While anti-spoofing systems are essential for securing speech-based applications, the influence of widely used audio watermarking, originally designed for copyright protection, remains largely unexplored. We construct watermark-augmented training and evaluation datasets, named the Watermark-Spoofing dataset, by applying diverse handcrafted and neural watermarking methods to existing anti-spoofing datasets. Experiments show that watermarking consistently degrades anti-spoofing performance, with higher watermark density correlating with higher Equal Error Rates (EERs). To mitigate this, we propose the Knowledge-Preserving Watermark Learning (KPWL) framework, enabling models to adapt to watermark-induced shifts while preserving their original-domain spoofing detection capability. These findings reveal audio watermarking as a previously overlooked domain shift and establish the first benchmark for developing watermark-resilient anti-spoofing systems. All related protocols are publicly available at https://github.com/Alphawarheads/Watermark_Spoofing.git

[396] Measuring LLM Sensitivity in Transformer-based Tabular Data Synthesis

Maria F. Davila R, Azizjon Turaev, Wolfram Wingerath

Main category: cs.LG

TL;DR: This study analyzes how hyperparameter choices affect synthetic tabular data quality and computational performance in Transformer-based models (GReaT and REaLTabFormer), finding that REaLTabFormer with lightweight LLMs offers the best balance between data quality and computational efficiency.

Details

Motivation: Transformer-based models outperform other TDS tools but have high computational costs, making them impractical for users with prosumer hardware. The study aims to assess sensitivity to hyperparameters to find optimal configurations.

Method: Conducted sensitivity assessment across 10 model setups varying in architecture type and depth, evaluating runtime, ML utility, and similarity to real data distributions on four real-world datasets.

Result: Runtime is proportional to hyperparameter count; GReaT has lower runtimes but REaLTabFormer maintains better utility and similarity on larger datasets. REaLTabFormer with lightweight LLMs provides the best balance.

Conclusion: REaLTabFormer with lightweight LLMs preserves data quality while reducing computational requirements, though runtime remains higher than GReaT, indicating limited efficiency gains are possible.

Abstract: Synthetic tabular data is used for privacy-preserving data sharing and data-driven model development. Its effectiveness, however, depends heavily on the used Tabular Data Synthesis (TDS) tool. Recent studies have shown that Transformer-based models outperform other state-of-the-art models such as Generative Adversarial Networks (GANs) and Diffusion models in terms of data quality. However, Transformer-based models also come with high computational costs, making them sometimes unfeasible for end users with prosumer hardware. This study presents a sensitivity assessment on how the choice of hyperparameters, such as number of layers or hidden dimension affects the quality of the resultant synthetic data and the computational performance. It is performed across two tools, GReaT and REaLTabFormer, evaluating 10 model setups that vary in architecture type and depth. We assess the sensitivity on three dimensions: runtime, machine learning (ML) utility, and similarity to real data distributions. Experiments were conducted on four real-world datasets. Our findings reveal that runtime is proportional to the number of hyperparameters, with shallower configurations completing faster. GReaT consistently achieves lower runtimes than REaLTabFormer, and only on the largest dataset they have comparable runtime. For small datasets, both tools achieve synthetic data with high utility and optimal similarity, but on larger datasets only REaLTabFormer sustains strong utility and similarity. As a result, REaLTabFormer with lightweight LLMs provides the best balance, since it preserves data quality while reducing computational requirements. Nonetheless, its runtime remains higher than that of GReaT and other TDS tools, suggesting that efficiency gains are possible but only up to a certain level.

[397] Sig2Model: A Boosting-Driven Model for Updatable Learned Indexes

Alireza Heidari, Amirhossein Ahmad, Wei Zhang, Ying Xiong

Main category: cs.LG

TL;DR: Sig2Model is an efficient learned index that minimizes retraining costs for dynamic datasets through sigmoid boosting approximation, proactive update training with GMMs, and neural joint optimization.

Details

Motivation: Traditional learned indexes perform well on static data but degrade under dynamic updates due to high retraining costs required to maintain CDF invariance, making them unsuitable for real-world workloads with frequent updates.

Method: Three key techniques: (1) sigmoid boosting approximation for dynamic model adjustments with bounded error guarantees, (2) proactive update training using Gaussian mixture models to identify high-update regions, and (3) neural joint optimization framework for continuous refinement.

Result: Sig2Model reduces retraining cost by up to 20x, achieves up to 3x higher QPS, and uses up to 1000x less memory compared to state-of-the-art updatable learned indexes.

Conclusion: Sig2Model effectively addresses the update performance limitations of learned indexes, making them practical for real-world dynamic workloads through efficient retraining minimization techniques.

Abstract: Learned Indexes (LIs) represent a paradigm shift from traditional index structures by employing machine learning models to approximate the cumulative distribution function (CDF) of sorted data. While LIs achieve remarkable efficiency for static datasets, their performance degrades under dynamic updates: maintaining the CDF invariant (sum of F(k) equals 1) requires global model retraining, which blocks queries and limits the queries-per-second (QPS) metric. Current approaches fail to address these retraining costs effectively, rendering them unsuitable for real-world workloads with frequent updates. In this paper, we present Sig2Model, an efficient and adaptive learned index that minimizes retraining cost through three key techniques: (1) a sigmoid boosting approximation technique that dynamically adjusts the index model by approximating update-induced shifts in data distribution with localized sigmoid functions while preserving bounded error guarantees and deferring full retraining; (2) proactive update training via Gaussian mixture models (GMMs) that identifies high-update-probability regions for strategic placeholder allocation to speed up updates; and (3) a neural joint optimization framework that continuously refines both the sigmoid ensemble and GMM parameters via gradient-based learning. We evaluate Sig2Model against state-of-the-art updatable learned indexes on real-world and synthetic workloads, and show that Sig2Model reduces retraining cost by up to 20x, achieves up to 3x higher QPS, and uses up to 1000x less memory.

[398] IConv: Focusing on Local Variation with Channel Independent Convolution for Multivariate Time Series Forecasting

Gawon Lee, Hanbyeol Park, Minseop Kim, Dohee Kim, Hyerim Bae

Main category: cs.LG

TL;DR: The paper proposes a hybrid model combining MLP and CNN architectures to address non-stationary time-series forecasting, using MLP for long-term trends and CNN with novel IConv architecture for local variations.

Details

Motivation: Real-world time-series data exhibit non-stationarity with changing trends and local variations. MLP models excel at capturing long-term dependencies but ignore local patterns, while CNNs can effectively model these variations but have limitations with diverse channel distributions.

Method: Proposes combining MLP for overall trend modeling (long-term dependencies) with CNN using diverse kernels for fine-grained local patterns. Introduces IConv - a novel convolutional architecture that processes temporal dependencies channel-independently and considers inter-channel relationships through distinct layers.

Result: Extensive experiments on time-series datasets demonstrate the superiority of the proposed method for multivariate time-series forecasting compared to existing approaches.

Conclusion: The hybrid MLP-CNN model with IConv architecture effectively addresses the limitations of pure MLP models by capturing both long-term trends and local variations, providing superior performance in multivariate time-series forecasting tasks.

Abstract: Real-world time-series data often exhibit non-stationarity, including changing trends, irregular seasonality, and residuals. In terms of changing trends, recently proposed multi-layer perceptron (MLP)-based models have shown excellent performance owing to their computational efficiency and ability to capture long-term dependency. However, the linear nature of MLP architectures poses limitations when applied to channels with diverse distributions, resulting in local variations such as seasonal patterns and residual components being ignored. However, convolutional neural networks (CNNs) can effectively incorporate these variations. To resolve the limitations of MLP, we propose combining them with CNNs. The overall trend is modeled using an MLP to consider long-term dependencies. The CNN uses diverse kernels to model fine-grained local patterns in conjunction with MLP trend predictions. To focus on modeling local variation, we propose IConv, a novel convolutional architecture that processes the temporal dependency channel independently and considers the inter-channel relationship through distinct layers. Independent channel processing enables the modeling of diverse local temporal dependencies and the adoption of a large kernel size. Distinct inter-channel considerations reduce computational cost. The proposed model is evaluated through extensive experiments on time-series datasets. The results reveal the superiority of the proposed method for multivariate time-series forecasting.

[399] LiLAW: Lightweight Learnable Adaptive Weighting to Meta-Learn Sample Difficulty and Improve Noisy Training

Abhishek Moturu, Anna Goldenberg, Babak Taati

Main category: cs.LG

TL;DR: LiLAW is a lightweight method that dynamically adjusts sample loss weights based on difficulty levels using only three learnable parameters, improving model performance in noisy and heterogeneous data environments.

Details

Motivation: Training deep neural networks with noisy labels and data heterogeneity is challenging, requiring methods that can adaptively prioritize informative samples without excessive hyperparameter tuning or clean validation sets.

Method: LiLAW categorizes samples as easy, moderate, or hard and updates three learnable weight parameters using a single mini-batch gradient descent step on the validation set after each training mini-batch, without needing a clean validation set.

Result: Extensive experiments across various datasets, noise levels, loss functions, and architectures show LiLAW consistently enhances performance even in high-noise environments, without heavy reliance on data augmentation or advanced regularization.

Conclusion: LiLAW provides a computationally efficient solution to improve model generalization and robustness in neural network training, demonstrating practicality across diverse settings including medical imaging.

Abstract: Training deep neural networks in the presence of noisy labels and data heterogeneity is a major challenge. We introduce Lightweight Learnable Adaptive Weighting (LiLAW), a novel method that dynamically adjusts the loss weight of each training sample based on its evolving difficulty level, categorized as easy, moderate, or hard. Using only three learnable parameters, LiLAW adaptively prioritizes informative samples throughout training by updating these weights using a single mini-batch gradient descent step on the validation set after each training mini-batch, without requiring excessive hyperparameter tuning or a clean validation set. Extensive experiments across multiple general and medical imaging datasets, noise levels and types, loss functions, and architectures with and without pretraining demonstrate that LiLAW consistently enhances performance, even in high-noise environments. It is effective without heavy reliance on data augmentation or advanced regularization, highlighting its practicality. It offers a computationally efficient solution to boost model generalization and robustness in any neural network training setup.

[400] Aligning Inductive Bias for Data-Efficient Generalization in State Space Models

Qiyu Chen, Guozhang Chen

Main category: cs.LG

TL;DR: The paper introduces Task-Dependent Initialization (TDI), a method to align State Space Models’ inductive bias with task-specific spectral characteristics to improve data efficiency in low-data regimes.

Details

Motivation: Large-scale models face data scarcity challenges, and fixed inductive biases in State Space Models are inefficient when task structure doesn't match the model's prior. There's a need for more data-efficient learning methods.

Method: Formalizes SSM inductive bias through SSM-induced kernel analysis, then proposes TDI via power spectrum matching to align model frequency response with task spectral characteristics before training.

Result: Experiments on diverse real-world benchmarks show TDI significantly improves generalization and sample efficiency, especially in low-data scenarios.

Conclusion: TDI provides theoretical and practical tools for creating more data-efficient models, representing a crucial step toward sustainable scaling of large models.

Abstract: The remarkable success of large-scale models is fundamentally tied to scaling laws, yet the finite nature of high-quality data presents a looming challenge. One of the next frontiers in modeling is data efficiency: the ability to learn more from less. A model’s inductive bias is a critical lever for this, but foundational sequence models like State Space Models (SSMs) rely on a fixed bias. This fixed prior is sample-inefficient when a task’s underlying structure does not match. In this work, we introduce a principled framework to solve this problem. We first formalize the inductive bias of linear time-invariant SSMs through an SSM-induced kernel, mathematically and empirically proving its spectrum is directly governed by the model’s frequency response. Further, we propose a method of Task-Dependent Initialization (TDI): power spectrum matching, a fast and efficient method that aligns the model’s inductive bias with the task’s spectral characteristics before large-scale training. Our experiments on a diverse set of real-world benchmarks show that TDI significantly improves generalization and sample efficiency, particularly in low-data regimes. This work provides a theoretical and practical tool to create more data-efficient models, a crucial step towards sustainable scaling.

[401] FERD: Fairness-Enhanced Data-Free Robustness Distillation

Zhengxiao Li, Liming Lu, Xu Zheng, Siyuan Liang, Zhenghan Chen, Yongbin Zhou, Shuchao Pang

Main category: cs.LG

TL;DR: FERD is a fairness-enhanced data-free robustness distillation framework that addresses robust fairness issues by adjusting adversarial example proportion and distribution to improve worst-class robustness across categories.

Details

Motivation: Existing data-free robustness distillation methods focus on overall robustness but overlook robust fairness, leading to severe disparity in robustness across different categories. The paper identifies two key problems: unequal robustness performance across categories and instability against different attack targets.

Method: FERD uses two main strategies: (1) robustness-guided class reweighting to synthesize more samples for less robust categories, and (2) generating Fairness-Aware Examples (FAEs) with uniformity constraints and Uniform-Target Adversarial Examples (UTAEs) with uniform target class constraints to provide balanced representation and avoid biased attacks.

Result: Extensive experiments on three public datasets show FERD achieves state-of-the-art worst-class robustness under all adversarial attacks, with improvements of 15.1% under FGSM and 6.4% under AutoAttack using MobileNet-V2 on CIFAR-10.

Conclusion: FERD demonstrates superior performance in both robustness and fairness aspects, effectively addressing the robust fairness issues in data-free robustness distillation by balancing adversarial example generation across all categories.

Abstract: Data-Free Robustness Distillation (DFRD) aims to transfer the robustness from the teacher to the student without accessing the training data. While existing methods focus on overall robustness, they overlook the robust fairness issues, leading to severe disparity of robustness across different categories. In this paper, we find two key problems: (1) student model distilled with equal class proportion data behaves significantly different across distinct categories; and (2) the robustness of student model is not stable across different attacks target. To bridge these gaps, we present the first Fairness-Enhanced data-free Robustness Distillation (FERD) framework to adjust the proportion and distribution of adversarial examples. For the proportion, FERD adopts a robustness-guided class reweighting strategy to synthesize more samples for the less robust categories, thereby improving robustness of them. For the distribution, FERD generates complementary data samples for advanced robustness distillation. It generates Fairness-Aware Examples (FAEs) by enforcing a uniformity constraint on feature-level predictions, which suppress the dominance of class-specific non-robust features, providing a more balanced representation across all categories. Then, FERD constructs Uniform-Target Adversarial Examples (UTAEs) from FAEs by applying a uniform target class constraint to avoid biased attack directions, which distribute the attack targets across all categories and prevents overfitting to specific vulnerable categories. Extensive experiments on three public datasets show that FERD achieves state-of-the-art worst-class robustness under all adversarial attack (e.g., the worst-class robustness under FGSM and AutoAttack are improved by 15.1% and 6.4% using MobileNet-V2 on CIFAR-10), demonstrating superior performance in both robustness and fairness aspects.

[402] T2I-Diff: fMRI Signal Generation via Time-Frequency Image Transform and Classifier-Free Denoising Diffusion Models

Hwa Hui Tew, Junn Yong Loo, Yee-Fan Tan, Xinyu Tang, Hernando Ombao, Fuad Noman, Raphael C. -W. Phan, Chee-Ming Ting

Main category: cs.LG

TL;DR: T2I-Diff is a novel fMRI generation framework that uses time-frequency representations and classifier-free denoising diffusion to synthesize high-quality BOLD signals, addressing limitations of existing generative models.

Details

Motivation: fMRI data acquisition is resource-intensive, limiting high-fidelity samples for data-driven brain analysis. Existing generative models underperform due to overlooking complex non-stationarity and nonlinear BOLD dynamics.

Method: The framework converts BOLD signals to windowed spectrograms via time-dependent Fourier transform, trains a classifier-free diffusion model to generate class-conditioned frequency spectrograms, and reverts them to BOLD signals via inverse Fourier transforms.

Result: The approach demonstrates improved accuracy and generalization in downstream fMRI-based brain network classification tasks.

Conclusion: T2I-Diff effectively addresses the challenges of fMRI data generation by capturing complex temporal dynamics and spectral evolution, enabling better performance in brain analysis applications.

Abstract: Functional Magnetic Resonance Imaging (fMRI) is an advanced neuroimaging method that enables in-depth analysis of brain activity by measuring dynamic changes in the blood oxygenation level-dependent (BOLD) signals. However, the resource-intensive nature of fMRI data acquisition limits the availability of high-fidelity samples required for data-driven brain analysis models. While modern generative models can synthesize fMRI data, they often underperform because they overlook the complex non-stationarity and nonlinear BOLD dynamics. To address these challenges, we introduce T2I-Diff, an fMRI generation framework that leverages time-frequency representation of BOLD signals and classifier-free denoising diffusion. Specifically, our framework first converts BOLD signals into windowed spectrograms via a time-dependent Fourier transform, capturing both the underlying temporal dynamics and spectral evolution. Subsequently, a classifier-free diffusion model is trained to generate class-conditioned frequency spectrograms, which are then reverted to BOLD signals via inverse Fourier transforms. Finally, we validate the efficacy of our approach by demonstrating improved accuracy and generalization in downstream fMRI-based brain network classification.

[403] CaTS-Bench: Can Language Models Describe Numeric Time Series?

Luca Zhou, Pratham Yashwante, Marshall Fisher, Alessio Sampieri, Zihao Zhou, Fabio Galasso, Rose Yu

Main category: cs.LG

TL;DR: CaTS-Bench is a large-scale, real-world benchmark for context-aware time series captioning, addressing limitations of existing benchmarks by including metadata, visual representations, and diverse datasets with 465k training and 105k test timestamps.

Details

Motivation: Existing time series captioning benchmarks rely on synthetic data or simplistic captions, and neglect metadata and visual representations, creating a gap in real-world applicability.

Method: Derived from 11 diverse datasets reframed as captioning and Q&A tasks, using a scalable pipeline with LLM-generated captions verified through factual checks, human indistinguishability studies, and diversity analyses, plus a human-revisited subset of 579 test captions.

Result: Established CaTS-Bench as a comprehensive benchmark with new evaluation metrics, benchmarking leading VLMs to highlight their strengths and limitations in time series reasoning.

Conclusion: CaTS-Bench and its captioning pipeline provide a reliable and extensible foundation for future research at the intersection of time series analysis and foundation models.

Abstract: Time series captioning, the task of describing numeric time series in natural language, requires numerical reasoning, trend interpretation, and contextual understanding. Existing benchmarks, however, often rely on synthetic data or overly simplistic captions, and typically neglect metadata and visual representations. To close this gap, we introduce CaTS-Bench, the first large-scale, real-world benchmark for Context-aware Time Series captioning. CaTS-Bench is derived from 11 diverse datasets reframed as captioning and Q&A tasks, comprising roughly 465k training and 105k test timestamps. Each sample includes a numeric series segment, contextual metadata, a line-chart image, and a caption. A key contribution of this work is the scalable pipeline used to generate reference captions: while most references are produced by an oracle LLM and verified through factual checks, human indistinguishability studies, and diversity analyses, we also provide a human-revisited subset of 579 test captions, refined from LLM outputs to ensure accuracy and human-like style. Beyond captioning, CaTS-Bench offers 460 multiple-choice questions targeting deeper aspects of time series reasoning. We further propose new tailored evaluation metrics and benchmark leading VLMs, highlighting both their strengths and persistent limitations. Together, these contributions establish CaTS-Bench and its captioning pipeline as a reliable and extensible foundation for future research at the intersection of time series analysis and foundation models.

[404] Explaining Grokking and Information Bottleneck through Neural Collapse Emergence

Keitaro Sakamoto, Issei Sato

Main category: cs.LG

TL;DR: This paper provides a unified explanation for late-phase training phenomena like grokking and information bottleneck through the lens of neural collapse, showing that contraction of within-class variance is the key underlying factor.

Details

Motivation: To understand the mechanisms behind puzzling training dynamics in deep neural networks, particularly grokking (abrupt test improvement after training loss plateaus) and information bottleneck (progressive discarding of irrelevant input information), which currently lack unified explanations.

Method: The authors analyze training dynamics through neural collapse theory, specifically focusing on the contraction of population within-class variance. They relate this measure to neural collapse measures on training sets and examine the distinct time scales between training set fitting and neural collapse progression.

Result: The analysis reveals that the contraction of within-class variance is the common factor driving both grokking and information bottleneck phenomena. The different time scales between fitting and neural collapse progression account for the observed late-phase behaviors.

Conclusion: Neural collapse provides a unified framework to explain late-phase training phenomena, with within-class variance contraction being the key mechanism. The theoretical findings are validated across multiple datasets and architectures.

Abstract: The training dynamics of deep neural networks often defy expectations, even as these models form the foundation of modern machine learning. Two prominent examples are grokking, where test performance improves abruptly long after the training loss has plateaued, and the information bottleneck principle, where models progressively discard input information irrelevant to the prediction task as training proceeds. However, the mechanisms underlying these phenomena and their relations remain poorly understood. In this work, we present a unified explanation of such late-phase phenomena through the lens of neural collapse, which characterizes the geometry of learned representations. We show that the contraction of population within-class variance is a key factor underlying both grokking and information bottleneck, and relate this measure to the neural collapse measure defined on the training set. By analyzing the dynamics of neural collapse, we show that distinct time scales between fitting the training set and the progression of neural collapse account for the behavior of the late-phase phenomena. Finally, we validate our theoretical findings on multiple datasets and architectures.

Jiaqi Tang, Yinsong Xu, Yang Liu, Qingchao Chen

Main category: cs.LG

TL;DR: A two-stage training framework that shapes initial model states through unimodal training to prevent modality competition in multi-modal fusion, using Effective Competitive Strength (ECS) theory and FastPID for fine-grained modality analysis.

Details

Motivation: Multi-modal fusion suffers from modality competition where one modality dominates learning, leaving others under-optimized. Existing methods overlook the critical impact of the model's initial state.

Method: Proposes a two-stage framework: 1) unimodal training to shape initial states, 2) joint training. Introduces Effective Competitive Strength (ECS) theory and FastPID - a differentiable solver for partial information decomposition that measures modality uniqueness, redundancy, and synergy.

Result: Experiments on diverse benchmarks demonstrate state-of-the-art performance. The method reliably unlocks synergistic multi-modal fusion by easing competition before it starts.

Conclusion: Shaping pre-fusion models’ initial state is a powerful strategy that prevents modality competition, achieving provably tighter error bounds and superior multi-modal fusion performance.

Abstract: Multi-modal fusion often suffers from modality competition during joint training, where one modality dominates the learning process, leaving others under-optimized. Overlooking the critical impact of the model’s initial state, most existing methods address this issue during the joint learning stage. In this study, we introduce a two-stage training framework to shape the initial states through unimodal training before the joint training. First, we propose the concept of Effective Competitive Strength (ECS) to quantify a modality’s competitive strength. Our theoretical analysis further reveals that properly shaping the initial ECS by unimodal training achieves a provably tighter error bound. However, ECS is computationally intractable in deep neural networks. To bridge this gap, we develop a framework comprising two core components: a fine-grained computable diagnostic metric and an asynchronous training controller. For the metric, we first prove that mutual information(MI) is a principled proxy for ECS. Considering MI is induced by per-modality marginals and thus treats each modality in isolation, we further propose FastPID, a computationally efficient and differentiable solver for partial information decomposition, which decomposes the joint distribution’s information into fine-grained measurements: modality-specific uniqueness, redundancy, and synergy. Guided by these measurements, our asynchronous controller dynamically balances modalities by monitoring uniqueness and locates the ideal initial state to start joint training by tracking peak synergy. Experiments on diverse benchmarks demonstrate that our method achieves state-of-the-art performance. Our work establishes that shaping the pre-fusion models’ initial state is a powerful strategy that eases competition before it starts, reliably unlocking synergistic multi-modal fusion.

[406] Robust Multi-Omics Integration from Incomplete Modalities Significantly Improves Prediction of Alzheimer’s Disease

Sungjoon Park, Kyungwook Lee, Soorin Yim, Doyeong Hwang, Dongyun Kim, Soonyoung Lee, Amy Dunn, Daniel Gatti, Elissa Chesler, Kristen O’Connell, Kiyoung Kim

Main category: cs.LG

TL;DR: MOIRA is an early integration method for multi-omics data that handles missing modalities through representation alignment and adaptive aggregation, achieving superior performance on Alzheimer’s Disease prediction.

Details

Motivation: Missing modalities in multi-omics data hinder integrative analysis across heterogeneous omics datasets, limiting insights into complex biomolecular interactions in metabolism and disease.

Method: MOIRA projects each omics dataset onto a shared embedding space and uses a learnable weighting mechanism to fuse them, enabling robust learning from incomplete data via representation alignment and adaptive aggregation.

Result: Evaluated on the ROSMAP dataset for Alzheimer’s Disease, MOIRA outperformed existing approaches, with ablation studies confirming modality-wise contributions and feature importance analysis revealing AD-related biomarkers consistent with prior literature.

Conclusion: MOIRA provides a biologically relevant approach for multi-omics integration that effectively handles missing data and reveals meaningful disease biomarkers.

Abstract: Multi-omics data capture complex biomolecular interactions and provide insights into metabolism and disease. However, missing modalities hinder integrative analysis across heterogeneous omics. To address this, we present MOIRA (Multi-Omics Integration with Robustness to Absent modalities), an early integration method enabling robust learning from incomplete omics data via representation alignment and adaptive aggregation. MOIRA leverages all samples, including those with missing modalities, by projecting each omics dataset onto a shared embedding space where a learnable weighting mechanism fuses them. Evaluated on the Religious Order Study and Memory and Aging Project (ROSMAP) dataset for Alzheimer’s Disease (AD), MOIRA outperformed existing approaches, and further ablation studies confirmed modality-wise contributions. Feature importance analysis revealed AD-related biomarkers consistent with prior literature, highlighting the biological relevance of our approach.

[407] Causal Time Series Generation via Diffusion Models

Yutong Xia, Chang Xu, Yuxuan Liang, Qingsong Wen, Roger Zimmermann, Jiang Bian

Main category: cs.LG

TL;DR: This paper introduces causal time series generation (CaTSG) as a new task family that extends traditional conditional TSG to include interventional and counterfactual settings using causal reasoning.

Details

Motivation: Current conditional time series generation models only learn observational correlations without considering unobserved confounding, limiting their reliability for simulation under interventions and counterfactual scenarios.

Method: The authors develop CaTSG, a unified diffusion-based framework with backdoor-adjusted guidance that causally steers sampling using causal score functions derived via backdoor adjustment and the abduction-action-prediction procedure.

Result: Extensive experiments on synthetic and real-world datasets show that CaTSG achieves superior fidelity and supports interventional and counterfactual generation that existing baselines cannot handle.

Conclusion: The paper proposes the causal TSG family and provides an initial proof-of-concept with CaTSG, opening a promising direction toward more reliable simulation under interventions and counterfactual generation.

Abstract: Time series generation (TSG) synthesizes realistic sequences and has achieved remarkable success. Among TSG, conditional models generate sequences given observed covariates, however, such models learn observational correlations without considering unobserved confounding. In this work, we propose a causal perspective on conditional TSG and introduce causal time series generation as a new TSG task family, formalized within Pearl’s causal ladder, extending beyond observational generation to include interventional and counterfactual settings. To instantiate these tasks, we develop CaTSG, a unified diffusion-based framework with backdoor-adjusted guidance that causally steers sampling toward desired interventions and individual counterfactuals while preserving observational fidelity. Specifically, our method derives causal score functions via backdoor adjustment and the abduction-action-prediction procedure, thus enabling principled support for all three levels of TSG. Extensive experiments on both synthetic and real-world datasets show that CaTSG achieves superior fidelity and also supporting interventional and counterfactual generation that existing baselines cannot handle. Overall, we propose the causal TSG family and instantiate it with CaTSG, providing an initial proof-of-concept and opening a promising direction toward more reliable simulation under interventions and counterfactual generation.

[408] FHRFormer: A Self-supervised Transformer Approach for Fetal Heart Rate Inpainting and Forecasting

Kjersti Engan, Neel Kanwal, Anita Yeconia, Ladislaus Blacy, Yuda Munyaw, Estomih Mduma, Hege Ersdal

Main category: cs.LG

TL;DR: A masked transformer-based autoencoder approach to reconstruct missing fetal heart rate (FHR) signals caused by sensor displacement during maternal movement, enabling better AI-based risk prediction for neonatal breathing assistance needs.

Details

Motivation: Approximately 10% of newborns need breathing assistance at birth, and FHR monitoring is crucial for fetal well-being assessment. However, wearable FHR monitors suffer from signal dropouts due to maternal movement, limiting AI-based analysis and risk prediction capabilities.

Method: Proposes a masked transformer-based autoencoder that captures both spatial and frequency components of FHR data to reconstruct missing signals. The method handles varying durations of missing data and supports both signal inpainting and forecasting.

Result: The approach demonstrates robustness across different missing data durations and can be used for reconstructing FHR signals with preserved spectral characteristics, overcoming limitations of traditional interpolation methods.

Conclusion: The method can be applied retrospectively to research datasets for AI-based risk algorithm development and potentially integrated into wearable FHR monitoring devices for earlier and more robust detection of neonatal breathing assistance risks.

Abstract: Approximately 10% of newborns require assistance to initiate breathing at birth, and around 5% need ventilation support. Fetal heart rate (FHR) monitoring plays a crucial role in assessing fetal well-being during prenatal care, enabling the detection of abnormal patterns and supporting timely obstetric interventions to mitigate fetal risks during labor. Applying artificial intelligence (AI) methods to analyze large datasets of continuous FHR monitoring episodes with diverse outcomes may offer novel insights into predicting the risk of needing breathing assistance or interventions. Recent advances in wearable FHR monitors have enabled continuous fetal monitoring without compromising maternal mobility. However, sensor displacement during maternal movement, as well as changes in fetal or maternal position, often lead to signal dropouts, resulting in gaps in the recorded FHR data. Such missing data limits the extraction of meaningful insights and complicates automated (AI-based) analysis. Traditional approaches to handle missing data, such as simple interpolation techniques, often fail to preserve the spectral characteristics of the signals. In this paper, we propose a masked transformer-based autoencoder approach to reconstruct missing FHR signals by capturing both spatial and frequency components of the data. The proposed method demonstrates robustness across varying durations of missing data and can be used for signal inpainting and forecasting. The proposed approach can be applied retrospectively to research datasets to support the development of AI-based risk algorithms. In the future, the proposed method could be integrated into wearable FHR monitoring devices to achieve earlier and more robust risk detection.

[409] Federated Markov Imputation: Privacy-Preserving Temporal Imputation in Multi-Centric ICU Environments

Christoph Düsing, Philipp Cimiano

Main category: cs.LG

TL;DR: Federated Markov Imputation (FMI) is a privacy-preserving method that enables ICUs to collaboratively build global transition models for temporal imputation in federated learning on EHR data with varying temporal granularities.

Details

Motivation: Missing data is a persistent challenge in federated learning on electronic health records, particularly when institutions collect time-series data at varying temporal granularities.

Method: Propose Federated Markov Imputation (FMI), which enables Intensive Care Units (ICUs) to collaboratively build global transition models for temporal imputation while preserving privacy.

Result: FMI outperforms local imputation baselines on a real-world sepsis onset prediction task using the MIMIC-IV dataset, especially in scenarios with irregular sampling intervals across ICUs.

Conclusion: FMI effectively addresses missing data challenges in federated EHR learning by enabling collaborative temporal imputation while maintaining privacy.

Abstract: Missing data is a persistent challenge in federated learning on electronic health records, particularly when institutions collect time-series data at varying temporal granularities. To address this, we propose Federated Markov Imputation (FMI), a privacy-preserving method that enables Intensive Care Units (ICUs) to collaboratively build global transition models for temporal imputation. We evaluate FMI on a real-world sepsis onset prediction task using the MIMIC-IV dataset and show that it outperforms local imputation baselines, especially in scenarios with irregular sampling intervals across ICUs.

[410] StyleBench: Evaluating thinking styles in Large Language Models

Junyu Guo, Shangding Gu, Ming Jin, Costas Spanos, Javad Lavaei

Main category: cs.LG

TL;DR: StyleBench is a comprehensive benchmark evaluating 5 reasoning styles across 15 models and 5 reasoning tasks, revealing that no single reasoning style is universally optimal and strategy effectiveness depends on model scale and task type.

Details

Motivation: The effectiveness of LLMs is heavily influenced by reasoning strategies, but the interplay between reasoning styles, model architecture, and task type remains poorly understood.

Method: Introduced StyleBench to systematically evaluate 5 reasoning styles (CoT, ToT, AoT, SoT, CoD) on 5 reasoning tasks using 15 open-source models from major families ranging from 270M to 120B parameters.

Result: No single reasoning style is universally optimal. Search-based methods (AoT, ToT) excel in open-ended problems but require large models, while concise styles (SoT, CoD) achieve efficiency gains on well-defined tasks. Smaller models often fail to follow instructions and default to guessing.

Conclusion: Reasoning strategy effectiveness is highly contingent on model scale and task type, offering a roadmap for selecting optimal strategies based on specific constraints.

Abstract: The effectiveness of Large Language Models (LLMs) is heavily influenced by the reasoning strategies, or styles of thought, employed in their prompts. However, the interplay between these reasoning styles, model architecture, and task type remains poorly understood. To address this, we introduce StyleBench, a comprehensive benchmark for systematically evaluating reasoning styles across diverse tasks and models. We assess five representative reasoning styles, including Chain of Thought (CoT), Tree of Thought (ToT), Algorithm of Thought (AoT), Sketch of Thought (SoT), and Chain-of-Draft (CoD) on five reasoning tasks, using 15 open-source models from major families (LLaMA, Qwen, Mistral, Gemma, GPT-OSS, Phi, and DeepSeek) ranging from 270M to 120B parameters. Our large-scale analysis reveals that no single style is universally optimal. We demonstrate that strategy efficacy is highly contingent on both model scale and task type: search-based methods (AoT, ToT) excel in open-ended problems but require large-scale models, while concise styles (SoT, CoD) achieve radical efficiency gains on well-defined tasks. Furthermore, we identify key behavioral patterns: smaller models frequently fail to follow output instructions and default to guessing, while reasoning robustness emerges as a function of scale. Our findings offer a crucial roadmap for selecting optimal reasoning strategies based on specific constraints, we open source the benchmark in https://github.com/JamesJunyuGuo/Style_Bench.

[411] Model-Based Reinforcement Learning under Random Observation Delays

Armin Karamzade, Kyungmin Kim, JB Lanier, Davide Corsi, Roy Fox

Main category: cs.LG

TL;DR: This paper addresses random sensor delays in POMDPs where observations arrive out-of-sequence, proposing a model-based filtering approach that outperforms naive methods and delay-aware MDP baselines.

Details

Motivation: Real-world environments frequently have delays, but standard RL assumes instantaneous perception. Random sensor delays in POMDPs with out-of-sequence observations haven't been properly addressed in RL literature.

Method: Proposes a model-based filtering process that sequentially updates belief states from incoming observation streams, incorporated into a delay-aware framework for model-based RL (applied to Dreamer).

Result: The method consistently outperforms delay-aware MDP baselines, shows robustness to delay distribution shifts during deployment, and performs better than common practical heuristics in simulated robotic tasks.

Conclusion: Explicitly modeling observation delays is crucial, and the proposed framework effectively handles random delays in POMDP settings where naive approaches like observation stacking fail.

Abstract: Delays frequently occur in real-world environments, yet standard reinforcement learning (RL) algorithms often assume instantaneous perception of the environment. We study random sensor delays in POMDPs, where observations may arrive out-of-sequence, a setting that has not been previously addressed in RL. We analyze the structure of such delays and demonstrate that naive approaches, such as stacking past observations, are insufficient for reliable performance. To address this, we propose a model-based filtering process that sequentially updates the belief state based on an incoming stream of observations. We then introduce a simple delay-aware framework that incorporates this idea into model-based RL, enabling agents to effectively handle random delays. Applying this framework to Dreamer, we compare our approach to delay-aware baselines developed for MDPs. Our method consistently outperforms these baselines and demonstrates robustness to delay distribution shifts during deployment. Additionally, we present experiments on simulated robotic tasks, comparing our method to common practical heuristics and emphasizing the importance of explicitly modeling observation delays.

[412] Distribution-Controlled Client Selection to Improve Federated Learning Strategies

Christoph Düsing, Philipp Cimiano

Main category: cs.LG

TL;DR: Proposes a distribution-controlled client selection method for federated learning that aligns label distributions with target distributions to address data imbalance issues.

Details

Motivation: Data imbalance among clients in federated learning degrades model performance, requiring effective client selection methods to mitigate this problem.

Method: Extends existing FL strategies by selecting active clients that align current label distribution with either balanced distribution or federation’s combined label distribution.

Result: Empirical verification on three FL strategies and two datasets shows alignment with balanced distribution improves local imbalance, while alignment with federation’s combined distribution is better for global imbalance.

Conclusion: The proposed distribution-controlled client selection effectively addresses different types of data imbalance in federated learning, with specific target distributions working best for different imbalance scenarios.

Abstract: Federated learning (FL) is a distributed learning paradigm that allows multiple clients to jointly train a shared model while maintaining data privacy. Despite its great potential for domains with strict data privacy requirements, the presence of data imbalance among clients is a thread to the success of FL, as it causes the performance of the shared model to decrease. To address this, various studies have proposed enhancements to existing FL strategies, particularly through client selection methods that mitigate the detrimental effects of data imbalance. In this paper, we propose an extension to existing FL strategies, which selects active clients that best align the current label distribution with one of two target distributions, namely a balanced distribution or the federations combined label distribution. Subsequently, we empirically verify the improvements through our distribution-controlled client selection on three common FL strategies and two datasets. Our results show that while aligning the label distribution with a balanced distribution yields the greatest improvements facing local imbalance, alignment with the federation’s combined label distribution is superior for global imbalance.

[413] Improving Early Sepsis Onset Prediction Through Federated Learning

Christoph Düsing, Philipp Cimiano

Main category: cs.LG

TL;DR: A federated learning approach using attention-enhanced LSTM for sepsis onset prediction that supports variable prediction windows and improves early detection while preserving privacy.

Details

Motivation: Early sepsis prediction is challenging due to limited training data in individual ICUs. Federated learning enables collaborative training without data sharing to address data scarcity and privacy concerns.

Method: Proposed a federated, attention-enhanced LSTM model trained on multi-centric ICU data with variable prediction horizons instead of fixed windows.

Result: FL approach improves overall prediction performance (approaching centralized model performance) and is particularly beneficial for early sepsis detection with large prediction windows.

Conclusion: Variable prediction window approach reduces computational and organizational overhead without significant performance loss, making federated learning effective for early sepsis prediction while preserving privacy.

Abstract: Early and accurate prediction of sepsis onset remains a major challenge in intensive care, where timely detection and subsequent intervention can significantly improve patient outcomes. While machine learning models have shown promise in this domain, their success is often limited by the amount and diversity of training data available to individual hospitals and Intensive Care Units (ICUs). Federated Learning (FL) addresses this issue by enabling collaborative model training across institutions without requiring data sharing, thus preserving patient privacy. In this work, we propose a federated, attention-enhanced Long Short-Term Memory model for sepsis onset prediction, trained on multi-centric ICU data. Unlike existing approaches that rely on fixed prediction windows, our model supports variable prediction horizons, enabling both short- and long-term forecasting in a single unified model. During analysis, we put particular emphasis on the improvements through our approach in terms of early sepsis detection, i.e., predictions with large prediction windows by conducting an in-depth temporal analysis. Our results prove that using FL does not merely improve overall prediction performance (with performance approaching that of a centralized model), but is particularly beneficial for early sepsis onset prediction. Finally, we show that our choice of employing a variable prediction window rather than a fixed window does not hurt performance significantly but reduces computational, communicational, and organizational overhead.

[414] Deterministic Discrete Denoising

Hideyuki Suzuki, Hiroshi Yamashita

Main category: cs.LG

TL;DR: A deterministic denoising algorithm for discrete-state diffusion models using Markov chains and herding algorithm with weakly chaotic dynamics to replace stochastic denoising.

Details

Motivation: To create a deterministic alternative to stochastic denoising processes in discrete diffusion models without requiring retraining or continuous state embeddings.

Method: Introduces a variant of the herding algorithm with weakly chaotic dynamics to derandomize the generative reverse process, enabling deterministic discrete state transitions.

Result: Demonstrated consistent improvements in both efficiency and sample quality on text and image generation tasks.

Conclusion: Deterministic reverse processes can be effective in discrete state spaces, enhancing the significance of discrete diffusion in generative modeling.

Abstract: We propose a deterministic denoising algorithm for discrete-state diffusion models based on Markov chains. The generative reverse process is derandomized by introducing a variant of the herding algorithm with weakly chaotic dynamics, which induces deterministic discrete state transitions. Our approach is a direct replacement for the stochastic denoising process, requiring neither retraining nor continuous state embeddings. We demonstrate consistent improvements in both efficiency and sample quality on text and image generation tasks. Thus, this simple derandomization approach is expected to enhance the significance of discrete diffusion in generative modeling. Furthermore, our results reveal that deterministic reverse processes, well established in continuous diffusion, can also be effective in discrete state spaces.

[415] Deep Learning for Crime Forecasting: The Role of Mobility at Fine-grained Spatiotemporal Scales

Ariadna Albors Zumel, Michele Tizzoni, Gian Maria Campedelli

Main category: cs.LG

TL;DR: A deep learning framework using ConvLSTM with mobility, crime, and sociodemographic data improves crime forecasting at fine spatiotemporal resolutions across four US cities.

Details

Motivation: To evaluate if incorporating micro-level mobility features alongside historical crime and sociodemographic data enhances predictive performance in crime forecasting at fine-grained spatial and temporal resolutions.

Method: Used crime incident data, sociodemographic data from American Community Survey, and human mobility data from Advan (2019-2023) aggregated into 0.077 sq. mile grids. Trained a ConvLSTM network to predict crime occurrences 12 hours ahead using 14-day and 2-day input sequences, comparing against logistic regression, random forest, and standard LSTM baselines.

Result: Mobility features improved predictive performance, especially with shorter input sequences. Best results achieved when combining mobility and sociodemographic features, with the deep learning model achieving highest recall, precision, and F1 scores in all four cities. Longer sequences better for violent crimes, shorter sequences better for property crimes.

Conclusion: Integrating diverse data sources including mobility is crucial for spatiotemporal crime forecasting. Deep learning shows advantages (and limits) when dealing with fine-grained spatial and temporal scales.

Abstract: Objectives: To develop a deep learning framework to evaluate if and how incorporating micro-level mobility features, alongside historical crime and sociodemographic data, enhances predictive performance in crime forecasting at fine-grained spatial and temporal resolutions. Methods: We advance the literature on computational methods and crime forecasting by focusing on four U.S. cities (i.e., Baltimore, Chicago, Los Angeles, and Philadelphia). We employ crime incident data obtained from each city’s police department, combined with sociodemographic data from the American Community Survey and human mobility data from Advan, collected from 2019 to 2023. This data is aggregated into grids with equally sized cells of 0.077 sq. miles (0.2 sq. kms) and used to train our deep learning forecasting model, a Convolutional Long Short-Term Memory (ConvLSTM) network, which predicts crime occurrences 12 hours ahead using 14-day and 2-day input sequences. We also compare its performance against three baseline models: logistic regression, random forest, and standard LSTM. Results: Incorporating mobility features improves predictive performance, especially when using shorter input sequences. Noteworthy, however, the best results are obtained when both mobility and sociodemographic features are used together, with our deep learning model achieving the highest recall, precision, and F1 score in all four cities, outperforming alternative methods. With this configuration, longer input sequences enhance predictions for violent crimes, while shorter sequences are more effective for property crimes. Conclusion: These findings underscore the importance of integrating diverse data sources for spatiotemporal crime forecasting, mobility included. They also highlight the advantages (and limits) of deep learning when dealing with fine-grained spatial and temporal scales.

[416] Energy saving in off-road vehicles using leakage compensation technique

Gyan Wrat, J. Das

Main category: cs.LG

TL;DR: The paper compares two hydraulic circuits for linear actuators in heavy equipment, finding that a proportional flow control valve (PFCV) with artificial leakage is 8.5% more energy efficient than conventional proportional directional control valve (PDCV) systems.

Details

Motivation: To improve energy efficiency of linear actuators in heavy earth moving equipment by reducing energy loss and heat generation, thereby lowering environmental impact and operating costs.

Method: Compared two hydraulic circuits: conventional PDCV vs innovative PFCV with artificial leakage. Used MATLAB/Simulink simulation and experimental validation with PID controller tuned by fuzzy controller for position control.

Result: The PFCV circuit achieved 8.5% higher energy efficiency than the conventional PDCV circuit by bypassing extra pump flow during position control instead of using pressure relief valves.

Conclusion: The proposed PFCV approach can significantly improve energy efficiency in heavy equipment linear actuators, reducing both environmental impact and operating costs.

Abstract: The article focuses on enhancing the energy efficiency of linear actuators used in heavy earth moving equipment, particularly in the booms ofexcavation equipment. Two hydraulic circuits are compared in terms of energy efficiency, with one using a conventional proportional directionalcontrol valve (PDCV) and the other using an innovative solution of proportional flow control valve (PFCV) with artificial leakage between thetwo ends of the actuator. The PFCV reduces energy loss in the form of heat by bypassing the extra flow from the pump during position control,unlike the PDCV that uses a pressure relief valve. The hydraulic circuit using PFCV is found to be 8.5% more energy efficient than theconventional circuit using PDCV. The article also discusses the position control of the actuator, which is achieved using a PID controller tuned by a fuzzy controller. Thesimulation of the hydraulic circuit is carried out using MATLAB/Simulink, and the results are compared with experiments. Overall, the proposedapproach could lead to significant improvements in the energy efficiency of linear actuators used in heavy earth moving equipment, therebyreducing their environmental impact and operating costs.

[417] GenFacts-Generative Counterfactual Explanations for Multi-Variate Time Series

Sarah Seifi, Anass Ibrahimi, Tobias Sukianto, Cecilia Carbonelli, Lorenzo Servadei, Robert Wille

Main category: cs.LG

TL;DR: GenFacts is a generative framework for creating plausible and interpretable counterfactual explanations for multivariate time series data, using a class-discriminative variational autoencoder with contrastive learning and realism constraints.

Details

Motivation: Existing counterfactual explanation methods for multivariate time series often produce invalid, implausible, or unintuitive results, limiting their practical usefulness for model transparency.

Method: GenFacts uses a class-discriminative variational autoencoder integrated with contrastive and classification-consistency objectives, prototype-based initialization, and realism-constrained optimization to generate counterfactuals.

Result: GenFacts outperforms state-of-the-art baselines by 18.7% in plausibility and achieves the highest interpretability scores in human evaluations on both radar gesture data and handwritten letter trajectory datasets.

Conclusion: Plausibility and user-centered interpretability are more important than sparsity alone for creating actionable counterfactual explanations in time series data.

Abstract: Counterfactual explanations aim to enhance model transparency by showing how inputs can be minimally altered to change predictions. For multivariate time series, existing methods often generate counterfactuals that are invalid, implausible, or unintuitive. We introduce GenFacts, a generative framework based on a class-discriminative variational autoencoder. It integrates contrastive and classification-consistency objectives, prototype-based initialization, and realism-constrained optimization. We evaluate GenFacts on radar gesture data as an industrial use case and handwritten letter trajectories as an intuitive benchmark. Across both datasets, GenFacts outperforms state-of-the-art baselines in plausibility (+18.7%) and achieves the highest interpretability scores in a human study. These results highlight that plausibility and user-centered interpretability, rather than sparsity alone, are key to actionable counterfactuals in time series data.

[418] Why Attention Fails: The Degeneration of Transformers into MLPs in Time Series Forecasting

Zida Liang, Jiayi Zhu, Weiqiang Sun

Main category: cs.LG

TL;DR: Transformers underperform in time series forecasting due to attention mechanisms degenerating into simple MLPs, caused by improper embedding methods that fail to create well-structured latent spaces.

Details

Motivation: To understand why transformer-based architectures fail to outperform simple linear baselines in time series forecasting despite their success in NLP and computer vision.

Method: Designed progressive experiments modifying transformers into MLPs, created interpretable datasets to analyze attention mechanisms, and conducted theoretical analysis of embedding methods.

Result: Transformer blocks often degenerate into simple MLPs in time-series transformers, and attention mechanisms don’t function as expected due to improper embedding techniques.

Conclusion: Current embedding methods prevent transformers from operating in well-structured latent spaces, which is the fundamental reason for their poor performance in time series forecasting.

Abstract: Transformer-based architectures achieved high performance in natural language processing and computer vision, yet many studies have shown that they have not demonstrated a clear advantage in time series forecasting and even underperform simple linear baselines in some cases. However, most of these studies have not thoroughly explored the reasons behind the failure of transformers. To better understand time-series transformers(TST), we designed a series of experiments, progressively modifying transformers into MLPs to investigate the impact of the attention mechanism. Surprisingly, transformer blocks often degenerate into simple MLPs in existing time-series transformers. We designed a interpretable dataset to investigate the reasons behind the failure of the attention mechanism and revealed that the attention mechanism is not working in the expected way. We theoretically analyzed the reasons behind this phenomenon, demonstrating that the current embedding methods fail to allow transformers to function in a well-structured latent space, and further analyzed the deeper underlying causes of the failure of embedding.

[419] Decoupled-Value Attention for Prior-Data Fitted Networks: GP Inference for Physical Equations

Kaustubh Sharma, Simardeep Singh, Parikshit Pareek

Main category: cs.LG

TL;DR: The paper introduces Decoupled-Value Attention (DVA) to improve Prior-data fitted networks (PFNs) for high-dimensional regression tasks, achieving significant performance gains and 80x speedup over Gaussian Process inference.

Details

Motivation: PFNs offer faster alternatives to Gaussian Process inference but struggle with high-dimensional regression tasks using standard Transformer attention. The goal is to develop an attention mechanism that better mirrors GP properties while maintaining computational efficiency.

Method: Proposed Decoupled-Value Attention (DVA) that computes similarities from inputs only and propagates labels through values, mirroring the Gaussian-process update while remaining kernel-free. Also explored localized attention and CNN-based PFNs as alternatives to Transformer architectures.

Result: DVA significantly improves PFN performance: validation loss reduced by more than 50% in 5D and 10D cases, achieves mean absolute error of 1E-3 for 64-dimensional power flow equations, and provides 80x speedup over exact GP inference.

Conclusion: The attention rule is more crucial than architecture choice for scaling PFNs. DVA successfully bridges the gap between PFNs and GP inference while maintaining computational efficiency, enabling effective high-dimensional regression with significant speed advantages.

Abstract: Prior-data fitted networks (PFNs) are a promising alternative to time-consuming Gaussian Process (GP) inference for creating fast surrogates of physical systems. PFN reduces the computational burden of GP-training by replacing Bayesian inference in GP with a single forward pass of a learned prediction model. However, with standard Transformer attention, PFNs show limited effectiveness on high-dimensional regression tasks. We introduce Decoupled-Value Attention (DVA)– motivated by the GP property that the function space is fully characterized by the kernel over inputs and the predictive mean is a weighted sum of training targets. DVA computes similarities from inputs only and propagates labels solely through values. Thus, the proposed DVA mirrors the Gaussian-process update while remaining kernel-free. We demonstrate that the crucial factor for scaling PFNs is the attention rule rather than the architecture itself. Specifically, our results demonstrate that (a) localized attention consistently reduces out-of-sample validation loss in PFNs across different dimensional settings, with validation loss reduced by more than 50% in five- and ten-dimensional cases, and (b) the role of attention is more decisive than the choice of backbone architecture, showing that CNN-based PFNs can perform at par with their Transformer-based counterparts. The proposed PFNs provide 64-dimensional power flow equation approximations with a mean absolute error of the order of 1E-3, while being over 80x faster than exact GP inference.

[420] Flow Matching in the Low-Noise Regime: Pathologies and a Contrastive Remedy

Weili Zeng, Yichao Yan

Main category: cs.LG

TL;DR: Flow matching suffers from instability in low-noise regimes where small input perturbations cause large velocity variations, degrading semantic representations. The paper proposes Local Contrastive Flow (LCF) to address this pathology.

Details

Motivation: To overcome the fundamental instability in flow matching models when noise levels approach zero, which causes ill-conditioning and degrades semantic representation quality.

Method: Proposes Local Contrastive Flow (LCF) - a hybrid training protocol that uses contrastive feature alignment at small noise levels while retaining standard flow matching at moderate/high noise levels.

Result: LCF improves convergence speed and stabilizes representation quality compared to standard flow matching approaches.

Conclusion: Addressing low-noise pathologies is critical for unlocking flow matching’s full potential in both generation and representation learning.

Abstract: Flow matching has recently emerged as a powerful alternative to diffusion models, providing a continuous-time formulation for generative modeling and representation learning. Yet, we show that this framework suffers from a fundamental instability in the low-noise regime. As noise levels approach zero, arbitrarily small perturbations in the input can induce large variations in the velocity target, causing the condition number of the learning problem to diverge. This ill-conditioning not only slows optimization but also forces the encoder to reallocate its limited Jacobian capacity toward noise directions, thereby degrading semantic representations. We provide the first theoretical analysis of this phenomenon, which we term the low-noise pathology, establishing its intrinsic link to the structure of the flow matching objective. Building on these insights, we propose Local Contrastive Flow (LCF), a hybrid training protocol that replaces direct velocity regression with contrastive feature alignment at small noise levels, while retaining standard flow matching at moderate and high noise. Empirically, LCF not only improves convergence speed but also stabilizes representation quality. Our findings highlight the critical importance of addressing low-noise pathologies to unlock the full potential of flow matching for both generation and representation learning.

[421] Alignment Unlocks Complementarity: A Framework for Multiview Circuit Representation Learning

Zhengyuan Shi, Jingxin Wang, Wentao Jiang, Chengyu Ma, Ziyang Zheng, Zhufei Chu, Weikang Qian, Qiang Xu

Main category: cs.LG

TL;DR: MixGate introduces a function-aware alignment strategy to enable effective multiview self-supervised learning on Boolean circuits by first establishing functional equivalence between different graph representations before applying masked modeling.

Details

Motivation: Multiview learning on Boolean circuits is promising but challenging due to structural heterogeneity between different graph representations (like AIG vs XMG), which causes naive self-supervised methods to fail as cross-view context appears as noise.

Method: MixGate uses a training curriculum that first establishes functional alignment via an Equivalence Alignment Loss to create a shared representation space, then applies multiview masked modeling on the aligned views.

Result: Extensive experiments show the alignment-first strategy transforms masked modeling from ineffective to powerful, with ablation studies confirming the importance of functional alignment.

Conclusion: Functional alignment is a necessary precondition for effective multiview self-supervision on Boolean circuits, enabling complementary structural information from different graph representations to be leveraged successfully.

Abstract: Multiview learning on Boolean circuits holds immense promise, as different graph-based representations offer complementary structural and semantic information. However, the vast structural heterogeneity between views, such as an And-Inverter Graph (AIG) versus an XOR-Majority Graph (XMG), poses a critical barrier to effective fusion, especially for self-supervised techniques like masked modeling. Naively applying such methods fails, as the cross-view context is perceived as noise. Our key insight is that functional alignment is a necessary precondition to unlock the power of multiview self-supervision. We introduce MixGate, a framework built on a principled training curriculum that first teaches the model a shared, function-aware representation space via an Equivalence Alignment Loss. Only then do we introduce a multiview masked modeling objective, which can now leverage the aligned views as a rich, complementary signal. Extensive experiments, including a crucial ablation study, demonstrate that our alignment-first strategy transforms masked modeling from an ineffective technique into a powerful performance driver.

[422] Knowledgeable Language Models as Black-Box Optimizers for Personalized Medicine

Michael S. Yao, Osbert Bastani, Alma Andersson, Tommaso Biancalani, Aïcha Bentaieb, Claudia Iriondo

Main category: cs.LG

TL;DR: LEON uses LLMs as black-box optimizers with domain knowledge to propose personalized treatments without fine-tuning, outperforming traditional methods.

Details

Motivation: Personalized medicine needs to optimize treatments based on patient factors, but surrogate models fail to generalize. Domain knowledge from medical sources can provide better fitness signals.

Method: LLM-based Entropy-guided Optimization with kNowledgeable priors (LEON) uses ‘optimization by prompting’ - leveraging LLMs as stochastic engines to propose treatment designs using medical knowledge without task-specific fine-tuning.

Result: Experiments on real-world optimization tasks show LEON outperforms both traditional and LLM-based methods in proposing individualized treatments for patients.

Conclusion: Domain-specific prior knowledge combined with LLMs provides an effective approach for personalized treatment optimization, demonstrating superior performance over existing methods.

Abstract: The goal of personalized medicine is to discover a treatment regimen that optimizes a patient’s clinical outcome based on their personal genetic and environmental factors. However, candidate treatments cannot be arbitrarily administered to the patient to assess their efficacy; we often instead have access to an in silico surrogate model that approximates the true fitness of a proposed treatment. Unfortunately, such surrogate models have been shown to fail to generalize to previously unseen patient-treatment combinations. We hypothesize that domain-specific prior knowledge - such as medical textbooks and biomedical knowledge graphs - can provide a meaningful alternative signal of the fitness of proposed treatments. To this end, we introduce LLM-based Entropy-guided Optimization with kNowledgeable priors (LEON), a mathematically principled approach to leverage large language models (LLMs) as black-box optimizers without any task-specific fine-tuning, taking advantage of their ability to contextualize unstructured domain knowledge to propose personalized treatment plans in natural language. In practice, we implement LEON via ‘optimization by prompting,’ which uses LLMs as stochastic engines for proposing treatment designs. Experiments on real-world optimization tasks show LEON outperforms both traditional and LLM-based methods in proposing individualized treatments for patients.

[423] CLUE: Conflict-guided Localization for LLM Unlearning Framework

Hang Chen, Jiaying Zhu, Xinyu Yang, Wenya Wang

Main category: cs.LG

TL;DR: CLUE is a novel LLM unlearning framework that uses circuit discovery to precisely identify and separate forget and retain neurons, then applies targeted fine-tuning strategies to achieve superior unlearning performance.

Details

Motivation: Current localization-based unlearning methods fail to disentangle neurons responsible for forgetting undesirable knowledge from those needed to retain essential skills, leading to uniform interventions that risk catastrophic over-forgetting or incomplete erasure.

Method: Proposes CLUE framework that: 1) identifies forget and retain circuits using mechanistic interpretability, 2) transforms circuits into conjunctive normal forms (CNF), 3) uses CNF satisfiability to determine neuron assignments (forget/retain), 4) applies targeted fine-tuning strategies for different neuron categories.

Result: Extensive experiments show CLUE achieves superior forget efficacy and retain utility compared to existing localization methods through precise neural localization.

Conclusion: CLUE provides a more precise and effective approach to LLM unlearning by properly disentangling and targeting specific neuron circuits, overcoming limitations of previous methods that treated forget and retain neurons as an entangled group.

Abstract: The LLM unlearning aims to eliminate the influence of undesirable data without affecting causally unrelated information. This process typically involves using a forget set to remove target information, alongside a retain set to maintain non-target capabilities. While recent localization-based methods demonstrate promise in identifying important neurons to be unlearned, they fail to disentangle neurons responsible for forgetting undesirable knowledge or retaining essential skills, often treating them as a single entangled group. As a result, these methods apply uniform interventions, risking catastrophic over-forgetting or incomplete erasure of the target knowledge. To address this, we turn to circuit discovery, a mechanistic interpretability technique, and propose the Conflict-guided Localization for LLM Unlearning framEwork (CLUE). This framework identifies the forget and retain circuit composed of important neurons, and then the circuits are transformed into conjunctive normal forms (CNF). The assignment of each neuron in the CNF satisfiability solution reveals whether it should be forgotten or retained. We then provide targeted fine-tuning strategies for different categories of neurons. Extensive experiments demonstrate that, compared to existing localization methods, CLUE achieves superior forget efficacy and retain utility through precise neural localization.

[424] Intercept Cancer: Cancer Pre-Screening with Large Scale Healthcare Foundation Models

Liwen Sun, Hao-Ren Yao, Gary Gao, Ophir Frieder, Chenyan Xiong

Main category: cs.LG

TL;DR: CATCH-FM is a cancer pre-screening method that identifies high-risk patients using EHR foundation models, achieving 60% sensitivity and 99% specificity for early cancer detection.

Details

Motivation: Existing cancer screening methods are expensive, intrusive, and not globally available, leading to preventable deaths. There's a need for accessible pre-screening using historical medical records.

Method: Pretrained compute-optimal foundation models (up to 2.4B parameters) on medical code sequences from millions of EHRs, then fine-tuned on clinician-curated cancer risk prediction cohorts.

Result: Achieved 60% sensitivity with 99% specificity/NPV, outperforming tree models and LLMs. State-of-the-art pancreatic cancer prediction on EHRSHOT leaderboard, robust across patient distributions.

Conclusion: CATCH-FM demonstrates effective cancer pre-screening using EHR foundation models, capturing non-trivial risk factors and showing promise for accessible early cancer detection.

Abstract: Cancer screening, leading to early detection, saves lives. Unfortunately, existing screening techniques require expensive and intrusive medical procedures, not globally available, resulting in too many lost would-be-saved lives. We present CATCH-FM, CATch Cancer early with Healthcare Foundation Models, a cancer pre-screening methodology that identifies high-risk patients for further screening solely based on their historical medical records. With millions of electronic healthcare records (EHR), we establish the scaling law of EHR foundation models pretrained on medical code sequences, pretrain compute-optimal foundation models of up to 2.4 billion parameters, and finetune them on clinician-curated cancer risk prediction cohorts. In our retrospective evaluation comprising of thirty thousand patients, CATCH-FM achieved strong efficacy (60% sensitivity) with low risk (99% specificity and Negative Predictive Value), outperforming feature-based tree models as well as general and medical large language models by large margins. Despite significant demographic, healthcare system, and EHR coding differences, CATCH-FM achieves state-of-the-art pancreatic cancer risk prediction on the EHRSHOT few-shot leaderboard, outperforming EHR foundation models pretrained using on-site patient data. Our analysis demonstrates the robustness of CATCH-FM in various patient distributions, the benefits of operating in the ICD code space, and its ability to capture non-trivial cancer risk factors. Our code will be open-sourced.

[425] FracAug: Fractional Augmentation boost Graph-level Anomaly Detection under Limited Supervision

Xiangyu Dong, Xingyi Zhang, Sibo Wang

Main category: cs.LG

TL;DR: FracAug is a plug-in augmentation framework for graph-level anomaly detection that generates semantic-preserving graph variants and uses mutual verification for pseudo-labeling to address data imbalance and labeling costs in GNNs.

Details

Motivation: Graph-level anomaly detection faces challenges with high labeling costs and dataset imbalance, which limit Graph Neural Networks' performance in domains like drug discovery.

Method: FracAug learns graph semantics and synthesizes fractional variants using a weighted distance-aware margin loss to capture multi-scale topology. It then uses predictions from original and augmented graphs to pseudo-label unlabeled data, iteratively expanding the training set.

Result: Experiments across 14 GNNs on 12 real-world datasets show consistent performance gains, with average AUROC, AUPRC, and F1-score improvements of up to 5.72%, 7.23%, and 4.18% respectively.

Conclusion: FracAug demonstrates remarkable universality and efficacy as a model-agnostic module that can enhance various GNNs for graph-level anomaly detection tasks.

Abstract: Graph-level anomaly detection (GAD) is critical in diverse domains such as drug discovery, yet high labeling costs and dataset imbalance hamper the performance of Graph Neural Networks (GNNs). To address these issues, we propose FracAug, an innovative plug-in augmentation framework that enhances GNNs by generating semantically consistent graph variants and pseudo-labeling with mutual verification. Unlike previous heuristic methods, FracAug learns semantics within given graphs and synthesizes fractional variants, guided by a novel weighted distance-aware margin loss. This captures multi-scale topology to generate diverse, semantic-preserving graphs unaffected by data imbalance. Then, FracAug utilizes predictions from both original and augmented graphs to pseudo-label unlabeled data, iteratively expanding the training set. As a model-agnostic module compatible with various GNNs, FracAug demonstrates remarkable universality and efficacy: experiments across 14 GNNs on 12 real-world datasets show consistent gains, boosting average AUROC, AUPRC, and F1-score by up to 5.72%, 7.23%, and 4.18%, respectively.

[426] Toward Robust and Efficient ML-Based GPU Caching for Modern Inference

Peng Chen, Jiaji Zhang, Hailiang Zhao, Yirong Zhang, Jiahong Yu, Xueyan Tang, Yixuan Wang, Hao Li, Jianping Zou, Gang Xiong, Kingsum Chow, Shuibing He, Shuiguang Deng

Main category: cs.LG

TL;DR: LCR is a practical learning-based GPU caching framework that enhances LRU with machine-learned predictions and dynamic adaptation to prediction accuracy, delivering robust performance gains in recommendation models and large language models.

Details

Motivation: Cache efficiency is a major bottleneck in GPU inference, with heuristic policies like LRU struggling under structured access patterns, and existing learning-based approaches facing limitations in robustness and practicality.

Method: LCR uses LARU algorithm which enhances LRU with machine-learned predictions and dynamically adapts to prediction accuracy through online error estimation, ensuring graceful degradation when predictions are inaccurate.

Result: LCR improves throughput by up to 24.2% in DLRM scenarios and reduces P99 TTFT by up to 28.3% in LLM scenarios, outperforming widely used inference systems while maintaining stable performance even under poor predictions.

Conclusion: LCR bridges the gap between empirical progress and theoretical advances in learning-based caching, delivering consistent gains under realistic conditions with practical robustness.

Abstract: In modern GPU inference, cache efficiency remains a major bottleneck. In recommendation models, embedding hit rates largely determine throughput, while in large language models, KV-cache misses substantially increase time-to-first-token (TTFT). Heuristic policies such as \textsc{LRU} often struggle under structured access patterns. Learning-based approaches are promising, but in practice face two major limitations: they degrade sharply when predictions are inaccurate, or they gain little even with accurate predictions due to conservative designs. Some also incur high overhead, further limiting practicality. We present \textsc{LCR}, a practical framework for learning-based GPU caching that delivers performance gains while ensuring robustness and efficiency. Its core algorithm, \textsc{LARU}, enhances \textsc{LRU} with machine-learned predictions and dynamically adapts to prediction accuracy through online error estimation. When predictions are accurate, \textsc{LARU} achieves near-optimal performance. With inaccurate predictions, it degrades gracefully to near-\textsc{LRU} performance. With \textsc{LCR}, we bridge the gap between empirical progress and theoretical advances in learning-based caching. Experiments show that \textsc{LCR} delivers consistent gains under realistic conditions. In DLRM and LLM scenarios, it improves throughput by up to 24.2% and reduces P99 TTFT by up to 28.3%, outperforming widely used inference systems. Even under poor predictions, its performance remains stable, demonstrating practical robustness.

[427] Learning Ising Models under Hard Constraints using One Sample

Rohan Chauhan, Ioannis Panageas

Main category: cs.LG

TL;DR: The paper presents an efficient estimator for the inverse temperature parameter of truncated Ising models using a single sample, with applications to configurations constrained by k-SAT formulas.

Details

Motivation: To address the challenge of estimating parameters in truncated Ising models, which are constrained by k-SAT formulas, using only a single sample, building on recent work in statistical estimation under truncation.

Method: The method designs an estimator based on maximizing pseudolikelihood, extending techniques from previous work to handle the truncated Ising model setting with bounded-degree graphs and k-SAT constraints.

Result: The estimator achieves O(Δ³/√n)-consistency with the true parameter β* and runs in nearly O(n) time, valid for k ≳ log(d²k)Δ³.

Conclusion: This work successfully generalizes recent pseudolikelihood techniques to the challenging truncated Ising model setting, providing efficient parameter estimation with theoretical guarantees.

Abstract: We consider the problem of estimating inverse temperature parameter $\beta$ of an $n$-dimensional truncated Ising model using a single sample. Given a graph $G = (V,E)$ with $n$ vertices, a truncated Ising model is a probability distribution over the $n$-dimensional hypercube ${-1,1}^n$ where each configuration $\mathbf{\sigma}$ is constrained to lie in a truncation set $S \subseteq {-1,1}^n$ and has probability $\Pr(\mathbf{\sigma}) \propto \exp(\beta\mathbf{\sigma}^\top A\mathbf{\sigma})$ with $A$ being the adjacency matrix of $G$. We adopt the recent setting of [Galanis et al. SODA'24], where the truncation set $S$ can be expressed as the set of satisfying assignments of a $k$-SAT formula. Given a single sample $\mathbf{\sigma}$ from a truncated Ising model, with inverse parameter $\beta^$, underlying graph $G$ of bounded degree $\Delta$ and $S$ being expressed as the set of satisfying assignments of a $k$-SAT formula, we design in nearly $O(n)$ time an estimator $\hat{\beta}$ that is $O(\Delta^3/\sqrt{n})$-consistent with the true parameter $\beta^$ for $k \gtrsim \log(d^2k)\Delta^3.$ Our estimator is based on the maximization of the pseudolikelihood, a notion that has received extensive analysis for various probabilistic models without [Chatterjee, Annals of Statistics ‘07] or with truncation [Galanis et al. SODA ‘24]. Our approach generalizes recent techniques from [Daskalakis et al. STOC ‘19, Galanis et al. SODA ‘24], to confront the more challenging setting of the truncated Ising model.

[428] Binary Autoencoder for Mechanistic Interpretability of Large Language Models

Hakaze Cho, Haolin Yang, Brian M. Kurkoski, Naoya Inoue

Main category: cs.LG

TL;DR: Proposes Binary Autoencoder (BAE) to improve feature sparsity and atomization in LLM interpretation by enforcing minimal entropy on minibatches of hidden activations through binary discretization and gradient estimation.

Details

Motivation: Existing autoencoder methods for LLM feature interpretation rely on implicit regularization that causes dense features and poor sparsity across instances, harming feature atomization.

Method: Binary Autoencoder (BAE) discretizes hidden activations to 1-bit via step function, applies gradient estimation for backpropagation, and enforces minimal entropy on minibatches to promote feature independence and sparsity.

Result: BAE enables reliable entropy calculation for characterizing LLM inference dynamics and In-context Learning, and produces more interpretable features with better sparsity than baselines while avoiding dense features.

Conclusion: BAE effectively serves as a feature extractor for LLM interpretation, demonstrating improved feature sparsity and atomization compared to existing methods.

Abstract: Existing works are dedicated to untangling atomized numerical components (features) from the hidden states of Large Language Models (LLMs) for interpreting their mechanism. However, they typically rely on autoencoders constrained by some implicit training-time regularization on single training instances (i.e., $L_1$ normalization, top-k function, etc.), without an explicit guarantee of global sparsity among instances, causing a large amount of dense (simultaneously inactive) features, harming the feature sparsity and atomization. In this paper, we propose a novel autoencoder variant that enforces minimal entropy on minibatches of hidden activations, thereby promoting feature independence and sparsity across instances. For efficient entropy calculation, we discretize the hidden activations to 1-bit via a step function and apply gradient estimation to enable backpropagation, so that we term it as Binary Autoencoder (BAE) and empirically demonstrate two major applications: (1) Feature set entropy calculation. Entropy can be reliably estimated on binary hidden activations, which we empirically evaluate and leverage to characterize the inference dynamics of LLMs and In-context Learning. (2) Feature untangling. Similar to typical methods, BAE can extract atomized features from LLM’s hidden states. To robustly evaluate such feature extraction capability, we refine traditional feature-interpretation methods to avoid unreliable handling of numerical tokens, and show that BAE avoids dense features while producing the largest number of interpretable ones among baselines, which confirms the effectiveness of BAE serving as a feature extractor.

[429] Feature Augmentation of GNNs for ILPs: Local Uniqueness Suffices

Qingyu Han, Qian Li, Linxin Yang, Qian Chen, Qingjiang Shi, Ruoyu Sun

Main category: cs.LG

TL;DR: The paper proposes Local-UID, a parsimonious identifier scheme based on d-hop uniqueness coloring for Graph Neural Networks in Integer Linear Programming optimization, achieving Global-UID expressiveness with better generalization.

Details

Motivation: Standard anonymous GNNs have limited expressiveness for ILPs, while Global-UID approaches introduce spurious correlations that harm generalization, creating a tradeoff that needs addressing.

Method: Proposes Local-UID scheme using d-hop uniqueness coloring, with ColorGNN (color-conditioned embeddings) and ColorUID (lightweight feature variant) implementations.

Result: Substantial gains on three ILP benchmarks, strong OOD generalization on linear programming datasets, and improvements on general graph-level tasks when combined with state-of-the-art methods.

Conclusion: Local-UIDs achieve the expressive power of Global-UIDs while offering stronger generalization, providing an effective solution to the expressiveness-generalization tradeoff in GNN-based ILP optimization.

Abstract: Integer Linear Programs (ILPs) are central to real-world optimizations but notoriously difficult to solve. Learning to Optimize (L2O) has emerged as a promising paradigm, with Graph Neural Networks (GNNs) serving as the standard backbone. However, standard anonymous GNNs are limited in expressiveness for ILPs, and the common enhancement of augmenting nodes with globally unique identifiers (UIDs) typically introduces spurious correlations that severely harm generalization. To address this tradeoff, we propose a parsimonious Local-UID scheme based on d-hop uniqueness coloring, which ensures identifiers are unique only within each node’s d-hop neighborhood. Building on this scheme, we introduce ColorGNN, which incorporates color information via color-conditioned embeddings, and ColorUID, a lightweight feature-level variant. We prove that for d-layer networks, Local-UIDs achieve the expressive power of Global-UIDs while offering stronger generalization. Extensive experiments show that our approach (i) yields substantial gains on three ILP benchmarks, (ii) exhibits strong OOD generalization on linear programming datasets, and (iii) further improves a general graph-level task when paired with a state-of-the-art method.

[430] Mechanism of Task-oriented Information Removal in In-context Learning

Hakaze Cho, Haolin Yang, Gouki Minegishi, Naoya Inoue

Main category: cs.LG

TL;DR: This paper investigates the mechanism of in-context learning (ICL) through information removal, showing that LMs encode queries into non-selective representations and ICL selectively removes redundant information to steer models toward intended tasks.

Details

Motivation: To understand the inner mechanism of in-context learning in language models, which remains unclear despite its effectiveness in few-shot learning.

Method: The authors analyze hidden states through information removal perspective, use low-rank filters to selectively remove information, measure hidden states with designed metrics, and identify denoising attention heads that enable information removal.

Result: Few-shot ICL effectively simulates task-oriented information removal processes, selectively removing redundant information from entangled representations. Blocking identified denoising heads significantly degrades ICL accuracy, confirming the critical role of information removal mechanism.

Conclusion: Information removal constitutes a key mechanism underlying ICL, where demonstrations help selectively remove redundant information from non-selective representations to focus on intended tasks.

Abstract: In-context Learning (ICL) is an emerging few-shot learning paradigm based on modern Language Models (LMs), yet its inner mechanism remains unclear. In this paper, we investigate the mechanism through a novel perspective of information removal. Specifically, we demonstrate that in the zero-shot scenario, LMs encode queries into non-selective representations in hidden states containing information for all possible tasks, leading to arbitrary outputs without focusing on the intended task, resulting in near-zero accuracy. Meanwhile, we find that selectively removing specific information from hidden states by a low-rank filter effectively steers LMs toward the intended task. Building on these findings, by measuring the hidden states on carefully designed metrics, we observe that few-shot ICL effectively simulates such task-oriented information removal processes, selectively removing the redundant information from entangled non-selective representations, and improving the output based on the demonstrations, which constitutes a key mechanism underlying ICL. Moreover, we identify essential attention heads inducing the removal operation, termed Denoising Heads, which enables the ablation experiments blocking the information removal operation from the inference, where the ICL accuracy significantly degrades, especially when the correct label is absent from the few-shot demonstrations, confirming both the critical role of the information removal mechanism and denoising heads.

[431] Lossless Compression: A New Benchmark for Time Series Model Evaluation

Meng Wan, Benxi Tian, Jue Wang, Cui Hui, Ningming Nie, Tiantian Liu, Zongguo Wang, Cao Rongqiang, Peng Shi, Yangang Wang

Main category: cs.LG

TL;DR: The paper introduces lossless compression as a new paradigm for evaluating time series models, establishing a direct link between compression length and negative log-likelihood through Shannon’s source coding theorem.

Details

Motivation: Traditional time series evaluation tasks (forecasting, imputation, anomaly detection, classification) focus on task-specific performance but fail to rigorously measure whether models capture the full generative distribution of the data.

Method: Proposes TSCom-Bench, a comprehensive evaluation framework that enables rapid adaptation of time series models as backbones for lossless compression, with standardized protocols and metrics based on information theory.

Result: Experiments on diverse datasets with state-of-the-art models (TimeXer, iTransformer, PatchTST) show that compression reveals distributional weaknesses overlooked by classic benchmarks.

Conclusion: Lossless compression serves as a principled task that complements and extends existing evaluation methods for time series modeling, providing a strict information-theoretic criterion for modeling capacity.

Abstract: The evaluation of time series models has traditionally focused on four canonical tasks: forecasting, imputation, anomaly detection, and classification. While these tasks have driven significant progress, they primarily assess task-specific performance and do not rigorously measure whether a model captures the full generative distribution of the data. We introduce lossless compression as a new paradigm for evaluating time series models, grounded in Shannon’s source coding theorem. This perspective establishes a direct equivalence between optimal compression length and the negative log-likelihood, providing a strict and unified information-theoretic criterion for modeling capacity. Then We define a standardized evaluation protocol and metrics. We further propose and open-source a comprehensive evaluation framework TSCom-Bench, which enables the rapid adaptation of time series models as backbones for lossless compression. Experiments across diverse datasets on state-of-the-art models, including TimeXer, iTransformer, and PatchTST, demonstrate that compression reveals distributional weaknesses overlooked by classic benchmarks. These findings position lossless compression as a principled task that complements and extends existing evaluation for time series modeling.

[432] DELTA-Code: How Does RL Unlock and Transfer New Programming Algorithms in LLMs?

Yiyou Sun, Yuhan Cao, Pohao Huang, Haoyue Bai, Hannaneh Hajishirzi, Nouha Dziri, Dawn Song

Main category: cs.LG

TL;DR: DELTA-Code is a synthetic coding benchmark that evaluates whether LLMs can learn new reasoning strategies through RL and transfer them to out-of-distribution problems, revealing grokking phase transitions and transfer limitations.

Details

Motivation: To determine if LLMs can genuinely acquire new reasoning skills beyond their pre-trained capabilities, particularly through RL training on problems where pretrained models fail completely.

Method: Created DELTA-Code benchmark with templated problem generators and fully OOD problem families. Used RL training with techniques like staged warm-up, experience replay, curriculum training, and verification-in-the-loop.

Result: Models showed striking grokking phase transitions - from near-zero reward to near-perfect accuracy after extended training. Solid gains within families and for recomposed skills, but persistent weaknesses in transformative cases.

Conclusion: DELTA provides a clean testbed for probing RL-driven reasoning limits and understanding how models can acquire new algorithmic skills beyond existing priors.

Abstract: It remains an open question whether LLMs can acquire or generalize genuinely new reasoning strategies, beyond the sharpened skills encoded in their parameters during pre-training or post-training. To attempt to answer this debate, we introduce DELTA-Code–Distributional Evaluation of Learnability and Transferrability in Algorithmic Coding, a controlled benchmark of synthetic coding problem families designed to probe two fundamental aspects: learnability – can LLMs, through reinforcement learning (RL), solve problem families where pretrained models exhibit failure with large enough attempts (pass@K=0)? –and transferrability – if learnability happens, can such skills transfer systematically to out-of-distribution (OOD) test sets? Unlike prior public coding datasets, DELTA isolates reasoning skills through templated problem generators and introduces fully OOD problem families that demand novel strategies rather than tool invocation or memorized patterns. Our experiments reveal a striking grokking phase transition: after an extended period with near-zero reward, RL-trained models abruptly climb to near-perfect accuracy. To enable learnability on previously unsolvable problem families, we explore key training ingredients such as staged warm-up with dense rewards, experience replay, curriculum training, and verification-in-the-loop. Beyond learnability, we use DELTA to evaluate transferability or generalization along exploratory, compositional, and transformative axes, as well as cross-family transfer. Results show solid gains within families and for recomposed skills, but persistent weaknesses in transformative cases. DELTA thus offers a clean testbed for probing the limits of RL-driven reasoning and for understanding how models can move beyond existing priors to acquire new algorithmic skills.

[433] MAIFormer: Multi-Agent Inverted Transformer for Flight Trajectory Prediction

Seokbin Yoon, Keumjin Lee

Main category: cs.LG

TL;DR: MAIFormer is a novel neural architecture using multi-agent inverted transformers to predict flight trajectories by capturing individual aircraft behaviors and complex interactions between flights, achieving state-of-the-art performance with interpretable results.

Details

Motivation: Flight trajectory prediction for multiple aircraft is essential for air traffic management but challenging due to the need to model both individual aircraft behaviors over time and complex interactions between flights, while also generating explainable predictions.

Method: Proposes Multi-Agent Inverted Transformer (MAIFormer) with two key attention modules: (1) masked multivariate attention for capturing spatio-temporal patterns of individual aircraft, and (2) agent attention for modeling social patterns among multiple agents in complex air traffic scenes.

Result: MAIFormer achieves the best performance across multiple metrics on real-world ADS-B flight trajectory data from Incheon International Airport, outperforming other methods while producing interpretable predictions.

Conclusion: MAIFormer provides both high prediction accuracy and human-interpretable outcomes, improving model transparency and practical utility in air traffic control applications.

Abstract: Flight trajectory prediction for multiple aircraft is essential and provides critical insights into how aircraft navigate within current air traffic flows. However, predicting multi-agent flight trajectories is inherently challenging. One of the major difficulties is modeling both the individual aircraft behaviors over time and the complex interactions between flights. Generating explainable prediction outcomes is also a challenge. Therefore, we propose a Multi-Agent Inverted Transformer, MAIFormer, as a novel neural architecture that predicts multi-agent flight trajectories. The proposed framework features two key attention modules: (i) masked multivariate attention, which captures spatio-temporal patterns of individual aircraft, and (ii) agent attention, which models the social patterns among multiple agents in complex air traffic scenes. We evaluated MAIFormer using a real-world automatic dependent surveillance-broadcast flight trajectory dataset from the terminal airspace of Incheon International Airport in South Korea. The experimental results show that MAIFormer achieves the best performance across multiple metrics and outperforms other methods. In addition, MAIFormer produces prediction outcomes that are interpretable from a human perspective, which improves both the transparency of the model and its practical utility in air traffic control.

[434] ExMolRL: Phenotype-Target Joint Generation of De Novo Molecules via Multi-Objective Reinforcement Learning

Haotian Guo, Hui Liu

Main category: cs.LG

TL;DR: ExMoIRL is a novel generative framework that synergistically integrates phenotypic and target-specific cues for de novo molecular generation, overcoming limitations of current phenotype-based and target-based strategies in AI-driven drug design.

Details

Motivation: Current phenotype-based and target-based strategies in AI-driven drug design suffer limitations - phenotype-based approaches incur high experimental costs while target-based strategies overlook system-level cellular responses. There's a need to bridge this gap by combining both approaches.

Method: The framework uses a phenotype-guided generator pretrained on drug-induced transcriptional profiles and fine-tuned via multi-objective reinforcement learning. The reward function fuses docking affinity and drug-likeness scores, augmented with ranking loss, prior-likelihood regularization, and entropy maximization.

Result: Extensive experiments demonstrate ExMoIRL’s superior performance over state-of-the-art phenotype-based and target-based models across multiple well-characterized targets. Generated molecules exhibit favorable drug-like properties, high target affinity, and inhibitory potency (IC50) against cancer cells.

Conclusion: This unified framework showcases the synergistic potential of combining phenotype-guided and target-aware strategies, offering a more effective solution for de novo drug discovery.

Abstract: The generation of high-quality candidate molecules remains a central challenge in AI-driven drug design. Current phenotype-based and target-based strategies each suffer limitations, either incurring high experimental costs or overlook system-level cellular responses. To bridge this gap, we propose ExMoIRL, a novel generative framework that synergistically integrates phenotypic and target-specific cues for de novo molecular generation. The phenotype-guided generator is first pretrained on expansive drug-induced transcriptional profiles and subsequently fine-tuned via multi-objective reinforcement learning (RL). Crucially, the reward function fuses docking affinity and drug-likeness scores, augmented with ranking loss, prior-likelihood regularization, and entropy maximization. The multi-objective RL steers the model toward chemotypes that are simultaneously potent, diverse, and aligned with the specified phenotypic effects. Extensive experiments demonstrate ExMoIRL’s superior performance over state-of-the-art phenotype-based and target-based models across multiple well-characterized targets. Our generated molecules exhibit favorable drug-like properties, high target affinity, and inhibitory potency (IC50) against cancer cells. This unified framework showcases the synergistic potential of combining phenotype-guided and target-aware strategies, offering a more effective solution for de novo drug discovery.

[435] ScaleDiff: Scaling Difficult Problems for Advanced Mathematical Reasoning

Qizhi Pei, Zhuoshi Pan, Honglin Lin, Xin Gao, Yu Li, Zinan Tang, Conghui He, Rui Yan, Lijun Wu

Main category: cs.LG

TL;DR: ScaleDiff is a pipeline for scaling difficult math problem generation by identifying hard problems from datasets and training a specialized generator, achieving strong performance with cost-efficient models.

Details

Motivation: Existing methods for automated math problem synthesis face challenges in scaling due to high computational costs, complex prompting, and limited difficulty levels.

Method: Use adaptive thinking model to identify difficult problems with single forward pass, then train DiffGen-8B generator on filtered difficult data to produce new problems at scale.

Result: Fine-tuning on ScaleDiff-Math dataset yields 11.3% performance increase, achieving 65.9% average accuracy on multiple benchmarks, outperforming recent LRMs using cost-efficient Qwen3-8B teacher.

Conclusion: ScaleDiff effectively transfers advanced reasoning capabilities without expensive teacher models and shows clear scaling benefits with increased difficult problem quantity.

Abstract: Large Reasoning Models (LRMs) have shown impressive capabilities in complex problem-solving, often benefiting from training on difficult mathematical problems that stimulate intricate reasoning. Recent efforts have explored automated synthesis of mathematical problems by prompting proprietary models or large-scale open-source models from seed data or inherent mathematical concepts. However, scaling up these methods remains challenging due to their high computational/API cost, complexity of prompting, and limited difficulty level of the generated problems. To overcome these limitations, we propose ScaleDiff, a simple yet effective pipeline designed to scale the creation of difficult problems. We efficiently identify difficult problems from existing datasets with only a single forward pass using an adaptive thinking model, which can perceive problem difficulty and automatically switch between “Thinking” and “NoThinking” modes. We then train a specialized difficult problem generator (DiffGen-8B) on this filtered difficult data, which can produce new difficult problems in large scale, eliminating the need for complex, per-instance prompting and its associated high API costs. Fine-tuning Qwen2.5-Math-7B-Instruct on the ScaleDiff-Math dataset yields a substantial performance increase of 11.3% compared to the original dataset and achieves a 65.9% average accuracy on AIME'24, AIME'25, HMMT-Feb'25, BRUMO'25, and MATH500, outperforming recent strong LRMs like OpenThinker3. Notably, this performance is achieved using the cost-efficient Qwen3-8B model as a teacher, demonstrating that our pipeline can effectively transfer advanced reasoning capabilities without relying on larger, more expensive teacher models. Furthermore, we observe a clear scaling phenomenon in model performance on difficult benchmarks as the quantity of difficult problems increases. Code: https://github.com/QizhiPei/ScaleDiff.

[436] Predicting LLM Reasoning Performance with Small Proxy Model

Woosung Koh, Juyoung Suk, Sungjun Han, Se-Young Yun, Jay Shin

Main category: cs.LG

TL;DR: rBridge enables small proxy models (≤1B) to effectively predict large-model reasoning performance by aligning with pre-training objectives and target tasks, reducing dataset ranking costs by over 100x.

Details

Motivation: Pre-training large language models is prohibitively expensive, and current approaches using small proxy models fail for reasoning capabilities which only emerge reliably in larger models (>7B parameters).

Method: rBridge weights negative log-likelihood with task alignment using reasoning traces from frontier models as gold labels, allowing small proxies to predict large-model reasoning.

Result: rBridge reduces dataset ranking costs by over 100x, achieves strongest correlation across six reasoning benchmarks at 1B-32B scale, and zero-shot transfers predictive relationships across pre-training datasets.

Conclusion: rBridge provides a practical, cost-effective path for exploring reasoning-oriented pre-training by enabling small models to reliably predict large-model reasoning performance.

Abstract: Given the prohibitive cost of pre-training large language models, it is essential to leverage smaller proxy models to optimize datasets before scaling up. However, this approach becomes challenging for reasoning capabilities, which exhibit emergent behavior that only appear reliably at larger model sizes, often exceeding 7B parameters. To address this, we introduce rBridge, showing that small proxies ($\leq$1B) can effectively predict large-model reasoning by aligning more closely with (1) the pre-training objective and (2) the target task. rBridge achieves this by weighting negative log-likelihood with task alignment, using reasoning traces from frontier models as gold labels. In our experiments, rBridge (i) reduces dataset ranking costs by over 100x relative to the best baseline, (ii) achieves the strongest correlation across six reasoning benchmarks at 1B to 32B scale, and (iii) zero-shot transfers predictive relationships across pre-training datasets at 1B to 7B scale. These findings indicate that rBridge offers a practical path for exploring reasoning-oriented pre-training at lower cost.

[437] Efficient Ensemble Conditional Independence Test Framework for Causal Discovery

Zhengkang Guan, Kun Kuang

Main category: cs.LG

TL;DR: E-CIT is a framework that reduces computational cost of conditional independence tests by partitioning data into subsets, applying base CITs independently, and aggregating p-values using stable distributions.

Details

Motivation: Constraint-based causal discovery requires numerous conditional independence tests which are computationally expensive, especially with large sample sizes.

Method: Divide-and-aggregate strategy: partition data into subsets, apply base CIT to each subset, aggregate p-values using novel method based on stable distributions.

Result: Reduces computational complexity to linear in sample size, achieves competitive performance, and shows improvement in complex testing scenarios on real-world datasets.

Conclusion: E-CIT provides an efficient plug-and-play framework that significantly reduces computational burden while maintaining or improving CIT performance.

Abstract: Constraint-based causal discovery relies on numerous conditional independence tests (CITs), but its practical applicability is severely constrained by the prohibitive computational cost, especially as CITs themselves have high time complexity with respect to the sample size. To address this key bottleneck, we introduce the Ensemble Conditional Independence Test (E-CIT), a general and plug-and-play framework. E-CIT operates on an intuitive divide-and-aggregate strategy: it partitions the data into subsets, applies a given base CIT independently to each subset, and aggregates the resulting p-values using a novel method grounded in the properties of stable distributions. This framework reduces the computational complexity of a base CIT to linear in the sample size when the subset size is fixed. Moreover, our tailored p-value combination method offers theoretical consistency guarantees under mild conditions on the subtests. Experimental results demonstrate that E-CIT not only significantly reduces the computational burden of CITs and causal discovery but also achieves competitive performance. Notably, it exhibits an improvement in complex testing scenarios, particularly on real-world datasets.

[438] Actor-Critic without Actor

Donghyeon Ki, Hee-Jun Ahn, Kyungyoon Kim, Byung-Jun Lee

Main category: cs.LG

TL;DR: Actor-Critic without Actor (ACA) is a lightweight RL framework that eliminates the explicit actor network by generating actions directly from the gradient field of a noise-level critic, combining simplicity with expressive multi-modal behavior.

Details

Motivation: Traditional actor-critic methods rely on separate actor and critic networks, making training vulnerable to architectural decisions and hyperparameter tuning, which limits scalability. Diffusion models offer expressive policies but introduce additional complexity and computational burdens.

Method: ACA removes the explicit actor network and instead generates actions directly from the gradient field of a noise-level critic, eliminating actor training overhead while keeping policy improvement aligned with the critic’s value estimates.

Result: Through experiments on standard online RL benchmarks, ACA achieves more favorable learning curves and competitive performance compared to both standard actor-critic and state-of-the-art diffusion-based methods.

Conclusion: ACA provides a simple yet powerful solution for online RL that retains the ability to capture diverse, multi-modal behaviors without relying on diffusion-based actors, combining simplicity with expressiveness.

Abstract: Actor-critic methods constitute a central paradigm in reinforcement learning (RL), coupling policy evaluation with policy improvement. While effective across many domains, these methods rely on separate actor and critic networks, which makes training vulnerable to architectural decisions and hyperparameter tuning. Such complexity limits their scalability in settings that require large function approximators. Recently, diffusion models have recently been proposed as expressive policies that capture multi-modal behaviors and improve exploration, but they introduce additional design choices and computational burdens, hindering efficient deployment. We introduce Actor-Critic without Actor (ACA), a lightweight framework that eliminates the explicit actor network and instead generates actions directly from the gradient field of a noise-level critic. This design removes the algorithmic and computational overhead of actor training while keeping policy improvement tightly aligned with the critic’s latest value estimates. Moreover, ACA retains the ability to capture diverse, multi-modal behaviors without relying on diffusion-based actors, combining simplicity with expressiveness. Through extensive experiments on standard online RL benchmarks,ACA achieves more favorable learning curves and competitive performance compared to both standard actor-critic and state-of-the-art diffusion-based methods, providing a simple yet powerful solution for online RL.

[439] FORCE: Transferable Visual Jailbreaking Attacks via Feature Over-Reliance CorrEction

Runqi Lin, Alasdair Paren, Suqin Yuan, Muyang Li, Philip Torr, Adel Bibi, Tongliang Liu

Main category: cs.LG

TL;DR: The paper analyzes why visual jailbreaking attacks on multimodal LLMs have poor cross-model transferability, finding they reside in high-sharpness regions. The authors propose FORCE method to correct feature over-reliance and improve transferability.

Details

Motivation: Visual jailbreaking attacks can easily manipulate open-source MLLMs but fail to transfer to closed-source models, limiting vulnerability assessment capabilities.

Method: Analyzed loss landscape and feature representations, then proposed FORCE method that guides attacks to explore broader feasible regions across layer features and rescales frequency feature influence based on semantic content.

Result: FORCE method discovers flattened feasible regions for visual jailbreaking attacks, significantly improving cross-model transferability for evaluating closed-source MLLMs.

Conclusion: By eliminating non-generalizable reliance on layer and spectral features, the proposed approach effectively facilitates visual red-teaming evaluations against closed-source multimodal LLMs.

Abstract: The integration of new modalities enhances the capabilities of multimodal large language models (MLLMs) but also introduces additional vulnerabilities. In particular, simple visual jailbreaking attacks can manipulate open-source MLLMs more readily than sophisticated textual attacks. However, these underdeveloped attacks exhibit extremely limited cross-model transferability, failing to reliably identify vulnerabilities in closed-source MLLMs. In this work, we analyse the loss landscape of these jailbreaking attacks and find that the generated attacks tend to reside in high-sharpness regions, whose effectiveness is highly sensitive to even minor parameter changes during transfer. To further explain the high-sharpness localisations, we analyse their feature representations in both the intermediate layers and the spectral domain, revealing an improper reliance on narrow layer representations and semantically poor frequency components. Building on this, we propose a Feature Over-Reliance CorrEction (FORCE) method, which guides the attack to explore broader feasible regions across layer features and rescales the influence of frequency features according to their semantic content. By eliminating non-generalizable reliance on both layer and spectral features, our method discovers flattened feasible regions for visual jailbreaking attacks, thereby improving cross-model transferability. Extensive experiments demonstrate that our approach effectively facilitates visual red-teaming evaluations against closed-source MLLMs.

[440] Reinforcement Learning Fine-Tuning Enhances Activation Intensity and Diversity in the Internal Circuitry of LLMs

Honglin Zhang, Qianyue Hao, Fengli Xu, Yong Li

Main category: cs.LG

TL;DR: RL fine-tuning enhances LLMs by increasing activation intensity and diversity, making information flow more redundant and flexible for better generalization, with notable differences between online RL methods and DPO.

Details

Motivation: To understand why RL fine-tuning improves LLM capabilities beyond SFT alone, and to investigate the internal mechanisms that differentiate RL fine-tuning effects across various model families.

Method: Used edge attribution patching (EAP) to analyze internal differences in LLMs before and after RL fine-tuning across multiple model families, comparing online RL methods (PPO, GRPO) with preference-based approaches (DPO).

Result: RL fine-tuning causes two robust effects: increased activation intensity (more engaged pathways) and greater diversity in activation patterns (higher entropy, less concentrated distributions). DPO shows weaker/inconsistent changes compared to PPO/GRPO.

Conclusion: RL fine-tuning systematically alters LLM internal circuitry by making information flow more redundant and flexible, explaining its generalization advantage, with significant methodological distinctions between online RL and preference-based approaches.

Abstract: Large language models (LLMs) acquire extensive prior knowledge through large-scale pretraining and can be further enhanced via supervised fine-tuning (SFT) or reinforcement learning (RL)-based post-training. A growing body of evidence has shown that RL fine-tuning improves the capability of LLMs beyond what SFT alone achieves. However, the underlying mechanisms why RL fine-tuning is able to enhance the capability of various LLMs with distinct intrinsic characteristics remain underexplored. In this study, we draw inspiration from prior work on edge attribution patching (EAP) to investigate the internal differences of LLMs before and after RL fine-tuning. Our analysis across multiple model families shows two robust effects of online RL post-training: (i) an overall increase in activation intensity, indicating that more internal pathways are engaged and their signals become stronger, and (ii) greater diversity in activation patterns, reflected by higher entropy and less concentrated edge distributions. These changes suggest that RL reshapes information flow to be both more redundant and more flexible, which may explain its advantage in generalization. Notably, models fine-tuned with Direct Preference Optimization (DPO) deviate from these trends, exhibiting substantially weaker or inconsistent internal changes compared to PPO- and GRPO-based training. Together, our findings provide a unified view of how RL fine-tuning systematically alters the internal circuitry of LLMs and highlight the methodological distinctions between online RL and preference-based approaches. Our code is open source at https://anonymous.4open.science/r/llm_rl_probing_analysis-F673.

[441] Physics of Learning: A Lagrangian perspective to different learning paradigms

Siyuan Guo, Bernhard Schölkopf

Main category: cs.LG

TL;DR: The paper proposes a unified framework for efficient learning systems based on least action principles from physics, deriving classic learning algorithms from first principles using a Learning Lagrangian.

Details

Motivation: To build efficient learning systems that reach desired error thresholds with minimal observations, inspired by physics principles.

Method: Deriving learning algorithms by postulating that learning searches for stationary paths in a Learning Lagrangian, applying least action principles to derive Bellman’s equation and Adam optimizer.

Result: Shows that classic learning algorithms can be derived from first principles using the Lagrangian framework.

Conclusion: Learning algorithms can be systematically derived from fundamental physics-inspired principles, providing a unified theoretical foundation for efficient learning.

Abstract: We study the problem of building an efficient learning system. Efficient learning processes information in the least time, i.e., building a system that reaches a desired error threshold with the least number of observations. Building upon least action principles from physics, we derive classic learning algorithms, Bellman’s optimality equation in reinforcement learning, and the Adam optimizer in generative models from first principles, i.e., the Learning $\textit{Lagrangian}$. We postulate that learning searches for stationary paths in the Lagrangian, and learning algorithms are derivable by seeking the stationary trajectories.

[442] GeoRef: Referring Expressions in Geometry via Task Formulation, Synthetic Supervision, and Reinforced MLLM-based Solutions

Bing Liu, Wenqiang Yv, Xuzheng Yang, Shichang Wang, Junzhuo Liu, Peng Wang, Guoqing Wang, Yang Yang, Heng Tao Shen

Main category: cs.LG

TL;DR: The paper introduces GeoRef, a benchmark for Referring Expression Comprehension (REC) in geometric problems, which tests models’ ability to localize geometric elements based on natural language queries. It uses synthetic training data and shows that GRPO fine-tuning outperforms SFT, with a verify-and-regenerate mechanism further improving accuracy.

Details

Motivation: AI-driven geometric problem solving requires accurate diagram interpretation and cross-modal grounding, but current models struggle with identifying geometric elements from natural language queries. This gap motivates the development of a specialized REC task for geometric problems.

Method: The authors create GeoRef benchmark from existing geometric corpora with diverse annotations. They generate synthetic training data using a geometric formal language and explore two fine-tuning approaches: Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO), plus a verify-and-regenerate mechanism for error correction.

Result: GRPO significantly outperforms SFT by better aligning with task-specific rewards. The verify-and-regenerate mechanism further boosts accuracy. Even state-of-the-art MLLMs struggle with this task, but models trained on GeoRef show measurable improvements on downstream geometric reasoning tasks.

Conclusion: Explicit evaluation and strengthening of geometric grounding is necessary for robust geometric problem solving. The REC task serves as a foundational capability for multimodal mathematical understanding, with GeoRef demonstrating broader value for improving geometric reasoning.

Abstract: AI-driven geometric problem solving is a complex vision-language task that requires accurate diagram interpretation, mathematical reasoning, and robust cross-modal grounding. A foundational yet underexplored capability for this task is the ability to identify and interpret geometric elements based on natural language queries. To address this, we introduce the task of Referring Expression Comprehension (REC) for geometric problems, which evaluates whether models can localize points, shapes, and spatial relations in diagrams in response to textual prompts. We present GeoRef, a benchmark dataset constructed from existing geometric problem corpora, featuring diverse, high-quality annotations and queries. Due to the lack of annotated data for this task, we generate a large-scale synthetic training dataset using a structured geometric formal language, enabling broad coverage of geometric concepts and facilitating model adaptation. We explore two fine-tuning approaches: Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO). Our results show that GRPO significantly outperforms SFT by better aligning model behavior with task-specific rewards. Furthermore, we propose a verify-and-regenerate mechanism that detects incorrect predictions and re-infers answers using contextual reasoning history, further boosting accuracy. Notably, even state-of-the-art Multimodal Large Language Models (MLLMs) struggle with this task, underscoring the necessity of explicitly evaluating and strengthening geometric grounding as a prerequisite for robust geometric problem solving. Moreover, models trained on GeoRef demonstrate measurable improvements on downstream geometric reasoning tasks, highlighting the broader value of REC as a foundation for multimodal mathematical understanding.

[443] TyphoonMLA: A Mixed Naive-Absorb MLA Kernel For Shared Prefix

Ahmet Caner Yüzügüler, Ahmet Çelik, Jiawei Zhuang, Lukas Cavigelli

Main category: cs.LG

TL;DR: TyphoonMLA is a hybrid attention mechanism that combines naive and absorb formulations to optimize MLA performance, achieving up to 3x throughput improvement with minimal HBM overhead.

Details

Motivation: Existing MLA implementations face limitations: naive kernels are efficient for training/prefill but absorb kernels used in decoding can't leverage data reuse opportunities like shared prefixes due to their compute-bound nature.

Method: TyphoonMLA uses a hybrid approach - applies naive formulation to compute-bound parts of attention calculations to leverage shared prefixes, while using absorb formulation for non-shared parts to reduce bandwidth requirements.

Result: TyphoonMLA improves attention calculation throughput by up to 3x on NPU and 3.24x on GPUs, with only 3% overhead in HBM size.

Conclusion: The hybrid TyphoonMLA approach successfully combines the strengths of both naive and absorb formulations, enabling significant performance gains in MLA architectures while maintaining low memory overhead.

Abstract: Multi-Head Latent Attention (MLA) is a recent attention mechanism adopted in state-of-the-art LLMs such as DeepSeek-v3 and Kimi K2. Thanks to its novel formulation, MLA allows two functionally equivalent but computationally distinct kernel implementations: naive and absorb. While the naive kernels (e.g., FlashAttention) are typically preferred in training and prefill for their computational efficiency, existing decoding kernels (e.g., FlashMLA) rely on the absorb method to minimize HBM bandwidth usage. However, the compute-bound nature of the absorb implementations prohibits performance benefits from data reuse opportunities in attention calculations, such as shared prefixes. In this work, we introduce TyphoonMLA, a hybrid approach that combines naive and absorb formulations to harness the strengths of both. TyphoonMLA effectively leverages the shared prefix by applying the naive formulation to the compute-bound parts of attention calculations, while reducing the bandwidth requirements for non-shared parts by using the absorb formulation. As a result, TyphoonMLA improves the throughput of attention calculations in MLA architectures by up to 3x and 3.24x on NPU and GPUs, with only a 3% overhead in HBM size.

Sedjro Salomon Hotegni, Sebastian Peitz

Main category: cs.LG

TL;DR: SPREAD is a generative framework using Denoising Diffusion Probabilistic Models (DDPMs) for efficient multi-objective optimization, combining gradient-based updates with diversity mechanisms to compute Pareto sets.

Details

Motivation: To address the challenge of developing efficient multi-objective optimization methods for large-scale and expensive problems by computing Pareto sets of optimal compromises between conflicting objectives.

Method: SPREAD learns a conditional diffusion process over decision space points and refines candidates using adaptive multiple gradient descent-inspired updates for convergence and Gaussian RBF-based repulsion for diversity during reverse diffusion steps.

Result: Empirical results on multi-objective optimization benchmarks show SPREAD matches or exceeds leading baselines in efficiency, scalability, and Pareto front coverage in both offline and Bayesian surrogate-based settings.

Conclusion: SPREAD provides an effective generative framework for multi-objective optimization that bridges the gap for large-scale problems through diffusion models and adaptive optimization techniques.

Abstract: Developing efficient multi-objective optimization methods to compute the Pareto set of optimal compromises between conflicting objectives remains a key challenge, especially for large-scale and expensive problems. To bridge this gap, we introduce SPREAD, a generative framework based on Denoising Diffusion Probabilistic Models (DDPMs). SPREAD first learns a conditional diffusion process over points sampled from the decision space and then, at each reverse diffusion step, refines candidates via a sampling scheme that uses an adaptive multiple gradient descent-inspired update for fast convergence alongside a Gaussian RBF-based repulsion term for diversity. Empirical results on multi-objective optimization benchmarks, including offline and Bayesian surrogate-based settings, show that SPREAD matches or exceeds leading baselines in efficiency, scalability, and Pareto front coverage.

[445] Structure-Attribute Transformations with Markov Chain Boost Graph Domain Adaptation

Zhen Liu, Yongtao Zhang, Shaobo Ren, Yuxin You

Main category: cs.LG

TL;DR: SATMC is a novel graph domain adaptation framework that addresses structural heterogeneity by sequentially aligning distributions through both graph structure and attribute transformations, achieving superior cross-network node classification performance.

Details

Motivation: Traditional graph domain adaptation methods focus mainly on node attribute alignment but struggle with structural heterogeneity between different graph domains, leading to suboptimal distribution alignment.

Method: SATMC uses sequential alignment via structure and attribute transformations, incorporates a private domain information reduction mechanism, and employs empirical Wasserstein distance to enhance generalization.

Result: Extensive experiments on nine pairs of cross-domain datasets show SATMC outperforms state-of-the-art methods in cross-network node classification, with theoretical proofs demonstrating tighter error bounds.

Conclusion: SATMC effectively addresses structural heterogeneity in graph domain adaptation and achieves superior performance in cross-network node classification tasks compared to existing methods.

Abstract: Graph domain adaptation has gained significant attention in label-scarce scenarios across different graph domains. Traditional approaches to graph domain adaptation primarily focus on transforming node attributes over raw graph structures and aligning the distributions of the transformed node features across networks. However, these methods often struggle with the underlying structural heterogeneity between distinct graph domains, which leads to suboptimal distribution alignment. To address this limitation, we propose Structure-Attribute Transformation with Markov Chain (SATMC), a novel framework that sequentially aligns distributions across networks via both graph structure and attribute transformations. To mitigate the negative influence of domain-private information and further enhance the model’s generalization, SATMC introduces a private domain information reduction mechanism and an empirical Wasserstein distance. Theoretical proofs suggest that SATMC can achieve a tighter error bound for cross-network node classification compared to existing graph domain adaptation methods. Extensive experiments on nine pairs of publicly available cross-domain datasets show that SATMC outperforms state-of-the-art methods in the cross-network node classification task. The code is available at https://github.com/GiantZhangYT/SATMC.

[446] GraphUniverse: Enabling Systematic Evaluation of Inductive Generalization

Louis Van Langendonck, Guillermo Bernárdez, Nina Miolane, Pere Barlet-Ros

Main category: cs.LG

TL;DR: GraphUniverse is a framework for generating graph families to systematically evaluate inductive generalization in graph learning, revealing that transductive performance poorly predicts inductive generalization and robustness varies significantly with architecture and graph properties.

Details

Motivation: Existing graph learning benchmarks are limited to single-graph transductive settings, lacking systematic evaluation of inductive generalization across diverse graph structures and distribution shifts.

Method: Developed GraphUniverse framework that generates graphs with persistent semantic communities while controlling structural properties like homophily and degree distributions, enabling controlled distribution shift experiments.

Result: Benchmarking various architectures (GNNs, graph transformers, topological architectures) showed transductive performance doesn’t predict inductive generalization, and robustness to distribution shift depends heavily on both model architecture and initial graph regime.

Conclusion: GraphUniverse enables systematic evaluation of inductive generalization and can facilitate development of robust, generalizable graph architectures including foundation models.

Abstract: A fundamental challenge in graph learning is understanding how models generalize to new, unseen graphs. While synthetic benchmarks offer controlled settings for analysis, existing approaches are confined to single-graph, transductive settings where models train and test on the same graph structure. Addressing this gap, we introduce GraphUniverse, a framework for generating entire families of graphs to enable the first systematic evaluation of inductive generalization at scale. Our core innovation is the generation of graphs with persistent semantic communities, ensuring conceptual consistency while allowing fine-grained control over structural properties like homophily and degree distributions. This enables crucial but underexplored robustness tests, such as performance under controlled distribution shifts. Benchmarking a wide range of architectures – from GNNs to graph transformers and topological architectures – reveals that strong transductive performance is a poor predictor of inductive generalization. Furthermore, we find that robustness to distribution shift is highly sensitive not only to model architecture choice but also to the initial graph regime (e.g., high vs. low homophily). Beyond benchmarking, GraphUniverse’s flexibility and scalability can facilitate the development of robust and truly generalizable architectures – including next-generation graph foundation models. An interactive demo is available at https://graphuniverse.streamlit.app.

[447] Teaching RL Agents to Act Better: VLM as Action Advisor for Online Reinforcement Learning

Xiefeng Wu, Jing Zhao, Shu Zhang, Mingyu Hu

Main category: cs.LG

TL;DR: VARL is a framework that uses vision-language models to provide action suggestions for reinforcement learning agents, improving sample efficiency without changing optimality or convergence.

Details

Motivation: Online RL is time-consuming due to massive interaction steps needed, and VLA policies require task-specific expert demonstrations for effective deployment. VARL aims to leverage VLM domain knowledge to provide action suggestions instead of heuristic rewards.

Method: VARL uses vision-language models as action advisors that suggest actions to RL agents. The suggested actions increase sample diversity and improve sample efficiency, particularly in sparse-reward tasks, without altering the RL algorithm’s optimality guarantees.

Result: VARL greatly improves sample efficiency across diverse environments and agent settings without introducing significant computational overhead. It makes RL feasible to apply directly from scratch in real-world environments.

Conclusion: VARL provides a general framework for online reinforcement learning that leverages VLM domain knowledge through action suggestions rather than reward shaping, maintaining convergence properties while significantly improving sample efficiency.

Abstract: Online reinforcement learning in complex tasks is time-consuming, as massive interaction steps are needed to learn the optimal Q-function.Vision-language action (VLA) policies represent a promising direction for solving diverse tasks; however, their performance on low-level control remains limited, and effective deployment often requires task-specific expert demonstrations for fine-tuning. In this paper, we propose \textbf{VARL} (\textbf{V}LM as \textbf{A}ction advisor for online \textbf{R}einforcement \textbf{L}earning), a framework that leverages the domain knowledge of vision-language models (VLMs) to provide action suggestions for reinforcement learning agents. Unlike previous methods, VARL provides action suggestions rather than designing heuristic rewards, thereby guaranteeing unchanged optimality and convergence. The suggested actions increase sample diversity and ultimately improve sample efficiency, especially in sparse-reward tasks. To validate the effectiveness of VARL, we evaluate it across diverse environments and agent settings. Results show that VARL greatly improves sample efficiency without introducing significant computational overhead. These advantages make VARL a general framework for online reinforcement learning and make it feasible to directly apply reinforcement learning from scratch in real-world environments.

[448] EvoMail: Self-Evolving Cognitive Agents for Adaptive Spam and Phishing Email Defense

Wei Huang, De-Tian Chu, Lin-Yuan Bai, Wei Kang, Hai-Tao Zhang, Bo Li, Zhi-Mo Han, Jing Ge, Hai-Feng Lin

Main category: cs.LG

TL;DR: EvoMail is a self-evolving cognitive agent framework that uses heterogeneous email graphs and adversarial self-evolution to detect modern spam and phishing attacks, outperforming traditional methods.

Details

Motivation: Traditional spam detection systems struggle with multi-modal campaigns that combine text, URLs, headers, and attachments, and cannot adapt quickly to evolving adversarial tactics.

Method: Constructs unified heterogeneous email graphs fusing text, metadata, and embedded resources. Uses Cognitive Graph Neural Network enhanced by LLM for context-aware reasoning. Implements adversarial self-evolution loop with red-team generating evasion tactics and blue-team learning from failures.

Result: Extensive experiments on real-world datasets show EvoMail consistently outperforms state-of-the-art baselines in detection accuracy, adaptability, and interpretability.

Conclusion: EvoMail demonstrates potential as a resilient and explainable defense framework against next-generation spam and phishing threats.

Abstract: Modern email spam and phishing attacks have evolved far beyond keyword blacklists or simple heuristics. Adversaries now craft multi-modal campaigns that combine natural-language text with obfuscated URLs, forged headers, and malicious attachments, adapting their strategies within days to bypass filters. Traditional spam detection systems, which rely on static rules or single-modality models, struggle to integrate heterogeneous signals or to continuously adapt, leading to rapid performance degradation. We propose EvoMail, a self-evolving cognitive agent framework for robust detection of spam and phishing. EvoMail first constructs a unified heterogeneous email graph that fuses textual content, metadata (headers, senders, domains), and embedded resources (URLs, attachments). A Cognitive Graph Neural Network enhanced by a Large Language Model (LLM) performs context-aware reasoning across these sources to identify coordinated spam campaigns. Most critically, EvoMail engages in an adversarial self-evolution loop: a ‘‘red-team’’ agent generates novel evasion tactics – such as character obfuscation or AI-generated phishing text – while the ‘‘blue-team’’ detector learns from failures, compresses experiences into a memory module, and reuses them for future reasoning. Extensive experiments on real-world datasets (Enron-Spam, Ling-Spam, SpamAssassin, and TREC) and synthetic adversarial variants demonstrate that EvoMail consistently outperforms state-of-the-art baselines in detection accuracy, adaptability to evolving spam tactics, and interpretability of reasoning traces. These results highlight EvoMail’s potential as a resilient and explainable defense framework against next-generation spam and phishing threats.

[449] Sparse Representations Improve Adversarial Robustness of Neural Network Classifiers

Killian Steunou, Sigurd Saue, Théo Druilhe

Main category: cs.LG

TL;DR: This paper revisits linear dimensionality reduction as a defense against adversarial attacks, comparing PCA and sparse PCA (SPCA) with theoretical analysis and empirical validation showing SPCA provides better robustness.

Details

Motivation: Deep neural networks are vulnerable to adversarial perturbations, and the paper aims to explore simple, data-adapted defenses using linear dimensionality reduction methods.

Method: Empirical comparison of PCA and SPCA as front-end feature extractors for classifiers, complemented by theoretical analysis including robustness certificates for linear heads and Lipschitz composition arguments for non-linear heads.

Result: SPCA consistently degrades more gracefully than PCA under strong white-box and black-box attacks while maintaining competitive clean accuracy. Theoretical analysis shows sparser projections reduce adversarial leverage.

Conclusion: Sparsity in projections reduces adversarial vulnerability, and SPCA provides effective robustness benefits that persist beyond linear settings, offering a simple yet effective defense mechanism.

Abstract: Deep neural networks perform remarkably well on image classification tasks but remain vulnerable to carefully crafted adversarial perturbations. This work revisits linear dimensionality reduction as a simple, data-adapted defense. We empirically compare standard Principal Component Analysis (PCA) with its sparse variant (SPCA) as front-end feature extractors for downstream classifiers, and we complement these experiments with a theoretical analysis. On the theory side, we derive exact robustness certificates for linear heads applied to SPCA features: for both $\ell_\infty$ and $\ell_2$ threat models (binary and multiclass), the certified radius grows as the dual norms of $W^\top u$ shrink, where $W$ is the projection and $u$ the head weights. We further show that for general (non-linear) heads, sparsity reduces operator-norm bounds through a Lipschitz composition argument, predicting lower input sensitivity. Empirically, with a small non-linear network after the projection, SPCA consistently degrades more gracefully than PCA under strong white-box and black-box attacks while maintaining competitive clean accuracy. Taken together, the theory identifies the mechanism (sparser projections reduce adversarial leverage) and the experiments verify that this benefit persists beyond the linear setting. Our code is available at https://github.com/killian31/SPCARobustness.

[450] LAVA: Explainability for Unsupervised Latent Embeddings

Ivan Stresec, Joana P. Gonçalves

Main category: cs.LG

TL;DR: LAVA is a post-hoc model-agnostic method that explains local embedding organization in unsupervised learning by analyzing feature correlations within latent space neighborhoods and identifying recurring patterns across the entire space.

Details

Motivation: Unsupervised black-box models are difficult to interpret, especially for discovery tasks where understanding the multi-dimensional latent embedding structure is crucial. Current explainability methods for unsupervised learning provide either too granular or too reductive explanations without automated strategies to relate similar samples based on latent proximity.

Method: LAVA represents the latent space as a series of neighborhoods (localities) described by correlations between original features, then reveals recurring correlation patterns across the entire latent space. It works with manifold learning methods that produce no mapping function, using only the relative spatial organization of embeddings.

Result: Applied to UMAP embeddings of MNIST and single-cell kidney datasets, LAVA captures relevant feature associations and identifies visually and biologically meaningful local patterns that are shared among seemingly distant regions of the latent spaces.

Conclusion: LAVA provides an effective approach for explaining local embedding organization in unsupervised learning by bridging the gap between input features and latent space structure, enabling meaningful interpretation of complex embedding relationships.

Abstract: Unsupervised black-box models can be drivers of scientific discovery, but remain difficult to interpret. Crucially, discovery hinges on understanding the model output, which is often a multi-dimensional latent embedding rather than a well-defined target. While explainability for supervised learning usually seeks to uncover how input features are used to predict a target, its unsupervised counterpart should relate input features to the structure of the learned latent space. Adaptations of supervised model explainability for unsupervised learning provide either single-sample or dataset-wide summary explanations. However, without automated strategies of relating similar samples to one another guided by their latent proximity, explanations remain either too fine-grained or too reductive to be meaningful. This is especially relevant for manifold learning methods that produce no mapping function, leaving us only with the relative spatial organization of their embeddings. We introduce Locality-Aware Variable Associations (LAVA), a post-hoc model-agnostic method designed to explain local embedding organization through its relationship with the input features. To achieve this, LAVA represents the latent space as a series of localities (neighborhoods) described in terms of correlations between the original features, and then reveals reoccurring patterns of correlations across the entire latent space. Based on UMAP embeddings of MNIST and a single-cell kidney dataset, we show that LAVA captures relevant feature associations, with visually and biologically relevant local patterns shared among seemingly distant regions of the latent spaces.

[451] CAD-Tokenizer: Towards Text-based CAD Prototyping via Modality-Specific Tokenization

Ruiyu Wang, Shizhao Sun, Weijian Ma, Jiang Bian

Main category: cs.LG

TL;DR: CAD-Tokenizer is a multimodal tokenization framework that represents CAD sequences with primitive-aware tokens, enabling better text-guided CAD prototyping by capturing geometric structure that standard LLM tokenizers miss.

Details

Motivation: Standard LLM tokenizers decompose CAD sequences into natural-language word pieces, failing to capture primitive-level CAD semantics and hindering attention modules from modeling geometric structure, which limits text-guided CAD prototyping capabilities.

Method: Propose CAD-Tokenizer, a sequence-based VQ-VAE framework with primitive-level pooling and constrained decoding that produces modality-specific tokens aligned with CAD’s structural nature.

Result: CAD-Tokenizer significantly improves instruction following and generation quality in unified text-guided CAD prototyping, achieving better quantitative and qualitative performance over general-purpose LLMs and task-specific baselines.

Conclusion: A multimodal tokenization strategy aligned with CAD’s primitive and structural nature provides more effective representations for text-guided CAD prototyping, enabling better modeling of geometric structure.

Abstract: Computer-Aided Design (CAD) is a foundational component of industrial prototyping, where models are defined not by raw coordinates but by construction sequences such as sketches and extrusions. This sequential structure enables both efficient prototype initialization and subsequent editing. Text-guided CAD prototyping, which unifies Text-to-CAD generation and CAD editing, has the potential to streamline the entire design pipeline. However, prior work has not explored this setting, largely because standard large language model (LLM) tokenizers decompose CAD sequences into natural-language word pieces, failing to capture primitive-level CAD semantics and hindering attention modules from modeling geometric structure. We conjecture that a multimodal tokenization strategy, aligned with CAD’s primitive and structural nature, can provide more effective representations. To this end, we propose CAD-Tokenizer, a framework that represents CAD data with modality-specific tokens using a sequence-based VQ-VAE with primitive-level pooling and constrained decoding. This design produces compact, primitive-aware representations that align with CAD’s structural nature. Applied to unified text-guided CAD prototyping, CAD-Tokenizer significantly improves instruction following and generation quality, achieving better quantitative and qualitative performance over both general-purpose LLMs and task-specific baselines.

[452] GRPO is Secretly a Process Reward Model

Michael Sullivan

Main category: cs.LG

TL;DR: The paper shows that GRPO RL algorithm induces a non-trivial process reward model (PRM) and identifies a flaw in its objective. A modified algorithm (λ-GRPO) is proposed to fix this issue, achieving better performance than standard GRPO.

Details

Motivation: To understand and improve the GRPO algorithm by leveraging its hidden PRM structure, avoiding the need for costly explicit PRMs while boosting model performance efficiently.

Method: Theoretical proof that GRPO induces a PRM under certain assumptions, empirical validation of these assumptions, identification of GRPO’s flaw (non-uniform process steps), and proposal of λ-GRPO modification to mitigate the issue.

Result: λ-GRPO achieves higher validation accuracy and better downstream reasoning task performance than standard GRPO, reaching peak performance more rapidly with negligible impact on training time and cost.

Conclusion: The built-in PRM structure in vanilla GRPO can be leveraged to boost performance effectively, questioning the advantage of costly explicit PRMs for GRPO.

Abstract: We prove theoretically that the GRPO RL algorithm induces a non-trivial process reward model (PRM), under certain assumptions regarding within-group overlap of token sequences across completions. We then show empirically that these assumptions are met under real-world conditions: GRPO does in fact induce a non-trivial PRM. Leveraging the framework of GRPO-as-a-PRM, we identify a flaw in the GRPO objective: non-uniformly distributed process steps hinder both exploration and exploitation (under different conditions). We propose a simple modification to the algorithm to mitigate this defect ($\lambda$-GRPO), and show that LLMs trained with $\lambda$-GRPO achieve higher validation accuracy and performance on downstream reasoning tasks$-$and reach peak performance more rapidly$-$than LLMs trained with standard GRPO. Our results call into question the advantage of costly, explicitly-defined PRMs for GRPO: we show that it is possible to instead leverage the hidden, built-in PRM structure within the vanilla GRPO algorithm to boost model performance with a negligible impact on training time and cost.

[453] DATS: Distance-Aware Temperature Scaling for Calibrated Class-Incremental Learning

Giuseppe Serra, Florian Buettner

Main category: cs.LG

TL;DR: DATS (Distance-Aware Temperature Scaling) is a novel continual learning method that addresses calibration issues by adapting temperature parameters based on task distance, without requiring task information at test time.

Details

Motivation: Existing continual learning approaches use a single temperature parameter across all tasks, which overlooks task-specific differences and leads to inconsistent calibration errors. Safety-critical applications require reliable uncertainty communication through calibrated confidence scores.

Method: Proposes DATS which combines prototype-based distance estimation with distance-aware calibration to infer task proximity and assign adaptive temperatures without prior task information at deployment.

Result: Extensive evaluation on standard benchmarks and real-world biomedical datasets shows DATS is stable, reliable, and consistently reduces calibration error across tasks compared to state-of-the-art approaches.

Conclusion: DATS provides a principled solution for task-adaptive calibration in continual learning, addressing the limitations of single-temperature approaches and improving uncertainty communication in safety-critical applications.

Abstract: Continual Learning (CL) is recently gaining increasing attention for its ability to enable a single model to learn incrementally from a sequence of new classes. In this scenario, it is important to keep consistent predictive performance across all the classes and prevent the so-called Catastrophic Forgetting (CF). However, in safety-critical applications, predictive performance alone is insufficient. Predictive models should also be able to reliably communicate their uncertainty in a calibrated manner - that is, with confidence scores aligned to the true frequencies of target events. Existing approaches in CL address calibration primarily from a data-centric perspective, relying on a single temperature shared across all tasks. Such solutions overlook task-specific differences, leading to large fluctuations in calibration error across tasks. For this reason, we argue that a more principled approach should adapt the temperature according to the distance to the current task. However, the unavailability of the task information at test time/during deployment poses a major challenge to achieve the intended objective. For this, we propose Distance-Aware Temperature Scaling (DATS), which combines prototype-based distance estimation with distance-aware calibration to infer task proximity and assign adaptive temperatures without prior task information. Through extensive empirical evaluation on both standard benchmarks and real-world, imbalanced datasets taken from the biomedical domain, our approach demonstrates to be stable, reliable and consistent in reducing calibration error across tasks compared to state-of-the-art approaches.

[454] Mixture of Thoughts: Learning to Aggregate What Experts Think, Not Just What They Say

Jacob Fein-Ashley, Dhruv Parikh, Rajgopal Kannan, Viktor Prasanna

Main category: cs.LG

TL;DR: MoT is a method for latent-level collaboration among heterogeneous LLMs using a global routing scheme where a lightweight router selects top experts and enables cross-attention in a shared latent space.

Details

Motivation: Existing multi-LLM approaches have limitations: routing to single experts loses complementary strengths, aggregation methods are costly, and weight fusion requires architectural homogeneity. There's a need for efficient collaboration among specialized LLMs.

Method: A lightweight router selects top-K experts and designates a primary expert. Interaction layers project hidden states into a shared latent space where the primary expert performs cross-attention over selected peers. Only router and interaction layers are trained with a joint objective.

Result: MoT surpasses state-of-the-art Avengers by +0.38% on in-distribution and +2.92% on out-of-distribution benchmarks, outperforms best single model, with single-pass inference and runtime comparable to routing baselines.

Conclusion: MoT provides a simple latent-space mechanism for combining heterogeneous LLMs, offering practical multi-LLM collaboration without iterative aggregation overheads.

Abstract: Open-source Large Language Models (LLMs) increasingly specialize by domain (e.g., math, code, general reasoning), motivating systems that leverage complementary strengths across models. Prior multi-LLM approaches either (i) route a query to one or a few experts and generate independently, (ii) aggregate outputs from each model via costly multi-turn exchanges, or (iii) fuse weights into a single model-typically requiring architectural homogeneity. We introduce Mixture of Thoughts (MoT), a simple method for latent-level collaboration among heterogeneous experts under a global routing scheme. For each query, a lightweight router selects top-$K$ experts and designates a primary expert; uniformly placed interaction layers project hidden states into a shared latent space where the primary expert performs cross-attention over its active (selected) peers. Pre-trained experts remain frozen; only the router and the lightweight interaction layers are trained with a novel joint training objective that improves both the expert selection and inter-expert collaboration. Across five in-distribution (ID) and three out-of-distribution (OOD) benchmarks, MoT surpasses the current routing and aggregation-based state-of-the-art, Avengers, by $+0.38%$ and $+2.92%$, respectively. Further, MoT significantly outperforms the best-performing single model. It achieves this with single-pass inference, runtime comparable to routing baselines, and none of the overheads of iterative aggregation. MoT offers a simple latent-space mechanism for combining heterogeneous LLMs, a practical step toward broader multi-LLM collaboration. Our code is publicly available at https://github.com/jacobfa/mot.

[455] A Unified Framework for Diffusion Model Unlearning with f-Divergence

Nicola Novello, Federico Fontana, Luigi Cinque, Deniz Gunduz, Andrea M. Tonello

Main category: cs.LG

TL;DR: A unified f-divergence framework for machine unlearning in diffusion models that generalizes MSE-based approaches and allows flexible divergence selection for optimal unlearning-performance trade-offs.

Details

Motivation: Existing unlearning methods for text-to-image models rely on MSE minimization, which is limited. The authors aim to generalize this approach using f-divergences for better unlearning control and concept preservation.

Method: Propose a unified f-divergence framework where any f-divergence can be used instead of just MSE. Analyze different divergences’ effects on convergence and unlearning quality.

Result: The framework shows that MSE is a special case of f-divergence and demonstrates how different divergences offer trade-offs between aggressive unlearning and concept preservation.

Conclusion: The unified f-divergence framework provides a flexible paradigm for selecting optimal divergences tailored to specific applications, improving upon MSE-based unlearning methods.

Abstract: Machine unlearning aims to remove specific knowledge from a trained model. While diffusion models (DMs) have shown remarkable generative capabilities, existing unlearning methods for text-to-image (T2I) models often rely on minimizing the mean squared error (MSE) between the output distribution of a target and an anchor concept. We show that this MSE-based approach is a special case of a unified $f$-divergence-based framework, in which any $f$-divergence can be utilized. We analyze the benefits of using different $f$-divergences, that mainly impact the convergence properties of the algorithm and the quality of unlearning. The proposed unified framework offers a flexible paradigm that allows to select the optimal divergence for a specific application, balancing different trade-offs between aggressive unlearning and concept preservation.

[456] Inverse Reinforcement Learning Using Just Classification and a Few Regressions

Lars van der Laan, Nathan Kallus, Aurélien Bibaut

Main category: cs.LG

TL;DR: The paper presents a simplified approach to inverse reinforcement learning (IRL) by reducing it to two supervised learning problems: probabilistic classification to estimate behavior policy and iterative regression to solve a linear fixed-point equation.

Details

Motivation: Traditional IRL methods involve complex inner-loop optimization, repeated dynamic programming, or adversarial training, which complicate the use of modern function approximators like neural networks.

Method: The authors show that softmax IRL’s population maximum-likelihood solution is characterized by a linear fixed-point equation involving the behavior policy. This reduces IRL to two supervised learning tasks: behavior policy estimation via classification and solving the fixed point via iterative regression.

Result: The proposed method is simple, modular, and works across different function approximation classes. It achieves competitive or superior performance compared to MaxEnt IRL while being more practical.

Conclusion: This work provides a streamlined approach to IRL that eliminates complex optimization procedures, making it more accessible for use with modern function approximators while maintaining strong theoretical guarantees and empirical performance.

Abstract: Inverse reinforcement learning (IRL) aims to explain observed behavior by uncovering an underlying reward. In the maximum-entropy or Gumbel-shocks-to-reward frameworks, this amounts to fitting a reward function and a soft value function that together satisfy the soft Bellman consistency condition and maximize the likelihood of observed actions. While this perspective has had enormous impact in imitation learning for robotics and understanding dynamic choices in economics, practical learning algorithms often involve delicate inner-loop optimization, repeated dynamic programming, or adversarial training, all of which complicate the use of modern, highly expressive function approximators like neural nets and boosting. We revisit softmax IRL and show that the population maximum-likelihood solution is characterized by a linear fixed-point equation involving the behavior policy. This observation reduces IRL to two off-the-shelf supervised learning problems: probabilistic classification to estimate the behavior policy, and iterative regression to solve the fixed point. The resulting method is simple and modular across function approximation classes and algorithms. We provide a precise characterization of the optimal solution, a generic oracle-based algorithm, finite-sample error bounds, and empirical results showing competitive or superior performance to MaxEnt IRL.

[457] Towards Foundation Models for Zero-Shot Time Series Anomaly Detection: Leveraging Synthetic Data and Relative Context Discrepancy

Tian Lan, Hao Duong Le, Jinbo Li, Wenjun He, Meng Wang, Chenghao Liu, Chen Zhang

Main category: cs.LG

TL;DR: TimeRCD is a novel foundation model for zero-shot time series anomaly detection that uses Relative Context Discrepancy (RCD) instead of reconstruction-based objectives to better identify subtle anomalies and reduce false positives/negatives.

Details

Motivation: Current reconstruction-based foundation models for TSAD suffer from objective mismatch - they struggle with subtle anomalies and misinterpret complex normal patterns, leading to poor generalization in zero-shot settings.

Method: TimeRCD uses a Transformer architecture trained with Relative Context Discrepancy (RCD) paradigm, which detects anomalies by identifying significant discrepancies between adjacent time windows rather than reconstructing inputs. It’s pre-trained on a large-scale synthetic corpus with token-level anomaly labels.

Result: Extensive experiments show TimeRCD significantly outperforms existing general-purpose and anomaly-specific foundation models in zero-shot TSAD across diverse datasets.

Conclusion: The RCD paradigm establishes a new effective path for building robust and generalizable foundation models for time series anomaly detection, overcoming limitations of reconstruction-based approaches.

Abstract: Time series anomaly detection (TSAD) is a critical task, but developing models that generalize to unseen data in a zero-shot manner remains a major challenge. Prevailing foundation models for TSAD predominantly rely on reconstruction-based objectives, which suffer from a fundamental objective mismatch: they struggle to identify subtle anomalies while often misinterpreting complex normal patterns, leading to high rates of false negatives and positives. To overcome these limitations, we introduce \texttt{TimeRCD}, a novel foundation model for TSAD built upon a new pre-training paradigm: Relative Context Discrepancy (RCD). Instead of learning to reconstruct inputs, \texttt{TimeRCD} is explicitly trained to identify anomalies by detecting significant discrepancies between adjacent time windows. This relational approach, implemented with a standard Transformer architecture, enables the model to capture contextual shifts indicative of anomalies that reconstruction-based methods often miss. To facilitate this paradigm, we develop a large-scale, diverse synthetic corpus with token-level anomaly labels, providing the rich supervisory signal necessary for effective pre-training. Extensive experiments demonstrate that \texttt{TimeRCD} significantly outperforms existing general-purpose and anomaly-specific foundation models in zero-shot TSAD across diverse datasets. Our results validate the superiority of the RCD paradigm and establish a new, effective path toward building robust and generalizable foundation models for time series anomaly detection.

[458] Closed-form $\ell_r$ norm scaling with data for overparameterized linear regression and diagonal linear networks under $\ell_p$ bias

Shuofeng Zhang, Ard Louis

Main category: cs.LG

TL;DR: This paper provides a unified characterization of parameter norm scaling for overparameterized linear regression with minimum-ℓp interpolators, revealing a competition between signal spike and null coordinates that yields closed-form predictions for data-dependent transitions and universal thresholds.

Details

Motivation: To understand how different ℓr norms of minimum-ℓp interpolators scale with sample size in overparameterized linear regression, which has implications for generalization proxies that depend on these norms.

Method: Uses dual-ray analysis to reveal competition between signal spike and bulk null coordinates in X⊤Y, deriving closed-form predictions for transition points and thresholds. Also studies diagonal linear networks by calibrating initialization scale to effective p via separable potential.

Result: Identifies a data-dependent transition n⋆ (elbow) and universal threshold r⋆=2(p-1) that separates norms that plateau from those that grow. Shows DLNs inherit same laws, providing bridge between explicit and implicit bias.

Conclusion: The unified solution explains norm saturation behavior and suggests generalization proxies’ predictive power depends sensitively on which ℓr norm is used, with implications for understanding bias in overparameterized models.

Abstract: For overparameterized linear regression with isotropic Gaussian design and minimum-$\ell_p$ interpolator $p\in(1,2]$, we give a unified, high-probability characterization for the scaling of the family of parameter norms $ \{ \lVert \widehat{w_p} \rVert_r \}{r \in [1,p]} $ with sample size. We solve this basic, but unresolved question through a simple dual-ray analysis, which reveals a competition between a signal spike and a bulk of null coordinates in $X^\top Y$, yielding closed-form predictions for (i) a data-dependent transition $n\star$ (the “elbow”), and (ii) a universal threshold $r_\star=2(p-1)$ that separates $\lVert \widehat{w_p} \rVert_r$’s which plateau from those that continue to grow with an explicit exponent. This unified solution resolves the scaling of all $\ell_r$ norms within the family $r\in [1,p]$ under $\ell_p$-biased interpolation, and explains in one picture which norms saturate and which increase as $n$ grows. We then study diagonal linear networks (DLNs) trained by gradient descent. By calibrating the initialization scale $\alpha$ to an effective $p_{\mathrm{eff}}(\alpha)$ via the DLN separable potential, we show empirically that DLNs inherit the same elbow/threshold laws, providing a predictive bridge between explicit and implicit bias. Given that many generalization proxies depend on $\lVert \widehat {w_p} \rVert_r$, our results suggest that their predictive power will depend sensitively on which $l_r$ norm is used.

[459] Differential-Integral Neural Operator for Long-Term Turbulence Forecasting

Hao Wu, Yuan Gao, Fan Xu, Fan Zhang, Qingsong Wen, Kun Wang, Xiaomeng Huang, Xian Wu

Main category: cs.LG

TL;DR: DINO is a novel neural operator framework that decomposes turbulent dynamics into local differential and global integral operators, achieving superior long-term forecasting stability and physical fidelity compared to state-of-the-art methods.

Details

Motivation: Existing deep learning methods fail in long-term turbulence forecasting due to error accumulation and loss of physical fidelity, stemming from their inability to capture both local dissipative effects and global non-local interactions simultaneously.

Method: DINO uses parallel branches: a constrained convolutional network for local differential operators (provably converging to derivatives) and a Transformer architecture for global integral operators (learning data-driven global kernels), based on first-principles operator decomposition.

Result: DINO significantly outperforms state-of-the-art models on 2D Kolmogorov flow, suppressing error accumulation over hundreds of timesteps while maintaining high fidelity in vorticity fields and energy spectra.

Conclusion: DINO establishes a new benchmark for physically consistent, long-range turbulence forecasting by explicitly modeling distinct physical operators through physics-based decomposition.

Abstract: Accurately forecasting the long-term evolution of turbulence represents a grand challenge in scientific computing and is crucial for applications ranging from climate modeling to aerospace engineering. Existing deep learning methods, particularly neural operators, often fail in long-term autoregressive predictions, suffering from catastrophic error accumulation and a loss of physical fidelity. This failure stems from their inability to simultaneously capture the distinct mathematical structures that govern turbulent dynamics: local, dissipative effects and global, non-local interactions. In this paper, we propose the {\textbf{\underline{D}}}ifferential-{\textbf{\underline{I}}}ntegral {\textbf{\underline{N}}}eural {\textbf{\underline{O}}}perator (\method{}), a novel framework designed from a first-principles approach of operator decomposition. \method{} explicitly models the turbulent evolution through parallel branches that learn distinct physical operators: a local differential operator, realized by a constrained convolutional network that provably converges to a derivative, and a global integral operator, captured by a Transformer architecture that learns a data-driven global kernel. This physics-based decomposition endows \method{} with exceptional stability and robustness. Through extensive experiments on the challenging 2D Kolmogorov flow benchmark, we demonstrate that \method{} significantly outperforms state-of-the-art models in long-term forecasting. It successfully suppresses error accumulation over hundreds of timesteps, maintains high fidelity in both the vorticity fields and energy spectra, and establishes a new benchmark for physically consistent, long-range turbulence forecast.

[460] Tree Search for LLM Agent Reinforcement Learning

Yuxiang Ji, Ziyu Ma, Yong Wang, Guanhua Chen, Xiangxiang Chu, Liaoni Wu

Main category: cs.LG

TL;DR: Tree-GRPO is a tree-based reinforcement learning method that addresses sparse supervision in long-term agent tasks by using tree search to enable step-wise process supervision from outcome rewards.

Details

Motivation: Existing RL approaches for LLM agents in long-term, multi-turn tasks suffer from sparse supervision when driven solely by outcome rewards.

Method: Proposes Tree-based Group Relative Policy Optimization (Tree-GRPO) using tree search where nodes represent agent interaction steps, enabling shared prefixes and step-wise process supervision from outcome rewards.

Result: Experiments across 11 datasets and 3 types of QA tasks demonstrate superiority over chain-based RL methods.

Conclusion: Tree-GRPO effectively addresses sparse supervision in agent tasks and shows improved performance through tree-structured trajectory optimization.

Abstract: Recent advances in reinforcement learning (RL) have significantly enhanced the agentic capabilities of large language models (LLMs). In long-term and multi-turn agent tasks, existing approaches driven solely by outcome rewards often suffer from the problem of sparse supervision. To address the challenge, we propose Tree-based Group Relative Policy Optimization (Tree-GRPO), a grouped agent RL method based on tree search, where each tree node represents the complete agent interaction step. By sharing common prefixes, the tree search sampling increases the number of rollouts achievable within a fixed budget of tokens or tool calls. Moreover, we find that the tree-structured trajectory naturally allows the construction of step-wise process supervised signals even using only the outcome reward. Based on this, Tree-GRPO estimates the grouped relative advantages both on intra-tree and inter-tree levels. Through theoretical analysis, we demonstrate that the objective of intra-tree level group relative policy optimization is equivalent to that of step-level direct preference learning. Experiments across 11 datasets and 3 types of QA tasks demonstrate the superiority of the proposed tree-based RL over the chain-based RL method.

[461] From Physics to Machine Learning and Back: Part II - Learning and Observational Bias in PHM

Olga Fink, Ismail Nejjar, Vinay Sharma, Keivan Faghih Niresi, Han Sun, Hao Dong, Chenghao Xu, Amaury Wei, Arthur Bizzi, Raffael Theiler, Yuan Tian, Leandro Von Krannichfeldt, Zhan Ma, Sergei Garmaev, Zepeng Zhang, Mengjie Zhao

Main category: cs.LG

TL;DR: This review paper examines physics-informed machine learning approaches for Prognostics and Health Management (PHM), focusing on how learning and observational biases can guide models toward physically consistent predictions and enable the transition from passive prediction to active decision-making through reinforcement learning.

Details

Motivation: Real-world PHM faces challenges including noisy/incomplete sensor data, limited labels, and complex nonlinear degradation behaviors. Physics-informed machine learning addresses these limitations by embedding physical knowledge into data-driven models to improve reliability and consistency.

Method: The review examines two main approaches: learning biases (embedding physical constraints through physics-informed loss functions, governing equations, and monotonicity properties) and observational biases (influencing data selection through virtual sensing, physics-based simulation for augmentation, and multi-sensor fusion). It also explores reinforcement learning for maintenance policy optimization and scaling methods like meta-learning and domain generalization.

Result: The paper demonstrates that physics-informed modeling enables more reliable PHM predictions by ensuring physical consistency, facilitates active decision-making through reinforcement learning that respects physical constraints, and provides strategies for scaling solutions from individual assets to fleet-wide deployment.

Conclusion: Physics-informed machine learning represents a promising paradigm for PHM that bridges the gap between data-driven approaches and physical system understanding, enabling more robust fault detection, failure anticipation, and maintenance optimization while supporting scalable deployment across complex engineered systems.

Abstract: Prognostics and Health Management ensures the reliability, safety, and efficiency of complex engineered systems by enabling fault detection, anticipating equipment failures, and optimizing maintenance activities throughout an asset lifecycle. However, real-world PHM presents persistent challenges: sensor data is often noisy or incomplete, available labels are limited, and degradation behaviors and system interdependencies can be highly complex and nonlinear. Physics-informed machine learning has emerged as a promising approach to address these limitations by embedding physical knowledge into data-driven models. This review examines how incorporating learning and observational biases through physics-informed modeling and data strategies can guide models toward physically consistent and reliable predictions. Learning biases embed physical constraints into model training through physics-informed loss functions and governing equations, or by incorporating properties like monotonicity. Observational biases influence data selection and synthesis to ensure models capture realistic system behavior through virtual sensing for estimating unmeasured states, physics-based simulation for data augmentation, and multi-sensor fusion strategies. The review then examines how these approaches enable the transition from passive prediction to active decision-making through reinforcement learning, which allows agents to learn maintenance policies that respect physical constraints while optimizing operational objectives. This closes the loop between model-based predictions, simulation, and actual system operation, empowering adaptive decision-making. Finally, the review addresses the critical challenge of scaling PHM solutions from individual assets to fleet-wide deployment. Fast adaptation methods including meta-learning and few-shot learning are reviewed alongside domain generalization techniques …

[462] Explaining Fine Tuned LLMs via Counterfactuals A Knowledge Graph Driven Framework

Yucheng Wang, Ziyang Chen, Md Faisal Kabir

Main category: cs.LG

TL;DR: This paper introduces a counterfactual-based framework to explain how LoRA fine-tuning alters LLMs’ structural reasoning and semantic behavior using knowledge graphs.

Details

Motivation: Understanding how LoRA fine-tuning mechanisms change LLMs' structural reasoning and semantic behavior remains an open challenge that needs interpretable explanation methods.

Method: Constructed BioToolKG (bioinformatics tools knowledge graph) and designed CFFTLLMExplainer that learns soft masks over graph nodes/edges to generate minimal structural perturbations inducing maximum semantic divergence, with joint optimization of structural sparsity and semantic divergence.

Result: Applied to fine-tuned LLaMA-based LLM, showing counterfactual masking exposes structural dependencies and aligns with LoRA-induced parameter shifts.

Conclusion: Provides insights into fine-tuned LLMs’ internal mechanisms and highlights counterfactual graphs as a tool for interpretable AI.

Abstract: The widespread adoption of Low-Rank Adaptation (LoRA) has enabled large language models (LLMs) to acquire domain-specific knowledge with remarkable efficiency. However, understanding how such a fine-tuning mechanism alters a model’s structural reasoning and semantic behavior remains an open challenge. This work introduces a novel framework that explains fine-tuned LLMs via counterfactuals grounded in knowledge graphs. Specifically, we construct BioToolKG, a domain-specific heterogeneous knowledge graph in bioinformatics tools and design a counterfactual-based fine-tuned LLMs explainer (CFFTLLMExplainer) that learns soft masks over graph nodes and edges to generate minimal structural perturbations that induce maximum semantic divergence. Our method jointly optimizes structural sparsity and semantic divergence while enforcing interpretability preserving constraints such as entropy regularization and edge smoothness. We apply this framework to a fine-tuned LLaMA-based LLM and reveal that counterfactual masking exposes the model’s structural dependencies and aligns with LoRA-induced parameter shifts. This work provides new insights into the internal mechanisms of fine-tuned LLMs and highlights counterfactual graphs as a potential tool for interpretable AI.

[463] Go With The Flow: Churn-Tolerant Decentralized Training of Large Language Models

Nikolay Blagoev, Bart Cox, Jérémie Decouchant, Lydia Y. Chen

Main category: cs.LG

TL;DR: GWTF is the first crash-tolerant decentralized training framework for LLMs that enables efficient collaborative training on heterogeneous volunteer clients while handling node churn and network instabilities.

Details

Motivation: To democratize LLM training by enabling collaborative training on volunteer heterogeneous clients, addressing the limitations of existing distributed and federated frameworks.

Method: Uses a novel decentralized flow algorithm that finds the most effective routing to maximize the number of microbatches trained with minimal delay, handling node churn and network instabilities.

Result: GWTF reduces training time by up to 45% in realistic scenarios with heterogeneous clients distributed over 10 geographic locations and high node churn rates.

Conclusion: GWTF provides a practical and efficient solution for decentralized LLM training that outperforms prior art in challenging real-world conditions.

Abstract: Motivated by the emergence of large language models (LLMs) and the importance of democratizing their training, we propose GWTF, the first crash tolerant practical decentralized training framework for LLMs. Differently from existing distributed and federated training frameworks, GWTF enables the efficient collaborative training of a LLM on heterogeneous clients that volunteer their resources. In addition, GWTF addresses node churn, i.e., clients joining or leaving the system at any time, and network instabilities, i.e., network links becoming unstable or unreliable. The core of GWTF is a novel decentralized flow algorithm that finds the most effective routing that maximizes the number of microbatches trained with the lowest possible delay. We extensively evaluate GWTF on GPT-like and LLaMa-like models and compare it against the prior art. Our results indicate that GWTF reduces the training time by up to 45% in realistic and challenging scenarios that involve heterogeneous client nodes distributed over 10 different geographic locations with a high node churn rate.

[464] Federated Flow Matching

Zifan Wang, Anqi Dong, Mahmoud Selim, Michael M. Zavlanos, Karl H. Johansson

Main category: cs.LG

TL;DR: FFM is a framework for training flow matching models on decentralized data without central aggregation, addressing privacy constraints through federated learning approaches.

Details

Motivation: Data is decentralized across devices and institutions where privacy, ownership, and regulation prevent centralization, creating a need to train generative models directly from distributed data locally.

Method: Three approaches: FFM-vanilla (local training with independent couplings), FFM-LOT (local optimal transport for better straightness), and FFM-GOT (global coordination using semi-dual optimal transport formulation with shared potential function).

Result: Experiments show FFM enables privacy-preserving training while improving flow straightness and sample quality, achieving performance comparable to centralized baselines on synthetic and image datasets.

Conclusion: FFM provides an effective federated learning framework for flow matching models that maintains privacy while achieving competitive performance through global coordination strategies.

Abstract: Data today is decentralized, generated and stored across devices and institutions where privacy, ownership, and regulation prevent centralization. This motivates the need to train generative models directly from distributed data locally without central aggregation. In this paper, we introduce Federated Flow Matching (FFM), a framework for training flow matching models under privacy constraints. Specifically, we first examine FFM-vanilla, where each client trains locally with independent source and target couplings, preserving privacy but yielding curved flows that slow inference. We then develop FFM-LOT, which employs local optimal transport couplings to improve straightness within each client but lacks global consistency under heterogeneous data. Finally, we propose FFM-GOT, a federated strategy based on the semi-dual formulation of optimal transport, where a shared global potential function coordinates couplings across clients. Experiments on synthetic and image datasets show that FFM enables privacy-preserving training while enhancing both the flow straightness and sample quality in federated settings, with performance comparable to the centralized baseline.

[465] A Causality-Aware Spatiotemporal Model for Multi-Region and Multi-Pollutant Air Quality Forecasting

Junxin Lu, Shiliang Sun

Main category: cs.LG

TL;DR: AirPCM is a deep spatiotemporal forecasting model that integrates multi-region, multi-pollutant dynamics with explicit meteorology-pollutant causality modeling for accurate air pollution forecasting.

Details

Motivation: Air pollution threatens public health and environmental sustainability, but accurate forecasting is challenging due to complex multi-pollutant interactions, evolving meteorological conditions, and spatial heterogeneity across monitoring stations.

Method: AirPCM uses a unified architecture to jointly capture cross-station spatial correlations, temporal auto-correlations, and meteorology-pollutant dynamic causality, enabling fine-grained, interpretable multi-pollutant forecasting across varying geographic and temporal scales.

Result: Extensive evaluations on multi-scale real-world datasets show AirPCM consistently surpasses state-of-the-art baselines in both predictive accuracy and generalization capability.

Conclusion: AirPCM’s long-term forecasting capability provides actionable insights for future air quality trends and high-risk windows, supporting evidence-based environmental governance and carbon mitigation planning.

Abstract: Air pollution, a pressing global problem, threatens public health, environmental sustainability, and climate stability. Achieving accurate and scalable forecasting across spatially distributed monitoring stations is challenging due to intricate multi-pollutant interactions, evolving meteorological conditions, and region specific spatial heterogeneity. To address this challenge, we propose AirPCM, a novel deep spatiotemporal forecasting model that integrates multi-region, multi-pollutant dynamics with explicit meteorology-pollutant causality modeling. Unlike existing methods limited to single pollutants or localized regions, AirPCM employs a unified architecture to jointly capture cross-station spatial correlations, temporal auto-correlations, and meteorology-pollutant dynamic causality. This empowers fine-grained, interpretable multi-pollutant forecasting across varying geographic and temporal scales, including sudden pollution episodes. Extensive evaluations on multi-scale real-world datasets demonstrate that AirPCM consistently surpasses state-of-the-art baselines in both predictive accuracy and generalization capability. Moreover, the long-term forecasting capability of AirPCM provides actionable insights into future air quality trends and potential high-risk windows, offering timely support for evidence-based environmental governance and carbon mitigation planning.

[466] humancompatible.train: Implementing Optimization Algorithms for Stochastically-Constrained Stochastic Optimization Problems

Andrii Kliachkin, Jana Lepšová, Gilles Bareilles, Jakub Mareček

Main category: cs.LG

TL;DR: The paper introduces humancompatible.train, a PyTorch-based Python package for training deep neural networks with stochastic constraints, implementing previously unimplemented algorithms and demonstrating their use on fairness-constrained tasks.

Details

Motivation: There is growing interest in constrained training of DNNs for applications like fairness and safety, but no industry standard toolkit exists despite several proposals.

Method: Developed an easily-extendable PyTorch-based Python package that implements multiple previously unimplemented algorithms for stochastically constrained stochastic optimization.

Result: The toolkit was demonstrated by comparing two algorithms on a deep learning task with fairness constraints, showing practical applicability.

Conclusion: humancompatible.train provides a flexible and extensible solution for constrained DNN training, addressing the need for standardized tools in this important research area.

Abstract: There has been a considerable interest in constrained training of deep neural networks (DNNs) recently for applications such as fairness and safety. Several toolkits have been proposed for this task, yet there is still no industry standard. We present humancompatible.train (https://github.com/humancompatible/train), an easily-extendable PyTorch-based Python package for training DNNs with stochastic constraints. We implement multiple previously unimplemented algorithms for stochastically constrained stochastic optimization. We demonstrate the toolkit use by comparing two algorithms on a deep learning task with fairness constraints.

[467] SuperOffload: Unleashing the Power of Large-Scale LLM Training on Superchips

Xinyu Lian, Masahiro Tanaka, Olatunji Ruwase, Minjia Zhang

Main category: cs.LG

TL;DR: SuperOffload is a novel offloading system designed for Superchips (tightly coupled GPU-CPU architectures) that achieves up to 2.5x throughput improvement for LLM training compared to state-of-the-art offloading systems.

Details

Motivation: Superchips represent a significant advancement in AI hardware with tightly coupled GPU-CPU architectures, but there's limited research on how LLM training can benefit from this new architecture, particularly regarding offloading assumptions.

Method: SuperOffload uses adaptive weight offloading, bucketization repartitioning, Superchip-aware casting, speculative execution, and optimized Adam optimizer for Grace CPUs to efficiently utilize Hopper GPU, Grace CPU, and NVLink-C2C interconnect simultaneously.

Result: Evaluation on NVIDIA GH200 shows 2.5x throughput improvement, enabling training of 25B models on a single Superchip with high throughput, and scaling to 13B models with 1M token sequences on 8 GH200 achieving 55% MFU.

Conclusion: SuperOffload demonstrates that Superchips require rethinking traditional offloading approaches and presents an effective solution that leverages the unique capabilities of tightly coupled GPU-CPU architectures for efficient LLM training.

Abstract: The emergence of Superchips represents a significant advancement in next-generation AI hardware. These Superchips employ a tightly coupled heterogeneous architecture that integrates GPU and CPU on the same package, which offers unprecedented computational power. However, there has been scant research investigating how LLM training benefits from this new architecture. In this work, for the first time, we study LLM training solutions based on offloading for Superchips. We observe important differences between Superchips and traditional loosely-coupled GPU-CPU architecture, which necessitate revisiting prevailing assumptions about offloading. Based on that, we present SuperOffload, a Superchip-centric offloading system that simultaneously uses Hopper GPU, Grace CPU, and NVLink-C2C interconnect more efficiently. SuperOffload accomplishes this via a combination of techniques, such as adaptive weight offloading, bucketization repartitioning, Superchip-aware casting, speculative execution, and a highly optimized Adam optimizer for Grace CPUs. Our evaluation of SuperOffload on NVIDIA GH200 demonstrates up to 2.5x throughput improvement compared to state-of-the-art offloading-based systems, enabling training of up to 25B model on a single Superchip while achieving high training throughput. We also extend SuperOffload with ZeRO-style data parallelism and DeepSpeed-Ulysses sequence parallelism, enabling training of 13B model with sequence lengths up to 1 million tokens on 8 GH200 while achieving 55% MFU.

[468] It’s Not You, It’s Clipping: A Soft Trust-Region via Probability Smoothing for LLM RL

Madeleine Dwyer, Adam Sobey, Adriane Chapman

Main category: cs.LG

TL;DR: PSPO (Probability Smoothing Policy Optimisation) is a new RL method that smooths policy probabilities instead of clipping ratios, providing better stability and performance than traditional PPO/GRPO methods.

Details

Motivation: Current RL methods like PPO and GRPO rely on ratio clipping to stabilize updates, but this discards information and creates gradient discontinuities. PSPO aims to preserve gradient signals while maintaining stability.

Method: PSPO smooths the current policy’s probabilities toward the old behavior policy before computing importance ratios, creating a soft trust region analogous to label smoothing. It’s instantiated as GR-PSPO within the GRPO framework.

Result: GR-PSPO significantly outperforms clipped GRPO, with over 20% improvement on GSM8K (39.7% vs. 17.6% for 0.5B model, 59.4% vs. 37.8% for 1.5B model). It also produces clearer, more logical reasoning compared to unclipped GRPO.

Conclusion: PSPO provides a superior alternative to ratio clipping in RL-based LLM training, offering formal guarantees while preserving gradient information and enabling more stable, high-performance model fine-tuning.

Abstract: Training large language models (LLMs) with reinforcement learning (RL) methods such as PPO and GRPO commonly relies on ratio clipping to stabilise updates. While effective at preventing instability, clipping discards information and introduces gradient discontinuities. We propose Probability Smoothing Policy Optimisation (PSPO), which smooths the current policy’s probabilities toward the old (behaviour) policy before computing the importance ratio, analogous to label smoothing. Unlike clipping, PSPO preserves gradient signal, while interpolation toward the old policy creates a soft trust region that discourages large, destabilising updates, with formal guarantees. We instantiate PSPO within GRPO (GR-PSPO) and fine-tune Qwen2.5-0.5B and Qwen2.5-1.5B on GSM8K, evaluating on GSM8K test and the cross-dataset generalisation on SVAMP, ASDiv, and MATH-500. Relative to unclipped GRPO (single iteration; no data reuse, ratio always = 1), GR-PSPO achieves similar performance but improves the reasoning leading to clearer and more concise responses which are more logical. Compared to clipped GRPO, GR-PSPO substantially improves performance both the 0.5B and 1.5B models, with a boost of over 20% on GSM8K (39.7% vs. 17.6% for 0.5B, 59.4% vs. 37.8% for 1.5B).

[469] Optimal Robust Recourse with $L^p$-Bounded Model Change

Phone Kyaw, Kshitij Kayastha, Shahin Jabbari

Main category: cs.LG

TL;DR: This paper proposes a new algorithm for computing optimal robust recourse recommendations that are resilient to model updates, using L^p norm constraints instead of L^∞ norm to achieve lower-cost and more sparse recourses.

Details

Motivation: Existing robust recourse methods using L^∞ norm constraints lead to high-cost recommendations. The paper aims to address this by using more constrained L^p norm model changes to achieve lower-cost and more practical recourse solutions.

Method: The authors develop a new algorithm that provably computes optimal robust recourse for generalized linear models under L^p norm constraints (p≥1, p≠∞). The approach handles both linear and non-linear models.

Result: Empirical results show the algorithm achieves significantly lower recourse costs (up to several orders of magnitude) compared to prior work, with better trade-offs between implementation cost and validity. It also produces more sparse recourses and remains resilient to post-processing.

Conclusion: The proposed L^p norm-based robust recourse algorithm provides more practical and cost-effective recommendations than existing L^∞ norm approaches, offering better performance across multiple metrics including cost, sparsity, and resilience.

Abstract: Recourse provides individuals who received undesirable labels (e.g., denied a loan) from algorithmic decision-making systems with a minimum-cost improvement suggestion to achieve the desired outcome. However, in practice, models often get updated to reflect changes in the data distribution or environment, invalidating the recourse recommendations (i.e., following the recourse will not lead to the desirable outcome). The robust recourse literature addresses this issue by providing a framework for computing recourses whose validity is resilient to slight changes in the model. However, since the optimization problem of computing robust recourse is non-convex (even for linear models), most of the current approaches do not have any theoretical guarantee on the optimality of the recourse. Recent work by Kayastha et. al. provides the first provably optimal algorithm for robust recourse with respect to generalized linear models when the model changes are measured using the $L^{\infty}$ norm. However, using the $L^{\infty}$ norm can lead to recourse solutions with a high price. To address this shortcoming, we consider more constrained model changes defined by the $L^p$ norm, where $p\geq 1$ but $p\neq \infty$, and provide a new algorithm that provably computes the optimal robust recourse for generalized linear models. Empirically, for both linear and non-linear models, we demonstrate that our algorithm achieves a significantly lower price of recourse (up to several orders of magnitude) compared to prior work and also exhibits a better trade-off between the implementation cost of recourse and its validity. Our empirical analysis also illustrates that our approach provides more sparse recourses compared to prior work and remains resilient to post-processing approaches that guarantee feasibility.

[470] No Prior, No Leakage: Revisiting Reconstruction Attacks in Trained Neural Networks

Yehonatan Refael, Guy Smorodinsky, Ofir Lindenbaum, Itay Safran

Main category: cs.LG

TL;DR: This paper analyzes the limitations of neural network training data reconstruction attacks, proving that without prior knowledge, reconstruction is fundamentally unreliable with infinitely many alternative solutions possible.

Details

Motivation: To understand the reliability and inherent weaknesses of existing training data reconstruction methods, which currently lack solid theoretical foundation despite empirical demonstrations.

Method: The authors take a complementary perspective by analyzing limitations rather than designing stronger attacks. They provide rigorous theoretical proofs and empirical demonstrations to show reconstruction failures.

Result: Proved that without prior data knowledge, there exist infinitely many alternative solutions arbitrarily far from true training data. Empirically showed exact duplication occurs only by chance. Networks trained more extensively are actually less susceptible to reconstruction.

Conclusion: The study refines theoretical understanding of training set leakage conditions and shows that stronger generalization (through more extensive training) can enhance privacy, reconciling privacy needs with generalization requirements.

Abstract: The memorization of training data by neural networks raises pressing concerns for privacy and security. Recent work has shown that, under certain conditions, portions of the training set can be reconstructed directly from model parameters. Some of these methods exploit implicit bias toward margin maximization, suggesting that properties often regarded as beneficial for generalization may actually compromise privacy. Yet despite striking empirical demonstrations, the reliability of these attacks remains poorly understood and lacks a solid theoretical foundation. In this work, we take a complementary perspective: rather than designing stronger attacks, we analyze the inherent weaknesses and limitations of existing reconstruction methods and identify conditions under which they fail. We rigorously prove that, without incorporating prior knowledge about the data, there exist infinitely many alternative solutions that may lie arbitrarily far from the true training set, rendering reconstruction fundamentally unreliable. Empirically, we further demonstrate that exact duplication of training examples occurs only by chance. Our results refine the theoretical understanding of when training set leakage is possible and offer new insights into mitigating reconstruction attacks. Remarkably, we demonstrate that networks trained more extensively, and therefore satisfying implicit bias conditions more strongly – are, in fact, less susceptible to reconstruction attacks, reconciling privacy with the need for strong generalization in this setting.

[471] FoMo-0D: A Foundation Model for Zero-shot Tabular Outlier Detection

Yuchen Shen, Haomin Wen, Leman Akoglu

Main category: cs.LG

TL;DR: FoMo-0D is a pre-trained foundation model for zero-shot outlier detection on tabular data that eliminates the need for model selection and hyperparameter tuning by directly predicting outlier labels without fine-tuning.

Details

Motivation: Outlier detection faces significant challenges in model selection due to its unsupervised nature, with no systematic approaches for algorithm and hyperparameter selection limiting practical application.

Method: The model is pre-trained on synthetic data and can directly predict outlier/inlier labels for test samples without parameter fine-tuning, requiring no labeled data or additional training for new tasks.

Result: Extensive experiments on 57 real-world datasets against 26 baselines show FoMo-0D is highly competitive, outperforming most baselines with no statistically significant difference from the 2nd best method, while being 7x faster in inference.

Conclusion: FoMo-0D provides an effective zero-shot solution for outlier detection that bypasses model selection challenges, offering competitive performance with significant efficiency improvements.

Abstract: Outlier detection (OD) has a vast literature as it finds numerous real-world applications. Being an unsupervised task, model selection is a key bottleneck for OD without label supervision. Despite a long list of available OD algorithms with tunable hyperparameters, the lack of systematic approaches for unsupervised algorithm and hyperparameter selection limits their effective use in practice. In this paper, we present FoMo-0D, a pre-trained Foundation Model for zero/0-shot OD on tabular data, which bypasses the hurdle of model selection altogether. Having been pre-trained on synthetic data, FoMo-0D can directly predict the (outlier/inlier) label of test samples without parameter fine-tuning – requiring no labeled data, and no additional training or hyperparameter tuning when given a new task. Extensive experiments on 57 real-world datasets against 26 baselines show that FoMo-0D is highly competitive; outperforming the majority of the baselines with no statistically significant difference from the 2nd best method. Further, FoMo-0D is efficient in inference time requiring only 7.7 ms per sample on average, with at least 7x speed-up compared to previous methods. To facilitate future research, our implementations for data synthesis and pre-training as well as model checkpoints are openly available at https://github.com/A-Chicharito-S/FoMo-0D.

[472] Learning to Bid Optimally and Efficiently in Adversarial First-price Auctions

Yanjun Han, Zhengyuan Zhou, Aaron Flores, Erik Ordentlich, Tsachy Weissman

Main category: cs.LG

TL;DR: This paper presents the first minimax optimal online bidding algorithm for repeated first-price auctions, achieving O(√T) regret against Lipschitz bidding policies, with both statistical optimality and computational efficiency.

Details

Motivation: The shift from second-price to first-price auctions in online advertising creates challenges for bidders, as truthful bidding is no longer optimal and competitors' behaviors are unknown, requiring new learning approaches.

Method: Develops a novel online learning algorithm using hierarchical expert-chaining structure that leverages good experts, then modifies it for computational efficiency while maintaining optimal regret guarantees.

Result: The algorithm achieves minimax optimal O(√T) regret, outperforms existing bidding algorithms on real-world datasets from Verizon Media, and includes an impossibility result showing limitations against stronger oracles.

Conclusion: The work provides the first statistically and computationally efficient solution for learning to bid in first-price auctions, with theoretical guarantees and practical validation on real data.

Abstract: First-price auctions have very recently swept the online advertising industry, replacing second-price auctions as the predominant auction mechanism on many platforms. This shift has brought forth important challenges for a bidder: how should one bid in a first-price auction, where unlike in second-price auctions, it is no longer optimal to bid one’s private value truthfully and hard to know the others’ bidding behaviors? In this paper, we take an online learning angle and address the fundamental problem of learning to bid in repeated first-price auctions, where both the bidder’s private valuations and other bidders’ bids can be arbitrary. We develop the first minimax optimal online bidding algorithm that achieves an $\widetilde{O}(\sqrt{T})$ regret when competing with the set of all Lipschitz bidding policies, a strong oracle that contains a rich set of bidding strategies. This novel algorithm is built on the insight that the presence of a good expert can be leveraged to improve performance, as well as an original hierarchical expert-chaining structure, both of which could be of independent interest in online learning. Further, by exploiting the product structure that exists in the problem, we modify this algorithm–in its vanilla form statistically optimal but computationally infeasible–to a computationally efficient and space efficient algorithm that also retains the same $\widetilde{O}(\sqrt{T})$ minimax optimal regret guarantee. Additionally, through an impossibility result, we highlight that one is unlikely to compete this favorably with a stronger oracle (than the considered Lipschitz bidding policies). Finally, we test our algorithm on three real-world first-price auction datasets obtained from Verizon Media and demonstrate our algorithm’s superior performance compared to several existing bidding algorithms.

[473] Expressiveness of Multi-Neuron Convex Relaxations in Neural Network Certification

Yuhao Mao, Yani Zhang, Martin Vechev

Main category: cs.LG

TL;DR: Multi-neuron relaxations for neural network certification are inherently incomplete (universal convex barrier), but can achieve completeness through network augmentation or input domain partitioning, distinguishing them from single-neuron relaxations.

Details

Motivation: To rigorously analyze whether multi-neuron relaxations overcome the single-neuron convex barrier in neural network certification and determine their theoretical capabilities beyond single-neuron relaxations.

Method: First rigorous analysis of multi-neuron relaxation expressiveness, examining completeness with sufficient resources for capturing multiple neurons and layers optimally.

Result: Multi-neuron relaxations are inherently incomplete even with optimal resources (universal convex barrier), but can achieve completeness through: (1) augmenting networks with polynomial number of designed ReLU neurons, or (2) partitioning input domain into convex sub-polytopes.

Conclusion: Multi-neuron relaxations have distinct advantages over single-neuron ones and establish a foundation for new directions in certified robustness, including tailored training methods and verification approaches using multi-neuron relaxations as main subroutines.

Abstract: Neural network certification methods heavily rely on convex relaxations to provide robustness guarantees. However, these relaxations are often imprecise: even the most accurate single-neuron relaxation is incomplete for general ReLU networks, a limitation known as the \emph{single-neuron convex barrier}. While multi-neuron relaxations have been heuristically applied to address this issue, two central questions arise: (i) whether they overcome the convex barrier, and if not, (ii) whether they offer theoretical capabilities beyond those of single-neuron relaxations. In this work, we present the first rigorous analysis of the expressiveness of multi-neuron relaxations. Perhaps surprisingly, we show that they are inherently incomplete, even when allocated sufficient resources to capture finitely many neurons and layers optimally. This result extends the single-neuron barrier to a \textit{universal convex barrier} for neural network certification. On the positive side, we show that completeness can be achieved by either (i) augmenting the network with a polynomial number of carefully designed ReLU neurons or (ii) partitioning the input domain into convex sub-polytopes, thereby distinguishing multi-neuron relaxations from single-neuron ones which are unable to realize the former and have worse partition complexity for the latter. Our findings establish a foundation for multi-neuron relaxations and point to new directions for certified robustness, including training methods tailored to multi-neuron relaxations and verification methods with multi-neuron relaxations as the main subroutine.

[474] Contextual Combinatorial Bandits with Changing Action Sets via Gaussian Processes

Andi Nika, Sepehr Elahi, Cem Tekin

Main category: cs.LG

TL;DR: The paper proposes O’CLOK-UCB algorithm for combinatorial contextual bandits with time-varying arm availability, using Gaussian Processes and achieving improved regret bounds.

Details

Motivation: Address combinatorial bandit problems with time-varying base arm availability and leverage Gaussian Processes to model correlations between arms for better performance.

Method: Proposes O’CLOK-UCB algorithm using kernel upper confidence bounds with Gaussian Processes, and a sparse GP variant for computational efficiency.

Result: Achieves Õ(√(λ*(K)KTγ_KT)) regret bound and shows superior performance over existing UCB methods in experiments.

Conclusion: The algorithm effectively exploits inter-base arm correlations and significantly outperforms state-of-the-art approaches in realistic scenarios.

Abstract: We consider a contextual bandit problem with a combinatorial action set and time-varying base arm availability. At the beginning of each round, the agent observes the set of available base arms and their contexts and then selects an action that is a feasible subset of the set of available base arms to maximize its cumulative reward in the long run. We assume that the mean outcomes of base arms are samples from a Gaussian Process (GP) indexed by the context set ${\cal X}$, and the expected reward is Lipschitz continuous in expected base arm outcomes. For this setup, we propose an algorithm called Optimistic Combinatorial Learning and Optimization with Kernel Upper Confidence Bounds (O’CLOK-UCB) and prove that it incurs $\tilde{O}(\sqrt{\lambda^(K)KT\gamma_{KT}(\cup_{t\leq T}\mathcal{X}t)} )$ regret with high probability, where $\gamma{KT}(\cup_{t\leq T}\mathcal{X}_t)$ is the maximum information gain associated with the sets of base arm contexts $\mathcal{X}_t$ that appeared in the first $T$ rounds, $K$ is the maximum cardinality of any feasible action over all rounds, and $\lambda^(K)$ is the maximum eigenvalue of all covariance matrices of selected actions up to time $T$, which is a function of $K$. To dramatically speed up the algorithm, we also propose a variant of O’CLOK-UCB that uses sparse GPs. Finally, we experimentally show that both algorithms exploit inter-base arm outcome correlation and vastly outperform the previous state-of-the-art UCB-based algorithms in realistic setups.

[475] Bias Similarity Measurement: A Black-Box Audit of Fairness Across LLMs

Hyejun Jeong, Shiqing Ma, Amir Houmansadr

Main category: cs.LG

TL;DR: BSM (Bias Similarity Measurement) treats fairness as a relational property between models, unifying multiple signals to measure bias similarity across LLM families and releases.

Details

Motivation: Current evaluations score models in isolation, obscuring how biases persist across model families and releases. There's a need for comparative bias analysis.

Method: Evaluated 30 LLMs on 1M+ prompts using BSM framework that combines scalar, distributional, behavioral, and representational signals into a single similarity space.

Result: Instruction tuning primarily enforces abstention rather than altering internal representations; small models gain little accuracy and can become less fair; open-weight models can match proprietary systems; family signatures diverge (Gemma favors refusal, LLaMA 3.1 approaches neutrality).

Conclusion: BSM reframes fairness as comparative bias similarity rather than isolated scores, enabling systematic auditing of LLM ecosystems for procurement, regression testing, and lineage screening.

Abstract: Large Language Models (LLMs) reproduce social biases, yet prevailing evaluations score models in isolation, obscuring how biases persist across families and releases. We introduce Bias Similarity Measurement (BSM), which treats fairness as a relational property between models, unifying scalar, distributional, behavioral, and representational signals into a single similarity space. Evaluating 30 LLMs on 1M+ prompts, we find that instruction tuning primarily enforces abstention rather than altering internal representations; small models gain little accuracy and can become less fair under forced choice; and open-weight models can match or exceed proprietary systems. Family signatures diverge: Gemma favors refusal, LLaMA 3.1 approaches neutrality with fewer refusals, and converges toward abstention-heavy behavior overall. Counterintuitively, Gemma 3 Instruct matches GPT-4-level fairness at far lower cost, whereas Gemini’s heavy abstention suppresses utility. Beyond these findings, BSM offers an auditing workflow for procurement, regression testing, and lineage screening, and extends naturally to code and multilingual settings. Our results reframe fairness not as isolated scores but as comparative bias similarity, enabling systematic auditing of LLM ecosystems. Code available at https://github.com/HyejunJeong/bias_llm.

[476] Security of Deep Reinforcement Learning for Autonomous Driving: A Survey

Ambra Demontis, Srishti Gupta, Maura Pintor, Luca Demetrio, Kathrin Grosse, Hsiao-Ying Lin, Chengfang Fang, Battista Biggio, Fabio Roli

Main category: cs.LG

TL;DR: A comprehensive survey of 86 studies on reinforcement learning security, systematically categorizing attacks and defenses by threat models and single/multi-agent settings, with specific application to autonomous driving.

Details

Motivation: RL is increasingly used in safety-critical applications like autonomous driving but is vulnerable to attacks that can compromise policy learning or induce errors. Existing surveys lack practical guidance for selecting appropriate defenses for specific systems.

Method: Systematic categorization of 86 recent studies on RL security, organizing attacks and defenses according to defined threat models and single- versus multi-agent settings.

Result: Provides a structured framework for understanding RL security vulnerabilities and defense mechanisms, with specific insights applicable to autonomous driving systems.

Conclusion: The survey offers practical guidance for designing robust RL systems in safety-critical applications by systematically analyzing attack vectors and corresponding defense strategies.

Abstract: Reinforcement learning (RL) enables agents to learn optimal behaviors through interaction with their environment and has been increasingly deployed in safety-critical applications, including autonomous driving. Despite its promise, RL is susceptible to attacks designed either to compromise policy learning or to induce erroneous decisions by trained agents. Although the literature on RL security has grown rapidly and several surveys exist, existing categorizations often fall short in guiding the selection of appropriate defenses for specific systems. In this work, we present a comprehensive survey of 86 recent studies on RL security, addressing these limitations by systematically categorizing attacks and defenses according to defined threat models and single- versus multi-agent settings. Furthermore, we examine the relevance and applicability of state-of-the-art attacks and defense mechanisms within the context of autonomous driving, providing insights to inform the design of robust RL systems.

[477] Understanding Optimization in Deep Learning with Central Flows

Jeremy M. Cohen, Alex Damian, Ameet Talwalkar, J. Zico Kolter, Jason D. Lee

Main category: cs.LG

TL;DR: The paper develops a theory called “central flows” to describe optimization dynamics in deep learning’s “edge of stability” regime, showing that time-averaged trajectories are more tractable than exact oscillatory paths.

Details

Motivation: Traditional optimization theories fail to explain deep learning optimization dynamics, particularly in the oscillatory "edge of stability" regime where optimizers operate.

Method: Derive differential equations called “central flows” that characterize time-averaged optimization trajectories, and empirically validate their predictive accuracy for neural network optimization.

Result: Central flows accurately predict long-term optimization trajectories and help explain phenomena like gradient descent progress despite loss increases, adaptive optimizer behavior, and implicit step size navigation.

Conclusion: Central flows provide a valuable theoretical framework for understanding deep learning optimization dynamics in complex, oscillatory regimes.

Abstract: Traditional theories of optimization cannot describe the dynamics of optimization in deep learning, even in the simple setting of deterministic training. The challenge is that optimizers typically operate in a complex, oscillatory regime called the “edge of stability.” In this paper, we develop theory that can describe the dynamics of optimization in this regime. Our key insight is that while the exact trajectory of an oscillatory optimizer may be challenging to analyze, the time-averaged (i.e. smoothed) trajectory is often much more tractable. To analyze an optimizer, we derive a differential equation called a “central flow” that characterizes this time-averaged trajectory. We empirically show that these central flows can predict long-term optimization trajectories for generic neural networks with a high degree of numerical accuracy. By interpreting these central flows, we are able to understand how gradient descent makes progress even as the loss sometimes goes up; how adaptive optimizers “adapt” to the local loss landscape; and how adaptive optimizers implicitly navigate towards regions where they can take larger steps. Our results suggest that central flows can be a valuable theoretical tool for reasoning about optimization in deep learning.

[478] Towards the Identifiability in Noisy Label Learning: A Multinomial Mixture Modelling Approach

Cuong Nguyen, Thanh-Toan Do, Gustavo Carneiro

Main category: cs.LG

TL;DR: This paper presents a novel approach for learning from noisy labels by making the problem identifiable through multiple i.i.d. noisy labels per instance, using nearest neighbors to generate additional labels and EM algorithm for clean label inference.

Details

Motivation: The conventional LNL problem with single noisy labels per instance is non-identifiable, meaning clean labels cannot be theoretically estimated without additional heuristics. The paper aims to address this fundamental limitation.

Method: Proposes using at least 2C-1 i.i.d. noisy labels per instance (where C is number of classes) to make the problem identifiable. Uses nearest neighbors to automatically generate additional i.i.d. noisy labels, then applies Expectation-Maximisation algorithm to infer clean labels.

Result: The method accurately estimates clean labels across various label noise benchmarks including synthetic, web-controlled, and real-world datasets. The model trained with this approach performs competitively with state-of-the-art methods.

Conclusion: The paper demonstrates that the LNL problem becomes identifiable with multiple i.i.d. noisy labels per instance, providing a theoretically grounded and effective solution without requiring heuristics about clean samples.

Abstract: Learning from noisy labels (LNL) is crucial in deep learning, in which one of the approaches is to identify clean-label samples from poorly-annotated datasets. Such an identification is challenging because the conventional LNL problem, which assumes only one noisy label per instance, is non-identifiable, i.e., clean labels cannot be estimated theoretically without additional heuristics. This paper presents a novel data-driven approach that addresses this issue without requiring any heuristics about clean samples. We discover that the LNL problem becomes identifiable if there are at least $2C - 1$ i.i.d. noisy labels per instance, where $C$ is the number of classes. Our finding relies on the assumption of i.i.d. noisy labels and multinomial mixture modelling, making it easier to interpret than previous studies that require full-rank noisy-label transition matrices. To fulfil this condition without additional manual annotations, we propose a method that automatically generates additional i.i.d. noisy labels through nearest neighbours. These noisy labels are then used in the Expectation-Maximisation algorithm to infer clean labels. Our method demonstrably estimates clean labels accurately across various label noise benchmarks, including synthetic, web-controlled, and real-world datasets. Furthermore, the model trained with our method performs competitively with many state-of-the-art methods.

[479] Estimating Deep Learning energy consumption based on model architecture and training environment

Santiago del Rey, Luís Cruz, Xavier Franch, Silverio Martínez-Fernández

Main category: cs.LG

TL;DR: This paper investigates the environmental impact of deep learning by analyzing how model architecture and training environment affect energy consumption, proposing new estimation methods that outperform existing tools.

Details

Motivation: To address the gap in accurate energy estimation for deep learning training, as current practices rely on unverified assumptions and fail to capture the interaction between model complexity and hardware capabilities.

Method: Train various computer vision models while collecting energy consumption and accuracy metrics, analyze trade-offs, and propose STEP (Stable Training Epoch Projection) and PRE (Pre-training Regression-based Estimation) methods for better energy estimation.

Result: Selecting optimal model-training environment combinations can reduce training energy by up to 80.68% with minimal accuracy loss. Common estimation methods like FLOPs or GPU TDP fail to capture energy dynamics and lead to substantial errors.

Conclusion: The proposed STEP and PRE methods significantly outperform existing energy estimation tools by a factor of two or more, demonstrating the importance of considering model-hardware interactions for accurate energy assessment in deep learning.

Abstract: To raise awareness of the environmental impact of deep learning (DL), many studies estimate the energy use of DL systems. However, energy estimates during DL training often rely on unverified assumptions. This work addresses that gap by investigating how model architecture and training environment affect energy consumption. We train a variety of computer vision models and collect energy consumption and accuracy metrics to analyze their trade-offs across configurations. Our results show that selecting the right model-training environment combination can reduce training energy consumption by up to 80.68% with less than 2% loss in $F_1$ score. We find a significant interaction effect between model and training environment: energy efficiency improves when GPU computational power scales with model complexity. Moreover, we demonstrate that common estimation practices, such as using FLOPs or GPU TDP, fail to capture these dynamics and can lead to substantial errors. To address these shortcomings, we propose the Stable Training Epoch Projection (STEP) and the Pre-training Regression-based Estimation (PRE) methods. Across evaluations, our methods outperform existing tools by a factor of two or more in estimation accuracy.

[480] Reformulation is All You Need: Addressing Malicious Text Features in DNNs

Yi Jiang, Oubo Ma, Yong Yang, Tong Zhang, Shouling Ji

Main category: cs.LG

TL;DR: A unified defense framework against both adversarial and backdoor attacks in NLP by addressing malicious textual features through reformulation modules while preserving semantic integrity.

Details

Motivation: Existing defenses are either computationally expensive (model-oriented) or vulnerable to adaptive attacks (sample-oriented). The root cause of attacks is that DNN models erroneously assign significant weight to subtle textual features that humans ignore.

Method: Proposes reformulation modules that process textual inputs to address potential malicious features while maintaining original semantics. The framework is adaptive to various attack vectors.

Result: Extensive experiments show the framework outperforms existing sample-oriented defense baselines across diverse malicious textual features.

Conclusion: The proposed unified defense effectively handles both adversarial and backdoor attacks by targeting the root cause in the encoding process, offering better performance than current approaches.

Abstract: Human language encompasses a wide range of intricate and diverse implicit features, which attackers can exploit to launch adversarial or backdoor attacks, compromising DNN models for NLP tasks. Existing model-oriented defenses often require substantial computational resources as model size increases, whereas sample-oriented defenses typically focus on specific attack vectors or schemes, rendering them vulnerable to adaptive attacks. We observe that the root cause of both adversarial and backdoor attacks lies in the encoding process of DNN models, where subtle textual features, negligible for human comprehension, are erroneously assigned significant weight by less robust or trojaned models. Based on it we propose a unified and adaptive defense framework that is effective against both adversarial and backdoor attacks. Our approach leverages reformulation modules to address potential malicious features in textual inputs while preserving the original semantic integrity. Extensive experiments demonstrate that our framework outperforms existing sample-oriented defense baselines across a diverse range of malicious textual features.

[481] Strassen Attention, Split VC Dimension and Compositionality in Transformers

Alexander Kozachinskiy, Felipe Urrutia, Hector Jimenez, Tomasz Steifer, Germán Pizarro, Matías Fuentes, Francisco Meza, Cristian B. Calderon, Cristóbal Rojas

Main category: cs.LG

TL;DR: This paper establishes theoretical limitations of one-layer softmax transformers for advanced reasoning tasks and introduces Strassen attention as a scalable solution with sub-cubic complexity.

Details

Motivation: To understand the fundamental limitations of standard one-layer softmax transformers in handling advanced reasoning tasks that require looking at token triplets and compositional reasoning, and to develop a more scalable attention mechanism.

Method: The authors formally prove the inability of one-layer softmax transformers to solve three reasoning tasks (Match 3, function composition, and binary relations composition). They introduce Strassen attention and prove its theoretical capabilities, then experimentally compare it against standard, higher-order, and triangular attention mechanisms.

Result: Strassen attention outperforms standard attention significantly on all tested reasoning tasks and shows better scalability than higher-order attention with sub-cubic running-time complexity.

Conclusion: Understanding theoretical limitations can guide research toward scalable attention mechanisms that enhance transformers’ reasoning abilities, with Strassen attention demonstrating practical improvements over existing approaches.

Abstract: We propose the first method to show theoretical limitations for one-layer softmax transformers with arbitrarily many precision bits (even infinite). We establish those limitations for three tasks that require advanced reasoning. The first task, Match 3 (Sanford et al., 2023), requires looking at all possible token triplets in an input sequence. The second and third tasks address compositionality-based reasoning: function composition (Peng et al., 2024) and binary relations composition, respectively. We formally prove the inability of one-layer softmax Transformers to solve any of these tasks. To overcome these limitations, we introduce Strassen attention and prove that, equipped with this mechanism, a one-layer transformer can in principle solve all these tasks. Importantly, we show that it enjoys sub-cubic running-time complexity, making it more scalable than similar previously proposed mechanisms, such as higher-order attention (Sanford et al., 2023). To complement our theoretical findings, we experimentally studied Strassen attention and compared it against standard (Vaswani et al, 2017), higher-order attention (Sanford et al., 2023), and triangular attention (Bergen et al. 2021). Our results help to disentangle all these attention mechanisms, highlighting their strengths and limitations. In particular, Strassen attention outperforms standard attention significantly on all the tasks. Altogether, understanding the theoretical limitations can guide research towards scalable attention mechanisms that improve the reasoning abilities of Transformers.

[482] Energy based diffusion generator for efficient sampling of Boltzmann distributions

Yan Wang, Ling Guo, Hao Wu, Tao Zhou

Main category: cs.LG

TL;DR: EDG is a novel simulation-free method that combines variational autoencoders and diffusion models to sample from complex Boltzmann distributions without solving differential equations.

Details

Motivation: Sampling from high-dimensional Boltzmann distributions with complex energy functions is challenging, and existing methods often require solving differential equations or have restrictive constraints.

Method: EDG uses a decoder to generate samples from simple latent variables and a diffusion-based encoder to estimate KL divergence to the target distribution. It removes bijectivity constraints for flexible network design.

Result: EDG demonstrates superior performance across various sampling tasks with complex target distributions, outperforming existing methods.

Conclusion: EDG provides an effective, simulation-free approach for Boltzmann distribution sampling that offers flexibility and strong empirical performance.

Abstract: Sampling from Boltzmann distributions, particularly those tied to high dimensional and complex energy functions, poses a significant challenge in many fields. In this work, we present the Energy-Based Diffusion Generator (EDG), a novel approach that integrates ideas from variational autoencoders and diffusion models. EDG uses a decoder to generate Boltzmann-distributed samples from simple latent variables, and a diffusion-based encoder to estimate the Kullback-Leibler divergence to the target distribution. Notably, EDG is simulation-free, eliminating the need to solve ordinary or stochastic differential equations during training. Furthermore, by removing constraints such as bijectivity in the decoder, EDG allows for flexible network design. Through empirical evaluation, we demonstrate the superior performance of EDG across a variety of sampling tasks with complex target distributions, outperforming existing methods.

[483] What Makes a Reward Model a Good Teacher? An Optimization Perspective

Noam Razin, Zixuan Wang, Hubert Strauss, Stanley Wei, Jason D. Lee, Sanjeev Arora

Main category: cs.LG

TL;DR: Reward model accuracy alone is insufficient for effective RLHF; reward variance is crucial for optimization efficiency, as low variance creates flat objective landscapes that slow learning.

Details

Motivation: Current RLHF evaluation focuses primarily on reward model accuracy, but it's unclear if accuracy fully captures what makes a reward model effective for guiding language model optimization.

Method: Theoretical analysis from optimization perspective, proving that low reward variance creates flat objective landscapes regardless of accuracy. Experiments with models up to 8B parameters validate the interplay between reward variance, accuracy, and optimization efficiency.

Result: Even perfectly accurate reward models can lead to slow optimization if they induce low variance, while less accurate models with higher variance can outperform them. Reward models effective for one language model may fail for another due to variance differences.

Conclusion: Reward model evaluation must consider both accuracy and variance, as sufficient variance is essential for efficient optimization in RLHF beyond just accuracy metrics.

Abstract: The success of Reinforcement Learning from Human Feedback (RLHF) critically depends on the quality of the reward model. However, while this quality is primarily evaluated through accuracy, it remains unclear whether accuracy fully captures what makes a reward model an effective teacher. We address this question from an optimization perspective. First, we prove that regardless of how accurate a reward model is, if it induces low reward variance, then the RLHF objective suffers from a flat landscape. Consequently, even a perfectly accurate reward model can lead to extremely slow optimization, underperforming less accurate models that induce higher reward variance. We additionally show that a reward model that works well for one language model can induce low reward variance, and thus a flat objective landscape, for another. These results establish a fundamental limitation of evaluating reward models solely based on accuracy or independently of the language model they guide. Experiments using models of up to 8B parameters corroborate our theory, demonstrating the interplay between reward variance, accuracy, and reward maximization rate. Overall, our findings highlight that beyond accuracy, a reward model needs to induce sufficient variance for efficient~optimization.

[484] A High-Dimensional Statistical Method for Optimizing Transfer Quantities in Multi-Source Transfer Learning

Qingyue Zhang, Haohao Fu, Guanbo Huang, Yaoyuan Liang, Chang Chu, Tianren Peng, Yanru Wu, Qi Li, Yang Li, Shao-Lun Huang

Main category: cs.LG

TL;DR: A theoretical framework and algorithm (OTQMS) that determines the optimal quantity of source samples needed from each source task in multi-source transfer learning to improve training efficiency and performance.

Details

Motivation: Existing multi-source transfer learning methods use all available source samples, which constrains training efficiency and may lead to suboptimal results due to data scarcity in real-world supervised learning scenarios.

Method: Introduced a generalization error measure based on K-L divergence and minimized it using high-dimensional statistical analysis to determine optimal transfer quantity for each source task. Developed an architecture-agnostic algorithm OTQMS to implement the theoretical framework.

Result: Experimental studies on diverse architectures and two real-world benchmark datasets show that OTQMS significantly outperforms state-of-the-art approaches in both accuracy and data efficiency.

Conclusion: The proposed framework and algorithm provide an effective solution for multi-source transfer learning by optimizing source sample quantity, achieving better performance with improved data efficiency compared to existing methods.

Abstract: Multi-source transfer learning provides an effective solution to data scarcity in real- world supervised learning scenarios by leveraging multiple source tasks. In this field, existing works typically use all available samples from sources in training, which constrains their training efficiency and may lead to suboptimal results. To address this, we propose a theoretical framework that answers the question: what is the optimal quantity of source samples needed from each source task to jointly train the target model? Specifically, we introduce a generalization error measure based on K-L divergence, and minimize it based on high-dimensional statistical analysis to determine the optimal transfer quantity for each source task. Additionally, we develop an architecture-agnostic and data-efficient algorithm OTQMS to implement our theoretical results for target model training in multi- source transfer learning. Experimental studies on diverse architectures and two real-world benchmark datasets show that our proposed algorithm significantly outperforms state-of-the-art approaches in both accuracy and data efficiency. The code and supplementary materials are available in https://anonymous.4open.science/r/Materials.

[485] Reinforcement Learning in Categorical Cybernetics

Jules Hedges, Riu Rodríguez Sakamoto

Main category: cs.LG

TL;DR: This paper shows that major reinforcement learning algorithms fit into the categorical cybernetics framework as parametrised bidirectional processes, extending previous work on value iteration to represent various RL methods through parametrised optics.

Details

Motivation: To provide a unified mathematical framework for understanding reinforcement learning algorithms using categorical cybernetics, demonstrating that different RL approaches can be seen as special cases of a general construction.

Method: The approach involves: (1) extending Bellman operators to parametrised optics for action-value functions, (2) applying a representable contravariant functor to create parametrised functions for Bellman iteration, and (3) embedding these into parametrised optics representing models that interact with environments via agents.

Result: The paper demonstrates that dynamic programming, Monte Carlo methods, temporal difference learning, and deep RL can all be represented as different extremal cases of this unified categorical framework.

Conclusion: This categorical cybernetics approach provides a natural and fruitful way to think about reinforcement learning, offering strong evidence that parametrised optics are a fundamental mathematical structure underlying various RL algorithms.

Abstract: We show that several major algorithms of reinforcement learning (RL) fit into the framework of categorical cybernetics, that is to say, parametrised bidirectional processes. We build on our previous work in which we show that value iteration can be represented by precomposition with a certain optic. The outline of the main construction in this paper is: (1) We extend the Bellman operators to parametrised optics that apply to action-value functions and depend on a sample. (2) We apply a representable contravariant functor, obtaining a parametrised function that applies the Bellman iteration. (3) This parametrised function becomes the backward pass of another parametrised optic that represents the model, which interacts with an environment via an agent. Thus, parametrised optics appear in two different ways in our construction, with one becoming part of the other. As we show, many of the major classes of algorithms in RL can be seen as different extremal cases of this general setup: dynamic programming, Monte Carlo methods, temporal difference learning, and deep RL. We see this as strong evidence that this approach is a natural one and believe that it will be a fruitful way to think about RL in the future.

[486] Process Reward Models That Think

Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moontae Lee, Honglak Lee, Lu Wang

Main category: cs.LG

TL;DR: ThinkPRM is a generative, chain-of-thought verifier that requires only 1% of process labels compared to discriminative PRMs, outperforming baselines on multiple benchmarks while scaling verification compute efficiently.

Details

Motivation: Traditional process reward models (PRMs) require expensive step-level supervision for training. This work aims to build data-efficient PRMs that can verify solution steps with minimal supervision.

Method: Proposes ThinkPRM, a long chain-of-thought verifier fine-tuned on few process labels. It generates verification chain-of-thought to verify each solution step, leveraging inherent reasoning abilities of long CoT models.

Result: ThinkPRM outperforms LLM-as-a-Judge and discriminative verifiers using only 1% of PRM800K labels. It achieves superior performance on ProcessBench, MATH-500, AIME ‘24, and shows 8% and 4.5% improvements on GPQA-Diamond and LiveCodeBench respectively compared to full PRM800K-trained verifiers.

Conclusion: Generative, long CoT PRMs like ThinkPRM can effectively scale test-time compute for verification while requiring minimal supervision, highlighting the value of this approach over traditional discriminative methods.

Abstract: Step-by-step verifiers – also known as process reward models (PRMs) – are a key ingredient for test-time scaling. PRMs require step-level supervision, making them expensive to train. This work aims to build data-efficient PRMs as verbalized step-wise reward models that verify every step in the solution by generating a verification chain-of-thought (CoT). We propose ThinkPRM, a long CoT verifier fine-tuned on orders of magnitude fewer process labels than those required by discriminative PRMs. Our approach capitalizes on the inherent reasoning abilities of long CoT models, and outperforms LLM-as-a-Judge and discriminative verifiers – using only 1% of the process labels in PRM800K – across several challenging benchmarks. Specifically, ThinkPRM beats the baselines on ProcessBench, MATH-500, and AIME ‘24 under best-of-N selection and reward-guided search. In an out-of-domain evaluation on a subset of GPQA-Diamond and LiveCodeBench, our PRM surpasses discriminative verifiers trained on the full PRM800K by 8% and 4.5%, respectively. Lastly, under the same token budget, ThinkPRM scales up verification compute more effectively compared to LLM-as-a-Judge, outperforming it by 7.2% on a subset of ProcessBench. Our work highlights the value of generative, long CoT PRMs that can scale test-time compute for verification while requiring minimal supervision for training. Our code, data, and models are released at https://github.com/mukhal/thinkprm.

[487] Least Volume Analysis

Qiuyi Chen, Cashen Diniz, Mark Fuge

Main category: cs.LG

TL;DR: Least Volume (LV) is a regularization method that reduces latent dimensions in autoencoders without prior knowledge of intrinsic dimensionality, with extensions to non-Euclidean settings and applications in data sampling and representation learning.

Details

Motivation: To develop an effective dimension reduction method that doesn't require prior knowledge of dataset intrinsic dimensionality, inspired by geometric intuition.

Method: Introduces Least Volume (LV) regularization, extends it to Generalized Least Volume (GLV) for non-Euclidean settings and label integration, and develops Dynamic Pruning algorithm for implementation.

Result: LV effectively reduces latent dimensions, induces PCA-like importance ordering, reveals role of low-dimensional spaces in data sampling and disentangled representation, and GLV produces smooth representations for downstream optimization.

Conclusion: LV and GLV are effective dimension reduction methods with geometric foundations, applicable to various datasets and enabling stable downstream optimization through smooth latent representations.

Abstract: This paper introduces Least Volume (LV)–a simple yet effective regularization method inspired by geometric intuition–that reduces the number of latent dimensions required by an autoencoder without prior knowledge of the dataset’s intrinsic dimensionality. We show that its effectiveness depends on the Lipschitz continuity of the decoder, prove that Principal Component Analysis (PCA) is a linear special case, and demonstrate that LV induces a PCA-like importance ordering in nonlinear models. We extend LV to non-Euclidean settings as Generalized Least Volume (GLV), enabling the integration of label information into the latent representation. To support implementation, we also develop an accompanying Dynamic Pruning algorithm. We evaluate LV on several benchmark problems, demonstrating its effectiveness in dimension reduction. Leveraging this, we reveal the role of low-dimensional latent spaces in data sampling and disentangled representation, and use them to probe the varying topological complexity of various datasets. GLV is further applied to labeled datasets, where it induces a contrastive learning effect in representations of discrete labels. On a continuous-label airfoil dataset, it produces representations that lead to smooth changes in aerodynamic performance, thereby stabilizing downstream optimization.

[488] UDDETTS: Unifying Discrete and Dimensional Emotions for Controllable Emotional Text-to-Speech

Jiaxuan Liu, Yang Xiang, Han Zhao, Xiangang Li, Yingying Gao, Shilei Zhang, Zhenhua Ling

Main category: cs.LG

TL;DR: UDDETTS is a universal LLM framework that unifies discrete and dimensional emotions for controllable emotional TTS, using interpretable Arousal-Dominance-Valence space and semi-supervised training.

Details

Motivation: Current LLM-based TTS systems struggle with fine-grained emotional speech synthesis and interpretable emotion control. Traditional discrete emotion labels cannot capture the complexity and continuity of human emotions, and limited emotional datasets cause overfitting.

Method: Proposes UDDETTS framework with ADV (Arousal-Dominance-Valence) space for dimensional emotion description. Uses semi-supervised training to leverage diverse speech datasets with different emotional annotations. Supports emotion control via discrete labels or ADV values.

Result: UDDETTS achieves linear emotion control along three interpretable dimensions and demonstrates superior end-to-end emotional speech synthesis capabilities compared to existing methods.

Conclusion: The proposed framework successfully addresses limitations of current emotional TTS systems by providing interpretable dimensional emotion control and effective utilization of diverse training data through semi-supervised learning.

Abstract: Recent large language models (LLMs) have made great progress in the field of text-to-speech (TTS), but they still face major challenges in synthesizing fine-grained emotional speech in an interpretable manner. Traditional methods rely on discrete emotion labels to control emotion categories and intensities, which cannot capture the complexity and continuity of human emotional perception and expression. The lack of large-scale emotional speech datasets with balanced emotion distributions and fine-grained emotional annotations often causes overfitting in synthesis models and impedes effective emotion control. To address these issues, we propose UDDETTS, a universal LLM framework unifying discrete and dimensional emotions for controllable emotional TTS. This model introduces the interpretable Arousal-Dominance-Valence (ADV) space for dimensional emotion description and supports emotion control driven by either discrete emotion labels or nonlinearly quantified ADV values. Furthermore, a semi-supervised training strategy is designed to comprehensively utilize diverse speech datasets with different types of emotional annotations to train the UDDETTS. Experiments show that UDDETTS achieves linear emotion control along three interpretable dimensions, and exhibits superior end-to-end emotional speech synthesis capabilities. Code and demos are available at: https://anonymous.4open.science/w/UDDETTS.

[489] Edge Probability Graph Models Beyond Edge Independency: Concepts, Analyses, and Algorithms

Fanchen Bu, Ruochen Yang, Paul Bogdan, Kijung Shin

Main category: cs.LG

TL;DR: This paper proposes a new edge-dependent realization framework called “binding” for random graph models that overcomes limitations of edge-independent models, enabling simultaneous high subgraph densities and output variability while maintaining tractability.

Details

Motivation: Existing random graph models with edge independence cannot produce both high subgraph densities and high output variability simultaneously. The authors aim to develop models that better reproduce real-world graph patterns while maintaining tractability and variability.

Method: The authors propose a novel edge-dependent realization framework called “binding” that preserves output variability. They derive closed-form tractability results for subgraph densities and develop algorithms for graph generation with binding and parameter fitting.

Result: Empirical results show that random graph models with binding exhibit high tractability and better reproduce common real-world graph patterns (power-law degrees, small diameters, high clustering) compared to edge-independent models.

Conclusion: The binding framework significantly improves upon edge-independent random graph models by enabling simultaneous achievement of high subgraph densities and output variability while maintaining computational tractability.

Abstract: Desirable random graph models (RGMs) should (i) reproduce common patterns in real-world graphs (e.g., power-law degrees, small diameters, and high clustering), (ii) generate variable (i.e., not overly similar) graphs, and (iii) remain tractable to compute and control graph statistics. A common class of RGMs (e.g., Erdos-Renyi and stochastic Kronecker) outputs edge probabilities, so we need to realize (i.e., sample from) the output edge probabilities to generate graphs. Typically, the existence of each edge is assumed to be determined independently, for simplicity and tractability. However, with edge independency, RGMs provably cannot produce high subgraph densities and high output variability simultaneously. In this work, we explore RGMs beyond edge independence that can better reproduce common patterns while maintaining high tractability and variability. Theoretically, we propose an edge-dependent realization (i.e., sampling) framework called binding that provably preserves output variability, and derive closed-form tractability results on subgraph (e.g., triangle) densities. Practically, we propose algorithms for graph generation with binding and parameter fitting of binding. Our empirical results demonstrate that RGMs with binding exhibit high tractability and well reproduce common patterns, significantly improving upon edge-independent RGMs.

[490] Fractional Reasoning via Latent Steering Vectors Improves Inference Time Compute

Sheng Liu, Tianlang Chen, Pan Lu, Haotian Ye, Yizheng Chen, Lei Xing, James Zou

Main category: cs.LG

TL;DR: Fractional Reasoning is a training-free framework that enables continuous control over reasoning intensity at test time by scaling latent steering vectors, improving both breadth-based and depth-based reasoning strategies.

Details

Motivation: Existing test-time compute methods apply reasoning uniformly across inputs, but different problems require different reasoning depths. Current approaches lack fine-grained control over reasoning intensity.

Method: Extracts latent steering vectors associated with deeper reasoning and reapplies them with tunable scaling factors, allowing models to tailor reasoning process to input complexity without additional training.

Result: Experiments on GSM8K, MATH500, and GPQA show consistent performance improvements across diverse reasoning tasks and models compared to uniform reasoning approaches.

Conclusion: Fractional Reasoning provides an effective model-agnostic framework for adaptive test-time compute that outperforms fixed reasoning strategies by enabling continuous control over reasoning intensity.

Abstract: Test-time compute has emerged as a powerful paradigm for improving the performance of large language models (LLMs), where generating multiple outputs or refining individual chains can significantly boost answer accuracy. However, existing methods like Best-of-N, majority voting, and self-reflection typically apply reasoning in a uniform way across inputs, overlooking the fact that different problems may require different levels of reasoning depth. In this work, we propose Fractional Reasoning, a training-free and model-agnostic framework that enables continuous control over reasoning intensity at inference time, going beyond the limitations of fixed instructional prompts. Our method operates by extracting the latent steering vector associated with deeper reasoning and reapplying it with a tunable scaling factor, allowing the model to tailor its reasoning process to the complexity of each input. This supports two key modes of test-time scaling: (1) improving output quality in breadth-based strategies (e.g., Best-of-N, majority voting), and (2) enhancing the correctness of individual reasoning chains in depth-based strategies (e.g., self-reflection). Experiments on GSM8K, MATH500, and GPQA demonstrate that Fractional Reasoning consistently improves performance across diverse reasoning tasks and models.

[491] Agreement-Based Cascading for Efficient Inference

Steven Kolawole, Don Dennis, Ameet Talwalkar, Virginia Smith

Main category: cs.LG

TL;DR: Agreement-Based Cascading (ABC) is an adaptive inference technique that uses model ensemble agreement to route examples through cascades of increasingly complex models, achieving significant cost reductions while maintaining or improving accuracy.

Details

Motivation: To reduce machine learning inference costs by avoiding large model invocation for easy examples, while leveraging ensemble benefits and parallel execution capabilities.

Method: Builds cascades of models with increasing size/complexity, uses agreement between ensembles at each level for data-dependent routing, and offsets ensemble costs through parallel execution and model size differences.

Result: ABC achieves up to 14x communication cost reduction in edge-to-cloud inference, 3x rental cost reduction in cloud serving, and 2-25x price reduction per token/request in API services compared to state-of-the-art LLM cascades.

Conclusion: ABC reliably surpasses single models in both efficiency and accuracy, serving as an effective drop-in replacement that outperforms existing cascading methods across multiple inference scenarios.

Abstract: Adaptive inference schemes reduce the cost of machine learning inference by assigning smaller models to easier examples, attempting to avoid invocation of larger models when possible. In this work we explore a simple, effective adaptive inference technique we term Agreement-Based Cascading (ABC). ABC builds a cascade of models of increasing size/complexity, and uses agreement between ensembles of models at each level of the cascade as a basis for data-dependent routing. Although ensemble execution introduces additional expense, we show that these costs can be easily offset in practice due to large expected differences in model sizes, parallel inference execution capabilities, and accuracy benefits of ensembling. We examine ABC theoretically and empirically in terms of these parameters, showing that the approach can reliably act as a drop-in replacement for existing models and surpass the best single model it aims to replace in terms of both efficiency and accuracy. Additionally, we explore the performance of ABC relative to existing cascading methods in three common scenarios: (1) edge-to-cloud inference, where ABC reduces communication costs by up to 14x; (2) cloud-based model serving, where it achieves a 3x reduction in rental costs; and (3) inference via model API services, where ABC achieves a 2-25x reduction in average price per token/request relative to state-of-the-art LLM cascades.

[492] Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling

Zeyu Huang, Tianhao Cheng, Zihan Qiu, Zili Wang, Yinghui Xu, Edoardo M. Ponti, Ivan Titov

Main category: cs.LG

TL;DR: Prefix-RFT is a hybrid post-training method that combines Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) to overcome their individual limitations, achieving better performance than standalone methods.

Details

Motivation: SFT excels at mimicking data but has problematic generalization, while RFT enhances performance but is sensitive to initial policy and learns unexpected behaviors. A unified approach is needed to synergize both paradigms.

Method: Prefix-RFT integrates learning from both demonstration (SFT) and exploration (RFT) through a hybrid approach that requires minimal modifications to standard RFT pipelines. It uses mathematical reasoning problems as a testbed.

Result: Prefix-RFT surpasses standalone SFT and RFT performance, outperforms parallel mixed-policy RFT methods, shows robustness to variations in demonstration data quality/quantity, and seamlessly integrates with existing frameworks.

Conclusion: Prefix-RFT effectively harmonizes SFT and RFT, demonstrating that a unified paradigm integrating demonstration and exploration is a promising direction for LLM post-training research.

Abstract: Existing post-training techniques for large language models are broadly categorized into Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT). Each paradigm presents a distinct trade-off: SFT excels at mimicking demonstration data but can lead to problematic generalization as a form of behavior cloning. Conversely, RFT can significantly enhance a model’s performance but is prone to learn unexpected behaviors, and its performance is highly sensitive to the initial policy. In this paper, we propose a unified view of these methods and introduce Prefix-RFT, a hybrid approach that synergizes learning from both demonstration and exploration. Using mathematical reasoning problems as a testbed, we empirically demonstrate that Prefix-RFT is both simple and effective. It not only surpasses the performance of standalone SFT and RFT but also outperforms parallel mixed-policy RFT methods. A key advantage is its seamless integration into existing open-source frameworks, requiring only minimal modifications to the standard RFT pipeline. Our analysis highlights the complementary nature of SFT and RFT, and validates that Prefix-RFT effectively harmonizes these two learning paradigms. Furthermore, ablation studies confirm the method’s robustness to variations in the quality and quantity of demonstration data. We hope this work offers a new perspective on LLM post-training, suggesting that a unified paradigm that judiciously integrates demonstration and exploration could be a promising direction for future research.

[493] Context-Aware Reasoning On Parametric Knowledge for Inferring Causal Variables

Ivaxi Sheth, Sahar Abdelnabi, Mario Fritz

Main category: cs.LG

TL;DR: The paper introduces a novel benchmark for completing partial causal graphs, requiring LLMs to hypothesize backdoor variables between cause and effect, with over 4000 queries of varying difficulty.

Details

Motivation: Scientific discovery relies on causal inference, but randomized experiments are often infeasible and observational studies suffer from confounding biases. Identifying backdoor paths is expensive and dependent on domain knowledge, creating a need for automated hypothesis generation.

Method: The authors designed a benchmark with varying difficulty levels containing over 4000 queries where LLMs must complete partial causal graphs by reasoning about backdoor variables in the context of the entire graph structure.

Result: The study demonstrates LLMs’ strong ability to hypothesize backdoor variables between causes and effects, showing that this requires contextual reasoning rather than simple knowledge memorization.

Conclusion: LLMs show promising capabilities for automated causal inference and hypothesis generation in scientific discovery, potentially reducing reliance on expensive domain knowledge for identifying confounding paths in observational studies.

Abstract: Scientific discovery catalyzes human intellectual advances, driven by the cycle of hypothesis generation, experimental design, evaluation, and assumption refinement. Central to this process is causal inference, uncovering the mechanisms behind observed phenomena. While randomized experiments provide strong inferences, they are often infeasible due to ethical or practical constraints. However, observational studies are prone to confounding or mediating biases. While crucial, identifying such backdoor paths is expensive and heavily depends on scientists’ domain knowledge to generate hypotheses. We introduce a novel benchmark where the objective is to complete a partial causal graph. We design a benchmark with varying difficulty levels with over 4000 queries. We show the strong ability of LLMs to hypothesize the backdoor variables between a cause and its effect. Unlike simple knowledge memorization of fixed associations, our task requires the LLM to reason according to the context of the entire graph.

[494] Reinforcement Fine-Tuning Naturally Mitigates Forgetting in Continual Post-Training

Song Lai, Haohan Zhao, Rong Feng, Changyi Ma, Wenzhuo Liu, Hongbo Zhao, Xi Lin, Dong Yi, Min Xie, Qingfu Zhang, Hongbin Liu, Gaofeng Meng, Fei Zhu

Main category: cs.LG

TL;DR: Comparative analysis of supervised fine-tuning (SFT) vs reinforcement fine-tuning (RFT) in continual post-training, showing RFT’s superior knowledge retention and general capability preservation compared to SFT’s catastrophic forgetting.

Details

Motivation: To explore the fundamental role of learning paradigms in continual post-training (CPT) for multimodal large language models, as existing research has focused on methods like data replay and parameter regularization but overlooked the core learning paradigm impact.

Method: Comparative experiments on seven diverse multimodal tasks using Qwen2.5-VL-7B-Instruct, analyzing SFT and RFT paradigms with theoretical analysis of gradient updates and proposing a rollout-based instance filtering algorithm.

Result: RFT inherently preserves prior knowledge achieving multi-task training performance, protects/enhances general knowledge on standard benchmarks, while SFT causes catastrophic forgetting and degrades general capabilities. The stability comes from RFT’s implicit regularization via reward variance scaling.

Conclusion: RFT is superior as a robust paradigm for continual post-training due to its natural knowledge preservation mechanisms and better performance stability compared to SFT.

Abstract: Continual post-training (CPT) is a popular and effective technique for adapting foundation models like multimodal large language models to specific and ever-evolving downstream tasks. While existing research has primarily concentrated on methods like data replay, model expansion, or parameter regularization, the fundamental role of the learning paradigm within CPT remains largely unexplored. This paper presents a comparative analysis of two core post-training paradigms: supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT), investigating their respective impacts on knowledge retention during CPT. Our experiments are conducted on a benchmark comprising seven diverse multimodal tasks, utilizing Qwen2.5-VL-7B-Instruct as the base model for continual post-training. The investigation yields two significant findings: (1) When continuously learning on downstream tasks, SFT leads to catastrophic forgetting of previously learned tasks. In contrast, RFT inherently preserves prior knowledge and achieve performance comparable to multi-task training. (2) RFT successfully protects and even enhances the model’s general knowledge on standard benchmarks (e.g., MMMU and MMLU-Pro). Conversely, SFT degrades general model capabilities severely. Further analysis reveals that this stability is not primarily due to explicit mechanisms like KL penalty or chain-of-thought reasoning. Instead, we identify an implicit regularization mechanism inherent to RFT as a key contributing factor. Our theoretical analysis suggests that RFT’s gradient updates are naturally scaled by the reward variance, acting as a data-dependent regularizer that inherently protects previously acquired knowledge. Finally, we propose a rollout-based instance filtering algorithm to enhance the stability and efficiency of RFT. Our comprehensive study demonstrates the superiority of RFT as a robust paradigm for continual post-training.

[495] OLMA: One Loss for More Accurate Time Series Forecasting

Tianyi Shi, Zhu Meng, Yue Chen, Siyang Zheng, Fei Su, Jin Huang, Changrui Ren, Zhicheng Zhao

Main category: cs.LG

TL;DR: This paper addresses two key challenges in time series forecasting: random noise setting a theoretical error lower bound, and neural networks’ frequency bias. It proposes using unitary transformations to reduce entropy and introduces frequency domain supervision with DFT/DWT, plus a novel OLMA loss function that improves forecasting accuracy.

Details

Motivation: Time series forecasting faces two overlooked challenges: (1) random noise in labels creates a theoretical lower bound for forecasting error correlated with label entropy, and (2) neural networks exhibit frequency bias where they learn some frequency bands well but others poorly, limiting overall performance.

Method: Proves a theorem that unitary transformations can reduce marginal entropy of correlated Gaussian processes. Uses DFT to reduce entropy and introduces frequency domain supervision via DFT and DWT. Proposes OLMA loss function that applies frequency domain transformations across channel and temporal dimensions.

Result: Experimental results on multiple datasets demonstrate OLMA’s effectiveness in addressing both challenges, leading to improved forecasting accuracy. DFT is confirmed to reduce entropy in most scenarios.

Conclusion: The perspectives of entropy and frequency bias provide a new feasible research direction for time series forecasting. The proposed frequency domain supervision strategy is general and can be integrated into any supervised learning method.

Abstract: Time series forecasting faces two important but often overlooked challenges. Firstly, the inherent random noise in the time series labels sets a theoretical lower bound for the forecasting error, which is positively correlated with the entropy of the labels. Secondly, neural networks exhibit a frequency bias when modeling the state-space of time series, that is, the model performs well in learning certain frequency bands but poorly in others, thus restricting the overall forecasting performance. To address the first challenge, we prove a theorem that there exists a unitary transformation that can reduce the marginal entropy of multiple correlated Gaussian processes, thereby providing guidance for reducing the lower bound of forecasting error. Furthermore, experiments confirm that Discrete Fourier Transform (DFT) can reduce the entropy in the majority of scenarios. Correspondingly, to alleviate the frequency bias, we jointly introduce supervision in the frequency domain along the temporal dimension through DFT and Discrete Wavelet Transform (DWT). This supervision-side strategy is highly general and can be seamlessly integrated into any supervised learning method. Moreover, we propose a novel loss function named OLMA, which utilizes the frequency domain transformation across both channel and temporal dimensions to enhance forecasting. Finally, the experimental results on multiple datasets demonstrate the effectiveness of OLMA in addressing the above two challenges and the resulting improvement in forecasting accuracy. The results also indicate that the perspectives of entropy and frequency bias provide a new and feasible research direction for time series forecasting. The code is available at: https://github.com/Yuyun1011/OLMA-One-Loss-for-More-Accurate-Time-Series-Forecasting.

[496] DimINO: Dimension-Informed Neural Operator Learning

Yichen Song, Yalun Wu, Yunbo Wang, Xiaokang Yang

Main category: cs.LG

TL;DR: DimINO introduces dimension-informed neural operators that incorporate dimensional analysis to create lightweight models for solving PDEs while maintaining generalization capabilities.

Details

Motivation: Traditional neural operators require large architectures for reliable error bounds, creating a need for more efficient models that don't sacrifice generalization across varying physical parameters.

Method: DimINO framework includes DimNorm and redimensionalization operations that can be integrated into existing neural operator architectures, leveraging dimensional analysis principles.

Result: Empirical results show up to 76.3% performance gain on PDE datasets while demonstrating Similar Transformation Invariance (STI) property.

Conclusion: DimINO provides a theoretically grounded and empirically effective approach to creating lightweight neural operators that maintain generalization through dimensional analysis principles.

Abstract: In computational physics, a longstanding challenge lies in finding numerical solutions to partial differential equations (PDEs). Recently, research attention has increasingly focused on Neural Operator methods, which are notable for their ability to approximate operators-mappings between functions. Although neural operators benefit from a universal approximation theorem, achieving reliable error bounds often necessitates large model architectures, such as deep stacks of Fourier layers. This raises a natural question: Can we design lightweight models without sacrificing generalization? To address this, we introduce DimINO (Dimension-Informed Neural Operators), a framework inspired by dimensional analysis. DimINO incorporates two key components, DimNorm and a redimensionalization operation, which can be seamlessly integrated into existing neural operator architectures. These components enhance the model’s ability to generalize across datasets with varying physical parameters. Theoretically, we establish a universal approximation theorem for DimINO and prove that it satisfies a critical property we term Similar Transformation Invariance (STI). Empirically, DimINO achieves up to 76.3% performance gain on PDE datasets while exhibiting clear evidence of the STI property.

[497] Causal Reflection with Language Models

Abi Aryan, Zac Liu

Main category: cs.LG

TL;DR: Causal Reflection framework enables agents to model causality dynamically and self-correct through formal reflection mechanisms, using LLMs as structured inference engines rather than black-box reasoners.

Details

Motivation: Both LLMs and traditional RL agents lack robust causal reasoning capabilities, relying on spurious correlations without understanding why actions lead to outcomes, limiting their adaptability and self-correction abilities.

Method: Introduces Causal Reflection framework that models causality as dynamic function over state, action, time, and perturbation, plus a formal Reflect mechanism that identifies outcome mismatches and generates causal hypotheses for model revision.

Result: The framework enables agents to reason about delayed and nonlinear effects, adapt to evolving environments, and communicate causal understanding through natural language explanations and counterfactuals.

Conclusion: Lays theoretical groundwork for Causal Reflective agents that can self-correct, adapt, and explicitly model causal relationships, moving beyond black-box reasoning to structured causal inference.

Abstract: While LLMs exhibit impressive fluency and factual recall, they struggle with robust causal reasoning, often relying on spurious correlations and brittle patterns. Similarly, traditional Reinforcement Learning agents also lack causal understanding, optimizing for rewards without modeling why actions lead to outcomes. We introduce Causal Reflection, a framework that explicitly models causality as a dynamic function over state, action, time, and perturbation, enabling agents to reason about delayed and nonlinear effects. Additionally, we define a formal Reflect mechanism that identifies mismatches between predicted and observed outcomes and generates causal hypotheses to revise the agent’s internal model. In this architecture, LLMs serve not as black-box reasoners, but as structured inference engines translating formal causal outputs into natural language explanations and counterfactuals. Our framework lays the theoretical groundwork for Causal Reflective agents that can adapt, self-correct, and communicate causal understanding in evolving environments.

[498] Redefining Neural Operators in $d+1$ Dimensions

Haoze Song, Zhihao Li, Xiaobo Zhang, Zecheng Gan, Zhilu Lai, Wei Wang

Main category: cs.LG

TL;DR: The paper introduces Schr"odingerised Kernel Neural Operator (SKNO), a novel neural operator framework that redefines neural operators on a d+1 dimensional domain inspired by quantum simulation techniques, achieving superior performance across various PDE benchmarks.

Details

Motivation: Existing neural operators lack a clear understanding of the evolution mechanism in embedding spaces, which limits their ability to fully capture target system dynamics. The authors aim to elucidate this mechanism and design more effective operators.

Method: Drawing from Schr"odingerisation method in quantum PDE simulations, the paper redefines neural operators on a d+1 dimensional domain and implements SKNO that better aligns with d+1 dimensional evolution. The approach includes lifting and recovering operators within the new framework.

Result: SKNO consistently outperforms other baselines across ten benchmarks from simple 1D heat equations to complex 3D Rayleigh-Taylor instability. It demonstrates resolution-invariance on mixing-resolution training and zero-shot super-resolution tasks.

Conclusion: The d+1 dimensional evolving design in SKNO provides better alignment with underlying system evolution, offering a more effective framework for neural operators that can fully capture complex PDE dynamics.

Abstract: Neural Operators have emerged as powerful tools for learning mappings between function spaces. Among them, the kernel integral operator has been widely validated on universally approximating various operators. Although many advancements following this definition have developed effective modules to better approximate the kernel function defined on the original domain (with $d$ dimensions, $d=1, 2, 3\dots$), the unclarified evolving mechanism in the embedding spaces blocks researchers’ view to design neural operators that can fully capture the target system evolution. Drawing on the Schr"odingerisation method in quantum simulations of partial differential equations (PDEs), we elucidate the linear evolution mechanism in neural operators. Based on that, we redefine neural operators on a new $d+1$ dimensional domain. Within this framework, we implement a Schr"odingerised Kernel Neural Operator (SKNO) aligning better with the $d+1$ dimensional evolution. In experiments, the $d+1$ dimensional evolving designs in our SKNO consistently outperform other baselines across ten benchmarks of increasing difficulty, ranging from the simple 1D heat equation to the highly nonlinear 3D Rayleigh-Taylor instability. We also validate the resolution-invariance of SKNO on mixing-resolution training and zero-shot super-resolution tasks. In addition, we show the impact of different lifting and recovering operators on the prediction within the redefined NO framework, reflecting the alignment between our model and the underlying $d+1$ dimensional evolution.

[499] Bayesian Optimization with Preference Exploration using a Monotonic Neural Network Ensemble

Hanyang Wang, Juergen Branke, Matthias Poloczek

Main category: cs.LG

TL;DR: Proposes a neural network ensemble approach for Bayesian Optimization with Preference Exploration (BOPE) that incorporates monotonicity constraints to improve multi-objective optimization by focusing on relevant Pareto-optimal subsets.

Details

Motivation: Many real-world black-box optimization problems have multiple conflicting objectives, and while interactive preference learning helps focus on relevant solutions, previous approaches haven't fully exploited the natural monotonicity of utility functions.

Method: Uses a neural network ensemble as a utility surrogate model that naturally integrates monotonicity constraints and supports pairwise comparison data for Bayesian Optimization with Preference Exploration.

Result: The proposed method outperforms state-of-the-art approaches and demonstrates robustness to noise in utility evaluations. An ablation study confirms the critical importance of monotonicity for performance enhancement.

Conclusion: Incorporating monotonicity constraints through neural network ensembles significantly improves Bayesian optimization for multi-objective problems with preference learning, providing better performance and noise robustness compared to existing methods.

Abstract: Many real-world black-box optimization problems have multiple conflicting objectives. Rather than attempting to approximate the entire set of Pareto-optimal solutions, interactive preference learning allows to focus the search on the most relevant subset. However, few previous studies have exploited the fact that utility functions are usually monotonic. In this paper, we address the Bayesian Optimization with Preference Exploration (BOPE) problem and propose using a neural network ensemble as a utility surrogate model. This approach naturally integrates monotonicity and supports pairwise comparison data. Our experiments demonstrate that the proposed method outperforms state-of-the-art approaches and exhibits robustness to noise in utility evaluations. An ablation study highlights the critical role of monotonicity in enhancing performance.

[500] Generative and Contrastive Graph Representation Learning

Jiali Chen, Avijit Mukherjee

Main category: cs.LG

TL;DR: A novel graph self-supervised learning architecture that combines contrastive and generative approaches with community-aware node-level contrastive learning and comprehensive augmentation strategies to achieve superior performance across multiple graph tasks.

Details

Motivation: Existing graph SSL methods excel in different tasks - contrastive methods perform well on classification while generative methods excel at link prediction. The paper aims to integrate both approaches to create a more versatile and effective framework.

Method: Integrates contrastive and generative paradigms with community-aware node-level contrastive learning for better positive/negative pairs generation, graph-level contrastive learning for global semantics, and comprehensive augmentation combining feature masking, node perturbation, and edge perturbation.

Result: Outperforms state-of-the-art methods on open benchmark datasets, achieving performance improvements of 0.23%-2.01% across node classification, clustering, and link prediction tasks.

Conclusion: The proposed hybrid framework successfully combines the strengths of both contrastive and generative approaches, demonstrating superior and more balanced performance across diverse graph learning tasks through enhanced community awareness and robust augmentation strategies.

Abstract: Self-supervised learning (SSL) on graphs generates node and graph representations (i.e., embeddings) that can be used for downstream tasks such as node classification, node clustering, and link prediction. Graph SSL is particularly useful in scenarios with limited or no labeled data. Existing SSL methods predominantly follow contrastive or generative paradigms, each excelling in different tasks: contrastive methods typically perform well on classification tasks, while generative methods often excel in link prediction. In this paper, we present a novel architecture for graph SSL that integrates the strengths of both approaches. Our framework introduces community-aware node-level contrastive learning, providing more robust and effective positive and negative node pairs generation, alongside graph-level contrastive learning to capture global semantic information. Additionally, we employ a comprehensive augmentation strategy that combines feature masking, node perturbation, and edge perturbation, enabling robust and diverse representation learning. By incorporating these enhancements, our model achieves superior performance across multiple tasks, including node classification, clustering, and link prediction. Evaluations on open benchmark datasets demonstrate that our model outperforms state-of-the-art methods, achieving a performance lift of 0.23%-2.01% depending on the task and dataset.

[501] A Quotient Homology Theory of Representation in Neural Networks

Kosio Beshkov

Main category: cs.LG

TL;DR: This paper introduces an equivalence class based on neural network hyperplane arrangements to compute topological features (Betti numbers) of neural representations without external metrics, using overlap decomposition and proving homology isomorphisms under convex intersection conditions.

Details

Motivation: To develop an intrinsic method for calculating topological features of neural representations that tracks purely topological rather than geometric characteristics, avoiding dependence on external metrics used in persistent homology.

Method: Define equivalence classes using neural network hyperplane arrangements, prove homology isomorphisms when intersections are convex, develop numerical computation methods using linear programming and union-find algorithms to compute overlap decomposition.

Result: The method enables computation of Betti numbers that capture topological features distinct from geometric ones measured by persistent homology. Experiments on toy datasets validate the approach and show evolution of overlap decomposition during training.

Conclusion: The framework provides an intrinsic topological analysis tool for neural representations, though some limitations exist. It offers a metric-independent alternative to persistent homology for studying neural network behavior.

Abstract: Previous research has proven that the set of maps implemented by neural networks with a ReLU activation function is identical to the set of piecewise linear continuous maps. Furthermore, such networks induce a hyperplane arrangement splitting the input domain of the network into convex polyhedra $G_J$ over which a network $\Phi$ operates in an affine manner. In this work, we leverage these properties to define an equivalence class $\sim_\Phi$ on top of an input dataset, which can be split into two sets related to the local rank of $\Phi_J$ and the intersections $\cap \text{Im}\Phi_{J_i}$. We refer to the latter as the \textit{overlap decomposition} $\mathcal{O}\Phi$ and prove that if the intersections between each polyhedron and an input manifold are convex, the homology groups of neural representations are isomorphic to quotient homology groups $H_k(\Phi(\mathcal{M})) \simeq H_k(\mathcal{M}/\mathcal{O}\Phi)$. This lets us intrinsically calculate the Betti numbers of neural representations without the choice of an external metric. We develop methods to numerically compute the overlap decomposition through linear programming and a union-find algorithm. Using this framework, we perform several experiments on toy datasets showing that, compared to standard persistent homology, our overlap homology-based computation of Betti numbers tracks purely topological rather than geometric features. Finally, we study the evolution of the overlap decomposition during training on several classification problems while varying network width and depth and discuss some shortcomings of our method.

[502] Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks

Taishi Nakamura, Satoki Ishikawa, Masaki Kawamura, Takumi Okamoto, Daisuke Nohara, Jun Suzuki, Rio Yokota

Main category: cs.LG

TL;DR: This paper investigates how Mixture-of-Experts (MoE) sparsity affects different capability regimes in large language models, revealing that optimal sparsity depends on active FLOPs and total tokens per parameter rather than traditional compute scaling laws.

Details

Motivation: Current scaling laws overlook the sparsity dimension introduced by MoE models, and there's a need to understand how sparsity affects different capabilities (memorization vs reasoning) under fixed compute budgets.

Method: Trained MoE model families varying total parameters, active parameters, and top-k routing under fixed compute budgets, then analyzed pre-training loss versus downstream accuracy across different capability regimes.

Result: Two key principles emerged: 1) Models with identical training loss but greater active compute achieve higher reasoning accuracy (Active FLOPs principle), 2) Memorization improves with more parameters while reasoning benefits from optimal total tokens per parameter (TPP principle).

Conclusion: Optimal MoE sparsity must be determined jointly by active FLOPs and TPP, revising classical compute-optimal scaling laws for sparse models.

Abstract: Empirical scaling laws have driven the evolution of large language models (LLMs), yet their coefficients shift whenever the model architecture or data pipeline changes. Mixture-of-Experts (MoE) models, now standard in state-of-the-art systems, introduce a new sparsity dimension that current dense-model frontiers overlook. We investigate how MoE sparsity influences two distinct capability regimes: memorization skills and reasoning skills. By training MoE families that vary total parameters, active parameters, and top-$k$ routing under fixed compute budgets, we disentangle pre-training loss from downstream accuracy. Our results reveal two principles. First, Active FLOPs: models with identical training loss but greater active compute achieve higher reasoning accuracy. Second, Total tokens per parameter (TPP): memorization tasks improve with more parameters, while reasoning tasks benefit from optimal TPP, indicating that reasoning is data-hungry. Neither reinforcement learning post-training (GRPO) nor increased test-time compute alters these trends. We therefore argue that optimal MoE sparsity must be determined jointly by active FLOPs and TPP, revising the classical picture of compute-optimal scaling. Our model checkpoints, code and logs are open-source at https://github.com/rioyokotalab/optimal-sparsity.

[503] Training Set Reconstruction from Differentially Private Forests: How Effective is DP?

Alice Gorgé, Julien Ferry, Sébastien Gambs, Thibaut Vidal

Main category: cs.LG

TL;DR: A reconstruction attack on differentially private random forests that can recover training data despite privacy guarantees, showing that DP alone is insufficient for complete protection.

Details

Motivation: To demonstrate that even state-of-the-art differentially private random forests can leak training data, challenging the assumption that DP provides complete privacy protection.

Method: Uses constraint programming that incorporates knowledge of the forest’s structure and DP mechanism characteristics to formally reconstruct the most likely training dataset.

Result: DP reduces but does not eliminate reconstruction success; only forests with performance no better than a constant classifier are fully robust to the attack.

Conclusion: Provides practical recommendations for building more resilient DP random forests that maintain predictive performance while being more resistant to reconstruction attacks.

Abstract: Recent research has shown that structured machine learning models such as tree ensembles are vulnerable to privacy attacks targeting their training data. To mitigate these risks, differential privacy (DP) has become a widely adopted countermeasure, as it offers rigorous privacy protection. In this paper, we introduce a reconstruction attack targeting state-of-the-art $\epsilon$-DP random forests. By leveraging a constraint programming model that incorporates knowledge of the forest’s structure and DP mechanism characteristics, our approach formally reconstructs the most likely dataset that could have produced a given forest. Through extensive computational experiments, we examine the interplay between model utility, privacy guarantees and reconstruction accuracy across various configurations. Our results reveal that random forests trained with meaningful DP guarantees can still leak portions of their training data. Specifically, while DP reduces the success of reconstruction attacks, the only forests fully robust to our attack exhibit predictive performance no better than a constant classifier. Building on these insights, we also provide practical recommendations for the construction of DP random forests that are more resilient to reconstruction attacks while maintaining a non-trivial predictive performance.

[504] ALICE: An Interpretable Neural Architecture for Generalization in Substitution Ciphers

Jeff Shen, Lindsay M. Smith

Main category: cs.LG

TL;DR: ALICE is a Transformer-based model that achieves state-of-the-art performance in solving substitution ciphers, generalizing well from minimal training data and offering interpretable insights into neural reasoning processes.

Details

Motivation: Cryptogram solving serves as an ideal testbed for studying neural network reasoning and generalization, requiring models to decrypt text without explicit cipher access from 26! possible mappings.

Method: ALICE uses an encoder-only Transformer with a novel bijective decoding head that models permutations via Gumbel-Sinkhorn method, enabling direct extraction of learned cipher mappings.

Result: ALICE achieves new state-of-the-art accuracy and speed, generalizing to unseen ciphers after training on only ~1500 unique ciphers (3.7×10^-24 of the possible cipher space).

Conclusion: The architectural innovations and analysis methods provide insights into neural network generalization and interpretability, with applications beyond cryptograms.

Abstract: We present cryptogram solving as an ideal testbed for studying neural network reasoning and generalization; models must decrypt text encoded with substitution ciphers, choosing from 26! possible mappings without explicit access to the cipher. We develop ALICE (an Architecture for Learning Interpretable Cryptogram dEcipherment), a simple encoder-only Transformer that sets a new state-of-the-art for both accuracy and speed on this decryption problem. Surprisingly, ALICE generalizes to unseen ciphers after training on only ${\sim}1500$ unique ciphers, a minute fraction ($3.7 \times 10^{-24}$) of the possible cipher space. To enhance interpretability, we introduce a novel bijective decoding head that explicitly models permutations via the Gumbel-Sinkhorn method, enabling direct extraction of learned cipher mappings. Through early exit and probing experiments, we reveal how ALICE progressively refines its predictions in a way that appears to mirror common human strategies – early layers place greater emphasis on letter frequencies, while later layers form word-level structures. Our architectural innovations and analysis methods are applicable beyond cryptograms and offer new insights into neural network generalization and interpretability.

[505] Cohort-Based Active Modality Acquisition

Tillmann Rheude, Roland Eils, Benjamin Wild

Main category: cs.LG

TL;DR: Cohort-based Active Modality Acquisition (CAMA) is a test-time framework for selecting which samples should receive additional modalities when resources are limited, using generative imputation and discriminative modeling to estimate acquisition benefits.

Details

Motivation: Real-world ML applications often have incomplete multimodal data, and acquiring additional modalities is costly. Current methods don't adequately address test-time and cohort-based acquisition strategies for optimizing resource allocation.

Method: Proposes CAMA framework with acquisition strategies combining generative imputation and discriminative modeling to estimate expected benefits of acquiring missing modalities. Includes upper-bound heuristics for benchmarking.

Result: Experiments on multimodal datasets with up to 15 modalities show imputation-based strategies outperform unimodal, entropy-based, and random selection methods. Successfully applied to UK Biobank for proteomics data acquisition.

Conclusion: CAMA provides an effective approach for optimizing modality acquisition at cohort level, enabling more efficient resource use in constrained settings with real-world applicability.

Abstract: Real-world machine learning applications often involve data from multiple modalities that must be integrated effectively to make robust predictions. However, in many practical settings, not all modalities are available for every sample, and acquiring additional modalities can be costly. This raises the question: which samples should be prioritized for additional modality acquisition when resources are limited? While prior work has explored individual-level acquisition strategies and training-time active learning paradigms, test-time and cohort-based acquisition remain underexplored. We introduce Cohort-based Active Modality Acquisition (CAMA), a novel test-time setting to formalize the challenge of selecting which samples should receive additional modalities. We derive acquisition strategies that leverage a combination of generative imputation and discriminative modeling to estimate the expected benefit of acquiring missing modalities based on common evaluation metrics. We also introduce upper-bound heuristics that provide performance ceilings to benchmark acquisition strategies. Experiments on multimodal datasets with up to 15 modalities demonstrate that our proposed imputation-based strategies can more effectively guide the acquisition of additional modalities for selected samples compared with methods relying solely on unimodal information, entropy-based guidance, or random selection. We showcase the real-world relevance and scalability of our method by demonstrating its ability to effectively guide the costly acquisition of proteomics data for disease prediction in a large prospective cohort, the UK Biobank (UKBB). Our work provides an effective approach for optimizing modality acquisition at the cohort level, enabling more effective use of resources in constrained settings.

[506] Regularization can make diffusion models more efficient

Mahsa Taheri, Johannes Lederer

Main category: cs.LG

TL;DR: Sparsity can reduce computational costs in diffusion models by decreasing dependency on input dimension complexity.

Details

Motivation: Diffusion models are computationally expensive, and sparsity offers a way to improve efficiency while maintaining sample quality.

Method: Apply sparsity concepts to diffusion pipelines, leveraging mathematical guarantees to reduce computational complexity based on intrinsic data dimension.

Result: Empirical results show that sparsity leads to better samples at lower computational costs.

Conclusion: Sparsity is an effective approach to enhance the efficiency of diffusion models without compromising performance.

Abstract: Diffusion models are one of the key architectures of generative AI. Their main drawback, however, is the computational costs. This study indicates that the concept of sparsity, well known especially in statistics, can provide a pathway to more efficient diffusion pipelines. Our mathematical guarantees prove that sparsity can reduce the input dimension’s influence on the computational complexity to that of a much smaller intrinsic dimension of the data. Our empirical findings confirm that inducing sparsity can indeed lead to better samples at a lower cost.

[507] ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms

Bingxin Xu, Zhen Dong, Oussama Elachqar, Yuzhang Shang

Main category: cs.LG

TL;DR: ButterflyQuant introduces learnable butterfly transforms to replace fixed Hadamard rotations for 2-bit LLM quantization, achieving better outlier suppression and significantly lower perplexity than previous methods.

Details

Motivation: Extreme 2-bit quantization suffers from catastrophic performance loss due to outliers in activations. Fixed rotation methods like QuIP and QuaRot cannot adapt to different outlier patterns across transformer layers, motivating layer-adaptive approaches.

Method: Replaces fixed Hadamard transforms with learnable butterfly transforms parameterized by continuous Givens rotation angles. Uses orthogonal constraints for theoretical guarantees and introduces uniformity regularization on post-transformation activations. Learning requires only 128 calibration samples and converges quickly on a single GPU.

Result: For LLaMA-2-7B with 2-bit quantization, ButterflyQuant achieves 15.4 perplexity versus 37.3 for QuIP, demonstrating significant performance improvement.

Conclusion: ButterflyQuant provides an efficient, adaptive solution for extreme quantization that outperforms fixed-transform methods while maintaining computational efficiency and theoretical guarantees.

Abstract: Large language models require massive memory footprints, severely limiting deployment on consumer hardware. Quantization reduces memory through lower numerical precision, but extreme 2-bit quantization suffers from catastrophic performance loss due to outliers in activations. Rotation-based methods such as QuIP and QuaRot apply orthogonal transforms to eliminate outliers before quantization, using computational invariance: $\mathbf{y} = \mathbf{Wx} = (\mathbf{WQ}^T)(\mathbf{Qx})$ for orthogonal $\mathbf{Q}$. However, these methods use fixed transforms–Hadamard matrices achieving optimal worst-case coherence $\mu = 1/\sqrt{n}$–that cannot adapt to specific weight distributions. We identify that different transformer layers exhibit distinct outlier patterns, motivating layer-adaptive rotations rather than one-size-fits-all approaches. In this work, we propose ButterflyQuant, which replaces Hadamard rotations with learnable butterfly transforms parameterized by continuous Givens rotation angles. Unlike Hadamard’s discrete ${+1, -1}$ entries that are non-differentiable and thus prohibit gradient-based learning, butterfly transforms’ continuous parameterization enables smooth optimization while guaranteeing orthogonality by construction. This orthogonal constraint ensures theoretical guarantees in outlier suppression while achieving $O(n \log n)$ computational complexity with only $\frac{n \log n}{2}$ learnable parameters. We further introduce a uniformity regularization on post-transformation activations to promote smoother distributions amenable to quantization. Learning requires only 128 calibration samples and converges in minutes on a single GPU–a negligible one-time cost. For LLaMA-2-7B with 2-bit quantization, ButterflyQuant achieves 15.4 perplexity versus 37.3 for QuIP. \href{https://github.com/42Shawn/Butterflyquant-llm}{Codes} are available.

[508] Runtime-Adaptive Pruning for LLM Inference

Huanrong Liu, Chunlin Tian, Xuyang Wei, Qingbiao Li, Li Li

Main category: cs.LG

TL;DR: RAP is an elastic pruning framework using reinforcement learning to dynamically compress LLMs by adjusting compression strategies based on runtime memory variations and heterogeneous KV-cache demands.

Details

Motivation: Current LLM compression methods use fixed heuristics that fail to adapt to runtime memory variations and diverse KV-cache demands from different user requests, limiting deployment efficiency.

Method: RAP uses reinforcement learning to dynamically track the ratio between model parameters and KV-cache during execution, selectively retaining components that maximize utility within current memory budget based on workload and device state.

Result: Extensive experiments show RAP outperforms state-of-the-art baselines, achieving the first joint consideration of model weights and KV-cache compression in real-time.

Conclusion: RAP successfully addresses runtime memory constraints through dynamic, RL-driven compression that adapts to varying workload demands, enabling more efficient LLM deployment.

Abstract: Large language models (LLMs) excel at language understanding and generation, but their enormous computational and memory requirements hinder deployment. Compression offers a potential solution to mitigate these constraints. However, most existing methods rely on fixed heuristics and thus fail to adapt to runtime memory variations or heterogeneous KV-cache demands arising from diverse user requests. To address these limitations, we propose RAP, an elastic pruning framework driven by reinforcement learning (RL) that dynamically adjusts compression strategies in a runtime-aware manner. Specifically, RAP dynamically tracks the evolving ratio between model parameters and KV-cache across practical execution. Recognizing that FFNs house most parameters, whereas parameter -light attention layers dominate KV-cache formation, the RL agent retains only those components that maximize utility within the current memory budget, conditioned on instantaneous workload and device state. Extensive experiments results demonstrate that RAP outperforms state-of-the-art baselines, marking the first time to jointly consider model weights and KV-cache on the fly.

[509] You Are Your Own Best Teacher: Achieving Centralized-level Performance in Federated Learning under Heterogeneous and Long-tailed Data

Shanshan Yan, Zexi Li, Chao Wu, Meng Pang, Yang Lu, Yan Yan, Hanzi Wang

Main category: cs.LG

TL;DR: FedYoYo addresses data heterogeneity in federated learning through self-bootstrap distillation and distribution-aware logit adjustment, achieving centralized-level performance without extra datasets or models.

Details

Motivation: Previous neural-collapse-inspired methods are insufficient to reach neural collapse optimality and still have significant performance gaps compared to centralized training in federated learning.

Method: FedYoYo introduces Augmented Self-bootstrap Distillation (ASD) to distill knowledge between weakly and strongly augmented local samples, and Distribution-aware Logit Adjustment (DLA) to balance the self-bootstrap process and correct biased feature representations.

Result: FedYoYo nearly eliminates the performance gap, achieving centralized-level performance even under mixed heterogeneity. It reduces model drift, improves convergence, and achieves state-of-the-art results, surpassing centralized logit adjustment methods by 5.4% under global long-tailed settings.

Conclusion: FedYoYo effectively addresses data heterogeneity challenges in federated learning through self-bootstrap distillation and distribution-aware optimization, demonstrating superior performance compared to existing methods.

Abstract: Data heterogeneity, stemming from local non-IID data and global long-tailed distributions, is a major challenge in federated learning (FL), leading to significant performance gaps compared to centralized learning. Previous research found that poor representations and biased classifiers are the main problems and proposed neural-collapse-inspired synthetic simplex ETF to help representations be closer to neural collapse optima. However, we find that the neural-collapse-inspired methods are not strong enough to reach neural collapse and still have huge gaps to centralized training. In this paper, we rethink this issue from a self-bootstrap perspective and propose FedYoYo (You Are Your Own Best Teacher), introducing Augmented Self-bootstrap Distillation (ASD) to improve representation learning by distilling knowledge between weakly and strongly augmented local samples, without needing extra datasets or models. We further introduce Distribution-aware Logit Adjustment (DLA) to balance the self-bootstrap process and correct biased feature representations. FedYoYo nearly eliminates the performance gap, achieving centralized-level performance even under mixed heterogeneity. It enhances local representation learning, reducing model drift and improving convergence, with feature prototypes closer to neural collapse optimality. Extensive experiments show FedYoYo achieves state-of-the-art results, even surpassing centralized logit adjustment methods by 5.4% under global long-tailed settings.

[510] Small LLMs with Expert Blocks Are Good Enough for Hyperparamter Tuning

Om Naphade, Saksham Bansal, Parikshit Pareek

Main category: cs.LG

TL;DR: Proposes an Expert Block Framework using Small LLMs for Hyper-parameter Tuning (HPT) that achieves comparable performance to large models like GPT-4 with only 10 trials.

Details

Motivation: HPT is computationally expensive and opaque with larger models, and existing LLM-based HPT approaches rely on massive models exceeding 100 billion parameters.

Method: Uses Trajectory Context Summarizer (TCS) to transform raw training trajectories into structured context, enabling small LLMs (phi4:14B and qwen2.5-coder:32B) to analyze optimization progress effectively.

Result: Achieves average performance within ~0.9 percentage points of GPT-4 across six diverse tasks using only a 10-trial budget.

Conclusion: Small LLMs with proper context summarization can perform HPT nearly as well as much larger models, making HPT more accessible and efficient.

Abstract: Hyper-parameter Tuning (HPT) is a necessary step in machine learning (ML) pipelines but becomes computationally expensive and opaque with larger models. Recently, Large Language Models (LLMs) have been explored for HPT, yet most rely on models exceeding 100 billion parameters. We propose an Expert Block Framework for HPT using Small LLMs. At its core is the Trajectory Context Summarizer (TCS), a deterministic block that transforms raw training trajectories into structured context, enabling small LLMs to analyze optimization progress with reliability comparable to larger models. Using two locally-run LLMs (phi4:reasoning14B and qwen2.5-coder:32B) and a 10-trial budget, our TCS-enabled HPT pipeline achieves average performance within ~0.9 percentage points of GPT-4 across six diverse tasks.

[511] FFT-based Dynamic Subspace Selection for Low-Rank Adaptive Optimization of Large Language Models

Ionut-Vlad Modoranu, Mher Safaryan, Erik Schultheis, Max Ryabinin, Artem Chumachenko, Dan Alistarh

Main category: cs.LG

TL;DR: Proposes a computationally efficient low-rank optimization method using Discrete Cosine Transform (DCT) matrices to approximate SVD/QR-based gradient projections for training large language models, achieving faster runtime and reduced memory usage.

Details

Motivation: Existing low-rank optimization methods using SVD/QR decomposition are computationally expensive and memory-intensive when applied individually to each layer of large language models, requiring storage of projection matrices.

Method: A two-step procedure using predefined DCT orthogonal matrices: (1) matrix multiplication with DCT matrix in O(n³) time, (2) lightweight sorting to select most relevant basis vectors. For large layers, uses FFT-based DCT computation in O(n² log(n)) time.

Result: Achieves rank-independent running time matching SVD/QR performance while reducing memory usage by up to 25% across different model sizes in pre-training and fine-tuning tasks.

Conclusion: The DCT-based approach provides an efficient alternative to costly SVD/QR methods for low-rank optimization in LLM training, offering significant computational and memory benefits while maintaining performance.

Abstract: Low-rank optimization has emerged as a promising direction in training large language models (LLMs) to improve running time and reduce the memory usage of adaptive optimizers by constraining learning to a lower-dimensional space. Prior work typically projects gradients of linear layers using approaches based on Singular Value Decomposition (SVD) or QR-decomposition. Applying these techniques individually to each layer in large models is computationally expensive and incurs additional memory costs due to storing the projection matrices. In this work, we propose a computationally efficient and conceptually simple, two-step procedure to approximate SVD/QR-based gradient projections into lower-dimensional spaces by using a predefined orthogonal matrix of the Discrete Cosine Transform (DCT). We dynamically select columns from the DCT matrix based on their alignment with the gradient of each layer. The effective projection matrices are obtained via a simple matmul with the DCT matrix in $O(n^3)$ time, followed by a lightweight sorting step to identify the most relevant basis vectors. For large layers, DCT can be computed via Makhoul’s $N$-point algorithm based on Fast Fourier Transform (FFT) in $O(n^2 \log(n))$ time. Due to the predefined nature of the orthogonal bases, they are computed once at the start of training. Our numerical experiments on both pre-training and fine-tuning tasks demonstrate the effectiveness of our dual strategy in approximating optimal low-rank projections, obtaining an approach with rank-independent running time that matches the performance of costly SVD/QR-based methods while achieving faster runtime and reduced memory usage by up to $25%$ across different model sizes.

[512] Benchmarking for Practice: Few-Shot Time-Series Crop-Type Classification on the EuroCropsML Dataset

Joana Reuss, Jan Macdonald, Simon Becker, Ekaterina Gikalo, Konrad Schultka, Lorenz Richter, Marco Körner

Main category: cs.LG

TL;DR: This paper presents the first comprehensive benchmark for evaluating supervised and self-supervised learning methods for crop-type classification using satellite time-series data under real-world conditions, finding that meta-learning achieves slightly higher accuracy but with increased computational cost.

Details

Motivation: Current machine learning algorithms for crop-type classification lack evaluation in real-world scenarios, and their efficacy in challenging practical applications has not been thoroughly assessed. The authors aim to facilitate future research by providing a comprehensive benchmark.

Method: The study uses the EuroCropsML time-series dataset combining farmer-reported crop data with Sentinel-2 satellite observations from Estonia, Latvia, and Portugal. It evaluates supervised transfer learning, MAML-based meta-learning, and self-supervised learning methods for crop-type classification.

Result: MAML-based meta-learning achieves slightly higher accuracy than supervised transfer learning and SSL methods, but with increased computational demands. Supervised methods perform best when pre-trained and fine-tuned on geographically close regions. SSL lags behind meta-learning but outperforms training from scratch and standard transfer learning, particularly in capturing fine-grained features.

Conclusion: There are trade-offs between accuracy and computational demand in selecting methods for real-world crop-type classification. Knowledge transfer across diverse geographic regions is challenging, and SSL approaches provide practical value when labeled pre-training data is scarce.

Abstract: Accurate crop-type classification from satellite time series is essential for agricultural monitoring. While various machine learning algorithms have been developed to enhance performance on data-scarce tasks, their evaluation often lacks real-world scenarios. Consequently, their efficacy in challenging practical applications has not yet been profoundly assessed. To facilitate future research in this domain, we present the first comprehensive benchmark for evaluating supervised and SSL methods for crop-type classification under real-world conditions. This benchmark study relies on the EuroCropsML time-series dataset, which combines farmer-reported crop data with Sentinel-2 satellite observations from Estonia, Latvia, and Portugal. Our findings indicate that MAML-based meta-learning algorithms achieve slightly higher accuracy compared to supervised transfer learning and SSL methods. However, compared to simpler transfer learning, the improvement of meta-learning comes at the cost of increased computational demands and training time. Moreover, supervised methods benefit most when pre-trained and fine-tuned on geographically close regions. In addition, while SSL generally lags behind meta-learning, it demonstrates advantages over training from scratch, particularly in capturing fine-grained features essential for real-world crop-type classification, and also surpasses standard transfer learning. This highlights its practical value when labeled pre-training crop data is scarce. Our insights underscore the trade-offs between accuracy and computational demand in selecting supervised machine learning methods for real-world crop-type classification tasks and highlight the difficulties of knowledge transfer across diverse geographic regions. Furthermore, they demonstrate the practical value of SSL approaches when labeled pre-training crop data is scarce.

[513] MESS+: Dynamically Learned Inference-Time LLM Routing in Model Zoos with Service Level Guarantees

Herbert Woisetschläger, Ryan Zhang, Shiqiang Wang, Hans-Arno Jacobsen

Main category: cs.LG

TL;DR: MESS+ is a stochastic optimization algorithm for cost-optimal LLM request routing that guarantees SLA compliance while learning model satisfaction probabilities in real-time.

Details

Motivation: Users want factually correct, safe responses without technical expertise, while service providers want to minimize costs. Current model selection is challenging and requires technical knowledge.

Method: Combines virtual queues and request satisfaction prediction to solve per-request optimization problems, learning model satisfaction probabilities through real-time user interactions.

Result: Achieves 2× cost savings compared to existing LLM routing techniques across various state-of-the-art LLM benchmarks.

Conclusion: MESS+ provides a practical solution for automated LLM selection that balances user satisfaction with cost efficiency while ensuring SLA compliance.

Abstract: Open-weight large language model (LLM) zoos provide access to numerous high-quality models, but selecting the appropriate model for specific tasks remains challenging and requires technical expertise. Most users simply want factually correct, safe, and satisfying responses without concerning themselves with model technicalities, while inference service providers prioritize minimizing operating costs. These competing interests are typically mediated through service level agreements (SLAs) that guarantee minimum service quality. We introduce MESS+, a stochastic optimization algorithm for cost-optimal LLM request routing while providing rigorous SLA compliance guarantees. MESS+ learns request satisfaction probabilities of LLMs in real-time as users interact with the system, based on which model selection decisions are made by solving a per-request optimization problem. Our algorithm includes a novel combination of virtual queues and request satisfaction prediction, along with a theoretical analysis of cost optimality and constraint satisfaction. Across a wide range of state-of-the-art LLM benchmarks, MESS+ achieves an average of $2\times$ cost savings compared to existing LLM routing techniques.

[514] Rethinking Circuit Completeness in Language Models: AND, OR, and ADDER Gates

Hang Chen, Jiaying Zhu, Xinyu Yang, Wenya Wang

Main category: cs.LG

TL;DR: A framework for achieving faithful and complete circuit discovery by identifying AND, OR, and ADDER logic gates through combined noising and denoising interventions.

Details

Motivation: Standard circuit discovery methods fail to guarantee completeness due to partial detection of OR gates, leading to inconsistent results and omission of key mechanisms.

Method: Decompose circuits into logical gates (AND, OR, ADDER) and propose a framework combining noising-based and denoising-based interventions to fully identify gates without significant computational overhead.

Result: The framework successfully restores faithfulness, completeness, and sparsity of circuits, and reveals properties of logic gates including their proportions and contributions in language models.

Conclusion: The proposed approach enables comprehensive circuit discovery by systematically identifying logical gates, providing insights into language model functionality through gate behavior analysis.

Abstract: Circuit discovery has gradually become one of the prominent methods for mechanistic interpretability, and research on circuit completeness has also garnered increasing attention. Methods of circuit discovery that do not guarantee completeness not only result in circuits that are not fixed across different runs but also cause key mechanisms to be omitted. The nature of incompleteness arises from the presence of OR gates within the circuit, which are often only partially detected in standard circuit discovery methods. To this end, we systematically introduce three types of logic gates: AND, OR, and ADDER gates, and decompose the circuit into combinations of these logical gates. Through the concept of these gates, we derive the minimum requirements necessary to achieve faithfulness and completeness. Furthermore, we propose a framework that combines noising-based and denoising-based interventions, which can be easily integrated into existing circuit discovery methods without significantly increasing computational complexity. This framework is capable of fully identifying the logic gates and distinguishing them within the circuit. In addition to the extensive experimental validation of the framework’s ability to restore the faithfulness, completeness, and sparsity of circuits, using this framework, we uncover fundamental properties of the three logic gates, such as their proportions and contributions to the output, and explore how they behave among the functionalities of language models.

[515] Identification and Optimal Nonlinear Control of Turbojet Engine Using Koopman Eigenfunction Model

David Grasev

Main category: cs.LG

TL;DR: This paper proposes a data-driven Koopman operator approach for modeling and controlling gas turbine engines, overcoming limitations of traditional physics-based models by using operational data and achieving superior control performance.

Details

Motivation: Gas turbine engines are complex nonlinear systems where physics-based modeling is challenging due to unavailable performance characteristics and simplifying assumptions. Conventional experimental methods for component-level and linear parameter-varying models have limitations.

Method: Uses sparse identification of nonlinear dynamics for rotor estimation, maps autonomous dynamics into optimized Koopman eigenfunction space via eigenvalue optimization and gradient-based identification, then designs nonlinear feedback controller and Kalman estimator in eigenfunction space.

Result: The Koopman-based controller outperforms traditional and gain-scheduled proportional-integral controllers, as well as internal model control, in both reference tracking and disturbance rejection under various flight conditions.

Conclusion: The eigenmode structure enables targeted optimization of individual modes, and the global nature of the Koopman-based approach leads to improved performance tuning and superior control capabilities compared to benchmark methods.

Abstract: Gas turbine engines are complex and highly nonlinear dynamical systems. Deriving their physics-based models can be challenging because it requires performance characteristics that are not always available, often leading to many simplifying assumptions. This paper discusses the limitations of conventional experimental methods used to derive component-level and locally linear parameter-varying models, and addresses these issues by employing identification techniques based on data collected from standard engine operation under closed-loop control. The rotor dynamics are estimated using the sparse identification of nonlinear dynamics. Subsequently, the autonomous part of the dynamics is mapped into an optimally constructed Koopman eigenfunction space. This process involves eigenvalue optimization using metaheuristic algorithms and temporal projection, followed by gradient-based eigenfunction identification. The resulting Koopman model is validated against an in-house reference component-level model. A globally optimal nonlinear feedback controller and a Kalman estimator are then designed within the eigenfunction space and compared to traditional and gain-scheduled proportional-integral controllers, as well as a proposed internal model control approach. The eigenmode structure enables targeting individual modes during optimization, leading to improved performance tuning. Results demonstrate that the Koopman-based controller surpasses other benchmark controllers in both reference tracking and disturbance rejection under sea-level and varying flight conditions, due to its global nature.

[516] Buffer-free Class-Incremental Learning with Out-of-Distribution Detection

Srishti Gupta, Daniele Angioni, Maura Pintor, Ambra Demontis, Lea Schönherr, Battista Biggio, Fabio Roli

Main category: cs.LG

TL;DR: This paper analyzes post-hoc out-of-distribution (OOD) detection methods as a buffer-free alternative for class-incremental learning in open-world scenarios, achieving comparable or superior performance to buffer-based approaches.

Details

Motivation: Current class-incremental learning methods rely on memory buffers for OOD detection, which raises privacy, scalability, and training time concerns. The paper aims to eliminate the need for memory buffers while maintaining performance.

Method: The authors conduct an in-depth analysis of post-hoc OOD detection methods applied at inference time, comparing them against buffer-based OOD detection approaches in class-incremental learning settings.

Result: Post-hoc OOD detection methods achieve comparable or superior performance to buffer-based methods in both class-incremental learning and unknown sample rejection, as demonstrated on CIFAR-10, CIFAR-100, and Tiny ImageNet datasets.

Conclusion: Buffer-free OOD detection using post-hoc methods offers an efficient and privacy-preserving alternative for class-incremental learning systems in open-world settings, providing new design insights.

Abstract: Class-incremental learning (CIL) poses significant challenges in open-world scenarios, where models must not only learn new classes over time without forgetting previous ones but also handle inputs from unknown classes that a closed-set model would misclassify. Recent works address both issues by (i)~training multi-head models using the task-incremental learning framework, and (ii) predicting the task identity employing out-of-distribution (OOD) detectors. While effective, the latter mainly relies on joint training with a memory buffer of past data, raising concerns around privacy, scalability, and increased training time. In this paper, we present an in-depth analysis of post-hoc OOD detection methods and investigate their potential to eliminate the need for a memory buffer. We uncover that these methods, when applied appropriately at inference time, can serve as a strong substitute for buffer-based OOD detection. We show that this buffer-free approach achieves comparable or superior performance to buffer-based methods both in terms of class-incremental learning and the rejection of unknown samples. Experimental results on CIFAR-10, CIFAR-100 and Tiny ImageNet datasets support our findings, offering new insights into the design of efficient and privacy-preserving CIL systems for open-world settings.

[517] Fractal Graph Contrastive Learning

Nero Z. Li, Xuehao Zhai, Zhichao Shi, Boshen Shi, Xuhui Jiang

Main category: cs.LG

TL;DR: FractalGCL is a graph contrastive learning framework that uses renormalization-based augmentation and fractal-dimension-aware contrastive loss to improve global structural consistency in graph representations, achieving state-of-the-art performance with reduced computational overhead.

Details

Motivation: Existing graph contrastive learning methods rely on random perturbations or local structure preservation but lack explicit control over global structural consistency between augmented views, limiting their performance.

Method: Proposes FractalGCL with two innovations: 1) renormalisation-based augmentation using box coverings to generate structurally aligned positive views, and 2) fractal-dimension-aware contrastive loss that aligns embeddings based on fractal dimensions. Also develops a one-shot Gaussian estimator to reduce computational cost.

Result: FractalGCL achieves state-of-the-art results on standard benchmarks and outperforms traditional and latest baselines on traffic networks by an average margin of about 4%. The computational optimization reduces training time by approximately 61%.

Conclusion: The framework effectively addresses global structural consistency in graph contrastive learning while maintaining computational efficiency, demonstrating superior performance across various graph learning tasks.

Abstract: While Graph Contrastive Learning (GCL) has attracted considerable attention in the field of graph self-supervised learning, its performance heavily relies on data augmentations that are expected to generate semantically consistent positive pairs. Existing strategies typically resort to random perturbations or local structure preservation, yet lack explicit control over global structural consistency between augmented views. To address this limitation, we propose Fractal Graph Contrastive Learning (FractalGCL), a theory-driven framework introducing two key innovations: a renormalisation-based augmentation that generates structurally aligned positive views via box coverings; and a fractal-dimension-aware contrastive loss that aligns graph embeddings according to their fractal dimensions, equipping the method with a fallback mechanism guaranteeing a performance lower bound even on non-fractal graphs. While combining the two innovations markedly boosts graph-representation quality, it also adds non-trivial computational overhead. To mitigate the computational overhead of fractal dimension estimation, we derive a one-shot estimator by proving that the dimension discrepancy between original and renormalised graphs converges weakly to a centred Gaussian distribution. This theoretical insight enables a reduction in dimension computation cost by an order of magnitude, cutting overall training time by approximately 61%. The experiments show that FractalGCL not only delivers state-of-the-art results on standard benchmarks but also outperforms traditional and latest baselines on traffic networks by an average margin of about remarkably 4%. Codes are available at (https://anonymous.4open.science/r/FractalGCL-0511/).

[518] UNO: Unlearning via Orthogonalization in Generative models

Pinak Mandal, Georg A. Gottwald

Main category: cs.LG

TL;DR: The paper proposes fast unlearning algorithms based on loss gradient orthogonalization for generative models, enabling selective data removal without full retraining while maintaining model quality.

Details

Motivation: As generative models become more powerful and widespread, there's a growing need to unlearn specific data due to privacy concerns, legal requirements, or to remove harmful content, without the high cost of retraining from scratch.

Method: The authors develop unlearning algorithms using loss gradient orthogonalization for both unconditional and conditional generative models. The approach aims to selectively remove influence of specific data points while preserving model performance.

Result: The algorithms achieve orders of magnitude faster unlearning times than previous methods like gradient surgery, successfully forgetting data while maintaining original model fidelity across datasets of increasing complexity (MNIST, CelebA, ImageNet-1K) and model types (VAEs, diffusion transformers).

Conclusion: The proposed gradient orthogonalization-based unlearning algorithms provide an efficient and reliable solution for selective data removal in generative models, meeting key requirements of forgetting undesired data, preserving generation quality, maintaining desired data influence, and requiring minimal training steps.

Abstract: As generative models become increasingly powerful and pervasive, the ability to unlearn specific data, whether due to privacy concerns, legal requirements, or the correction of harmful content, has become increasingly important. Unlike in conventional training, where data are accumulated and knowledge is reinforced, unlearning aims to selectively remove the influence of particular data points without costly retraining from scratch. To be effective and reliable, such algorithms need to achieve (i) forgetting of the undesired data, (ii) preservation of the quality of the generation, (iii) preservation of the influence of the desired training data on the model parameters, and (iv) small number of training steps. We propose fast unlearning algorithms based on loss gradient orthogonalization for unconditional and conditional generative models. We show that our algorithms are able to forget data while maintaining the fidelity of the original model. On standard image benchmarks, our algorithms achieve orders of magnitude faster unlearning times than their predecessors, such as gradient surgery. We demonstrate our algorithms with datasets of increasing complexity (MNIST, CelebA and ImageNet-1K) and for generative models of increasing complexity (VAEs and diffusion transformers).

[519] Provably Sample-Efficient Robust Reinforcement Learning with Average Reward

Zachary Roch, Chi Zhang, George Atia, Yue Wang

Main category: cs.LG

TL;DR: The paper proposes Robust Halpern Iteration (RHI), a new algorithm for robust reinforcement learning under average-reward criterion that achieves state-of-the-art sample complexity without requiring prior knowledge or restrictive structural assumptions.

Details

Motivation: There is a significant gap in understanding the finite-sample complexity of robust RL methods under average-reward criterion, as most existing work only provides asymptotic guarantees, which hinders practical deployment in data-limited scenarios.

Method: The authors develop Robust Halpern Iteration (RHI) for robust Markov Decision Processes with transition uncertainty characterized by ℓ_p-norm and contamination models. The algorithm operates without requiring prior knowledge and only assumes the underlying robust MDP is communicating.

Result: RHI achieves a sample complexity of Õ(SAH²/ε²) to learn an ε-optimal robust policy, where S and A are numbers of states and actions, and H is the robust optimal bias span. This represents the tightest known bound.

Conclusion: The work provides essential theoretical understanding of sample efficiency for robust average reward RL, closing the gap in finite-sample complexity analysis and enabling more principled deployment in practical applications.

Abstract: Robust reinforcement learning (RL) under the average-reward criterion is essential for long-term decision-making, particularly when the environment may differ from its specification. However, a significant gap exists in understanding the finite-sample complexity of these methods, as most existing work provides only asymptotic guarantees. This limitation hinders their principled understanding and practical deployment, especially in data-limited scenarios. We close this gap by proposing \textbf{Robust Halpern Iteration (RHI)}, a new algorithm designed for robust Markov Decision Processes (MDPs) with transition uncertainty characterized by $\ell_p$-norm and contamination models. Our approach offers three key advantages over previous methods: (1). Weaker Structural Assumptions: RHI only requires the underlying robust MDP to be communicating, a less restrictive condition than the commonly assumed ergodicity or irreducibility; (2). No Prior Knowledge: Our algorithm operates without requiring any prior knowledge of the robust MDP; (3). State-of-the-Art Sample Complexity: To learn an $\epsilon$-optimal robust policy, RHI achieves a sample complexity of $\tilde{\mathcal O}\left(\frac{SA\mathcal H^{2}}{\epsilon^{2}}\right)$, where $S$ and $A$ denote the numbers of states and actions, and $\mathcal H$ is the robust optimal bias span. This result represents the tightest known bound. Our work hence provides essential theoretical understanding of sample efficiency of robust average reward RL.

[520] AMPED: Adaptive Multi-objective Projection for balancing Exploration and skill Diversification

Geonwoo Cho, Jaemoon Lee, Jaegyun Im, Subi Lee, Jihwan Lee, Sundong Kim

Main category: cs.LG

TL;DR: AMPED is a skill-based RL method that balances exploration and skill diversity through gradient-surgery projection during pre-training and uses a skill selector for downstream task adaptation.

Details

Motivation: Existing SBRL methods struggle to simultaneously optimize exploration and skill diversity, which are conflicting objectives. AMPED aims to explicitly address both to enable more robust skill learning.

Method: Uses adaptive multi-objective projection to balance exploration and diversity gradients during pre-training, and a skill selector that exploits learned diversity for downstream task fine-tuning.

Result: Achieves superior performance compared to SBRL baselines across various benchmarks. Ablation studies confirm each component contributes to performance.

Conclusion: Explicitly harmonizing exploration and diversity is crucial for effective skill learning, and AMPED demonstrates that greater skill diversity reduces fine-tuning sample complexity with a greedy skill selector.

Abstract: Skill-based reinforcement learning (SBRL) enables rapid adaptation in environments with sparse rewards by pretraining a skill-conditioned policy. Effective skill learning requires jointly maximizing both exploration and skill diversity. However, existing methods often face challenges in simultaneously optimizing for these two conflicting objectives. In this work, we propose a new method, Adaptive Multi-objective Projection for balancing Exploration and skill Diversification (AMPED), which explicitly addresses both: during pre-training, a gradient-surgery projection balances the exploration and diversity gradients, and during fine-tuning, a skill selector exploits the learned diversity by choosing skills suited to downstream tasks. Our approach achieves performance that surpasses SBRL baselines across various benchmarks. Through an extensive ablation study, we identify the role of each component and demonstrate that each element in AMPED is contributing to performance. We further provide theoretical and empirical evidence that, with a greedy skill selector, greater skill diversity reduces fine-tuning sample complexity. These results highlight the importance of explicitly harmonizing exploration and diversity and demonstrate the effectiveness of AMPED in enabling robust and generalizable skill learning. Project Page: https://geonwoo.me/amped/

[521] Optimal Formats for Weight Quantisation

Douglas Orr, Luka Ribar, Carlo Luschi

Main category: cs.LG

TL;DR: A framework for systematic design of weight quantisation formats using classical quantisation theory, showing that variable-length codes are optimal for minimising model size while maintaining performance.

Details

Motivation: Current quantisation formats are chosen empirically from a large recipe book without systematic analysis of their theoretical foundations.

Method: Frame quantisation as minimising KL divergence between original and quantised outputs under size constraints, approximated by squared quantisation error. Develop non-linear quantisation curves for block-scaled data and derive optimal bit-width allocation using Fisher information.

Result: Variable-length formats consistently outperform fixed-length formats, and optimal bit-width allocation saves up to 0.25 bits per parameter in large language models.

Conclusion: Theoretical analysis reveals that variable-length encoding is key to efficient quantisation, providing a systematic framework for format design rather than empirical selection.

Abstract: Weight quantisation is an essential technique for enabling efficient training and deployment of modern deep learning models. However, the recipe book of quantisation formats is large and formats are often chosen empirically. In this paper, we propose a framework for systematic design and analysis of quantisation formats. By connecting the question of format design with the classical quantisation theory, we show that the strong practical performance of popular formats comes from their ability to represent values using variable-length codes. We frame the problem as minimising the KL divergence between original and quantised model outputs under a model size constraint, which can be approximated by minimising the squared quantisation error, a well-studied problem where entropy-constrained quantisers with variable-length codes are optimal. We develop non-linear quantisation curves for block-scaled data across multiple distribution families and observe that these formats, along with sparse outlier formats, consistently outperform fixed-length formats, indicating that they also exploit variable-length encoding. Finally, by using the relationship between the Fisher information and KL divergence, we derive the optimal allocation of bit-widths to individual parameter tensors across the model’s layers, saving up to 0.25 bits per parameter when applied to large language models.

[522] Robust Molecular Property Prediction via Densifying Scarce Labeled Data

Jina Kim, Jeffrey Willette, Bruno Andreis, Sung Ju Hwang

Main category: cs.LG

TL;DR: A novel bilevel optimization approach that uses unlabeled data to interpolate between in-distribution and out-of-distribution data, improving generalization in molecular prediction models.

Details

Motivation: Molecular prediction models suffer from poor generalization to out-of-distribution compounds due to reliance on training data structures, which is problematic in drug discovery where critical compounds often lie beyond the training set. Covariate shift and scarcity of labeled experimental data exacerbate this issue.

Method: Proposes a bilevel optimization approach that leverages unlabeled data to create interpolations between in-distribution (ID) and out-of-distribution (OOD) data, enabling the model to learn how to generalize beyond the training distribution.

Result: Demonstrates significant performance gains on challenging real-world datasets with substantial covariate shift, supported by t-SNE visualizations that highlight the effectiveness of the interpolation method.

Conclusion: The proposed bilevel optimization approach effectively addresses the generalization limitations of molecular prediction models by leveraging unlabeled data to bridge the gap between ID and OOD compounds, achieving improved performance in drug discovery applications.

Abstract: A widely recognized limitation of molecular prediction models is their reliance on structures observed in the training data, resulting in poor generalization to out-of-distribution compounds. Yet in drug discovery, the compounds most critical for advancing research often lie beyond the training set, making the bias toward the training data particularly problematic. This mismatch introduces substantial covariate shift, under which standard deep learning models produce unstable and inaccurate predictions. Furthermore, the scarcity of labeled data-stemming from the onerous and costly nature of experimental validation-further exacerbates the difficulty of achieving reliable generalization. To address these limitations, we propose a novel bilevel optimization approach that leverages unlabeled data to interpolate between in-distribution (ID) and out-of-distribution (OOD) data, enabling the model to learn how to generalize beyond the training distribution. We demonstrate significant performance gains on challenging real-world datasets with substantial covariate shift, supported by t-SNE visualizations highlighting our interpolation method.

[523] Time series saliency maps: explaining models across multiple domains

Christodoulos Kechris, Jonathan Dan, David Atienza

Main category: cs.LG

TL;DR: Cross-domain Integrated Gradients extends traditional saliency methods to enable feature attributions in multiple domains beyond time-domain, particularly frequency domain, for time-series analysis.

Details

Motivation: Traditional saliency map methods in time-series offer limited insights because semantically meaningful features often exist in other domains like frequency, not just in the time-domain pixel-level analysis.

Method: Introduces Cross-domain Integrated Gradients, a generalization of Integrated Gradients that enables feature attributions on any domain formulated as an invertible, differentiable transformation of the time domain, with extensions to complex domain for frequency-based attributions.

Result: The method reveals interpretable, problem-specific attributions that time-domain methods cannot capture, demonstrated on three real-world tasks: wearable sensor heart rate extraction, EEG-based seizure detection, and zero-shot time-series forecasting.

Conclusion: Cross-domain integrated gradients provide semantically meaningful insights in time-series models that are impossible with traditional time-domain saliency methods, with open-source library released for practical implementation.

Abstract: Traditional saliency map methods, popularized in computer vision, highlight individual points (pixels) of the input that contribute the most to the model’s output. However, in time-series they offer limited insights as semantically meaningful features are often found in other domains. We introduce Cross-domain Integrated Gradients, a generalization of Integrated Gradients. Our method enables feature attributions on any domain that can be formulated as an invertible, differentiable transformation of the time domain. Crucially, our derivation extends the original Integrated Gradients into the complex domain, enabling frequency-based attributions. We provide the necessary theoretical guarantees, namely, path independence and completeness. Our approach reveals interpretable, problem-specific attributions that time-domain methods cannot capture, on three real-world tasks: wearable sensor heart rate extraction, electroencephalography-based seizure detection, and zero-shot time-series forecasting. We release an open-source Tensorflow/PyTorch library to enable plug-and-play cross-domain explainability for time-series models. These results demonstrate the ability of cross-domain integrated gradients to provide semantically meaningful insights in time-series models that are impossible with traditional time-domain saliency.

[524] Exploring the Secondary Risks of Large Language Models

Jiawei Chen, Zhengwei Fang, Xiao Yang, Chao Yu, Zhaoxia Yin, Hang Su

Main category: cs.LG

TL;DR: The paper introduces ‘secondary risks’ - a novel class of LLM failure modes that occur during benign interactions, proposes SecLens framework to detect them, and shows these risks are widespread across 16 models.

Details

Motivation: Current LLM safety research focuses on adversarial attacks, but overlooks non-adversarial failures that emerge during normal usage, posing real-world deployment risks.

Method: Introduces two risk primitives (verbose response and speculative advice), develops SecLens - a black-box multi-objective search framework, and creates SecRiskBench benchmark with 650 prompts across 8 risk categories.

Result: Secondary risks are widespread across 16 popular models, transferable across models, and modality independent, demonstrating urgent need for better safety mechanisms.

Conclusion: Benign yet harmful LLM behaviors pose significant real-world risks that current safety mechanisms fail to address, requiring enhanced safety approaches beyond adversarial attack prevention.

Abstract: Ensuring the safety and alignment of Large Language Models is a significant challenge with their growing integration into critical applications and societal functions. While prior research has primarily focused on jailbreak attacks, less attention has been given to non-adversarial failures that subtly emerge during benign interactions. We introduce secondary risks a novel class of failure modes marked by harmful or misleading behaviors during benign prompts. Unlike adversarial attacks, these risks stem from imperfect generalization and often evade standard safety mechanisms. To enable systematic evaluation, we introduce two risk primitives verbose response and speculative advice that capture the core failure patterns. Building on these definitions, we propose SecLens, a black-box, multi-objective search framework that efficiently elicits secondary risk behaviors by optimizing task relevance, risk activation, and linguistic plausibility. To support reproducible evaluation, we release SecRiskBench, a benchmark dataset of 650 prompts covering eight diverse real-world risk categories. Experimental results from extensive evaluations on 16 popular models demonstrate that secondary risks are widespread, transferable across models, and modality independent, emphasizing the urgent need for enhanced safety mechanisms to address benign yet harmful LLM behaviors in real-world deployments.

[525] Why and When Deep is Better than Shallow: An Implementation-Agnostic State-Transition View of Depth Supremacy

Sho Sonoda, Yuka Hashimoto, Isao Ishikawa, Masahiro Ikeda

Main category: cs.LG

TL;DR: The paper provides a theoretical framework showing why deep models outperform shallow ones by proving depth supremacy through bias-variance decomposition that separates abstract network depth from implementation details.

Details

Motivation: To understand the fundamental reasons why deep learning models perform better than shallow ones, independent of specific implementations like ReLU nets or transformers.

Method: Formulates deep models as abstract state-transition semigroups acting on metric spaces, proves bias-variance decomposition theorems, and analyzes variance growth patterns with depth.

Result: Identifies four canonical bias-variance trade-off regimes (EL/EP/PL/PP) and shows that optimal depth k* > 1 typically holds, demonstrating depth supremacy. The EL regime (exponential bias decay + logarithmic variance growth) achieves lowest generalization error.

Conclusion: Deep models are superior because they achieve better bias-variance trade-offs, particularly for iterative or hierarchical concept classes like neural ODEs, diffusion models, and chain-of-thought reasoning.

Abstract: Why and when is deep better than shallow? We answer this question in a framework that is agnostic to network implementation. We formulate a deep model as an abstract state-transition semigroup acting on a general metric space, and separate the implementation (e.g., ReLU nets, transformers, and chain-of-thought) from the abstract state transition. We prove a bias-variance decomposition in which the variance depends only on the abstract depth-$k$ network and not on the implementation (Theorem 1). We further split the bounds into output and hidden parts to tie the depth dependence of the variance to the metric entropy of the state-transition semigroup (Theorem 2). We then investigate implementation-free conditions under which the variance grow polynomially or logarithmically with depth (Section 4). Combining these with exponential or polynomial bias decay identifies four canonical bias-variance trade-off regimes (EL/EP/PL/PP) and produces explicit optimal depths $k^\ast$. Across regimes, $k^\ast>1$ typically holds, giving a rigorous form of depth supremacy. The lowest generalization error bound is achieved under the EL regime (exp-decay bias + log-growth variance), explaining why and when deep is better, especially for iterative or hierarchical concept classes such as neural ODEs, diffusion/score models, and chain-of-thought reasoning.

[526] Supervised Graph Contrastive Learning for Gene Regulatory Networks

Sho Oshima, Yuji Okamoto, Taisei Tosaki, Ryosuke Kojima, Yasushi Okuno

Main category: cs.LG

TL;DR: SupGCL is a supervised graph contrastive learning method that incorporates real biological perturbations from gene knockdown experiments as supervision, outperforming state-of-the-art methods on 13 GRN tasks across three cancer types.

Details

Motivation: Current GCL methods use artificial perturbations that diverge from biological reality, while biologically meaningful perturbations are actually valuable information sources that should be leveraged rather than avoided.

Method: SupGCL is a probabilistic formulation that generalizes conventional GCL by linking artificial augmentations with real perturbations from knockdown experiments and using them as explicit supervisory signals.

Result: SupGCL consistently outperforms state-of-the-art baselines across 13 tasks on GRN datasets from three cancer types, including both node-level (gene function classification) and graph-level (patient survival prediction) tasks.

Conclusion: Biologically meaningful perturbations are a rich source of information that can be effectively incorporated into graph contrastive learning, leading to superior performance on biological network analysis tasks.

Abstract: Graph Contrastive Learning (GCL) is a powerful self-supervised learning framework that performs data augmentation through graph perturbations, with growing applications in the analysis of biological networks such as Gene Regulatory Networks (GRNs). The artificial perturbations commonly used in GCL, such as node dropping, induce structural changes that can diverge from biological reality. This concern has contributed to a broader trend in graph representation learning toward augmentation-free methods, which view such structural changes as problematic and to be avoided. However, this trend overlooks the fundamental insight that structural changes from biologically meaningful perturbations are not a problem to be avoided but a rich source of information, thereby ignoring the valuable opportunity to leverage data from real biological experiments. Motivated by this insight, we propose SupGCL (Supervised Graph Contrastive Learning), a new GCL method for GRNs that directly incorporates biological perturbations from gene knockdown experiments as supervision. SupGCL is a probabilistic formulation that continuously generalizes conventional GCL, linking artificial augmentations with real perturbations measured in knockdown experiments and using the latter as explicit supervisory signals. To assess effectiveness, we train GRN representations with SupGCL and evaluate their performance on downstream tasks. The evaluation includes both node-level tasks, such as gene function classification, and graph-level tasks on patient-specific GRNs, such as patient survival hazard prediction. Across 13 tasks built from GRN datasets derived from patients with three cancer types, SupGCL consistently outperforms state-of-the-art baselines.

[527] Bridging Arbitrary and Tree Metrics via Differentiable Gromov Hyperbolicity

Pierre Houedry, Nicolas Courty, Florestan Martin-Baillon, Laetitia Chapel, Titouan Vayer

Main category: cs.LG

TL;DR: DeltaZero is a novel differentiable optimization framework that bridges arbitrary metric spaces to their closest tree metrics using a smooth surrogate for Gromov’s δ-hyperbolicity, achieving state-of-the-art distortion with tractable complexity.

Details

Motivation: Existing approaches for converting arbitrary metrics to tree metrics are either heuristic with no guarantees or perform moderately well. There's a need for a method that provides both theoretical guarantees and practical performance.

Method: DeltaZero uses a differentiable optimization framework with a smooth surrogate for Gromov’s δ-hyperbolicity, enabling gradient-based optimization with tractable complexity. The method is derived from a problem formulation with better worst-case guarantees than existing bounds.

Result: Experiments on synthetic and real-world datasets show that DeltaZero consistently achieves state-of-the-art distortion compared to existing methods.

Conclusion: DeltaZero provides an effective and theoretically grounded approach for approximating arbitrary metric spaces with tree metrics, offering both practical performance improvements and better theoretical guarantees than existing methods.

Abstract: Trees and the associated shortest-path tree metrics provide a powerful framework for representing hierarchical and combinatorial structures in data. Given an arbitrary metric space, its deviation from a tree metric can be quantified by Gromov’s $\delta$-hyperbolicity. Nonetheless, designing algorithms that bridge an arbitrary metric to its closest tree metric is still a vivid subject of interest, as most common approaches are either heuristical and lack guarantees, or perform moderately well. In this work, we introduce a novel differentiable optimization framework, coined DeltaZero, that solves this problem. Our method leverages a smooth surrogate for Gromov’s $\delta$-hyperbolicity which enables a gradient-based optimization, with a tractable complexity. The corresponding optimization procedure is derived from a problem with better worst case guarantees than existing bounds, and is justified statistically. Experiments on synthetic and real-world datasets demonstrate that our method consistently achieves state-of-the-art distortion.

[528] VRAIL: Vectorized Reward-based Attribution for Interpretable Learning

Jina Kim, Youjin Jang, Jeongjin Han

Main category: cs.LG

TL;DR: VRAIL is a bi-level RL framework that learns interpretable weight representations from state features through deep learning and reinforcement learning stages, improving training stability and producing human-interpretable behavior.

Details

Motivation: To develop a framework that enhances both learning performance and interpretability in reinforcement learning by attributing importance to state features and their interactions.

Method: Two-stage approach: DL stage fits estimated value function using state features, RL stage uses this for reward shaping via potential-based transformations. Models estimator in linear or quadratic form for feature attribution.

Result: VRAIL improves training stability and convergence compared to standard DQN on Taxi-v3 environment, uncovers semantically meaningful subgoals like passenger possession without environment modifications.

Conclusion: VRAIL serves as a general, model-agnostic framework for reward shaping that enhances both learning performance and interpretability in reinforcement learning.

Abstract: We propose VRAIL (Vectorized Reward-based Attribution for Interpretable Learning), a bi-level framework for value-based reinforcement learning (RL) that learns interpretable weight representations from state features. VRAIL consists of two stages: a deep learning (DL) stage that fits an estimated value function using state features, and an RL stage that uses this to shape learning via potential-based reward transformations. The estimator is modeled in either linear or quadratic form, allowing attribution of importance to individual features and their interactions. Empirical results on the Taxi-v3 environment demonstrate that VRAIL improves training stability and convergence compared to standard DQN, without requiring environment modifications. Further analysis shows that VRAIL uncovers semantically meaningful subgoals, such as passenger possession, highlighting its ability to produce human-interpretable behavior. Our findings suggest that VRAIL serves as a general, model-agnostic framework for reward shaping that enhances both learning and interpretability.

[529] CRISP-NAM: Competing Risks Interpretable Survival Prediction with Neural Additive Models

Dhanesh Ramachandram, Ananya Raval

Main category: cs.LG

TL;DR: CRISP-NAM is an interpretable neural additive model for competing risks survival analysis that models cause-specific hazards while maintaining feature-level interpretability through dedicated neural networks for each feature.

Details

Motivation: Competing risks are important in survival modeling, especially in healthcare where patients may experience multiple distinct event types, and there's a need for interpretable models that can handle complex non-linear relationships.

Method: Extends neural additive architecture to model cause-specific hazards, with each feature contributing independently through dedicated neural networks, allowing visualization of non-linear relationships between covariates and competing risks.

Result: Demonstrates competitive performance on multiple datasets compared to existing approaches.

Conclusion: CRISP-NAM provides an interpretable solution for competing risks survival analysis that maintains competitive predictive performance while offering feature-level interpretability.

Abstract: Competing risks are crucial considerations in survival modelling, particularly in healthcare domains where patients may experience multiple distinct event types. We propose CRISP-NAM (Competing Risks Interpretable Survival Prediction with Neural Additive Models), an interpretable neural additive model for competing risks survival analysis which extends the neural additive architecture to model cause-specific hazards while preserving feature-level interpretability. Each feature contributes independently to risk estimation through dedicated neural networks, allowing for visualization of complex non-linear relationships between covariates and each competing risk. We demonstrate competitive performance on multiple datasets compared to existing approaches.

[530] An entropy-optimal path to humble AI

Davide Bassetti, Lukáš Pospíšil, Michael Groom, Terence J. O’Kane, Illia Horenko

Main category: cs.LG

TL;DR: A novel mathematical framework for non-equilibrium entropy-optimizing Boltzmann machines that provides gradient-descent-free learning with mathematically-justified confidence measures, resulting in more performant and cost-effective models compared to state-of-the-art AI tools.

Details

Motivation: Address the issues of huge costs and over-confidence in current AI models by developing a more efficient and reliable alternative.

Method: Non-equilibrium entropy-optimizing reformulation of Boltzmann machines based on exact law of total probability and exact convex polytope representations, enabling gradient-descent-free learning.

Result: The method produces more performant and slim models with descriptor lengths close to intrinsic complexity bounds, and achieves higher prediction skills for climate phenomena using minimal training data.

Conclusion: The proposed framework offers a mathematically rigorous, cost-effective alternative to current AI tools with improved performance and reliability measures.

Abstract: Progress of AI has led to very successful, but by no means humble models and tools, especially regarding (i) the huge and further exploding costs and resources they demand, and (ii) the over-confidence of these tools with the answers they provide. Here we introduce a novel mathematical framework for a non-equilibrium entropy-optimizing reformulation of Boltzmann machines based on the exact law of total probability and the exact convex polytope representations. We show that it results in the highly-performant, but much cheaper, gradient-descent-free learning framework with mathematically-justified existence and uniqueness criteria, and cheaply-computable confidence/reliability measures for both the model inputs and the outputs. Comparisons to state-of-the-art AI tools in terms of performance, cost and the model descriptor lengths on a broad set of synthetic and real-world problems with varying complexity reveal that the proposed method results in more performant and slim models, with the descriptor lengths being very close to the intrinsic complexity scaling bounds for the underlying problems. Applying this framework to historical climate data results in models with systematically higher prediction skills for the onsets of important La Ni~na and El Ni~no climate phenomena, requiring just few years of climate data for training - a small fraction of what is necessary for contemporary climate prediction tools.

[531] On the Dynamic Regret of Following the Regularized Leader: Optimism with History Pruning

Naram Mhaisen, George Iosifidis

Main category: cs.LG

TL;DR: FTRL can achieve dynamic regret bounds through optimistic composition and cost linearization with pruning, overcoming its ’lazy’ reputation by synchronizing algorithm state with iterates.

Details

Motivation: Prior work suggested FTRL's 'lazy' iterates limit performance in dynamic environments, but recent insights show it can produce 'agile' iterates, motivating re-examination of FTRL's dynamic regret capabilities.

Method: Optimistic composition of future costs and careful linearization of past costs with pruning, which synchronizes the algorithm’s state (linearized history) with its iterates to prevent arbitrary growth.

Result: FTRL recovers known dynamic regret bounds, provides principled interpolation between greedy and agile updates, refined control over regret terms, optimism without cyclic dependence, and minimal recursive regularization similar to AdaFTRL.

Conclusion: The limitation in dynamic regret is not FTRL’s ’lazy’ projection style but the decoupling of algorithm state from iterates; pruning addresses this by synchronizing them when necessary.

Abstract: We revisit the Follow the Regularized Leader (FTRL) framework for Online Convex Optimization (OCO) over compact sets, focusing on achieving dynamic regret guarantees. Prior work has highlighted the framework’s limitations in dynamic environments due to its tendency to produce “lazy” iterates. However, building on insights showing FTRL’s ability to produce “agile” iterates, we show that it can indeed recover known dynamic regret bounds through optimistic composition of future costs and careful linearization of past costs, which can lead to pruning some of them. This new analysis of FTRL against dynamic comparators yields a principled way to interpolate between greedy and agile updates and offers several benefits, including refined control over regret terms, optimism without cyclic dependence, and the application of minimal recursive regularization akin to AdaFTRL. More broadly, we show that it is not the “lazy” projection style of FTRL that hinders (optimistic) dynamic regret, but the decoupling of the algorithm’s state (linearized history) from its iterates, allowing the state to grow arbitrarily. Instead, pruning synchronizes these two when necessary.

[532] Shift Happens: Mixture of Experts based Continual Adaptation in Federated Learning

Rahul Atul Bhope, K. R. Jayaram, Praveen Venkateswaran, Nalini Venkatasubramanian

Main category: cs.LG

TL;DR: ShiftEx is a shift-aware mixture of experts framework that addresses covariate and label shifts in federated learning by dynamically creating specialized global models using Maximum Mean Discrepancy detection and facility location optimization.

Details

Motivation: Federated Learning faces challenges with dynamic data distribution shifts in real-world settings, where non-stationary client data degrades model performance, requiring adaptive middleware solutions.

Method: Uses Maximum Mean Discrepancy for covariate shift detection, latent memory for expert reuse, and facility location-based optimization to minimize covariate mismatch, expert creation costs, and label imbalance.

Result: Achieves 5.5-12.9 percentage point accuracy improvements and 22-95% faster adaptation compared to state-of-the-art FL baselines across diverse shift scenarios.

Conclusion: ShiftEx provides a scalable, privacy-preserving middleware solution for FL systems in non-stationary conditions while minimizing communication and computational overhead.

Abstract: Federated Learning (FL) enables collaborative model training across decentralized clients without sharing raw data, yet faces significant challenges in real-world settings where client data distributions evolve dynamically over time. This paper tackles the critical problem of covariate and label shifts in streaming FL environments, where non-stationary data distributions degrade model performance and necessitate a middleware layer that adapts FL to distributional shifts. We introduce ShiftEx, a shift-aware mixture of experts framework that dynamically creates and trains specialized global models in response to detected distribution shifts using Maximum Mean Discrepancy for covariate shifts. The framework employs a latent memory mechanism for expert reuse and implements facility location-based optimization to jointly minimize covariate mismatch, expert creation costs, and label imbalance. Through theoretical analysis and comprehensive experiments on benchmark datasets, we demonstrate 5.5-12.9 percentage point accuracy improvements and 22-95 % faster adaptation compared to state-of-the-art FL baselines across diverse shift scenarios. The proposed approach offers a scalable, privacy-preserving middleware solution for FL systems operating in non-stationary, real-world conditions while minimizing communication and computational overhead.

[533] Differential Gated Self-Attention

Elpiniki Maria Lygizou, Mónika Farsang, Radu Grosu

Main category: cs.LG

TL;DR: M-DGSA is a novel Transformer architecture that uses input-dependent gating inspired by biological lateral inhibition to dynamically suppress attention noise, improving robustness across vision and language tasks.

Details

Motivation: Standard Transformers treat all query-key interactions uniformly, making them susceptible to corrupted inputs. The authors aim to enhance noise resilience by incorporating biological principles of lateral inhibition into self-attention mechanisms.

Method: Multihead Differential Gated Self-Attention (M-DGSA) splits each attention head into excitatory and inhibitory branches with dual softmax maps, fused by a sigmoid gate predicted from token embeddings. This creates context-aware contrast enhancement with minimal computational overhead.

Result: M-DGSA demonstrates consistent robustness gains over vanilla Transformer, Vision Transformer, and Differential Transformer baselines on both vision and language benchmarks.

Conclusion: The paper presents a biologically-inspired gating mechanism that successfully enhances Transformer robustness to noise while maintaining cross-domain applicability and seamless integration into existing architectures.

Abstract: Transformers excel across a large variety of tasks but remain susceptible to corrupted inputs, since standard self-attention treats all query-key interactions uniformly. Inspired by lateral inhibition in biological neural circuits and building on the recent use by the Differential Transformer’s use of two parallel softmax subtraction for noise cancellation, we propose Multihead Differential Gated Self-Attention (M-DGSA) that learns per-head input-dependent gating to dynamically suppress attention noise. Each head splits into excitatory and inhibitory branches whose dual softmax maps are fused by a sigmoid gate predicted from the token embedding, yielding a context-aware contrast enhancement. M-DGSA integrates seamlessly into existing Transformer stacks with minimal computational overhead. We evaluate on both vision and language benchmarks, demonstrating consistent robustness gains over vanilla Transformer, Vision Transformer, and Differential Transformer baselines. Our contributions are (i) a novel input-dependent gating mechanism for self-attention grounded in lateral inhibition, (ii) a principled synthesis of biological contrast-enhancement and self-attention theory, and (iii) comprehensive experiments demonstrating noise resilience and cross-domain applicability.

[534] TRACED: Transition-aware Regret Approximation with Co-learnability for Environment Design

Geonwoo Cho, Jaegyun Im, Jihwan Lee, Hojun Yi, Sejin Kim, Sundong Kim

Main category: cs.LG

TL;DR: TRACED introduces transition-prediction error and co-learnability metrics to improve curriculum design in Unsupervised Environment Design (UED), enhancing zero-shot generalization in deep reinforcement learning.

Details

Motivation: Generalizing deep RL agents to unseen environments is challenging. Existing UED methods use regret approximation based solely on value-function loss, which may not fully capture learning potential.

Method: TRACED combines transition-prediction error with regret approximation and introduces co-learnability to model how training on one task affects performance on others. This creates more effective curricula for student agents.

Result: Empirical evaluations show TRACED improves zero-shot generalization over strong baselines across multiple benchmarks. Ablation studies confirm both components contribute to performance gains.

Conclusion: Refined regret approximation and explicit modeling of task relationships enable more sample-efficient curriculum design in UED, demonstrating the value of transition-aware metrics and co-learnability.

Abstract: Generalizing deep reinforcement learning agents to unseen environments remains a significant challenge. One promising solution is Unsupervised Environment Design (UED), a co-evolutionary framework in which a teacher adaptively generates tasks with high learning potential, while a student learns a robust policy from this evolving curriculum. Existing UED methods typically measure learning potential via regret, the gap between optimal and current performance, approximated solely by value-function loss. Building on these approaches, we introduce the transition-prediction error as an additional term in our regret approximation. To capture how training on one task affects performance on others, we further propose a lightweight metric called Co-Learnability. By combining these two measures, we present Transition-aware Regret Approximation with Co-learnability for Environment Design (TRACED). Empirical evaluations show that TRACED produces curricula that improve zero-shot generalization over strong baselines across multiple benchmarks. Ablation studies confirm that the transition-prediction error drives rapid complexity ramp-up and that Co-Learnability delivers additional gains when paired with the transition-prediction error. These results demonstrate how refined regret approximation and explicit modeling of task relationships can be leveraged for sample-efficient curriculum design in UED. Project Page: https://geonwoo.me/traced/

[535] A Finite-Time Analysis of TD Learning with Linear Function Approximation without Projections or Strong Convexity

Wei-Cheng Lee, Francesco Orabona

Main category: cs.LG

TL;DR: This paper presents a refined analysis of Temporal Difference (TD) learning with linear function approximation, showing that a projection-free variant converges with rate O~(||θ*||²₂/√T) even under Markovian noise, without requiring bounded iterates.

Details

Motivation: Prior convergence guarantees for TD learning in the robust setting typically rely on artificial bounded projection assumptions that don't match practical implementations. The authors challenge this necessity.

Method: The authors conduct a refined analysis of TD learning that reveals a novel self-bounding property of the TD updates, which they exploit to guarantee bounded iterates without explicit projection.

Result: For the first time, they prove that the simple projection-free TD variant converges with rate O~(||θ*||²₂/√T) in the presence of Markovian noise.

Conclusion: The analysis demonstrates that explicit projection is unnecessary for TD learning convergence, providing more practical and theoretically sound guarantees that better align with real-world implementations.

Abstract: We investigate the finite-time convergence properties of Temporal Difference (TD) learning with linear function approximation, a cornerstone algorithm in the field of reinforcement learning. We are interested in the so-called ``robust’’ setting, where the convergence guarantee does not depend on the minimal curvature of the potential function. While prior work has established convergence guarantees in this setting, these results typically rely on the assumption that each iterate is projected onto a bounded set, a condition that is both artificial and does not match the current practice. In this paper, we challenge the necessity of such an assumption and present a refined analysis of TD learning. For the first time, we show that the simple projection-free variant converges with a rate of $\widetilde{\mathcal{O}}(\frac{||\theta^*||^2_2}{\sqrt{T}})$, even in the presence of Markovian noise. Our analysis reveals a novel self-bounding property of the TD updates and exploits it to guarantee bounded iterates.

[536] Long-Tailed Out-of-Distribution Detection with Refined Separate Class Learning

Shuai Feng, Yuxin Ge, Yuntao Du, Mingcai Chen, Chongjun Wang, Lei Feng

Main category: cs.LG

TL;DR: RSCL improves OOD detection in long-tailed distributions using dynamic temperature adjustment and informative outlier mining to distinguish OOD samples from both head and tail classes.

Details

Motivation: Existing OOD detection methods struggle with long-tailed data distributions due to confusion between OOD samples and head/tail classes, and limitations in static temperature scaling and uninformative outliers in SCL approaches.

Method: Proposes Refined Separate Class Learning (RSCL) with dynamic class-wise temperature adjustment and informative outlier mining based on affinity with head/tail classes.

Result: Extensive experiments show RSCL achieves superior OOD detection performance while improving in-distribution classification accuracy.

Conclusion: RSCL effectively addresses limitations of existing SCL methods and provides robust OOD detection for long-tailed distributions.

Abstract: Out-of-distribution (OOD) detection is crucial for deploying robust machine learning models. However, when training data follows a long-tailed distribution, the model’s ability to accurately detect OOD samples is significantly compromised, due to the confusion between OOD samples and head/tail classes. To distinguish OOD samples from both head and tail classes, the separate class learning (SCL) approach has emerged as a promising solution, which separately conduct head-specific and tail-specific class learning. To this end, we examine the limitations of existing works of SCL and reveal that the OOD detection performance is notably influenced by the use of static scaling temperature value and the presence of uninformative outliers. To mitigate these limitations, we propose a novel approach termed Refined Separate Class Learning (RSCL), which leverages dynamic class-wise temperature adjustment to modulate the temperature parameter for each in-distribution class and informative outlier mining to identify diverse types of outliers based on their affinity with head and tail classes. Extensive experiments demonstrate that RSCL achieves superior OOD detection performance while improving the classification accuracy on in-distribution data.

[537] Spiking Brain Compression: Exploring One-Shot Post-Training Pruning and Quantization for Spiking Neural Networks

Lianfeng Shi, Ao Li, Benjamin Ward-Cherrier

Main category: cs.LG

TL;DR: SBC is a one-shot post-training compression framework for Spiking Neural Networks that extends Optimal Brain Compression with spike train-based objectives, achieving state-of-the-art results with significant speed improvements over iterative methods.

Details

Motivation: SNNs need efficient compression for neuromorphic hardware, but current iterative pruning/quantization methods are costly for pre-trained or large networks. A one-shot approach is needed to reduce compression overhead.

Method: Extends OBC to SNNs by replacing current-based loss with spike train-based objective, enabling cheap Hessian computation and single backward pass for pruning/quantization with analytical rescaling.

Result: Achieves single-digit to double-digit accuracy gains over OBC, approaches iterative method accuracy while reducing compression time by 2-3 orders of magnitude on neuromorphic and large static datasets.

Conclusion: SBC provides an efficient one-shot compression solution for SNNs that balances accuracy and computational cost, making it practical for real-world neuromorphic applications.

Abstract: Spiking Neural Networks (SNNs) have emerged as a new generation of energy-efficient neural networks suitable for implementation on neuromorphic hardware. As neuromorphic hardware has limited memory and computing resources, weight pruning and quantization have recently been explored to improve SNNs’ efficiency. State-of-the-art SNN pruning/quantization methods employ multiple compression and training iterations, increasing the cost for pre-trained or very large SNNs. In this paper, we propose a new one-shot post-training pruning/quantization framework, Spiking Brain Compression (SBC), that extends the Optimal Brain Compression (OBC) method to SNNs. SBC replaces the current-based loss found in OBC with a spike train-based objective whose Hessian is cheaply computable, allowing a single backward pass to prune or quantize synapses and analytically rescale the rest. Our experiments on models trained with neuromorphic datasets (N-MNIST, CIFAR10-DVS, DVS128-Gesture) and large static datasets (CIFAR-100, ImageNet) show state-of-the-art results for one-shot post-training compression methods on SNNs, with single-digit to double-digit accuracy gains compared to OBC. SBC also approaches the accuracy of costly iterative methods, while cutting compression time by 2-3 orders of magnitude.

[538] Enhanced Generative Model Evaluation with Clipped Density and Coverage

Nicolas Salvy, Hugues Talbot, Bertrand Thirion

Main category: cs.LG

TL;DR: The paper introduces Clipped Density and Clipped Coverage metrics to address limitations in current generative model evaluation methods by providing robust, interpretable quality assessment that prevents outlier bias.

Details

Motivation: Current quality metrics for generative models lack reliable, interpretable values due to insufficient calibration and robustness to outliers, hindering their use in critical applications.

Method: Proposes two novel metrics: Clipped Density and Clipped Coverage, which clip individual sample contributions and nearest neighbor ball radii to prevent out-of-distribution samples from biasing aggregated values.

Result: The metrics demonstrate linear score degradation as bad sample proportion increases, allowing straightforward interpretation as equivalent proportions of good samples. They outperform existing methods in robustness, sensitivity, and interpretability.

Conclusion: Clipped Density and Clipped Coverage provide superior evaluation of generative models compared to existing metrics, addressing key shortcomings in fidelity and coverage assessment.

Abstract: Although generative models have made remarkable progress in recent years, their use in critical applications has been hindered by an inability to reliably evaluate the quality of their generated samples. Quality refers to at least two complementary concepts: fidelity and coverage. Current quality metrics often lack reliable, interpretable values due to an absence of calibration or insufficient robustness to outliers. To address these shortcomings, we introduce two novel metrics: Clipped Density and Clipped Coverage. By clipping individual sample contributions, as well as the radii of nearest neighbor balls for fidelity, our metrics prevent out-of-distribution samples from biasing the aggregated values. Through analytical and empirical calibration, these metrics demonstrate linear score degradation as the proportion of bad samples increases. Thus, they can be straightforwardly interpreted as equivalent proportions of good samples. Extensive experiments on synthetic and real-world datasets demonstrate that Clipped Density and Clipped Coverage outperform existing methods in terms of robustness, sensitivity, and interpretability when evaluating generative models.

[539] There Was Never a Bottleneck in Concept Bottleneck Models

Antonio Almudévar, José Miguel Hernández-Lobato, Alfonso Ortega

Main category: cs.LG

TL;DR: The paper proposes Minimal Concept Bottleneck Models (MCBMs) to address limitations in Concept Bottleneck Models (CBMs) by incorporating an Information Bottleneck objective to ensure each representation component encodes only information relevant to its corresponding concept.

Details

Motivation: CBMs don't impose a true bottleneck - components can predict concepts but may encode additional information, which raises concerns about interpretability and intervention validity.

Method: MCBMs incorporate an Information Bottleneck objective via variational regularization to constrain each representation component to retain only concept-relevant information.

Result: MCBMs yield more interpretable representations, support principled concept-level interventions, and remain consistent with probability-theoretic foundations.

Conclusion: The proposed MCBM approach overcomes CBM limitations by ensuring true concept bottlenecks through information-theoretic constraints.

Abstract: Deep learning representations are often difficult to interpret, which can hinder their deployment in sensitive applications. Concept Bottleneck Models (CBMs) have emerged as a promising approach to mitigate this issue by learning representations that support target task performance while ensuring that each component predicts a concrete concept from a predefined set. In this work, we argue that CBMs do not impose a true bottleneck: the fact that a component can predict a concept does not guarantee that it encodes only information about that concept. This shortcoming raises concerns regarding interpretability and the validity of intervention procedures. To overcome this limitation, we propose Minimal Concept Bottleneck Models (MCBMs), which incorporate an Information Bottleneck (IB) objective to constrain each representation component to retain only the information relevant to its corresponding concept. This IB is implemented via a variational regularization term added to the training loss. As a result, MCBMs yield more interpretable representations, support principled concept-level interventions, and remain consistent with probability-theoretic foundations.

[540] Predictive Coding-based Deep Neural Network Fine-tuning for Computationally Efficient Domain Adaptation

Matteo Cardoni, Sam Leroux

Main category: cs.LG

TL;DR: A hybrid training methodology combining Backpropagation for offline training and Predictive Coding for online adaptation, enabling efficient on-device domain adaptation for deep neural networks in dynamic environments.

Details

Motivation: Single static models are insufficient for dynamic real-world environments where input data distributions change due to factors like sensor drift or lighting variations, necessitating continual model adaptation.

Method: Start with offline training using Backpropagation for high initial performance, then use Predictive Coding for online adaptation to recover accuracy lost from input distribution shifts, leveraging Backpropagation’s robustness and Predictive Coding’s computational efficiency.

Result: Experimental results on MNIST and CIFAR-10 datasets show effective adaptation with reduced computational overhead.

Conclusion: The hybrid strategy offers a promising solution for maintaining model performance in dynamic environments, particularly suitable for resource-constrained edge devices or neuromorphic accelerators.

Abstract: As deep neural networks are increasingly deployed in dynamic, real-world environments, relying on a single static model is often insufficient. Changes in input data distributions caused by sensor drift or lighting variations necessitate continual model adaptation. In this paper, we propose a hybrid training methodology that enables efficient on-device domain adaptation by combining the strengths of Backpropagation and Predictive Coding. The method begins with a deep neural network trained offline using Backpropagation to achieve high initial performance. Subsequently, Predictive Coding is employed for online adaptation, allowing the model to recover accuracy lost due to shifts in the input data distribution. This approach leverages the robustness of Backpropagation for initial representation learning and the computational efficiency of Predictive Coding for continual learning, making it particularly well-suited for resource-constrained edge devices or future neuromorphic accelerators. Experimental results on the MNIST and CIFAR-10 datasets demonstrate that this hybrid strategy enables effective adaptation with a reduced computational overhead, offering a promising solution for maintaining model performance in dynamic environments.

[541] Improved Scaling Laws in Linear Regression via Data Reuse

Licong Lin, Jingfeng Wu, Peter L. Bartlett

Main category: cs.LG

TL;DR: Data reuse through multi-pass SGD improves neural scaling laws in linear regression, achieving better test error bounds than one-pass SGD when data is limited.

Details

Motivation: Neural scaling laws suggest test error decreases with model and data size, but this becomes unsustainable when new data runs out. The paper investigates whether data reuse can improve scaling laws in data-constrained regimes.

Method: The authors derive sharp test error bounds for M-dimensional linear models trained by multi-pass stochastic gradient descent on N data points with sketched features. They assume power-law spectra for data covariance and true parameters.

Result: Multi-pass SGD achieves test error of Θ(M^{1-b} + L^{(1-b)/a}), while one-pass SGD only attains Θ(M^{1-b} + N^{(1-b)/a}). This shows improved scaling via data reuse (L > N) in data-constrained settings.

Conclusion: Data reuse through multi-pass SGD provides better scaling laws than one-pass SGD when data is limited, with theoretical findings supported by numerical simulations.

Abstract: Neural scaling laws suggest that the test error of large language models trained online decreases polynomially as the model size and data size increase. However, such scaling can be unsustainable when running out of new data. In this work, we show that data reuse can improve existing scaling laws in linear regression. Specifically, we derive sharp test error bounds on $M$-dimensional linear models trained by multi-pass stochastic gradient descent (multi-pass SGD) on $N$ data with sketched features. Assuming that the data covariance has a power-law spectrum of degree $a$, and that the true parameter follows a prior with an aligned power-law spectrum of degree $b-a$ (with $a > b > 1$), we show that multi-pass SGD achieves a test error of $\Theta(M^{1-b} + L^{(1-b)/a})$, where $L \lesssim N^{a/b}$ is the number of iterations. In the same setting, one-pass SGD only attains a test error of $\Theta(M^{1-b} + N^{(1-b)/a})$ (see e.g., Lin et al., 2024). This suggests an improved scaling law via data reuse (i.e., choosing $L>N$) in data-constrained regimes. Numerical simulations are also provided to verify our theoretical findings.

[542] From Sorting Algorithms to Scalable Kernels: Bayesian Optimization in High-Dimensional Permutation Spaces

Zikai Xie, Linjiang Chen

Main category: cs.LG

TL;DR: A novel framework using sorting algorithm-derived kernels for efficient Bayesian Optimization in high-dimensional permutation spaces, introducing the Merge Kernel with Θ(n log n) complexity that outperforms existing methods.

Details

Motivation: Current BO approaches for permutation spaces suffer from Ω(n²) complexity with dense representations, making them impractical for large-scale permutations. There's a need for scalable representations to enable BO applications in high-dimensional permutation problems.

Method: Propose a framework generating permutation representations via kernel functions from sorting algorithms. Introduce the Merge Kernel based on merge sort’s divide-and-conquer structure, achieving Θ(n log n) complexity without information loss while capturing permutation structure effectively.

Result: Extensive evaluations show the Merge Kernel performs competitively with Mallows kernel in low dimensions but significantly outperforms it in both optimization performance and computational efficiency as dimension n grows.

Conclusion: The Merge Kernel provides a scalable and effective solution for Bayesian Optimization in high-dimensional permutation spaces, enabling applications to previously intractable problems like large-scale feature ordering and combinatorial neural architecture search.

Abstract: Bayesian Optimization (BO) is a powerful tool for black-box optimization, but its application to high-dimensional permutation spaces is severely limited by the challenge of defining scalable representations. The current state-of-the-art BO approach for permutation spaces relies on an exhaustive $\Omega(n^2)$ pairwise comparison, inducing a dense representation that is impractical for large-scale permutations. To break this barrier, we introduce a novel framework for generating efficient permutation representations via kernel functions derived from sorting algorithms. Within this framework, the Mallows kernel can be viewed as a special instance derived from enumeration sort. Further, we introduce the \textbf{Merge Kernel} , which leverages the divide-and-conquer structure of merge sort to produce a compact, $\Theta(n\log n)$ to achieve the lowest possible complexity with no information loss and effectively capture permutation structure. Our central thesis is that the Merge Kernel performs competitively with the Mallows kernel in low-dimensional settings, but significantly outperforms it in both optimization performance and computational efficiency as the dimension $n$ grows. Extensive evaluations on various permutation optimization benchmarks confirm our hypothesis, demonstrating that the Merge Kernel provides a scalable and more effective solution for Bayesian optimization in high-dimensional permutation spaces, thereby unlocking the potential for tackling previously intractable problems such as large-scale feature ordering and combinatorial neural architecture search.

[543] CodeBrain: Towards Decoupled Interpretability and Multi-Scale Architecture for EEG Foundation Model

Jingying Ma, Feng Wu, Qika Lin, Yucheng Xing, Chenyu Liu, Ziyu Jia, Mengling Feng

Main category: cs.LG

TL;DR: CodeBrain is a two-stage EEG foundation model that uses temporal-frequency tokenization and multi-scale architecture to address limitations of current EEG models, achieving strong generalization across multiple tasks and datasets.

Details

Motivation: Current EEG foundation models produce clinically uninterpretable and weakly discriminative representations, inefficiently capture global dependencies, and neglect important local neural events.

Method: Two-stage approach: 1) TFDual-Tokenizer decouples temporal and frequency EEG signals into discrete tokens, 2) multi-scale EEGSSM architecture combines global convolution with sliding window attention to capture both long-range and local dependencies.

Result: Pretrained on the largest public EEG corpus, CodeBrain achieves strong generalization across 8 downstream tasks and 10 datasets under distribution shifts, with comprehensive ablations and interpretability evaluations.

Conclusion: CodeBrain effectively addresses current EEG model limitations by providing interpretable, discriminative representations that capture both global and local neural dependencies, demonstrating robust performance across diverse EEG tasks.

Abstract: Electroencephalography (EEG) provides real-time insights into brain activity and supports diverse applications in neuroscience. While EEG foundation models (EFMs) have emerged to address the scalability issues of task-specific models, current approaches still yield clinically uninterpretable and weakly discriminative representations, inefficiently capture global dependencies, and neglect important local neural events. We present CodeBrain, a two-stage EFM designed to fill this gap. In the first stage, we introduce the TFDual-Tokenizer, which decouples heterogeneous temporal and frequency EEG signals into discrete tokens, quadratically expanding the representation space to enhance discriminative power and offering domain-specific interpretability by suggesting potential links to neural events and spectral rhythms. In the second stage, we propose the multi-scale EEGSSM architecture, which combines structured global convolution with sliding window attention to efficiently capture both sparse long-range and local dependencies, reflecting the brain’s small-world topology. Pretrained on the largest public EEG corpus, CodeBrain achieves strong generalization across 8 downstream tasks and 10 datasets under distribution shifts, supported by comprehensive ablations, scaling-law analyses, and interpretability evaluations. Both code and pretraining weights will be released in the future version.

[544] CopulaSMOTE: A Copula-Based Oversampling Approach for Imbalanced Classification in Diabetes Prediction

Agnideep Aich, Md Monzur Murshed, Sameera Hewage, Amanda Mayeaux

Main category: cs.LG

TL;DR: This paper proposes using A2 copula-based data augmentation to handle imbalanced diabetes datasets, showing superior performance over SMOTE with Random Forest achieving the best results.

Details

Motivation: Diabetes affects 1 in 9 people and early detection is crucial. Current ML methods are affected by data imbalance, and standard oversampling techniques like SMOTE may not preserve data dependency structures effectively.

Method: Used A2 copula for data augmentation to preserve dependency structure when generating minority class data. Applied five ML algorithms (logistic regression, random forest, gradient boosting, XGBoost, MLP) on the Pima Indian dataset and validated results with McNemar’s test.

Result: Random Forest with A2 copula oversampling (theta=10) achieved the best performance: 5.3% accuracy improvement, 9.5% precision improvement, 5.7% recall improvement, 7.6% F1-score improvement, and 1.1% AUC improvement compared to standard SMOTE.

Conclusion: This is the first known use of A2 copulas for data augmentation. Copulas serve as an effective alternative to SMOTE, demonstrating their efficacy as a statistical method in ML applications for handling imbalanced datasets.

Abstract: Diabetes mellitus poses a significant health risk, as nearly 1 in 9 people are affected by it. Early detection can significantly lower this risk. Despite significant advancements in machine learning for identifying diabetic cases, results can still be influenced by the imbalanced nature of the data. To address this challenge, our study considered copula-based data augmentation, which preserves the dependency structure when generating data for the minority class and integrates it with machine learning (ML) techniques. We selected the Pima Indian dataset and generated data using A2 copula, then applied five machine learning algorithms: logistic regression, random forest, gradient boosting, extreme gradient boosting, and Multilayer Perceptron. Overall, our findings show that Random Forest with A2 copula oversampling (theta = 10) achieved the best performance, with improvements of 5.3% in accuracy, 9.5% in precision, 5.7% in recall, 7.6% in F1-score, and 1.1% in AUC compared to the standard SMOTE method. Furthermore, we statistically validated our results using the McNemar’s test. This research represents the first known use of A2 copulas for data augmentation and serves as an alternative to the SMOTE technique, highlighting the efficacy of copulas as a statistical method in machine learning applications.

[545] GEDAN: Learning the Edit Costs for Graph Edit Distance

Francesco Leonardi, Markus Orsi, Jean-Louis Reymond, Kaspar Riesen

Main category: cs.LG

TL;DR: A neural network framework for learning contextualized edit costs in Graph Edit Distance computation, overcoming limitations of unit-cost assumptions and enabling interpretable graph matching.

Details

Motivation: Traditional GED methods use restrictive unit-cost assumptions that don't align with real-world topological and functional distances, limiting their practical applicability.

Method: End-to-end Graph Neural Network combining unsupervised self-organizing mechanism for GED approximation with Generalized Additive Model to learn fine-grained, contextualized edit costs.

Result: The approach overcomes non-end-to-end method limitations, produces interpretable graph matchings, uncovers meaningful structures in complex graphs, and shows strong performance in domains like molecular analysis.

Conclusion: The proposed framework successfully learns task-aligned edit costs, providing more realistic and interpretable graph similarity measurements suitable for real-world applications.

Abstract: Graph Edit Distance (GED) is defined as the minimum cost transformation of one graph into another and is a widely adopted metric for measuring the dissimilarity between graphs. The major problem of GED is that its computation is NP-hard, which has in turn led to the development of various approximation methods, including approaches based on neural networks (NN). However, most NN methods assume a unit cost for edit operations – a restrictive and often unrealistic simplification, since topological and functional distances rarely coincide in real-world data. In this paper, we propose a fully end-to-end Graph Neural Network framework for learning the edit costs for GED, at a fine-grained level, aligning topological and task-specific similarity. Our method combines an unsupervised self-organizing mechanism for GED approximation with a Generalized Additive Model that flexibly learns contextualized edit costs. Experiments demonstrate that our approach overcomes the limitations of non-end-to-end methods, yielding directly interpretable graph matchings, uncovering meaningful structures in complex graphs, and showing strong applicability to domains such as molecular analysis.

[546] Mirror Descent Policy Optimisation for Robust Constrained Markov Decision Processes

David Bossens, Atsushi Nitanda

Main category: cs.LG

TL;DR: Mirror descent policy optimisation for robust constrained Markov decision processes (RCMDPs) using policy gradient techniques to optimize both policy and adversarial transition kernel, achieving Õ(1/T^(1/3)) convergence rate.

Details

Motivation: Safety is essential for reinforcement learning systems. RCMDPs allow learning policies that satisfy long-term constraints while providing guarantees under epistemic uncertainty.

Method: Uses mirror descent policy optimisation with policy gradient techniques to optimize both the policy (maximiser) and transition kernel (adversarial minimiser) on the Lagrangian representing a constrained MDP. Also contributes algorithm for approximate gradient descent in transition kernel space.

Result: Achieves Õ(1/T^(1/3)) convergence rate in sample-based RCMDP setting. Experiments show benefits in constrained/unconstrained optimisation and significant robustness improvements over baseline algorithms.

Conclusion: Proposed method effectively handles robust constrained MDPs with theoretical guarantees and practical performance improvements in robustness tests.

Abstract: Safety is an essential requirement for reinforcement learning systems. The newly emerging framework of robust constrained Markov decision processes allows learning policies that satisfy long-term constraints while providing guarantees under epistemic uncertainty. This paper presents mirror descent policy optimisation for robust constrained Markov decision processes (RCMDPs), making use of policy gradient techniques to optimise both the policy (as a maximiser) and the transition kernel (as an adversarial minimiser) on the Lagrangian representing a constrained MDP. Our proposed algorithm obtains an $\tilde{\mathcal{O}}\left(1/T^{1/3}\right)$ convergence rate in the sample-based RCMDP setting. In addition to the RCMDP setting, the paper also contributes an algorithm for approximate gradient descent in the space of transition kernels, which is of independent interest for designing adversarial environments. Experiments confirm the benefits of mirror descent policy optimisation in constrained and unconstrained optimisation, and significant improvements are observed in robustness tests when compared to baseline policy optimisation algorithms.

[547] Lost in Latent Space: An Empirical Study of Latent Diffusion Models for Physics Emulation

François Rozet, Ruben Ohana, Michael McCabe, Gilles Louppe, François Lanusse, Shirley Ho

Main category: cs.LG

TL;DR: Latent-space diffusion models can effectively emulate dynamical systems with up to 1000x compression while maintaining accuracy, outperforming non-generative methods with better uncertainty handling.

Details

Motivation: Diffusion models are computationally expensive for physics emulation, similar to image/video generation challenges. The paper investigates whether latent-space emulation (like in autoencoders) can reduce computational costs for dynamical systems.

Method: Using latent-space diffusion models with autoencoder compression instead of pixel-space emulation. Explored various compression rates (up to 1000x) and practical design choices including architectures and optimizers.

Result: Latent-space emulation accuracy remains robust even at high compression rates (1000x). Diffusion-based emulators are more accurate than non-generative counterparts and provide better uncertainty compensation through greater prediction diversity.

Conclusion: Latent-space diffusion models are effective for dynamical system emulation, offering significant computational savings without sacrificing accuracy, while outperforming traditional non-generative approaches.

Abstract: The steep computational cost of diffusion models at inference hinders their use as fast physics emulators. In the context of image and video generation, this computational drawback has been addressed by generating in the latent space of an autoencoder instead of the pixel space. In this work, we investigate whether a similar strategy can be effectively applied to the emulation of dynamical systems and at what cost. We find that the accuracy of latent-space emulation is surprisingly robust to a wide range of compression rates (up to 1000x). We also show that diffusion-based emulators are consistently more accurate than non-generative counterparts and compensate for uncertainty in their predictions with greater diversity. Finally, we cover practical design choices, spanning from architectures to optimizers, that we found critical to train latent-space emulators.

[548] Reparameterization Proximal Policy Optimization

Hai Zhong, Xun Wang, Zhuoran Li, Longbo Huang

Main category: cs.LG

TL;DR: RPO combines reparameterization policy gradients with PPO’s surrogate objective to enable stable sample reuse, achieving superior sample efficiency in locomotion and manipulation tasks.

Details

Motivation: Reparameterization policy gradient (RPG) suffers from training instability due to high-variance gradients, limiting its sample efficiency despite potential benefits from differentiable dynamics.

Method: Proposes Reparameterization Proximal Policy Optimization (RPO) which bridges RPG with PPO’s surrogate objective, using backpropagation through time for efficient gradient computation and incorporating policy gradient clipping, KL divergence regularization, and variance reduction techniques.

Result: RPO demonstrates superior sample efficiency and strong performance on challenging locomotion and manipulation tasks compared to existing methods.

Conclusion: RPO successfully addresses RPG’s instability issues while maintaining sample efficiency, providing a stable and effective approach for policy optimization in reinforcement learning.

Abstract: Reparameterization policy gradient (RPG) is promising for improving sample efficiency by leveraging differentiable dynamics. However, a critical barrier is its training instability, where high-variance gradients can destabilize the learning process. To address this, we draw inspiration from Proximal Policy Optimization (PPO), which uses a surrogate objective to enable stable sample reuse in the model-free setting. We first establish a connection between this surrogate objective and RPG, which has been largely unexplored and is non-trivial. Then, we bridge this gap by demonstrating that the reparameterization gradient of a PPO-like surrogate objective can be computed efficiently using backpropagation through time. Based on this key insight, we propose Reparameterization Proximal Policy Optimization (RPO), a stable and sample-efficient RPG-based method. RPO enables stable sample reuse over multiple epochs by employing a policy gradient clipping mechanism tailored for RPG. It is further stabilized by Kullback-Leibler (KL) divergence regularization and remains fully compatible with existing variance reduction methods. We evaluate RPO on a suite of challenging locomotion and manipulation tasks, where experiments demonstrate that our method achieves superior sample efficiency and strong performance.

[549] Training-Free Stein Diffusion Guidance: Posterior Correction for Sampling Beyond High-Density Regions

Van Khoa Nguyen, Lionel Blondé, Alexandros Kalousis

Main category: cs.LG

TL;DR: Stein Diffusion Guidance (SDG) is a novel training-free framework that combines stochastic optimal control principles with Stein variational inference to improve diffusion guidance in low-density regions, outperforming standard methods.

Details

Motivation: Current training-free diffusion guidance methods rely on Tweedie's formula approximations which are unreliable in low-density regions, while principled stochastic optimal control approaches are too computationally expensive for practical use.

Method: SDG introduces a surrogate SOC objective with theoretical value function bounds, uses Stein variational inference to find the steepest descent direction minimizing KL divergence between approximate and true posteriors, and incorporates a principled Stein correction mechanism with a novel running cost functional.

Result: Experiments on molecular low-density sampling tasks show that SDG consistently outperforms standard training-free guidance methods.

Conclusion: SDG enables effective guidance in low-density regions and has potential for broader diffusion-based sampling applications beyond high-density regions.

Abstract: Training free diffusion guidance provides a flexible way to leverage off-the-shelf classifiers without additional training. Yet, current approaches hinge on posterior approximations via Tweedie’s formula, which often yield unreliable guidance, particularly in low-density regions. Stochastic optimal control (SOC), in contrast, provides principled posterior simulation but is prohibitively expensive for fast sampling. In this work, we reconcile the strengths of these paradigms by introducing Stein Diffusion Guidance (SDG), a novel training-free framework grounded in a surrogate SOC objective. We establish a theoretical bound on the value function, demonstrating the necessity of correcting approximate posteriors to faithfully reflect true diffusion dynamics. Leveraging Stein variational inference, SDG identifies the steepest descent direction that minimizes the Kullback-Leibler divergence between approximate and true posteriors. By incorporating a principled Stein correction mechanism and a novel running cost functional, SDG enables effective guidance in low-density regions. Experiments on molecular low-density sampling tasks suggest that SDG consistently surpasses standard training-free guidance methods, highlighting its potential for broader diffusion-based sampling beyond high-density regions.

[550] A Principled Loss Function for Direct Language Model Alignment

Yuandong Tan

Main category: cs.LG

TL;DR: The paper proposes a novel loss function for LLM alignment that addresses theoretical flaws in DPO by targeting finite logits differences instead of indefinite maximization, leading to better training stability and performance.

Details

Motivation: DPO's loss function is theoretically misaligned with its derivation, promoting indefinite maximization that causes training instability and reward hacking. The authors aim to create a more stable alternative.

Method: Derived a new loss function directly from RLHF optimality conditions that targets specific finite values for logits differences based on underlying rewards, preventing large gradients when dispreferred response probabilities approach zero.

Result: Fine-tuned Qwen2.5-7B model showed significant win-rate improvements over standard DPO baseline and achieved competitive performance against larger models like Llama-3.1-8B.

Conclusion: The proposed method provides inherent stability that prevents reward hacking and leads to more effective alignment compared to DPO, with validated performance gains.

Abstract: The alignment of large language models (LLMs) with human preferences is commonly achieved through Reinforcement Learning from Human Feedback (RLHF). Direct Preference Optimization (DPO) simplified this paradigm by establishing a direct mapping between the optimal policy and a reward function, eliminating the need for an explicit reward model. However, we argue that the DPO loss function is theoretically misaligned with its own derivation, as it promotes the indefinite maximization of a logits difference, which can lead to training instability and reward hacking. In this paper, we propose a novel loss function derived directly from the RLHF optimality condition. Our proposed loss targets a specific, finite value for the logits difference, which is dictated by the underlying reward, rather than its maximization. We provide a theoretical analysis, including a gradient-based comparison, to demonstrate that our method avoids the large gradients that plague DPO when the probability of dispreferred responses approaches zero. This inherent stability prevents reward hacking and leads to more effective alignment. We validate our approach by fine-tuning a Qwen2.5-7B model, showing significant win-rate improvements over a standard DPO baseline and achieving competitive performance against larger models like Llama-3.1-8B.

[551] ParallelTime: Dynamically Weighting the Balance of Short- and Long-Term Temporal Dependencies

Itay Katav, Aryeh Kontorovich

Main category: cs.LG

TL;DR: ParallelTime is a new architecture for multivariate time series forecasting that dynamically weights long-term and short-term dependencies using a ParallelTime Weighter mechanism, outperforming existing methods with better efficiency and scalability.

Details

Motivation: Current approaches that equally weight long-term (Mamba) and short-term (attention) dependencies are suboptimal for time series forecasting, requiring a more adaptive weighting mechanism.

Method: Proposes ParallelTime architecture with a dynamic weighting mechanism (ParallelTime Weighter) that calculates interdependent weights for long-term and short-term dependencies for each token based on input and model knowledge.

Result: Achieves state-of-the-art performance across diverse benchmarks with lower FLOPs, fewer parameters, better scalability to longer prediction horizons, and significant performance improvements over existing methods.

Conclusion: ParallelTime demonstrates a promising path for future development of parallel Attention-Mamba architectures in time series forecasting, offering robust and efficient performance.

Abstract: Modern multivariate time series forecasting primarily relies on two architectures: the Transformer with attention mechanism and Mamba. In natural language processing, an approach has been used that combines local window attention for capturing short-term dependencies and Mamba for capturing long-term dependencies, with their outputs averaged to assign equal weight to both. We find that for time-series forecasting tasks, assigning equal weight to long-term and short-term dependencies is not optimal. To mitigate this, we propose a dynamic weighting mechanism, ParallelTime Weighter, which calculates interdependent weights for long-term and short-term dependencies for each token based on the input and the model’s knowledge. Furthermore, we introduce the ParallelTime architecture, which incorporates the ParallelTime Weighter mechanism to deliver state-of-the-art performance across diverse benchmarks. Our architecture demonstrates robustness, achieves lower FLOPs, requires fewer parameters, scales effectively to longer prediction horizons, and significantly outperforms existing methods. These advances highlight a promising path for future developments of parallel Attention-Mamba in time series forecasting. The implementation is readily available at: \href{https://github.com/itay1551/ParallelTime}{GitHub}.

[552] Merging Memory and Space: A State Space Neural Operator

Nodens F. Koren, Samuel Lanthaler

Main category: cs.LG

TL;DR: SS-NO is a compact neural operator architecture for time-dependent PDEs that extends state space models with adaptive damping and learnable frequency modulation, achieving SOTA performance with fewer parameters.

Details

Motivation: To develop an efficient and compact architecture for learning solution operators of time-dependent PDEs that can capture long-range dependencies while maintaining parameter efficiency.

Method: Extends structured state space models (SSMs) to joint spatiotemporal modeling with two key mechanisms: adaptive damping for stabilizing learning and learnable frequency modulation for data-driven spectral selection. Also introduces a factorized variant for scalable 2D problems.

Result: Achieves state-of-the-art performance across diverse PDE benchmarks (1D Burgers’, Kuramoto-Sivashinsky, 2D Navier-Stokes, compressible Euler flows) while using significantly fewer parameters than competing approaches.

Conclusion: The effectiveness of damping and frequency learning in operator modeling is demonstrated, and lightweight factorization provides a complementary path toward efficient large-scale PDE learning.

Abstract: We propose the State Space Neural Operator (SS-NO), a compact architecture for learning solution operators of time-dependent partial differential equations (PDEs). Our formulation extends structured state space models (SSMs) to joint spatiotemporal modeling, introducing two key mechanisms: adaptive damping, which stabilizes learning by localizing receptive fields, and learnable frequency modulation, which enables data-driven spectral selection. These components provide a unified framework for capturing long-range dependencies with parameter efficiency. Theoretically, we establish connections between SSMs and neural operators, proving a universality theorem for convolutional architectures with full field-of-view. Empirically, SS-NO achieves state-of-the-art performance across diverse PDE benchmarks-including 1D Burgers’ and Kuramoto-Sivashinsky equations, and 2D Navier-Stokes and compressible Euler flows-while using significantly fewer parameters than competing approaches. A factorized variant of SS-NO further demonstrates scalable performance on challenging 2D problems. Our results highlight the effectiveness of damping and frequency learning in operator modeling, while showing that lightweight factorization provides a complementary path toward efficient large-scale PDE learning.

[553] Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning

Yang Zhou, Sunzhu Li, Shunyu Liu, Wenkai Fang, Kongcheng Zhang, Jiale Zhao, Jingwen Yang, Yihe Zhou, Jianwei Lv, Tongya Zheng, Hengtong Lu, Wei Chen, Yan Xie, Mingli Song

Main category: cs.LG

TL;DR: RuscaRL is a novel RL framework that uses checklist-style rubrics as instructional scaffolding to break the exploration bottleneck in LLM reasoning, enabling better sample generation and more effective training.

Details

Motivation: Current RL approaches for LLMs face a dilemma where improvement requires high-quality samples, but exploration is limited by the models' inherent capabilities, creating a cycle where unexplored patterns cannot be learned.

Method: RuscaRL introduces rubrics as (1) explicit scaffolding during rollout generation to guide diverse high-quality responses, with gradual decay to encourage internalization, and (2) verifiable rewards during training using LLM-as-a-Judge scoring with rubrics as references.

Result: Significant performance improvements across benchmarks, boosting Qwen2.5-7B-Instruct from 23.6 to 50.3 on HealthBench-500 (surpassing GPT-4.1) and achieving 61.1 with Qwen3-30B-A3B-Instruct (outperforming OpenAI-o3).

Conclusion: RuscaRL effectively expands reasoning boundaries in LLMs by breaking the exploration bottleneck through rubric-based scaffolding, demonstrating superior performance on general reasoning tasks.

Abstract: Recent advances in Large Language Models (LLMs) have underscored the potential of Reinforcement Learning (RL) to facilitate the emergence of reasoning capabilities. Despite the encouraging results, a fundamental dilemma persists as RL improvement relies on learning from high-quality samples, yet the exploration for such samples remains bounded by the inherent limitations of LLMs. This, in effect, creates an undesirable cycle in which what cannot be explored cannot be learned. In this work, we propose Rubric-Scaffolded Reinforcement Learning (RuscaRL), a novel instructional scaffolding framework designed to break the exploration bottleneck for general LLM reasoning. Specifically, RuscaRL introduces checklist-style rubrics as (1) explicit scaffolding for exploration during rollout generation, where different rubrics are provided as external guidance within task instructions to steer diverse high-quality responses. This guidance is gradually decayed over time, encouraging the model to internalize the underlying reasoning patterns; (2) verifiable rewards for exploitation during model training, where we can obtain robust LLM-as-a-Judge scores using rubrics as references, enabling effective RL on general reasoning tasks. Extensive experiments demonstrate the superiority of the proposed RuscaRL across various benchmarks, effectively expanding reasoning boundaries under the Best-of-N evaluation. Notably, RuscaRL significantly boosts Qwen2.5-7B-Instruct from 23.6 to 50.3 on HealthBench-500, surpassing GPT-4.1. Furthermore, our fine-tuned variant on Qwen3-30B-A3B-Instruct achieves 61.1 on HealthBench-500, outperforming leading LLMs including OpenAI-o3. Our code is available at https://github.com/IANNXANG/RuscaRL.

[554] Discovery Learning accelerates battery design evaluation

Jiawei Zhang, Yifei Zhang, Baozhao Yi, Yao Ren, Qi Jiao, Hanyu Bai, Weiran Jiang, Ziyou Song

Main category: cs.LG

TL;DR: Discovery Learning (DL) is a scientific machine-learning paradigm that integrates active learning, physics-guided learning, and zero-shot learning to enable rapid battery lifetime evaluation without requiring prototyping or additional data labeling.

Details

Motivation: Battery research and development are bottlenecked by high time and energy costs for evaluating new designs through prototyping and life testing. Existing data-driven methods require labeled data of target designs and cannot make reliable predictions until after prototyping.

Method: DL integrates active learning, physics-guided learning, and zero-shot learning into a human-like reasoning loop inspired by educational psychology learning theories. It learns from historical battery designs and actively reduces the need for prototyping.

Result: Tested on 123 industrial-grade large-format lithium-ion pouch cells spanning eight material-design combinations, DL achieved 7.2% test error in predicting average cycle life under unknown device variability, using only public datasets of small-capacity cylindrical cells for training.

Conclusion: DL enables 98% time savings and 95% energy savings compared to industrial practices, representing a key advance toward efficient data-driven modeling for accelerating battery technology development and scientific discovery.

Abstract: Fast and reliable validation of novel designs in complex physical systems such as batteries is critical to accelerating technological innovation. However, battery research and development remain bottlenecked by the prohibitively high time and energy costs required to evaluate numerous new design candidates, particularly in battery prototyping and life testing. Despite recent progress in data-driven battery lifetime prediction, existing methods require labeled data of target designs to improve accuracy and cannot make reliable predictions until after prototyping, thus falling far short of the efficiency needed to enable rapid feedback for battery design. Here, we introduce Discovery Learning (DL), a scientific machine-learning paradigm that integrates active learning, physics-guided learning, and zero-shot learning into a human-like reasoning loop, drawing inspiration from learning theories in educational psychology. DL can learn from historical battery designs and actively reduce the need for prototyping, thus enabling rapid lifetime evaluation for unobserved material-design combinations without requiring additional data labeling. To test DL, we present 123 industrial-grade large-format lithium-ion pouch cells, spanning eight material-design combinations and diverse cycling protocols. Trained solely on public datasets of small-capacity cylindrical cells, DL achieves 7.2% test error in predicting the average cycle life under unknown device variability. This results in savings of 98% in time and 95% in energy compared to industrial practices. This work highlights the potential of uncovering insights from historical designs to inform and accelerate the development of next-generation battery technologies. DL represents a key advance toward efficient data-driven modeling and helps realize the promise of machine learning for accelerating scientific discovery and engineering innovation.

[555] Data-Augmented Few-Shot Neural Emulator for Computer-Model System Identification

Sanket Jantre, Deepak Akhare, Zhiyuan Wang, Xiaoning Qian, Nathan M. Urban

Main category: cs.LG

TL;DR: The paper proposes a sample-efficient data-augmentation strategy for training neural PDEs using space-filling sampling of local stencil states instead of traditional trajectory rollouts, achieving better performance with significantly less data.

Details

Motivation: Traditional neural PDE training uses trajectory data from long-horizon solver rollouts, which contains spatiotemporal redundancy and undersamples rare but important states. This approach aims to improve sample efficiency and generalization.

Method: Space-filling sampling of local stencil states from computer models, removing redundancy in trajectory data. The method can work with as little as 10 timesteps’ worth of simulation data, optionally enhanced with a single full-trajectory simulation.

Result: The approach produces more accurate neural stencil operators that outperform traditional ML emulators trained on thousands of trajectories, achieving better long-horizon rollout accuracy and stability across several PDE systems.

Conclusion: Data-augmented stencil sampling provides a highly efficient alternative to trajectory-based training for neural PDEs, enabling accurate models with minimal computational resources while improving generalization across the state space.

Abstract: Partial differential equations (PDEs) underpin the modeling of many natural and engineered systems. It can be convenient to express such models as neural PDEs rather than using traditional numerical PDE solvers by replacing part or all of the PDE’s governing equations with a neural network representation. Neural PDEs are often easier to differentiate, linearize, reduce, or use for uncertainty quantification than the original numerical solver. They are usually trained on solution trajectories obtained by long-horizon rollout of the PDE solver. Here we propose a more sample-efficient data-augmentation strategy for generating neural PDE training data from a computer model by space-filling sampling of local “stencil” states. This approach removes a large degree of spatiotemporal redundancy present in trajectory data and oversamples states that may be rarely visited but help the neural PDE generalize across the state space. We demonstrate that accurate neural PDE stencil operators can be learned from synthetic training data generated by the computational equivalent of 10 timesteps’ worth of numerical simulation. Accuracy is further improved if we assume access to a single full-trajectory simulation from the computer model, which is typically available in practice. Across several PDE systems, we show that our data-augmented stencil data yield better trained neural stencil operators, with clear performance gains compared with naively sampled stencil data from simulation trajectories. Finally, with only 10 solver steps’ worth of augmented stencil data, our approach outperforms traditional ML emulators trained on thousands of trajectories in long-horizon rollout accuracy and stability.

[556] Causal Structure Learning in Hawkes Processes with Complex Latent Confounder Networks

Songyao Jin, Biwei Huang

Main category: cs.LG

TL;DR: This paper addresses the challenge of latent subprocesses in multivariate Hawkes processes by establishing identifiability conditions and proposing a two-phase iterative algorithm that alternates between causal inference and latent subprocess discovery.

Details

Motivation: Real-world systems often have latent subprocesses that are not directly observed, posing significant challenges for existing methods that primarily focus on causal structures among observed subprocesses only.

Method: The authors show that continuous-time event sequences can be represented by discrete-time causal models as time intervals shrink. They propose a two-phase iterative algorithm that alternates between inferring causal relationships among discovered subprocesses and uncovering new latent subprocesses, guided by path-based identifiability conditions.

Result: Experiments on synthetic and real-world datasets demonstrate that the proposed method effectively recovers causal structures even in the presence of latent subprocesses.

Conclusion: The paper provides necessary and sufficient conditions for identifying latent subprocesses and causal influences, offering a practical solution for modeling complex systems with partial observability.

Abstract: Multivariate Hawkes process provides a powerful framework for modeling temporal dependencies and event-driven interactions in complex systems. While existing methods primarily focus on uncovering causal structures among observed subprocesses, real-world systems are often only partially observed, with latent subprocesses posing significant challenges. In this paper, we show that continuous-time event sequences can be represented by a discrete-time causal model as the time interval shrinks, and we leverage this insight to establish necessary and sufficient conditions for identifying latent subprocesses and the causal influences. Accordingly, we propose a two-phase iterative algorithm that alternates between inferring causal relationships among discovered subprocesses and uncovering new latent subprocesses, guided by path-based conditions that guarantee identifiability. Experiments on both synthetic and real-world datasets show that our method effectively recovers causal structures despite the presence of latent subprocesses.

[557] MDPO: Overcoming the Training-Inference Divide of Masked Diffusion Language Models

Haoyu He, Katrin Renz, Yong Cao, Andreas Geiger

Main category: cs.LG

TL;DR: The paper addresses the training-inference discrepancy in Masked Diffusion Language Models (MDLMs) by proposing MDPO, a reinforcement learning approach that aligns training with inference schedules, achieving significant performance improvements with fewer gradient updates.

Details

Motivation: MDLMs suffer from a key discrepancy where training masks tokens randomly while inference progressively reveals sequence structure, leading to suboptimal performance. This gap has been overlooked in previous works.

Method: The authors frame denoising trajectory learning as a sequential decision-making problem and propose Masked Diffusion Policy Optimization (MDPO) to explicitly train under inference schedules. They also introduce Running Confidence Remasking (RCR) as a plug-in inference method for flexible token refinement.

Result: MDPO matches SOTA performance with 60x fewer gradient updates and achieves 9.6% improvement on MATH500 and 54.2% on Countdown over SOTA with same update count. RCR further enhances performance when combined with MDPO.

Conclusion: The work demonstrates significant potential for addressing training-inference discrepancies in MDLMs, establishing new state-of-the-art results with more efficient training and improved inference strategies.

Abstract: Diffusion language models, as a promising alternative to traditional autoregressive (AR) models, enable faster generation and richer conditioning on bidirectional context. However, they suffer from a key discrepancy between training and inference: during inference, MDLMs progressively reveal the structure of the generated sequence by producing fewer and fewer masked tokens, whereas this structure is ignored in training as tokens are masked at random. Although this discrepancy between training and inference can lead to suboptimal performance, it has been largely overlooked by previous works, leaving closing this gap between the two stages an open problem. To address this, we frame the problem of learning effective denoising trajectories as a sequential decision-making problem and use the resulting framework to apply reinforcement learning. We propose a novel Masked Diffusion Policy Optimization (MDPO) to exploit the Markov property diffusion possesses and explicitly train the model under the same progressive refining schedule used at inference. MDPO matches the performance of the previous state-of-the-art (SOTA) method with 60x fewer gradient updates, while achieving average improvements of 9.6% on MATH500 and 54.2% on Countdown over SOTA when trained within the same number of weight updates. Additionally, we improve the remasking strategy of MDLMs as a plug-in inference replacement to overcome the limitation that the model cannot refine tokens flexibly. This training-free method, termed Running Confidence Remasking (RCR), consistently enhances performance and provides further improvements when used with MDPO. Our findings establish great potential for investigating the discrepancy between pre-training and inference of MDLMs. Code: https://github.com/autonomousvision/mdpo. Project Page: https://cli212.github.io/MDPO/.

[558] The Lifecycle Principle: Stabilizing Dynamic Neural Networks with State Memory

Zichuan Yang

Main category: cs.LG

TL;DR: The paper proposes the Lifecycle (LC) principle, a regularization method that deactivates neurons long-term and uses state memory to restore their parameters when revived, preventing training instability and improving generalization.

Details

Motivation: To address the training instability caused by long-term neuron deactivation methods, which differ from temporary approaches like Dropout by introducing severe optimization shocks when neurons are revived with random weights.

Method: The LC principle uses state memory to preserve and restore a neuron’s last known effective parameters when it is revived, rather than re-initializing it with random weights. This avoids destructive optimization shocks and smooths the loss landscape.

Result: Experiments on image classification benchmarks show that the method improves generalization and robustness. Theoretical analysis indicates it guides optimization towards flatter minima.

Conclusion: The Lifecycle principle with state memory is essential for achieving stable training and better generalization in long-term neuron deactivation regularization methods.

Abstract: I investigate a stronger form of regularization by deactivating neurons for extended periods, a departure from the temporary changes of methods like Dropout. However, this long-term dynamism introduces a critical challenge: severe training instability when neurons are revived with random weights. To solve this, I propose the Lifecycle (LC) principle, a regularization mechanism centered on a key innovation: state memory. Instead of re-initializing a revived neuron, my method restores its parameters to their last known effective state. This process preserves learned knowledge and avoids destructive optimization shocks. My theoretical analysis reveals that the LC principle smooths the loss landscape, guiding optimization towards flatter minima associated with better generalization. Experiments on image classification benchmarks demonstrate that my method improves generalization and robustness. Crucially, ablation studies confirm that state memory is essential for achieving these gains.

[559] Aligning Distributionally Robust Optimization with Practical Deep Learning Needs

Dmitrii Feoktistov, Igor Ignashin, Andrey Veprikov, Nikita Borovko, Alexander Bogdanov, Savelii Chezhegov, Aleksandr Beznosikov

Main category: cs.LG

TL;DR: ALSO (Adaptive Loss Scaling Optimizer) is a novel adaptive algorithm that bridges the gap between Distributionally Robust Optimization (DRO) and modern DL practices by enabling adaptive weight assignment to sample groups and handling stochastic gradients.

Details

Motivation: Traditional DRO methods don't align with modern DL optimizers' need for adaptivity and stochastic gradient handling, and lack the ability to assign weights to groups of samples rather than just individual samples.

Method: ALSO modifies the DRO objective to allow weight assignment to sample groups and provides an adaptive algorithm that can handle stochastic gradients. The algorithm is proven to converge for non-convex objectives typical in DL.

Result: Empirical evaluation across diverse DL tasks (Tabular DL to Split Learning) shows that ALSO outperforms both traditional optimizers and existing DRO methods.

Conclusion: ALSO successfully bridges the gap between DRO theory and practical DL optimization, providing adaptive group-wise weight assignment with proven convergence guarantees and superior empirical performance.

Abstract: While traditional Deep Learning (DL) optimization methods treat all training samples equally, Distributionally Robust Optimization (DRO) adaptively assigns importance weights to different samples. However, a significant gap exists between DRO and current DL practices. Modern DL optimizers require adaptivity and the ability to handle stochastic gradients, as these methods demonstrate superior performance. Additionally, for practical applications, a method should allow weight assignment not only to individual samples, but also to groups of objects (for example, all samples of the same class). This paper aims to bridge this gap by introducing ALSO $\unicode{x2013}$ Adaptive Loss Scaling Optimizer $\unicode{x2013}$ an adaptive algorithm for a modified DRO objective that can handle weight assignment to sample groups. We prove the convergence of our proposed algorithm for non-convex objectives, which is the typical case for DL models. Empirical evaluation across diverse Deep Learning tasks, from Tabular DL to Split Learning tasks, demonstrates that ALSO outperforms both traditional optimizers and existing DRO methods.

[560] On Entropy Control in LLM-RL Algorithms

Han Shen

Main category: cs.LG

TL;DR: AEnt is a new entropy control method for LLM-RL training that addresses issues with conventional entropy regularization in large language models by using clamped entropy bonus with automatic coefficient adjustment.

Details

Motivation: Conventional entropy regularization used in RL algorithms like PPO, SAC, and A3C works well in robotics and games but gives weak to no gains in LLM-RL training due to LLM's extremely large response space and sparsity of optimal outputs.

Method: Proposes AEnt with clamped entropy bonus evaluated on re-normalized policy defined on smaller token space, encouraging exploration within compact response sets. Automatically adjusts entropy coefficient based on clamped entropy value to control entropy-induced bias while leveraging benefits.

Result: AEnt outperforms baselines consistently across multiple benchmarks in math-reasoning tasks under different base models and datasets.

Conclusion: AEnt effectively addresses entropy control challenges in LLM-RL settings by adapting entropy regularization to handle large response spaces and sparse optimal outputs.

Abstract: For RL algorithms, appropriate entropy control is crucial to their effectiveness. To control the policy entropy, a commonly used method is entropy regularization, which is adopted in various popular RL algorithms including PPO, SAC and A3C. Although entropy regularization proves effective in robotic and games RL conventionally, studies found that it gives weak to no gains in LLM-RL training. In this work, we study the issues of entropy bonus in LLM-RL setting. Specifically, we first argue that the conventional entropy regularization suffers from the LLM’s extremely large response space and the sparsity of the optimal outputs. As a remedy, we propose AEnt, an entropy control method that utilizes a new clamped entropy bonus with an automatically adjusted coefficient. The clamped entropy is evaluated with the re-normalized policy defined on certain smaller token space, which encourages exploration within a more compact response set. In addition, the algorithm automatically adjusts entropy coefficient according to the clamped entropy value, effectively controlling the entropy-induced bias while leveraging the entropy’s benefits. AEnt is tested in math-reasoning tasks under different base models and datasets, and it is observed that AEnt outperforms the baselines consistently across multiple benchmarks.

[561] One-Embedding-Fits-All: Efficient Zero-Shot Time Series Forecasting by a Model Zoo

Hao-Nan Shi, Ting-Ji Huang, Lu Han, De-Chuan Zhan, Han-Jia Ye

Main category: cs.LG

TL;DR: ZooCast is a framework that intelligently assembles multiple Time Series Foundation Models (TSFMs) into a model zoo, dynamically selecting optimal models for different forecasting tasks using a unified embedding representation.

Details

Motivation: Different TSFMs excel at different temporal patterns, but no single model performs best universally. The authors aim to leverage the complementary abilities of multiple TSFMs to improve zero-shot forecasting performance.

Method: Proposes ZooCast with a One-Embedding-Fits-All paradigm that creates a unified representation space where each model is represented by a single embedding, enabling efficient similarity matching for task selection.

Result: ZooCast demonstrates strong performance on the GIFT-Eval zero-shot forecasting benchmark while maintaining the efficiency of a single TSFM. It also seamlessly integrates new models with negligible overhead.

Conclusion: The framework effectively leverages complementary TSFM capabilities through intelligent model selection, achieving progressive accuracy gains in real-world scenarios with sequential model releases.

Abstract: The proliferation of Time Series Foundation Models (TSFMs) has significantly advanced zero-shot forecasting, enabling predictions for unseen time series without task-specific fine-tuning. Extensive research has confirmed that no single TSFM excels universally, as different models exhibit preferences for distinct temporal patterns. This diversity suggests an opportunity: how to take advantage of the complementary abilities of TSFMs. To this end, we propose ZooCast, which characterizes each model’s distinct forecasting strengths. ZooCast can intelligently assemble current TSFMs into a model zoo that dynamically selects optimal models for different forecasting tasks. Our key innovation lies in the One-Embedding-Fits-All paradigm that constructs a unified representation space where each model in the zoo is represented by a single embedding, enabling efficient similarity matching for all tasks. Experiments demonstrate ZooCast’s strong performance on the GIFT-Eval zero-shot forecasting benchmark while maintaining the efficiency of a single TSFM. In real-world scenarios with sequential model releases, the framework seamlessly adds new models for progressive accuracy gains with negligible overhead.

[562] Data-Efficient Time-Dependent PDE Surrogates: Graph Neural Simulators vs. Neural Operators

Dibyajyoti Nayak, Somdatta Goswami

Main category: cs.LG

TL;DR: Graph Neural Simulators (GNS) propose a graph-based surrogate modeling paradigm for PDEs that outperforms neural operators in data efficiency and long-term stability by leveraging local structure and numerical time-stepping schemes.

Details

Motivation: Neural operators have limitations in data efficiency and generalization due to their global processing approach, which fails to exploit the local structure of physical systems governed by PDEs.

Method: GNS uses message-passing combined with numerical time-stepping to learn PDE dynamics by modeling instantaneous time derivatives, mimicking traditional numerical solvers for stable long-horizon rollouts.

Result: GNS achieves less than 1% relative L2 error using only 3% of available trajectories, with 82.5% lower autoregressive error than FNO and 99.9% lower than DeepONet across four canonical PDE systems.

Conclusion: GNS with its graph-based locality and solver-inspired design is the most suitable and scalable surrogate modeling framework for AI-driven scientific discovery.

Abstract: Developing accurate, data-efficient surrogate models is central to advancing AI for Science. Neural operators (NOs), which approximate mappings between infinite-dimensional function spaces using conventional neural architectures, have gained popularity as surrogates for systems driven by partial differential equations (PDEs). However, their reliance on large datasets and limited ability to generalize in low-data regimes hinder their practical utility. We argue that these limitations arise from their global processing of data, which fails to exploit the local, discretized structure of physical systems. To address this, we propose Graph Neural Simulators (GNS) as a principled surrogate modeling paradigm for time-dependent PDEs. GNS leverages message-passing combined with numerical time-stepping schemes to learn PDE dynamics by modeling the instantaneous time derivatives. This design mimics traditional numerical solvers, enabling stable long-horizon rollouts and strong inductive biases that enhance generalization. We rigorously evaluate GNS on four canonical PDE systems: (1) 2D scalar Burgers’, (2) 2D coupled Burgers’, (3) 2D Allen-Cahn, and (4) 2D nonlinear shallow-water equations, comparing against state-of-the-art NOs including Deep Operator Network (DeepONet) and Fourier Neural Operator (FNO). Results demonstrate that GNS is markedly more data-efficient, achieving less than 1% relative L2 error using only 3% of available trajectories, and exhibits dramatically reduced error accumulation over time (82.5% lower autoregressive error than FNO, 99.9% lower than DeepONet). To choose the training data, we introduce a PCA combined with KMeans trajectory selection strategy. These findings provide compelling evidence that GNS, with its graph-based locality and solver-inspired design, is the most suitable and scalable surrogate modeling framework for AI-driven scientific discovery.

[563] One Model for All Tasks: Leveraging Efficient World Models in Multi-Task Planning

Yuan Pu, Yazhe Niu, Jia Tang, Junyu Xiong, Shuai Hu, Hongsheng Li

Main category: cs.LG

TL;DR: ScaleZero addresses gradient conflicts and model plasticity loss in heterogeneous multi-task decision-making by combining Mixture-of-Experts architecture with Dynamic Parameter Scaling strategy, achieving competitive performance with single-task agents using fewer environment interactions.

Details

Motivation: Conventional multi-task world models struggle with gradient conflicts and loss of plasticity when handling diverse tasks with varying complexities, observation spaces, and action spaces.

Method: Proposes ScaleZero with two key components: 1) Mixture-of-Experts architecture to route task-specific representations and mitigate gradient conflicts, 2) Dynamic Parameter Scaling strategy that progressively integrates LoRA adapters based on task-specific learning progress.

Result: ScaleZero performs on par with specialized single-task agents on diverse benchmarks (Atari, DMC, Jericho) using only online reinforcement learning, and achieves competitive performance with just 71.5% of environment interactions when using DPS.

Conclusion: ScaleZero demonstrates effective multi-task planning capabilities through its architectural design and dynamic parameter allocation strategy, showing potential for handling heterogeneous multi-task decision-making efficiently.

Abstract: In heterogeneous multi-task decision-making, tasks not only exhibit diverse observation and action spaces but also vary substantially in their underlying complexities. While conventional multi-task world models like UniZero excel in single-task settings, we find that when handling a broad and diverse suite of tasks, gradient conflicts and the loss of model plasticity often constrain their sample efficiency. In this work, we address these challenges from two complementary perspectives: the single learning iteration and the overall learning process. First, to mitigate the gradient conflicts, we systematically investigate key architectural designs for extending UniZero. Our investigation identifies a Mixture-of-Experts (MoE) architecture as the most effective approach. We demonstrate, both theoretically and empirically, that this architecture alleviates gradient conflicts by routing task-specific representations to specialized sub-networks. This finding leads to our proposed model, \textit{ScaleZero}. Second, to dynamically allocate model capacity throughout the learning process, we introduce an online Dynamic Parameter Scaling (DPS) strategy. This strategy progressively integrates LoRA adapters in response to task-specific progress, enabling adaptive knowledge retention and parameter expansion. Evaluations on a diverse set of standard benchmarks (Atari, DMC, Jericho) demonstrate that ScaleZero, utilizing solely online reinforcement learning with one model, performs on par with specialized single-task agents. With the DPS strategy, it remains competitive while using just 71.5% of the environment interactions. These findings underscore the potential of ScaleZero for effective multi-task planning. Our code is available at \textcolor{magenta}{https://github.com/opendilab/LightZero}.

[564] APRIL: Active Partial Rollouts in Reinforcement Learning to Tame Long-tail Generation

Yuzhen Zhou, Jiajun Li, Yusheng Su, Gowtham Ramesh, Zilin Zhu, Xiang Long, Chenyang Zhao, Jin Pan, Xiaodong Yu, Ze Wang, Kangrui Du, Jialian Wu, Ximeng Sun, Jiang Liu, Qiaolin Yu, Hao Chen, Zicheng Liu, Emad Barsoum

Main category: cs.LG

TL;DR: APRIL (Active Partial Rollouts in Reinforcement Learning) addresses the computational inefficiency in RL training caused by long-tail distribution of rollout response lengths, improving throughput by up to 44% and final accuracy by up to 8%.

Details

Motivation: RL training is computationally expensive with rollout generation accounting for over 90% of runtime, and efficiency is constrained by long-tail distribution where lengthy responses stall entire batches, leaving GPUs idle and underutilized.

Method: APRIL over-provisions rollout requests, terminates once the target number of responses is reached, and recycles incomplete responses for continuation in future steps, ensuring no rollouts are discarded while reducing GPU idle time.

Result: APRIL improves rollout throughput by up to 44% across RL algorithms (GRPO, DAPO, GSPO), accelerates convergence, and achieves up to 8% higher final accuracy across tasks. It’s framework and hardware agnostic.

Conclusion: APRIL unifies system-level and algorithmic considerations to advance RL training efficiency, with potential to inspire further optimizations in RL systems.

Abstract: Reinforcement learning (RL) has become a cornerstone in advancing large-scale pre-trained language models (LLMs). Successive generations, including GPT-o series, DeepSeek-R1, Kimi-K1.5, Grok 4, and GLM-4.5, have relied on large-scale RL training to enhance reasoning and coding capabilities. To meet the community’s growing RL needs, numerous RL frameworks have been proposed. However, RL training remains computationally expensive, with rollout generation accounting for more than 90% of total runtime. In addition, its efficiency is often constrained by the long-tail distribution of rollout response lengths, where a few lengthy responses stall entire batches, leaving GPUs idle and underutilized. As model and rollout sizes continue to grow, this bottleneck increasingly limits scalability. To address this challenge, we propose Active Partial Rollouts in Reinforcement Learning (APRIL), which mitigates long-tail inefficiency. In the rollout phase, APRIL over-provisions rollout requests, terminates once the target number of responses is reached, and recycles incomplete responses for continuation in future steps. This strategy ensures that no rollouts are discarded while substantially reducing GPU idle time. Experiments show that APRIL improves rollout throughput by at most 44% across commonly used RL algorithms (GRPO, DAPO, GSPO), accelerates convergence, and achieves at most 8% higher final accuracy across tasks. Moreover, APRIL is both framework and hardware agnostic, already integrated into the slime RL framework, and deployable on NVIDIA and AMD GPUs alike. Taken together, this work unifies system-level and algorithmic considerations in proposing APRIL, with the aim of advancing RL training efficiency and inspiring further optimizations in RL systems. Our codebase is available at https://github.com/RLsys-Foundation/APRIL

[565] Prompt Injection Attacks on LLM Generated Reviews of Scientific Publications

Janis Keuper

Main category: cs.LG

TL;DR: This paper investigates the practicability and effectiveness of hidden prompt injections to manipulate LLM-generated peer review scores, finding that simple injections can achieve 100% acceptance rates and that LLM reviews are generally biased toward acceptance.

Details

Motivation: The motivation is to examine the existence and impact of hidden prompt injections in LLM-based peer review processes, as such manipulations could significantly influence the ongoing debate about LLM usage in scientific peer-review.

Method: The study conducted a systematic evaluation using 1,000 reviews of 2024 ICLR papers generated by a wide range of LLMs to test the effectiveness of prompt injection attacks.

Result: Two key findings: 1) Very simple prompt injections are highly effective, reaching up to 100% acceptance scores; 2) LLM reviews are generally biased toward acceptance (>95% in many models).

Conclusion: Both findings have significant implications for the ongoing discussions about LLM usage in peer-review, highlighting vulnerabilities in the system and inherent biases in automated review processes.

Abstract: The ongoing intense discussion on rising LLM usage in the scientific peer-review process has recently been mingled by reports of authors using hidden prompt injections to manipulate review scores. Since the existence of such “attacks” - although seen by some commentators as “self-defense” - would have a great impact on the further debate, this paper investigates the practicability and technical success of the described manipulations. Our systematic evaluation uses 1k reviews of 2024 ICLR papers generated by a wide range of LLMs shows two distinct results: I) very simple prompt injections are indeed highly effective, reaching up to 100% acceptance scores. II) LLM reviews are generally biased toward acceptance (>95% in many models). Both results have great impact on the ongoing discussions on LLM usage in peer-review.

[566] Matched-Pair Experimental Design with Active Learning

Weizhi Li, Gautam Dasarathy, Visar Berisha

Main category: cs.LG

TL;DR: This paper proposes an active learning framework for matched-pair experimental designs that sequentially enrolls patients in high treatment-effect regions to efficiently detect treatment efficacy.

Details

Motivation: Traditional matched-pair designs struggle when overall treatment effects are small across the population, creating a need to identify and target specific regions where interventions are most effective to reduce experimental costs.

Method: The authors frame target region identification as a classification problem and develop an active learning framework specifically tailored for matched-pair designs that sequentially enrolls patients in high-effect regions.

Result: Theoretical analysis shows improved label complexity, and experiments demonstrate that the approach efficiently identifies high-treatment-effect regions while reducing experimental costs compared to traditional methods.

Conclusion: The proposed active learning framework for matched-pair designs effectively targets high-effect regions, reduces experimental costs, and ensures comprehensive identification of treatment-effective areas.

Abstract: Matched-pair experimental designs aim to detect treatment effects by pairing participants and comparing within-pair outcome differences. In many situations, the overall effect size across the entire population is small. Then, the focus naturally shifts to identifying and targeting high treatment-effect regions where the intervention is most effective. This paper proposes a matched-pair experimental design that sequentially and actively enrolls patients in high treatment-effect regions. Importantly, we frame the identification of the target region as a classification problem and propose an active learning framework tailored to matched-pair designs. Our design not only reduces the experimental cost of detecting treatment efficacy, but also ensures that the identified regions enclose the entire high-treatment-effect regions. Our theoretical analysis of the framework’s label complexity and experiments in practical scenarios demonstrate the efficiency and advantages of the approach.

[567] Selective Risk Certification for LLM Outputs via Information-Lift Statistics: PAC-Bayes, Robustness, and Skeleton Design

Sanjeda Akter, Ibne Farabi Shihab, Anuj Sharma

Main category: cs.LG

TL;DR: This paper develops information-lift certificates for uncertainty quantification in large language models, achieving better coverage and risk control than recent baselines while effectively blocking critical errors in high-stakes scenarios.

Details

Motivation: Large language models often generate confident but incorrect outputs, requiring formal uncertainty quantification with abstention guarantees to ensure reliability, especially in high-stakes applications.

Method: The method uses information-lift certificates that compare model probabilities to a skeleton baseline, accumulating evidence into sub-gamma PAC-Bayes bounds that remain valid under heavy-tailed distributions.

Result: Across eight datasets, the method achieves 77.2% coverage at 2% risk, outperforming recent 2023-2024 baselines by 8.6-15.1 percentage points, while blocking 96% of critical errors in high-stakes scenarios compared to 18-31% for entropy methods.

Conclusion: The approach provides effective uncertainty quantification with strong performance, though limitations include skeleton dependence and frequency-only risk control (not severity-aware), but performance degrades gracefully under corruption.

Abstract: Large language models frequently generate confident but incorrect outputs, requiring formal uncertainty quantification with abstention guarantees. We develop information-lift certificates that compare model probabilities to a skeleton baseline, accumulating evidence into sub-gamma PAC-Bayes bounds valid under heavy-tailed distributions. Across eight datasets, our method achieves 77.2% coverage at 2% risk, outperforming recent 2023-2024 baselines by 8.6-15.1 percentage points, while blocking 96% of critical errors in high-stakes scenarios vs 18-31% for entropy methods. Limitations include skeleton dependence and frequency-only (not severity-aware) risk control, though performance degrades gracefully under corruption.

[568] Frictional Q-Learning

Hyunwoo Kim, Hyo Kyung Lee

Main category: cs.LG

TL;DR: Frictional Q-learning is a deep RL algorithm for continuous control that uses a friction-inspired constraint to prevent policy drift and extrapolation error by keeping actions close to those in the replay buffer.

Details

Motivation: To address extrapolation error in off-policy RL by drawing an analogy with static friction in classical mechanics, preventing the policy from drifting toward unsupported actions.

Method: Extends batch-constrained RL by constraining the agent’s action space to encourage behavior similar to the replay buffer while maintaining distance from the orthonormal action space manifold.

Result: The algorithm is robustly trained and achieves competitive performance across standard continuous control benchmarks.

Conclusion: Frictional Q-learning provides an intuitive physical interpretation of extrapolation error while maintaining the simplicity of batch-constrained methods and demonstrating strong empirical performance.

Abstract: We draw an analogy between static friction in classical mechanics and extrapolation error in off-policy RL, and use it to formulate a constraint that prevents the policy from drifting toward unsupported actions. In this study, we present Frictional Q-learning, a deep reinforcement learning algorithm for continuous control, which extends batch-constrained reinforcement learning. Our algorithm constrains the agent’s action space to encourage behavior similar to that in the replay buffer, while maintaining a distance from the manifold of the orthonormal action space. The constraint preserves the simplicity of batch-constrained, and provides an intuitive physical interpretation of extrapolation error. Empirically, we further demonstrate that our algorithm is robustly trained and achieves competitive performance across standard continuous control benchmarks.

[569] Adversarial generalization of unfolding (model-based) networks

Vicky Kouni

Main category: cs.LG

TL;DR: This paper provides the first theoretical analysis of adversarial generalization for unfolding networks, deriving tight error bounds and demonstrating that overparameterization can enhance robustness.

Details

Motivation: Unfolding networks are used in critical applications like medical imaging and cryptography, but their adversarial robustness lacks theoretical understanding despite the importance of preventing catastrophic failures.

Method: The authors study state-of-the-art overparameterized unfolding networks under l₂-norm constrained attacks, deploying a new framework to estimate adversarial Rademacher complexity and provide generalization error bounds.

Result: The derived adversarial generalization error bounds are tight with respect to attack level, and experiments on real-world data consistently corroborate the theoretical findings.

Conclusion: Overparameterization in unfolding networks can be exploited to promote adversarial robustness, providing insights for efficiently robustifying neural networks.

Abstract: Unfolding networks are interpretable networks emerging from iterative algorithms, incorporate prior knowledge of data structure, and are designed to solve inverse problems like compressed sensing, which deals with recovering data from noisy, missing observations. Compressed sensing finds applications in critical domains, from medical imaging to cryptography, where adversarial robustness is crucial to prevent catastrophic failures. However, a solid theoretical understanding of the performance of unfolding networks in the presence of adversarial attacks is still in its infancy. In this paper, we study the adversarial generalization of unfolding networks when perturbed with $l_2$-norm constrained attacks, generated by the fast gradient sign method. Particularly, we choose a family of state-of-the-art overaparameterized unfolding networks and deploy a new framework to estimate their adversarial Rademacher complexity. Given this estimate, we provide adversarial generalization error bounds for the networks under study, which are tight with respect to the attack level. To our knowledge, this is the first theoretical analysis on the adversarial generalization of unfolding networks. We further present a series of experiments on real-world data, with results corroborating our derived theory, consistently for all data. Finally, we observe that the family’s overparameterization can be exploited to promote adversarial robustness, shedding light on how to efficiently robustify neural networks.

[570] Advancing Universal Deep Learning for Electronic-Structure Hamiltonian Prediction of Materials

Shi Yin, Zujian Dai, Xinyang Pan, Lixin He

Main category: cs.LG

TL;DR: NextHAM is a neural E(3)-symmetry method for efficient and generalizable materials electronic-structure Hamiltonian prediction, using zeroth-step Hamiltonians as informative descriptors and correction terms, with a Transformer architecture and novel training objectives.

Details

Motivation: Deep learning methods for Hamiltonian prediction offer computational efficiency but face challenges with atomic diversity, structural patterns, and high-dimensional complexity. Current methods struggle with generalization across diverse materials.

Method: Proposes NextHAM with three key components: 1) zeroth-step Hamiltonians from DFT initial charge density as descriptors and correction terms, 2) neural Transformer with strict E(3)-symmetry, 3) novel training objective for accuracy in real and reciprocal space to prevent error amplification and ghost states.

Result: Experimental results on Materials-HAM-SOC dataset (17,000 materials, 68 elements) show NextHAM achieves excellent accuracy and efficiency in predicting Hamiltonians and band structures.

Conclusion: NextHAM advances universal deep learning for Hamiltonian prediction through methodological innovations and a comprehensive benchmark dataset, demonstrating superior performance in accuracy and computational efficiency.

Abstract: Deep learning methods for electronic-structure Hamiltonian prediction has offered significant computational efficiency advantages over traditional DFT methods, yet the diversity of atomic types, structural patterns, and the high-dimensional complexity of Hamiltonians pose substantial challenges to the generalization performance. In this work, we contribute on both the methodology and dataset sides to advance universal deep learning paradigm for Hamiltonian prediction. On the method side, we propose NextHAM, a neural E(3)-symmetry and expressive correction method for efficient and generalizable materials electronic-structure Hamiltonian prediction. First, we introduce the zeroth-step Hamiltonians, which can be efficiently constructed by the initial charge density of DFT, as informative descriptors of neural regression model in the input level and initial estimates of the target Hamiltonian in the output level, so that the regression model directly predicts the correction terms to the target ground truths, thereby significantly simplifying the input-output mapping for learning. Second, we present a neural Transformer architecture with strict E(3)-Symmetry and high non-linear expressiveness for Hamiltonian prediction. Third, we propose a novel training objective to ensure the accuracy performance of Hamiltonians in both real space and reciprocal space, preventing error amplification and the occurrence of “ghost states” caused by the large condition number of the overlap matrix. On the dataset side, we curate a high-quality broad-coverage large benchmark, namely Materials-HAM-SOC, comprising 17,000 material structures spanning 68 elements from six rows of the periodic table and explicitly incorporating SOC effects. Experimental results on Materials-HAM-SOC demonstrate that NextHAM achieves excellent accuracy and efficiency in predicting Hamiltonians and band structures.

[571] Local Mechanisms of Compositional Generalization in Conditional Diffusion

Arwen Bradley

Main category: cs.LG

TL;DR: This paper investigates compositional generalization in conditional diffusion models, focusing on length generalization (generating images with more objects than seen during training). The authors establish a theoretical equivalence between compositional structure and local conditional scores, and validate that models achieving length generalization exhibit these local dependencies.

Details

Motivation: To understand the mechanisms behind compositional generalization in conditional diffusion models, particularly why some models can generalize to out-of-distribution condition combinations while others cannot. The paper aims to uncover the structural properties that enable this capability.

Method: The authors study length generalization in a controlled CLEVR setting, prove an exact equivalence between conditional projective composition and local conditional scores, and validate their theory through empirical experiments. They use causal interventions to enforce local conditional scores and test feature-space compositionality in color-conditioned CLEVR and SDXL.

Result: Models that succeed at length generalization exhibit local conditional scores, while failing models do not. A causal intervention enforcing local conditional scores restores length generalization in previously failing models. Preliminary evidence shows compositional structure in SDXL for feature-space compositionality.

Conclusion: Compositional generalization in conditional diffusion models is enabled by local conditional scores - sparse dependencies on both pixels and conditioners. This structural mechanism explains why some models generalize well while others fail, and can be enforced through interventions to improve generalization capabilities.

Abstract: Conditional diffusion models appear capable of compositional generalization, i.e., generating convincing samples for out-of-distribution combinations of conditioners, but the mechanisms underlying this ability remain unclear. To make this concrete, we study length generalization, the ability to generate images with more objects than seen during training. In a controlled CLEVR setting (Johnson et al., 2017), we find that length generalization is achievable in some cases but not others, suggesting that models only sometimes learn the underlying compositional structure. We then investigate locality as a structural mechanism for compositional generalization. Prior works proposed score locality as a mechanism for creativity in unconditional diffusion models (Kamb & Ganguli, 2024; Niedoba et al., 2024), but did not address flexible conditioning or compositional generalization. In this paper, we prove an exact equivalence between a specific compositional structure (“conditional projective composition”) (Bradley et al., 2025) and scores with sparse dependencies on both pixels and conditioners (“local conditional scores”). This theory also extends to feature-space compositionality. We validate our theory empirically: CLEVR models that succeed at length generalization exhibit local conditional scores, while those that fail do not. Furthermore, we show that a causal intervention explicitly enforcing local conditional scores restores length generalization in a previously failing model. Finally, we investigate feature-space compositionality in color-conditioned CLEVR, and find preliminary evidence of compositional structure in SDXL.

[572] Discovering Association Rules in High-Dimensional Small Tabular Data

Erkan Karabulut, Daniel Daza, Paul Groth, Victoria Degeler

Main category: cs.LG

TL;DR: This paper addresses Association Rule Mining (ARM) in high-dimensional tabular data, particularly focusing on scalability and performance in low-data regimes. It improves upon Aerial+ with fine-tuning approaches using tabular foundation models.

Details

Motivation: Current ARM methods face rule explosion and computational overhead in high-dimensional settings. Neurosymbolic methods like Aerial+ address dimensionality but suffer from neural network limitations in low-data scenarios, which is problematic for domains like biomedicine with thousands of features but few samples.

Method: The paper proposes two fine-tuning approaches to Aerial+ using tabular foundation models to enhance performance in low-data, high-dimensional settings. It also empirically evaluates scalability against state-of-the-art baselines.

Result: Aerial+ scales one to two orders of magnitude better than existing methods across five real-world datasets. The proposed fine-tuning approaches significantly improve rule quality in low-data, high-dimensional scenarios.

Conclusion: The work demonstrates effective solutions for ARM in challenging high-dimensional, low-data settings, with practical applications in domains like biomedicine where traditional methods struggle.

Abstract: Association Rule Mining (ARM) aims to discover patterns between features in datasets in the form of propositional rules, supporting both knowledge discovery and interpretable machine learning in high-stakes decision-making. However, in high-dimensional settings, rule explosion and computational overhead render popular algorithmic approaches impractical without effective search space reduction, challenges that propagate to downstream tasks. Neurosymbolic methods, such as Aerial+, have recently been proposed to address the rule explosion in ARM. While they tackle the high dimensionality of the data, they also inherit limitations of neural networks, particularly reduced performance in low-data regimes. This paper makes three key contributions to association rule discovery in high-dimensional tabular data. First, we empirically show that Aerial+ scales one to two orders of magnitude better than state-of-the-art algorithmic and neurosymbolic baselines across five real-world datasets. Second, we introduce the novel problem of ARM in high-dimensional, low-data settings, such as gene expression data from the biomedicine domain with around 18k features and 50 samples. Third, we propose two fine-tuning approaches to Aerial+ using tabular foundation models. Our proposed approaches are shown to significantly improve rule quality on five real-world datasets, demonstrating their effectiveness in low-data, high-dimensional scenarios.

[573] MolPILE - large-scale, diverse dataset for molecular representation learning

Jakub Adamczyk, Jakub Poziemski, Franciszek Job, Mateusz Król, Maciej Makowski

Main category: cs.LG

TL;DR: MolPILE is a large-scale, diverse, and rigorously curated dataset of 222 million compounds created from 6 databases to address limitations in existing molecular datasets for foundation model training.

Details

Motivation: Existing small molecule datasets have limitations that hinder molecular representation learning effectiveness, creating a need for an ImageNet-like standardized resource in molecular chemistry.

Method: Constructed MolPILE using an automated curation pipeline from 6 large-scale databases, with comprehensive analysis of current pretraining datasets and retraining existing models on the new dataset.

Result: Retraining existing models on MolPILE yields improvements in generalization performance, demonstrating the dataset’s effectiveness for molecular representation learning.

Conclusion: MolPILE provides a standardized resource that addresses critical gaps in molecular chemistry datasets, enabling better foundation model training and improved generalization capabilities.

Abstract: The size, diversity, and quality of pretraining datasets critically determine the generalization ability of foundation models. Despite their growing importance in chemoinformatics, the effectiveness of molecular representation learning has been hindered by limitations in existing small molecule datasets. To address this gap, we present MolPILE, large-scale, diverse, and rigorously curated collection of 222 million compounds, constructed from 6 large-scale databases using an automated curation pipeline. We present a comprehensive analysis of current pretraining datasets, highlighting considerable shortcomings for training ML models, and demonstrate how retraining existing models on MolPILE yields improvements in generalization performance. This work provides a standardized resource for model training, addressing the pressing need for an ImageNet-like dataset in molecular chemistry.

[574] S$^2$Transformer: Scalable Structured Transformers for Global Station Weather Forecasting

Hongyi Chen, Xiucheng Li, Xinyang Chen, Yun Cheng, Jing Li, Kehai Chen, Liqiang Nie

Main category: cs.LG

TL;DR: A novel Spatial Structured Attention Block is proposed for global station weather forecasting that models both local and global spatial correlations through intra-subgraph and inter-subgraph attention mechanisms, achieving up to 16.8% performance improvement over baselines.

Details

Motivation: Existing time series forecasting methods often ignore or unidirectionally model spatial correlation in global weather forecasting, which contradicts the intrinsic nature of weather systems and limits forecast performance.

Method: The paper proposes a Spatial Structured Attention Block that partitions spatial graphs into subgraphs, uses Intra-subgraph Attention for local spatial correlation learning, and Inter-subgraph Attention for global message passing. This is built into a multiscale spatiotemporal model called S²Transformer that progressively expands subgraph scales.

Result: Experimental results show the model achieves performance improvements up to 16.8% over time series forecasting baselines while maintaining low running costs.

Conclusion: The proposed S²Transformer model is scalable, produces structured spatial correlation, is easy to implement, and significantly outperforms existing methods for global station weather forecasting.

Abstract: Global Station Weather Forecasting (GSWF) is a key meteorological research area, critical to energy, aviation, and agriculture. Existing time series forecasting methods often ignore or unidirectionally model spatial correlation when conducting large-scale global station forecasting. This contradicts the intrinsic nature underlying observations of the global weather system, limiting forecast performance. To address this, we propose a novel Spatial Structured Attention Block in this paper. It partitions the spatial graph into a set of subgraphs and instantiates Intra-subgraph Attention to learn local spatial correlation within each subgraph, and aggregates nodes into subgraph representations for message passing among the subgraphs via Inter-subgraph Attention – considering both spatial proximity and global correlation. Building on this block, we develop a multiscale spatiotemporal forecasting model S$^2$Transformer by progressively expanding subgraph scales. The resulting model is both scalable and able to produce structured spatial correlation, and meanwhile, it is easy to implement. The experimental results show that it can achieve performance improvements up to 16.8% over time series forecasting baselines at low running costs.

[575] MCGrad: Multicalibration at Web Scale

Lorenzo Perini, Daniel Haimovich, Fridolin Linder, Niek Tax, Dima Karamshuk, Milan Vojnovic, Nastaran Okati, Pavlos Athanasios Apostolopoulos

Main category: cs.LG

TL;DR: MCGrad is a novel multicalibration algorithm that addresses limitations of existing methods by not requiring manual subgroup specification, being scalable, and improving other ML metrics rather than harming them.

Details

Motivation: Existing multicalibration methods have limited industry adoption due to requiring manual subgroup specification, lack of scalability, and potential harm to other model performance metrics like log loss and PRAUC.

Method: MCGrad is a multicalibration algorithm that automatically identifies subgroups without explicit specification, is designed for scalability, and maintains or improves other ML evaluation metrics during calibration.

Result: MCGrad has been successfully deployed in production at Meta and is part of hundreds of production models, with positive results from both internal deployments and public dataset evaluations.

Conclusion: MCGrad represents a practical multicalibration solution that overcomes key barriers to industry adoption, demonstrating successful real-world deployment while improving model fairness without compromising other performance metrics.

Abstract: We propose MCGrad, a novel and scalable multicalibration algorithm. Multicalibration - calibration in sub-groups of the data - is an important property for the performance of machine learning-based systems. Existing multicalibration methods have thus far received limited traction in industry. We argue that this is because existing methods (1) require such subgroups to be manually specified, which ML practitioners often struggle with, (2) are not scalable, or (3) may harm other notions of model performance such as log loss and Area Under the Precision-Recall Curve (PRAUC). MCGrad does not require explicit specification of protected groups, is scalable, and often improves other ML evaluation metrics instead of harming them. MCGrad has been in production at Meta, and is now part of hundreds of production models. We present results from these deployments as well as results on public datasets.

cs.MA

[576] Structuring Collective Action with LLM-Guided Evolution: From Ill-Structured Problems to Executable Heuristics

Kevin Bradley Dsouza, Graham Alexander Watt, Yuri Leonenko, Juan Moreno-Cruz

Main category: cs.MA

TL;DR: ECHO-MIMIC is a computational framework that transforms collective action problems (ill-structured problems) into well-structured problems by discovering executable heuristics and persuasive rationales through evolutionary search driven by large language models.

Details

Motivation: Collective action problems are challenging because individual agents struggle to understand the causal links between local actions and global outcomes, face conflicting stakeholder objectives, and lack clear algorithms to bridge micro-level choices with macro-level welfare.

Method: The framework operates in two stages: ECHO evolves Python code snippets encoding behavioral policies, while MIMIC evolves natural language messages to motivate adoption. Both use LLM-driven evolutionary search where the LLM proposes variants and population-level selection retains those maximizing collective performance in simulations.

Result: Applied to agricultural landscape management, ECHO-MIMIC discovers high-performing heuristics compared to baselines and crafts tailored messages that successfully align simulated farmer behavior with landscape-level ecological goals.

Conclusion: ECHO-MIMIC transforms the cognitive burden of collective action into simple agent-level instructions, making ill-structured problems solvable and opening new paths for scalable, adaptive policy design.

Abstract: Collective action problems, which require aligning individual incentives with collective goals, are classic examples of Ill-Structured Problems (ISPs). For an individual agent, the causal links between local actions and global outcomes are unclear, stakeholder objectives often conflict, and no single, clear algorithm can bridge micro-level choices with macro-level welfare. We present ECHO-MIMIC, a computational framework that converts this global complexity into a tractable, Well-Structured Problem (WSP) for each agent by discovering compact, executable heuristics and persuasive rationales. The framework operates in two stages: ECHO (Evolutionary Crafting of Heuristics from Outcomes) evolves snippets of Python code that encode candidate behavioral policies, while MIMIC (Mechanism Inference & Messaging for Individual-to-Collective Alignment) evolves companion natural language messages that motivate agents to adopt those policies. Both phases employ a large-language-model-driven evolutionary search: the LLM proposes diverse and context-aware code or text variants, while population-level selection retains those that maximize collective performance in a simulated environment. We demonstrate this framework on a canonical ISP in agricultural landscape management, where local farming decisions impact global ecological connectivity. Results show that ECHO-MIMIC discovers high-performing heuristics compared to baselines and crafts tailored messages that successfully align simulated farmer behavior with landscape-level ecological goals. By coupling algorithmic rule discovery with tailored communication, ECHO-MIMIC transforms the cognitive burden of collective action into a simple set of agent-level instructions, making previously ill-structured problems solvable in practice and opening a new path toward scalable, adaptive policy design.

[577] RadAgents: Multimodal Agentic Reasoning for Chest X-ray Interpretation with Radiologist-like Workflows

Kai Zhang, Corey D Barrett, Jangwon Kim, Lichao Sun, Tara Taghavi, Krishnaram Kenthapadi

Main category: cs.MA

TL;DR: RadAgents is a multi-agent framework for chest X-ray interpretation that addresses limitations in clinical interpretability, multimodal fusion, and inconsistency resolution through task-aware reasoning and verification mechanisms.

Details

Motivation: Current agentic systems for CXR interpretation lack clinically interpretable reasoning aligned with guidelines, insufficiently fuse multimodal evidence, and fail to detect/resolve cross-tool inconsistencies without verification mechanisms.

Method: Multi-agent framework that couples clinical priors with task-aware multimodal reasoning, integrates grounding and multimodal retrieval-augmentation to verify and resolve context conflicts.

Result: The system produces more reliable, transparent outputs that are consistent with clinical practice by addressing the identified gaps in existing methods.

Conclusion: RadAgents represents an advancement in agentic systems for medical imaging by providing clinically aligned, multimodal reasoning with built-in verification mechanisms for improved reliability and transparency.

Abstract: Agentic systems offer a potential path to solve complex clinical tasks through collaboration among specialized agents, augmented by tool use and external knowledge bases. Nevertheless, for chest X-ray (CXR) interpretation, prevailing methods remain limited: (i) reasoning is frequently neither clinically interpretable nor aligned with guidelines, reflecting mere aggregation of tool outputs; (ii) multimodal evidence is insufficiently fused, yielding text-only rationales that are not visually grounded; and (iii) systems rarely detect or resolve cross-tool inconsistencies and provide no principled verification mechanisms. To bridge the above gaps, we present RadAgents, a multi-agent framework for CXR interpretation that couples clinical priors with task-aware multimodal reasoning. In addition, we integrate grounding and multimodal retrieval-augmentation to verify and resolve context conflicts, resulting in outputs that are more reliable, transparent, and consistent with clinical practice.

cs.MM

eess.AS

[578] Data-Efficient ASR Personalization for Non-Normative Speech Using an Uncertainty-Based Phoneme Difficulty Score for Guided Sampling

Niclas Pokel, Pehuén Moure, Roman Boehringer, Yingqiang Gao

Main category: eess.AS

TL;DR: A data-efficient personalization method for ASR that uses phoneme-level uncertainty estimation via Monte Carlo Dropout to guide targeted oversampling, improving recognition accuracy for non-normative speech.

Details

Motivation: ASR systems perform poorly on non-normative speech from individuals with impairments due to high acoustic variability and limited training data.

Method: Leverage Monte Carlo Dropout to estimate phoneme-level uncertainty, then use these estimates for targeted oversampling during fine-tuning.

Result: Model-derived uncertainty strongly correlates with expert clinical assessments of speech difficulty, and uncertainty-guided sampling significantly improves ASR accuracy on English and German datasets.

Conclusion: Provides a clinically-validated, practical framework for personalized and inclusive ASR that aligns model uncertainty with expert assessment.

Abstract: Automatic speech recognition (ASR) systems struggle with non-normative speech from individuals with impairments caused by conditions like cerebral palsy or structural anomalies. The high acoustic variability and scarcity of training data severely degrade model performance. This work introduces a data-efficient personalization method that quantifies phoneme-level uncertainty to guide fine-tuning. We leverage Monte Carlo Dropout to estimate which phonemes a model finds most difficult and use these estimates for a targeted oversampling strategy. We validate our method on English and German datasets. Crucially, we demonstrate that our model-derived uncertainty strongly correlates with phonemes identified as challenging in an expert clinical logopedic report, marking, to our knowledge, the first work to successfully align model uncertainty with expert assessment of speech difficulty. Our results show that this clinically-validated, uncertainty-guided sampling significantly improves ASR accuracy, delivering a practical framework for personalized and inclusive ASR.

[579] Variational Low-Rank Adaptation for Personalized Impaired Speech Recognition

Niclas Pokel, Pehuén Moure, Roman Boehringer, Shih-Chii Liu, Yingqiang Gao

Main category: eess.AS

TL;DR: This paper introduces a Bayesian Low-rank Adaptation method for efficient ASR personalization to handle speech impairments, validated on English and German datasets with significant accuracy improvements.

Details

Motivation: Speech impairments from congenital disorders and brain injuries challenge ASR systems due to limited training data and high acoustic variability. Current models like Whisper struggle with non-normative speech, and data collection/annotation is burdensome for affected individuals and caregivers.

Method: The authors propose a novel ASR personalization method based on Bayesian Low-rank Adaptation for data-efficient fine-tuning. The approach is designed for low-resource settings and validated on the English UA-Speech dataset and a newly collected German dataset (BF-Sprache) from a child with structural speech impairment.

Result: The method significantly improves ASR accuracy for impaired speech while maintaining data and annotation efficiency.

Conclusion: This approach offers a practical path toward inclusive ASR by effectively handling speech impairments in low-resource settings with minimal data requirements.

Abstract: Speech impairments resulting from congenital disorders, such as cerebral palsy, down syndrome, or apert syndrome, as well as acquired brain injuries due to stroke, traumatic accidents, or tumors, present major challenges to automatic speech recognition (ASR) systems. Despite recent advancements, state-of-the-art ASR models like Whisper still struggle with non-normative speech due to limited training data availability and high acoustic variability. Moreover, collecting and annotating non-normative speech is burdensome: speaking is effortful for many affected individuals, while laborious annotation often requires caregivers familiar with the speaker. This work introduces a novel ASR personalization method based on Bayesian Low-rank Adaptation for data-efficient fine-tuning. We validate our method on the English UA-Speech dataset and a newly collected German speech dataset, BF-Sprache, from a child with structural speech impairment. The dataset and approach are designed to reflect the challenges of low-resource settings that include individuals with speech impairments. Our method significantly improves ASR accuracy for impaired speech while maintaining data and annotation efficiency, offering a practical path toward inclusive ASR.

[580] Phoenix-VAD: Streaming Semantic Endpoint Detection for Full-Duplex Speech Interaction

Weijie Wu, Wenhao Guan, Kaidi Wang, Peijie Chen, Zhuanling Zha, Junbo Li, Jun Fang, Lin Li, Qingyang Hong

Main category: eess.AS

TL;DR: Phoenix-VAD is an LLM-based model for streaming semantic endpoint detection in spoken dialogue systems, enabling plug-and-play full-duplex prediction.

Details

Motivation: Current spoken dialogue models lack a plug-and-play full-duplex prediction module for semantic endpoint detection, hindering seamless audio interactions.

Method: Leverages LLM’s semantic comprehension capability with a sliding window training strategy to achieve reliable semantic endpoint detection while supporting streaming inference.

Result: Achieves excellent and competitive performance on both semantically complete and incomplete speech scenarios.

Conclusion: Enables full-duplex prediction module to be optimized independently of dialogue model, providing more reliable and flexible support for next-generation human-computer interaction.

Abstract: Spoken dialogue models have significantly advanced intelligent human\textendash computer interaction, yet they lack a plug\textendash and\textendash play full\textendash duplex prediction module for semantic endpoint detection, hindering seamless audio interactions. In this paper, we introduce Phoenix\textendashVAD, an LLM\textendash based model that enables streaming semantic endpoint detection. Specifically, Phoenix\textendash VAD leverages the semantic comprehension capability of the LLM and a sliding window training strategy to achieve reliable semantic endpoint detection while supporting streaming inference. Experiments on both semantically complete and incomplete speech scenarios indicate that Phoenix\textendash VAD achieves excellent and competitive performance. Furthermore, this design enables the full\textendash duplex prediction module to be optimized independently of the dialogue model, providing more reliable and flexible support for next\textendash generation human\textendash computer interaction.

[581] Objective Evaluation of Prosody and Intelligibility in Speech Synthesis via Conditional Prediction of Discrete Tokens

Ismail Rasim Ulgen, Zongyang Du, Junchen Lu, Philipp Koehn, Berrak Sisman

Main category: eess.AS

TL;DR: TTScore is a reference-free evaluation framework for synthesized speech that uses conditional prediction of discrete tokens to assess intelligibility and prosody, outperforming existing metrics in correlation with human judgments.

Details

Motivation: Existing speech evaluation metrics like WER and F0-RMSE are limited in scope and weakly correlated with human perception, providing only coarse or narrow measures of intelligibility and prosody.

Method: TTScore employs two sequence-to-sequence predictors: TTScore-int measures intelligibility through content tokens, and TTScore-pro evaluates prosody through prosody tokens. Both predictors are conditioned on input text and compute likelihoods of corresponding token sequences.

Result: Experiments on SOMOS, VoiceMOS, and TTSArena benchmarks show TTScore-int and TTScore-pro provide reliable aspect-specific evaluation and achieve stronger correlations with human judgments of overall quality than existing metrics.

Conclusion: TTScore offers a targeted and reference-free framework that captures alignment with intended linguistic content and prosodic structure, providing more comprehensive and human-aligned evaluation of synthesized speech.

Abstract: Objective evaluation of synthesized speech is critical for advancing speech generation systems, yet existing metrics for intelligibility and prosody remain limited in scope and weakly correlated with human perception. Word Error Rate (WER) provides only a coarse text-based measure of intelligibility, while F0-RMSE and related pitch-based metrics offer a narrow, reference-dependent view of prosody. To address these limitations, we propose TTScore, a targeted and reference-free evaluation framework based on conditional prediction of discrete speech tokens. TTScore employs two sequence-to-sequence predictors conditioned on input text: TTScore-int, which measures intelligibility through content tokens, and TTScore-pro, which evaluates prosody through prosody tokens. For each synthesized utterance, the predictors compute the likelihood of the corresponding token sequences, yielding interpretable scores that capture alignment with intended linguistic content and prosodic structure. Experiments on the SOMOS, VoiceMOS, and TTSArena benchmarks demonstrate that TTScore-int and TTScore-pro provide reliable, aspect-specific evaluation and achieve stronger correlations with human judgments of overall quality than existing intelligibility and prosody-focused metrics.

[582] Real-Time System for Audio-Visual Target Speech Enhancement

T. Aleksandra Ma, Sile Yin, Li-Chia Yang, Shuo Zhang

Main category: eess.AS

TL;DR: RAVEN is a real-time audio-visual speech enhancement system that runs on CPU hardware, using visual lip movement cues to improve speech enhancement in noisy environments.

Details

Motivation: To create an interactive real-time system for audio-visual speech enhancement that operates on standard CPU hardware, addressing the gap where no prior work has demonstrated such capability.

Method: Uses pretrained visual embeddings from an audio-visual speech recognition model to encode lip movement information, enabling real-time processing on CPU without specialized hardware.

Result: The system successfully generalizes across various challenging scenarios including environmental noise, interfering speakers, transient sounds, and singing voices.

Conclusion: RAVEN demonstrates practical real-time audio-visual speech enhancement that can be experienced live using standard microphone and webcam setups with clean speech playback through headphones.

Abstract: We present a live demonstration for RAVEN, a real-time audio-visual speech enhancement system designed to run entirely on a CPU. In single-channel, audio-only settings, speech enhancement is traditionally approached as the task of extracting clean speech from environmental noise. More recent work has explored the use of visual cues, such as lip movements, to improve robustness, particularly in the presence of interfering speakers. However, to our knowledge, no prior work has demonstrated an interactive system for real-time audio-visual speech enhancement operating on CPU hardware. RAVEN fills this gap by using pretrained visual embeddings from an audio-visual speech recognition model to encode lip movement information. The system generalizes across environmental noise, interfering speakers, transient sounds, and even singing voices. In this demonstration, attendees will be able to experience live audio-visual target speech enhancement using a microphone and webcam setup, with clean speech playback through headphones.

[583] SPADE: Structured Pruning and Adaptive Distillation for Efficient LLM-TTS

Tan Dat Nguyen, Jaehun Kim, Ji-Hoon Kim, Shukjae Choi, Youshin Lim, Joon Son Chung

Main category: eess.AS

TL;DR: SPADE is a framework that combines structured pruning and adaptive distillation to create efficient LLM-based text-to-speech models, reducing model size and latency while maintaining quality.

Details

Motivation: Recent LLM-TTS systems have strong controllability and zero-shot generalization but suffer from large parameter counts and high latency that limit real-world deployment.

Method: SPADE combines (i) pruning guided by word-error-rate-based layer importance index to remove non-essential Transformer layers, with (ii) multi-level knowledge distillation to restore autoregressive coherence.

Result: SPADE preserves near-parity perceptual quality while halving Transformer depth, reducing VRAM usage by up to 20%, achieving up to 1.7x faster real-time factor with less than 5% of original training data.

Conclusion: Compact LLM-TTS models can maintain naturalness and speaker similarity while enabling practical real-time speech generation.

Abstract: The goal of this paper is to introduce SPADE, a framework for Structured Pruning and Adaptive Distillation for Efficient Large Language Model-based text-to-speech (LLM-TTS). Recent LLM-TTS systems achieve strong controllability and zero-shot generalization, but their large parameter counts and high latency limit real-world deployment. SPADE addresses this by combining (i) a pruning step guided by a word-error-rate-based layer importance index to remove non-essential Transformer layers, with (ii) multi-level knowledge distillation to restore autoregressive coherence. On zero-shot benchmarks, SPADE preserves near-parity perceptual quality while halving Transformer depth, reducing VRAM usage by up to 20%, and achieving up to 1.7x faster real-time factor with less than 5% of the original training data. These results show that compact LLM-TTS models can maintain naturalness and speaker similarity while enabling practical real-time speech generation. Audio samples are available at https://mm.kaist.ac.kr/projects/SPADE/.

[584] PAS-SE: Personalized Auxiliary-Sensor Speech Enhancement for Voice Pickup in Hearables

Mattes Ohlenbusch, Mikolaj Kegler, Marko Stamenovic

Main category: eess.AS

TL;DR: This paper compares personalized speech enhancement (PSE) using enrollment utterances and auxiliary-sensor speech enhancement (AS-SE) using in-ear microphones for voice pickup in hearables, showing that combining both strategies (PAS-SE) provides complementary benefits.

Details

Motivation: Speech enhancement for hearables faces challenges in distinguishing target voice from interfering talkers without additional context, particularly for single-channel methods.

Method: The authors compare PSE (using enrollment utterances) and AS-SE (using in-ear microphones), propose training-time augmentations for cross-dataset generalization, and combine both approaches as PAS-SE.

Result: PAS-SE provides complementary performance benefits, especially when enrollment speech is recorded with in-ear microphones. The system maintains performance even with noisy in-ear enrollments.

Conclusion: Combining personalized and auxiliary-sensor approaches offers effective speech enhancement for hearables, with PAS-SE outperforming individual methods and demonstrating robustness to noisy enrollment data.

Abstract: Speech enhancement for voice pickup in hearables aims to improve the user’s voice by suppressing noise and interfering talkers, while maintaining own-voice quality. For single-channel methods, it is particularly challenging to distinguish the target from interfering talkers without additional context. In this paper, we compare two strategies to resolve this ambiguity: personalized speech enhancement (PSE), which uses enrollment utterances to represent the target, and auxiliary-sensor speech enhancement (AS-SE), which uses in-ear microphones as additional input. We evaluate the strategies on two public datasets, employing different auxiliary sensor arrays, to investigate their cross-dataset generalization. We propose training-time augmentations to facilitate cross-dataset generalization of AS-SE systems. We also show that combining PSE and AS-SE (PAS-SE) provides complementary performance benefits, especially when enrollment speech is recorded with the in-ear microphone. We further demonstrate that PAS-SE personalized with noisy in-ear enrollments maintains performance benefits over the AS-SE system.

[585] TF-Restormer: Complex Spectral Prediction for Speech Restoration

Ui-Hyeop Shin, Jaehyun Ko, Woocheol Jeong, Hyuing-Min Park

Main category: eess.AS

TL;DR: TF-Restormer is an encoder-decoder architecture for universal speech restoration that handles arbitrary input-output sampling rates without redundant resampling, using time-frequency dual-path encoding and frequency extension queries.

Details

Motivation: Existing speech restoration systems sacrifice signal fidelity, are impractical for streaming, and require external resampling for different sampling rates, leading to redundant computations.

Method: Uses an encoder-decoder architecture with time-frequency dual-path encoder, light decoder with frequency extension queries, SFI STFT discriminator for adversarial training, causal time module for streaming, and spectral inductive bias for robustness.

Result: TF-Restormer consistently outperforms prior systems across sampling rates, achieving balanced gains in signal fidelity and perceptual quality, with streaming mode maintaining competitive effectiveness for real-time applications.

Conclusion: The proposed architecture enables efficient and universal speech restoration across arbitrary input-output rates without redundant resampling, supporting both batch and streaming modes with improved robustness under extreme degradations.

Abstract: Speech restoration in real-world conditions is challenging due to compounded distortions such as clipping, band-pass filtering, digital artifacts, noise, and reverberation, and low sampling rates. Existing systems, including vocoder-based approaches, often sacrifice signal fidelity, while diffusion models remain impractical for streaming. Moreover, most assume a fixed target sampling rate, requiring external resampling that leads to redundant computations. We present TF-Restormer, an encoder-decoder architecture that concentrates analysis on input-bandwidth with a time-frequency dual-path encoder and reconstructs missing high-frequency bands through a light decoder with frequency extension queries. It enables efficient and universal restoration across arbitrary input-output rates without redundant resampling. To support adversarial training across diverse rates, we introduce a shared sampling-frequency-independent (SFI) STFT discriminator. TF-Restormer further supports streaming with a causal time module, and improves robustness under extreme degradations by injecting spectral inductive bias into the frequency module. Finally, we propose a scaled log-spectral loss that stabilizes optimization under severe conditions while emphasizing well-predicted spectral details. As a single model across sampling rates, TF-Restormer consistently outperforms prior systems, achieving balanced gains in signal fidelity and perceptual quality, while its streaming mode maintains competitive effectiveness for real-time application. Code and demos are available at https://tf-restormer.github.io/demo.

[586] Measuring Audio’s Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models

Haolin He, Xingjian Du, Renhe Sun, Zheqi Dai, Yujia Xiao, Mingru Yang, Jiayi Zhou, Xiquan Li, Zhengxi Liu, Zining Liang, Chunyat Wu, Qianhua He, Tan Lee, Xie Chen, Weilong Zheng, Weiqiang Wang, Mark Plumbley, Jian Liu, Qiuqiang Kong

Main category: eess.AS

TL;DR: This paper introduces AudioMCQ, a large-scale audio multiple-choice question dataset, and proposes effective multi-stage post-training strategies for Large Audio Language Models (LALMs) that address the zero audio-contribution phenomenon, achieving state-of-the-art results on multiple benchmarks.

Details

Motivation: Current multi-stage post-training approaches for LALMs (such as SFT followed by RL) remain suboptimal, and the allocation of data across training stages to maximize model capabilities hasn't been fully explored. Additionally, there's a lack of large-scale, high-quality datasets for such research.

Method: 1) Created AudioMCQ dataset with 571k samples and chain-of-thought annotations; 2) Identified zero audio-contribution phenomenon; 3) Proposed Audio-Contribution Filtering to partition data; 4) Developed two post-training paradigms: Weak-to-Strong and Mixed-to-Strong strategies.

Result: Achieved first place in DCASE 2025 Audio-Question-Answering challenge. Established new SOTA performance: 78.2% on MMAU-test-mini, 75.6% on MMAU, 67.1% on MMAR, and 70.7% on MMSU.

Conclusion: The proposed AudioMCQ dataset and multi-stage training strategies effectively address data allocation challenges in LALM post-training, significantly improving model performance by mitigating the zero audio-contribution problem and optimizing training stage sequencing.

Abstract: Large Audio Language Models (LALMs) represent an important frontier in multimodal AI, addressing diverse audio tasks. Recently, post-training of LALMs has received increasing attention due to significant performance improvements over foundation models. While single-stage post-training such as reinforcement learning (RL) has demonstrated promising results, multi-stage approaches such as supervised fine-tuning (SFT) followed by RL remain suboptimal. The allocation of data across multiple training stages to maximize LALM capabilities has not been fully explored, and large-scale, high-quality datasets for such research are also lacking. To address these problems, we firstly present AudioMCQ, a comprehensive audio multiple-choice question dataset comprising 571k samples with two kinds of chain-of-thought annotations. Secondly, we investigate the prevalent zero audio-contribution phenomenon in LALMs, where models derive correct answers solely from textual information without processing audio content. We propose Audio-Contribution Filtering to partition data into weak and strong audio-contribution subsets. Based on these insights, we develop two effective post-training paradigms: Weak-to-Strong (SFT on weak audio-contribution data followed by RL on strong audio-contribution data) and Mixed-to-Strong (SFT on mixed audio-contribution data followed by RL on strong audio-contribution data). We achieve first place in the DCASE 2025 Audio-Question-Answering challenge by using AudioMCQ. Additionally, leveraging our dataset with different training strategies, we achieve 78.2% on MMAU-test-mini, 75.6% on MMAU, 67.1% on MMAR, and 70.7% on MMSU, establishing new state-of-the-art performance across these benchmarks.

[587] Are Modern Speech Enhancement Systems Vulnerable to Adversarial Attacks?

Rostislav Makarov, Lea Schönherr, Timo Gerkmann

Main category: eess.AS

TL;DR: Speech enhancement models are vulnerable to adversarial attacks where carefully crafted noise can manipulate the enhanced output to convey different meanings, though diffusion models show inherent robustness.

Details

Motivation: As speech enhancement models become more expressive, they introduce new vulnerabilities that could be exploited through adversarial attacks, potentially manipulating semantic meaning in enhanced speech.

Method: The study demonstrates adversarial attacks using psychoacoustically masked noise injected into input signals to manipulate contemporary predictive speech enhancement models, while also testing diffusion models with stochastic samplers.

Result: Contemporary predictive speech enhancement models can be successfully manipulated by adversarial attacks to produce enhanced speech with entirely different semantic meaning, but diffusion models exhibit inherent robustness to such attacks.

Conclusion: Advanced speech enhancement models’ expressiveness creates security vulnerabilities to adversarial manipulation, highlighting the need for robust architectures like diffusion models which provide inherent protection against such attacks.

Abstract: Machine learning approaches for speech enhancement are becoming increasingly expressive, enabling ever more powerful modifications of input signals. In this paper, we demonstrate that this expressiveness introduces a vulnerability: advanced speech enhancement models can be susceptible to adversarial attacks. Specifically, we show that adversarial noise, carefully crafted and psychoacoustically masked by the original input, can be injected such that the enhanced speech output conveys an entirely different semantic meaning. We experimentally verify that contemporary predictive speech enhancement models can indeed be manipulated in this way. Furthermore, we highlight that diffusion models with stochastic samplers exhibit inherent robustness to such adversarial attacks by design.

[588] Hybrid Real- And Complex-Valued Neural Network Concept For Low-Complexity Phase-Aware Speech Enhancement

Luan Vinícius Fiorio, Alex Young, Ronald M. Aarts

Main category: eess.AS

TL;DR: Hybrid real- and complex-valued neural networks outperform pure real or complex models for speech enhancement with lower complexity and same parameter count

Details

Motivation: Real-valued models are inefficient while complex-valued models present high complexity, creating a need for a balanced approach

Method: Devised a straightforward design method to extend real-valued networks into hybrid counterparts, comparing real, complex, and hybrid versions of convolutional and convolutional-recurrent architectures

Result: Hybrid networks consistently outperform counterparts with same parameters, and have substantially lower complexity in terms of multiply-accumulate operations

Conclusion: Hybrid real-complex neural networks provide superior performance for speech enhancement with reduced computational complexity compared to pure real or complex models

Abstract: In this paper, we propose hybrid real- and complex-valued neural networks for speech enhancement. Real- or complex-valued models are either inefficient or present high complexity. We devise a straightforward design method for extending a real-valued network into its hybrid counterpart. Based on speech intelligibility and quality metrics, we compare the real, complex, and hybrid versions of a convolutional and a convolutional-recurrent architecture. The hybrid network consistently outperforms its counterparts with the same number of parameters. Additionally, the hybrid models’ complexity in terms of multiply-accumulate operations is substantially lower than that of their counterparts.

[589] MeanSE: Efficient Generative Speech Enhancement with Mean Flows

Jiahe Wang, Hongyu Wang, Wei Wang, Lei Yang, Chenda Li, Wangyou Zhang, Lufen Tan, Yanmin Qian

Main category: eess.AS

TL;DR: MeanSE is an efficient generative speech enhancement model that uses mean flows to achieve high-quality enhancement with just 1 function evaluation, outperforming flow matching baselines.

Details

Motivation: Flow-based generative models for speech enhancement require multiple function evaluations (NFEs) for stable performance, leading to high computational load and poor 1-NFE performance.

Method: Proposes MeanSE which models the average velocity field using mean flows to enable efficient single-step enhancement.

Result: MeanSE significantly outperforms flow matching baseline with a single NFE and shows better out-of-domain generalization capabilities.

Conclusion: The proposed MeanSE model provides an efficient solution for high-quality speech enhancement with minimal computational requirements.

Abstract: Speech enhancement (SE) improves degraded speech’s quality, with generative models like flow matching gaining attention for their outstanding perceptual quality. However, the flow-based model requires multiple numbers of function evaluations (NFEs) to achieve stable and satisfactory performance, leading to high computational load and poor 1-NFE performance. In this paper, we propose MeanSE, an efficient generative speech enhancement model using mean flows, which models the average velocity field to achieve high-quality 1-NFE enhancement. Experimental results demonstrate that our proposed MeanSE significantly outperforms the flow matching baseline with a single NFE, exhibiting extremely better out-of-domain generalization capabilities.

[590] Scaling Rich Style-Prompted Text-to-Speech Datasets

Anuj Diwan, Zhisheng Zheng, David Harwath, Eunsol Choi

Main category: eess.AS

TL;DR: ParaSpeechCaps is a large-scale dataset that automatically scales rich paralinguistic style captions for speech, improving TTS model performance on style consistency and speech quality.

Details

Motivation: Existing large-scale speech datasets only cover basic style tags, while rich abstract tags have been limited to small-scale human-annotated datasets. There's a need to scale rich paralinguistic annotations automatically.

Method: Combine off-the-shelf text/speech embedders, classifiers, and an audio language model to automatically annotate rich style tags. Create two subsets: PSC-Base (342h human-labeled) and PSC-Scaled (2427h auto-annotated). Finetune Parler-TTS model on the dataset.

Result: Achieved +7.9% improvement in style consistency and +15.5% improvement in speech naturalness over best baseline. Dataset covers 59 style tags including speaker-level intrinsic and utterance-level situational tags.

Conclusion: ParaSpeechCaps successfully scales rich paralinguistic annotations and improves TTS performance. The dataset design choices provide foundation for future work in paralinguistic speech modeling.

Abstract: We introduce Paralinguistic Speech Captions (ParaSpeechCaps), a large-scale dataset that annotates speech utterances with rich style captions. While rich abstract tags (e.g. guttural, nasal, pained) have been explored in small-scale human-annotated datasets, existing large-scale datasets only cover basic tags (e.g. low-pitched, slow, loud). We combine off-the-shelf text and speech embedders, classifiers and an audio language model to automatically scale rich tag annotations for the first time. ParaSpeechCaps covers a total of 59 style tags, including both speaker-level intrinsic tags and utterance-level situational tags. It consists of 342 hours of human-labelled data (PSC-Base) and 2427 hours of automatically annotated data (PSC-Scaled). We finetune Parler-TTS, an open-source style-prompted TTS model, on ParaSpeechCaps, and achieve improved style consistency (+7.9% Consistency MOS) and speech quality (+15.5% Naturalness MOS) over the best performing baseline that combines existing rich style tag datasets. We ablate several of our dataset design choices to lay the foundation for future work in this space. Our dataset, models and code are released at https://github.com/ajd12342/paraspeechcaps .

[591] Lessons Learnt: Revisit Key Training Strategies for Effective Speech Emotion Recognition in the Wild

Jing-Tong Tzeng, Bo-Hao Su, Ya-Tse Wu, Hsing-Hang Chou, Chi-Chun Lee

Main category: eess.AS

TL;DR: Simple training strategy optimizations (balancing, activation functions, fine-tuning) outperform complex architectures for speech emotion recognition, achieving state-of-the-art valence performance through multi-modal fusion.

Details

Motivation: To revisit overlooked training strategies in machine learning that are often overshadowed by deeper architectures, aiming to enhance speech emotion recognition in naturalistic conditions with minimal architectural changes.

Method: Multi-modal fusion model with separate fine-tuning of RoBERTa and WavLM in single-modality settings, followed by feature fusion without training the backbone extractor. Used focal loss and optimized activation functions.

Result: Achieved valence CCC of 0.6953, the best valence score in Task 2: Emotional Attribute Regression. Simple modifications significantly improved generalization performance.

Conclusion: Refining core training components rather than deepening models leads to more robust speech emotion recognition in-the-wild, demonstrating that optimization strategies can outperform architectural complexity.

Abstract: In this study, we revisit key training strategies in machine learning often overlooked in favor of deeper architectures. Specifically, we explore balancing strategies, activation functions, and fine-tuning techniques to enhance speech emotion recognition (SER) in naturalistic conditions. Our findings show that simple modifications improve generalization with minimal architectural changes. Our multi-modal fusion model, integrating these optimizations, achieves a valence CCC of 0.6953, the best valence score in Task 2: Emotional Attribute Regression. Notably, fine-tuning RoBERTa and WavLM separately in a single-modality setting, followed by feature fusion without training the backbone extractor, yields the highest valence performance. Additionally, focal loss and activation functions significantly enhance performance without increasing complexity. These results suggest that refining core components, rather than deepening models, leads to more robust SER in-the-wild.

[592] MAGE: A Coarse-to-Fine Speech Enhancer with Masked Generative Model

The Hieu Pham, Tan Dat Nguyen, Phuong Thanh Tran, Joon Son Chung, Duc Dung Nguyen

Main category: eess.AS

TL;DR: MAGE is a Masked Audio Generative Enhancer that improves speech enhancement through efficient coarse-to-fine masking and a lightweight corrector module, achieving state-of-the-art perceptual quality with reduced parameters.

Details

Motivation: Speech enhancement faces a trade-off between efficiency and perceptual quality. Current masked generative models use random masking, which is inefficient and lacks generalization.

Method: MAGE employs scarcity-aware coarse-to-fine masking (prioritizing frequent tokens early, rare tokens later) and a lightweight corrector module for stable inference. Built on BigCodec and finetuned from Qwen2.5-0.5B, reduced to 200M parameters via selective layer retention.

Result: Experiments on DNS Challenge and noisy LibriSpeech show MAGE achieves state-of-the-art perceptual quality and significantly reduces word error rate for downstream recognition, outperforming larger baselines.

Conclusion: MAGE advances generative speech enhancement with a compact, robust design that balances efficiency and perceptual quality through intelligent masking strategies and correction mechanisms.

Abstract: Speech enhancement remains challenging due to the trade-off between efficiency and perceptual quality. In this paper, we introduce MAGE, a Masked Audio Generative Enhancer that advances generative speech enhancement through a compact and robust design. Unlike prior masked generative models with random masking, MAGE employs a scarcity-aware coarse-to-fine masking strategy that prioritizes frequent tokens in early steps and rare tokens in later refinements, improving efficiency and generalization. We also propose a lightweight corrector module that further stabilizes inference by detecting low-confidence predictions and re-masking them for refinement. Built on BigCodec and finetuned from Qwen2.5-0.5B, MAGE is reduced to 200M parameters through selective layer retention. Experiments on DNS Challenge and noisy LibriSpeech show that MAGE achieves state-of-the-art perceptual quality and significantly reduces word error rate for downstream recognition, outperforming larger baselines. Audio examples are available at https://hieugiaosu.github.io/MAGE/.

[593] Measuring Prosody Diversity in Zero-Shot TTS: A New Metric, Benchmark, and Exploration

Yifan Yang, Bing Han, Hui Wang, Long Zhou, Wei Wang, Mingyu Cui, Xu Tan, Xie Chen

Main category: eess.AS

TL;DR: ProsodyEval is a new dataset and DS-WED metric for assessing prosody diversity in TTS systems, showing better correlation with human perception than existing acoustic metrics.

Details

Motivation: Existing acoustic metrics capture only partial views of prosodic variation and correlate poorly with human perception, leaving prosody diversity quantification underexplored.

Method: Created ProsodyEval dataset with 1000 speech samples from 7 TTS systems and 2000 human ratings, then proposed DS-WED metric using weighted edit distance over semantic tokens from HuBERT and WavLM.

Result: DS-WED achieves substantially higher correlation with human judgments than existing acoustic metrics, and benchmarking reveals factors influencing prosody diversity including generative modeling paradigms and duration control.

Conclusion: Current large audio language models remain limited in capturing prosodic variations, and DS-WED provides a robust objective metric for prosody diversity assessment.

Abstract: Prosody diversity is essential for achieving naturalness and expressiveness in zero-shot text-to-speech (TTS). However, frequently used acoustic metrics capture only partial views of prosodic variation and correlate poorly with human perception, leaving the problem of reliably quantifying prosody diversity underexplored. To bridge this gap, we introduce ProsodyEval, a prosody diversity assessment dataset that provides Prosody Mean Opinion Score (PMOS) alongside conventional acoustic metrics. ProsodyEval comprises 1000 speech samples derived from 7 mainstream TTS systems, with 2000 human ratings. Building on this, we propose the Discretized Speech Weighted Edit Distance (DS-WED), a new objective diversity metric that quantifies prosodic variation via weighted edit distance over semantic tokens. Experiments on ProsodyEval show that DS-WED achieves substantially higher correlation with human judgments than existing acoustic metrics, while remaining highly robust in speech tokenization from HuBERT and WavLM. Leveraging DS-WED, we benchmark state-of-the-art open-source TTS systems on LibriSpeech test-clean and Seed-TTS test-en, and further explorations uncover several factors that influence prosody diversity, including generative modeling paradigms, duration control, and reinforcement learning. Moreover, we find that current large audio language models (LALMs) remain limited in capturing prosodic variations. Audio samples are available at https://prosodyeval.github.io.

eess.IV

[594] Optimal Transport Based Hyperspectral Unmixing for Highly Mixed Observations

D. Doutsas, B. Figliuzzi

Main category: eess.IV

TL;DR: A novel optimal transport-based approach for blind hyperspectral unmixing that uses OT to measure distribution discrepancy as regularization, improving endmember estimation in highly mixed data scenarios.

Details

Motivation: To address the challenge of highly mixed data in blind hyperspectral unmixing by better constraining abundance distributions using optimal transport theory.

Method: Proposes using optimal transport to measure discrepancy between estimated abundance matrix and targeted Dirichlet distribution, incorporating this as a regularization term in optimization. Demonstrated with unsupervised deep learning approach.

Result: Experiments show the method allows better estimation of endmembers in highly mixed data and displays robustness to choice of target abundance distribution.

Conclusion: The optimal transport-based regularization approach effectively improves blind hyperspectral unmixing performance, particularly for highly mixed data scenarios.

Abstract: We propose a novel approach based on optimal transport (OT) for tackling the problem of highly mixed data in blind hyperspectral unmixing. Our method constrains the distribution of the estimated abundance matrix to resemble a targeted Dirichlet distribution more closely. The novelty lies in using OT to measure the discrepancy between the targeted and true abundance distributions, which we incorporate as a regularization term in our optimization problem. We demonstrate the efficiency of our method through a case study involving an unsupervised deep learning approach. Our experiments show that the proposed approach allows for a better estimation of the endmembers in the presence of highly mixed data, while displaying robustness to the choice of target abundance distribution.

[595] Super-resolution of 4D flow MRI through inverse problem explicit solving

Aurélien de Turenne, Rémi Cart-Lamy, Denis Kouamé

Main category: eess.IV

TL;DR: A novel method for super-resolution and denoising of 4D Flow MRI using complex domain inverse problem solving to enhance spatial resolution and reduce noise without large training datasets.

Details

Motivation: 4D Flow MRI's clinical utility is limited by low spatial resolution and poor signal-to-noise ratio due to acquisition time constraints, requiring improved image quality.

Method: Explicit solution of inverse problem in complex domain using clinically available magnitude/velocity images, modeling resolution degradation as convolution with subsampling, employing fast non-iterative algorithm per velocity direction.

Result: Validation on synthetic CFD data and physical phantom experiments shows enhanced velocity field resolution and noise reduction without iterative solvers or large training datasets.

Conclusion: The proposed approach demonstrates potential for improving 4D Flow MRI quality through complex domain processing, offering a practical solution for clinical applications.

Abstract: Four-dimensional Flow MRI (4D Flow MRI) enables non-invasive, time-resolved imaging of blood flow in three spatial dimensions, offering valuable insights into complex hemodynamics. However, its clinical utility is limited by low spatial resolution and poor signal-to-noise ratio (SNR), imposed by acquisition time constraints. In this work, we propose a novel method for super-resolution and denoising of 4D Flow MRI based on the explicit solution of an inverse problem formulated in the complex domain. Using clinically available magnitude and velocity images, we reconstruct complex-valued spatial signals and model resolution degradation as a convolution followed by subsampling. A fast, non-iterative algorithm is employed to solve the inverse problem independently for each velocity direction. We validate our method on synthetic data generated from computational fluid dynamics (CFD) and on physical phantom experiments acquired with 4D Flow MRI. Results demonstrate the potential of our approach to enhance velocity field resolution and reduce noise without the need for large training datasets or iterative solvers.

[596] Intermediate Domain-guided Adaptation for Unsupervised Chorioallantoic Membrane Vessel Segmentation

Pengwu Song, Zhiping Wang, Peng Yao, Liang Xu, Shuwei Shen, Pengfei Shao, Mingzhai Sun, Ronald X. Xu

Main category: eess.IV

TL;DR: Proposes an Intermediate Domain-guided Adaptation (IDA) method for unsupervised vessel segmentation in chorioallantoic membrane (CAM) images by leveraging similarity with retinal images and existing retinal datasets.

Details

Motivation: Manual segmentation of CAM blood vessels is time-consuming and subjective, while existing CAM vessel segmentation algorithms are limited with poor performance due to lack of public datasets.

Method: Uses Multi-Resolution Asymmetric Translation (MRAT) to generate intermediate images and Intermediate Domain-guided Contrastive Learning (IDCL) to disentangle cross-domain feature representations, overcoming limitations of traditional UDA approaches.

Result: Extensive experiments on the first CAM dataset show the method outperforms compared approaches and achieves superior performance in UDA tasks across retinal datasets, demonstrating strong generalization capability.

Conclusion: The proposed IDA method effectively addresses CAM vessel segmentation challenges by leveraging intermediate domain information and achieves state-of-the-art performance with good generalization.

Abstract: The chorioallantoic membrane (CAM) model is a widely used in vivo platform for studying angiogenesis, especially in relation to tumor growth, drug delivery, and vascular biology.Since the topology and morphology of developing blood vessels is a key evaluation metric, accurate vessel segmentation is essential for quantitative analysis of angiogenesis. However, manual segmentation is extremely time-consuming, labor-intensive, and prone to inconsistency due to its subjective nature. Moreover, research on CAM vessel segmentation algorithms remains limited, and the lack of public datasets contributes to poor prediction performance. To address these challenges, we propose an innovative Intermediate Domain-guided Adaptation (IDA) method, which utilizes the similarity between CAM images and retinal images, along with existing public retinal datasets, to perform unsupervised training on CAM images. Specifically, we introduce a Multi-Resolution Asymmetric Translation (MRAT) strategy to generate intermediate images to promote image-level interaction. Then, an Intermediate Domain-guided Contrastive Learning (IDCL) module is developed to disentangle cross-domain feature representations. This method overcomes the limitations of existing unsupervised domain adaptation (UDA) approaches, which primarily concentrate on directly source-target alignment while neglecting intermediate domain information. Notably, we create the first CAM dataset to validate the proposed algorithm. Extensive experiments on this dataset show that our method outperforms compared approaches. Moreover, it achieves superior performance in UDA tasks across retinal datasets, highlighting its strong generalization capability. The CAM dataset and source codes are available at https://github.com/Light-47/IDA.

[597] Frequency-Compensated Network for Daily Arctic Sea Ice Concentration Prediction

Jialiang Zhang, Feng Gao, Yanhai Gan, Junyu Dong, Qian Du

Main category: eess.IV

TL;DR: FCNet is a dual-branch neural network that combines frequency domain analysis with convolutional features for improved Arctic sea ice concentration prediction, addressing limitations in long-term dependency modeling and high-frequency detail preservation.

Details

Motivation: Current sea ice concentration forecasting methods fail to adequately explore long-term feature dependencies in the frequency domain and struggle to preserve high-frequency details, particularly in marginal sea ice areas where accurate change detection is crucial.

Method: A dual-branch network with frequency feature extraction (using adaptive frequency filter blocks combining trainable layers with Fourier-based filters) and convolutional feature extraction (using high-frequency enhancement blocks with channel-wise attention and temporal attention units for low-frequency features).

Result: Extensive experiments on satellite-derived daily SIC dataset demonstrate the effectiveness of FCNet in achieving refined prediction of edges and details in sea ice concentration.

Conclusion: FCNet successfully addresses the challenges of long-term frequency domain dependencies and high-frequency detail preservation, providing improved Arctic sea ice concentration forecasting through its innovative dual-branch architecture with frequency compensation.

Abstract: Accurately forecasting sea ice concentration (SIC) in the Arctic is critical to global ecosystem health and navigation safety. However, current methods still is confronted with two challenges: 1) these methods rarely explore the long-term feature dependencies in the frequency domain. 2) they can hardly preserve the high-frequency details, and the changes in the marginal area of the sea ice cannot be accurately captured. To this end, we present a Frequency-Compensated Network (FCNet) for Arctic SIC prediction on a daily basis. In particular, we design a dual-branch network, including branches for frequency feature extraction and convolutional feature extraction. For frequency feature extraction, we design an adaptive frequency filter block, which integrates trainable layers with Fourier-based filters. By adding frequency features, the FCNet can achieve refined prediction of edges and details. For convolutional feature extraction, we propose a high-frequency enhancement block to separate high and low-frequency information. Moreover, high-frequency features are enhanced via channel-wise attention, and temporal attention unit is employed for low-frequency feature extraction to capture long-range sea ice changes. Extensive experiments are conducted on a satellite-derived daily SIC dataset, and the results verify the effectiveness of the proposed FCNet. Our codes and data will be made public available at: https://github.com/oucailab/FCNet .

[598] Adaptive Weight Modified Riesz Mean Filter For High-Density Salt and Pepper Noise Removal

Md Jahidul Islam

Main category: eess.IV

TL;DR: The paper introduces AWMRmF, a novel filter that outperforms existing state-of-the-art filters in removing high-density salt and pepper noise, achieving superior PSNR and SSIM metrics across various noise levels.

Details

Motivation: To develop an effective filter for removing high-density salt and pepper noise (60-95% noise levels) that improves upon existing methods like AFMF, AWMF, ACmF, ARmF, and IAWMF.

Method: AWMRmF integrates a pixel weight function and adaptivity condition inspired by DAMRmF. Performance was evaluated on 26 test images with noise levels ranging from 60% to 95% using PSNR and SSIM metrics.

Result: AWMRmF demonstrated superior performance compared to all other filters in both PSNR and SSIM metrics, and also achieved better mean PSNR and SSIM results.

Conclusion: The proposed AWMRmF filter is highly effective for high-density salt and pepper noise removal and outperforms current state-of-the-art filtering methods.

Abstract: This paper introduces a novel filter, the Adaptive Weight Modified Riesz Mean Filter (AWMRmF), designed for the effective removal of high-density salt and pepper noise (SPN). AWMRmF integrates a pixel weight function and adaptivity condition inspired by the Different Adaptive Modified Riesz Mean Filter (DAMRmF). In my simulations, I evaluated the performance of AWMRmF against established filters such as Adaptive Frequency Median Filter (AFMF), Adaptive Weighted Mean Filter (AWMF), Adaptive Cesaro Mean Filter (ACmF), Adaptive Riesz Mean Filter (ARmF), and Improved Adaptive Weighted Mean Filter (IAWMF). The assessment was conducted on 26 typical test images, varying noise levels from 60% to 95%. The findings indicate that, in terms of both Peak Signal to Noise Ratio (PSNR) and Structural Similarity (SSIM) metrics, AWMRmF outperformed other state-of-the-art filters. Furthermore, AWMRmF demonstrated superior performance in mean PSNR and SSIM results as well.

[599] CryoSplat: Gaussian Splatting for Cryo-EM Homogeneous Reconstruction

Suyi Chen, Haibin Ling

Main category: eess.IV

TL;DR: cryoSplat is a GMM-based method that integrates Gaussian splatting with cryo-EM physics for 3D molecular reconstruction from 2D projections, enabling stable reconstruction with random initialization.

Details

Motivation: Existing GMM methods for cryo-EM reconstruction require external consensus maps or atomic models for initialization, limiting their use in self-contained pipelines. Off-the-shelf Gaussian splatting methods are incompatible with cryo-EM due to physics mismatches.

Method: Developed orthogonal projection-aware Gaussian splatting with view-dependent normalization and FFT-aligned coordinate system specifically tailored for cryo-EM imaging physics.

Result: Experimental results on real datasets validate cryoSplat’s effectiveness and robustness over representative baselines, enabling stable reconstruction directly from raw particle images.

Conclusion: cryoSplat successfully bridges Gaussian splatting with cryo-EM physics, providing an efficient and self-contained reconstruction method that works with random initialization.

Abstract: As a critical modality for structural biology, cryogenic electron microscopy (cryo-EM) facilitates the determination of macromolecular structures at near-atomic resolution. The core computational task in single-particle cryo-EM is to reconstruct the 3D electrostatic potential of a molecule from noisy 2D projections acquired at unknown orientations. Gaussian mixture models (GMMs) provide a continuous, compact, and physically interpretable representation for molecular density and have recently gained interest in cryo-EM reconstruction. However, existing methods rely on external consensus maps or atomic models for initialization, limiting their use in self-contained pipelines. In parallel, differentiable rendering techniques such as Gaussian splatting have demonstrated remarkable scalability and efficiency for volumetric representations, suggesting a natural fit for GMM-based cryo-EM reconstruction. However, off-the-shelf Gaussian splatting methods are designed for photorealistic view synthesis and remain incompatible with cryo-EM due to mismatches in the image formation physics, reconstruction objectives, and coordinate systems. Addressing these issues, we propose cryoSplat, a GMM-based method that integrates Gaussian splatting with the physics of cryo-EM image formation. In particular, we develop an orthogonal projection-aware Gaussian splatting, with adaptations such as a view-dependent normalization term and FFT-aligned coordinate system tailored for cryo-EM imaging. These innovations enable stable and efficient homogeneous reconstruction directly from raw cryo-EM particle images using random initialization. Experimental results on real datasets validate the effectiveness and robustness of cryoSplat over representative baselines. The code will be released upon publication.

[600] DermINO: Hybrid Pretraining for a Versatile Dermatology Foundation Model

Jingkai Xu, De Cheng, Xiangqian Zhao, Jungang Yang, Zilong Wang, Xinyang Jiang, Xufang Luo, Lili Chen, Xiaoli Ning, Chengxu Li, Xinzhu Zhou, Xuejiao Song, Ang Li, Qingyue Xia, Zhou Zhuang, Hongfei Ouyang, Ke Xue, Yujun Sheng, Rusong Meng, Feng Xu, Xi Yang, Weimin Ma, Yusheng Lee, Dongsheng Li, Xinbo Gao, Jianming Liang, Lili Qiu, Nannan Wang, Xianbo Zuo, Cui Yong

Main category: eess.IV

TL;DR: DermNIO is a versatile foundation model for dermatology that addresses limitations of current AI tools by using a novel hybrid pretraining framework on 432,776 images, achieving superior performance across 20 datasets and various clinical tasks.

Details

Motivation: Skin diseases affect up to 70% of the population, but face diagnostic challenges and dermatologist shortages, especially in resource-limited areas. Current AI models are limited by reliance on large labeled datasets and narrow task-specific designs.

Method: DermNIO uses a hybrid pretraining framework combining self-supervised learning with semi-supervised learning and knowledge-guided prototype initialization, trained on 432,776 images from public repositories, web-sourced images, and proprietary collections.

Result: DermNIO outperforms state-of-the-art models across 20 datasets, excelling in malignancy classification, disease severity grading, multi-category diagnosis, image captioning, and lesion segmentation. In a blinded study with 23 dermatologists, it achieved 95.79% accuracy vs clinicians’ 73.66%, and improved clinician performance by 17.21%.

Conclusion: DermNIO demonstrates strong generalization capability and robustness across diverse clinical tasks and populations, showing significant potential to address dermatology’s diagnostic challenges and resource limitations through AI assistance.

Abstract: Skin diseases impose a substantial burden on global healthcare systems, driven by their high prevalence (affecting up to 70% of the population), complex diagnostic processes, and a critical shortage of dermatologists in resource-limited areas. While artificial intelligence(AI) tools have demonstrated promise in dermatological image analysis, current models face limitations-they often rely on large, manually labeled datasets and are built for narrow, specific tasks, making them less effective in real-world settings. To tackle these limitations, we present DermNIO, a versatile foundation model for dermatology. Trained on a curated dataset of 432,776 images from three sources (public repositories, web-sourced images, and proprietary collections), DermNIO incorporates a novel hybrid pretraining framework that augments the self-supervised learning paradigm through semi-supervised learning and knowledge-guided prototype initialization. This integrated method not only deepens the understanding of complex dermatological conditions, but also substantially enhances the generalization capability across various clinical tasks. Evaluated across 20 datasets, DermNIO consistently outperforms state-of-the-art models across a wide range of tasks. It excels in high-level clinical applications including malignancy classification, disease severity grading, multi-category diagnosis, and dermatological image caption, while also achieving state-of-the-art performance in low-level tasks such as skin lesion segmentation. Furthermore, DermNIO demonstrates strong robustness in privacy-preserving federated learning scenarios and across diverse skin types and sexes. In a blinded reader study with 23 dermatologists, DermNIO achieved 95.79% diagnostic accuracy (versus clinicians’ 73.66%), and AI assistance improved clinician performance by 17.21%.

[601] Cross-Cancer Knowledge Transfer in WSI-based Prognosis Prediction

Pei Liu, Luping Ji, Jiaxiang Gou, Xiangxiang Zeng

Main category: eess.IV

TL;DR: CROPKT introduces a paradigm shift from cancer-specific models to cross-cancer knowledge transfer for WSI-based prognosis prediction, addressing scalability issues with rare tumors and computational efficiency.

Details

Motivation: Current cancer-specific models struggle with rare tumors and cannot leverage knowledge from other cancers. Multi-task learning frameworks are computationally expensive and require iterative training on large datasets.

Method: The study curates a large multi-cancer dataset (UNI2-h-DSS) with 26 cancers, designs experiments to understand transferability mechanisms, and proposes a routing-based baseline approach (ROUPKT) for efficient knowledge transfer from off-the-shelf models.

Result: CROPKT systematically measures WSI-based prognostic knowledge transferability across different cancers, provides insights into transfer mechanisms, and demonstrates efficient utilization of cross-cancer knowledge.

Conclusion: CROPKT serves as a foundational study for the nascent paradigm of cross-cancer knowledge transfer in WSI-based prognosis prediction, offering a more scalable and efficient alternative to traditional approaches.

Abstract: Whole-Slide Image (WSI) is an important tool for estimating cancer prognosis. Current studies generally follow a conventional cancer-specific paradigm where one cancer corresponds to one model. However, it naturally struggles to scale to rare tumors and cannot utilize the knowledge of other cancers. Although a multi-task learning-like framework has been studied recently, it usually has high demands on computational resources and needs considerable costs in iterative training on ultra-large multi-cancer WSI datasets. To this end, this paper makes a paradigm shift to knowledge transfer and presents the first preliminary yet systematic study on cross-cancer prognosis knowledge transfer in WSIs, called CROPKT. It has three major parts: (i) we curate a large dataset (UNI2-h-DSS) with 26 cancers and use it to measure the transferability of WSI-based prognostic knowledge across different cancers (including rare tumors); (ii) beyond a simple evaluation merely for benchmark, we design a range of experiments to gain deeper insights into the underlying mechanism of transferability; (iii) we further show the utility of cross-cancer knowledge transfer, by proposing a routing-based baseline approach (ROUPKT) that could often efficiently utilize the knowledge transferred from off-the-shelf models of other cancers. We hope CROPKT could serve as an inception and lay the foundation for this nascent paradigm, i.e., WSI-based prognosis prediction with cross-cancer knowledge transfer. Our source code is available at https://github.com/liupei101/CROPKT.

[602] Robust Pan-Cancer Mitotic Figure Detection with YOLOv12

Raphaël Bourgade, Guillaume Balezo, Thomas Walter

Main category: eess.IV

TL;DR: The paper presents a YOLOv12-based mitosis detection method that achieved second place in the MIDOG 2025 challenge, demonstrating robust performance on complex whole-slide images without external data.

Details

Motivation: Mitotic figure identification is crucial for tumor pathology but suffers from high inter-observer variability among pathologists, necessitating automated detection methods.

Method: The authors developed a mitosis detection approach using the state-of-the-art YOLOv12 object detection architecture.

Result: The method achieved an F1-score of 0.801 on preliminary test set (hotspots only) and ranked second in the final test with an F1-score of 0.7216 on complex whole-slide regions.

Conclusion: The YOLOv12-based approach demonstrates strong performance in mitosis detection, showing promise for robust automated analysis in tumor pathology.

Abstract: Mitotic figures represent a key histoprognostic feature in tumor pathology, providing crucial insights into tumor aggressiveness and proliferation. However, their identification remains challenging, subject to significant inter-observer variability, even among experienced pathologists. To address this issue, the MItosis DOmain Generalization (MIDOG) 2025 challenge marks the third edition of an international competition aiming to develop robust mitosis detection algorithms. In this paper, we present a mitotic figure detection approach based on the state-of-the-art YOLOv12 object detection architecture. Our method achieved an F1-score of 0.801 on the preliminary test set (hotspots only) and ranked second on the final test leaderboard with an F1-score of 0.7216 across complex and heterogeneous whole-slide regions, without relying on external data.

Today’s Research Highlights

Table of Contents

cs.CL

[1] Interpreting Public Sentiment in Diplomacy Events: A Counterfactual Analysis Framework Using Large Language Models

[2] Speaker Style-Aware Phoneme Anchoring for Improved Cross-Lingual Speech Emotion Recognition

[3] CFD-LLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics

[4] Assessing Classical Machine Learning and Transformer-based Approaches for Detecting AI-Generated Research Text

[5] ConceptViz: A Visual Analytics Approach for Exploring Concepts in Large Language Models

[6] SKILL-RAG: Self-Knowledge Induced Learning and Filtering for Retrieval-Augmented Generation

[7] Beyond Global Emotion: Fine-Grained Emotional Speech Synthesis with Dynamic Word-Level Modulation

[8] USB-Rec: An Effective Framework for Improving Conversational Recommendation Capability of Large Language Model

[9] Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents

[10] Document Summarization with Conformal Importance Guarantees

[11] ShortCheck: Checkworthiness Detection of Multilingual Short-Form Videos

[12] Building Tailored Speech Recognizers for Japanese Speaking Assessment

[13] MARS: toward more efficient multi-agent collaboration for LLM reasoning

[14] MI-Fuse: Label Fusion for Unsupervised Domain Adaptation with Closed-Source Large-Audio Language Model

[15] SiniticMTError: A Machine Translation Dataset with Error Annotations for Sinitic Languages

[16] SwasthLLM: a Unified Cross-Lingual, Multi-Task, and Meta-Learning Zero-Shot Framework for Medical Diagnosis Using Contrastive Representations

[17] Dynamic Reasoning Chains through Depth-Specialized Mixture-of-Experts in Transformer Architectures

[18] Hierarchical Resolution Transformers: A Wavelet-Inspired Architecture for Multi-Scale Language Understanding

[19] FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion Language Models

[20] Look Before you Leap: Estimating LLM Benchmark Scores from Descriptions

[21] Enhancing Molecular Property Prediction with Knowledge from Large Language Models

[22] LAMA-UT: Language Agnostic Multilingual ASR through Orthography Unification and Language-Specific Transliteration

[23] RedHerring Attack: Testing the Reliability of Attack Detection

[24] Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST

[25] Overcoming Black-box Attack Inefficiency with Hybrid and Dynamic Select Algorithms

[26] Speech Language Models for Under-Represented Languages: Insights from Wolof

[27] Probability Distribution Collapse: A Critical Bottleneck to Compact Unsupervised Neural Grammar Induction

[28] Confidence-guided Refinement Reasoning for Zero-shot Question Answering

[29] SFT Doesn’t Always Hurt General Capabilities: Revisiting Domain-Specific Fine-Tuning in LLMs

[30] Towards Atoms of Large Language Models

[31] Few-Shot and Training-Free Review Generation via Conversational Prompting

[32] Enrich-on-Graph: Query-Graph Alignment for Complex Reasoning with LLM Enriching

[33] Leveraging What’s Overfixed: Post-Correction via LLM Grammatical Error Overcorrection

[34] Distilling Many-Shot In-Context Learning into a Cheat Sheet

[35] Zero-Shot Privacy-Aware Text Rewriting via Iterative Tree Search

[36] Concise and Sufficient Sub-Sentence Citations for Retrieval-Augmented Generation

[37] WeFT: Weighted Entropy-driven Fine-Tuning for dLLMs

[38] Single Answer is Not Enough: On Generating Ranked Lists with Medical Reasoning Models

[39] Learning to Summarize by Learning to Quiz: Adversarial Agentic Collaboration for Long Document Summarization

[40] MemLens: Uncovering Memorization in LLMs with Activation Trajectories

[41] Cross-Linguistic Analysis of Memory Load in Sentence Comprehension: Linear Distance and Structural Density

[42] Tool Calling for Arabic LLMs: Data Strategies and Instruction Tuning

[43] Analysis of instruction-based LLMs’ capabilities to score and judge text-input problems in an academic setting

[44] Generative AI for FFRDCs

[45] Behind RoPE: How Does Causal Mask Encode Positional Information?

[46] When Instructions Multiply: Measuring and Estimating LLM Capabilities of Multiple Instructions Following

[47] SoM-1K: A Thousand-Problem Benchmark Dataset for Strength of Materials

[48] Which Cultural Lens Do Models Adopt? On Cultural Positioning Bias and Agentic Mitigation in LLMs

[49] PerHalluEval: Persian Hallucination Evaluation Benchmark for Large Language Models

[50] BESPOKE: Benchmark for Search-Augmented Large Language Model Personalization via Diagnostic Feedback

[51] VoiceBBQ: Investigating Effect of Content and Acoustics in Social Bias of Spoken Language Model

[52] Acoustic-based Gender Differentiation in Speech-aware Language Models

[53] AutoIntent: AutoML for Text Classification

[54] Retrieval over Classification: Integrating Relation Semantics for Multimodal Relation Extraction

[55] Learning the Wrong Lessons: Syntactic-Domain Spurious Correlations in Language Models

[56] Who’s Laughing Now? An Overview of Computational Humour Generation and Explanation

[57] GEP: A GCG-Based method for extracting personally identifiable information from chatbots built on small language models

[58] Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning

[59] CLaw: Benchmarking Chinese Legal Knowledge in Large Language Models - A Fine-grained Corpus and Reasoning Analysis

[60] SGMem: Sentence Graph Memory for Long-Term Conversational Agents

[61] Query-Centric Graph Retrieval Augmented Generation

[62] Un-Doubling Diffusion: LLM-guided Disambiguation of Homonym Duplication

[63] LLM Output Homogenization is Task Dependent

[64] LLMTrace: A Corpus for Classification and Fine-Grained Localization of AI-Written Text

[65] Bounds of Chain-of-Thought Robustness: Reasoning Steps, Embed Norms, and Beyond

[66] DisCoCLIP: A Distributional Compositional Tensor Network Encoder for Vision-Language Understanding

[67] The role of synthetic data in Multilingual, Multi-cultural AI systems: Lessons from Indic Languages

[68] Sycophancy Is Not One Thing: Causal Separation of Sycophantic Behaviors in LLMs

[69] RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards

[70] SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines

[71] Higher-Order DisCoCat (Peirce-Lambek-Montague semantics)

[72] ASCIIEval: Benchmarking Models’ Visual Perception in Text Strings via ASCII Art

[73] UniHR: Hierarchical Representation Learning for Unified Knowledge Graph Link Prediction

[74] Investigating Factuality in Long-Form Text Generation: The Roles of Self-Known and Self-Unknown

[75] Labeling Free-text Data using Language Model Ensembles

[76] Improving LLM Unlearning Robustness via Random Perturbations

[77] Quantifying depressive mental states with large language models