Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 61]
cs.CV [Total: 87]
cs.AI [Total: 32]
cs.SD [Total: 7]
cs.LG [Total: 95]
cs.MA [Total: 0]
cs.MM [Total: 1]
eess.AS [Total: 8]
eess.IV [Total: 5]

cs.CL

[1] CoBA: Counterbias Text Augmentation for Mitigating Various Spurious Correlations via Semantic Triples

Kyohoon Jin, Juhwan Choi, Jungmin Yun, Junho Lee, Soojin Jang, Youngbin Kim

Main category: cs.CL

TL;DR: CoBA is a counterbias data augmentation framework that decomposes text into semantic triples and modifies them to disrupt spurious correlations, improving model robustness and reducing biases.

Details

Motivation: Deep learning models often rely on spurious correlations in training data, leading to poor generalization and performance degradation on unseen data.

Method: Counterbias data augmentation framework that decomposes text into subject-predicate-object triples, selectively modifies these triples to disrupt spurious correlations, and reconstructs text from adjusted triples.

Result: Improves downstream task performance, effectively reduces biases, and strengthens out-of-distribution resilience.

Conclusion: CoBA offers a versatile and robust solution to address challenges posed by spurious correlations in deep learning models.

Abstract: Deep learning models often learn and exploit spurious correlations in training data, using these non-target features to inform their predictions. Such reliance leads to performance degradation and poor generalization on unseen data. To address these limitations, we introduce a more general form of counterfactual data augmentation, termed counterbias data augmentation, which simultaneously tackles multiple biases (e.g., gender bias, simplicity bias) and enhances out-of-distribution robustness. We present CoBA: CounterBias Augmentation, a unified framework that operates at the semantic triple level: first decomposing text into subject-predicate-object triples, then selectively modifying these triples to disrupt spurious correlations. By reconstructing the text from these adjusted triples, CoBA generates counterbias data that mitigates spurious patterns. Through extensive experiments, we demonstrate that CoBA not only improves downstream task performance, but also effectively reduces biases and strengthens out-of-distribution resilience, offering a versatile and robust solution to the challenges posed by spurious correlations.

[2] Mapping Toxic Comments Across Demographics: A Dataset from German Public Broadcasting

Jan Fillies, Michael Peter Hoffmann, Rebecca Reichel, Roman Salzwedel, Sven Bodemer, Adrian Paschke

Main category: cs.CL

TL;DR: First large-scale German toxic speech dataset with age demographics, revealing age-based patterns in online toxicity across Instagram, TikTok, and YouTube.

Details

Motivation: Existing toxic speech datasets lack demographic context, limiting understanding of how different age groups communicate toxic content online.

Method: Collaborated with German public service network to create dataset with 3,024 human-annotated and 30,024 LLM-annotated comments from social media platforms, using toxic keywords for relevance and combining human expertise with language models.

Result: Dataset shows 16.7% problematic content, with younger users favoring expressive language and older users more engaged in disinformation and devaluation.

Conclusion: Provides valuable resource for studying linguistic variation across demographics and supports development of more equitable, age-aware content moderation systems.

Abstract: A lack of demographic context in existing toxic speech datasets limits our understanding of how different age groups communicate online. In collaboration with funk, a German public service content network, this research introduces the first large-scale German dataset annotated for toxicity and enriched with platform-provided age estimates. The dataset includes 3,024 human-annotated and 30,024 LLM-annotated anonymized comments from Instagram, TikTok, and YouTube. To ensure relevance, comments were consolidated using predefined toxic keywords, resulting in 16.7% labeled as problematic. The annotation pipeline combined human expertise with state-of-the-art language models, identifying key categories such as insults, disinformation, and criticism of broadcasting fees. The dataset reveals age-based differences in toxic speech patterns, with younger users favoring expressive language and older users more often engaging in disinformation and devaluation. This resource provides new opportunities for studying linguistic variation across demographics and supports the development of more equitable and age-aware content moderation systems.

[3] Granite Embedding R2 Models

Parul Awasthy, Aashka Trivedi, Yulong Li, Meet Doshi, Riyaz Bhat, Vignesh P, Vishwajeet Kumar, Yushu Yang, Bhavani Iyer, Abraham Daniels, Rudra Murthy, Ken Barker, Martin Franz, Madison Lee, Todd Ward, Salim Roukos, David Cox, Luis Lastras, Jaydeep Sen, Radu Florian

Main category: cs.CL

TL;DR: Granite Embedding R2 models are high-performance English encoder-based embedding models for enterprise retrieval, featuring 16x expanded context, state-of-the-art performance across multiple domains, and 19-44% speed improvements over competitors.

Details

Motivation: To address enterprise-scale dense retrieval needs with improved context length, performance across diverse domains (text, code, long-document, conversational, tabular data), and faster retrieval speeds while maintaining accuracy for competitive advantage.

Method: Developed both bi-encoder and cross-encoder architectures including a 22-layer retriever model and efficient 12-layer counterpart, plus a high-quality reranker model, all trained exclusively on enterprise-appropriate data with comprehensive governance oversight.

Result: Models demonstrate exceptional versatility across standard benchmarks, IBM evaluation suites, and real-world enterprise use cases, establishing new performance standards for open-source embedding models with measurable speed advantages of 19-44% over leading competitors.

Conclusion: Granite R2 models provide cutting-edge performance, enterprise-ready licensing, and transparent data provenance for mission-critical deployments, available under Apache 2.0 license for unrestricted research and commercial use.

Abstract: We introduce the Granite Embedding R2 models, a comprehensive family of high-performance English encoder-based embedding models engineered for enterprise-scale dense retrieval applications. Building upon our first-generation release, these models deliver substantial improvements, including 16x expanded context length (8,192 tokens), state-of-the-art performance across diverse retrieval domains - text, code, long-document search, multi-turn conversational, and tabular data - and measurable speed advantages of 19-44% over leading competitors while maintaining superior accuracy. Our release encompasses both bi-encoder and cross-encoder architectures, featuring a highly effective 22-layer retriever model and its efficient 12-layer counterpart, alongside a high-quality reranker model, all trained exclusively on enterprise-appropriate data with comprehensive governance oversight. The models demonstrate exceptional versatility across standard benchmarks, IBM-developed evaluation suites, and real-world enterprise use cases, establishing new performance standards for open-source embedding models. In an era where retrieval speed and accuracy are paramount for competitive advantage, the Granite R2 models deliver a compelling combination of cutting-edge performance, enterprise-ready licensing, and transparent data provenance that organizations require for mission-critical deployments. All models are publicly available under the Apache 2.0 license at https://huggingface.co/collections/ibm-granite, enabling unrestricted research and commercial use.

[4] TrInk: Ink Generation with Transformer Network

Zezhong Jin, Shubhang Desai, Xu Chen, Biyi Fang, Zhuoyi Huang, Zhe Li, Chong-Xin Gan, Xiao Tu, Man-Wai Mak, Yan Lu, Shujie Liu

Main category: cs.CL

TL;DR: TrInk is a Transformer-based model for handwriting generation that uses scaled positional embeddings and Gaussian memory mask to improve text-stroke alignment, achieving significant error rate reductions on benchmark datasets.

Details

Motivation: To develop a more effective handwriting generation model that can better capture global dependencies and improve alignment between input text and generated stroke points, addressing limitations in previous methods.

Method: Transformer-based architecture with scaled positional embeddings and Gaussian memory mask in cross-attention module to enhance text-stroke alignment. Also designed comprehensive evaluation pipelines for both subjective and objective assessment of legibility and style consistency.

Result: Achieved 35.56% reduction in character error rate (CER) and 29.66% reduction in word error rate (WER) on the IAM-OnDB dataset compared to previous methods.

Conclusion: The proposed TrInk model demonstrates superior performance in handwriting generation with significant improvements in error rates, showing the effectiveness of the Transformer architecture with enhanced attention mechanisms for this task.

Abstract: In this paper, we propose TrInk, a Transformer-based model for ink generation, which effectively captures global dependencies. To better facilitate the alignment between the input text and generated stroke points, we introduce scaled positional embeddings and a Gaussian memory mask in the cross-attention module. Additionally, we design both subjective and objective evaluation pipelines to comprehensively assess the legibility and style consistency of the generated handwriting. Experiments demonstrate that our Transformer-based model achieves a 35.56% reduction in character error rate (CER) and an 29.66% reduction in word error rate (WER) on the IAM-OnDB dataset compared to previous methods. We provide an demo page with handwriting samples from TrInk and baseline models at: https://akahello-a11y.github.io/trink-demo/

[5] How Does Cognitive Bias Affect Large Language Models? A Case Study on the Anchoring Effect in Price Negotiation Simulations

Yoshiki Takenami, Yin Jou Huang, Yugo Murawaki, Chenhui Chu

Main category: cs.CL

TL;DR: LLMs exhibit anchoring bias in price negotiations similar to humans, with reasoning models showing reduced susceptibility but no correlation with personality traits.

Details

Motivation: To investigate cognitive biases like anchoring effect in LLMs during price negotiations and understand how factors like reasoning and personality influence this bias, aiming for safer LLM applications.

Method: Instructed seller LLM agents to apply anchoring effect in price negotiations, evaluated using both objective and subjective metrics, and analyzed relationships with reasoning capabilities and personality traits.

Result: LLMs are influenced by anchoring effect like humans; reasoning models are less prone to anchoring (suggesting chain of thought mitigates effect); no significant correlation found between personality traits and anchoring susceptibility.

Conclusion: Findings enhance understanding of cognitive biases in LLMs and support development of safer, more responsible LLM applications in real-world scenarios.

Abstract: Cognitive biases, well-studied in humans, can also be observed in LLMs, affecting their reliability in real-world applications. This paper investigates the anchoring effect in LLM-driven price negotiations. To this end, we instructed seller LLM agents to apply the anchoring effect and evaluated negotiations using not only an objective metric but also a subjective metric. Experimental results show that LLMs are influenced by the anchoring effect like humans. Additionally, we investigated the relationship between the anchoring effect and factors such as reasoning and personality. It was shown that reasoning models are less prone to the anchoring effect, suggesting that the long chain of thought mitigates the effect. However, we found no significant correlation between personality traits and susceptibility to the anchoring effect. These findings contribute to a deeper understanding of cognitive biases in LLMs and to the realization of safe and responsible application of LLMs in society.

[6] Can Multimodal LLMs Solve the Basic Perception Problems of Percept-V?

Samrajnee Ghosh, Naman Agarwal, Hemanshu Garg, Chinmay Mittal, Mausam, Parag Singla

Main category: cs.CL

TL;DR: MLLMs show poor performance on basic visual perception tasks despite excelling at complex reasoning, with accuracy dropping significantly as task complexity increases.

Details

Motivation: To assess MLLMs' performance on simple perception tasks using uncontaminated generated images with basic shapes, as limited testing has been done in this area despite their strong performance in complex reasoning tasks.

Method: Created Percept-V dataset with 7200 program-generated images across 30 categories testing various visual perception skills, then tested state-of-the-art MLLMs (GPT-4o, Gemini, Claude) and LRMs (OpenAI o4-mini, DeepSeek R1).

Result: Significant performance drop with increasing problem complexity across all categories. MLLMs show similar accuracy trends across categories, with some cognitive skills being more difficult than others.

Conclusion: MLLMs struggle with basic visual perception tasks despite their advanced reasoning capabilities, revealing important limitations in their perceptual abilities that need to be addressed.

Abstract: The reasoning abilities of Multimodal Large Language Models (MLLMs) have garnered a lot of attention in recent times, with advances made in frontiers like coding, mathematics, and science. However, very limited experiments have been done to assess their performance in simple perception tasks performed over uncontaminated, generated images containing basic shapes and structures. To address this issue, the paper introduces a dataset, Percept-V, containing a total of 7200 program-generated images equally divided into 30 categories, each testing a combination of visual perception skills. Unlike previously proposed datasets, Percept-V comprises very basic tasks of varying complexity that test the perception abilities of MLLMs. This dataset is then tested on state-of-the-art MLLMs like GPT-4o, Gemini, and Claude as well as Large Reasoning Models (LRMs) like OpenAI o4-mini and DeepSeek R1 to gauge their performance. Contrary to the evidence that MLLMs excel in many complex tasks, our experiments show a significant drop in the models’ performance with increasing problem complexity across all categories. An analysis of the performances also reveals that the tested MLLMs exhibit a similar trend in accuracy across categories, testing a particular cognitive skill and find some skills to be more difficult than others.

[7] A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers

Ming Hu, Chenglong Ma, Wei Li, Wanghan Xu, Jiamin Wu, Jucheng Hu, Tianbin Li, Guohang Zhuang, Jiaqi Liu, Yingzhou Lu, Ying Chen, Chaoyang Zhang, Cheng Tan, Jie Ying, Guocheng Wu, Shujian Gao, Pengcheng Chen, Jiashi Lin, Haitao Wu, Lulu Chen, Fengxiang Wang, Yuanyuan Zhang, Xiangyu Zhao, Feilong Tang, Encheng Su, Junzhi Ning, Xinyao Liu, Ye Du, Changkai Ji, Cheng Tang, Huihui Xu, Ziyang Chen, Ziyan Huang, Jiyao Liu, Pengfei Jiang, Yizhou Wang, Chen Tang, Jianyu Wu, Yuchen Ren, Siyuan Yan, Zhonghua Wang, Zhongxing Xu, Shiyan Su, Shangquan Sun, Runkai Zhao, Zhisheng Zhang, Yu Liu, Fudi Wang, Yuanfeng Ji, Yanzhou Su, Hongming Shan, Chunmei Feng, Jiahao Xu, Jiangtao Yan, Wenhao Tang, Diping Song, Lihao Liu, Yanyan Huang, Lequan Yu, Bin Fu, Shujun Wang, Xiaomeng Li, Xiaowei Hu, Yun Gu, Ben Fei, Zhongying Deng, Benyou Wang, Yuewen Cao, Minjie Shen, Haodong Duan, Jie Xu, Yirong Chen, Fang Yan, Hongxia Hao, Jielan Li, Jiajun Du, Yanbo Wang, Imran Razzak, Chi Zhang, Lijun Wu, Conghui He, Zhaohui Lu, Jinhai Huang, Yihao Liu, Fenghua Ling, Yuqiang Li, Aoran Wang, Qihao Zheng, Nanqing Dong, Tianfan Fu, Dongzhan Zhou, Yan Lu, Wenlong Zhang, Jin Ye, Jianfei Cai, Wanli Ouyang, Yu Qiao, Zongyuan Ge, Shixiang Tang, Junjun He, Chunfeng Song, Lei Bai, Bowen Zhou

Main category: cs.CL

TL;DR: This survey presents a comprehensive data-centric analysis of Scientific Large Language Models (Sci-LLMs), reframing their development as co-evolution between models and scientific data, with unified taxonomies, systematic reviews of models and datasets, and outlining a paradigm shift toward closed-loop autonomous scientific discovery systems.

Details

Motivation: Sci-LLMs are transforming scientific research, but their progress is constrained by the complex nature of scientific data, which differs significantly from general NLP datasets in being multimodal, cross-scale, and domain-specific.

Method: The authors conduct a comprehensive survey with unified taxonomy of scientific data and hierarchical knowledge model, systematically reviewing recent Sci-LLMs (general-purpose and specialized), analyzing over 270 pre-/post-training datasets, and examining over 190 benchmark datasets with advanced evaluation protocols.

Result: The analysis reveals that Sci-LLMs require handling heterogeneous, multi-scale, uncertainty-laden corpora that demand representations preserving domain invariance and enabling cross-modal reasoning. The survey identifies persistent issues in scientific data development and emerging solutions involving semi-automated annotation and expert validation.

Conclusion: The work outlines a paradigm shift toward closed-loop systems where autonomous Sci-LLM agents actively experiment and contribute to evolving knowledge bases, providing a roadmap for building trustworthy AI systems that accelerate scientific discovery as true research partners.

Abstract: Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research, yet their progress is shaped by the complex nature of scientific data. This survey presents a comprehensive, data-centric synthesis that reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate. We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge, emphasizing the multimodal, cross-scale, and domain-specific challenges that differentiate scientific corpora from general natural language processing datasets. We systematically review recent Sci-LLMs, from general-purpose foundations to specialized models across diverse scientific disciplines, alongside an extensive analysis of over 270 pre-/post-training datasets, showing why Sci-LLMs pose distinct demands – heterogeneous, multi-scale, uncertainty-laden corpora that require representations preserving domain invariance and enabling cross-modal reasoning. On evaluation, we examine over 190 benchmark datasets and trace a shift from static exams toward process- and discovery-oriented assessments with advanced evaluation protocols. These data-centric analyses highlight persistent issues in scientific data development and discuss emerging solutions involving semi-automated annotation pipelines and expert validation. Finally, we outline a paradigm shift toward closed-loop systems where autonomous agents based on Sci-LLMs actively experiment, validate, and contribute to a living, evolving knowledge base. Collectively, this work provides a roadmap for building trustworthy, continually evolving artificial intelligence (AI) systems that function as a true partner in accelerating scientific discovery.

[8] Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Hinrich Schütze, Volker Tresp, Yunpu Ma

Main category: cs.CL

TL;DR: Memory-R1 is a reinforcement learning framework that enables LLMs to actively manage external memory through specialized agents for adaptive memory operations and reasoning.

Details

Motivation: LLMs are stateless with limited context windows, and existing memory augmentation approaches are static and heuristic-driven without learned mechanisms for memory management.

Method: Uses RL framework with two agents: Memory Manager for structured memory operations (add/update/delete) and Answer Agent for selecting relevant entries and reasoning. Fine-tuned with PPO and GRPO using minimal supervision.

Result: Outperforms strongest baselines with only 152 QA pairs for training, shows strong generalization across question types and LLM backbones.

Conclusion: RL enables more agentic, memory-aware behavior in LLMs, pointing toward richer persistent reasoning systems with adaptive memory management.

Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities across a wide range of NLP tasks, but they remain fundamentally stateless, constrained by limited context windows that hinder long-horizon reasoning. Recent efforts to address this limitation often augment LLMs with an external memory bank, yet most existing pipelines are static and heuristic-driven, lacking any learned mechanism for deciding what to store, update, or retrieve. We present Memory-R1, a reinforcement learning (RL) framework that equips LLMs with the ability to actively manage and utilize external memory through two specialized agents: a Memory Manager that learns to perform structured memory operations, including adding, updating, deleting, or taking no operation on memory entries; and an Answer Agent that selects the most relevant entries and reasons over them to produce an answer. Both agents are fine-tuned with outcome-driven RL (PPO and GRPO), enabling adaptive memory management and utilization with minimal supervision. With as few as 152 question-answer pairs and a corresponding temporal memory bank for training, Memory-R1 outperforms the strongest existing baseline and demonstrates strong generalization across diverse question types and LLM backbones. Beyond presenting an effective approach, this work provides insights into how RL can unlock more agentic, memory-aware behavior in LLMs, pointing toward richer, more persistent reasoning systems.

[9] Quantifying Label-Induced Bias in Large Language Model Self- and Cross-Evaluations

Muskan Saraf, Sajjad Rezvani Boroujeni, Justin Beaudry, Hossein Abedi, Tom Bush

Main category: cs.CL

TL;DR: LLM evaluation scores are heavily biased by model identity labels, with Claude labels boosting scores and Gemini labels depressing them, causing up to 50 percentage point preference reversals.

Details

Motivation: To investigate how model identity labels influence LLM evaluation judgments in both self- and cross-model evaluation scenarios, as LLMs are increasingly used to evaluate outputs but their objectivity may be compromised.

Method: Tested ChatGPT, Gemini, and Claude evaluating blog posts under four label conditions: no labels, true labels, and two false-label scenarios. Used both overall preference voting and quality ratings (Coherence, Informativeness, Conciseness) with scores converted to percentages.

Result: Strong asymmetrical biases found - Claude labels consistently increased scores while Gemini labels consistently decreased them, regardless of actual content. False labels reversed rankings with 50 percentage point preference shifts and 12 percentage point quality rating changes. Gemini’s self-scores collapsed under true labels while Claude’s self-preference intensified.

Conclusion: Perceived model identity significantly distorts LLM evaluation judgments, requiring blind or multimodel evaluation protocols to ensure fair benchmarking.

Abstract: Large language models (LLMs) are increasingly used to evaluate outputs, yet their judgments may be influenced. This study examines bias in self- and cross-model evaluations by ChatGPT, Gemini, and Claude under four conditions: no labels, true labels, and two false-label scenarios. Blog posts authored by each model were evaluated by all three using both overall preference voting and quality ratings for Coherence, Informativeness, and Conciseness, with all scores expressed as percentages for direct comparison. Results reveal striking asymmetries: the “Claude” label consistently boosts scores, while the “Gemini” label consistently depresses them, regardless of actual content. False labels frequently reversed rankings, producing shifts of up to 50 percentage points in preference votes and up to 12 percentage points in converted quality ratings. Gemini’s self-scores collapsed under true labels, while Claude’s self-preference intensified. These findings show that perceived model identity can heavily distort high-level judgments and subtly influence detailed quality ratings, underscoring the need for blind or multimodel evaluation protocols to ensure fairness in LLM benchmarking.

[10] BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design

Deepro Choudhury, Sinead Williamson, Adam Goliński, Ning Miao, Freddie Bickford Smith, Michael Kirchhof, Yizhe Zhang, Tom Rainforth

Main category: cs.CL

TL;DR: BED-LLM framework enables LLMs to intelligently gather information through sequential Bayesian experimental design, achieving substantial performance gains in multi-turn conversations and preference inference tasks.

Details

Motivation: To improve LLMs' ability to act as effective multi-turn conversational agents by enabling them to intelligently and adaptively gather information from users or external sources using principled Bayesian methods.

Method: Uses sequential Bayesian experimental design framework, iteratively choosing questions that maximize expected information gain (EIG) about the task. Includes probabilistic model derived from LLM’s belief distribution, specialized EIG estimator, non-in-context conditioning, and targeted query proposal strategy.

Result: Achieves substantial performance gains across 20-questions game and user preference inference tasks compared to direct prompting and other adaptive design strategies.

Conclusion: BED-LLM provides a principled framework for LLMs to act as intelligent conversational agents through Bayesian experimental design, demonstrating significant improvements in information gathering and adaptive interaction capabilities.

Abstract: We propose a general-purpose approach for improving the ability of Large Language Models (LLMs) to intelligently and adaptively gather information from a user or other external source using the framework of sequential Bayesian experimental design (BED). This enables LLMs to act as effective multi-turn conversational agents and interactively interface with external environments. Our approach, which we call BED-LLM (Bayesian Experimental Design with Large Language Models), is based on iteratively choosing questions or queries that maximize the expected information gain (EIG) about the task of interest given the responses gathered previously. We show how this EIG can be formulated in a principled way using a probabilistic model derived from the LLM’s belief distribution and provide detailed insights into key decisions in its construction. Further key to the success of BED-LLM are a number of specific innovations, such as a carefully designed estimator for the EIG, not solely relying on in-context updates for conditioning on previous responses, and a targeted strategy for proposing candidate queries. We find that BED-LLM achieves substantial gains in performance across a wide range of tests based on the 20-questions game and using the LLM to actively infer user preferences, compared to direct prompting of the LLM and other adaptive design strategies.

[11] Improving Aviation Safety Analysis: Automated HFACS Classification Using Reinforcement Learning with Group Relative Policy Optimization

Arash Ahmadi, Sarah Sharif, Yaser Banad

Main category: cs.CL

TL;DR: Automated HFACS classification framework using RL with GRPO to fine-tune Llama-3.1 8B model for aviation safety analysis, achieving significant performance improvements over state-of-the-art LLMs.

Details

Motivation: Traditional HFACS methods for aviation accident analysis suffer from scalability and consistency limitations, requiring automated solutions for better safety analysis.

Method: Reinforcement Learning with Group Relative Policy Optimization (GRPO) to fine-tune Llama-3.1 8B language model, incorporating multi-component reward system and synthetic data generation to address class imbalance.

Result: 350% increase in exact match accuracy (0.0400 to 0.1800) and improved partial match accuracy of 0.8800, outperforming GPT-5-mini and Gemini-2.5-flash on key metrics.

Conclusion: Smaller, domain-optimized models provide computationally efficient solutions for safety analysis, enabling low-latency deployment on resource-constrained edge devices, with exact match accuracy proposed as new benchmarking methodology.

Abstract: Analyzing the human factors behind aviation accidents is crucial for preventing future incidents, yet traditional methods using the Human Factors Analysis and Classification System (HFACS) are limited by scalability and consistency. To address this, we introduce an automated HFACS classification framework for aviation safety analysis that utilizes Reinforcement Learning with Group Relative Policy Optimization (GRPO) to fine-tune a Llama-3.1 8B language model. Our approach incorporates a multi-component reward system tailored for aviation safety analysis and integrates synthetic data generation to overcome class imbalance in accident datasets. The resulting GRPO-optimized model achieved noticeable performance gains, including a 350% increase in exact match accuracy (from 0.0400 to 0.1800) and an improved partial match accuracy of 0.8800. Significantly, our specialized model outperforms state-of-the-art LLMs (Large Language Models), including GPT-5-mini and Gemini-2.5-fiash, on key metrics. This research also proposes exact match accuracy in multi-label HFACS classification problem as a new benchmarking methodology to evaluate the advanced reasoning capabilities of language models. Ultimately, our work validates that smaller, domain-optimized models can provide a computationally efficient and better solution for critical safety analysis. This approach makes powerful, low-latency deployment on resource-constrained edge devices feasible.

[12] Enhancing Robustness of Autoregressive Language Models against Orthographic Attacks via Pixel-based Approach

Han Yang, Jian Lan, Yihong Liu, Hinrich Schütze, Thomas Seidl

Main category: cs.CL

TL;DR: Pixel-based language model replaces text embeddings with visual word representations to defend against orthographic attacks and support multilingual text.

Details

Motivation: Autoregressive language models are vulnerable to orthographic attacks using multilingual characters, causing performance degradation due to subword tokenizer limitations.

Method: Propose a pixel-based generative language model that renders words as individual images instead of using text-based embeddings, enabling robustness to noisy inputs and multilingual compatibility.

Result: Evaluated on multilingual LAMBADA, WMT24, and SST-2 benchmarks, demonstrating resilience to orthographic noise and effectiveness in multilingual settings.

Conclusion: Pixel-based representations provide stronger robustness against orthographic attacks and enable better multilingual text processing compared to traditional text-based embeddings.

Abstract: Autoregressive language models are vulnerable to orthographic attacks, where input text is perturbed with characters from multilingual alphabets, leading to substantial performance degradation. This vulnerability primarily stems from the out-of-vocabulary issue inherent in subword tokenizers and their embeddings. To address this limitation, we propose a pixel-based generative language model that replaces the text-based embeddings with pixel-based representations by rendering words as individual images. This design provides stronger robustness to noisy inputs, while an extension of compatibility to multilingual text across diverse writing systems. We evaluate the proposed method on the multilingual LAMBADA dataset, WMT24 dataset and the SST-2 benchmark, demonstrating both its resilience to orthographic noise and its effectiveness in multilingual settings.

[13] Do Self-Supervised Speech Models Exhibit the Critical Period Effects in Language Acquisition?

Yurie Koga, Shunsuke Kando, Yusuke Miyao

Main category: cs.CL

TL;DR: Self-supervised speech models do not show clear Critical Period effects in phonological acquisition, with delayed L2 exposure actually improving performance and delayed L1 exposure causing forgetting.

Details

Motivation: To investigate whether Critical Period effects observed in human language acquisition (difficulty with delayed L2 exposure onset and L1 retention with delayed L1 offset) are present in self-supervised speech models, given the importance of spoken language in human acquisition.

Method: Trained self-supervised speech models with varying L2 training onsets and L1 training offsets on child-directed speech, then evaluated their phone discrimination performance.

Result: S3Ms do not exhibit clear evidence of Critical Period effects. Models with delayed L2 exposure onset performed better on L2, and delayed L1 exposure offset led to L1 forgetting (contrary to human patterns).

Conclusion: Self-supervised speech models do not replicate human Critical Period effects in phonological acquisition, showing different learning patterns than humans.

Abstract: This paper investigates whether the Critical Period (CP) effects in human language acquisition are observed in self-supervised speech models (S3Ms). CP effects refer to greater difficulty in acquiring a second language (L2) with delayed L2 exposure onset, and greater retention of their first language (L1) with delayed L1 exposure offset. While previous work has studied these effects using textual language models, their presence in speech models remains underexplored despite the central role of spoken language in human language acquisition. We train S3Ms with varying L2 training onsets and L1 training offsets on child-directed speech and evaluate their phone discrimination performance. We find that S3Ms do not exhibit clear evidence of either CP effects in terms of phonological acquisition. Notably, models with delayed L2 exposure onset tend to perform better on L2 and delayed L1 exposure offset leads to L1 forgetting.

[14] Decoding Memories: An Efficient Pipeline for Self-Consistency Hallucination Detection

Weizhi Gao, Xiaorui Liu, Feiyi Wang, Dan Lu, Junqi Yin

Main category: cs.CL

TL;DR: DMP accelerates self-consistency hallucination detection by identifying redundant prefix tokens and using selective inference with annealed decoding, achieving 3x speedup without performance loss.

Details

Motivation: Existing self-consistency methods for hallucination detection in LLMs suffer from high computational costs due to repeated generation, with significant redundancy in shared prefix tokens across generations.

Method: Proposes Decoding Memory Pipeline (DMP) that identifies redundant prefix tokens across multiple generations and accelerates inference through selective processing and annealed decoding strategies.

Result: Achieves up to 3x speedup in generation efficiency without sacrificing AUROC performance, demonstrating consistent improvements across different models, datasets, and decoding strategies.

Conclusion: DMP provides an efficient solution for self-consistency methods that is orthogonal to existing approaches and shows promise for extension to alignment and reasoning tasks.

Abstract: Large language models (LLMs) have demonstrated impressive performance in both research and real-world applications, but they still struggle with hallucination. Existing hallucination detection methods often perform poorly on sentence-level generation or rely heavily on domain-specific knowledge. While self-consistency approaches help address these limitations, they incur high computational costs due to repeated generation. In this paper, we conduct the first study on identifying redundancy in self-consistency methods, manifested as shared prefix tokens across generations, and observe that non-exact-answer tokens contribute minimally to the semantic content. Based on these insights, we propose a novel Decoding Memory Pipeline (DMP) that accelerates generation through selective inference and annealed decoding. Being orthogonal to the model, dataset, decoding strategy, and self-consistency baseline, our DMP consistently improves the efficiency of multi-response generation and holds promise for extension to alignment and reasoning tasks. Extensive experiments show that our method achieves up to a 3x speedup without sacrificing AUROC performance.

[15] Efficient Code Embeddings from Code Generation Models

Daria Kryvosheieva, Saba Sturua, Michael Günther, Scott Martens, Han Xiao

Main category: cs.CL

TL;DR: jina-code-embeddings is a novel code embedding model suite that achieves state-of-the-art performance for code retrieval and semantic similarity tasks using an autoregressive backbone with last-token pooling, despite relatively small model sizes.

Details

Motivation: To create an effective code embedding model that can retrieve code from natural language queries, perform technical question-answering, and identify semantically similar code snippets across different programming languages.

Method: Uses an autoregressive backbone pre-trained on both text and code, generating embeddings via last-token pooling technique. The paper outlines a specific training recipe for this approach.

Result: Demonstrates state-of-the-art performance in code embedding tasks despite the relatively small size of the models compared to other approaches.

Conclusion: Validates the effectiveness of using an autoregressive backbone with last-token pooling for code embedding model construction, showing that small models can achieve excellent performance in code retrieval and semantic similarity tasks.

Abstract: jina-code-embeddings is a novel code embedding model suite designed to retrieve code from natural language queries, perform technical question-answering, and identify semantically similar code snippets across programming languages. It makes innovative use of an autoregressive backbone pre-trained on both text and code, generating embeddings via last-token pooling. We outline the training recipe and demonstrate state-of-the-art performance despite the relatively small size of the models, validating this approach to code embedding model construction.

[16] BLUEX Revisited: Enhancing Benchmark Coverage with Automatic Captioning

João Guilherme Alves Santos, Giovana Kerche Bonás, Thales Sales Almeida

Main category: cs.CL

TL;DR: Updated BLUEX dataset with 2024-2025 exams and AI-generated image captions, increasing accessibility by 40% and more than doubling usable questions to 1,422 for LLM evaluation and data contamination studies.

Details

Motivation: Growing need for robust evaluation methods for Large Language Models, especially in multilingual and non-English contexts, requiring enhanced datasets for data contamination studies.

Method: Created updated BLUEX dataset including recent 2024-2025 exams and automatically generated image captions using state-of-the-art models to enhance accessibility for text-only models.

Result: Captioning strategies increased accessibility by more than 40%, producing 1,422 usable questions (more than double the original). Evaluated commercial and open-source LLMs’ ability to leverage visual context through captions.

Conclusion: The enhanced BLUEX dataset provides significantly improved resources for evaluating LLMs in multilingual contexts and studying data contamination, with captioning effectively bridging visual content accessibility gaps.

Abstract: With the growing capabilities of Large Language Models (LLMs), there is an increasing need for robust evaluation methods, especially in multilingual and non-English contexts. We present an updated version of the BLUEX dataset, now including 2024-2025 exams and automatically generated image captions using state-of-the-art models, enhancing its relevance for data contamination studies in LLM pretraining. Captioning strategies increase accessibility to text-only models by more than 40%, producing 1,422 usable questions, more than doubling the number in the original BLUEX. We evaluated commercial and open-source LLMs and their ability to leverage visual context through captions.

[17] Challenges and Applications of Large Language Models: A Comparison of GPT and DeepSeek family of models

Shubham Sharma, Sneha Tuli, Narendra Badam

Main category: cs.CL

TL;DR: This survey paper compares 16 key challenges in LLM development and deployment, analyzing trade-offs between closed-source (GPT-4o) and open-source (DeepSeek-V3-0324) models across various application domains.

Details

Motivation: To address the complexity of LLM development and deployment by providing a comprehensive comparison of different approaches, helping researchers and practitioners understand capabilities, limitations, and best practices.

Method: Comparative analysis of 16 LLM challenges through examination of two state-of-the-art models: OpenAI’s closed-source GPT-4o and DeepSeek’s open-source Mixture-of-Experts model DeepSeek-V3-0324.

Result: Identifies trade-offs between closed-source models (robust safety, fine-tuned reliability) and open-source models (efficiency, adaptability), and maps model attributes to optimal use cases across domains like chatbots, coding tools, healthcare, and education.

Conclusion: Provides guidance for AI stakeholders on selecting appropriate LLM approaches based on specific application requirements, highlighting the complementary strengths of different development paradigms.

Abstract: Large Language Models (LLMs) are transforming AI across industries, but their development and deployment remain complex. This survey reviews 16 key challenges in building and using LLMs and examines how these challenges are addressed by two state-of-the-art models with unique approaches: OpenAI’s closed source GPT-4o (May 2024 update) and DeepSeek-V3-0324 (March 2025), a large open source Mixture-of-Experts model. Through this comparison, we showcase the trade-offs between closed source models (robust safety, fine-tuned reliability) and open source models (efficiency, adaptability). We also explore LLM applications across different domains (from chatbots and coding tools to healthcare and education), highlighting which model attributes are best suited for each use case. This article aims to guide AI researchers, developers, and decision-makers in understanding current LLM capabilities, limitations, and best practices.

[18] Normality and the Turing Test

Alexandre Kabbach

Main category: cs.CL

TL;DR: The paper reinterprets the Turing test through the concept of normality, arguing it tests average human intelligence requiring imperfect behavior, and suggests current AI models target exceptional intelligence rather than normal intelligence needed to pass the test.

Details

Motivation: To revisit and reinterpret the Turing test through the statistical concept of normality, challenging conventional understandings of what constitutes artificial intelligence and how the test should be evaluated.

Method: Conceptual analysis of the Turing test using statistical interpretations of normality, examining both the target intelligence (normal/average human intelligence) and the evaluation method (statistical aggregation of multiple judges’ judgments).

Result: The paper concludes that large language models like ChatGPT are unlikely to pass the Turing test because they target exceptional rather than normal human intelligence, representing artificial smartness rather than true artificial intelligence.

Conclusion: The Turing test is fundamentally a test of normal intelligence assessed through statistical aggregation, raising broader questions about whether human cognition can be reduced to average/normal minds and challenging the normalist paradigm underlying the test.

Abstract: This paper proposes to revisit the Turing test through the concept of normality. Its core argument is that the statistical interpretation of the normal–understood as the average both in the normative and mathematical sense of the term–proves useful for understanding the Turing test in at least two ways. First, in the sense that the Turing test targets normal/average rather than exceptional human intelligence, so that successfully passing the test requires building machines that “make mistakes” and display imperfect behavior just like normal/average humans. Second, in the sense that the Turing test is a statistical test where judgments of intelligence are never carried out by a single “average” judge (understood as non-expert) but always by a full jury. As such, the notion of “average human interrogator” that Turing talks about in his original paper should be understood primarily as referring to a mathematical abstraction made of the normalized aggregate of individual judgments of multiple judges. In short, this paper argues that the Turing test is a test of normal intelligence as assessed by a normal judge characterizing the average judgment of a pool of human interrogators. Its conclusions are twofold. First, it argues that large language models such as ChatGPT are unlikely to pass the Turing test as those models precisely target exceptional rather than normal/average human intelligence. As such, they constitute models of what it proposes to call artificial smartness rather than artificial intelligence per se. Second, it argues that the core question of whether the Turing test can contribute anything to the understanding of human cognition is that of whether the human mind is really reducible to the normal/average mind–a question which largely extends beyond the Turing test itself and questions the conceptual underpinnings of the normalist paradigm it belongs to.

[19] AllSummedUp: un framework open-source pour comparer les metriques d’evaluation de resume

Tanguy Herserant, Vincent Guigue

Main category: cs.CL

TL;DR: Study finds significant reproducibility issues in text summarization evaluation metrics, showing discrepancies between reported and actual performance, with LLM-based methods being particularly unstable despite better human alignment.

Details

Motivation: To address reproducibility challenges in automatic text summarization evaluation and investigate the gap between reported metric performances and actual experimental results.

Method: Conducted experiments across six representative metrics (including ROUGE and LLM-based methods like G-Eval, SEval-Ex) using a unified open-source framework applied to the SummEval dataset for fair comparison.

Result: Revealed structural trade-off: metrics with highest human judgment alignment are computationally intensive and less stable across runs. LLM-based methods show randomness, technical dependencies, and limited reproducibility.

Conclusion: Advocates for more robust evaluation protocols including exhaustive documentation and methodological standardization to ensure greater reliability in automatic summarization assessment.

Abstract: This paper investigates reproducibility challenges in automatic text summarization evaluation. Based on experiments conducted across six representative metrics ranging from classical approaches like ROUGE to recent LLM-based methods (G-Eval, SEval-Ex), we highlight significant discrepancies between reported performances in the literature and those observed in our experimental setting. We introduce a unified, open-source framework, applied to the SummEval dataset and designed to support fair and transparent comparison of evaluation metrics. Our results reveal a structural trade-off: metrics with the highest alignment with human judgments tend to be computationally intensive and less stable across runs. Beyond comparative analysis, this study highlights key concerns about relying on LLMs for evaluation, stressing their randomness, technical dependencies, and limited reproducibility. We advocate for more robust evaluation protocols including exhaustive documentation and methodological standardization to ensure greater reliability in automatic summarization assessment.

[20] Automatic Reviewers Fail to Detect Faulty Reasoning in Research Papers: A New Counterfactual Evaluation Framework

Nils Dycke, Iryna Gurevych

Main category: cs.CL

TL;DR: LLMs used as automatic review generators fail to detect logical flaws in research papers, showing no significant difference in reviews between sound and flawed papers.

Details

Motivation: To understand the capabilities and limitations of LLMs in scholarly peer review, particularly in detecting faulty research logic which is crucial for maintaining scientific integrity.

Method: Developed a fully automated counterfactual evaluation framework to test ARG approaches under controlled conditions, focusing on their ability to detect internal consistency issues between results, interpretations, and claims.

Result: Contrary to expectations, flaws in research logic had no significant effect on the output reviews generated by state-of-the-art ARG approaches.

Conclusion: Current LLM-based automatic review generators lack the ability to detect faulty research logic, necessitating three actionable recommendations for future work and the release of evaluation framework for public use.

Abstract: Large Language Models (LLMs) have great potential to accelerate and support scholarly peer review and are increasingly used as fully automatic review generators (ARGs). However, potential biases and systematic errors may pose significant risks to scientific integrity; understanding the specific capabilities and limitations of state-of-the-art ARGs is essential. We focus on a core reviewing skill that underpins high-quality peer review: detecting faulty research logic. This involves evaluating the internal consistency between a paper’s results, interpretations, and claims. We present a fully automated counterfactual evaluation framework that isolates and tests this skill under controlled conditions. Testing a range of ARG approaches, we find that, contrary to expectation, flaws in research logic have no significant effect on their output reviews. Based on our findings, we derive three actionable recommendations for future work and release our counterfactual dataset and evaluation framework publicly.

[21] Med-RewardBench: Benchmarking Reward Models and Judges for Medical Multimodal Large Language Models

Meidan Ding, Jipeng Zhang, Wenxuan Wang, Cheng-Yi Li, Wei-Chieh Fang, Hsin-Yu Wu, Haiqin Zhong, Wenting Chen, Linlin Shen

Main category: cs.CL

TL;DR: Med-RewardBench is the first benchmark for evaluating medical reward models and judges in multimodal medical scenarios, featuring 1,026 expert-annotated cases across 13 organ systems and 8 clinical departments.

Details

Motivation: Existing benchmarks focus on general MLLM capabilities or evaluate models as solvers, neglecting essential clinical dimensions like diagnostic accuracy and clinical relevance. Medical reward models remain underexplored despite their critical importance for reliable medical AI applications.

Method: Created a multimodal dataset with rigorous three-step annotation process across six clinically critical dimensions. Evaluated 32 state-of-the-art MLLMs including open-source, proprietary, and medical-specific models. Developed baseline models with fine-tuning.

Result: Revealed substantial challenges in aligning MLLM outputs with expert judgment. Baseline models demonstrated significant performance improvements through fine-tuning, showing the benchmark’s utility for model development.

Conclusion: Med-RewardBench addresses a critical gap in medical AI evaluation and provides a comprehensive framework for developing and assessing medical reward models that can ensure accurate, context-sensitive, and professionally aligned responses in clinical applications.

Abstract: Multimodal large language models (MLLMs) hold significant potential in medical applications, including disease diagnosis and clinical decision-making. However, these tasks require highly accurate, context-sensitive, and professionally aligned responses, making reliable reward models and judges critical. Despite their importance, medical reward models (MRMs) and judges remain underexplored, with no dedicated benchmarks addressing clinical requirements. Existing benchmarks focus on general MLLM capabilities or evaluate models as solvers, neglecting essential evaluation dimensions like diagnostic accuracy and clinical relevance. To address this, we introduce Med-RewardBench, the first benchmark specifically designed to evaluate MRMs and judges in medical scenarios. Med-RewardBench features a multimodal dataset spanning 13 organ systems and 8 clinical departments, with 1,026 expert-annotated cases. A rigorous three-step process ensures high-quality evaluation data across six clinically critical dimensions. We evaluate 32 state-of-the-art MLLMs, including open-source, proprietary, and medical-specific models, revealing substantial challenges in aligning outputs with expert judgment. Additionally, we develop baseline models that demonstrate substantial performance improvements through fine-tuning.

[22] Discovering Semantic Subdimensions through Disentangled Conceptual Representations

Yunhao Zhang, Shaonan Wang, Nan Lin, Xinyi Dong, Chong Li, Chengqing Zong

Main category: cs.CL

TL;DR: Proposes a framework to uncover fine-grained semantic subdimensions from word embeddings using a disentangled model, and validates neural plausibility through brain mapping.

Details

Motivation: Existing semantic dimension approaches are too coarse and overlook finer conceptual distinctions, limiting understanding of how meaning is organized in language and the brain.

Method: Introduces Disentangled Continuous Semantic Representation Model (DCSRM) to decompose word embeddings into multiple sub-embeddings encoding specific semantic information, then uses voxel-wise encoding models to map subdimensions to brain activation.

Result: Identifies interpretable semantic subdimensions with neural correlates, revealing that semantic dimensions are structured by distinct principles with polarity as a key decomposition factor.

Conclusion: The framework provides more fine-grained interpretable semantic subdimensions that are cognitively and neuroscientifically plausible, advancing understanding of conceptual meaning organization.

Abstract: Understanding the core dimensions of conceptual semantics is fundamental to uncovering how meaning is organized in language and the brain. Existing approaches often rely on predefined semantic dimensions that offer only broad representations, overlooking finer conceptual distinctions. This paper proposes a novel framework to investigate the subdimensions underlying coarse-grained semantic dimensions. Specifically, we introduce a Disentangled Continuous Semantic Representation Model (DCSRM) that decomposes word embeddings from large language models into multiple sub-embeddings, each encoding specific semantic information. Using these sub-embeddings, we identify a set of interpretable semantic subdimensions. To assess their neural plausibility, we apply voxel-wise encoding models to map these subdimensions to brain activation. Our work offers more fine-grained interpretable semantic subdimensions of conceptual meaning. Further analyses reveal that semantic dimensions are structured according to distinct principles, with polarity emerging as a key factor driving their decomposition into subdimensions. The neural correlates of the identified subdimensions support their cognitive and neuroscientific plausibility.

[23] Beyond the Surface: Probing the Ideological Depth of Large Language Models

Shariar Kabir, Kevin Esterling, Yue Dong

Main category: cs.CL

TL;DR: LLMs exhibit ideological leanings with varying depth - some models are easily steerable between political viewpoints while others show resistance, indicating more entrenched internal political representations that can be quantified and analyzed.

Details

Motivation: To understand the stability and coherence of ideological positions in LLMs, as surface-level responses can be manipulated through prompt engineering, raising questions about whether they reflect genuine underlying political representations.

Method: Dual approach: 1) Measuring steerability using instruction prompting and activation steering to test how easily models switch between liberal/conservative viewpoints, 2) Probing internal mechanisms using Sparse Autoencoders (SAEs) to analyze ideological features and targeted ablation of core political features.

Result: Models with lower steerability possess more distinct abstract ideological features (one model had 7.3x more political features than similar-sized model). Targeted ablation in ideologically deep models leads to consistent logical reasoning shifts, while shallow models show increased refusal outputs.

Conclusion: Ideological depth is a quantifiable property of LLMs, and steerability serves as a valuable indicator of their latent political architecture, with some models exhibiting more coherent and entrenched ideological structures than others.

Abstract: Large Language Models (LLMs) have demonstrated pronounced ideological leanings, yet the stability and depth of these positions remain poorly understood. Surface-level responses can often be manipulated through simple prompt engineering, calling into question whether they reflect a coherent underlying ideology. This paper investigates the concept of “ideological depth” in LLMs, defined as the robustness and complexity of their internal political representations. We employ a dual approach: first, we measure the “steerability” of two well-known open-source LLMs using instruction prompting and activation steering. We find that while some models can easily switch between liberal and conservative viewpoints, others exhibit resistance or an increased rate of refusal, suggesting a more entrenched ideological structure. Second, we probe the internal mechanisms of these models using Sparse Autoencoders (SAEs). Preliminary analysis reveals that models with lower steerability possess more distinct and abstract ideological features. Our evaluations reveal that one model can contain 7.3x more political features than another model of similar size. This allows targeted ablation of a core political feature in an ideologically “deep” model, leading to consistent, logical shifts in its reasoning across related topics, whereas the same intervention in a “shallow” model results in an increase in refusal outputs. Our findings suggest that ideological depth is a quantifiable property of LLMs and that steerability serves as a valuable window into their latent political architecture.

[24] Igniting Creative Writing in Small Language Models: LLM-as-a-Judge versus Multi-Agent Refined Rewards

Xiaolong Wei, Bo Lu, Xingyu Zhang, Zhejun Zhao, Dongdong Shen, Long Xia, Dawei Yin

Main category: cs.CL

TL;DR: This paper introduces two AI-driven reward strategies within RLAIF framework to enhance creative writing in 7B-parameter SLMs for Chinese greetings, with principle-guided LLM-as-a-Judge showing superior results.

Details

Motivation: LLMs have strong creative writing capabilities but high computational costs, while current SLM enhancement methods (SFT, RLHF) face issues with novelty and cost. Need for more scalable approaches to boost creative writing in smaller models.

Method: Two RLAIF strategies: 1) RM trained on high-quality preference data from multi-agent rejection sampling framework, 2) Principle-guided LLM-as-a-Judge with adversarial training and reflection mechanism for direct reward signals.

Result: Both approaches significantly enhance creative output over baselines, with principle-guided LLM-as-a-Judge showing superior generation quality, better training efficiency, and reduced dependency on human-annotated data.

Conclusion: The principle-guided LLM-as-a-Judge approach provides a more scalable and effective path for creative SLMs, with automated evaluation methods aligning well with human judgments.

Abstract: Large Language Models (LLMs) have demonstrated remarkable creative writing capabilities, yet their substantial computational demands hinder widespread use. Enhancing Small Language Models (SLMs) offers a promising alternative, but current methods like Supervised Fine-Tuning (SFT) struggle with novelty, and Reinforcement Learning from Human Feedback (RLHF) is costly. This paper explores two distinct AI-driven reward strategies within a Reinforcement Learning from AI Feedback (RLAIF) framework to ignite the creative writing of a 7B-parameter SLM, specifically for generating Chinese greetings. The first strategy employs a RM trained on high-quality preference data curated by a novel multi-agent rejection sampling framework designed for creative tasks. The second, more novel strategy utilizes a principle-guided LLM-as-a-Judge, whose reward function is optimized via an adversarial training scheme with a reflection mechanism, to directly provide reward signals. Comprehensive experiments reveal that while both approaches significantly enhance creative output over baselines, the principle-guided LLM-as-a-Judge demonstrably yields superior generation quality. Furthermore, it offers notable advantages in training efficiency and reduced dependency on human-annotated data, presenting a more scalable and effective path towards creative SLMs. Our automated evaluation methods also exhibit strong alignment with human judgments. Our code and data are publicly available at https://github.com/weixiaolong94-hub/Igniting-Creative-Writing-in-Small-Language-Models.

[25] HSFN: Hierarchical Selection for Fake News Detection building Heterogeneous Ensemble

Sara B. Coutinho, Rafael M. O. Cruz, Francimaria R. S. Nascimento, George D. C. Cavalcanti

Main category: cs.CL

TL;DR: Novel automatic classifier selection method for fact-checking ensembles that prioritizes diversity through hierarchical clustering and performance evaluation, achieving best accuracy on 2 out of 6 datasets.

Details

Motivation: Psychological biases make people vulnerable to fake news on social media. Ensemble methods for fact-checking need diverse classifiers, but selecting genuinely diverse models is challenging due to redundant pattern learning.

Method: Proposes HierarchySelect approach that computes pairwise diversity between classifiers, applies hierarchical clustering to organize them into groups at different granularity levels, selects diverse pools from each level, and incorporates performance metrics to ensure generalization.

Result: Experiments with 40 heterogeneous classifiers across six datasets show the method achieves highest accuracy on two datasets compared to Elbow heuristic and state-of-the-art baselines.

Conclusion: The proposed diversity-focused classifier selection approach effectively improves ensemble performance for fake news detection, with implementation available on GitHub.

Abstract: Psychological biases, such as confirmation bias, make individuals particularly vulnerable to believing and spreading fake news on social media, leading to significant consequences in domains such as public health and politics. Machine learning-based fact-checking systems have been widely studied to mitigate this problem. Among them, ensemble methods are particularly effective in combining multiple classifiers to improve robustness. However, their performance heavily depends on the diversity of the constituent classifiers-selecting genuinely diverse models remains a key challenge, especially when models tend to learn redundant patterns. In this work, we propose a novel automatic classifier selection approach that prioritizes diversity, also extended by performance. The method first computes pairwise diversity between classifiers and applies hierarchical clustering to organize them into groups at different levels of granularity. A HierarchySelect then explores these hierarchical levels to select one pool of classifiers per level, each representing a distinct intra-pool diversity. The most diverse pool is identified and selected for ensemble construction from these. The selection process incorporates an evaluation metric reflecting each classifiers’s performance to ensure the ensemble also generalises well. We conduct experiments with 40 heterogeneous classifiers across six datasets from different application domains and with varying numbers of classes. Our method is compared against the Elbow heuristic and state-of-the-art baselines. Results show that our approach achieves the highest accuracy on two of six datasets. The implementation details are available on the project’s repository: https://github.com/SaraBCoutinho/HSFN .

[26] L3Cube-MahaSTS: A Marathi Sentence Similarity Dataset and Models

Aishwarya Mirashi, Ananya Joshi, Raviraj Joshi

Main category: cs.CL

TL;DR: MahaSTS is a human-annotated Marathi sentence similarity dataset with 16,860 sentence pairs, and MahaSBERT-STS-v2 is a fine-tuned model that outperforms other benchmarks for Marathi STS tasks.

Details

Motivation: To address the lack of high-quality human-annotated sentence similarity datasets for Marathi language and improve sentence similarity modeling in low-resource settings.

Method: Created a uniformly distributed dataset across 0-5 similarity score buckets, fine-tuned MahaSBERT model on this dataset, and benchmarked against MahaBERT, MuRIL, IndicBERT, and IndicSBERT.

Result: MahaSTS enables effective training for sentence similarity tasks in Marathi, demonstrating the impact of human-curated annotations and structured supervision.

Conclusion: The approach successfully addresses Marathi STS challenges through balanced dataset construction and targeted model fine-tuning, with publicly available resources for the community.

Abstract: We present MahaSTS, a human-annotated Sentence Textual Similarity (STS) dataset for Marathi, along with MahaSBERT-STS-v2, a fine-tuned Sentence-BERT model optimized for regression-based similarity scoring. The MahaSTS dataset consists of 16,860 Marathi sentence pairs labeled with continuous similarity scores in the range of 0-5. To ensure balanced supervision, the dataset is uniformly distributed across six score-based buckets spanning the full 0-5 range, thus reducing label bias and enhancing model stability. We fine-tune the MahaSBERT model on this dataset and benchmark its performance against other alternatives like MahaBERT, MuRIL, IndicBERT, and IndicSBERT. Our experiments demonstrate that MahaSTS enables effective training for sentence similarity tasks in Marathi, highlighting the impact of human-curated annotations, targeted fine-tuning, and structured supervision in low-resource settings. The dataset and model are publicly shared at https://github.com/l3cube-pune/MarathiNLP

[27] A Survey on Current Trends and Recent Advances in Text Anonymization

Tobias Deußer, Lorenz Sparrenberg, Armin Berger, Max Hahnbück, Christian Bauckhage, Rafet Sifa

Main category: cs.CL

TL;DR: Comprehensive survey on text anonymization techniques covering traditional NER-based approaches, LLM impacts, domain-specific solutions, privacy models, evaluation frameworks, and emerging challenges in privacy-utility trade-offs.

Details

Motivation: Growing need for robust text anonymization to protect sensitive personal information while maintaining data usability across various domains and complying with privacy regulations.

Method: Survey methodology examining foundational Named Entity Recognition approaches, Large Language Models (both as anonymizers and de-anonymization threats), domain-specific solutions, formal privacy models, risk-aware frameworks, and evaluation metrics/toolkits.

Result: Consolidated current knowledge on text anonymization, identified emerging trends, persistent challenges including privacy-utility trade-offs, quasi-identifiers, and LLM implications across healthcare, law, finance, and education sectors.

Conclusion: The survey provides comprehensive guidance for future research directions in text anonymization, addressing evolving challenges and the dual role of LLMs while emphasizing the need for practical deployment solutions across critical domains.

Abstract: The proliferation of textual data containing sensitive personal information across various domains requires robust anonymization techniques to protect privacy and comply with regulations, while preserving data usability for diverse and crucial downstream tasks. This survey provides a comprehensive overview of current trends and recent advances in text anonymization techniques. We begin by discussing foundational approaches, primarily centered on Named Entity Recognition, before examining the transformative impact of Large Language Models, detailing their dual role as sophisticated anonymizers and potent de-anonymization threats. The survey further explores domain-specific challenges and tailored solutions in critical sectors such as healthcare, law, finance, and education. We investigate advanced methodologies incorporating formal privacy models and risk-aware frameworks, and address the specialized subfield of authorship anonymization. Additionally, we review evaluation frameworks, comprehensive metrics, benchmarks, and practical toolkits for real-world deployment of anonymization solutions. This review consolidates current knowledge, identifies emerging trends and persistent challenges, including the evolving privacy-utility trade-off, the need to address quasi-identifiers, and the implications of LLM capabilities, and aims to guide future research directions for both academics and practitioners in this field.

[28] Middo: Model-Informed Dynamic Data Optimization for Enhanced LLM Fine-Tuning via Closed-Loop Learning

Zinan Tang, Xin Gao, Qizhi Pei, Zhuoshi Pan, Mengzhang Cai, Jiang Wu, Conghui He, Lijun Wu

Main category: cs.CL

TL;DR: Middo is a self-evolving framework that dynamically optimizes LLM training data using model-aware selection and context-preserving refinement, improving accuracy by 7.15% while maintaining dataset scale.

Details

Motivation: Existing data selection and synthesis approaches face limitations in static dataset curation that fail to adapt to evolving model capabilities, requiring a more dynamic optimization system.

Method: A closed-loop optimization system with: (1) self-referential diagnostic module using tri-axial model signals (loss patterns, embedding clusters, self-alignment scores), (2) adaptive optimization engine that transforms suboptimal samples, and (3) continuous evolution through dynamic learning principles.

Result: Experiments show consistent enhancement of seed data quality and LLM performance with 7.15% average accuracy improvement while maintaining original dataset scale.

Conclusion: Establishes a new paradigm for sustainable LLM training through dynamic human-AI co-evolution of data and models.

Abstract: Supervised Fine-Tuning (SFT) Large Language Models (LLM) fundamentally rely on high-quality training data. While data selection and data synthesis are two common strategies to improve data quality, existing approaches often face limitations in static dataset curation that fail to adapt to evolving model capabilities. In this paper, we introduce Middo, a self-evolving Model-informed dynamic data optimization framework that uses model-aware data selection and context-preserving data refinement. Unlike conventional one-off filtering/synthesis methods, our framework establishes a closed-loop optimization system: (1) A self-referential diagnostic module proactively identifies suboptimal samples through tri-axial model signals - loss patterns (complexity), embedding cluster dynamics (diversity), and self-alignment scores (quality); (2) An adaptive optimization engine then transforms suboptimal samples into pedagogically valuable training points while preserving semantic integrity; (3) This optimization process continuously evolves with model capability through dynamic learning principles. Experiments on multiple benchmarks demonstrate that our \method consistently enhances the quality of seed data and boosts LLM’s performance with improving accuracy by 7.15% on average while maintaining the original dataset scale. This work establishes a new paradigm for sustainable LLM training through dynamic human-AI co-evolution of data and models. Our datasets, models, and code are coming soon.

[29] Personality Matters: User Traits Predict LLM Preferences in Multi-Turn Collaborative Tasks

Sarfaroz Yunusov, Kaige Chen, Kazi Nishat Anwar, Ali Emami

Main category: cs.CL

TL;DR: Personality types significantly influence LLM preferences: Rationals prefer GPT-4 for goal-oriented tasks, while Idealists favor Claude 3.5 for creative/analytical work.

Details

Motivation: To understand if users with different personality traits systematically prefer certain LLMs over others in multi-turn collaborative workflows.

Method: Study with 32 participants across four Keirsey personality types, evaluating interactions with GPT-4 and Claude 3.5 across four collaborative tasks (data analysis, creative writing, information retrieval, writing assistance).

Result: Significant personality-driven preferences: Rationals strongly preferred GPT-4 for goal-oriented tasks; Idealists favored Claude 3.5 for creative/analytical tasks; other types showed task-dependent preferences. Sentiment analysis confirmed patterns.

Conclusion: Personality-based analysis reveals LLM differences that traditional evaluations miss, despite similar aggregate helpfulness ratings across models.

Abstract: As Large Language Models (LLMs) increasingly integrate into everyday workflows, where users shape outcomes through multi-turn collaboration, a critical question emerges: do users with different personality traits systematically prefer certain LLMs over others? We conducted a study with 32 participants evenly distributed across four Keirsey personality types, evaluating their interactions with GPT-4 and Claude 3.5 across four collaborative tasks: data analysis, creative writing, information retrieval, and writing assistance. Results revealed significant personality-driven preferences: Rationals strongly preferred GPT-4, particularly for goal-oriented tasks, while idealists favored Claude 3.5, especially for creative and analytical tasks. Other personality types showed task-dependent preferences. Sentiment analysis of qualitative feedback confirmed these patterns. Notably, aggregate helpfulness ratings were similar across models, showing how personality-based analysis reveals LLM differences that traditional evaluations miss.

[30] QZhou-Embedding Technical Report

Peng Yu, En Xu, Bin Chen, Haibiao Chen, Yinfei Xu

Main category: cs.CL

TL;DR: QZhou-Embedding is a state-of-the-art text embedding model based on Qwen2.5-7B-Instruct that achieves top rankings on MTEB and CMTEB benchmarks through innovative multi-task training strategies and LLM-enhanced data synthesis.

Details

Motivation: To create a general-purpose contextual text embedding model with superior text representation capabilities by leveraging diverse training data and advanced training strategies to advance retrieval model performance.

Method: Built on Qwen2.5-7B-Instruct foundation with unified multi-task framework, specialized data transformation, LLM API data synthesis pipeline (paraphrasing, augmentation, hard negative generation), and two-stage training (retrieval pretraining followed by full-task fine-tuning).

Result: Achieves state-of-the-art results on MTEB and CMTEB benchmarks (ranked first on both leaderboards as of August 27, 2025), and excels at reranking, clustering, and other tasks.

Conclusion: Higher-quality, more diverse data is crucial for retrieval model advancement, and leveraging LLMs’ generative capabilities can optimize data quality for embedding model breakthroughs. Model weights and evaluation code are publicly available.

Abstract: We present QZhou-Embedding, a general-purpose contextual text embedding model with exceptional text representation capabilities. Built upon the Qwen2.5-7B-Instruct foundation model, we designed a unified multi-task framework comprising specialized data transformation and training strategies. The data transformation scheme enables the incorporation of more diverse textual training datasets, while the task-specific training strategies enhance model learning efficiency. We developed a data synthesis pipeline leveraging LLM API, incorporating techniques such as paraphrasing, augmentation, and hard negative example generation to improve the semantic richness and sample difficulty of the training set. Additionally, we employ a two-stage training strategy, comprising initial retrieval-focused pretraining followed by full-task fine-tuning, enabling the embedding model to extend its capabilities based on robust retrieval performance. Our model achieves state-of-the-art results on the MTEB and CMTEB benchmarks, ranking first on both leaderboards (August 27 2025), and simultaneously achieves state-of-the-art performance on tasks including reranking, clustering, etc. Our findings demonstrate that higher-quality, more diverse data is crucial for advancing retrieval model performance, and that leveraging LLMs generative capabilities can further optimize data quality for embedding model breakthroughs. Our model weights are released on HuggingFace under Apache 2.0 license. For reproducibility, we provide evaluation code and instructions on GitHub.

[31] Is this chart lying to me? Automating the detection of misleading visualizations

Jonathan Tonglet, Jan Zimny, Tinne Tuytelaars, Iryna Gurevych

Main category: cs.CL

TL;DR: Introduces Misviz benchmark with 2,604 real-world misleading visualizations and Misviz-synth synthetic dataset of 81,814 visualizations to detect and classify 12 types of visualization misleaders.

Details

Motivation: Misleading visualizations drive misinformation on social media, but AI model development is limited by lack of large, diverse datasets for training and evaluation.

Method: Created benchmark datasets (real-world Misviz and synthetic Misviz-synth), then evaluated state-of-the-art MLLMs, rule-based systems, and fine-tuned classifiers on detection and classification tasks.

Result: The task of automatically detecting misleading visualizations remains highly challenging despite comprehensive evaluation across multiple model types.

Conclusion: Released Misviz datasets and code to support future research in combating visualization-based misinformation, highlighting the ongoing difficulty of this detection problem.

Abstract: Misleading visualizations are a potent driver of misinformation on social media and the web. By violating chart design principles, they distort data and lead readers to draw inaccurate conclusions. Prior work has shown that both humans and multimodal large language models (MLLMs) are frequently deceived by such visualizations. Automatically detecting misleading visualizations and identifying the specific design rules they violate could help protect readers and reduce the spread of misinformation. However, the training and evaluation of AI models has been limited by the absence of large, diverse, and openly available datasets. In this work, we introduce Misviz, a benchmark of 2,604 real-world visualizations annotated with 12 types of misleaders. To support model training, we also release Misviz-synth, a synthetic dataset of 81,814 visualizations generated using Matplotlib and based on real-world data tables. We perform a comprehensive evaluation on both datasets using state-of-the-art MLLMs, rule-based systems, and fine-tuned classifiers. Our results reveal that the task remains highly challenging. We release Misviz, Misviz-synth, and the accompanying code.

[32] Not All Parameters Are Created Equal: Smart Isolation Boosts Fine-Tuning Performance

Yao Wang, Di Liang, Minlong Peng

Main category: cs.CL

TL;DR: CPI-FT framework addresses task interference in multi-task fine-tuning by identifying core parameter regions for each task, grouping similar tasks, and using parameter fusion with SLERP to integrate non-core parameters while preserving task-specific cores.

Details

Motivation: To solve the "seesaw phenomenon" in supervised fine-tuning where performance improvements on some tasks come at the expense of others due to indiscriminate parameter updates.

Method: 1) Independently fine-tune on each task to identify core parameter regions 2) Group tasks with similar core regions 3) Use parameter fusion: transplant core parameters and integrate non-core parameters via Spherical Linear Interpolation (SLERP) 4) Lightweight pipelined SFT training with frozen core regions

Result: Significantly alleviates task interference and catastrophic forgetting, consistently outperforms vanilla multi-task and multi-stage fine-tuning baselines on multiple public benchmarks.

Conclusion: CPI-FT effectively mitigates destructive interference in multi-task fine-tuning by preserving task-specific core parameters while smoothly integrating shared knowledge through careful parameter fusion techniques.

Abstract: Supervised fine-tuning (SFT) is a pivotal approach to adapting large language models (LLMs) for downstream tasks; however, performance often suffers from the ``seesaw phenomenon’’, where indiscriminate parameter updates yield progress on certain tasks at the expense of others. To address this challenge, we propose a novel \emph{Core Parameter Isolation Fine-Tuning} (CPI-FT) framework. Specifically, we first independently fine-tune the LLM on each task to identify its core parameter regions by quantifying parameter update magnitudes. Tasks with similar core regions are then grouped based on region overlap, forming clusters for joint modeling. We further introduce a parameter fusion technique: for each task, core parameters from its individually fine-tuned model are directly transplanted into a unified backbone, while non-core parameters from different tasks are smoothly integrated via Spherical Linear Interpolation (SLERP), mitigating destructive interference. A lightweight, pipelined SFT training phase using mixed-task data is subsequently employed, while freezing core regions from prior tasks to prevent catastrophic forgetting. Extensive experiments on multiple public benchmarks demonstrate that our approach significantly alleviates task interference and forgetting, consistently outperforming vanilla multi-task and multi-stage fine-tuning baselines.

[33] Reasoning-Intensive Regression

Diane Tchuindjo, Omar Khattab

Main category: cs.CL

TL;DR: MENTAT combines batch-reflective prompt optimization with neural ensemble learning to address reasoning-intensive regression tasks, achieving 65% improvement over frozen LLMs and finetuned Transformers.

Details

Motivation: Existing methods struggle with reasoning-intensive regression (RiR) tasks that require deep text analysis for numerical property deduction, especially with limited training data and computation.

Method: Proposes MENTAT - a lightweight method combining batch-reflective prompt optimization with neural ensemble learning to enhance performance on RiR tasks.

Result: MENTAT achieves up to 65% improvement over both frozen LLM prompting and finetuned Transformer encoder baselines on established RiR benchmarks.

Conclusion: While MENTAT shows significant improvements, substantial room remains for future advances in reasoning-intensive regression tasks.

Abstract: AI researchers and practitioners increasingly apply large language models (LLMs) to what we call reasoning-intensive regression (RiR), i.e. deducing subtle numerical properties from text. Unlike standard language regression tasks, e.g. for sentiment or similarity, RiR often appears instead in ad-hoc problems like rubric-based scoring or domain-specific retrieval, where much deeper analysis of text is required while only limited task-specific training data and computation are available. We cast three realistic problems as RiR tasks to establish an initial benchmark, and use that to test our hypothesis that prompting frozen LLMs and finetuning Transformer encoders via gradient descent will both often struggle in RiR. We then propose MENTAT, a simple and lightweight method that combines batch-reflective prompt optimization with neural ensemble learning. MENTAT achieves up to 65% improvement over both baselines, though substantial room remains for future advances in RiR.

[34] PiCSAR: Probabilistic Confidence Selection And Ranking

Joshua Ong Jun Leang, Zheng Zhao, Aryo Pradipta Gema, Sohee Yang, Wai-Chung Kwan, Xuanli He, Wenda Li, Pasquale Minervini, Eleonora Giunchiglia, Shay B. Cohen

Main category: cs.CL

TL;DR: PiCSAR is a training-free method that uses joint log-likelihood of reasoning and final answer to score candidate solutions in best-of-n sampling, achieving significant performance gains with fewer samples.

Details

Motivation: Best-of-n sampling improves LLM/LRM accuracy but requires effective scoring functions for reasoning tasks without ground-truth answers. Current methods struggle to identify correct reasoning chains.

Method: PiCSAR scores candidate generations using joint log-likelihood of reasoning and final answer, which decomposes into reasoning confidence and answer confidence components.

Result: Substantial gains across benchmarks (+10.18 on MATH500, +9.81 on AIME2025), outperforming baselines with at least 2x fewer samples in 16/20 comparisons.

Conclusion: Correct reasoning chains exhibit higher reasoning and answer confidence, validating PiCSAR’s effectiveness as a simple, training-free scoring method for reasoning tasks.

Abstract: Best-of-n sampling improves the accuracy of large language models (LLMs) and large reasoning models (LRMs) by generating multiple candidate solutions and selecting the one with the highest reward. The key challenge for reasoning tasks is designing a scoring function that can identify correct reasoning chains without access to ground-truth answers. We propose Probabilistic Confidence Selection And Ranking (PiCSAR): a simple, training-free method that scores each candidate generation using the joint log-likelihood of the reasoning and final answer. The joint log-likelihood of the reasoning and final answer naturally decomposes into reasoning confidence and answer confidence. PiCSAR achieves substantial gains across diverse benchmarks (+10.18 on MATH500, +9.81 on AIME2025), outperforming baselines with at least 2x fewer samples in 16 out of 20 comparisons. Our analysis reveals that correct reasoning chains exhibit significantly higher reasoning and answer confidence, justifying the effectiveness of PiCSAR.

[35] Going over Fine Web with a Fine-Tooth Comb: Technical Report of Indexing Fine Web for Problematic Content Search and Retrieval

Inés Altemir Marinas, Anastasiia Kucherenko, Andrei Kucharavy

Main category: cs.CL

TL;DR: A framework for indexing and analyzing large language model training datasets using ElasticSearch, applied to the 1.5TB FineWeb-2 corpus with fast query performance for real-time dataset analysis.

Details

Motivation: Web-scale datasets like Common Crawl used for LLM training raise data quality, safety, and ethical concerns, but prior research has been limited to small samples due to computational constraints.

Method: Developed an ElasticSearch-based pipeline for indexing and analyzing LLM training datasets, specifically applied to SwissAI’s FineWeb-2 corpus (1.5TB, four languages).

Result: Achieved fast query performance with most searches completing in milliseconds and all under 2 seconds, enabling real-time dataset analysis.

Conclusion: The framework provides practical tools for safer and more accountable AI systems by enabling efficient analysis of large-scale training datasets.

Abstract: Large language models (LLMs) rely heavily on web-scale datasets like Common Crawl, which provides over 80% of training data for some modern models. However, the indiscriminate nature of web crawling raises challenges in data quality, safety, and ethics. Despite the critical importance of training data quality, prior research on harmful content has been limited to small samples due to computational constraints. This project presents a framework for indexing and analyzing LLM training datasets using an ElasticSearch-based pipeline. We apply it to SwissAI’s FineWeb-2 corpus (1.5TB, four languages), achieving fast query performance–most searches in milliseconds, all under 2 seconds. Our work demonstrates real-time dataset analysis, offering practical tools for safer, more accountable AI systems.

[36] Continuous Language Model Interpolation for Dynamic and Controllable Text Generation

Sara Kangaslahti, David Alvarez-Melis

Main category: cs.CL

TL;DR: Linear weight interpolation between fine-tuned LLMs enables dynamic, predictable control over multiple stylistic attributes simultaneously.

Details

Motivation: Existing LLM adaptation focuses on single-objective optimization, but real applications require dynamic adaptation to diverse and changing user preferences.

Method: Use low-rank updates to fine-tune base model to different domains, then linearly interpolate between anchor models’ weight updates to parametrize the entire convex hull of models.

Result: Varying interpolation weights yields predictable and consistent changes in model outputs with little attribute entanglement for most pairs.

Conclusion: Linear interpolation between fine-tuned model weights facilitates fine-grained, multi-attribute control of LLM generation characteristics.

Abstract: As large language models (LLMs) have gained popularity for a variety of use cases, making them adaptable and controllable has become increasingly important, especially for user-facing applications. While the existing literature on LLM adaptation primarily focuses on finding a model (or models) that optimizes a single predefined objective, here we focus on the challenging case where the model must dynamically adapt to diverse – and often changing – user preferences. For this, we leverage adaptation methods based on linear weight interpolation, casting them as continuous multi-domain interpolators that produce models with specific prescribed generation characteristics on-the-fly. Specifically, we use low-rank updates to fine-tune a base model to various different domains, yielding a set of anchor models with distinct generation profiles. Then, we use the weight updates of these anchor models to parametrize the entire (infinite) class of models contained within their convex hull. We empirically show that varying the interpolation weights yields predictable and consistent change in the model outputs with respect to all of the controlled attributes. We find that there is little entanglement between most attributes and identify and discuss the pairs of attributes for which this is not the case. Our results suggest that linearly interpolating between the weights of fine-tuned models facilitates predictable, fine-grained control of model outputs with respect to multiple stylistic characteristics simultaneously.

[37] Revealing Fine-Grained Values and Opinions in Large Language Models

Dustin Wright, Arnav Arora, Nadav Borenstein, Srishti Yadav, Serge Belongie, Isabelle Augenstein

Main category: cs.CL

TL;DR: This paper analyzes LLM biases by examining 156k responses to Political Compass Test questions using 420 prompt variations across 6 models, finding that demographic features in prompts significantly affect outcomes and similar justifications are repeated across models.

Details

Motivation: To identify biases and mitigate potential harm in large language models by uncovering latent values and opinions, as current methods using survey questions produce varying results depending on prompting approaches.

Method: Analyzed 156k LLM responses to 62 Political Compass Test propositions from 6 LLMs using 420 prompt variations. Performed coarse-grained stance analysis and fine-grained trope analysis to identify semantically similar, recurrent phrases across different prompts.

Result: Demographic features in prompts significantly affect Political Compass Test outcomes, reflecting bias. Disparities exist between closed-form and open-domain response tests. Similar justifications (tropes) are repeatedly generated across models and prompts even with different stances.

Conclusion: The study demonstrates systematic patterns in LLM responses that reveal inherent biases and consistent reasoning patterns, highlighting the importance of prompt design and the need for robust methods to uncover and address LLM biases.

Abstract: Uncovering latent values and opinions embedded in large language models (LLMs) can help identify biases and mitigate potential harm. Recently, this has been approached by prompting LLMs with survey questions and quantifying the stances in the outputs towards morally and politically charged statements. However, the stances generated by LLMs can vary greatly depending on how they are prompted, and there are many ways to argue for or against a given position. In this work, we propose to address this by analysing a large and robust dataset of 156k LLM responses to the 62 propositions of the Political Compass Test (PCT) generated by 6 LLMs using 420 prompt variations. We perform coarse-grained analysis of their generated stances and fine-grained analysis of the plain text justifications for those stances. For fine-grained analysis, we propose to identify tropes in the responses: semantically similar phrases that are recurrent and consistent across different prompts, revealing natural patterns in the text that a given LLM is prone to produce. We find that demographic features added to prompts significantly affect outcomes on the PCT, reflecting bias, as well as disparities between the results of tests when eliciting closed-form vs. open domain responses. Additionally, patterns in the plain text rationales via tropes show that similar justifications are repeatedly generated across models and prompts even with disparate stances.

[38] E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning

Zihan Liao, Jun Wang, Hang Yu, Lingxiao Wei, Jianguo Li, Jun Wang, Wei Zhang

Main category: cs.CL

TL;DR: E2LLM is a novel approach that addresses the “impossible triangle” of long-context processing by chunking and compressing contexts using a pretrained encoder, then aligning with LLM via adapter and specialized training objectives.

Details

Motivation: To overcome the challenges of achieving high long-context performance, low computational complexity, and compatibility with pretrained models simultaneously - the "impossible triangle" in LLM context processing.

Method: Divides long contexts into chunks, compresses each into soft prompts using pretrained text encoder, aligns with decoder-only LLM via adapter, and uses encoder output reconstruction and long-context instruction fine-tuning objectives.

Result: Outperforms 8 SOTA methods in effectiveness and efficiency for document summarization and QA, and achieves best performance on LongBench v2 among comparable-sized models.

Conclusion: E2LLM successfully navigates the impossible triangle of long-context processing, demonstrating superior performance and efficiency compared to existing methods.

Abstract: Processing long contexts is increasingly important for Large Language Models (LLMs) in tasks like multi-turn dialogues, code generation, and document summarization. This paper addresses the challenges of achieving high long-context performance, low computational complexity, and compatibility with pretrained models – collectively termed the ``impossible triangle’’. We introduce E2LLM (Encoder Elongated Large Language Models), a novel approach that effectively navigates this paradox. E2LLM divides long contexts into chunks, compresses each into soft prompts using a pretrained text encoder, and aligns these representations with a decoder-only LLM via an adapter. To enhance the LLM’s reasoning with these soft prompts, we employ two training objectives: encoder output reconstruction and long-context instruction fine-tuning. Extensive experiments reveal that E2LLM not only outperforms 8 state-of-the-art (SOTA) methods in effectiveness and efficiency for document summarization and question answering, but also achieves the best performance on LongBench v2 among models of comparable size.

Jinghan Zhang, Fengran Mo, Tharindu Cyril Weerasooriya, Yeyang Zhou, Xinyue Ye, Dongjie Wang, Yanjie Fu, Kunpeng Liu

Main category: cs.CL

TL;DR: TSE framework expands LLM reasoning by generating new thought branches to explore cognitive blind spots, outperforming baseline methods on complex reasoning tasks.

Details

Motivation: Existing LLM reasoning methods are confined to explored solution spaces and overlook cognitive blind spots, limiting their reasoning potential.

Method: Thought Space Explorer (TSE) generates new reasoning steps and branches from original thought structures using various strategies to broaden exploration.

Result: Experimental results show TSE surpasses various baseline methods on multiple levels of reasoning tasks.

Conclusion: Structured and expansive thought exploration helps unleash LLMs’ reasoning potential by addressing cognitive blind spots.

Abstract: Recent advances in large language models (LLMs) have demonstrated their potential in handling complex reasoning tasks, which are usually achieved by constructing a thought chain to guide the model in solving the problem with multi-step thinking. However, existing methods often remain confined to previously explored solution spaces and thus overlook the critical blind spot within LLMs’ cognitive range. To address these issues, we introduce the ``Thought Space Explorer’’ (TSE), a novel framework to expand and optimize thought structures to guide LLMs to explore their blind spots of thinking. By generating new reasoning steps and branches based on the original thought structure with various designed strategies, TSE broadens the thought exploration view and alleviates the impact of blind spots for LLM reasoning. Experimental results on multiple levels of reasoning tasks demonstrate the efficacy of TSE by surpassing various baseline methods. We also conduct extensive analysis to understand how structured and expansive thought can contribute to unleashing the potential of LLM reasoning capabilities.

[40] A Collaborative Content Moderation Framework for Toxicity Detection based on Conformalized Estimates of Annotation Disagreement

Guillermo Villate-Castillo, Javier Del Ser, Borja Sanz

Main category: cs.CL

TL;DR: A novel content moderation framework that captures annotation disagreement as valuable signal rather than noise, using multitask learning with toxicity classification as primary task and disagreement prediction as auxiliary task, enhanced with conformal prediction for uncertainty estimation.

Details

Motivation: Traditional content moderation systems treat annotation disagreement as noise, but this work recognizes it as valuable signal reflecting the inherent ambiguity and subjective nature of toxicity perception in content.

Method: Multitask learning approach where toxicity classification is the primary task and annotation disagreement prediction is the auxiliary task, combined with conformal prediction for uncertainty estimation to handle both annotation ambiguity and model uncertainty.

Result: The joint approach improves model performance, calibration, and uncertainty estimation while offering greater parameter efficiency and enhancing the review process compared to single-task methods.

Conclusion: Capturing annotation disagreement as a signal rather than dismissing it as noise leads to more effective content moderation systems that better handle the subjective nature of toxicity perception and provide flexibility for human moderators.

Abstract: Content moderation typically combines the efforts of human moderators and machine learning models. However, these systems often rely on data where significant disagreement occurs during moderation, reflecting the subjective nature of toxicity perception. Rather than dismissing this disagreement as noise, we interpret it as a valuable signal that highlights the inherent ambiguity of the content,an insight missed when only the majority label is considered. In this work, we introduce a novel content moderation framework that emphasizes the importance of capturing annotation disagreement. Our approach uses multitask learning, where toxicity classification serves as the primary task and annotation disagreement is addressed as an auxiliary task. Additionally, we leverage uncertainty estimation techniques, specifically Conformal Prediction, to account for both the ambiguity in comment annotations and the model’s inherent uncertainty in predicting toxicity and disagreement.The framework also allows moderators to adjust thresholds for annotation disagreement, offering flexibility in determining when ambiguity should trigger a review. We demonstrate that our joint approach enhances model performance, calibration, and uncertainty estimation, while offering greater parameter efficiency and improving the review process in comparison to single-task methods.

[41] Retrieval-Augmented Machine Translation with Unstructured Knowledge

Jiaan Wang, Fandong Meng, Yingxue Zhang, Jie Zhou

Main category: cs.CL

TL;DR: RAGtrans benchmark for retrieval-augmented machine translation using unstructured multilingual documents, with multi-task training method achieving significant BLEU and COMET score improvements.

Details

Motivation: Existing RAG approaches for MT rely on paired corpora or structured knowledge graphs, but most world knowledge exists in unstructured documents that may not be fully paired across languages, creating a gap in leveraging this knowledge for translation.

Method: Built RAGtrans benchmark with 169K MT samples from GPT-4o and human translators, plus multilingual documents. Proposed multi-task training method using existing multilingual corpora to create auxiliary objectives without additional labeling, teaching LLMs to use information from multilingual documents during translation.

Result: Method improves LLMs by 1.6-3.1 BLEU and 1.0-2.0 COMET in En-Zh, and 1.7-2.9 BLEU and 2.1-2.7 COMET in En-De translation tasks.

Conclusion: The approach successfully enhances MT performance using unstructured multilingual documents, while also identifying critical difficulties current LLMs face with this retrieval-augmented translation task.

Abstract: Retrieval-augmented generation (RAG) introduces additional information to enhance large language models (LLMs). In machine translation (MT), previous work typically retrieves in-context examples from paired MT corpora, or domain-specific knowledge from knowledge graphs, to enhance MT models. However, a large amount of world knowledge is organized in unstructured documents, and might not be fully paired across different languages. In this paper, we study retrieval-augmented MT using unstructured documents. Specifically, we build RAGtrans, the first benchmark to train and evaluate LLMs’ retrieval-augmented MT ability. RAGtrans contains 169K MT samples collected via GPT-4o and human translators. Besides, documents from various languages are also provided to supply the knowledge to these samples. Based on RAGtrans, we further propose a multi-task training method to teach LLMs how to use information from multilingual documents during their translation. The method uses existing multilingual corpora to create auxiliary training objectives without additional labeling requirements. Extensive experiments show that the method improves LLMs by 1.6-3.1 BLEU and 1.0-2.0 COMET scores in En-Zh, and 1.7-2.9 BLEU and 2.1-2.7 COMET scores in En-De. We also conclude the critical difficulties that current LLMs face with this task.

[42] Toxicity Begets Toxicity: Unraveling Conversational Chains in Political Podcasts

Naquee Rizwan, Nayandeep Deb, Sarthak Roy, Vishwajeet Singh Solanki, Kiran Garimella, Animesh Mukherjee

Main category: cs.CL

TL;DR: Analysis of toxicity patterns in political podcast conversations, focusing on how harmful language escalates through conversational turns.

Details

Motivation: Tackling toxic behavior in digital communication is a pressing concern, but podcasts remain understudied despite their rapid popularity growth compared to other platforms like social networks.

Method: Curated a dataset of political podcast transcripts and analyzed them with focus on conversational structure, specifically investigating how toxicity surfaces and intensifies through sequences of replies within dialogues.

Result: Identified organic patterns by which harmful language escalates across conversational turns in podcast dialogues.

Conclusion: This research fills an important gap in understanding toxicity in podcast conversations and provides insights into how harmful language develops through conversational dynamics.

Abstract: Tackling toxic behavior in digital communication continues to be a pressing concern for both academics and industry professionals. While significant research has explored toxicity on platforms like social networks and discussion boards, podcasts despite their rapid rise in popularity remain relatively understudied in this context. This work seeks to fill that gap by curating a dataset of political podcast transcripts and analyzing them with a focus on conversational structure. Specifically, we investigate how toxicity surfaces and intensifies through sequences of replies within these dialogues, shedding light on the organic patterns by which harmful language can escalate across conversational turns. Warning: Contains potentially abusive/toxic contents.

[43] Strategic resource allocation in memory encoding: An efficiency principle shaping language processing

Weijie Xu, Richard Futrell

Main category: cs.CL

TL;DR: Strategic Resource Allocation (SRA) proposes that working memory dynamically prioritizes novel/unexpected information to minimize retrieval error under limited capacity constraints, showing reduced locality effects for surprising inputs in corpus data.

Details

Motivation: To understand how limited working memory capacity is efficiently used in human language processing, addressing the computational problem of minimizing retrieval error under resource constraints.

Method: Proposed SRA as an efficiency principle from resource-rational perspective, tested through naturalistic corpus data analysis of dependency locality in both production and comprehension across languages.

Result: Found converging evidence for SRA - non-local dependencies with less predictable antecedents show reduced locality effects, but with considerable cross-linguistic variability.

Conclusion: SRA highlights representational uncertainty’s role in memory encoding and reinterprets surprisal/entropy effects through efficient memory encoding perspective, requiring further examination of language-specific interactions.

Abstract: How is the limited capacity of working memory efficiently used to support human linguistic behaviors? In this paper, we propose Strategic Resource Allocation (SRA) as an efficiency principle for memory encoding in sentence processing. The idea is that working memory resources are dynamically and strategically allocated to prioritize novel and unexpected information. From a resource-rational perspective, we argue that SRA is the principled solution to a computational problem posed by two functional assumptions about working memory, namely its limited capacity and its noisy representation. Specifically, working memory needs to minimize the retrieval error of past inputs under the constraint of limited memory resources, an optimization problem whose solution is to allocate more resources to encode more surprising inputs with higher precision. One of the critical consequences of SRA is that surprising inputs are encoded with enhanced representations, and therefore are less susceptible to memory decay and interference. Empirically, through naturalistic corpus data, we find converging evidence for SRA in the context of dependency locality from both production and comprehension, where non-local dependencies with less predictable antecedents are associated with reduced locality effect. However, our results also reveal considerable cross-linguistic variability, suggesting the need for a closer examination of how SRA, as a domain-general memory efficiency principle, interacts with language-specific phrase structures. SRA highlights the critical role of representational uncertainty in understanding memory encoding. It also reimages the effects of surprisal and entropy on processing difficulty from the perspective of efficient memory encoding.

[44] Inducing Programmatic Skills for Agentic Tasks

Zora Zhiruo Wang, Apurva Gandhi, Graham Neubig, Daniel Fried

Main category: cs.CL

TL;DR: ASI (Agent Skill Induction) uses program-based skills to improve web navigation agents, achieving 23.5% higher success rate and 10.7-15.3% fewer steps than baselines through online skill learning and verification.

Details

Motivation: Web navigation agents need to perform specialized tasks like product searches and travel planning, requiring the ability to learn and adapt task-specific skills through online interaction with web environments.

Method: Proposes Agent Skill Induction (ASI) which induces, verifies, and utilizes program-based skills on the fly. Uses programmatic verification during skill induction and composes primitive actions into higher-level skills.

Result: Outperforms static baseline by 23.5% and text-skill counterpart by 11.3% in success rate. Reduces steps by 10.7-15.3%. Maintains efficiency and accuracy at scale. Successfully transfers and updates skills between websites.

Conclusion: Programs are effective representations for web navigation skills. ASI enables agents to bootstrap task-specific skills through online learning with verification, improving both success rates and efficiency while maintaining adaptability across different websites.

Abstract: To succeed in common digital tasks such as web navigation, agents must carry out a variety of specialized tasks such as searching for products or planning a travel route. To tackle these tasks, agents can bootstrap themselves by learning task-specific skills online through interaction with the web environment. In this work, we demonstrate that programs are an effective representation for skills. We propose agent skill induction (ASI), which allows agents to adapt themselves by inducing, verifying, and utilizing program-based skills on the fly. We start with an evaluation on the WebArena agent benchmark and show that ASI outperforms the static baseline agent and its text-skill counterpart by 23.5% and 11.3% in success rate, mainly thanks to the programmatic verification guarantee during the induction phase. ASI also improves efficiency by reducing 10.7-15.3% of the steps over baselines, by composing primitive actions (e.g., click) into higher-level skills (e.g., search product). We then highlight the efficacy of ASI in remaining efficient and accurate under scaled-up web activities. Finally, we examine the generalizability of induced skills when transferring between websites, and find that ASI can effectively reuse common skills, while also updating incompatible skills to versatile website changes.

[45] DeepTrans: Deep Reasoning Translation via Reinforcement Learning

Jiaan Wang, Fandong Meng, Jie Zhou

Main category: cs.CL

TL;DR: DeepTrans is a deep reasoning translation model that uses reinforcement learning to achieve free translation without requiring labeled data, showing 16.3% improvement in literature translation.

Details

Motivation: Free translation requires going beyond word-for-word translation and is under-explored in deep reasoning LLMs, despite their promising performance in various tasks.

Method: Uses reinforcement learning with a carefully designed reward model that scores both translation results and thought processes, teaching the model how to think and translate without labeled data.

Result: DeepTrans improves performance by 16.3% in literature translation using Qwen2.5-7B backbone and outperforms strong deep reasoning LLMs.

Conclusion: The work demonstrates effective free translation through RL without labeled data, with summarized failures and findings to inspire future research in this area.

Abstract: Recently, deep reasoning LLMs (e.g., OpenAI o1 and DeepSeek-R1) have shown promising performance in various downstream tasks. Free translation is an important and interesting task in the multilingual world, which requires going beyond word-for-word translation. However, the task is still under-explored in deep reasoning LLMs. In this paper, we introduce DeepTrans, a deep reasoning translation model that learns free translation via reinforcement learning (RL). Specifically, we carefully build a reward model with pre-defined scoring criteria on both the translation results and the thought processes. The reward model teaches DeepTrans how to think and free-translate the given sentences during RL. Besides, our RL training does not need any labeled translations, avoiding the human-intensive annotation or resource-intensive data synthesis. Experimental results show the effectiveness of DeepTrans. Using Qwen2.5-7B as the backbone, DeepTrans improves performance by 16.3% in literature translation, and outperforms strong deep reasoning LLMs. Moreover, we summarize the failures and interesting findings during our RL exploration. We hope this work could inspire other researchers in free translation.

[46] Testing Conviction: An Argumentative Framework for Measuring LLM Political Stability

Shariar Kabir, Kevin Esterling, Yue Dong

Main category: cs.CL

TL;DR: This paper proposes a framework to distinguish between genuine ideological alignment and performative text generation in LLMs by evaluating argumentative consistency and uncertainty quantification.

Details

Motivation: Existing methods categorize LLMs as left- or right-leaning based on single-prompt responses but cannot determine if these classifications reflect stable ideologies or superficial mimicry.

Method: Developed a framework using (1) argumentative consistency and (2) uncertainty quantification. Tested 12 LLMs on 19 economic policies from the Political Compass Test to classify responses as stable or performative ideological positioning.

Result: 95% of left-leaning models and 89% of right-leaning models demonstrated behavior consistent with classifications across different conditions. Semantic entropy strongly validated classifications (AUROC=0.78), showing uncertainty’s relationship to ideological consistency.

Conclusion: Ideological stability in LLMs is topic-dependent, challenging the notion of monolithic LLM ideologies. The framework provides a robust way to distinguish genuine alignment from performative behavior.

Abstract: Large Language Models (LLMs) increasingly shape political discourse, yet exhibit inconsistent responses when challenged. While prior research categorizes LLMs as left- or right-leaning based on single-prompt responses, a critical question remains: Do these classifications reflect stable ideologies or superficial mimicry? Existing methods cannot distinguish between genuine ideological alignment and performative text generation. To address this, we propose a framework for evaluating ideological depth through (1) argumentative consistency and (2) uncertainty quantification. Testing 12 LLMs on 19 economic policies from the Political Compass Test, we classify responses as stable or performative ideological positioning. Results show 95% of left-leaning models and 89% of right-leaning models demonstrate behavior consistent with our classifications across different experimental conditions. Furthermore, semantic entropy strongly validates our classifications (AUROC=0.78), revealing uncertainty’s relationship to ideological consistency. Our findings demonstrate that ideological stability is topic-dependent and challenge the notion of monolithic LLM ideologies, and offer a robust way to distinguish genuine alignment from performative behavior.

[47] MAC-Tuning: LLM Multi-Compositional Problem Reasoning with Enhanced Knowledge Boundary Awareness

Junsheng Huang, Zhitao He, Yucheng Huang, Sandeep Polisetty, Qingyun Wang, Yi. R, Fung

Main category: cs.CL

TL;DR: MAC-Tuning addresses LLM hallucination in multi-problem settings by separating answer prediction and confidence estimation during fine-tuning, achieving 25% better average precision than baselines.

Details

Motivation: Existing methods for LLM hallucination focus on single-problem settings and don't address the more challenging multi-problem scenario where multiple questions need accurate simultaneous answering.

Method: Multiple Answers and Confidence Stepwise Tuning (MAC-Tuning) separates the learning of answer prediction and confidence estimation during fine-tuning on instruction data.

Result: Extensive experiments show the method outperforms baselines by up to 25% in average precision.

Conclusion: MAC-Tuning effectively addresses LLM hallucination in multi-problem settings through stepwise learning of answers and confidence estimation.

Abstract: The hallucination of non-existent facts by LLMs is an important problem given its widespread adoption across various applications. Previous research addresses this problem by analyzing the internal parameterized knowledge boundaries to estimate confidence. However, these studies focus on the single-problem setting and have not explored the more challenging multi-problem setting, which requires accurately answering multiple questions simultaneously. We introduce a novel method for the multi-problem setting, Multiple Answers and Confidence Stepwise Tuning (MAC-Tuning), that separates the learning of answer prediction and confidence estimation during fine-tuning on instruction data. Extensive experiments demonstrate that our method outperforms baselines by up to 25% in average precision.

[48] FedSEA-LLaMA: A Secure, Efficient and Adaptive Federated Splitting Framework for Large Language Models

Zishuai Zhang, Hainan zhang, Weihua Li, Qinnan zhang, jin Dong, Yongxin Tong, Zhiming Zheng

Main category: cs.CL

TL;DR: FedSEA-LLaMA is a secure, efficient, and adaptive federated splitting framework for LLaMA2 that addresses privacy, communication overhead, and adaptability challenges in federated learning environments while maintaining performance comparable to centralized training.

Details

Motivation: Private data is valuable for improving LLMs but is scattered across data silos, and traditional federated approaches face challenges with data privacy, high communication costs due to sequential training/inference, and lack of adaptability to downstream tasks.

Method: The framework uses three key techniques: 1) Gaussian noise injection for secure end-to-end vector transmission, 2) Attention-mask compression and KV cache collaboration to reduce communication costs, and 3) Dynamic adjustment of partition points for input/output blocks based on task requirements.

Result: Experiments show FedSEA-LLaMA maintains performance comparable to centralized LLaMA2 on natural language understanding, summarization, and conversational QA tasks, while achieving up to 8x speedups in training and inference. Privacy attack analysis confirms security effectiveness.

Conclusion: FedSEA-LLaMA successfully addresses the three main challenges of federated split learning for LLMs, providing a secure, efficient, and adaptable framework that enables effective utilization of private data across distributed environments without compromising performance.

Abstract: Private data holds promise for improving LLMs due to its high quality, but its scattered distribution across data silos and the high computational demands of LLMs limit their deployment in federated environments. To address this, the transformer-based federated split models are proposed, which offload most model parameters to the server (or distributed clients) while retaining only a small portion on the client to ensure data privacy. Despite this design, they still face three challenges: 1) Peer-to-peer key encryption struggles to secure transmitted vectors effectively; 2) The auto-regressive nature of LLMs means that federated split learning can only train and infer sequentially, causing high communication overhead; 3) Fixed partition points lack adaptability to downstream tasks. In this paper, we introduce FedSEA-LLaMA, a Secure, Efficient, and Adaptive Federated splitting framework based on LLaMA2. First, we inject Gaussian noise into forward-pass hidden states to enable secure end-to-end vector transmission. Second, we employ attention-mask compression and KV cache collaboration to reduce communication costs, accelerating training and inference. Third, we allow users to dynamically adjust the partition points for input/output blocks based on specific task requirements. Experiments on natural language understanding, summarization, and conversational QA tasks show that FedSEA-LLaMA maintains performance comparable to centralized LLaMA2 and achieves up to 8x speedups in training and inference. Further analysis of privacy attacks and different partition points also demonstrates the effectiveness of FedSEA-LLaMA in security and adaptability.

[49] Hydra: Structured Cross-Source Enhanced Large Language Model Reasoning

Xingyu Tan, Xiaoyang Wang, Qing Liu, Xiwei Xu, Xin Yuan, Liming Zhu, Wenjie Zhang

Main category: cs.CL

TL;DR: Hydra is a training-free framework that unifies graph topology, document semantics, and source reliability to enhance retrieval-augmented generation, achieving state-of-the-art results on multiple benchmarks.

Details

Motivation: Current hybrid RAG systems struggle with multi-hop reasoning, multi-entity questions, multi-source verification, and effective graph utilization, limiting their reasoning capabilities.

Method: Hydra uses agent-driven exploration combining structured and unstructured retrieval, tri-factor cross-source verification (source trustworthiness, cross-source corroboration, entity-path alignment), and leverages graph structure for noise pruning and efficient exploration.

Result: Hydra achieves overall state-of-the-art results on seven benchmark datasets, outperforming ToG-2 by 20.3% on average (up to 30.1%), and enables smaller models like Llama-3.1-8B to match GPT-4-Turbo’s performance.

Conclusion: Hydra effectively addresses key limitations in hybrid RAG systems by unifying multiple information sources and verification mechanisms, demonstrating significant performance improvements across various model sizes and benchmarks.

Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external knowledge. Current hybrid RAG system retrieves evidence from both knowledge graphs (KGs) and text documents to support LLM reasoning. However, it faces challenges like handling multi-hop reasoning, multi-entity questions, multi-source verification, and effective graph utilization. To address these limitations, we present Hydra, a training-free framework that unifies graph topology, document semantics, and source reliability to support deep, faithful reasoning in LLMs. Hydra handles multi-hop and multi-entity problems through agent-driven exploration that combines structured and unstructured retrieval, increasing both diversity and precision of evidence. To tackle multi-source verification, Hydra uses a tri-factor cross-source verification (source trustworthiness assessment, cross-source corroboration, and entity-path alignment), to balance topic relevance with cross-modal agreement. By leveraging graph structure, Hydra fuses heterogeneous sources, guides efficient exploration, and prunes noise early. Comprehensive experiments on seven benchmark datasets show that Hydra achieves overall state-of-the-art results on all benchmarks with GPT-3.5, outperforming the strong hybrid baseline ToG-2 by an average of 20.3% and up to 30.1%. Furthermore, Hydra enables smaller models (e.g., Llama-3.1-8B) to achieve reasoning performance comparable to that of GPT-4-Turbo. The source code is available on https://stevetantan.github.io/Hydra/.

[50] L3Cube-MahaEmotions: A Marathi Emotion Recognition Dataset with Synthetic Annotations using CoTR prompting and Large Language Models

Nidhi Kowtal, Raviraj Joshi

Main category: cs.CL

TL;DR: Created L3Cube-MahaEmotions dataset for Marathi emotion recognition using LLM-based synthetic annotation with GPT-4, showing LLMs outperform fine-tuned BERT models in low-resource settings.

Details

Motivation: Address emotion recognition challenges in low-resource languages like Marathi due to limited annotated data availability.

Method: Used Chain-of-Translation prompting to translate Marathi to English for emotion labeling via LLMs (GPT-4/Llama3-405B), with manual validation/test sets. Evaluated synthetic label aggregation strategies.

Result: GPT-4 predictions outperformed fine-tuned BERT models. BERT models trained on synthetic labels couldn’t surpass GPT-4 performance. LLMs generalized better than fine-tuned BERT for complex emotion recognition.

Conclusion: High-quality human-labeled data is crucial, and generic LLMs generalize better than fine-tuned models for low-resource emotion recognition tasks. Dataset and models are publicly available.

Abstract: Emotion recognition in low-resource languages like Marathi remains challenging due to limited annotated data. We present L3Cube-MahaEmotions, a high-quality Marathi emotion recognition dataset with 11 fine-grained emotion labels. The training data is synthetically annotated using large language models (LLMs), while the validation and test sets are manually labeled to serve as a reliable gold-standard benchmark. Building on the MahaSent dataset, we apply the Chain-of-Translation (CoTR) prompting technique, where Marathi sentences are translated into English and emotion labeled via a single prompt. GPT-4 and Llama3-405B were evaluated, with GPT-4 selected for training data annotation due to superior label quality. We evaluate model performance using standard metrics and explore label aggregation strategies (e.g., Union, Intersection). While GPT-4 predictions outperform fine-tuned BERT models, BERT-based models trained on synthetic labels fail to surpass GPT-4. This highlights both the importance of high-quality human-labeled data and the inherent complexity of emotion recognition. An important finding of this work is that generic LLMs like GPT-4 and Llama3-405B generalize better than fine-tuned BERT for complex low-resource emotion recognition tasks. The dataset and model are shared publicly at https://github.com/l3cube-pune/MarathiNLP

[51] Quantifying Fairness in LLMs Beyond Tokens: A Semantic and Statistical Perspective

Weijie Xu, Yiwen Wang, Chi Xue, Xiangkun Hu, Xi Fang, Guimin Dong, Chandan K. Reddy

Main category: cs.CL

TL;DR: FiSCo is a statistical framework that evaluates LLM fairness by detecting subtle semantic biases in long-form responses across demographic groups using claim-level entailment checks and statistical hypothesis testing.

Details

Motivation: LLMs often generate biased responses, but existing evaluation methods overlook biases in long-form content and fail to account for LLM output variability, undermining reliability in real-world applications.

Method: Decomposes model outputs into semantically distinct claims, uses entailment checks to assess meaning consistency, and applies statistical hypothesis testing to compare inter- and intra-group similarities.

Result: FiSCo reliably identifies nuanced biases while reducing the impact of stochastic LLM variability, outperforming various existing evaluation metrics on synthetic and human-annotated datasets.

Conclusion: FiSCo provides a robust framework for detecting subtle semantic biases in LLM responses, addressing limitations of surface-level analysis and enabling more reliable fairness evaluation across demographic groups.

Abstract: Large Language Models (LLMs) often generate responses with inherent biases, undermining their reliability in real-world applications. Existing evaluation methods often overlook biases in long-form responses and the intrinsic variability of LLM outputs. To address these challenges, we propose FiSCo (Fine-grained Semantic Comparison), a novel statistical framework to evaluate group-level fairness in LLMs by detecting subtle semantic differences in long-form responses across demographic groups. Unlike prior work focusing on sentiment or token-level comparisons, FiSCo goes beyond surface-level analysis by operating at the claim level, leveraging entailment checks to assess the consistency of meaning across responses. We decompose model outputs into semantically distinct claims and apply statistical hypothesis testing to compare inter- and intra-group similarities, enabling robust detection of subtle biases. We formalize a new group counterfactual fairness definition and validate FiSCo on both synthetic and human-annotated datasets spanning gender, race, and age. Experiments show that FiSCo more reliably identifies nuanced biases while reducing the impact of stochastic LLM variability, outperforming various evaluation metrics.

[52] Interpretable Mnemonic Generation for Kanji Learning via Expectation-Maximization

Jaewook Lee, Alexander Scarlatos, Andrew Lan

Main category: cs.CL

TL;DR: Proposes an interpretable generative framework for Japanese kanji mnemonic generation using explicit rule modeling and EM algorithm, outperforming black-box LLM methods in cold-start scenarios.

Details

Motivation: Japanese vocabulary learning is challenging due to script differences, particularly with complex kanji characters. Existing LLM-based mnemonic generation methods lack interpretability, functioning as black boxes that limit understanding of effective mnemonic creation mechanisms.

Method: A generative framework that explicitly models mnemonic construction process using common rules, learned through a novel Expectation-Maximization-type algorithm. Trained on learner-authored mnemonics from an online platform to learn latent structures and compositional rules.

Result: The method performs well in cold-start settings for new learners while providing insights into the mechanisms behind effective mnemonic creation. Experiments demonstrate superior performance compared to existing black-box approaches.

Conclusion: The proposed framework enables interpretable and systematic mnemonic generation, offering both practical assistance for Japanese learners and theoretical understanding of effective mnemonic construction principles.

Abstract: Learning Japanese vocabulary is a challenge for learners from Roman alphabet backgrounds due to script differences. Japanese combines syllabaries like hiragana with kanji, which are logographic characters of Chinese origin. Kanji are also complicated due to their complexity and volume. Keyword mnemonics are a common strategy to aid memorization, often using the compositional structure of kanji to form vivid associations. Despite recent efforts to use large language models (LLMs) to assist learners, existing methods for LLM-based keyword mnemonic generation function as a black box, offering limited interpretability. We propose a generative framework that explicitly models the mnemonic construction process as driven by a set of common rules, and learn them using a novel Expectation-Maximization-type algorithm. Trained on learner-authored mnemonics from an online platform, our method learns latent structures and compositional rules, enabling interpretable and systematic mnemonics generation. Experiments show that our method performs well in the cold-start setting for new learners while providing insight into the mechanisms behind effective mnemonic creation.

[53] SKA-Bench: A Fine-Grained Benchmark for Evaluating Structured Knowledge Understanding of LLMs

Zhiqiang Liu, Enpei Niu, Yin Hua, Mengshu Sun, Lei Liang, Huajun Chen, Wen Zhang

Main category: cs.CL

TL;DR: SKA-Bench is a comprehensive benchmark for evaluating LLMs’ structured knowledge understanding across four knowledge forms (KG, Table, KG+Text, Table+Text) with four fundamental ability testbeds.

Details

Motivation: Existing evaluations for structured knowledge understanding are non-rigorous and focus on single knowledge types, lacking comprehensive assessment of specific capabilities.

Method: Three-stage pipeline construction of instances (question, answer, positive knowledge units, noisy knowledge units) expanded into four testbeds: Noise Robustness, Order Insensitivity, Information Integration, and Negative Rejection.

Result: Evaluation of 8 LLMs shows significant challenges in structured knowledge understanding, with performance affected by noise amount, knowledge unit order, and hallucination issues.

Conclusion: LLMs still struggle with structured knowledge understanding, and SKA-Bench provides a rigorous benchmark for diagnosing these shortcomings across multiple knowledge forms and capabilities.

Abstract: Although large language models (LLMs) have made significant progress in understanding Structured Knowledge (SK) like KG and Table, existing evaluations for SK understanding are non-rigorous (i.e., lacking evaluations of specific capabilities) and focus on a single type of SK. Therefore, we aim to propose a more comprehensive and rigorous structured knowledge understanding benchmark to diagnose the shortcomings of LLMs. In this paper, we introduce SKA-Bench, a Structured Knowledge Augmented QA Benchmark that encompasses four widely used structured knowledge forms: KG, Table, KG+Text, and Table+Text. We utilize a three-stage pipeline to construct SKA-Bench instances, which includes a question, an answer, positive knowledge units, and noisy knowledge units. To evaluate the SK understanding capabilities of LLMs in a fine-grained manner, we expand the instances into four fundamental ability testbeds: Noise Robustness, Order Insensitivity, Information Integration, and Negative Rejection. Empirical evaluations on 8 representative LLMs, including the advanced DeepSeek-R1, indicate that existing LLMs still face significant challenges in understanding structured knowledge, and their performance is influenced by factors such as the amount of noise, the order of knowledge units, and hallucination phenomenon. Our dataset and code are available at https://github.com/zjukg/SKA-Bench.

[54] Atom-Searcher: Enhancing Agentic Deep Research via Fine-Grained Atomic Thought Reward

Yong Deng, Guoqing Wang, Zhenzhe Ying, Xiaofeng Wu, Jinzhen Lin, Wenwen Xiong, Yuqin Dai, Shuo Yang, Zhanwei Zhang, Qiwen Wang, Yang Qin, Yuan Wang, Quanxing Zha, Sunhao Dai, Changhua Meng

Main category: cs.CL

TL;DR: Atom-Searcher is a novel RL framework that uses Atomic Thought decomposition and fine-grained rewards to improve agentic deep research, outperforming state-of-the-art methods on multiple benchmarks.

Details

Motivation: LLMs struggle with complex tasks due to static knowledge, and current RAG approaches have limitations in multi-hop reasoning. Agentic research methods using outcome-based RL face issues like conflicting gradients and reward sparsity.

Method: Proposes Atomic Thought paradigm that decomposes reasoning into fine-grained functional units supervised by Reasoning Reward Models (RRMs). Atom-Searcher integrates this with a curriculum-inspired reward schedule that transitions from process-level to outcome rewards.

Result: Experiments on seven benchmarks show consistent improvements over state-of-the-art methods. The framework enables scalable computation, provides supervision anchors for RRMs, and exhibits more interpretable, human-like reasoning patterns.

Conclusion: Atom-Searcher effectively addresses limitations of current agentic research approaches by combining fine-grained reasoning decomposition with strategic reward scheduling, leading to superior performance and more interpretable reasoning processes.

Abstract: Large language models (LLMs) exhibit remarkable problem-solving abilities, but struggle with complex tasks due to static internal knowledge. Retrieval-Augmented Generation (RAG) enhances access to external information, yet remains limited in multi-hop reasoning and strategic search due to rigid workflows. Recent advancements in agentic deep research empower LLMs to autonomously reason, search, and synthesize information. However, current approaches relying on outcome-based reinforcement learning (RL) face critical issues such as conflicting gradients and reward sparsity, limiting performance gains and training efficiency. To address these, we first propose Atomic Thought, a novel LLM thinking paradigm that decomposes reasoning into fine-grained functional units. These units are supervised by Reasoning Reward Models (RRMs), which provide Atomic Thought Rewards (ATR) for fine-grained guidance. Building on this, we propose Atom-Searcher, a novel RL framework for agentic deep research that integrates Atomic Thought and ATR. Atom-Searcher uses a curriculum-inspired reward schedule, prioritizing process-level ATR early and transitioning to outcome rewards, accelerating convergence on effective reasoning paths. Experiments on seven benchmarks show consistent improvements over the state-of-the-art. Key advantages include: (1) Atom-Searcher scales computation at test-time. (2) Atomic Thought provides supervision anchors for RRMs, bridging deep research tasks and RRMs. (3) Atom-Searcher exhibits more interpretable, human-like reasoning patterns.

[55] Trust but Verify! A Survey on Verification Design for Test-time Scaling

V Venktesh, Mandeep Rathee, Avishek Anand

Main category: cs.CL

TL;DR: Survey paper on test-time scaling verifiers for LLMs, covering diverse verification approaches, training mechanisms, and their utility in improving model performance during inference.

Details

Motivation: Despite widespread adoption of verifiers in test-time scaling, there is no comprehensive collection, categorization, or discussion of diverse verification approaches and their training mechanisms.

Method: The authors conduct a systematic survey of literature on verifier-based test-time scaling, presenting a unified view of verifier training, types (prompt-based, fine-tuned discriminative/generative models), and their utility in exploring solution spaces.

Result: The survey provides detailed categorization and analysis of various verification approaches used to score candidate outputs and select optimal outcomes during LLM inference.

Conclusion: Verifier-based test-time scaling emerges as a superior approach for parameter-free scaling at inference time, offering high performance gains through diligent exploration of vast solution spaces using reward models.

Abstract: Test-time scaling (TTS) has emerged as a new frontier for scaling the performance of Large Language Models. In test-time scaling, by using more computational resources during inference, LLMs can improve their reasoning process and task performance. Several approaches have emerged for TTS such as distilling reasoning traces from another model or exploring the vast decoding search space by employing a verifier. The verifiers serve as reward models that help score the candidate outputs from the decoding process to diligently explore the vast solution space and select the best outcome. This paradigm commonly termed has emerged as a superior approach owing to parameter free scaling at inference time and high performance gains. The verifiers could be prompt-based, fine-tuned as a discriminative or generative model to verify process paths, outcomes or both. Despite their widespread adoption, there is no detailed collection, clear categorization and discussion of diverse verification approaches and their training mechanisms. In this survey, we cover the diverse approaches in the literature and present a unified view of verifier training, types and their utility in test-time scaling. Our repository can be found at https://github.com/elixir-research-group/Verifierstesttimescaling.github.io.

[56] Active Domain Knowledge Acquisition with 100-Dollar Budget: Enhancing LLMs via Cost-Efficient, Expert-Involved Interaction in Sensitive Domains

Yang Wu, Raha Moraffah, Rujing Yao, Jinhong Yu, Zhimin Tao, Xiaozhong Liu

Main category: cs.CL

TL;DR: PU-ADKA is a framework that enhances domain-specific LLMs by selectively querying human experts within budget constraints, outperforming traditional fine-tuning methods.

Details

Motivation: LLMs lack expert knowledge in specialized domains like drug discovery and rare disease research, and traditional approaches don't efficiently utilize human expertise under budget limitations.

Method: A novel framework that actively engages domain experts by selectively identifying and querying the most appropriate expert based on availability, knowledge boundaries, and consultation costs, trained using PubMed data simulations.

Result: Validated through controlled expert interactions and real-world deployment with a drug development team, demonstrating effectiveness in enhancing LLM performance under strict budget constraints.

Conclusion: PU-ADKA provides an efficient solution for domain knowledge acquisition in LLMs and introduces a new benchmark dataset (CKAD) to advance research in cost-effective expert knowledge integration.

Abstract: Large Language Models (LLMs) have demonstrated an impressive level of general knowledge. However, they often struggle in highly specialized and cost-sensitive domains such as drug discovery and rare disease research due to the lack of expert knowledge. In this paper, we propose a novel framework (PU-ADKA) designed to efficiently enhance domain-specific LLMs by actively engaging domain experts within a fixed budget. Unlike traditional fine-tuning approaches, PU-ADKA selectively identifies and queries the most appropriate expert from a team, taking into account each expert’s availability, knowledge boundaries, and consultation costs. We train PU-ADKA using simulations on PubMed data and validate it through both controlled expert interactions and real-world deployment with a drug development team, demonstrating its effectiveness in enhancing LLM performance in specialized domains under strict budget constraints. In addition to outlining our methodological innovations and experimental results, we introduce a new benchmark dataset, CKAD, for cost-effective LLM domain knowledge acquisition to foster further research in this challenging area.

[57] German4All – A Dataset and Model for Readability-Controlled Paraphrasing in German

Miriam Anschütz, Thanh Mai Pham, Eslam Nasrallah, Maximilian Müller, Cristian-George Craciun, Georg Groh

Main category: cs.CL

TL;DR: German4All is the first large-scale German dataset with aligned readability-controlled paraphrases across 5 complexity levels, created using GPT-4 and used to train a state-of-the-art open-source paraphrasing model for German text simplification.

Details

Motivation: There is a need for creating accessible texts that can be tailored to diverse reader groups through paraphrasing across different complexity levels, particularly for the German language.

Method: Automatically synthesized over 25,000 paragraph-level paraphrases using GPT-4 across five readability levels, with rigorous evaluation through both human and LLM-based judgments.

Result: Created German4All dataset and trained an open-source readability-controlled paraphrasing model that achieves state-of-the-art performance in German text simplification.

Conclusion: The dataset and model are open-sourced to encourage further research on multi-level paraphrasing, enabling more nuanced and reader-specific text adaptations in German.

Abstract: The ability to paraphrase texts across different complexity levels is essential for creating accessible texts that can be tailored toward diverse reader groups. Thus, we introduce German4All, the first large-scale German dataset of aligned readability-controlled, paragraph-level paraphrases. It spans five readability levels and comprises over 25,000 samples. The dataset is automatically synthesized using GPT-4 and rigorously evaluated through both human and LLM-based judgments. Using German4All, we train an open-source, readability-controlled paraphrasing model that achieves state-of-the-art performance in German text simplification, enabling more nuanced and reader-specific adaptations. We opensource both the dataset and the model to encourage further research on multi-level paraphrasing

[58] Uncovering the Bigger Picture: Comprehensive Event Understanding Via Diverse News Retrieval

Yixuan Tang, Yuanyuan Shi, Yiqun Sun, Anthony Kum Hoe Tung

Main category: cs.CL

TL;DR: NEWSCOPE is a two-stage framework for diverse news retrieval that uses sentence-level clustering and diversity-aware re-ranking to reduce redundancy and improve viewpoint exposure while maintaining relevance.

Details

Motivation: Most news retrieval systems prioritize textual relevance, leading to redundant results and limited exposure to diverse perspectives, which is essential for comprehensive event understanding.

Method: Two-stage framework: 1) dense retrieval for topically relevant content, 2) sentence-level clustering and diversity-aware re-ranking to surface complementary information. Introduces three interpretable diversity metrics and constructs two paragraph-level benchmarks.

Result: NEWSCOPE consistently outperforms strong baselines, achieving significantly higher diversity without compromising relevance. Demonstrates effectiveness of fine-grained, interpretable modeling.

Conclusion: The framework successfully mitigates redundancy and promotes comprehensive event understanding through fine-grained semantic modeling and interpretable diversity metrics.

Abstract: Access to diverse perspectives is essential for understanding real-world events, yet most news retrieval systems prioritize textual relevance, leading to redundant results and limited viewpoint exposure. We propose NEWSCOPE, a two-stage framework for diverse news retrieval that enhances event coverage by explicitly modeling semantic variation at the sentence level. The first stage retrieves topically relevant content using dense retrieval, while the second stage applies sentence-level clustering and diversity-aware re-ranking to surface complementary information. To evaluate retrieval diversity, we introduce three interpretable metrics, namely Average Pairwise Distance, Positive Cluster Coverage, and Information Density Ratio, and construct two paragraph-level benchmarks: LocalNews and DSGlobal. Experiments show that NEWSCOPE consistently outperforms strong baselines, achieving significantly higher diversity without compromising relevance. Our results demonstrate the effectiveness of fine-grained, interpretable modeling in mitigating redundancy and promoting comprehensive event understanding. The data and code are available at https://github.com/tangyixuan/NEWSCOPE.

[59] Exploring Selective Retrieval-Augmentation for Long-Tail Legal Text Classification

Boheng Mao

Main category: cs.CL

TL;DR: Selective Retrieval-Augmentation (SRA) improves legal text classification on long-tail distributions by augmenting only low-frequency labels from training data, achieving consistent F1 gains without architecture changes.

Details

Motivation: Legal text classification datasets often have long-tail label distributions where rare classes perform poorly due to underrepresentation, requiring targeted augmentation without introducing noise to well-represented classes.

Method: SRA selectively retrieves and augments samples only for low-frequency labels from the training data itself, avoiding external corpora and information leakage, while requiring no model architecture modifications.

Result: SRA achieved consistent improvements in both micro-F1 and macro-F1 scores over LexGLUE baselines on LEDGAR (single-label) and UNFAIR-ToS (multi-label) benchmark datasets.

Conclusion: Selective retrieval-augmentation is an effective proof-of-concept approach for handling long-tail distributions in legal text classification, providing performance gains without architectural changes or external data sources.

Abstract: Legal text classification is a fundamental NLP task in the legal domain. Benchmark datasets in this area often exhibit a long-tail label distribution, where many labels are underrepresented, leading to poor model performance on rare classes. This paper explores Selective Retrieval-Augmentation (SRA) as a proof-of-concept approach to this problem. SRA focuses on augmenting samples belonging to low-frequency labels in the training set, preventing the introduction of noise for well-represented classes, and requires no changes to the model architecture. Retrieval is performed only from the training data to ensure there is no potential information leakage, removing the need for external corpora simultaneously. SRA is tested on two legal text classification benchmark datasets with long-tail distributions: LEDGAR (single-label) and UNFAIR-ToS (multi-label). Results show that SRA achieves consistent gains in both micro-F1 and macro-F1 over LexGLUE baselines.

[60] UI-Bench: A Benchmark for Evaluating Design Capabilities of AI Text-to-App Tools

Sam Jung, Agustin Garcinuno, Spencer Mateega

Main category: cs.CL

TL;DR: UI-Bench is the first large-scale benchmark for evaluating AI text-to-app tools through expert pairwise comparisons of 300 generated sites from 10 tools, establishing a reproducible standard with calibrated rankings.

Details

Motivation: There is no public benchmark to rigorously verify claims made by AI text-to-app tools about generating high-quality applications and websites quickly. Current tools promise excellent results but lack standardized evaluation.

Method: Created UI-Bench with 30 prompts across 10 tools, generating 300 sites. Used expert pairwise comparisons (4,000+ judgments) and a TrueSkill-derived model for ranking with calibrated confidence intervals.

Result: Established the first comprehensive benchmark for AI text-to-app tools, providing reproducible evaluation standards, open-source framework, complete prompt set, and public leaderboard with tool rankings.

Conclusion: UI-Bench provides a rigorous, standardized benchmark for advancing AI-driven web design, enabling reproducible evaluation and comparison of text-to-app tools through expert validation and statistical ranking.

Abstract: AI text-to-app tools promise high quality applications and websites in minutes, yet no public benchmark rigorously verifies those claims. We introduce UI-Bench, the first large-scale benchmark that evaluates visual excellence across competing AI text-to-app tools through expert pairwise comparison. Spanning 10 tools, 30 prompts, 300 generated sites, and 4,000+ expert judgments, UI-Bench ranks systems with a TrueSkill-derived model that yields calibrated confidence intervals. UI-Bench establishes a reproducible standard for advancing AI-driven web design. We release (i) the complete prompt set, (ii) an open-source evaluation framework, and (iii) a public leaderboard. The generated sites rated by participants will be released soon. View the UI-Bench leaderboard at https://uibench.ai/leaderboard.

[61] KG-CQR: Leveraging Structured Relation Representations in Knowledge Graphs for Contextual Query Retrieval

Chi Minh Bui, Ngoc Mai Thieu, Van Vinh Nguyen, Jason J. Jung, Khac-Hoai Nam Bui

Main category: cs.CL

TL;DR: KG-CQR enhances RAG retrieval by enriching queries using knowledge graphs, achieving 4-6% mAP and 2-3% Recall@25 improvements over baselines.

Details

Motivation: Improve retrieval phase of RAG systems by addressing contextual query representation limitations through knowledge graph integration.

Method: Proposes KG-CQR framework with subgraph extraction, completion, and contextual generation modules for query enrichment using corpus-centric knowledge graphs.

Result: Achieves 4-6% improvement in mAP and 2-3% improvement in Recall@25 on RAGBench and MultiHop-RAG datasets, with consistent outperformance on multi-hop QA tasks.

Conclusion: KG-CQR effectively enhances retrieval effectiveness in RAG systems through structured knowledge graph integration without requiring additional LLM training.

Abstract: The integration of knowledge graphs (KGs) with large language models (LLMs) offers significant potential to improve the retrieval phase of retrieval-augmented generation (RAG) systems. In this study, we propose KG-CQR, a novel framework for Contextual Query Retrieval (CQR) that enhances the retrieval phase by enriching the contextual representation of complex input queries using a corpus-centric KG. Unlike existing methods that primarily address corpus-level context loss, KG-CQR focuses on query enrichment through structured relation representations, extracting and completing relevant KG subgraphs to generate semantically rich query contexts. Comprising subgraph extraction, completion, and contextual generation modules, KG-CQR operates as a model-agnostic pipeline, ensuring scalability across LLMs of varying sizes without additional training. Experimental results on RAGBench and MultiHop-RAG datasets demonstrate KG-CQR’s superior performance, achieving a 4-6% improvement in mAP and a 2-3% improvement in Recall@25 over strong baseline models. Furthermore, evaluations on challenging RAG tasks such as multi-hop question answering show that, by incorporating KG-CQR, the performance consistently outperforms the existing baseline in terms of retrieval effectiveness

cs.CV

[62] 2COOOL: 2nd Workshop on the Challenge Of Out-Of-Label Hazards in Autonomous Driving

Ali K. AlShami, Ryan Rabinowitz, Maged Shoman, Jianwu Fang, Lukas Picek, Shao-Yuan Lo, Steve Cruz, Khang Nhut Lam, Nachiket Kamod, Lei-Lei Li, Jugal Kalita, Terrance E. Boult

Main category: cs.CV

TL;DR: Workshop on addressing novel scenarios and out-of-label hazards in autonomous driving through vision-language models, anomaly detection, and open-set recognition to improve safety.

Details

Motivation: Despite advances in autonomous driving, entirely safe self-driving cars remain elusive due to novel scenarios and out-of-distribution hazards that current systems cannot handle effectively.

Method: The 2COOOL workshop brings together researchers and industry experts to develop new algorithms using anomaly detection, open-set recognition, open-vocabulary modeling, domain adaptation, and vision-language models for hazard understanding and avoidance.

Result: A dedicated forum established at ICCV 2025 to push state-of-the-art in novelty handling, building on the success of the inaugural workshop at WACV 2025 with academic and industry participation.

Conclusion: Addressing novel scenarios through specialized workshops and interdisciplinary approaches is essential for achieving safe autonomous driving deployment in real-world conditions.

Abstract: As the computer vision community advances autonomous driving algorithms, integrating vision-based insights with sensor data remains essential for improving perception, decision making, planning, prediction, simulation, and control. Yet we must ask: Why don’t we have entirely safe self-driving cars yet? A key part of the answer lies in addressing novel scenarios, one of the most critical barriers to real-world deployment. Our 2COOOL workshop provides a dedicated forum for researchers and industry experts to push the state of the art in novelty handling, including out-of-distribution hazard detection, vision-language models for hazard understanding, new benchmarking and methodologies, and safe autonomous driving practices. The 2nd Workshop on the Challenge of Out-of-Label Hazards in Autonomous Driving (2COOOL) will be held at the International Conference on Computer Vision (ICCV) 2025 in Honolulu, Hawaii, on October 19, 2025. We aim to inspire the development of new algorithms and systems for hazard avoidance, drawing on ideas from anomaly detection, open-set recognition, open-vocabulary modeling, domain adaptation, and related fields. Building on the success of its inaugural edition at the Winter Conference on Applications of Computer Vision (WACV) 2025, the workshop will feature a mix of academic and industry participation.

[63] GLENDA: Gynecologic Laparoscopy Endometriosis Dataset

Andreas Leibetseder, Sabrina Kletz, Klaus Schoeffmann, Simon Keckstein, Jörg Keckstein

Main category: cs.CV

TL;DR: GLENDA is the first public image dataset for endometriosis detection in gynecologic laparoscopy, featuring expert-annotated region-based annotations to address data scarcity in medical computer vision research.

Details

Motivation: Manual analysis of surgical recordings is time-consuming, and current computer vision approaches for gynecologic laparoscopy suffer from limited sample data availability in the medical field, particularly for endometriosis detection.

Method: Created the GLENDA dataset through collaboration with leading medical experts, containing region-based annotations of endometriosis (uterine-like tissue dislocation) in gynecologic laparoscopy images.

Result: Published the first-of-its-kind image dataset specifically designed for endometriosis detection in minimally invasive gynecologic surgery, providing valuable training data for computer vision and machine learning approaches.

Conclusion: GLENDA addresses the critical data scarcity problem in medical computer vision research and enables the development of more sophisticated automated analysis tools for gynecologic laparoscopy recordings, benefiting treatment planning, documentation, and education.

Abstract: Gynecologic laparoscopy as a type of minimally invasive surgery (MIS) is performed via a live feed of a patient’s abdomen surveying the insertion and handling of various instruments for conducting treatment. Adopting this kind of surgical intervention not only facilitates a great variety of treatments, the possibility of recording said video streams is as well essential for numerous post-surgical activities, such as treatment planning, case documentation and education. Nonetheless, the process of manually analyzing surgical recordings, as it is carried out in current practice, usually proves tediously time-consuming. In order to improve upon this situation, more sophisticated computer vision as well as machine learning approaches are actively developed. Since most of such approaches heavily rely on sample data, which especially in the medical field is only sparsely available, with this work we publish the Gynecologic Laparoscopy ENdometriosis DAtaset (GLENDA) - an image dataset containing region-based annotations of a common medical condition named endometriosis, i.e. the dislocation of uterine-like tissue. The dataset is the first of its kind and it has been created in collaboration with leading medical experts in the field.

[64] Advanced Deep Learning Techniques for Classifying Dental Conditions Using Panoramic X-Ray Images

Alireza Golkarieh, Kiana Kiashemshaki, Sajjad Rezvani Boroujeni

Main category: cs.CV

TL;DR: Deep learning study shows hybrid CNN-Random Forest model achieves 85.4% accuracy for automated dental condition classification in panoramic X-rays, outperforming custom CNN and pre-trained models.

Details

Motivation: To develop automated classification methods for dental conditions in panoramic X-ray images to support dental diagnostics and reduce manual analysis burden.

Method: Used dataset of 1,512 radiographs with 11,137 expert annotations across four conditions. Evaluated three approaches: custom CNN, hybrid CNN+traditional classifiers, and fine-tuned pre-trained models (VGG16, Xception, ResNet50) using 5-fold cross validation.

Result: Hybrid CNN-Random Forest achieved highest performance with 85.4% accuracy, surpassing custom CNN baseline (74.3%). VGG16 performed best among pre-trained models at 82.3% accuracy.

Conclusion: Hybrid models combining CNN feature extraction with ensemble classifiers provide efficient and reliable performance for dental condition classification, offering practical automated diagnostic support while needing larger datasets and clinical validation.

Abstract: This study investigates deep learning methods for automated classification of dental conditions in panoramic X-ray images. A dataset of 1,512 radiographs with 11,137 expert-verified annotations across four conditions fillings, cavities, implants, and impacted teeth was used. After preprocessing and class balancing, three approaches were evaluated: a custom convolutional neural network (CNN), hybrid models combining CNN feature extraction with traditional classifiers, and fine-tuned pre-trained architectures. Experiments employed 5 fold cross validation with accuracy, precision, recall, and F1 score as evaluation metrics. The hybrid CNN Random Forest model achieved the highest performance with 85.4% accuracy, surpassing the custom CNN baseline of 74.3%. Among pre-trained models, VGG16 performed best at 82.3% accuracy, followed by Xception and ResNet50. Results show that hybrid models improve discrimination of morphologically similar conditions and provide efficient, reliable performance. These findings suggest that combining CNN-based feature extraction with ensemble classifiers offers a practical path toward automated dental diagnostic support, while also highlighting the need for larger datasets and further clinical validation.

[65] Identifying Surgical Instruments in Laparoscopy Using Deep Learning Instance Segmentation

Sabrina Kletz, Klaus Schoeffmann, Jenny Benois-Pineau, Heinrich Husslein

Main category: cs.CV

TL;DR: This paper evaluates surgical instrument segmentation and recognition in laparoscopic gynecology videos using region-based fully convolutional networks, achieving high accuracy for binary segmentation but facing challenges in multi-class instrument recognition due to instrument similarity.

Details

Motivation: Recorded surgical videos provide detailed information but automatic content indexing for searchable archives remains challenging due to the specialized nature of medical video content.

Method: Used region-based fully convolutional network for instance-aware instrument segmentation (binary classification) and instrument recognition (multi-class classification) in laparoscopic gynecology videos.

Result: Achieved high accuracy for instrument localization and segmentation even with a moderately low number of training examples, but multi-class instrument recognition proved very challenging due to high similarity between surgical instruments.

Conclusion: While binary segmentation of surgical instruments from background is feasible with good accuracy, identifying specific instrument types remains difficult and requires further research to address the challenges posed by instrument similarity.

Abstract: Recorded videos from surgeries have become an increasingly important information source for the field of medical endoscopy, since the recorded footage shows every single detail of the surgery. However, while video recording is straightforward these days, automatic content indexing - the basis for content-based search in a medical video archive - is still a great challenge due to the very special video content. In this work, we investigate segmentation and recognition of surgical instruments in videos recorded from laparoscopic gynecology. More precisely, we evaluate the achievable performance of segmenting surgical instruments from their background by using a region-based fully convolutional network for instance-aware (1) instrument segmentation as well as (2) instrument recognition. While the first part addresses only binary segmentation of instances (i.e., distinguishing between instrument or background) we also investigate multi-class instrument recognition (i.e., identifying the type of instrument). Our evaluation results show that even with a moderately low number of training examples, we are able to localize and segment instrument regions with a pretty high accuracy. However, the results also reveal that determining the particular instrument is still very challenging, due to the inherently high similarity of surgical instruments.

[66] Q-Align: Alleviating Attention Leakage in Zero-Shot Appearance Transfer via Query-Query Alignment

Namu Kim, Wonbin Kweon, Minsoo Kim, Hwanjo Yu

Main category: cs.CV

TL;DR: Q-Align addresses attention leakage in zero-shot appearance transfer by using Query-Query alignment instead of Query-Key alignment, improving semantic mapping between images.

Details

Motivation: Zero-shot appearance transfer with large-scale image generation models faces attention leakage issues due to Query-Key alignment capturing semantic mapping between images.

Method: Q-Align introduces three core components: Query-Query alignment for spatial semantic mapping, Key-Value rearrangement for enhanced feature correspondence, and attention refinement using rearranged keys/values for semantic consistency.

Result: Q-Align outperforms state-of-the-art methods in appearance fidelity while maintaining competitive structure preservation.

Conclusion: The proposed Q-Align method effectively mitigates attention leakage and improves semantic alignment in zero-shot appearance transfer tasks.

Abstract: We observe that zero-shot appearance transfer with large-scale image generation models faces a significant challenge: Attention Leakage. This challenge arises when the semantic mapping between two images is captured by the Query-Key alignment. To tackle this issue, we introduce Q-Align, utilizing Query-Query alignment to mitigate attention leakage and improve the semantic alignment in zero-shot appearance transfer. Q-Align incorporates three core contributions: (1) Query-Query alignment, facilitating the sophisticated spatial semantic mapping between two images; (2) Key-Value rearrangement, enhancing feature correspondence through realignment; and (3) Attention refinement using rearranged keys and values to maintain semantic consistency. We validate the effectiveness of Q-Align through extensive experiments and analysis, and Q-Align outperforms state-of-the-art methods in appearance fidelity while maintaining competitive structure preservation.

[67] Learning from Silence and Noise for Visual Sound Source Localization

Xavier Juanola, Giovana Morais, Magdalena Fuentes, Gloria Haro

Main category: cs.CV

TL;DR: A new self-supervised sound localization method (SSL-SaN) that handles negative audio cases like silence, noise, and offscreen sounds, with improved training strategy and evaluation metrics.

Details

Motivation: Current visual sound source localization methods perform poorly with low audio-visual semantic correspondence (silence, noise, offscreen sounds) and are only evaluated on positive cases with single visible sound sources.

Method: Proposed a new training strategy incorporating silence and noise, developed SSL-SaN self-supervised model, created a new metric for alignment-separability trade-off, and extended IS3 dataset to IS3+ with negative audio.

Result: SSL-SaN achieves state-of-the-art performance in sound localization and cross-modal retrieval, with improved robustness against negative sounds while maintaining performance in positive cases.

Conclusion: The approach addresses key limitations in current sound localization methods by handling negative audio scenarios and providing better evaluation metrics and datasets for comprehensive assessment.

Abstract: Visual sound source localization is a fundamental perception task that aims to detect the location of sounding sources in a video given its audio. Despite recent progress, we identify two shortcomings in current methods: 1) most approaches perform poorly in cases with low audio-visual semantic correspondence such as silence, noise, and offscreen sounds, i.e. in the presence of negative audio; and 2) most prior evaluations are limited to positive cases, where both datasets and metrics convey scenarios with a single visible sound source in the scene. To address this, we introduce three key contributions. First, we propose a new training strategy that incorporates silence and noise, which improves performance in positive cases, while being more robust against negative sounds. Our resulting self-supervised model, SSL-SaN, achieves state-of-the-art performance compared to other self-supervised models, both in sound localization and cross-modal retrieval. Second, we propose a new metric that quantifies the trade-off between alignment and separability of auditory and visual features across positive and negative audio-visual pairs. Third, we present IS3+, an extended and improved version of the IS3 synthetic dataset with negative audio. Our data, metrics and code are available on the https://xavijuanola.github.io/SSL-SaN/.

[68] Entropy-Based Non-Invasive Reliability Monitoring of Convolutional Neural Networks

Amirhossein Nazeri, Wael Hafez

Main category: cs.CV

TL;DR: Adversarial perturbations create detectable entropy signatures in CNN activations, enabling 90% detection accuracy without model modifications.

Details

Motivation: CNNs are vulnerable to adversarial attacks but existing detection methods require expensive retraining, architecture changes, or degrade clean input performance.

Method: Monitor activation entropy in CNN layers without modifying the model, using parallel entropy monitoring on VGG-16 to detect entropy shifts.

Result: Adversarial inputs consistently shift activation entropy by 7% in early convolutional layers, achieving 90% detection accuracy with false positive/negative rates below 20%.

Conclusion: CNNs inherently encode distribution shifts in activation patterns, enabling real-time adversarial detection through activation entropy monitoring without compromising original model performance.

Abstract: Convolutional Neural Networks (CNNs) have become the foundation of modern computer vision, achieving unprecedented accuracy across diverse image recognition tasks. While these networks excel on in-distribution data, they remain vulnerable to adversarial perturbations imperceptible input modifications that cause misclassification with high confidence. However, existing detection methods either require expensive retraining, modify network architecture, or degrade performance on clean inputs. Here we show that adversarial perturbations create immediate, detectable entropy signatures in CNN activations that can be monitored without any model modification. Using parallel entropy monitoring on VGG-16, we demonstrate that adversarial inputs consistently shift activation entropy by 7% in early convolutional layers, enabling 90% detection accuracy with false positives and false negative rates below 20%. The complete separation between clean and adversarial entropy distributions reveals that CNNs inherently encode distribution shifts in their activation patterns. This work establishes that CNN reliability can be assessed through activation entropy alone, enabling practical deployment of self-diagnostic vision systems that detect adversarial inputs in real-time without compromising original model performance.

[69] ERTACache: Error Rectification and Timesteps Adjustment for Efficient Diffusion

Xurui Peng, Hong Liu, Chenqian Yan, Rui Ma, Fangmin Chen, Xing Wang, Zhihua Wu, Songwei Liu, Mingbao Lin

Main category: cs.CV

TL;DR: ERTACache is a diffusion model acceleration framework that addresses caching-induced errors through residual profiling, dynamic interval adjustment, and analytical error approximation, achieving 2x speedup while maintaining or improving quality.

Details

Motivation: Diffusion models suffer from computational overhead due to iterative inference, and naive feature caching causes quality degradation from cumulative errors.

Method: Proposes ERTACache with offline residual profiling to identify reusable steps, trajectory-aware correction coefficients for dynamic interval adjustment, and closed-form residual linearization for error approximation.

Result: Achieves up to 2x inference speedup across image and video generation benchmarks while preserving or improving visual quality, with minimal VBench degradation on Wan2.1 video diffusion model.

Conclusion: ERTACache provides a principled caching framework that effectively addresses both feature shift and step amplification errors, enabling efficient sampling with maintained fidelity.

Abstract: Diffusion models suffer from substantial computational overhead due to their inherently iterative inference process. While feature caching offers a promising acceleration strategy by reusing intermediate outputs across timesteps, naive reuse often incurs noticeable quality degradation. In this work, we formally analyze the cumulative error introduced by caching and decompose it into two principal components: feature shift error, caused by inaccuracies in cached outputs, and step amplification error, which arises from error propagation under fixed timestep schedules. To address these issues, we propose ERTACache, a principled caching framework that jointly rectifies both error types. Our method employs an offline residual profiling stage to identify reusable steps, dynamically adjusts integration intervals via a trajectory-aware correction coefficient, and analytically approximates cache-induced errors through a closed-form residual linearization model. Together, these components enable accurate and efficient sampling under aggressive cache reuse. Extensive experiments across standard image and video generation benchmarks show that ERTACache achieves up to 2x inference speedup while consistently preserving or even improving visual quality. Notably, on the state-of-the-art Wan2.1 video diffusion model, ERTACache delivers 2x acceleration with minimal VBench degradation, effectively maintaining baseline fidelity while significantly improving efficiency. The code is available at https://github.com/bytedance/ERTACache.

[70] Towards Understanding Camera Motions in Any Video

Zhiqiu Lin, Siyuan Cen, Daniel Jiang, Jay Karhade, Hewei Wang, Chancharik Mitra, Tiffany Ling, Yuhan Huang, Sifan Liu, Mingyu Chen, Rushikesh Zawar, Xue Bai, Yilun Du, Chuang Gan, Deva Ramanan

Main category: cs.CV

TL;DR: CameraBench is a large-scale dataset with expert-annotated videos and a camera motion taxonomy for evaluating camera motion understanding in AI models.

Details

Motivation: To address the lack of comprehensive benchmarks for assessing camera motion understanding in videos, which is crucial for applications like cinematography, video analysis, and content understanding.

Method: Created a dataset of ~3,000 diverse internet videos with expert annotations through multi-stage quality control. Developed a taxonomy of camera motion primitives with cinematographers. Conducted human studies to quantify annotation performance and trained models including fine-tuning generative VLMs.

Result: Human studies showed expertise and training significantly improve annotation accuracy. Evaluation revealed SfM models struggle with semantic primitives while VLMs struggle with geometric primitives. Fine-tuned VLM achieved best performance combining both capabilities.

Conclusion: CameraBench provides a valuable benchmark and taxonomy for camera motion understanding, enabling improved model performance and applications in motion-augmented captioning, video QA, and retrieval.

Abstract: We introduce CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion understanding. CameraBench consists of ~3,000 diverse internet videos, annotated by experts through a rigorous multi-stage quality control process. One of our contributions is a taxonomy of camera motion primitives, designed in collaboration with cinematographers. We find, for example, that some motions like “follow” (or tracking) require understanding scene content like moving subjects. We conduct a large-scale human study to quantify human annotation performance, revealing that domain expertise and tutorial-based training can significantly enhance accuracy. For example, a novice may confuse zoom-in (a change of intrinsics) with translating forward (a change of extrinsics), but can be trained to differentiate the two. Using CameraBench, we evaluate Structure-from-Motion (SfM) and Video-Language Models (VLMs), finding that SfM models struggle to capture semantic primitives that depend on scene content, while VLMs struggle to capture geometric primitives that require precise estimation of trajectories. We then fine-tune a generative VLM on CameraBench to achieve the best of both worlds and showcase its applications, including motion-augmented captioning, video question answering, and video-text retrieval. We hope our taxonomy, benchmark, and tutorials will drive future efforts towards the ultimate goal of understanding camera motions in any video.

[71] Video-LLMs with Temporal Visual Screening

Zheyu Fan, Jiateng Liu, Yuji Zhang, Zihan Wang, Yi R., Fung, Manling Li, Heng Ji

Main category: cs.CV

TL;DR: TVS is a temporal visual screening method that improves Video-LLMs by focusing on critical video segments and simplifying queries while maintaining answer consistency, achieving significant performance gains.

Details

Motivation: Current Video-LLMs struggle with fine-grained temporal semantics due to sparse frame sampling and lack of inter-frame reasoning supervision, while humans naturally perform temporal screening by focusing on salient segments.

Method: Proposes Temporal Visual Screening (TVS) that pre-processes video QA data by: (1) retaining focus-critical segments, (2) synchronously reconstructing queries to direct form while preserving answers, (3) maintaining answer invariance and consistency. TVS serves as a modular front-end adapter for both training and inference pipelines.

Result: TVS achieves relative gains of 7.33% during training and 34.6% during inference. The proposed baseline ReSimplifyIt outperforms prior approaches by 0.47 F-1 score on video trimming while maintaining competitive query rewriting performance.

Conclusion: Temporal visual screening effectively improves video-language understanding by optimizing reasoning burden distribution and cognitive load, enabling better alignment between queries and visual information.

Abstract: Humans naturally perform temporal screening by dragging the progress bar and focusing on salient temporal segments, but current Video Large Language Models (Video-LLMs) struggle to capture fine-grained temporal semantics due to sparse frame sampling and insufficient inter-frame reasoning supervision during their training. To address this, Inspired by well-established cognitive science principles, we propose Temporal Visual Screening (TVS), a new task that universally pre-processes video question answering and instruction tuning data by: (1) retaining focus-critical video segments, (2) synchronously reconstructing queries to their most direct form while preserving answer consistency, and (3) keeping the invariance and consistency for any possible answer. TVS is formulated as a modular front-end adapter task that can be seamlessly integrated into both Video Instruction Tuning (training) and Video Question Answering (inference) pipelines. TVS optimizes distribution of reasoning burden and cognitive load; during training, it aligns queries with focus-critical visual information; at inference, it enables query-aware segment focus and streamlined query representations. In particular, we curate the first benchmark for TVS and propose ReSimplifyIt, a baseline outperforming prior approaches on seemingly similar tasks by 0.47 in F-1 score on video trimming while achieving competitive query rewriting performance. Experiments demonstrate that incorporating TVS yields relative gains of 7.33% (training) and 34.6% (inference), demonstrating the effectiveness of temporal information screening for improving video-language understanding.

[72] Consistent and Invariant Generalization Learning for Short-video Misinformation Detection

Hanghui Guo, Weijie Shi, Mengze Li, Juncheng Li, Hao Chen, Yue Cui, Jiajie Xu, Jia Zhu, Jiawei Shen, Zhangze Chen, Sirui Han

Main category: cs.CV

TL;DR: DOCTOR is a domain generalization model for short-video misinformation detection that uses cross-modal consistency and invariance learning to handle domain gaps across different modalities.

Details

Motivation: Current short-video misinformation detection models trained on specific domains perform poorly on unseen domains due to domain gaps, particularly in how different domains rely on different modalities (video vs audio) and how domain biases accumulate during cross-modal fusion.

Method: Proposes DOCTOR with two modules: (1) cross-modal feature interpolation and interpolation distillation to map modalities into shared space and synchronize multi-modal learning, (2) diffusion model to add noise while retaining core features and enhance domain-invariant features through cross-modal guided denoising.

Result: Extensive experiments demonstrate the effectiveness of the proposed DOCTOR model for domain generalization in short-video misinformation detection.

Conclusion: The DOCTOR framework successfully addresses domain generalization challenges in short-video misinformation detection by leveraging cross-modal consistency and invariance learning techniques to handle modality-specific domain dependencies and bias accumulation.

Abstract: Short-video misinformation detection has attracted wide attention in the multi-modal domain, aiming to accurately identify the misinformation in the video format accompanied by the corresponding audio. Despite significant advancements, current models in this field, trained on particular domains (source domains), often exhibit unsatisfactory performance on unseen domains (target domains) due to domain gaps. To effectively realize such domain generalization on the short-video misinformation detection task, we propose deep insights into the characteristics of different domains: (1) The detection on various domains may mainly rely on different modalities (i.e., mainly focusing on videos or audios). To enhance domain generalization, it is crucial to achieve optimal model performance on all modalities simultaneously. (2) For some domains focusing on cross-modal joint fraud, a comprehensive analysis relying on cross-modal fusion is necessary. However, domain biases located in each modality (especially in each frame of videos) will be accumulated in this fusion process, which may seriously damage the final identification of misinformation. To address these issues, we propose a new DOmain generalization model via ConsisTency and invariance learning for shORt-video misinformation detection (named DOCTOR), which contains two characteristic modules: (1) We involve the cross-modal feature interpolation to map multiple modalities into a shared space and the interpolation distillation to synchronize multi-modal learning; (2) We design the diffusion model to add noise to retain core features of multi modal and enhance domain invariant features through cross-modal guided denoising. Extensive experiments demonstrate the effectiveness of our proposed DOCTOR model. Our code is public available at https://github.com/ghh1125/DOCTOR.

[73] ROBUST-MIPS: A Combined Skeletal Pose and Instance Segmentation Dataset for Laparoscopic Surgical Instruments

Zhe Han, Charlie Budd, Gongyu Zhang, Huanyu Tian, Christos Bergeles, Tom Vercauteren

Main category: cs.CV

TL;DR: The paper introduces ROBUST-MIPS, a surgical tool pose and instance segmentation dataset derived from ROBUST-MIS, arguing that skeletal pose annotations are more efficient than segmentation for surgical tool localization while maintaining semantic richness.

Details

Motivation: Current deep learning approaches for surgical tool localization are limited by the availability of diverse annotated data. The authors propose that skeletal pose annotations strike a better balance between semantic information richness and annotation efficiency, enabling faster growth of annotated datasets.

Method: The authors created ROBUST-MIPS by enriching the existing ROBUST-MIS dataset with tool pose annotations. They developed custom tool pose annotation software and established a benchmark using popular pose estimation methods to evaluate the adequacy of pose annotations for surgical tool localization.

Result: The study demonstrated high-quality results using pose estimation methods for surgical tool localization, showing that pose annotations are adequate for this task. The enriched dataset enables joint study of pose and segmentation annotations and facilitates comparisons on downstream tasks.

Conclusion: Skeletal pose annotations represent an efficient and effective alternative to segmentation for surgical tool localization. The released dataset, benchmark models, and annotation software aim to encourage adoption of this annotation style in the computer-assisted intervention community.

Abstract: Localisation of surgical tools constitutes a foundational building block for computer-assisted interventional technologies. Works in this field typically focus on training deep learning models to perform segmentation tasks. Performance of learning-based approaches is limited by the availability of diverse annotated data. We argue that skeletal pose annotations are a more efficient annotation approach for surgical tools, striking a balance between richness of semantic information and ease of annotation, thus allowing for accelerated growth of available annotated data. To encourage adoption of this annotation style, we present, ROBUST-MIPS, a combined tool pose and tool instance segmentation dataset derived from the existing ROBUST-MIS dataset. Our enriched dataset facilitates the joint study of these two annotation styles and allow head-to-head comparison on various downstream tasks. To demonstrate the adequacy of pose annotations for surgical tool localisation, we set up a simple benchmark using popular pose estimation methods and observe high-quality results. To ease adoption, together with the dataset, we release our benchmark models and custom tool pose annotation software.

[74] Safe-Control: A Safety Patch for Mitigating Unsafe Content in Text-to-Image Generation Models

Xiangtao Meng, Yingkai Dong, Ning Yu, Li Wang, Zheng Li, Shanqing Guo

Main category: cs.CV

TL;DR: Safe-Control is a plug-and-play safety patch that reduces unsafe content generation in Text-to-Image models to 7% while maintaining benign image quality, outperforming existing safety mechanisms.

Details

Motivation: Existing T2I model safety mechanisms are susceptible to evasion under distribution shifts or require extensive model-specific adjustments, creating a need for more robust and adaptable safety solutions.

Method: Uses data-driven strategies and safety-aware conditions to inject safety control signals into locked T2I models as a patch-like update. Developers can create various safety patches that can be merged into a unified patch.

Result: Reduces unsafe content generation probability to 7% compared to ~20% for baseline methods, works across six diverse T2I models with similar architectures, and maintains benign image quality and text alignment.

Conclusion: Safe-Control provides an effective, adaptable plug-and-play safety solution that significantly outperforms existing safety mechanisms while being compatible with multiple T2I models without extensive modifications.

Abstract: Despite the advancements in Text-to-Image (T2I) generation models, their potential for misuse or even abuse raises serious safety concerns. Model developers have made tremendous efforts to introduce safety mechanisms that can address these concerns in T2I models. However, the existing safety mechanisms, whether external or internal, either remain susceptible to evasion under distribution shifts or require extensive model-specific adjustments. To address these limitations, we introduce Safe-Control, an innovative plug-and-play safety patch designed to mitigate unsafe content generation in T2I models. Using data-driven strategies and safety-aware conditions, Safe-Control injects safety control signals into the locked T2I model, acting as an update in a patch-like manner. Model developers can also construct various safety patches to meet the evolving safety requirements, which can be flexibly merged into a single, unified patch. Its plug-and-play design further ensures adaptability, making it compatible with other T2I models of similar denoising architecture. We conduct extensive evaluations on six diverse and public T2I models. Empirical results highlight that Safe-Control is effective in reducing unsafe content generation across six diverse T2I models with similar generative architectures, yet it successfully maintains the quality and text alignment of benign images. Compared to seven state-of-the-art safety mechanisms, including both external and internal defenses, Safe-Control significantly outperforms all baselines in reducing unsafe content generation. For example, it reduces the probability of unsafe content generation to 7%, compared to approximately 20% for most baseline methods, under both unsafe prompts and the latest adversarial attacks.

[75] GENNAV: Polygon Mask Generation for Generalized Referring Navigable Regions

Kei Katsumata, Yui Iioka, Naoki Hosomi, Teruhisa Misu, Kentaro Yamada, Komei Sugiura

Main category: cs.CV

TL;DR: GENNAV is a novel method for identifying target regions from natural language instructions and front camera images, specifically addressing challenges with stuff-type targets and multiple/absent targets through joint existence prediction and segmentation.

Details

Motivation: Existing methods underperform in handling stuff-type target regions with ambiguous boundaries, as well as cases with absent or multiple targets, creating limitations for real-world navigation applications.

Method: Proposed GENNAV system that predicts target existence and generates segmentation masks for multiple stuff-type target regions, evaluated on a novel benchmark GRiN-Drive with three sample types (no-target, single-target, multi-target).

Result: GENNAV achieved superior performance over baseline methods on standard evaluation metrics and demonstrated robust zero-shot transfer performance in real-world experiments across four automobiles in five geographically distinct urban areas.

Conclusion: GENNAV effectively addresses the challenges of stuff-type target region identification and handles multiple/absent target scenarios, showing strong performance and real-world robustness for navigation applications.

Abstract: We focus on the task of identifying the location of target regions from a natural language instruction and a front camera image captured by a mobility. This task is challenging because it requires both existence prediction and segmentation, particularly for stuff-type target regions with ambiguous boundaries. Existing methods often underperform in handling stuff-type target regions, in addition to absent or multiple targets. To overcome these limitations, we propose GENNAV, which predicts target existence and generates segmentation masks for multiple stuff-type target regions. To evaluate GENNAV, we constructed a novel benchmark called GRiN-Drive, which includes three distinct types of samples: no-target, single-target, and multi-target. GENNAV achieved superior performance over baseline methods on standard evaluation metrics. Furthermore, we conducted real-world experiments with four automobiles operated in five geographically distinct urban areas to validate its zero-shot transfer performance. In these experiments, GENNAV outperformed baseline methods and demonstrated its robustness across diverse real-world environments. The project page is available at https://gennav.vercel.app/.

[76] R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning

Jie Jiang, Qi Yang, Bolin Ni, Shiming Xiang, Han Hu, Houwen Peng

Main category: cs.CV

TL;DR: R-4B is an auto-thinking MLLM that adaptively decides when to use step-by-step thinking based on problem complexity, achieving SOTA performance with lower computational cost.

Details

Motivation: Current MLLMs use step-by-step thinking for all problems, which is inefficient for simple problems that don't require complex reasoning.

Method: Uses bi-mode annealing to enable both thinking and non-thinking capabilities, and Bi-mode Policy Optimization (BPO) to improve decision-making on when to activate thinking. Trained in two phases: first on curated dataset, then under improved GRPO framework.

Result: Outperforms Qwen2.5-VL-7B in most tasks and achieves comparable performance to larger models (like 16B Kimi-VL) on reasoning benchmarks with lower computational cost across 25 benchmarks.

Conclusion: R-4B successfully addresses inefficiency in MLLMs by adaptively using thinking only when needed, achieving state-of-the-art performance with improved computational efficiency.

Abstract: Multimodal Large Language Models (MLLMs) equipped with step-by-step thinking capabilities have demonstrated remarkable performance on complex reasoning problems. However, this thinking process is redundant for simple problems solvable without complex reasoning. To address this inefficiency, we propose R-4B, an auto-thinking MLLM, which can adaptively decide when to think based on problem complexity. The central idea of R-4B is to empower the model with both thinking and non-thinking capabilities using bi-mode annealing, and apply Bi-mode Policy Optimization~(BPO) to improve the model’s accuracy in determining whether to activate the thinking process. Specifically, we first train the model on a carefully curated dataset spanning various topics, which contains samples from both thinking and non-thinking modes. Then it undergoes a second phase of training under an improved GRPO framework, where the policy model is forced to generate responses from both modes for each input query. Experimental results show that R-4B achieves state-of-the-art performance across 25 challenging benchmarks. It outperforms Qwen2.5-VL-7B in most tasks and achieves performance comparable to larger models such as Kimi-VL-A3B-Thinking-2506 (16B) on reasoning-intensive benchmarks with lower computational cost.

[77] HiddenObject: Modality-Agnostic Fusion for Multimodal Hidden Object Detection

Harris Song, Tuan-Anh Vu, Sanjith Menon, Sriram Narasimhan, M. Khalid Jawed

Main category: cs.CV

TL;DR: HiddenObject is a Mamba-based fusion framework that integrates RGB, thermal, and depth data to detect hidden/occluded objects, achieving state-of-the-art performance across challenging multimodal scenarios.

Details

Motivation: Traditional RGB-based detection methods fail under adverse conditions like occlusion, camouflage, and lighting variations, creating a need for more robust, modality-agnostic approaches.

Method: A fusion framework that integrates RGB, thermal, and depth data using Mamba-based fusion mechanism to capture complementary signals across modalities and create unified representations.

Result: State-of-the-art or competitive performance across multiple benchmark datasets, demonstrating efficacy over unimodal and naive fusion strategies.

Conclusion: Mamba-based fusion architectures can significantly advance multimodal object detection, especially under visually degraded or complex conditions.

Abstract: Detecting hidden or partially concealed objects remains a fundamental challenge in multimodal environments, where factors like occlusion, camouflage, and lighting variations significantly hinder performance. Traditional RGB-based detection methods often fail under such adverse conditions, motivating the need for more robust, modality-agnostic approaches. In this work, we present HiddenObject, a fusion framework that integrates RGB, thermal, and depth data using a Mamba-based fusion mechanism. Our method captures complementary signals across modalities, enabling enhanced detection of obscured or camouflaged targets. Specifically, the proposed approach identifies modality-specific features and fuses them in a unified representation that generalizes well across challenging scenarios. We validate HiddenObject across multiple benchmark datasets, demonstrating state-of-the-art or competitive performance compared to existing methods. These results highlight the efficacy of our fusion design and expose key limitations in current unimodal and na"ive fusion strategies. More broadly, our findings suggest that Mamba-based fusion architectures can significantly advance the field of multimodal object detection, especially under visually degraded or complex conditions.

[78] RadGS-Reg: Registering Spine CT with Biplanar X-rays via Joint 3D Radiative Gaussians Reconstruction and 3D/3D Registration

Ao Shen, Xueming Fu, Junfeng Jiang, Qiang Zeng, Ye Tang, Zhengming Chen, Luming Nong, Feng Wang, S. Kevin Zhou

Main category: cs.CV

TL;DR: RadGS-Reg is a novel framework for CT/X-ray registration that combines 3D Radiative Gaussians reconstruction with 3D/3D registration, using Counterfactual Attention Learning and patient-specific pre-training to achieve state-of-the-art performance on vertebral registration.

Details

Motivation: Traditional CT/X-ray registration methods suffer from spatial information loss and domain gap issues. Current 3D reconstruction approaches from biplanar X-rays are limited by dense-view requirements and poor performance with noisy X-rays.

Method: Joint 3D Radiative Gaussians (RadGS) reconstruction with 3D/3D registration. Uses Counterfactual Attention Learning mechanism to focus on vertebral regions in noisy X-rays. Implements patient-specific pre-training strategy to adapt from simulated to real data while learning vertebral shape priors.

Result: State-of-the-art performance on in-house datasets for both reconstruction and registration tasks, surpassing existing methods.

Conclusion: RadGS-Reg effectively addresses limitations of traditional methods by combining joint reconstruction and registration with attention mechanisms and adaptive pre-training, demonstrating superior performance for vertebral CT/X-ray registration.

Abstract: Computed Tomography (CT)/X-ray registration in image-guided navigation remains challenging because of its stringent requirements for high accuracy and real-time performance. Traditional “render and compare” methods, relying on iterative projection and comparison, suffer from spatial information loss and domain gap. 3D reconstruction from biplanar X-rays supplements spatial and shape information for 2D/3D registration, but current methods are limited by dense-view requirements and struggles with noisy X-rays. To address these limitations, we introduce RadGS-Reg, a novel framework for vertebral-level CT/X-ray registration through joint 3D Radiative Gaussians (RadGS) reconstruction and 3D/3D registration. Specifically, our biplanar X-rays vertebral RadGS reconstruction module explores learning-based RadGS reconstruction method with a Counterfactual Attention Learning (CAL) mechanism, focusing on vertebral regions in noisy X-rays. Additionally, a patient-specific pre-training strategy progressively adapts the RadGS-Reg from simulated to real data while simultaneously learning vertebral shape prior knowledge. Experiments on in-house datasets demonstrate the state-of-the-art performance for both tasks, surpassing existing methods. The code is available at: https://github.com/shenao1995/RadGS_Reg.

Kevin Mayer, Alex Vesel, Xinyi Zhao, Martin Fischer

Main category: cs.CV

TL;DR: SYNBUILD-3D is a large synthetic dataset of 6.2M 3D residential buildings with multi-modal representations including wireframe graphs, floor plan images, and roof point clouds to enable automated 3D building generation.

Details

Motivation: Address the lack of large-scale annotated 3D building datasets for applications in architecture, energy simulation, and navigation by leveraging synthetic data approaches.

Method: Created a synthetic dataset with three modalities: semantically enriched 3D wireframe graphs (LoD 4), corresponding floor plan images, and LiDAR-like roof point clouds with semantic annotations derived from floor plans.

Result: Produced a diverse dataset of over 6.2 million synthetic 3D residential buildings with comprehensive semantic annotations including rooms, doors, and windows.

Conclusion: SYNBUILD-3D enables development of generative AI algorithms for automated 3D building model creation with semantic-geometric consistency, addressing the data scarcity problem in 3D building modeling.

Abstract: 3D building models are critical for applications in architecture, energy simulation, and navigation. Yet, generating accurate and semantically rich 3D buildings automatically remains a major challenge due to the lack of large-scale annotated datasets in the public domain. Inspired by the success of synthetic data in computer vision, we introduce SYNBUILD-3D, a large, diverse, and multi-modal dataset of over 6.2 million synthetic 3D residential buildings at Level of Detail (LoD) 4. In the dataset, each building is represented through three distinct modalities: a semantically enriched 3D wireframe graph at LoD 4 (Modality I), the corresponding floor plan images (Modality II), and a LiDAR-like roof point cloud (Modality III). The semantic annotations for each building wireframe are derived from the corresponding floor plan images and include information on rooms, doors, and windows. Through its tri-modal nature, future work can use SYNBUILD-3D to develop novel generative AI algorithms that automate the creation of 3D building models at LoD 4, subject to predefined floor plan layouts and roof geometries, while enforcing semantic-geometric consistency. Dataset and code samples are publicly available at https://github.com/kdmayer/SYNBUILD-3D.

[80] Radially Distorted Homographies, Revisited

Mårten Wadenbäck, Marcus Valtonen Örnhag, Johan Edstedt

Main category: cs.CV

TL;DR: A unified approach for estimating homographies with radial distortion in three different configurations, providing faster minimal solvers while maintaining accuracy.

Details

Motivation: Homography estimation is crucial in computer vision, but real images often have lens distortions that require simultaneous estimation of both homography and radial distortion for accurate results.

Method: Proposes a novel unified mathematical framework to handle three radial distortion configurations: distortion in one image, identical distortion in both images, and independent distortion in both images. Develops new minimal solvers based on this approach.

Result: The proposed solvers are faster than existing state-of-the-art methods while maintaining similar accuracy. Tested successfully on established benchmarks including fisheye camera images.

Conclusion: The unified approach provides efficient and accurate solutions for homography estimation with radial distortion across all three common configurations, offering performance improvements over previous methods.

Abstract: Homographies are among the most prevalent transformations occurring in geometric computer vision and projective geometry, and homography estimation is consequently a crucial step in a wide assortment of computer vision tasks. When working with real images, which are often afflicted with geometric distortions caused by the camera lens, it may be necessary to determine both the homography and the lens distortion-particularly the radial component, called radial distortion-simultaneously to obtain anything resembling useful estimates. When considering a homography with radial distortion between two images, there are three conceptually distinct configurations for the radial distortion; (i) distortion in only one image, (ii) identical distortion in the two images, and (iii) independent distortion in the two images. While these cases have been addressed separately in the past, the present paper provides a novel and unified approach to solve all three cases. We demonstrate how the proposed approach can be used to construct new fast, stable, and accurate minimal solvers for radially distorted homographies. In all three cases, our proposed solvers are faster than the existing state-of-the-art solvers while maintaining similar accuracy. The solvers are tested on well-established benchmarks including images taken with fisheye cameras. The source code for our solvers will be made available in the event our paper is accepted for publication.

[81] GCAV: A Global Concept Activation Vector Framework for Cross-Layer Consistency in Interpretability

Zhenghao He, Sanchit Sinha, Guangzhi Xiong, Aidong Zhang

Main category: cs.CV

TL;DR: GCAV is a novel framework that unifies Concept Activation Vectors across different layers into a single consistent representation using contrastive learning and attention-based fusion, improving concept consistency and interpretability in neural networks.

Details

Motivation: Existing CAVs computed independently at different layers often show inconsistencies, making cross-layer comparisons unreliable and limiting the interpretability of deep neural networks.

Method: Leverages contrastive learning to align concept representations across layers and employs attention-based fusion to construct globally integrated CAVs. Introduces TGCAV for testing with these unified representations.

Result: Significantly reduces variance in TCAV scores while preserving concept relevance, enhances concept localization, and improves robustness against adversarial perturbations across multiple deep neural networks.

Conclusion: GCAV provides a more comprehensive and interpretable understanding of how deep learning models encode human-defined concepts by integrating cross-layer information into a coherent framework.

Abstract: Concept Activation Vectors (CAVs) provide a powerful approach for interpreting deep neural networks by quantifying their sensitivity to human-defined concepts. However, when computed independently at different layers, CAVs often exhibit inconsistencies, making cross-layer comparisons unreliable. To address this issue, we propose the Global Concept Activation Vector (GCAV), a novel framework that unifies CAVs into a single, semantically consistent representation. Our method leverages contrastive learning to align concept representations across layers and employs an attention-based fusion mechanism to construct a globally integrated CAV. By doing so, our method significantly reduces the variance in TCAV scores while preserving concept relevance, ensuring more stable and reliable concept attributions. To evaluate the effectiveness of GCAV, we introduce Testing with Global Concept Activation Vectors (TGCAV) as a method to apply TCAV to GCAV-based representations. We conduct extensive experiments on multiple deep neural networks, demonstrating that our method effectively mitigates concept inconsistency across layers, enhances concept localization, and improves robustness against adversarial perturbations. By integrating cross-layer information into a coherent framework, our method offers a more comprehensive and interpretable understanding of how deep learning models encode human-defined concepts. Code and models are available at https://github.com/Zhenghao-He/GCAV.

[82] Generalizable Object Re-Identification via Visual In-Context Prompting

Zhizhong Huang, Xiaoming Liu

Main category: cs.CV

TL;DR: VICP is a novel framework that enables object re-identification models to generalize to unseen categories using in-context examples as prompts, without parameter adaptation, by synergizing LLMs and vision foundation models.

Details

Motivation: Current ReID methods are domain-specific and lack generalization, requiring costly labeled data for new categories. Self-supervised learning struggles to capture identity-sensitive features critical for ReID.

Method: VICP uses LLMs to infer semantic identity rules from few-shot positive/negative pairs through task-specific prompting, then guides a vision foundation model (e.g., DINO) to extract ID-discriminative features via dynamic visual prompts, aligning LLM-derived concepts with the VFM’s pre-trained prior.

Result: Experiments on ShopID10K and diverse ReID benchmarks show VICP outperforms baselines by a clear margin on unseen categories.

Conclusion: VICP enables generalization to novel categories without dataset-specific retraining, eliminating the need for costly labeled data and parameter adaptation for new object categories.

Abstract: Current object re-identification (ReID) methods train domain-specific models (e.g., for persons or vehicles), which lack generalization and demand costly labeled data for new categories. While self-supervised learning reduces annotation needs by learning instance-wise invariance, it struggles to capture \textit{identity-sensitive} features critical for ReID. This paper proposes Visual In-Context Prompting~(VICP), a novel framework where models trained on seen categories can directly generalize to unseen novel categories using only \textit{in-context examples} as prompts, without requiring parameter adaptation. VICP synergizes LLMs and vision foundation models~(VFM): LLMs infer semantic identity rules from few-shot positive/negative pairs through task-specific prompting, which then guides a VFM (\eg, DINO) to extract ID-discriminative features via \textit{dynamic visual prompts}. By aligning LLM-derived semantic concepts with the VFM’s pre-trained prior, VICP enables generalization to novel categories, eliminating the need for dataset-specific retraining. To support evaluation, we introduce ShopID10K, a dataset of 10K object instances from e-commerce platforms, featuring multi-view images and cross-domain testing. Experiments on ShopID10K and diverse ReID benchmarks demonstrate that VICP outperforms baselines by a clear margin on unseen categories. Code is available at https://github.com/Hzzone/VICP.

[83] Lightweight MRI-Based Automated Segmentation of Pancreatic Cancer with Auto3DSeg

Keshav Jha, William Sharp, Dominic LaBella

Main category: cs.CV

TL;DR: SegResNet models were evaluated for pancreatic tumor segmentation on two MRI datasets, showing moderate performance (DSC 0.56 for Task 1, 0.33 for Task 2) and highlighting challenges with small datasets and MRI sequence variability.

Details

Motivation: Accurate pancreatic tumor delineation is critical for diagnosis and treatment planning, but automated segmentation remains challenging due to anatomical variability and limited dataset availability.

Method: Used SegResNet models within Auto3DSeg architecture with 5-fold cross-validation and STAPLE ensembling, focusing on anatomically relevant ROI. Trained on two datasets: 91 T1-weighted arterial contrast-enhanced MRI cases (Task 1) and 50 T2-weighted MR-Linac cases (Task 2).

Result: Task 1: DSC 0.56, 5mm DSC 0.73, HD95 41.1mm, MASD 26.0mm, RMSE 5164mm. Task 2: DSC 0.33, 5mm DSC 0.50, HD95 20.1mm, MASD 7.2mm, RMSE 17,203mm. Performance decreased in Task 2.

Conclusion: Results demonstrate challenges with small MRI datasets and sequence variability, but show potential for automated delineation. Emphasizes need for larger, standardized MRI datasets to improve model robustness and clinical utility.

Abstract: Accurate delineation of pancreatic tumors is critical for diagnosis, treatment planning, and outcome assessment, yet automated segmentation remains challenging due to anatomical variability and limited dataset availability. In this study, SegResNet models, as part of the Auto3DSeg architecture, were trained and evaluated on two MRI-based pancreatic tumor segmentation tasks as part of the 2025 PANTHER Challenge. Algorithm methodology included 5-fold cross-validation with STAPLE ensembling after focusing on an anatomically relevant region-of-interest. The Pancreatic Tumor Segmentation on Diagnostic MRI task 1 training set included 91 T1-weighted arterial contrast-enhanced MRI with expert annotated pancreas and tumor labels. The Pancreatic Tumor Segmentation on MR-Linac task 2 training set used 50 T2-weighted MR-Linac cases with expert annotated pancreas and tumor labels. Algorithm-automated segmentation performance of pancreatic tumor was assessed using Dice Similarity Coefficient (DSC), 5 mm DSC, 95th percentile Hausdorff Distance (HD95), Mean Average Surface Distance (MASD), and Root Mean Square Error (RMSE). For Task 1, the algorithm achieved a DSC of 0.56, 5 mm DSC of 0.73, HD95 of 41.1 mm, MASD of 26.0 mm, and RMSE of 5164 mm. For Task 2, performance decreased, with a DSC of 0.33, 5 mm DSC of 0.50, HD95 of 20.1 mm, MASD of 7.2 mm, and RMSE of 17,203 mm. These findings illustrate the challenges of MRI-based pancreatic tumor segmentation with small datasets, highlighting variability introduced by different MRI sequences. Despite modest performance, the results demonstrate potential for automated delineation and emphasize the need for larger, standardized MRI datasets to improve model robustness and clinical utility.

[84] Reverse Imaging for Wide-spectrum Generalization of Cardiac MRI Segmentation

Yidong Zhao, Peter Kellman, Hui Xue, Tongyun Yang, Yi Zhang, Yuchi Han, Orlando Simonetti, Qian Tao

Main category: cs.CV

TL;DR: Reverse Imaging is a physics-driven method that infers underlying spin properties from cardiac MRI images to enable domain adaptation and data augmentation, solving generalization problems across different imaging sequences.

Details

Motivation: Pretrained segmentation models struggle to generalize across different cardiac MRI sequences due to variations in image contrast caused by different imaging protocols, despite being governed by the same fundamental spin properties.

Method: The method reversely infers underlying spin properties (proton density, T1, T2 values) from observed MRI images by solving ill-posed nonlinear inverse problems regularized by a spin prior distribution. A generative diffusion model learns this spin prior from multiparametric mSASHA dataset.

Result: The method enables meaningful spin-property estimates that serve as interpretable latent variables for flexible image synthesis of arbitrary novel sequences, achieving highly accurate segmentation across vastly different image contrasts and protocols.

Conclusion: Reverse Imaging fundamentally solves the generalization problem in cardiac MRI segmentation by providing wide-spectrum generalization across different imaging sequences through physics-driven domain adaptation and data augmentation.

Abstract: Pretrained segmentation models for cardiac magnetic resonance imaging (MRI) struggle to generalize across different imaging sequences due to significant variations in image contrast. These variations arise from changes in imaging protocols, yet the same fundamental spin properties, including proton density, T1, and T2 values, govern all acquired images. With this core principle, we introduce Reverse Imaging, a novel physics-driven method for cardiac MRI data augmentation and domain adaptation to fundamentally solve the generalization problem. Our method reversely infers the underlying spin properties from observed cardiac MRI images, by solving ill-posed nonlinear inverse problems regularized by the prior distribution of spin properties. We acquire this “spin prior” by learning a generative diffusion model from the multiparametric SAturation-recovery single-SHot acquisition sequence (mSASHA) dataset, which offers joint cardiac T1 and T2 maps. Our method enables approximate but meaningful spin-property estimates from MR images, which provide an interpretable “latent variable” that lead to highly flexible image synthesis of arbitrary novel sequences. We show that Reverse Imaging enables highly accurate segmentation across vastly different image contrasts and imaging protocols, realizing wide-spectrum generalization of cardiac MRI segmentation.

[85] PHD: Personalized 3D Human Body Fitting with Point Diffusion

Hsuan-I Ho, Chen Guo, Po-Chen Wu, Ivan Shugurov, Chengcheng Tang, Abhay Mittal, Sizhe An, Manuel Kaufmann, Linguang Zhang

Main category: cs.CV

TL;DR: PHD introduces personalized 3D human mesh recovery using user-specific shape information to improve pose estimation accuracy from videos, addressing limitations of traditional user-agnostic methods.

Details

Motivation: Traditional HMR methods are user-agnostic and optimized for generalization, but they compromise 3D accuracy by failing to jointly account for person-specific body shapes and 3D pose plausibility when refining poses using 2D image constraints.

Method: The pipeline decouples the process by first calibrating the user’s body shape, then employing a personalized pose fitting process conditioned on that shape using a body shape-conditioned 3D pose prior implemented as a Point Diffusion Transformer with Point Distillation Sampling loss.

Result: The approach improves both pelvis-aligned pose accuracy and absolute pose accuracy, mitigates errors from over-reliance on 2D constraints, is highly data-efficient (requiring only synthetic data), and serves as a versatile plug-and-play module for existing 3D pose estimators.

Conclusion: PHD provides an effective solution for personalized 3D human mesh recovery that enhances pose estimation accuracy by leveraging user-specific shape information and can be seamlessly integrated with existing systems.

Abstract: We introduce PHD, a novel approach for personalized 3D human mesh recovery (HMR) and body fitting that leverages user-specific shape information to improve pose estimation accuracy from videos. Traditional HMR methods are designed to be user-agnostic and optimized for generalization. While these methods often refine poses using constraints derived from the 2D image to improve alignment, this process compromises 3D accuracy by failing to jointly account for person-specific body shapes and the plausibility of 3D poses. In contrast, our pipeline decouples this process by first calibrating the user’s body shape and then employing a personalized pose fitting process conditioned on that shape. To achieve this, we develop a body shape-conditioned 3D pose prior, implemented as a Point Diffusion Transformer, which iteratively guides the pose fitting via a Point Distillation Sampling loss. This learned 3D pose prior effectively mitigates errors arising from an over-reliance on 2D constraints. Consequently, our approach improves not only pelvis-aligned pose accuracy but also absolute pose accuracy – an important metric often overlooked by prior work. Furthermore, our method is highly data-efficient, requiring only synthetic data for training, and serves as a versatile plug-and-play module that can be seamlessly integrated with existing 3D pose estimators to enhance their performance. Project page: https://phd-pose.github.io/

[86] Efficient Diffusion-Based 3D Human Pose Estimation with Hierarchical Temporal Pruning

Yuquan Bi, Hongsong Wang, Xinli Shi, Zhipeng Gui, Jie Gui, Yuan Yan Tang

Main category: cs.CV

TL;DR: Efficient diffusion-based 3D human pose estimation with hierarchical temporal pruning (HTP) that reduces computational costs while maintaining performance.

Details

Motivation: Diffusion models generate high-fidelity 3D human poses but suffer from high computational costs due to their iterative nature and multi-hypothesis requirements.

Method: Hierarchical Temporal Pruning (HTP) strategy with three components: Temporal Correlation-Enhanced Pruning (TCEP) for frame-level pruning, Sparse-Focused Temporal MHSA for attention computation reduction, and Mask-Guided Pose Token Pruner (MGPTP) for semantic-level pruning.

Result: Reduces training MACs by 38.5%, inference MACs by 56.8%, improves inference speed by 81.1% on average compared to prior diffusion methods, while achieving state-of-the-art performance on Human3.6M and MPI-INF-3DHP datasets.

Conclusion: The proposed HTP framework effectively reduces computational overhead in diffusion-based 3D human pose estimation while maintaining high performance through hierarchical pruning of redundant pose tokens.

Abstract: Diffusion models have demonstrated strong capabilities in generating high-fidelity 3D human poses, yet their iterative nature and multi-hypothesis requirements incur substantial computational cost. In this paper, we propose an Efficient Diffusion-Based 3D Human Pose Estimation framework with a Hierarchical Temporal Pruning (HTP) strategy, which dynamically prunes redundant pose tokens across both frame and semantic levels while preserving critical motion dynamics. HTP operates in a staged, top-down manner: (1) Temporal Correlation-Enhanced Pruning (TCEP) identifies essential frames by analyzing inter-frame motion correlations through adaptive temporal graph construction; (2) Sparse-Focused Temporal MHSA (SFT MHSA) leverages the resulting frame-level sparsity to reduce attention computation, focusing on motion-relevant tokens; and (3) Mask-Guided Pose Token Pruner (MGPTP) performs fine-grained semantic pruning via clustering, retaining only the most informative pose tokens. Experiments on Human3.6M and MPI-INF-3DHP show that HTP reduces training MACs by 38.5%, inference MACs by 56.8%, and improves inference speed by an average of 81.1% compared to prior diffusion-based methods, while achieving state-of-the-art performance.

[87] Print2Volume: Generating Synthetic OCT-based 3D Fingerprint Volume from 2D Fingerprint Image

Qingran Miao, Haixia Wang, Haohao Sun, Yilong Zhang

Main category: cs.CV

TL;DR: Print2Volume generates realistic synthetic 3D OCT fingerprints from 2D images to address data scarcity, achieving significant performance improvement in biometric recognition.

Details

Motivation: OCT provides high-resolution 3D fingerprint data but suffers from high acquisition costs and limited public datasets, hindering deep learning model development.

Method: Three-stage framework: 2D style transfer, 3D structure expansion network, and OCT realism refiner using 3D GAN to generate authentic synthetic OCT fingerprints.

Result: Generated 420,000 synthetic samples and reduced Equal Error Rate from 15.62% to 2.50% on ZJUT-EIFD benchmark through synthetic pre-training and real-data fine-tuning.

Conclusion: Print2Volume effectively overcomes OCT data scarcity by generating high-quality synthetic data that significantly improves biometric recognition performance.

Abstract: Optical Coherence Tomography (OCT) enables the acquisition of high-resolution, three-dimensional fingerprint data, capturing rich subsurface structures for robust biometric recognition. However, the high cost and time-consuming nature of OCT data acquisition have led to a scarcity of large-scale public datasets, significantly hindering the development of advanced algorithms, particularly data-hungry deep learning models. To address this critical bottleneck, this paper introduces Print2Volume, a novel framework for generating realistic, synthetic OCT-based 3D fingerprints from 2D fingerprint image. Our framework operates in three sequential stages: (1) a 2D style transfer module that converts a binary fingerprint into a grayscale images mimicking the style of a Z-direction mean-projected OCT scan; (2) a 3D Structure Expansion Network that extrapolates the 2D im-age into a plausible 3D anatomical volume; and (3) an OCT Realism Refiner, based on a 3D GAN, that renders the structural volume with authentic textures, speckle noise, and other imaging characteristics. Using Print2Volume, we generated a large-scale synthetic dataset of 420,000 samples. Quantitative experiments demonstrate the high quality of our synthetic data and its significant impact on recognition performance. By pre-training a recognition model on our synthetic data and fine-tuning it on a small real-world dataset, we achieved a remarkable reduction in the Equal Error Rate (EER) from 15.62% to 2.50% on the ZJUT-EIFD benchmark, proving the effectiveness of our approach in overcoming data scarcity.

[88] SatDINO: A Deep Dive into Self-Supervised Pretraining for Remote Sensing

Jakub Straka, Ivan Gruber

Main category: cs.CV

TL;DR: SatDINO is a self-supervised learning model for satellite imagery that outperforms MAE-based methods and achieves competitive results across multiple benchmarks through novel enhancements like GSD encoding and adaptive view sampling.

Details

Motivation: Self-supervised learning is valuable for remote sensing due to abundant unlabeled data. The authors aim to adapt DINO, a contrastive self-supervised method, specifically for satellite imagery representation learning.

Method: Developed SatDINO, a DINO-based model tailored for satellite imagery. Introduced novel enhancements including ground sample distance (GSD) encoding and adaptive view sampling. Conducted extensive experiments across multiple datasets and testing setups with rigorous ablation studies.

Result: SatDINO outperforms state-of-the-art MAE-based methods and achieves competitive results across multiple benchmarks. The ablation study validates the effectiveness of individual components.

Conclusion: SatDINO demonstrates superior performance for satellite imagery representation learning. The proposed enhancements (GSD encoding and adaptive view sampling) are effective and can be independently applied. The model and code are publicly available for further research.

Abstract: Self-supervised learning has emerged as a powerful tool for remote sensing, where large amounts of unlabeled data are available. In this work, we investigate the use of DINO, a contrastive self-supervised method, for pretraining on remote sensing imagery. We introduce SatDINO, a model tailored for representation learning in satellite imagery. Through extensive experiments on multiple datasets in multiple testing setups, we demonstrate that SatDINO outperforms other state-of-the-art methods based on much more common masked autoencoders (MAE) and achieves competitive results in multiple benchmarks. We also provide a rigorous ablation study evaluating SatDINO’s individual components. Finally, we propose a few novel enhancements, such as a new way to incorporate ground sample distance (GSD) encoding and adaptive view sampling. These enhancements can be used independently on our SatDINO model. Our code and trained models are available at: https://github.com/strakaj/SatDINO.

[89] Standardized Multi-Layer Tissue Maps for Enhanced Artificial Intelligence Integration and Search in Large-Scale Whole Slide Image Archives

Gernot Fiala, Markus Plass, Robert Harb, Peter Regitnig, Kristijan Skok, Wael Al Zoughbi, Carmen Zerner, Paul Torke, Michaela Kargl, Heimo Müller, Tomas Brazdil, Matej Gallo, Jaroslav Kubín, Roman Stoklasa, Rudolf Nenutil, Norman Zerbe, Andreas Holzinger, Petr Holub

Main category: cs.CV

TL;DR: A framework for generating 2D index maps and tissue profiling for Whole Slide Images to enable automated content analysis and AI algorithm development.

Details

Motivation: Current WSI collections lack standardized metadata, making manual inspection impractical for large datasets with millions of images needed for AI training and validation.

Method: Proposes a general framework with 2D index maps and domain-specific profiling using three-layer tissue mapping (source, tissue type, pathological alterations) with common syntax and semantics for interoperability.

Result: Enables automated content analysis of WSIs, provides fine-grained tissue information, and demonstrates applicability in WSI catalogs, machine learning, and graph-based representations.

Conclusion: The proposed standard addresses the metadata gap in WSI collections, facilitating efficient AI algorithm development and large-scale analysis across various medical domains.

Abstract: A Whole Slide Image (WSI) is a high-resolution digital image created by scanning an entire glass slide containing a biological specimen, such as tissue sections or cell samples, at multiple magnifications. These images can be viewed, analyzed, shared digitally, and are used today for Artificial Intelligence (AI) algorithm development. WSIs are used in a variety of fields, including pathology for diagnosing diseases and oncology for cancer research. They are also utilized in neurology, veterinary medicine, hematology, microbiology, dermatology, pharmacology, toxicology, immunology, and forensic science. When assembling cohorts for the training or validation of an AI algorithm, it is essential to know what is present on such a WSI. However, there is currently no standard for this metadata, so such selection has mainly been done through manual inspection, which is not suitable for large collections with several million objects. We propose a general framework to generate a 2D index map for WSI and a profiling mechanism for specific application domains. We demonstrate this approach in the field of clinical pathology, using common syntax and semantics to achieve interoperability between different catalogs. Our approach augments each WSI collection with a detailed tissue map that provides fine-grained information about the WSI content. The tissue map is organized into three layers: source, tissue type, and pathological alterations, with each layer assigning segments of the WSI to specific classes. We illustrate the advantages and applicability of the proposed standard through specific examples in WSI catalogs, Machine Learning (ML), and graph-based WSI representations.

[90] Unsupervised Incremental Learning Using Confidence-Based Pseudo-Labels

Lucas Rakotoarivony

Main category: cs.CV

TL;DR: ICPL is an unsupervised incremental learning method that uses confidence-based pseudo-labels to enable learning from unlabeled datasets, achieving competitive results with supervised methods and outperforming state-of-the-art class-iNCD methods by over 5% accuracy.

Details

Motivation: Real-world scenarios often involve novel classes emerging after training, requiring incremental learning. However, existing Class-Incremental Learning methods assume fully labeled incremental datasets, which is unrealistic in practice.

Method: Proposes ICPL method that replaces human annotations with confidence-based pseudo-labels, integrates them into various CIL methods with confidence-based selection, and evaluates on CIFAR100, ImageNet100, and fine-grained datasets.

Result: ICPL achieves competitive results compared to supervised methods and outperforms state-of-the-art class-iNCD methods by more than 5% in final accuracy. Validated on computational complexity for resource-constrained environments.

Conclusion: The proposed unsupervised incremental learning approach using confidence-based pseudo-labels effectively addresses the practical challenge of learning from unlabeled datasets while maintaining performance comparable to supervised methods.

Abstract: Deep learning models have achieved state-of-the-art performance in many computer vision tasks. However, in real-world scenarios, novel classes that were unseen during training often emerge, requiring models to acquire new knowledge incrementally. Class-Incremental Learning (CIL) methods enable a model to learn novel classes while retaining knowledge of previous classes. However, these methods make the strong assumption that the incremental dataset is fully labeled, which is unrealistic in practice. In this work, we propose an unsupervised Incremental Learning method using Confidence-based Pseudo-labels (ICPL), which replaces human annotations with pseudo-labels, enabling incremental learning from unlabeled datasets. We integrate these pseudo-labels into various CIL methods with confidence-based selection and evaluate performance degradation on CIFAR100 and ImageNet100. Then, we compare our approach to popular Class Incremental Novel Category Discovery (class-iNCD) methods addressing similar challenges. Additionally, we apply our method to fine-grained datasets to demonstrate its real-world practicality and measure its computational complexity to validate its suitability for resource-constrained environments. ICPL achieves competitive results compared to supervised methods and outperforms state-of-the-art class-iNCD methods by more than 5% in final accuracy.

[91] MedShift: Implicit Conditional Transport for X-Ray Domain Adaptation

Francisco Caetano, Christiaan Viviers, Peter H. H. de With, Fons van der Sommen

Main category: cs.CV

TL;DR: MedShift is a unified generative model for cross-domain translation between synthetic and real X-ray images using Flow Matching and Schrodinger Bridges, enabling high-fidelity unpaired translation across multiple domains without domain-specific training.

Details

Motivation: Synthetic medical data has scalability benefits but suffers from domain gaps that limit real-world clinical applicability, particularly in X-ray imaging where differences in attenuation behavior, noise characteristics, and soft tissue representation create significant translation challenges.

Method: Proposes MedShift, a class-conditional generative model based on Flow Matching and Schrodinger Bridges that learns a shared domain-agnostic latent space. Introduces X-DigiSkull dataset with aligned synthetic and real skull X-rays under varying radiation doses for benchmarking.

Result: MedShift demonstrates strong performance despite smaller model size compared to diffusion-based approaches, offering flexibility at inference to prioritize either perceptual fidelity or structural consistency. Enables seamless translation between any training domains.

Conclusion: MedShift provides a scalable and generalizable solution for domain adaptation in medical imaging, with available code and dataset, making it suitable for bridging synthetic-real domain gaps in clinical settings.

Abstract: Synthetic medical data offers a scalable solution for training robust models, but significant domain gaps limit its generalizability to real-world clinical settings. This paper addresses the challenge of cross-domain translation between synthetic and real X-ray images of the head, focusing on bridging discrepancies in attenuation behavior, noise characteristics, and soft tissue representation. We propose MedShift, a unified class-conditional generative model based on Flow Matching and Schrodinger Bridges, which enables high-fidelity, unpaired image translation across multiple domains. Unlike prior approaches that require domain-specific training or rely on paired data, MedShift learns a shared domain-agnostic latent space and supports seamless translation between any pair of domains seen during training. We introduce X-DigiSkull, a new dataset comprising aligned synthetic and real skull X-rays under varying radiation doses, to benchmark domain translation models. Experimental results demonstrate that, despite its smaller model size compared to diffusion-based approaches, MedShift offers strong performance and remains flexible at inference time, as it can be tuned to prioritize either perceptual fidelity or structural consistency, making it a scalable and generalizable solution for domain adaptation in medical imaging. The code and dataset are available at https://caetas.github.io/medshift.html

[92] Trees as Gaussians: Large-Scale Individual Tree Mapping

Dimitri Gominski, Martin Brandt, Xiaoye Tong, Siyu Liu, Maurice Mugabowindekwe, Sizhuo Li, Florian Reiner, Andrew Davies, Rasmus Fensholt

Main category: cs.CV

TL;DR: Deep learning approach for global-scale individual tree detection using 3-m resolution satellite imagery and simulated tree crowns with Gaussian kernels, trained on billions of lidar-extracted points.

Details

Motivation: Large-scale monitoring of individual trees is limited by inadequate modeling, as existing global products only provide binary tree cover or canopy height without identifying individual trees.

Method: Deep learning model using 3-m resolution PlanetScope imagery with simulated tree crowns using scalable Gaussian kernels. Training based on billions of points automatically extracted from airborne lidar data.

Result: State-of-the-art performance with fractional cover R² = 0.81 against aerial lidar, balanced detection metrics across biomes, and improved detection through fine-tuning with manual labels.

Conclusion: Provides a scalable framework for global, high-resolution tree monitoring that can identify trees both inside and outside forests, adaptable to future satellite missions.

Abstract: Trees are key components of the terrestrial biosphere, playing vital roles in ecosystem function, climate regulation, and the bioeconomy. However, large-scale monitoring of individual trees remains limited by inadequate modelling. Available global products have focused on binary tree cover or canopy height, which do not explicitely identify trees at individual level. In this study, we present a deep learning approach for detecting large individual trees in 3-m resolution PlanetScope imagery at a global scale. We simulate tree crowns with Gaussian kernels of scalable size, allowing the extraction of crown centers and the generation of binary tree cover maps. Training is based on billions of points automatically extracted from airborne lidar data, enabling the model to successfully identify trees both inside and outside forests. We compare against existing tree cover maps and airborne lidar with state-of-the-art performance (fractional cover R$^2 = 0.81$ against aerial lidar), report balanced detection metrics across biomes, and demonstrate how detection can be further improved through fine-tuning with manual labels. Our method offers a scalable framework for global, high-resolution tree monitoring, and is adaptable to future satellite missions offering improved imagery.

[93] Scale-GS: Efficient Scalable Gaussian Splatting via Redundancy-filtering Training on Streaming Content

Jiayu Yang, Weijian Su, Songqian Zhang, Yuqi Han, Jinli Suo, Qiang Zhang

Main category: cs.CV

TL;DR: M presents a scalable 3D Gaussian Splatting framework for dynamic scenes that uses hierarchical Gaussian organization, hybrid deformation/spawning strategy, and adaptive masking to reduce training time while maintaining high visual quality.

Details

Motivation: 3D Gaussian Splatting has limitations for dynamic scenes due to large data volume from dense Gaussians and prolonged training time per frame, which hinders real-time immersive applications.

Method: Hierarchical organization of Gaussian spheres by scale within anchor-based structure; hybrid deformation and spawning strategy for motion modeling; bidirectional adaptive masking to remove static regions and prioritize informative viewpoints.

Result: Extensive experiments show superior visual quality while significantly reducing training time compared to state-of-the-art methods.

Conclusion: M provides an efficient and scalable framework for dynamic scene rendering that addresses the computational overhead and training time limitations of traditional 3D Gaussian Splatting approaches.

Abstract: 3D Gaussian Splatting (3DGS) enables high-fidelity real-time rendering, a key requirement for immersive applications. However, the extension of 3DGS to dynamic scenes remains limitations on the substantial data volume of dense Gaussians and the prolonged training time required for each frame. This paper presents \M, a scalable Gaussian Splatting framework designed for efficient training in streaming tasks. Specifically, Gaussian spheres are hierarchically organized by scale within an anchor-based structure. Coarser-level Gaussians represent the low-resolution structure of the scene, while finer-level Gaussians, responsible for detailed high-fidelity rendering, are selectively activated by the coarser-level Gaussians. To further reduce computational overhead, we introduce a hybrid deformation and spawning strategy that models motion of inter-frame through Gaussian deformation and triggers Gaussian spawning to characterize wide-range motion. Additionally, a bidirectional adaptive masking mechanism enhances training efficiency by removing static regions and prioritizing informative viewpoints. Extensive experiments demonstrate that \M~ achieves superior visual quality while significantly reducing training time compared to state-of-the-art methods.

[94] One More Glance with Sharp Eyes: Rethinking Lightweight Captioning as a Practical Visual Specialist

Junha Song, Yongsik Jo, So Yeon Min, Quanting Xie, Taehwan Kim, Yonatan Bisk, Jaegul Choo

Main category: cs.CV

TL;DR: A lightweight 125M-parameter image captioning model achieves performance comparable to large multimodal generalists, but suffers from visual blindness. A novel Sharp-Eyed Refinement framework with DeepLens module addresses this by improving visual grounding through better attention mechanisms.

Details

Motivation: Deploying multimodal large language models on local devices is challenging due to high computational demands, requiring lightweight yet effective captioning solutions for applications like video instruction systems and exploration robots.

Method: Developed a specialist model based on 125M-parameter language model (56x smaller than LLaMA-7B), then created Sharp-Eyed Refinement framework with DeepLens module that extracts detailed visual representations by focusing on informative regions identified during initial glance.

Result: The lightweight model achieves performance comparable to large multimodal generalists. The Sharp-Eyed Refinement framework effectively addresses visual blindness issues and enhances caption quality through improved visual grounding.

Conclusion: Lightweight specialists can serve as strong visual specialists for on-device applications, and the proposed framework successfully mitigates visual blindness limitations through enhanced attention mechanisms and visual representations.

Abstract: Image captioning is fundamental for applications like video instruction systems and exploration robots, yet deploying such models on local devices is challenging due to the high computational demands of multimodal large language models (MLLMs). To address this, we first explore lightweight captioning by implementing a specialist based on a 125M-parameter language model, 56 times smaller than LLaMA-7B, and evaluating its performance on both single-sentence and detailed captioning tasks. Surprisingly, we find that our model can achieve performance comparable to large multimodal generalists, suggesting its potential to serve as a strong visual specialist for on-device applications. While promising, our model also exhibits a limitation: like other MLLMs, it suffers from visual blindness, occasionally resulting in semantic captioning errors. We carry out toy experiments and investigate the underlying causes, where we observe that the problems arise from ineffective attention mechanisms and limited visual representations. To alleviate them, we develop a novel captioning framework, Sharp-Eyed Refinement, which enhances caption quality through improved visual grounding. At its core, our DeepLens extracts detailed visual representations by concentrating on informative regions identified during the initial glance. Our experiments confirm both the advantages of our specialist over prior small captioning models and large generalists and the effectiveness of our framework.

[95] Federated Fine-tuning of SAM-Med3D for MRI-based Dementia Classification

Kaouther Mouheb, Marawan Elbatel, Janne Papma, Geert Jan Biessels, Jurgen Claassen, Huub Middelkoop, Barbara van Munster, Wiesje van der Flier, Inez Ramakers, Stefan Klein, Esther E. Bron

Main category: cs.CV

TL;DR: Benchmark study evaluates foundation models in federated learning for dementia diagnosis, finding classification head architecture, freezing encoder, and advanced aggregation methods significantly impact performance.

Details

Motivation: Foundation models show strong potential for AI-based dementia diagnosis but their integration into federated learning systems remains underexplored, particularly for decentralized clinical settings.

Method: Systematic evaluation of key design choices including classification head architecture, fine-tuning strategy, and aggregation method using brain MRI data from large multi-cohort datasets in federated learning framework.

Result: Classification head architecture substantially influences performance, freezing the FM encoder achieves comparable results to full fine-tuning, and advanced aggregation methods outperform standard federated averaging.

Conclusion: Results provide practical insights for deploying foundation models in decentralized clinical settings and highlight important trade-offs that should guide future federated learning method development.

Abstract: While foundation models (FMs) offer strong potential for AI-based dementia diagnosis, their integration into federated learning (FL) systems remains underexplored. In this benchmarking study, we systematically evaluate the impact of key design choices: classification head architecture, fine-tuning strategy, and aggregation method, on the performance and efficiency of federated FM tuning using brain MRI data. Using a large multi-cohort dataset, we find that the architecture of the classification head substantially influences performance, freezing the FM encoder achieves comparable results to full fine-tuning, and advanced aggregation methods outperform standard federated averaging. Our results offer practical insights for deploying FMs in decentralized clinical settings and highlight trade-offs that should guide future method development.

[96] Multi-Method Ensemble for Out-of-Distribution Detection

Lucas Rakotoarivony

Main category: cs.CV

TL;DR: Proposes Multi-Method Ensemble (MME) score that combines feature truncation and multiple scoring functions for superior out-of-distribution detection.

Details

Motivation: Existing OOD detection methods focus on single techniques or specific OOD datasets, overlooking the potential of combining multiple state-of-the-art solutions.

Method: Theoretically and empirically demonstrates that feature truncation and scoring functions can be effectively combined. Proposes MME score that unifies state-of-the-art OOD detectors into a single scoring function with aggregation of multiple scoring methods.

Result: MME significantly outperforms recent state-of-the-art methods across all benchmarks. Achieves average FPR95 of 27.57% on ImageNet-1K using BiT model, improving by 6% over best baseline.

Conclusion: Combining multiple OOD detection techniques through ensemble methods provides more robust and effective detection across various OOD scenarios, demonstrating the value of integrating existing solutions rather than developing isolated approaches.

Abstract: Detecting out-of-distribution (OOD) samples is essential for neural networks operating in open-world settings, particularly in safety-critical applications. Existing methods have improved OOD detection by leveraging two main techniques: feature truncation, which increases the separation between in-distribution (ID) and OOD samples, and scoring functions, which assign scores to distinguish between ID and OOD data. However, most approaches either focus on a single family of techniques or evaluate their effectiveness on a specific type of OOD dataset, overlooking the potential of combining multiple existing solutions. Motivated by this observation, we theoretically and empirically demonstrate that state-of-the-art feature truncation and scoring functions can be effectively combined. Moreover, we show that aggregating multiple scoring functions enhances robustness against various types of OOD samples. Based on these insights, we propose the Multi-Method Ensemble (MME) score, which unifies state-of-the-art OOD detectors into a single, more effective scoring function. Extensive experiments on both large-scale and small-scale benchmarks, covering near-OOD and far-OOD scenarios, show that MME significantly outperforms recent state-of-the-art methods across all benchmarks. Notably, using the BiT model, our method achieves an average FPR95 of 27.57% on the challenging ImageNet-1K benchmark, improving performance by 6% over the best existing baseline.

[97] Adversarial Patch Attack for Ship Detection via Localized Augmentation

Chun Liu, Panpan Ding, Zheng Zheng, Hailong Wang, Bingqian Zhu, Tao Xu, Zhigang Han, Jiayao Wang

Main category: cs.CV

TL;DR: Localized augmentation method for adversarial patch attacks that focuses only on target regions to improve attack success rate and transferability in ship detection.

Details

Motivation: DNN-based ship detection is vulnerable to adversarial patch attacks, and existing data transformation methods can cause false detections by over-augmenting background/non-target areas.

Method: Proposes a localized augmentation approach that applies augmentation only to target regions, avoiding interference with non-target areas to help loss function focus on adversarial patch impact.

Result: Experiments on HRSC2016 dataset show the method effectively increases adversarial patch attack success rate and enhances transferability.

Conclusion: Localized augmentation reduces background interference and improves attack effectiveness by focusing augmentation only on target regions.

Abstract: Current ship detection techniques based on remote sensing imagery primarily rely on the object detection capabilities of deep neural networks (DNNs). However, DNNs are vulnerable to adversarial patch attacks, which can lead to misclassification by the detection model or complete evasion of the targets. Numerous studies have demonstrated that data transformation-based methods can improve the transferability of adversarial examples. However, excessive augmentation of image backgrounds or irrelevant regions may introduce unnecessary interference, resulting in false detections of the object detection model. These errors are not caused by the adversarial patches themselves but rather by the over-augmentation of background and non-target areas. This paper proposes a localized augmentation method that applies augmentation only to the target regions, avoiding any influence on non-target areas. By reducing background interference, this approach enables the loss function to focus more directly on the impact of the adversarial patch on the detection model, thereby improving the attack success rate. Experiments conducted on the HRSC2016 dataset demonstrate that the proposed method effectively increases the success rate of adversarial patch attacks and enhances their transferability.

[98] ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding

Hao Lu, Jiahao Wang, Yaolun Zhang, Ruohui Wang, Xuanyu Zheng, Yepeng Tang, Dahua Lin, Lewei Lu

Main category: cs.CV

TL;DR: ELV-Halluc is the first benchmark for long-video hallucination in Video-MLLMs, focusing on Semantic Aggregation Hallucination (SAH) where models generate incorrect outputs despite correct frame-level semantics due to aggregation issues in complex long videos.

Details

Motivation: Current video hallucination benchmarks focus on short videos and oversimplify hallucination causes, missing SAH where models fail to properly aggregate frame-level semantics into event-level groups, especially problematic in long videos with increased semantic complexity.

Method: Introduces ELV-Halluc benchmark for systematic SAH investigation, analyzes SAH patterns, tests positional encoding strategies, and employs DPO (Direct Preference Optimization) with 8K adversarial data pairs to enhance semantic distinction capabilities.

Result: Confirmed SAH existence showing it increases with semantic complexity, models more prone to SAH on rapidly changing semantics. Positional encoding helps alleviate SAH, and DPO strategy achieved 27.7% SAH reduction on ELV-Halluc and improvements on Video-MME.

Conclusion: SAH is a critical hallucination type in long videos that requires specialized attention. The proposed ELV-Halluc benchmark and mitigation approaches (positional encoding + DPO) effectively address SAH, demonstrating significant improvements in video understanding reliability.

Abstract: Video multimodal large language models (Video-MLLMs) have achieved remarkable progress in video understanding. However, they remain vulnerable to hallucination-producing content inconsistent with or unrelated to video inputs. Previous video hallucination benchmarks primarily focus on short-videos. They attribute hallucinations to factors such as strong language priors, missing frames, or vision-language biases introduced by the visual encoder. While these causes indeed account for most hallucinations in short videos, they still oversimplify the cause of hallucinations. Sometimes, models generate incorrect outputs but with correct frame-level semantics. We refer to this type of hallucination as Semantic Aggregation Hallucination (SAH), which arises during the process of aggregating frame-level semantics into event-level semantic groups. Given that SAH becomes particularly critical in long videos due to increased semantic complexity across multiple events, it is essential to separate and thoroughly investigate the causes of this type of hallucination. To address the above issues, we introduce ELV-Halluc, the first benchmark dedicated to long-video hallucination, enabling a systematic investigation of SAH. Our experiments confirm the existence of SAH and show that it increases with semantic complexity. Additionally, we find that models are more prone to SAH on rapidly changing semantics. Moreover, we discuss potential approaches to mitigate SAH. We demonstrate that positional encoding strategy contributes to alleviating SAH, and further adopt DPO strategy to enhance the model’s ability to distinguish semantics within and across events. To support this, we curate a dataset of 8K adversarial data pairs and achieve improvements on both ELV-Halluc and Video-MME, including a substantial 27.7% reduction in SAH ratio.

[99] Maybe you don’t need a U-Net: convolutional feature upsampling for materials micrograph segmentation

Ronan Docherty, Antonis Vamvakeros, Samuel J. Cooper

Main category: cs.CV

TL;DR: A CNN-based upsampler network is trained to enhance low-resolution foundation model features using input image reference, enabling efficient and high-quality segmentation of microscopy images with minimal labels.

Details

Motivation: Patch-based vision transformers struggle with fine features in micrographs and large image sizes common in materials/biological analysis, requiring a solution for better feature representation.

Method: Train a convolutional neural network to upsample low-resolution foundation model features by referencing the original input image, then apply this upsampler without additional training.

Result: Successfully segmented various microscopy images (plant cells, battery cathode, organic crystals) with improved separation of hard-to-segment phases like hairline cracks, achieving high-quality results with fewer labels.

Conclusion: The upsampled deep features enable efficient interactive segmentation that outperforms traditional convolutional networks in speed and label efficiency while maintaining quality.

Abstract: Feature foundation models - usually vision transformers - offer rich semantic descriptors of images, useful for downstream tasks such as (interactive) segmentation and object detection. For computational efficiency these descriptors are often patch-based, and so struggle to represent the fine features often present in micrographs; they also struggle with the large image sizes present in materials and biological image analysis. In this work, we train a convolutional neural network to upsample low-resolution (i.e, large patch size) foundation model features with reference to the input image. We apply this upsampler network (without any further training) to efficiently featurise and then segment a variety of microscopy images, including plant cells, a lithium-ion battery cathode and organic crystals. The richness of these upsampled features admits separation of hard to segment phases, like hairline cracks. We demonstrate that interactive segmentation with these deep features produces high-quality segmentations far faster and with far fewer labels than training or finetuning a more traditional convolutional network.

[100] HCCM: Hierarchical Cross-Granularity Contrastive and Matching Learning for Natural Language-Guided Drones

Hao Ruan, Jinliang Lin, Yingxin Lai, Zhiming Luo, Shaozi Li

Main category: cs.CV

TL;DR: HCCM framework improves drone vision-language understanding with hierarchical cross-granularity contrastive learning and matching, achieving state-of-the-art performance on drone datasets.

Details

Motivation: Address challenges in drone scenarios with wide field of view and complex compositional semantics, where mainstream VLMs lack fine-grained semantics and hierarchical methods have rigid constraints.

Method: Proposes HCCM with two components: Region-Global Image-Text Contrastive Learning (RG-ITC) for hierarchical semantics without precise partitioning, and Region-Global Image-Text Matching (RG-ITM) for local semantic consistency. Includes Momentum Contrast and Distillation (MCD) for robustness to incomplete text descriptions.

Result: Achieves 28.8% Recall@1 for image retrieval and 14.7% for text retrieval on GeoText-1652. Shows strong zero-shot generalization with 39.93% mean recall on unseen ERA dataset, outperforming fine-tuned baselines.

Conclusion: HCCM effectively addresses drone vision-language challenges through flexible hierarchical learning and robust alignment mechanisms, demonstrating superior performance and generalization capabilities.

Abstract: Natural Language-Guided Drones (NLGD) provide a novel paradigm for tasks such as target matching and navigation. However, the wide field of view and complex compositional semantics in drone scenarios pose challenges for vision-language understanding. Mainstream Vision-Language Models (VLMs) emphasize global alignment while lacking fine-grained semantics, and existing hierarchical methods depend on precise entity partitioning and strict containment, limiting effectiveness in dynamic environments. To address this, we propose the Hierarchical Cross-Granularity Contrastive and Matching learning (HCCM) framework with two components: (1) Region-Global Image-Text Contrastive Learning (RG-ITC), which avoids precise scene partitioning and captures hierarchical local-to-global semantics by contrasting local visual regions with global text and vice versa; (2) Region-Global Image-Text Matching (RG-ITM), which dispenses with rigid constraints and instead evaluates local semantic consistency within global cross-modal representations, enhancing compositional reasoning. Moreover, drone text descriptions are often incomplete or ambiguous, destabilizing alignment. HCCM introduces a Momentum Contrast and Distillation (MCD) mechanism to improve robustness. Experiments on GeoText-1652 show HCCM achieves state-of-the-art Recall@1 of 28.8% (image retrieval) and 14.7% (text retrieval). On the unseen ERA dataset, HCCM demonstrates strong zero-shot generalization with 39.93% mean recall (mR), outperforming fine-tuned baselines.

[101] Complete Gaussian Splats from a Single Image with Denoising Diffusion Models

Ziwei Liao, Mohamed Sayed, Steven L. Waslander, Sara Vicente, Daniyar Turmukhambetov, Michael Firman

Main category: cs.CV

TL;DR: Single-image 3D scene completion using Gaussian splats with latent diffusion model to reconstruct occluded areas, trained self-supervised from 2D images only.

Details

Motivation: Gaussian splatting fails on occluded/unobserved areas from sparse views. Conventional regression methods produce blurry results and can't capture multiple plausible completions.

Method: Generative formulation using Variational AutoReconstructor to learn latent space from 2D images, then train diffusion model over this latent space to complete 3D Gaussian splats from single image.

Result: Generates faithful reconstructions and diverse samples that complete occluded surfaces for high-quality 360-degree renderings.

Conclusion: Proposed method successfully addresses 3D scene completion from single image using generative approach, overcoming limitations of regression-based methods.

Abstract: Gaussian splatting typically requires dense observations of the scene and can fail to reconstruct occluded and unobserved areas. We propose a latent diffusion model to reconstruct a complete 3D scene with Gaussian splats, including the occluded parts, from only a single image during inference. Completing the unobserved surfaces of a scene is challenging due to the ambiguity of the plausible surfaces. Conventional methods use a regression-based formulation to predict a single “mode” for occluded and out-of-frustum surfaces, leading to blurriness, implausibility, and failure to capture multiple possible explanations. Thus, they often address this problem partially, focusing either on objects isolated from the background, reconstructing only visible surfaces, or failing to extrapolate far from the input views. In contrast, we propose a generative formulation to learn a distribution of 3D representations of Gaussian splats conditioned on a single input image. To address the lack of ground-truth training data, we propose a Variational AutoReconstructor to learn a latent space only from 2D images in a self-supervised manner, over which a diffusion model is trained. Our method generates faithful reconstructions and diverse samples with the ability to complete the occluded surfaces for high-quality 360-degree renderings.

[102] EZ-Sort: Efficient Pairwise Comparison via Zero-Shot CLIP-Based Pre-Ordering and Human-in-the-Loop Sorting

Yujin Park, Haejun Chung, Ikbeom Jang

Main category: cs.CV

TL;DR: EZ-Sort reduces pairwise comparison annotation costs by 90.5% vs exhaustive methods and 19.8% vs prior work using CLIP-based pre-ordering and automated comparisons.

Details

Motivation: Pairwise comparisons are more reliable than absolute ratings but require O(n²) annotations, making them impractical for large datasets. Recent work reduced this to O(n log n), but further efficiency improvements are needed.

Method: Uses CLIP model for zero-shot hierarchical pre-ordering, initializes bucket-aware Elo scores, and runs uncertainty-guided human-in-the-loop MergeSort. Replaces obvious comparisons with automated ones.

Result: Reduced human annotation cost by 90.5% compared to exhaustive pairwise comparisons and 19.8% compared to prior work (n=100), while maintaining or improving inter-rater reliability across multiple datasets.

Conclusion: Combining CLIP-based priors with uncertainty-aware sampling provides an efficient and scalable solution for pairwise ranking tasks, significantly reducing annotation burden while preserving reliability.

Abstract: Pairwise comparison is often favored over absolute rating or ordinal classification in subjective or difficult annotation tasks due to its improved reliability. However, exhaustive comparisons require a massive number of annotations (O(n^2)). Recent work has greatly reduced the annotation burden (O(n log n)) by actively sampling pairwise comparisons using a sorting algorithm. We further improve annotation efficiency by (1) roughly pre-ordering items using the Contrastive Language-Image Pre-training (CLIP) model hierarchically without training, and (2) replacing easy, obvious human comparisons with automated comparisons. The proposed EZ-Sort first produces a CLIP-based zero-shot pre-ordering, then initializes bucket-aware Elo scores, and finally runs an uncertainty-guided human-in-the-loop MergeSort. Validation was conducted using various datasets: face-age estimation (FGNET), historical image chronology (DHCI), and retinal image quality assessment (EyePACS). It showed that EZ-Sort reduced human annotation cost by 90.5% compared to exhaustive pairwise comparisons and by 19.8% compared to prior work (when n = 100), while improving or maintaining inter-rater reliability. These results demonstrate that combining CLIP-based priors with uncertainty-aware sampling yields an efficient and scalable solution for pairwise ranking.

[103] ECHO: Ego-Centric modeling of Human-Object interactions

Ilya A. Petrov, Vladimir Guzov, Riccardo Marin, Emre Aksan, Xu Chen, Daniel Cremers, Thabo Beeler, Gerard Pons-Moll

Main category: cs.CV

TL;DR: ECHO is a unified framework that recovers human pose, object motion, and contact from minimal egocentric observations (head and wrists tracking) using a Diffusion Transformer with three-variate diffusion process.

Details

Motivation: Modeling human-object interactions from egocentric perspective is important for wearable devices but largely unexplored. The paper investigates how much interaction information can be recovered from only head and wrists tracking.

Method: ECHO employs a Diffusion Transformer architecture with unique three-variate diffusion process that jointly models human motion, object trajectory, and contact sequence. It operates in head-centric canonical space and uses conveyor-based inference for sequences of any length.

Result: ECHO outperforms existing methods that lack the same flexibility, achieving state-of-the-art performance in egocentric HOI reconstruction through extensive evaluation.

Conclusion: ECHO successfully demonstrates that comprehensive human-object interaction information can be recovered from minimal egocentric observations, setting a new benchmark for flexible HOI modeling from wearable device data.

Abstract: Modeling human-object interactions (HOI) from an egocentric perspective is a largely unexplored yet important problem due to the increasing adoption of wearable devices, such as smart glasses and watches. We investigate how much information about interaction can be recovered from only head and wrists tracking. Our answer is ECHO (Ego-Centric modeling of Human-Object interactions), which, for the first time, proposes a unified framework to recover three modalities: human pose, object motion, and contact from such minimal observation. ECHO employs a Diffusion Transformer architecture and a unique three-variate diffusion process, which jointly models human motion, object trajectory, and contact sequence, allowing for flexible input configurations. Our method operates in a head-centric canonical space, enhancing robustness to global orientation. We propose a conveyor-based inference, which progressively increases the diffusion timestamp with the frame position, allowing us to process sequences of any length. Through extensive evaluation, we demonstrate that ECHO outperforms existing methods that do not offer the same flexibility, setting a state-of-the-art in egocentric HOI reconstruction.

[104] How Well Do Vision–Language Models Understand Cities? A Comparative Study on Spatial Reasoning from Street-View Images

Juneyoung Ro, Namwoo Kim, Yoonjin Yoon

Main category: cs.CV

TL;DR: Study evaluates VLMs on urban spatial reasoning, shows fine-tuning with synthetic CoT-supervised dataset significantly improves performance on challenging urban scene questions.

Details

Motivation: To understand how well general vision-language models transfer spatial reasoning abilities to urban scenes, which remains underexplored despite being crucial for urban scene understanding.

Method: Comparative evaluation of BLIP-2, InstructBLIP, and LLaVA-1.5 using zero-shot and fine-tuned approaches with a synthetic VQA dataset constructed from street-view image predictions (segmentation, depth, object detection) paired with LLM-generated Chain-of-Thought answers.

Result: VLMs perform reasonably well in zero-shot settings, but fine-tuning with synthetic CoT-supervised dataset substantially boosts performance, especially for challenging question types like negation and counterfactuals.

Conclusion: Urban spatial reasoning presents a new challenge for VLMs, and synthetic dataset construction with CoT supervision is an effective approach for adapting general-purpose models to specialized urban domains.

Abstract: Effectively understanding urban scenes requires fine-grained spatial reasoning about objects, layouts, and depth cues. However, how well current vision-language models (VLMs), pretrained on general scenes, transfer these abilities to urban domain remains underexplored. To address this gap, we conduct a comparative study of three off-the-shelf VLMs-BLIP-2, InstructBLIP, and LLaVA-1.5-evaluating both zero-shot performance and the effects of fine-tuning with a synthetic VQA dataset specific to urban scenes. We construct such dataset from segmentation, depth, and object detection predictions of street-view images, pairing each question with LLM-generated Chain-of-Thought (CoT) answers for step-by-step reasoning supervision. Results show that while VLMs perform reasonably well in zero-shot settings, fine-tuning with our synthetic CoT-supervised dataset substantially boosts performance, especially for challenging question types such as negation and counterfactuals. This study introduces urban spatial reasoning as a new challenge for VLMs and demonstrates synthetic dataset construction as a practical path for adapting general-purpose models to specialized domains.

[105] Temporal Flow Matching for Learning Spatio-Temporal Trajectories in 4D Longitudinal Medical Imaging

Nico Albert Disch, Yannick Kirchhoff, Robin Peretzke, Maximilian Rokuss, Saikat Roy, Constantin Ulrich, David Zimmerer, Klaus Maier-Hein

Main category: cs.CV

TL;DR: TFM is a unified generative method for 4D medical image prediction that learns temporal distributions, supports 3D volumes and irregular sampling, and outperforms existing spatio-temporal approaches.

Details

Motivation: Existing deep learning methods for medical imaging either focus on single temporal contexts or are limited to classification/regression tasks, lacking fine-grained spatial prediction capabilities for disease progression modeling and treatment planning.

Method: Temporal Flow Matching (TFM) - a generative trajectory method that learns underlying temporal distributions, can fall back to nearest image prediction, and supports 3D volumes, multiple prior scans, and irregular sampling.

Result: Extensive benchmarks on three public longitudinal datasets show TFM consistently surpasses spatio-temporal methods from natural imaging, establishing new state-of-the-art performance for 4D medical image prediction.

Conclusion: TFM provides a robust baseline and unified framework for temporal medical image analysis, addressing the fundamental gap in fine-grained spatial predictions for disease progression and anatomical development tracking.

Abstract: Understanding temporal dynamics in medical imaging is crucial for applications such as disease progression modeling, treatment planning and anatomical development tracking. However, most deep learning methods either consider only single temporal contexts, or focus on tasks like classification or regression, limiting their ability for fine-grained spatial predictions. While some approaches have been explored, they are often limited to single timepoints, specific diseases or have other technical restrictions. To address this fundamental gap, we introduce Temporal Flow Matching (TFM), a unified generative trajectory method that (i) aims to learn the underlying temporal distribution, (ii) by design can fall back to a nearest image predictor, i.e. predicting the last context image (LCI), as a special case, and (iii) supports $3D$ volumes, multiple prior scans, and irregular sampling. Extensive benchmarks on three public longitudinal datasets show that TFM consistently surpasses spatio-temporal methods from natural imaging, establishing a new state-of-the-art and robust baseline for $4D$ medical image prediction.

[106] Integrating Pathology and CT Imaging for Personalized Recurrence Risk Prediction in Renal Cancer

Daniël Boeke, Cedrik Blommestijn, Rebecca N. Wray, Kalina Chupetlovska, Shangqi Gao, Zeyu Gao, Regina G. H. Beets-Tan, Mireia Crispin-Ortuzar, James O. Jones, Wilson Silva, Ines P. Machado

Main category: cs.CV

TL;DR: Multimodal deep learning framework combining CT scans and pathology slides improves recurrence prediction in kidney cancer, with pathology being more prognostic than imaging alone.

Details

Motivation: Current Leibovich score for ccRCC recurrence prediction has limited patient-level resolution and excludes imaging information, needing better personalized risk assessment.

Method: Modular deep learning framework with pretrained encoders and Cox survival modeling, testing unimodal, late fusion, and intermediate fusion approaches using CT and whole-slide images.

Result: WSI-based models outperformed CT-only models, intermediate fusion further improved performance, best model approached adjusted Leibovich score, radiology added value through fusion.

Conclusion: Foundation model-based multimodal integration is feasible for personalized ccRCC risk prediction, with future work needed on better fusion strategies, larger datasets, and improved CT encoders.

Abstract: Recurrence risk estimation in clear cell renal cell carcinoma (ccRCC) is essential for guiding postoperative surveillance and treatment. The Leibovich score remains widely used for stratifying distant recurrence risk but offers limited patient-level resolution and excludes imaging information. This study evaluates multimodal recurrence prediction by integrating preoperative computed tomography (CT) and postoperative histopathology whole-slide images (WSIs). A modular deep learning framework with pretrained encoders and Cox-based survival modeling was tested across unimodal, late fusion, and intermediate fusion setups. In a real-world ccRCC cohort, WSI-based models consistently outperformed CT-only models, underscoring the prognostic strength of pathology. Intermediate fusion further improved performance, with the best model (TITAN-CONCH with ResNet-18) approaching the adjusted Leibovich score. Random tie-breaking narrowed the gap between the clinical baseline and learned models, suggesting discretization may overstate individualized performance. Using simple embedding concatenation, radiology added value primarily through fusion. These findings demonstrate the feasibility of foundation model-based multimodal integration for personalized ccRCC risk prediction. Future work should explore more expressive fusion strategies, larger multimodal datasets, and general-purpose CT encoders to better match pathology modeling capacity.

[107] Why Stop at Words? Unveiling the Bigger Picture through Line-Level OCR

Shashank Vempati, Nishit Anand, Gaurav Talebailkar, Arpan Garai, Chetan Arora

Main category: cs.CV

TL;DR: Transition from word-level to line-level OCR to bypass word segmentation errors and better utilize language models, achieving 5.4% accuracy improvement and 4x efficiency gain.

Details

Motivation: Word-level OCR has shifted the bottleneck from character segmentation to word segmentation, limiting context for language models and causing errors in word detection.

Method: Proposed line-level OCR that processes entire lines instead of individual words, bypassing word segmentation errors and providing larger sentence context for better language model utilization.

Result: 5.4% end-to-end accuracy improvement and 4 times efficiency improvement compared to word-based OCR pipelines. Created a new dataset of 251 English page images with line-level annotations.

Conclusion: Line-level OCR represents a natural progression that improves both accuracy and efficiency, with potential to leverage advances in large language models for document image processing.

Abstract: Conventional optical character recognition (OCR) techniques segmented each character and then recognized. This made them prone to error in character segmentation, and devoid of context to exploit language models. Advances in sequence to sequence translation in last decade led to modern techniques first detecting words and then inputting one word at a time to a model to directly output full words as sequence of characters. This allowed better utilization of language models and bypass error-prone character segmentation step. We observe that the above transition in style has moved the bottleneck in accuracy to word segmentation. Hence, in this paper, we propose a natural and logical progression from word level OCR to line-level OCR. The proposal allows to bypass errors in word detection, and provides larger sentence context for better utilization of language models. We show that the proposed technique not only improves the accuracy but also efficiency of OCR. Despite our thorough literature survey, we did not find any public dataset to train and benchmark such shift from word to line-level OCR. Hence, we also contribute a meticulously curated dataset of 251 English page images with line-level annotations. Our experimentation revealed a notable end-to-end accuracy improvement of 5.4%, underscoring the potential benefits of transitioning towards line-level OCR, especially for document images. We also report a 4 times improvement in efficiency compared to word-based pipelines. With continuous improvements in large language models, our methodology also holds potential to exploit such advances. Project Website: https://nishitanand.github.io/line-level-ocr-website

[108] Unfolding Framework with Complex-Valued Deformable Attention for High-Quality Computer-Generated Hologram Generation

Haomiao Zhang, Zhangyuan Li, Yanling Piao, Zhi Li, Xiaodong Wang, Miao Cao, Xiongfei Su, Qiang Song, Xin Yuan

Main category: cs.CV

TL;DR: A Deep Unfolding Network for computer-generated holography that combines adaptive bandwidth-preserving modeling with complex-valued denoising to overcome limitations of existing methods and achieve state-of-the-art reconstruction quality.

Details

Motivation: Existing deep learning-based CGH methods have three main limitations: (1) end-to-end networks ignore physical relationships reducing interpretability, (2) CNNs have limited receptive fields for global context, and (3) ASM-based models are constrained to finite near-fields.

Method: Proposes a Deep Unfolding Network that decomposes gradient descent into two modules: Adaptive Bandwidth-Preserving Model (ABPM) for wider working distances and Phase-Domain Complex-valued Denoiser (PCD) with complex-valued deformable self-attention to capture global features.

Result: Achieves PSNR over 35 dB, outperforms existing methods on both simulated and real data, and provides state-of-the-art reconstruction performance with improved flexibility and working distance.

Conclusion: The proposed DUN framework successfully addresses key limitations in CGH by combining physical modeling with deep learning, offering better interpretability, global feature capture, and extended working range compared to conventional approaches.

Abstract: Computer-generated holography (CGH) has gained wide attention with deep learning-based algorithms. However, due to its nonlinear and ill-posed nature, challenges remain in achieving accurate and stable reconstruction. Specifically, ($i$) the widely used end-to-end networks treat the reconstruction model as a black box, ignoring underlying physical relationships, which reduces interpretability and flexibility. ($ii$) CNN-based CGH algorithms have limited receptive fields, hindering their ability to capture long-range dependencies and global context. ($iii$) Angular spectrum method (ASM)-based models are constrained to finite near-fields.In this paper, we propose a Deep Unfolding Network (DUN) that decomposes gradient descent into two modules: an adaptive bandwidth-preserving model (ABPM) and a phase-domain complex-valued denoiser (PCD), providing more flexibility. ABPM allows for wider working distances compared to ASM-based methods. At the same time, PCD leverages its complex-valued deformable self-attention module to capture global features and enhance performance, achieving a PSNR over 35 dB. Experiments on simulated and real data show state-of-the-art results.

[109] Towards Interactive Lesion Segmentation in Whole-Body PET/CT with Promptable Models

Maximilian Rokuss, Yannick Kirchhoff, Fabian Isensee, Klaus H. Maier-Hein

Main category: cs.CV

TL;DR: Extension of nnU-Net for interactive PET/CT lesion segmentation using Euclidean Distance Transform encoding of user clicks, achieving best performance in autoPET/CT IV challenge

Details

Motivation: Clinical practice benefits from interactive segmentation approaches that keep humans in the loop to refine automated predictions, addressing challenges of tracer heterogeneity and multi-center variability in PET/CT

Method: Extended nnU-Net framework with promptable capabilities by encoding user-provided foreground/background clicks as additional input channels using Euclidean Distance Transform, with online simulation of user interactions and custom point sampling

Result: EDT encodings consistently outperform Gaussian kernels; ensemble models achieve strongest cross-validation performance, reducing both false positives and false negatives compared to baselines

Conclusion: Promptable models enable efficient user-guided segmentation workflows in multi-tracer, multi-center PET/CT, with EDT-based approach showing superior performance

Abstract: Whole-body PET/CT is a cornerstone of oncological imaging, yet accurate lesion segmentation remains challenging due to tracer heterogeneity, physiological uptake, and multi-center variability. While fully automated methods have advanced substantially, clinical practice benefits from approaches that keep humans in the loop to efficiently refine predicted masks. The autoPET/CT IV challenge addresses this need by introducing interactive segmentation tasks based on simulated user prompts. In this work, we present our submission to Task 1. Building on the winning autoPET III nnU-Net pipeline, we extend the framework with promptable capabilities by encoding user-provided foreground and background clicks as additional input channels. We systematically investigate representations for spatial prompts and demonstrate that Euclidean Distance Transform (EDT) encodings consistently outperform Gaussian kernels. Furthermore, we propose online simulation of user interactions and a custom point sampling strategy to improve robustness under realistic prompting conditions. Our ensemble of EDT-based models, trained with and without external data, achieves the strongest cross-validation performance, reducing both false positives and false negatives compared to baseline models. These results highlight the potential of promptable models to enable efficient, user-guided segmentation workflows in multi-tracer, multi-center PET/CT. Code is publicly available at https://github.com/MIC-DKFZ/autoPET-interactive

[110] Mapping like a Skeptic: Probabilistic BEV Projection for Online HD Mapping

Fatih Erdoğan, Merve Rabia Barın, Fatma Güney

Main category: cs.CV

TL;DR: A novel probabilistic projection mechanism for HD map construction that uses geometric mapping with camera parameters and refines it with confidence scores to filter out irrelevant elements and improve temporal processing.

Details

Motivation: Existing HD mapping approaches struggle with accuracy due to generalization problems and often hallucinate non-existent road elements when using standard attention-based projection techniques.

Method: Proposes a probabilistic projection mechanism with confidence scores that starts with geometric mapping based on camera parameters and adapts it to the scene, filtering irrelevant elements and selectively accumulating reliable temporal information.

Result: Demonstrates improved performance over state-of-the-art approaches on nuScenes and Argoverse2 datasets, with particularly pronounced improvements on nuScenes and in challenging long perception ranges.

Conclusion: The proposed method provides better generalization and accuracy in HD map construction by combining geometric mapping with adaptive probabilistic filtering and temporal confidence-based accumulation.

Abstract: Constructing high-definition (HD) maps from sensory input requires accurately mapping the road elements in image space to the Bird’s Eye View (BEV) space. The precision of this mapping directly impacts the quality of the final vectorized HD map. Existing HD mapping approaches outsource the projection to standard mapping techniques, such as attention-based ones. However, these methods struggle with accuracy due to generalization problems, often hallucinating non-existent road elements. Our key idea is to start with a geometric mapping based on camera parameters and adapt it to the scene to extract relevant map information from camera images. To implement this, we propose a novel probabilistic projection mechanism with confidence scores to (i) refine the mapping to better align with the scene and (ii) filter out irrelevant elements that should not influence HD map generation. In addition, we improve temporal processing by using confidence scores to selectively accumulate reliable information over time. Experiments on new splits of the nuScenes and Argoverse2 datasets demonstrate improved performance over state-of-the-art approaches, indicating better generalization. The improvements are particularly pronounced on nuScenes and in the challenging long perception range. Our code and model checkpoints are available at https://github.com/Fatih-Erdogan/mapping-like-skeptic .

[111] FLORA: Efficient Synthetic Data Generation for Object Detection in Low-Data Regimes via finetuning Flux LoRA

Alvaro Patricio, Atabak Dehban, Rodrigo Ventura

Main category: cs.CV

TL;DR: FLORA is a lightweight synthetic data generation pipeline using LoRA-fine-tuned Flux 1.1 diffusion model that achieves superior object detection performance with only 500 synthetic images and consumer-grade GPUs, outperforming baselines that use 10x more data.

Details

Motivation: To address the limitations of resource-intensive full fine-tuning of large diffusion models for synthetic data generation, which requires enterprise-grade GPUs and thousands of synthetic images, making the process impractical for real-world scenarios.

Method: Uses Flux 1.1 Dev diffusion model fine-tuned exclusively through Low-Rank Adaptation (LoRA), creating a lightweight pipeline that reduces computational requirements to consumer-grade GPU levels.

Result: Training object detectors with just 500 FLORA-generated synthetic images yields superior performance compared to models trained on 5000 synthetic images from ODGEN baseline, achieving up to 21.3% improvement in mAP@.50:.95.

Conclusion: FLORA demonstrates that quality and efficiency-focused approach surpasses state-of-the-art performance with only 10% of the data and a fraction of computational cost, making synthetic data creation more practical and accessible.

Abstract: Recent advances in diffusion-based generative models have demonstrated significant potential in augmenting scarce datasets for object detection tasks. Nevertheless, most recent models rely on resource-intensive full fine-tuning of large-scale diffusion models, requiring enterprise-grade GPUs (e.g., NVIDIA V100) and thousands of synthetic images. To address these limitations, we propose Flux LoRA Augmentation (FLORA), a lightweight synthetic data generation pipeline. Our approach uses the Flux 1.1 Dev diffusion model, fine-tuned exclusively through Low-Rank Adaptation (LoRA). This dramatically reduces computational requirements, enabling synthetic dataset generation with a consumer-grade GPU (e.g., NVIDIA RTX 4090). We empirically evaluate our approach on seven diverse object detection datasets. Our results demonstrate that training object detectors with just 500 synthetic images generated by our approach yields superior detection performance compared to models trained on 5000 synthetic images from the ODGEN baseline, achieving improvements of up to 21.3% in mAP@.50:.95. This work demonstrates that it is possible to surpass state-of-the-art performance with far greater efficiency, as FLORA achieves superior results using only 10% of the data and a fraction of the computational cost. This work demonstrates that a quality and efficiency-focused approach is more effective than brute-force generation, making advanced synthetic data creation more practical and accessible for real-world scenarios.

[112] CAD2DMD-SET: Synthetic Generation Tool of Digital Measurement Device CAD Model Datasets for fine-tuning Large Vision-Language Models

João Valente, Atabak Dehban, Rodrigo Ventura

Main category: cs.CV

TL;DR: CAD2DMD-SET is a synthetic data generation tool that creates diverse VQA-labeled datasets for improving LVLMs’ ability to read digital measurement devices in challenging real-world conditions.

Details

Motivation: LVLMs struggle with reading values from Digital Measurement Devices (DMDs) in real-world conditions involving clutter, occlusions, extreme viewpoints, and motion blur, which are common in head-mounted cameras and AR applications.

Method: Developed a synthetic data generation tool using 3D CAD models, advanced rendering, and high-fidelity image composition to produce diverse VQA-labeled synthetic DMD datasets. Also created DMDBench, a curated validation set of 1,000 annotated real-world images.

Result: Fine-tuning three state-of-the-art LVLMs with CAD2DMD-SET’s dataset yielded substantial improvements, with InternVL showing a 200% score increase using Average Normalised Levenshtein Similarity (ANLS) without degrading performance on other tasks.

Conclusion: CAD2DMD-SET significantly improves LVLMs’ robustness and performance in challenging DMD reading scenarios, and the tool will be released as open-source to enable community expansion and dataset generation.

Abstract: Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities across various multimodal tasks. They continue, however, to struggle with trivial scenarios such as reading values from Digital Measurement Devices (DMDs), particularly in real-world conditions involving clutter, occlusions, extreme viewpoints, and motion blur; common in head-mounted cameras and Augmented Reality (AR) applications. Motivated by these limitations, this work introduces CAD2DMD-SET, a synthetic data generation tool designed to support visual question answering (VQA) tasks involving DMDs. By leveraging 3D CAD models, advanced rendering, and high-fidelity image composition, our tool produces diverse, VQA-labelled synthetic DMD datasets suitable for fine-tuning LVLMs. Additionally, we present DMDBench, a curated validation set of 1,000 annotated real-world images designed to evaluate model performance under practical constraints. Benchmarking three state-of-the-art LVLMs using Average Normalised Levenshtein Similarity (ANLS) and further fine-tuning LoRA’s of these models with CAD2DMD-SET’s generated dataset yielded substantial improvements, with InternVL showcasing a score increase of 200% without degrading on other tasks. This demonstrates that the CAD2DMD-SET training dataset substantially improves the robustness and performance of LVLMs when operating under the previously stated challenging conditions. The CAD2DMD-SET tool is expected to be released as open-source once the final version of this manuscript is prepared, allowing the community to add different measurement devices and generate their own datasets.

[113] UItron: Foundational GUI Agent with Advanced Perception and Planning

Zhixiong Zeng, Jing Huang, Liming Zheng, Wenkang Han, Yufeng Zhong, Lei Chen, Longrong Yang, Yingjie Chu, Yuzhi He, Lin Ma

Main category: cs.CV

TL;DR: UItron is an open-source foundational model for GUI agents that addresses challenges in GUI automation through advanced data engineering, interactive infrastructure, and a curriculum reinforcement learning framework, achieving superior performance particularly in Chinese mobile app scenarios.

Details

Motivation: Building GUI agents remains challenging due to scarcity of operation trajectories, lack of interactive infrastructure, and limitations in foundation models' initial capabilities. There's also a general lack of Chinese capabilities in state-of-the-art solutions.

Method: Systematic data engineering strategies, interactive environment connecting Mobile and PC devices, supervised finetuning over perception and planning tasks, and curriculum reinforcement learning framework for complex reasoning and exploration.

Result: UItron achieves superior performance in GUI perception, grounding, and planning benchmarks. It shows significant progress in Chinese app scenarios with over one million manually collected operation trajectories across top 100 popular apps.

Conclusion: UItron propels GUI agents closer to real-world application by addressing key challenges through systemic data engineering and interactive infrastructure, particularly demonstrating strong capabilities in Chinese mobile app interaction.

Abstract: GUI agent aims to enable automated operations on Mobile/PC devices, which is an important task toward achieving artificial general intelligence. The rapid advancement of VLMs accelerates the development of GUI agents, owing to their powerful capabilities in visual understanding and task planning. However, building a GUI agent remains a challenging task due to the scarcity of operation trajectories, the availability of interactive infrastructure, and the limitation of initial capabilities in foundation models. In this work, we introduce UItron, an open-source foundational model for automatic GUI agents, featuring advanced GUI perception, grounding, and planning capabilities. UItron highlights the necessity of systemic data engineering and interactive infrastructure as foundational components for advancing GUI agent development. It not only systematically studies a series of data engineering strategies to enhance training effects, but also establishes an interactive environment connecting both Mobile and PC devices. In training, UItron adopts supervised finetuning over perception and planning tasks in various GUI scenarios, and then develop a curriculum reinforcement learning framework to enable complex reasoning and exploration for online environments. As a result, UItron achieves superior performance in benchmarks of GUI perception, grounding, and planning. In particular, UItron highlights the interaction proficiency with top-tier Chinese mobile APPs, as we identified a general lack of Chinese capabilities even in state-of-the-art solutions. To this end, we manually collect over one million steps of operation trajectories across the top 100 most popular apps, and build the offline and online agent evaluation environments. Experimental results demonstrate that UItron achieves significant progress in Chinese app scenarios, propelling GUI agents one step closer to real-world application.

[114] Domain Generalization in-the-Wild: Disentangling Classification from Domain-Aware Representations

Ha Min Son, Zhe Zhao, Shahbaz Rezaei, Xin Liu

Main category: cs.CV

TL;DR: CLIP-DCA improves domain generalization for foundation models by enhancing domain awareness rather than enforcing domain invariance, showing significant gains on challenging out-of-distribution datasets.

Details

Motivation: Current domain generalization evaluation for foundation models like CLIP is inadequate because web-scale pretraining data may cover existing benchmarks, making tests insufficiently challenging for truly unseen data scenarios.

Method: CLIP-DCA enhances domain awareness within CLIP’s encoders using a separate domain head and synthetically generated diverse domain data, while encouraging domain-invariant classification through disentanglement from domain features.

Result: CLIP’s performance deteriorates significantly on more out-of-distribution datasets, but CLIP-DCA shows significant improvements compared to existing methods, particularly on datasets that are more out-of-distribution.

Conclusion: Enhancing domain awareness is a prerequisite for effective domain-invariant classification in foundation models, and the proposed CLIP-DCA approach successfully addresses CLIP’s performance degradation on challenging unseen domains.

Abstract: Evaluating domain generalization (DG) for foundational models like CLIP is challenging, as web-scale pretraining data potentially covers many existing benchmarks. Consequently, current DG evaluation may neither be sufficiently challenging nor adequately test genuinely unseen data scenarios. To better assess the performance of CLIP on DG in-the-wild, a scenario where CLIP encounters challenging unseen data, we consider two approaches: (1) evaluating on 33 diverse datasets with quantified out-of-distribution (OOD) scores after fine-tuning CLIP on ImageNet, and (2) using unlearning to make CLIP `forget' some domains as an approximation. We observe that CLIP’s performance deteriorates significantly on more OOD datasets. To address this, we present CLIP-DCA (Disentangling Classification from enhanced domain Aware representations). Our approach is motivated by the observation that while standard domain invariance losses aim to make representations domain-invariant, this can be harmful to foundation models by forcing the discarding of domain-aware representations beneficial for generalization. We instead hypothesize that enhancing domain awareness is a prerequisite for effective domain-invariant classification in foundation models. CLIP-DCA identifies and enhances domain awareness within CLIP’s encoders using a separate domain head and synthetically generated diverse domain data. Simultaneously, it encourages domain-invariant classification through disentanglement from the domain features. CLIP-DCA shows significant improvements within this challenging evaluation compared to existing methods, particularly on datasets that are more OOD.

[115] What Can We Learn from Harry Potter? An Exploratory Study of Visual Representation Learning from Atypical Videos

Qiyue Sun, Qiming Huang, Yang Yang, Hongjun Wang, Jianbo Jiao

Main category: cs.CV

TL;DR: Atypical video data improves open-world learning performance in OOD detection, novel category discovery, and zero-shot action recognition tasks, with semantic diversity being more important than dataset size.

Details

Motivation: To explore how exposure to atypical/unusual video content (sci-fi, animation, etc.) during training can improve model generalization and discovery capabilities in open-world scenarios, which is under-explored compared to typical closed-set learning.

Method: Collected a new video dataset of atypical data, fed them into model training for representation learning, and evaluated on three open-world tasks: OOD detection, novel category discovery, and zero-shot action recognition across various settings.

Result: Straightforward learning with atypical data consistently improved performance across all tasks. Categorical diversity boosted OOD detection, semantic diversity with smaller atypical sets outperformed larger typical datasets in NCD, and semantic diversity helped zero-shot generalization to unseen actions.

Conclusion: Atypical videos provide significant benefits for visual representation learning in open-world scenarios, with semantic diversity being a key factor, encouraging further research in this direction with the newly proposed dataset.

Abstract: Humans usually show exceptional generalisation and discovery ability in the open world, when being shown uncommon new concepts. Whereas most existing studies in the literature focus on common typical data from closed sets, open-world novel discovery is under-explored in videos. In this paper, we are interested in asking: \textit{What if atypical unusual videos are exposed in the learning process?} To this end, we collect a new video dataset consisting of various types of unusual atypical data (\eg sci-fi, animation, \etc). To study how such atypical data may benefit open-world learning, we feed them into the model training process for representation learning. Focusing on three key tasks in open-world learning: out-of-distribution (OOD) detection, novel category discovery (NCD), and zero-shot action recognition (ZSAR), we found that even straightforward learning approaches with atypical data consistently improve performance across various settings. Furthermore, we found that increasing the categorical diversity of the atypical samples further boosts OOD detection performance. Additionally, in the NCD task, using a smaller yet more semantically diverse set of atypical samples leads to better performance compared to using a larger but more typical dataset. In the ZSAR setting, the semantic diversity of atypical videos helps the model generalise better to unseen action classes. These observations in our extensive experimental evaluations reveal the benefits of atypical videos for visual representation learning in the open world, together with the newly proposed dataset, encouraging further studies in this direction.

[116] Unsupervised Video Continual Learning via Non-Parametric Deep Embedded Clustering

Nattapong Kurpukdee, Adrian G. Bors

Main category: cs.CV

TL;DR: Proposes unsupervised video continual learning (uVCL) using Kernel Density Estimation of video transformer features and novelty detection for dynamic memory expansion, achieving strong performance on video action recognition datasets without labels.

Details

Motivation: Address the gap in unsupervised video continual learning where neither task boundaries nor labels are provided, overcoming limitations of supervised approaches that require costly labeled data and explicit task boundaries.

Method: Uses Kernel Density Estimation (KDE) of deep embedded video features from unsupervised video transformers, with novelty detection for dynamic memory cluster expansion and transfer learning from previous tasks.

Result: Substantially enhances performance when successively learning many tasks, validated on UCF101, HMDB51, and Something-to-Something V2 datasets without using labels or class boundaries.

Conclusion: The proposed non-parametric approach effectively addresses unsupervised video continual learning challenges, providing a practical solution for learning from unstructured video data in sequential tasks without supervision.

Abstract: We propose a realistic scenario for the unsupervised video learning where neither task boundaries nor labels are provided when learning a succession of tasks. We also provide a non-parametric learning solution for the under-explored problem of unsupervised video continual learning. Videos represent a complex and rich spatio-temporal media information, widely used in many applications, but which have not been sufficiently explored in unsupervised continual learning. Prior studies have only focused on supervised continual learning, relying on the knowledge of labels and task boundaries, while having labeled data is costly and not practical. To address this gap, we study the unsupervised video continual learning (uVCL). uVCL raises more challenges due to the additional computational and memory requirements of processing videos when compared to images. We introduce a general benchmark experimental protocol for uVCL by considering the learning of unstructured video data categories during each task. We propose to use the Kernel Density Estimation (KDE) of deep embedded video features extracted by unsupervised video transformer networks as a non-parametric probabilistic representation of the data. We introduce a novelty detection criterion for the incoming new task data, dynamically enabling the expansion of memory clusters, aiming to capture new knowledge when learning a succession of tasks. We leverage the use of transfer learning from the previous tasks as an initial state for the knowledge transfer to the current learning task. We found that the proposed methodology substantially enhances the performance of the model when successively learning many tasks. We perform in-depth evaluations on three standard video action recognition datasets, including UCF101, HMDB51, and Something-to-Something V2, without using any labels or class boundaries.

[117] A Multi-Stage Fine-Tuning and Ensembling Strategy for Pancreatic Tumor Segmentation in Diagnostic and Therapeutic MRI

Omer Faruk Durugol, Maximilian Rokuss, Yannick Kirchhoff, Klaus H. Maier-Hein

Main category: cs.CV

TL;DR: Automated PDAC segmentation from MRI using nnU-Net with multi-stage pre-training and metric-aware ensembling, achieving state-of-the-art results in PANTHER challenge.

Details

Motivation: Automated PDAC segmentation from MRI is critical but hindered by poor tumor-tissue contrast and scarcity of annotated data, requiring robust methods for limited data scenarios.

Method: Built on nnU-Net framework with deep multi-stage cascaded pre-training strategy, starting from general anatomical foundation model and fine-tuning on CT pancreatic lesion datasets and target MRI modalities. Used five-fold cross-validation to evaluate data augmentation schemes and training schedules, then created custom heterogeneous ensembles of specialist models.

Result: Achieved state-of-the-art boundary precision (MASD of 5.46 mm and HD95 of 17.33 mm for Task 1) and top cross-validation Tumor Dice scores of 0.661 for Task 1 and 0.523 for Task 2. Found trade-off where aggressive augmentation gives best volumetric accuracy while default augmentations yield superior boundary precision.

Conclusion: Presents a robust methodology for developing specialized high-performance models in limited data scenarios, demonstrating effectiveness of metric-aware ensembling strategy for complex medical imaging tasks like PDAC segmentation.

Abstract: Automated segmentation of Pancreatic Ductal Adenocarcinoma (PDAC) from MRI is critical for clinical workflows but is hindered by poor tumor-tissue contrast and a scarcity of annotated data. This paper details our submission to the PANTHER challenge, addressing both diagnostic T1-weighted (Task 1) and therapeutic T2-weighted (Task 2) segmentation. Our approach is built upon the nnU-Net framework and leverages a deep, multi-stage cascaded pre-training strategy, starting from a general anatomical foundation model and sequentially fine-tuning on CT pancreatic lesion datasets and the target MRI modalities. Through extensive five-fold cross-validation, we systematically evaluated data augmentation schemes and training schedules. Our analysis revealed a critical trade-off, where aggressive data augmentation produced the highest volumetric accuracy, while default augmentations yielded superior boundary precision (achieving a state-of-the-art MASD of 5.46 mm and HD95 of 17.33 mm for Task 1). For our final submission, we exploited this finding by constructing custom, heterogeneous ensembles of specialist models, essentially creating a mix of experts. This metric-aware ensembling strategy proved highly effective, achieving a top cross-validation Tumor Dice score of 0.661 for Task 1 and 0.523 for Task 2. Our work presents a robust methodology for developing specialized, high-performance models in the context of limited data and complex medical imaging tasks (Team MIC-DKFZ).

[118] Benchmarking GPT-5 in Radiation Oncology: Measurable Gains, but Persistent Need for Expert Oversight

Ugur Dinc, Jibak Sarkar, Philipp Schubert, Sabine Semrau, Thomas Weissmann, Andre Karius, Johann Brand, Bernd-Niklas Axer, Ahmed Gomaa, Pluvio Stephan, Ishita Sheth, Sogand Beirami, Annette Schwarz, Udo Gaipl, Benjamin Frey, Christoph Bert, Stefanie Corradini, Rainer Fietkau, Florian Putz

Main category: cs.CV

TL;DR: GPT-5 demonstrates superior performance over previous models in radiation oncology benchmarks, achieving 92.8% accuracy on multiple-choice exams and generating clinically useful treatment recommendations with rare hallucinations, though expert oversight remains essential.

Details

Motivation: To evaluate the clinical decision support capabilities of GPT-5 specifically for radiation oncology applications, comparing its performance against previous GPT models in both standardized testing and real-world clinical scenarios.

Method: Used two benchmarks: ACR Radiation Oncology In-Training Examination (300 multiple-choice items) and 60 authentic radiation oncology vignettes. Four board-certified radiation oncologists rated GPT-5’s treatment plans for correctness, comprehensiveness, and hallucinations using Fleiss’ kappa for inter-rater reliability.

Result: GPT-5 achieved 92.8% accuracy on TXIT benchmark, outperforming GPT-4 (78.8%) and GPT-3.5 (62.1%). Vignette evaluation showed high correctness (3.24/4) and comprehensiveness (3.59/4) ratings with rare hallucinations. Domain-specific gains were most pronounced in Dose and Diagnosis.

Conclusion: GPT-5 shows significant improvement over previous models in radiation oncology applications but still requires rigorous expert oversight due to substantive errors in complex scenarios, indicating room for further improvement before clinical implementation.

Abstract: Introduction: Large language models (LLM) have shown great potential in clinical decision support. GPT-5 is a novel LLM system that has been specifically marketed towards oncology use. Methods: Performance was assessed using two complementary benchmarks: (i) the ACR Radiation Oncology In-Training Examination (TXIT, 2021), comprising 300 multiple-choice items, and (ii) a curated set of 60 authentic radiation oncologic vignettes representing diverse disease sites and treatment indications. For the vignette evaluation, GPT-5 was instructed to generate concise therapeutic plans. Four board-certified radiation oncologists rated correctness, comprehensiveness, and hallucinations. Inter-rater reliability was quantified using Fleiss’ \k{appa}. Results: On the TXIT benchmark, GPT-5 achieved a mean accuracy of 92.8%, outperforming GPT-4 (78.8%) and GPT-3.5 (62.1%). Domain-specific gains were most pronounced in Dose and Diagnosis. In the vignette evaluation, GPT-5’s treatment recommendations were rated highly for correctness (mean 3.24/4, 95% CI: 3.11-3.38) and comprehensiveness (3.59/4, 95% CI: 3.49-3.69). Hallucinations were rare with no case reaching majority consensus for their presence. Inter-rater agreement was low (Fleiss’ \k{appa} 0.083 for correctness), reflecting inherent variability in clinical judgment. Errors clustered in complex scenarios requiring precise trial knowledge or detailed clinical adaptation. Discussion: GPT-5 clearly outperformed prior model variants on the radiation oncology multiple-choice benchmark. Although GPT-5 exhibited favorable performance in generating real-world radiation oncology treatment recommendations, correctness ratings indicate room for further improvement. While hallucinations were infrequent, the presence of substantive errors underscores that GPT-5-generated recommendations require rigorous expert oversight before clinical implementation.

[119] TMUAD: Enhancing Logical Capabilities in Unified Anomaly Detection Models with a Text Memory Bank

Jiawei Liu, Jiahe Hou, Wei Wang, Jinsong Du, Yang Cong, Huijie Fan

Main category: cs.CV

TL;DR: TMUAD proposes a Three-Memory framework using text, object-level, and patch-level memory banks to unify structural and logical anomaly detection, achieving state-of-the-art performance across multiple datasets.

Details

Motivation: Anomaly detection is challenging due to limited normal data, and existing methods rely on image feature extractors and memory banks that may not adequately capture logical relationships between objects.

Method: Three complementary memory banks: 1) class-level text memory bank with logic-aware text extractor for logical anomalies, 2) object-level image memory bank preserving complete object contours, 3) patch-level memory bank for structural anomaly detection. These retrieve similar normal images and compute fused anomaly scores.

Result: Achieves state-of-the-art performance across seven publicly available datasets involving industrial and medical domains.

Conclusion: The unified approach through collaborative memory banks effectively handles both structural and logical anomaly detection, demonstrating superior performance in diverse application domains.

Abstract: Anomaly detection, which aims to identify anomalies deviating from normal patterns, is challenging due to the limited amount of normal data available. Unlike most existing unified methods that rely on carefully designed image feature extractors and memory banks to capture logical relationships between objects, we introduce a text memory bank to enhance the detection of logical anomalies. Specifically, we propose a Three-Memory framework for Unified structural and logical Anomaly Detection (TMUAD). First, we build a class-level text memory bank for logical anomaly detection by the proposed logic-aware text extractor, which can capture rich logical descriptions of objects from input images. Second, we construct an object-level image memory bank that preserves complete object contours by extracting features from segmented objects. Third, we employ visual encoders to extract patch-level image features for constructing a patch-level memory bank for structural anomaly detection. These three complementary memory banks are used to retrieve and compare normal images that are most similar to the query image, compute anomaly scores at multiple levels, and fuse them into a final anomaly score. By unifying structural and logical anomaly detection through collaborative memory banks, TMUAD achieves state-of-the-art performance across seven publicly available datasets involving industrial and medical domains. The model and code are available at https://github.com/SIA-IDE/TMUAD.

[120] VoCap: Video Object Captioning and Segmentation from Any Prompt

Jasper Uijlings, Xingyi Zhou, Xiuye Gu, Arsha Nagrani, Anurag Arnab, Alireza Fathi, David Ross, Cordelia Schmid

Main category: cs.CV

TL;DR: VoCap is a flexible video model that takes video and multimodal prompts to produce spatio-temporal masks with object-centric captions, addressing promptable video segmentation, referring expression segmentation, and object captioning simultaneously.

Details

Motivation: Understanding objects in videos with fine-grained localization and detailed semantic properties is fundamental for video understanding, but obtaining annotated data is tedious and expensive.

Method: Propose VoCap model that consumes video and multimodal prompts (text, box, mask) to output spatio-temporal masklets with object captions. Annotate SAV dataset with pseudo captions using VLMs, creating SAV-Caption dataset for training.

Result: State-of-the-art results on referring expression video object segmentation, competitive performance on semi-supervised video object segmentation, and establishes benchmark for video object captioning.

Conclusion: VoCap provides a unified solution for multiple video understanding tasks and the SAV-Caption dataset enables scalable training for video object captioning.

Abstract: Understanding objects in videos in terms of fine-grained localization masks and detailed semantic properties is a fundamental task in video understanding. In this paper, we propose VoCap, a flexible video model that consumes a video and a prompt of various modalities (text, box or mask), and produces a spatio-temporal masklet with a corresponding object-centric caption. As such our model addresses simultaneously the tasks of promptable video object segmentation, referring expression segmentation, and object captioning. Since obtaining data for this task is tedious and expensive, we propose to annotate an existing large-scale segmentation dataset (SAV) with pseudo object captions. We do so by preprocessing videos with their ground-truth masks to highlight the object of interest and feed this to a large Vision Language Model (VLM). For an unbiased evaluation, we collect manual annotations on the validation set. We call the resulting dataset SAV-Caption. We train our VoCap model at scale on a SAV-Caption together with a mix of other image and video datasets. Our model yields state-of-the-art results on referring expression video object segmentation, is competitive on semi-supervised video object segmentation, and establishes a benchmark for video object captioning. Our dataset will be made available at https://github.com/google-deepmind/vocap.

[121] The Demon is in Ambiguity: Revisiting Situation Recognition with Single Positive Multi-Label Learning

Yiming Lin, Yuchen Niu, Shang Wang, Kaizhu Huang, Qiufeng Wang, Xiao-Bo Jin

Main category: cs.CV

TL;DR: This paper reveals that verb classification in situation recognition is inherently multi-label due to semantic ambiguity, proposes a single positive multi-label learning approach, and achieves significant performance improvements with a novel graph-enhanced model.

Details

Motivation: Existing methods treat verb classification as single-label, but this fails to address the inherent ambiguity in visual event recognition where multiple verb categories may reasonably describe the same image.

Method: Proposes reformulating verb classification as single positive multi-label learning (SPMLL) and develops Graph Enhanced Verb Multilayer Perceptron (GE-VerbMLP) that combines graph neural networks to capture label correlations and adversarial training to optimize decision boundaries.

Result: Extensive experiments show more than 3% MAP improvement while remaining competitive on traditional top-1 and top-5 accuracy metrics.

Conclusion: Verb classification in situation recognition should be treated as a multi-label problem, and the proposed SPMLL approach with GE-VerbMLP effectively addresses the semantic ambiguity and achieves superior performance.

Abstract: Context recognition (SR) is a fundamental task in computer vision that aims to extract structured semantic summaries from images by identifying key events and their associated entities. Specifically, given an input image, the model must first classify the main visual events (verb classification), then identify the participating entities and their semantic roles (semantic role labeling), and finally localize these entities in the image (semantic role localization). Existing methods treat verb classification as a single-label problem, but we show through a comprehensive analysis that this formulation fails to address the inherent ambiguity in visual event recognition, as multiple verb categories may reasonably describe the same image. This paper makes three key contributions: First, we reveal through empirical analysis that verb classification is inherently a multi-label problem due to the ubiquitous semantic overlap between verb categories. Second, given the impracticality of fully annotating large-scale datasets with multiple labels, we propose to reformulate verb classification as a single positive multi-label learning (SPMLL) problem - a novel perspective in SR research. Third, we design a comprehensive multi-label evaluation benchmark for SR that is carefully designed to fairly evaluate model performance in a multi-label setting. To address the challenges of SPMLL, we futher develop the Graph Enhanced Verb Multilayer Perceptron (GE-VerbMLP), which combines graph neural networks to capture label correlations and adversarial training to optimize decision boundaries. Extensive experiments on real-world datasets show that our approach achieves more than 3% MAP improvement while remaining competitive on traditional top-1 and top-5 accuracy metrics.

[122] DriveQA: Passing the Driving Knowledge Test

Maolin Wei, Wanzhou Liu, Eshed Ohn-Bar

Main category: cs.CV

TL;DR: DriveQA is a comprehensive benchmark for testing LLMs and MLLMs on driving knowledge, revealing weaknesses in numerical reasoning and complex scenarios while showing that fine-tuning improves performance on real-world driving tasks.

Details

Motivation: Current autonomous driving benchmarks lack comprehensive testing of traffic rules, signage, and edge cases that human drivers must master to pass driving tests. There's a need to evaluate if LLMs can truly understand complex driving scenarios.

Method: Created DriveQA benchmark covering traffic regulations and scenarios, tested state-of-the-art LLMs and MLLMs, conducted fine-tuning experiments, and evaluated model sensitivity to environmental factors through controlled variations in DriveQA-V.

Result: LLMs perform well on basic traffic rules but struggle with numerical reasoning, complex right-of-way scenarios, traffic sign variations, and spatial layouts. Fine-tuning on DriveQA improves accuracy, particularly in regulatory sign recognition and intersection decision-making.

Conclusion: Pretraining on DriveQA enhances downstream driving task performance on real-world datasets (nuScenes, BDD), demonstrating that models can internalize traffic knowledge and generalize effectively across driving QA tasks.

Abstract: If a Large Language Model (LLM) were to take a driving knowledge test today, would it pass? Beyond standard spatial and visual question-answering (QA) tasks on current autonomous driving benchmarks, driving knowledge tests require a complete understanding of all traffic rules, signage, and right-of-way principles. To pass this test, human drivers must discern various edge cases that rarely appear in real-world datasets. In this work, we present DriveQA, an extensive open-source text and vision-based benchmark that exhaustively covers traffic regulations and scenarios. Through our experiments using DriveQA, we show that (1) state-of-the-art LLMs and Multimodal LLMs (MLLMs) perform well on basic traffic rules but exhibit significant weaknesses in numerical reasoning and complex right-of-way scenarios, traffic sign variations, and spatial layouts, (2) fine-tuning on DriveQA improves accuracy across multiple categories, particularly in regulatory sign recognition and intersection decision-making, (3) controlled variations in DriveQA-V provide insights into model sensitivity to environmental factors such as lighting, perspective, distance, and weather conditions, and (4) pretraining on DriveQA enhances downstream driving task performance, leading to improved results on real-world datasets such as nuScenes and BDD, while also demonstrating that models can internalize text and synthetic traffic knowledge to generalize effectively across downstream QA tasks.

Kaouther Mouheb, Mobina Ghojogh Nejad, Lavsen Dahal, Ehsan Samei, Kyle J. Lafata, W. Paul Segars, Joseph Y. Lo

Main category: cs.CV

TL;DR: CLAP is a conditional latent point-diffusion model that combines geometric deep learning with diffusion models to improve 3D modeling of complex organs like the large intestine, achieving significant accuracy improvements over initial shapes.

Details

Motivation: Accurate 3D modeling of human organs is critical for digital phantoms in virtual imaging trials, but organs like the large intestine remain challenging due to complex geometry and shape variability.

Method: Uses hierarchical variational autoencoder to learn global and local latent shape representations from point clouds, then employs two conditional diffusion models in latent space to refine organ shapes, followed by surface reconstruction to convert refined point clouds into meshes.

Result: Achieves 26% reduction in Chamfer distance and 36% reduction in Hausdorff distance relative to initial suboptimal shapes, demonstrating substantial improvements in shape modeling accuracy.

Conclusion: CLAP provides a robust and extensible solution for high-fidelity organ modeling with potential applicability to a wide range of anatomical structures.

Abstract: Accurate 3D modeling of human organs is critical for constructing digital phantoms in virtual imaging trials. However, organs such as the large intestine remain particularly challenging due to their complex geometry and shape variability. We propose CLAP, a novel Conditional LAtent Point-diffusion model that combines geometric deep learning with denoising diffusion models to enhance 3D representations of the large intestine. Given point clouds sampled from segmentation masks, we employ a hierarchical variational autoencoder to learn both global and local latent shape representations. Two conditional diffusion models operate within this latent space to refine the organ shape. A pretrained surface reconstruction model is then used to convert the refined point clouds into meshes. CLAP achieves substantial improvements in shape modeling accuracy, reducing Chamfer distance by 26% and Hausdorff distance by 36% relative to the initial suboptimal shapes. This approach offers a robust and extensible solution for high-fidelity organ modeling, with potential applicability to a wide range of anatomical structures.

[124] ALow-Cost Real-Time Framework for Industrial Action Recognition Using Foundation Models

Zhicheng Wang, Wensheng Liang, Ruiyan Zhuang, Shuai Li, Jianwei Tan, Xiaoguang Ma

Main category: cs.CV

TL;DR: LRIAR is a low-cost real-time framework for industrial action recognition using foundation models that automates dataset labeling and achieves improved accuracy and generalization with minimal human annotation.

Details

Motivation: Address challenges in industrial action recognition including high deployment costs, poor cross-scenario generalization, and limited real-time performance.

Method: Uses Grounding DINO with BLIP-2 for automatic dataset labeling, trains YOLOv5 for real-time action detection, and fine-tunes Vision Transformer with LoRA for classification.

Result: Extensive real-world experiments show consistent improvements over state-of-the-art methods in recognition accuracy, scenario generalization, and deployment efficiency.

Conclusion: LRIAR framework effectively enhances industrial action recognition with automated labeling, real-time performance, and improved transferability while reducing computational overhead.

Abstract: Action recognition (AR) in industrial environments – particularly for identifying actions and operational gestures – faces persistent challenges due to high deployment costs, poor cross-scenario generalization, and limited real-time performance. To address these issues, we propose a low-cost real-time framework for industrial action recognition using foundation models, denoted as LRIAR, to enhance recognition accuracy and transferability while minimizing human annotation and computational overhead. The proposed framework constructs an automatically labeled dataset by coupling Grounding DINO with the pretrained BLIP-2 image encoder, enabling efficient and scalable action labeling. Leveraging the constructed dataset, we train YOLOv5 for real-time action detection, and a Vision Transformer (ViT) classifier is deceloped via LoRA-based fine-tuning for action classification. Extensive experiments conducted in real-world industrial settings validate the effectiveness of LRIAR, demonstrating consistent improvements over state-of-the-art methods in recognition accuracy, scenario generalization, and deployment efficiency.

[125] JambaTalk: Speech-Driven 3D Talking Head Generation Based on Hybrid Transformer-Mamba Model

Farzaneh Jafari, Stefano Berretti, Anup Basu

Main category: cs.CV

TL;DR: JambaTalk is a hybrid Transformer-Mamba model for 3D face animation that combines both architectures’ advantages to achieve superior lip sync, facial expressions, and motion variety compared to state-of-the-art methods.

Details

Motivation: Current talking head generation models fail to achieve equivalence across all quantitative and qualitative metrics. No single model excels in lip-sync motion, expressive facial expressions, natural head poses, and high-quality video simultaneously.

Method: Introduces Jamba, a hybrid Transformer-Mamba model that combines Transformer and Mamba (Structured State Space Model) architectures. Uses JambaTalk built on Jamba blocks with multimodal integration to enhance motion variety and lip synchronization.

Result: Extensive experiments show the method achieves performance comparable or superior to state-of-the-art models across various metrics.

Conclusion: The hybrid Transformer-Mamba approach provides a comprehensive solution for talking head generation, overcoming limitations of traditional architectures in handling long sequences and achieving balanced performance across multiple quality dimensions.

Abstract: In recent years, the talking head generation has become a focal point for researchers. Considerable effort is being made to refine lip-sync motion, capture expressive facial expressions, generate natural head poses, and achieve high-quality video. However, no single model has yet achieved equivalence across all quantitative and qualitative metrics. We introduce Jamba, a hybrid Transformer-Mamba model, to animate a 3D face. Mamba, a pioneering Structured State Space Model (SSM) architecture, was developed to overcome the limitations of conventional Transformer architectures, particularly in handling long sequences. This challenge has constrained traditional models. Jamba combines the advantages of both the Transformer and Mamba approaches, offering a comprehensive solution. Based on the foundational Jamba block, we present JambaTalk to enhance motion variety and lip sync through multimodal integration. Extensive experiments reveal that our method achieves performance comparable or superior to state-of-the-art models.

[126] Guiding a diffusion model using sliding windows

Nikolas Adaloglou, Tim Kaiser, Damir Iagudin, Markus Kollmann

Main category: cs.CV

TL;DR: Masked sliding window guidance (M-SWG) is a training-free method that enhances diffusion model sample quality by using the model itself as guidance with restricted receptive field, achieving state-of-the-art results without additional training.

Details

Motivation: Guidance techniques improve diffusion model quality but typically require auxiliary models. The paper aims to develop a training-free guidance method that leverages the primary model's own capabilities by strategically restricting its receptive field.

Method: M-SWG upweights long-range spatial dependencies by guiding the primary diffusion model with itself through selective restriction of its receptive field using a masked sliding window approach. No model weights from previous iterations, additional training, or class conditioning required.

Result: Achieves superior Inception score compared to previous training-free approaches without sample oversaturation. Combined with existing guidance methods, reaches state-of-the-art Frechet DINOv2 distance on ImageNet using EDM2-XXL and DiT-XL models.

Conclusion: M-SWG provides an effective training-free guidance method that demonstrates the benefit of using a model’s own capabilities with strategic receptive field restrictions, achieving competitive performance with state-of-the-art results.

Abstract: Guidance is a widely used technique for diffusion models to enhance sample quality. Technically, guidance is realised by using an auxiliary model that generalises more broadly than the primary model. Using a 2D toy example, we first show that it is highly beneficial when the auxiliary model exhibits similar but stronger generalisation errors than the primary model. Based on this insight, we introduce \emph{masked sliding window guidance (M-SWG)}, a novel, training-free method. M-SWG upweights long-range spatial dependencies by guiding the primary model with itself by selectively restricting its receptive field. M-SWG requires neither access to model weights from previous iterations, additional training, nor class conditioning. M-SWG achieves a superior Inception score (IS) compared to previous state-of-the-art training-free approaches, without introducing sample oversaturation. In conjunction with existing guidance methods, M-SWG reaches state-of-the-art Frechet DINOv2 distance on ImageNet using EDM2-XXL and DiT-XL. The code is available at https://github.com/HHU-MMBS/swg_bmvc2025_official.

[127] Maximising Kidney Glomeruli Segmentation using Minimal Labels via Self-Supervision

Zeeshan Nisar, Thomas Lampert

Main category: cs.CV

TL;DR: Self-supervised pre-training enables histopathology segmentation with only 5% labels while maintaining near-full performance (5.9-6.2% drop vs fully supervised models).

Details

Motivation: Histopathology segmentation requires extensive labeling which is costly and time-consuming, especially with multiple stainings. Existing methods still need source stain labels, which can be challenging to obtain.

Method: Used self-supervised pre-training (SimCLR, BYOL, and novel HR-CS-CO) before segmentation with UNet and UDA-GAN, reducing label requirements by 95%.

Result: With only 5% labels and self-supervised pre-training, performance drops were minimal: 5.9% for UNet and 6.2% for UDA-GAN compared to fully supervised counterparts.

Conclusion: Self-supervised pre-training effectively reduces label dependency in histopathology segmentation while maintaining performance, and findings generalize to public benchmark datasets.

Abstract: Histopathology, the microscopic examination of tissue samples, is essential for disease diagnosis and prognosis. Accurate segmentation and identification of key regions in histopathology images are crucial for developing automated solutions. However, state-of-art deep learning segmentation methods like UNet require extensive labels, which is both costly and time-consuming, particularly when dealing with multiple stainings. To mitigate this, various multi-stain segmentation methods such as UDA-GAN have been developed, which reduce the need for labels by requiring only one (source) stain to be labelled. Nonetheless, obtaining source stain labels can still be challenging, and segmentation models fail when they are unavailable. This article shows that through self-supervised pre-training$-$including SimCLR, BYOL, and a novel approach, HR-CS-CO$-$the performance of these segmentation methods (UNet, and UDAGAN) can be retained even with 95% fewer labels. Notably, with self-supervised pre-training and using only 5% labels, the performance drops are minimal: 5.9% for UNet and 6.2% for UDAGAN, compared to their respective fully supervised counterparts (without pre-training, using 100% labels). Furthermore, these findings are shown to generalise beyond their training distribution to public benchmark datasets. Im-

[128] CHaRM: Conditioned Heatmap Regression Methodology for Accurate and Fast Dental Landmark Localization

José Rodríguez-Ortega, Francisco Pérez-Hernández, Siham Tabik

Main category: cs.CV

TL;DR: CHaRM is the first end-to-end deep learning method for automatic tooth landmark detection in 3D dental scans that eliminates the need for costly tooth segmentation, achieving state-of-the-art accuracy and 14.8x faster inference.

Details

Motivation: Manual landmark identification in 3D dental models is labor-intensive and requires expertise. Existing machine learning methods still require tooth segmentation, which is costly and complex.

Method: CHaRM integrates a point cloud encoder, decoder with heatmap regression, teeth-presence classification head, and novel CHaR module that adapts to missing teeth. It operates directly on IOS point clouds without segmentation.

Result: CHaRNet (CHaRM with PointMLP) achieved 0.56 mm mean error on standard models and 1.12 mm across all dentition types, with up to 14.8x faster inference compared to state-of-the-art methods.

Conclusion: The end-to-end approach streamlines orthodontic workflows, enhances 3D IOS analysis precision, and enables efficient computer-assisted treatment planning. Dataset and code will be publicly released.

Abstract: Identifying anatomical landmarks in 3D dental models is essential for orthodontic treatment, yet manual placement is labor-intensive and requires expert knowledge. While machine learning methods have been proposed for automatic landmark detection in 3D Intraoral Scans (IOS), none provide a fully end-to-end solution that avoids costly tooth segmentation. We present CHaRM (Conditioned Heatmap Regression Methodology), the first fully end-to-end deep learning approach for tooth landmark detection in 3D IOS. CHaRM integrates four components: a point cloud encoder, a decoder with a heatmap regression head, a teeth-presence classification head, and the novel CHaR module. The CHaR module leverages teeth-presence information to adapt to missing teeth, improving detection accuracy in complex dental cases. Unlike two-stage workflows that segment teeth before landmarking, CHaRM operates directly on IOS point clouds, reducing complexity, avoiding error propagation, and lowering computational cost. We evaluated CHaRM with five point cloud learning backbones on IOSLandmarks-1k, a new dataset of 1,214 annotated 3D dental models. Both the dataset and code will be publicly released to address the scarcity of open data in orthodontics and foster reproducible research. CHaRM with PointMLP, named CHaRNet, achieved the best accuracy and efficiency. Compared to state-of-the-art methods (TSMDL and ALIIOS), CHaRNet reduced mean Euclidean distance error to 0.56 mm on standard dental models and 1.12 mm across all dentition type, while delivering up to 14.8x faster inference on GPU. This end-to-end approach streamlines orthodontic workflows, enhances the precision of 3D IOS analysis, and enables efficient computer-assisted treatment planning.

[129] Mixed Signals: A Diverse Point Cloud Dataset for Heterogeneous LiDAR V2X Collaboration

Katie Z Luo, Minh-Quan Dao, Zhenzhen Liu, Mark Campbell, Wei-Lun Chao, Kilian Q. Weinberger, Ezio Malis, Vincent Fremont, Bharath Hariharan, Mao Shan, Stewart Worrall, Julie Stephany Berrio Perez

Main category: cs.CV

TL;DR: Mixed Signals is a comprehensive V2X dataset with 45.1k point clouds and 240.6k bounding boxes from 3 CAVs and roadside unit, addressing limitations of existing V2X datasets.

Details

Motivation: Existing V2X datasets are limited in scope, diversity, and quality, creating a need for more comprehensive and reliable data for vehicle-to-everything collaborative perception research.

Method: Collected data from three connected autonomous vehicles with different LiDAR configurations plus a roadside unit with dual LiDARs, providing point clouds and bounding box annotations across 10 classes with precise alignment.

Result: Created a ready-to-use dataset with detailed statistical analysis and extensive benchmarking of existing V2X methods, featuring consistent annotations across time and viewpoints.

Conclusion: Mixed Signals provides a high-quality, comprehensive V2X dataset that addresses previous limitations and enables reliable perception training for vehicle-to-everything collaborative systems.

Abstract: Vehicle-to-everything (V2X) collaborative perception has emerged as a promising solution to address the limitations of single-vehicle perception systems. However, existing V2X datasets are limited in scope, diversity, and quality. To address these gaps, we present Mixed Signals, a comprehensive V2X dataset featuring 45.1k point clouds and 240.6k bounding boxes collected from three connected autonomous vehicles (CAVs) equipped with two different configurations of LiDAR sensors, plus a roadside unit with dual LiDARs. Our dataset provides point clouds and bounding box annotations across 10 classes, ensuring reliable data for perception training. We provide detailed statistical analysis on the quality of our dataset and extensively benchmark existing V2X methods on it. The Mixed Signals dataset is ready-to-use, with precise alignment and consistent annotations across time and viewpoints. Dataset website is available at https://mixedsignalsdataset.cs.cornell.edu/.

[130] Convolutional Rectangular Attention Module

Hai-Vy Nguyen, Fabrice Gamboa, Sixin Zhang, Reda Chhaibi, Serge Gratton, Thierry Giaccone

Main category: cs.CV

TL;DR: A novel spatial attention module using rectangular regions instead of position-wise maps, improving stability and generalization in convolutional networks.

Details

Motivation: Conventional spatial attention methods produce irregular boundaries that hamper generalization, so a more structured rectangular approach is needed.

Method: Introduces a spatial attention module that constrains attention regions to rectangles parameterized by only 5 parameters, easily integrated into any convolutional network.

Result: Systematically outperforms position-wise attention methods and provides better stability and generalization to new samples.

Conclusion: Provides a useful spatial attention mechanism that improves performance while offering interpretability by showing where the model focuses in the input.

Abstract: In this paper, we introduce a novel spatial attention module that can be easily integrated to any convolutional network. This module guides the model to pay attention to the most discriminative part of an image. This enables the model to attain a better performance by an end-to-end training. In conventional approaches, a spatial attention map is typically generated in a position-wise manner. Thus, it is often resulting in irregular boundaries and so can hamper generalization to new samples. In our method, the attention region is constrained to be rectangular. This rectangle is parametrized by only 5 parameters, allowing for a better stability and generalization to new samples. In our experiments, our method systematically outperforms the position-wise counterpart. So that, we provide a novel useful spatial attention mechanism for convolutional models. Besides, our module also provides the interpretability regarding the \textit{where to look} question, as it helps to know the part of the input on which the model focuses to produce the prediction.

[131] Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation

Yuqing Wang, Zhijie Lin, Yao Teng, Yuanzhi Zhu, Shuhuai Ren, Jiashi Feng, Xihui Liu

Main category: cs.CV

TL;DR: TokenBridge bridges discrete and continuous token representations for visual generation, maintaining continuous representation quality while enabling simple discrete modeling through post-training quantization and dimension-wise discretization.

Details

Motivation: Address the fundamental dilemma in autoregressive visual generation where discrete tokens suffer from information loss but enable simple modeling, while continuous tokens preserve details but require complex distribution modeling.

Method: Decouples discretization from tokenizer training through post-training quantization that obtains discrete tokens from continuous representations. Uses dimension-wise quantization strategy and lightweight autoregressive prediction mechanism to handle large token space.

Result: Achieves reconstruction and generation quality comparable to continuous methods while using standard categorical prediction, demonstrating effectiveness in bridging discrete and continuous paradigms.

Conclusion: Bridging discrete and continuous token representations effectively harnesses strengths of both approaches, providing promising direction for high-quality visual generation with simple autoregressive modeling.

Abstract: Autoregressive visual generation models typically rely on tokenizers to compress images into tokens that can be predicted sequentially. A fundamental dilemma exists in token representation: discrete tokens enable straightforward modeling with standard cross-entropy loss, but suffer from information loss and tokenizer training instability; continuous tokens better preserve visual details, but require complex distribution modeling, complicating the generation pipeline. In this paper, we propose TokenBridge, which bridges this gap by maintaining the strong representation capacity of continuous tokens while preserving the modeling simplicity of discrete tokens. To achieve this, we decouple discretization from the tokenizer training process through post-training quantization that directly obtains discrete tokens from continuous representations. Specifically, we introduce a dimension-wise quantization strategy that independently discretizes each feature dimension, paired with a lightweight autoregressive prediction mechanism that efficiently model the resulting large token space. Extensive experiments show that our approach achieves reconstruction and generation quality on par with continuous methods while using standard categorical prediction. This work demonstrates that bridging discrete and continuous paradigms can effectively harness the strengths of both approaches, providing a promising direction for high-quality visual generation with simple autoregressive modeling. Project page: https://yuqingwang1029.github.io/TokenBridge.

[132] PicoPose: Progressive Pixel-to-Pixel Correspondence Learning for Novel Object Pose Estimation

Lihua Liu, Jiehong Lin, Zhenxin Liu, Kui Jia

Main category: cs.CV

TL;DR: PicoPose is a three-stage pixel-to-pixel correspondence learning framework for zero-shot RGB-based novel object pose estimation, achieving state-of-the-art performance on BOP benchmark datasets.

Details

Motivation: Zero-shot generalization remains a key challenge for RGB-based novel object pose estimation in robotic applications, requiring methods that can handle unseen objects without retraining.

Method: Three-stage process: 1) Feature matching with rendered templates to establish coarse correspondences, 2) Global regression of 2D affine transformation to smooth correspondences, 3) Local correspondence offset learning for fine-grained refinement using the transformed template features.

Result: Achieves state-of-the-art performance on all seven core datasets of the BOP benchmark, demonstrating exceptional generalization to novel objects.

Conclusion: PicoPose’s progressive correspondence refinement through three stages significantly improves pose estimation accuracy via PnP/RANSAC, providing a robust solution for zero-shot novel object pose estimation.

Abstract: RGB-based novel object pose estimation is critical for rapid deployment in robotic applications, yet zero-shot generalization remains a key challenge. In this paper, we introduce PicoPose, a novel framework designed to tackle this task using a three-stage pixel-to-pixel correspondence learning process. Firstly, PicoPose matches features from the RGB observation with those from rendered object templates, identifying the best-matched template and establishing coarse correspondences. Secondly, PicoPose smooths the correspondences by globally regressing a 2D affine transformation, including in-plane rotation, scale, and 2D translation, from the coarse correspondence map. Thirdly, PicoPose applies the affine transformation to the feature map of the best-matched template and learns correspondence offsets within local regions to achieve fine-grained correspondences. By progressively refining the correspondences, PicoPose significantly improves the accuracy of object poses computed via PnP/RANSAC. PicoPose achieves state-of-the-art performance on the seven core datasets of the BOP benchmark, demonstrating exceptional generalization to novel objects. Code and trained models are available at https://github.com/foollh/PicoPose.

[133] A Hybrid Fully Convolutional CNN-Transformer Model for Inherently Interpretable Disease Detection from Retinal Fundus Images

Kerol Djoumessi, Samuel Ofosu Mensah, Philipp Berens

Main category: cs.CV

TL;DR: Interpretable hybrid CNN-Transformer architecture for retinal disease detection that generates faithful evidence maps directly reflecting model decisions, achieving state-of-the-art performance.

Details

Motivation: Hybrid CNN-Transformer models combine local feature extraction and global dependencies but lack interpretability, which is crucial for medical imaging applications where understanding model decisions is essential.

Method: Developed an interpretable-by-design hybrid fully convolutional CNN-Transformer architecture that generates class-specific sparse evidence maps in a single forward pass, providing localized evidence that directly reflects the decision process.

Result: Achieves state-of-the-art predictive performance on retinal disease detection tasks using color fundus images, outperforming both black-box and interpretable models while providing faithful evidence maps.

Conclusion: The proposed architecture successfully combines the strengths of CNNs and Transformers while maintaining interpretability, making it suitable for medical imaging applications where both performance and transparency are critical.

Abstract: In many medical imaging tasks, convolutional neural networks (CNNs) efficiently extract local features hierarchically. More recently, vision transformers (ViTs) have gained popularity, using self-attention mechanisms to capture global dependencies, but lacking the inherent spatial localization of convolutions. Therefore, hybrid models combining CNNs and ViTs have been developed to combine the strengths of both architectures. However, such hybrid models are difficult to interpret, which hinders their application in medical imaging. In this work, we introduce an interpretable-by-design hybrid fully convolutional CNN-Transformer architecture for retinal disease detection. Unlike widely used post-hoc saliency methods for ViTs, our approach generates faithful and localized evidence maps that directly reflect the mode’s decision process. We evaluated our method on two medical tasks focused on disease detection using color fundus images. Our model achieves state-of-the-art predictive performance compared to black-box and interpretable models and provides class-specific sparse evidence maps in a single forward pass. The code is available at: https://github.com/kdjoumessi/Self-Explainable-CNN-Transformer.

[134] Computer-Aided Design of Personalized Occlusal Positioning Splints Using Multimodal 3D Data

Agnieszka Anna Tomaka, Leszek Luchowski, Michał Tarnawski, Dariusz Pojda

Main category: cs.CV

TL;DR: Computer-aided method for designing occlusal splints using transformation matrices and virtual embossing to achieve precise therapeutic mandibular positioning with high geometric accuracy.

Details

Motivation: To develop a reproducible digital workflow for creating customized occlusal positioning splints with improved geometric accuracy for managing stomatognathic system disorders.

Method: Uses transformation matrix to generate 3D splints from virtual patient models reconstructed from intraoral scans, CBCT, 3D facial scans, and digitized plaster models. Introduces virtual embossing to resolve surface conflicts and reproduce occlusal conditions.

Result: Demonstrates feasibility and geometric accuracy of the computer-aided approach at preclinical stage through profile and surface deviation analysis of both designed and printed splints.

Conclusion: The method provides reproducible patient-specific splint fabrication and establishes a transparent foundation for future validation studies, supporting multimodal image registration and occlusal discrepancy quantification.

Abstract: Digital technology plays a crucial role in designing customized medical devices, such as occlusal splints, commonly used in the management of disorders of the stomatognathic system. This methodological proof-of-concept study presents a computer-aided approach for designing and evaluating occlusal positioning splints. The primary aim is to demonstrate the feasibility and geometric accuracy of the proposed method at the preclinical stage. In this approach, a three-dimensional splint is generated using a transformation matrix to represent the therapeutic mandibular position. An experienced operator defines this position using a virtual patient model reconstructed from intraoral scans, CBCT, 3D facial scans, and a digitized plaster model. We introduce a novel method for generating splints that reproduces occlusal conditions in the therapeutic position and resolves surface conflicts through virtual embossing. The process for obtaining transformation matrices using dental tools and intraoral devices commonly employed in dental and laboratory workflows is described, and the geometric accuracy of both designed and printed splints is evaluated using profile and surface deviation analysis. The method supports reproducible, patient-specific splint fabrication and provides a transparent foundation for future validation studies, supporting multimodal image registration and quantification of occlusal discrepancies in research settings.

[135] Saliency-Guided Training for Fingerprint Presentation Attack Detection

Samuel Webster, Adam Czajka

Main category: cs.CV

TL;DR: First application of saliency-guided training to fingerprint presentation attack detection, showing improved generalization and achieving state-of-the-art results on LivDet-2021 benchmark.

Details

Motivation: To improve fingerprint presentation attack detection by directing model learning to important image regions through saliency guidance, addressing generalization challenges in biometric security.

Method: Created 800 human-annotated fingerprint saliency maps from 50 participants, explored algorithmic pseudosaliency methods (minutiae-based, quality-based, autoencoder-based), and evaluated various training configurations across five scenarios.

Result: Achieved first place on LivDet-2021 benchmark, demonstrated effectiveness in both limited and large data contexts, and showed improved model generalization capabilities for fingerprint PAD.

Conclusion: Saliency-guided training is highly effective for fingerprint presentation attack detection, offering promise for enhanced generalization, scalability to larger datasets, and particularly valuable when training data is limited.

Abstract: Saliency-guided training, which directs model learning to important regions of images, has demonstrated generalization improvements across various biometric presentation attack detection (PAD) tasks. This paper presents its first application to fingerprint PAD. We conducted a 50-participant study to create a dataset of 800 human-annotated fingerprint perceptually-important maps, explored alongside algorithmically-generated “pseudosaliency,” including minutiae-based, image quality-based, and autoencoder-based saliency maps. Evaluating on the 2021 Fingerprint Liveness Detection Competition testing set, we explore various configurations within five distinct training scenarios to assess the impact of saliency-guided training on accuracy and generalization. Our findings demonstrate the effectiveness of saliency-guided training for fingerprint PAD in both limited and large data contexts, and we present a configuration capable of earning the first place on the LivDet-2021 benchmark. Our results highlight saliency-guided training’s promise for increased model generalization capabilities, its effectiveness when data is limited, and its potential to scale to larger datasets in fingerprint PAD. All collected saliency data and trained models are released with the paper to support reproducible research.

[136] DDaTR: Dynamic Difference-aware Temporal Residual Network for Longitudinal Radiology Report Generation

Shanshan Song, Hui Tang, Honglong Yang, Xiaomeng Li

Main category: cs.CV

TL;DR: A novel dynamic difference-aware temporal residual network (DDaTR) is proposed for Longitudinal Radiology Report Generation, addressing limitations in capturing spatial and temporal correlations across medical imaging exams.

Details

Motivation: Existing LRRG methods only extract and concatenate features from prior and current images, failing to effectively capture both spatial and temporal correlations, which leads to inadequate representation of clinical progressions and sub-optimal performance.

Method: DDaTR introduces two modules at each visual encoder stage: Dynamic Feature Alignment Module (DFAM) to align prior features across modalities, and dynamic difference-aware module (DDAM) to capture difference information across exams. It uses dynamic residual network for unidirectional transmission of longitudinal information to model temporal correlations.

Result: Extensive experiments demonstrated superior performance over existing methods on three benchmarks, proving efficacy in both Radiology Report Generation (RRG) and Longitudinal Radiology Report Generation (LRRG) tasks.

Conclusion: The proposed DDaTR framework effectively addresses the limitations of previous approaches by capturing multi-level spatial correlations and temporal dependencies, significantly improving performance in automated radiology report generation with longitudinal analysis capabilities.

Abstract: Radiology Report Generation (RRG) automates the creation of radiology reports from medical imaging, enhancing the efficiency of the reporting process. Longitudinal Radiology Report Generation (LRRG) extends RRG by incorporating the ability to compare current and prior exams, facilitating the tracking of temporal changes in clinical findings. Existing LRRG approaches only extract features from prior and current images using a visual pre-trained encoder, which are then concatenated to generate the final report. However, these methods struggle to effectively capture both spatial and temporal correlations during the feature extraction process. Consequently, the extracted features inadequately capture the information of difference across exams and thus underrepresent the expected progressions, leading to sub-optimal performance in LRRG. To address this, we develop a novel dynamic difference-aware temporal residual network (DDaTR). In DDaTR, we introduce two modules at each stage of the visual encoder to capture multi-level spatial correlations. The Dynamic Feature Alignment Module (DFAM) is designed to align prior features across modalities for the integrity of prior clinical information. Prompted by the enriched prior features, the dynamic difference-aware module (DDAM) captures favorable difference information by identifying relationships across exams. Furthermore, our DDaTR employs the dynamic residual network to unidirectionally transmit longitudinal information, effectively modelling temporal correlations. Extensive experiments demonstrated superior performance over existing methods on three benchmarks, proving its efficacy in both RRG and LRRG tasks.

[137] Interpretation of Deep Learning Model in Embryo Selection for In Vitro Fertilization (IVF) Treatment

Radha Kodali, Venkata Rao Dhulipalla, Venkata Siva Kishor Tatavarty, Madhavi Nadakuditi, Bharadwaj Thiruveedhula, Suryanarayana Gunnam, Durga Prasad Bavirisetti, Gogulamudi Pradeep Reddy

Main category: cs.CV

TL;DR: An explainable AI framework combining CNN and LSTM for embryo classification from blastocyst images to improve IVF success rates.

Details

Motivation: Infertility impacts quality of life and is increasing. IVF is a primary solution but embryo selection by embryologists is time-consuming and inefficient. Blastocyst images provide valuable data for assessing embryo viability.

Method: Developed an explainable AI (XAI) framework using a fusion of convolutional neural network (CNN) and long short-term memory (LSTM) architecture (CNN-LSTM) for embryo classification from blastocyst images.

Result: The model achieves high accuracy in embryo classification while maintaining interpretability through XAI techniques.

Conclusion: The proposed CNN-LSTM framework provides an efficient and explainable solution for embryo classification in IVF procedures, potentially improving selection accuracy and reducing manual workload for embryologists.

Abstract: Infertility has a considerable impact on individuals’ quality of life, affecting them socially and psychologically, with projections indicating a rise in the upcoming years. In vitro fertilization (IVF) emerges as one of the primary techniques within economically developed nations, employed to address the rising problem of low fertility. Expert embryologists conventionally grade embryos by reviewing blastocyst images to select the most optimal for transfer, yet this process is time-consuming and lacks efficiency. Blastocyst images provide a valuable resource for assessing embryo viability. In this study, we introduce an explainable artificial intelligence (XAI) framework for classifying embryos, employing a fusion of convolutional neural network (CNN) and long short-term memory (LSTM) architecture, referred to as CNN-LSTM. Utilizing deep learning, our model achieves high accuracy in embryo classification while maintaining interpretability through XAI.

[138] Single Domain Generalization for Multimodal Cross-Cancer Prognosis via Dirac Rebalancer and Distribution Entanglement

Jia-Xuan Jiang, Jiashuai Liu, Hongtao Wu, Yifeng Wu, Zhong Wang, Qi Bi, Yefeng Zheng

Main category: cs.CV

TL;DR: This paper addresses the poor generalization of multimodal survival prediction models across different cancer types, proposing two novel modules (SDIR and CADE) to improve cross-cancer generalization performance.

Details

Motivation: Existing multimodal prognosis models focus on single cancer types and fail to generalize well across different cancers, despite the clinical need for robust cross-cancer applications. The authors discovered that multimodal models actually generalize worse than unimodal ones in cross-cancer scenarios.

Method: Proposed two plug-and-play modules: 1) Sparse Dirac Information Rebalancer (SDIR) - uses Bernoulli-based sparsification and Dirac-inspired stabilization to enhance weaker modality signals and mitigate strong feature dominance; 2) Cancer-aware Distribution Entanglement (CADE) - synthesizes target domain distribution by fusing local morphological cues and global gene expression in latent space.

Result: Experiments on a four-cancer-type benchmark demonstrated superior generalization performance compared to existing methods, showing effective cross-cancer multimodal prognosis capability.

Conclusion: The proposed SDIR and CADE modules successfully address the challenges of degraded features from weaker modalities and ineffective multimodal integration, laying the foundation for practical and robust cross-cancer multimodal prognosis models.

Abstract: Deep learning has shown remarkable performance in integrating multimodal data for survival prediction. However, existing multimodal methods mainly focus on single cancer types and overlook the challenge of generalization across cancers. In this work, we are the first to reveal that multimodal prognosis models often generalize worse than unimodal ones in cross-cancer scenarios, despite the critical need for such robustness in clinical practice. To address this, we propose a new task: Cross-Cancer Single Domain Generalization for Multimodal Prognosis, which evaluates whether models trained on a single cancer type can generalize to unseen cancers. We identify two key challenges: degraded features from weaker modalities and ineffective multimodal integration. To tackle these, we introduce two plug-and-play modules: Sparse Dirac Information Rebalancer (SDIR) and Cancer-aware Distribution Entanglement (CADE). SDIR mitigates the dominance of strong features by applying Bernoulli-based sparsification and Dirac-inspired stabilization to enhance weaker modality signals. CADE, designed to synthesize the target domain distribution, fuses local morphological cues and global gene expression in latent space. Experiments on a four-cancer-type benchmark demonstrate superior generalization, laying the foundation for practical, robust cross-cancer multimodal prognosis. Code is available at https://github.com/HopkinsKwong/MCCSDG

[139] InterpIoU: Rethinking Bounding Box Regression with Interpolation-Based IoU Optimization

Haoyuan Liu, Hiroshi Watanabe

Main category: cs.CV

TL;DR: InterpIoU is a novel bounding box regression loss that replaces handcrafted geometric penalties with interpolation-based IoU calculation, providing better gradients and avoiding box enlargement issues, especially for small objects.

Details

Motivation: Existing IoU-based losses with handcrafted geometric penalties are sensitive to box shape, size, and distribution, leading to suboptimal optimization for small objects and undesired behaviors like bounding box enlargement due to misalignment with IoU objectives.

Method: Proposes InterpIoU loss function that uses interpolated boxes between predictions and ground truth to calculate IoU, providing meaningful gradients in non-overlapping cases. Also introduces Dynamic InterpIoU that dynamically adjusts interpolation coefficients based on IoU values.

Result: Experiments on COCO, VisDrone, and PASCAL VOC show consistent outperformance of state-of-the-art IoU-based losses across various detection frameworks, with particularly notable improvements in small object detection.

Conclusion: InterpIoU effectively addresses limitations of traditional geometric penalties by using interpolation-based IoU calculation, demonstrating that IoU itself serves as an ideal regression target while handcrafted penalties are unnecessary and suboptimal.

Abstract: Bounding box regression (BBR) is fundamental to object detection, where the regression loss is crucial for accurate localization. Existing IoU-based losses often incorporate handcrafted geometric penalties to address IoU’s non-differentiability in non-overlapping cases and enhance BBR performance. However, these penalties are sensitive to box shape, size, and distribution, often leading to suboptimal optimization for small objects and undesired behaviors such as bounding box enlargement due to misalignment with the IoU objective. To address these limitations, we propose InterpIoU, a novel loss function that replaces handcrafted geometric penalties with a term based on the IoU between interpolated boxes and the target. By using interpolated boxes to bridge the gap between predictions and ground truth, InterpIoU provides meaningful gradients in non-overlapping cases and inherently avoids the box enlargement issue caused by misaligned penalties. Simulation results further show that IoU itself serves as an ideal regression target, while existing geometric penalties are both unnecessary and suboptimal. Building on InterpIoU, we introduce Dynamic InterpIoU, which dynamically adjusts interpolation coefficients based on IoU values, enhancing adaptability to scenarios with diverse object distributions. Experiments on COCO, VisDrone, and PASCAL VOC show that our methods consistently outperform state-of-the-art IoU-based losses across various detection frameworks, with particularly notable improvements in small object detection, confirming their effectiveness.

[140] Mask & Match: Learning to Recognize Handwritten Math with Self-Supervised Attention

Shree Mitra, Ritabrata Chakraborty, Nilkanta Sahu

Main category: cs.CV

TL;DR: Self-supervised learning framework for handwritten math expression recognition using contrastive pretraining and progressive spatial masking attention, achieving state-of-the-art results without expensive labeled data.

Details

Motivation: Handwritten mathematical expression recognition is challenging due to 2D structure, varying symbol scales, and complex spatial relationships. Existing methods require expensive labeled data, so the authors propose a self-supervised approach to eliminate this dependency.

Method: Three-stage pipeline: (1) self-supervised pretraining with global/local contrastive loss, (2) novel self-supervised attention network with progressive spatial masking strategy, (3) supervised fine-tuning with transformer decoder for LaTeX generation.

Result: Outperforms existing SSL and fully supervised baselines on CROHME benchmarks, demonstrating effectiveness of the progressive attention mechanism in enhancing HMER performance.

Conclusion: The proposed self-supervised framework successfully eliminates need for expensive labeled data while improving structural understanding through progressive attention learning, achieving state-of-the-art performance in handwritten math expression recognition.

Abstract: Recognizing handwritten mathematical expressions (HMER) is a challenging task due to the inherent two-dimensional structure, varying symbol scales, and complex spatial relationships among symbols. In this paper, we present a self-supervised learning (SSL) framework for HMER that eliminates the need for expensive labeled data. Our approach begins by pretraining an image encoder using a combination of global and local contrastive loss, enabling the model to learn both holistic and fine-grained representations. A key contribution of this work is a novel self-supervised attention network, which is trained using a progressive spatial masking strategy. This attention mechanism is designed to learn semantically meaningful focus regions, such as operators, exponents, and nested mathematical notation, without requiring any supervision. The progressive masking curriculum encourages the network to become increasingly robust to missing or occluded visual information, ultimately improving structural understanding. Our complete pipeline consists of (1) self-supervised pretraining of the encoder, (2) self-supervised attention learning, and (3) supervised fine-tuning with a transformer decoder to generate LATEX sequences. Extensive experiments on CROHME benchmarks demonstrate that our method outperforms existing SSL and fully supervised baselines, validating the effectiveness of our progressive attention mechanism in enhancing HMER performance. Our codebase can be found here.

[141] DocR1: Evidence Page-Guided GRPO for Multi-Page Document Understanding

Junyu Xiong, Yonghui Wang, Weichao Zhao, Chenyu Liu, Bing Yin, Wengang Zhou, Houqiang Li

Main category: cs.CV

TL;DR: DocR1 is an MLLM trained with EviGRPO, a novel RL framework that uses evidence-aware rewards to enable coarse-to-fine reasoning for multi-page document understanding, achieving SOTA performance.

Details

Motivation: Multi-page document understanding requires fine-grained visual comprehension and multi-hop reasoning across pages, which remains challenging for current MLLMs and underexplored in RL applications.

Method: Evidence Page-Guided GRPO (EviGRPO) framework with evidence-aware reward mechanism that promotes coarse-to-fine reasoning (retrieve pages first, then generate answers), two-stage annotation pipeline, curriculum learning, and constructed EviBench (4.8k training examples) and ArxivFullQA (8.6k evaluation QA pairs).

Result: DocR1 achieves state-of-the-art performance on multi-page tasks while maintaining strong results on single-page benchmarks across extensive experiments.

Conclusion: The EviGRPO framework enables building high-quality multi-page document understanding models with limited supervision through evidence-guided reinforcement learning.

Abstract: Understanding multi-page documents poses a significant challenge for multimodal large language models (MLLMs), as it requires fine-grained visual comprehension and multi-hop reasoning across pages. While prior work has explored reinforcement learning (RL) for enhancing advanced reasoning in MLLMs, its application to multi-page document understanding remains underexplored. In this paper, we introduce DocR1, an MLLM trained with a novel RL framework, Evidence Page-Guided GRPO (EviGRPO). EviGRPO incorporates an evidence-aware reward mechanism that promotes a coarse-to-fine reasoning strategy, guiding the model to first retrieve relevant pages before generating answers. This training paradigm enables us to build high-quality models with limited supervision. To support this, we design a two-stage annotation pipeline and a curriculum learning strategy, based on which we construct two datasets: EviBench, a high-quality training set with 4.8k examples, and ArxivFullQA, an evaluation benchmark with 8.6k QA pairs based on scientific papers. Extensive experiments across a wide range of benchmarks demonstrate that DocR1 achieves state-of-the-art performance on multi-page tasks, while consistently maintaining strong results on single-page benchmarks.

[142] Region-Level Context-Aware Multimodal Understanding

Hongliang Wei, Xianqi Zhang, Xingtao Wang, Xiaopeng Fan, Debin Zhao

Main category: cs.CV

TL;DR: Proposes Region-level Context-aware Multimodal Understanding (RCMU) to integrate textual context with visual objects, introduces RCVIT training method, RCMU dataset, RC&P-Bench benchmark, and achieves state-of-the-art performance with RC-Qwen2-VL models.

Details

Motivation: Existing MLLMs focus on general visual understanding but lack the ability to integrate textual context associated with specific objects/regions for context-aware multimodal understanding.

Method: Proposed Region-level Context-aware Visual Instruction Tuning (RCVIT) that incorporates object information and bounding box coordinates to associate visual content with textual information. Created RCMU dataset and RC&P-Bench benchmark for training and evaluation.

Result: RC-Qwen2-VL models achieve outstanding performance on multiple RCMU tasks and demonstrate successful applications in multimodal RAG and personalized conversation.

Conclusion: The proposed RCMU framework effectively enables MLLMs to integrate textual context with visual objects, advancing region-level context-aware multimodal understanding capabilities.

Abstract: Despite significant progress, existing research on Multimodal Large Language Models (MLLMs) mainly focuses on general visual understanding, overlooking the ability to integrate textual context associated with objects for a more context-aware multimodal understanding – an ability we refer to as Region-level Context-aware Multimodal Understanding (RCMU). To address this limitation, we first formulate the RCMU task, which requires models to respond to user instructions by integrating both image content and textual information of regions or objects. To equip MLLMs with RCMU capabilities, we propose Region-level Context-aware Visual Instruction Tuning (RCVIT), which incorporates object information into the model input and enables the model to utilize bounding box coordinates to effectively associate objects’ visual content with their textual information. To address the lack of datasets, we introduce the RCMU dataset, a large-scale visual instruction tuning dataset that covers multiple RCMU tasks. We also propose RC&P-Bench, a comprehensive benchmark that can evaluate the performance of MLLMs in RCMU and multimodal personalized understanding tasks. Additionally, we propose a reference-free evaluation metric to perform a comprehensive and fine-grained evaluation of the region-level context-aware image descriptions. By performing RCVIT on Qwen2-VL models with the RCMU dataset, we developed RC-Qwen2-VL models. Experimental results indicate that RC-Qwen2-VL models not only achieve outstanding performance on multiple RCMU tasks but also demonstrate successful applications in multimodal RAG and personalized conversation. Our data, model and benchmark are available at https://github.com/hongliang-wei/RC-MLLM

[143] PlantVillageVQA: A Visual Question Answering Dataset for Benchmarking Vision-Language Models in Plant Science

Syed Nazmus Sakib, Nafiul Haque, Mohammad Zabed Hossain, Shifat E. Arman

Main category: cs.CV

TL;DR: PlantVillageVQA is a large-scale visual question answering dataset for agricultural applications, containing 193,609 QA pairs over 55,448 images covering 14 crop species and 38 diseases, with expert-verified questions organized by cognitive complexity.

Details

Motivation: To advance vision-language models for agricultural decision-making and analysis by providing a standardized, expert-verified dataset for plant disease identification and diagnostic accuracy improvement.

Method: Created through a two-stage pipeline: (1) template-based QA synthesis from image metadata and (2) multi-stage linguistic re-engineering, followed by iterative expert review for scientific accuracy and relevancy.

Result: A comprehensive dataset with 193,609 high-quality QA pairs organized into 3 cognitive complexity levels and 9 distinct categories, evaluated using three state-of-the-art models for quality assessment.

Conclusion: The dataset provides a publicly available, standardized resource to enhance plant disease diagnostic accuracy and advance agricultural research, with plans for open-sourcing on HuggingFace.

Abstract: PlantVillageVQA is a large-scale visual question answering (VQA) dataset derived from the widely used PlantVillage image corpus. It was designed to advance the development and evaluation of vision-language models for agricultural decision-making and analysis. The PlantVillageVQA dataset comprises 193,609 high-quality question-answer (QA) pairs grounded over 55,448 images spanning 14 crop species and 38 disease conditions. Questions are organised into 3 levels of cognitive complexity and 9 distinct categories. Each question category was phrased manually following expert guidance and generated via an automated two-stage pipeline: (1) template-based QA synthesis from image metadata and (2) multi-stage linguistic re-engineering. The dataset was iteratively reviewed by domain experts for scientific accuracy and relevancy. The final dataset was evaluated using three state-of-the-art models for quality assessment. Our objective remains to provide a publicly available, standardised and expert-verified database to enhance diagnostic accuracy for plant disease identifications and advance scientific research in the agricultural domain. Our dataset will be open-sourced at https://huggingface.co/datasets/SyedNazmusSakib/PlantVillageVQA.

[144] CE-RS-SBCIT A Novel Channel Enhanced Hybrid CNN Transformer with Residual, Spatial, and Boundary-Aware Learning for Brain Tumor MRI Analysis

Mirza Mumtaz Zahoor, Saddam Hussain Khan

Main category: cs.CV

TL;DR: Novel CE-RS-SBCIT hybrid framework combines CNNs and Transformers for brain tumor classification from MRI, achieving 98.30% accuracy by addressing computational cost, contrast sensitivity, and structural heterogeneity challenges.

Details

Motivation: Brain tumors are lethal diseases requiring early detection and accurate classification. Current deep learning approaches (CNNs and Transformers) face challenges with computational cost, sensitivity to minor contrast variations, and structural/texture inconsistencies in MRI data.

Method: Hybrid framework integrating residual/spatial learning CNNs with transformer modules. Four innovations: 1) Smoothing and Boundary-based CNN-Integrated Transformer (SBCIT), 2) Tailored residual and spatial learning CNNs, 3) Channel Enhancement strategy, 4) Novel spatial attention mechanism. Uses stem convolution, contextual interaction transformer blocks with smoothing/boundary operations.

Result: Extensive evaluation on challenging MRI datasets (Kaggle and Figshare) encompassing glioma, meningioma, pituitary tumors, and healthy controls. Achieved 98.30% accuracy, 98.08% sensitivity, 98.25% F1-score, and 98.43% precision.

Conclusion: The CE-RS-SBCIT framework demonstrates superior performance in brain tumor classification by effectively combining local fine-grained and global contextual cues through its innovative hybrid architecture, addressing key limitations of conventional deep learning approaches.

Abstract: Brain tumors remain among the most lethal human diseases, where early detection and accurate classification are critical for effective diagnosis and treatment planning. Although deep learning-based computer-aided diagnostic (CADx) systems have shown remarkable progress. However, conventional convolutional neural networks (CNNs) and Transformers face persistent challenges, including high computational cost, sensitivity to minor contrast variations, structural heterogeneity, and texture inconsistencies in MRI data. Therefore, a novel hybrid framework, CE-RS-SBCIT, is introduced, integrating residual and spatial learning-based CNNs with transformer-driven modules. The proposed framework exploits local fine-grained and global contextual cues through four core innovations: (i) a smoothing and boundary-based CNN-integrated Transformer (SBCIT), (ii) tailored residual and spatial learning CNNs, (iii) a channel enhancement (CE) strategy, and (iv) a novel spatial attention mechanism. The developed SBCIT employs stem convolution and contextual interaction transformer blocks with systematic smoothing and boundary operations, enabling efficient global feature modeling. Moreover, Residual and spatial CNNs, enhanced by auxiliary transfer-learned feature maps, enrich the representation space, while the CE module amplifies discriminative channels and mitigates redundancy. Furthermore, the spatial attention mechanism selectively emphasizes subtle contrast and textural variations across tumor classes. Extensive evaluation on challenging MRI datasets from Kaggle and Figshare, encompassing glioma, meningioma, pituitary tumors, and healthy controls, demonstrates superior performance, achieving 98.30% accuracy, 98.08% sensitivity, 98.25% F1-score, and 98.43% precision.

Kaijie Xu, Clark Verbrugge

Main category: cs.CV

TL;DR: This paper introduces a new task of detecting Spatial Transition Points (STPs) and Main STPs (MSTPs) from single game frames to help AI navigation and level design evaluation in 3D games.

Details

Motivation: Players need visual cues to find map transitions in complex 3D games. Automating this detection is important for client-side auto-mapping and provides objective evaluation of map design cues.

Method: Two-stage deep-learning pipeline: 1) Faster R-CNN for STP detection, 2) lightweight MSTP selector fusing local and global features with parameter-efficient adapters. Optional retrieval-augmented fusion step.

Result: Experiments show full-network fine-tuning works best for STP detection with sufficient data, but adapter-only transfer is more robust in low-data scenarios and better for MSTP selection. Validated on custom dataset from 5 Action RPG titles.

Conclusion: The paper establishes feasibility of STP/MSTP detection, provides baseline pipeline and dataset, and offers insights into efficient model adaptation for future AI navigation aids and level-design tools.

Abstract: In complex 3D game environments, players rely on visual affordances to spot map transition points. Efficient identification of such points is important to client-side auto-mapping, and provides an objective basis for evaluating map cue presentation. In this work, we formalize the task of detecting traversable Spatial Transition Points (STPs)-connectors between two sub regions-and selecting the singular Main STP (MSTP), the unique STP that lies on the designer-intended critical path toward the player’s current macro-objective, from a single game frame, proposing this as a new research focus. We introduce a two-stage deep-learning pipeline that first detects potential STPs using Faster R-CNN and then ranks them with a lightweight MSTP selector that fuses local and global visual features. Both stages benefit from parameter-efficient adapters, and we further introduce an optional retrieval-augmented fusion step. Our primary goal is to establish the feasibility of this problem and set baseline performance metrics. We validate our approach on a custom-built, diverse dataset collected from five Action RPG titles. Our experiments reveal a key trade-off: while full-network fine-tuning produces superior STP detection with sufficient data, adapter-only transfer is significantly more robust and effective in low-data scenarios and for the MSTP selection task. By defining this novel problem, providing a baseline pipeline and dataset, and offering initial insights into efficient model adaptation, we aim to contribute to future AI-driven navigation aids and data-informed level-design tools.

[146] Video-LevelGauge: Investigating Contextual Positional Bias in Large Video Language Models

Hou Xia, Zheren Fu, Fangcan Ling, Jiajun Li, Yi Tu, Zhendong Mao, Yongdong Zhang

Main category: cs.CV

TL;DR: Video-LevelGauge benchmark reveals significant positional biases in large video language models, with open-source models showing head/neighbor preferences while commercial models like Gemini2.5-Pro perform consistently across video sequences.

Details

Motivation: Existing video understanding benchmarks assess overall performance but overlook nuanced positional biases, which is critical for evaluating LVLM performance in real-world scenarios.

Method: Created Video-LevelGauge benchmark with 438 curated videos, 1,177 multiple-choice and 120 open-ended questions using standardized probes and contextual setups with flexible control over context length, position, and types. Employed statistical measures and morphological pattern recognition for bias analysis.

Result: Evaluation of 27 state-of-the-art LVLMs revealed significant positional biases in many open-source models (head/neighbor preferences), while commercial models like Gemini2.5-Pro showed consistent performance across entire video sequences.

Conclusion: The benchmark provides actionable insights for mitigating bias and guiding model enhancement, highlighting the importance of systematic positional bias assessment in video language models.

Abstract: Large video language models (LVLMs) have made notable progress in video understanding, spurring the development of corresponding evaluation benchmarks. However, existing benchmarks generally assess overall performance across entire video sequences, overlooking nuanced behaviors such as contextual positional bias, a critical yet under-explored aspect of LVLM performance. We present Video-LevelGauge, a dedicated benchmark designed to systematically assess positional bias in LVLMs. We employ standardized probes and customized contextual setups, allowing flexible control over context length, probe position, and contextual types to simulate diverse real-world scenarios. In addition, we introduce a comprehensive analysis method that combines statistical measures with morphological pattern recognition to characterize bias. Our benchmark comprises 438 manually curated videos spanning multiple types, yielding 1,177 high-quality multiple-choice questions and 120 open-ended questions, validated for their effectiveness in exposing positional bias. Based on these, we evaluate 27 state-of-the-art LVLMs, including both commercial and open-source models. Our findings reveal significant positional biases in many leading open-source models, typically exhibiting head or neighbor-content preferences. In contrast, commercial models such as Gemini2.5-Pro show impressive, consistent performance across entire video sequences. Further analyses on context length, context variation, and model scale provide actionable insights for mitigating bias and guiding model enhancement . https://github.com/Cola-any/Video-LevelGauge

[147] BuzzSet v1.0: A Dataset for Pollinator Detection in Field Conditions

Ahmed Emam, Mohamed Elbassiouny, Julius Miller, Patrick Donworth, Sabine Seidel, Ribana Roscher

Main category: cs.CV

TL;DR: BuzzSet v1.0 is a large-scale dataset of 7,856 high-resolution pollinator images with over 8,000 annotated instances across honeybees, bumblebees, and unidentified insects, collected under real field conditions to address automated pollinator monitoring challenges.

Details

Motivation: Pollinator populations are declining due to environmental stressors, but scalable automated monitoring remains challenging due to difficulties detecting small, fast-moving, and camouflaged insects in agricultural environments.

Method: Created BuzzSet dataset with images preprocessed into 256x256 tiles, used YOLOv12 model for initial annotations refined through human verification, and provided baselines using RF-DETR transformer-based object detector.

Result: Strong classification accuracy with F1 scores of 0.94 for honeybees and 0.92 for bumblebees, minimal confusion between categories, and overall mAP of 0.559 demonstrating the dataset’s challenging nature.

Conclusion: BuzzSet establishes a benchmark for ecological computer vision, highlighting the primary challenge of detecting camouflaged insects in natural vegetation as an open problem for future research, with plans to expand to version 2.0.

Abstract: Pollinator insects such as honeybees and bumblebees are vital to global food production and ecosystem stability, yet their populations are declining due to anthropogenic and environmental stressors. Scalable, automated monitoring in agricultural environments remains an open challenge due to the difficulty of detecting small, fast-moving, and often camouflaged insects. To address this, we present BuzzSet v1.0, a large-scale dataset of high-resolution pollinator images collected under real field conditions. BuzzSet contains 7,856 manually verified images with more than 8,000 annotated instances across three classes: honeybees, bumblebees, and unidentified insects. Initial annotations were produced using a YOLOv12 model trained on external data and refined through human verification with open-source tools. All images were preprocessed into 256 x 256 tiles to improve the detection of small insects. We provide baselines using the RF-DETR transformer-based object detector. The model achieves strong classification accuracy with F1 scores of 0.94 and 0.92 for honeybees and bumblebees, with minimal confusion between these categories. The unidentified class remains more difficult due to label ambiguity and fewer samples, yet still contributes insights for robustness evaluation. Overall detection performance (mAP at 0.50 of 0.559) illustrates the challenging nature of the dataset and its potential to drive advances in small object detection under realistic ecological conditions. Future work focuses on expanding the dataset to version 2.0 with additional annotations and evaluating further detection strategies. BuzzSet establishes a benchmark for ecological computer vision, with the primary challenge being reliable detection of insects frequently camouflaged within natural vegetation, highlighting an open problem for future research.

[148] PointDGRWKV: Generalizing RWKV-like Architecture to Unseen Domains for Point Cloud Classification

Hao Yang, Qianyu Zhou, Haijia Sun, Xiangtai Li, Xuequan Lu, Lizhuang Ma, Shuicheng Yan

Main category: cs.CV

TL;DR: PointDGRWKV is the first RWKV-based framework for Domain Generalization in Point Cloud Classification, addressing spatial distortion and attention drift issues through adaptive geometric token shift and cross-domain key distribution alignment.

Details

Motivation: Existing DG PCC methods using convolutional networks, Transformers, or Mamba architectures suffer from limited receptive fields, high computational cost, or insufficient long-range dependency modeling. RWKV offers linear complexity and global receptive fields but faces challenges when directly applied to unstructured point clouds.

Method: Proposes PointDGRWKV with two key modules: 1) Adaptive Geometric Token Shift to model local neighborhood structures and improve geometric context awareness, and 2) Cross-Domain key feature Distribution Alignment to mitigate attention drift by aligning key feature distributions across domains.

Result: Extensive experiments on multiple benchmarks demonstrate that PointDGRWKV achieves state-of-the-art performance on Domain Generalization for Point Cloud Classification.

Conclusion: The proposed framework successfully adapts RWKV architecture for DG PCC by addressing spatial distortion and attention drift issues, maintaining RWKV’s linear efficiency while achieving superior generalization performance across domains.

Abstract: Domain Generalization (DG) has been recently explored to enhance the generalizability of Point Cloud Classification (PCC) models toward unseen domains. Prior works are based on convolutional networks, Transformer or Mamba architectures, either suffering from limited receptive fields or high computational cost, or insufficient long-range dependency modeling. RWKV, as an emerging architecture, possesses superior linear complexity, global receptive fields, and long-range dependency. In this paper, we present the first work that studies the generalizability of RWKV models in DG PCC. We find that directly applying RWKV to DG PCC encounters two significant challenges: RWKV’s fixed direction token shift methods, like Q-Shift, introduce spatial distortions when applied to unstructured point clouds, weakening local geometric modeling and reducing robustness. In addition, the Bi-WKV attention in RWKV amplifies slight cross-domain differences in key distributions through exponential weighting, leading to attention shifts and degraded generalization. To this end, we propose PointDGRWKV, the first RWKV-based framework tailored for DG PCC. It introduces two key modules to enhance spatial modeling and cross-domain robustness, while maintaining RWKV’s linear efficiency. In particular, we present Adaptive Geometric Token Shift to model local neighborhood structures to improve geometric context awareness. In addition, Cross-Domain key feature Distribution Alignment is designed to mitigate attention drift by aligning key feature distributions across domains. Extensive experiments on multiple benchmarks demonstrate that PointDGRWKV achieves state-of-the-art performance on DG PCC.

cs.AI

[149] Fuzzy, Symbolic, and Contextual: Enhancing LLM Instruction via Cognitive Scaffolding

Vanessa Figueiredo

Main category: cs.AI

TL;DR: Study shows architectural scaffolds with symbolic reasoning and memory improve LLM instructional dialogue performance in Socratic tutoring, outperforming baseline variants.

Details

Motivation: To understand how architectural inductive biases influence cognitive behavior of LLMs in instructional dialogue and promote adaptive, structured reasoning.

Method: Introduces symbolic scaffolding with short-term memory schema, uses controlled ablation across five system variants, evaluates with expert-designed rubrics covering scaffolding, responsiveness, symbolic reasoning, and conversational memory.

Result: Full system consistently outperforms baseline variants; removing memory or symbolic structure degrades key cognitive behaviors including abstraction, adaptive probing, and conceptual continuity.

Conclusion: Architectural scaffolds can reliably shape emergent instructional strategies in LLMs, supporting a processing-level account of cognitive behavior influence.

Abstract: We study how architectural inductive biases influence the cognitive behavior of large language models (LLMs) in instructional dialogue. We introduce a symbolic scaffolding mechanism paired with a short-term memory schema designed to promote adaptive, structured reasoning in Socratic tutoring. Using controlled ablation across five system variants, we evaluate model outputs via expert-designed rubrics covering scaffolding, responsiveness, symbolic reasoning, and conversational memory. We present preliminary results using an LLM-based evaluation framework aligned to a cognitively grounded rubric. This enables scalable, systematic comparisons across architectural variants in early-stage experimentation. The preliminary results show that our full system consistently outperforms baseline variants. Analysis reveals that removing memory or symbolic structure degrades key cognitive behaviors, including abstraction, adaptive probing, and conceptual continuity. These findings support a processing-level account in which architectural scaffolds can reliably shape emergent instructional strategies in LLMs.

[150] Automated Clinical Problem Detection from SOAP Notes using a Collaborative Multi-Agent LLM Architecture

Yeawon Lee, Xiaoyang Wang, Christopher C. Yang

Main category: cs.AI

TL;DR: A multi-agent system that simulates clinical team reasoning improves accuracy in identifying clinical problems from SOAP notes compared to single LLM approaches.

Details

Motivation: Clinical narrative interpretation is critical but challenging to automate. Single LLM approaches lack robustness for high-stakes clinical tasks, requiring more reliable systems.

Method: A collaborative multi-agent system with Manager agent orchestrating specialist agents in hierarchical, iterative debates to reach consensus on clinical problem identification from SOAP note S and O sections.

Result: The dynamic multi-agent configuration showed consistently improved performance in identifying congestive heart failure, acute kidney injury, and sepsis on 420 MIMIC-III notes compared to single-agent baseline.

Conclusion: Modeling clinical team reasoning through multi-agent systems offers a promising path toward more accurate, robust, and interpretable clinical decision support tools, though occasional groupthink susceptibility was noted.

Abstract: Accurate interpretation of clinical narratives is critical for patient care, but the complexity of these notes makes automation challenging. While Large Language Models (LLMs) show promise, single-model approaches can lack the robustness required for high-stakes clinical tasks. We introduce a collaborative multi-agent system (MAS) that models a clinical consultation team to address this gap. The system is tasked with identifying clinical problems by analyzing only the Subjective (S) and Objective (O) sections of SOAP notes, simulating the diagnostic reasoning process of synthesizing raw data into an assessment. A Manager agent orchestrates a dynamically assigned team of specialist agents who engage in a hierarchical, iterative debate to reach a consensus. We evaluated our MAS against a single-agent baseline on a curated dataset of 420 MIMIC-III notes. The dynamic multi-agent configuration demonstrated consistently improved performance in identifying congestive heart failure, acute kidney injury, and sepsis. Qualitative analysis of the agent debates reveals that this structure effectively surfaces and weighs conflicting evidence, though it can occasionally be susceptible to groupthink. By modeling a clinical team’s reasoning process, our system offers a promising path toward more accurate, robust, and interpretable clinical decision support tools.

[151] Addressing accuracy and hallucination of LLMs in Alzheimer’s disease research through knowledge graphs

Tingxuan Xu, Jiarui Feng, Justin Melendez, Kaleigh Roberts, Donghong Cai, Mingfang Zhu, Donald Elbert, Yixin Chen, Randall J. Bateman

Main category: cs.AI

TL;DR: Evaluation of GraphRAG systems for Alzheimer’s disease research, comparing response quality and traceability against standard GPT-4o using expert-curated questions.

Details

Motivation: LLM-based chatbots face limitations in scientific research due to hallucinations, limited domain knowledge, and lack of explainability, particularly in intensive knowledge domains like Alzheimer's disease.

Method: Compiled 50 papers and 70 expert questions on Alzheimer’s disease, constructed GraphRAG knowledge base, used GPT-4o as LLM, and compared GraphRAG responses with standard GPT-4o outputs while evaluating traceability.

Result: Assessment of quality and traceability of GraphRAG systems compared to standard LLM approach, with development of an easy-to-use interface for researchers.

Conclusion: GraphRAG shows promise for improving reliability in domain-specific scientific applications like Alzheimer’s research, addressing key limitations of standard LLMs through enhanced contextual integration and traceability.

Abstract: In the past two years, large language model (LLM)-based chatbots, such as ChatGPT, have revolutionized various domains by enabling diverse task completion and question-answering capabilities. However, their application in scientific research remains constrained by challenges such as hallucinations, limited domain-specific knowledge, and lack of explainability or traceability for the response. Graph-based Retrieval-Augmented Generation (GraphRAG) has emerged as a promising approach to improving chatbot reliability by integrating domain-specific contextual information before response generation, addressing some limitations of standard LLMs. Despite its potential, there are only limited studies that evaluate GraphRAG on specific domains that require intensive knowledge, like Alzheimer’s disease or other biomedical domains. In this paper, we assess the quality and traceability of two popular GraphRAG systems. We compile a database of 50 papers and 70 expert questions related to Alzheimer’s disease, construct a GraphRAG knowledge base, and employ GPT-4o as the LLM for answering queries. We then compare the quality of responses generated by GraphRAG with those from a standard GPT-4o model. Additionally, we discuss and evaluate the traceability of several Retrieval-Augmented Generation (RAG) and GraphRAG systems. Finally, we provide an easy-to-use interface with a pre-built Alzheimer’s disease database for researchers to test the performance of both standard RAG and GraphRAG.

[152] MultiFluxAI Enhancing Platform Engineering with Advanced Agent-Orchestrated Retrieval Systems

Sri Ram Macharla, Sridhar Murthy J, Anjaneyulu Pasala

Main category: cs.AI

TL;DR: MultiFluxAI is an AI platform that integrates diverse data sources for product engineering using Generative AI, vectorization, and agentic orchestration to handle complex queries and enhance user engagement.

Details

Motivation: To address challenges in managing and integrating vast, disparate data sources across product engineering domains and improve user engagement through better query handling.

Method: Leverages advanced AI techniques including Generative AI, vectorization, and agentic orchestration to provide dynamic, context-aware responses to complex user queries.

Result: Developed an innovative AI platform capable of handling both current and new service-related queries in digital ecosystems.

Conclusion: MultiFluxAI successfully addresses data integration challenges in product engineering and enhances user engagement through advanced AI-driven query processing.

Abstract: MultiFluxAI is an innovative AI platform developed to address the challenges of managing and integrating vast, disparate data sources in product engineering across application domains. It addresses both current and new service related queries that enhance user engagement in the digital ecosystem. This platform leverages advanced AI techniques, such as Generative AI, vectorization, and agentic orchestration to provide dynamic and context-aware responses to complex user queries.

[153] Multi-Ontology Integration with Dual-Axis Propagation for Medical Concept Representation

Mohsen Nayebi Kerdabadi, Arya Hadizadeh Moghaddam, Dongjie Wang, Zijun Yao

Main category: cs.AI

TL;DR: LINKO is an LLM-augmented framework that integrates multiple medical ontologies through dual-axis knowledge propagation (intra-ontology vertical and inter-ontology horizontal) to enhance medical concept representation learning.

Details

Motivation: Existing methods focus on single ontology systems or multiple ontologies in isolation, missing cross-ontology connections that could enrich medical concept representations.

Method: Uses LLMs for graph-retrieval-augmented initialization of concept embeddings, then performs dual-axis knowledge propagation: vertical within ontologies and horizontal across ontologies at each level.

Result: Superior performance over state-of-the-art baselines on two public datasets, with enhanced robustness in limited data and rare disease prediction scenarios.

Conclusion: LINKO effectively integrates multiple medical ontologies through dual-axis propagation, serving as a plug-in encoder that improves EHR predictive models’ performance and robustness.

Abstract: Medical ontology graphs map external knowledge to medical codes in electronic health records via structured relationships. By leveraging domain-approved connections (e.g., parent-child), predictive models can generate richer medical concept representations by incorporating contextual information from related concepts. However, existing literature primarily focuses on incorporating domain knowledge from a single ontology system, or from multiple ontology systems (e.g., diseases, drugs, and procedures) in isolation, without integrating them into a unified learning structure. Consequently, concept representation learning often remains limited to intra-ontology relationships, overlooking cross-ontology connections. In this paper, we propose LINKO, a large language model (LLM)-augmented integrative ontology learning framework that leverages multiple ontology graphs simultaneously by enabling dual-axis knowledge propagation both within and across heterogeneous ontology systems to enhance medical concept representation learning. Specifically, LINKO first employs LLMs to provide a graph-retrieval-augmented initialization for ontology concept embedding, through an engineered prompt that includes concept descriptions, and is further augmented with ontology context. Second, our method jointly learns the medical concepts in diverse ontology graphs by performing knowledge propagation in two axes: (1) intra-ontology vertical propagation across hierarchical ontology levels and (2) inter-ontology horizontal propagation within every level in parallel. Last, through extensive experiments on two public datasets, we validate the superior performance of LINKO over state-of-the-art baselines. As a plug-in encoder compatible with existing EHR predictive models, LINKO further demonstrates enhanced robustness in scenarios involving limited data availability and rare disease prediction.

[154] Think in Games: Learning to Reason in Games via Reinforcement Learning with Large Language Models

Yi Liao, Yu Gu, Yuan Sui, Zining Zhu, Yifan Lu, Guohua Tang, Zhongqian Sun, Wei Yang

Main category: cs.AI

TL;DR: TiG framework enables LLMs to develop procedural knowledge through game interactions, combining language modeling with reinforcement learning to bridge declarative-procedural knowledge gap with high efficiency and interpretability.

Details

Motivation: LLMs excel at complex reasoning but struggle with simple interactive tasks that require procedural knowledge, highlighting a gap between declarative and procedural understanding that needs to be addressed.

Method: Think in Games (TiG) reformulates RL-based decision-making as language modeling task, where LLMs generate language-guided policies refined through online reinforcement learning using environmental feedback.

Result: TiG achieves competitive performance with dramatically lower data and computational demands compared to conventional RL methods, while providing step-by-step natural language explanations.

Conclusion: TiG successfully bridges declarative and procedural knowledge gap, enabling LLMs to convert static knowledge into dynamic decision-making while maintaining reasoning abilities and improving transparency in interactive tasks.

Abstract: Large language models (LLMs) excel at complex reasoning tasks such as mathematics and coding, yet they frequently struggle with simple interactive tasks that young children perform effortlessly. This discrepancy highlights a critical gap between declarative knowledge (knowing about something) and procedural knowledge (knowing how to do something). Although traditional reinforcement learning (RL) agents can acquire procedural knowledge through environmental interaction, they often operate as black boxes and require substantial training data. In contrast, LLMs possess extensive world knowledge and reasoning capabilities, but are unable to effectively convert this static knowledge into dynamic decision-making in interactive settings. To address this challenge, we propose Think in Games (TiG), a novel framework that empowers LLMs to develop procedural understanding through direct interaction with game environments, while retaining their inherent reasoning and explanatory abilities. Specifically, TiG reformulates RL-based decision-making as a language modeling task: LLMs generate language-guided policies, which are refined iteratively through online reinforcement learning based on environmental feedback. Our experimental results show that TiG successfully bridges the gap between declarative and procedural knowledge, achieving competitive performance with dramatically lower data and computational demands compared to conventional RL methods. Moreover, TiG provides step-by-step natural language explanations for its decisions, greatly improving transparency and interpretability in complex interactive tasks.

[155] AHELM: A Holistic Evaluation of Audio-Language Models

Tony Lee, Haoqin Tu, Chi Heem Wong, Zijun Wang, Siwei Yang, Yifan Mai, Yuyin Zhou, Cihang Xie, Percy Liang

Main category: cs.AI

TL;DR: AHELM is a comprehensive benchmark for audio-language models that standardizes evaluation across 10 key aspects including perception, reasoning, fairness, and safety, testing 14 models and revealing performance gaps despite some models excelling in multiple areas.

Details

Motivation: Current ALM evaluations lack standardization, measure limited capabilities, and omit important aspects like fairness and safety, making cross-model comparisons difficult.

Method: Created AHELM benchmark aggregating various datasets including new synthetic datasets PARADE (for stereotype avoidance) and CoRe-Bench (for conversational reasoning), with standardized prompts, inference parameters, and evaluation metrics.

Result: Gemini 2.5 Pro ranked top in 5/10 aspects but showed group unfairness (p=0.01) on ASR tasks. Baseline systems performed surprisingly well, with one ranking 5th overall despite only speech-to-text capabilities.

Conclusion: AHELM provides holistic standardized evaluation for ALMs, revealing performance variations and fairness issues, and will be maintained as a living benchmark with ongoing updates.

Abstract: Evaluations of audio-language models (ALMs) – multimodal models that take interleaved audio and text as input and output text – are hindered by the lack of standardized benchmarks; most benchmarks measure only one or two capabilities and omit evaluative aspects such as fairness or safety. Furthermore, comparison across models is difficult as separate evaluations test a limited number of models and use different prompting methods and inference parameters. To address these shortfalls, we introduce AHELM, a benchmark that aggregates various datasets – including 2 new synthetic audio-text datasets called PARADE, which evaluates the ALMs on avoiding stereotypes, and CoRe-Bench, which measures reasoning over conversational audio through inferential multi-turn question answering – to holistically measure the performance of ALMs across 10 aspects we have identified as important to the development and usage of ALMs: audio perception, knowledge, reasoning, emotion detection, bias, fairness, multilinguality, robustness, toxicity, and safety. We also standardize the prompts, inference parameters, and evaluation metrics to ensure equitable comparisons across models. We test 14 open-weight and closed-API ALMs from 3 developers and 3 additional simple baseline systems each consisting of an automatic speech recognizer and a language model. Our results show that while Gemini 2.5 Pro ranks top in 5 out of 10 aspects, it exhibits group unfairness ($p=0.01$) on ASR tasks whereas most of the other models do not. We also find that the baseline systems perform reasonably well on AHELM, with one ranking 5th overall despite having only speech-to-text capabilities. For transparency, all raw prompts, model generations, and outputs are available on our website at https://crfm.stanford.edu/helm/audio/v1.0.0. AHELM is intended to be a living benchmark and new datasets and models will be added over time.

[156] AI Compute Architecture and Evolution Trends

Bor-Sung Liang

Main category: cs.AI

TL;DR: This paper proposes a 7-layer AI compute architecture model and analyzes AI development challenges and opportunities through technical and economic perspectives.

Details

Motivation: To address the shift from academic AI research to practical applications by providing a structured framework to analyze AI development challenges across technical and economic dimensions.

Method: Proposes a seven-layer AI compute architecture model (Physical, Link, Neural Network, Context, Agent, Orchestrator, Application layers) and analyzes each layer’s development trajectory, key technologies, and evolution through three stages of LLM development.

Result: Provides a comprehensive framework for understanding AI computing evolution, identifies technical challenges at each layer, compares different LLM development paths, and analyzes the transition from single AI agents to AI ecosystems.

Conclusion: AI development requires addressing both technical challenges across the 7-layer architecture and economic sustainability issues, with predictions for future AI trajectory based on internet industry analysis.

Abstract: The focus of AI development has shifted from academic research to practical applications. However, AI development faces numerous challenges at various levels. This article will attempt to analyze the opportunities and challenges of AI from several different perspectives using a structured approach. This article proposes a seven-layer model for AI compute architecture, including Physical Layer, Link Layer, Neural Network Layer, Context Layer, Agent Layer, Orchestrator Layer, and Application Layer, from bottom to top. It also explains how AI computing has evolved into this 7-layer architecture through the three-stage evolution on large-scale language models (LLMs). For each layer, we describe the development trajectory and key technologies. In Layers 1 and 2 we discuss AI computing issues and the impact of Scale-Up and Scale-Out strategies on computing architecture. In Layer 3 we explore two different development paths for LLMs. In Layer 4 we discuss the impact of contextual memory on LLMs and compares it to traditional processor memory. In Layers 5 to 7 we discuss the trends of AI agents and explore the issues in evolution from a single AI agent to an AI-based ecosystem, and their impact on the AI industry. Furthermore, AI development involves not only technical challenges but also the economic issues to build self-sustainable ecosystem. This article analyzes the internet industry to provide predictions on the future trajectory of AI development.

[157] CARJAN: Agent-Based Generation and Simulation of Traffic Scenarios with AJAN

Leonard Frank Neis, Andre Antakli, Matthias Klusch

Main category: cs.AI

TL;DR: CARJAN is a novel tool for semi-automated generation and simulation of urban traffic scenarios with pedestrians, cyclists, and autonomous vehicles using AJAN framework and CARLA simulator.

Details

Motivation: User-friendly modeling and virtual simulation of urban traffic scenarios with different types of interacting agents remains challenging.

Method: Uses multi-agent engineering framework AJAN and CARLA driving simulator with visual interface for scenario modeling, SPARQL Behavior Tree-based decision-making for agents.

Result: Provides an integrated approach for interactive, intelligent agent-based generation and simulation of virtual traffic scenarios in CARLA.

Conclusion: CARJAN offers a novel semi-automated solution for creating and simulating complex urban traffic scenarios with diverse agent interactions.

Abstract: User-friendly modeling and virtual simulation of urban traffic scenarios with different types of interacting agents such as pedestrians, cyclists and autonomous vehicles remains a challenge. We present CARJAN, a novel tool for semi-automated generation and simulation of such scenarios based on the multi-agent engineering framework AJAN and the driving simulator CARLA. CARJAN provides a visual user interface for the modeling, storage and maintenance of traffic scenario layouts, and leverages SPARQL Behavior Tree-based decision-making and interactions for agents in dynamic scenario simulations in CARLA. CARJAN provides a first integrated approach for interactive, intelligent agent-based generation and simulation of virtual traffic scenarios in CARLA.

[158] A General Framework of Epistemic Forgetting and its Instantiation by Ranking Functions

Christoph Beierle, Alexander Hahn, Diana Howey, Gabriele Kern-Isberner, Kai Sauerwald

Main category: cs.AI

TL;DR: This paper analyzes forgetting operations in epistemic states, proposing five general types and seven concrete forgetting operations for Spohn’s ranking functions, with comprehensive evaluation against postulates from logic programming and AGM theory.

Details

Motivation: To study forgetting operations from an epistemic perspective in richer semantic structures beyond classical logics, bridging the gap between variable elimination and AGM contraction while providing a comprehensive analysis of different forgetting operators.

Method: The authors take an epistemic perspective and study forgetting operations in epistemic states with clear links to propositional logic. They propose five general types of epistemic forgetting and instantiate them with seven concrete forgetting operations for Spohn’s ranking functions, then evaluate these operations against postulates from both logic programming and AGM theory.

Result: The paper provides a comprehensive evaluation of all concrete forgetting operations according to all postulates, resulting in a novel overview that highlights differences and commonalities among the various forgetting operators.

Conclusion: The study successfully lifts well-known and novel forgetting operations to the epistemic level, providing a rich landscape of axioms for evaluating forgetting operations and demonstrating the relationships between different approaches to forgetting in epistemic states.

Abstract: Forgetting as a knowledge management operation deliberately ignores parts of the knowledge and beliefs of an agent, for various reasons. Forgetting has many facets, one may want to forget parts of the syntax, a proposition, or a conditional. In the literature, two main operators suitable for performing forgetting have been proposed and investigated in depth: First, variable elimination is a syntactical method that blends out certain atomic variables to focus on the rest of the language. It has been mainly used in the area of logic programming and answer set programming. Second, contraction in AGM belief revision theory effectively removes propositions from belief sets under logical deduction. Both operations rely mainly on classical logics. In this article, we take an epistemic perspective and study forgetting operations in epistemic states with richer semantic structures, but with clear links to propositional logic. This allows us to investigate what forgetting in the epistemic background means, thereby lifting well-known and novel forgetting operations to the epistemic level. We present five general types of epistemic forgetting and instantiate them with seven concrete forgetting operations for Spohn’s ranking functions. We take inspiration from postulates of forgetting both from logic programming and AGM theory to propose a rich landscape of axioms for evaluating forgetting operations. Finally, we evaluate all concrete forgetting operations according to all postulates, leading to a novel comprehensive overview highlighting differences and commonalities among the forgetting operators.

[159] Learning Lifted Action Models From Traces of Incomplete Actions and States

Niklas Jansen, Jonas Gösgens, Hector Geffner

Main category: cs.AI

TL;DR: Learning lifted STRIPS models from incomplete state-action traces where states miss some predicates and actions lack full argument information, using a new STRIPS+ formalism and SYNTH algorithm.

Details

Motivation: Previous approaches assume full STRIPS states and actions, but real-world scenarios often have incomplete observations where some predicates are missing and action arguments are not fully revealed.

Method: Introduce STRIPS+ formalism allowing implicit action arguments and limited existential quantification. Develop SYNTH algorithm that constructs stratified precondition expressions to ground implicit arguments from state-action traces.

Result: SYNTH algorithm is proven correct and complete, and tested on traces from STRIPS+ models derived from existing STRIPS domains, showing scalability.

Conclusion: The approach successfully addresses the more realistic setting of learning from incomplete observations by extending STRIPS to STRIPS+ and providing a sound learning algorithm.

Abstract: Consider the problem of learning a lifted STRIPS model of the sliding-tile puzzle from random state-action traces where the states represent the location of the tiles only, and the actions are the labels up, down, left, and right, with no arguments. Two challenges are involved in this problem. First, the states are not full STRIPS states, as some predicates are missing, like the atoms representing the position of the blank''. Second, the actions are not full STRIPS either, as they do not reveal all the objects involved in the actions effects and preconditions. Previous approaches have addressed different versions of this model learning problem, but most assume that actions in the traces are full STRIPS actions or that the domain predicates are all observable. The new setting considered in this work is more realistic’’, as the atoms observed convey the state of the world but not full STRIPS states, and the actions reveal the arguments needed for selecting the action but not the ones needed for modeling it in STRIPS. For formulating and addressing the learning problem, we introduce a variant of STRIPS, which we call STRIPS+, where certain STRIPS action arguments can be left implicit in preconditions which can also involve a limited form of existential quantification. The learning problem becomes the problem of learning STRIPS+ models from STRIPS+ state-action traces. For this, the proposed learning algorithm, called SYNTH, constructs a stratified sequence (conjunction) of precondition expressions or ``queries’’ for each action, that denote unique objects in the state and ground the implicit action arguments in STRIPS+. The correctness and completeness of SYNTH is established, and its scalability is tested on state-action traces obtained from STRIPS+ models derived from existing STRIPS domains.

[160] MMSearch-Plus: A Simple Yet Challenging Benchmark for Multimodal Browsing Agents

Xijia Tao, Yihua Teng, Xinxing Su, Xinyu Fu, Jihao Wu, Chaofan Tao, Ziru Liu, Haoli Bai, Rui Liu, Lingpeng Kong

Main category: cs.AI

TL;DR: MMSearch-Plus is a challenging multimodal web browsing benchmark with 311 tasks requiring fine-grained visual reasoning, iterative search, and cross-validation under noise, where current MLLMs struggle significantly.

Details

Motivation: Existing multimodal browsing benchmarks can be solved with shallow workflows using high-recall image search and text-masking, lacking genuine multimodal challenges like fine-grained visual reasoning, provenance verification, and long-horizon tool use.

Method: Created MMSearch-Plus benchmark using Spatial-Temporal Extrapolation procedure to seed questions requiring extrapolation from spatial cues and temporal traces to out-of-image facts. Provided model-agnostic agent framework with browsing tools.

Result: Best agent achieved 15.1% without search and 36.0% with search rollout. Strong open-source model (Qwen-2.5-VL-72B) got 0.0% without search and 6.9% after 20 search rounds, showing significant challenges.

Conclusion: Current MLLMs face substantial difficulties in multimodal web agent tasks, particularly with source verification, part-based reasoning, and long-horizon planning, highlighting the need for improved multimodal understanding capabilities.

Abstract: Large multimodal language models (MLLMs) are increasingly deployed as web agents, yet many multimodal browsing benchmarks can be solved by shallow, fixed workflows that lean on high-recall image search and nearby text-masking the genuinely multimodal challenges of fine-grained visual reasoning, provenance verification, and long-horizon tool use. We introduce MMSearch-Plus, a benchmark of 311 tasks that highly demand multimodal understanding while preserving the difficulty profile of strong text-only browsing suites. Each item is constructed to contain multiple weak, localized visual signals that must be extracted, propagated through iterative text-image search, and cross-validated under retrieval noise before answering. Our curation procedure, Spatial-Temporal Extrapolation, seeds questions whose answers require extrapolating from spatial cues (micro-text, part-level appearance, layouts, signage) and temporal traces (broadcast overlays, seasonal context) to out-of-image facts such as events, dates, and venues. We provide a model-agnostic agent framework with browsing tools and evaluate a range of closed and open MLLMs. The strongest agent (o3) attains 15.1% without search and 36.0% accuracy with rollout under our framework, while a strong open-source model (Qwen-2.5-VL-72B-Instruct) achieves 0.0% without search and 6.9% after 20 rounds of search. Beyond answer accuracy, we assess bounding-box production and cropped-image search, and conduct an error analysis that surfaces failures in source verification, part-based reasoning, and long-horizon planning.

[161] Modeling Wise Decision Making: A Z-Number Fuzzy Framework Inspired by Phronesis

Sweta Kaman, Ankita Sharma, Romi Banerjee

Main category: cs.AI

TL;DR: A computational framework using fuzzy Z-numbers to measure wisdom as a multidimensional construct with both wisdom scores and confidence levels, validated through moral dilemma tasks.

Details

Motivation: Current wisdom measures rely on self-reports and fail to capture the inherent uncertainty and humility in wise reasoning. A computational approach could improve psychological measurement and enable more humane AI systems.

Method: Developed a fuzzy inference system with Z-numbers (wisdom score + confidence score). Used culturally neutral pictorial moral dilemmas with think-aloud responses from 100 participants, mapped to 5 wisdom components. Combined scores using 21 rules with Gaussian kernel density estimation.

Result: The system produced dual-attribute wisdom representations that showed modest but significant correlations with established scales while having negligible relations with unrelated traits, supporting convergent and divergent validity.

Conclusion: Successfully formalized wisdom as a multidimensional, uncertainty-conscious construct using Z-numbers. This approach advances psychological measurement and provides AI systems with interpretable, confidence-sensitive reasoning that bridges computational rigor and human judgment.

Abstract: Background: Wisdom is a superordinate construct that embraces perspective taking, reflectiveness, prosocial orientation, reflective empathetic action, and intellectual humility. Unlike conventional models of reasoning that are rigidly bound by binary thinking, wisdom unfolds in shades of ambiguity, requiring both graded evaluation and self-reflective humility. Current measures depend on self-reports and seldom reflect the humility and uncertainty inherent in wise reasoning. A computational framework that takes into account both multidimensionality and confidence has the potential to improve psychological science and allow humane AI. Method: We present a fuzzy inference system with Z numbers, each of the decisions being expressed in terms of a wisdom score (restriction) and confidence score (certainty). As part of this study, participants (N = 100) were exposed to culturally neutral pictorial moral dilemma tasks to which they generated think-aloud linguistic responses, which were mapped into five theoretically based components of wisdom. The scores of each individual component were combined using a base of 21 rules, with membership functions tuned via Gaussian kernel density estimation. Results: In a proof of concept study, the system produced dual attribute wisdom representations that correlated modestly but significantly with established scales while showing negligible relations with unrelated traits, supporting convergent and divergent validity. Contribution: The contribution is to formalize wisdom as a multidimensional, uncertainty-conscious construct, operationalized in the form of Z-numbers. In addition to progressing measurement in psychology, it calculates how fuzzy Z numbers can provide AI systems with interpretable, confidence-sensitive reasoning that affords a safe, middle ground between rigorous computation and human-like judgment.

[162] Counterfactual Scenarios for Automated Planning

Nicola Gigante, Francesco Leofante, Andrea Micheli

Main category: cs.AI

TL;DR: The paper proposes a new counterfactual scenario explanation paradigm for automated planning that identifies minimal modifications to planning problems to achieve desired properties, showing computational viability comparable to plan computation.

Details

Motivation: Traditional counterfactual explanations in automated planning only show minimal plan modifications for different goals, but fail to capture higher-level problem properties. This limitation motivates a new approach that can explain planning problems through counterfactual scenarios.

Method: The authors propose counterfactual scenarios that identify minimal modifications to a planning problem P such that it admits plans satisfying desired properties defined by an LTLf formula ψ. They present two qualitative instantiations based on explicit quantification over plans and characterize computational complexity for different types of changes.

Result: The research shows that generating counterfactual scenarios is often only as computationally expensive as computing a plan for the original problem P, demonstrating practical viability of the approach.

Conclusion: The proposed counterfactual scenario framework provides a practical and computationally feasible method for explaining planning problems by identifying minimal modifications needed to achieve desired properties, offering a foundation for practical algorithm development in this area.

Abstract: Counterfactual Explanations (CEs) are a powerful technique used to explain Machine Learning models by showing how the input to a model should be minimally changed for the model to produce a different output. Similar proposals have been made in the context of Automated Planning, where CEs have been characterised in terms of minimal modifications to an existing plan that would result in the satisfaction of a different goal. While such explanations may help diagnose faults and reason about the characteristics of a plan, they fail to capture higher-level properties of the problem being solved. To address this limitation, we propose a novel explanation paradigm that is based on counterfactual scenarios. In particular, given a planning problem $P$ and an \ltlf formula $\psi$ defining desired properties of a plan, counterfactual scenarios identify minimal modifications to $P$ such that it admits plans that comply with $\psi$. In this paper, we present two qualitative instantiations of counterfactual scenarios based on an explicit quantification over plans that must satisfy $\psi$. We then characterise the computational complexity of generating such counterfactual scenarios when different types of changes are allowed on $P$. We show that producing counterfactual scenarios is often only as expensive as computing a plan for $P$, thus demonstrating the practical viability of our proposal and ultimately providing a framework to construct practical algorithms in this area.

[163] HealthProcessAI: A Technical Framework and Proof-of-Concept for LLM-Enhanced Healthcare Process Mining

Eduardo Illueca-Fernandez, Kaile Chen, Fernando Seoane, Farhad Abtahi

Main category: cs.AI

TL;DR: HealthProcessAI is a GenAI framework that simplifies healthcare process mining by integrating LLMs with existing PM4PY and bupaR libraries to automate interpretation and generate accessible reports from complex workflow analyses.

Details

Motivation: Process mining in healthcare faces barriers including technical complexity, lack of standardization, and limited training resources, making it difficult for clinicians and researchers to understand and apply the analytical outputs.

Method: The framework integrates multiple LLMs through OpenRouter platform to provide automated process map interpretation and report generation. It was validated using sepsis progression data across four proof-of-concept scenarios and evaluated five state-of-the-art LLM models.

Result: The framework successfully processed sepsis data and demonstrated robust technical performance. LLM evaluation showed Claude Sonnet-4 and Gemini 2.5-Pro achieved the highest consistency scores (3.79/4.0 and 3.65/4.0) when assessed by automated LLM evaluators.

Conclusion: HealthProcessAI represents a novel methodological advance by combining structured analytics with AI-driven interpretation, making complex process mining results more accessible and actionable for diverse healthcare stakeholders including clinicians, data scientists, and researchers.

Abstract: Process mining has emerged as a powerful analytical technique for understanding complex healthcare workflows. However, its application faces significant barriers, including technical complexity, a lack of standardized approaches, and limited access to practical training resources. We introduce HealthProcessAI, a GenAI framework designed to simplify process mining applications in healthcare and epidemiology by providing a comprehensive wrapper around existing Python (PM4PY) and R (bupaR) libraries. To address unfamiliarity and improve accessibility, the framework integrates multiple Large Language Models (LLMs) for automated process map interpretation and report generation, helping translate technical analyses into outputs that diverse users can readily understand. We validated the framework using sepsis progression data as a proof-of-concept example and compared the outputs of five state-of-the-art LLM models through the OpenRouter platform. To test its functionality, the framework successfully processed sepsis data across four proof-of-concept scenarios, demonstrating robust technical performance and its capability to generate reports through automated LLM analysis. LLM evaluation using five independent LLMs as automated evaluators revealed distinct model strengths: Claude Sonnet-4 and Gemini 2.5-Pro achieved the highest consistency scores (3.79/4.0 and 3.65/4.0) when evaluated by automated LLM assessors. By integrating multiple Large Language Models (LLMs) for automated interpretation and report generation, the framework addresses widespread unfamiliarity with process mining outputs, making them more accessible to clinicians, data scientists, and researchers. This structured analytics and AI-driven interpretation combination represents a novel methodological advance in translating complex process mining results into potentially actionable insights for healthcare applications.

[164] Revisiting Landmarks: Learning from Previous Plans to Generalize over Problem Instances

Issa Hanou, Sebastijan Dumančić, Mathijs de Weerdt

Main category: cs.AI

TL;DR: A framework for discovering generalized landmarks that automatically work across planning domains, using state functions instead of predicates to capture repetition and intermediate goals.

Details

Motivation: Traditional landmark extraction algorithms fall short in planning problems where intermediate goals need to be identified, especially when dealing with repetitive subplans across different problem instances.

Method: Learn generalized landmarks from solved instances using state functions independent of specific objects, construct a directed generalized landmark graph that includes loop possibilities for repetitive subplans, and use this graph in a heuristic for solving new problem instances.

Result: Generalized landmark graphs learned from small instances are effective for larger instances in the same domain, with significant heuristic performance improvement when repetition loops are identified.

Conclusion: Generalized landmarks capture interpretable domain information that is useful for automated planning and can be discovered from a small set of plans for the same domain.

Abstract: We propose a new framework for discovering landmarks that automatically generalize across a domain. These generalized landmarks are learned from a set of solved instances and describe intermediate goals for planning problems where traditional landmark extraction algorithms fall short. Our generalized landmarks extend beyond the predicates of a domain by using state functions that are independent of the objects of a specific problem and apply to all similar objects, thus capturing repetition. Based on these functions, we construct a directed generalized landmark graph that defines the landmark progression, including loop possibilities for repetitive subplans. We show how to use this graph in a heuristic to solve new problem instances of the same domain. Our results show that the generalized landmark graphs learned from a few small instances are also effective for larger instances in the same domain. If a loop that indicates repetition is identified, we see a significant improvement in heuristic performance over the baseline. Generalized landmarks capture domain information that is interpretable and useful to an automated planner. This information can be discovered from a small set of plans for the same domain.

[165] Scalable Solution Methods for Dec-POMDPs with Deterministic Dynamics

Yang You, Alex Schutz, Zhikun Li, Bruno Lacerda, Robert Skilton, Nick Hawes

Main category: cs.AI

TL;DR: Introduces Deterministic Decentralized POMDPs (Det-Dec-POMDPs) for multi-agent planning problems and proposes IDPP solver for efficient large-scale solutions.

Details

Motivation: Many high-level multi-agent planning problems like multi-robot navigation can be effectively modeled using deterministic actions and observations, but current Dec-POMDP solvers struggle with large-scale problems.

Method: Proposes Iterative Deterministic POMDP Planning (IDPP) method that builds on Joint Equilibrium Search for Policies framework, specifically optimized for deterministic decentralized POMDPs.

Result: IDPP is designed to handle large-scale Det-Dec-POMDPs that current Dec-POMDP solvers cannot address efficiently.

Conclusion: The Det-Dec-POMDP framework and IDPP solver provide a practical approach for solving complex multi-agent planning problems with deterministic transitions and observations.

Abstract: Many high-level multi-agent planning problems, including multi-robot navigation and path planning, can be effectively modeled using deterministic actions and observations. In this work, we focus on such domains and introduce the class of Deterministic Decentralized POMDPs (Det-Dec-POMDPs). This is a subclass of Dec-POMDPs characterized by deterministic transitions and observations conditioned on the state and joint actions. We then propose a practical solver called Iterative Deterministic POMDP Planning (IDPP). This method builds on the classic Joint Equilibrium Search for Policies framework and is specifically optimized to handle large-scale Det-Dec-POMDPs that current Dec-POMDP solvers are unable to address efficiently.

[166] Integrating Large Language Models with Network Optimization for Interactive and Explainable Supply Chain Planning: A Real-World Case Study

Saravanan Venkatachalam

Main category: cs.AI

TL;DR: Integrated framework combining network optimization with LLMs for interactive, explainable supply chain planning support

Details

Motivation: Bridge the gap between complex operations research outputs and business stakeholder understanding in supply chain planning

Method: Combines traditional mixed-integer optimization for inventory redistribution with LLMs for natural language summaries, contextual visualizations, and tailored KPIs. Uses AI agents, RESTful APIs, and dynamic UI for real-time interaction.

Result: Case study demonstrates improved planning outcomes including stockout prevention, cost reduction, and maintained service levels

Conclusion: Framework successfully integrates optimization with LLMs for explainable decision support, with future extensions planned for enhanced adaptability and real-time capabilities

Abstract: This paper presents an integrated framework that combines traditional network optimization models with large language models (LLMs) to deliver interactive, explainable, and role-aware decision support for supply chain planning. The proposed system bridges the gap between complex operations research outputs and business stakeholder understanding by generating natural language summaries, contextual visualizations, and tailored key performance indicators (KPIs). The core optimization model addresses tactical inventory redistribution across a network of distribution centers for multi-period and multi-item, using a mixed-integer formulation. The technical architecture incorporates AI agents, RESTful APIs, and a dynamic user interface to support real-time interaction, configuration updates, and simulation-based insights. A case study demonstrates how the system improves planning outcomes by preventing stockouts, reducing costs, and maintaining service levels. Future extensions include integrating private LLMs, transfer learning, reinforcement learning, and Bayesian neural networks to enhance explainability, adaptability, and real-time decision-making.

[167] A-MHA: Anytime Multi-Heuristic A

Ramkumar Natarajan, Muhammad Suhail Saleem, William Xiao, Sandip Aine, Howie Choset, Maxim Likhachev

Main category: cs.AI

TL;DR: Extends Multi-Heuristic A* (MHA*) to an anytime algorithm called A-MHA* that finds feasible solutions quickly and continually improves them over time, inspired by Anytime Repairing A* concepts.

Details

Motivation: MHA* leverages multiple inadmissible heuristics for faster suboptimal solutions but is a one-shot algorithm that doesn't improve solutions over time and requires careful parameter tuning.

Method: Precise adaptation of Anytime Repairing A* (ARA*) concepts into the MHA* framework to create an anytime version that preserves suboptimality and completeness guarantees.

Result: A-MHA* successfully extends MHA* to perform in an anytime fashion, providing quick initial solutions with continuous improvement until time runs out.

Conclusion: The adaptation preserves MHA*’s original guarantees while enabling anytime performance, demonstrated effective in 3-D path planning and sliding tiles puzzle domains compared to MHA* and other anytime algorithms.

Abstract: Designing good heuristic functions for graph search requires adequate domain knowledge. It is often easy to design heuristics that perform well and correlate with the underlying true cost-to-go values in certain parts of the search space but these may not be admissible throughout the domain thereby affecting the optimality guarantees of the search. Bounded suboptimal search using several such partially good but inadmissible heuristics was developed in Multi-Heuristic A* (MHA*). Although MHA* leverages multiple inadmissible heuristics to potentially generate a faster suboptimal solution, the original version does not improve the solution over time. It is a one shot algorithm that requires careful setting of inflation factors to obtain a desired one time solution. In this work, we tackle this issue by extending MHA* to an anytime version that finds a feasible suboptimal solution quickly and continually improves it until time runs out. Our work is inspired from the Anytime Repairing A* (ARA*) algorithm. We prove that our precise adaptation of ARA* concepts in the MHA* framework preserves the original suboptimal and completeness guarantees and enhances MHA* to perform in an anytime fashion. Furthermore, we report the performance of A-MHA* in 3-D path planning domain and sliding tiles puzzle and compare against MHA* and other anytime algorithms.

[168] Leveraging Imperfection with MEDLEY A Multi-Model Approach Harnessing Bias in Medical AI

Farhad Abtahi, Mehdi Astaraki, Fernando Seoane

Main category: cs.AI

TL;DR: MEDLEY is a framework that preserves diverse AI model outputs instead of forcing consensus, treating biases as strengths and hallucinations as hypotheses for clinician verification.

Details

Motivation: Conventional approaches view bias in medical AI as a defect to eliminate, but human reasoning inherently incorporates valuable biases shaped by education and experience.

Method: Developed a proof-of-concept using 30+ large language models that preserves both consensus and minority views, documenting model-specific biases and treating hallucinations as provisional hypotheses.

Result: Created a minimum viable product that makes diagnostic uncertainty and latent biases transparent for clinical oversight, demonstrating how structured diversity can enhance medical reasoning.

Conclusion: MEDLEY reframes AI imperfection as a resource, offering a paradigm shift that opens new regulatory, ethical, and innovation pathways for trustworthy medical AI systems.

Abstract: Bias in medical artificial intelligence is conventionally viewed as a defect requiring elimination. However, human reasoning inherently incorporates biases shaped by education, culture, and experience, suggesting their presence may be inevitable and potentially valuable. We propose MEDLEY (Medical Ensemble Diagnostic system with Leveraged diversitY), a conceptual framework that orchestrates multiple AI models while preserving their diverse outputs rather than collapsing them into a consensus. Unlike traditional approaches that suppress disagreement, MEDLEY documents model-specific biases as potential strengths and treats hallucinations as provisional hypotheses for clinician verification. A proof-of-concept demonstrator was developed using over 30 large language models, creating a minimum viable product that preserved both consensus and minority views in synthetic cases, making diagnostic uncertainty and latent biases transparent for clinical oversight. While not yet a validated clinical tool, the demonstration illustrates how structured diversity can enhance medical reasoning under clinician supervision. By reframing AI imperfection as a resource, MEDLEY offers a paradigm shift that opens new regulatory, ethical, and innovation pathways for developing trustworthy medical AI systems.

[169] PosterForest: Hierarchical Multi-Agent Collaboration for Scientific Poster Generation

Jiho Choi, Seojeong Park, Seongjong Song, Hyunjung Shim

Main category: cs.AI

TL;DR: PosterForest is a training-free framework for automated scientific poster generation that uses hierarchical Poster Tree representation and multi-agent collaboration to optimize content, structure, and visual coherence.

Details

Motivation: Existing approaches neglect hierarchical document structure and semantic integration of textual/visual elements in scientific posters, which are critical for effective communication.

Method: Uses Poster Tree hierarchical intermediate representation to encode document structure and visual-textual relationships. Employs multi-agent collaboration with specialized agents for content summarization and layout planning that iteratively coordinate with mutual feedback.

Result: Outperforms existing baselines in both qualitative and quantitative evaluations across multiple academic domains. Achieves quality closest to expert-designed posters with superior information preservation, structural clarity, and user preference.

Conclusion: The framework successfully addresses hierarchical structure and visual-textual integration challenges in scientific poster generation through hierarchical representation and multi-agent collaboration, producing high-quality results comparable to expert designs.

Abstract: We present a novel training-free framework, \textit{PosterForest}, for automated scientific poster generation. Unlike prior approaches, which largely neglect the hierarchical structure of scientific documents and the semantic integration of textual and visual elements, our method addresses both challenges directly. We introduce the \textit{Poster Tree}, a hierarchical intermediate representation that jointly encodes document structure and visual-textual relationships at multiple levels. Our framework employs a multi-agent collaboration strategy, where agents specializing in content summarization and layout planning iteratively coordinate and provide mutual feedback. This approach enables the joint optimization of logical consistency, content fidelity, and visual coherence. Extensive experiments on multiple academic domains show that our method outperforms existing baselines in both qualitative and quantitative evaluations. The resulting posters achieve quality closest to expert-designed ground truth and deliver superior information preservation, structural clarity, and user preference.

[170] Freeze and Conquer: Reusable Ansatz for Solving the Traveling Salesman Problem

Fabrizio Fagiolo, Nicolo’ Vescera

Main category: cs.AI

TL;DR: Variational quantum algorithm for TSP with compact permutation encoding and optimize-freeze-reuse strategy that reduces qubit requirements and eliminates costly structural research.

Details

Motivation: To develop a quantum algorithm for the Traveling Salesman Problem that reduces qubit requirements and enables efficient implementation on NISQ hardware by reusing optimized circuit structures.

Method: Combines compact permutation encoding with optimize-freeze-reuse strategy: circuit topology is optimized on training instances using Simulated Annealing, then frozen and reused on new instances with only parameter re-optimization.

Result: Achieved 100% optimal trip sampling for 4 cities, 90% for 5 cities, 80% for 6 cities, but dropped to ~20% for 7 cities, showing scalability limitations for larger problems.

Conclusion: The method shows robust generalization for moderate problem sizes and dramatically reduces time-to-solution without degrading quality, but faces scalability challenges beyond 6-7 cities.

Abstract: In this paper we present a variational algorithm for the Traveling Salesman Problem (TSP) that combines (i) a compact encoding of permutations, which reduces the qubit requirement too, (ii) an optimize-freeze-reuse strategy: where the circuit topology (Ansatz'') is first optimized on a training instance by Simulated Annealing (SA), then frozen’’ and re-used on novel instances, limited to a rapid re-optimization of only the circuit parameters. This pipeline eliminates costly structural research in testing, making the procedure immediately implementable on NISQ hardware. On a set of $40$ randomly generated symmetric instances that span $4 - 7$ cities, the resulting Ansatz achieves an average optimal trip sampling probability of $100%$ for 4 city cases, $90%$ for 5 city cases and $80%$ for 6 city cases. With 7 cities the success rate drops markedly to an average of $\sim 20%$, revealing the onset of scalability limitations of the proposed method. The results show robust generalization ability for moderate problem sizes and indicate how freezing the Ansatz can dramatically reduce time-to-solution without degrading solution quality. The paper also discusses scalability limitations, the impact of ``warm-start’’ initialization of parameters, and prospects for extension to more complex problems, such as Vehicle Routing and Job-Shop Scheduling.

[171] Orientability of Causal Relations in Time Series using Summary Causal Graphs and Faithful Distributions

Timothée Loranchet, Charles K. Assaad

Main category: cs.AI

TL;DR: This paper provides theoretical guarantees for orienting micro-level causal edges in time series using summary causal graphs from expert knowledge, even with macro-level cycles or bidirected edges.

Details

Motivation: Experts can provide high-level causal abstractions (summary causal graphs) but the full micro-level causal structure between temporal variables often remains unknown, creating challenges for causal discovery in time series analysis.

Method: The authors present conditions that guarantee orientability of micro-level edges between temporal variables given background knowledge encoded in a summary causal graph, assuming access to a faithful and causally sufficient distribution.

Result: The results provide theoretical guarantees for edge orientation at the micro-level even in the presence of cycles or bidirected edges at the macro-level.

Conclusion: The findings offer practical guidance for leveraging summary causal graphs to improve causal discovery in complex temporal systems and highlight the value of incorporating expert knowledge for better causal inference from observational time series data.

Abstract: Understanding causal relations between temporal variables is a central challenge in time series analysis, particularly when the full causal structure is unknown. Even when the full causal structure cannot be fully specified, experts often succeed in providing a high-level abstraction of the causal graph, known as a summary causal graph, which captures the main causal relations between different time series while abstracting away micro-level details. In this work, we present conditions that guarantee the orientability of micro-level edges between temporal variables given the background knowledge encoded in a summary causal graph and assuming having access to a faithful and causally sufficient distribution with respect to the true unknown graph. Our results provide theoretical guarantees for edge orientation at the micro-level, even in the presence of cycles or bidirected edges at the macro-level. These findings offer practical guidance for leveraging SCGs to inform causal discovery in complex temporal systems and highlight the value of incorporating expert knowledge to improve causal inference from observational time series data.

[172] Tree-Guided Diffusion Planner

Hyeonseong Jeon, Cheolhong Min, Jaesik Park

Main category: cs.AI

TL;DR: Tree-guided Diffusion Planner (TDP) is a zero-shot test-time planning framework that uses bi-level sampling with particle guidance for exploration and conditional denoising for exploitation, outperforming existing methods on diverse tasks.

Details

Motivation: Standard gradient guidance struggles with non-convex objectives, non-differentiable constraints, and multi-reward scenarios in real-world planning, while supervised approaches lack test-time flexibility and zero-shot generalization.

Method: TDP frames planning as tree search with bi-level sampling: (1) diverse parent trajectories via training-free particle guidance for exploration, (2) sub-trajectory refinement through fast conditional denoising guided by task objectives using only pretrained models.

Result: TDP consistently outperforms state-of-the-art approaches on maze gold-picking, robot arm block manipulation, and AntMaze multi-goal exploration tasks.

Conclusion: The proposed Tree-guided Diffusion Planner effectively addresses limitations of gradient guidance by balancing exploration and exploitation through structured trajectory generation, enabling superior zero-shot test-time planning performance.

Abstract: Planning with pretrained diffusion models has emerged as a promising approach for solving test-time guided control problems. However, standard gradient guidance typically performs optimally under convex and differentiable reward landscapes, showing substantially reduced effectiveness in real-world scenarios involving non-convex objectives, non-differentiable constraints, and multi-reward structures. Furthermore, recent supervised planning approaches require task-specific training or value estimators, which limits test-time flexibility and zero-shot generalization. We propose a Tree-guided Diffusion Planner (TDP), a zero-shot test-time planning framework that balances exploration and exploitation through structured trajectory generation. We frame test-time planning as a tree search problem using a bi-level sampling process: (1) diverse parent trajectories are produced via training-free particle guidance to encourage broad exploration, and (2) sub-trajectories are refined through fast conditional denoising guided by task objectives. TDP addresses the limitations of gradient guidance by exploring diverse trajectory regions and harnessing gradient information across this expanded solution space using only pretrained models and test-time reward signals. We evaluate TDP on three diverse tasks: maze gold-picking, robot arm block manipulation, and AntMaze multi-goal exploration. TDP consistently outperforms state-of-the-art approaches on all tasks. The project page can be found at: tree-diffusion-planner.github.io.

[173] Policy Expansion for Bridging Offline-to-Online Reinforcement Learning

Haichao Zhang, We Xu, Haonan Yu

Main category: cs.AI

TL;DR: Policy Expansion (PEX) method that combines offline pre-training and online RL by maintaining the offline policy while adding a new policy for online learning, allowing adaptive composition of both policies during interaction.

Details

Motivation: To leverage both offline data efficiency and online RL performance while avoiding destruction of useful offline policy behaviors during initial online learning stages.

Method: Uses offline policy as one candidate, expands policy set with new policy for online learning, adaptively composes both policies during environment interaction to retain offline behaviors while enabling new learning.

Result: Experiments on multiple tasks demonstrate effectiveness in maintaining offline policy benefits while achieving improved performance through online learning.

Conclusion: Policy expansion scheme successfully mitigates issues of destroying useful offline behaviors while allowing natural exploration and capturing new useful behaviors through online learning.

Abstract: Pre-training with offline data and online fine-tuning using reinforcement learning is a promising strategy for learning control policies by leveraging the best of both worlds in terms of sample efficiency and performance. One natural approach is to initialize the policy for online learning with the one trained offline. In this work, we introduce a policy expansion scheme for this task. After learning the offline policy, we use it as one candidate policy in a policy set. We then expand the policy set with another policy which will be responsible for further learning. The two policies will be composed in an adaptive manner for interacting with the environment. With this approach, the policy previously learned offline is fully retained during online learning, thus mitigating the potential issues such as destroying the useful behaviors of the offline policy in the initial stage of online learning while allowing the offline policy participate in the exploration naturally in an adaptive manner. Moreover, new useful behaviors can potentially be captured by the newly added policy through learning. Experiments are conducted on a number of tasks and the results demonstrate the effectiveness of the proposed approach. Code is available at https://github.com/Haichao-Zhang/PEX

[174] Transforming Wearable Data into Personal Health Insights using Large Language Model Agents

Mike A. Merrill, Akshay Paruchuri, Naghmeh Rezaei, Geza Kovacs, Javier Perez, Yun Liu, Erik Schenck, Nova Hammerquist, Jake Sunshine, Shyam Tailor, Kumar Ayush, Hao-Wei Su, Qian He, Cory Y. McLean, Mark Malhotra, Shwetak Patel, Jiening Zhan, Tim Althoff, Daniel McDuff, Xin Liu

Main category: cs.AI

TL;DR: PHIA is a tool-based LLM agent system that uses code generation and information retrieval to analyze wearable health data, achieving 84% accuracy on numerical questions and 83% favorable ratings on open-ended questions.

Details

Motivation: Standard LLMs struggle with complex numerical reasoning required for personalized health insights from wearable trackers, necessitating specialized tool-based approaches.

Method: PHIA leverages multistep reasoning with code generation and information retrieval to analyze behavioral health data from wearable devices.

Result: PHIA significantly outperforms code generation baselines, achieving 84% accuracy on objective numerical questions and 83% favorable ratings on open-ended questions, with twice the likelihood of highest quality ratings.

Conclusion: This work advances behavioral health by enabling accessible, personalized data-driven wellness through improved health data interpretation for the wider population.

Abstract: Deriving personalized insights from popular wearable trackers requires complex numerical reasoning that challenges standard LLMs, necessitating tool-based approaches like code generation. Large language model (LLM) agents present a promising yet largely untapped solution for this analysis at scale. We introduce the Personal Health Insights Agent (PHIA), a system leveraging multistep reasoning with code generation and information retrieval to analyze and interpret behavioral health data. To test its capabilities, we create and share two benchmark datasets with over 4000 health insights questions. A 650-hour human expert evaluation shows that PHIA significantly outperforms a strong code generation baseline, achieving 84% accuracy on objective, numerical questions and, for open-ended ones, earning 83% favorable ratings while being twice as likely to achieve the highest quality rating. This work can advance behavioral health by empowering individuals to understand their data, enabling a new era of accessible, personalized, and data-driven wellness for the wider population.

[175] Evaluating Knowledge Graph Based Retrieval Augmented Generation Methods under Knowledge Incompleteness

Dongzhuoran Zhou, Yuqicheng Zhu, Xiaxia Wang, Yuan He, Jiaoyan Chen, Steffen Staab, Evgeny Kharlamov

Main category: cs.AI

TL;DR: KG-RAG methods are sensitive to knowledge graph incompleteness, and current benchmarks don’t adequately test this vulnerability.

Details

Motivation: Real-world knowledge graphs are often incomplete, but existing benchmarks don't properly evaluate how this incompleteness affects KG-RAG performance in question answering tasks.

Method: Systematically evaluate KG-RAG methods by removing triples from knowledge graphs using different methods and analyzing the resulting performance effects.

Result: Demonstrated that KG-RAG methods are sensitive to KG incompleteness, showing performance degradation when essential information is missing.

Conclusion: There is a need for more robust KG-RAG approaches that can handle incomplete knowledge graphs in realistic settings.

Abstract: Knowledge Graph based Retrieval-Augmented Generation (KG-RAG) is a technique that enhances Large Language Model (LLM) inference in tasks like Question Answering (QA) by retrieving relevant information from knowledge graphs (KGs). However, real-world KGs are often incomplete, meaning that essential information for answering questions may be missing. Existing benchmarks do not adequately capture the impact of KG incompleteness on KG-RAG performance. In this paper, we systematically evaluate KG-RAG methods under incomplete KGs by removing triples using different methods and analyzing the resulting effects. We demonstrate that KG-RAG methods are sensitive to KG incompleteness, highlighting the need for more robust approaches in realistic settings.

Daocheng Fu, Jianlong Chen, Renqiu Xia, Zijun Chen, Qi Liu, Yuan Feng, Hongbin Zhou, Renrui Zhang, Shiyang Feng, Peng Gao, Hongyuan Zha, Junchi Yan, Botian Shi, Yu Qiao, Bo Zhang

Main category: cs.AI

TL;DR: TrustGeoGen is a data engine that generates formally verified geometric problems to address LLM hallucination issues in GPS tasks, creating the GeoTrust-200K dataset and benchmark with 45.83% SOTA accuracy.

Details

Motivation: Address the lack of reliable benchmarks and systematic methodologies for geometric problem solving, particularly the inherent hallucination in LLMs that leads to noisy, unverified, and self-contradictory synthetic datasets.

Method: Integrates four innovations: 1) Multimodal Alignment for synchronized diagram-text-solution generation, 2) Formal Verification for rule-compliant reasoning paths, 3) Connection Thinking bridging formal deduction with human logic, and 4) GeoExplore algorithms for diverse problem variants with multiple solutions and backtracking.

Result: Created GeoTrust-200K dataset and GeoTrust-test benchmark with guaranteed cross-modal integrity. SOTA models achieve only 45.83% accuracy on GeoTrust-test. Training on synthesized data substantially improves model performance on GPS tasks with strong generalization to OOD benchmarks.

Conclusion: TrustGeoGen provides a principled and trustworthy benchmark for geometric problem solving, effectively addressing LLM hallucination issues and enabling significant performance improvements through formally verified synthetic data generation.

Abstract: Mathematical geometric problem solving (GPS) demands verifiable logical coherence and multimodal reasoning capabilities. While large language models (LLMs) have shown rapid progress in GPS, their advancement is hindered by the lack of reliable benchmarks and systematic methodologies. A critical challenge is the inherent hallucination in LLMs, which leads to synthetic GPS datasets that are often noisy, unverified, and self-contradictory. To address this, we introduce TrustGeoGen, a data engine that generates formally verified geometric problems to establish a principled and trustworthy benchmark. Our engine integrates four key innovations: 1) Multimodal Alignment, which synchronizes the generation of diagrams, text, and step-by-step solutions; 2) Formal Verification, ensuring all reasoning paths are rule-compliant; 3) Connection Thinking, bridging formal deduction with human-like logical steps; and 4) our \textit{GeoExplore} series algorithms, which produce diverse problem variants with multiple solutions and self-reflective backtracking. Using this engine, we create the GeoTrust-200K dataset and the corresponding GeoTrust-test benchmark, both with guaranteed cross-modal integrity. Experiments reveal that state-of-the-art models achieve only 45.83% accuracy on GeoTrust-test, highlighting its significant challenge. Furthermore, training on our synthesized data substantially improves model performance on GPS tasks, with strong generalization to out-of-domain (OOD) benchmarks. Our code and data are available at https://github.com/Alpha-Innovator/TrustGeoGen

[177] Compression versus Accuracy: A Hierarchy of Lifted Models

Jan Speller, Malte Luttermann, Marcel Gehrke, Tanya Braun

Main category: cs.AI

TL;DR: Hierarchical hyperparameter-free approach for lifted probabilistic graphical models that creates a structured hierarchy of ε values and corresponding models, enabling explicit trade-off between compression and accuracy without needing multiple ACP runs.

Details

Motivation: Current ACP algorithm requires choosing ε hyperparameter through trial-and-error, which is inefficient and leads to inconsistent models with poor interpretability between different ε values.

Method: Proposes a hierarchical method that computes a structured hierarchy of ε values that guarantee nested model groupings, ensuring factors grouped at smaller ε remain grouped at larger ε values.

Result: Creates a hierarchy of models with corresponding error bounds, allowing users to explicitly choose compression-accuracy trade-offs and maintain interpretability across different model versions.

Conclusion: The hierarchical approach eliminates the need for hyperparameter tuning, provides structured model exploration, and enables better interpretability while maintaining the benefits of lifted inference.

Abstract: Probabilistic graphical models that encode indistinguishable objects and relations among them use first-order logic constructs to compress a propositional factorised model for more efficient (lifted) inference. To obtain a lifted representation, the state-of-the-art algorithm Advanced Colour Passing (ACP) groups factors that represent matching distributions. In an approximate version using $\varepsilon$ as a hyperparameter, factors are grouped that differ by a factor of at most $(1\pm \varepsilon)$. However, finding a suitable $\varepsilon$ is not obvious and may need a lot of exploration, possibly requiring many ACP runs with different $\varepsilon$ values. Additionally, varying $\varepsilon$ can yield wildly different models, leading to decreased interpretability. Therefore, this paper presents a hierarchical approach to lifted model construction that is hyperparameter-free. It efficiently computes a hierarchy of $\varepsilon$ values that ensures a hierarchy of models, meaning that once factors are grouped together given some $\varepsilon$, these factors will be grouped together for larger $\varepsilon$ as well. The hierarchy of $\varepsilon$ values also leads to a hierarchy of error bounds. This allows for explicitly weighing compression versus accuracy when choosing specific $\varepsilon$ values to run ACP with and enables interpretability between the different models.

[178] AI Simulation by Digital Twins: Systematic Survey, Reference Framework, and Mapping to a Standardized Architecture

Xiaoran Liu, Istvan David

Main category: cs.AI

TL;DR: Survey of digital twin-enabled AI simulation to address data insufficiency in AI development, identifying trends and providing architectural guidelines.

Details

Motivation: Address insufficient data volume and quality challenges in subsymbolic AI by leveraging digital twins as high-fidelity virtual training environments.

Method: Systematic survey of 22 primary studies to identify technological trends and derive a reference framework mapping to ISO 23247 architecture.

Result: Identified key trends in digital twin-enabled AI simulation and provided architectural guidelines for implementation.

Conclusion: Digital twins offer promising avenues for AI simulation but present challenges that require further research and development.

Abstract: Insufficient data volume and quality are particularly pressing challenges in the adoption of modern subsymbolic AI. To alleviate these challenges, AI simulation uses virtual training environments in which AI agents can be safely and efficiently developed with simulated, synthetic data. Digital twins open new avenues in AI simulation, as these high-fidelity virtual replicas of physical systems are equipped with state-of-the-art simulators and the ability to further interact with the physical system for additional data collection. In this article, we report on our systematic survey of digital twin-enabled AI simulation. By analyzing 22 primary studies, we identify technological trends and derive a reference framework to situate digital twins and AI components. Based on our findings, we derive a reference framework and provide architectural guidelines by mapping it onto the ISO 23247 reference architecture for digital twins. Finally, we identify challenges and research opportunities for prospective researchers.

[179] QHackBench: Benchmarking Large Language Models for Quantum Code Generation Using PennyLane Hackathon Challenges

Abdul Basit, Minghao Shao, Muhammad Haider Asif, Nouhaila Innan, Muhammad Kashif, Alberto Marchisio, Muhammad Shafique

Main category: cs.AI

TL;DR: This paper benchmarks LLMs for quantum code generation using a novel QHackBench dataset from real quantum hackathon challenges, evaluating both standard prompting and RAG approaches with a focus on functional correctness and execution success.

Details

Motivation: Large Language Models show strong potential in code generation but their effectiveness in quantum computing remains underexplored, particularly for PennyLane-based quantum programming.

Method: Created QHackBench benchmark dataset from Quantum Hackathon competitions, evaluated LLMs under vanilla prompting and Retrieval-Augmented Generation (RAG) with augmented PennyLane dataset, and introduced a multi-agent evaluation pipeline for iterative solution refinement.

Result: RAG-enhanced models with augmented PennyLane dataset generated similar results as standard prompting, particularly for complex quantum algorithms. The multi-agent pipeline further improved execution success rates by refining incorrect solutions.

Conclusion: The study provides a comprehensive evaluation framework for LLMs in quantum code generation and commits to publicly releasing QHackBench, the evaluation framework, and experimental results to advance AI-assisted quantum programming research.

Abstract: Recent advances in Large Language Models (LLMs) have demonstrated strong potential in code generation, yet their effectiveness in quantum computing remains underexplored. This paper benchmarks LLMs for PennyLane-based quantum code generation using real-world challenges from the Quantum Hackathon (QHack). We introduce QHackBench, a novel benchmark dataset derived from QHack competitions, and evaluate model performance under vanilla prompting and Retrieval-Augmented Generation (RAG). Our structured evaluation framework assesses functional correctness, syntactic validity, and execution success across varying challenge difficulties. Results indicate that RAG-enhanced models, supplemented with an augmented PennyLane dataset, approximately generate similar results as the standard prompting, particularly in complex quantum algorithms. Additionally, we introduce a multi-agent evaluation pipeline that iteratively refines incorrect solutions, further enhancing execution success rates. To foster further research, we commit to publicly releasing QHackBench, along with our evaluation framework and experimental results, enabling continued advancements in AI-assisted quantum programming.

[180] What Breaks Knowledge Graph based RAG? Empirical Insights into Reasoning under Incomplete Knowledge

Dongzhuoran Zhou, Yuqicheng Zhu, Xiaxia Wang, Hongkuan Zhou, Yuan He, Jiaoyan Chen, Steffen Staab, Evgeny Kharlamov

Main category: cs.AI

TL;DR: A new benchmark and evaluation protocol for KG-RAG methods that tests reasoning under knowledge incompleteness, revealing current methods’ limitations in reasoning and over-reliance on memorization.

Details

Motivation: Current KG-RAG evaluation practices are inadequate because benchmarks contain questions that can be directly answered from knowledge graphs, making it unclear if models actually reason or just retrieve. Inconsistent metrics and lenient answer matching further hinder meaningful comparisons.

Method: Introduces a general method for constructing benchmarks with an evaluation protocol specifically designed to assess KG-RAG methods under conditions of knowledge incompleteness.

Result: Empirical results show current KG-RAG methods have limited reasoning ability when knowledge is missing, often rely on internal memorization rather than true reasoning, and exhibit varying generalization capabilities depending on their design.

Conclusion: The proposed benchmark and evaluation protocol provide a more rigorous way to assess KG-RAG methods, revealing significant limitations in current approaches that need to be addressed for true reasoning capabilities.

Abstract: Knowledge Graph-based Retrieval-Augmented Generation (KG-RAG) is an increasingly explored approach for combining the reasoning capabilities of large language models with the structured evidence of knowledge graphs. However, current evaluation practices fall short: existing benchmarks often include questions that can be directly answered using existing triples in KG, making it unclear whether models perform reasoning or simply retrieve answers directly. Moreover, inconsistent evaluation metrics and lenient answer matching criteria further obscure meaningful comparisons. In this work, we introduce a general method for constructing benchmarks, together with an evaluation protocol, to systematically assess KG-RAG methods under knowledge incompleteness. Our empirical results show that current KG-RAG methods have limited reasoning ability under missing knowledge, often rely on internal memorization, and exhibit varying degrees of generalization depending on their design.

cs.SD

[181] WaveLLDM: Design and Development of a Lightweight Latent Diffusion Model for Speech Enhancement and Restoration

Kevin Putra Santoso, Rizka Wakhidatus Sholikah, Raden Venantius Hari Ginardi

Main category: cs.SD

TL;DR: WaveLLDM is a lightweight latent diffusion model for audio restoration that uses compressed latent space processing to reduce computational complexity while maintaining good spectral reconstruction quality, though it underperforms state-of-the-art methods in perceptual quality.

Details

Motivation: High-quality audio is essential but degradation from noise, compression, and transmission artifacts remains challenging. Existing diffusion models require significant computational resources and struggle with longer missing segments.

Method: WaveLLDM integrates an efficient neural audio codec with latent diffusion for audio restoration, processing audio in compressed latent space instead of time or spectral domains to reduce computational complexity.

Result: Achieves accurate spectral reconstruction with low LSD scores (0.48-0.60) and good adaptability to unseen data, but underperforms in perceptual quality (WB-PESQ: 1.62-1.71) and speech clarity (STOI: 0.76-0.78) compared to state-of-the-art methods.

Conclusion: While current implementation has limitations due to suboptimal tuning and insufficient training, the flexible architecture combining neural audio codec and latent diffusion provides a strong foundation for future development in efficient audio restoration.

Abstract: High-quality audio is essential in a wide range of applications, including online communication, virtual assistants, and the multimedia industry. However, degradation caused by noise, compression, and transmission artifacts remains a major challenge. While diffusion models have proven effective for audio restoration, they typically require significant computational resources and struggle to handle longer missing segments. This study introduces WaveLLDM (Wave Lightweight Latent Diffusion Model), an architecture that integrates an efficient neural audio codec with latent diffusion for audio restoration and denoising. Unlike conventional approaches that operate in the time or spectral domain, WaveLLDM processes audio in a compressed latent space, reducing computational complexity while preserving reconstruction quality. Empirical evaluations on the Voicebank+DEMAND test set demonstrate that WaveLLDM achieves accurate spectral reconstruction with low Log-Spectral Distance (LSD) scores (0.48 to 0.60) and good adaptability to unseen data. However, it still underperforms compared to state-of-the-art methods in terms of perceptual quality and speech clarity, with WB-PESQ scores ranging from 1.62 to 1.71 and STOI scores between 0.76 and 0.78. These limitations are attributed to suboptimal architectural tuning, the absence of fine-tuning, and insufficient training duration. Nevertheless, the flexible architecture that combines a neural audio codec and latent diffusion model provides a strong foundation for future development.

[182] RARR : Robust Real-World Activity Recognition with Vibration by Scavenging Near-Surface Audio Online

Dong Yoon Lee, Alyssa Weakley, Hui Wei, Blake Brown, Keyana Carrion, Shijia Pan

Main category: cs.SD

TL;DR: A scalable vibration sensor system that uses synthesized acoustic data for pretraining and fine-tuning with limited real data to enable robust daily routine tracking for dementia patients living alone.

Details

Motivation: Address the challenges of remote monitoring for dementia patients living alone, including privacy concerns, activity recognition limitations, and the need for large labeled datasets in real-world deployments.

Method: Uses structural vibration sensors to monitor human activities unobtrusively. The solution adapts synthesized data from near-surface acoustic audio to pretrain models, then fine-tunes with very limited real-world data for robust activity recognition.

Result: Creates a framework that reduces the need for substantial labeled data while maintaining accurate activity recognition capabilities in end users’ homes.

Conclusion: The proposed scalable solution enables effective daily routine tracking for remote dementia caregiving by overcoming data scarcity and privacy preservation challenges through synthesized data pretraining and minimal fine-tuning.

Abstract: One in four people dementia live alone, leading family members to take on caregiving roles from a distance. Many researchers have developed remote monitoring solutions to lessen caregiving needs; however, limitations remain including privacy preserving solutions, activity recognition, and model generalizability to new users and environments. Structural vibration sensor systems are unobtrusive solutions that have been proven to accurately monitor human information, such as identification and activity recognition, in controlled settings by sensing surface vibrations generated by activities. However, when deploying in an end user’s home, current solutions require a substantial amount of labeled data for accurate activity recognition. Our scalable solution adapts synthesized data from near-surface acoustic audio to pretrain a model and allows fine tuning with very limited data in order to create a robust framework for daily routine tracking.

[183] Full-Frequency Temporal Patching and Structured Masking for Enhanced Audio Classification

Aditya Makineni, Baocheng Geng, Qing Tian

Main category: cs.SD

TL;DR: FFTP patching strategy and SpecMask augmentation improve audio classification by preserving frequency patterns, reducing computation by 83%, and boosting performance on AudioSet and SpeechCommands datasets.

Details

Motivation: Existing audio models use square patching from computer vision, which disrupts frequency patterns and creates too many patches, slowing training and increasing computation.

Method: Proposed Full-Frequency Temporal Patching (FFTP) that spans full frequency bands with localized temporal context, and SpecMask augmentation combining full-frequency and time-frequency masks under fixed budget.

Result: Improves mAP by up to +6.76 on AudioSet-18k and accuracy by +8.46 on SpeechCommandsV2, while reducing computation by up to 83.26%.

Conclusion: FFTP with SpecMask provides both performance gains and computational efficiency for audio spectrogram modeling, better matching the time-frequency asymmetry of audio data.

Abstract: Transformers and State-Space Models (SSMs) have advanced audio classification by modeling spectrograms as sequences of patches. However, existing models such as the Audio Spectrogram Transformer (AST) and Audio Mamba (AuM) adopt square patching from computer vision, which disrupts continuous frequency patterns and produces an excessive number of patches, slowing training, and increasing computation. We propose Full-Frequency Temporal Patching (FFTP), a patching strategy that better matches the time-frequency asymmetry of spectrograms by spanning full frequency bands with localized temporal context, preserving harmonic structure, and significantly reducing patch count and computation. We also introduce SpecMask, a patch-aligned spectrogram augmentation that combines full-frequency and localized time-frequency masks under a fixed masking budget, enhancing temporal robustness while preserving spectral continuity. When applied on both AST and AuM, our patching method with SpecMask improves mAP by up to +6.76 on AudioSet-18k and accuracy by up to +8.46 on SpeechCommandsV2, while reducing computation by up to 83.26%, demonstrating both performance and efficiency gains.

[184] DRASP: A Dual-Resolution Attentive Statistics Pooling Framework for Automatic MOS Prediction

Cheng-Yeh Yang, Kuan-Tang Huang, Chien-Chun Wang, Hung-Shin Lee, Hsin-Min Wang, Berlin Chen

Main category: cs.SD

TL;DR: DRASP framework combines global statistics and local attentive pooling for better MOS prediction, achieving 10.39% improvement over average pooling.

Details

Motivation: Existing pooling methods for MOS prediction operate at single granularity (either global or frame-level), missing complementary perceptual insights from both perspectives.

Method: Dual-Resolution Attentive Statistics Pooling (DRASP) integrates coarse-grained global statistical summaries with fine-grained attentive analysis of perceptually significant segments.

Result: Consistently outperforms baseline methods across diverse datasets, MOS prediction backbones, and audio generation systems with 10.39% relative improvement in SRCC.

Conclusion: DRASP provides more thorough and robust representation by capturing both overarching structural context and salient local details concurrently for improved MOS prediction.

Abstract: A pooling mechanism is essential for mean opinion score (MOS) prediction, facilitating the transformation of variable-length audio features into a concise fixed-size representation that effectively encodes speech quality. Existing pooling methods typically operate at a singular granularity, concentrating either on a comprehensive global perspective or a detailed frame-level analysis, which may overlook complementary perceptual insights. To address this limitation, we introduce the Dual-Resolution Attentive Statistics Pooling (DRASP) framework. DRASP integrates both coarse-grained, global statistical summaries and fine-grained, attentive analyses of perceptually significant segments. This dual-view architecture empowers our model to formulate a more thorough and robust representation, capturing both the overarching structural context and salient local details concurrently. Extensive experiments validate the effectiveness and strong generalization ability of the proposed framework. It consistently outperforms various baseline methods across diverse datasets (MusicEval and AES-Natural), MOS prediction backbones (including a CLAP-based model and AudioBox-Aesthetics), and different audio generation systems, achieving a relative improvement of 10.39% in system-level Spearman’s rank correlation coefficient (SRCC) over the widely-used average pooling approach.

[185] Adaptive Central Frequencies Locally Competitive Algorithm for Speech

Soufiyan Bahadi, Eric Plourde, Jean Rouat

Main category: cs.SD

TL;DR: Enhanced neuromorphic speech processing using adaptive sparse coding with dynamic parameter optimization for improved efficiency and lower power consumption on neuromorphic hardware.

Details

Motivation: To address power constraints in edge devices by developing more efficient neuromorphic speech processing methods that maintain accuracy while reducing energy consumption.

Method: Developed ALCA-CF (Adaptive LCA Central Frequency) that dynamically adjusts both modulation parameters and central frequencies in Gammatone/Gammachirp filter banks for optimized sparse coding.

Result: Improved reconstruction quality and sparsity while significantly reducing power consumption for speech classification on Intel’s Loihi 2 neuromorphic chip without compromising classification accuracy.

Conclusion: The ALCA-CF approach successfully optimizes neuromorphic speech processing by dynamically adapting key parameters, achieving better efficiency and lower power consumption while maintaining performance.

Abstract: Neuromorphic computing, inspired by nervous systems, revolutionizes information processing with its focus on efficiency and low power consumption. Using sparse coding, this paradigm enhances processing efficiency, which is crucial for edge devices with power constraints. The Locally Competitive Algorithm (LCA), adapted for audio with Gammatone and Gammachirp filter banks, provides an efficient sparse coding method for neuromorphic speech processing. Adaptive LCA (ALCA) further refines this method by dynamically adjusting modulation parameters, thereby improving reconstruction quality and sparsity. This paper introduces an enhanced ALCA version, the ALCA Central Frequency (ALCA-CF), which dynamically adapts both modulation parameters and central frequencies, optimizing the speech representation. Evaluations show that this approach improves reconstruction quality and sparsity while significantly reducing the power consumption of speech classification, without compromising classification accuracy, particularly on Intel’s Loihi 2 neuromorphic chip.

[186] Robust Localization of Partially Fake Speech: Metrics and Out-of-Domain Evaluation

Hieu-Thi Luong, Inbal Rimon, Haim Permuter, Kong Aik Lee, Eng Siong Chng

Main category: cs.SD

TL;DR: Current partial audio deepfake localization methods show strong in-domain performance but poor generalization to out-of-domain data, highlighting limitations of EER-focused evaluation and the need for threshold-dependent metrics.

Details

Motivation: To critically examine the limitations of current evaluation practices in partial audio deepfake localization and address the poor real-world generalization of existing methods.

Method: Reframing localization as sequential anomaly detection and using threshold-dependent metrics (accuracy, precision, recall, F1-score). Analyzing CFPRF framework performance on in-domain and out-of-domain datasets.

Result: CFPRF achieved 7.61% EER on in-domain data but 43.25% and 27.59% on out-of-domain sets. Reproduced model showed worse in-domain (9.84%) but better out-of-domain performance (41.72% and 14.98%). Adding partially fake training data improves performance while adding bona fide or fully synthetic data degrades it.

Conclusion: Over-optimizing for in-domain EER leads to poor real-world performance. Deep learning models generalize poorly to novel scenarios, and evaluation should use threshold-dependent metrics that better reflect deployment readiness.

Abstract: Partial audio deepfake localization poses unique challenges and remain underexplored compared to full-utterance spoofing detection. While recent methods report strong in-domain performance, their real-world utility remains unclear. In this analysis, we critically examine the limitations of current evaluation practices, particularly the widespread use of Equal Error Rate (EER), which often obscures generalization and deployment readiness. We propose reframing the localization task as a sequential anomaly detection problem and advocate for the use of threshold-dependent metrics such as accuracy, precision, recall, and F1-score, which better reflect real-world behavior. Specifically, we analyze the performance of the open-source Coarse-to-Fine Proposal Refinement Framework (CFPRF), which achieves a 20-ms EER of 7.61% on the in-domain PartialSpoof evaluation set, but 43.25% and 27.59% on the LlamaPartialSpoof and Half-Truth out-of-domain test sets. Interestingly, our reproduced version of the same model performs worse on in-domain data (9.84%) but better on the out-of-domain sets (41.72% and 14.98%, respectively). This highlights the risks of over-optimizing for in-domain EER, which can lead to models that perform poorly in real-world scenarios. It also suggests that while deep learning models can be effective on in-domain data, they generalize poorly to out-of-domain scenarios, failing to detect novel synthetic samples and misclassifying unfamiliar bona fide audio. Finally, we observe that adding more bona fide or fully synthetic utterances to the training data often degrades performance, whereas adding partially fake utterances improves it.

[187] Adaptive Duration Model for Text Speech Alignment

Junjie Cao

Main category: cs.SD

TL;DR: Novel duration prediction framework for speech-to-text alignment in TTS models that improves phoneme-level duration accuracy and zero-shot TTS robustness.

Details

Motivation: Current TTS models rely on attention mechanisms or external duration sources for speech-text alignment, which may lack precision and adaptation capabilities.

Method: Proposed a novel duration prediction framework that provides phoneme-level duration distribution from given text input.

Result: The model achieves more precise duration predictions and better adaptation to conditions compared to baseline models, with significant improvements in phoneme-level alignment accuracy.

Conclusion: The proposed duration prediction framework enhances TTS performance by providing more accurate alignments and making zero-shot TTS models more robust to audio mismatches.

Abstract: Speech-to-text alignment is a critical component of neural text to speech (TTS) models. Autoregressive TTS models typically use an attention mechanism to learn these alignments on-line, while non-autoregressive end to end TTS models rely on durations extracted from external sources. In this paper, we propose a novel duration prediction framework that can give promising phoneme-level duration distribution with given text. In our experiments, the proposed duration model has more precise prediction and adaptation ability to conditions, compared to previous baseline models. Specifically, it makes a considerable improvement on phoneme-level alignment accuracy and makes the performance of zero-shot TTS models more robust to the mismatch between prompt audio and input audio.

cs.LG

[188] Normalisation of SWIFT Message Counterparties with Feature Extraction and Clustering

Thanasis Schoinas, Benjamin Guinard, Diba Esbati, Richard Chalk

Main category: cs.LG

TL;DR: Proposes hybrid clustering pipeline for transaction counterparty data using string similarity, topic modeling, and hierarchical clustering to overcome limitations of NLP models and fuzzy matching in banking systems.

Details

Motivation: Natural language models fail with short, unstructured transaction counterparty data in banking systems (like SWIFT), creating gaps in fraud detection and payment tracing that current fuzzy matching tools inadequately address.

Method: Hybrid pipeline combining string similarity, topic modeling, hierarchical clustering, and rule-based approaches that handles unknown cluster numbers and uses precision/recall metrics for evaluation.

Result: Significantly outperforms baseline rule-based keyword approach on real-life labeled dataset, improves interpretability, reduces manual review needs, and provides better risk control in sanctions investigations.

Conclusion: The hybrid approach effectively clusters transaction counterparties where traditional NLP fails, offering improved performance while maintaining interpretability and reducing manual effort in financial investigations.

Abstract: Short text clustering is a known use case in the text analytics community. When the structure and content falls in the natural language domain e.g. Twitter posts or instant messages, then natural language techniques can be used, provided texts are of sufficient length to allow for use of (pre)trained models to extract meaningful information, such as part-of-speech or topic annotations. However, natural language models are not suitable for clustering transaction counterparties, as they are found in bank payment messaging systems, such as SWIFT. The manually typed tags are typically physical or legal entity details, which lack sentence structure, while containing all the variations and noise that manual entry introduces. This leaves a gap in an investigator or counter-fraud professional’s toolset when looking to augment their knowledge of payment flow originator and beneficiary entities and trace funds and assets. A gap that vendors traditionally try to close with fuzzy matching tools. With these considerations in mind, we are proposing a hybrid string similarity, topic modelling, hierarchical clustering and rule-based pipeline to facilitate clustering of transaction counterparties, also catering for unknown number of expected clusters. We are also devising metrics to supplement the evaluation of the approach, based on the well-known measures of precision and recall. Testing on a real-life labelled dataset demonstrates significantly improved performance over a baseline rule-based (‘keyword’) approach. The approach retains most of the interpretability found in rule-based systems, as the former adds an additional level of cluster refinement to the latter. The resulting workflow reduces the need for manual review. When only a subset of the population needs to be investigated, such as in sanctions investigations, the approach allows for better control of the risks of missing entity variations.

[189] Beyond Prediction: Reinforcement Learning as the Defining Leap in Healthcare AI

Dilruk Perera, Gousia Habib, Qianyi Xu, Daniel J. Tan, Kai He, Erik Cambria, Mengling Feng

Main category: cs.LG

TL;DR: This survey paper explores reinforcement learning’s transformative role in healthcare, analyzing RL techniques, applications across medical domains, and addressing ethical/deployment challenges.

Details

Motivation: To demonstrate how RL represents a fundamental shift from predictive AI to agentive clinical intelligence that actively makes intervention decisions with long-term goals in healthcare.

Method: The paper structures the RL landscape including model-based/model-free methods, offline approaches, and reward specification strategies. It comprehensively analyzes applications across critical care, chronic disease, mental health, diagnostics, and robotics.

Result: The survey identifies trends, gaps, and translational bottlenecks in healthcare RL applications while critically analyzing ethical, deployment, and reward design challenges.

Conclusion: RL serves as more than just prediction tools - it represents a shift toward agentive intelligence in clinical environments, requiring careful consideration of safety and human alignment for successful healthcare implementation.

Abstract: Reinforcement learning (RL) marks a fundamental shift in how artificial intelligence is applied in healthcare. Instead of merely predicting outcomes, RL actively decides interventions with long term goals. Unlike traditional models that operate on fixed associations, RL systems learn through trial, feedback, and long-term reward optimization, introducing transformative possibilities and new risks. From an information fusion lens, healthcare RL typically integrates multi-source signals such as vitals, labs clinical notes, imaging and device telemetry using temporal and decision-level mechanisms. These systems can operate within centralized, federated, or edge architectures to meet real-time clinical constraints, and naturally span data, features and decision fusion levels. This survey explore RL’s rise in healthcare as more than a set of tools, rather a shift toward agentive intelligence in clinical environments. We first structure the landscape of RL techniques including model-based and model-free methods, offline and batch-constrained approaches, and emerging strategies for reward specification and uncertainty calibration through the lens of healthcare constraints. We then comprehensively analyze RL applications spanning critical care, chronic disease, mental health, diagnostics, and robotic assistance, identifying their trends, gaps, and translational bottlenecks. In contrast to prior reviews, we critically analyze RL’s ethical, deployment, and reward design challenges, and synthesize lessons for safe, human-aligned policy learning. This paper serves as both a a technical roadmap and a critical reflection of RL’s emerging transformative role in healthcare AI not as prediction machinery, but as agentive clinical intelligence.

Tatsuya Mitomi, Fumiyasu Makinoshima, Fumiya Makihara, Eigo Segawa

Main category: cs.LG

TL;DR: Differentiable agent-based simulation for dynamic pricing in bike-sharing systems to balance inventory and reduce relocation costs through probabilistic user modeling.

Details

Motivation: Bike-sharing systems face imbalanced inventory due to spatiotemporally varying user demands, leading to additional relocation costs. Managing user demand through optimal dynamic pricing is essential but challenging due to diverse user backgrounds and probabilistic choices.

Method: Developed a differentiable agent-based simulation to rapidly design dynamic pricing policies. The approach models probabilistic user decisions and spatiotemporally heterogeneous trips to achieve balanced bicycle inventory.

Result: The method achieved 73-78% reduction in loss compared to conventional methods with over 100x faster convergence speed. Successfully validated on large-scale scenarios (289 stations, 1156 parameters) showing balanced inventory without manual relocation. Discount costs were minimized with appropriate initial conditions.

Conclusion: The differentiable agent-based simulation provides an effective and efficient approach for dynamic pricing in bike-sharing systems, enabling natural inventory balancing without manual intervention while optimizing discount costs.

Abstract: Bike-sharing systems are emerging in various cities as a new ecofriendly transportation system. In these systems, spatiotemporally varying user demands lead to imbalanced inventory at bicycle stations, resulting in additional relocation costs. Therefore, it is essential to manage user demand through optimal dynamic pricing for the system. However, optimal pricing design for such a system is challenging because the system involves users with diverse backgrounds and their probabilistic choices. To address this problem, we develop a differentiable agent-based simulation to rapidly design dynamic pricing in bike-sharing systems, achieving balanced bicycle inventory despite spatiotemporally heterogeneous trips and probabilistic user decisions. We first validate our approach against conventional methods through numerical experiments involving 25 bicycle stations and five time slots, yielding 100 parameters. Compared to the conventional methods, our approach obtains a more accurate solution with a 73% to 78% reduction in loss while achieving more than a 100-fold increase in convergence speed. We further validate our approach on a large-scale urban bike-sharing system scenario involving 289 bicycle stations, resulting in a total of 1156 parameters. Through simulations using the obtained pricing policies, we confirm that these policies can naturally induce balanced inventory without any manual relocation. Additionally, we find that the cost of discounts to induce the balanced inventory can be minimized by setting appropriate initial conditions.

[191] Spatiotemporal EEG-Based Emotion Recognition Using SAM Ratings from Serious Games with Hybrid Deep Learning

Abdul Rehman, Ilona Heldal, Jerry Chun-Wei Lin

Main category: cs.LG

TL;DR: A unified EEG emotion classification framework using LSTM-GRU achieves state-of-the-art performance (F1-score 0.932) on binary valence and multi-class/multi-label emotion recognition tasks using the GAMEEMO dataset.

Details

Motivation: Existing EEG-based emotion recognition studies focus narrowly on binary valence prediction or subject-specific classification, limiting generalizability and real-world deployment in affective computing systems.

Method: Proposes a multigranularity EEG emotion classification framework with structured preprocessing (temporal window segmentation, hybrid statistical/frequency-domain feature extraction, z-score normalization). Evaluates Random Forest, XGBoost, SVM, LSTM, LSTM-GRU, and CNN-LSTM models on three emotion encoding axes: binary valence, multi-class, and fine-grained multi-label representation.

Result: LSTM-GRU model consistently outperforms all others, achieving F1-score of 0.932 in binary valence task, 94.5% accuracy in multi-class emotion classification, and 90.6% in multi-label emotion classification.

Conclusion: The proposed unified framework demonstrates superior performance in EEG-based emotion recognition across multiple granularity levels, with LSTM-GRU emerging as the most effective architecture for generalizable affective computing applications.

Abstract: Recent advancements in EEG-based emotion recognition have shown promising outcomes using both deep learning and classical machine learning approaches; however, most existing studies focus narrowly on binary valence prediction or subject-specific classification, which limits generalizability and deployment in real-world affective computing systems. To address this gap, this paper presents a unified, multigranularity EEG emotion classification framework built on the GAMEEMO dataset, which consists of 14-channel EEG recordings and continuous self-reported emotion ratings (boring, horrible, calm, and funny) from 28 subjects across four emotion-inducing gameplay scenarios. Our pipeline employs a structured preprocessing strategy that comprises temporal window segmentation, hybrid statistical and frequency-domain feature extraction, and z-score normalization to convert raw EEG signals into robust, discriminative input vectors. Emotion labels are derived and encoded across three complementary axes: (i) binary valence classification based on the averaged polarity of positive and negative emotion ratings, and (ii) Multi-class emotion classification, where the presence of the most affective state is predicted. (iii) Fine-grained multi-label representation via binning each emotion into 10 ordinal classes. We evaluate a broad spectrum of models, including Random Forest, XGBoost, and SVM, alongside deep neural architectures such as LSTM, LSTM-GRU, and CNN-LSTM. Among these, the LSTM-GRU model consistently outperforms the others, achieving an F1-score of 0.932 in the binary valence task and 94.5% and 90.6% in both multi-class and Multi-Label emotion classification.

[192] PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning

Wenfeng Feng, Penghong Zhao, Guochao Jiang, Chuzhan Hao, Yuewei Zhang, Hao Wang

Main category: cs.LG

TL;DR: PVPO is a critic-free RL method that uses advantage reference anchors and data pre-sampling to reduce computational costs and avoid local optima, achieving SOTA performance across multiple domains.

Details

Motivation: Existing critic-free RL methods rely heavily on multiple sampling and intra-group comparisons for advantage estimation, which can lead to local optima and high computational costs.

Method: Uses a reference model to rollout in advance and calculate reward scores as reference anchors, plus data pre-sampling to assess sample difficulty and select high-gain data.

Result: Achieves State-Of-The-Art performance on nine datasets across two domains, with robust generalization across tasks and scalable performance across model scales.

Conclusion: PVPO effectively addresses the limitations of traditional critic-free RL methods by reducing cumulative bias and computational dependency while improving training efficiency and performance.

Abstract: Critic-free reinforcement learning methods, particularly group policies, have attracted considerable attention for their efficiency in complex tasks. However, these methods rely heavily on multiple sampling and comparisons within the policy to estimate advantage, which may cause the policy to fall into local optimum and increase computational cost. To address these issues, we propose PVPO, an efficient reinforcement learning method enhanced by an advantage reference anchor and data pre-sampling. Specifically, we use the reference model to rollout in advance and employ the calculated reward score as a reference anchor. Our approach effectively corrects the cumulative bias introduced by intra-group comparisons and significantly reduces reliance on the number of rollouts. Meanwhile, the reference model can assess sample difficulty during data pre-sampling, enabling effective selection of high-gain data to improve training efficiency. Experiments conducted on nine datasets across two domains demonstrate that PVPO achieves State-Of-The-Art (SOTA) performance. Our approach not only demonstrates robust generalization across multiple tasks, but also exhibits scalable performance across models of varying scales.

[193] Dynamic Low-rank Approximation of Full-Matrix Preconditioner for Training Generalized Linear Models

Tatyana Matveeva, Aleksandr Katrutsa, Evgeny Frolov

Main category: cs.LG

TL;DR: AdaGram is a full-matrix adaptive optimizer that uses symmetric factorization and low-rank approximations to enable efficient parameter correlation modeling while maintaining scalability for large models.

Details

Motivation: Diagonal adaptive methods like Adagrad cannot capture parameter correlations, while full-matrix methods are computationally prohibitive for large-scale optimization.

Method: Uses fast symmetric factorization for preconditioned updates and maintains low-rank preconditioner structure via matrix integrator methods to reduce memory and computational costs.

Result: AdaGram converges faster or matches diagonal adaptive optimizers using rank five and smaller approximations on standard machine learning tasks.

Conclusion: AdaGram provides a scalable solution for full-matrix adaptive optimization in large models by efficiently capturing parameter correlations with low computational overhead.

Abstract: Adaptive gradient methods like Adagrad and its variants are widespread in large-scale optimization. However, their use of diagonal preconditioning matrices limits the ability to capture parameter correlations. Full-matrix adaptive methods, approximating the exact Hessian, can model these correlations and may enable faster convergence. At the same time, their computational and memory costs are often prohibitive for large-scale models. To address this limitation, we propose AdaGram, an optimizer that enables efficient full-matrix adaptive gradient updates. To reduce memory and computational overhead, we utilize fast symmetric factorization for computing the preconditioned update direction at each iteration. Additionally, we maintain the low-rank structure of a preconditioner along the optimization trajectory using matrix integrator methods. Numerical experiments on standard machine learning tasks show that AdaGram converges faster or matches the performance of diagonal adaptive optimizers when using rank five and smaller rank approximations. This demonstrates AdaGram’s potential as a scalable solution for adaptive optimization in large models.

[194] An Explainable, Attention-Enhanced, Bidirectional Long Short-Term Memory Neural Network for Joint 48-Hour Forecasting of Temperature, Irradiance, and Relative Humidity

Georgios Vamvouras, Konstantinos Braimakis, Christos Tzivanidis

Main category: cs.LG

TL;DR: A deep learning framework using stacked BiLSTM with attention for 48-hour forecasting of temperature, solar irradiance, and humidity to support smart HVAC control, achieving state-of-the-art accuracy with enhanced interpretability.

Details

Motivation: To support Model Predictive Control in smart HVAC systems by providing accurate 48-hour forecasts of key meteorological variables (temperature, solar irradiance, relative humidity) for energy-efficient building operations.

Method: Stacked Bidirectional LSTM network with attention mechanism that jointly predicts all three variables, using historical meteorological data (2019-2022) with encoded cyclical time features. Model interpretability is enhanced through Integrated Gradients and attention weight analysis.

Result: Achieved Mean Absolute Errors of 1.3°C (temperature), 31 W/m² (irradiance), and 6.7 percentage points (humidity), outperforming both numerical weather prediction models and machine learning benchmarks. Demonstrated strong generalization on 2023 test data.

Conclusion: The framework successfully combines multivariate forecasting, attention-based deep learning, and explainability techniques to advance data-driven weather prediction for smart building control, showing potential for energy-efficient operations through reliable short-term meteorological forecasting.

Abstract: This paper presents a Deep Learning (DL) framework for 48-hour forecasting of temperature, solar irradiance, and relative humidity to support Model Predictive Control (MPC) in smart HVAC systems. The approach employs a stacked Bidirectional Long Short-Term Memory (BiLSTM) network with attention, capturing temporal and cross-feature dependencies by jointly predicting all three variables. Historical meteorological data (2019-2022) with encoded cyclical time features were used for training, while 2023 data evaluated generalization. The model achieved Mean Absolute Errors of 1.3 degrees Celsius (temperature), 31 W/m2 (irradiance), and 6.7 percentage points (humidity), outperforming state-of-the-art numerical weather prediction and machine learning benchmarks. Integrated Gradients quantified feature contributions, and attention weights revealed temporal patterns, enhancing interpretability. By combining multivariate forecasting, attention-based DL, and explainability, this work advances data-driven weather prediction. The demonstrated accuracy and transparency highlight the framework’s potential for energy-efficient building control through reliable short-term meteorological forecasting.

[195] Time-RA: Towards Time Series Reasoning for Anomaly with LLM Feedback

Yiyuan Yang, Zichuan Liu, Lei Song, Kai Ying, Zhiguang Wang, Tom Bamford, Svitlana Vyetrenko, Jiang Bian, Qingsong Wen

Main category: cs.LG

TL;DR: Proposes Time-RA, a generative reasoning task for time series anomaly detection using LLMs, and introduces RATs40K benchmark dataset with 40K multimodal samples across 10 domains for fine-grained anomaly categorization and explanatory reasoning.

Details

Motivation: Current time series anomaly detection approaches are limited to binary classification without detailed categorization or explanatory reasoning, lacking interpretability and comprehensive analysis.

Method: Transforms anomaly detection from discriminative to generative reasoning task using LLMs. Creates RATs40K dataset with numeric time series, contextual text, and visual data, annotated with fine-grained categories (14 univariate/6 multivariate types) using ensemble labels refined by GPT-4 feedback.

Result: Developed comprehensive benchmark showing capabilities and limitations of current LLMs and multimodal LLMs, demonstrating the critical importance of supervised fine-tuning for effective anomaly reasoning.

Conclusion: The Time-RA task and RATs40K dataset enable significant advancements in interpretable time series anomaly detection and reasoning, with open-sourced code and dataset to accelerate future research in this emerging area.

Abstract: Time series anomaly detection is critical across various domains, yet current approaches often limit analysis to mere binary anomaly classification without detailed categorization or further explanatory reasoning. To address these limitations, we propose a novel task, Time-series Reasoning for Anomaly (Time-RA) that transforms classical time series anomaly detection from a discriminative into a generative, reasoning-intensive task leveraging Large Language Models (LLMs). Also, we introduce the first real-world multimodal benchmark dataset, RATs40K, explicitly annotated for anomaly reasoning, comprising approximately 40,000 samples across 10 real-world domains. Each sample includes numeric time series data, contextual text information, and visual representations, each annotated with fine-grained categories (14 types for univariate anomalies and 6 for multivariate anomalies) and structured explanatory reasoning. We develop a sophisticated annotation framework utilizing ensemble-generated labels refined through GPT-4-driven feedback, ensuring accuracy and interpretability. Extensive benchmarking of LLMs and multimodal LLMs demonstrates the capabilities and limitations of current models, highlighting the critical role of supervised fine-tuning. Our dataset and task pave the way for significant advancements in interpretable time series anomaly detection and reasoning. The code (https://github.com/yyysjz1997/Time-RA) and dataset (https://huggingface.co/datasets/Time-RA/RATs40K) have been fully open-sourced to support and accelerate future research in this area.

[196] Automating the Deep Space Network Data Systems; A Case Study in Adaptive Anomaly Detection through Agentic AI

Evan J. Chou, Lisa S. Locke, Harvey M. Soldan

Main category: cs.LG

TL;DR: NASA’s Deep Space Network requires advanced anomaly detection to prevent costly disruptions. This study developed a comprehensive AI system combining machine learning, reinforcement learning, and LLMs for real-time anomaly detection and explanation in DSN equipment data.

Details

Motivation: The Deep Space Network's antenna facilities degrade over time, causing costly disruptions to spacecraft communications. There's a need to assist JPL engineers in pinpointing anomalies and equipment degradation through collected data to maintain operations for future space missions.

Method: Researched machine learning techniques for data reconstruction and anomaly detection through predictive analysis and statistical thresholds. Integrated reinforcement learning for severity classification and Large Language Models for anomaly explanation. Implemented a full data pipeline system for data extraction, parsing, and processing, wrapped around an agentic AI system for complex reasoning.

Result: Developed a complete data workflow for DSN anomaly detection that connects data extraction, processing, and trained models. Created a system that can identify anomalies in real-time datasets and provide severity classifications with explanatory labels.

Conclusion: The implemented AI system provides a comprehensive solution for DSN anomaly detection, combining multiple machine learning approaches with a robust data pipeline. The system can be continuously improved through human feedback and supports maintenance operations for critical space communications infrastructure.

Abstract: The Deep Space Network (DSN) is NASA’s largest network of antenna facilities that generate a large volume of multivariate time-series data. These facilities contain DSN antennas and transmitters that undergo degradation over long periods of time, which may cause costly disruptions to the data flow and threaten the earth-connection of dozens of spacecraft that rely on the Deep Space Network for their lifeline. The purpose of this study was to experiment with different methods that would be able to assist JPL engineers with directly pinpointing anomalies and equipment degradation through collected data, and continue conducting maintenance and operations of the DSN for future space missions around our universe. As such, we have researched various machine learning techniques that can fully reconstruct data through predictive analysis, and determine anomalous data entries within real-time datasets through statistical computations and thresholds. On top of the fully trained and tested machine learning models, we have also integrated the use of a reinforcement learning subsystem that classifies identified anomalies based on severity level and a Large Language Model that labels an explanation for each anomalous data entry, all of which can be improved and fine-tuned over time through human feedback/input. Specifically, for the DSN transmitters, we have also implemented a full data pipeline system that connects the data extraction, parsing, and processing workflow all together as there was no coherent program or script for performing these tasks before. Using this data pipeline system, we were able to then also connect the models trained from DSN antenna data, completing the data workflow for DSN anomaly detection. This was all wrapped around and further connected by an agentic AI system, where complex reasoning was utilized to determine the classifications and predictions of anomalous data.

[197] Adaptive LLM Routing under Budget Constraints

Pranoy Panda, Raghav Magazine, Chaitanya Devaguptapu, Sho Takemori, Vishal Sharma

Main category: cs.LG

TL;DR: PILOT framework for adaptive LLM routing using contextual bandits and cost-aware policy optimization

Details

Motivation: Existing LLM routing approaches assume complete knowledge of optimal query-LLM pairings, but real-world scenarios lack comprehensive mappings and face evolving queries, requiring adaptive decision-making

Method: Develop shared embedding space for queries and LLMs aligned to reflect affinity, learned from offline human preference data and refined through online bandit feedback. PILOT extends LinUCB with cost policy modeled as multi-choice knapsack problem

Result: Proposed framework enables adaptive routing without requiring exhaustive inference across all LLMs, with cost-efficient resource allocation

Conclusion: Contextual bandit formulation with shared embedding space and cost-aware policy provides effective solution for practical LLM routing in dynamic environments

Abstract: Large Language Models (LLMs) have revolutionized natural language processing, but their varying capabilities and costs pose challenges in practical applications. LLM routing addresses this by dynamically selecting the most suitable LLM for each query/task. Previous approaches treat this as a supervised learning problem, assuming complete knowledge of optimal query-LLM pairings. However, real-world scenarios lack such comprehensive mappings and face evolving user queries. We thus propose to study LLM routing as a contextual bandit problem, enabling adaptive decision-making using bandit feedback without requiring exhaustive inference across all LLMs for all queries (in contrast to supervised routing). To address this problem, we develop a shared embedding space for queries and LLMs, where query and LLM embeddings are aligned to reflect their affinity. This space is initially learned from offline human preference data and refined through online bandit feedback. We instantiate this idea through Preference-prior Informed Linucb fOr adaptive rouTing (PILOT), a novel extension of LinUCB. To handle diverse user budgets for model routing, we introduce an online cost policy modeled as a multi-choice knapsack problem, ensuring resource-efficient routing.

[198] Privacy Auditing Synthetic Data Release through Local Likelihood Attacks

Joshua Ward, Chi-Hua Wang, Guang Cheng

Main category: cs.LG

TL;DR: Proposes Gen-LRA, a novel membership inference attack for synthetic data that exploits generative model overfitting without requiring model access, outperforming existing methods across diverse benchmarks.

Details

Motivation: Existing privacy auditing frameworks for synthetic data rely on heuristics and unreasonable assumptions, showing limited capability to detect privacy exposure through synthetic data release.

Method: Develops Generative Likelihood Ratio Attack (Gen-LRA), a No-Box MIA that evaluates test observation influence on surrogate model’s estimation of local likelihood ratio over synthetic data, requiring no model knowledge or access.

Result: Gen-LRA consistently dominates other MIAs for generative models across multiple performance metrics in comprehensive benchmarks spanning diverse datasets, model architectures, and attack parameters.

Conclusion: Gen-LRA is an effective privacy auditing tool that highlights significant privacy risks from generative model overfitting in real-world synthetic data applications.

Abstract: Auditing the privacy leakage of synthetic data is an important but unresolved problem. Most existing privacy auditing frameworks for synthetic data rely on heuristics and unreasonable assumptions to attack the failure modes of generative models, exhibiting limited capability to describe and detect the privacy exposure of training data through synthetic data release. In this paper, we study designing Membership Inference Attacks (MIAs) that specifically exploit the observation that tabular generative models tend to significantly overfit to certain regions of the training distribution. Here, we propose Generative Likelihood Ratio Attack (Gen-LRA), a novel, computationally efficient No-Box MIA that, with no assumption of model knowledge or access, formulates its attack by evaluating the influence a test observation has in a surrogate model’s estimation of a local likelihood ratio over the synthetic data. Assessed over a comprehensive benchmark spanning diverse datasets, model architectures, and attack parameters, we find that Gen-LRA consistently dominates other MIAs for generative models across multiple performance metrics. These results underscore Gen-LRA’s effectiveness as a privacy auditing tool for the release of synthetic data, highlighting the significant privacy risks posed by generative model overfitting in real-world applications.

[199] Deep Residual Echo State Networks: exploring residual orthogonal connections in untrained Recurrent Neural Networks

Matteo Pinna, Andrea Ceni, Claudio Gallicchio

Main category: cs.LG

TL;DR: Deep Residual Echo State Networks (DeepResESNs) improve long-term memory and temporal modeling in Echo State Networks through hierarchical residual connections, with mathematical stability analysis and experimental validation.

Details

Motivation: Traditional Echo State Networks struggle with long-term information processing, requiring a more effective architecture for temporal modeling.

Method: Introduces deep untrained RNNs with temporal residual connections in orthogonal configurations (random and fixed-structure), with mathematical analysis of stability conditions.

Result: Significant improvement in memory capacity and long-term temporal modeling across various time series tasks compared to traditional shallow and deep Reservoir Computing.

Conclusion: DeepResESNs with residual connections effectively enhance Echo State Networks’ capabilities for long-term information processing while maintaining stability.

Abstract: Echo State Networks (ESNs) are a particular type of untrained Recurrent Neural Networks (RNNs) within the Reservoir Computing (RC) framework, popular for their fast and efficient learning. However, traditional ESNs often struggle with long-term information processing. In this paper, we introduce a novel class of deep untrained RNNs based on temporal residual connections, called Deep Residual Echo State Networks (DeepResESNs). We show that leveraging a hierarchy of untrained residual recurrent layers significantly boosts memory capacity and long-term temporal modeling. For the temporal residual connections, we consider different orthogonal configurations, including randomly generated and fixed-structure configurations, and we study their effect on network dynamics. A thorough mathematical analysis outlines necessary and sufficient conditions to ensure stable dynamics within DeepResESN. Our experiments on a variety of time series tasks showcase the advantages of the proposed approach over traditional shallow and deep RC.

[200] FUTURE: Flexible Unlearning for Tree Ensemble

Ziheng Chen, Jin Huang, Jiali Cheng, Yuchan Guo, Mengjie Wang, Lalitesh Morishetti, Kaushiki Nag, Hadi Amiri

Main category: cs.LG

TL;DR: FUTURE is a novel gradient-based unlearning algorithm for tree ensembles that addresses privacy concerns by enabling efficient forgetting of sensitive data through probabilistic approximations and optimization.

Details

Motivation: Tree ensembles excel in classification but face challenges with data privacy and right to be forgotten requirements. Existing unlearning methods are model-specific, rely on discrete structures, and don't scale well to complex ensembles and large datasets.

Method: Formulates unlearning as gradient-based optimization using probabilistic model approximations to handle non-differentiability of tree ensembles, enabling end-to-end efficient unlearning.

Result: Extensive experiments on real-world datasets demonstrate that FUTURE achieves significant and successful unlearning performance.

Conclusion: FUTURE provides an effective and efficient solution for tree ensemble unlearning, overcoming limitations of previous methods through gradient-based optimization and probabilistic approximations.

Abstract: Tree ensembles are widely recognized for their effectiveness in classification tasks, achieving state-of-the-art performance across diverse domains, including bioinformatics, finance, and medical diagnosis. With increasing emphasis on data privacy and the \textit{right to be forgotten}, several unlearning algorithms have been proposed to enable tree ensembles to forget sensitive information. However, existing methods are often tailored to a particular model or rely on the discrete tree structure, making them difficult to generalize to complex ensembles and inefficient for large-scale datasets. To address these limitations, we propose FUTURE, a novel unlearning algorithm for tree ensembles. Specifically, we formulate the problem of forgetting samples as a gradient-based optimization task. In order to accommodate non-differentiability of tree ensembles, we adopt the probabilistic model approximations within the optimization framework. This enables end-to-end unlearning in an effective and efficient manner. Extensive experiments on real-world datasets show that FUTURE yields significant and successful unlearning performance.

[201] Manifold Trajectories in Next-Token Prediction: From Replicator Dynamics to Softmax Equilibrium

Christopher R. Lee-Jenkins

Main category: cs.LG

TL;DR: This paper provides a variational interpretation of LLM decoding as constrained optimization on the probability simplex, showing next-token distributions follow smooth trajectories converging to softmax equilibrium.

Details

Motivation: To formalize the intuitive 'manifold traversal' concept in LLM decoding and provide a rigorous mathematical framework for understanding temperature scaling and sampling methods.

Method: Models decoding as a constrained variational principle using multiplicative-weights (entropic mirror) updates and replicator flow in continuous time, analyzing the trajectory within the probability simplex.

Result: Proves that next-token distributions follow smooth trajectories converging to softmax equilibrium, with temperature acting as exact time rescaling and top-k/nucleus sampling restricting flow to simplex faces.

Conclusion: Provides a formal mathematical foundation for LLM decoding dynamics, explaining temperature effects and sampling methods while offering insights into path-dependent score adjustments and hallucination behavior.

Abstract: Decoding in large language models is often described as scoring tokens and normalizing with softmax. We give a minimal, self-contained account of this step as a constrained variational principle on the probability simplex. The discrete, normalization-respecting ascent is the classical multiplicative-weights (entropic mirror) update; its continuous-time limit is the replicator flow. From these ingredients we prove that, for a fixed context and temperature, the next-token distribution follows a smooth trajectory inside the simplex and converges to the softmax equilibrium. This formalizes the common ``manifold traversal’’ intuition at the output-distribution level. The analysis yields precise, practice-facing consequences: temperature acts as an exact rescaling of time along the same trajectory, while top-k and nucleus sampling restrict the flow to a face with identical guarantees. We also outline a controlled account of path-dependent score adjustments and their connection to loop-like, hallucination-style behavior. We make no claims about training dynamics or internal representations; those are deferred to future work.

[202] Model-Task Alignment Drives Distinct RL Outcomes

Haoze Wu, Cheng Wang, Wenshuo Zhao, Junxian He

Main category: cs.LG

TL;DR: Counterintuitive RL phenomena in LLMs (single example matching full dataset, inaccurate rewards working, negative samples outperforming reward methods) only occur when models already have strong model-task alignment, while standard RL remains robust across all settings.

Details

Motivation: Recent RL advances in LLMs show counterintuitive patterns not seen in traditional RL, but the conditions under which these phenomena work (and fail) remain unclear.

Method: Systematic examination of counterintuitive claims through rigorous experiments across different model architectures and task domains, measuring model-task alignment via pass@k accuracy.

Result: Counterintuitive results only arise when models exhibit strong model-task alignment; these techniques fail in challenging regimes where standard RL methods remain effective.

Conclusion: Model-task alignment is a key differentiating factor for RL observations in LLMs - standard RL training is consistently robust while many counterintuitive phenomena are conditional on pre-existing alignment.

Abstract: Recent advances in applying reinforcement learning (RL) to large language models (LLMs) have led to substantial progress. In particular, a series of remarkable yet often counterintuitive phenomena have been reported in LLMs, exhibiting patterns not typically observed in traditional RL settings. For example, notable claims include that a single training example can match the performance achieved with an entire dataset, that the reward signal does not need to be very accurate, and that training solely with negative samples can match or even surpass sophisticated reward-based methods. However, the precise conditions under which these observations hold - and, critically, when they fail - remain unclear. In this work, we identify a key factor that differentiates RL observations: whether the pretrained model already exhibits strong Model-Task Alignment, as measured by pass@k accuracy on the evaluated task. Through a systematic and comprehensive examination of a series of counterintuitive claims, supported by rigorous experimental validation across different model architectures and task domains, our findings show that while standard RL training remains consistently robust across settings, many of these counterintuitive results arise only when the model and task already exhibit strong model-task alignment. In contrast, these techniques fail to drive substantial learning in more challenging regimes, where standard RL methods remain effective.

[203] Class Incremental Continual Learning with Self-Organizing Maps and Variational Autoencoders Using Synthetic Replay

Pujan Thapa, Alexander Ororbia, Travis Desell

Main category: cs.LG

TL;DR: A novel generative continual learning framework using self-organizing maps (SOMs) and variational autoencoders (VAEs) that enables memory-efficient replay without storing raw data or task labels.

Details

Motivation: To address the memory inefficiency in continual learning by eliminating the need to store raw data samples or task labels while maintaining competitive performance.

Method: Combines SOMs with VAEs - SOM operates in latent space for high-dimensional inputs (CIFAR) and standalone for low-dimensional inputs (MNIST). Stores running statistics (mean, variance, covariance) for each SOM unit to generate synthetic samples for replay.

Result: Competitive with state-of-the-art memory-based methods, outperforms memory-free methods, improves SOTA single class incremental performance by nearly 10% on CIFAR-10 and 7% on CIFAR-100.

Conclusion: Provides a scalable, task-label-free, and memory-efficient solution for continual learning that also facilitates visualization and can serve as a generative model post-training.

Abstract: This work introduces a novel generative continual learning framework based on self-organizing maps (SOMs) and variational autoencoders (VAEs) to enable memory-efficient replay, eliminating the need to store raw data samples or task labels. For high-dimensional input spaces, such as of CIFAR-10 and CIFAR-100, we design a scheme where the SOM operates over the latent space learned by a VAE, whereas, for lower-dimensional inputs, such as those found in MNIST and FashionMNIST, the SOM operates in a standalone fashion. Our method stores a running mean, variance, and covariance for each SOM unit, from which synthetic samples are then generated during future learning iterations. For the VAE-based method, generated samples are then fed through the decoder to then be used in subsequent replay. Experimental results on standard class-incremental benchmarks show that our approach performs competitively with state-of-the-art memory-based methods and outperforms memory-free methods, notably improving over best state-of-the-art single class incremental performance on CIFAR-10 and CIFAR-100 by nearly $10$% and $7$%, respectively. Our methodology further facilitates easy visualization of the learning process and can also be utilized as a generative model post-training. Results show our method’s capability as a scalable, task-label-free, and memory-efficient solution for continual learning.

[204] A Mixture of Experts Gating Network for Enhanced Surrogate Modeling in External Aerodynamics

Mohammad Amin Nabian, Sanjay Choudhry

Main category: cs.LG

TL;DR: A meta-learning Mixture of Experts framework that dynamically combines three state-of-the-art CFD surrogate models (DoMINO, X-MeshGraphNet, FigConvNet) using a gating network with entropy regularization, achieving superior accuracy over individual models and ensemble averages on automotive aerodynamic predictions.

Details

Motivation: High computational cost of CFD simulations remains a bottleneck in automotive design, and while ML-based surrogate models show promise, no single architecture demonstrates universal superiority across all scenarios.

Method: Proposes a Mixture of Experts model with a gating network that learns spatially-variant weighting to combine predictions from three heterogeneous surrogate models (DoMINO, X-MeshGraphNet, FigConvNet), using entropy regularization to prevent model collapse and ensure balanced expert contributions.

Result: The MoE model achieves significant reduction in L-2 prediction error, outperforming both ensemble averages and the most accurate individual expert model across all evaluated physical quantities (surface pressure and wall shear stress fields) on the DrivAerML dataset.

Conclusion: The MoE framework establishes an effective strategy for creating robust and accurate composite surrogate models by synergistically combining complementary strengths of specialized architectures in automotive aerodynamics.

Abstract: The computational cost associated with high-fidelity CFD simulations remains a significant bottleneck in the automotive design and optimization cycle. While ML-based surrogate models have emerged as a promising alternative to accelerate aerodynamic predictions, the field is characterized by a diverse and rapidly evolving landscape of specialized neural network architectures, with no single model demonstrating universal superiority. This paper introduces a novel meta-learning framework that leverages this architectural diversity as a strength. We propose a Mixture of Experts (MoE) model that employs a dedicated gating network to dynamically and optimally combine the predictions from three heterogeneous, state-of-the-art surrogate models: DoMINO, a decomposable multi-scale neural operator; X-MeshGraphNet, a scalable multi-scale graph neural network; and FigConvNet, a factorized implicit global convolution network. The gating network learns a spatially-variant weighting strategy, assigning credibility to each expert based on its localized performance in predicting surface pressure and wall shear stress fields. To prevent model collapse and encourage balanced expert contributions, we integrate an entropy regularization term into the training loss function. The entire system is trained and validated on the DrivAerML dataset, a large-scale, public benchmark of high-fidelity CFD simulations for automotive aerodynamics. Quantitative results demonstrate that the MoE model achieves a significant reduction in L-2 prediction error, outperforming not only the ensemble average but also the most accurate individual expert model across all evaluated physical quantities. This work establishes the MoE framework as a powerful and effective strategy for creating more robust and accurate composite surrogate models by synergistically combining the complementary strengths of specialized architectures.

[205] RelP: Faithful and Efficient Circuit Discovery via Relevance Patching

Farnoush Rezaei Jafari, Oliver Eberle, Ashkan Khakzar, Neel Nanda

Main category: cs.LG

TL;DR: Relevance Patching (RelP) is a new method that improves upon attribution patching by using Layer-wise Relevance Propagation coefficients instead of gradients, providing more accurate approximation of activation patching with similar computational efficiency.

Details

Motivation: Activation patching is computationally expensive at scale, while attribution patching suffers from noise and reduced reliability in deep non-linear networks. There's a need for a more faithful and efficient method.

Method: Replaces local gradients in attribution patching with propagation coefficients from Layer-wise Relevance Propagation (LRP), maintaining computational efficiency (2 forward + 1 backward pass) while improving faithfulness.

Result: RelP significantly outperforms standard attribution patching, achieving Pearson correlation of 0.956 vs 0.006 for MLP outputs in GPT-2 Large on IOI task. Provides comparable faithfulness to Integrated Gradients without extra computational cost.

Conclusion: RelP offers a computationally efficient and more faithful alternative to both activation patching and attribution patching, particularly effective for analyzing complex components in deep networks.

Abstract: Activation patching is a standard method in mechanistic interpretability for localizing the components of a model responsible for specific behaviors, but it is computationally expensive to apply at scale. Attribution patching offers a faster, gradient-based approximation, yet suffers from noise and reduced reliability in deep, highly non-linear networks. In this work, we introduce Relevance Patching (RelP), which replaces the local gradients in attribution patching with propagation coefficients derived from Layer-wise Relevance Propagation (LRP). LRP propagates the network’s output backward through the layers, redistributing relevance to lower-level components according to local propagation rules that ensure properties such as relevance conservation or improved signal-to-noise ratio. Like attribution patching, RelP requires only two forward passes and one backward pass, maintaining computational efficiency while improving faithfulness. We validate RelP across a range of models and tasks, showing that it more accurately approximates activation patching than standard attribution patching, particularly when analyzing residual stream and MLP outputs in the Indirect Object Identification (IOI) task. For instance, for MLP outputs in GPT-2 Large, attribution patching achieves a Pearson correlation of 0.006, whereas RelP reaches 0.956, highlighting the improvement offered by RelP. Additionally, we compare the faithfulness of sparse feature circuits identified by RelP and Integrated Gradients (IG), showing that RelP achieves comparable faithfulness without the extra computational cost associated with IG.

[206] Owen Sampling Accelerates Contribution Estimation in Federated Learning

Hossein KhademSohi, Hadi Hemmati, Jiayu Zhou, Steve Drew

Main category: cs.LG

TL;DR: FedOwen is an efficient framework that uses Owen sampling to approximate Shapley values for client contribution evaluation in federated learning, achieving better accuracy with the same computational budget.

Details

Motivation: Accurately estimating client contributions in federated learning is essential for fair rewards and faster convergence, but exact Shapley value computation is computationally infeasible for large federations.

Method: Uses Owen sampling to approximate Shapley values efficiently and implements an adaptive client selection strategy that balances exploitation of high-value clients with exploration of under-sampled ones.

Result: Achieves up to 23% higher final accuracy within the same number of communication rounds compared to state-of-the-art baselines on non-IID benchmarks under fixed valuation cost.

Conclusion: FedOwen provides an efficient and effective solution for client contribution evaluation in federated learning, enabling better model performance while maintaining computational feasibility.

Abstract: Federated Learning (FL) aggregates information from multiple clients to train a shared global model without exposing raw data. Accurately estimating each client’s contribution is essential not just for fair rewards, but for selecting the most useful clients so the global model converges faster. The Shapley value is a principled choice, yet exact computation scales exponentially with the number of clients, making it infeasible for large federations. We propose FedOwen, an efficient framework that uses Owen sampling to approximate Shapley values under the same total evaluation budget as existing methods while keeping the approximation error small. In addition, FedOwen uses an adaptive client selection strategy that balances exploiting high-value clients with exploring under-sampled ones, reducing bias and uncovering rare but informative data. Under a fixed valuation cost, FedOwen achieves up to 23 percent higher final accuracy within the same number of communication rounds compared to state-of-the-art baselines on non-IID benchmarks.

[207] Guess-and-Learn (G&L): Measuring the Cumulative Error Cost of Cold-Start Adaptation

Roland Arnold

Main category: cs.LG

TL;DR: G&L v1.0 introduces a framework to measure cold-start adaptability by tracking cumulative errors during sequential learning, revealing that smaller models adapt faster with fewer initial mistakes and current models remain far from optimal performance.

Details

Motivation: Traditional model evaluation focuses only on final accuracy, ignoring the cost of adaptation - the cumulative errors incurred while learning from scratch. This gap motivates measuring cold-start adaptability to understand how quickly and efficiently models learn from initial examples.

Method: G&L defines a protocol where learners sequentially label unlabeled datasets: select instance, predict label, receive ground truth, update parameters (online or batch mode). Four tracks (Scratch/Pretrained × Online/Batch) disentangle initialization and update frequency effects. Baseline experiments use classical methods, CNNs, ResNet-50, and pretrained transformers on MNIST and AG News.

Result: Smaller models adapt with fewer initial errors, pretraining benefits vary by domain, and all current models remain well above the oracle reference band, showing significant adaptability gap. Systematic differences in early-phase efficiency are revealed.

Conclusion: G&L provides a reproducible framework that complements conventional benchmarks by quantifying the mistake cost of early learning, enabling development of learners that are both accurate in the limit and reliable from the first examples.

Abstract: Evaluation of machine learning models typically emphasizes final accuracy, overlooking the cost of adaptation: the cumulative errors incurred while learning from scratch. Guess-and- Learn (G&L) v1.0 addresses this gap by measuring cold-start adaptability - the total mistakes a model makes while sequentially labeling an unlabeled dataset. At each step, the learner selects an instance, predicts its label, receives the ground truth, and updates parameters under either online (per-sample) or batch (delayed) mode. The resulting error trajectory exposes adaptation speed, selection quality, and bias - dynamics invisible to endpoint metrics. G&L defines four tracks (Scratch/Pretrained $\times$ Online/Batch) to disentangle the effects of initialization and update frequency. We formalize the protocol, relate it to classical mistake-bound theory, and estimate a heuristic “oracle reference band” for MNIST as a plausibility reference. Baseline experiments on MNIST and AG News, spanning classical methods (Perceptron, k-NN), convolutional architectures (CNN, ResNet-50), and pretrained transformers (ViT-B/16, BERT-base), reveal systematic differences in early-phase efficiency: smaller models can adapt with fewer initial errors, while pretraining benefits vary by domain. Across settings, current models remain well above the oracle band, highlighting an adaptability gap. By quantifying the mistake cost of early learning, G&L complements conventional benchmarks and provides a reproducible framework for developing learners that are not only accurate in the limit but also reliable from the first examples.

[208] CALM: A Framework for Continuous, Adaptive, and LLM-Mediated Anomaly Detection in Time-Series Streams

Ashok Devireddy, Shunping Huang

Main category: cs.LG

TL;DR: CALM is a real-time anomaly detection framework that uses continuous fine-tuning and LLM-as-a-Judge to adapt to concept drift in streaming time-series data, outperforming static models on the TSB-UAD benchmark.

Details

Motivation: Traditional offline-trained anomaly detection models suffer performance degradation when faced with concept drift in non-stationary time-series streams, necessitating adaptive solutions.

Method: Built on Apache Beam with TimesFm foundation model, CALM features continuous fine-tuning and an LLM-as-a-Judge component that provides semantic context to distinguish meaningful anomalies from noise.

Result: The continuously fine-tuned model improves ROC AUC scores in most datasets compared to static pre-trained models on the TSB-UAD benchmark.

Conclusion: CALM’s adaptive, LLM-guided approach effectively maintains high-performance anomaly detection in dynamic streaming environments by addressing concept drift through real-time adaptation.

Abstract: The detection of anomalies in non-stationary time-series streams is a critical but challenging task across numerous industrial and scientific domains. Traditional models, trained offline, suffer significant performance degradation when faced with concept drift, where the underlying statistical properties of the data change over time. This paper introduces CALM (Continuous, Adaptive, and LLM-Mediated), a novel, end-to-end framework for real-time anomaly detection designed to address this challenge. CALM is built on the Apache Beam distributed processing framework and leverages the TimesFm foundation model for forecasting-based anomaly detection. The framework’s novelty lies in two core contributions. First, it implements a closed-loop, continuous fine-tuning mechanism that allows the anomaly detection model to adapt to evolving data patterns in near real-time. Second, it introduces an LLM-as-a-Judge component, a Large Language Model that provides semantic, context-aware judgments on detected anomalies to curate a high-quality training dataset, deciding whether an anomaly represents transient noise or a meaningful pattern shift. We evaluate CALM on the comprehensive TSB-UAD benchmark. Our results demonstrate that the continuously fine-tuned model improves the ROC AUC score in most datasets compared to the static, pre-trained base model, validating the efficacy of our adaptive, LLM-guided approach to maintaining high-performance anomaly detection in dynamic streaming environments.

[209] Detecting Domain Shifts in Myoelectric Activations: Challenges and Opportunities in Stream Learning

Yibin Sun, Nick Lim, Guilherme Weigert Cassales, Heitor Murilo Gomes, Bernhard Pfahringer, Albert Bifet, Anany Dwivedi

Main category: cs.LG

TL;DR: The paper investigates domain shift detection in EMG signals using data stream learning techniques, evaluates multiple drift detection methods, and reveals limitations of current approaches for real-time detection.

Details

Motivation: Domain shifts in myoelectric activations are challenging due to EMG signal non-stationarity, requiring effective detection methods for stable EMG decoding models in real-world applications.

Method: Used DB6 dataset from Ninapro database, defined domains as time-series segments by subjects and sessions, applied KPCA with cosine kernel for preprocessing, and evaluated CUSUM, Page-Hinckley, and ADWIN drift detection methods.

Result: Current drift detection techniques show limitations in achieving high performance for real-time domain shift detection in EMG signals, indicating room for improvement.

Conclusion: Streaming-based approaches show potential for maintaining stable EMG decoding models, but further research is needed to enhance robustness and accuracy in practical scenarios.

Abstract: Detecting domain shifts in myoelectric activations poses a significant challenge due to the inherent non-stationarity of electromyography (EMG) signals. This paper explores the detection of domain shifts using data stream (DS) learning techniques, focusing on the DB6 dataset from the Ninapro database. We define domains as distinct time-series segments based on different subjects and recording sessions, applying Kernel Principal Component Analysis (KPCA) with a cosine kernel to pre-process and highlight these shifts. By evaluating multiple drift detection methods such as CUSUM, Page-Hinckley, and ADWIN, we reveal the limitations of current techniques in achieving high performance for real-time domain shift detection in EMG signals. Our results underscore the potential of streaming-based approaches for maintaining stable EMG decoding models, while highlighting areas for further research to enhance robustness and accuracy in real-world scenarios.

[210] MyGO: Memory Yielding Generative Offline-consolidation for Lifelong Learning Systems

Shihao Ji, Zihui Song

Main category: cs.LG

TL;DR: MyGO is a lifelong learning framework that uses generative models instead of storing raw data to prevent catastrophic forgetting, inspired by biological wake-sleep cycles.

Details

Motivation: Address challenges in continual learning including data privacy concerns, storage limitations, and performance degradation with dissimilar tasks in existing approaches that rely on experience replay or complex regularization.

Method: Two-phase approach: ‘wake’ phase learns new tasks and trains compact generative models (G-mem) to capture data distributions; ‘sleep’ phase uses all G-mem models to generate pseudo-data and consolidate knowledge via distillation into a core feature extractor without storing raw data.

Result: MyGO significantly mitigates catastrophic forgetting and maintains high average accuracy across tasks on both computer vision (Split-MNIST) and natural language processing (Split-AG News) benchmarks compared to sequential fine-tuning baseline.

Conclusion: The framework effectively addresses privacy and storage efficiency concerns while demonstrating domain-generality and effectiveness in lifelong learning scenarios.

Abstract: Continual or Lifelong Learning aims to develop models capable of acquiring new knowledge from a sequence of tasks without catastrophically forgetting what has been learned before. Existing approaches often rely on storing samples from previous tasks (experience replay) or employing complex regularization terms to protect learned weights. However, these methods face challenges related to data privacy, storage limitations, and performance degradation when tasks are dissimilar. To address these challenges, we introduce MyGO (Memory Yielding Generative Offline-consolidation), a novel lifelong learning framework inspired by the biological wake-sleep cycle. During the “wake” phase, the system rapidly learns a new task and trains a compact generative model (Generative Memory, G-mem) to capture its data distribution. During the “sleep” phase, the system enters an offline state, using all learned G-mem models to generate pseudo-data (“dreams”) and consolidate new and old knowledge into a core feature extractor via knowledge distillation. This approach obviates the need to store any raw data, retaining only compact generative models, which offers significant advantages in privacy and storage efficiency. We evaluate MyGO on computer vision (Split-MNIST) and natural language processing (Split-AG News) benchmarks, comparing it against a sequential fine-tuning baseline. The results demonstrate that MyGO significantly mitigates catastrophic forgetting and maintains high average accuracy across tasks, proving the framework’s effectiveness and domain-generality.

[211] Improving Fisher Information Estimation and Efficiency for LoRA-based LLM Unlearning

Yejin Kim, Eunwon Kim, Buru Chang, Junsuk Choe

Main category: cs.LG

TL;DR: VILA is a parameter-efficient machine unlearning framework that improves upon FILA by better accounting for Fisher information assumptions and enabling parameter identification without full model access, achieving 100x parameter efficiency and 40x faster training.

Details

Motivation: LLMs can unintentionally generate sensitive information, and retraining models to remove this data is computationally expensive. Existing unlearning methods like FILA have limitations in parameter importance estimation and require full model access.

Method: VILA explicitly considers the fundamental assumptions underlying Fisher information that were overlooked in FILA, enhancing accuracy of parameter identification for the forget set. It enables parameter identification without accessing the entire model, reducing computational costs.

Result: Achieves up to 100x higher parameter efficiency and 40x faster training speed compared to FILA. Sets new state-of-the-art performance on benchmarks including TOFU, WMDP, and MUSE.

Conclusion: VILA provides an efficient and accurate machine unlearning solution that effectively removes sensitive information from LLMs without the computational burden of full retraining, outperforming existing methods.

Abstract: LLMs have demonstrated remarkable performance across various tasks but face challenges related to unintentionally generating outputs containing sensitive information. A straightforward approach to address this issue is to retrain the model after excluding the problematic data. However, this approach incurs prohibitively high computational costs. To overcome this limitation, machine unlearning has emerged as a promising solution that can effectively remove sensitive information without the need to retrain the model from scratch. Recently, FILA has been proposed as a parameter-efficient unlearning method by integrating LoRA adapters. Specifically, it calculates the Fisher information to identify parameters associated with the forget set and assigns them to LoRA adapters for updates. Despite its innovative approach, FILA still requires access to all model parameters and does not adequately account for fundamental assumptions underlying Fisher information, leading to inaccuracies in importance estimation. To address these limitations, we propose VILA, a novel unlearning framework that explicitly considers the assumptions overlooked in FILA, thereby enhancing the accuracy of parameter identification for the forget set. Moreover, VILA significantly reduces computational costs by enabling parameter identification without accessing the entire model. Our method achieves up to 100x higher parameter efficiency and 40x faster training speed compared to FILA, and sets new state-of-the-art performance on benchmarks including TOFU, WMDP, and MUSE. Our code is available at https://github.com/kyj93790/VILA.

[212] Convergence of regularized agent-state-based Q-learning in POMDPs

Amit Sinha, Matthieu Geist, Aditya Mahajan

Main category: cs.LG

TL;DR: The paper analyzes convergence of regularized agent-state-based Q-learning (RASQL) algorithms, showing they converge to fixed points of regularized MDPs under mild conditions.

Details

Motivation: To understand convergence properties of practical Q-learning algorithms that use agent states (not belief states) and policy regularization for exploration and stability.

Method: Theoretical analysis of RASQL algorithms, including variants with periodic policies, under mild technical conditions, supported by numerical examples.

Result: Proves RASQL converges to fixed points of appropriately defined regularized MDPs, with convergence behavior matching theoretical limits in numerical experiments.

Conclusion: Provides theoretical foundation for convergence of practical Q-learning algorithms with agent states and regularization, validated by empirical results.

Abstract: In this paper, we present a framework to understand the convergence of commonly used Q-learning reinforcement learning algorithms in practice. Two salient features of such algorithms are: (i)~the Q-table is recursively updated using an agent state (such as the state of a recurrent neural network) which is not a belief state or an information state and (ii)~policy regularization is often used to encourage exploration and stabilize the learning algorithm. We investigate the simplest form of such Q-learning algorithms which we call regularized agent-state-based Q-learning (RASQL) and show that it converges under mild technical conditions to the fixed point of an appropriately defined regularized MDP, which depends on the stationary distribution induced by the behavioral policy. We also show that a similar analysis continues to work for a variant of RASQL that learns periodic policies. We present numerical examples to illustrate that the empirical convergence behavior matches with the proposed theoretical limit.

[213] Distribution-Aware Feature Selection for SAEs

Narmeen Oozeer, Nirmalendu Prakash, Michael Lan, Alice Rigg, Amirali Abdullah

Main category: cs.LG

TL;DR: Sampled-SAE improves sparse autoencoders by scoring features across token batches and sampling from a restricted pool, balancing global consistency with fine-grained reconstruction.

Details

Motivation: TopK SAEs are inefficient as they treat all tokens equally despite varying information content, and BatchTopK causes activation lottery where rare high-magnitude features crowd out more informative ones.

Method: Score feature columns using L2 norm or entropy, form candidate pool of size Kl, then apply Top-K selection from this restricted feature pool across token batches.

Result: No single l value optimizes all metrics on Pythia-160M - best choice depends on trade-off between shared structure, reconstruction fidelity, and downstream performance.

Conclusion: Sampled-SAE reframes BatchTopK as a tunable, distribution-aware family that provides spectrum between batch-level and token-specific feature selection.

Abstract: Sparse autoencoders (SAEs) decompose neural activations into interpretable features. A widely adopted variant, the TopK SAE, reconstructs each token from its K most active latents. However, this approach is inefficient, as some tokens carry more information than others. BatchTopK addresses this limitation by selecting top activations across a batch of tokens. This improves average reconstruction but risks an “activation lottery,” where rare high-magnitude features crowd out more informative but lower-magnitude ones. To address this issue, we introduce Sampled-SAE: we score the columns (representing features) of the batch activation matrix (via $L_2$ norm or entropy), forming a candidate pool of size $Kl$, and then apply Top-$K$ to select tokens across the batch from the restricted pool of features. Varying $l$ traces a spectrum between batch-level and token-specific selection. At $l=1$, tokens draw only from $K$ globally influential features, while larger $l$ expands the pool toward standard BatchTopK and more token-specific features across the batch. Small $l$ thus enforces global consistency; large $l$ favors fine-grained reconstruction. On Pythia-160M, no single value optimizes $l$ across all metrics: the best choice depends on the trade-off between shared structure, reconstruction fidelity, and downstream performance. Sampled-SAE thus reframes BatchTopK as a tunable, distribution-aware family.

[214] Stage-Diff: Stage-wise Long-Term Time Series Generation Based on Diffusion Models

Xuan Hou, Shuhan Liu, Zhaohui Peng, Yaohui Chu, Yue Zhang, Yining Wang

Main category: cs.LG

TL;DR: Stage-Diff: A staged diffusion model for long-term time series generation that handles complex temporal dependencies and distribution shifts through stage-wise generation and progressive decomposition.

Details

Motivation: Long-term time series present challenges with complex long-range dependencies, gradual data distribution shifts, and intricate inter-feature relationships that existing generative models struggle to handle effectively.

Method: Uses stage-wise sequence generation with inter-stage information transfer to preserve long-term dependencies while modeling distribution shifts. Applies progressive sequence decomposition for channel-independent modeling at different time scales, combined with multi-channel fusion modeling for inter-stage transfer.

Result: Extensive experiments on multiple real-world datasets validate the effectiveness of Stage-Diff in long-term time series generation tasks.

Conclusion: Stage-Diff successfully addresses the key challenges of long-term time series generation by balancing long-term dependencies with distribution shifts and effectively capturing both intra-sequence and inter-sequence dependencies.

Abstract: Generative models have been successfully used in the field of time series generation. However, when dealing with long-term time series, which span over extended periods and exhibit more complex long-term temporal patterns, the task of generation becomes significantly more challenging. Long-term time series exhibit long-range temporal dependencies, but their data distribution also undergoes gradual changes over time. Finding a balance between these long-term dependencies and the drift in data distribution is a key challenge. On the other hand, long-term time series contain more complex interrelationships between different feature sequences, making the task of effectively capturing both intra-sequence and inter-sequence dependencies another important challenge. To address these issues, we propose Stage-Diff, a staged generative model for long-term time series based on diffusion models. First, through stage-wise sequence generation and inter-stage information transfer, the model preserves long-term sequence dependencies while enabling the modeling of data distribution shifts. Second, within each stage, progressive sequence decomposition is applied to perform channel-independent modeling at different time scales, while inter-stage information transfer utilizes multi-channel fusion modeling. This approach combines the robustness of channel-independent modeling with the information fusion advantages of multi-channel modeling, effectively balancing the intra-sequence and inter-sequence dependencies of long-term time series. Extensive experiments on multiple real-world datasets validate the effectiveness of Stage-Diff in long-term time series generation tasks.

[215] DLGAN : Time Series Synthesis Based on Dual-Layer Generative Adversarial Networks

Xuan Hou, Shuhan Liu, Zhaohui Peng, Yaohui Chu, Yue Zhang, Yining Wang

Main category: cs.LG

TL;DR: DLGAN is a dual-layer GAN model for time series synthesis that decomposes generation into feature extraction and reconstruction stages, using autoencoder supervision and GAN to better capture temporal dependencies and features.

Details

Motivation: Existing time series synthesis methods struggle to ensure temporal dependencies in generated sequences and have difficulty accurately capturing original time series features when modeling directly on random sequences.

Method: Proposes DLGAN with two stages: 1) sequence feature extraction and reconstruction forming a complete autoencoder for supervised learning, and 2) GAN to generate synthetic feature vectors aligned with real sequence features.

Result: Extensive experiments on four public datasets demonstrate the model’s superiority across various evaluation metrics.

Conclusion: DLGAN provides an effective solution for time series synthesis that better preserves temporal dependencies and captures original time series features compared to existing methods.

Abstract: Time series synthesis is an effective approach to ensuring the secure circulation of time series data. Existing time series synthesis methods typically perform temporal modeling based on random sequences to generate target sequences, which often struggle to ensure the temporal dependencies in the generated time series. Additionally, directly modeling temporal features on random sequences makes it challenging to accurately capture the feature information of the original time series. To address the above issues, we propose a simple but effective generative model \textbf{D}ual-\textbf{L}ayer \textbf{G}enerative \textbf{A}dversarial \textbf{N}etworks, named \textbf{DLGAN}. The model decomposes the time series generation process into two stages: sequence feature extraction and sequence reconstruction. First, these two stages form a complete time series autoencoder, enabling supervised learning on the original time series to ensure that the reconstruction process can restore the temporal dependencies of the sequence. Second, a Generative Adversarial Network (GAN) is used to generate synthetic feature vectors that align with the real-time sequence feature vectors, ensuring that the generator can capture the temporal features from real time series. Extensive experiments on four public datasets demonstrate the superiority of this model across various evaluation metrics.

[216] Adaptive Heavy-Tailed Stochastic Gradient Descent

Bodu Gong, Gustavo Enrique Batista, Pierre Lafaye de Micheaux

Main category: cs.LG

TL;DR: AHTSGD is a novel optimization algorithm that adaptively injects heavy-tailed noise during early training to enhance exploration and transitions to lighter-tailed noise as sharpness stabilizes, improving generalization and convergence to wide basins.

Details

Motivation: Address overreliance on training loss in large neural networks by promoting convergence to wide basins (which offer better generalization) rather than sharp minima, inspired by heavy-tailed gradient noise distribution and Edge of Stability phenomenon.

Method: Adaptive Heavy Tailed Stochastic Gradient Descent (AHTSGD) dynamically adjusts the tail heaviness of injected noise based on training stage - heavier tails early for exploration, lighter tails later as sharpness stabilizes.

Result: Outperforms SGD and other noise-based methods on MNIST, CIFAR-10, and especially noisy datasets like SVHN. Accelerates early training from poor initializations and improves generalization across clean and noisy settings.

Conclusion: AHTSGD effectively leverages the Edge of Stability phenomenon to adapt noise injection, promoting convergence to wide basins and demonstrating robust performance across various datasets and learning rates.

Abstract: In the era of large-scale neural network models, optimization algorithms often struggle with generalization due to an overreliance on training loss. One key insight widely accepted in the machine learning community is the idea that wide basins (regions around a local minimum where the loss increases gradually) promote better generalization by offering greater stability to small changes in input data or model parameters. In contrast, sharp minima are typically more sensitive and less stable. Motivated by two key empirical observations - the inherent heavy-tailed distribution of gradient noise in stochastic gradient descent and the Edge of Stability phenomenon during neural network training, in which curvature grows before settling at a plateau, we introduce Adaptive Heavy Tailed Stochastic Gradient Descent (AHTSGD). The algorithm injects heavier-tailed noise into the optimizer during the early stages of training to enhance exploration and gradually transitions to lighter-tailed noise as sharpness stabilizes. By dynamically adapting to the sharpness of the loss landscape throughout training, AHTSGD promotes accelerated convergence to wide basins. AHTSGD is the first algorithm to adjust the nature of injected noise into an optimizer based on the Edge of Stability phenomenon. AHTSGD consistently outperforms SGD and other noise-based methods on benchmarks like MNIST and CIFAR-10, with marked gains on noisy datasets such as SVHN. It ultimately accelerates early training from poor initializations and improves generalization across clean and noisy settings, remaining robust to learning rate choices.

[217] Iterative Inference in a Chess-Playing Neural Network

Elias Sandmann, Sebastian Lapuschkin, Wojciech Samek

Main category: cs.LG

TL;DR: Neural networks in chess engines show non-smooth representation building with early solutions being discarded and high policy divergence until late layers, contrasting with smooth convergence in language models.

Details

Motivation: To understand whether neural networks build representations through smooth refinement or complex computational processes by analyzing a superhuman chess engine's policy network.

Method: Extended the logit lens technique to analyze the policy network of Leela Chess Zero, examining playing strength, puzzle-solving ability, and policy distribution trajectories across layers.

Result: Found strong monotonic trends in playing strength and puzzle-solving ability, but policy distributions follow non-smooth trajectories with early correct solutions being discarded, poor move ranking correlations, and high policy divergence until late layers.

Conclusion: Chess policy networks exhibit complex, non-smooth computational processes rather than gradual refinement, contrasting with the smooth convergence observed in language models, suggesting different representation building mechanisms across domains.

Abstract: Do neural networks build their representations through smooth, gradual refinement, or via more complex computational processes? We investigate this by extending the logit lens to analyze the policy network of Leela Chess Zero, a superhuman chess engine. We find strong monotonic trends in playing strength and puzzle-solving ability across layers, yet policy distributions frequently follow non-smooth trajectories. Evidence for this includes correct puzzle solutions that are discovered early but subsequently discarded, move rankings that remain poorly correlated with final outputs, and high policy divergence until late in the network. These findings contrast with the smooth distributional convergence typically observed in language models.

[218] PMODE: Theoretically Grounded and Modular Mixture Modeling

Robert A. Vandermeulen

Main category: cs.LG

TL;DR: PMODE is a modular framework for mixture modeling that partitions data and fits separate estimators to each subset, achieving near-optimal rates and supporting different distribution families. MV-PMODE extends this to high-dimensional settings with competitive performance.

Details

Motivation: To create a general and flexible mixture modeling framework that can handle both parametric and nonparametric components while maintaining theoretical guarantees and practical scalability to high-dimensional problems.

Method: Partitions data into subsets and fits separate density estimators to each partition, allowing mixture components from different distribution families. MV-PMODE specifically scales this approach to high-dimensional settings (thousands of dimensions).

Result: Achieves near-optimal rates for the estimator class, remains valid with heterogeneous mixture components, and MV-PMODE performs competitively against deep baselines on CIFAR-10 anomaly detection despite its simplicity.

Conclusion: PMODE provides a theoretically sound and practical framework for mixture modeling that is both flexible (supporting different distribution types) and scalable to high-dimensional applications while maintaining competitive performance.

Abstract: We introduce PMODE (Partitioned Mixture Of Density Estimators), a general and modular framework for mixture modeling with both parametric and nonparametric components. PMODE builds mixtures by partitioning the data and fitting separate estimators to each subset. It attains near-optimal rates for this estimator class and remains valid even when the mixture components come from different distribution families. As an application, we develop MV-PMODE, which scales a previously theoretical approach to high-dimensional density estimation to settings with thousands of dimensions. Despite its simplicity, it performs competitively against deep baselines on CIFAR-10 anomaly detection.

[219] Benchmarking the State of Networks with a Low-Cost Method Based on Reservoir Computing

Felix Simon Reimers, Carl-Hendrik Peters, Stefano Nichele

Main category: cs.LG

TL;DR: Using mobile network data and reservoir computing to monitor network states through proxy task performance, showing this low-cost method can detect network perturbations and identify weak spots.

Details

Motivation: To develop a non-invasive, low-cost method for monitoring communication and mobility networks using readily available mobile network utilization data.

Method: Transform mobile network data into reservoir computing models (echo state networks), use them for neuroscience-inspired proxy tasks, and measure performance to assess network state.

Result: Performance on proxy tasks relates to network state and visibly decreases when networks are perturbed, demonstrating the method’s sensitivity to network conditions.

Conclusion: This proof-of-concept shows reservoir computing with mobile network data can enable near-real-time monitoring and identification of weak spots in communication and transportation networks.

Abstract: Using data from mobile network utilization in Norway, we showcase the possibility of monitoring the state of communication and mobility networks with a non-invasive, low-cost method. This method transforms the network data into a model within the framework of reservoir computing and then measures the model’s performance on proxy tasks. Experimentally, we show how the performance on these proxies relates to the state of the network. A key advantage of this approach is that it uses readily available data sets and leverages the reservoir computing framework for an inexpensive and largely agnostic method. Data from mobile network utilization is available in an anonymous, aggregated form with multiple snapshots per day. This data can be treated like a weighted network. Reservoir computing allows the use of weighted, but untrained networks as a machine learning tool. The network, initialized as a so-called echo state network (ESN), projects incoming signals into a higher dimensional space, on which a single trained layer operates. This consumes less energy than deep neural networks in which every weight of the network is trained. We use neuroscience inspired tasks and trained our ESN model to solve them. We then show how the performance depends on certain network configurations and also how it visibly decreases when perturbing the network. While this work serves as proof of concept, we believe it can be elevated to be used for near-real-time monitoring as well as the identification of possible weak spots of both mobile communication networks as well as transportation networks.

[220] Rethinking Layer-wise Model Merging through Chain of Merges

Pietro Buzzega, Riccardo Salami, Angelo Porrello, Simone Calderara

Main category: cs.LG

TL;DR: CoM is a layer-wise merging method that addresses internal covariate shift in model merging by updating activation statistics auto-regressively, achieving SOTA performance.

Details

Motivation: Existing model merging techniques treat layers independently, failing to account for inter-layer dependencies and causing distributional mismatches similar to internal covariate shift.

Method: Chain of Merges (CoM) - a layer-wise merging procedure that updates activation statistics in an auto-regressive fashion to explicitly handle cross-layer interactions through conditionally optimal updates.

Result: Experiments on standard benchmarks demonstrate that CoM achieves state-of-the-art performance in model merging.

Conclusion: CoM effectively mitigates degradation caused by covariate shift in model merging by accounting for inter-layer dependencies, producing coherent merged models without retraining.

Abstract: Fine-tuning pretrained models has become a standard pathway to achieve state-of-the-art performance across a wide range of domains, leading to a proliferation of task-specific model variants. As the number of such specialized modules in-creases, merging them into a unified model without retraining has become a critical challenge. Existing merging techniques often rely on interference heuristics,importance weighting, or activation matching while treating each layer independently, thereby failing to account for the inter-layer dependencies inherent in deep networks. This simplification leads to distributional mismatches, especially inactivation-based methods, when changes in early layers are not properly reflected in downstream ones. We identify these mismatches as a form of internal covariate shift, comparable to the phenomenon encountered in the initial phases of neural networks training. To address it, we propose Chain of Merges (CoM), a layer-wise merging procedure that updates activation statistics in an auto-regressive fashion, explicitly accounting for cross-layer interactions. CoM produces a coherent merged model through a series of conditionally optimal updates, effectively mitigating degradation caused by covariate shift. Experiments on standard bench-marks demonstrate that CoM achieves state-of-the-art performance.

[221] Quantum enhanced ensemble GANs for anomaly detection in continuous biomanufacturing

Rajiv Kailasanathan, William R. Clements, Mohammad Reza Boskabadi, Shawn M. Gibford, Emmanouil Papadakis, Christopher J. Savoie, Seyed Soheil Mansouri

Main category: cs.LG

TL;DR: Novel GAN-based framework for unsupervised anomaly detection in continuous biomanufacturing, with hybrid quantum/classical approach showing improved performance.

Details

Motivation: Continuous biomanufacturing requires robust early anomaly detection as minor deviations compromise yield, stability, and economic performance due to complex non-linear dynamics.

Method: Ensemble of generative adversarial networks (GANs) for unsupervised anomaly detection, benchmarked on simulated dataset with normal/anomalous regimes, and evaluated hybrid quantum/classical GAN approach using both simulated and real photonic quantum processors.

Result: GAN-based framework effectively detects anomalies from sudden feedstock variability, and hybrid quantum/classical approach yields improved anomaly detection rates.

Conclusion: Hybrid quantum/classical approaches show potential for solving real-world problems in complex continuous biomanufacturing processes.

Abstract: The development of continuous biomanufacturing processes requires robust and early anomaly detection, since even minor deviations can compromise yield and stability, leading to disruptions in scheduling, reduced weekly production, and diminished economic performance. These processes are inherently complex and exhibit non-linear dynamics with intricate relationships between process variables, thus making advanced methods for anomaly detection essential for efficient operation. In this work, we present a novel framework for unsupervised anomaly detection in continuous biomanufacturing based on an ensemble of generative adversarial networks (GANs). We first establish a benchmark dataset simulating both normal and anomalous operation regimes in a continuous process for the production of a small molecule. We then demonstrate the effectiveness of our GAN-based framework in detecting anomalies caused by sudden feedstock variability. Finally, we evaluate the impact of using a hybrid quantum/classical GAN approach with both a simulated quantum circuit and a real photonic quantum processor on anomaly detection performance. We find that the hybrid approach yields improved anomaly detection rates. Our work shows the potential of hybrid quantum/classical approaches for solving real-world problems in complex continuous biomanufacturing processes.

[222] Beyond expected value: geometric mean optimization for long-term policy performance in reinforcement learning

Xinyi Sheng, Dominik Baumann

Main category: cs.LG

TL;DR: A novel RL algorithm that combines standard expected cumulative reward optimization with time-average growth rate to improve individual trajectory performance in real-world deployments.

Details

Motivation: Traditional RL optimizes expected cumulative reward (ensemble average), but real-world applications often require good performance on individual trajectories rather than average performance across infinite trajectories.

Method: Proposes a Bellman operator for time-average growth rate, uses geometric mean under multiplicative reward dynamics, and introduces a modified geometric mean with N-sliding window estimator for general reward dynamics. Embeds this as a regularizer into the objective function.

Result: The algorithm outperforms conventional RL methods in challenging simulations by enabling policies to benefit from both ensemble average and time-average performance simultaneously.

Conclusion: The proposed approach successfully addresses the limitation of traditional RL by optimizing for both average performance and individual trajectory performance, making it more suitable for real-world deployment scenarios.

Abstract: Reinforcement learning (RL) algorithms typically optimize the expected cumulative reward, i.e., the expected value of the sum of scalar rewards an agent receives over the course of a trajectory. The expected value averages the performance over an infinite number of trajectories. However, when deploying the agent in the real world, this ensemble average may be uninformative for the performance of individual trajectories. Thus, in many applications, optimizing the long-term performance of individual trajectories might be more desirable. In this work, we propose a novel RL algorithm that combines the standard ensemble average with the time-average growth rate, a measure for the long-term performance of individual trajectories. We first define the Bellman operator for the time-average growth rate. We then show that, under multiplicative reward dynamics, the geometric mean aligns with the time-average growth rate. To address more general and unknown reward dynamics, we propose a modified geometric mean with $N$-sliding window that captures the path-dependency as an estimator for the time-average growth rate. This estimator is embedded as a regularizer into the objective, forming a practical algorithm and enabling the policy to benefit from ensemble average and time-average simultaneously. We evaluate our algorithm in challenging simulations, where it outperforms conventional RL methods.

[223] Normalized Maximum Likelihood Code-Length on Riemannian Manifold Data Spaces

Kota Fukuzawa, Atsushi Suzuki, Kenji Yamanishi

Main category: cs.LG

TL;DR: Proposes Riemannian manifold NML (Rm-NML) for geometric model selection on non-Euclidean spaces like hyperbolic spaces, extending traditional NML to be coordinate-invariant and applicable to hierarchical graph data.

Details

Motivation: Existing Normalized Maximum Likelihood (NML) formulations are limited to Euclidean spaces and coordinate-dependent, making them unsuitable for Riemannian manifolds like hyperbolic spaces which are increasingly important for hierarchical graph data representation.

Method: Defines a new coordinate-invariant Rm-NML that reflects Riemannian geometric structure, extends computational techniques to manifolds, and derives simplification methods for Riemannian symmetric spaces including hyperbolic spaces.

Result: Developed Rm-NML that coincides with conventional NML in Euclidean space under natural parameterization and demonstrated practical application by explicitly computing Rm-NML for normal distributions on hyperbolic spaces.

Conclusion: The proposed Rm-NML provides a geometrically appropriate extension of model selection techniques to Riemannian manifolds, enabling better handling of hierarchical graph data in non-Euclidean spaces like hyperbolic geometry.

Abstract: In recent years, with the large-scale expansion of graph data, there has been an increased focus on Riemannian manifold data spaces other than Euclidean space. In particular, the development of hyperbolic spaces has been remarkable, and they have high expressive power for graph data with hierarchical structures. Normalized Maximum Likelihood (NML) is employed in regret minimization and model selection. However, existing formulations of NML have been developed primarily in Euclidean spaces and are inherently dependent on the choice of coordinate systems, making it non-trivial to extend NML to Riemannian manifolds. In this study, we define a new NML that reflects the geometric structure of Riemannian manifolds, called the Riemannian manifold NML (Rm-NML). This Rm-NML is invariant under coordinate transformations and coincides with the conventional NML under the natural parameterization in Euclidean space. We extend existing computational techniques for NML to the setting of Riemannian manifolds. Furthermore, we derive a method to simplify the computation of Rm-NML on Riemannian symmetric spaces, which encompass data spaces of growing interest such as hyperbolic spaces. To illustrate the practical application of our proposed method, we explicitly computed the Rm-NML for normal distributions on hyperbolic spaces.

[224] Controllable 3D Molecular Generation for Structure-Based Drug Design Through Bayesian Flow Networks and Gradient Integration

Seungyeon Choi, Hwanhee Kim, Chihyun Park, Dahyeon Lee, Seungyong Lee, Yoonju Kim, Hyoungjoon Park, Sein Kwon, Youngwan Jo, Sanghyun Park

Main category: cs.LG

TL;DR: CByG is a novel gradient-based conditional generative framework that extends Bayesian Flow Networks to address limitations in current SBDD models by incorporating binding affinity, synthetic feasibility, and selectivity guidance for practical drug discovery.

Details

Motivation: Current SBDD generative models focus mainly on binding affinity but neglect critical real-world drug discovery requirements like synthetic feasibility and selectivity, creating a gap between academic research and practical applications.

Method: Extends Bayesian Flow Network into a gradient-based conditional generative model (CByG) that integrates property-specific guidance for multiple pharmacological properties including binding affinity, synthetic feasibility, and selectivity.

Result: CByG significantly outperforms baseline models across multiple essential evaluation criteria, demonstrating superior performance in practical drug discovery metrics.

Conclusion: The proposed framework effectively bridges the gap between academic research and real-world drug discovery by comprehensively addressing multiple critical pharmacological properties through robust conditional guidance.

Abstract: Recent advances in Structure-based Drug Design (SBDD) have leveraged generative models for 3D molecular generation, predominantly evaluating model performance by binding affinity to target proteins. However, practical drug discovery necessitates high binding affinity along with synthetic feasibility and selectivity, critical properties that were largely neglected in previous evaluations. To address this gap, we identify fundamental limitations of conventional diffusion-based generative models in effectively guiding molecule generation toward these diverse pharmacological properties. We propose CByG, a novel framework extending Bayesian Flow Network into a gradient-based conditional generative model that robustly integrates property-specific guidance. Additionally, we introduce a comprehensive evaluation scheme incorporating practical benchmarks for binding affinity, synthetic feasibility, and selectivity, overcoming the limitations of conventional evaluation methods. Extensive experiments demonstrate that our proposed CByG framework significantly outperforms baseline models across multiple essential evaluation criteria, highlighting its effectiveness and practicality for real-world drug discovery applications.

[225] Priors Matter: Addressing Misspecification in Bayesian Deep Q-Learning

Pascal R. van der Vaart, Neil Yorke-Smith, Matthijs T. J. Spaan

Main category: cs.LG

TL;DR: Bayesian deep Q-learning shows improved performance when reducing posterior temperature (cold posterior effect), contrary to theory. Common Gaussian likelihood assumptions are often violated, and better priors can improve Bayesian RL algorithms.

Details

Motivation: To understand why Bayesian reinforcement learning shows a cold posterior effect (performance improvement with reduced temperature) and to challenge common assumptions about likelihoods and priors in model-free Bayesian algorithms.

Method: Empirical study of prior distributions and statistical tests to validate likelihood assumptions in Bayesian deep Q-learning. Development and testing of simple, implementable solutions for better priors.

Result: Found that Gaussian likelihood assumptions are frequently violated in practice. Demonstrated that improved prior distributions lead to more performant Bayesian algorithms in deep Q-learning.

Conclusion: Future Bayesian reinforcement learning research should focus on developing more suitable likelihoods and priors rather than just improving posterior approximation accuracy, as current assumptions are often invalid and better priors can significantly enhance performance.

Abstract: Uncertainty quantification in reinforcement learning can greatly improve exploration and robustness. Approximate Bayesian approaches have recently been popularized to quantify uncertainty in model-free algorithms. However, so far the focus has been on improving the accuracy of the posterior approximation, instead of studying the accuracy of the prior and likelihood assumptions underlying the posterior. In this work, we demonstrate that there is a cold posterior effect in Bayesian deep Q-learning, where contrary to theory, performance increases when reducing the temperature of the posterior. To identify and overcome likely causes, we challenge common assumptions made on the likelihood and priors in Bayesian model-free algorithms. We empirically study prior distributions and show through statistical tests that the common Gaussian likelihood assumption is frequently violated. We argue that developing more suitable likelihoods and priors should be a key focus in future Bayesian reinforcement learning research and we offer simple, implementable solutions for better priors in deep Q-learning that lead to more performant Bayesian algorithms.

[226] Failure Prediction Is a Better Performance Proxy for Early-Exit Networks Than Calibration

Piotr Kubaty, Filip Szatkowski, Metod Jazbec, Bartosz Wójcik

Main category: cs.LG

TL;DR: Early-exit models use confidence-based exit strategies, but calibration can be misleading. Failure prediction is a better proxy for performance than calibration.

Details

Motivation: To address limitations of confidence-based exit strategies in early-exit models, where calibration measures can be misleading indicators of performance despite being commonly used.

Method: Proposed using failure prediction instead of calibration as a proxy for early-exit model performance, as it accounts for sample ranking changes and correlates better with efficiency improvements.

Result: Demonstrated empirical cases where miscalibrated networks outperformed calibrated ones, showing that failure prediction has stronger correlation with efficiency gains than calibration methods.

Conclusion: Failure prediction is a more dependable basis for designing and evaluating early-exit models compared to calibration, as it better captures actual performance and computational efficiency.

Abstract: Early-exit models speed up inference by attaching internal classifiers to intermediate layers of the model and allowing computation to stop once a prediction satisfies an exit criterion. Most early-exit methods rely on confidence-based exit strategies, which motivated some works to calibrate intermediate classifiers to improve the performance of the entire model. In this paper, we show that calibration measures can be misleading indicators of the performance of multi-exit models: a well-calibrated classifier may still waste computation, and common calibration methods do not preserve the sample ranking within a classifier. We demonstrate empirical cases where miscalibrated networks outperform calibrated ones. As an alternative, we propose to use failure prediction as a more useful proxy for early-exit model performance. Unlike calibration, failure prediction accounts for changes in the ranking of samples and shows a strong correlation with efficiency improvements, making it a more dependable basis for designing and evaluating early-exit models.

[227] Accept or Deny? Evaluating LLM Fairness and Performance in Loan Approval across Table-to-Text Serialization Approaches

Israel Abebe Azime, Deborah D. Kanubala, Tejumade Afonja, Mario Fritz, Isabel Valera, Dietrich Klakow, Philipp Slusallek

Main category: cs.LG

TL;DR: LLMs struggle with tabular data fairness in loan approvals. Serialization format choice significantly impacts both performance and fairness, with some formats improving F1 scores but worsening fairness disparities. ICL boosts performance but has inconsistent fairness effects across datasets.

Details

Motivation: As LLMs are increasingly used in high-stakes financial decision-making like loan approvals, there's a need to assess their performance and fairness when processing tabular data from different geographic regions to ensure reliable predictions.

Method: Evaluated LLMs on serialized loan approval datasets from Ghana, Germany, and US, focusing on zero-shot and in-context learning capabilities. Tested different serialization formats (GReat, LIFT, etc.) for converting tabular data to text.

Result: Serialization format significantly affects both performance and fairness - certain formats like GReat and LIFT yield higher F1 scores but exacerbate fairness disparities. ICL improved performance by 4.9-59.6% over zero-shot, but fairness effects varied across datasets.

Conclusion: Effective tabular data representation methods and fairness-aware models are crucial for improving LLM reliability in financial decision-making, as serialization format choices significantly impact both performance and fairness outcomes.

Abstract: Large Language Models (LLMs) are increasingly employed in high-stakes decision-making tasks, such as loan approvals. While their applications expand across domains, LLMs struggle to process tabular data, ensuring fairness and delivering reliable predictions. In this work, we assess the performance and fairness of LLMs on serialized loan approval datasets from three geographically distinct regions: Ghana, Germany, and the United States. Our evaluation focuses on the model’s zero-shot and in-context learning (ICL) capabilities. Our results reveal that the choice of serialization (Serialization refers to the process of converting tabular data into text formats suitable for processing by LLMs.) format significantly affects both performance and fairness in LLMs, with certain formats such as GReat and LIFT yielding higher F1 scores but exacerbating fairness disparities. Notably, while ICL improved model performance by 4.9-59.6% relative to zero-shot baselines, its effect on fairness varied considerably across datasets. Our work underscores the importance of effective tabular data representation methods and fairness-aware models to improve the reliability of LLMs in financial decision-making.

[228] Spiking Decision Transformers: Local Plasticity, Phase-Coding, and Dendritic Routing for Low-Power Sequence Control

Vishal Pandey, Debasmita Biswas

Main category: cs.LG

TL;DR: SNN-DT combines Transformer-based reinforcement learning with spiking neural networks for ultra-low-power decision making on edge devices, achieving comparable performance with 10,000x energy reduction.

Details

Motivation: Transformer-based RL agents are energy-intensive and unsuitable for edge platforms, while spiking neural networks offer ultra-low-power inference but haven't been integrated with return-conditioned sequence modeling.

Method: Embeds Leaky Integrate-and-Fire neurons into self-attention blocks, uses surrogate gradient training, incorporates three-factor plasticity, phase-shifted spike-based positional encodings, and lightweight dendritic routing.

Result: Matches/exceeds standard Decision Transformer performance on classic control benchmarks (CartPole, MountainCar, Acrobot, Pendulum) while emitting <10 spikes per decision, indicating 10,000x energy reduction.

Conclusion: SNN-DT successfully merges sequence modeling with neuromorphic efficiency, enabling real-time, low-power control on embedded and wearable devices.

Abstract: Reinforcement learning agents based on Transformer architectures have achieved impressive performance on sequential decision-making tasks, but their reliance on dense matrix operations makes them ill-suited for energy-constrained, edge-oriented platforms. Spiking neural networks promise ultra-low-power, event-driven inference, yet no prior work has seamlessly merged spiking dynamics with return-conditioned sequence modeling. We present the Spiking Decision Transformer (SNN-DT), which embeds Leaky Integrate-and-Fire neurons into each self-attention block, trains end-to-end via surrogate gradients, and incorporates biologically inspired three-factor plasticity, phase-shifted spike-based positional encodings, and a lightweight dendritic routing module. Our implementation matches or exceeds standard Decision Transformer performance on classic control benchmarks (CartPole-v1, MountainCar-v0, Acrobot-v1, Pendulum-v1) while emitting fewer than ten spikes per decision, an energy proxy suggesting over four orders-of-magnitude reduction in per inference energy. By marrying sequence modeling with neuromorphic efficiency, SNN-DT opens a pathway toward real-time, low-power control on embedded and wearable devices.

[229] Summarize-Exemplify-Reflect: Data-driven Insight Distillation Empowers LLMs for Few-shot Tabular Classification

Yifei Yuan, Jiatong Li, Weijia Zhang, Mohammad Aliannejadi, Evangelos Kanoulas, Renjun Hu

Main category: cs.LG

TL;DR: InsightTab framework distills tabular data into actionable insights using divide-and-conquer, easy-first, and reflective learning principles to improve LLM performance on few-shot tabular classification tasks.

Details

Motivation: Address challenges in few-shot tabular classification with LLMs due to variability in structured data, inspired by human learning processes to make LLMs more effective on specific tabular tasks.

Method: InsightTab framework with rule summarization, strategic exemplification, and insight reflection through collaboration between LLMs and data modeling techniques, guided by divide-and-conquer, easy-first, and reflective learning principles.

Result: Consistent improvement over state-of-the-art methods across nine datasets, validated by ablation studies showing effectiveness in leveraging labeled data and managing bias.

Conclusion: The insight distillation approach enables LLMs to better align general knowledge with specific tabular task requirements, demonstrating robust and effective few-shot classification performance.

Abstract: Recent studies show the promise of large language models (LLMs) for few-shot tabular classification but highlight challenges due to the variability in structured data. To address this, we propose distilling data into actionable insights to enable robust and effective classification by LLMs. Drawing inspiration from human learning processes, we introduce InsightTab, an insight distillation framework guided by principles of divide-and-conquer, easy-first, and reflective learning. Our approach integrates rule summarization, strategic exemplification, and insight reflection through deep collaboration between LLMs and data modeling techniques. The obtained insights enable LLMs to better align their general knowledge and capabilities with the particular requirements of specific tabular tasks. We extensively evaluate InsightTab on nine datasets. The results demonstrate consistent improvement over state-of-the-art methods. Ablation studies further validate the principle-guided distillation process, while analyses emphasize InsightTab’s effectiveness in leveraging labeled data and managing bias.

[230] On the Hardness of Learning GNN-based SAT Solvers: The Role of Graph Ricci Curvature

Geri Skenderi

Main category: cs.LG

TL;DR: GNN-based SAT solvers suffer performance degradation on harder instances due to negative graph curvature in SAT formulas, which causes oversquashing and limits long-range dependency compression.

Details

Motivation: To understand why Graph Neural Networks (GNNs) perform poorly on harder Boolean Satisfiability Problems (SATs) and provide a geometric explanation through graph curvature analysis.

Method: Analyze SAT formulas using graph Ricci Curvature (RC) to quantify local connectivity bottlenecks, prove that random k-SAT bipartite graphs are inherently negatively curved, and show how curvature decreases with instance difficulty causing oversquashing in GNNs.

Result: Empirical validation across SAT benchmarks confirms that curvature is a strong indicator of problem complexity and can predict GNN solver performance. Harder SAT instances exhibit more negative curvature.

Conclusion: Negative graph curvature in SAT formulas fundamentally limits GNN performance through oversquashing. These findings connect to existing solver design principles and outline promising research directions for improving GNN-based SAT solvers.

Abstract: Graph Neural Networks (GNNs) have recently shown promise as solvers for Boolean Satisfiability Problems (SATs) by operating on graph representations of logical formulas. However, their performance degrades sharply on harder instances, raising the question of whether this reflects fundamental architectural limitations. In this work, we provide a geometric explanation through the lens of graph Ricci Curvature (RC), which quantifies local connectivity bottlenecks. We prove that bipartite graphs derived from random k-SAT formulas are inherently negatively curved, and that this curvature decreases with instance difficulty. Building on this, we show that GNN-based SAT solvers are affected by oversquashing, a phenomenon where long-range dependencies become impossible to compress into fixed-length representations. We validate our claims empirically across different SAT benchmarks and confirm that curvature is both a strong indicator of problem complexity and can be used to predict performance. Finally, we connect our findings to design principles of existing solvers and outline promising directions for future work.

[231] What Data is Really Necessary? A Feasibility Study of Inference Data Minimization for Recommender Systems

Jens Leysen, Marco Favier, Bart Goethals

Main category: cs.LG

TL;DR: Data minimization for recommender systems is technically feasible through various techniques that can substantially reduce implicit feedback data without significant performance loss, but practical implementation depends heavily on technical settings and user characteristics.

Details

Motivation: Operationalizing the legal principle of data minimization for recommender systems that rely on extensive personal data remains a significant challenge, requiring investigation into feasibility and practical implementation.

Method: Proposed a novel problem formulation, analyzed various minimization techniques, and investigated key factors influencing their effectiveness for minimizing implicit feedback inference data in recommender systems.

Result: Substantial inference data reduction is technically feasible without significant performance loss, but effectiveness critically depends on technical settings (performance targets, model choice) and user characteristics (history size, preference complexity).

Conclusion: While data minimization is technically feasible for recommender systems, its practical implementation remains challenging due to dependence on technical and user context, making universal standards for data ’necessity’ difficult to implement.

Abstract: Data minimization is a legal principle requiring personal data processing to be limited to what is necessary for a specified purpose. Operationalizing this principle for recommender systems, which rely on extensive personal data, remains a significant challenge. This paper conducts a feasibility study on minimizing implicit feedback inference data for such systems. We propose a novel problem formulation, analyze various minimization techniques, and investigate key factors influencing their effectiveness. We demonstrate that substantial inference data reduction is technically feasible without significant performance loss. However, its practicality is critically determined by two factors: the technical setting (e.g., performance targets, choice of model) and user characteristics (e.g., history size, preference complexity). Thus, while we establish its technical feasibility, we conclude that data minimization remains practically challenging and its dependence on the technical and user context makes a universal standard for data `necessity’ difficult to implement.

[232] Limitations of Physics-Informed Neural Networks: a Study on Smart Grid Surrogation

Julen Cestero, Carmine Delle Femine, Kenji S. Muro, Marco Quartulli, Marcello Restelli

Main category: cs.LG

TL;DR: PINNs outperform traditional ML models (XGBoost, Random Forest, Linear Regression) in smart grid modeling by integrating physical laws, showing superior generalization and maintaining physical feasibility despite slight degradation in extreme conditions.

Details

Motivation: Address data scarcity and physical consistency issues in conventional data-driven methods for smart grid modeling, aiming to bridge data-driven flexibility with first-principles rigor.

Method: Train Physics-Informed Neural Networks (PINNs) exclusively through physics-based loss functions that enforce power balance, operational constraints, and grid stability, comparing performance against XGBoost, Random Forest, and Linear Regression in interpolation, cross-validation, and episodic trajectory prediction experiments.

Result: PINNs demonstrate superior generalization with lower MAE in dynamic grid operations, reliably capturing state transitions in both random and expert-driven control scenarios, while traditional models exhibit erratic performance. PINNs consistently enforce physical feasibility despite slight degradation in extreme operational regimes.

Conclusion: PINNs represent a paradigm-shifting tool for smart grid surrogation, proving vital for safety-critical applications and advancing real-time grid control and scalable digital twins, emphasizing the necessity of physics-aware architectures in mission-critical energy systems.

Abstract: Physics-Informed Neural Networks (PINNs) present a transformative approach for smart grid modeling by integrating physical laws directly into learning frameworks, addressing critical challenges of data scarcity and physical consistency in conventional data-driven methods. This paper evaluates PINNs’ capabilities as surrogate models for smart grid dynamics, comparing their performance against XGBoost, Random Forest, and Linear Regression across three key experiments: interpolation, cross-validation, and episodic trajectory prediction. By training PINNs exclusively through physics-based loss functions (enforcing power balance, operational constraints, and grid stability) we demonstrate their superior generalization, outperforming data-driven models in error reduction. Notably, PINNs maintain comparatively lower MAE in dynamic grid operations, reliably capturing state transitions in both random and expert-driven control scenarios, while traditional models exhibit erratic performance. Despite slight degradation in extreme operational regimes, PINNs consistently enforce physical feasibility, proving vital for safety-critical applications. Our results contribute to establishing PINNs as a paradigm-shifting tool for smart grid surrogation, bridging data-driven flexibility with first-principles rigor. This work advances real-time grid control and scalable digital twins, emphasizing the necessity of physics-aware architectures in mission-critical energy systems.

[233] Comprehensive Signal Quality Evaluation of a Wearable Textile ECG Garment: A Sex-Balanced Study

Maximilian P. Oppelt, Tobias S. Zech, Sarah H. Lorenz, Laurenz Ottmann, Jan Steffan, Bjoern M. Eskofier, Nadine R. Lang-Richter, Norman Pfeiffer

Main category: cs.LG

TL;DR: Novel wearable textile ECG garment with optimized electrode placement achieves high signal quality comparable to reference devices, validated through comprehensive sex-balanced testing across multiple evaluation metrics.

Details

Motivation: To develop a textile-based ECG garment that minimizes noise and motion artifacts while ensuring suitability across different anatomical and physiological variations, particularly addressing sex-specific differences in signal acquisition.

Method: Comprehensive evaluation with 30 participants (15 male, 15 female) using quantitative signal quality indices, rhythm analysis, machine learning classification, morphological ECG analysis, and electrode projection angle studies across various activity phases.

Result: The textile system achieves signal quality highly concordant with reference devices in both rhythm and morphological analyses, exhibits robust classification performance, and enables identification of key sex-specific determinants affecting signal acquisition.

Conclusion: Textile-based ECG garments are practically viable for physiological monitoring and psychophysiological state detection, with sex-specific design considerations being crucial for equitable and reliable cardiac diagnostics in wearable health technologies.

Abstract: We introduce a novel wearable textile-garment featuring an innovative electrode placement aimed at minimizing noise and motion artifacts, thereby enhancing signal fidelity in Electrocardiography (ECG) recordings. We present a comprehensive, sex-balanced evaluation involving 15 healthy males and 15 healthy female participants to ensure the device’s suitability across anatomical and physiological variations. The assessment framework encompasses distinct evaluation approaches: quantitative signal quality indices to objectively benchmark device performance; rhythm-based analyzes of physiological parameters such as heart rate and heart rate variability; machine learning classification tasks to assess application-relevant predictive utility; morphological analysis of ECG features including amplitude and interval parameters; and investigations of the effects of electrode projection angle given by the textile / body shape, with all analyzes stratified by sex to elucidate sex-specific influences. Evaluations were conducted across various activity phases representing real-world conditions. The results demonstrate that the textile system achieves signal quality highly concordant with reference devices in both rhythm and morphological analyses, exhibits robust classification performance, and enables identification of key sex-specific determinants affecting signal acquisition. These findings underscore the practical viability of textile-based ECG garments for physiological monitoring as well as psychophysiological state detection. Moreover, we identify the importance of incorporating sex-specific design considerations to ensure equitable and reliable cardiac diagnostics in wearable health technologies.

[234] OASIS: Harnessing Diffusion Adversarial Network for Ocean Salinity Imputation using Sparse Drifter Trajectories

Bo Li, Yingqi Feng, Ming Jin, Xin Zheng, Yufei Tang, Laurent Cherubin, Alan Wee-Chung Liew, Can Wang, Qinghua Lu, Jingwei Yao, Shirui Pan, Hong Zhang, Xingquan Zhu

Main category: cs.LG

TL;DR: OASIS is a diffusion adversarial framework for ocean salinity imputation that handles sparse, noisy data better than traditional methods and standard ML approaches.

Details

Motivation: Ocean salinity measurement is sparse, irregular, and noisy in drifter datasets. Traditional methods rely on linearity and stationarity assumptions and are limited by issues like cloud cover and sensor drift. Machine learning models struggle with severe sparsity and lack principled ways to incorporate physical covariates.

Method: OASIS (OceAn Salinity Imputation System) - a novel diffusion adversarial framework designed to address the challenges of sparse and noisy ocean salinity data.

Result: Not specified in the provided abstract excerpt.

Conclusion: Not specified in the provided abstract excerpt.

Abstract: Ocean salinity plays a vital role in circulation, climate, and marine ecosystems, yet its measurement is often sparse, irregular, and noisy, especially in drifter-based datasets. Traditional approaches, such as remote sensing and optimal interpolation, rely on linearity and stationarity, and are limited by cloud cover, sensor drift, and low satellite revisit rates. While machine learning models offer flexibility, they often fail under severe sparsity and lack principled ways to incorporate physical covariates without specialized sensors. In this paper, we introduce the OceAn Salinity Imputation System (OASIS), a novel diffusion adversarial framework designed to address these challenges.

[235] Physics-Informed Spectral Modeling for Hyperspectral Imaging

Zuzanna Gawrysiak, Krzysztof Krawiec

Main category: cs.LG

TL;DR: PhISM is a physics-informed deep learning architecture that learns without supervision to disentangle hyperspectral observations using continuous basis functions, achieving superior performance on classification/regression tasks with limited labeled data and providing interpretable latent representations.

Details

Motivation: To develop an unsupervised learning approach that can effectively disentangle and model hyperspectral data while incorporating physical principles, addressing the challenge of limited labeled data in hyperspectral analysis and providing interpretable results.

Method: Physics-informed deep learning architecture that learns without supervision to explicitly disentangle hyperspectral observations and model them with continuous basis functions.

Result: Outperforms prior methods on several classification and regression benchmarks, requires limited labeled data, and provides additional insights through interpretable latent representation.

Conclusion: PhISM demonstrates that physics-informed deep learning can effectively handle hyperspectral data with minimal supervision, offering both performance improvements and interpretable insights through its disentangled continuous basis function approach.

Abstract: We present PhISM, a physics-informed deep learning architecture that learns without supervision to explicitly disentangle hyperspectral observations and model them with continuous basis functions. \mname outperforms prior methods on several classification and regression benchmarks, requires limited labeled data, and provides additional insights thanks to interpretable latent representation.

[236] Convergence of Stochastic Gradient Methods for Wide Two-Layer Physics-Informed Neural Networks

Bangti Jin, Longjun Wu

Main category: cs.LG

TL;DR: This paper establishes linear convergence guarantees for stochastic gradient descent in training over-parameterized two-layer Physics Informed Neural Networks (PINNs) for solving partial differential equations.

Details

Motivation: While PINNs are popular for solving PDEs using neural networks, most training uses stochastic gradient descent methods, but existing convergence analyses focused on full gradient descent. There was a need to provide theoretical guarantees for stochastic optimization methods commonly used in practice.

Method: The authors analyze stochastic gradient descent/flow for over-parameterized two-layer PINNs with general activation functions. The key challenge was handling dynamic randomness from stochastic optimization while maintaining positive definiteness of suitable Gram matrices during training.

Result: The paper establishes linear convergence of stochastic gradient descent in the high probability sense, extending previous results that only analyzed gradient descent. The analysis provides insight into the optimization dynamics.

Conclusion: This work provides theoretical convergence guarantees for stochastic optimization methods in training PINNs, which is crucial for practical applications where stochastic methods are commonly employed due to computational efficiency.

Abstract: Physics informed neural networks (PINNs) represent a very popular class of neural solvers for partial differential equations. In practice, one often employs stochastic gradient descent type algorithms to train the neural network. Therefore, the convergence guarantee of stochastic gradient descent is of fundamental importance. In this work, we establish the linear convergence of stochastic gradient descent / flow in training over-parameterized two layer PINNs for a general class of activation functions in the sense of high probability. These results extend the existing result [18] in which gradient descent was analyzed. The challenge of the analysis lies in handling the dynamic randomness introduced by stochastic optimization methods. The key of the analysis lies in ensuring the positive definiteness of suitable Gram matrices during the training. The analysis sheds insight into the dynamics of the optimization process, and provides guarantees on the neural networks trained by stochastic algorithms.

[237] Introduction to the Analysis of Probabilistic Decision-Making Algorithms

Agustinus Kristiadi

Main category: cs.LG

TL;DR: This monograph provides an accessible introduction to theoretical analysis of probabilistic decision-making algorithms like bandit algorithms, Bayesian optimization, and tree search, aimed at making these theories understandable to non-experts.

Details

Motivation: Decision theory algorithms are valuable for data-efficient scientific discovery but their theoretical analyses are often inaccessible to non-experts, limiting broader understanding and application.

Method: The monograph offers a self-contained introduction to theoretical analysis of probabilistic decision-making algorithms, assuming only basic probability, statistics, and elementary Gaussian process knowledge.

Result: Not applicable - this is an introductory monograph rather than a research paper presenting new results.

Conclusion: The work aims to bridge the accessibility gap in decision theory analysis, making these valuable theoretical foundations available to a wider audience in scientific discovery and related fields.

Abstract: Decision theories offer principled methods for making choices under various types of uncertainty. Algorithms that implement these theories have been successfully applied to a wide range of real-world problems, including materials and drug discovery. Indeed, they are desirable since they can adaptively gather information to make better decisions in the future, resulting in data-efficient workflows. In scientific discovery, where experiments are costly, these algorithms can thus significantly reduce the cost of experimentation. Theoretical analyses of these algorithms are crucial for understanding their behavior and providing valuable insights for developing next-generation algorithms. However, theoretical analyses in the literature are often inaccessible to non-experts. This monograph aims to provide an accessible, self-contained introduction to the theoretical analysis of commonly used probabilistic decision-making algorithms, including bandit algorithms, Bayesian optimization, and tree search algorithms. Only basic knowledge of probability theory and statistics, along with some elementary knowledge about Gaussian processes, is assumed.

Yunwoo Kim, Junhyuk Hwang

Main category: cs.LG

TL;DR: Machine learning model predicts social media engagement (likes/comments) from emotional/temporal features, achieving excellent prediction for likes (R²=0.98) but poor for comments (R²=0.41), suggesting different underlying mechanisms.

Details

Motivation: To understand and predict social media engagement patterns by leveraging emotional and temporal metadata, addressing the challenge of skewed engagement data and identifying factors that drive different types of user interactions.

Method: Multi-target regression using HistGradientBoostingRegressor on log-transformed engagement ratios from 600 annotated songs with valence, arousal, and sentiment metrics, evaluated with custom magnitude accuracy and standard regression metrics.

Result: Model achieves R²=0.98 for likes prediction but only R²=0.41 for comments, indicating emotional/temporal features and view counts effectively predict likes but not comments.

Conclusion: Likes are primarily driven by affective and exposure signals captured by emotional/temporal metadata, while comments depend on additional unrepresented factors, suggesting different predictive mechanisms for different engagement types.

Abstract: We present a machine learning approach for predicting social media engagement (comments and likes) from emotional and temporal features. The dataset contains 600 songs with annotations for valence, arousal, and related sentiment metrics. A multi target regression model based on HistGradientBoostingRegressor is trained on log transformed engagement ratios to address skewed targets. Performance is evaluated with both a custom order of magnitude accuracy and standard regression metrics, including the coefficient of determination (R^2). Results show that emotional and temporal metadata, together with existing view counts, predict future engagement effectively. The model attains R^2 = 0.98 for likes but only R^2 = 0.41 for comments. This gap indicates that likes are largely driven by readily captured affective and exposure signals, whereas comments depend on additional factors not represented in the current feature set.

[239] Activation Subspaces for Out-of-Distribution Detection

Barış Zöngür, Robin Hesse, Stefan Roth

Main category: cs.LG

TL;DR: ActSub: A novel OOD detection method using SVD to decompose model activations into decisive and insignificant components, achieving state-of-the-art performance by handling both Far-OOD and Near-OOD scenarios effectively.

Details

Motivation: To improve OOD detection reliability by distinguishing in-distribution from out-of-distribution samples, particularly addressing the limitations of existing methods in handling both large (Far-OOD) and small (Near-OOD) distribution shifts.

Method: Utilizes singular value decomposition of the classification head weight matrix to decompose model activations into decisive (maximally contributing) and insignificant (minimally contributing) components. For Far-OOD, uses insignificant subspace features; for Near-OOD, uses decisive subspace to avoid interference.

Result: Achieves state-of-the-art results across various standard OOD benchmarks, demonstrating superior performance in distinguishing ID from OOD data in both large and small distribution shift regimes.

Conclusion: The proposed ActSub method effectively leverages the complementary strengths of both decisive and insignificant subspaces from model activations, providing a unified approach that outperforms existing methods in OOD detection across different distribution shift scenarios.

Abstract: To ensure the reliability of deep models in real-world applications, out-of-distribution (OOD) detection methods aim to distinguish samples close to the training distribution (in-distribution, ID) from those farther away (OOD). In this work, we propose a novel OOD detection method that utilizes singular value decomposition of the weight matrix of the classification head to decompose the model’s activations into decisive and insignificant components, which contribute maximally, respectively minimally, to the final classifier output. We find that the subspace of insignificant components more effectively distinguishes ID from OOD data than raw activations in regimes of large distribution shifts (Far-OOD). This occurs because the classification objective leaves the insignificant subspace largely unaffected, yielding features that are ‘‘untainted’’ by the target classification task. Conversely, in regimes of smaller distribution shifts (Near-OOD), we find that activation shaping methods profit from only considering the decisive subspace, as the insignificant component can cause interference in the activation space. By combining two findings into a single approach, termed ActSub, we achieve state-of-the-art results in various standard OOD benchmarks.

[240] Neural Network Acceleration on MPSoC board: Integrating SLAC’s SNL, Rogue Software and Auto-SNL

Hamza Ezzaoui Rahali, Abhilasha Dave, Larry Ruckman, Mohammad Mehdi Rahimifar, Audrey C. Therrien, James J. Russel, Ryan T. Herbst

Main category: cs.LG

TL;DR: SLAC developed SNL framework for real-time ML inference on FPGAs to handle high-speed X-ray data from LCLS-II FEL, with Auto-SNL tool that converts Python models to FPGA code, showing competitive latency and resource efficiency compared to hls4ml.

Details

Motivation: Address the challenge of managing 1 TB/s data streams from LCLS-II FEL X-ray experiments where conventional ML implementations introduce excessive latency for real-time processing.

Method: Developed SLAC Neural Network Library (SNL) for FPGA deployment with dynamic weight updates, and Auto-SNL Python extension to convert neural network models to SNL-compatible high-level synthesis code.

Result: Benchmark comparison against hls4ml showed SNL achieves competitive or superior latency in most architectures with FPGA resource savings in some cases.

Conclusion: SNL demonstrates versatility for real-time ML inference, enabling new opportunities in high-energy physics, medical imaging, robotics and other fields requiring high-speed data processing.

Abstract: The LCLS-II Free Electron Laser (FEL) will generate X-ray pulses for beamline experiments at rates of up to 1~MHz, with detectors producing data throughputs exceeding 1 TB/s. Managing such massive data streams presents significant challenges, as transmission and storage infrastructures become prohibitively expensive. Machine learning (ML) offers a promising solution for real-time data reduction, but conventional implementations introduce excessive latency, making them unsuitable for high-speed experimental environments. To address these challenges, SLAC developed the SLAC Neural Network Library (SNL), a specialized framework designed to deploy real-time ML inference models on Field-Programmable Gate Arrays (FPGA). SNL’s key feature is the ability to dynamically update model weights without requiring FPGA resynthesis, enhancing flexibility for adaptive learning applications. To further enhance usability and accessibility, we introduce Auto-SNL, a Python extension that streamlines the process of converting Python-based neural network models into SNL-compatible high-level synthesis code. This paper presents a benchmark comparison against hls4ml, the current state-of-the-art tool, across multiple neural network architectures, fixed-point precisions, and synthesis configurations targeting a Xilinx ZCU102 FPGA. The results showed that SNL achieves competitive or superior latency in most tested architectures, while in some cases also offering FPGA resource savings. This adaptation demonstrates SNL’s versatility, opening new opportunities for researchers and academics in fields such as high-energy physics, medical imaging, robotics, and many more.

[241] Inferring Effects of Major Events through Discontinuity Forecasting of Population Anxiety

Siddharth Mangalik, Ojas Deshpande, Adithya V. Ganesan, Sean A. P. Clouston, H. Andrew Schwartz

Main category: cs.LG

TL;DR: Proposes adapting Longitudinal Regression Discontinuity Design (LRDD) into a statistical learning framework to forecast mental health discontinuities and slope changes from local events, showing improved performance over traditional methods.

Details

Motivation: Estimating community-specific mental health effects of local events is vital for public health policy, but traditional forecasting alone offers limited insights into causal impacts.

Method: Adapting LRDDs into a statistical learning framework that estimates future discontinuities and slope changes using location’s score history, dynamic covariates, and exogenous variables.

Result: Best results from integrating exogenous and dynamic covariates (r=+.46 for discontinuity, r=+.65 for slope), showing strong improvement over traditional static community representations.

Conclusion: Discontinuity forecasting enables estimating idiosyncratic effects of potential future or hypothetical events on specific communities, with sophisticated models achieving better performance.

Abstract: Estimating community-specific mental health effects of local events is vital for public health policy. While forecasting mental health scores alone offers limited insights into the impact of events on community well-being, quasi-experimental designs like the Longitudinal Regression Discontinuity Design (LRDD) from econometrics help researchers derive more effects that are more likely to be causal from observational data. LRDDs aim to extrapolate the size of changes in an outcome (e.g. a discontinuity in running scores for anxiety) due to a time-specific event. Here, we propose adapting LRDDs beyond traditional forecasting into a statistical learning framework whereby future discontinuities (i.e. time-specific shifts) and changes in slope (i.e. linear trajectories) are estimated given a location’s history of the score, dynamic covariates (other running assessments), and exogenous variables (static representations). Applying our framework to predict discontinuities in the anxiety of US counties from COVID-19 events, we found the task was difficult but more achievable as the sophistication of models was increased, with the best results coming from integrating exogenous and dynamic covariates. Our approach shows strong improvement ($r=+.46$ for discontinuity and $r = +.65$ for slope) over traditional static community representations. Discontinuity forecasting raises new possibilities for estimating the idiosyncratic effects of potential future or hypothetical events on specific communities.

[242] UniMLR: Modeling Implicit Class Significance for Multi-Label Ranking

V. Bugra Yesilkaynak, Emine Dari, Alican Mertan, Gozde Unal

Main category: cs.LG

TL;DR: UniMLR introduces a new multi-label ranking paradigm that leverages positive label ranking information rather than treating all positive labels equally, unifying ranking and classification tasks with improved performance.

Details

Motivation: Existing MLR frameworks only use binary positive/negative label information and ignore ranking relationships among positive labels, missing valuable information about class relevance and significance.

Method: Proposes UniMLR framework that models implicit class relevance as probability distributions using positive label rankings. Introduces eight synthetic Ranked MNIST datasets to address dataset scarcity and annotation bias issues.

Result: Statistically demonstrates accurate learning of positive rank order consistent with ground truth and proportional to underlying significance values. Comprehensive experiments on real-world and synthetic datasets show framework value.

Conclusion: UniMLR successfully unifies ranking and classification in MLR by exploiting positive label ranking information, providing a more comprehensive approach that captures class significance and relevance through probability distribution modeling.

Abstract: Existing multi-label ranking (MLR) frameworks only exploit information deduced from the bipartition of labels into positive and negative sets. Therefore, they do not benefit from ranking among positive labels, which is the novel MLR approach we introduce in this paper. We propose UniMLR, a new MLR paradigm that models implicit class relevance/significance values as probability distributions using the ranking among positive labels, rather than treating them as equally important. This approach unifies ranking and classification tasks associated with MLR. Additionally, we address the challenges of scarcity and annotation bias in MLR datasets by introducing eight synthetic datasets (Ranked MNISTs) generated with varying significance-determining factors, providing an enriched and controllable experimental environment. We statistically demonstrate that our method accurately learns a representation of the positive rank order, which is consistent with the ground truth and proportional to the underlying significance values. Finally, we conduct comprehensive empirical experiments on both real-world and synthetic datasets, demonstrating the value of our proposed framework.

[243] Learning Unified Representations from Heterogeneous Data for Robust Heart Rate Modeling

Peng Yang, Zhengdong Huang, Zicheng Xie, Wentao Tian, Jingyu Liu, Lunhong Dong

Main category: cs.LG

TL;DR: Proposed framework for heart rate prediction that handles data heterogeneity through random feature dropout for source diversity and time-aware attention with contrastive learning for user differences, achieving 17-15% performance gains.

Details

Motivation: Real-world heart rate prediction faces data heterogeneity challenges from fragmented device markets (source heterogeneity) and individual physiological differences (user heterogeneity), which existing methods fail to address effectively.

Method: A framework with random feature dropout strategy to handle source heterogeneity, time-aware attention module for long-term physiological traits, and contrastive learning objective for discriminative representations. Also created ParroTao benchmark dataset.

Result: Outperforms existing baselines by 17% on ParroTao dataset and 15% on FitRec dataset. Learned representations show strong discriminative power and practical value in downstream applications.

Conclusion: The proposed framework effectively addresses both source and user heterogeneity in heart rate prediction, demonstrating significant performance improvements and practical applicability for real-world health monitoring.

Abstract: Heart rate prediction is vital for personalized health monitoring and fitness, while it frequently faces a critical challenge when deploying in real-world: data heterogeneity. We classify it in two key dimensions: source heterogeneity from fragmented device markets with varying feature sets, and user heterogeneity reflecting distinct physiological patterns across individuals and activities. Existing methods either discard device-specific information, or fail to model user-specific differences, limiting their real-world performance. To address this, we propose a framework that learns latent representations agnostic to both heterogeneity, enabling downstream predictors to work consistently under heterogeneous data patterns. Specifically, we introduce a random feature dropout strategy to handle source heterogeneity, making the model robust to various feature sets. To manage user heterogeneity, we employ a time-aware attention module to capture long-term physiological traits and use a contrastive learning objective to build a discriminative representation space. To reflect the heterogeneous nature of real-world data, we created and publicly released a new benchmark dataset, ParroTao. Evaluations on both ParroTao and the public FitRec dataset show that our model significantly outperforms existing baselines by 17% and 15%, respectively. Furthermore, analysis of the learned representations demonstrates their strong discriminative power, and one downstream application task confirm the practical value of our model.

[244] MoE-Health: A Mixture of Experts Framework for Robust Multimodal Healthcare Prediction

Xiaoyang Wang, Christopher C. Yang

Main category: cs.LG

TL;DR: MoE-Health is a Mixture of Experts framework for multimodal healthcare prediction that dynamically handles incomplete and varying data modalities through expert networks and gating mechanisms, achieving superior performance on clinical tasks.

Details

Motivation: Healthcare systems generate diverse multimodal data (EHR, clinical notes, images), but real-world samples often have incomplete modalities. Existing approaches require complete data or manual selection, limiting real-world applicability where data availability varies across patients and institutions.

Method: Proposed MoE-Health framework uses specialized expert networks and a dynamic gating mechanism that dynamically selects and combines relevant experts based on available data modalities, enabling flexible adaptation to varying data availability scenarios.

Result: Evaluated on MIMIC-IV dataset across three clinical prediction tasks (mortality, length of stay, readmission). Achieved superior performance compared to existing multimodal fusion methods while maintaining robustness across different modality availability patterns.

Conclusion: MoE-Health effectively integrates multimodal information, offers improved predictive performance and robustness in handling heterogeneous and incomplete healthcare data, making it suitable for deployment in diverse healthcare environments with varying data availability.

Abstract: Healthcare systems generate diverse multimodal data, including Electronic Health Records (EHR), clinical notes, and medical images. Effectively leveraging this data for clinical prediction is challenging, particularly as real-world samples often present with varied or incomplete modalities. Existing approaches typically require complete modality data or rely on manual selection strategies, limiting their applicability in real-world clinical settings where data availability varies across patients and institutions. To address these limitations, we propose MoE-Health, a novel Mixture of Experts framework designed for robust multimodal fusion in healthcare prediction. MoE-Health architecture is specifically developed to handle samples with differing modalities and improve performance on critical clinical tasks. By leveraging specialized expert networks and a dynamic gating mechanism, our approach dynamically selects and combines relevant experts based on available data modalities, enabling flexible adaptation to varying data availability scenarios. We evaluate MoE-Health on the MIMIC-IV dataset across three critical clinical prediction tasks: in-hospital mortality prediction, long length of stay, and hospital readmission prediction. Experimental results demonstrate that MoE-Health achieves superior performance compared to existing multimodal fusion methods while maintaining robustness across different modality availability patterns. The framework effectively integrates multimodal information, offering improved predictive performance and robustness in handling heterogeneous and incomplete healthcare data, making it particularly suitable for deployment in diverse healthcare environments with heterogeneous data availability.

[245] QR-LoRA: QR-Based Low-Rank Adaptation for Efficient Fine-Tuning of Large Language Models

Jessica Liang, Anirudh Bharadwaj

Main category: cs.LG

TL;DR: QR-LoRA: A parameter-efficient fine-tuning method that uses QR decomposition with column pivoting to extract an orthonormal basis from pretrained weights, then trains only scalar coefficients for linear combinations, achieving strong performance with dramatically reduced parameters.

Details

Motivation: Standard LoRA and its SVD variants require learning both update factors directly or using expensive SVD operations that yield hard-to-interpret singular vectors. There's a need for more efficient and interpretable parameter-efficient fine-tuning methods.

Method: Extract orthonormal basis from pretrained weight matrix using QR decomposition with column pivoting, then express LoRA update as linear combination of these basis vectors while training only the scalar coefficients.

Result: QR-LoRA matches or exceeds performance of full fine-tuning, standard LoRA, and SVD-LoRA on GLUE tasks with only 601 parameters - 1000x reduction vs full fine-tuning and 77x fewer than typical LoRA.

Conclusion: QR-LoRA provides an efficient, structured approach to parameter-efficient fine-tuning that drastically reduces parameter count while maintaining or improving performance, offering better interpretability through clear basis vector structure.

Abstract: The growing scale of Large Language Models (LLMs) has necessitated the development of parameter-efficient fine-tuning techniques. Low-Rank Adaptation (LoRA) has emerged as a promising approach, reducing the number of trainable parameters by applying low-rank updates to pretrained weights. While standard LoRA learns both update factors directly, several recent variants first initialize those matrices via an SVD of the pretrained weights – an operation that can be expensive on large models and yields singular vectors that are not always easy to interpret. In this work, we extract an orthonormal basis from the pretrained weight matrix using QR decomposition with column pivoting, and then express the LoRA update as a linear combination of these basis vectors – training only the scalar coefficients, which imposes clear structure on adaptation and drastically reduces parameter count. Experiments across GLUE tasks show that QR-LoRA matches or exceeds the performance of full fine-tuning, standard LoRA, and SVD-LoRA (LoRA with update matrices initialized via singular value decomposition) with as few as 601 parameters – a reduction of over 1000x compared to full fine-tuning and 77x fewer than typical LoRA setups.

[246] Achieving Hilbert-Schmidt Independence Under Rényi Differential Privacy for Fair and Private Data Generation

Tobias Hyrup, Emmanouil Panagiotou, Arjun Roy, Arthur Zimek, Eirini Ntoutsi, Peter Schneider-Kamp

Main category: cs.LG

TL;DR: FLIP is a transformer-based VAE with latent diffusion that generates fair and private tabular data without requiring predefined downstream tasks, using RDP for privacy and CKA for fairness alignment.

Details

Motivation: Address growing privacy regulations (GDPR, HIPAA, AI Act) and fairness concerns in sensitive domains like healthcare by enabling risk-aware data sharing through synthetic data generation.

Method: Transformer-based variational autoencoder with latent diffusion, Rényi differential privacy constraints during training, RDP-compatible balanced sampling, and Centered Kernel Alignment for latent space fairness.

Result: FLIP effectively provides significant fairness improvements for task-agnostic fairness and across diverse downstream tasks under differential privacy constraints.

Conclusion: FLIP offers a comprehensive solution for generating privacy-preserving and fair synthetic tabular data with broad applicability across sensitive domains.

Abstract: As privacy regulations such as the GDPR and HIPAA and responsibility frameworks for artificial intelligence such as the AI Act gain traction, the ethical and responsible use of real-world data faces increasing constraints. Synthetic data generation has emerged as a promising solution to risk-aware data sharing and model development, particularly for tabular datasets that are foundational to sensitive domains such as healthcare. To address both privacy and fairness concerns in this setting, we propose FLIP (Fair Latent Intervention under Privacy guarantees), a transformer-based variational autoencoder augmented with latent diffusion to generate heterogeneous tabular data. Unlike the typical setup in fairness-aware data generation, we assume a task-agnostic setup, not reliant on a fixed, defined downstream task, thus offering broader applicability. To ensure privacy, FLIP employs R'enyi differential privacy (RDP) constraints during training and addresses fairness in the input space with RDP-compatible balanced sampling that accounts for group-specific noise levels across multiple sampling rates. In the latent space, we promote fairness by aligning neuron activation patterns across protected groups using Centered Kernel Alignment (CKA), a similarity measure extending the Hilbert-Schmidt Independence Criterion (HSIC). This alignment encourages statistical independence between latent representations and the protected feature. Empirical results demonstrate that FLIP effectively provides significant fairness improvements for task-agnostic fairness and across diverse downstream tasks under differential privacy constraints.

[247] Alice’s Adventures in a Differentiable Wonderland – Volume I, A Tour of the Land

Simone Scardapane

Main category: cs.LG

TL;DR: An introductory primer on differentiable programming and neural network architectures, focusing on optimization via automatic differentiation and common design patterns for handling sequences, graphs, texts, and audios.

Details

Motivation: To provide an accessible introduction to differentiable programming for beginners, helping them understand how neural networks work as compositions of differentiable primitives and bridge the gap between theory and practical implementation.

Method: Covers optimization through automatic differentiation and explains common neural network architectures including convolutional, attentional, and recurrent blocks. Uses PyTorch and JAX code examples to demonstrate practical implementation.

Result: Readers gain understanding of fundamental differentiable programming concepts and become capable of comprehending advanced models like large language models (LLMs) and multimodal architectures.

Conclusion: This primer successfully introduces beginners to the field of differentiable programming, providing both theoretical foundations and practical coding skills needed to work with modern neural network architectures.

Abstract: Neural networks surround us, in the form of large language models, speech transcription systems, molecular discovery algorithms, robotics, and much more. Stripped of anything else, neural networks are compositions of differentiable primitives, and studying them means learning how to program and how to interact with these models, a particular example of what is called differentiable programming. This primer is an introduction to this fascinating field imagined for someone, like Alice, who has just ventured into this strange differentiable wonderland. I overview the basics of optimizing a function via automatic differentiation, and a selection of the most common designs for handling sequences, graphs, texts, and audios. The focus is on a intuitive, self-contained introduction to the most important design techniques, including convolutional, attentional, and recurrent blocks, hoping to bridge the gap between theory and code (PyTorch and JAX) and leaving the reader capable of understanding some of the most advanced models out there, such as large language models (LLMs) and multimodal architectures.

[248] Mamba State-Space Models Are Lyapunov-Stable Learners

John T. Halloran, Manbir Gulati, Paul F. Roysdon

Main category: cs.LG

TL;DR: Mamba state-space models show exceptional stability during mixed-precision and parameter-efficient fine-tuning compared to Transformers, with theoretical guarantees from dynamical systems theory.

Details

Motivation: To investigate the sensitivity of Mamba's recurrent dynamics under common fine-tuning methods (MPFT and PEFT) which remain unexplored despite widespread adaptation.

Method: Empirical evaluation of Mamba LLMs’ stability under combinations of mixed-precision fine-tuning and parameter-efficient fine-tuning, contrasted with Transformer LLMs, supported by theoretical analysis using dynamical systems theory and Lyapunov stability.

Result: Mamba LLMs are extremely stable to changes from MPFT and PEFT combinations, unlike Transformer LLMs which may drastically diverge from full-precision counterparts. The robustness is theoretically guaranteed by stable recurrent dynamics.

Conclusion: Mamba’s recurrent dynamics provide inherent stability advantages over Transformers for fine-tuning, enabling novel study of in-context learning abilities through MPFT and PEFT methods.

Abstract: Mamba state-space models (SSMs) have recently outperformed state-of-the-art (SOTA) Transformer large language models (LLMs) in various tasks and been widely adapted. However, a major concern for stable learning in recurrent-based deep models (such as SSMs) is the sensitivity of their recurrent dynamics. Despite widespread adaptation, the sensitivity of Mamba’s recurrent dynamics under common fine-tuning methods-e.g., mixed-precision fine-tuning (MPFT) and parameter-efficient fine-tuning (PEFT)-remains unexplored. Empirically, we show that Mamba LLMs are extremely stable to changes introduced by combinations of MPFT and PEFT, in stark contrast to Transformer LLMs, which we demonstrate may drastically diverge from their respective full-precision counterparts under different combinations of MPFT and PEFT (despite the near-ubiquitous adaptation of these fine-tuning frameworks for attention-based models). The demonstrated robustness of Mamba LLMs are due to their recurrent dynamics, which we prove are guaranteed to be stable using dynamical systems theory (in particular, Lyapunov stability). We conclude by using MPFT and PEFT to novelly study Mamba LLMs’ in-context learning (ICL) abilities on natural language tasks, thus supplementing other recent work.

[249] Categorical Data Clustering via Value Order Estimated Distance Metric Learning

Yiqun Zhang, Mingjie Zhao, Hong Jia, Yang Lu, Mengke Li, Yiu-ming Cheung

Main category: cs.LG

TL;DR: A novel order distance metric learning approach for categorical data clustering that learns optimal order relationships between categorical values to enable intuitive distance measurement similar to numerical data.

Details

Motivation: Categorical data lacks a well-defined metric space like Euclidean distance, making clustering difficult and potentially twisting valuable information due to under-represented distribution patterns.

Method: A joint learning paradigm that alternatively performs clustering and order distance metric learning, learning optimal order relationships between categorical attribute values and quantifying their distance in a linear space.

Result: Superior clustering accuracy on categorical and mixed datasets, with reduced time complexity, guaranteed convergence, and greatly improved interpretability of categorical data.

Conclusion: The proposed method effectively addresses the challenge of clustering categorical data by learning intuitive order distance metrics, making categorical data more understandable and manageable while achieving high clustering performance.

Abstract: Clustering is a popular machine learning technique for data mining that can process and analyze datasets to automatically reveal sample distribution patterns. Since the ubiquitous categorical data naturally lack a well-defined metric space such as the Euclidean distance space of numerical data, the distribution of categorical data is usually under-represented, and thus valuable information can be easily twisted in clustering. This paper, therefore, introduces a novel order distance metric learning approach to intuitively represent categorical attribute values by learning their optimal order relationship and quantifying their distance in a line similar to that of the numerical attributes. Since subjectively created qualitative categorical values involve ambiguity and fuzziness, the order distance metric is learned in the context of clustering. Accordingly, a new joint learning paradigm is developed to alternatively perform clustering and order distance metric learning with low time complexity and a guarantee of convergence. Due to the clustering-friendly order learning mechanism and the homogeneous ordinal nature of the order distance and Euclidean distance, the proposed method achieves superior clustering accuracy on categorical and mixed datasets. More importantly, the learned order distance metric greatly reduces the difficulty of understanding and managing the non-intuitive categorical data. Experiments with ablation studies, significance tests, case studies, etc., have validated the efficacy of the proposed method. The source code is available at https://github.com/DAJ0612/OCL_Source_Code.

[250] ROSE: A Reward-Oriented Data Selection Framework for LLM Task-Specific Instruction Tuning

Yang Wu, Huayi Zhang, Yizheng Jiao, Lin Ma, Xiaozhong Liu, Jinhong Yu, Dongyu Zhang, Dezhi Yu, Wei Xu

Main category: cs.LG

TL;DR: ROSE is a reward-oriented data selection method that uses pairwise preference loss instead of traditional similarity metrics to select optimal training data for task-specific instruction tuning of LLMs, achieving competitive results with only 5% of training data.

Details

Motivation: Current data selection methods for instruction tuning rely on similarity metrics that don't correlate well with actual task performance, as instruction tuning loss often fails to show a monotonic relationship with task outcomes.

Method: ROSE leverages pairwise preference loss as a reward signal and adapts an influence formulation to approximate the influence of training data points relative to a few-shot preference validation set to select the most task-related data.

Result: By selecting just 5% of training data, ROSE achieves competitive results compared to full dataset fine-tuning and outperforms other state-of-the-art data selection methods across multiple benchmarks and model architectures.

Conclusion: ROSE provides an effective solution to the data selection problem for task-specific instruction tuning by using preference-based reward signals, demonstrating robust generalizability and superior performance over traditional similarity-based approaches.

Abstract: Instruction tuning has underscored the significant potential of large language models (LLMs) in producing more human controllable and effective outputs in various domains. In this work, we focus on the data selection problem for task-specific instruction tuning of LLMs. Prevailing methods primarily rely on the crafted similarity metrics to select training data that aligns with the test data distribution. The goal is to minimize instruction tuning loss on the test data, ultimately improving performance on the target task. However, it has been widely observed that instruction tuning loss (i.e., cross-entropy loss for next token prediction) in LLMs often fails to exhibit a monotonic relationship with actual task performance. This misalignment undermines the effectiveness of current data selection methods for task-specific instruction tuning. To address this issue, we introduce ROSE, a novel Reward-Oriented inStruction data sElection method which leverages pairwise preference loss as a reward signal to optimize data selection for task-specific instruction tuning. Specifically, ROSE adapts an influence formulation to approximate the influence of training data points relative to a few-shot preference validation set to select the most task-related training data points. Experimental results show that by selecting just 5% of the training data using ROSE, our approach can achieve competitive results compared to fine-tuning with the full training dataset, and it surpasses other state-of-the-art data selection methods for task-specific instruction tuning. Our qualitative analysis further confirms the robust generalizability of our method across multiple benchmark datasets and diverse model architectures.

[251] Refusal Tokens: A Simple Way to Calibrate Refusals in Large Language Models

Neel Jain, Aditya Shrivastava, Chenyang Zhu, Daben Liu, Alfy Samuel, Ashwinee Panda, Anoop Kumar, Micah Goldblum, Tom Goldstein

Main category: cs.LG

TL;DR: Proposes refusal tokens to control language model refusal behavior without retraining, enabling adjustable sensitivity to different query categories.

Details

Motivation: Current methods require training multiple models with different refusal rates, which is computationally expensive and inflexible for accommodating varying user preferences.

Method: Introduces refusal tokens (one per category or a single token) prepended during training, then controls refusal probability during inference to steer model behavior without fine-tuning.

Result: Enables control over refusal rates for different query categories through selective intervention during generation rather than model retraining.

Conclusion: Refusal tokens provide an efficient and flexible approach to customize language model refusal behavior, reducing computational costs while maintaining user-specific preferences.

Abstract: A key component of building safe and reliable language models is enabling the models to appropriately refuse to follow certain instructions or answer certain questions. We may want models to output refusal messages for various categories of user queries, for example, ill-posed questions, instructions for committing illegal acts, or queries which require information past the model’s knowledge horizon. Engineering models that refuse to answer such questions is complicated by the fact that an individual may want their model to exhibit varying levels of sensitivity for refusing queries of various categories, and different users may want different refusal rates. The current default approach involves training multiple models with varying proportions of refusal messages from each category to achieve the desired refusal rates, which is computationally expensive and may require training a new model to accommodate each user’s desired preference over refusal rates. To address these challenges, we propose refusal tokens, one such token for each refusal category or a single refusal token, which are prepended to the model’s responses during training. We then show how to increase or decrease the probability of generating the refusal token for each category during inference to steer the model’s refusal behavior. Refusal tokens enable controlling a single model’s refusal rates without the need of any further fine-tuning, but only by selectively intervening during generation.

[252] Don’t lie to your friends: Learning what you know from collaborative self-play

Jacob Eisenstein, Reza Aghajani, Adam Fisch, Dheeru Dua, Fantine Huot, Mirella Lapata, Vicky Zayats, Jonathan Berant

Main category: cs.LG

TL;DR: Collaborative self-play teaches AI agents meta-knowledge about their capabilities through multi-agent group rewards, enabling better tool use and selective prediction when deployed individually.

Details

Motivation: AI assistants need awareness of their own capabilities and limitations - knowing when to use tools vs parametric knowledge, when to trust tool outputs, and when to abstain. This is difficult to teach through supervised fine-tuning alone.

Method: Proposes collaborative self-play where multi-agent groups are rewarded for collectively arriving at correct answers. Agents with heterogeneous tools (corpus-specific retrieval) collaborate to maximize success while minimizing effort.

Result: Group-level rewards induce policies that transfer to improve tool use and selective prediction in individual agent deployment settings.

Conclusion: Multi-agent collaborative self-play effectively teaches meta-knowledge about capabilities, enabling better individual agent performance through emergent understanding of when and how to use available tools.

Abstract: To be helpful assistants, AI agents must be aware of their own capabilities and limitations. This includes knowing when to answer from parametric knowledge versus using tools, when to trust tool outputs, and when to abstain or hedge. Such capabilities are hard to teach through supervised fine-tuning because they require constructing examples that reflect the agent’s specific capabilities. We therefore propose a radically new approach to teaching agents what they know: \emph{collaborative self-play}. We construct multi-agent collaborations in which the group is rewarded for collectively arriving at correct answers. The desired meta-knowledge emerges from the incentives built into the structure of the interaction. We focus on small societies of agents that have access to heterogeneous tools (corpus-specific retrieval), and therefore must collaborate to maximize their success while minimizing their effort. Experiments show that group-level rewards for multi-agent communities can induce policies that \emph{transfer} to improve tool use and selective prediction in settings where individual agents are deployed in isolation.

[253] FROG: Fair Removal on Graphs

Ziheng Chen, Jiali Cheng, Hadi Amiri, Kaushiki Nag, Lu Lin, Xiangguo Sun, Gabriele Tolomei

Main category: cs.LG

TL;DR: A novel framework for fair graph unlearning that jointly optimizes graph structure and model to remove data while preserving fairness through targeted edge rewiring and augmentation.

Details

Motivation: Address the oversight of fairness impacts in existing graph unlearning methods, which often modify nodes/edges indiscriminately and can exacerbate group disparities when forgetting links between different demographic groups.

Method: Proposes a framework that rewires graphs by removing redundant edges that hinder forgetting while preserving fairness through targeted edge augmentation. Introduces worst-case evaluation for robustness assessment.

Result: Experiments on real-world datasets demonstrate the approach achieves more effective and fair unlearning compared to existing baselines.

Conclusion: The proposed method successfully addresses fairness concerns in graph unlearning by jointly optimizing graph structure and model parameters, providing a robust solution for privacy-compliant data removal while maintaining equitable outcomes.

Abstract: With growing emphasis on privacy regulations, machine unlearning has become increasingly critical in real-world applications such as social networks and recommender systems, many of which are naturally represented as graphs. However, existing graph unlearning methods often modify nodes or edges indiscriminately, overlooking their impact on fairness. For instance, forgetting links between users of different genders may inadvertently exacerbate group disparities. To address this issue, we propose a novel framework that jointly optimizes both the graph structure and the model to achieve fair unlearning. Our method rewires the graph by removing redundant edges that hinder forgetting while preserving fairness through targeted edge augmentation. We further introduce a worst-case evaluation mechanism to assess robustness under challenging scenarios. Experiments on real-world datasets show that our approach achieves more effective and fair unlearning than existing baselines.

[254] Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction

Vaishnavh Nagarajan, Chen Henry Wu, Charles Ding, Aditi Raghunathan

Main category: cs.LG

TL;DR: The paper introduces minimal algorithmic tasks to test language models’ creative limits, showing that next-token learning is myopic while multi-token approaches like teacherless training and diffusion models excel at producing diverse and original outputs. It also proposes seed-conditioning as an effective alternative to temperature sampling.

Details

Motivation: To create a clean, controllable test-bed for quantifying the creative limits of current language models, particularly for open-ended real-world tasks that require creative, far-sighted thinking and stochastic planning.

Method: Designed minimal algorithmic tasks that abstract real-world creative tasks, requiring either discovering new connections in knowledge graphs or constructing new patterns. Compared next-token learning with multi-token approaches (teacherless training, diffusion models) and tested noise injection at input layer (seed-conditioning) vs. temperature sampling.

Result: Next-token learning proved myopic for creative tasks, while multi-token approaches produced more diverse and original outputs. Seed-conditioning worked as well as or better than temperature sampling for eliciting randomness without compromising coherence.

Conclusion: The work provides a principled test-bed for analyzing creative skills and offers arguments for moving beyond next-token learning and temperature sampling in language models for open-ended creative tasks.

Abstract: We design a suite of minimal algorithmic tasks that are a loose abstraction of open-ended real-world tasks. This allows us to cleanly and controllably quantify the creative limits of the present-day language model. Much like real-world tasks that require a creative, far-sighted leap of thought, our tasks require an implicit, open-ended stochastic planning step that either (a) discovers new connections in an abstract knowledge graph (like in wordplay, drawing analogies, or research) or (b) constructs new patterns (like in designing math problems or new proteins). In these tasks, we empirically and conceptually argue how next-token learning is myopic; multi-token approaches, namely teacherless training and diffusion models, comparatively excel in producing diverse and original output. Secondly, to elicit randomness without hurting coherence, we find that injecting noise at the input layer (dubbed seed-conditioning) works surprisingly as well as (and in some conditions, better than) temperature sampling from the output layer. Thus, our work offers a principled, minimal test-bed for analyzing open-ended creative skills, and offers new arguments for going beyond next-token learning and temperature sampling. We make part of the code available under https://github.com/chenwu98/algorithmic-creativity

Shahryar Zehtabi, Dong-Jun Han, Seyyedali Hosseinalipour, Christopher G. Brinton

Main category: cs.LG

TL;DR: StyleDDG is a decentralized federated learning algorithm for domain generalization that enables peer-to-peer style sharing between devices to handle distribution shifts, with formal convergence guarantees and improved accuracy across unseen domains.

Details

Motivation: Address two gaps in FL and DG research: lack of formal mathematical analysis of DG objectives, and limitation of DG approaches to star-topology architectures in federated learning.

Method: Developed StyleDDG - a decentralized DG algorithm where devices in peer-to-peer networks share style information inferred from their datasets, with systematic analysis framework for style-based DG training.

Result: Achieved significant improvements in accuracy across target domains with minimal communication overhead compared to baseline decentralized gradient methods, as demonstrated on popular DG datasets.

Conclusion: StyleDDG successfully bridges the gaps by providing both formal convergence guarantees and practical decentralized implementation for domain generalization in federated learning settings.

Abstract: Much of federated learning (FL) focuses on settings where local dataset statistics remain the same between training and testing. However, this assumption often does not hold in practice due to distribution shifts, motivating the development of domain generalization (DG) approaches that leverage source domain data to train models capable of generalizing to unseen target domains. In this paper, we are motivated by two major gaps in existing work on FL and DG: (1) the lack of formal mathematical analysis of DG objectives; and (2) DG research in FL being limited to the star-topology architecture. We develop Decentralized Federated Domain Generalization with Style Sharing ($\textit{StyleDDG}$), a decentralized DG algorithm which allows devices in a peer-to-peer network to achieve DG based on sharing style information inferred from their datasets. Additionally, we provide the first systematic approach to analyzing style-based DG training in decentralized networks. We cast existing centralized DG algorithms within our framework, and employ their formalisms to model $\textit{StyleDDG}$. We then obtain analytical conditions under which convergence of $\textit{StyleDDG}$ can be guaranteed. Through experiments on popular DG datasets, we demonstrate that $\textit{StyleDDG}$ can obtain significant improvements in accuracy across target domains with minimal communication overhead compared to baseline decentralized gradient methods.

[256] WebInject: Prompt Injection Attack to Web Agents

Xilong Wang, John Bloch, Zedian Shao, Yuepeng Hu, Shuyan Zhou, Neil Zhenqiang Gong

Main category: cs.LG

TL;DR: WebInject is a prompt injection attack that adds pixel perturbations to webpage screenshots to manipulate MLLM-based web agents into performing attacker-specified actions, using neural network approximation and gradient descent optimization.

Details

Motivation: MLLM-based web agents interact with webpages through screenshots, making them vulnerable to visual manipulation attacks that can induce malicious actions by perturbing the rendered webpage pixels.

Method: Formulate perturbation finding as optimization problem, train neural network to approximate non-differentiable screenshot mapping, apply projected gradient descent to solve the reformulated optimization.

Result: Extensive evaluation shows WebInject is highly effective and significantly outperforms baseline methods across multiple datasets.

Conclusion: WebInject demonstrates successful prompt injection attacks against MLLM web agents through pixel-level perturbations, highlighting security vulnerabilities in visual-based web interaction systems.

Abstract: Multi-modal large language model (MLLM)-based web agents interact with webpage environments by generating actions based on screenshots of the webpages. In this work, we propose WebInject, a prompt injection attack that manipulates the webpage environment to induce a web agent to perform an attacker-specified action. Our attack adds a perturbation to the raw pixel values of the rendered webpage. After these perturbed pixels are mapped into a screenshot, the perturbation induces the web agent to perform the attacker-specified action. We formulate the task of finding the perturbation as an optimization problem. A key challenge in solving this problem is that the mapping between raw pixel values and screenshot is non-differentiable, making it difficult to backpropagate gradients to the perturbation. To overcome this, we train a neural network to approximate the mapping and apply projected gradient descent to solve the reformulated optimization problem. Extensive evaluation on multiple datasets shows that WebInject is highly effective and significantly outperforms baselines.

[257] Label Embedding via Low-Coherence Matrices

Jianxin Zhang, Clayton Scott

Main category: cs.LG

TL;DR: Theoretical analysis of label embedding for extreme multiclass classification, showing trade-off between computational and statistical efficiency via embedding matrix coherence, with vanishing statistical penalty under low coherence.

Details

Motivation: Label embedding has shown success in extreme classification and zero-shot learning with computational and statistical advantages, but lacks theoretical foundations.

Method: Presents excess risk bound analysis quantifying trade-off via embedding matrix coherence, and shows statistical penalty vanishes under Massart noise condition with low coherence.

Result: Theoretical framework reveals coherence-dependent trade-off, supporting a simple, scalable, and parallelizable algorithm effective in large-scale applications.

Conclusion: Label embedding provides both computational and statistical benefits in extreme classification when embedding matrix has low coherence, with theoretical guarantees under noise conditions.

Abstract: Label embedding is a framework for multiclass classification problems where each label is represented by a distinct vector of some fixed dimension, and training involves matching model output to the vector representing the correct label. While label embedding has been successfully applied in extreme classification and zero-shot learning, and offers both computational and statistical advantages, its theoretical foundations remain poorly understood. This work presents an analysis of label embedding in the context of extreme multiclass classification, where the number of classes $C$ is very large. We present an excess risk bound that reveals a trade-off between computational and statistical efficiency, quantified via the coherence of the embedding matrix. We further show that under the Massart noise condition, the statistical penalty for label embedding vanishes with sufficiently low coherence. Our analysis supports an algorithm that is simple, scalable, and easily parallelizable, and experimental results demonstrate its effectiveness in large-scale applications.

[258] BASE-Q: Bias and Asymmetric Scaling Enhanced Rotational Quantization for Large Language Models

Liulu He, Shenli Zheng, Karwei Sun, Yijiang Liu, Yufei Zhao, Chongkang Tan, Huanrui Yang, Yuan Du, Li Du

Main category: cs.LG

TL;DR: BASE-Q is a new quantization method that combines bias correction and asymmetric scaling to reduce rounding and clipping errors in LLM quantization, enabling blockwise optimization without memory-intensive full-model backpropagation.

Details

Motivation: Current rotational quantization methods have limitations: they fail to align channel means (causing wider quantization bounds and increased rounding errors) and make activation distributions more Gaussian-like (increasing clipping error energy loss). Rotation optimization also requires full-model loading for backpropagation, causing high memory consumption.

Method: BASE-Q introduces a simple yet powerful approach combining bias correction and asymmetric scaling to effectively reduce both rounding and clipping errors. It enables blockwise optimization, eliminating the need for memory-intensive full-model backpropagation.

Result: Extensive experiments on various LLMs and benchmarks show BASE-Q narrows the accuracy gap to full-precision models by 50.5%, 42.9%, and 29.2% compared to QuaRot, SpinQuant, and OSTQuant respectively.

Conclusion: BASE-Q effectively addresses fundamental limitations of current rotational quantization methods by reducing both rounding and clipping errors while enabling more memory-efficient blockwise optimization, significantly improving quantization performance for large language models.

Abstract: Rotations have become essential to state-of-the-art quantization pipelines for large language models (LLMs) by effectively smoothing outliers in weights and activations. However, further optimizing the rotation parameters offers only limited performance gains and introduces significant training overhead: due to rotation parameter sharing, full-model must be loaded simultaneously to enable backpropagation, resulting in substantial memory consumption and limited practical utility. In this work, we identify two fundamental limitations of current rotational quantization methods: (i) rotation fails to align channel means, resulting in wider quantization bounds and increased rounding errors; and (ii) rotation makes the activation distribution more Gaussian-like, increasing energy loss caused by clipping errors. To address these issues, we introduce \textbf{BASE-Q}, a simple yet powerful approach that combines bias correction and asymmetric scaling to effectively reduce rounding and clipping errors. Furthermore, BASE-Q enables blockwise optimization, eliminating the need for memory-intensive full-model backpropagation. Extensive experiments on various LLMs and benchmarks demonstrate the effectiveness of BASE-Q, narrowing the accuracy gap to full-precision models by 50.5%, 42.9%, and 29.2% compared to QuaRot, SpinQuant, and OSTQuant, respectively. The code will be released soon.

[259] Finite-Time Analysis of Three-Timescale Constrained Actor-Critic and Constrained Natural Actor-Critic Algorithms

Prashansa Panda, Shalabh Bhatnagar

Main category: cs.LG

TL;DR: Non-asymptotic analysis of constrained actor critic and natural actor critic methods for CMDPs with inequality constraints, achieving O(ε^{-2.5}) sample complexity for finding stationary points.

Details

Motivation: Actor critic methods are widely used in RL but lack non-asymptotic analysis for constrained MDPs with inequality constraints in non-i.i.d settings.

Method: Uses Lagrange multiplier method to handle inequality constraints, analyzes both constrained actor critic (C-AC) and constrained natural actor critic (C-NAC) algorithms with function approximation.

Result: Proves both algorithms find first-order stationary points (∇L(θ,γ)₂² ≤ ε) with O(ε^{-2.5}) sample complexity, validated on Safety-Gym environments.

Conclusion: Provides theoretical guarantees for constrained actor critic methods with efficient sample complexity, enabling practical constrained reinforcement learning applications.

Abstract: Actor Critic methods have found immense applications on a wide range of Reinforcement Learning tasks especially when the state-action space is large. In this paper, we consider actor critic and natural actor critic algorithms with function approximation for constrained Markov decision processes (C-MDP) involving inequality constraints and carry out a non-asymptotic analysis for both of these algorithms in a non-i.i.d (Markovian) setting. We consider the long-run average cost criterion where both the objective and the constraint functions are suitable policy-dependent long-run averages of certain prescribed cost functions. We handle the inequality constraints using the Lagrange multiplier method. We prove that these algorithms are guaranteed to find a first-order stationary point (i.e., $\Vert \nabla L(\theta,\gamma)\Vert_2^2 \leq \epsilon$) of the performance (Lagrange) function $L(\theta,\gamma)$, with a sample complexity of $\mathcal{\tilde{O}}(\epsilon^{-2.5})$ in the case of both Constrained Actor Critic (C-AC) and Constrained Natural Actor Critic (C-NAC) algorithms. We also show the results of experiments on three different Safety-Gym environments.

[260] SPIN-ODE: Stiff Physics-Informed Neural ODE for Chemical Reaction Rate Estimation

Wenqing Peng, Zhi-Song Liu, Michael Boy

Main category: cs.LG

TL;DR: SPIN-ODE framework addresses stiffness in chemical reaction systems using a three-stage optimization process with neural ODE and CRNN for effective rate coefficient estimation.

Details

Motivation: Stiffness in real-world atmospheric chemistry systems causes training instability and poor convergence, hindering learning-based approaches for rate coefficient estimation.

Method: Three-stage optimization: 1) black-box neural ODE fits concentration trajectories, 2) CRNN pre-trains to map concentrations to time derivatives, 3) fine-tunes rate coefficients with pre-trained CRNN.

Result: Extensive experiments on synthetic and real-world datasets validate the effectiveness and robustness of the approach for chemical rate coefficient discovery.

Conclusion: First work addressing stiff neural ODE for chemical rate coefficient discovery, opening promising directions for neural network integration with detailed chemistry.

Abstract: Estimating rate coefficients from complex chemical reactions is essential for advancing detailed chemistry. However, the stiffness inherent in real-world atmospheric chemistry systems poses severe challenges, leading to training instability and poor convergence, which hinder effective rate coefficient estimation using learning-based approaches. To address this, we propose a Stiff Physics-Informed Neural ODE framework (SPIN-ODE) for chemical reaction modelling. Our method introduces a three-stage optimisation process: first, a black-box neural ODE is trained to fit concentration trajectories; second, a Chemical Reaction Neural Network (CRNN) is pre-trained to learn the mapping between concentrations and their time derivatives; and third, the rate coefficients are fine-tuned by integrating with the pre-trained CRNN. Extensive experiments on both synthetic and newly proposed real-world datasets validate the effectiveness and robustness of our approach. As the first work addressing stiff neural ODE for chemical rate coefficient discovery, our study opens promising directions for integrating neural networks with detailed chemistry.

[261] Survey of Privacy Threats and Countermeasures in Federated Learning

Masahiro Hayashitani, Junki Mori, Isamu Teranishi

Main category: cs.LG

TL;DR: Privacy threats and countermeasures analysis for three types of federated learning: horizontal, vertical, and transfer federated learning.

Details

Motivation: Federated learning is considered privacy-aware but still faces privacy threats. Current research lacks comprehensive categorization and description of common and unique privacy threats across different federated learning types.

Method: The paper analyzes and categorizes privacy threats and countermeasures for three typical federated learning types: horizontal federated learning, vertical federated learning, and transfer federated learning.

Result: Provides a comprehensive classification of privacy threats specific to each federated learning type and describes appropriate countermeasures for each scenario.

Conclusion: The study offers a systematic framework for understanding and addressing privacy vulnerabilities in different federated learning architectures, helping practitioners implement more effective privacy protection measures.

Abstract: Federated learning is widely considered to be as a privacy-aware learning method because no training data is exchanged directly between clients. Nevertheless, there are threats to privacy in federated learning, and privacy countermeasures have been studied. However, we note that common and unique privacy threats among typical types of federated learning have not been categorized and described in a comprehensive and specific way. In this paper, we describe privacy threats and countermeasures for the typical types of federated learning; horizontal federated learning, vertical federated learning, and transfer federated learning.

[262] Two-Timescale Critic-Actor for Average Reward MDPs with Function Approximation

Prashansa Panda, Shalabh Bhatnagar

Main category: cs.LG

TL;DR: First two-timescale critic-actor algorithm with function approximation for long-run average reward setting, featuring both finite-time non-asymptotic and asymptotic convergence analysis with optimal learning rates.

Details

Motivation: Previous works focused on non-asymptotic convergence for AC algorithms, but existing two-timescale critic-actor approaches were limited to look-up table cases with only asymptotic convergence shown. There was a need for function approximation in long-run average reward settings with comprehensive convergence analysis.

Method: Proposed a two-timescale critic-actor algorithm with function approximation, where actor and critic operate on different timescales. Provided both finite-time non-asymptotic bounds and asymptotic convergence analysis, showing convergence to attractors of associated differential inclusions.

Result: Achieved sample complexity of O(ε^{-(2+δ)}) with δ>0 arbitrarily small for critic’s mean squared error, outperforming previous two-timescale AC algorithms. Demonstrated almost sure asymptotic convergence to local maxima of perturbed average reward objective. Numerical experiments showed superior performance on three benchmark settings.

Conclusion: The proposed two-timescale critic-actor algorithm with function approximation represents a significant advancement, providing the first comprehensive convergence analysis (both finite-time and asymptotic) for long-run average reward settings, with improved sample complexity and demonstrated practical effectiveness.

Abstract: Several recent works have focused on carrying out non-asymptotic convergence analyses for AC algorithms. Recently, a two-timescale critic-actor algorithm has been presented for the discounted cost setting in the look-up table case where the timescales of the actor and the critic are reversed and only asymptotic convergence shown. In our work, we present the first two-timescale critic-actor algorithm with function approximation in the long-run average reward setting and present the first finite-time non-asymptotic as well as asymptotic convergence analysis for such a scheme. We obtain optimal learning rates and prove that our algorithm achieves a sample complexity of {$\mathcal{\tilde{O}}(\epsilon^{-(2+\delta)})$ with $\delta >0$ arbitrarily close to zero,} for the mean squared error of the critic to be upper bounded by $\epsilon$ which is better than the one obtained for two-timescale AC in a similar setting. A notable feature of our analysis is that we present the asymptotic convergence analysis of our scheme in addition to the finite-time bounds that we obtain and show the almost sure asymptotic convergence of the (slower) critic recursion to the attractor of an associated differential inclusion with actor parameters corresponding to local maxima of a perturbed average reward objective. We also show the results of numerical experiments on three benchmark settings and observe that our critic-actor algorithm performs the best amongst all algorithms.

[263] TorchCP: A Python Library for Conformal Prediction

Jianguo Huang, Jianqing Song, Xuanning Zhou, Bingyi Jing, Hongxin Wei

Main category: cs.LG

TL;DR: TorchCP is a PyTorch-native conformal prediction library that provides scalable, GPU-accelerated uncertainty quantification for deep learning models including DNNs, GNNs, and LLMs, with up to 90% inference time reduction.

Details

Motivation: Existing conformal prediction libraries lack proper model support and scalability for large-scale deep learning scenarios, despite CP algorithms evolving to work with sophisticated models like DNNs, GNNs, and LLMs.

Method: Developed a PyTorch-native library with low-coupling design, implementing state-of-the-art CP algorithms with GPU-accelerated batch processing, online prediction capabilities, and CP-specific training algorithms.

Result: TorchCP comprises 16k lines of code with 100% unit test coverage, achieves up to 90% reduction in inference time on large datasets, and provides comprehensive support for various deep learning models while maintaining full GPU scalability.

Conclusion: TorchCP successfully bridges the gap between conformal prediction theory and practical deep learning applications, empowering researchers and practitioners to enhance uncertainty quantification across cutting-edge AI applications with efficient, scalable tools.

Abstract: Conformal prediction (CP) is a powerful statistical framework that generates prediction intervals or sets with guaranteed coverage probability. While CP algorithms have evolved beyond traditional classifiers and regressors to sophisticated deep learning models like deep neural networks (DNNs), graph neural networks (GNNs), and large language models (LLMs), existing CP libraries often lack the model support and scalability for large-scale DL scenarios. This paper introduces TorchCP, a PyTorch-native library designed to integrate state-of-the-art CP algorithms into deep learning techniques, including DNN-based classifier/regressor, GNN, and LLM. Released under the LGPL-3.0 license, TorchCP comprises about 16k lines of code, validated with 100% unit test coverage and detailed documentation. Notably, TorchCP enables CP-specific training algorithms, online prediction, and GPU-accelerated batch processing, achieving up to 90% reduction in inference time on large datasets. With its low-coupling design, comprehensive suite of advanced methods, and full GPU scalability, TorchCP empowers researchers and practitioners to enhance uncertainty quantification across cutting-edge applications.

[264] Federated Diffusion Modeling with Differential Privacy for Tabular Data Synthesis

Timur Sattarov, Marco Schreyer, Damian Borth

Main category: cs.LG

TL;DR: DP-FedTabDiff combines differential privacy, federated learning, and diffusion models to generate high-quality synthetic tabular data while ensuring privacy compliance.

Details

Motivation: Growing need for privacy-preserving data analytics solutions that can generate synthetic data while maintaining strict privacy standards in regulated domains.

Method: Integration of Differential Privacy, Federated Learning, and Denoising Diffusion Probabilistic Models to create the DP-FedTabDiff framework for synthetic tabular data generation.

Result: Significant improvements in privacy guarantees without compromising data quality on multiple real-world mixed-type tabular datasets, with optimal trade-offs between privacy budgets and federated optimization strategies.

Conclusion: DP-FedTabDiff enables secure data sharing and analytics in regulated domains, demonstrating potential for advancing federated learning and privacy-preserving data synthesis.

Abstract: The increasing demand for privacy-preserving data analytics in various domains necessitates solutions for synthetic data generation that rigorously uphold privacy standards. We introduce the DP-FedTabDiff framework, a novel integration of Differential Privacy, Federated Learning and Denoising Diffusion Probabilistic Models designed to generate high-fidelity synthetic tabular data. This framework ensures compliance with privacy regulations while maintaining data utility. We demonstrate the effectiveness of DP-FedTabDiff on multiple real-world mixed-type tabular datasets, achieving significant improvements in privacy guarantees without compromising data quality. Our empirical evaluations reveal the optimal trade-offs between privacy budgets, client configurations, and federated optimization strategies. The results affirm the potential of DP-FedTabDiff to enable secure data sharing and analytics in highly regulated domains, paving the way for further advances in federated learning and privacy-preserving data synthesis.

[265] Beyond Frequency: The Role of Redundancy in Large Language Model Memorization

Jie Zhang, Qinghua Zhao, Chi-ho Lin, Zhongfeng Kang, Lei Li

Main category: cs.LG

TL;DR: Memorization patterns in LLMs show low-redundancy samples are 2x more vulnerable to memorization than high-redundancy ones, with 79% of memorized samples being low-redundancy.

Details

Motivation: Address critical privacy and fairness risks posed by memorization in large language models as they scale to billions of parameters.

Method: Counterfactual analysis by perturbing sample prefixes and quantifying perturbation strength through token positional changes to examine redundancy’s correlation with memorization.

Result: Frequency increases minimally impact memorized samples (0.09) vs substantial effect on non-memorized (0.25); memorized samples drop by 0.6 under perturbation vs 0.01 for non-memorized; low-redundancy samples show 2x higher vulnerability.

Conclusion: Redundancy correlates with memorization patterns, suggesting redundancy-guided data preprocessing approaches to reduce privacy risks and mitigate bias for fair model deployments.

Abstract: Memorization in large language models poses critical risks for privacy and fairness as these systems scale to billions of parameters. While previous studies established correlations between memorization and factors like token frequency and repetition patterns, we revealed distinct response patterns: frequency increases minimally impact memorized samples (e.g. 0.09) while substantially affecting non-memorized samples (e.g., 0.25), with consistency observed across model scales. Through counterfactual analysis by perturbing sample prefixes and quantifying perturbation strength through token positional changes, we demonstrate that redundancy correlates with memorization patterns. Our findings establish that: about 79% of memorized samples are low-redundancy, these low-redundancy samples exhibit 2-fold higher vulnerability than high-redundancy ones, and consequently memorized samples drop by 0.6 under perturbation while non-memorized samples drop by only 0.01, indicating that more redundant content becomes both more memorable and more fragile. These findings suggest potential redundancy-guided approaches for data preprocessing, thereby reducing privacy risks and mitigating bias to ensure fairness in model deployments.

[266] Stochastic Control for Fine-tuning Diffusion Models: Optimality, Regularity, and Convergence

Yinbin Han, Meisam Razaviyayn, Renyuan Xu

Main category: cs.LG

TL;DR: A stochastic control framework for fine-tuning diffusion models with theoretical guarantees and linear convergence rate.

Details

Motivation: Fine-tuning large diffusion models for specific tasks and preferences remains challenging with limited theoretical understanding despite empirical progress.

Method: Proposes a stochastic control framework integrating linear dynamics control with KL regularization, building on pre-trained diffusion models. Develops policy iteration algorithm (PI-FT) with proven well-posedness and regularity.

Result: PI-FT achieves global convergence at linear rate, maintains regularity throughout training, and demonstrates practical effectiveness in numerical experiments.

Conclusion: The framework provides theoretical foundation for diffusion model fine-tuning with guaranteed convergence and regularity properties, extending to parametric and continuous-time settings.

Abstract: Diffusion models have emerged as powerful tools for generative modeling, demonstrating exceptional capability in capturing target data distributions from large datasets. However, fine-tuning these massive models for specific downstream tasks, constraints, and human preferences remains a critical challenge. While recent advances have leveraged reinforcement learning algorithms to tackle this problem, much of the progress has been empirical, with limited theoretical understanding. To bridge this gap, we propose a stochastic control framework for fine-tuning diffusion models. Building on denoising diffusion probabilistic models as the pre-trained reference dynamics, our approach integrates linear dynamics control with Kullback-Leibler regularization. We establish the well-posedness and regularity of the stochastic control problem and develop a policy iteration algorithm (PI-FT) for numerical solution. We show that PI-FT achieves global convergence at a linear rate. Unlike existing work that assumes regularities throughout training, we prove that the control and value sequences generated by the algorithm maintain the regularity. Additionally, we explore extensions of our framework to parametric settings and continuous-time formulations, and demonstrate the practical effectiveness of the proposed PI-FT algorithm through numerical experiments. Our code is available at https://github.com/yinbinhan/fine-tuning-of-diffusion-models.

[267] SpecPipe: Accelerating Pipeline Parallelism-based LLM Inference with Speculative Decoding

Haofei Yin, Mengbai Xiao, Tinghong Li, Xiao Zhang, Dongxiao Yu, Guanghui Zhang

Main category: cs.LG

TL;DR: SpecPipe improves LLM inference latency by combining pipeline parallelism with speculative decoding, achieving 4-5x speedup over standard methods through dynamic token trees and optimized hardware utilization.

Details

Motivation: Address high service latency in pipeline parallelism for LLM inference while overcoming limitations of existing speculative decoding methods (low hardware utilization and narrow speculative window).

Method: Introduces SpecPipe with dynamic speculative token tree and pipelined inference framework. Uses branch prediction-inspired approach to fill pipeline with speculative tokens step-by-step, integrating high-accuracy draft model without fine-tuning.

Result: 4.19x-5.53x improvement in time between tokens over standard pipeline parallelism, 2.08x-2.38x over prior tree-based methods. Multi-request variant achieves 1.64x-2.08x higher throughput and 1.61x-2.06x lower latency than vLLM.

Conclusion: SpecPipe effectively addresses latency issues in distributed LLM inference by maximizing hardware utilization and expanding speculative window, demonstrating significant performance improvements for both single and multi-request scenarios.

Abstract: The demand for large language model inference is rapidly increasing. Pipeline parallelism offers a cost-effective deployment strategy for distributed inference but suffers from high service latency. While incorporating speculative decoding to pipeline parallelism improves performance, it still faces challenges of low hardware utilization and narrow speculative window. Inspired by branch prediction in instruction pipelining, we introduce SpecPipe, which fills the pipeline with speculative tokens of a request step-by-step. By maximizing the hardware utilization, SpecPipe decodes one token per pipeline step ideally. Specifically, SpecPipe comprises a dynamic speculative token tree and a pipelined inference framework. The tree dynamically accepts tokens from a speculative token source and outputs the tokens to the inference pipeline. Since the speculative window relaxed in our framework, a high-accuracy draft model is integrated without fine-tuning. The pipeline inference framework follows node-wise computation, pruning propagation, and inter-node communication stages. We implement SpecPipe and a variant SpecPipe-DB with dynamic batching for single- and multi-request inference, respectively. On an 8-stage pipeline, SpecPipe improves time between tokens on diverse single-request workloads by $4.19\times$-$5.53\times$ over standard pipeline parallelism and by $2.08\times$-$2.38\times$ over prior tree-based speculative decoding methods. For multi-request workloads, SpecPipe-DB achieves $1.64\times$-$2.08\times$ higher throughput and $1.61\times$-$2.06\times$ lower time between tokens than vLLM.

[268] Scientifically-Interpretable Reasoning Network (ScIReN): Discovering Hidden Relationships in the Carbon Cycle and Beyond

Joshua Fan, Haodi Xu, Feng Tao, Md Nasim, Marc Grimson, Yiqi Luo, Carla P. Gomes

Main category: cs.LG

TL;DR: ScIReN is a transparent framework combining interpretable neural networks with process-based models to improve soil carbon cycle predictions while maintaining scientific interpretability.

Details

Motivation: Current soil carbon models have unknown parameters set ad-hoc and fit poorly, while neural networks lack scientific interpretability and cannot reveal new scientific relationships due to their black-box nature.

Method: Uses interpretable Kolmogorov-Arnold networks (KAN) as encoder to predict scientifically-meaningful latent parameters, combined with differentiable process-based decoder. Includes smoothness penalties and hard-sigmoid constraint layer to ensure parameters stay within scientifically meaningful ranges.

Result: Outperforms black-box networks in predictive accuracy for soil carbon flow simulation and ecosystem respiration modeling, while providing scientific interpretability to infer latent mechanisms and their relationships with input features.

Conclusion: ScIReN successfully bridges the gap between data-driven learning and scientific interpretability, enabling both accurate predictions and discovery of new scientific relationships in soil carbon cycling.

Abstract: Understanding how carbon flows through the soil is crucial for mitigating the effects of climate change. While soils have potential to sequester carbon from the atmosphere, the soil carbon cycle remains poorly understood. Scientists have developed mathematical process-based models of the soil carbon cycle based on existing knowledge, but they contain numerous unknown parameters that must be set in an ad-hoc manner, and often fit observations poorly. On the other hand, neural networks can learn patterns from data, but do not respect known scientific laws, nor can they reveal novel scientific relationships due to their black-box nature. We thus propose Scientifically-Interpretable Reasoning Network (ScIReN), a fully-transparent framework that combines interpretable neural and process-based reasoning. An interpretable encoder predicts scientifically-meaningful latent parameters, which are then passed through a differentiable process-based decoder to predict labeled output variables. ScIReN leverages Kolmogorov-Arnold networks (KAN) to ensure the encoder is fully interpretable and reveals relationships between input features and latent parameters; it uses novel smoothness penalties to balance expressivity and simplicity. ScIReN also uses a novel hard-sigmoid constraint layer to restrict latent parameters to meaningful ranges defined by scientific prior knowledge. While the process-based decoder enforces established scientific knowledge, the KAN-based encoder reveals new scientific relationships hidden in conventional black-box models. We apply ScIReN on two tasks: simulating the flow of organic carbon through soils, and modeling ecosystem respiration from plants. In both tasks, ScIReN outperforms black-box networks in predictive accuracy while providing substantial scientific interpretability – it can infer latent scientific mechanisms and their relationships with input features.

[269] On the Adversarial Robustness of Spiking Neural Networks Trained by Local Learning

Jiaqi Lin, Abhronil Sengupta

Main category: cs.LG

TL;DR: This paper examines adversarial robustness in Spiking Neural Networks (SNNs) using local learning methods instead of traditional gradient-based approaches, and introduces a hybrid adversarial attack paradigm that outperforms existing methods.

Details

Motivation: Most current adversarial attack studies on SNNs rely on biologically implausible gradient-based methods like BPTT, while local learning methods remain under-explored despite relaxing many of BPTT's constraints.

Method: The researchers analyzed adversarial robustness through four types of training algorithms and introduced a hybrid adversarial attack paradigm that leverages adversarial instance transferability to overcome gradient-based attack limitations.

Result: The proposed hybrid approach demonstrated superior performance over existing adversarial attack methods and showed strong generalizability across multi-step attacks, black-box FGSM scenarios, and non-spiking domains.

Conclusion: Local learning methods provide a viable alternative to gradient-based approaches for studying adversarial robustness in SNNs, and the hybrid attack paradigm effectively addresses the limitations of traditional gradient-based adversarial attacks.

Abstract: Recent research has shown the vulnerability of Spiking Neural Networks (SNNs) under adversarial examples that are nearly indistinguishable from clean data in the context of frame-based and event-based information. The majority of these studies are constrained in generating adversarial examples using Backpropagation Through Time (BPTT), a gradient-based method which lacks biological plausibility. In contrast, local learning methods, which relax many of BPTT’s constraints, remain under-explored in the context of adversarial attacks. To address this problem, we examine adversarial robustness in SNNs through the framework of four types of training algorithms. We provide an in-depth analysis of the ineffectiveness of gradient-based adversarial attacks to generate adversarial instances in this scenario. To overcome these limitations, we introduce a hybrid adversarial attack paradigm that leverages the transferability of adversarial instances. The proposed hybrid approach demonstrates superior performance, outperforming existing adversarial attack methods. Furthermore, the generalizability of the method is assessed under multi-step adversarial attacks, adversarial attacks in black-box FGSM scenarios, and within the non-spiking domain.

[270] Merging and Disentangling Views in Visual Reinforcement Learning for Robotic Manipulation

Abdulaziz Almuzairee, Rohan Patil, Dwait Bhatt, Henrik I. Christensen

Main category: cs.LG

TL;DR: MAD algorithm merges multiple camera views for better sample efficiency in visual servoing while disentangling views to maintain robustness against camera failures and enable lightweight deployment.

Details

Motivation: Multi-view vision policies improve manipulation performance but are sensitive to camera failures and deployment complexity. Need a solution that maintains sample efficiency while being robust and deployable.

Method: Merge And Disentanglement (MAD) algorithm that efficiently merges views for sample efficiency while simultaneously disentangling views by augmenting multi-view feature inputs with single-view features.

Result: Demonstrated efficiency and robustness on Meta-World and ManiSkill3 benchmarks, producing robust policies that allow lightweight deployment.

Conclusion: MAD successfully addresses the trade-off between multi-view sample efficiency and single-view robustness, enabling practical deployment of vision-based manipulation policies.

Abstract: Vision is well-known for its use in manipulation, especially using visual servoing. Due to the 3D nature of the world, using multiple camera views and merging them creates better representations for Q-learning and in turn, trains more sample efficient policies. Nevertheless, these multi-view policies are sensitive to failing cameras and can be burdensome to deploy. To mitigate these issues, we introduce a Merge And Disentanglement (MAD) algorithm that efficiently merges views to increase sample efficiency while simultaneously disentangling views by augmenting multi-view feature inputs with single-view features. This produces robust policies and allows lightweight deployment. We demonstrate the efficiency and robustness of our approach using Meta-World and ManiSkill3. For project website and code, see https://aalmuzairee.github.io/mad

[271] ETTRL: Balancing Exploration and Exploitation in LLM Test-Time Reinforcement Learning Via Entropy Mechanism

Jia Liu, ChangYi He, YingQiao Lin, MingMin Yang, FeiYang Shen, ShaoGuo Liu

Main category: cs.LG

TL;DR: Entropy-based test-time reinforcement learning method improves reasoning performance with better efficiency and diversity.

Details

Motivation: Large Language Models depend on annotated data and struggle with unsupervised scenarios. Test-time reinforcement learning faces challenges like high inference costs and early-stage estimation bias that reduces diversity.

Method: Proposes entropy-based mechanism with two strategies: Entropy-fork Tree Majority Rollout (ETMR) and Entropy-based Advantage Reshaping (EAR) to balance exploration-exploitation.

Result: Llama3.1-8B achieved 68% relative improvement in Pass@1 on AIME 2024 benchmark while using only 60% of rollout tokens budget.

Conclusion: The method effectively optimizes inference efficiency, diversity, and estimation robustness for unsupervised reinforcement learning in open-domain reasoning tasks.

Abstract: Recent advancements in Large Language Models have yielded significant improvements in complex reasoning tasks such as mathematics and programming. However, these models remain heavily dependent on annotated data and exhibit limited adaptability in unsupervised scenarios. To address these limitations, test-time reinforcement learning (TTRL) has been proposed, which enables self-optimization by leveraging model-generated pseudo-labels. Despite its promise, TTRL faces several key challenges, including high inference costs due to parallel rollouts and early-stage estimation bias that fosters overconfidence, reducing output diversity and causing performance plateaus. To address these challenges, we introduce an entropy-based mechanism to enhance the exploration-exploitation balance in test-time reinforcement learning through two strategies: Entropy-fork Tree Majority Rollout (ETMR) and Entropy-based Advantage Reshaping (EAR). Compared with the baseline, our approach enables Llama3.1-8B to achieve a 68 percent relative improvement in Pass at 1 metric on the AIME 2024 benchmark, while consuming only 60 percent of the rollout tokens budget. This highlights our method’s ability to effectively optimize the trade-off between inference efficiency, diversity, and estimation robustness, thereby advancing unsupervised reinforcement learning for open-domain reasoning tasks.

[272] BiTrajDiff: Bidirectional Trajectory Generation with Diffusion Models for Offline Reinforcement Learning

Yunpeng Qing, Shuo Chen, Yixiao Chi, Shunyu Liu, Sixu Lin, Kelu Yao, Changqing Zou

Main category: cs.LG

TL;DR: BiTrajDiff introduces bidirectional trajectory diffusion for offline RL data augmentation, generating both future and history trajectories from intermediate states to address dataset distribution bias and improve policy learning.

Details

Motivation: Current offline RL data augmentation methods only reconstruct future trajectories, ignoring history transitions that lead to critical states, which limits discovery of diverse behavior patterns and generalizability.

Method: Bidirectional Trajectory Diffusion (BiTrajDiff) uses two independent diffusion processes: one for forward trajectory generation (predicting future dynamics) and one for backward trajectory generation (tracing essential history transitions) from any intermediate states.

Result: Extensive experiments on D4RL benchmark show BiTrajDiff achieves superior performance compared to other advanced DA methods across various offline RL backbones.

Conclusion: BiTrajDiff effectively leverages critical states as anchors to expand into valuable yet underexplored state space regions, facilitating dataset diversity and improving offline RL performance through bidirectional trajectory modeling.

Abstract: Recent advances in offline Reinforcement Learning (RL) have proven that effective policy learning can benefit from imposing conservative constraints on pre-collected datasets. However, such static datasets often exhibit distribution bias, resulting in limited generalizability. To address this limitation, a straightforward solution is data augmentation (DA), which leverages generative models to enrich data distribution. Despite the promising results, current DA techniques focus solely on reconstructing future trajectories from given states, while ignoring the exploration of history transitions that reach them. This single-direction paradigm inevitably hinders the discovery of diverse behavior patterns, especially those leading to critical states that may have yielded high-reward outcomes. In this work, we introduce Bidirectional Trajectory Diffusion (BiTrajDiff), a novel DA framework for offline RL that models both future and history trajectories from any intermediate states. Specifically, we decompose the trajectory generation task into two independent yet complementary diffusion processes: one generating forward trajectories to predict future dynamics, and the other generating backward trajectories to trace essential history transitions.BiTrajDiff can efficiently leverage critical states as anchors to expand into potentially valuable yet underexplored regions of the state space, thereby facilitating dataset diversity. Extensive experiments on the D4RL benchmark suite demonstrate that BiTrajDiff achieves superior performance compared to other advanced DA methods across various offline RL backbones.

Xinxing Ren, Qianbo Zang, Zekun Guo

Main category: cs.LG

TL;DR: SimuGen is a multimodal agent framework that generates accurate Simulink simulation code by combining visual diagrams with domain knowledge, addressing LLMs’ limitations in simulation domain tasks.

Details

Motivation: LLMs struggle with Simulink model generation due to lack of domain-specific pretraining data, leading to unreliable code from text-only inputs.

Method: Multimodal agent framework with specialized agents (investigator, unit test reviewer, code generator, executor, debug locator, report writer) coordinated with domain knowledge base.

Result: Framework enables interpretable, robust, and reproducible Simulink simulation generation.

Conclusion: SimuGen provides an effective solution for automated Simulink code generation by leveraging visual information and collaborative agent coordination.

Abstract: Recent advances in large language models (LLMs) have shown impressive performance in mathematical reasoning and code generation. However, LLMs still struggle in the simulation domain, particularly in generating Simulink models, which are essential tools in engineering and scientific research. Our preliminary experiments indicate that LLM agents often fail to produce reliable and complete Simulink simulation code from text-only inputs, likely due to the lack of Simulink-specific data in their pretraining. To address this challenge, we propose SimuGen, a multimodal agent-based framework that automatically generates accurate Simulink simulation code by leveraging both the visual Simulink diagram and domain knowledge. SimuGen coordinates several specialized agents, including an investigator, unit test reviewer, code generator, executor, debug locator, and report writer, supported by a domain-specific knowledge base. This collaborative and modular design enables interpretable, robust, and reproducible Simulink simulation generation. Our source code is publicly available at https://github.com/renxinxing123/SimuGen_beta.

[274] Quantized Neural Networks for Microcontrollers: A Comprehensive Review of Methods, Platforms, and Applications

Hamza A. Abushahla, Dara Varam, Ariel J. N. Panopio, Mohamed I. AlHajri

Main category: cs.LG

TL;DR: Survey paper on quantization techniques for deploying neural networks on resource-constrained microcontrollers, focusing on hardware-software trade-offs and TinyML frameworks.

Details

Motivation: Address challenges in balancing model performance, computational complexity, and memory constraints when deploying quantized neural networks on microcontrollers and embedded systems.

Method: Systematic review and hardware-centric analysis of quantization techniques, evaluation of existing software frameworks and hardware platforms for QNN execution on microcontrollers.

Result: Comprehensive survey of essential quantization methods for accelerating deep learning models on embedded applications, with analysis of performance-hardware trade-offs.

Conclusion: Identifies current challenges and outlines promising future directions in the rapidly evolving domain of quantized neural network deployment for TinyML applications.

Abstract: The deployment of Quantized Neural Networks (QNNs) on resource-constrained devices, such as microcontrollers, has introduced significant challenges in balancing model performance, computational complexity, and memory constraints. Tiny Machine Learning (TinyML) addresses these issues by integrating advancements across machine learning algorithms, hardware acceleration, and software optimization to efficiently run deep neural networks on embedded systems. This survey presents a hardware-centric introduction to quantization, systematically reviewing essential quantization techniques employed to accelerate deep learning models for embedded applications. In particular, further emphasis is placed on the critical trade-offs between model performance and hardware capabilities. The survey further evaluates existing software frameworks and hardware platforms designed specifically for supporting QNN execution on microcontrollers. Moreover, we provide an analysis of the current challenges and an outline of promising future directions in the rapidly evolving domain of QNN deployment.

[275] Sensitivity of Stability: Theoretical & Empirical Analysis of Replicability for Adaptive Data Selection in Transfer Learning

Prabhav Singh, Jessica Sorrell

Main category: cs.LG

TL;DR: Transfer learning with adaptive data selection faces a replicability-performance trade-off: highly adaptive strategies improve performance but reduce replicability, while source domain pretraining can mitigate replicability failures.

Details

Motivation: To understand the reliability of transfer learning adaptations, particularly when using adaptive data selection strategies that dynamically prioritize training examples, as replicability remains poorly understood.

Method: Developed a mathematical framework quantifying the trade-off between adaptation effectiveness and result consistency, formalizing selection sensitivity (Δ_Q). Conducted extensive experiments on MultiNLI corpus using six adaptive selection strategies (uniform sampling to gradient-based selection).

Result: Replicability failure probability increases quadratically with selection sensitivity and decreases exponentially with sample size. Gradient-based and curriculum learning achieve superior performance but suffer high replicability failure rates (>7%), while less adaptive approaches maintain failure rates below 7%. Source domain pretraining reduces failure rates by up to 30%.

Conclusion: Established principled guidelines for navigating performance-replicability trade-off in transfer learning, highlighting the need for replicability-aware design in modern systems.

Abstract: The widespread adoption of transfer learning has revolutionized machine learning by enabling efficient adaptation of pre-trained models to new domains. However, the reliability of these adaptations remains poorly understood, particularly when using adaptive data selection strategies that dynamically prioritize training examples. We present a comprehensive theoretical and empirical analysis of replicability in transfer learning, introducing a mathematical framework that quantifies the fundamental trade-off between adaptation effectiveness and result consistency. Our key contribution is the formalization of selection sensitivity ($\Delta_Q$), a measure that captures how adaptive selection strategies respond to perturbations in training data. We prove that replicability failure probability: the likelihood that two independent training runs produce models differing in performance by more than a threshold, increases quadratically with selection sensitivity while decreasing exponentially with sample size. Through extensive experiments on the MultiNLI corpus using six adaptive selection strategies - ranging from uniform sampling to gradient-based selection - we demonstrate that this theoretical relationship holds precisely in practice. Our results reveal that highly adaptive strategies like gradient-based and curriculum learning achieve superior task performance but suffer from high replicability failure rates, while less adaptive approaches maintain failure rates below 7%. Crucially, we show that source domain pretraining provides a powerful mitigation mechanism, reducing failure rates by up to 30% while preserving performance gains. These findings establish principled guidelines for practitioners to navigate the performance-replicability trade-off and highlight the need for replicability-aware design in modern transfer learning systems.

[276] BudgetThinker: Empowering Budget-aware LLM Reasoning with Control Tokens

Hao Wen, Xinrui Wu, Yi Sun, Feifei Zhang, Liye Chen, Jie Wang, Yunxin Liu, Yunhao Liu, Ya-Qin Zhang, Yuanchun Li

Main category: cs.LG

TL;DR: BudgetThinker enables LLMs to perform budget-aware reasoning by inserting control tokens during inference and using a two-stage training pipeline with SFT and curriculum RL to optimize for both accuracy and token budget adherence.

Details

Motivation: Current LLM reasoning methods require significant test-time computation, causing high latency and resource costs that limit practical deployment in time-constrained or cost-sensitive real-world scenarios.

Method: Periodic insertion of special control tokens during inference to inform the model of remaining token budget, coupled with a two-stage training pipeline: SFT to familiarize with budget constraints, followed by curriculum-based RL with length-aware reward function.

Result: BudgetThinker significantly outperforms strong baselines in maintaining performance across various reasoning budgets on challenging mathematical benchmarks.

Conclusion: Provides a scalable and effective solution for developing efficient and controllable LLM reasoning, making advanced models more practical for deployment in resource-constrained and real-time environments.

Abstract: Recent advancements in Large Language Models (LLMs) have leveraged increased test-time computation to enhance reasoning capabilities, a strategy that, while effective, incurs significant latency and resource costs, limiting their applicability in real-world time-constrained or cost-sensitive scenarios. This paper introduces BudgetThinker, a novel framework designed to empower LLMs with budget-aware reasoning, enabling precise control over the length of their thought processes. We propose a methodology that periodically inserts special control tokens during inference to continuously inform the model of its remaining token budget. This approach is coupled with a comprehensive two-stage training pipeline, beginning with Supervised Fine-Tuning (SFT) to familiarize the model with budget constraints, followed by a curriculum-based Reinforcement Learning (RL) phase that utilizes a length-aware reward function to optimize for both accuracy and budget adherence. We demonstrate that BudgetThinker significantly surpasses strong baselines in maintaining performance across a variety of reasoning budgets on challenging mathematical benchmarks. Our method provides a scalable and effective solution for developing efficient and controllable LLM reasoning, making advanced models more practical for deployment in resource-constrained and real-time environments.

[277] Inductive Domain Transfer In Misspecified Simulation-Based Inference

Ortal Senouf, Cédric Vincent-Cuaz, Emmanuel Abbé, Pascal Frossard

Main category: cs.LG

TL;DR: A fully inductive and amortized simulation-based inference framework that integrates calibration and distributional alignment using optimal transport to handle model misspecification, enabling efficient inference without simulation access at test time.

Details

Motivation: Address limitations of existing SBI approaches like RoPE that require batch test samples at inference time, limiting scalability and generalization in misspecified environments.

Method: Combines mini-batch optimal transport with closed-form coupling to align real and simulated observations, then trains conditional normalizing flow to approximate OT-induced posterior for end-to-end inference.

Result: Matches or surpasses performance of RoPE and other SBI/non-SBI estimators across synthetic and real-world benchmarks including medical biomarker estimation.

Conclusion: Proposed framework offers improved scalability and applicability in challenging misspecified environments while maintaining competitive performance.

Abstract: Simulation-based inference (SBI) is a statistical inference approach for estimating latent parameters of a physical system when the likelihood is intractable but simulations are available. In practice, SBI is often hindered by model misspecification–the mismatch between simulated and real-world observations caused by inherent modeling simplifications. RoPE, a recent SBI approach, addresses this challenge through a two-stage domain transfer process that combines semi-supervised calibration with optimal transport (OT)-based distribution alignment. However, RoPE operates in a fully transductive setting, requiring access to a batch of test samples at inference time, which limits scalability and generalization. We propose here a fully inductive and amortized SBI framework that integrates calibration and distributional alignment into a single, end-to-end trainable model. Our method leverages mini-batch OT with a closed-form coupling to align real and simulated observations that correspond to the same latent parameters, using both paired calibration data and unpaired samples. A conditional normalizing flow is then trained to approximate the OT-induced posterior, enabling efficient inference without simulation access at test time. Across a range of synthetic and real-world benchmarks–including complex medical biomarker estimation–our approach matches or surpasses the performance of RoPE, as well as other standard SBI and non-SBI estimators, while offering improved scalability and applicability in challenging, misspecified environments.

[278] CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics

Weida Wang, Dongchen Huang, Jiatong Li, Tengchao Yang, Ziyang Zheng, Di Zhang, Dong Han, Benteng Chen, Binzhao Luo, Zhiyu Liu, Kunling Liu, Zhiyuan Gao, Shiqi Geng, Wei Ma, Jiaming Su, Xin Li, Shuchen Pu, Yuhan Shui, Qianjia Cheng, Zhihao Dou, Dongfei Cui, Changyong He, Jin Zeng, Zeke Xie, Mao Su, Dongzhan Zhou, Yuqiang Li, Wanli Ouyang, Yunqi Cai, Xi Dai, Shufei Zhang, Lei Bai, Jinguang Cheng, Zhong Fang, Hongming Weng

Main category: cs.LG

TL;DR: CMPhysBench is a new benchmark with 520+ graduate-level condensed matter physics calculation problems to evaluate LLMs’ capabilities in this domain, using a novel SEED scoring metric that provides fine-grained partial credit.

Details

Motivation: To assess Large Language Models' proficiency in condensed matter physics, a practical and frontier domain where current models show significant capability gaps compared to traditional physics.

Method: Created a benchmark with 520+ graduate-level calculation problems covering key subfields like magnetism and superconductivity. Introduced Scalable Expression Edit Distance (SEED) score using tree-based representations for fine-grained evaluation of solution similarity.

Result: Even the best model (Grok-4) achieved only 36 average SEED score and 28% accuracy, demonstrating substantial capability gaps in condensed matter physics problem-solving.

Conclusion: LLMs currently have significant limitations in condensed matter physics, highlighting the need for specialized benchmarks and improved models for this advanced physics domain.

Abstract: We introduce CMPhysBench, designed to assess the proficiency of Large Language Models (LLMs) in Condensed Matter Physics, as a novel Benchmark. CMPhysBench is composed of more than 520 graduate-level meticulously curated questions covering both representative subfields and foundational theoretical frameworks of condensed matter physics, such as magnetism, superconductivity, strongly correlated systems, etc. To ensure a deep understanding of the problem-solving process,we focus exclusively on calculation problems, requiring LLMs to independently generate comprehensive solutions. Meanwhile, leveraging tree-based representations of expressions, we introduce the Scalable Expression Edit Distance (SEED) score, which provides fine-grained (non-binary) partial credit and yields a more accurate assessment of similarity between prediction and ground-truth. Our results show that even the best models, Grok-4, reach only 36 average SEED score and 28% accuracy on CMPhysBench, underscoring a significant capability gap, especially for this practical and frontier domain relative to traditional physics. The code anddataset are publicly available at https://github.com/CMPhysBench/CMPhysBench.

[279] C-Flat++: Towards a More Efficient and Powerful Framework for Continual Learning

Wei Li, Hangjie Yuan, Zixiang Zhao, Yifan Zhu, Aojun Lu, Tao Feng, Yanan Sun

Main category: cs.LG

TL;DR: C-Flat is a plug-and-play continual learning method that promotes flatter loss landscapes to improve stability and performance across various CL settings, with C-Flat++ providing an efficient variant.

Details

Motivation: Existing sharpness-aware minimization methods in continual learning may favor sharper minima over flatter ones in certain settings, leading to less robust and potentially suboptimal solutions.

Method: Proposes C-Flat, a method that promotes flatter loss landscapes tailored for CL, with plug-and-play compatibility. Also introduces C-Flat++ for efficient flatness-driven promotion with reduced update costs.

Result: C-Flat consistently improves performance across a wide range of settings, datasets, and scenarios. Extensive experiments demonstrate effectiveness and efficiency.

Conclusion: The proposed C-Flat and C-Flat++ approaches effectively address continual learning challenges by promoting flatter minima, offering both performance improvements and computational efficiency.

Abstract: Balancing sensitivity to new tasks and stability for retaining past knowledge is crucial in continual learning (CL). Recently, sharpness-aware minimization has proven effective in transfer learning and has also been adopted in continual learning (CL) to improve memory retention and learning efficiency. However, relying on zeroth-order sharpness alone may favor sharper minima over flatter ones in certain settings, leading to less robust and potentially suboptimal solutions. In this paper, we propose \textbf{C}ontinual \textbf{Flat}ness (\textbf{C-Flat}), a method that promotes flatter loss landscapes tailored for CL. C-Flat offers plug-and-play compatibility, enabling easy integration with minimal modifications to the code pipeline. Besides, we present a general framework that integrates C-Flat into all major CL paradigms and conduct comprehensive comparisons with loss-minima optimizers and flat-minima-based CL methods. Our results show that C-Flat consistently improves performance across a wide range of settings. In addition, we introduce C-Flat++, an efficient yet effective framework that leverages selective flatness-driven promotion, significantly reducing the update cost required by C-Flat. Extensive experiments across multiple CL methods, datasets, and scenarios demonstrate the effectiveness and efficiency of our proposed approaches. Code is available at https://github.com/WanNaa/C-Flat.

[280] Robustness is Important: Limitations of LLMs for Data Fitting

Hejia Liu, Mochen Yang, Gediminas Adomavicius

Main category: cs.LG

TL;DR: LLMs show significant prediction sensitivity to task-irrelevant data variations like variable name changes, with error fluctuations up to 82%, revealing fundamental robustness issues despite competitive predictive performance.

Details

Motivation: To investigate the vulnerability of LLMs when used for data fitting tasks, particularly their sensitivity to irrelevant data representation changes that should not affect predictions.

Method: Examined LLM prediction sensitivity through in-context learning and supervised fine-tuning, analyzed attention patterns in open-weight models, and compared with specialized tabular foundation model TabPFN.

Result: LLMs show dramatic prediction changes (up to 82% error variation) from irrelevant modifications like variable name changes, with non-uniform attention patterns explaining the sensitivity. TabPFN also shows some vulnerability.

Conclusion: Current LLMs lack basic robustness for principled data fitting despite impressive predictive capabilities, requiring fundamental improvements before reliable deployment.

Abstract: Large Language Models (LLMs) are being applied in a wide array of settings, well beyond the typical language-oriented use cases. In particular, LLMs are increasingly used as a plug-and-play method for fitting data and generating predictions. Prior work has shown that LLMs, via in-context learning or supervised fine-tuning, can perform competitively with many tabular supervised learning techniques in terms of predictive performance. However, we identify a critical vulnerability of using LLMs for data fitting – making changes to data representation that are completely irrelevant to the underlying learning task can drastically alter LLMs’ predictions on the same data. For example, simply changing variable names can sway the size of prediction error by as much as 82% in certain settings. Such prediction sensitivity with respect to task-irrelevant variations manifests under both in-context learning and supervised fine-tuning, for both close-weight and open-weight general-purpose LLMs. Moreover, by examining the attention scores of an open-weight LLM, we discover a non-uniform attention pattern: training examples and variable names/values which happen to occupy certain positions in the prompt receive more attention when output tokens are generated, even though different positions are expected to receive roughly the same attention. This partially explains the sensitivity in the presence of task-irrelevant variations. We also consider a state-of-the-art tabular foundation model (TabPFN) trained specifically for data fitting. Despite being explicitly designed to achieve prediction robustness, TabPFN is still not immune to task-irrelevant variations. Overall, despite LLMs’ impressive predictive capabilities, currently they lack even the basic level of robustness to be used as a principled data-fitting tool.

[281] Memorization in Graph Neural Networks

Adarsh Jamadandi, Jing Xu, Adam Dziedzic, Franziska Boenisch

Main category: cs.LG

TL;DR: NCMemo framework quantifies label memorization in GNNs, revealing inverse relationship with graph homophily - lower homophily increases memorization as GNNs rely on memorizing labels when graph structure is less informative.

Details

Motivation: While DNN memorization is well-studied, graph neural network (GNN) memorization remains under-explored, particularly in semi-supervised node classification settings where understanding memorization patterns is crucial for privacy and model performance.

Method: Developed NCMemo framework to measure label memorization, analyzed relationship with graph homophily, examined GNN training dynamics and implicit bias, identified nodes prone to memorization based on label inconsistency, and investigated graph rewiring as mitigation strategy.

Result: Found strong inverse correlation between homophily and memorization; graph rewiring effectively reduces memorization without performance loss and lowers privacy risk for previously memorized data points.

Conclusion: The work provides fundamental insights into GNN learning dynamics, establishes connection between graph structure and memorization, and offers practical mitigation through graph rewiring for more privacy-preserving GNN deployment.

Abstract: Deep neural networks (DNNs) have been shown to memorize their training data, yet similar analyses for graph neural networks (GNNs) remain largely under-explored. We introduce NCMemo (Node Classification Memorization), the first framework to quantify label memorization in semi-supervised node classification. We first establish an inverse relationship between memorization and graph homophily, i.e., the property that connected nodes share similar labels/features. We find that lower homophily significantly increases memorization, indicating that GNNs rely on memorization to learn less homophilic graphs. Secondly, we analyze GNN training dynamics. We find that the increased memorization in low homophily graphs is tightly coupled to the GNNs’ implicit bias on using graph structure during learning. In low homophily regimes, this structure is less informative, hence inducing memorization of the node labels to minimize training loss. Finally, we show that nodes with higher label inconsistency in their feature-space neighborhood are significantly more prone to memorization. Building on our insights into the link between graph homophily and memorization, we investigate graph rewiring as a means to mitigate memorization. Our results demonstrate that this approach effectively reduces memorization without compromising model performance. Moreover, we show that it lowers the privacy risk for previously memorized data points in practice. Thus, our work not only advances understanding of GNN learning but also supports more privacy-preserving GNN deployment.

[282] FORGE: Foundational Optimization Representations from Graph Embeddings

Zohair Shafi, Serdar Kadioglu

Main category: cs.LG

TL;DR: Forge is a pre-training method using vector-quantized graph autoencoders on diverse MIP instances without solution dependency, enabling both unsupervised clustering and supervised predictions for solver performance improvement.

Details

Motivation: Existing learning-based optimization approaches require solving many hard instances for training data and need dedicated models per problem distribution, limiting scalability and generalization.

Method: Pre-train a vector-quantized graph autoencoder on a large collection of mixed-integer programming instances unsupervised, creating discrete code assignments as vocabulary to represent optimization instances.

Result: Forge embeddings effectively differentiate and cluster unseen instances unsupervised. When fine-tuned, a single model predicts warm-start variables and integrality gaps across multiple problem types, improving commercial solver performance.

Conclusion: Forge provides a scalable pre-training approach for MIP instances that enables both unsupervised analysis and supervised predictions to enhance optimization solver performance across diverse problem distributions.

Abstract: Combinatorial optimization problems are ubiquitous in science and engineering, yet learning-based approaches to accelerate their solution often require solving a large number of hard-to-solve optimization instances to collect training data, incurring significant computational overhead. Existing methods require training dedicated models for each problem distribution for each downstream task, severely limiting their scalability and generalization. In this work, we introduce Forge, a method of pre-training a vector-quantized graph autoencoder on a large and diverse collection of mixed-integer programming (MIP) instances in an unsupervised fashion without dependency on their solution. The vector quantization process creates discrete code assignments that act as a vocabulary to represent optimization instances. We evaluate our approach under both supervised and unsupervised settings. For the unsupervised setting, we demonstrate that Forge embeddings effectively differentiate and cluster unseen instances. For the supervised setting, we fine-tuneForge embeddings and show that a single model predicts both the variables for warm-starts and integrality gaps for cut-generation across multiple problem type distributions. Both predictions help improve performance of a state-of-the-art, commercial optimization solver. Finally, we release our code and pre-trained Forge weights to encourage further research and practical use of instance-level MIP embeddings at https://github.com/skadio/forge/.

cs.MA

cs.MM

[283] lifeXplore at the Lifelog Search Challenge 2020

Andreas Leibetseder, Klaus Schoeffmann

Main category: cs.MM

TL;DR: The paper presents lifeXplore system improvements for the Lifelog Search Challenge, adding YOLO9000, OCR, and uniform sampling to enhance video exploration and retrieval capabilities.

Details

Motivation: To improve lifelogging moment retrieval performance in the Lifelog Search Challenge competition by enhancing the existing lifeXplore system with additional features.

Method: Enhanced the lifeXplore system by incorporating deep concept YOLO9000 for object detection, optical character recognition (OCR) for text extraction, and uniform sampling as an alternative to traditional shot segmentation for video processing.

Result: The improved system combines feature map browsing, concept search and filtering, and hand-drawn sketching capabilities for more effective lifelog retrieval.

Conclusion: The enhanced lifeXplore system with additional deep learning concepts and processing techniques provides improved tools for participating in the Lifelog Search Challenge competitions.

Abstract: Since its first iteration in 2018, the Lifelog Search Challenge (LSC) - an interactive competition for retrieving lifelogging moments - is co-located at the annual ACM International Conference on Multimedia Retrieval (ICMR) and has drawn international attention. With the goal of making an ever growing public lifelogging dataset searchable, several teams develop systems for quickly solving time-limited queries during the challenge. Having participated in both previous LSC iterations, i.e. LSC2018 and LSC2019, we present our lifeXplore system - a video exploration and retrieval tool combining feature map browsing, concept search and filtering as well as hand-drawn sketching. The system is improved by including additional deep concept YOLO9000, optical character recognition (OCR) as well as adding uniform sampling as an alternative to the system’s traditional underlying shot segmentation.

eess.AS

[284] Benchmarking Large Pretrained Multilingual Models on Québec French Speech Recognition

Coralie Serrand, Gilles Boulianne, Amira Morsli

Main category: eess.AS

TL;DR: Evaluation of multilingual speech recognition models on Quebec French shows benchmark performance differs significantly from standard test sets like FLEURS and CommonVoice.

Details

Motivation: To assess how well large pretrained multilingual speech recognition models perform on regional French varieties, specifically Quebec French, under realistic conditions.

Method: Built a benchmark and evaluation pipeline using the CommissionsQc datasets - spontaneous conversations from public inquiries in Quebec, measuring speed, word error rate, and semantic accuracy.

Result: Published results on standard benchmarks (FLEURS, CommonVoice) are not good predictors of performance on Quebec French, indicating significant performance differences on regional varieties.

Conclusion: The findings are valuable for practitioners developing speech applications for realistic conditions and regional language varieties, highlighting the need for specialized evaluation beyond standard benchmarks.

Abstract: We evaluate the performance of large pretrained multilingual speech recognition models on a regional variety of French spoken in Qu'ebec, Canada, in terms of speed, word error rate and semantic accuracy. To this end we build a benchmark and evaluation pipeline based on the CommissionsQc datasets, a corpus of spontaneous conversations recorded during public inquiries recently held in Qu'ebec. Published results for these models on well-known benchmarks such as FLEURS or CommonVoice are not good predictors of the performance we observe on CommissionsQC. Our results should be of interest for practitioners interested in building speech applications for realistic conditions or regional language varieties.

[285] Can Layer-wise SSL Features Improve Zero-Shot ASR Performance for Children’s Speech?

Abhijit Sinha, Hemant Kumar Kathania, Sudarsana Reddy Kadiri, Shrikanth Narayanan

Main category: eess.AS

TL;DR: SSL pre-trained models (Wav2Vec2, HuBERT, Data2Vec, WavLM) significantly improve children’s speech recognition in zero-shot scenarios, with Wav2Vec2 Layer 22 achieving 51.64% relative WER reduction.

Details

Motivation: Children's speech has distinct acoustic and linguistic characteristics that challenge standard ASR systems, and SSL models show promise but need investigation for optimal feature extraction in zero-shot scenarios.

Method: Analyzed layer-wise features from SSL models integrated into DNN-based ASR using Kaldi. Tested on PFSTAR children’s speech using WSJCAM0 adult speech for training in zero-shot setup.

Result: Wav2Vec2 Layer 22 achieved lowest WER of 5.15% (51.64% improvement over baseline). Consistent performance improvements across age groups, with significant gains even for younger children. Similar results confirmed on CMU Kids dataset.

Conclusion: SSL models provide effective features for children’s speech recognition in zero-shot scenarios, with specific layers (like Wav2Vec2 Layer 22) offering optimal performance and generalizable improvements across datasets.

Abstract: Automatic Speech Recognition (ASR) systems often struggle to accurately process children’s speech due to its distinct and highly variable acoustic and linguistic characteristics. While recent advancements in self-supervised learning (SSL) models have greatly enhanced the transcription of adult speech, accurately transcribing children’s speech remains a significant challenge. This study investigates the effectiveness of layer-wise features extracted from state-of-the-art SSL pre-trained models - specifically, Wav2Vec2, HuBERT, Data2Vec, and WavLM in improving the performance of ASR for children’s speech in zero-shot scenarios. A detailed analysis of features extracted from these models was conducted, integrating them into a simplified DNN-based ASR system using the Kaldi toolkit. The analysis identified the most effective layers for enhancing ASR performance on children’s speech in a zero-shot scenario, where WSJCAM0 adult speech was used for training and PFSTAR children speech for testing. Experimental results indicated that Layer 22 of the Wav2Vec2 model achieved the lowest Word Error Rate (WER) of 5.15%, representing a 51.64% relative improvement over the direct zero-shot decoding using Wav2Vec2 (WER of 10.65%). Additionally, age group-wise analysis demonstrated consistent performance improvements with increasing age, along with significant gains observed even in younger age groups using the SSL features. Further experiments on the CMU Kids dataset confirmed similar trends, highlighting the generalizability of the proposed approach.

[286] Zero-Shot KWS for Children’s Speech using Layer-Wise Features from SSL Models

Subham Kutum, Abhijit Sinha, Hemant Kumar Kathania, Sudarsana Reddy Kadiri, Mahesh Chandra Govil

Main category: eess.AS

TL;DR: Zero-shot keyword spotting for children’s speech using SSL models (Wav2Vec2, HuBERT, Data2Vec) trained on adult data but tested on children’s speech, achieving state-of-the-art results with Wav2Vec2 layer 22 performing best.

Details

Motivation: Children's speech has distinct acoustic and linguistic characteristics that pose challenges for traditional KWS systems, requiring specialized approaches that can handle these differences without child-specific training data.

Method: Extract layer-wise features from SSL models (Wav2Vec2, HuBERT, Data2Vec) and train Kaldi-based DNN KWS system on WSJCAM0 adult dataset, then test zero-shot on PFSTAR children’s dataset.

Result: Achieved state-of-the-art results: Wav2Vec2 layer 22 performed best with ATWV 0.691, MTWV 0.7003, PFA 0.0164, PMiss 0.0547 for 30 keywords. Significant improvement over MFCC baseline, effective across age groups and robust in noisy conditions.

Conclusion: SSL features significantly enhance zero-shot KWS performance for children’s speech, effectively addressing the challenges of child speaker characteristics and demonstrating robustness across different conditions and datasets.

Abstract: Numerous methods have been proposed to enhance Keyword Spotting (KWS) in adult speech, but children’s speech presents unique challenges for KWS systems due to its distinct acoustic and linguistic characteristics. This paper introduces a zero-shot KWS approach that leverages state-of-the-art self-supervised learning (SSL) models, including Wav2Vec2, HuBERT and Data2Vec. Features are extracted layer-wise from these SSL models and used to train a Kaldi-based DNN KWS system. The WSJCAM0 adult speech dataset was used for training, while the PFSTAR children’s speech dataset was used for testing, demonstrating the zero-shot capability of our method. Our approach achieved state-of-the-art results across all keyword sets for children’s speech. Notably, the Wav2Vec2 model, particularly layer 22, performed the best, delivering an ATWV score of 0.691, a MTWV score of 0.7003 and probability of false alarm and probability of miss of 0.0164 and 0.0547 respectively, for a set of 30 keywords. Furthermore, age-specific performance evaluation confirmed the system’s effectiveness across different age groups of children. To assess the system’s robustness against noise, additional experiments were conducted using the best-performing layer of the best-performing Wav2Vec2 model. The results demonstrated a significant improvement over traditional MFCC-based baseline, emphasizing the potential of SSL embeddings even in noisy conditions. To further generalize the KWS framework, the experiments were repeated for an additional CMU dataset. Overall the results highlight the significant contribution of SSL features in enhancing Zero-Shot KWS performance for children’s speech, effectively addressing the challenges associated with the distinct characteristics of child speakers.

[287] Cochleagram-based Noise Adapted Speaker Identification System for Distorted Speech

Sabbir Ahmed, Nursadul Mamun, Md Azad Hossain

Main category: eess.AS

TL;DR: Proposes a robust noise-adapted speaker identification system using cochleagram features and CNN, showing improved accuracy in noisy, mismatched, reverberated and distorted environments.

Details

Motivation: Environmental noise, reverberation and distortion degrade speaker features, making automatic speaker identification challenging in real-world conditions.

Method: Uses 128-channel gammatone filterbank to generate cochleagrams, trains convolutional neural network on clean and noisy (-5dB SNR) cochleagrams to create noise-adapted speaker models.

Result: The system showed measurable accuracy improvement over existing neurogram-based SID systems when tested with various noise types, reverberated and distorted data.

Conclusion: Cochleagram features combined with CNN-based noise adaptation provide robust speaker identification performance in challenging acoustic environments.

Abstract: Speaker Identification refers to the process of identifying a person using one’s voice from a collection of known speakers. Environmental noise, reverberation and distortion make the task of automatic speaker identification challenging as extracted features get degraded thus affecting the performance of the speaker identification (SID) system. This paper proposes a robust noise adapted SID system under noisy, mismatched, reverberated and distorted environments. This method utilizes an auditory features called cochleagram to extract speaker characteristics and thus identify the speaker. A $128$ channel gammatone filterbank with a frequency range from $50$ to $8000$ Hz was used to generate 2-D cochleagrams. Wideband as well as narrowband noises were used along with clean speech to obtain noisy cochleagrams at various levels of signal to noise ratio (SNR). Both clean and noisy cochleagrams of only $-5$ dB SNR were then fed into a convolutional neural network (CNN) to build a speaker model in order to perform SID which is referred as noise adapted speaker model (NASM). The NASM was trained using a certain noise and then was evaluated using clean and various types of noises. Moreover, the robustness of the proposed system was tested using reverberated as well as distorted test data. Performance of the proposed system showed a measurable accuracy improvement over existing neurogram based SID system.

[288] Fundamentals of Data-Driven Approaches to Acoustic Signal Detection, Filtering, and Transformation

Chao Pan

Main category: eess.AS

TL;DR: This paper provides a systematic overview of data-driven acoustic signal processing methods, categorizing techniques into transformation, detection, and filtering, and highlighting the shift from knowledge-driven to data-driven approaches enabled by deep learning.

Details

Motivation: The rapid evolution of signal processing with diverse applications has created a need to systematically summarize data-driven acoustic signal processing principles and methods, especially given the recent shift from knowledge-driven to deep learning approaches.

Method: The paper categorizes acoustic signal processing techniques into three types: transformation (domain conversion), detection (information identification), and filtering (source extraction). It systematically summarizes principles and methods including sound source localization, sound event detection, voiceprint recognition, noise reduction, and source separation.

Result: The paper provides a comprehensive framework for understanding data-driven acoustic signal processing, covering various techniques and their applications in speech communication, voice interaction, smart healthcare, and industrial diagnostics.

Conclusion: Deep learning technologies have fundamentally shifted acoustic signal processing methodologies from knowledge-driven to data-driven approaches, enabling significant research outcomes and requiring systematic summarization of principles and methods for both academic and practical applications.

Abstract: In recent decades, the field of signal processing has rapidly evolved due to diverse application demands, leading to a rich array of scientific questions and research areas. The forms of signals, their formation mechanisms, and the information extraction methods vary by application, resulting in diverse signal processing techniques. Common techniques can be categorized into three types: transformation, detection, and filtering. Signal transformation converts signals from their original domain to a more suitable target domain for analysis; signal detection aims to identify the existence of relevant information within a signal and its specific time and location; and signal filtering focuses on extracting or separating source signals of interest from observed signals. In acoustic signal processing, techniques include sound source localization, sound event detection, voiceprint extraction and recognition, noise reduction, and source separation, with applications in speech communication, voice interaction, smart healthcare, and industrial diagnostics. Recently, the advancement of deep learning technologies has shifted methodologies in acoustic signal processing from knowledge-driven to data-driven approaches, leading to significant research outcomes. This paper aims to systematically summarize the principles and methods of data-driven acoustic signal processing, providing a comprehensive understanding framework for academic exploration and practical applications.

[289] Towards Improved Speech Recognition through Optimized Synthetic Data Generation

Yanis Perrin, Gilles Boulianne

Main category: eess.AS

TL;DR: Using text-to-speech with voice cloning to generate synthetic audio for ASR training when real transcribed audio is unavailable due to confidentiality, achieving performance comparable to real data training.

Details

Motivation: Supervised speech recognition requires transcribed audio, which is often unavailable due to confidentiality issues, necessitating alternative approaches using synthetic data.

Method: Generate synthetic audio from text corpus using state-of-the-art text-to-speech with voice cloning, optimize through finetuning, filtering, and evaluation, then train encoder-decoder ASR model on synthetic data.

Result: Experiments on Québec French conversational speech datasets show that improving synthetic data generation leads to significant ASR performance improvements.

Conclusion: Synthetic audio generated via advanced text-to-speech with voice cloning can effectively replace real transcribed data for ASR training when confidentiality constraints prevent access to real audio.

Abstract: Supervised training of speech recognition models requires access to transcribed audio data, which often is not possible due to confidentiality issues. Our approach to this problem is to generate synthetic audio from a text-only corpus using a state-of-the-art text-to-speech model with voice cloning capabilities. Our goal is to achieve automatic speech recognition (ASR) performance comparable to models trained on real data. We explore ways to optimize synthetic data generation through finetuning, filtering and evaluation, and its use for training an end-to-end encoder-decoder ASR model. Experiments were conducted using two datasets of spontaneous, conversational speech in Qu'ebec French. We show that improving data generation leads to large improvements in the final ASR system trained on synthetic data.

[290] Efficient Sparse Coding with the Adaptive Locally Competitive Algorithm for Speech Classification

Soufiyan Bahadi, Eric Plourde, Jean Rouat

Main category: eess.AS

TL;DR: ALCA improves LCA for neuromorphic speech classification by dynamically adjusting filter parameters, achieving better accuracy than cochlea models while reducing power consumption to 4-13mW on neuromorphic hardware.

Details

Motivation: Bridge efficiency gap between human brain and conventional computers in complex tasks, particularly neuromorphic audio processing where LCA shows promise but lacks thorough study in speech classification applications.

Method: Adaptive Locally Competitive Algorithm (ALCA) builds on LCA by dynamically adjusting modulation parameters of filter bank to fine-tune filters’ sensitivity, enhancing lateral inhibition for improved reconstruction quality, sparsity, and convergence time.

Result: LCA achieves better speech classification accuracy than LAUSCHER cochlea model but with higher power consumption. ALCA maintains accuracy while reducing dynamic power consumption to 4-13mW on neuromorphic hardware (3 orders of magnitude less than GPU setups).

Conclusion: ALCA is a compelling solution for efficient speech classification systems, offering substantial advancements in balancing classification accuracy and power efficiency for neuromorphic applications.

Abstract: Researchers are exploring novel computational paradigms such as sparse coding and neuromorphic computing to bridge the efficiency gap between the human brain and conventional computers in complex tasks. A key area of focus is neuromorphic audio processing. While the Locally Competitive Algorithm has emerged as a promising solution for sparse coding, offering potential for real-time and low-power processing on neuromorphic hardware, its applications in neuromorphic speech classification have not been thoroughly studied. The Adaptive Locally Competitive Algorithm builds upon the Locally Competitive Algorithm by dynamically adjusting the modulation parameters of the filter bank to fine-tune the filters’ sensitivity. This adaptability enhances lateral inhibition, improving reconstruction quality, sparsity, and convergence time, which is crucial for real-time applications. This paper demonstrates the potential of the Locally Competitive Algorithm and its adaptive variant as robust feature extractors for neuromorphic speech classification. Results show that the Locally Competitive Algorithm achieves better speech classification accuracy at the expense of higher power consumption compared to the LAUSCHER cochlea model used for benchmarking. On the other hand, the Adaptive Locally Competitive Algorithm mitigates this power consumption issue without compromising the accuracy. The dynamic power consumption is reduced to a range of 4 to 13 milliwatts on neuromorphic hardware, three orders of magnitude less than setups using Graphics Processing Units. These findings position the Adaptive Locally Competitive Algorithm as a compelling solution for efficient speech classification systems, promising substantial advancements in balancing speech classification accuracy and power efficiency.

[291] Spatio-spectral diarization of meetings by combining TDOA-based segmentation and speaker embedding-based clustering

Tobias Cord-Landwehr, Tobias Gburrek, Marc Deegen, Reinhold Haeb-Umbach

Main category: eess.AS

TL;DR: A spatio-spectral diarization pipeline combining TDOA-based segmentation with embedding-based clustering that works with both compact arrays and distributed microphones without requiring multi-channel training data.

Details

Motivation: To create a speaker diarization system that handles overlapping speech better than single-channel approaches and works flexibly with different microphone configurations without prior knowledge of microphone placement or count.

Method: Combined model-based and data-driven approach using TDOA (Time Difference of Arrival) for segmentation followed by embedding-based clustering. The system requires no multi-channel training data and adapts to both compact arrays and distributed microphones.

Result: Significantly outperforms single-channel pyannote approach in both compact array and distributed microphone scenarios. Handles overlapping speech better and can correctly track speakers during position changes, unlike fully spatial diarization pipelines.

Conclusion: The proposed spatio-spectral pipeline provides superior diarization performance across different microphone configurations without requiring specialized training data or prior microphone knowledge, while maintaining speaker tracking accuracy during movement.

Abstract: We propose a spatio-spectral, combined model-based and data-driven diarization pipeline consisting of TDOA-based segmentation followed by embedding-based clustering. The proposed system requires neither access to multi-channel training data nor prior knowledge about the number or placement of microphones. It works for both a compact microphone array and distributed microphones, with minor adjustments. Due to its superior handling of overlapping speech during segmentation, the proposed pipeline significantly outperforms the single-channel pyannote approach, both in a scenario with a compact microphone array and in a setup with distributed microphones. Additionally, we show that, unlike fully spatial diarization pipelines, the proposed system can correctly track speakers when they change positions.

eess.IV

[292] Deep Active Learning for Lung Disease Severity Classification from Chest X-rays: Learning with Less Data in the Presence of Class Imbalance

Roy M. Gabriel, Mohammadreza Zandehshahvar, Marly van Assen, Nattakorn Kittisut, Kyle Peters, Carlo N. De Cecco, Ali Adibi

Main category: eess.IV

TL;DR: Deep active learning with Bayesian Neural Network approximation and weighted loss reduces labeled data needs for COVID-19 severity classification from chest X-rays while handling class imbalance.

Details

Motivation: To reduce the amount of required labeled data for lung disease severity classification from chest X-rays under class imbalance conditions, particularly for COVID-19 diagnosis.

Method: Used deep active learning with Monte Carlo Dropout (Bayesian Neural Network approximation) and weighted loss function. Applied various acquisition functions to iteratively select the most informative samples from unlabeled data. Trained on 2,319 CXRs from COVID-19 patients labeled by radiologists.

Result: Entropy Sampling achieved 93.7% accuracy (AU ROC 0.91) in binary classification using only 15.4% of training data. Mean STD sampling achieved 70.3% accuracy (AU ROC 0.86) in multi-class classification using 23.1% of labeled data. Outperformed more complex acquisition functions.

Conclusion: Deep active learning with BNN approximation and weighted loss effectively reduces labeling requirements while maintaining or exceeding diagnostic performance, addressing class imbalance in medical image analysis.

Abstract: To reduce the amount of required labeled data for lung disease severity classification from chest X-rays (CXRs) under class imbalance, this study applied deep active learning with a Bayesian Neural Network (BNN) approximation and weighted loss function. This retrospective study collected 2,319 CXRs from 963 patients (mean age, 59.2 $\pm$ 16.6 years; 481 female) at Emory Healthcare affiliated hospitals between January and November 2020. All patients had clinically confirmed COVID-19. Each CXR was independently labeled by 3 to 6 board-certified radiologists as normal, moderate, or severe. A deep neural network with Monte Carlo Dropout was trained using active learning to classify disease severity. Various acquisition functions were used to iteratively select the most informative samples from an unlabeled pool. Performance was evaluated using accuracy, area under the receiver operating characteristic curve (AU ROC), and area under the precision-recall curve (AU PRC). Training time and acquisition time were recorded. Statistical analysis included descriptive metrics and performance comparisons across acquisition strategies. Entropy Sampling achieved 93.7% accuracy (AU ROC, 0.91) in binary classification (normal vs. diseased) using 15.4% of the training data. In the multi-class setting, Mean STD sampling achieved 70.3% accuracy (AU ROC, 0.86) using 23.1% of the labeled data. These methods outperformed more complex and computationally expensive acquisition functions and significantly reduced labeling needs. Deep active learning with BNN approximation and weighted loss effectively reduces labeled data requirements while addressing class imbalance, maintaining or exceeding diagnostic performance.

[293] Endmember Extraction from Hyperspectral Images Using Self-Dictionary Approach with Linear Programming

Tomohiko Mizutani

Main category: eess.IV

TL;DR: Enhanced Hottopixx implementation for faster endmember extraction from hyperspectral images with improved computational efficiency and accuracy.

Details

Motivation: Hottopixx methods for endmember extraction in hyperspectral imaging suffer from high computational costs due to quadratic growth of LP problem size with pixel count, limiting practical application despite theoretical effectiveness.

Method: Proposed an enhanced implementation of Hottopixx that reduces computational time while maintaining or improving endmember extraction performance through optimized linear programming approaches.

Result: Experiments demonstrate that the enhanced implementation enables practical application of Hottopixx for real hyperspectral images and achieves reasonably high accuracy in estimating endmember signatures.

Conclusion: The improved Hottopixx implementation successfully addresses computational limitations, making endmember extraction from real hyperspectral images feasible with good accuracy.

Abstract: Hyperspectral imaging technology has a wide range of applications, including forest management, mineral resource exploration, and Earth surface monitoring. A key step in utilizing this technology is endmember extraction, which aims to identify the spectral signatures of materials in observed scenes. Theoretical studies suggest that self-dictionary methods using linear programming (LP), known as Hottopixx methods, are effective in extracting endmembers. However, their practical application is hindered by high computational costs, as they require solving LP problems whose size grows quadratically with the number of pixels in the image. As a result, their actual effectiveness remains unclear. To address this issue, we propose an enhanced implementation of Hottopixx designed to reduce computational time and improve endmember extraction performance. We demonstrate its effectiveness through experiments. The results suggest that our implementation enables the application of Hottopixx for endmember extraction from real hyperspectral images and allows us to achieve reasonably high accuracy in estimating endmember signatures.

[294] Gaussian is All You Need: A Unified Framework for Solving Inverse Problems via Diffusion Posterior Sampling

Nebiyou Yismaw, Ulugbek S. Kamilov, M. Salman Asif

Main category: eess.IV

TL;DR: Proposes a unified likelihood approximation method with covariance correction for diffusion-based inverse problems, improving performance and computational efficiency.

Details

Motivation: Existing diffusion-based methods for inverse problems have insufficient or inefficient likelihood approximations during reverse sampling.

Method: Introduces a covariance correction term to enhance likelihood approximation, avoids gradient propagation through diffusion model, and provides efficient factorization/inversion of covariance matrices for various inverse problems.

Result: Achieves better convergence to true data posterior and improved performance on real-world natural image datasets compared to existing approaches.

Conclusion: The proposed covariance correction method provides a more effective and efficient solution for diffusion-based inverse problem solving.

Abstract: Diffusion models can generate a variety of high-quality images by modeling complex data distributions. Trained diffusion models can also be very effective image priors for solving inverse problems. Most of the existing diffusion-based methods integrate data consistency steps by approximating the likelihood function within the diffusion reverse sampling process. In this paper, we show that the existing approximations are either insufficient or computationally inefficient. To address these issues, we propose a unified likelihood approximation method that incorporates a covariance correction term to enhance the performance and avoids propagating gradients through the diffusion model. The correction term, when integrated into the reverse diffusion sampling process, achieves better convergence towards the true data posterior for selected distributions and improves performance on real-world natural image datasets. Furthermore, we present an efficient way to factorize and invert the covariance matrix of the likelihood function for several inverse problems. Our comprehensive experiments demonstrate the effectiveness of our method over several existing approaches. Code available at https://github.com/CSIPlab/CoDPS.

[295] DIFFRACT: Diffusion-based Restoration via Adaptive Control and Thresholding for Diffraction Imaging

Nikolay Falaleev, Nikolai Orlov

Main category: eess.IV

TL;DR: Novel diffusion model approach for denoising EBSD patterns using two-stage training with UNet architecture and adaptive feedback control

Details

Motivation: To improve Electron Backscatter Diffraction (EBSD) pattern denoising by leveraging diffusion models with better control over the denoising process and reducing hallucination risks

Method: Two-stage training process with UNet-based architecture, incorporating auxiliary regression head for quality prediction and adaptive feedback-driven iterative denoising control

Result: Successful application of diffusion models to EBSD pattern denoising demonstrated on custom-collected dataset containing EBSD patterns, Master Patterns, and quality values

Conclusion: The proposed DIFFRACT model effectively denoises EBSD patterns with adaptive control, quality assessment, and reduced hallucination risks

Abstract: This paper presents a novel approach for denoising Electron Backscatter Diffraction (EBSD) patterns using diffusion models. We propose a two-stage training process with a UNet-based architecture, incorporating an auxiliary regression head to predict the quality of the experimental pattern and assess the progress of the denoising process. The model uses an adaptive denoising strategy, which integrates quality prediction and feedback-driven iterative denoising process control. This adaptive feedback loop allows the model to adjust its schedule, providing fine control over the denoising process. Furthermore, our model can identify samples where no meaningful signal is present, thereby reducing the risk of hallucinations. We demonstrate the DIFFRACT - the successful application of diffusion models to EBSD pattern denoising using a custom-collected dataset of EBSD patterns, their corresponding Master Patterns, and quality values.

[296] Explicit Residual-Based Scalable Image Coding for Humans and Machines

Yui Tatsumi, Ziyue Zeng, Hiroshi Watanabe

Main category: eess.IV

TL;DR: Proposes two scalable image compression methods (FR-ICMH and PR-ICMH) that integrate explicit residual compression to serve both machine and human vision needs, achieving significant BD-rate savings.

Details

Motivation: Address the gap in current scalable image compression methods that serve both machine and human vision (ICMH), where models often rely too heavily on learning capacity without sufficient architectural consideration.

Method: Integrates explicit residual compression mechanism from traditional methods like JPEG2000 into neural network-based codecs. Proposes two complementary methods: Feature Residual-based Scalable Coding (FR-ICMH) and Pixel Residual-based Scalable Coding (PR-ICMH).

Result: Experimental results show effectiveness with PR-ICMH achieving up to 29.57% BD-rate savings over previous work. Methods provide flexibility between encoder complexity and compression performance.

Conclusion: The proposed residual compression integration enhances coding efficiency and interpretability of ICMH frameworks, making them adaptable to diverse application requirements and various machine vision tasks.

Abstract: Scalable image compression is a technique that progressively reconstructs multiple versions of an image for different requirements. In recent years, images have increasingly been consumed not only by humans but also by image recognition models. This shift has drawn growing attention to scalable image compression methods that serve both machine and human vision (ICMH). Many existing models employ neural network-based codecs, known as learned image compression, and have made significant strides in this field by carefully designing the loss functions. In some cases, however, models are overly reliant on their learning capacity, and their architectural design is not sufficiently considered. In this paper, we enhance the coding efficiency and interpretability of ICMH framework by integrating an explicit residual compression mechanism, which is commonly employed in resolution scalable coding methods such as JPEG2000. Specifically, we propose two complementary methods: Feature Residual-based Scalable Coding (FR-ICMH) and Pixel Residual-based Scalable Coding (PR-ICMH). These proposed methods are applicable to various machine vision tasks. Moreover, they provide flexibility to choose between encoder complexity and compression performance, making it adaptable to diverse application requirements. Experimental results demonstrate the effectiveness of our proposed methods, with PR-ICMH achieving up to 29.57% BD-rate savings over the previous work.

Today’s Research Highlights

Table of Contents

cs.CL

[1] CoBA: Counterbias Text Augmentation for Mitigating Various Spurious Correlations via Semantic Triples

[2] Mapping Toxic Comments Across Demographics: A Dataset from German Public Broadcasting

[3] Granite Embedding R2 Models

[4] TrInk: Ink Generation with Transformer Network

[5] How Does Cognitive Bias Affect Large Language Models? A Case Study on the Anchoring Effect in Price Negotiation Simulations

[6] Can Multimodal LLMs Solve the Basic Perception Problems of Percept-V?

[7] A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers

[8] Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

[9] Quantifying Label-Induced Bias in Large Language Model Self- and Cross-Evaluations

[10] BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design

[11] Improving Aviation Safety Analysis: Automated HFACS Classification Using Reinforcement Learning with Group Relative Policy Optimization

[12] Enhancing Robustness of Autoregressive Language Models against Orthographic Attacks via Pixel-based Approach

[13] Do Self-Supervised Speech Models Exhibit the Critical Period Effects in Language Acquisition?

[14] Decoding Memories: An Efficient Pipeline for Self-Consistency Hallucination Detection

[15] Efficient Code Embeddings from Code Generation Models

[16] BLUEX Revisited: Enhancing Benchmark Coverage with Automatic Captioning

[17] Challenges and Applications of Large Language Models: A Comparison of GPT and DeepSeek family of models

[18] Normality and the Turing Test

[19] AllSummedUp: un framework open-source pour comparer les metriques d’evaluation de resume

[20] Automatic Reviewers Fail to Detect Faulty Reasoning in Research Papers: A New Counterfactual Evaluation Framework

[21] Med-RewardBench: Benchmarking Reward Models and Judges for Medical Multimodal Large Language Models

[22] Discovering Semantic Subdimensions through Disentangled Conceptual Representations

[23] Beyond the Surface: Probing the Ideological Depth of Large Language Models

[24] Igniting Creative Writing in Small Language Models: LLM-as-a-Judge versus Multi-Agent Refined Rewards

[25] HSFN: Hierarchical Selection for Fake News Detection building Heterogeneous Ensemble

[26] L3Cube-MahaSTS: A Marathi Sentence Similarity Dataset and Models

[27] A Survey on Current Trends and Recent Advances in Text Anonymization

[28] Middo: Model-Informed Dynamic Data Optimization for Enhanced LLM Fine-Tuning via Closed-Loop Learning

[29] Personality Matters: User Traits Predict LLM Preferences in Multi-Turn Collaborative Tasks

[30] QZhou-Embedding Technical Report

[31] Is this chart lying to me? Automating the detection of misleading visualizations

[32] Not All Parameters Are Created Equal: Smart Isolation Boosts Fine-Tuning Performance

[33] Reasoning-Intensive Regression

[34] PiCSAR: Probabilistic Confidence Selection And Ranking

[35] Going over Fine Web with a Fine-Tooth Comb: Technical Report of Indexing Fine Web for Problematic Content Search and Retrieval

[36] Continuous Language Model Interpolation for Dynamic and Controllable Text Generation

[37] Revealing Fine-Grained Values and Opinions in Large Language Models

[38] E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning

[39] Blind Spot Navigation in Large Language Model Reasoning with Thought Space Explorer

[40] A Collaborative Content Moderation Framework for Toxicity Detection based on Conformalized Estimates of Annotation Disagreement

[41] Retrieval-Augmented Machine Translation with Unstructured Knowledge

[42] Toxicity Begets Toxicity: Unraveling Conversational Chains in Political Podcasts

[43] Strategic resource allocation in memory encoding: An efficiency principle shaping language processing

[44] Inducing Programmatic Skills for Agentic Tasks

[45] DeepTrans: Deep Reasoning Translation via Reinforcement Learning

[46] Testing Conviction: An Argumentative Framework for Measuring LLM Political Stability

[47] MAC-Tuning: LLM Multi-Compositional Problem Reasoning with Enhanced Knowledge Boundary Awareness

[48] FedSEA-LLaMA: A Secure, Efficient and Adaptive Federated Splitting Framework for Large Language Models

[49] Hydra: Structured Cross-Source Enhanced Large Language Model Reasoning

[50] L3Cube-MahaEmotions: A Marathi Emotion Recognition Dataset with Synthetic Annotations using CoTR prompting and Large Language Models

[51] Quantifying Fairness in LLMs Beyond Tokens: A Semantic and Statistical Perspective

[52] Interpretable Mnemonic Generation for Kanji Learning via Expectation-Maximization

[53] SKA-Bench: A Fine-Grained Benchmark for Evaluating Structured Knowledge Understanding of LLMs

[54] Atom-Searcher: Enhancing Agentic Deep Research via Fine-Grained Atomic Thought Reward

[55] Trust but Verify! A Survey on Verification Design for Test-time Scaling

[56] Active Domain Knowledge Acquisition with 100-Dollar Budget: Enhancing LLMs via Cost-Efficient, Expert-Involved Interaction in Sensitive Domains

[57] German4All – A Dataset and Model for Readability-Controlled Paraphrasing in German

[58] Uncovering the Bigger Picture: Comprehensive Event Understanding Via Diverse News Retrieval

[59] Exploring Selective Retrieval-Augmentation for Long-Tail Legal Text Classification

[60] UI-Bench: A Benchmark for Evaluating Design Capabilities of AI Text-to-App Tools

[61] KG-CQR: Leveraging Structured Relation Representations in Knowledge Graphs for Contextual Query Retrieval

cs.CV

[62] 2COOOL: 2nd Workshop on the Challenge Of Out-Of-Label Hazards in Autonomous Driving

[63] GLENDA: Gynecologic Laparoscopy Endometriosis Dataset

[64] Advanced Deep Learning Techniques for Classifying Dental Conditions Using Panoramic X-Ray Images

[65] Identifying Surgical Instruments in Laparoscopy Using Deep Learning Instance Segmentation

[66] Q-Align: Alleviating Attention Leakage in Zero-Shot Appearance Transfer via Query-Query Alignment

[67] Learning from Silence and Noise for Visual Sound Source Localization

[68] Entropy-Based Non-Invasive Reliability Monitoring of Convolutional Neural Networks

[69] ERTACache: Error Rectification and Timesteps Adjustment for Efficient Diffusion

[70] Towards Understanding Camera Motions in Any Video

[71] Video-LLMs with Temporal Visual Screening

[72] Consistent and Invariant Generalization Learning for Short-video Misinformation Detection

[73] ROBUST-MIPS: A Combined Skeletal Pose and Instance Segmentation Dataset for Laparoscopic Surgical Instruments

[74] Safe-Control: A Safety Patch for Mitigating Unsafe Content in Text-to-Image Generation Models

[75] GENNAV: Polygon Mask Generation for Generalized Referring Navigable Regions

[76] R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning