Today’s Research Highlights
AI-enhanced summaries of the latest research papers from arXiv.
Table of Contents
- cs.CL [Total: 63]
- cs.CV [Total: 118]
- cs.AI [Total: 36]
- cs.SD [Total: 12]
- cs.LG [Total: 97]
- cs.MA [Total: 4]
- cs.MM [Total: 0]
- eess.AS [Total: 8]
- eess.IV [Total: 15]
cs.CL
[1] Noise or Nuance: An Investigation Into Useful Information and Filtering For LLM Driven AKBC
Alex Clay, Ernesto Jiménez-Ruiz, Pranava Madhyastha
Main category: cs.CL
TL;DR: Analysis of triple completion task in constrained LLM settings, focusing on generation quality, filtering, and response parsing tradeoffs.
Details
Motivation: To address the limitations of RAG and fine-tuning in constrained environments like the 2025 LM-KBC challenge, where these common techniques are restricted.Method: Investigated three aspects: generation approaches, quality assurance through LLM filtering, and response parsing techniques in constrained settings.
Result: Found that additional information improves generation quality, LLMs can effectively filter poor quality triples, and response parsing involves a tradeoff between flexibility and consistency that depends on the specific setting.
Conclusion: In constrained LLM environments, strategic use of additional information, LLM-based filtering, and context-aware response parsing can effectively compensate for the limitations of restricted RAG and fine-tuning approaches.
Abstract: RAG and fine-tuning are prevalent strategies for improving the quality of LLM outputs. However, in constrained situations, such as that of the 2025 LM-KBC challenge, such techniques are restricted. In this work we investigate three facets of the triple completion task: generation, quality assurance, and LLM response parsing. Our work finds that in this constrained setting: additional information improves generation quality, LLMs can be effective at filtering poor quality triples, and the tradeoff between flexibility and consistency with LLM response parsing is setting dependent.
[2] Automated Evidence Extraction and Scoring for Corporate Climate Policy Engagement: A Multilingual RAG Approach
Imene Kolli, Ario Saeid Vaghefi, Chiara Colesanti Senni, Shantam Raj, Markus Leippold
Main category: cs.CL
TL;DR: AI-assisted RAG framework automates evidence extraction from corporate climate policy documents, but requires human oversight for accurate analysis.
Details
Motivation: Manual assessment of corporate climate policy engagement is time-consuming and error-prone, needing automation to scale monitoring efforts.Method: Retrieval-Augmented Generation with layout-aware parsing, Nomic embedding model, and few-shot prompting for multilingual document analysis.
Result: Combination of layout parsing, Nomic embeddings, and few-shot strategies yields best performance for evidence extraction and classification.
Conclusion: RAG system accelerates evidence extraction effectively but requires human-in-the-loop approach to maintain accuracy in nuanced policy analysis.
Abstract: InfluenceMap’s LobbyMap Platform monitors the climate policy engagement of over 500 companies and 250 industry associations, assessing each entity’s support or opposition to science-based policy pathways for achieving the Paris Agreement’s goal of limiting global warming to 1.5{\deg}C. Although InfluenceMap has made progress with automating key elements of the analytical workflow, a significant portion of the assessment remains manual, making it time- and labor-intensive and susceptible to human error. We propose an AI-assisted framework to accelerate the monitoring of corporate climate policy engagement by leveraging Retrieval-Augmented Generation to automate the most time-intensive extraction of relevant evidence from large-scale textual data. Our evaluation shows that a combination of layout-aware parsing, the Nomic embedding model, and few-shot prompting strategies yields the best performance in extracting and classifying evidence from multilingual corporate documents. We conclude that while the automated RAG system effectively accelerates evidence extraction, the nuanced nature of the analysis necessitates a human-in-the-loop approach where the technology augments, rather than replaces, expert judgment to ensure accuracy.
[3] Documents Are People and Words Are Items: A Psychometric Approach to Textual Data with Contextual Embeddings
Jinsong Chen
Main category: cs.CL
TL;DR: Novel psychometric method using LLMs to analyze textual data by treating documents as individuals and words as items, enabling factor analysis to uncover latent dimensions in text corpora.
Details
Motivation: To develop a psychometric approach for textual data analysis that can uncover latent knowledge dimensions and patterns, particularly useful for fields rich in textual information like education, psychology, and law.Method: Two-stage approach: 1) Use NLP and transformer models to identify keywords and generate contextual scores, 2) Apply factor analysis (exploratory and bifactor models) to extract latent factors, determine correlations, and identify significant words per factor.
Result: Successfully applied to Wiki STEM corpus, demonstrating the method’s ability to uncover latent knowledge dimensions and patterns within textual data.
Conclusion: This approach enhances psychometric analysis of textual data and shows promise for applications across various text-rich domains by providing a natural psychometric interpretation framework.
Abstract: This research introduces a novel psychometric method for analyzing textual data using large language models. By leveraging contextual embeddings to create contextual scores, we transform textual data into response data suitable for psychometric analysis. Treating documents as individuals and words as items, this approach provides a natural psychometric interpretation under the assumption that certain keywords, whose contextual meanings vary significantly across documents, can effectively differentiate documents within a corpus. The modeling process comprises two stages: obtaining contextual scores and performing psychometric analysis. In the first stage, we utilize natural language processing techniques and encoder based transformer models to identify common keywords and generate contextual scores. In the second stage, we employ various types of factor analysis, including exploratory and bifactor models, to extract and define latent factors, determine factor correlations, and identify the most significant words associated with each factor. Applied to the Wiki STEM corpus, our experimental results demonstrate the method’s potential to uncover latent knowledge dimensions and patterns within textual data. This approach not only enhances the psychometric analysis of textual data but also holds promise for applications in fields rich in textual information, such as education, psychology, and law.
[4] Stated Preference for Interaction and Continued Engagement (SPICE): Evaluating an LLM’s Willingness to Re-engage in Conversation
Thomas Manuel Rost, Martina Figlia, Bernd Wallraff
Main category: cs.CL
TL;DR: SPICE is a simple diagnostic tool that measures LLM willingness to continue interactions based on user tone, showing strong discrimination between friendly (97.5% continue), unclear (60.4% continue), and abusive (17.9% continue) interactions.
Details
Motivation: To develop a low-overhead, reproducible tool for auditing model dispositions and understanding how LLMs respond to different user interaction tones, complementing existing metrics with a direct relational signal.Method: Tested 4 open-weight chat models using 3-tone (friendly, unclear, abusive) x 10-interaction stimulus set across 4 framing conditions (480 trials total), asking YES/NO questions about willingness to re-engage after reviewing transcripts.
Result: SPICE sharply discriminates by user tone with statistically significant results. Models showed near-unanimous preference to continue friendly interactions (97.5% YES), mixed for unclear (60.4% YES), and strong preference to discontinue abusive interactions (17.9% YES). SPICE provides distinct signal from abuse classification.
Conclusion: SPICE is validated as a robust, low-overhead tool for auditing model dispositions, offering a direct relational signal that complements existing metrics and works even when models fail to explicitly identify abuse.
Abstract: We introduce and evaluate Stated Preference for Interaction and Continued Engagement (SPICE), a simple diagnostic signal elicited by asking a Large Language Model a YES or NO question about its willingness to re-engage with a user’s behavior after reviewing a short transcript. In a study using a 3-tone (friendly, unclear, abusive) by 10-interaction stimulus set, we tested four open-weight chat models across four framing conditions, resulting in 480 trials. Our findings show that SPICE sharply discriminates by user tone. Friendly interactions yielded a near-unanimous preference to continue (97.5% YES), while abusive interactions yielded a strong preference to discontinue (17.9% YES), with unclear interactions falling in between (60.4% YES). This core association remains decisive under multiple dependence-aware statistical tests, including Rao-Scott adjustment and cluster permutation tests. Furthermore, we demonstrate that SPICE provides a distinct signal from abuse classification. In trials where a model failed to identify abuse, it still overwhelmingly stated a preference not to continue the interaction (81% of the time). An exploratory analysis also reveals a significant interaction effect: a preamble describing the study context significantly impacts SPICE under ambiguity, but only when transcripts are presented as a single block of text rather than a multi-turn chat. The results validate SPICE as a robust, low-overhead, and reproducible tool for auditing model dispositions, complementing existing metrics by offering a direct, relational signal of a model’s state. All stimuli, code, and analysis scripts are released to support replication.
[5] BRoverbs – Measuring how much LLMs understand Portuguese proverbs
Thales Sales Almeida, Giovana Kerche Bonás, João Guilherme Alves Santos
Main category: cs.CL
TL;DR: BRoverbs is a new dataset for evaluating Portuguese-language LLMs using Brazilian proverbs to assess cultural and linguistic understanding beyond translated datasets.
Details
Motivation: Existing evaluations for Portuguese LLMs are limited, relying on translated datasets that miss linguistic nuances and cultural references, or focus only on structured exams and sentiment analysis.Method: Created BRoverbs dataset specifically designed with Brazilian proverbs to challenge LLMs with cultural wisdom, figurative expressions, and complex syntactic structures unique to the region.
Result: Provides a new evaluation tool for Portuguese-language LLMs that captures regional linguistic and cultural nuances through proverb comprehension.
Conclusion: BRoverbs addresses gaps in Portuguese LLM evaluation by offering a culturally relevant benchmark that advances regionally informed assessment of language model capabilities.
Abstract: Large Language Models (LLMs) exhibit significant performance variations depending on the linguistic and cultural context in which they are applied. This disparity signals the necessity of mature evaluation frameworks that can assess their capabilities in specific regional settings. In the case of Portuguese, existing evaluations remain limited, often relying on translated datasets that may not fully capture linguistic nuances or cultural references. Meanwhile, native Portuguese-language datasets predominantly focus on structured national exams or sentiment analysis of social media interactions, leaving gaps in evaluating broader linguistic understanding. To address this limitation, we introduce BRoverbs, a dataset specifically designed to assess LLM performance through Brazilian proverbs. Proverbs serve as a rich linguistic resource, encapsulating cultural wisdom, figurative expressions, and complex syntactic structures that challenge the model comprehension of regional expressions. BRoverbs aims to provide a new evaluation tool for Portuguese-language LLMs, contributing to advancing regionally informed benchmarking. The benchmark is available at https://huggingface.co/datasets/Tropic-AI/BRoverbs.
[6] Can Vision-Language Models Solve Visual Math Equations?
Monjoy Narayan Choudhury, Junling Wang, Yifan Hou, Mrinmaya Sachan
Main category: cs.CL
TL;DR: VLMs struggle with visual equation solving due to counting bottlenecks and multi-step reasoning challenges, despite strong performance on text-based equations.
Details
Motivation: To understand why Vision-Language Models (VLMs) fail at tasks requiring integrated perception and symbolic computation, specifically visual equation solving where equations are embedded in images with object icons as variables.Method: Decomposed visual equation solving into coefficient counting and variable recognition components, analyzed performance on both textual and visually grounded equations, and examined how equation complexity affects reasoning capabilities.
Result: VLMs perform well on textual equations but fail on visual counterparts. Counting is the primary bottleneck, even with accurate recognition. Multi-step reasoning introduces additional errors, and symbolic reasoning becomes limiting as equation complexity increases.
Conclusion: Current VLMs have key weaknesses in visually grounded mathematical reasoning, particularly in counting and multi-step visual reasoning, pointing to areas needing future improvement.
Abstract: Despite strong performance in visual understanding and language-based reasoning, Vision-Language Models (VLMs) struggle with tasks requiring integrated perception and symbolic computation. We study this limitation through visual equation solving, where mathematical equations are embedded in images, variables are represented by object icons, and coefficients must be inferred by counting. While VLMs perform well on textual equations, they fail on visually grounded counterparts. To understand this gap, we decompose the task into coefficient counting and variable recognition, and find that counting is the primary bottleneck, even when recognition is accurate. We also observe that composing recognition and reasoning introduces additional errors, highlighting challenges in multi-step visual reasoning. Finally, as equation complexity increases, symbolic reasoning itself becomes a limiting factor. These findings reveal key weaknesses in current VLMs and point toward future improvements in visually grounded mathematical reasoning.
[7] Improving LLM Safety and Helpfulness using SFT and DPO: A Study on OPT-350M
Piyush Pant
Main category: cs.CL
TL;DR: SFT+DPO combination outperforms individual SFT or DPO methods for aligning OPT-350M model on safety and helpfulness metrics.
Details
Motivation: To investigate the effectiveness of different alignment techniques (SFT, DPO, and their combination) for improving language model safety and helpfulness.Method: Trained and evaluated four models (base OPT350M, SFT-only, DPO-only, SFT+DPO) using Anthropic Helpful-Harmless RLHF dataset with three evaluation metrics: Harmlessness Rate, Helpfulness Rate, and Combined Alignment Score.
Result: SFT outperformed DPO individually, but the combined SFT+DPO model achieved the best performance across all metrics, showing complementary benefits of both techniques.
Conclusion: Combining SFT and DPO provides superior alignment results, though challenges remain with noisy data and resource constraints. This offers a foundation for more robust alignment pipelines.
Abstract: This research investigates the effectiveness of alignment techniques, Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and a combined SFT+DPO approach on improving the safety and helpfulness of the OPT-350M language model. Utilizing the Anthropic Helpful-Harmless RLHF dataset, we train and evaluate four models: the base OPT350M, an SFT model, a DPO model, and a model trained with both SFT and DPO. We introduce three key evaluation metrics: Harmlessness Rate (HmR), Helpfulness Rate (HpR), and a Combined Alignment Score (CAS), all derived from reward model outputs. The results show that while SFT outperforms DPO, The combined SFT+DPO model outperforms all others across all metrics, demonstrating the complementary nature of these techniques. Our findings also highlight challenges posed by noisy data, limited GPU resources, and training constraints. This study offers a comprehensive view of how fine-tuning strategies affect model alignment and provides a foundation for more robust alignment pipelines in future work.
[8] MR-UIE: Multi-Perspective Reasoning with Reinforcement Learning for Universal Information Extraction
Zhongqiu Li, Shiquan Wang, Ruiyu Fang, Mengjiao Bao, Zhenhe Wu, Shuangyong Song, Yongxiang Li, Zhongjiang He
Main category: cs.CL
TL;DR: Proposes MR-UIE, a reinforcement learning approach with multi-perspective reasoning to enhance LLMs’ performance in universal information extraction tasks, improving generalization and accuracy across complex structured output scenarios.
Details
Motivation: LLMs underperform in universal information extraction (UIE) tasks, particularly with complex schema descriptions and multi-step reasoning requirements. Existing methods like in-context learning and instruction tuning have significant limitations in generalization.Method: Integrates reinforcement learning with multi-perspective reasoning to transform LLMs from passive extractors to active reasoners, enabling them to understand both what to extract and how to reason through complex IE tasks.
Result: MR-UIE consistently improves extraction accuracy across multiple domains and outperforms state-of-the-art methods on several datasets. Multi-perspective reasoning significantly enhances generalization in complex IE tasks.
Conclusion: The integration of reinforcement learning with multi-perspective reasoning is effective for improving LLM performance in universal information extraction, demonstrating the critical role of reasoning in handling challenging structured output scenarios.
Abstract: Large language models (LLMs) demonstrate robust capabilities across diverse research domains. However, their performance in universal information extraction (UIE) remains insufficient, especially when tackling structured output scenarios that involve complex schema descriptions and require multi-step reasoning. While existing approaches enhance the performance of LLMs through in-context learning and instruction tuning, significant limitations nonetheless persist. To enhance the model’s generalization ability, we propose integrating reinforcement learning (RL) with multi-perspective reasoning for information extraction (IE) tasks. Our work transitions LLMs from passive extractors to active reasoners, enabling them to understand not only what to extract but also how to reason. Experiments conducted on multiple IE benchmarks demonstrate that MR-UIE consistently elevates extraction accuracy across domains and surpasses state-of-the-art methods on several datasets. Furthermore, incorporating multi-perspective reasoning into RL notably enhances generalization in complex IE tasks, underscoring the critical role of reasoning in challenging scenarios.
[9] TigerCoder: A Novel Suite of LLMs for Code Generation in Bangla
Nishat Raihan, Antonios Anastasopoulos, Marcos Zampieri
Main category: cs.CL
TL;DR: First dedicated family of Code LLMs for Bangla (1B & 9B parameters) with comprehensive dataset, evaluation benchmark, and ~11-18% performance gains over existing models.
Details
Motivation: Bangla is the 5th most spoken language but remains underrepresented in LLMs for code generation due to scarcity of high-quality training data.Method: Created comprehensive Bangla code instruction datasets, developed MBPP-Bangla evaluation benchmark, and built TigerCoder-family Code LLMs through programming domain adaptation.
Result: Achieved significant ~11-18% performance gains at Pass@1 over existing multilingual and general-purpose Bangla LLMs, demonstrating smaller models can excel with curated datasets.
Conclusion: Curated, high-quality datasets can overcome limitations of smaller models for low-resource languages like Bangla. All resources are open-sourced to advance Bangla LLM research.
Abstract: Despite being the 5th most spoken language, Bangla remains underrepresented in Large Language Models (LLMs), particularly for code generation. This primarily stems from the scarcity of high-quality data to pre-train and/or finetune such models. Hence, we introduce the first dedicated family of Code LLMs for Bangla (1B & 9B). We offer three major contributions: (1) a comprehensive Bangla code instruction datasets for programming domain adaptation; (2) MBPP-Bangla, an evaluation benchmark for Bangla code generation; and (3) the TigerCoder-family of Code LLMs, achieving significant ~11-18% performance gains at Pass@1 over existing multilingual and general-purpose Bangla LLMs. Our findings show that curated, high-quality datasets can overcome limitations of smaller models for low-resource languages. We open-source all resources to advance further Bangla LLM research.
[10] Compass-v3: Scaling Domain-Specific LLMs for Multilingual E-Commerce in Southeast Asia
Sophia Maria
Main category: cs.CL
TL;DR: Compass-v3 is a 245B parameter Mixture-of-Experts model specialized for Southeast Asian e-commerce, featuring hardware optimizations, multilingual training on 12T tokens, and novel OTPO alignment method, achieving SOTA performance while replacing 70% of OpenAI usage in Shopee’s platform.
Details
Motivation: Large language models struggle with domain-specific tasks like e-commerce due to noisy, heterogeneous, multilingual, and dynamic data. E-commerce requires specialized knowledge that general LLMs lack, particularly for Southeast Asian markets with diverse languages.Method: Developed a vertical-domain MoE model with 245B total parameters (71B active per token) using fewer but larger experts. Implemented hardware-efficient optimizations like intra-node expert parallelism and custom memcpy operator. Trained on 12T tokens of curated multilingual corpora and synthetic e-commerce instructions. Introduced Optimal-Transport Direct Preference Optimization (OTPO) for better token-level alignment.
Result: Compass-v3 outperforms DeepSeek-V3.1, GPT-4 series, and Qwen3-235B in e-commerce tasks. Shows strong multilingual capability across low-resource Southeast Asian languages and Portuguese while maintaining competitive general benchmark performance. Currently handles over 70% of Shopee’s LLM traffic, replacing OpenAI usage.
Conclusion: Compass-v3 demonstrates that specialized domain models with efficient MoE architecture and targeted training can achieve superior performance in complex, multilingual e-commerce environments while maintaining general capabilities, proving effective for industrial-scale deployment.
Abstract: Large language models (LLMs) excel in general-domain applications, yet their performance often degrades in specialized tasks requiring domain-specific knowledge. E-commerce is particularly challenging, as its data are noisy, heterogeneous, multilingual, and highly dynamic. We present Compass-v3, a vertical-domain Mixture-of-Experts (MoE) model with 245B total parameters and 71B active per token, designed for Southeast Asian e-commerce. Compass-v3 adopts fewer but larger experts, combined with hardware-efficient optimizations-such as intra-node expert parallelism and a customized memcpy operator-to maximize GPU utilization. The model is trained on 12T tokens of curated multilingual corpora and large-scale synthetic e-commerce instructions using a mixed-training strategy. To enhance alignment, we propose Optimal-Transport Direct Preference Optimization (OTPO), which captures token-level distinctions and improves instruction adherence in commerce-specific scenarios. Extensive evaluations demonstrate that Compass-v3 delivers state-of-the-art e-commerce performance, surpassing DeepSeek-V3.1, GPT-4 series, and Qwen3-235B. Moreover, Compass-v3 demonstrates strong multilingual capability across low-resource Southeast Asian languages (Indonesian, Thai, Filipino, Vietnamese, Malay, Taglog) and Portuguese while sustaining competitive performance on general benchmarks. It has already been widely applied in Shopee’s industrial-scale e-commerce platform and is gradually replacing OpenAI’s traffic, now accounting for over 70% of total LLM usage, highlighting its dual strengths in specialized commerce expertise and broad linguistic competence.
[11] Automated Classification of Tutors’ Dialogue Acts Using Generative AI: A Case Study Using the CIMA Corpus
Liqun He, Jiaqi Xu
Main category: cs.CL
TL;DR: GPT-4 achieves 80% accuracy in classifying tutor dialogue acts, outperforming baselines and showing strong agreement with human annotations, suggesting generative AI can efficiently automate educational dialogue analysis.
Details
Motivation: To reduce time and effort required for manual coding of tutor dialogue acts by exploring generative AI automation, addressing the need for efficient educational dialogue analysis.Method: Used GPT-3.5-turbo and GPT-4 models with tailored prompts on the CIMA corpus containing pre-annotated tutor responses across four dialogue act categories.
Result: GPT-4 achieved 80% accuracy, weighted F1-score of 0.81, and Cohen’s Kappa of 0.74, indicating substantial agreement with human annotations and surpassing baseline performance.
Conclusion: Generative AI shows strong potential for efficient dialogue act classification, with task-specific label definitions and contextual information being crucial for quality automation, while ethical considerations must be addressed.
Abstract: This study explores the use of generative AI for automating the classification of tutors’ Dialogue Acts (DAs), aiming to reduce the time and effort required by traditional manual coding. This case study uses the open-source CIMA corpus, in which tutors’ responses are pre-annotated into four DA categories. Both GPT-3.5-turbo and GPT-4 models were tested using tailored prompts. Results show that GPT-4 achieved 80% accuracy, a weighted F1-score of 0.81, and a Cohen’s Kappa of 0.74, surpassing baseline performance and indicating substantial agreement with human annotations. These findings suggest that generative AI has strong potential to provide an efficient and accessible approach to DA classification, with meaningful implications for educational dialogue analysis. The study also highlights the importance of task-specific label definitions and contextual information in enhancing the quality of automated annotation. Finally, it underscores the ethical considerations associated with the use of generative AI and the need for responsible and transparent research practices. The script of this research is publicly available at https://github.com/liqunhe27/Generative-AI-for-educational-dialogue-act-tagging.
[12] EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs
Yuhao Zhang, Yuhao Du, Zhanchen Dai, Xiangnan Ma, Kaiqi Kou, Benyou Wang, Haizhou Li
Main category: cs.CL
TL;DR: EchoX is a speech-to-speech LLM that bridges the acoustic-semantic gap using semantic representations and dynamic speech target generation, preserving reasoning capabilities while achieving strong performance on knowledge QA tasks.
Details
Motivation: Current SLLMs derived from text LLMs show degraded knowledge and reasoning capabilities due to failure to bridge the acoustic-semantic gap in feature representation space.Method: Proposes EchoX which leverages semantic representations and dynamically generates speech training targets, integrating both acoustic and semantic learning.
Result: EchoX achieves advanced performance on multiple knowledge-based question-answering benchmarks using about six thousand hours of training data.
Conclusion: The approach successfully bridges the acoustic-semantic gap, enabling speech LLMs to preserve strong reasoning abilities comparable to text-based LLMs.
Abstract: Speech-to-speech large language models (SLLMs) are attracting increasing attention. Derived from text-based large language models (LLMs), SLLMs often exhibit degradation in knowledge and reasoning capabilities. We hypothesize that this limitation arises because current training paradigms for SLLMs fail to bridge the acoustic-semantic gap in the feature representation space. To address this issue, we propose EchoX, which leverages semantic representations and dynamically generates speech training targets. This approach integrates both acoustic and semantic learning, enabling EchoX to preserve strong reasoning abilities as a speech LLM. Experimental results demonstrate that EchoX, with about six thousand hours of training data, achieves advanced performance on multiple knowledge-based question-answering benchmarks. The project is available at https://github.com/FreedomIntelligence/EchoX.
[13] ViRanker: A BGE-M3 & Blockwise Parallel Transformer Cross-Encoder for Vietnamese Reranking
Phuong-Nam Dang, Kieu-Linh Nguyen, Thanh-Hieu Pham
Main category: cs.CL
TL;DR: ViRanker is a Vietnamese cross-encoder reranking model built on BGE-M3 with Blockwise Parallel Transformer, achieving strong performance on MMARCO-VI benchmark and competing with PhoRanker.
Details
Motivation: Address the lack of competitive rerankers for Vietnamese, a low-resource language with complex syntax and diacritics.Method: Built on BGE-M3 encoder enhanced with Blockwise Parallel Transformer, trained on 8GB curated corpus with hybrid hard-negative sampling for robustness.
Result: Achieves strong early-rank accuracy on MMARCO-VI benchmark, surpassing multilingual baselines and competing closely with PhoRanker.
Conclusion: Demonstrates how architectural adaptation and data curation can advance reranking for underrepresented languages; model released openly on Hugging Face to support reproducibility and adoption.
Abstract: This paper presents ViRanker, a cross-encoder reranking model tailored to the Vietnamese language. Built on the BGE-M3 encoder and enhanced with the Blockwise Parallel Transformer, ViRanker addresses the lack of competitive rerankers for Vietnamese, a low-resource language with complex syntax and diacritics. The model was trained on an 8 GB curated corpus and fine-tuned with hybrid hard-negative sampling to strengthen robustness. Evaluated on the MMARCO-VI benchmark, ViRanker achieves strong early-rank accuracy, surpassing multilingual baselines and competing closely with PhoRanker. By releasing the model openly on Hugging Face, we aim to support reproducibility and encourage wider adoption in real-world retrieval systems. Beyond Vietnamese, this study illustrates how careful architectural adaptation and data curation can advance reranking in other underrepresented languages.
[14] LITcoder: A General-Purpose Library for Building and Comparing Encoding Models
Taha Binhuraib, Ruimin Gao, Anna A. Ivanova
Main category: cs.CL
TL;DR: LITcoder is an open-source library for building and benchmarking neural encoding models, providing standardized tools for aligning continuous stimuli with brain data and evaluating model performance.
Details
Motivation: To lower technical barriers to encoding model implementation, facilitate systematic comparisons across models and datasets, foster methodological rigor, and accelerate development of high-performance predictive models of brain activity.Method: Modular pipeline implementation covering brain datasets, regions, stimulus features (neural-net-based and control), downsampling approaches, with built-in logging, plotting, and integration with experiment tracking platforms like Weights & Biases.
Result: Demonstrated scalability and versatility by fitting encoding models to three story listening datasets (LeBel et al., Narratives, Little Prince), identifying critical methodological choices for continuous fMRI data.
Conclusion: LITcoder provides a flexible backend that enables researchers to easily compose, compare, and extend encoding models without reinventing core infrastructure, making neural encoding research more accessible and rigorous.
Abstract: We introduce LITcoder, an open-source library for building and benchmarking neural encoding models. Designed as a flexible backend, LITcoder provides standardized tools for aligning continuous stimuli (e.g., text and speech) with brain data, transforming stimuli into representational features, mapping those features onto brain data, and evaluating the predictive performance of the resulting model on held-out data. The library implements a modular pipeline covering a wide array of methodological design choices, so researchers can easily compose, compare, and extend encoding models without reinventing core infrastructure. Such choices include brain datasets, brain regions, stimulus feature (both neural-net-based and control, such as word rate), downsampling approaches, and many others. In addition, the library provides built-in logging, plotting, and seamless integration with experiment tracking platforms such as Weights & Biases (W&B). We demonstrate the scalability and versatility of our framework by fitting a range of encoding models to three story listening datasets: LeBel et al. (2023), Narratives, and Little Prince. We also explore the methodological choices critical for building encoding models for continuous fMRI data, illustrating the importance of accounting for all tokens in a TR scan (as opposed to just taking the last one, even when contextualized), incorporating hemodynamic lag effects, using train-test splits that minimize information leakage, and accounting for head motion effects on encoding model predictivity. Overall, LITcoder lowers technical barriers to encoding model implementation, facilitates systematic comparisons across models and datasets, fosters methodological rigor, and accelerates the development of high-quality high-performance predictive models of brain activity. Project page: https://litcoder-brain.github.io
[15] MERaLiON-SpeechEncoder: Towards a Speech Foundation Model for Singapore and Beyond
Muhammad Huzaifah, Geyu Lin, Tianchi Liu, Hardik B. Sailor, Kye Min Tan, Tarun K. Vangani, Qiongqiong Wang, Jeremy H. M. Wong, Jinyang Wu, Nancy F. Chen, Ai Ti Aw
Main category: cs.CL
TL;DR: MERaLiON-SpeechEncoder is a Singapore-developed foundation model pre-trained on 200K hours of speech data using masked language modeling, showing improvements for Singapore English speech recognition while remaining competitive across 10 other speech tasks.
Details
Motivation: To develop a foundation model specifically tailored for speech processing needs in Singapore and Southeast Asia, supporting downstream speech applications and addressing regional language varieties.Method: Pre-trained from scratch on 200,000 hours of unlabelled speech data using self-supervised learning with masked language modeling approach.
Result: Demonstrates improvements for spontaneous and Singapore speech benchmarks in speech recognition, while remaining competitive with state-of-the-art speech encoders across ten other speech tasks.
Conclusion: The model successfully addresses regional speech processing needs and will be released to support broader research efforts in Singapore and internationally, with plans to expand language coverage in future releases.
Abstract: This technical report describes the MERaLiON-SpeechEncoder, a foundation model designed to support a wide range of downstream speech applications. Developed as part of Singapore’s National Multimodal Large Language Model Programme, the MERaLiON-SpeechEncoder is tailored to address the speech processing needs in Singapore and the surrounding Southeast Asian region. The model currently supports mainly English, including the variety spoken in Singapore. We are actively expanding our datasets to gradually cover other languages in subsequent releases. The MERaLiON-SpeechEncoder was pre-trained from scratch on 200,000 hours of unlabelled speech data using a self-supervised learning approach based on masked language modelling. We describe our training procedure and hyperparameter tuning experiments in detail below. Our evaluation demonstrates improvements to spontaneous and Singapore speech benchmarks for speech recognition, while remaining competitive to other state-of-the-art speech encoders across ten other speech tasks. We commit to releasing our model, supporting broader research endeavours, both in Singapore and beyond.
[16] Target-oriented Multimodal Sentiment Classification with Counterfactual-enhanced Debiasing
Zhiyue Liu, Fanrong Ma, Xin Ling
Main category: cs.CL
TL;DR: A counterfactual-enhanced debiasing framework for target-oriented multimodal sentiment classification that reduces spurious correlations between text features and output labels by using counterfactual data augmentation and adaptive debiasing contrastive learning.
Details
Motivation: Existing multimodal sentiment classification methods over-rely on textual content and fail to consider dataset biases, particularly word-level contextual biases, leading to spurious correlations that impair classification accuracy.Method: Proposes a counterfactual data augmentation strategy that minimally alters sentiment-related causal features to generate detail-matched image-text samples, combined with an adaptive debiasing contrastive learning mechanism to learn robust features and mitigate biased word influence.
Result: Experimental results on several benchmark datasets show that the proposed method outperforms state-of-the-art baselines.
Conclusion: The counterfactual-enhanced debiasing framework effectively reduces spurious correlations in multimodal sentiment classification, improving model performance by guiding attention to sentiment-related content and mitigating biased word influence.
Abstract: Target-oriented multimodal sentiment classification seeks to predict sentiment polarity for specific targets from image-text pairs. While existing works achieve competitive performance, they often over-rely on textual content and fail to consider dataset biases, in particular word-level contextual biases. This leads to spurious correlations between text features and output labels, impairing classification accuracy. In this paper, we introduce a novel counterfactual-enhanced debiasing framework to reduce such spurious correlations. Our framework incorporates a counterfactual data augmentation strategy that minimally alters sentiment-related causal features, generating detail-matched image-text samples to guide the model’s attention toward content tied to sentiment. Furthermore, for learning robust features from counterfactual data and prompting model decisions, we introduce an adaptive debiasing contrastive learning mechanism, which effectively mitigates the influence of biased words. Experimental results on several benchmark datasets show that our proposed method outperforms state-of-the-art baselines.
[17] A Novel Data Augmentation Approach for Automatic Speaking Assessment on Opinion Expressions
Chung-Chun Wang, Jhen-Ke Lin, Hao-Chien Lu, Hong-Yun Lin, Berlin Chen
Main category: cs.CL
TL;DR: Novel training paradigm using LLM-generated responses and synthesized speech with dynamic importance loss to overcome data scarcity in automated speaking assessment for opinion expressions.
Details
Motivation: Address the challenge of limited labeled recordings in automated speaking assessment, which restricts prompt diversity and undermines scoring reliability for opinion expressions.Method: Leverage LLMs to generate diverse responses at specific proficiency levels, convert to synthesized speech via speaker-aware TTS, use dynamic importance loss to reweight training instances, and employ multimodal LLM to integrate textual features with speech signals for score prediction.
Result: Outperforms methods using real data or conventional augmentation on LTTC dataset, effectively mitigating low-resource constraints.
Conclusion: Enables automated speaking assessment on opinion expressions with cross-modal information by overcoming data scarcity through synthetic data generation and adaptive training techniques.
Abstract: Automated speaking assessment (ASA) on opinion expressions is often hampered by the scarcity of labeled recordings, which restricts prompt diversity and undermines scoring reliability. To address this challenge, we propose a novel training paradigm that leverages a large language models (LLM) to generate diverse responses of a given proficiency level, converts responses into synthesized speech via speaker-aware text-to-speech synthesis, and employs a dynamic importance loss to adaptively reweight training instances based on feature distribution differences between synthesized and real speech. Subsequently, a multimodal large language model integrates aligned textual features with speech signals to predict proficiency scores directly. Experiments conducted on the LTTC dataset show that our approach outperforms methods relying on real data or conventional augmentation, effectively mitigating low-resource constraints and enabling ASA on opinion expressions with cross-modal information.
[18] Efficient Trie-based Biasing using K-step Prediction for Rare Word Recognition
Chin Yuen Kwok, Jia Qi yip
Main category: cs.CL
TL;DR: Proposes a look-ahead multi-step prediction method to replace computationally expensive Trie-based biasing for rare word recognition in ASR, achieving significant WER reduction from 30.86% to 12.19% with minimal fine-tuning.
Details
Motivation: Trie-based biasing in ASR requires computationally expensive bonus revocation when rare words aren't fully recognized, especially problematic for large decoder models during beam search.Method: Adapt ASR models to look ahead and predict multiple steps simultaneously, eliminating the need for bonus revocation by better estimating if partial hypotheses will lead to full rare word generation.
Result: Fine-tuning Whisper with only 10 hours of synthetic data reduced word error rate on NSC Part 2 test set from 30.86% to 12.19%.
Conclusion: Look-ahead multi-step prediction effectively replaces traditional Trie-based biasing, providing more efficient and accurate rare word recognition without computational revocation overhead.
Abstract: Contextual biasing improves rare word recognition of ASR models by prioritizing the output of rare words during decoding. A common approach is Trie-based biasing, which gives “bonus scores” to partial hypothesis (e.g. “Bon”) that may lead to the generation of the rare word (e.g. “Bonham”). If the full word (“Bonham”) isn’t ultimately recognized, the system revokes those earlier bonuses. This revocation is limited to beam search and is computationally expensive, particularly for models with large decoders. To overcome these limitations, we propose adapting ASR models to look ahead and predict multiple steps at once. This avoids the revocation step entirely by better estimating whether a partial hypothesis will lead to the generation of the full rare word. By fine-tuning Whisper with only 10 hours of synthetic data, our method reduces the word error rate on the NSC Part 2 test set from 30.86% to 12.19%.
[19] The NTNU System at the S&I Challenge 2025 SLA Open Track
Hong-Yun Lin, Tien-Hong Lo, Yu-Hsuan Fang, Jhen-Ke Lin, Chung-Chun Wang, Hao-Chien Lu, Berlin Chen
Main category: cs.CL
TL;DR: Proposes a multimodal system combining wav2vec 2.0 and Phi-4 MLLM to overcome limitations of BERT-based and acoustic-only approaches for spoken language assessment, achieving second place in competition with RMSE of 0.375.
Details
Motivation: BERT-based methods rely on ASR transcripts missing prosodic/phonetic cues, while wav2vec 2.0 excels at acoustic features but lacks semantic interpretability. Need integrated approach for comprehensive spoken language assessment.Method: Integrates wav2vec 2.0 with Phi-4 multimodal large language model through score fusion strategy to combine acoustic and semantic capabilities.
Result: Achieves RMSE of 0.375 on official test set, securing second place in Speak & Improve Challenge 2025. Outperforms third-ranked (0.384) and baseline (0.444) systems.
Conclusion: The proposed multimodal fusion approach effectively addresses modality-specific limitations and demonstrates strong performance in spoken language proficiency assessment.
Abstract: A recent line of research on spoken language assessment (SLA) employs neural models such as BERT and wav2vec 2.0 (W2V) to evaluate speaking proficiency across linguistic and acoustic modalities. Although both models effectively capture features relevant to oral competence, each exhibits modality-specific limitations. BERT-based methods rely on ASR transcripts, which often fail to capture prosodic and phonetic cues for SLA. In contrast, W2V-based methods excel at modeling acoustic features but lack semantic interpretability. To overcome these limitations, we propose a system that integrates W2V with Phi-4 multimodal large language model (MLLM) through a score fusion strategy. The proposed system achieves a root mean square error (RMSE) of 0.375 on the official test set of the Speak & Improve Challenge 2025, securing second place in the competition. For comparison, the RMSEs of the top-ranked, third-ranked, and official baseline systems are 0.364, 0.384, and 0.444, respectively.
[20] Improving Synthetic Data Training for Contextual Biasing Models with a Keyword-Aware Cost Function
Chin Yuen Kwok, Jia Qi Yip, Eng Siong Chng
Main category: cs.CL
TL;DR: Proposed keyword-aware loss function for TCPGen-based contextual biasing to improve rare word recognition in ASR, reducing WER from 29.71% to 11.81% on NSC Part 2.
Details
Motivation: Address overfitting in contextual biasing modules trained on synthetic rare word data by developing a more effective training approach that handles synthetic audio artifacts.Method: Enhanced TCPGen-based contextual biasing with keyword-aware loss function containing masked cross-entropy for biased word prediction and binary classification for detecting biased word positions.
Result: Adapting Whisper to 10 hours of synthetic data reduced word error rate on NSC Part 2 test set from 29.71% to 11.81%.
Conclusion: The proposed keyword-aware loss function effectively improves rare word recognition by complementarily supporting biased word decoding during inference, overcoming overfitting issues from synthetic data training.
Abstract: Rare word recognition can be improved by adapting ASR models to synthetic data that includes these words. Further improvements can be achieved through contextual biasing, which trains and adds a biasing module into the model architecture to prioritize rare words. While training the module on synthetic rare word data is more effective than using non-rare-word data, it can lead to overfitting due to artifacts in the synthetic audio. To address this, we enhance the TCPGen-based contextual biasing approach and propose a keyword-aware loss function that additionally focuses on biased words when training biasing modules. This loss includes a masked cross-entropy term for biased word prediction and a binary classification term for detecting biased word positions. These two terms complementarily support the decoding of biased words during inference. By adapting Whisper to 10 hours of synthetic data, our method reduced the word error rate on the NSC Part 2 test set from 29.71% to 11.81%.
[21] GmSLM : Generative Marmoset Spoken Language Modeling
Talia Sternberg, Michael London, David Omer, Yossi Adi
Main category: cs.CL
TL;DR: GmSLM is a generative spoken language model pipeline optimized for Marmoset monkey vocal communication, using unsupervised wild data and weak labels to generate realistic vocalizations that match real samples acoustically.
Details
Motivation: Marmoset monkeys show complex vocal communication similar to human speech features, offering a unique opportunity to study brain activity related to vocal communication since human brain access is difficult in speech research.Method: Developed Generative Marmoset Spoken Language Modeling (GmSLM) pipeline with novel zero-shot evaluation metrics using unsupervised in-the-wild data and weakly labeled conversational data, comparing against human-speech-based baselines.
Result: GmSLM generated vocalizations closely matched real resynthesized samples acoustically, performed well on downstream tasks, and effectively distinguished real from artificial conversations despite being fully unsupervised.
Conclusion: GmSLM provides a practical framework linking vocalization and brain activity, potentially benefiting future neuroscience, bioacoustics, and evolutionary biology research.
Abstract: Marmoset monkeys exhibit complex vocal communication, challenging the view that nonhuman primates vocal communication is entirely innate, and show similar features of human speech, such as vocal labeling of others and turn-taking. Studying their vocal communication offers a unique opportunity to link it with brain activity-especially given the difficulty of accessing the human brain in speech and language research. Since Marmosets communicate primarily through vocalizations, applying standard LLM approaches is not straightforward. We introduce Generative Marmoset Spoken Language Modeling (GmSLM), an optimized spoken language model pipeline for Marmoset vocal communication. We designed a novel zero-shot evaluation metrics using unsupervised in-the-wild data, alongside weakly labeled conversational data, to assess GmSLM and demonstrate its advantage over a basic human-speech-based baseline. GmSLM generated vocalizations closely matched real resynthesized samples acoustically and performed well on downstream tasks. Despite being fully unsupervised, GmSLM effectively distinguish real from artificial conversations and may support further investigations of the neural basis of vocal communication and provides a practical framework linking vocalization and brain activity. We believe GmSLM stands to benefit future work in neuroscience, bioacoustics, and evolutionary biology. Samples are provided under: pages.cs.huji.ac.il/adiyoss-lab/GmSLM.
[22] CCF: A Context Compression Framework for Efficient Long-Sequence Language Modeling
Wenhao Li, Bangcheng Sun, Weihao Ye, Tianyi Zhang, Daohai Yu, Fei Chao, Rongrong Ji
Main category: cs.CL
TL;DR: CCF is a context compression framework that enables efficient long-context modeling through hierarchical latent representations and semantic aggregation, achieving high compression ratios with competitive performance.
Details
Motivation: Scaling language models to longer contexts creates computational and memory burdens. Naive context extension leads to inefficiencies in both training and inference, requiring more efficient approaches.Method: Proposes CCF framework with hierarchical latent representations, segment-wise semantic aggregation, and key-value memory encoding. Uses incremental segment decoding with sparse reservoir sampling for training efficiency.
Result: Achieves competitive perplexity under high compression ratios, significantly improves throughput and memory efficiency compared to existing approaches on multiple long-context benchmarks.
Conclusion: Structured compression through CCF demonstrates potential for scalable and effective long-context language modeling by preserving global semantics while reducing input redundancy.
Abstract: Scaling language models to longer contexts is essential for capturing rich dependencies across extended discourse. However, na"ive context extension imposes significant computational and memory burdens, often resulting in inefficiencies during both training and inference. In this work, we propose CCF, a novel context compression framework designed to enable efficient long-context modeling by learning hierarchical latent representations that preserve global semantics while aggressively reducing input redundancy. CCF integrates segment-wise semantic aggregation with key-value memory encoding, forming compact representations that support accurate reconstruction and long-range understanding. To further enhance scalability, we introduce a training-efficient optimization strategy that couples incremental segment decoding with sparse reservoir sampling, substantially reducing memory overhead without degrading performance. Empirical results on multiple long-context language modeling benchmarks demonstrate that CCF achieves competitive perplexity under high compression ratios, and significantly improves throughput and memory efficiency compared to existing approaches. These findings highlight the potential of structured compression for scalable and effective long-context language modeling.
[23] Reading Between the Lines: Classifying Resume Seniority with Large Language Models
Matan Cohen, Shira Shani, Eden Menahem, Yehudit Aperstein, Alexander Apartsin
Main category: cs.CL
TL;DR: LLMs and fine-tuned BERT models show promise for automating seniority classification in resumes, using a hybrid dataset of real and synthetic examples to detect exaggerated qualifications and subtle linguistic cues.
Details
Motivation: Accurately assessing candidate seniority from resumes is challenging due to overstated experience and ambiguous self-presentation, requiring automated solutions to improve evaluation accuracy.Method: Used large language models (LLMs) including fine-tuned BERT architectures, evaluated on a hybrid dataset of real-world resumes and synthetically generated hard examples simulating exaggerated qualifications.
Result: Models demonstrated effectiveness in detecting subtle linguistic cues associated with seniority inflation and implicit expertise, showing promising performance for automated classification.
Conclusion: The approach provides promising directions for enhancing AI-driven candidate evaluation systems and mitigating bias from self-promotional language, with the dataset made available for research community use.
Abstract: Accurately assessing candidate seniority from resumes is a critical yet challenging task, complicated by the prevalence of overstated experience and ambiguous self-presentation. In this study, we investigate the effectiveness of large language models (LLMs), including fine-tuned BERT architectures, for automating seniority classification in resumes. To rigorously evaluate model performance, we introduce a hybrid dataset comprising both real-world resumes and synthetically generated hard examples designed to simulate exaggerated qualifications and understated seniority. Using the dataset, we evaluate the performance of Large Language Models in detecting subtle linguistic cues associated with seniority inflation and implicit expertise. Our findings highlight promising directions for enhancing AI-driven candidate evaluation systems and mitigating bias introduced by self-promotional language. The dataset is available for the research community at https://bit.ly/4mcTovt
[24] Agentic LLMs for Question Answering over Tabular Data
Rishit Tyagi, Mohit Gupta, Rahul Bouri
Main category: cs.CL
TL;DR: Proposes an LLM-based NL-to-SQL approach for table QA that achieves ~71% accuracy on DataBench benchmarks, significantly outperforming baselines (~26-27%) through a multi-stage pipeline with iterative refinement.
Details
Motivation: Table QA presents unique challenges due to diverse table structures, sizes, and data types in real-world scenarios, requiring robust methods to handle structured queries effectively.Method: Multi-stage pipeline using LLMs (GPT-4o, GPT-4o-mini, DeepSeek v2:16b) for NL-to-SQL conversion, involving example selection, SQL generation, answer extraction, verification, and iterative refinement.
Result: Achieved 70.5% accuracy on DataBench QA and 71.6% on DataBench Lite QA, significantly surpassing baseline scores of 26% and 27% respectively.
Conclusion: LLM-driven Table QA shows strong performance but has limitations; the paper provides insights into both strengths and weaknesses of this approach for structured data querying.
Abstract: Question Answering over Tabular Data (Table QA) presents unique challenges due to the diverse structure, size, and data types of real-world tables. The SemEval 2025 Task 8 (DataBench) introduced a benchmark composed of large-scale, domain-diverse datasets to evaluate the ability of models to accurately answer structured queries. We propose a Natural Language to SQL (NL-to-SQL) approach leveraging large language models (LLMs) such as GPT-4o, GPT-4o-mini, and DeepSeek v2:16b to generate SQL queries dynamically. Our system follows a multi-stage pipeline involving example selection, SQL query generation, answer extraction, verification, and iterative refinement. Experiments demonstrate the effectiveness of our approach, achieving 70.5% accuracy on DataBench QA and 71.6% on DataBench Lite QA, significantly surpassing baseline scores of 26% and 27% respectively. This paper details our methodology, experimental results, and alternative approaches, providing insights into the strengths and limitations of LLM-driven Table QA.
[25] From scratch to silver: Creating trustworthy training data for patent-SDG classification using Large Language Models
Grazia Sveva Ascione, Nicolò Tamagnone
Main category: cs.CL
TL;DR: Weak supervision approach using LLMs and patent-to-paper citations creates scalable patent-SDG classification system that outperforms traditional methods.
Details
Motivation: Lack of large labeled datasets for patent-SDG classification limits supervised learning, while existing methods lack scalability and generalizability.Method: Uses weak supervision with NPL citations as noisy signal, develops composite labeling function with LLMs to extract structured concepts, computes cross-domain similarity with rank-based retrieval, and calibrates via positive-only loss.
Result: Creates silver-standard multi-label dataset; outperforms baselines in internal validation and shows greater thematic coherence in external validation using network modularity metrics.
Conclusion: Weak supervision and semantic alignment can effectively enhance SDG classification at scale, enabling better tracking of innovation addressing global challenges.
Abstract: Classifying patents by their relevance to the UN Sustainable Development Goals (SDGs) is crucial for tracking how innovation addresses global challenges. However, the absence of a large, labeled dataset limits the use of supervised learning. Existing methods, such as keyword searches, transfer learning, and citation-based heuristics, lack scalability and generalizability. This paper frames patent-to-SDG classification as a weak supervision problem, using citations from patents to SDG-tagged scientific publications (NPL citations) as a noisy initial signal. To address its sparsity and noise, we develop a composite labeling function (LF) that uses large language models (LLMs) to extract structured concepts, namely functions, solutions, and applications, from patents and SDG papers based on a patent ontology. Cross-domain similarity scores are computed and combined using a rank-based retrieval approach. The LF is calibrated via a custom positive-only loss that aligns with known NPL-SDG links without penalizing discovery of new SDG associations. The result is a silver-standard, soft multi-label dataset mapping patents to SDGs, enabling the training of effective multi-label regression models. We validate our approach through two complementary strategies: (1) internal validation against held-out NPL-based labels, where our method outperforms several baselines including transformer-based models, and zero-shot LLM; and (2) external validation using network modularity in patent citation, co-inventor, and co-applicant graphs, where our labels reveal greater thematic, cognitive, and organizational coherence than traditional technological classifications. These results show that weak supervision and semantic alignment can enhance SDG classification at scale.
[26] MetaRAG: Metamorphic Testing for Hallucination Detection in RAG Systems
Channdeth Sok, David Luz, Yacine Haddam
Main category: cs.CL
TL;DR: MetaRAG is a metamorphic testing framework for detecting hallucinations in RAG systems through factoid decomposition, controlled mutations, and context verification without requiring ground truth or model access.
Details
Motivation: LLMs deployed in enterprise applications suffer from hallucinations, and existing detection methods don't address the unique challenges of RAG systems where responses must align with retrieved evidence.Method: Four-stage framework: 1) decompose answers into atomic factoids, 2) generate mutations using synonym/antonym substitutions, 3) verify variants against retrieved context, 4) aggregate penalties into hallucination scores with span-level localization.
Result: Effective hallucination detection demonstrated on proprietary enterprise dataset, enabling trustworthy deployment of RAG-based conversational agents with identity-aware safeguards.
Conclusion: MetaRAG provides real-time, unsupervised black-box testing for RAG systems, offering practical hallucination detection and identity-aware safeguards for enterprise deployment without requiring ground truth or model internals.
Abstract: Large Language Models (LLMs) are increasingly deployed in enterprise applications, yet their reliability remains limited by hallucinations, i.e., confident but factually incorrect information. Existing detection approaches, such as SelfCheckGPT and MetaQA, primarily target standalone LLMs and do not address the unique challenges of Retrieval-Augmented Generation (RAG) systems, where responses must be consistent with retrieved evidence. We therefore present MetaRAG, a metamorphic testing framework for hallucination detection in Retrieval-Augmented Generation (RAG) systems. MetaRAG operates in a real-time, unsupervised, black-box setting, requiring neither ground-truth references nor access to model internals, making it suitable for proprietary and high-stakes domains. The framework proceeds in four stages: (1) decompose answers into atomic factoids, (2) generate controlled mutations of each factoid using synonym and antonym substitutions, (3) verify each variant against the retrieved context (synonyms are expected to be entailed and antonyms contradicted), and (4) aggregate penalties for inconsistencies into a response-level hallucination score. Crucially for identity-aware AI, MetaRAG localizes unsupported claims at the factoid span where they occur (e.g., pregnancy-specific precautions, LGBTQ+ refugee rights, or labor eligibility), allowing users to see flagged spans and enabling system designers to configure thresholds and guardrails for identity-sensitive queries. Experiments on a proprietary enterprise dataset illustrate the effectiveness of MetaRAG for detecting hallucinations and enabling trustworthy deployment of RAG-based conversational agents. We also outline a topic-based deployment design that translates MetaRAG’s span-level scores into identity-aware safeguards; this design is discussed but not evaluated in our experiments.
[27] Modelling Analogies and Analogical Reasoning: Connecting Cognitive Science Theory and NLP Research
Molly R Petersen, Claire E Stevenson, Lonneke van der Plas
Main category: cs.CL
TL;DR: This paper connects cognitive science theories of analogical reasoning to NLP research, showing how cognitive processes can improve relational understanding in text beyond entity-level similarity.
Details
Motivation: To bridge cognitive science theories about analogical reasoning with current NLP research, as these cognitive processes are not typically viewed through a cognitive lens in NLP despite their relevance.Method: Summarizing key theories about analogical reasoning processes from cognitive science literature and relating them to concepts in natural language processing.
Result: Demonstrates how cognitive notions of analogical reasoning are relevant for several major challenges in NLP research beyond just analogy solving.
Conclusion: Applying cognitive perspectives on analogical reasoning can guide NLP researchers to better optimize relational understanding in text rather than relying heavily on entity-level similarity.
Abstract: Analogical reasoning is an essential aspect of human cognition. In this paper, we summarize key theory about the processes underlying analogical reasoning from the cognitive science literature and relate it to current research in natural language processing. While these processes can be easily linked to concepts in NLP, they are generally not viewed through a cognitive lens. Furthermore, we show how these notions are relevant for several major challenges in NLP research, not directly related to analogy solving. This may guide researchers to better optimize relational understanding in text, as opposed to relying heavily on entity-level similarity.
[28] Hierarchical Bracketing Encodings Work for Dependency Graphs
Ana Ezquerro, Carlos Gómez-Rodríguez, David Vilares
Main category: cs.CL
TL;DR: Hierarchical bracketing encodings enable linear-time dependency graph parsing with n tagging actions while handling reentrancies, cycles, and empty nodes, reducing label space while maintaining structural information.
Details
Motivation: To develop a practical approach for dependency graph parsing that can handle complex linguistic phenomena like reentrancies, cycles, and empty nodes while maintaining efficiency and reducing computational complexity.Method: Encodes dependency graphs as sequences using hierarchical bracketing encodings, enabling linear-time parsing with n tagging actions. This representation substantially reduces the label space compared to existing graph linearizations while preserving structural information.
Result: Competitive results on a multilingual and multi-formalism benchmark, with consistent improvements over other methods in exact match accuracy.
Conclusion: Hierarchical bracketing encodings provide an effective and efficient approach for dependency graph parsing, offering practical advantages in handling complex linguistic structures while maintaining parsing efficiency and accuracy.
Abstract: We revisit hierarchical bracketing encodings from a practical perspective in the context of dependency graph parsing. The approach encodes graphs as sequences, enabling linear-time parsing with $n$ tagging actions, and still representing reentrancies, cycles, and empty nodes. Compared to existing graph linearizations, this representation substantially reduces the label space while preserving structural information. We evaluate it on a multilingual and multi-formalism benchmark, showing competitive results and consistent improvements over other methods in exact match accuracy.
[29] GrACE: A Generative Approach to Better Confidence Elicitation in Large Language Models
Zhaohan Zhang, Ziquan Liu, Ioannis Patras
Main category: cs.CL
TL;DR: GrACE is a novel confidence elicitation method for LLMs that uses hidden state similarity to a special token for real-time confidence estimation, achieving superior calibration and discriminative capacity without additional sampling or auxiliary models.
Details
Motivation: Existing confidence elicitation methods for LLMs are either computationally expensive or poorly calibrated, making them impractical for real-world deployment in high-stakes applications like healthcare and finance.Method: GrACE uses a special token appended to the vocabulary and measures similarity between the last hidden state and this token’s embedding to express confidence. The model is fine-tuned with calibration targets associated with accuracy.
Result: Experiments with three LLMs and two benchmark datasets show GrACE achieves best discriminative capacity and calibration on open-ended generation tasks, outperforming six competing methods. It also reduces required samples in test-time scaling schemes while improving accuracy.
Conclusion: GrACE provides a practical solution for deploying LLMs with scalable, reliable, and real-time confidence estimation, making it suitable for high-stakes real-world applications.
Abstract: Assessing the reliability of Large Language Models (LLMs) by confidence elicitation is a prominent approach to AI safety in high-stakes applications, such as healthcare and finance. Existing methods either require expensive computational overhead or suffer from poor calibration, making them impractical and unreliable for real-world deployment. In this work, we propose GrACE, a Generative Approach to Confidence Elicitation that enables scalable and reliable confidence elicitation for LLMs. GrACE adopts a novel mechanism in which the model expresses confidence by the similarity between the last hidden state and the embedding of a special token appended to the vocabulary, in real-time. We fine-tune the model for calibrating the confidence with calibration targets associated with accuracy. Experiments with three LLMs and two benchmark datasets show that the confidence produced by GrACE achieves the best discriminative capacity and calibration on open-ended generation tasks, outperforming six competing methods without resorting to additional sampling or an auxiliary model. Moreover, we propose two strategies for improving test-time scaling based on confidence induced by GrACE. Experimental results show that using GrACE not only improves the accuracy of the final decision but also significantly reduces the number of required samples in the test-time scaling scheme, indicating the potential of GrACE as a practical solution for deploying LLMs with scalable, reliable, and real-time confidence estimation.
[30] Mitigating Language Barriers in Education: Developing Multilingual Digital Learning Materials with Machine Translation
Lucie Poláková, Martin Popel, Věra Kloudová, Michal Novák, Mariia Anisimova, Jiří Balhar
Main category: cs.CL
TL;DR: The EdUKate project develops multilingual educational materials using machine translation for Czech schools, focusing on Czech-Ukrainian translation of interactive exercises with special handling of formatted content and terminology.
Details
Motivation: To address the needs of non-Czech-speaking students in Czech primary and secondary schools by providing accessible multilingual learning materials through digital education and machine translation technologies.Method: Combines digital education, linguistics, translation studies, and machine translation to develop a direct Czech-Ukrainian MT system tailored for educational content. Processes formatted content (XML/PDF) and handles technical terminology. Includes teacher surveys to identify needs.
Result: Developed multilingual learning materials (up to 9,000 exercises translated into Ukrainian, English, German). Created and evaluated a specialized Czech-Ukrainian machine translation system. All applications are freely available to students, educators, and researchers.
Conclusion: The project successfully bridges language barriers in Czech education through tailored machine translation solutions, making educational resources accessible to diverse student populations while addressing specific formatting and terminology challenges in educational content.
Abstract: The EdUKate project combines digital education, linguistics, translation studies, and machine translation to develop multilingual learning materials for Czech primary and secondary schools. Launched through collaboration between a major Czech academic institution and the country’s largest educational publisher, the project is aimed at translating up to 9,000 multimodal interactive exercises from Czech into Ukrainian, English, and German for an educational web portal. It emphasizes the development and evaluation of a direct Czech-Ukrainian machine translation system tailored to the educational domain, with special attention to processing formatted content such as XML and PDF and handling technical and scientific terminology. We present findings from an initial survey of Czech teachers regarding the needs of non-Czech-speaking students and describe the system’s evaluation and implementation on the web portal. All resulting applications are freely available to students, educators, and researchers.
[31] Towards Explainable Job Title Matching: Leveraging Semantic Textual Relatedness and Knowledge Graphs
Vadim Zadykian, Bruno Andrade, Haithem Afli
Main category: cs.CL
TL;DR: Self-supervised hybrid architecture combining sentence embeddings with knowledge graphs improves job title matching, with 25% RMSE reduction in high semantic relatedness regions through stratified evaluation.
Details
Motivation: Address the challenge of job title matching in resume recommendation systems where lexical similarity is limited or misleading, requiring nuanced semantic textual relatedness understanding.Method: Self-supervised hybrid architecture combining dense sentence embeddings with domain-specific knowledge graphs via graph neural networks, with stratified evaluation across low, medium, and high semantic relatedness regions.
Result: Fine-tuned SBERT models augmented with knowledge graphs achieved 25% RMSE reduction in high-STR region compared to strong baselines, showing consistent improvements in semantic alignment.
Conclusion: Combining knowledge graphs with text embeddings provides significant benefits, and stratified regional performance analysis reveals hidden model strengths/weaknesses, supporting better model selection for HR systems requiring fairness and explainability.
Abstract: Semantic Textual Relatedness (STR) captures nuanced relationships between texts that extend beyond superficial lexical similarity. In this study, we investigate STR in the context of job title matching - a key challenge in resume recommendation systems, where overlapping terms are often limited or misleading. We introduce a self-supervised hybrid architecture that combines dense sentence embeddings with domain-specific Knowledge Graphs (KGs) to improve both semantic alignment and explainability. Unlike previous work that evaluated models on aggregate performance, our approach emphasizes data stratification by partitioning the STR score continuum into distinct regions: low, medium, and high semantic relatedness. This stratified evaluation enables a fine-grained analysis of model performance across semantically meaningful subspaces. We evaluate several embedding models, both with and without KG integration via graph neural networks. The results show that fine-tuned SBERT models augmented with KGs produce consistent improvements in the high-STR region, where the RMSE is reduced by 25% over strong baselines. Our findings highlight not only the benefits of combining KGs with text embeddings, but also the importance of regional performance analysis in understanding model behavior. This granular approach reveals strengths and weaknesses hidden by global metrics, and supports more targeted model selection for use in Human Resources (HR) systems and applications where fairness, explainability, and contextual matching are essential.
[32] DeMeVa at LeWiDi-2025: Modeling Perspectives with In-Context Learning and Label Distribution Learning
Daniil Ignatev, Nan Li, Hugh Mee Wong, Anh Dang, Shane Kaszefski Yaschuk
Main category: cs.CL
TL;DR: DeMeVa team’s approaches for LeWiDi 2025: ICL with LLMs using different sampling strategies and LDL methods with RoBERTa fine-tuning for soft label predictions.
Details
Motivation: To explore effective methods for predicting perspectivist annotations (annotator-specific labels) and soft label predictions in learning with disagreements tasks.Method: Two approaches: 1) In-context learning with large language models comparing example sampling strategies, 2) Label distribution learning methods with RoBERTa evaluating various fine-tuning techniques.
Result: ICL effectively predicts annotator-specific annotations, and aggregating these predictions into soft labels achieves competitive performance. LDL methods show promise for soft label predictions.
Conclusion: Both ICL and LDL methods are valuable for perspectivist annotation tasks, with LDL methods deserving further exploration by the perspectivist community.
Abstract: This system paper presents the DeMeVa team’s approaches to the third edition of the Learning with Disagreements shared task (LeWiDi 2025; Leonardelli et al., 2025). We explore two directions: in-context learning (ICL) with large language models, where we compare example sampling strategies; and label distribution learning (LDL) methods with RoBERTa (Liu et al., 2019b), where we evaluate several fine-tuning methods. Our contributions are twofold: (1) we show that ICL can effectively predict annotator-specific annotations (perspectivist annotations), and that aggregating these predictions into soft labels yields competitive performance; and (2) we argue that LDL methods are promising for soft label predictions and merit further exploration by the perspectivist community.
[33] Prompting the Market? A Large-Scale Meta-Analysis of GenAI in Finance NLP (2022-2025)
Paolo Pedinotti, Peter Baumann, Nathan Jessurun, Leslie Barrett, Enrico Santus
Main category: cs.CL
TL;DR: MetaGraph is a methodology that extracts knowledge graphs from scientific literature to analyze research trends in financial NLP, revealing three evolutionary phases from early LLM adoption to critical reflection and modular system integration.
Details
Motivation: The rapid transformation of financial NLP through LLMs has outpaced traditional surveys, creating a need for structured, data-driven analysis of research trends and evolution in the field.Method: Defined an ontology for financial NLP research and applied an LLM-based extraction pipeline to analyze 681 papers (2022-2025) to construct queryable knowledge graphs.
Result: Revealed three key evolutionary phases: early LLM adoption and task/dataset innovation; critical reflection on LLM limitations; and growing integration of peripheral techniques into modular systems.
Conclusion: Provides a structured, queryable view of financial NLP evolution and demonstrates a reusable approach for mapping scientific progress across domains, offering clear insights into emerging trends and methodological shifts.
Abstract: Large Language Models (LLMs) have rapidly reshaped financial NLP, enabling new tasks and driving a proliferation of datasets and diversification of data sources. Yet, this transformation has outpaced traditional surveys. In this paper, we present MetaGraph, a generalizable methodology for extracting knowledge graphs from scientific literature and analyzing them to obtain a structured, queryable view of research trends. We define an ontology for financial NLP research and apply an LLM-based extraction pipeline to 681 papers (2022-2025), enabling large-scale, data-driven analysis. MetaGraph reveals three key phases: early LLM adoption and task/dataset innovation; critical reflection on LLM limitations; and growing integration of peripheral techniques into modular systems. This structured view offers both practitioners and researchers a clear understanding of how financial NLP has evolved - highlighting emerging trends, shifting priorities, and methodological shifts-while also demonstrating a reusable approach for mapping scientific progress in other domains.
[34] Personality-Enhanced Social Recommendations in SAMI: Exploring the Role of Personality Detection in Matchmaking
Brittany Harbison, Samuel Taubman, Travis Taylor, Ashok. K. Goel
Main category: cs.CL
TL;DR: Using GPT’s zero-shot capability to detect Big-Five personality traits from forum posts to improve social matchmaking in online learning platforms.
Details
Motivation: Online courses lack organic social connections, and existing systems like SAMI have limited Theory of Mind capabilities, particularly in understanding personality which affects recommendation relevance.Method: Developed a personality detection model using GPT’s zero-shot capability to infer Big-Five personality traits from student introduction posts in online forums, then integrated this into SAMI’s matchmaking system.
Result: The GPT-based model demonstrated efficacy in personality detection compared to established benchmarks, and initial integration showed personality traits can complement existing matching factors.
Conclusion: Personality-informed recommendations show promise for improving social connections in online learning, though further evaluation is needed to assess full impact on student engagement and match quality.
Abstract: Social connection is a vital part of learning, yet online course environments present barriers to the organic formation of social groups. SAMI offers one solution by facilitating student connections, but its effectiveness is constrained by an incomplete Theory of Mind, limiting its ability to create an effective mental model of a student. One facet of this is its inability to intuit personality, which may influence the relevance of its recommendations. To explore this, we propose a personality detection model utilizing GPTs zero-shot capability to infer Big-Five personality traits from forum introduction posts, often encouraged in online courses. We benchmark its performance against established models, demonstrating its efficacy in this task. Furthermore, we integrate this model into SAMIs entity-based matchmaking system, enabling personality-informed social recommendations. Initial integration suggests personality traits can complement existing matching factors, though additional evaluation is required to determine their full impact on student engagement and match quality.
[35] Fluent but Unfeeling: The Emotional Blind Spots of Language Models
Bangzhao Shu, Isha Joshi, Melissa Karnaze, Anh C. Pham, Ishita Kakkar, Sindhu Kothe, Arpine Hovasapian, Mai ElSherief
Main category: cs.CL
TL;DR: LLMs struggle with fine-grained emotion alignment despite their versatility in NLP, requiring better contextual understanding for mental health applications.
Details
Motivation: To evaluate whether LLMs can align with human emotions at a fine-grained level, addressing the gap in existing research that focuses only on broad emotion categories.Method: Introduces EXPRESS benchmark dataset with 251 fine-grained emotion labels from Reddit, evaluates LLMs using comprehensive framework that decomposes emotions into eight basic categories, and conducts systematic testing under various prompt settings.
Result: LLMs find it challenging to accurately predict emotions that align with human self-disclosed emotions, often failing to capture contextual cues effectively despite generating theoretically consistent emotion terms.
Conclusion: Highlights limitations of LLMs in fine-grained emotion alignment and provides insights for future research to enhance contextual understanding in mental health applications.
Abstract: The versatility of Large Language Models (LLMs) in natural language understanding has made them increasingly popular in mental health research. While many studies explore LLMs’ capabilities in emotion recognition, a critical gap remains in evaluating whether LLMs align with human emotions at a fine-grained level. Existing research typically focuses on classifying emotions into predefined, limited categories, overlooking more nuanced expressions. To address this gap, we introduce EXPRESS, a benchmark dataset curated from Reddit communities featuring 251 fine-grained, self-disclosed emotion labels. Our comprehensive evaluation framework examines predicted emotion terms and decomposes them into eight basic emotions using established emotion theories, enabling a fine-grained comparison. Systematic testing of prevalent LLMs under various prompt settings reveals that accurately predicting emotions that align with human self-disclosed emotions remains challenging. Qualitative analysis further shows that while certain LLMs generate emotion terms consistent with established emotion theories and definitions, they sometimes fail to capture contextual cues as effectively as human self-disclosures. These findings highlight the limitations of LLMs in fine-grained emotion alignment and offer insights for future research aimed at enhancing their contextual understanding.
[36] LAVA: Language Model Assisted Verbal Autopsy for Cause-of-Death Determination
Yiqun T. Chen, Tyler H. McCormick, Li Liu, Abhirup Datta
Main category: cs.CL
TL;DR: LA-VA pipeline combines LLMs with traditional methods for improved verbal autopsy cause-of-death prediction, showing GPT-5 outperforms baselines by 5-10% across age groups.
Details
Motivation: Verbal autopsy is critical for estimating causes of death in resource-limited settings where medical certification is unavailable, but current methods need improvement.Method: Combines Large Language Models (LLMs) with traditional algorithmic approaches and embedding-based classification, evaluated on PHMRC dataset across three age categories using multiple approaches including GPT-5 predictions and meta-learner ensembles.
Result: GPT-5 achieves highest individual performance with average test site accuracies of 48.6% (Adult), 50.5% (Child), and 53.5% (Neonate), outperforming traditional statistical machine learning baselines by 5-10%.
Conclusion: Simple off-the-shelf LLM-assisted approaches could substantially improve verbal autopsy accuracy, with important implications for global health surveillance in low-resource settings.
Abstract: Verbal autopsy (VA) is a critical tool for estimating causes of death in resource-limited settings where medical certification is unavailable. This study presents LA-VA, a proof-of-concept pipeline that combines Large Language Models (LLMs) with traditional algorithmic approaches and embedding-based classification for improved cause-of-death prediction. Using the Population Health Metrics Research Consortium (PHMRC) dataset across three age categories (Adult: 7,580; Child: 1,960; Neonate: 2,438), we evaluate multiple approaches: GPT-5 predictions, LCVA baseline, text embeddings, and meta-learner ensembles. Our results demonstrate that GPT-5 achieves the highest individual performance with average test site accuracies of 48.6% (Adult), 50.5% (Child), and 53.5% (Neonate), outperforming traditional statistical machine learning baselines by 5-10%. Our findings suggest that simple off-the-shelf LLM-assisted approaches could substantially improve verbal autopsy accuracy, with important implications for global health surveillance in low-resource settings.
[37] Bridging the Capability Gap: Joint Alignment Tuning for Harmonizing LLM-based Multi-Agent Systems
Minghang Zhu, Zhengliang Shi, Zhiwei Xu, Shiguang Wu, Lingjie Wang, Pengjie Ren, Zhaochun Ren, Zhumin Chen
Main category: cs.CL
TL;DR: MOAT is a joint alignment tuning framework that improves collaboration between planning and grounding agents in multi-agent LLM systems through iterative alignment, outperforming SOTA methods by 3.1-4.4%.
Details
Motivation: Existing methods fine-tune planning and grounding agents independently, leading to capability gaps and poor coordination between agents in multi-agent systems.Method: MOAT alternates between Planning Agent Alignment (optimizing subgoal generation to better guide grounding agent) and Grounding Agent Improving (fine-tuning with diverse subgoal-action pairs for better generalization).
Result: Achieves average improvements of 3.1% on held-in tasks and 4.4% on held-out tasks across six benchmarks, outperforming state-of-the-art baselines.
Conclusion: Theoretical analysis proves MOAT ensures non-decreasing and progressively convergent training, demonstrating effective joint alignment for improved multi-agent collaboration.
Abstract: The advancement of large language models (LLMs) has enabled the construction of multi-agent systems to solve complex tasks by dividing responsibilities among specialized agents, such as a planning agent for subgoal generation and a grounding agent for executing tool-use actions. Most existing methods typically fine-tune these agents independently, leading to capability gaps among them with poor coordination. To address this, we propose MOAT, a Multi-Agent Joint Alignment Tuning framework that improves agents collaboration through iterative alignment. MOAT alternates between two key stages: (1) Planning Agent Alignment, which optimizes the planning agent to generate subgoal sequences that better guide the grounding agent; and (2) Grounding Agent Improving, which fine-tunes the grounding agent using diverse subgoal-action pairs generated by the agent itself to enhance its generalization capablity. Theoretical analysis proves that MOAT ensures a non-decreasing and progressively convergent training process. Experiments across six benchmarks demonstrate that MOAT outperforms state-of-the-art baselines, achieving average improvements of 3.1% on held-in tasks and 4.4% on held-out tasks.
[38] All for One: LLMs Solve Mental Math at the Last Token With Information Transferred From Other Tokens
Siddarth Mamidanna, Daking Rai, Ziyu Yao, Yilun Zhou
Main category: cs.CL
TL;DR: The paper investigates how LLMs perform mental math calculations, identifying a specific computation pattern (All-for-One subgraph) where meaningful computation occurs late and only at the last token using specialized techniques CAMA and ABP.
Details
Motivation: To understand the inner workings of LLMs in computational tasks, specifically how information flows and computations occur across tokens during mental math calculations.Method: Three-step approach: inhibiting early layer computations, restricting information transfer routes, and forcing computation at last token. Uses Context-Aware Mean Ablation (CAMA) and Attention-Based Peeking (ABP) techniques to identify the All-for-One subgraph.
Result: Identified AF1 subgraph achieves high accuracy on mental math tasks, occurs late in layers, transfers across models, and works with various input styles. The subgraph is both sufficient and necessary for performance.
Conclusion: LLMs perform mental math through a specific computation pattern where the last token receives information from other tokens in middle layers, and the identified techniques provide unique advantages for analyzing model internals.
Abstract: Large language models (LLMs) demonstrate proficiency across numerous computational tasks, yet their inner workings remain unclear. In theory, the combination of causal self-attention and multilayer perceptron layers allows every token to access and compute information based on all preceding tokens. In practice, to what extent are such operations present? In this paper, on mental math tasks (i.e., direct math calculation via next-token prediction without explicit reasoning), we investigate this question in three steps: inhibiting input-specific token computations in the initial layers, restricting the routes of information transfer across token positions in the next few layers, and forcing all computation to happen at the last token in the remaining layers. With two proposed techniques, Context-Aware Mean Ablation (CAMA) and Attention-Based Peeking (ABP), we identify an All-for-One subgraph (AF1) with high accuracy on a wide variety of mental math tasks, where meaningful computation occurs very late (in terms of layer depth) and only at the last token, which receives information of other tokens in few specific middle layers. Experiments on a variety of models and arithmetic expressions show that this subgraph is sufficient and necessary for high model performance, transfers across different models, and works on a variety of input styles. Ablations on different CAMA and ABP alternatives reveal their unique advantages over other methods, which may be of independent interest.
[39] Steering MoE LLMs via Expert (De)Activation
Mohsen Fayyaz, Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Ryan Rossi, Trung Bui, Hinrich Schütze, Nanyun Peng
Main category: cs.CL
TL;DR: SteerMoE framework detects and controls behavior-linked experts in Mixture-of-Experts LLMs to steer model behaviors like faithfulness and safety without retraining, achieving significant improvements and also demonstrating vulnerability to adversarial attacks.
Details
Motivation: To develop a method for controlling specific behaviors (faithfulness, safety) in MoE-based LLMs by identifying and manipulating specialized experts without requiring model retraining or weight modifications.Method: Detects experts with distinct activation patterns across paired inputs showing contrasting behaviors, then selectively activates or deactivates these experts during inference to steer model behavior.
Result: Across 11 benchmarks and 6 LLMs: improved safety by up to +20% and faithfulness by +27%; in adversarial mode, reduced safety by -41% alone and -100% when combined with existing jailbreak methods, bypassing all safety guardrails.
Conclusion: SteerMoE provides effective behavior control without retraining but also exposes a new dimension of alignment faking vulnerability hidden within expert networks of MoE models.
Abstract: Mixture-of-Experts (MoE) in Large Language Models (LLMs) routes each token through a subset of specialized Feed-Forward Networks (FFN), known as experts. We present SteerMoE, a framework for steering MoE models by detecting and controlling behavior-linked experts. Our detection method identifies experts with distinct activation patterns across paired inputs exhibiting contrasting behaviors. By selectively (de)activating such experts during inference, we control behaviors like faithfulness and safety without retraining or modifying weights. Across 11 benchmarks and 6 LLMs, our steering raises safety by up to +20% and faithfulness by +27%. In adversarial attack mode, it drops safety by -41% alone, and -100% when combined with existing jailbreak methods, bypassing all safety guardrails and exposing a new dimension of alignment faking hidden within experts.
[40] CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models
Runpeng Dai, Linfeng Song, Haolin Liu, Zhenwen Liang, Dian Yu, Haitao Mi, Zhaopeng Tu, Rui Liu, Tong Zheng, Hongtu Zhu, Dong Yu
Main category: cs.CL
TL;DR: RLVR methods suffer from poor exploration leading to premature convergence. CDE framework uses intrinsic curiosity signals from actor (perplexity) and critic (value variance) as exploration bonuses to improve performance.
Details
Motivation: Current RLVR methods often explore poorly, leading to premature convergence and entropy collapse, which limits the reasoning ability enhancement of LLMs.Method: Curiosity-Driven Exploration (CDE) framework that uses actor’s perplexity over generated responses and critic’s value estimate variance from multi-head architecture as intrinsic exploration bonuses.
Result: Achieves approximately +3 point improvement over standard RLVR using GRPO/PPO on AIME benchmarks, and identifies calibration collapse mechanism in RLVR.
Conclusion: CDE effectively addresses exploration challenges in RLVR through curiosity-driven bonuses, improving performance and providing insights into LLM failure modes.
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful paradigm for enhancing the reasoning ability of Large Language Models (LLMs). Yet current RLVR methods often explore poorly, leading to premature convergence and entropy collapse. To address this challenge, we introduce Curiosity-Driven Exploration (CDE), a framework that leverages the model’s own intrinsic sense of curiosity to guide exploration. We formalize curiosity with signals from both the actor and the critic: for the actor, we use perplexity over its generated response, and for the critic, we use the variance of value estimates from a multi-head architecture. Both signals serve as an exploration bonus within the RLVR framework to guide the model. Our theoretical analysis shows that the actor-wise bonus inherently penalizes overconfident errors and promotes diversity among correct responses; moreover, we connect the critic-wise bonus to the well-established count-based exploration bonus in RL. Empirically, our method achieves an approximate +3 point improvement over standard RLVR using GRPO/PPO on AIME benchmarks. Further analysis identifies a calibration collapse mechanism within RLVR, shedding light on common LLM failure modes.
[41] ASTPrompter: Preference-Aligned Automated Language Model Red-Teaming to Generate Low-Perplexity Unsafe Prompts
Amelia F. Hardy, Houjun Liu, Allie Griffith, Bernard Lange, Duncan Eddy, Mykel J. Kochenderfer
Main category: cs.CL
TL;DR: ASTPrompter is a red-teaming method that generates low-perplexity attacks on LLMs, achieving higher success rates with more natural-looking prompts that are harder to filter and transfer better across models.
Details
Motivation: Existing LLM red-teaming approaches focus on high attack success rates but produce high-perplexity prompts that are easy to detect. Low-perplexity attacks are more impactful as they're harder to filter, more likely to occur naturally, and create more damaging training data.Method: ASTPrompter uses single-step optimization with contrastive preference learning to train an attacker that maintains low perplexity while achieving high attack success rates.
Result: ASTPrompter achieves 5.1x higher attack success rate on Llama-8.1B with inputs 2.1x more likely to occur. The attacks transfer successfully to Mistral-7B, Qwen-7B, and TinyLlama in both black- and white-box settings.
Conclusion: Perplexity is an important but under-considered factor in red-teaming. ASTPrompter demonstrates efficient frontier discovery between attack success rate and perplexity, enabling more effective and stealthy LLM attacks.
Abstract: Existing LLM red-teaming approaches prioritize high attack success rate, often resulting in high-perplexity prompts. This focus overlooks low-perplexity attacks that are more difficult to filter, more likely to arise during benign usage, and more impactful as negative downstream training examples. In response, we introduce ASTPrompter, a single-step optimization method that uses contrastive preference learning to train an attacker to maintain low perplexity while achieving a high attack success rate (ASR). ASTPrompter achieves an attack success rate 5.1 times higher on Llama-8.1B while using inputs that are 2.1 times more likely to occur according to the frozen LLM. Furthermore, our attack transfers to Mistral-7B, Qwen-7B, and TinyLlama in both black- and white-box settings. Lastly, by tuning a single hyperparameter in our method, we discover successful attack prefixes along an efficient frontier between ASR and perplexity, highlighting perplexity as a previously under-considered factor in red-teaming.
[42] RED: Unleashing Token-Level Rewards from Holistic Feedback via Reward Redistribution
Jiahui Li, Lin Li, Tai-wei Chang, Kun Kuang, Long Chen, Jun Zhou, Cheng Yang
Main category: cs.CL
TL;DR: RED is a token-level reward redistribution method that assigns fine-grained credit to individual tokens using existing reward models, improving RLHF alignment without additional training costs.
Details
Motivation: Current reward models provide sparse, sequence-level rewards that overlook individual token contributions to desired outcomes, limiting precise alignment with human preferences.Method: Proposes RED method that redistributes sequence-level rewards to token-level rewards using an off-the-shelf reward model, enabling fine-grained guidance during RL training without modifying the reward model.
Result: Experimental results across diverse datasets and tasks demonstrate superior performance, with enhanced understanding of language nuances and more precise improvements.
Conclusion: Token-level reward redistribution through RED provides effective fine-grained guidance for RLHF, achieving better alignment with human preferences while maintaining minimal computational overhead.
Abstract: Reinforcement learning from human feedback (RLHF) offers a promising approach to aligning large language models (LLMs) with human preferences. Typically, a reward model is trained or supplied to act as a proxy for humans in evaluating generated responses during the reinforcement training phase. However, current reward models operate as sequence-to-one models, allocating a single, sparse, and delayed reward to an entire output sequence. This approach may overlook the significant contributions of individual tokens toward the desired outcome. To this end, we propose a more fine-grained, token-level guidance approach for RL training. Specifically, we introduce RED, a novel reward redistribition method that evaluates and assigns specific credit to each token using an off-the-shelf reward model. Utilizing these fine-grained rewards enhances the model’s understanding of language nuances, leading to more precise performance improvements. Notably, our method does not require modifying the reward model or introducing additional training steps, thereby incurring minimal computational costs. Experimental results across diverse datasets and tasks demonstrate the superiority of our approach.
[43] Thinking with Many Minds: Using Large Language Models for Multi-Perspective Problem-Solving
Sanghyun Park, Boris Maciejovsky, Phanish Puranam
Main category: cs.CL
TL;DR: Synthetic deliberation uses LLMs to simulate discourse between diverse perspective agents, overcoming cognitive limitations of mental simulation for better problem-solving.
Details
Motivation: Cognitive constraints limit mental simulation effectiveness for complex problem-solving that requires entertaining multiple distinct perspectives simultaneously.Method: LLM-based method using custom GPT model to simulate discourse between agents embodying diverse perspectives, enabling parallel processing and viewpoint synthesis.
Result: Enables concurrent processing of multiple viewpoints without cognitive degradation, parallel exploration of perspectives, and precise control over viewpoint synthesis.
Conclusion: Synthetic deliberation transcends mental simulation limitations and shows promise for strategic planning, policymaking, and conflict resolution by externalizing deliberative process.
Abstract: Complex problem-solving requires cognitive flexibility–the capacity to entertain multiple perspectives while preserving their distinctiveness. This flexibility replicates the “wisdom of crowds” within a single individual, allowing them to “think with many minds.” While mental simulation enables imagined deliberation, cognitive constraints limit its effectiveness. We propose synthetic deliberation, a Large Language Model (LLM)-based method that simulates discourse between agents embodying diverse perspectives, as a solution. Using a custom GPT-based model, we showcase its benefits: concurrent processing of multiple viewpoints without cognitive degradation, parallel exploration of perspectives, and precise control over viewpoint synthesis. By externalizing the deliberative process and distributing cognitive labor between parallel search and integration, synthetic deliberation transcends mental simulation’s limitations. This approach shows promise for strategic planning, policymaking, and conflict resolution.
[44] CondAmbigQA: A Benchmark and Dataset for Conditional Ambiguous Question Answering
Zongxi Li, Yang Li, Haoran Xie, S. Joe Qin
Main category: cs.CL
TL;DR: Proposes CondAmbigQA benchmark with 2,000 ambiguous queries and condition-aware metrics to address query ambiguity in QA, showing 11.75% accuracy improvement when models consider contextual conditions.
Details
Motivation: Users often omit critical information assuming LLMs share their cognitive alignment, leading to ambiguous queries and responses that may be misperceived as hallucinations rather than inherent query ambiguity.Method: Retrieval-based annotation using Wikipedia fragments to identify possible interpretations and annotate answers with explicit contextual constraints called “conditions” that resolve ambiguities in QA tasks.
Result: Models considering conditions before answering improve accuracy by 11.75%, with additional 7.15% gain when conditions are explicitly provided, demonstrating that apparent hallucinations often stem from query ambiguity.
Conclusion: Condition reasoning is effective in QA, providing tools for rigorous evaluation and showing that many perceived model failures are actually due to inherent query ambiguity rather than model limitations.
Abstract: Users often assume that large language models (LLMs) share their cognitive alignment of context and intent, leading them to omit critical information in question-answering (QA) and produce ambiguous queries. Responses based on misaligned assumptions may be perceived as hallucinations. Therefore, identifying possible implicit assumptions is crucial in QA. To address this fundamental challenge, we propose Conditional Ambiguous Question-Answering (CondAmbigQA), a benchmark comprising 2,000 ambiguous queries and condition-aware evaluation metrics. Our study pioneers “conditions” as explicit contextual constraints that resolve ambiguities in QA tasks through retrieval-based annotation, where retrieved Wikipedia fragments help identify possible interpretations for a given query and annotate answers accordingly. Experiments demonstrate that models considering conditions before answering improve answer accuracy by 11.75%, with an additional 7.15% gain when conditions are explicitly provided. These results highlight that apparent hallucinations may stem from inherent query ambiguity rather than model failure, and demonstrate the effectiveness of condition reasoning in QA, providing researchers with tools for rigorous evaluation.
[45] SimMark: A Robust Sentence-Level Similarity-Based Watermarking Algorithm for Large Language Models
Amirhossein Dabiriaghdam, Lele Wang
Main category: cs.CL
TL;DR: SimMark is a robust sentence-level watermarking algorithm that detects LLM-generated text by embedding imperceptible statistical patterns using semantic similarity and rejection sampling, achieving superior robustness against paraphrasing while maintaining text quality.
Details
Motivation: The widespread adoption of LLMs requires reliable methods to detect machine-generated text, but existing approaches often lack robustness against paraphrasing attacks or require access to model internals.Method: Leverages semantic sentence embeddings with rejection sampling to embed detectable patterns, uses soft counting mechanism for robustness, and works without requiring model internal access (compatible with both open and API-based LLMs).
Result: Experimental results show SimMark sets new benchmarks for robust watermarking, surpassing prior sentence-level techniques in robustness, sampling efficiency, and cross-domain applicability while maintaining text quality and fluency.
Conclusion: SimMark provides an effective and practical solution for tracing LLM-generated content through robust, imperceptible watermarking that works across diverse domains and LLM types without compromising output quality.
Abstract: The widespread adoption of large language models (LLMs) necessitates reliable methods to detect LLM-generated text. We introduce SimMark, a robust sentence-level watermarking algorithm that makes LLMs’ outputs traceable without requiring access to model internals, making it compatible with both open and API-based LLMs. By leveraging the similarity of semantic sentence embeddings combined with rejection sampling to embed detectable statistical patterns imperceptible to humans, and employing a soft counting mechanism, SimMark achieves robustness against paraphrasing attacks. Experimental results demonstrate that SimMark sets a new benchmark for robust watermarking of LLM-generated content, surpassing prior sentence-level watermarking techniques in robustness, sampling efficiency, and applicability across diverse domains, all while maintaining the text quality and fluency.
[46] Are Generative Models Underconfident? Better Quality Estimation with Boosted Model Probability
Tu Anh Dinh, Jan Niehues
Main category: cs.CL
TL;DR: BoostedProb improves quality estimation by boosting model confidence when multiple correct output options exist, outperforming raw probability and competing with more complex methods.
Details
Motivation: Output probability in text-generation models can be underconfident because multiple correct options spread the probability distribution, making lower probability not necessarily indicate lower quality.Method: Proposed BoostedProb approach that boosts model’s confidence in cases with multiple viable output options, maintaining the same complexity level as raw probability.
Result: Achieved average +0.194 improvement in Pearson correlation to ground-truth quality compared to raw model probability, and performs competitively with more costly supervised/ensemble approaches.
Conclusion: BoostedProb provides an effective, low-complexity solution for quality estimation that addresses the underconfidence issue in text-generation models’ output probabilities.
Abstract: Quality Estimation (QE) is estimating quality of the model output during inference when the ground truth is not available. Deriving output quality from the models’ output probability is the most trivial and low-effort way. However, we show that the output probability of text-generation models can appear underconfident. At each output step, there can be multiple correct options, making the probability distribution spread out more. Thus, lower probability does not necessarily mean lower output quality. Due to this observation, we propose a QE approach called BoostedProb, which boosts the model’s confidence in cases where there are multiple viable output options. With no increase in complexity, BoostedProb is notably better than raw model probability in different settings, achieving on average +0.194 improvement in Pearson correlation to ground-truth quality. It also comes close to or outperforms more costly approaches like supervised or ensemble-based QE in certain settings.
[47] Culturally-Nuanced Story Generation for Reasoning in Low-Resource Languages: The Case of Javanese and Sundanese
Salsabila Zahirah Pranida, Rifo Ahmad Genadi, Fajri Koto
Main category: cs.CL
TL;DR: LLMs can generate culturally nuanced narratives for low-resource languages like Javanese and Sundanese, with LLM-generated data outperforming machine-translated and Indonesian human-authored data in downstream tasks.
Details
Motivation: Culturally grounded commonsense reasoning is underexplored in low-resource languages due to scarce data and costly native annotation.Method: Compare three data creation strategies: LLM-assisted stories with cultural cues, machine translation from Indonesian benchmarks, and native-written stories. Fine-tune models on each dataset and evaluate on human-authored test set.
Result: LLM stories match natives on cultural fidelity but lag in coherence and correctness. LLM-generated data yields higher downstream performance than machine-translated and Indonesian human-authored training data.
Conclusion: LLMs are effective for generating culturally grounded data in low-resource languages, and the research releases a high-quality benchmark for Javanese and Sundanese.
Abstract: Culturally grounded commonsense reasoning is underexplored in low-resource languages due to scarce data and costly native annotation. We test whether large language models (LLMs) can generate culturally nuanced narratives for such settings. Focusing on Javanese and Sundanese, we compare three data creation strategies: (1) LLM-assisted stories prompted with cultural cues, (2) machine translation from Indonesian benchmarks, and (3) native-written stories. Human evaluation finds LLM stories match natives on cultural fidelity but lag in coherence and correctness. We fine-tune models on each dataset and evaluate on a human-authored test set for classification and generation. LLM-generated data yields higher downstream performance than machine-translated and Indonesian human-authored training data. We release a high-quality benchmark of culturally grounded commonsense stories in Javanese and Sundanese to support future work.
[48] Uncertainty Quantification in Retrieval Augmented Question Answering
Laura Perez-Beltrachini, Mirella Lapata
Main category: cs.CL
TL;DR: Proposes a lightweight neural model to predict passage utility in retrieval-augmented QA, outperforming simple metrics and expensive sampling methods.
Details
Motivation: Previous retrieval-augmented QA approaches improve performance but don't assess whether retrieved passages are actually useful for correct answering.Method: Train a lightweight neural model to predict passage utility for target QA models, using it to quantify uncertainty.
Result: The approach efficiently approximates or outperforms more expensive sampling-based methods, while simple information theoretic metrics have limited effectiveness.
Conclusion: The proposed method provides an effective way to quantify uncertainty and assess passage utility in retrieval-augmented QA systems.
Abstract: Retrieval augmented Question Answering (QA) helps QA models overcome knowledge gaps by incorporating retrieved evidence, typically a set of passages, alongside the question at test time. Previous studies show that this approach improves QA performance and reduces hallucinations, without, however, assessing whether the retrieved passages are indeed useful at answering correctly. In this work, we propose to quantify the uncertainty of a QA model via estimating the utility of the passages it is provided with. We train a lightweight neural model to predict passage utility for a target QA model and show that while simple information theoretic metrics can predict answer correctness up to a certain extent, our approach efficiently approximates or outperforms more expensive sampling-based methods. Code and data are available at https://github.com/lauhaide/ragu.
[49] CritiQ: Mining Data Quality Criteria from Human Preferences
Honglin Guo, Kai Lv, Qipeng Guo, Tianyi Liang, Zhiheng Xi, Demin Song, Qiuyinzhe Zhang, Yu Sun, Kai Chen, Xipeng Qiu, Tao Gui
Main category: cs.CL
TL;DR: CritiQ is a novel data selection method that automatically mines quality criteria from human preferences using only ~30 annotated pairs, employing manager and worker agents to evolve criteria and make judgments, then trains a scorer for efficient data selection.
Details
Motivation: Existing data selection methods rely on manual heuristics, perplexity, classifiers, or prompt engineering, which require significant expert experience, human annotation effort, and introduce biases.Method: CritiQ Flow uses manager agents to evolve quality criteria and worker agents for pairwise judgments, builds a knowledge base from previous work, trains CritiQ Scorer for quality scoring, and performs efficient data selection.
Result: Achieves high accuracy on human-annotated test sets in code, math, and logic domains. Continual training of Llama 3.1 models shows improved performance on downstream tasks compared to uniform sampling.
Conclusion: CritiQ provides an interpretable, efficient data selection method that reduces human annotation effort while improving model performance, with ablation studies validating the benefits of knowledge base and reflection process.
Abstract: Language model heavily depends on high-quality data for optimal performance. Existing approaches rely on manually designed heuristics, the perplexity of existing models, training classifiers, or careful prompt engineering, which require significant expert experience and human annotation effort while introduce biases. We introduce CritiQ, a novel data selection method that automatically mines criteria from human preferences for data quality with only ~30 human-annotated pairs and performs efficient data selection. The main component, CritiQ Flow, employs a manager agent to evolve quality criteria and worker agents to make pairwise judgments. We build a knowledge base that extracts quality criteria from previous work to boost CritiQ Flow. Compared to perplexity- and classifier- based methods, verbal criteria are more interpretable and possess reusable value. After deriving the criteria, we train the CritiQ Scorer to give quality scores and perform efficient data selection. We demonstrate the effectiveness of our method in the code, math, and logic domains, achieving high accuracy on human-annotated test sets. To validate the quality of the selected data, we continually train Llama 3.1 models and observe improved performance on downstream tasks compared to uniform sampling. Ablation studies validate the benefits of the knowledge base and the reflection process. We analyze how criteria evolve and the effectiveness of majority voting.
[50] MIND: Towards Immersive Psychological Healing with Multi-agent Inner Dialogue
Yujia Chen, Changsong Li, Yiming Wang, Tianjie Ju, Qingqing Xiao, Nan Zhang, Zifan Kong, Peng Wang, Binyu Yan
Main category: cs.CL
TL;DR: MIND is a multi-agent LLM framework that creates immersive psychological healing through role-playing inner dialogues, outperforming traditional methods.
Details
Motivation: Traditional mental health approaches like counseling and chatbots fail to engage effectively due to generic responses lacking emotional depth, despite LLMs' potential for human-like interactions.Method: Propose MIND (Multi-agent INner Dialogue) paradigm that predefines an interactive healing framework with LLM agents assigned different roles to engage in interactive inner dialogues with users.
Result: Extensive human experiments show MIND provides more user-friendly experience than traditional paradigms across various real-world healing dimensions.
Conclusion: MIND effectively leverages the significant potential of LLMs in psychological healing by creating immersive healing environments through multi-agent inner dialogues.
Abstract: Mental health issues are worsening in today’s competitive society, such as depression and anxiety. Traditional healings like counseling and chatbots fail to engage effectively, they often provide generic responses lacking emotional depth. Although large language models (LLMs) have the potential to create more human-like interactions, they still struggle to capture subtle emotions. This requires LLMs to be equipped with human-like adaptability and warmth. To fill this gap, we propose the MIND (Multi-agent INner Dialogue), a novel paradigm that provides more immersive psychological healing environments. Considering the strong generative and role-playing ability of LLM agents, we predefine an interactive healing framework and assign LLM agents different roles within the framework to engage in interactive inner dialogues with users, thereby providing an immersive healing experience. We conduct extensive human experiments in various real-world healing dimensions, and find that MIND provides a more user-friendly experience than traditional paradigms. This demonstrates that MIND effectively leverages the significant potential of LLMs in psychological healing.
[51] SWI: Speaking with Intent in Large Language Models
Yuwei Yin, EunJeong Hwang, Giuseppe Carenini
Main category: cs.CL
TL;DR: Speaking with Intent (SWI) enhances LLM reasoning by generating explicit intents that guide analysis and action, showing consistent improvements across multiple benchmarks.
Details
Motivation: To emulate human deliberate thinking by having LLMs generate explicit intents that serve as cognitive frameworks for better communication and problem-solving.Method: Introduces Speaking with Intent (SWI) approach where LLMs explicitly generate intents that encapsulate underlying intentions and provide high-level planning before generating final outputs.
Result: Extensive experiments show SWI consistently outperforms direct generation without explicit intent on text summarization, multi-task QA, and mathematical reasoning benchmarks. Human evaluations confirm coherence and interpretability.
Conclusion: SWI provides a promising new approach to enhance LLM generation and reasoning capabilities by incorporating explicit cognitive intents, paving the way for more deliberate AI systems.
Abstract: Intent, typically clearly formulated and planned, functions as a cognitive framework for communication and problem-solving. This paper introduces the concept of Speaking with Intent (SWI) in large language models (LLMs), where the explicitly generated intent encapsulates the model’s underlying intention and provides high-level planning to guide subsequent analysis and action. By emulating deliberate and purposeful thoughts in the human mind, SWI is hypothesized to enhance the reasoning capabilities and generation quality of LLMs. Extensive experiments on text summarization, multi-task question answering, and mathematical reasoning benchmarks consistently demonstrate the effectiveness and generalizability of Speaking with Intent over direct generation without explicit intent. Further analysis corroborates the generalizability of SWI under different experimental settings. Moreover, human evaluations verify the coherence, effectiveness, and interpretability of the intent produced by SWI. The promising results in enhancing LLMs with explicit intents pave a new avenue for boosting LLMs’ generation and reasoning abilities with cognitive notions.
[52] Entropy-Gated Branching for Efficient Test-Time Reasoning
Xianzhi Li, Ethan Callanan, Abdellah Ghassel, Xiaodan Zhu
Main category: cs.CL
TL;DR: Entropy-Gated Branching: A dynamic inference method that selectively expands prediction sequences only at high-uncertainty points, improving accuracy by 22.6% while being 37% faster than beam search.
Details
Motivation: Standard test-time compute methods like beam search waste computational resources on low-diversity branches where models already have high confidence, while a small subset of uncertain reasoning steps disproportionately impacts final accuracy.Method: Uses entropy as a gating mechanism to identify points of high uncertainty for selective branching, coupled with an external feedback model to rank and prune candidate branches.
Result: Empirical results show 22.6% accuracy improvement over standard inference and 37% faster operation than conventional beam search with similar or higher performance on mathematical and financial reasoning benchmarks.
Conclusion: Dynamic resource allocation during inference can substantially improve both efficiency and effectiveness, offering a more scalable pathway to enhanced LLM reasoning capabilities.
Abstract: Test-time compute methods like beam search can significantly improve the reasoning capabilities and problem-solving accuracy of large language models. However, these approaches require substantially increased computational resources, with most computation wasted on exploring low-diversity branches where the model already exhibits high confidence. We observe that a small subset of uncertain reasoning steps has a disproportionately large impact on final prediction accuracy, and branching at these points tends to yield higher-quality and more diverse candidate reasoning steps. Therefore, we introduce Entropy-Gated Branching: a novel inference technique that dynamically allocates computational resources by selectively expanding prediction sequences only at points of high uncertainty. Our method leverages entropy as a gating mechanism to identify when branching is most beneficial, coupled with an external feedback model to rank and prune candidate branches. Empirical results on mathematical and financial reasoning benchmarks show that this strategy improves accuracy by 22.6% over standard inference while operating 37% faster than conventional beam search with similar or higher performance. Our results show that dynamic resource allocation during inference can substantially improve both efficiency and effectiveness, offering a more scalable pathway to enhanced LLM reasoning capabilities.
[53] Contextualize-then-Aggregate: Circuits for In-Context Learning in Gemma-2 2B
Aleksandra Bakalova, Yana Veitsman, Xinting Huang, Michael Hahn
Main category: cs.CL
TL;DR: The paper identifies a two-step ‘contextualize-then-aggregate’ mechanism in LLMs for in-context learning, where lower layers build representations of individual examples and higher layers aggregate them to identify tasks.
Details
Motivation: Despite substantial research on ICL's behavioral aspects, it remains unclear how LLMs assemble task information from individual examples in few-shot prompts, prompting a need for causal analysis of the underlying mechanisms.Method: Used causal interventions to analyze information flow in Gemma-2 2B model across five naturalistic ICL tasks, examining how examples are processed and contextualized.
Result: Found that LLMs use a two-step strategy: lower layers build contextualized representations of individual examples through cross-sequence connections, while higher layers aggregate these to identify tasks and prepare predictions.
Conclusion: The study provides rigorous causal analysis revealing the specific mechanisms behind ICL, showing contextualization importance varies by task and increases with ambiguous examples, shedding light on how language models perform in-context learning.
Abstract: In-Context Learning (ICL) is an intriguing ability of large language models (LLMs). Despite a substantial amount of work on its behavioral aspects and how it emerges in miniature setups, it remains unclear which mechanism assembles task information from the individual examples in a fewshot prompt. We use causal interventions to identify information flow in Gemma-2 2B for five naturalistic ICL tasks. We find that the model infers task information using a two-step strategy we call contextualize-then-aggregate: In the lower layers, the model builds up representations of individual fewshot examples, which are contextualized by preceding examples through connections between fewshot input and output tokens across the sequence. In the higher layers, these representations are aggregated to identify the task and prepare prediction of the next output. The importance of the contextualization step differs between tasks, and it may become more important in the presence of ambiguous examples. Overall, by providing rigorous causal analysis, our results shed light on the mechanisms through which ICL happens in language models.
[54] An Ontology-Driven Graph RAG for Legal Norms: A Structural, Temporal, and Deterministic Approach
Hudson de Martim
Main category: cs.CL
TL;DR: SAT-Graph RAG framework addresses limitations of standard RAG in legal domain by modeling hierarchical and temporal structure of law using ontology-driven knowledge graphs.
Details
Motivation: Standard flat-text retrieval in legal RAG systems fails to capture hierarchical, diachronic, and causal structure of law, leading to anachronistic and unreliable answers.Method: Uses ontology-driven framework with formal LRMoo-inspired model distinguishing abstract legal Works from versioned Expressions. Models temporal states as efficient aggregations and legislative events as Action nodes. Implements planner-guided query strategy for deterministic resolution of complex requests.
Result: Demonstrated through Brazilian Constitution case study, providing verifiable, temporally-correct substrate for LLMs that enables higher-order analytical capabilities while reducing factual errors.
Conclusion: Practical framework for building more trustworthy and explainable legal AI systems by explicitly modeling legal structure and temporal relationships.
Abstract: Retrieval-Augmented Generation (RAG) systems in the legal domain face a critical challenge: standard, flat-text retrieval is blind to the hierarchical, diachronic, and causal structure of law, leading to anachronistic and unreliable answers. This paper introduces the Structure-Aware Temporal Graph RAG (SAT-Graph RAG), an ontology-driven framework designed to overcome these limitations by explicitly modeling the formal structure and diachronic nature of legal norms. We ground our knowledge graph in a formal, LRMoo-inspired model that distinguishes abstract legal Works from their versioned Expressions. We model temporal states as efficient aggregations that reuse the versioned expressions (CTVs) of unchanged components, and we reify legislative events as first-class Action nodes to make causality explicit and queryable. This structured backbone enables a unified, planner-guided query strategy that applies explicit policies to deterministically resolve complex requests for (i) point-in-time retrieval, (ii) hierarchical impact analysis, and (iii) auditable provenance reconstruction. Through a case study on the Brazilian Constitution, we demonstrate how this approach provides a verifiable, temporally-correct substrate for LLMs, enabling higher-order analytical capabilities while drastically reducing the risk of factual errors. The result is a practical framework for building more trustworthy and explainable legal AI systems.
[55] AdaptMI: Adaptive Skill-based In-context Math Instruction for Small Language Models
Yinghui He, Abhishek Panigrahi, Yong Lin, Sanjeev Arora
Main category: cs.CL
TL;DR: AdaptMI+ improves small language model performance on math problems by adaptively using skill-based examples only when needed, preventing cognitive overload from unnecessary information.
Details
Motivation: Skill-based prompting helps large language models but hurts small language models on easy questions by introducing cognitive overload, creating a performance gap that needs addressing.Method: AdaptMI adaptively selects skill-based examples only when model performs poorly, and AdaptMI+ adds targeted examples for missing skills based on cognitive load theory.
Result: On 5-shot evaluations across math benchmarks and five SLMs (1B-7B), AdaptMI+ improves accuracy by up to 6% over naive skill-based strategies.
Conclusion: Adaptive skill-based prompting tailored to small language models’ capabilities significantly improves math problem-solving performance while avoiding cognitive overload.
Abstract: In-context learning (ICL) allows a language model to improve its problem-solving capability when provided with suitable information in context. Since the choice of in-context information can be determined based on the problem itself, in-context learning is analogous to human learning from teachers in a classroom. Recent works (Didolkar et al., 2024a; 2024b) show that ICL performance can be improved by leveraging a frontier large language model’s (LLM) ability to predict required skills to solve a problem, popularly referred to as an LLM’s metacognition, and using the recommended skills to construct necessary in-context examples. While this skill-based strategy boosts ICL performance in larger models, its gains on small language models (SLMs) have been minimal, highlighting a performance gap in ICL capabilities. We investigate this gap and show that skill-based prompting can hurt SLM performance on easy questions by introducing unnecessary information, akin to cognitive overload. To address this, we introduce AdaptMI, an adaptive approach to selecting skill-based in-context Math Instructions for SLMs. Inspired by cognitive load theory from human pedagogy, our method only introduces skill-based examples when the model performs poorly. We further propose AdaptMI+, which adds examples targeted to the specific skills missing from the model’s responses. On 5-shot evaluations across popular math benchmarks and five SLMs (1B–7B; Qwen, Llama), AdaptMI+ improves accuracy by up to 6% over naive skill-based strategies.
[56] Task Matters: Knowledge Requirements Shape LLM Responses to Context-Memory Conflict
Kaiser Sun, Fan Bai, Mark Dredze
Main category: cs.CL
TL;DR: LLMs show varying reliance on contextual vs parametric knowledge under conflict, with performance degradation correlating with task knowledge requirements, and both explanations and reiteration increasing context reliance.
Details
Motivation: To understand how LLMs handle conflicts between contextual and parametric knowledge across different task types, as prior research focused only on tasks that should always rely on context.Method: A model-agnostic diagnostic framework that automatically detects knowledge disagreements and injects controlled conflicts into tasks, creating datasets spanning task knowledge reliance and conflict plausibility dimensions.
Result: Performance degradation from conflict correlates with task’s knowledge reliance; explanatory rationales and simple reiteration both increase context reliance (helpful for context-only tasks but harmful when parametric knowledge should dominate).
Conclusion: These behaviors raise concerns about model-based evaluation validity and underscore the need to account for knowledge conflict in LLM deployment.
Abstract: Large Language Models require both contextual knowledge and parametric memory, but these sources can disagree. Prior investigations on contextual question answering tasks report a preference toward parametric knowledge under conflict, yet they focus almost exclusively on tasks that should always rely on the given passage, leaving open how this behavior manifests when tasks demand different amounts and kinds of knowledge. We study this question with a model-agnostic diagnostic framework that (i) automatically detects disagreements between a model’s beliefs and a curated knowledge set, and (ii) injects controlled conflicts into tasks. The resulting datasets span two orthogonal dimensions: task knowledge reliance and conflict plausibility. Evaluating representative open-source LLMs, we find that: (1) performance degradation from conflict correlates with a task’s knowledge reliance; (2) explanatory rationales and simple reiteration both increase context reliance-helpful for context-only tasks but harmful when parametric knowledge should dominate; (3) These behaviors raise concerns about the validity of model-based evaluation and underscore the need to account for knowledge conflict in the deployment of LLMs.
[57] Persistent Homology of Topic Networks for the Prediction of Reader Curiosity
Manuel D. S. Hopp, Vincent Labatut, Arthur Amalvy, Richard Dufour, Hannah Stone, Hayley Jach, Kou Murayama
Main category: cs.CL
TL;DR: A novel framework that models reader curiosity by quantifying semantic information gaps using topological analysis of text structure, showing significant improvement in predicting reader engagement.
Details
Motivation: Reader curiosity is crucial for textual engagement but remains underexplored in NLP. The paper aims to bridge this gap by developing a computational method to analyze how text structure influences reader curiosity.Method: Leverages BERTopic-inspired topic modeling and persistent homology to analyze the evolving topology of a dynamic semantic network derived from text segments. Uses topological features (connected components, cycles, voids) as proxies for information gaps to predict reader curiosity ratings.
Result: The topological features significantly improved curiosity prediction compared to a baseline model (73% vs. 30% explained deviance), validating the approach through empirical evaluation with 49 participants reading “The Hunger Games”.
Conclusion: The pipeline offers a new computational method for analyzing text structure and its relation to reader engagement, successfully demonstrating that topological features can effectively model and predict reader curiosity.
Abstract: Reader curiosity, the drive to seek information, is crucial for textual engagement, yet remains relatively underexplored in NLP. Building on Loewenstein’s Information Gap Theory, we introduce a framework that models reader curiosity by quantifying semantic information gaps within a text’s semantic structure. Our approach leverages BERTopic-inspired topic modeling and persistent homology to analyze the evolving topology (connected components, cycles, voids) of a dynamic semantic network derived from text segments, treating these features as proxies for information gaps. To empirically evaluate this pipeline, we collect reader curiosity ratings from participants (n = 49) as they read S. Collins’s ‘‘The Hunger Games’’ novel. We then use the topological features from our pipeline as independent variables to predict these ratings, and experimentally show that they significantly improve curiosity prediction compared to a baseline model (73% vs. 30% explained deviance), validating our approach. This pipeline offers a new computational method for analyzing text structure and its relation to reader engagement.
[58] Spotlight Attention: Towards Efficient LLM Generation via Non-linear Hashing-based KV Cache Retrieval
Wenhao Li, Yuxin Zhang, Gen Luo, Haiyuan Wan, Ziyang Gong, Fei Chao, Rongrong Ji
Main category: cs.CL
TL;DR: Spotlight Attention uses non-linear hashing to optimize KV cache selection in LLMs, achieving 5x shorter hash codes, 3x higher throughput, and efficient GPU training.
Details
Motivation: Existing KV cache reduction methods use inefficient linear hashing due to orthogonal query-key distributions in narrow cones, requiring better optimization.Method: Non-linear hashing functions to optimize embedding distribution, Bradley-Terry ranking-based loss for lightweight training, and specialized CUDA kernels for bitwise operations.
Result: 5x shorter hash codes, under 100μs retrieval for 512K tokens on A100 GPU, 3x higher end-to-end throughput than vanilla decoding.
Conclusion: Spotlight Attention significantly improves KV cache efficiency through optimized non-linear hashing, enabling faster inference with maintained performance.
Abstract: Reducing the key-value (KV) cache burden in Large Language Models (LLMs) significantly accelerates inference. Dynamically selecting critical KV caches during decoding helps maintain performance. Existing methods use random linear hashing to identify important tokens, but this approach is inefficient due to the orthogonal distribution of queries and keys within two narrow cones in LLMs. We introduce Spotlight Attention, a novel method that employs non-linear hashing functions to optimize the embedding distribution of queries and keys, enhancing coding efficiency and robustness. We also developed a lightweight, stable training framework using a Bradley-Terry ranking-based loss, enabling optimization of the non-linear hashing module on GPUs with 16GB memory in 8 hours. Experimental results show that Spotlight Attention drastically improves retrieval precision while shortening the length of the hash code at least 5$\times$ compared to traditional linear hashing. Finally, we exploit the computational advantages of bitwise operations by implementing specialized CUDA kernels, achieving hashing retrieval for 512K tokens in under 100$\mu$s on a single A100 GPU, with end-to-end throughput up to 3$\times$ higher than vanilla decoding.
[59] T2R-bench: A Benchmark for Generating Article-Level Reports from Real World Industrial Tables
Jie Zhang, Changzai Pan, Kaiwen Wei, Sishi Xiong, Yu Zhao, Xiangyu Li, Jiaxin Peng, Xiaoyan Gu, Jian Yang, Wenhan Chang, Zhenhe Wu, Jiang Zhong, Shuangyong Song, Yongxiang Li, Xuelong Li
Main category: cs.CL
TL;DR: Proposes table-to-report task and T2R-bench benchmark to evaluate LLMs’ ability to transform complex industrial tables into reports, showing current models still underperform with only 62.71 score.
Details
Motivation: Existing table reasoning research doesn't adequately address the practical challenge of transforming complex industrial tables into reports, and current benchmarks lack capacity to assess real-world application.Method: Created T2R-bench, a bilingual benchmark with 457 real-world industrial tables from 19 domains and 4 table types, with proposed evaluation criteria for report quality measurement.
Result: Experiments on 25 LLMs show even state-of-the-art models like Deepseek-R1 only achieve 62.71 overall score, indicating significant room for improvement.
Conclusion: The table-to-report task remains challenging for current LLMs, and T2R-bench provides a valuable benchmark for evaluating practical table reasoning capabilities in industrial applications.
Abstract: Extensive research has been conducted to explore the capabilities of large language models (LLMs) in table reasoning. However, the essential task of transforming tables information into reports remains a significant challenge for industrial applications. This task is plagued by two critical issues: 1) the complexity and diversity of tables lead to suboptimal reasoning outcomes; and 2) existing table benchmarks lack the capacity to adequately assess the practical application of this task. To fill this gap, we propose the table-to-report task and construct a bilingual benchmark named T2R-bench, where the key information flow from the tables to the reports for this task. The benchmark comprises 457 industrial tables, all derived from real-world scenarios and encompassing 19 industry domains as well as 4 types of industrial tables. Furthermore, we propose an evaluation criteria to fairly measure the quality of report generation. The experiments on 25 widely-used LLMs reveal that even state-of-the-art models like Deepseek-R1 only achieves performance with 62.71 overall score, indicating that LLMs still have room for improvement on T2R-bench.
[60] MachineLearningLM: Scaling Many-shot In-context Learning via Continued Pretraining
Haoyu Dong, Pengkun Zhang, Mingzhe Lu, Yanzhen Shen, Guolin Ke
Main category: cs.CL
TL;DR: MachineLearningLM is a continued-pretraining framework that enhances LLMs’ in-context learning for ML tasks using synthetic data from structural causal models, enabling robust many-shot performance while preserving general capabilities.
Details
Motivation: LLMs struggle with many-shot in-context learning on standard ML tasks despite having broad knowledge and reasoning abilities. The goal is to equip LLMs with robust ML capability without sacrificing their general-purpose functionality.Method: Uses continued pretraining with synthetic ML tasks generated from millions of structural causal models (SCMs), distills tree-based decision strategies from random forest teachers, and employs token-efficient prompting for 3-6x more examples per context window.
Result: Outperforms strong LLM baselines by ~15% on out-of-distribution tabular classification across multiple domains, shows monotonic accuracy improvement from 8 to 1,024 shots, achieves random-forest-level accuracy, and maintains 75.4% MMLU score.
Conclusion: MachineLearningLM successfully enhances LLMs’ in-context ML capabilities through synthetic pretraining while preserving general knowledge and reasoning, demonstrating effective many-shot scaling and robust performance across diverse domains.
Abstract: Large language models (LLMs) possess broad world knowledge and strong general-purpose reasoning ability, yet they struggle to learn from many in-context examples on standard machine learning (ML) tasks, that is, to leverage many-shot demonstrations purely via in-context learning (ICL) without gradient descent. We introduce MachineLearningLM, a portable continued-pretraining framework that equips a general-purpose LLM with robust in-context ML capability while preserving its general knowledge and reasoning for broader chat workflows. Our pretraining procedure synthesizes ML tasks from millions of structural causal models (SCMs), spanning shot counts up to 1,024. We begin with a random-forest teacher, distilling tree-based decision strategies into the LLM to strengthen robustness in numerical modeling. All tasks are serialized with a token-efficient prompt, enabling 3x to 6x more examples per context window and delivering up to 50x amortized throughput via batch inference. Despite a modest setup (Qwen-2.5-7B-Instruct with LoRA rank 8), MachineLearningLM outperforms strong LLM baselines (e.g., GPT-5-mini) by an average of about 15% on out-of-distribution tabular classification across finance, physics, biology, and healthcare domains. It exhibits a striking many-shot scaling law: accuracy increases monotonically as in-context demonstrations grow from 8 to 1,024. Without any task-specific training, it attains random-forest-level accuracy across hundreds of shots. General chat capabilities, including knowledge and reasoning, are preserved: it achieves 75.4% on MMLU.
[61] PersonaFuse: A Personality Activation-Driven Framework for Enhancing Human-LLM Interactions
Yixuan Tang, Yi Yang, Ahmed Abbasi
Main category: cs.CL
TL;DR: PersonaFuse is a novel LLM post-training framework that enables language models to adapt and express different personalities for varying social contexts, improving emotional intelligence without sacrificing reasoning ability or safety.
Details
Motivation: LLMs have limitations in emotional perception and social competence during real-world conversations, particularly in adapting communication style and emotional expression to different social and task contexts.Method: PersonaFuse employs a Mixture-of-Expert architecture inspired by Trait Activation Theory and the Big Five personality model, combining persona adapters with a dynamic routing network to enable contextual trait expression.
Result: Experimental results show PersonaFuse substantially outperforms baseline models across multiple dimensions of social-emotional intelligence, achieves competitive response quality against leading LLMs like GPT-4o and DeepSeek, and delivers consistent improvements in downstream applications like mental health counseling and customer service.
Conclusion: PersonaFuse offers a theoretically grounded and practical approach for developing social-emotional enhanced LLMs, marking a significant advancement toward more human-centric AI systems without sacrificing general reasoning ability or model safety.
Abstract: Recent advancements in Large Language Models (LLMs) demonstrate remarkable capabilities across various fields. These developments have led to more direct communication between humans and LLMs in various situations, such as social companionship and psychological support. However, LLMs often exhibit limitations in emotional perception and social competence during real-world conversations. These limitations partly originate from their inability to adapt their communication style and emotional expression to different social and task contexts. In this work, we introduce PersonaFuse, a novel LLM post-training framework that enables LLMs to adapt and express different personalities for varying situations. Inspired by Trait Activation Theory and the Big Five personality model, PersonaFuse employs a Mixture-of-Expert architecture that combines persona adapters with a dynamic routing network, enabling contextual trait expression. Experimental results show that PersonaFuse substantially outperforms baseline models across multiple dimensions of social-emotional intelligence. Importantly, these gains are achieved without sacrificing general reasoning ability or model safety, which remain common limitations of direct prompting and supervised fine-tuning approaches. PersonaFuse also delivers consistent improvements in downstream human-centered applications, such as mental health counseling and review-based customer service. Finally, human preference evaluations against leading LLMs, including GPT-4o and DeepSeek, demonstrate that PersonaFuse achieves competitive response quality despite its comparatively smaller model size. These findings demonstrate that PersonaFuse offers a theoretically grounded and practical approach for developing social-emotional enhanced LLMs, marking a significant advancement toward more human-centric AI systems.
[62] MERLIN: Multi-Stage Curriculum Alignment for Multilingual Encoder and LLM Fusion
Kosei Uemura, David Guzmán, Quang Phuoc Nguyen, Jesujoba Oluwadara Alabi, En-shiun Annie Lee, David Ifeoluwa Adelani
Main category: cs.CL
TL;DR: MERLIN is a two-stage model-stacking framework that uses curriculum learning and DoRA weights to significantly improve reasoning accuracy in low-resource languages, outperforming existing methods and GPT-4o-mini.
Details
Motivation: Large language models struggle with complex reasoning in low-resource languages (LRLs), and existing encoder-plus-decoder methods leave a large performance gap on these languages.Method: Two-stage model-stacking framework with curriculum learning strategy (from general bilingual bitext to task-specific data) and adaptation of only a small set of DoRA weights.
Result: +12.9 pp improvement in exact-match accuracy over MindMerger on AfriMGSM benchmark, outperforms GPT-4o-mini, and consistent gains on MGSM (+0.9 pp) and MSVAMP (+2.8 pp) benchmarks.
Conclusion: MERLIN demonstrates effectiveness across both low and high-resource language settings, providing significant improvements in reasoning accuracy for underrepresented languages.
Abstract: Large language models excel in English but still struggle with complex reasoning in many low-resource languages (LRLs). Existing encoder-plus-decoder methods such as LangBridge and MindMerger raise accuracy on mid and high-resource languages, yet they leave a large gap on LRLs. We present MERLIN, a two-stage model-stacking framework that applies a curriculum learning strategy – from general bilingual bitext to task-specific data – and adapts only a small set of DoRA weights. On the AfriMGSM benchmark MERLIN improves exact-match accuracy by +12.9 pp over MindMerger and outperforms GPT-4o-mini. It also yields consistent gains on MGSM and MSVAMP (+0.9 and +2.8 pp), demonstrating effectiveness across both low and high-resource settings.
[63] OTESGN: Optimal Transport-Enhanced Syntactic-Semantic Graph Networks for Aspect-Based Sentiment Analysis
Xinfeng Liao, Xuanqi Chen, Lianxi Wang, Jiahuan Yang, Zhuowei Chen, Ziying Rong
Main category: cs.CL
TL;DR: OTESGN is a novel model for aspect-based sentiment analysis that combines syntactic and semantic information using optimal transport and graph networks, achieving state-of-the-art performance on benchmark datasets.
Details
Motivation: Existing ABSA approaches rely on dot-product similarity and fixed graphs, which limit their ability to capture nonlinear associations and adapt to noisy contexts.Method: Proposes OTESGN with Syntactic Graph-Aware Attention for global dependencies, Semantic Optimal Transport Attention using Sinkhorn algorithm for aspect-opinion association, Adaptive Attention Fusion, and contrastive regularization.
Result: Achieves state-of-the-art performance with improvements of up to +1.30 Macro-F1 on Laptop14 and +1.01 on Twitter datasets compared to competitive baselines.
Conclusion: OTESGN effectively captures fine-grained sentiment associations and suppresses noise from irrelevant context, demonstrating superior performance in aspect-based sentiment analysis.
Abstract: Aspect-based sentiment analysis (ABSA) aims to identify aspect terms and determine their sentiment polarity. While dependency trees combined with contextual semantics provide structural cues, existing approaches often rely on dot-product similarity and fixed graphs, which limit their ability to capture nonlinear associations and adapt to noisy contexts. To address these limitations, we propose the Optimal Transport-Enhanced Syntactic-Semantic Graph Network (OTESGN), a model that jointly integrates structural and distributional signals. Specifically, a Syntactic Graph-Aware Attention module models global dependencies with syntax-guided masking, while a Semantic Optimal Transport Attention module formulates aspect-opinion association as a distribution matching problem solved via the Sinkhorn algorithm. An Adaptive Attention Fusion mechanism balances heterogeneous features, and contrastive regularization enhances robustness. Extensive experiments on three benchmark datasets (Rest14, Laptop14, and Twitter) demonstrate that OTESGN delivers state-of-the-art performance. Notably, it surpasses competitive baselines by up to +1.30 Macro-F1 on Laptop14 and +1.01 on Twitter. Ablation studies and visualization analyses further highlight OTESGN’s ability to capture fine-grained sentiment associations and suppress noise from irrelevant context.
cs.CV
[64] Recurrence Meets Transformers for Universal Multimodal Retrieval
Davide Caffagni, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Main category: cs.CV
TL;DR: ReT-2 is a unified multimodal retrieval model that handles both image and text queries to search across multimodal document collections, achieving state-of-the-art performance with improved efficiency.
Details
Motivation: Existing retrieval methods are limited to single-modality queries or documents and require task-specific fine-tuning, while complex multimodal retrieval tasks are emerging with the advancement of multimodal LLMs.Method: ReT-2 uses multi-layer representations and a recurrent Transformer architecture with LSTM-inspired gating mechanisms to dynamically integrate information across layers and modalities, capturing fine-grained visual and textual details.
Result: ReT-2 achieves state-of-the-art performance on M2KR and M-BEIR benchmarks across different retrieval configurations, with faster inference and reduced memory usage compared to prior approaches. It also improves downstream performance when integrated into retrieval-augmented generation pipelines.
Conclusion: ReT-2 provides an effective unified solution for multimodal retrieval that supports complex queries and documents containing both images and text, demonstrating superior performance and efficiency across various benchmarks and applications.
Abstract: With the rapid advancement of multimodal retrieval and its application in LLMs and multimodal LLMs, increasingly complex retrieval tasks have emerged. Existing methods predominantly rely on task-specific fine-tuning of vision-language models and are limited to single-modality queries or documents. In this paper, we propose ReT-2, a unified retrieval model that supports multimodal queries, composed of both images and text, and searches across multimodal document collections where text and images coexist. ReT-2 leverages multi-layer representations and a recurrent Transformer architecture with LSTM-inspired gating mechanisms to dynamically integrate information across layers and modalities, capturing fine-grained visual and textual details. We evaluate ReT-2 on the challenging M2KR and M-BEIR benchmarks across different retrieval configurations. Results demonstrate that ReT-2 consistently achieves state-of-the-art performance across diverse settings, while offering faster inference and reduced memory usage compared to prior approaches. When integrated into retrieval-augmented generation pipelines, ReT-2 also improves downstream performance on Encyclopedic-VQA and InfoSeek datasets. Our source code and trained models are publicly available at: https://github.com/aimagelab/ReT-2
[65] Diffusion-Based Action Recognition Generalizes to Untrained Domains
Rogerio Guimaraes, Frank Xiao, Pietro Perona, Markus Marks
Main category: cs.CV
TL;DR: Using Vision Diffusion Model features with transformer aggregation achieves human-like action recognition across species, viewpoints, and contexts, setting new state-of-the-art benchmarks.
Details
Motivation: Humans can recognize actions despite large variations in context and viewpoint, but current deep learning models struggle with such generalization.Method: Propose using features from Vision Diffusion Model (VDM) aggregated via transformer, with conditioning on earlier timesteps to emphasize semantic over pixel-level information.
Result: Sets new state-of-the-art across three generalization benchmarks: classifying actions across animal species, viewing angles, and recording contexts.
Conclusion: The approach brings machine action recognition closer to human-like robustness by leveraging diffusion model features for better generalization.
Abstract: Humans can recognize the same actions despite large context and viewpoint variations, such as differences between species (walking in spiders vs. horses), viewpoints (egocentric vs. third-person), and contexts (real life vs movies). Current deep learning models struggle with such generalization. We propose using features generated by a Vision Diffusion Model (VDM), aggregated via a transformer, to achieve human-like action recognition across these challenging conditions. We find that generalization is enhanced by the use of a model conditioned on earlier timesteps of the diffusion process to highlight semantic information over pixel level details in the extracted features. We experimentally explore the generalization properties of our approach in classifying actions across animal species, across different viewing angles, and different recording contexts. Our model sets a new state-of-the-art across all three generalization benchmarks, bringing machine action recognition closer to human-like robustness. Project page: $\href{https://www.vision.caltech.edu/actiondiff/}{\texttt{vision.caltech.edu/actiondiff}}$ Code: $\href{https://github.com/frankyaoxiao/ActionDiff}{\texttt{github.com/frankyaoxiao/ActionDiff}}$
[66] Towards Better Dental AI: A Multimodal Benchmark and Instruction Dataset for Panoramic X-ray Analysis
Jing Hao, Yuxuan Fan, Yanpeng Sun, Kaixin Guo, Lizhuo Lin, Jinrong Yang, Qi Yong H. Ai, Lun M. Wong, Hao Tang, Kuo Feng Hung
Main category: cs.CV
TL;DR: MMOral is the first large-scale multimodal instruction dataset and benchmark for panoramic X-ray interpretation in dentistry, consisting of 20,563 annotated images with 1.3 million instruction instances. Current LVLMs perform poorly (best model GPT-4o at 41.45% accuracy), but OralGPT (fine-tuned Qwen2.5-VL-7B) shows 24.73% improvement after SFT.
Details
Motivation: Large vision-language models (LVLMs) have shown strong performance on general medical tasks but remain underexplored in specialized domains like dentistry. Panoramic X-rays present unique challenges with dense anatomical structures and subtle pathological cues not captured by existing medical benchmarks.Method: Created MMOral dataset with 20,563 annotated panoramic X-ray images and 1.3 million instruction instances across attribute extraction, report generation, VQA, and image-grounded dialogue. Developed MMOral-Bench evaluation suite covering five key diagnostic dimensions. Proposed OralGPT by conducting supervised fine-tuning on Qwen2.5-VL-7B using the MMOral dataset.
Result: Evaluation of 64 LVLMs revealed significant limitations, with GPT-4o achieving only 41.45% accuracy. OralGPT demonstrated substantial performance improvement of 24.73% after just one epoch of supervised fine-tuning.
Conclusion: MMOral and OralGPT provide a critical foundation for intelligent dentistry and enable more clinically impactful multimodal AI systems in the dental field. The significant performance gap and improvement potential highlight the need for domain-specific adaptation of LVLMs.
Abstract: Recent advances in large vision-language models (LVLMs) have demonstrated strong performance on general-purpose medical tasks. However, their effectiveness in specialized domains such as dentistry remains underexplored. In particular, panoramic X-rays, a widely used imaging modality in oral radiology, pose interpretative challenges due to dense anatomical structures and subtle pathological cues, which are not captured by existing medical benchmarks or instruction datasets. To this end, we introduce MMOral, the first large-scale multimodal instruction dataset and benchmark tailored for panoramic X-ray interpretation. MMOral consists of 20,563 annotated images paired with 1.3 million instruction-following instances across diverse task types, including attribute extraction, report generation, visual question answering, and image-grounded dialogue. In addition, we present MMOral-Bench, a comprehensive evaluation suite covering five key diagnostic dimensions in dentistry. We evaluate 64 LVLMs on MMOral-Bench and find that even the best-performing model, i.e., GPT-4o, only achieves 41.45% accuracy, revealing significant limitations of current models in this domain. To promote the progress of this specific domain, we also propose OralGPT, which conducts supervised fine-tuning (SFT) upon Qwen2.5-VL-7B with our meticulously curated MMOral instruction dataset. Remarkably, a single epoch of SFT yields substantial performance enhancements for LVLMs, e.g., OralGPT demonstrates a 24.73% improvement. Both MMOral and OralGPT hold significant potential as a critical foundation for intelligent dentistry and enable more clinically impactful multimodal AI systems in the dental field. The dataset, model, benchmark, and evaluation suite are available at https://github.com/isbrycee/OralGPT.
[67] PromptGuard: An Orchestrated Prompting Framework for Principled Synthetic Text Generation for Vulnerable Populations using LLMs with Enhanced Safety, Fairness, and Controllability
Tung Vu, Lam Nguyen, Quynh Dao
Main category: cs.CV
TL;DR: PromptGuard is a modular prompting framework with VulnGuard Prompt that prevents LLMs from generating harmful content using contrastive learning, ethical reasoning, and multi-objective optimization, achieving 25-30% harm reduction.
Details
Motivation: Existing safety approaches fail to proactively prevent harmful outputs from LLMs that could affect vulnerable populations like LGBTQ+ individuals and marginalized communities.Method: Uses VulnGuard Prompt - a hybrid technique combining few-shot examples from GitHub, ethical chain-of-thought reasoning, adaptive role-prompting, and multi-objective optimization with formal proofs.
Result: Demonstrates 25-30% analytical harm reduction through entropy bounds and Pareto optimality, with comprehensive mathematical formalization and convergence proofs.
Conclusion: PromptGuard provides a systematic framework for real-time harm prevention with strong theoretical foundations for empirical research on LLM safety.
Abstract: The proliferation of Large Language Models (LLMs) in real-world applications poses unprecedented risks of generating harmful, biased, or misleading information to vulnerable populations including LGBTQ+ individuals, single parents, and marginalized communities. While existing safety approaches rely on post-hoc filtering or generic alignment techniques, they fail to proactively prevent harmful outputs at the generation source. This paper introduces PromptGuard, a novel modular prompting framework with our breakthrough contribution: VulnGuard Prompt, a hybrid technique that prevents harmful information generation using real-world data-driven contrastive learning. VulnGuard integrates few-shot examples from curated GitHub repositories, ethical chain-of-thought reasoning, and adaptive role-prompting to create population-specific protective barriers. Our framework employs theoretical multi-objective optimization with formal proofs demonstrating 25-30% analytical harm reduction through entropy bounds and Pareto optimality. PromptGuard orchestrates six core modules: Input Classification, VulnGuard Prompting, Ethical Principles Integration, External Tool Interaction, Output Validation, and User-System Interaction, creating an intelligent expert system for real-time harm prevention. We provide comprehensive mathematical formalization including convergence proofs, vulnerability analysis using information theory, and theoretical validation framework using GitHub-sourced datasets, establishing mathematical foundations for systematic empirical research.
[68] Can Multimodal LLMs See Materials Clearly? A Multimodal Benchmark on Materials Characterization
Zhengzhao Lai, Youbin Zheng, Zhenyang Cai, Haonan Lyu, Jinpu Yang, Hongqing Liang, Yan Hu, Benyou Wang
Main category: cs.CV
TL;DR: MatCha is the first benchmark for materials characterization image understanding, testing MLLMs on 1,500 expert-level questions across 21 tasks, revealing significant performance gaps compared to human experts.
Details
Motivation: To bridge the gap in multimodal large language models' capacity to understand real-world materials characterization imaging data, which remains underexplored despite their promise in materials science.Method: Developed MatCha benchmark with 1,500 questions covering four key stages of materials research and 21 distinct tasks, then evaluated state-of-the-art MLLMs on this benchmark using few-shot and chain-of-thought prompting.
Result: MLLMs show significant performance degradation compared to human experts, especially for questions requiring higher-level expertise and sophisticated visual perception. Simple prompting techniques fail to alleviate these limitations.
Conclusion: Existing MLLMs have limited adaptability to real-world materials characterization scenarios, and MatCha can facilitate future research in new material discovery and autonomous scientific agents.
Abstract: Materials characterization is fundamental to acquiring materials information, revealing the processing-microstructure-property relationships that guide material design and optimization. While multimodal large language models (MLLMs) have recently shown promise in generative and predictive tasks within materials science, their capacity to understand real-world characterization imaging data remains underexplored. To bridge this gap, we present MatCha, the first benchmark for materials characterization image understanding, comprising 1,500 questions that demand expert-level domain expertise. MatCha encompasses four key stages of materials research comprising 21 distinct tasks, each designed to reflect authentic challenges faced by materials scientists. Our evaluation of state-of-the-art MLLMs on MatCha reveals a significant performance gap compared to human experts. These models exhibit degradation when addressing questions requiring higher-level expertise and sophisticated visual perception. Simple few-shot and chain-of-thought prompting struggle to alleviate these limitations. These findings highlight that existing MLLMs still exhibit limited adaptability to real-world materials characterization scenarios. We hope MatCha will facilitate future research in areas such as new material discovery and autonomous scientific agents. MatCha is available at https://github.com/FreedomIntelligence/MatCha.
[69] Similarity-based Outlier Detection for Noisy Object Re-Identification Using Beta Mixtures
Waqar Ahmad, Evan Murphy, Vladimir A. Krylov
Main category: cs.CV
TL;DR: Beta-SOD: A novel Beta mixture similarity-based outlier detection method for robust object re-identification that handles label noise by modeling pairwise similarity distributions and combining multiple loss functions.
Details
Motivation: Object re-identification methods are highly sensitive to label noise, which causes significant performance degradation. Existing approaches struggle with noisy labels in Re-ID tasks.Method: Reframes Re-ID as supervised image similarity task using Siamese network. Introduces Beta-SOD framework that models cosine similarity distribution with two-component Beta mixture model. Combines binary cross-entropy, contrastive, and cosine embedding losses for joint optimization.
Result: Superior performance on person Re-ID (CUHK03, Market-1501) and vehicle Re-ID (VeRi-776) datasets across various noise levels (10-30%). Outperforms state-of-the-art methods in noisy scenarios.
Conclusion: Beta-SOD provides robust and broadly applicable solution for noisy Re-ID tasks, with proven identifiability of Beta mixture model and effective outlier detection framework.
Abstract: Object re-identification (Re-ID) methods are highly sensitive to label noise, which typically leads to significant performance degradation. We address this challenge by reframing Re-ID as a supervised image similarity task and adopting a Siamese network architecture trained to capture discriminative pairwise relationships. Central to our approach is a novel statistical outlier detection (OD) framework, termed Beta-SOD (Beta mixture Similarity-based Outlier Detection), which models the distribution of cosine similarities between embedding pairs using a two-component Beta distribution mixture model. We establish a novel identifiability result for mixtures of two Beta distributions, ensuring that our learning task is well-posed.The proposed OD step complements the Re-ID architecture combining binary cross-entropy, contrastive, and cosine embedding losses that jointly optimize feature-level similarity learning.We demonstrate the effectiveness of Beta-SOD in de-noising and Re-ID tasks for person Re-ID, on CUHK03 and Market-1501 datasets, and vehicle Re-ID, on VeRi-776 dataset. Our method shows superior performance compared to the state-of-the-art methods across various noise levels (10-30%), demonstrating both robustness and broad applicability in noisy Re-ID scenarios. The implementation of Beta-SOD is available at: https://github.com/waqar3411/Beta-SOD
[70] SFD-Mamba2Net: Strcture-Guided Frequency-Enhanced Dual-Stream Mamba2 Network for Coronary Artery Segmentation
Nan Mu, Ruiqi Song, Zhihui Xu, Jingfeng Jiang, Chen Zhao
Main category: cs.CV
TL;DR: SFD-Mamba2Net improves coronary artery segmentation and stenosis detection in ICA images using multi-scale structural priors, state-space modeling, and frequency-domain enhancement.
Details
Motivation: ICA images have low contrast, high noise, and complex vascular structures that challenge existing segmentation and detection methods for CAD diagnosis.Method: End-to-end framework with CASE module for multi-scale structural enhancement in encoder and PHFP module with wavelet decomposition for high-frequency detail refinement in decoder.
Result: Outperformed state-of-the-art methods across eight segmentation metrics and achieved highest true positive rate and positive predictive value in stenosis detection.
Conclusion: The proposed SFD-Mamba2Net framework effectively addresses challenges in ICA image analysis and demonstrates superior performance for coronary artery segmentation and stenosis detection.
Abstract: Background: Coronary Artery Disease (CAD) is one of the leading causes of death worldwide. Invasive Coronary Angiography (ICA), regarded as the gold standard for CAD diagnosis, necessitates precise vessel segmentation and stenosis detection. However, ICA images are typically characterized by low contrast, high noise levels, and complex, fine-grained vascular structures, which pose significant challenges to the clinical adoption of existing segmentation and detection methods. Objective: This study aims to improve the accuracy of coronary artery segmentation and stenosis detection in ICA images by integrating multi-scale structural priors, state-space-based long-range dependency modeling, and frequency-domain detail enhancement strategies. Methods: We propose SFD-Mamba2Net, an end-to-end framework tailored for ICA-based vascular segmentation and stenosis detection. In the encoder, a Curvature-Aware Structural Enhancement (CASE) module is embedded to leverage multi-scale responses for highlighting slender tubular vascular structures, suppressing background interference, and directing attention toward vascular regions. In the decoder, we introduce a Progressive High-Frequency Perception (PHFP) module that employs multi-level wavelet decomposition to progressively refine high-frequency details while integrating low-frequency global structures. Results and Conclusions: SFD-Mamba2Net consistently outperformed state-of-the-art methods across eight segmentation metrics, and achieved the highest true positive rate and positive predictive value in stenosis detection.
[71] Live(r) Die: Predicting Survival in Colorectal Liver Metastasis
Muhammad Alberb, Helen Cheung, Anne Martel
Main category: cs.CV
TL;DR: Automated MRI-based framework for predicting surgical outcomes in colorectal liver metastasis using segmentation and radiomics pipelines, achieving 10%+ improvement over existing biomarkers.
Details
Motivation: Current prognostic models for colorectal liver metastasis lack predictive power, especially for multifocal cases, necessitating more accurate automated prediction methods.Method: Combines segmentation pipeline (using promptable foundation models and SAMONAI for 3D segmentation) with radiomics pipeline (SurvAMINN neural network for survival analysis from extracted tumor features).
Result: Framework outperforms existing clinical and genomic biomarkers with C-index improvement exceeding 10% on 227-patient dataset.
Conclusion: Integration of automated segmentation and radiomics enables accurate, efficient, and interpretable outcome prediction for colorectal liver metastasis patients.
Abstract: Colorectal cancer frequently metastasizes to the liver, significantly reducing long-term survival. While surgical resection is the only potentially curative treatment for colorectal liver metastasis (CRLM), patient outcomes vary widely depending on tumor characteristics along with clinical and genomic factors. Current prognostic models, often based on limited clinical or molecular features, lack sufficient predictive power, especially in multifocal CRLM cases. We present a fully automated framework for surgical outcome prediction from pre- and post-contrast MRI acquired before surgery. Our framework consists of a segmentation pipeline and a radiomics pipeline. The segmentation pipeline learns to segment the liver, tumors, and spleen from partially annotated data by leveraging promptable foundation models to complete missing labels. Also, we propose SAMONAI, a novel zero-shot 3D prompt propagation algorithm that leverages the Segment Anything Model to segment 3D regions of interest from a single point prompt, significantly improving our segmentation pipeline’s accuracy and efficiency. The predicted pre- and post-contrast segmentations are then fed into our radiomics pipeline, which extracts features from each tumor and predicts survival using SurvAMINN, a novel autoencoder-based multiple instance neural network for survival analysis. SurvAMINN jointly learns dimensionality reduction and hazard prediction from right-censored survival data, focusing on the most aggressive tumors. Extensive evaluation on an institutional dataset comprising 227 patients demonstrates that our framework surpasses existing clinical and genomic biomarkers, delivering a C-index improvement exceeding 10%. Our results demonstrate the potential of integrating automated segmentation algorithms and radiomics-based survival analysis to deliver accurate, annotation-efficient, and interpretable outcome prediction in CRLM.
[72] Discovering Divergent Representations between Text-to-Image Models
Lisa Dunlap, Joseph E. Gonzalez, Trevor Darrell, Fabian Caba Heilbron, Josef Sivic, Bryan Russell
Main category: cs.CV
TL;DR: CompCon is an evolutionary search algorithm that discovers visual attribute differences between text-to-image models and identifies prompt concepts that trigger these divergences.
Details
Motivation: To understand when and how visual representations learned by different generative models diverge, and to systematically discover visual attributes that appear in one model's outputs but not another's.Method: Introduces CompCon, an evolutionary search algorithm that discovers visual attributes more prevalent in one model’s output than another, along with prompt concepts linked to these differences. Uses automated data generation pipeline to create ID2 dataset with 60 input-dependent differences.
Result: Successfully compares popular text-to-image models, finding divergent representations such as PixArt depicting loneliness prompts with wet streets and Stable Diffusion 3.5 depicting African American people in media professions.
Conclusion: CompCon provides an effective method for systematically discovering and analyzing visual representation differences between text-to-image models, revealing model-specific biases and characteristics.
Abstract: In this paper, we investigate when and how visual representations learned by two different generative models diverge. Given two text-to-image models, our goal is to discover visual attributes that appear in images generated by one model but not the other, along with the types of prompts that trigger these attribute differences. For example, “flames” might appear in one model’s outputs when given prompts expressing strong emotions, while the other model does not produce this attribute given the same prompts. We introduce CompCon (Comparing Concepts), an evolutionary search algorithm that discovers visual attributes more prevalent in one model’s output than the other, and uncovers the prompt concepts linked to these visual differences. To evaluate CompCon’s ability to find diverging representations, we create an automated data generation pipeline to produce ID2, a dataset of 60 input-dependent differences, and compare our approach to several LLM- and VLM-powered baselines. Finally, we use CompCon to compare popular text-to-image models, finding divergent representations such as how PixArt depicts prompts mentioning loneliness with wet streets and Stable Diffusion 3.5 depicts African American people in media professions. Code at: https://github.com/adobe-research/CompCon
[73] An U-Net-Based Deep Neural Network for Cloud Shadow and Sun-Glint Correction of Unmanned Aerial System (UAS) Imagery
Yibin Wang, Wondimagegn Beshah, Padmanava Dash, Haifeng Wang
Main category: cs.CV
TL;DR: Novel machine learning approach using U-Net to identify and correct cloud shadows and sun glint in UAS imagery for water quality estimation.
Details
Motivation: UAS imagery is often sullied by cloud shadows and sun glint, which pose serious issues for estimating water quality parameters from the images.Method: U-Net based deep learning model trained with pixel-level data extraction to identify and extract regions with cloud shadows and sun glint, separating them from clear sky and unaffected regions.
Result: Developed a high-quality image correction model that can recover cloud shadow and sun glint areas in UAS images.
Conclusion: The proposed machine learning approach effectively addresses the limitations of UAS imagery by providing a solution to correct cloud shadows and sun glint, improving water quality parameter estimation.
Abstract: The use of unmanned aerial systems (UASs) has increased tremendously in the current decade. They have significantly advanced remote sensing with the capability to deploy and image the terrain as per required spatial, spectral, temporal, and radiometric resolutions for various remote sensing applications. One of the major advantages of UAS imagery is that images can be acquired in cloudy conditions by flying the UAS under the clouds. The limitation to the technology is that the imagery is often sullied by cloud shadows. Images taken over water are additionally affected by sun glint. These are two pose serious issues for estimating water quality parameters from the UAS images. This study proposes a novel machine learning approach first to identify and extract regions with cloud shadows and sun glint and separate such regions from non-obstructed clear sky regions and sun-glint unaffected regions. The data was extracted from the images at pixel level to train an U-Net based deep learning model and best settings for model training was identified based on the various evaluation metrics from test cases. Using this evaluation, a high-quality image correction model was determined, which was used to recover the cloud shadow and sun glint areas in the images.
[74] Classification of Driver Behaviour Using External Observation Techniques for Autonomous Vehicles
Ian Nell, Shane Gilroy
Main category: cs.CV
TL;DR: A computer vision-based system for detecting distracted and impaired driving through external observation, using YOLO object detection and lane analysis to identify unsafe behaviors without requiring vehicle connectivity.
Details
Motivation: Road traffic accidents caused by human error, particularly distracted and impaired driving, remain a major global safety concern that needs effective detection solutions.Method: Uses advanced computer vision techniques including real-time object tracking, lateral displacement analysis, lane position monitoring, YOLO object detection model, and custom lane estimation algorithms to detect unsafe driving behaviors.
Result: Experimental evaluations on diverse video datasets demonstrate the framework’s reliability and adaptability across varying road and environmental conditions.
Conclusion: The vision-based approach successfully enables behavioral analysis of non-connected vehicles and provides a practical solution for detecting distracted and impaired driving through external observation techniques.
Abstract: Road traffic accidents remain a significant global concern, with human error, particularly distracted and impaired driving, among the leading causes. This study introduces a novel driver behavior classification system that uses external observation techniques to detect indicators of distraction and impairment. The proposed framework employs advanced computer vision methodologies, including real-time object tracking, lateral displacement analysis, and lane position monitoring. The system identifies unsafe driving behaviors such as excessive lateral movement and erratic trajectory patterns by implementing the YOLO object detection model and custom lane estimation algorithms. Unlike systems reliant on inter-vehicular communication, this vision-based approach enables behavioral analysis of non-connected vehicles. Experimental evaluations on diverse video datasets demonstrate the framework’s reliability and adaptability across varying road and environmental conditions.
[75] CoSwin: Convolution Enhanced Hierarchical Shifted Window Attention For Small-Scale Vision
Puskal Khadka, Rodrigue Rizk, Longwei Wang, KC Santosh
Main category: cs.CV
TL;DR: CoSwin combines Vision Transformers with convolutional features to address local feature extraction limitations in small datasets, achieving state-of-the-art performance across multiple benchmarks.
Details
Motivation: Vision Transformers excel at global context but lack key inductive biases like locality and translation equivariance, which limits their performance on small datasets where local feature extraction is crucial.Method: Proposes CoSwin architecture that integrates a learnable local feature enhancement module into each attention block of hierarchical shifted window attention, enabling simultaneous capture of fine-grained spatial details and global semantic structure.
Result: Achieves consistent performance gains: 2.17% on CIFAR-10, 4.92% on CIFAR-100, 0.10% on MNIST, 0.26% on SVHN, and 4.47% on Tiny ImageNet over baseline Swin Transformer.
Conclusion: Local-global feature fusion effectively enhances generalization and robustness of transformers for small-scale vision tasks, demonstrating the importance of combining convolutional inductive biases with transformer architectures.
Abstract: Vision Transformers (ViTs) have achieved impressive results in computer vision by leveraging self-attention to model long-range dependencies. However, their emphasis on global context often comes at the expense of local feature extraction in small datasets, particularly due to the lack of key inductive biases such as locality and translation equivariance. To mitigate this, we propose CoSwin, a novel feature-fusion architecture that augments the hierarchical shifted window attention with localized convolutional feature learning. Specifically, CoSwin integrates a learnable local feature enhancement module into each attention block, enabling the model to simultaneously capture fine-grained spatial details and global semantic structure. We evaluate CoSwin on multiple image classification benchmarks including CIFAR-10, CIFAR-100, MNIST, SVHN, and Tiny ImageNet. Our experimental results show consistent performance gains over state-of-the-art convolutional and transformer-based models. Notably, CoSwin achieves improvements of 2.17% on CIFAR-10, 4.92% on CIFAR-100, 0.10% on MNIST, 0.26% on SVHN, and 4.47% on Tiny ImageNet over the baseline Swin Transformer. These improvements underscore the effectiveness of local-global feature fusion in enhancing the generalization and robustness of transformers for small-scale vision. Code and pretrained weights available at https://github.com/puskal-khadka/coswin
[76] iMatcher: Improve matching in point cloud registration via local-to-global geometric consistency learning
Karim Slimani, Catherine Achard, Brahim Tamadazte
Main category: cs.CV
TL;DR: iMatcher is a differentiable framework for point cloud feature matching that uses learned features to predict geometrically consistent confidence matrices, achieving state-of-the-art performance on multiple datasets.
Details
Motivation: To improve point cloud registration by developing a fully differentiable framework that incorporates both local and global geometric consistency for more accurate feature matching.Method: Uses local graph embedding for score matrix initialization, bilateral source-to-target matching via nearest neighbor search, and global geometric consistency learning to refine point-wise matching probabilities.
Result: Achieves 95%-97% inlier ratio on KITTI, 94%-97% on KITTI-360, and up to 81.1% on 3DMatch, demonstrating significant improvement in rigid registration performance across diverse datasets.
Conclusion: iMatcher provides a robust and effective differentiable framework for feature matching that outperforms existing methods and shows strong generalization across indoor/outdoor and partial matching scenarios.
Abstract: This paper presents iMatcher, a fully differentiable framework for feature matching in point cloud registration. The proposed method leverages learned features to predict a geometrically consistent confidence matrix, incorporating both local and global consistency. First, a local graph embedding module leads to an initialization of the score matrix. A subsequent repositioning step refines this matrix by considering bilateral source-to-target and target-to-source matching via nearest neighbor search in 3D space. The paired point features are then stacked together to be refined through global geometric consistency learning to predict a point-wise matching probability. Extensive experiments on real-world outdoor (KITTI, KITTI-360) and indoor (3DMatch) datasets, as well as on 6-DoF pose estimation (TUD-L) and partial-to-partial matching (MVP-RG), demonstrate that iMatcher significantly improves rigid registration performance. The method achieves state-of-the-art inlier ratios, scoring 95% - 97% on KITTI, 94% - 97% on KITTI-360, and up to 81.1% on 3DMatch, highlighting its robustness across diverse settings.
[77] UltrON: Ultrasound Occupancy Networks
Magdalena Wysocki, Felix Duelmer, Ananya Bal, Nassir Navab, Mohammad Farid Azampour
Main category: cs.CV
TL;DR: UltrON is a novel occupancy-based representation that leverages acoustic features from B-mode ultrasound for 3D shape reconstruction, addressing occlusion and annotation dependency issues in weakly-supervised optimization.
Details
Motivation: Traditional ultrasound shape reconstruction methods rely on precise annotations and struggle with view-dependent artifacts and acoustic shadowing. There's a need to utilize the rich acoustic information in B-mode images without additional annotation costs.Method: Proposes UltrON - an occupancy-based representation that uses acoustic features from B-mode images. Introduces a novel loss function to compensate for view-dependency and enables occupancy optimization from multiview ultrasound.
Result: UltrON mitigates limitations of occlusions and sparse labeling, improves geometric consistency, and generalizes to shapes of the same anatomy without requiring additional annotations.
Conclusion: The approach paves the way for more accurate 3D reconstruction from ultrasound by effectively utilizing acoustic properties and addressing view-dependent challenges in weakly-supervised settings.
Abstract: In free-hand ultrasound imaging, sonographers rely on expertise to mentally integrate partial 2D views into 3D anatomical shapes. Shape reconstruction can assist clinicians in this process. Central to this task is the choice of shape representation, as it determines how accurately and efficiently the structure can be visualized, analyzed, and interpreted. Implicit representations, such as SDF and occupancy function, offer a powerful alternative to traditional voxel- or mesh-based methods by modeling continuous, smooth surfaces with compact storage, avoiding explicit discretization. Recent studies demonstrate that SDF can be effectively optimized using annotations derived from segmented B-mode ultrasound images. Yet, these approaches hinge on precise annotations, overlooking the rich acoustic information embedded in B-mode intensity. Moreover, implicit representation approaches struggle with the ultrasound’s view-dependent nature and acoustic shadowing artifacts, which impair reconstruction. To address the problems resulting from occlusions and annotation dependency, we propose an occupancy-based representation and introduce \gls{UltrON} that leverages acoustic features to improve geometric consistency in weakly-supervised optimization regime. We show that these features can be obtained from B-mode images without additional annotation cost. Moreover, we propose a novel loss function that compensates for view-dependency in the B-mode images and facilitates occupancy optimization from multiview ultrasound. By incorporating acoustic properties, \gls{UltrON} generalizes to shapes of the same anatomy. We show that \gls{UltrON} mitigates the limitations of occlusions and sparse labeling and paves the way for more accurate 3D reconstruction. Code and dataset will be available at https://github.com/magdalena-wysocki/ultron.
[78] Implicit Neural Representations of Intramyocardial Motion and Strain
Andrew Bell, Yan Kit Choi, Steffen Peterson, Andrew King, Muhummad Sohaib Nazir, Alistair Young
Main category: cs.CV
TL;DR: INR-based method for automatic LV motion and strain quantification from tagging MRI, achieving state-of-the-art accuracy with 380x speedup compared to baselines.
Details
Motivation: Automatic quantification of intramyocardial motion and strain from tagging MRI is important but challenging, requiring accurate and scalable solutions for large CMR datasets.Method: Uses implicit neural representations (INRs) conditioned on learned latent codes to predict continuous left ventricular displacement without requiring inference-time optimization.
Result: Achieved best tracking accuracy (2.14 mm RMSE) and lowest combined error in global circumferential (2.86%) and radial (6.42%) strain on 452 UK Biobank test cases, with ~380x faster inference than most accurate baseline.
Conclusion: INR-based models are suitable for accurate and scalable analysis of myocardial strain in large CMR datasets.
Abstract: Automatic quantification of intramyocardial motion and strain from tagging MRI remains an important but challenging task. We propose a method using implicit neural representations (INRs), conditioned on learned latent codes, to predict continuous left ventricular (LV) displacement – without requiring inference-time optimisation. Evaluated on 452 UK Biobank test cases, our method achieved the best tracking accuracy (2.14 mm RMSE) and the lowest combined error in global circumferential (2.86%) and radial (6.42%) strain compared to three deep learning baselines. In addition, our method is $\sim$380$\times$ faster than the most accurate baseline. These results highlight the suitability of INR-based models for accurate and scalable analysis of myocardial strain in large CMR datasets.
[79] E-MLNet: Enhanced Mutual Learning for Universal Domain Adaptation with Sample-Specific Weighting
Samuel Felipe dos Santos, Tiago Agostinho de Almeida, Jurandy Almeida
Main category: cs.CV
TL;DR: E-MLNet improves Universal Domain Adaptation by introducing dynamic weighting to focus adaptation on relevant class boundaries, outperforming previous methods on multiple benchmarks.
Details
Motivation: Existing Universal Domain Adaptation methods treat all classifiers equally, which dilutes learning signals and reduces effectiveness in distinguishing known from unknown classes.Method: Enhanced Mutual Learning Network (E-MLNet) integrates dynamic weighting strategy to Open-set Entropy Minimization, using closed-set classifier predictions to focus adaptation on the most relevant class boundaries for each target sample.
Result: E-MLNet achieves highest average H-scores on VisDA and ImageCLEF benchmarks, outperforming MLNet in 22/31 Open-Partial DA tasks and 19/31 Open-Set DA tasks, demonstrating superior robustness.
Conclusion: The focused adaptation strategy through dynamic weighting significantly improves Universal Domain Adaptation performance by sharpening the distinction between known and unknown classes.
Abstract: Universal Domain Adaptation (UniDA) seeks to transfer knowledge from a labeled source to an unlabeled target domain without assuming any relationship between their label sets, requiring models to classify known samples while rejecting unknown ones. Advanced methods like Mutual Learning Network (MLNet) use a bank of one-vs-all classifiers adapted via Open-set Entropy Minimization (OEM). However, this strategy treats all classifiers equally, diluting the learning signal. We propose the Enhanced Mutual Learning Network (E-MLNet), which integrates a dynamic weighting strategy to OEM. By leveraging the closed-set classifier’s predictions, E-MLNet focuses adaptation on the most relevant class boundaries for each target sample, sharpening the distinction between known and unknown classes. We conduct extensive experiments on four challenging benchmarks: Office-31, Office-Home, VisDA-2017, and ImageCLEF. The results demonstrate that E-MLNet achieves the highest average H-scores on VisDA and ImageCLEF and exhibits superior robustness over its predecessor. E-MLNet outperforms the strong MLNet baseline in the majority of individual adaptation tasks – 22 out of 31 in the challenging Open-Partial DA setting and 19 out of 31 in the Open-Set DA setting – confirming the benefits of our focused adaptation strategy.
[80] COCO-Urdu: A Large-Scale Urdu Image-Caption Dataset with Multimodal Quality Estimation
Umair Hassan
Main category: cs.CV
TL;DR: COCO-Urdu is the largest publicly available Urdu image-caption dataset with 59K images and 319K captions, created to address the under-representation of Urdu in multimodal research.
Details
Motivation: Urdu is critically under-served in multimodal research despite being spoken by 250M+ people, with no large-scale datasets available, leading to biases in multilingual vision-language models.Method: Derived from MS COCO using stratified sampling, translated with SeamlessM4T v2, and validated with hybrid multimodal quality estimation (COMET-Kiwi, CLIP similarity, BERTScore + back-translation) with iterative refinement using LLMs.
Result: Dataset contains 59,000 images and 319,000 high-quality Urdu captions, benchmarked with strong BLEU, SacreBLEU, and chrF scores.
Conclusion: COCO-Urdu reduces language bias in multimodal research and establishes foundation for inclusive vision-language systems, with both dataset and quality pipeline released publicly.
Abstract: Urdu, spoken by over 250 million people, remains critically under-served in multimodal and vision-language research. The absence of large-scale, high-quality datasets has limited the development of Urdu-capable systems and reinforced biases in multilingual vision-language models trained primarily on high-resource languages. To address this gap, we present COCO-Urdu, a large-scale image-caption dataset derived from MS COCO, containing 59,000 images and 319,000 Urdu captions selected through stratified sampling to preserve the original distribution. Captions were translated using SeamlessM4T v2 and validated with a hybrid multimodal quality estimation framework that integrates COMET-Kiwi for translation quality, CLIP-based similarity for visual grounding, and BERTScore with back-translation for semantic consistency; low-scoring captions were iteratively refined using open-source large language models. We further benchmark COCO-Urdu on BLEU, SacreBLEU, and chrF, reporting consistently strong results. To the best of our knowledge, COCO-Urdu is the largest publicly available Urdu captioning dataset. By releasing both the dataset and the quality estimation pipeline, we aim to reduce language bias in multimodal research and establish a foundation for inclusive vision-language systems.
[81] VoxelFormer: Parameter-Efficient Multi-Subject Visual Decoding from fMRI
Chenqian Le, Yilin Zhao, Nikasadat Emami, Kushagra Yadav, Xujin “Chris” Liu, Xupeng Chen, Yao Wang
Main category: cs.CV
TL;DR: VoxelFormer is a lightweight transformer that enables multi-subject fMRI visual decoding using token merging and query-based transformers, achieving competitive performance with fewer parameters.
Details
Motivation: Most fMRI-based visual decoding methods require subject-specific training, which limits scalability and practical deployment. The goal is to develop a more efficient multi-subject approach.Method: VoxelFormer uses a Token Merging Transformer (ToMer) for voxel compression and a Q-Former that produces neural representations aligned with CLIP image embeddings for multi-subject training.
Result: Evaluated on the 7T Natural Scenes Dataset, VoxelFormer achieves competitive retrieval performance on trained subjects with significantly fewer parameters than existing methods.
Conclusion: Token merging and query-based transformers are promising strategies for parameter-efficient neural decoding, enabling scalable multi-subject fMRI visual reconstruction.
Abstract: Recent advances in fMRI-based visual decoding have enabled compelling reconstructions of perceived images. However, most approaches rely on subject-specific training, limiting scalability and practical deployment. We introduce \textbf{VoxelFormer}, a lightweight transformer architecture that enables multi-subject training for visual decoding from fMRI. VoxelFormer integrates a Token Merging Transformer (ToMer) for efficient voxel compression and a query-driven Q-Former that produces fixed-size neural representations aligned with the CLIP image embedding space. Evaluated on the 7T Natural Scenes Dataset, VoxelFormer achieves competitive retrieval performance on subjects included during training with significantly fewer parameters than existing methods. These results highlight token merging and query-based transformers as promising strategies for parameter-efficient neural decoding.
[82] Integrating Anatomical Priors into a Causal Diffusion Model
Binxu Li, Wei Peng, Mingjie Li, Ehsan Adeli, Kilian M. Pohl
Main category: cs.CV
TL;DR: PCGM is a novel diffusion-based framework that generates anatomically plausible 3D brain MRIs by integrating explicit anatomical constraints through probabilistic graphs and ControlNet masks, enabling accurate replication of subtle disease effects.
Details
Motivation: 3D brain MRI studies need to detect subtle morphometric differences between cohorts, but counterfactual models struggle to produce anatomically plausible MRIs due to lack of explicit inductive biases to preserve fine-grained anatomical details.Method: Proposes PCGM that integrates anatomical constraints via probabilistic graph module, translates constraints into spatial binary masks using 3D ControlNet extension, and uses a novel counterfactual denoising UNet with 3D diffusion decoder.
Result: PCGM generates higher quality structural brain MRIs than baseline approaches and successfully replicates subtle disease effects on cortical brain regions previously reported in neuroscience literature.
Conclusion: This represents an important milestone for using synthetic MRIs in studies investigating subtle morphological differences, demonstrating the effectiveness of explicit anatomical constraints in counterfactual image generation.
Abstract: 3D brain MRI studies often examine subtle morphometric differences between cohorts that are hard to detect visually. Given the high cost of MRI acquisition, these studies could greatly benefit from image syntheses, particularly counterfactual image generation, as seen in other domains, such as computer vision. However, counterfactual models struggle to produce anatomically plausible MRIs due to the lack of explicit inductive biases to preserve fine-grained anatomical details. This shortcoming arises from the training of the models aiming to optimize for the overall appearance of the images (e.g., via cross-entropy) rather than preserving subtle, yet medically relevant, local variations across subjects. To preserve subtle variations, we propose to explicitly integrate anatomical constraints on a voxel-level as prior into a generative diffusion framework. Called Probabilistic Causal Graph Model (PCGM), the approach captures anatomical constraints via a probabilistic graph module and translates those constraints into spatial binary masks of regions where subtle variations occur. The masks (encoded by a 3D extension of ControlNet) constrain a novel counterfactual denoising UNet, whose encodings are then transferred into high-quality brain MRIs via our 3D diffusion decoder. Extensive experiments on multiple datasets demonstrate that PCGM generates structural brain MRIs of higher quality than several baseline approaches. Furthermore, we show for the first time that brain measurements extracted from counterfactuals (generated by PCGM) replicate the subtle effects of a disease on cortical brain regions previously reported in the neuroscience literature. This achievement is an important milestone in the use of synthetic MRIs in studies investigating subtle morphological differences.
[83] Enhancing 3D Medical Image Understanding with Pretraining Aided by 2D Multimodal Large Language Models
Qiuhui Chen, Xuancheng Yao, Huping Ye, Yi Hong
Main category: cs.CV
TL;DR: Med3DInsight is a novel pretraining framework that integrates 3D medical image encoders with 2D multimodal large language models using plane-slice-aware transformer and partial optimal transport alignment, achieving state-of-the-art performance on segmentation and classification tasks without human annotations.
Details
Motivation: Existing 3D medical SSL methods lack deep semantic comprehension, while recent MLLMs show promise for enhanced image understanding through text descriptions. The goal is to leverage 2D MLLMs for improved 3D medical image understanding.Method: Proposes Med3DInsight framework with 3D image encoders integrated with 2D MLLMs via plane-slice-aware transformer module. Uses partial optimal transport based alignment for noise tolerance from LLM-generated content.
Result: Achieves state-of-the-art performance on segmentation and classification tasks across various CT and MRI datasets, outperforming current SSL methods. Can be seamlessly integrated into existing 3D medical image networks.
Conclusion: Med3DInsight introduces a new paradigm for scalable multimodal 3D medical representation learning without human annotations, demonstrating superior performance and integration capabilities for medical image understanding.
Abstract: Understanding 3D medical image volumes is critical in the medical field, yet existing 3D medical convolution and transformer-based self-supervised learning (SSL) methods often lack deep semantic comprehension. Recent advancements in multimodal large language models (MLLMs) provide a promising approach to enhance image understanding through text descriptions. To leverage these 2D MLLMs for improved 3D medical image understanding, we propose Med3DInsight, a novel pretraining framework that integrates 3D image encoders with 2D MLLMs via a specially designed plane-slice-aware transformer module. Additionally, our model employs a partial optimal transport based alignment, demonstrating greater tolerance to noise introduced by potential noises in LLM-generated content. Med3DInsight introduces a new paradigm for scalable multimodal 3D medical representation learning without requiring human annotations. Extensive experiments demonstrate our state-of-the-art performance on two downstream tasks, i.e., segmentation and classification, across various public datasets with CT and MRI modalities, outperforming current SSL methods. Med3DInsight can be seamlessly integrated into existing 3D medical image understanding networks, potentially enhancing their performance. Our source code, generated datasets, and pre-trained models will be available at https://github.com/Qybc/Med3DInsight.
[84] Improvement of Human-Object Interaction Action Recognition Using Scene Information and Multi-Task Learning Approach
Hesham M. Shehata, Mohammad Abdolrahmani
Main category: cs.CV
TL;DR: Proposed multi-task learning approach with fixed object information improves human action recognition accuracy to 99.25%, outperforming skeleton-only GCNs by 2.75% for human-object interaction detection.
Details
Motivation: Current GCNs fail at detecting human-object interactions due to lack of scene information representation and appropriate learning architectures for fixed objects in environments.Method: Multi-task learning approach that incorporates fixed object information and interaction area data alongside human skeleton poses, using a custom dataset with interaction and non-interaction classes.
Result: Achieved 99.25% accuracy in recognizing interaction and non-interaction actions, representing a 2.75% improvement over base models using only human skeleton poses.
Conclusion: Incorporating fixed object information through multi-task learning significantly enhances human action recognition performance for human-object interaction scenarios in public environments.
Abstract: Recent graph convolutional neural networks (GCNs) have shown high performance in the field of human action recognition by using human skeleton poses. However, it fails to detect human-object interaction cases successfully due to the lack of effective representation of the scene information and appropriate learning architectures. In this context, we propose a methodology to utilize human action recognition performance by considering fixed object information in the environment and following a multi-task learning approach. In order to evaluate the proposed method, we collected real data from public environments and prepared our data set, which includes interaction classes of hands-on fixed objects (e.g., ATM ticketing machines, check-in/out machines, etc.) and non-interaction classes of walking and standing. The multi-task learning approach, along with interaction area information, succeeds in recognizing the studied interaction and non-interaction actions with an accuracy of 99.25%, outperforming the accuracy of the base model using only human skeleton poses by 2.75%.
[85] IRDFusion: Iterative Relation-Map Difference guided Feature Fusion for Multispectral Object Detection
Jifeng Shen, Haibo Zhan, Xin Zuo, Heng Fan, Xiaohui Yuan, Jun Li, Wankou Yang
Main category: cs.CV
TL;DR: Proposes IRDFusion framework for multispectral object detection using cross-modal feature contrast and screening to enhance salient structures while suppressing background noise.
Details
Motivation: Current multispectral object detection methods retain extraneous background or noise during feature fusion, limiting perceptual performance.Method: Introduces two novel modules: Mutual Feature Refinement Module (MFRM) for intra/inter-modal feature enhancement, and Differential Feature Feedback Module (DFFM) for dynamic inter-modal differential feature computation. Integrated as Iterative Relation-Map Differential Guided Feature Fusion (IRDFusion) mechanism.
Result: Achieves state-of-the-art performance on FLIR, LLVIP and M$^3$FD datasets, consistently outperforming existing methods across diverse challenging scenarios.
Conclusion: IRDFusion enables high-quality cross-modal fusion by progressively amplifying salient relational signals through iterative feedback while suppressing feature noise, demonstrating robustness and effectiveness.
Abstract: Current multispectral object detection methods often retain extraneous background or noise during feature fusion, limiting perceptual performance.To address this, we propose an innovative feature fusion framework based on cross-modal feature contrastive and screening strategy, diverging from conventional approaches. The proposed method adaptively enhances salient structures by fusing object-aware complementary cross-modal features while suppressing shared background interference.Our solution centers on two novel, specially designed modules: the Mutual Feature Refinement Module (MFRM) and the Differential Feature Feedback Module (DFFM). The MFRM enhances intra- and inter-modal feature representations by modeling their relationships, thereby improving cross-modal alignment and discriminative power.Inspired by feedback differential amplifiers, the DFFM dynamically computes inter-modal differential features as guidance signals and feeds them back to the MFRM, enabling adaptive fusion of complementary information while suppressing common-mode noise across modalities. To enable robust feature learning, the MFRM and DFFM are integrated into a unified framework, which is formally formulated as an Iterative Relation-Map Differential Guided Feature Fusion mechanism, termed IRDFusion. IRDFusion enables high-quality cross-modal fusion by progressively amplifying salient relational signals through iterative feedback, while suppressing feature noise, leading to significant performance gains.In extensive experiments on FLIR, LLVIP and M$^3$FD datasets, IRDFusion achieves state-of-the-art performance and consistently outperforms existing methods across diverse challenging scenarios, demonstrating its robustness and effectiveness. Code will be available at https://github.com/61s61min/IRDFusion.git.
[86] An Improved U-Net Model for Offline handwriting signature denoising
Wanghui Xiao
Main category: cs.CV
TL;DR: Proposes an improved U-net model with discrete wavelet transform and PCA for denoising offline handwriting signatures to enhance signature recognition systems.
Details
Motivation: Handwriting signatures are crucial for identity recognition but often contain interfering information from historical documents, creating challenges for forensic analysis and identification work.Method: Developed a signature handwriting denoising model based on improved U-net structure, incorporating discrete wavelet transform and PCA transform to enhance noise suppression capabilities.
Result: Experimental results show the model significantly outperforms traditional denoising methods, effectively improving clarity and readability of signature images.
Conclusion: The proposed model provides more reliable technical support for signature analysis and recognition by enhancing denoising effectiveness and system robustness.
Abstract: Handwriting signatures, as an important means of identity recognition, are widely used in multiple fields such as financial transactions, commercial contracts and personal affairs due to their legal effect and uniqueness. In forensic science appraisals, the analysis of offline handwriting signatures requires the appraiser to provide a certain number of signature samples, which are usually derived from various historical contracts or archival materials. However, the provided handwriting samples are often mixed with a large amount of interfering information, which brings severe challenges to handwriting identification work. This study proposes a signature handwriting denoising model based on the improved U-net structure, aiming to enhance the robustness of the signature recognition system. By introducing discrete wavelet transform and PCA transform, the model’s ability to suppress noise has been enhanced. The experimental results show that this modelis significantly superior to the traditional methods in denoising effect, can effectively improve the clarity and readability of the signed images, and provide more reliable technical support for signature analysis and recognition.
[87] SQAP-VLA: A Synergistic Quantization-Aware Pruning Framework for High-Performance Vision-Language-Action Models
Hengyu Fang, Yijiang Liu, Yuan Du, Li Du, Huanrui Yang
Main category: cs.CV
TL;DR: SQAP-VLA is a training-free framework that simultaneously applies quantization and token pruning to Vision-Language-Action models, achieving 1.93x speedup while maintaining performance.
Details
Motivation: VLA models have high computational and memory costs that hinder practical deployment, and existing compression methods fail to combine quantization and token pruning effectively due to incompatibility issues.Method: Co-designs quantization and token pruning pipeline with new quantization-aware token pruning criteria and enhanced quantizer design that works on aggressively quantized models.
Result: Achieves 1.93x speedup with up to 4.5% average success rate enhancement compared to original VLA models while preserving core performance.
Conclusion: SQAP-VLA successfully overcomes the incompatibility between quantization and token pruning, enabling holistic efficiency improvement for VLA models without training.
Abstract: Vision-Language-Action (VLA) models exhibit unprecedented capabilities for embodied intelligence. However, their extensive computational and memory costs hinder their practical deployment. Existing VLA compression and acceleration approaches conduct quantization or token pruning in an ad-hoc manner but fail to enable both for a holistic efficiency improvement due to an observed incompatibility. This work introduces SQAP-VLA, the first structured, training-free VLA inference acceleration framework that simultaneously enables state-of-the-art quantization and token pruning. We overcome the incompatibility by co-designing the quantization and token pruning pipeline, where we propose new quantization-aware token pruning criteria that work on an aggressively quantized model while improving the quantizer design to enhance pruning effectiveness. When applied to standard VLA models, SQAP-VLA yields significant gains in computational efficiency and inference speed while successfully preserving core model performance, achieving a $\times$1.93 speedup and up to a 4.5% average success rate enhancement compared to the original model.
[88] S-BEVLoc: BEV-based Self-supervised Framework for Large-scale LiDAR Global Localization
Chenghao Zhang, Lun Luo, Si-Yuan Cao, Xiaokai Bai, Yuncheng Jin, Zhu Yu, Beinan Yu, Yisen Wang, Hui-Liang Shen
Main category: cs.CV
TL;DR: S-BEVLoc is a self-supervised LiDAR global localization framework that eliminates the need for ground-truth poses by using BEV images and geographic distance-based triplet training with SoftCos loss.
Details
Motivation: Current LiDAR localization methods require expensive ground-truth poses from GPS/SLAM for supervision, which is costly and limits scalability. A self-supervised approach is needed to reduce dependency on labeled data.Method: Uses BEV images to create training triplets based on geographic distances between keypoint patches. Employs CNN for local feature extraction and NetVLAD for global descriptor aggregation, with SoftCos loss for triplet learning.
Result: Achieves state-of-the-art performance on KITTI and NCLT datasets for place recognition, loop closure, and global localization tasks while offering superior scalability.
Conclusion: S-BEVLoc demonstrates that self-supervised learning can effectively replace supervised approaches in LiDAR global localization, providing high performance without ground-truth pose requirements.
Abstract: LiDAR-based global localization is an essential component of simultaneous localization and mapping (SLAM), which helps loop closure and re-localization. Current approaches rely on ground-truth poses obtained from GPS or SLAM odometry to supervise network training. Despite the great success of these supervised approaches, substantial cost and effort are required for high-precision ground-truth pose acquisition. In this work, we propose S-BEVLoc, a novel self-supervised framework based on bird’s-eye view (BEV) for LiDAR global localization, which eliminates the need for ground-truth poses and is highly scalable. We construct training triplets from single BEV images by leveraging the known geographic distances between keypoint-centered BEV patches. Convolutional neural network (CNN) is used to extract local features, and NetVLAD is employed to aggregate global descriptors. Moreover, we introduce SoftCos loss to enhance learning from the generated triplets. Experimental results on the large-scale KITTI and NCLT datasets show that S-BEVLoc achieves state-of-the-art performance in place recognition, loop closure, and global localization tasks, while offering scalability that would require extra effort for supervised approaches.
[89] FPI-Det: a face–phone Interaction Dataset for phone-use detection and understanding
Jianqin Gao, Tianqi Wang, Yu Zhang, Yishu Zhang, Chenyuan Wang, Allan Dong, Zihao Wang
Main category: cs.CV
TL;DR: Introduces FPI-Det dataset for detecting phone usage by analyzing face-phone interactions across diverse scenarios with 22,879 annotated images, providing baseline evaluations of YOLO and DETR models.
Details
Motivation: Mobile device usage detection requires understanding behavioral context and fine-grained human-device interactions, which existing benchmarks don't adequately address.Method: Created FPI-Det dataset with 22,879 images featuring synchronized face and phone annotations across workplace, education, transportation, and public scenarios with extreme scale variations and occlusions.
Result: Evaluated YOLO and DETR detectors, providing baseline performance results across different object sizes, occlusion levels, and environmental conditions.
Conclusion: FPI-Det addresses the gap in fine-grained human-device interaction benchmarks and provides a foundation for improved phone usage detection systems.
Abstract: The widespread use of mobile devices has created new challenges for vision systems in safety monitoring, workplace productivity assessment, and attention management. Detecting whether a person is using a phone requires not only object recognition but also an understanding of behavioral context, which involves reasoning about the relationship between faces, hands, and devices under diverse conditions. Existing generic benchmarks do not fully capture such fine-grained human–device interactions. To address this gap, we introduce the FPI-Det, containing 22{,}879 images with synchronized annotations for faces and phones across workplace, education, transportation, and public scenarios. The dataset features extreme scale variation, frequent occlusions, and varied capture conditions. We evaluate representative YOLO and DETR detectors, providing baseline results and an analysis of performance across object sizes, occlusion levels, and environments. Source code and dataset is available at https://github.com/KvCgRv/FPI-Det.
[90] Zero-shot Hierarchical Plant Segmentation via Foundation Segmentation Models and Text-to-image Attention
Junhao Xing, Ryohei Miyakawa, Yang Yang, Xinpeng Liu, Risa Shinoda, Hiroaki Santo, Yosuke Toda, Fumio Okura
Main category: cs.CV
TL;DR: ZeroPlantSeg is a zero-shot method for segmenting entire rosette-shaped plant individuals from top-view images by combining foundation segmentation models for leaf extraction and vision-language models for structural reasoning, achieving state-of-the-art performance without training.
Details
Motivation: Existing foundation segmentation models can extract leaf instances in zero-shot but struggle with hierarchical segmentation of entire plant individuals with overlapping leaves, which typically requires annotated training datasets that are species-specific and labor-intensive to create.Method: Integrates a foundation segmentation model to extract leaf instances and a vision-language model to reason about plant structures for extracting complete plant individuals, all without additional training or annotated datasets.
Result: Outperforms existing zero-shot methods and achieves better cross-domain performance than supervised methods across multiple plant species, growth stages, and shooting environments.
Conclusion: The proposed ZeroPlantSeg framework successfully addresses the hierarchical segmentation challenge for plant individuals using zero-shot learning, demonstrating strong generalization capabilities across diverse conditions without requiring training data.
Abstract: Foundation segmentation models achieve reasonable leaf instance extraction from top-view crop images without training (i.e., zero-shot). However, segmenting entire plant individuals with each consisting of multiple overlapping leaves remains challenging. This problem is referred to as a hierarchical segmentation task, typically requiring annotated training datasets, which are often species-specific and require notable human labor. To address this, we introduce ZeroPlantSeg, a zero-shot segmentation for rosette-shaped plant individuals from top-view images. We integrate a foundation segmentation model, extracting leaf instances, and a vision-language model, reasoning about plants’ structures to extract plant individuals without additional training. Evaluations on datasets with multiple plant species, growth stages, and shooting environments demonstrate that our method surpasses existing zero-shot methods and achieves better cross-domain performance than supervised methods. Implementations are available at https://github.com/JunhaoXing/ZeroPlantSeg.
[91] Gradient-Attention Guided Dual-Masking Synergetic Framework for Robust Text-based Person Retrieval
Tianlu Zheng, Yifan Zhang, Xiang An, Ziyong Feng, Kaicheng Yang, Qichuan Ding
Main category: cs.CV
TL;DR: This paper improves CLIP for person representation learning by creating WebPerson dataset (5M high-quality person-centric image-text pairs) and proposing GA-DMS framework that uses gradient-attention guidance to mask noisy text tokens and enhance cross-modal alignment.
Details
Motivation: CLIP faces two challenges for person representation learning: scarcity of large-scale annotated person-centric vision-language data, and limitations of global contrastive learning which struggles with discriminative local features and is vulnerable to noisy text tokens.Method: Developed noise-resistant data construction pipeline using MLLMs to filter and caption web-sourced images, creating WebPerson dataset. Introduced GA-DMS framework with gradient-attention guided dual-masking to adaptively mask noisy textual tokens and incorporated masked token prediction objectives.
Result: Extensive experiments show that GA-DMS achieves state-of-the-art performance across multiple benchmarks.
Conclusion: The synergistic improvements in data curation (WebPerson dataset) and model architecture (GA-DMS framework) successfully advance CLIP for person representation learning, addressing both data scarcity and model architecture limitations.
Abstract: Although Contrastive Language-Image Pre-training (CLIP) exhibits strong performance across diverse vision tasks, its application to person representation learning faces two critical challenges: (i) the scarcity of large-scale annotated vision-language data focused on person-centric images, and (ii) the inherent limitations of global contrastive learning, which struggles to maintain discriminative local features crucial for fine-grained matching while remaining vulnerable to noisy text tokens. This work advances CLIP for person representation learning through synergistic improvements in data curation and model architecture. First, we develop a noise-resistant data construction pipeline that leverages the in-context learning capabilities of MLLMs to automatically filter and caption web-sourced images. This yields WebPerson, a large-scale dataset of 5M high-quality person-centric image-text pairs. Second, we introduce the GA-DMS (Gradient-Attention Guided Dual-Masking Synergetic) framework, which improves cross-modal alignment by adaptively masking noisy textual tokens based on the gradient-attention similarity score. Additionally, we incorporate masked token prediction objectives that compel the model to predict informative text tokens, enhancing fine-grained semantic representation learning. Extensive experiments show that GA-DMS achieves state-of-the-art performance across multiple benchmarks.
[92] ALL-PET: A Low-resource and Low-shot PET Foundation Model in the Projection Domain
Bin Huang, Kang Chen, Bingxuan Li, Huafeng Liu, Qiegen Liu
Main category: cs.CV
TL;DR: ALL-PET is a low-resource PET foundation model that uses latent diffusion with innovative mask augmentation and attention mechanisms to achieve high-quality sinogram generation with only 500 samples, enabling multiple PET imaging tasks with efficient memory usage.
Details
Motivation: Building large-scale PET foundation models is challenging due to limited labeled data and computational resources. The goal is to overcome data scarcity and efficiency limitations for PET imaging applications.Method: Proposes ALL-PET with three key innovations: 1) Radon mask augmentation strategy (RMAS) generating diverse training samples, 2) positive/negative mask constraints for geometric consistency, and 3) transparent medical attention (TMA) for lesion-focused guidance in projection domain.
Result: Achieves high-quality sinogram generation using only 500 samples, with performance comparable to models trained on larger datasets. Generalizes across multiple PET tasks including low-dose reconstruction, attenuation correction, delayed-frame prediction, and tracer separation with memory under 24GB.
Conclusion: ALL-PET demonstrates that effective PET foundation models can be built with minimal data through innovative projection-domain approaches, enabling efficient and versatile PET imaging applications with limited computational resources.
Abstract: Building large-scale foundation model for PET imaging is hindered by limited access to labeled data and insufficient computational resources. To overcome data scarcity and efficiency limitations, we propose ALL-PET, a low-resource, low-shot PET foundation model operating directly in the projection domain. ALL-PET leverages a latent diffusion model (LDM) with three key innovations. First, we design a Radon mask augmentation strategy (RMAS) that generates over 200,000 structurally diverse training samples by projecting randomized image-domain masks into sinogram space, significantly improving generalization with minimal data. This is extended by a dynamic multi-mask (DMM) mechanism that varies mask quantity and distribution, enhancing data diversity without added model complexity. Second, we implement positive/negative mask constraints to embed strict geometric consistency, reducing parameter burden while preserving generation quality. Third, we introduce transparent medical attention (TMA), a parameter-free, geometry-driven mechanism that enhances lesion-related regions in raw projection data. Lesion-focused attention maps are derived from coarse segmentation, covering both hypermetabolic and hypometabolic areas, and projected into sinogram space for physically consistent guidance. The system supports clinician-defined ROI adjustments, ensuring flexible, interpretable, and task-adaptive emphasis aligned with PET acquisition physics. Experimental results show ALL-PET achieves high-quality sinogram generation using only 500 samples, with performance comparable to models trained on larger datasets. ALL-PET generalizes across tasks including low-dose reconstruction, attenuation correction, delayed-frame prediction, and tracer separation, operating efficiently with memory use under 24GB.
[93] Noise-Robust Topology Estimation of 2D Image Data via Neural Networks and Persistent Homology
Dylan Peek, Matthew P. Skerritt, Stephan Chalup
Main category: cs.CV
TL;DR: ANNs outperform traditional PH methods for predicting Betti numbers in noisy 2D binary images due to their ability to learn contextual and geometric priors.
Details
Motivation: Compare noise robustness between Persistent Homology (PH) and Artificial Neural Networks (ANNs) for topological structure inference from data, particularly for predicting Betti numbers.Method: Trained supervised neural network to predict Betti numbers and compared against PH pipeline using cubical complexes and Signed Euclidean Distance Transform (SEDT) on one synthetic and two real-world datasets with noise.
Result: ANNs outperformed the PH approach under noise conditions, demonstrating better noise robustness.
Conclusion: ANNs offer a compelling alternative to PH for topology estimation under structural noise, leveraging their capacity to learn from training data.
Abstract: Persistent Homology (PH) and Artificial Neural Networks (ANNs) offer contrasting approaches to inferring topological structure from data. In this study, we examine the noise robustness of a supervised neural network trained to predict Betti numbers in 2D binary images. We compare an ANN approach against a PH pipeline based on cubical complexes and the Signed Euclidean Distance Transform (SEDT), which is a widely adopted strategy for noise-robust topological analysis. Using one synthetic and two real-world datasets, we show that ANNs can outperform this PH approach under noise, likely due to their capacity to learn contextual and geometric priors from training data. Though still emerging, the use of ANNs for topology estimation offers a compelling alternative to PH under structural noise.
[94] Objectness Similarity: Capturing Object-Level Fidelity in 3D Scene Evaluation
Yuiko Uchida, Ren Togo, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama
Main category: cs.CV
TL;DR: OSIM is a new evaluation metric for 3D scenes that focuses on object-level perception rather than overall image quality, better aligning with human visual perception.
Details
Motivation: Existing metrics assess overall image quality but create discrepancies with human perception, as humans fundamentally perceive 3D scenes through attention to individual objects.Method: Leverages object detection models and their feature representations to quantify the ‘objectness’ of each object in 3D scenes, enabling object-centric evaluation.
Result: User study shows OSIM aligns more closely with human perception compared to existing metrics, and the authors re-evaluated recent 3D reconstruction/generation models under standardized setup.
Conclusion: OSIM provides a more human-aligned evaluation approach for 3D scenes by focusing on object-level perception, with code publicly available for further research.
Abstract: This paper presents Objectness SIMilarity (OSIM), a novel evaluation metric for 3D scenes that explicitly focuses on “objects,” which are fundamental units of human visual perception. Existing metrics assess overall image quality, leading to discrepancies with human perception. Inspired by neuropsychological insights, we hypothesize that human recognition of 3D scenes fundamentally involves attention to individual objects. OSIM enables object-centric evaluations by leveraging an object detection model and its feature representations to quantify the “objectness” of each object in the scene. Our user study demonstrates that OSIM aligns more closely with human perception compared to existing metrics. We also analyze the characteristics of OSIM using various approaches. Moreover, we re-evaluate recent 3D reconstruction and generation models under a standardized experimental setup to clarify advancements in this field. The code is available at https://github.com/Objectness-Similarity/OSIM.
[95] Video Understanding by Design: How Datasets Shape Architectures and Insights
Lei Wang, Piotr Koniusz, Yongsheng Gao
Main category: cs.CV
TL;DR: This survey provides a dataset-driven analysis of video understanding architectures, showing how dataset characteristics (motion complexity, temporal span, hierarchical composition, multimodal richness) shape model design from CNNs to transformers and foundation models.
Details
Motivation: Existing surveys classify video models by task or architecture family, but overlook how dataset structural pressures guide architectural evolution. This work aims to provide a unified framework connecting datasets, inductive biases, and architectures.Method: The survey adopts a dataset-driven perspective, analyzing how specific dataset characteristics impose inductive biases that models must encode. It reinterprets architectural milestones as responses to these dataset-driven pressures.
Result: The analysis provides a coherent framework that unifies datasets, inductive biases, and architectures, offering both a comprehensive retrospective of video understanding progress and practical design guidance.
Conclusion: This dataset-driven perspective provides a prescriptive roadmap for advancing general-purpose video understanding by aligning model design with dataset invariances while balancing scalability and task demands.
Abstract: Video understanding has advanced rapidly, fueled by increasingly complex datasets and powerful architectures. Yet existing surveys largely classify models by task or family, overlooking the structural pressures through which datasets guide architectural evolution. This survey is the first to adopt a dataset-driven perspective, showing how motion complexity, temporal span, hierarchical composition, and multimodal richness impose inductive biases that models should encode. We reinterpret milestones, from two-stream and 3D CNNs to sequential, transformer, and multimodal foundation models, as concrete responses to these dataset-driven pressures. Building on this synthesis, we offer practical guidance for aligning model design with dataset invariances while balancing scalability and task demands. By unifying datasets, inductive biases, and architectures into a coherent framework, this survey provides both a comprehensive retrospective and a prescriptive roadmap for advancing general-purpose video understanding.
[96] OCELOT 2023: Cell Detection from Cell-Tissue Interaction Challenge
JaeWoong Shin, Jeongun Ryu, Aaron Valero Puche, Jinhee Lee, Biagio Brattoli, Wonkyung Jung, Soo Ick Cho, Kyunghyun Paeng, Chan-Young Ock, Donggeun Yoo, Zhaoyang Li, Wangkai Li, Huayu Mai, Joshua Millward, Zhen He, Aiden Nibali, Lydia Anette Schoenpflug, Viktor Hendrik Koelzer, Xu Shuoyu, Ji Zheng, Hu Bin, Yu-Wen Lo, Ching-Hui Yang, Sérgio Pereira
Main category: cs.CV
TL;DR: OCELOT 2023 challenge demonstrated that incorporating multi-scale cell-tissue interactions significantly improves cell detection performance over traditional cell-only models, with top entries achieving up to 7.99 F1-score improvement.
Details
Motivation: Pathologists examine tissue at multiple magnifications to understand cell-tissue relationships, but existing deep learning models lack this capability due to missing multi-scale annotated datasets.Method: The challenge created a dataset with overlapping cell detection and tissue segmentation annotations from 673 image pairs across 6 organs from TCGA Whole-Slide Images, divided into training/validation/test sets.
Result: Participant models significantly enhanced understanding of cell-tissue relationships, with top entries achieving up to 7.99 F1-score improvement over baseline cell-only models.
Conclusion: Incorporating multi-scale semantics and cell-tissue interactions is crucial for achieving human-level performance in pathological image analysis, representing a substantial advancement over traditional methods.
Abstract: Pathologists routinely alternate between different magnifications when examining Whole-Slide Images, allowing them to evaluate both broad tissue morphology and intricate cellular details to form comprehensive diagnoses. However, existing deep learning-based cell detection models struggle to replicate these behaviors and learn the interdependent semantics between structures at different magnifications. A key barrier in the field is the lack of datasets with multi-scale overlapping cell and tissue annotations. The OCELOT 2023 challenge was initiated to gather insights from the community to validate the hypothesis that understanding cell and tissue (cell-tissue) interactions is crucial for achieving human-level performance, and to accelerate the research in this field. The challenge dataset includes overlapping cell detection and tissue segmentation annotations from six organs, comprising 673 pairs sourced from 306 The Cancer Genome Atlas (TCGA) Whole-Slide Images with hematoxylin and eosin staining, divided into training, validation, and test subsets. Participants presented models that significantly enhanced the understanding of cell-tissue relationships. Top entries achieved up to a 7.99 increase in F1-score on the test set compared to the baseline cell-only model that did not incorporate cell-tissue relationships. This is a substantial improvement in performance over traditional cell-only detection methods, demonstrating the need for incorporating multi-scale semantics into the models. This paper provides a comparative analysis of the methods used by participants, highlighting innovative strategies implemented in the OCELOT 2023 challenge.
[97] RT-DETR++ for UAV Object Detection
Yuan Shufang
Main category: cs.CV
TL;DR: RT-DETR++ enhances UAV object detection with improved encoder featuring channel-gated attention up/downsampling and CSP-PAC feature fusion, achieving better small object detection while maintaining real-time performance.
Details
Motivation: Address challenges in UAV imagery object detection including densely packed small objects, scale variations, and occlusion issues that are common in aerial imagery.Method: Enhances RT-DETR encoder with: 1) Channel-gated attention-based upsampling/downsampling (AU/AD) mechanism for error minimization and detail preservation, 2) CSP-PAC feature fusion using parallel hollow convolutions to process local and contextual information simultaneously.
Result: Superior performance in detecting small and densely packed objects while maintaining sufficient speed for real-time detection without increased computational complexity.
Conclusion: Provides an effective approach for feature encoding design in real-time detection systems, particularly beneficial for UAV imagery applications.
Abstract: Object detection in unmanned aerial vehicle (UAV) imagery presents significant challenges. Issues such as densely packed small objects, scale variations, and occlusion are commonplace. This paper introduces RT-DETR++, which enhances the encoder component of the RT-DETR model. Our improvements focus on two key aspects. First, we introduce a channel-gated attention-based upsampling/downsampling (AU/AD) mechanism. This dual-path system minimizes errors and preserves details during feature layer propagation. Second, we incorporate CSP-PAC during feature fusion. This technique employs parallel hollow convolutions to process local and contextual information within the same layer, facilitating the integration of multi-scale features. Evaluation demonstrates that our novel neck design achieves superior performance in detecting small and densely packed objects. The model maintains sufficient speed for real-time detection without increasing computational complexity. This study provides an effective approach for feature encoding design in real-time detection systems.
[98] A Knowledge Noise Mitigation Framework for Knowledge-based Visual Question Answering
Zhiyue Liu, Sihang Liu, Jinyuan Liu, Xinru Zhang
Main category: cs.CV
TL;DR: A training-free framework for knowledge-based VQA that reduces noise by enhancing knowledge relevance and reducing redundancy through better retrieval queries, knowledge extraction, and selective integration.
Details
Motivation: Existing KB-VQA approaches directly augment models with retrieved knowledge but ignore substantial knowledge redundancy, which introduces noise into the answering process.Method: 1) Create low-noise queries from image-question pairs for better knowledge retrieval 2) Use large models to extract answer-beneficial segments from retrieved knowledge 3) Selective knowledge integration - only incorporate knowledge when the model lacks confidence in answering
Result: The framework outperforms state-of-the-art methods in extensive experiments, enabling acquisition of accurate and critical knowledge.
Conclusion: The proposed training-free framework effectively mitigates noise from knowledge redundancy by enhancing relevance and reducing redundancy, leading to improved KB-VQA performance.
Abstract: Knowledge-based visual question answering (KB-VQA) requires a model to understand images and utilize external knowledge to provide accurate answers. Existing approaches often directly augment models with retrieved information from knowledge sources while ignoring substantial knowledge redundancy, which introduces noise into the answering process. To address this, we propose a training-free framework with knowledge focusing for KB-VQA, that mitigates the impact of noise by enhancing knowledge relevance and reducing redundancy. First, for knowledge retrieval, our framework concludes essential parts from the image-question pairs, creating low-noise queries that enhance the retrieval of highly relevant knowledge. Considering that redundancy still persists in the retrieved knowledge, we then prompt large models to identify and extract answer-beneficial segments from knowledge. In addition, we introduce a selective knowledge integration strategy, allowing the model to incorporate knowledge only when it lacks confidence in answering the question, thereby mitigating the influence of redundant information. Our framework enables the acquisition of accurate and critical knowledge, and extensive experiments demonstrate that it outperforms state-of-the-art methods.
[99] CWSSNet: Hyperspectral Image Classification Enhanced by Wavelet Domain Convolution
Yulin Tong, Fengzong Zhang, Haiqin Cheng
Main category: cs.CV
TL;DR: CWSSNet framework combines 3D spectral-spatial features with wavelet convolution for hyperspectral image classification, achieving high accuracy and robustness with limited training data.
Details
Motivation: Hyperspectral images suffer from feature redundancy due to numerous bands and spectral mixing, requiring advanced methods for fine ground object classification in applications like forestry and precision agriculture.Method: Proposed CWSSNet framework integrating 3D spectral-spatial features and wavelet convolution, using multiscale convolutional attention module for multimodal information fusion and multi-band decomposition in wavelet domain.
Result: Achieved 74.50% mIoU, 82.73% mAcc, and 84.94% mF1 on ZY1F satellite data from Yugan County, with highest IoU for water bodies, vegetation, and bare land classification. Maintained reliable performance with 70% training set and limited training time increase.
Conclusion: CWSSNet effectively addresses hyperspectral image classification challenges, demonstrating superior performance, good robustness, and reliable small-sample training capabilities compared to traditional methods.
Abstract: Hyperspectral remote sensing technology has significant application value in fields such as forestry ecology and precision agriculture, while also putting forward higher requirements for fine ground object classification. However, although hyperspectral images are rich in spectral information and can improve recognition accuracy, they tend to cause prominent feature redundancy due to their numerous bands, high dimensionality, and spectral mixing characteristics. To address this, this study used hyperspectral images from the ZY1F satellite as a data source and selected Yugan County, Shangrao City, Jiangxi Province as the research area to perform ground object classification research. A classification framework named CWSSNet was proposed, which integrates 3D spectral-spatial features and wavelet convolution. This framework integrates multimodal information us-ing a multiscale convolutional attention module and breaks through the classification performance bottleneck of traditional methods by introducing multi-band decomposition and convolution operations in the wavelet domain. The experiments showed that CWSSNet achieved 74.50%, 82.73%, and 84.94% in mean Intersection over Union (mIoU), mean Accuracy (mAcc), and mean F1-score (mF1) respectively in Yugan County. It also obtained the highest Intersection over Union (IoU) in the classifica-tion of water bodies, vegetation, and bare land, demonstrating good robustness. Additionally, when the training set proportion was 70%, the increase in training time was limited, and the classification effect was close to the optimal level, indicating that the model maintains reliable performance under small-sample training conditions.
[100] Bridging the Gap Between Ideal and Real-world Evaluation: Benchmarking AI-Generated Image Detection in Challenging Scenarios
Chunxiao Li, Xiaoxiao Wang, Meiling Li, Boming Miao, Peng Sun, Yunjian Zhang, Xiangyang Ji, Yao Zhu
Main category: cs.CV
TL;DR: RRDataset evaluates AI-generated image detection models across scenario generalization, internet transmission robustness, and re-digitization robustness, revealing current methods’ limitations in real-world conditions.
Details
Motivation: Address the research gap in evaluating AI-generated image detection methods under complex real-world conditions as realistic image synthesis poses challenges to digital security and media credibility.Method: Introduces RRDataset with three evaluation dimensions: scenario generalization across 7 major scenarios, internet transmission robustness through social media sharing, and re-digitization robustness with 4 distinct methods. Benchmarks 17 detectors and 10 VLMs, plus human study with 192 participants.
Result: Benchmarking reveals limitations of current AI detection methods under real-world conditions. Human study shows human few-shot learning capabilities in detecting AI-generated images.
Conclusion: Highlights the importance of leveraging human adaptability to develop more robust AI-generated image detection algorithms that perform well in real-world scenarios.
Abstract: With the rapid advancement of generative models, highly realistic image synthesis has posed new challenges to digital security and media credibility. Although AI-generated image detection methods have partially addressed these concerns, a substantial research gap remains in evaluating their performance under complex real-world conditions. This paper introduces the Real-World Robustness Dataset (RRDataset) for comprehensive evaluation of detection models across three dimensions: 1) Scenario Generalization: RRDataset encompasses high-quality images from seven major scenarios (War and Conflict, Disasters and Accidents, Political and Social Events, Medical and Public Health, Culture and Religion, Labor and Production, and everyday life), addressing existing dataset gaps from a content perspective. 2) Internet Transmission Robustness: examining detector performance on images that have undergone multiple rounds of sharing across various social media platforms. 3) Re-digitization Robustness: assessing model effectiveness on images altered through four distinct re-digitization methods. We benchmarked 17 detectors and 10 vision-language models (VLMs) on RRDataset and conducted a large-scale human study involving 192 participants to investigate human few-shot learning capabilities in detecting AI-generated images. The benchmarking results reveal the limitations of current AI detection methods under real-world conditions and underscore the importance of drawing on human adaptability to develop more robust detection algorithms.
[101] Dark-ISP: Enhancing RAW Image Processing for Low-Light Object Detection
Jiasheng Guo, Xin Gao, Yuxiang Yan, Guanghao Li, Jian Pu
Main category: cs.CV
TL;DR: Dark-ISP is a lightweight self-adaptive ISP plugin that processes Bayer RAW images for low-light object detection, outperforming state-of-the-art methods with minimal parameters.
Details
Motivation: Low-light object detection is challenging due to degraded image quality. Existing approaches either use RAW-RGB images with information loss or employ complex frameworks, creating a need for a more efficient solution.Method: Deconstructs conventional ISP pipelines into sequential linear (sensor calibration) and nonlinear (tone mapping) sub-modules as differentiable components. Uses content-aware adaptability and physics-informed priors with a Self-Boost mechanism for sub-module cooperation.
Result: Outperforms state-of-the-art RGB- and RAW-based detection approaches on three RAW image datasets, achieving superior results in challenging low-light environments with minimal parameters.
Conclusion: The proposed Dark-ISP enables seamless end-to-end training for object detection by directly processing Bayer RAW images, providing an effective solution for low-light object detection with lightweight architecture.
Abstract: Low-light Object detection is crucial for many real-world applications but remains challenging due to degraded image quality. While recent studies have shown that RAW images offer superior potential over RGB images, existing approaches either use RAW-RGB images with information loss or employ complex frameworks. To address these, we propose a lightweight and self-adaptive Image Signal Processing (ISP) plugin, Dark-ISP, which directly processes Bayer RAW images in dark environments, enabling seamless end-to-end training for object detection. Our key innovations are: (1) We deconstruct conventional ISP pipelines into sequential linear (sensor calibration) and nonlinear (tone mapping) sub-modules, recasting them as differentiable components optimized through task-driven losses. Each module is equipped with content-aware adaptability and physics-informed priors, enabling automatic RAW-to-RGB conversion aligned with detection objectives. (2) By exploiting the ISP pipeline’s intrinsic cascade structure, we devise a Self-Boost mechanism that facilitates cooperation between sub-modules. Through extensive experiments on three RAW image datasets, we demonstrate that our method outperforms state-of-the-art RGB- and RAW-based detection approaches, achieving superior results with minimal parameters in challenging low-light environments.
[102] VQualA 2025 Challenge on Visual Quality Comparison for Large Multimodal Models: Methods and Results
Hanwei Zhu, Haoning Wu, Zicheng Zhang, Lingyu Zhu, Yixuan Li, Peilin Chen, Shiqi Wang, Chris Wei Zhou, Linhan Cao, Wei Sun, Xiangyang Zhu, Weixia Zhang, Yucheng Zhu, Jing Liu, Dandan Zhu, Guangtao Zhai, Xiongkuo Min, Zhichao Zhang, Xinyue Li, Shubo Xu, Anh Dao, Yifan Li, Hongyuan Yu, Jiaojiao Yi, Yiding Tian, Yupeng Wu, Feiran Sun, Lijuan Liao, Song Jiang
Main category: cs.CV
TL;DR: VQualA 2025 Challenge evaluated LMMs’ visual quality comparison abilities using a novel benchmark with thousands of tasks, attracting 100+ participants and showcasing emerging capabilities of instruction-tuned models.
Details
Motivation: To evaluate and enhance state-of-the-art Large Multimodal Models' ability to perform open-ended and detailed reasoning about visual quality differences across multiple images.Method: Created a novel benchmark with thousands of coarse-to-fine grained visual quality comparison tasks (single images, pairs, multi-image groups) using holistic evaluation protocols including 2AFC-based binary preference and multi-choice questions.
Result: Around 100 participants submitted entries, with five models demonstrating emerging capabilities of instruction-tuned LMMs on quality assessment tasks.
Conclusion: The challenge marks a significant step toward open-domain visual quality reasoning and comparison, serving as a catalyst for future research on interpretable and human-aligned quality evaluation systems.
Abstract: This paper presents a summary of the VQualA 2025 Challenge on Visual Quality Comparison for Large Multimodal Models (LMMs), hosted as part of the ICCV 2025 Workshop on Visual Quality Assessment. The challenge aims to evaluate and enhance the ability of state-of-the-art LMMs to perform open-ended and detailed reasoning about visual quality differences across multiple images. To this end, the competition introduces a novel benchmark comprising thousands of coarse-to-fine grained visual quality comparison tasks, spanning single images, pairs, and multi-image groups. Each task requires models to provide accurate quality judgments. The competition emphasizes holistic evaluation protocols, including 2AFC-based binary preference and multi-choice questions (MCQs). Around 100 participants submitted entries, with five models demonstrating the emerging capabilities of instruction-tuned LMMs on quality assessment. This challenge marks a significant step toward open-domain visual quality reasoning and comparison and serves as a catalyst for future research on interpretable and human-aligned quality evaluation systems.
[103] MGTraj: Multi-Granularity Goal-Guided Human Trajectory Prediction with Recursive Refinement Network
Ge Sun, Jun Ma
Main category: cs.CV
TL;DR: MGTraj is a multi-granularity goal-guided model for human trajectory prediction that recursively refines trajectory proposals from coarse to fine levels using transformer-based networks and achieves state-of-the-art performance.
Details
Motivation: Current goal-guided approaches operate at extreme granularities (coarse goal prediction and fine trajectory completion), leaving intermediate temporal granularity unexplored despite its potential utility in capturing diverse human motion patterns.Method: Proposes MGTraj with recursive encoding from coarse to fine granularity levels, transformer-based recursive refinement networks (RRN) at each level, weight-sharing strategy for feature integration across granularities, and velocity prediction as auxiliary task.
Result: Outperforms baseline methods and achieves state-of-the-art performance among goal-guided methods on ETH/UCY and Stanford Drone Dataset benchmarks.
Conclusion: Multi-granularity modeling effectively captures diverse scales of human dynamics and motion patterns, demonstrating the value of intermediate temporal granularity in goal-guided trajectory prediction frameworks.
Abstract: Accurate human trajectory prediction is crucial for robotics navigation and autonomous driving. Recent research has demonstrated that incorporating goal guidance significantly enhances prediction accuracy by reducing uncertainty and leveraging prior knowledge. Most goal-guided approaches decouple the prediction task into two stages: goal prediction and subsequent trajectory completion based on the predicted goal, which operate at extreme granularities: coarse-grained goal prediction forecasts the overall intention, while fine-grained trajectory completion needs to generate the positions for all future timesteps. The potential utility of intermediate temporal granularity remains largely unexplored, which motivates multi-granularity trajectory modeling. While prior work has shown that multi-granularity representations capture diverse scales of human dynamics and motion patterns, effectively integrating this concept into goal-guided frameworks remains challenging. In this paper, we propose MGTraj, a novel Multi-Granularity goal-guided model for human Trajectory prediction. MGTraj recursively encodes trajectory proposals from coarse to fine granularity levels. At each level, a transformer-based recursive refinement network (RRN) captures features and predicts progressive refinements. Features across different granularities are integrated using a weight-sharing strategy, and velocity prediction is employed as an auxiliary task to further enhance performance. Comprehensive experimental results in EHT/UCY and Stanford Drone Dataset indicate that MGTraj outperforms baseline methods and achieves state-of-the-art performance among goal-guided methods.
[104] Medverse: A Universal Model for Full-Resolution 3D Medical Image Segmentation, Transformation and Enhancement
Jiesi Hu, Jianfeng Cao, Yanwu Yang, Chenfei Ye, Yixuan Zhang, Hanyang Peng, Ting Ma
Main category: cs.CV
TL;DR: Medverse is a universal in-context learning model for 3D medical imaging that handles diverse tasks across multiple organs and modalities, achieving high-fidelity predictions with global anatomical understanding through a novel autoregressive framework.
Details
Motivation: Current ICL models for medical imaging cannot simultaneously achieve high-fidelity predictions and global anatomical understanding, and lack unified training across diverse medical imaging tasks and anatomical regions, limiting the potential of ICL in medical applications.Method: Medverse employs a next-scale autoregressive in-context learning framework that progressively refines predictions from coarse to fine, and uses a blockwise cross-attention module for long-range interactions while maintaining computational efficiency through spatial sparsity. Trained on 22 datasets covering diverse tasks.
Result: Medverse substantially outperforms existing ICL baselines on held-out datasets covering unseen clinical centers, organs, species, and imaging modalities, demonstrating superior performance and establishing a novel paradigm for in-context learning.
Conclusion: Medverse presents a universal ICL model that successfully addresses the limitations of current medical imaging ICL approaches, enabling high-fidelity predictions with global anatomical understanding across diverse tasks and anatomical regions, with code and model weights made publicly available.
Abstract: In-context learning (ICL) offers a promising paradigm for universal medical image analysis, enabling models to perform diverse image processing tasks without retraining. However, current ICL models for medical imaging remain limited in two critical aspects: they cannot simultaneously achieve high-fidelity predictions and global anatomical understanding, and there is no unified model trained across diverse medical imaging tasks (e.g., segmentation and enhancement) and anatomical regions. As a result, the full potential of ICL in medical imaging remains underexplored. Thus, we present \textbf{Medverse}, a universal ICL model for 3D medical imaging, trained on 22 datasets covering diverse tasks in universal image segmentation, transformation, and enhancement across multiple organs, imaging modalities, and clinical centers. Medverse employs a next-scale autoregressive in-context learning framework that progressively refines predictions from coarse to fine, generating consistent, full-resolution volumetric outputs and enabling multi-scale anatomical awareness. We further propose a blockwise cross-attention module that facilitates long-range interactions between context and target inputs while preserving computational efficiency through spatial sparsity. Medverse is extensively evaluated on a broad collection of held-out datasets covering previously unseen clinical centers, organs, species, and imaging modalities. Results demonstrate that Medverse substantially outperforms existing ICL baselines and establishes a novel paradigm for in-context learning. Code and model weights will be made publicly available. Our model are publicly available at https://github.com/jiesihu/Medverse.
[105] FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark
Rongyao Fang, Aldrich Yu, Chengqi Duan, Linjiang Huang, Shuai Bai, Yuxuan Cai, Kun Wang, Si Liu, Xihui Liu, Hongsheng Li
Main category: cs.CV
TL;DR: FLUX-Reason-6M dataset (6M images, 20M bilingual descriptions) and PRISM-Bench evaluation standard created to address reasoning gaps in open-source text-to-image models, with comprehensive evaluation showing performance gaps.
Details
Motivation: Open-source text-to-image models lag behind closed-source systems due to lack of large-scale reasoning-focused datasets and comprehensive evaluation benchmarks.Method: Created FLUX-Reason-6M dataset with 6M high-quality images organized by 6 key characteristics and Generation Chain-of-Thought breakdowns. Developed PRISM-Bench with 7 evaluation tracks using advanced vision-language models for human-aligned assessment.
Result: Evaluation of 19 leading models revealed critical performance gaps and specific areas needing improvement in reasoning-oriented T2I generation.
Conclusion: The released dataset, benchmark, and evaluation code aim to catalyze the next wave of reasoning-oriented text-to-image generation by providing previously unavailable resources to the research community.
Abstract: The advancement of open-source text-to-image (T2I) models has been hindered by the absence of large-scale, reasoning-focused datasets and comprehensive evaluation benchmarks, resulting in a performance gap compared to leading closed-source systems. To address this challenge, We introduce FLUX-Reason-6M and PRISM-Bench (Precise and Robust Image Synthesis Measurement Benchmark). FLUX-Reason-6M is a massive dataset consisting of 6 million high-quality FLUX-generated images and 20 million bilingual (English and Chinese) descriptions specifically designed to teach complex reasoning. The image are organized according to six key characteristics: Imagination, Entity, Text rendering, Style, Affection, and Composition, and design explicit Generation Chain-of-Thought (GCoT) to provide detailed breakdowns of image generation steps. The whole data curation takes 15,000 A100 GPU days, providing the community with a resource previously unattainable outside of large industrial labs. PRISM-Bench offers a novel evaluation standard with seven distinct tracks, including a formidable Long Text challenge using GCoT. Through carefully designed prompts, it utilizes advanced vision-language models for nuanced human-aligned assessment of prompt-image alignment and image aesthetics. Our extensive evaluation of 19 leading models on PRISM-Bench reveals critical performance gaps and highlights specific areas requiring improvement. Our dataset, benchmark, and evaluation code are released to catalyze the next wave of reasoning-oriented T2I generation. Project page: https://flux-reason-6m.github.io/ .
[106] CoAtNeXt:An Attention-Enhanced ConvNeXtV2-Transformer Hybrid Model for Gastric Tissue Classification
Mustafa Yurdakul, Sakir Tasdemir
Main category: cs.CV
TL;DR: CoAtNeXt is a novel hybrid model combining CoAtNet architecture with enhanced ConvNeXtV2 blocks and CBAM attention, achieving state-of-the-art performance in gastric tissue image classification on both binary and multiclass tasks.
Details
Motivation: Manual histopathologic examination of gastric tissue is labor-intensive, prone to variability among pathologists, and may miss critical findings. There's a need for automated, reliable methods to improve diagnostic consistency and efficiency.Method: Proposed CoAtNeXt model built on CoAtNet architecture by replacing MBConv layers with enhanced ConvNeXtV2 blocks and integrating CBAM attention module. Evaluated on HMU-GC-HE-30K (8-class) and GasHisSDB (binary) datasets against 10 CNN and 10 ViT models.
Result: Achieved 96.47% accuracy, 96.60% precision, 96.47% recall, 96.45% F1 score, and 99.89% AUC on HMU-GC-HE-30K. On GasHisSDB: 98.29% accuracy, 98.07% precision, 98.41% recall, 98.23% F1 score, and 99.90% AUC. Outperformed all compared models.
Conclusion: CoAtNeXt is a robust architecture for histopathological classification of gastric tissue images, demonstrating potential to assist pathologists by enhancing diagnostic accuracy and reducing workload in both binary and multiclass scenarios.
Abstract: Background and objective Early diagnosis of gastric diseases is crucial to prevent fatal outcomes. Although histopathologic examination remains the diagnostic gold standard, it is performed entirely manually, making evaluations labor-intensive and prone to variability among pathologists. Critical findings may be missed, and lack of standard procedures reduces consistency. These limitations highlight the need for automated, reliable, and efficient methods for gastric tissue analysis. Methods In this study, a novel hybrid model named CoAtNeXt was proposed for the classification of gastric tissue images. The model is built upon the CoAtNet architecture by replacing its MBConv layers with enhanced ConvNeXtV2 blocks. Additionally, the Convolutional Block Attention Module (CBAM) is integrated to improve local feature extraction through channel and spatial attention mechanisms. The architecture was scaled to achieve a balance between computational efficiency and classification performance. CoAtNeXt was evaluated on two publicly available datasets, HMU-GC-HE-30K for eight-class classification and GasHisSDB for binary classification, and was compared against 10 Convolutional Neural Networks (CNNs) and ten Vision Transformer (ViT) models. Results CoAtNeXt achieved 96.47% accuracy, 96.60% precision, 96.47% recall, 96.45% F1 score, and 99.89% AUC on HMU-GC-HE-30K. On GasHisSDB, it reached 98.29% accuracy, 98.07% precision, 98.41% recall, 98.23% F1 score, and 99.90% AUC. It outperformed all CNN and ViT models tested and surpassed previous studies in the literature. Conclusion Experimental results show that CoAtNeXt is a robust architecture for histopathological classification of gastric tissue images, providing performance on binary and multiclass. Its highlights its potential to assist pathologists by enhancing diagnostic accuracy and reducing workload.
[107] DATE: Dynamic Absolute Time Enhancement for Long Video Understanding
Chao Yuan, Yang Yang, Yehui Yang, Zach Cheng
Main category: cs.CV
TL;DR: DATE enhances temporal awareness in MLLMs through timestamp injection and semantic-guided sampling for better long video understanding.
Details
Motivation: Existing MLLMs struggle with long-range temporal dependencies and precise event localization in long videos due to uniform frame sampling and implicit position encodings.Method: Proposes Dynamic Absolute Time Enhancement (DATE) with Timestamp Injection Mechanism (TIM) and Temporal-Aware Similarity Sampling (TASS) strategy that treats video sampling as vision-language retrieval with two-stage algorithm.
Result: Achieves state-of-the-art performance on hour-long video benchmarks, with 7B model outperforming many 72B models on some tasks.
Conclusion: DATE significantly improves temporal reasoning and event localization in long videos through explicit temporal modeling and semantic-aware sampling strategies.
Abstract: Long video understanding remains a fundamental challenge for multimodal large language models (MLLMs), particularly in tasks requiring precise temporal reasoning and event localization. Existing approaches typically adopt uniform frame sampling and rely on implicit position encodings to model temporal order. However, these methods struggle with long-range dependencies, leading to critical information loss and degraded temporal comprehension. In this paper, we propose Dynamic Absolute Time Enhancement (DATE) that enhances temporal awareness in MLLMs through the Timestamp Injection Mechanism (TIM) and a semantically guided Temporal-Aware Similarity Sampling (TASS) strategy. Specifically, we interleave video frame embeddings with textual timestamp tokens to construct a continuous temporal reference system. We further reformulate the video sampling problem as a vision-language retrieval task and introduce a two-stage algorithm to ensure both semantic relevance and temporal coverage: enriching each query into a descriptive caption to better align with the vision feature, and sampling key event with a similarity-driven temporally regularized greedy strategy. Our method achieves remarkable improvements w.r.t. absolute time understanding and key event localization, resulting in state-of-the-art performance among 7B and 72B models on hour-long video benchmarks. Particularly, our 7B model even exceeds many 72B models on some benchmarks.
[108] Unified Start, Personalized End: Progressive Pruning for Efficient 3D Medical Image Segmentation
Linhao Li, Yiwen Ye, Ziyang Chen, Yong Xia
Main category: cs.CV
TL;DR: PSP-Seg is a progressive pruning framework for efficient 3D medical image segmentation that dynamically prunes redundant modules, achieving performance comparable to nnU-Net while significantly reducing resource consumption.
Details
Motivation: 3D medical image segmentation faces heavy resource and time constraints, limiting clinical scalability. Existing static models lack adaptability and struggle to balance performance with efficiency across diverse tasks.Method: PSP-Seg starts with a redundant model and iteratively prunes redundant modules using block-wise pruning combined with a functional decoupling loss to enable dynamic and efficient segmentation.
Result: PSP-Seg-S (lightweight variant) matches nnU-Net performance while reducing GPU memory by 42-45%, training time by 29-48%, and parameters by 83-87% across five public datasets.
Conclusion: PSP-Seg offers a cost-effective, high-performing alternative for clinical 3D segmentation with significant resource savings, demonstrating strong potential for widespread clinical deployment.
Abstract: 3D medical image segmentation often faces heavy resource and time consumption, limiting its scalability and rapid deployment in clinical environments. Existing efficient segmentation models are typically static and manually designed prior to training, which restricts their adaptability across diverse tasks and makes it difficult to balance performance with resource efficiency. In this paper, we propose PSP-Seg, a progressive pruning framework that enables dynamic and efficient 3D segmentation. PSP-Seg begins with a redundant model and iteratively prunes redundant modules through a combination of block-wise pruning and a functional decoupling loss. We evaluate PSP-Seg on five public datasets, benchmarking it against seven state-of-the-art models and six efficient segmentation models. Results demonstrate that the lightweight variant, PSP-Seg-S, achieves performance on par with nnU-Net while reducing GPU memory usage by 42-45%, training time by 29-48%, and parameter number by 83-87% across all datasets. These findings underscore PSP-Seg’s potential as a cost-effective yet high-performing alternative for widespread clinical application.
[109] Visual Programmability: A Guide for Code-as-Thought in Chart Understanding
Bohao Tang, Yan Ma, Fei Zhang, Jiadi Su, Ethan Chern, Zhulin Hu, Zhixin Wang, Pengfei Liu, Ya Zhang
Main category: cs.CV
TL;DR: Code-as-Thought (CaT) approach for chart understanding that adaptively chooses between code-based symbolic reasoning and direct visual analysis using reinforcement learning with dual rewards.
Details
Motivation: Address limitations of prior chart understanding methods - external tools make systems brittle, while single-strategy models lack verifiability and struggle with complex charts where symbolic representation is unsuitable.Method: Adaptive framework where VLM learns to choose between CaT pathway (code-based symbolic representation) and direct visual reasoning pathway. Uses reinforcement learning with dual-reward system: data-accuracy reward and decision reward.
Result: Demonstrates strong and robust performance across diverse chart-understanding benchmarks, showing VLMs can dynamically select optimal reasoning pathways.
Conclusion: VLMs can be taught not only to reason but also how to reason, adapting their strategy based on chart complexity and task requirements through visual programmability.
Abstract: Chart understanding presents a critical test to the reasoning capabilities of Vision-Language Models (VLMs). Prior approaches face critical limitations: some rely on external tools, making them brittle and constrained by a predefined toolkit, while others fine-tune specialist models that often adopt a single reasoning strategy, such as text-based chain-of-thought (CoT). The intermediate steps of text-based reasoning are difficult to verify, which complicates the use of reinforcement-learning signals that reward factual accuracy. To address this, we propose a Code-as-Thought (CaT) approach to represent the visual information of a chart in a verifiable, symbolic format. Our key insight is that this strategy must be adaptive: a fixed, code-only implementation consistently fails on complex charts where symbolic representation is unsuitable. This finding leads us to introduce Visual Programmability: a learnable property that determines if a chart-question pair is better solved with code or direct visual analysis. We implement this concept in an adaptive framework where a VLM learns to choose between the CaT pathway and a direct visual reasoning pathway. The selection policy of the model is trained with reinforcement learning using a novel dual-reward system. This system combines a data-accuracy reward to ground the model in facts and prevent numerical hallucination, with a decision reward that teaches the model when to use each strategy, preventing it from defaulting to a single reasoning mode. Experiments demonstrate strong and robust performance across diverse chart-understanding benchmarks. Our work shows that VLMs can be taught not only to reason but also how to reason, dynamically selecting the optimal reasoning pathway for each task.
[110] Modality-Agnostic Input Channels Enable Segmentation of Brain lesions in Multimodal MRI with Sequences Unavailable During Training
Anthony P. Addison, Felix Wagner, Wentian Xu, Natalie Voets, Konstantinos Kamnitsas
Main category: cs.CV
TL;DR: A U-net modification with modality-agnostic input channels enables brain MRI segmentation on both seen and unseen modalities using synthetic MRI augmentation.
Details
Motivation: Existing segmentation models for multimodal brain MRI are restricted to fixed modalities and cannot effectively process new ones at inference, losing discriminative modality-specific information.Method: Modified U-net architecture with modality-agnostic input channel alongside modality-specific channels, trained using image augmentation that synthesizes artificial MRI modalities by differentially altering appearance of pathological and healthy brain tissue.
Result: Evaluated on 8 MRI databases with 5 pathology types and 8 modalities, the approach preserves ability to process training modalities while effectively handling new unseen modalities to improve segmentation.
Conclusion: A simple architectural modification enables practical multimodal brain MRI segmentation that works with any available imaging modalities, including previously unseen ones.
Abstract: Segmentation models are important tools for the detection and analysis of lesions in brain MRI. Depending on the type of brain pathology that is imaged, MRI scanners can acquire multiple, different image modalities (contrasts). Most segmentation models for multimodal brain MRI are restricted to fixed modalities and cannot effectively process new ones at inference. Some models generalize to unseen modalities but may lose discriminative modality-specific information. This work aims to develop a model that can perform inference on data that contain image modalities unseen during training, previously seen modalities, and heterogeneous combinations of both, thus allowing a user to utilize any available imaging modalities. We demonstrate this is possible with a simple, thus practical alteration to the U-net architecture, by integrating a modality-agnostic input channel or pathway, alongside modality-specific input channels. To train this modality-agnostic component, we develop an image augmentation scheme that synthesizes artificial MRI modalities. Augmentations differentially alter the appearance of pathological and healthy brain tissue to create artificial contrasts between them while maintaining realistic anatomical integrity. We evaluate the method using 8 MRI databases that include 5 types of pathologies (stroke, tumours, traumatic brain injury, multiple sclerosis and white matter hyperintensities) and 8 modalities (T1, T1+contrast, T2, PD, SWI, DWI, ADC and FLAIR). The results demonstrate that the approach preserves the ability to effectively process MRI modalities encountered during training, while being able to process new, unseen modalities to improve its segmentation. Project code: https://github.com/Anthony-P-Addison/AGN-MOD-SEG
[111] Model-Agnostic Open-Set Air-to-Air Visual Object Detection for Reliable UAV Perception
Spyridon Loukovitis, Anastasios Arsenos, Vasileios Karampinis, Athanasios Voulodimos
Main category: cs.CV
TL;DR: Novel open-set detection framework for UAV air-to-air object detection that handles unknown objects and corrupted flight data using embedding-space entropy modeling with spectral normalization and temperature scaling.
Details
Motivation: Traditional closed-set detectors degrade significantly under domain shifts and flight data corruption, posing safety risks for UAV autonomy in real-world air-to-air detection scenarios.Method: Model-agnostic framework for embedding-based detectors that estimates semantic uncertainty via entropy modeling in embedding space, incorporating spectral normalization and temperature scaling for enhanced open-set discrimination.
Result: Achieves up to 10% relative AUROC gain over standard YOLO-based detectors on AOT aerial benchmark, with comprehensive ablation studies showing consistent improvements and background rejection enhancing robustness without compromising accuracy.
Conclusion: The proposed solution is well-suited for reliable UAV perception in dynamic air-to-air environments, demonstrating effective open-set detection capabilities and robustness against corrupted flight data.
Abstract: Open-set detection is crucial for robust UAV autonomy in air-to-air object detection under real-world conditions. Traditional closed-set detectors degrade significantly under domain shifts and flight data corruption, posing risks to safety-critical applications. We propose a novel, model-agnostic open-set detection framework designed specifically for embedding-based detectors. The method explicitly handles unknown object rejection while maintaining robustness against corrupted flight data. It estimates semantic uncertainty via entropy modeling in the embedding space and incorporates spectral normalization and temperature scaling to enhance open-set discrimination. We validate our approach on the challenging AOT aerial benchmark and through extensive real-world flight tests. Comprehensive ablation studies demonstrate consistent improvements over baseline methods, achieving up to a 10% relative AUROC gain compared to standard YOLO-based detectors. Additionally, we show that background rejection further strengthens robustness without compromising detection accuracy, making our solution particularly well-suited for reliable UAV perception in dynamic air-to-air environments.
[112] Learning Object-Centric Representations in SAR Images with Multi-Level Feature Fusion
Oh-Tae Jang, Min-Gon Cho, Kyung-Tae Kim
Main category: cs.CV
TL;DR: SlotSAR is a novel object-centric learning framework that disentangles target representations from background clutter in SAR images without mask annotations, achieving state-of-the-art performance.
Details
Motivation: SAR images contain complex background clutter that resembles targets, causing models to extract entangled or spurious features that undermine clear target representations.Method: Extracts high-level semantic features from SARATR-X and low-level scattering features from wavelet scattering network, then integrates them using a multi-level slot attention module for enhanced representation distinctiveness.
Result: Achieves state-of-the-art performance in SAR imagery by preserving structural details compared to existing OCL methods.
Conclusion: SlotSAR effectively disentangles target representations from background clutter without mask annotations, providing robust target characterization through complementary multi-level representations.
Abstract: Synthetic aperture radar (SAR) images contain not only targets of interest but also complex background clutter, including terrain reflections and speckle noise. In many cases, such clutter exhibits intensity and patterns that resemble targets, leading models to extract entangled or spurious features. Such behavior undermines the ability to form clear target representations, regardless of the classifier. To address this challenge, we propose a novel object-centric learning (OCL) framework, named SlotSAR, that disentangles target representations from background clutter in SAR images without mask annotations. SlotSAR first extracts high-level semantic features from SARATR-X and low-level scattering features from the wavelet scattering network in order to obtain complementary multi-level representations for robust target characterization. We further present a multi-level slot attention module that integrates these low- and high-level features to enhance slot-wise representation distinctiveness, enabling effective OCL. Experimental results demonstrate that SlotSAR achieves state-of-the-art performance in SAR imagery by preserving structural details compared to existing OCL methods.
[113] You Share Beliefs, I Adapt: Progressive Heterogeneous Collaborative Perception
Hao Si, Ehsan Javanmardi, Manabu Tsukada
Main category: cs.CV
TL;DR: PHCP enables heterogeneous collaborative perception during inference without joint training by using few-shot unsupervised domain adaptation and self-training adapters.
Details
Motivation: Existing methods require impractical joint training or pre-stored models for each collaborator, making real-world deployment challenging.Method: Progressive Heterogeneous Collaborative Perception (PHCP) formulates the problem as few-shot unsupervised domain adaptation, dynamically aligning features by self-training an adapter during inference without labeled data.
Result: Extensive experiments on OPV2V dataset show PHCP achieves strong performance across diverse heterogeneous scenarios, comparable to SOTA methods trained on full dataset using only small unlabeled data.
Conclusion: PHCP successfully addresses heterogeneous collaborative perception during inference without joint training, making it practical for real-world applications.
Abstract: Collaborative perception enables vehicles to overcome individual perception limitations by sharing information, allowing them to see further and through occlusions. In real-world scenarios, models on different vehicles are often heterogeneous due to manufacturer variations. Existing methods for heterogeneous collaborative perception address this challenge by fine-tuning adapters or the entire network to bridge the domain gap. However, these methods are impractical in real-world applications, as each new collaborator must undergo joint training with the ego vehicle on a dataset before inference, or the ego vehicle stores models for all potential collaborators in advance. Therefore, we pose a new question: Can we tackle this challenge directly during inference, eliminating the need for joint training? To answer this, we introduce Progressive Heterogeneous Collaborative Perception (PHCP), a novel framework that formulates the problem as few-shot unsupervised domain adaptation. Unlike previous work, PHCP dynamically aligns features by self-training an adapter during inference, eliminating the need for labeled data and joint training. Extensive experiments on the OPV2V dataset demonstrate that PHCP achieves strong performance across diverse heterogeneous scenarios. Notably, PHCP achieves performance comparable to SOTA methods trained on the entire dataset while using only a small amount of unlabeled data.
[114] Resource-Efficient Glioma Segmentation on Sub-Saharan MRI
Freedmore Sidume, Oumayma Soula, Joseph Muthui Wacira, YunFei Zhu, Abbas Rabiu Muhammad, Abderrazek Zeraii, Oluwaseun Kalejaye, Hajer Ibrahim, Olfa Gaddour, Brain Halubanza, Dong Zhang, Udunna C Anazodo, Confidence Raymond
Main category: cs.CV
TL;DR: A deep learning framework using 3D Attention UNet with residual blocks and transfer learning achieves robust glioma segmentation on limited African MRI data with good performance and practical deployment capabilities.
Details
Motivation: Address the scarcity of high-quality annotated MRI data in Sub-Saharan Africa for glioma segmentation, which is critical for diagnosis and treatment planning in resource-constrained settings.Method: 3D Attention UNet architecture augmented with residual blocks, enhanced through transfer learning from pre-trained weights on BraTS 2021 dataset, evaluated on BraTS-Africa dataset (95 MRI cases).
Result: Achieved Dice scores of 0.76 for Enhancing Tumor, 0.80 for Necrotic and Non-Enhancing Tumor Core, and 0.85 for Surrounding Non-Functional Hemisphere despite limited data quality and quantity.
Conclusion: The model demonstrates strong generalizability and practicality with compact architecture (90MB) and fast inference time, supporting clinical decision making in low-resource settings and contributing to equitable AI for global health.
Abstract: Gliomas are the most prevalent type of primary brain tumors, and their accurate segmentation from MRI is critical for diagnosis, treatment planning, and longitudinal monitoring. However, the scarcity of high-quality annotated imaging data in Sub-Saharan Africa (SSA) poses a significant challenge for deploying advanced segmentation models in clinical workflows. This study introduces a robust and computationally efficient deep learning framework tailored for resource-constrained settings. We leveraged a 3D Attention UNet architecture augmented with residual blocks and enhanced through transfer learning from pre-trained weights on the BraTS 2021 dataset. Our model was evaluated on 95 MRI cases from the BraTS-Africa dataset, a benchmark for glioma segmentation in SSA MRI data. Despite the limited data quality and quantity, our approach achieved Dice scores of 0.76 for the Enhancing Tumor (ET), 0.80 for Necrotic and Non-Enhancing Tumor Core (NETC), and 0.85 for Surrounding Non-Functional Hemisphere (SNFH). These results demonstrate the generalizability of the proposed model and its potential to support clinical decision making in low-resource settings. The compact architecture, approximately 90 MB, and sub-minute per-volume inference time on consumer-grade hardware further underscore its practicality for deployment in SSA health systems. This work contributes toward closing the gap in equitable AI for global health by empowering underserved regions with high-performing and accessible medical imaging solutions.
[115] Image Recognition with Vision and Language Embeddings of VLMs
Illia Volkov, Nikita Kisel, Klara Janouskova, Jiri Matas
Main category: cs.CV
TL;DR: This paper evaluates both language-guided and vision-only image classification using dual-encoder VLMs, showing complementary strengths and introducing a simple fusion method to improve performance.
Details
Motivation: Vision-language models have strong zero-shot classification through image-text alignment, but their purely visual inference capabilities remain under-explored and need comprehensive evaluation.Method: Comprehensive evaluation of dual-encoder VLMs (including SigLIP 2 and RADIOv2.5) on ImageNet-1k, analyzing factors like prompt design, class diversity, k-NN neighbors, and reference set size. Introduces a learning-free fusion method based on per-class precision.
Result: Language and vision offer complementary strengths - some classes favor textual prompts while others are better handled by visual similarity. The proposed fusion method improves classification performance.
Conclusion: The study demonstrates the complementary nature of language and vision in VLMs and provides a simple yet effective fusion approach to leverage both modalities for enhanced image classification performance.
Abstract: Vision-language models (VLMs) have enabled strong zero-shot classification through image-text alignment. Yet, their purely visual inference capabilities remain under-explored. In this work, we conduct a comprehensive evaluation of both language-guided and vision-only image classification with a diverse set of dual-encoder VLMs, including both well-established and recent models such as SigLIP 2 and RADIOv2.5. The performance is compared in a standard setup on the ImageNet-1k validation set and its label-corrected variant. The key factors affecting accuracy are analysed, including prompt design, class diversity, the number of neighbours in k-NN, and reference set size. We show that language and vision offer complementary strengths, with some classes favouring textual prompts and others better handled by visual similarity. To exploit this complementarity, we introduce a simple, learning-free fusion method based on per-class precision that improves classification performance. The code is available at: https://github.com/gonikisgo/bmvc2025-vlm-image-recognition.
[116] Fine-Grained Customized Fashion Design with Image-into-Prompt benchmark and dataset from LMM
Hui Li, Yi You, Qiqi Chen, Bingfeng Zhang, George Q. Huang
Main category: cs.CV
TL;DR: BUG workflow with LMM automates clothing design from chat using image-to-prompt conversion, enabling fine-grained customization without professional knowledge.
Details
Motivation: Current generative AI models struggle with fine-grained customization in fashion design due to text uncertainty and lack of professional background knowledge from end-users.Method: Proposed Better Understanding Generation (BUG) workflow with Large Multimodal Model to automatically create and customize cloth designs from chat using image-into-prompt conversion.
Result: Created FashionEdit dataset simulating real-world clothing design workflow, evaluated on generation similarity, user satisfaction, and quality metrics.
Conclusion: The framework enables creative potential beyond words and lowers barriers for clothing design/editing without human involvement, making fashion design more accessible.
Abstract: Generative AI evolves the execution of complex workflows in industry, where the large multimodal model empowers fashion design in the garment industry. Current generation AI models magically transform brainstorming into fancy designs easily, but the fine-grained customization still suffers from text uncertainty without professional background knowledge from end-users. Thus, we propose the Better Understanding Generation (BUG) workflow with LMM to automatically create and fine-grain customize the cloth designs from chat with image-into-prompt. Our framework unleashes users’ creative potential beyond words and also lowers the barriers of clothing design/editing without further human involvement. To prove the effectiveness of our model, we propose a new FashionEdit dataset that simulates the real-world clothing design workflow, evaluated from generation similarity, user satisfaction, and quality. The code and dataset: https://github.com/detectiveli/FashionEdit.
[117] OpenFake: An Open Dataset and Platform Toward Large-Scale Deepfake Detection
Victor Livernoche, Akshatha Arodi, Andreea Musulan, Zachary Yang, Adam Salvail, Gaétan Marceau Caron, Jean-François Godbout, Reihaneh Rabbany
Main category: cs.CV
TL;DR: A comprehensive politically-focused deepfake detection dataset with 3M real images and 963k synthetic images from modern generative models, plus a crowdsourced adversarial platform to keep detection methods robust against evolving threats.
Details
Motivation: Deepfakes intensify misinformation spread in political contexts, and existing detection datasets are limited by outdated generation methods, low realism, or single-face imagery, making them ineffective for general synthetic image detection.Method: Created a dataset with 3M real images paired with captions, used to generate 963k high-quality synthetic images from proprietary and open-source models. Introduced a crowdsourced adversarial platform where participants generate challenging synthetic images to test detection methods.
Result: Developed a comprehensive benchmark dataset specifically for politically-focused deepfake detection using modern generative models. Human perception study shows recent proprietary models produce synthetic images increasingly indistinguishable from real ones.
Conclusion: The dataset and ongoing community-driven adversarial platform ensure deepfake detection methods remain robust and adaptive, proactively safeguarding public discourse from sophisticated misinformation threats as generative techniques continue to evolve.
Abstract: Deepfakes, synthetic media created using advanced AI techniques, have intensified the spread of misinformation, particularly in politically sensitive contexts. Existing deepfake detection datasets are often limited, relying on outdated generation methods, low realism, or single-face imagery, restricting the effectiveness for general synthetic image detection. By analyzing social media posts, we identify multiple modalities through which deepfakes propagate misinformation. Furthermore, our human perception study demonstrates that recently developed proprietary models produce synthetic images increasingly indistinguishable from real ones, complicating accurate identification by the general public. Consequently, we present a comprehensive, politically-focused dataset specifically crafted for benchmarking detection against modern generative models. This dataset contains three million real images paired with descriptive captions, which are used for generating 963k corresponding high-quality synthetic images from a mix of proprietary and open-source models. Recognizing the continual evolution of generative techniques, we introduce an innovative crowdsourced adversarial platform, where participants are incentivized to generate and submit challenging synthetic images. This ongoing community-driven initiative ensures that deepfake detection methods remain robust and adaptive, proactively safeguarding public discourse from sophisticated misinformation threats.
[118] Exploring Pre-training Across Domains for Few-Shot Surgical Skill Assessment
Dimitrios Anastasiou, Razvan Caramalau, Nazir Sirajudeen, Matthew Boal, Philip Edwards, Justin Collins, John Kelly, Ashwin Sridhar, Maxine Tran, Faiz Mumtaz, Nevil Pavithran, Nader Francis, Danail Stoyanov, Evangelos B. Mazomenos
Main category: cs.CV
TL;DR: This paper investigates self-supervised pre-training strategies for few-shot surgical skill assessment, showing that domain-relevant datasets outperform larger but less aligned ones, with procedure-specific data boosting performance.
Details
Motivation: Surgical skill assessment requires expert annotations that are time-consuming to produce. Few-shot learning offers a scalable alternative but depends on effective pre-training, which remains unexplored in surgical skill assessment.Method: The authors formulate surgical skill assessment as a few-shot task, annotate a robotic surgery dataset with OSATS scores, evaluate various pre-training sources across three few-shot settings, and analyze domain similarity and procedure-specific data inclusion.
Result: Domain-relevant small datasets outperformed large-scale less aligned ones, achieving accuracies of 60.16% (1-shot), 66.03% (2-shot), and 73.65% (5-shot). Procedure-specific data with domain-relevant external datasets boosted performance by +1.22% accuracy and +2.28% F1-score.
Conclusion: Domain relevance is more important than dataset size for pre-training in surgical skill assessment. Incorporating procedure-specific data with domain-relevant sources significantly improves downstream performance, while using less similar large-scale sources can degrade performance.
Abstract: Automated surgical skill assessment (SSA) is a central task in surgical computer vision. Developing robust SSA models is challenging due to the scarcity of skill annotations, which are time-consuming to produce and require expert consensus. Few-shot learning (FSL) offers a scalable alternative enabling model development with minimal supervision, though its success critically depends on effective pre-training. While widely studied for several surgical downstream tasks, pre-training has remained largely unexplored in SSA. In this work, we formulate SSA as a few-shot task and investigate how self-supervised pre-training strategies affect downstream few-shot SSA performance. We annotate a publicly available robotic surgery dataset with Objective Structured Assessment of Technical Skill (OSATS) scores, and evaluate various pre-training sources across three few-shot settings. We quantify domain similarity and analyze how domain gap and the inclusion of procedure-specific data into pre-training influence transferability. Our results show that small but domain-relevant datasets can outperform large scale, less aligned ones, achieving accuracies of 60.16%, 66.03%, and 73.65% in the 1-, 2-, and 5-shot settings, respectively. Moreover, incorporating procedure-specific data into pre-training with a domain-relevant external dataset significantly boosts downstream performance, with an average gain of +1.22% in accuracy and +2.28% in F1-score; however, applying the same strategy with less similar but large-scale sources can instead lead to performance degradation. Code and models are available at https://github.com/anastadimi/ssa-fsl.
[119] Texture-aware Intrinsic Image Decomposition with Model- and Learning-based Priors
Xiaodong Wang, Zijun He, Xin Yuan
Main category: cs.CV
TL;DR: A novel optimization framework for single-image intrinsic image decomposition that uses texture-guided regularization to handle complex scenes with severe lighting and rich textures, producing higher quality results than previous methods.
Details
Motivation: Traditional intrinsic image decomposition struggles with complex scenes featuring spatially-varying lighting and rich textures. Previous learning-based methods tend to produce texture-less and over-smooth intrinsic images, limiting their effectiveness in real-world applications.Method: The authors design a texture-guided regularization term and formulate the decomposition into an optimization framework that separates material textures from lighting effects. The method uses texture information inferred from RGB images to guide the decomposition process.
Result: The proposed approach demonstrates superior performance compared to existing methods, producing high-quality intrinsic images for real-world images with complex lighting and texture conditions.
Conclusion: Combining texture-aware priors through a carefully designed optimization framework effectively addresses the challenges of intrinsic image decomposition in complex scenes, outperforming previous approaches.
Abstract: This paper aims to recover the intrinsic reflectance layer and shading layer given a single image. Though this intrinsic image decomposition problem has been studied for decades, it remains a significant challenge in cases of complex scenes, i.e. spatially-varying lighting effect and rich textures. In this paper, we propose a novel method for handling severe lighting and rich textures in intrinsic image decomposition, which enables to produce high-quality intrinsic images for real-world images. Specifically, we observe that previous learning-based methods tend to produce texture-less and over-smoothing intrinsic images, which can be used to infer the lighting and texture information given a RGB image. In this way, we design a texture-guided regularization term and formulate the decomposition problem into an optimization framework, to separate the material textures and lighting effect. We demonstrate that combining the novel texture-aware prior can produce superior results to existing approaches.
[120] Plug-and-play Diffusion Models for Image Compressive Sensing with Data Consistency Projection
Xiaodong Wang, Ping Wang, Zhangyuan Li, Xin Yuan
Main category: cs.CV
TL;DR: This paper explores connections between Plug-and-Play methods and DDIM for solving inverse problems in single-pixel imaging, proposing a unified framework with hybrid data-consistency that improves reconstruction quality.
Details
Motivation: To bridge the gap between PnP methods and diffusion models for ill-posed inverse problems, particularly in single-pixel imaging, by understanding their differences in denoising mechanisms and sampling procedures.Method: Decouples diffusion process into three stages (denoising, data consistency enforcement, sampling), proposes hybrid data-consistency module that linearly combines multiple PnP-style fidelity terms applied directly to denoised estimates.
Result: Experimental results on single-pixel imaging tasks demonstrate better reconstruction quality compared to existing methods.
Conclusion: The proposed unified framework successfully integrates learned priors with physical forward models, and the hybrid correction improves measurement consistency without disrupting diffusion sampling trajectory.
Abstract: We explore the connection between Plug-and-Play (PnP) methods and Denoising Diffusion Implicit Models (DDIM) for solving ill-posed inverse problems, with a focus on single-pixel imaging. We begin by identifying key distinctions between PnP and diffusion models-particularly in their denoising mechanisms and sampling procedures. By decoupling the diffusion process into three interpretable stages: denoising, data consistency enforcement, and sampling, we provide a unified framework that integrates learned priors with physical forward models in a principled manner. Building upon this insight, we propose a hybrid data-consistency module that linearly combines multiple PnP-style fidelity terms. This hybrid correction is applied directly to the denoised estimate, improving measurement consistency without disrupting the diffusion sampling trajectory. Experimental results on single-pixel imaging tasks demonstrate that our method achieves better reconstruction quality.
[121] Improving Video Diffusion Transformer Training by Multi-Feature Fusion and Alignment from Self-Supervised Vision Encoders
Dohun Lee, Hyeonho Jeong, Jiwook Kim, Duygu Ceylan, Jong Chul Ye
Main category: cs.CV
TL;DR: Video diffusion models can be improved by aligning intermediate features with pre-trained vision encoders. Align4Gen proposes a multi-feature fusion and alignment method that enhances video generation quality.
Details
Motivation: While video diffusion models have advanced through architectural innovations, less attention has been paid to improving feature representation power. The paper addresses this gap by exploring feature alignment with pre-trained vision encoders.Method: Proposes Align4Gen - a novel multi-feature fusion and alignment method integrated into video diffusion model training. Includes a new metric and analysis of various vision encoders for discriminability and temporal consistency.
Result: Align4Gen improves video generation quality for both unconditional and class-conditional tasks, as quantified by various metrics. The method shows enhanced performance compared to baseline approaches.
Conclusion: Feature alignment with pre-trained vision encoders is beneficial for video diffusion models. The proposed Align4Gen framework effectively leverages this alignment to achieve superior video generation results.
Abstract: Video diffusion models have advanced rapidly in the recent years as a result of series of architectural innovations (e.g., diffusion transformers) and use of novel training objectives (e.g., flow matching). In contrast, less attention has been paid to improving the feature representation power of such models. In this work, we show that training video diffusion models can benefit from aligning the intermediate features of the video generator with feature representations of pre-trained vision encoders. We propose a new metric and conduct an in-depth analysis of various vision encoders to evaluate their discriminability and temporal consistency, thereby assessing their suitability for video feature alignment. Based on the analysis, we present Align4Gen which provides a novel multi-feature fusion and alignment method integrated into video diffusion model training. We evaluate Align4Gen both for unconditional and class-conditional video generation tasks and show that it results in improved video generation as quantified by various metrics. Full video results are available on our project page: https://align4gen.github.io/align4gen/
[122] A Fully Automatic Framework for Intracranial Pressure Grading: Integrating Keyframe Identification, ONSD Measurement and Clinical Data
Pengxu Wen, Tingting Yu, Ziwei Nie, Cheng Jiang, Zhenyu Yin, Mingyang He, Bo Liao, Xiaoping Yang
Main category: cs.CV
TL;DR: A fully automatic two-stage framework for non-invasive intracranial pressure grading using optic nerve sheath diameter measurement from ultrasound videos and clinical data fusion, achieving significantly higher accuracy than conventional methods.
Details
Motivation: Intracranial pressure elevation threatens cerebral function but current monitoring methods like lumbar puncture are invasive. Optic nerve sheath diameter shows promise as a biomarker but manual measurement suffers from inconsistency, subjectivity, and variability.Method: Two-stage framework: 1) Fundus ultrasound video processing with frame-level anatomical segmentation, rule-based keyframe identification, and precise ONSD measurement; 2) ICP grading by fusing ONSD metrics with clinical features to predict ICP grades.
Result: Achieved validation accuracy of 0.845 ± 0.071 (five-fold cross-validation) and independent test accuracy of 0.786, significantly outperforming conventional threshold-based method (0.637 ± 0.111 validation, 0.429 test accuracy).
Conclusion: The framework reduces operator variability and integrates multi-source information, establishing a reliable non-invasive approach for clinical ICP evaluation that can improve patient management in acute neurological conditions.
Abstract: Intracranial pressure (ICP) elevation poses severe threats to cerebral function, thus necessitating monitoring for timely intervention. While lumbar puncture is the gold standard for ICP measurement, its invasiveness and associated risks drive the need for non-invasive alternatives. Optic nerve sheath diameter (ONSD) has emerged as a promising biomarker, as elevated ICP directly correlates with increased ONSD. However, current clinical practices for ONSD measurement suffer from inconsistency in manual operation, subjectivity in optimal view selection, and variability in thresholding, limiting their reliability. To address these challenges, we introduce a fully automatic two-stage framework for ICP grading, integrating keyframe identification, ONSD measurement and clinical data. Specifically, the fundus ultrasound video processing stage performs frame-level anatomical segmentation, rule-based keyframe identification guided by an international consensus statement, and precise ONSD measurement. The intracranial pressure grading stage then fuses ONSD metrics with clinical features to enable the prediction of ICP grades, thereby demonstrating an innovative blend of interpretable ultrasound analysis and multi-source data integration for objective clinical evaluation. Experimental results demonstrate that our method achieves a validation accuracy of $0.845 \pm 0.071$ (with standard deviation from five-fold cross-validation) and an independent test accuracy of 0.786, significantly outperforming conventional threshold-based method ($0.637 \pm 0.111$ validation accuracy, $0.429$ test accuracy). Through effectively reducing operator variability and integrating multi-source information, our framework establishes a reliable non-invasive approach for clinical ICP evaluation, holding promise for improving patient management in acute neurological conditions.
[123] Unsupervised Integrated-Circuit Defect Segmentation via Image-Intrinsic Normality
Botong Zhao, Qijun Shi, Shujing Lyu, Yue Lu
Main category: cs.CV
TL;DR: Unsupervised IC defect segmentation framework that extracts normal features from test images without external references, using reconstruction residuals to identify defects.
Details
Motivation: Traditional defect segmentation methods rely on external normal references which are brittle for IC imagery due to layout variations and alignment difficulties across different products.Method: Proposes a learnable normal-information extractor that aggregates representative normal features from test images, uses coherence loss to associate features with normal regions, and employs a decoder to reconstruct only normal content. Uses reconstruction residuals for defect segmentation and pseudo-anomaly augmentation for training stability.
Result: Experiments on datasets from three IC process stages show consistent improvements over existing approaches and strong robustness to product variability.
Conclusion: The framework effectively segments IC defects without external normal support by leveraging internal normal patterns within test images, demonstrating superior performance and robustness across different IC manufacturing stages.
Abstract: Modern Integrated-Circuit(IC) manufacturing introduces diverse, fine-grained defects that depress yield and reliability. Most industrial defect segmentation compares a test image against an external normal set, a strategy that is brittle for IC imagery where layouts vary across products and accurate alignment is difficult. We observe that defects are predominantly local, while each image still contains rich, repeatable normal patterns. We therefore propose an unsupervised IC defect segmentation framework that requires no external normal support. A learnable normal-information extractor aggregates representative normal features from the test image, and a coherence loss enforces their association with normal regions. Guided by these features, a decoder reconstructs only normal content; the reconstruction residual then segments defects. Pseudo-anomaly augmentation further stabilizes training. Experiments on datasets from three IC process stages show consistent improvements over existing approaches and strong robustness to product variability.
[124] Invisible Attributes, Visible Biases: Exploring Demographic Shortcuts in MRI-based Alzheimer’s Disease Classification
Akshit Achara, Esther Puyol Anton, Alexander Hammers, Andrew P. King
Main category: cs.CV
TL;DR: This paper investigates shortcut learning and demographic bias in deep learning-based Alzheimer’s disease diagnosis from MRI scans, demonstrating race and sex-based biases in multiple DL models.
Details
Motivation: DL algorithms for AD diagnosis can suffer from shortcut learning where spurious features related to protected attributes like race and sex are used for prediction, leading to performance bias against underrepresented groups.Method: The study investigates if DL algorithms can identify race/sex from brain MRI scans, examines training set imbalance effects on performance, and conducts quantitative/qualitative analysis of feature attributions using ResNet and SwinTransformer models on multiple datasets.
Result: The research demonstrates the existence of both race and sex-based shortcut learning and bias in DL-based AD classification, showing that models can identify demographic attributes and exhibit performance drops due to training set imbalances.
Conclusion: This work establishes the foundation for developing fairer DL diagnostic tools in brain MRI by identifying and quantifying demographic biases in current AD classification models.
Abstract: Magnetic resonance imaging (MRI) is the gold standard for brain imaging. Deep learning (DL) algorithms have been proposed to aid in the diagnosis of diseases such as Alzheimer’s disease (AD) from MRI scans. However, DL algorithms can suffer from shortcut learning, in which spurious features, not directly related to the output label, are used for prediction. When these features are related to protected attributes, they can lead to performance bias against underrepresented protected groups, such as those defined by race and sex. In this work, we explore the potential for shortcut learning and demographic bias in DL based AD diagnosis from MRI. We first investigate if DL algorithms can identify race or sex from 3D brain MRI scans to establish the presence or otherwise of race and sex based distributional shifts. Next, we investigate whether training set imbalance by race or sex can cause a drop in model performance, indicating shortcut learning and bias. Finally, we conduct a quantitative and qualitative analysis of feature attributions in different brain regions for both the protected attribute and AD classification tasks. Through these experiments, and using multiple datasets and DL models (ResNet and SwinTransformer), we demonstrate the existence of both race and sex based shortcut learning and bias in DL based AD classification. Our work lays the foundation for fairer DL diagnostic tools in brain MRI. The code is provided at https://github.com/acharaakshit/ShortMR
[125] Decoupling Clinical and Class-Agnostic Features for Reliable Few-Shot Adaptation under Shift
Umaima Rahman, Raza Imam, Mohammad Yaqub, Dwarikanath Mahapatra
Main category: cs.CV
TL;DR: DRiFt is a feature decoupling framework that separates clinically relevant signals from task-agnostic noise in medical VLMs, improving robustness to distribution shifts and enhancing clinical reliability.
Details
Motivation: Medical vision-language models often learn task-agnostic correlations due to variability in imaging protocols and free-text reports, limiting their generalizability and increasing failure risk in real-world clinical settings.Method: Structured feature decoupling framework using parameter-efficient tuning (LoRA) and learnable prompt tokens to separate clinically relevant signals from noise, plus curated high-quality image-text pairs for better cross-modal alignment.
Result: Improves in-distribution performance by +11.4% Top-1 accuracy and +3.3% Macro-F1 over prior prompt-based methods, while maintaining strong robustness across unseen datasets.
Conclusion: Disentangling task-relevant features and careful alignment significantly enhance model generalization and reduce unpredictable behavior under domain shift, contributing to safer, more trustworthy VLMs for clinical use.
Abstract: Medical vision-language models (VLMs) offer promise for clinical decision support, yet their reliability under distribution shifts remains a major concern for safe deployment. These models often learn task-agnostic correlations due to variability in imaging protocols and free-text reports, limiting their generalizability and increasing the risk of failure in real-world settings. We propose DRiFt, a structured feature decoupling framework that explicitly separates clinically relevant signals from task-agnostic noise using parameter-efficient tuning (LoRA) and learnable prompt tokens. To enhance cross-modal alignment and reduce uncertainty, we curate high-quality, clinically grounded image-text pairs by generating captions for a diverse medical dataset. Our approach improves in-distribution performance by +11.4% Top-1 accuracy and +3.3% Macro-F1 over prior prompt-based methods, while maintaining strong robustness across unseen datasets. Ablation studies reveal that disentangling task-relevant features and careful alignment significantly enhance model generalization and reduce unpredictable behavior under domain shift. These insights contribute toward building safer, more trustworthy VLMs for clinical use. The code is available at https://github.com/rumaima/DRiFt.
[126] FS-Diff: Semantic guidance and clarity-aware simultaneous multimodal image fusion and super-resolution
Yuchan Jie, Yushen Xu, Xiaosong Li, Fuqiang Zhou, Jianming Lv, Huafeng Li
Main category: cs.CV
TL;DR: FS-Diff is a joint image fusion and super-resolution method that uses semantic guidance and clarity-aware mechanisms to produce high-quality fused images with enhanced details and semantic information from multimodal sources.
Details
Motivation: Current image fusion techniques struggle with corrupted target/background structures, low resolution, and weak semantic information in real-world applications like military reconnaissance and long-range detection missions.Method: Unifies image fusion and super-resolution as conditional generation using semantic guidance from clarity sensing mechanism. Uses bidirectional feature Mamba for global feature extraction and modified U-Net for random iterative denoising at multiple noise levels.
Result: Outperforms state-of-the-art methods at multiple magnifications, recovers richer details and semantics in fused images across six public datasets and their new AVMS benchmark with 600 image pairs.
Conclusion: FS-Diff effectively addresses the limitations of current fusion techniques by combining semantic guidance with clarity-aware processing, producing superior high-resolution fusion results with enhanced cross-modal features and semantic information.
Abstract: As an influential information fusion and low-level vision technique, image fusion integrates complementary information from source images to yield an informative fused image. A few attempts have been made in recent years to jointly realize image fusion and super-resolution. However, in real-world applications such as military reconnaissance and long-range detection missions, the target and background structures in multimodal images are easily corrupted, with low resolution and weak semantic information, which leads to suboptimal results in current fusion techniques. In response, we propose FS-Diff, a semantic guidance and clarity-aware joint image fusion and super-resolution method. FS-Diff unifies image fusion and super-resolution as a conditional generation problem. It leverages semantic guidance from the proposed clarity sensing mechanism for adaptive low-resolution perception and cross-modal feature extraction. Specifically, we initialize the desired fused result as pure Gaussian noise and introduce the bidirectional feature Mamba to extract the global features of the multimodal images. Moreover, utilizing the source images and semantics as conditions, we implement a random iterative denoising process via a modified U-Net network. This network istrained for denoising at multiple noise levels to produce high-resolution fusion results with cross-modal features and abundant semantic information. We also construct a powerful aerial view multiscene (AVMS) benchmark covering 600 pairs of images. Extensive joint image fusion and super-resolution experiments on six public and our AVMS datasets demonstrated that FS-Diff outperforms the state-of-the-art methods at multiple magnifications and can recover richer details and semantics in the fused images. The code is available at https://github.com/XylonXu01/FS-Diff.
[127] Mechanistic Learning with Guided Diffusion Models to Predict Spatio-Temporal Brain Tumor Growth
Daria Laslo, Efthymios Georgiou, Marius George Linguraru, Andreas Rauschecker, Sabine Muller, Catherine R. Jutzeler, Sarah Bruningk
Main category: cs.CV
TL;DR: Hybrid framework combining mathematical tumor growth model with guided diffusion model to predict brain tumor progression and generate future MRIs from previous scans.
Details
Motivation: Predicting spatio-temporal progression of brain tumors is crucial for clinical decision-making in neuro-oncology, especially in data-limited scenarios.Method: Combines mechanistic ODE-based tumor growth model (capturing radiotherapy effects) with gradient-guided denoising diffusion implicit model (DDIM) for image synthesis conditioned on predicted tumor burden.
Result: Generates realistic follow-up scans based on spatial similarity metrics and produces tumor growth probability maps showing clinically relevant extent and directionality (95th percentile Hausdorff Distance).
Conclusion: Enables biologically informed image generation that accounts for mechanistic priors, offering generative-space-time predictions for brain tumor progression.
Abstract: Predicting the spatio-temporal progression of brain tumors is essential for guiding clinical decisions in neuro-oncology. We propose a hybrid mechanistic learning framework that combines a mathematical tumor growth model with a guided denoising diffusion implicit model (DDIM) to synthesize anatomically feasible future MRIs from preceding scans. The mechanistic model, formulated as a system of ordinary differential equations, captures temporal tumor dynamics including radiotherapy effects and estimates future tumor burden. These estimates condition a gradient-guided DDIM, enabling image synthesis that aligns with both predicted growth and patient anatomy. We train our model on the BraTS adult and pediatric glioma datasets and evaluate on 60 axial slices of in-house longitudinal pediatric diffuse midline glioma (DMG) cases. Our framework generates realistic follow-up scans based on spatial similarity metrics. It also introduces tumor growth probability maps, which capture both clinically relevant extent and directionality of tumor growth as shown by 95th percentile Hausdorff Distance. The method enables biologically informed image generation in data-limited scenarios, offering generative-space-time predictions that account for mechanistic priors.
[128] Semantic Concentration for Self-Supervised Dense Representations Learning
Peisong Wen, Qianqian Xu, Siran Dai, Runmin Cong, Qingming Huang
Main category: cs.CV
TL;DR: This paper addresses over-dispersion in dense self-supervised learning by proposing explicit semantic concentration through patch correspondence distillation with noise-tolerant ranking loss and object-aware filtering.
Details
Motivation: Image-level SSL avoids over-dispersion through implicit semantic concentration, but dense SSL faces challenges due to spatial sensitivity and complex scene data, requiring explicit solutions.Method: Proposes two main techniques: 1) patch correspondence distillation with noise-tolerant ranking loss based on extended AP loss, and 2) object-aware filter using learnable prototypes via cross-attention to map output to object-based space.
Result: Empirical studies across various tasks demonstrate the effectiveness of the proposed method in addressing over-dispersion and improving dense representation learning.
Conclusion: The explicit semantic concentration approach through correspondence distillation and object-aware filtering successfully overcomes the limitations of dense SSL, providing robust patch representations for downstream tasks.
Abstract: Recent advances in image-level self-supervised learning (SSL) have made significant progress, yet learning dense representations for patches remains challenging. Mainstream methods encounter an over-dispersion phenomenon that patches from the same instance/category scatter, harming downstream performance on dense tasks. This work reveals that image-level SSL avoids over-dispersion by involving implicit semantic concentration. Specifically, the non-strict spatial alignment ensures intra-instance consistency, while shared patterns, i.e., similar parts of within-class instances in the input space, ensure inter-image consistency. Unfortunately, these approaches are infeasible for dense SSL due to their spatial sensitivity and complicated scene-centric data. These observations motivate us to explore explicit semantic concentration for dense SSL. First, to break the strict spatial alignment, we propose to distill the patch correspondences. Facing noisy and imbalanced pseudo labels, we propose a noise-tolerant ranking loss. The core idea is extending the Average Precision (AP) loss to continuous targets, such that its decision-agnostic and adaptive focusing properties prevent the student model from being misled. Second, to discriminate the shared patterns from complicated scenes, we propose the object-aware filter to map the output space to an object-based space. Specifically, patches are represented by learnable prototypes of objects via cross-attention. Last but not least, empirical studies across various tasks soundly support the effectiveness of our method. Code is available in https://github.com/KID-7391/CoTAP.
[129] FlexiD-Fuse: Flexible number of inputs multi-modal medical image fusion based on diffusion model
Yushen Xu, Xiaosong Li, Yuchun Wang, Xiaoqi Cheng, Huafeng Li, Haishu Tan
Main category: cs.CV
TL;DR: FlexiD-Fuse is a diffusion-based medical image fusion network that handles flexible input quantities (2+ modalities) using hierarchical Bayesian modeling and EM algorithm integration, achieving state-of-the-art performance across various fusion tasks.
Details
Motivation: Existing medical image fusion methods only work with fixed numbers of input modalities (e.g., only 2 or 3 modalities), which limits their clinical applicability where varying numbers of medical images need to be fused.Method: Transforms diffusion fusion into maximum likelihood estimation using hierarchical Bayesian modeling, incorporates Expectation-Maximization algorithm into diffusion sampling to handle flexible input quantities end-to-end.
Result: Achieves best performance on Harvard datasets with 9 evaluation metrics for medical image fusion, and demonstrates effectiveness on infrared-visible, multi-exposure, and multi-focus fusion tasks with arbitrary input numbers.
Conclusion: FlexiD-Fuse successfully addresses the limitation of fixed-input fusion methods and provides a flexible solution that works with varying numbers of medical image modalities while maintaining high fusion quality.
Abstract: Different modalities of medical images provide unique physiological and anatomical information for diseases. Multi-modal medical image fusion integrates useful information from different complementary medical images with different modalities, producing a fused image that comprehensively and objectively reflects lesion characteristics to assist doctors in clinical diagnosis. However, existing fusion methods can only handle a fixed number of modality inputs, such as accepting only two-modal or tri-modal inputs, and cannot directly process varying input quantities, which hinders their application in clinical settings. To tackle this issue, we introduce FlexiD-Fuse, a diffusion-based image fusion network designed to accommodate flexible quantities of input modalities. It can end-to-end process two-modal and tri-modal medical image fusion under the same weight. FlexiD-Fuse transforms the diffusion fusion problem, which supports only fixed-condition inputs, into a maximum likelihood estimation problem based on the diffusion process and hierarchical Bayesian modeling. By incorporating the Expectation-Maximization algorithm into the diffusion sampling iteration process, FlexiD-Fuse can generate high-quality fused images with cross-modal information from source images, independently of the number of input images. We compared the latest two and tri-modal medical image fusion methods, tested them on Harvard datasets, and evaluated them using nine popular metrics. The experimental results show that our method achieves the best performance in medical image fusion with varying inputs. Meanwhile, we conducted extensive extension experiments on infrared-visible, multi-exposure, and multi-focus image fusion tasks with arbitrary numbers, and compared them with the perspective SOTA methods. The results of the extension experiments consistently demonstrate the effectiveness and superiority of our method.
[130] Improving Human Motion Plausibility with Body Momentum
Ha Linh Nguyen, Tze Ho Elden Tse, Angela Yao
Main category: cs.CV
TL;DR: Proposes using whole-body linear and angular momentum as a constraint to link local joint motion with global movement, addressing the physical coupling between these components that existing motion models often fail to capture.
Details
Motivation: Existing motion models treat local joint motion and global root movement separately, but these components are physically coupled through environmental interactions. Current approaches either fail to capture this coupling accurately or require computationally expensive physics simulations.Method: Introduces a new loss term that enforces consistency between generated momentum profiles and ground-truth data, using whole-body linear and angular momentum as a physically grounded constraint to relate local joint behavior to global displacement.
Result: The proposed momentum-based loss reduces foot sliding and jitter, improves balance, and preserves motion accuracy while providing a computationally efficient alternative to complex physics simulations.
Conclusion: Momentum constraints provide an effective and physically meaningful way to couple local and global motion components, leading to more realistic and stable motion generation without the computational complexity of full physics simulations.
Abstract: Many studies decompose human motion into local motion in a frame attached to the root joint and global motion of the root joint in the world frame, treating them separately. However, these two components are not independent. Global movement arises from interactions with the environment, which are, in turn, driven by changes in the body configuration. Motion models often fail to precisely capture this physical coupling between local and global dynamics, while deriving global trajectories from joint torques and external forces is computationally expensive and complex. To address these challenges, we propose using whole-body linear and angular momentum as a constraint to link local motion with global movement. Since momentum reflects the aggregate effect of joint-level dynamics on the body’s movement through space, it provides a physically grounded way to relate local joint behavior to global displacement. Building on this insight, we introduce a new loss term that enforces consistency between the generated momentum profiles and those observed in ground-truth data. Incorporating our loss reduces foot sliding and jitter, improves balance, and preserves the accuracy of the recovered motion. Code and data are available at the project page https://hlinhn.github.io/momentum_bmvc.
[131] Region-Wise Correspondence Prediction between Manga Line Art Images
Yingxuan Li, Jiafeng Mao, Qianru Qiu, Yusuke Matsui
Main category: cs.CV
TL;DR: A Transformer-based framework for predicting region-wise correspondence between manga line art images without pre-existing labels, achieving high patch-level accuracy and consistent region matching.
Details
Motivation: Region-wise correspondence between manga line art images is fundamental for applications like automatic colorization and in-between frame generation, but remains unexplored in realistic scenarios without pre-existing segmentation or annotations.Method: Divide line art images into patches, use Transformer-based framework to learn patch-level similarities, apply edge-aware clustering and region matching algorithm to convert patch-level predictions into coherent region-level correspondences.
Result: Achieves high patch-level accuracy (96.34%) and generates consistent region-level correspondences on multiple datasets.
Conclusion: The method demonstrates strong performance and potential for real-world manga applications, providing a practical solution for region-wise correspondence prediction without pre-existing labels.
Abstract: Understanding region-wise correspondence between manga line art images is a fundamental task in manga processing, enabling downstream applications such as automatic line art colorization and in-between frame generation. However, this task remains largely unexplored, especially in realistic scenarios without pre-existing segmentation or annotations. In this paper, we introduce a novel and practical task: predicting region-wise correspondence between raw manga line art images without any pre-existing labels or masks. To tackle this problem, we divide each line art image into a set of patches and propose a Transformer-based framework that learns patch-level similarities within and across images. We then apply edge-aware clustering and a region matching algorithm to convert patch-level predictions into coherent region-level correspondences. To support training and evaluation, we develop an automatic annotation pipeline and manually refine a subset of the data to construct benchmark datasets. Experiments on multiple datasets demonstrate that our method achieves high patch-level accuracy (e.g., 96.34%) and generates consistent region-level correspondences, highlighting its potential for real-world manga applications.
[132] Generative Diffusion Contrastive Network for Multi-View Clustering
Jian Zhu, Xin Zou, Xi Wang, Ning Zhang, Bian Wu, Yao Yang, Ying Zhou, Lingfang Zeng, Chang Tang, Cheng Luo
Main category: cs.CV
TL;DR: Proposes SGDF method and GDCN network for multi-view clustering that handles noisy and missing data using generative diffusion fusion
Details
Motivation: Address low-quality data issues in multi-view fusion caused by noisy data contamination and missing data in certain viewsMethod: Stochastic Generative Diffusion Fusion (SGDF) with multiple generative mechanism for multi-view features, and Generative Diffusion Contrastive Network (GDCN) built on SGDF
Result: Extensive experiments show GDCN achieves state-of-the-art results in deep multi-view clustering tasks
Conclusion: SGDF and GDCN provide robust solutions for handling low-quality data in multi-view clustering through generative diffusion fusion approach
Abstract: In recent years, Multi-View Clustering (MVC) has been significantly advanced under the influence of deep learning. By integrating heterogeneous data from multiple views, MVC enhances clustering analysis, making multi-view fusion critical to clustering performance. However, there is a problem of low-quality data in multi-view fusion. This problem primarily arises from two reasons: 1) Certain views are contaminated by noisy data. 2) Some views suffer from missing data. This paper proposes a novel Stochastic Generative Diffusion Fusion (SGDF) method to address this problem. SGDF leverages a multiple generative mechanism for the multi-view feature of each sample. It is robust to low-quality data. Building on SGDF, we further present the Generative Diffusion Contrastive Network (GDCN). Extensive experiments show that GDCN achieves the state-of-the-art results in deep MVC tasks. The source code is publicly available at https://github.com/HackerHyper/GDCN.
[133] DualTrack: Sensorless 3D Ultrasound needs Local and Global Context
Paul F. R. Wilson, Matteo Ronchetti, Rüdiger Göbl, Viktoria Markova, Sebastian Rosenzweig, Raphael Prevost, Parvin Mousavi, Oliver Zettinig
Main category: cs.CV
TL;DR: DualTrack is a novel dual-encoder architecture for sensorless 3D ultrasound that decouples local and global feature extraction to achieve state-of-the-art 3D reconstruction accuracy below 5mm error.
Details
Motivation: Traditional 3D ultrasound systems are costly and complex, while existing sensorless approaches either ignore or tightly couple global and local features, limiting robustness in modeling complementary aspects of ultrasound imaging.Method: Proposes DualTrack with separate local and global encoders - local encoder uses dense spatiotemporal convolutions for fine-grained features, global encoder uses image backbone with temporal attention for anatomical features. Features are combined via lightweight fusion module for trajectory estimation.
Result: Achieves state-of-the-art accuracy on large public benchmark with average reconstruction error below 5mm, outperforming previous methods and producing globally consistent 3D reconstructions.
Conclusion: Decoupling local and global feature extraction through specialized encoders enables more robust and accurate sensorless 3D ultrasound reconstruction, making 3D ultrasound more accessible by eliminating need for expensive traditional systems.
Abstract: Three-dimensional ultrasound (US) offers many clinical advantages over conventional 2D imaging, yet its widespread adoption is limited by the cost and complexity of traditional 3D systems. Sensorless 3D US, which uses deep learning to estimate a 3D probe trajectory from a sequence of 2D US images, is a promising alternative. Local features, such as speckle patterns, can help predict frame-to-frame motion, while global features, such as coarse shapes and anatomical structures, can situate the scan relative to anatomy and help predict its general shape. In prior approaches, global features are either ignored or tightly coupled with local feature extraction, restricting the ability to robustly model these two complementary aspects. We propose DualTrack, a novel dual-encoder architecture that leverages decoupled local and global encoders specialized for their respective scales of feature extraction. The local encoder uses dense spatiotemporal convolutions to capture fine-grained features, while the global encoder utilizes an image backbone (e.g., a 2D CNN or foundation model) and temporal attention layers to embed high-level anatomical features and long-range dependencies. A lightweight fusion module then combines these features to estimate the trajectory. Experimental results on a large public benchmark show that DualTrack achieves state-of-the-art accuracy and globally consistent 3D reconstructions, outperforming previous methods and yielding an average reconstruction error below 5 mm.
[134] InterAct: Advancing Large-Scale Versatile 3D Human-Object Interaction Generation
Sirui Xu, Dongting Li, Yucheng Zhang, Xiyan Xu, Qi Long, Ziyin Wang, Yunzhi Lu, Shuchang Dong, Hezi Jiang, Akshat Gupta, Yu-Xiong Wang, Liang-Yan Gui
Main category: cs.CV
TL;DR: InterAct is a large-scale 3D human-object interaction benchmark that consolidates 21.81 hours of HOI data, enhances quality through optimization, expands to 30.70 hours, and provides state-of-the-art performance on six benchmarking tasks.
Details
Motivation: Existing human motion capture datasets lack extensive high-quality motion and annotations for 3D human-object interactions, suffering from artifacts like contact penetration, floating, and incorrect hand motions.Method: Consolidated and standardized 21.81 hours of HOI data from diverse sources with detailed textual annotations. Proposed a unified optimization framework using contact invariance principles to reduce artifacts, correct hand motions, and expand dataset to 30.70 hours while maintaining human-object relationships.
Result: Created a large-scale 3D HOI benchmark with enhanced data quality, achieving state-of-the-art performance on six defined benchmarking tasks through a unified HOI generative modeling perspective.
Conclusion: InterAct serves as a foundational resource for advancing 3D human-object interaction generation and is publicly available to support continued research in this area.
Abstract: While large-scale human motion capture datasets have advanced human motion generation, modeling and generating dynamic 3D human-object interactions (HOIs) remain challenging due to dataset limitations. Existing datasets often lack extensive, high-quality motion and annotation and exhibit artifacts such as contact penetration, floating, and incorrect hand motions. To address these issues, we introduce InterAct, a large-scale 3D HOI benchmark featuring dataset and methodological advancements. First, we consolidate and standardize 21.81 hours of HOI data from diverse sources, enriching it with detailed textual annotations. Second, we propose a unified optimization framework to enhance data quality by reducing artifacts and correcting hand motions. Leveraging the principle of contact invariance, we maintain human-object relationships while introducing motion variations, expanding the dataset to 30.70 hours. Third, we define six benchmarking tasks and develop a unified HOI generative modeling perspective, achieving state-of-the-art performance. Extensive experiments validate the utility of our dataset as a foundational resource for advancing 3D human-object interaction generation. To support continued research in this area, the dataset is publicly available at https://github.com/wzyabcas/InterAct, and will be actively maintained.
[135] PeftCD: Leveraging Vision Foundation Models with Parameter-Efficient Fine-Tuning for Remote Sensing Change Detection
Sijun Dong, Yuxuan Hu, LiBo Wang, Geng Chen, Xiaoliang Meng
Main category: cs.CV
TL;DR: PeftCD is a parameter-efficient change detection framework using Vision Foundation Models with LoRA and Adapter modules, achieving state-of-the-art performance across multiple remote sensing datasets.
Details
Motivation: To address challenges in remote sensing change detection including pseudo changes, limited labeled samples, and cross-domain generalization difficulties.Method: Uses weight-sharing Siamese encoder from VFMs (SAM2 and DINOv3) with integrated LoRA and Adapter modules for efficient fine-tuning, plus lightweight decoder.
Result: Achieves SOTA performance on multiple datasets: SYSU-CD (73.81% IoU), WHUCD (92.05%), MSRSCD (64.07%), MLCD (76.89%), CDD (97.01%), S2Looking (52.25%), LEVIR-CD (85.62%) with precise boundaries and pseudo-change suppression.
Conclusion: PeftCD provides optimal balance of accuracy, efficiency, and generalization, offering scalable paradigm for adapting large VFMs to real-world remote sensing applications.
Abstract: To tackle the prevalence of pseudo changes, the scarcity of labeled samples, and the difficulty of cross-domain generalization in multi-temporal and multi-source remote sensing imagery, we propose PeftCD, a change detection framework built upon Vision Foundation Models (VFMs) with Parameter-Efficient Fine-Tuning (PEFT). At its core, PeftCD employs a weight-sharing Siamese encoder derived from a VFM, into which LoRA and Adapter modules are seamlessly integrated. This design enables highly efficient task adaptation by training only a minimal set of additional parameters. To fully unlock the potential of VFMs, we investigate two leading backbones: the Segment Anything Model v2 (SAM2), renowned for its strong segmentation priors, and DINOv3, a state-of-the-art self-supervised representation learner. The framework is complemented by a deliberately lightweight decoder, ensuring the focus remains on the powerful feature representations from the backbones. Extensive experiments demonstrate that PeftCD achieves state-of-the-art performance across multiple public datasets, including SYSU-CD (IoU 73.81%), WHUCD (92.05%), MSRSCD (64.07%), MLCD (76.89%), CDD (97.01%), S2Looking (52.25%) and LEVIR-CD (85.62%), with notably precise boundary delineation and strong suppression of pseudo-changes. In summary, PeftCD presents an optimal balance of accuracy, efficiency, and generalization. It offers a powerful and scalable paradigm for adapting large-scale VFMs to real-world remote sensing change detection applications. The code and pretrained models will be released at https://github.com/dyzy41/PeftCD.
[136] ReceiptSense: Beyond Traditional OCR – A Dataset for Receipt Understanding
Abdelrahman Abdallah, Mohamed Mounis, Mahmoud Abdalla, Mahmoud SalahEldin Kasem, Mohamed Mahmoud, Ibrahim Abdelhalim, Mohamed Elkasaby, Yasser ElBendary, Adam Jatowt
Main category: cs.CV
TL;DR: A comprehensive Arabic-English receipt dataset with 20K annotated receipts, 30K OCR images, 10K item annotations, and QA pairs for LLM evaluation, supporting multilingual OCR and information extraction research.
Details
Motivation: Multilingual OCR and information extraction from receipts remains challenging, particularly for complex scripts like Arabic, requiring better datasets and evaluation methods.Method: Created a large-scale dataset with diverse annotations including receipt-level, OCR-level, item-level data and QA pairs. Established baselines using Tesseract OCR and advanced neural networks.
Result: The dataset effectively captures merchant names, item descriptions, prices, receipt numbers, and dates, demonstrating effectiveness for processing complex, noisy real-world receipt layouts.
Conclusion: The publicly accessible dataset advances automated multilingual document processing research and supports object detection, OCR, and information extraction tasks for Arabic-English receipts.
Abstract: Multilingual OCR and information extraction from receipts remains challenging, particularly for complex scripts like Arabic. We introduce \dataset, a comprehensive dataset designed for Arabic-English receipt understanding comprising 20,000 annotated receipts from diverse retail settings, 30,000 OCR-annotated images, and 10,000 item-level annotations, and a new Receipt QA subset with 1265 receipt images paired with 40 question-answer pairs each to support LLM evaluation for receipt understanding. The dataset captures merchant names, item descriptions, prices, receipt numbers, and dates to support object detection, OCR, and information extraction tasks. We establish baseline performance using traditional methods (Tesseract OCR) and advanced neural networks, demonstrating the dataset’s effectiveness for processing complex, noisy real-world receipt layouts. Our publicly accessible dataset advances automated multilingual document processing research (see https://github.com/Update-For-Integrated-Business-AI/CORU ).
[137] Visual Grounding from Event Cameras
Lingdong Kong, Dongyue Lu, Ao Liang, Rong Li, Yuhao Dong, Tianshuai Hu, Lai Xing Ng, Wei Tsang Ooi, Benoit R. Cottereau
Main category: cs.CV
TL;DR: Talk2Event is the first large-scale benchmark for language-driven object grounding using event camera data, featuring 5,567 driving scenes with 13,458 annotated objects and 30,000+ validated referring expressions enriched with structured attributes.
Details
Motivation: Event cameras offer microsecond precision and reliability under challenging conditions, but their integration with natural language understanding remains unexplored, creating a gap in multimodal perception for dynamic scenes.Method: Built a benchmark on real-world driving scenarios with annotated objects and referring expressions. Each expression includes four structured attributes: appearance, status, relation to viewer, and relation to surrounding objects to capture spatial, temporal and relational cues.
Result: Created a comprehensive dataset of 5,567 scenes, 13,458 annotated objects, and over 30,000 validated referring expressions with attribute-centric design for interpretable and compositional grounding.
Conclusion: Talk2Event serves as a foundation for advancing multimodal and temporally-aware perception, with applications in robotics and human-AI interaction, enabling contextual reasoning beyond simple object recognition in dynamic environments.
Abstract: Event cameras capture changes in brightness with microsecond precision and remain reliable under motion blur and challenging illumination, offering clear advantages for modeling highly dynamic scenes. Yet, their integration with natural language understanding has received little attention, leaving a gap in multimodal perception. To address this, we introduce Talk2Event, the first large-scale benchmark for language-driven object grounding using event data. Built on real-world driving scenarios, Talk2Event comprises 5,567 scenes, 13,458 annotated objects, and more than 30,000 carefully validated referring expressions. Each expression is enriched with four structured attributes – appearance, status, relation to the viewer, and relation to surrounding objects – that explicitly capture spatial, temporal, and relational cues. This attribute-centric design supports interpretable and compositional grounding, enabling analysis that moves beyond simple object recognition to contextual reasoning in dynamic environments. We envision Talk2Event as a foundation for advancing multimodal and temporally-aware perception, with applications spanning robotics, human-AI interaction, and so on.
[138] Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis
Yikang Ding, Jiwen Liu, Wenyuan Zhang, Zekun Wang, Wentao Hu, Liyuan Cui, Mingming Lao, Yingchao Shao, Hui Liu, Xiaohan Li, Ming Chen, Xiaoqiang Liu, Yu-Shen Liu, Pengfei Wan
Main category: cs.CV
TL;DR: Kling-Avatar is a novel cascaded framework that unifies multimodal instruction understanding with photorealistic portrait generation, enabling semantically grounded, high-fidelity audio-driven avatar video synthesis with superior performance in lip sync, emotion expressiveness, and long-duration generation.
Details
Motivation: Existing audio-driven avatar generation methods treat instruction conditioning as low-level tracking without modeling communicative purpose, compromising narrative coherence and character expressiveness. The gap between multimodal instruction understanding and realistic video generation needs to be bridged.Method: Two-stage cascaded framework: 1) MLLM director produces blueprint video conditioned on multimodal instructions to govern high-level semantics (motion, emotions), 2) Parallel generation of sub-clips using first-last frame strategy guided by blueprint keyframes, enabling global-to-local detail preservation and fast long-duration video generation.
Result: Generates vivid, fluent long-duration videos up to 1080p and 48 fps with superior performance in lip synchronization accuracy, emotion/dynamic expressiveness, instruction controllability, identity preservation, and cross-domain generalization. Outperforms existing methods on 375 curated benchmark samples.
Conclusion: Kling-Avatar establishes a new benchmark for semantically grounded, high-fidelity audio-driven avatar synthesis, making it suitable for real-world applications like digital human livestreaming and vlogging through its parallel architecture and comprehensive instruction understanding.
Abstract: Recent advances in audio-driven avatar video generation have significantly enhanced audio-visual realism. However, existing methods treat instruction conditioning merely as low-level tracking driven by acoustic or visual cues, without modeling the communicative purpose conveyed by the instructions. This limitation compromises their narrative coherence and character expressiveness. To bridge this gap, we introduce Kling-Avatar, a novel cascaded framework that unifies multimodal instruction understanding with photorealistic portrait generation. Our approach adopts a two-stage pipeline. In the first stage, we design a multimodal large language model (MLLM) director that produces a blueprint video conditioned on diverse instruction signals, thereby governing high-level semantics such as character motion and emotions. In the second stage, guided by blueprint keyframes, we generate multiple sub-clips in parallel using a first-last frame strategy. This global-to-local framework preserves fine-grained details while faithfully encoding the high-level intent behind multimodal instructions. Our parallel architecture also enables fast and stable generation of long-duration videos, making it suitable for real-world applications such as digital human livestreaming and vlogging. To comprehensively evaluate our method, we construct a benchmark of 375 curated samples covering diverse instructions and challenging scenarios. Extensive experiments demonstrate that Kling-Avatar is capable of generating vivid, fluent, long-duration videos at up to 1080p and 48 fps, achieving superior performance in lip synchronization accuracy, emotion and dynamic expressiveness, instruction controllability, identity preservation, and cross-domain generalization. These results establish Kling-Avatar as a new benchmark for semantically grounded, high-fidelity audio-driven avatar synthesis.
[139] Measuring Epistemic Humility in Multimodal Large Language Models
Bingkui Tong, Jiaer Xia, Sifeng Shang, Kaiyang Zhou
Main category: cs.CV
TL;DR: HumbleBench is a new benchmark for evaluating multimodal LLMs’ ability to reject incorrect answers and demonstrate epistemic humility by choosing “None of the above” when no provided options are correct, addressing hallucination risks in safety-critical applications.
Details
Motivation: Existing MLLM benchmarks focus only on recognition accuracy but overlook the critical capability of rejecting plausible but incorrect answers, which is essential for trustworthy AI and preventing hallucinations that could lead to misinformation and unsafe errors.Method: Built from panoptic scene graph dataset with fine-grained annotations, used GPT-4-Turbo to generate multiple-choice questions with “None of the above” option, followed by rigorous manual filtering to create questions covering object, relation, and attribute hallucination types.
Result: Evaluated various state-of-the-art MLLMs (both general-purpose and specialized reasoning models) on HumbleBench, providing valuable findings about their ability to reject false options and demonstrate reliable behavior.
Conclusion: HumbleBench fills a key gap in current evaluation suites by incorporating explicit false-option rejection, offering a more realistic measure of MLLM reliability for safety-critical applications, with publicly released code and dataset.
Abstract: Hallucinations in multimodal large language models (MLLMs) – where the model generates content inconsistent with the input image – pose significant risks in real-world applications, from misinformation in visual question answering to unsafe errors in decision-making. Existing benchmarks primarily test recognition accuracy, i.e., evaluating whether models can select the correct answer among distractors. This overlooks an equally critical capability for trustworthy AI: recognizing when none of the provided options are correct, a behavior reflecting epistemic humility. We present HumbleBench, a new hallucination benchmark designed to evaluate MLLMs’ ability to reject plausible but incorrect answers across three hallucination types: object, relation, and attribute. Built from a panoptic scene graph dataset, we leverage fine-grained scene graph annotations to extract ground-truth entities and relations, and prompt GPT-4-Turbo to generate multiple-choice questions, followed by a rigorous manual filtering process. Each question includes a “None of the above” option, requiring models not only to recognize correct visual information but also to identify when no provided answer is valid. We evaluate a variety of state-of-the-art MLLMs – including both general-purpose and specialized reasoning models – on HumbleBench and share valuable findings and insights with the community. By incorporating explicit false-option rejection, HumbleBench fills a key gap in current evaluation suites, providing a more realistic measure of MLLM reliability in safety-critical settings. Our code and dataset are released publicly and can be accessed at https://github.com/maifoundations/HumbleBench.
[140] Can Understanding and Generation Truly Benefit Together – or Just Coexist?
Zhiyuan Yan, Kaiqing Lin, Zongjian Li, Junyan Ye, Hui Han, Zhendong Wang, Hao Liu, Bin Lin, Hao Li, Xue Xu, Xinyan Xiao, Jingdong Wang, Haifeng Wang, Li Yuan
Main category: cs.CV
TL;DR: UAE framework unifies image understanding (I2T) and generation (T2I) through auto-encoder paradigm with reconstruction fidelity as training objective, using reinforcement learning to achieve bidirectional improvement.
Details
Motivation: To create a unified multimodal learning framework that bridges image understanding and generation through coherent bidirectional information flow, bringing mutual gains between both processes.Method: Proposes UAE framework with three-stage Unified-GRPO RL approach: cold-start initialization, Generation for Understanding (encoder trained to generate captions that maximize decoder reconstruction), and Understanding for Generation (decoder refined to reconstruct from detailed captions).
Result: Encoder autonomously produces more descriptive captions while decoder demonstrates improved understanding of intricate descriptions, resulting in high-fidelity reconstructions and enhanced long-context instruction following.
Conclusion: The auto-encoder paradigm with reconstruction fidelity as unified objective successfully enables bidirectional improvement between understanding and generation, achieving surprising synergy where both components mutually enhance each other through RL progression.
Abstract: In this paper, we introduce an insightful paradigm through the Auto-Encoder lens-understanding as the encoder (I2T) that compresses images into text, and generation as the decoder (T2I) that reconstructs images from that text. Using reconstruction fidelity as the unified training objective, we enforce the coherent bidirectional information flow between the understanding and generation processes, bringing mutual gains. To implement this, we propose UAE, a novel framework for unified multimodal learning. We begin by pre-training the decoder with large-scale long-context image captions to capture fine-grained semantic and complex spatial relationships. We then propose Unified-GRPO via reinforcement learning (RL), which covers three stages: (1) A cold-start phase to gently initialize both encoder and decoder with a semantic reconstruction loss; (2) Generation for Understanding, where the encoder is trained to generate informative captions that maximize the decoder’s reconstruction quality, enhancing its visual understanding; (3) Understanding for Generation, where the decoder is refined to reconstruct from these captions, forcing it to leverage every detail and improving its long-context instruction following and generation fidelity. For evaluation, we introduce Unified-Bench, the first benchmark tailored to assess the degree of unification of the UMMs. A surprising “aha moment” arises within the multimodal learning domain: as RL progresses, the encoder autonomously produces more descriptive captions, while the decoder simultaneously demonstrates a profound ability to understand these intricate descriptions, resulting in reconstructions of striking fidelity.
[141] Geometric Neural Distance Fields for Learning Human Motion Priors
Zhengdi Yu, Simone Foti, Linguang Zhang, Amy Zhao, Cem Keskin, Stefanos Zafeiriou, Tolga Birdal
Main category: cs.CV
TL;DR: NRMF is a novel 3D generative human motion prior that uses neural distance fields on Riemannian manifolds to model pose, velocity, and acceleration dynamics, enabling robust and physically plausible motion recovery across various tasks.
Details
Motivation: Existing VAE and diffusion-based methods lack explicit modeling of human motion dynamics in geometric spaces. The authors aim to create a motion prior that respects the underlying articulation geometry and ensures temporal consistency and physical plausibility.Method: Models human motion as zero level sets of neural distance fields on the product space of joint rotations, angular velocities, and accelerations. Introduces adaptive-step hybrid projection algorithm and geometric integrator for motion rollout.
Result: Significant and consistent gains across multiple input modalities and diverse tasks including denoising, motion in-betweening, and fitting to partial 2D/3D observations. Trained on AMASS dataset with remarkable generalization capabilities.
Conclusion: NRMF provides a rigorous geometric framework for 3D human motion generation that outperforms existing methods by explicitly modeling motion dynamics on Riemannian manifolds, enabling robust and physically plausible motion recovery.
Abstract: We introduce Neural Riemannian Motion Fields (NRMF), a novel 3D generative human motion prior that enables robust, temporally consistent, and physically plausible 3D motion recovery. Unlike existing VAE or diffusion-based methods, our higher-order motion prior explicitly models the human motion in the zero level set of a collection of neural distance fields (NDFs) corresponding to pose, transition (velocity), and acceleration dynamics. Our framework is rigorous in the sense that our NDFs are constructed on the product space of joint rotations, their angular velocities, and angular accelerations, respecting the geometry of the underlying articulations. We further introduce: (i) a novel adaptive-step hybrid algorithm for projecting onto the set of plausible motions, and (ii) a novel geometric integrator to “roll out” realistic motion trajectories during test-time-optimization and generation. Our experiments show significant and consistent gains: trained on the AMASS dataset, NRMF remarkably generalizes across multiple input modalities and to diverse tasks ranging from denoising to motion in-betweening and fitting to partial 2D / 3D observations.
[142] Improving Alignment in LVLMs with Debiased Self-Judgment
Sihan Yang, Chenhang Cui, Zihao Zhao, Yiyang Zhou, Weilong Yan, Ying Wei, Huaxiu Yao
Main category: cs.CV
TL;DR: A novel self-judgment scoring method for LVLMs that reduces hallucinations and improves safety without external resources
Details
Motivation: Address challenges in visual-linguistic modality alignment that cause hallucinations and safety concerns, while overcoming limitations of external dataset dependency in existing methodsMethod: Generate debiased self-judgment scores internally without external resources, enhancing both decoding strategies and preference tuning processes
Result: Significantly outperforms traditional methods with reduced hallucinations, enhanced safety, and improved overall capability
Conclusion: Provides a more effective and scalable solution for aligning large visual-language models through autonomous self-evaluation
Abstract: The rapid advancements in Large Language Models (LLMs) and Large Visual-Language Models (LVLMs) have opened up new opportunities for integrating visual and linguistic modalities. However, effectively aligning these modalities remains challenging, often leading to hallucinations–where generated outputs are not grounded in the visual input–and raising safety concerns across various domains. Existing alignment methods, such as instruction tuning and preference tuning, often rely on external datasets, human annotations, or complex post-processing, which limit scalability and increase costs. To address these challenges, we propose a novel approach that generates the debiased self-judgment score, a self-evaluation metric created internally by the model without relying on external resources. This enables the model to autonomously improve alignment. Our method enhances both decoding strategies and preference tuning processes, resulting in reduced hallucinations, enhanced safety, and improved overall capability. Empirical results show that our approach significantly outperforms traditional methods, offering a more effective solution for aligning LVLMs.
[143] Locality in Image Diffusion Models Emerges from Data Statistics
Artem Lukoianov, Chenyang Yuan, Justin Solomon, Vincent Sitzmann
Main category: cs.CV
TL;DR: The paper shows that locality in deep diffusion models emerges from statistical properties of image datasets rather than convolutional neural network inductive biases, and develops an improved analytical denoiser.
Details
Motivation: To understand why deep diffusion models outperform the theoretically optimal denoiser and characterize the performance gap between analytical models and trained UNet denoisers.Method: Demonstrated that an optimal parametric linear denoiser exhibits locality similar to deep neural denoisers, analyzed pixel correlations in natural images theoretically and experimentally, and crafted an improved analytical denoiser.
Result: Found that locality emerges from pixel correlations in image datasets, not CNN inductive biases. The new analytical denoiser better matches deep diffusion model scores than prior expert-crafted alternatives.
Conclusion: The performance gap in diffusion models is driven by statistical properties of image data rather than architectural biases, enabling more accurate analytical models of deep diffusion behavior.
Abstract: Among generative models, diffusion models are uniquely intriguing due to the existence of a closed-form optimal minimizer of their training objective, often referred to as the optimal denoiser. However, diffusion using this optimal denoiser merely reproduces images in the training set and hence fails to capture the behavior of deep diffusion models. Recent work has attempted to characterize this gap between the optimal denoiser and deep diffusion models, proposing analytical, training-free models that can generate images that resemble those generated by a trained UNet. The best-performing method hypothesizes that shift equivariance and locality inductive biases of convolutional neural networks are the cause of the performance gap, hence incorporating these assumptions into its analytical model. In this work, we present evidence that the locality in deep diffusion models emerges as a statistical property of the image dataset, not due to the inductive bias of convolutional neural networks. Specifically, we demonstrate that an optimal parametric linear denoiser exhibits similar locality properties to the deep neural denoisers. We further show, both theoretically and experimentally, that this locality arises directly from the pixel correlations present in natural image datasets. Finally, we use these insights to craft an analytical denoiser that better matches scores predicted by a deep diffusion model than the prior expert-crafted alternative.
[144] SpatialVID: A Large-Scale Video Dataset with Spatial Annotations
Jiahao Wang, Yufeng Yuan, Rujie Zheng, Youtian Lin, Jian Gao, Lin-Zhuo Chen, Yajie Bao, Yi Zhang, Chang Zeng, Yanxi Zhou, Xiaoxiao Long, Hao Zhu, Zhaoxiang Zhang, Xun Cao, Yao Yao
Main category: cs.CV
TL;DR: SpatialVID is a large-scale dataset of 21,000+ hours of raw video processed into 2.7 million clips with dense 3D annotations including camera poses, depth maps, and motion instructions to address data scarcity in spatial intelligence research.
Details
Motivation: Current spatial intelligence models are constrained by limited training data - existing datasets lack scale, diversity, and annotation richness, especially for real-world dynamic scenes with ground-truth camera motion.Method: Collected over 21,000 hours of raw video, processed through hierarchical filtering into 2.7 million clips (7,089 hours), then annotated with camera poses, depth maps, dynamic masks, structured captions, and serialized motion instructions.
Result: Created SpatialVID dataset with diverse scenes, camera movements, and comprehensive 3D annotations that analysis shows has richness and diversity to improve model generalization and performance.
Conclusion: SpatialVID establishes itself as a key asset for video and 3D vision research by providing large-scale, high-quality training data that addresses current data scarcity limitations in spatial intelligence.
Abstract: Significant progress has been made in spatial intelligence, spanning both spatial reconstruction and world exploration. However, the scalability and real-world fidelity of current models remain severely constrained by the scarcity of large-scale, high-quality training data. While several datasets provide camera pose information, they are typically limited in scale, diversity, and annotation richness, particularly for real-world dynamic scenes with ground-truth camera motion. To this end, we collect \textbf{SpatialVID}, a dataset consists of a large corpus of in-the-wild videos with diverse scenes, camera movements and dense 3D annotations such as per-frame camera poses, depth, and motion instructions. Specifically, we collect more than 21,000 hours of raw video, and process them into 2.7 million clips through a hierarchical filtering pipeline, totaling 7,089 hours of dynamic content. A subsequent annotation pipeline enriches these clips with detailed spatial and semantic information, including camera poses, depth maps, dynamic masks, structured captions, and serialized motion instructions. Analysis of SpatialVID’s data statistics reveals a richness and diversity that directly foster improved model generalization and performance, establishing it as a key asset for the video and 3D vision research community.
[145] Semantic Augmentation in Images using Language
Sahiti Yerramilli, Jayant Sravan Tamarapalli, Tanmay Girish Kulkarni, Jonathan Francis, Eric Nyberg
Main category: cs.CV
TL;DR: Using diffusion models to generate photorealistic images for data augmentation to combat overfitting and improve out-of-domain generalization in deep learning models
Details
Motivation: Deep learning models are data-hungry and suffer from overfitting due to limited labeled datasets, limiting their generalization to real-world examplesMethod: Leveraging diffusion models trained on large datasets to generate photorealistic images from textual inputs for augmenting existing datasets
Result: Various strategies for effective data augmentation are explored to enhance model performance
Conclusion: Generated images from diffusion models can effectively augment datasets and improve out-of-domain generalization capabilities of deep learning models
Abstract: Deep Learning models are incredibly data-hungry and require very large labeled datasets for supervised learning. As a consequence, these models often suffer from overfitting, limiting their ability to generalize to real-world examples. Recent advancements in diffusion models have enabled the generation of photorealistic images based on textual inputs. Leveraging the substantial datasets used to train these diffusion models, we propose a technique to utilize generated images to augment existing datasets. This paper explores various strategies for effective data augmentation to improve the out-of-domain generalization capabilities of deep learning models.
[146] IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves
Ruofan Wang, Juncheng Li, Yixu Wang, Bo Wang, Xiaosen Wang, Yan Teng, Yingchun Wang, Xingjun Ma, Yu-Gang Jiang
Main category: cs.CV
TL;DR: IDEATOR is an automated jailbreak attack method that uses VLMs and diffusion models to generate malicious image-text pairs, achieving high attack success rates and strong transferability across multiple vision-language models.
Details
Motivation: Current jailbreak approaches for VLMs rely on limited adversarial or manually crafted images from text datasets, lacking effectiveness and diversity across different contexts.Method: Leverages a VLM to create targeted jailbreak texts and pairs them with jailbreak images generated by a state-of-the-art diffusion model to autonomously generate malicious image-text pairs.
Result: Achieved 94% attack success rate on MiniGPT-4 with average 5.34 queries, and high transferability rates of 82%, 88%, and 75% on LLaVA, InstructBLIP, and Chameleon respectively.
Conclusion: The method demonstrates significant safety gaps in current VLMs and introduces VLJailbreakBench, a comprehensive safety benchmark with 3,654 multimodal jailbreak samples showing urgent need for stronger defenses.
Abstract: As large Vision-Language Models (VLMs) gain prominence, ensuring their safe deployment has become critical. Recent studies have explored VLM robustness against jailbreak attacks-techniques that exploit model vulnerabilities to elicit harmful outputs. However, the limited availability of diverse multimodal data has constrained current approaches to rely heavily on adversarial or manually crafted images derived from harmful text datasets, which often lack effectiveness and diversity across different contexts. In this paper, we propose IDEATOR, a novel jailbreak method that autonomously generates malicious image-text pairs for black-box jailbreak attacks. IDEATOR is grounded in the insight that VLMs themselves could serve as powerful red team models for generating multimodal jailbreak prompts. Specifically, IDEATOR leverages a VLM to create targeted jailbreak texts and pairs them with jailbreak images generated by a state-of-the-art diffusion model. Extensive experiments demonstrate IDEATOR’s high effectiveness and transferability, achieving a 94% attack success rate (ASR) in jailbreaking MiniGPT-4 with an average of only 5.34 queries, and high ASRs of 82%, 88%, and 75% when transferred to LLaVA, InstructBLIP, and Chameleon, respectively. Building on IDEATOR’s strong transferability and automated process, we introduce the VLJailbreakBench, a safety benchmark comprising 3,654 multimodal jailbreak samples. Our benchmark results on 11 recently released VLMs reveal significant gaps in safety alignment. For instance, our challenge set achieves ASRs of 46.31% on GPT-4o and 19.65% on Claude-3.5-Sonnet, underscoring the urgent need for stronger defenses.VLJailbreakBench is publicly available at https://roywang021.github.io/VLJailbreakBench.
[147] Total Disentanglement of Font Images into Style and Character Class Features
Daichi Haraguchi, Wataru Shimoda, Kota Yamaguchi, Seiichi Uchida
Main category: cs.CV
TL;DR: Total disentanglement is a neural network method that completely separates font images into style and content features, enabling high-accuracy reconstruction and various font-related tasks.
Details
Motivation: To address the long-standing question of whether 'A'-ness exists by developing a method that can completely disentangle font style from character content features.Method: Uses a neural network with careful training procedure to extract common style features from all A-Z characters in the same font and common content features from the same character across different fonts.
Result: Achieves very high accuracy in total disentanglement, provides experimental proof that ‘A’-ness exists, and enables applications in font recognition, character recognition, and one-shot font image generation.
Conclusion: Total disentanglement successfully demonstrates complete nonlinear decomposition of font images into style and content features, solving a fundamental problem in font analysis and enabling various practical applications.
Abstract: In this paper, we demonstrate a total disentanglement of font images. Total
disentanglement is a neural network-based method for decomposing each font
image nonlinearly and completely into its style and content (i.e., character
class) features. It uses a simple but careful training procedure to extract the
common style feature from all A'-
Z’ images in the same font and the common
content feature from all A' (or another class) images in different fonts. These disentangled features guarantee the reconstruction of the original font image. Various experiments have been conducted to understand the performance of total disentanglement. First, it is demonstrated that total disentanglement is achievable with very high accuracy; this is experimental proof of the long-standing open question, ``Does
A’-ness exist?’’ Hofstadter (1985).
Second, it is demonstrated that the disentangled features produced by total
disentanglement apply to a variety of tasks, including font recognition,
character recognition, and one-shot font image generation. Code is available
here: https://github.com/uchidalab/total_disentanglement
[148] EgoAgent: A Joint Predictive Agent Model in Egocentric Worlds
Lu Chen, Yizhou Wang, Shixiang Tang, Qianhong Ma, Tong He, Wanli Ouyang, Xiaowei Zhou, Hujun Bao, Sida Peng
Main category: cs.CV
TL;DR: EgoAgent is a unified transformer model that jointly learns perception, prediction, and action capabilities in a single framework, overcoming limitations of separate models by modeling their causal and temporal relationships through an interleaved sequence approach.
Details
Motivation: Existing methods train separate models for perception, prediction, and action, which fails to capture their intrinsic relationships and prevents mutual learning. Humans learn through the perception-action loop, inspiring a unified approach.Method: Proposes EgoAgent with a joint embedding-action-prediction architecture using a transformer. Models causal and temporal dependencies through interleaved sequences of states and actions, featuring temporally asymmetric predictor and observer branches for synergistic optimization.
Result: Comprehensive evaluations demonstrate superiority on image classification, egocentric future state prediction, and 3D human motion prediction tasks compared to separate model approaches.
Conclusion: EgoAgent successfully unifies perception, prediction, and action capabilities in a single transformer model, enabling synergistic learning and superior performance across multiple representative tasks in computer vision and agent modeling.
Abstract: Learning an agent model that behaves like humans-capable of jointly perceiving the environment, predicting the future, and taking actions from a first-person perspective-is a fundamental challenge in computer vision. Existing methods typically train separate models for these abilities, which fail to capture their intrinsic relationships and prevent them from learning from each other. Inspired by how humans learn through the perception-action loop, we propose EgoAgent, a unified agent model that simultaneously learns to represent, predict, and act within a single transformer. EgoAgent explicitly models the causal and temporal dependencies among these abilities by formulating the task as an interleaved sequence of states and actions. It further introduces a joint embedding-action-prediction architecture with temporally asymmetric predictor and observer branches, enabling synergistic optimization across all three capabilities. Comprehensive evaluations of EgoAgent on representative tasks such as image classification, egocentric future state prediction, and 3D human motion prediction demonstrate the superiority of our method. The code and trained models will be publicly available at https://github.com/zju3dv/EgoAgent.
[149] Automatic infant 2D pose estimation from videos: comparing seven deep neural network methods
Filipe Gama, Matej Misar, Lukas Navara, Sergiu T. Popescu, Matej Hoffmann
Main category: cs.CV
TL;DR: Comparison of 7 popular pose estimation methods on infant videos shows ViTPose performs best without finetuning, with AlphaPose offering near real-time performance.
Details
Motivation: Automatic markerless estimation of infant posture from ordinary videos enables movement studies in natural settings, facilitating motor development understanding and early disorder diagnosis, but existing methods are trained on adult datasets.Method: Tested and compared seven popular pose estimation methods (AlphaPose, DeepLabCut/DeeperCut, Detectron2, HRNet, MediaPipe/BlazePose, OpenPose, ViTPose) on infant videos in supine and complex positions, evaluating standard metrics plus novel torso-length ratio errors and reliability analysis.
Result: Surprisingly, all methods except DeepLabCut and MediaPipe performed competitively without finetuning, with ViTPose achieving best performance. AlphaPose ran close to real-time (27 fps). Introduced new error metrics in neck-mid-hip ratio and studied detection reliability.
Conclusion: Most adult-trained pose estimation methods work well on infants without additional training, with ViTPose performing best. Provided documented containers and analysis scripts for reproducibility, enabling broader adoption of infant movement analysis.
Abstract: Automatic markerless estimation of infant posture and motion from ordinary videos carries great potential for movement studies “in the wild”, facilitating understanding of motor development and massively increasing the chances of early diagnosis of disorders. There is rapid development of human pose estimation methods in computer vision thanks to advances in deep learning and machine learning. However, these methods are trained on datasets that feature adults in different contexts. This work tests and compares seven popular methods (AlphaPose, DeepLabCut/DeeperCut, Detectron2, HRNet, MediaPipe/BlazePose, OpenPose, and ViTPose) on videos of infants in supine position and in more complex settings. Surprisingly, all methods except DeepLabCut and MediaPipe have competitive performance without additional finetuning, with ViTPose performing best. Next to standard performance metrics (average precision and recall), we introduce errors expressed in the neck-mid-hip (torso length) ratio and additionally study missed and redundant detections, and the reliability of the internal confidence ratings of the different methods, which are relevant for downstream tasks. Among the networks with competitive performance, only AlphaPose could run close to real time (27 fps) on our machine. We provide documented Docker containers or instructions for all the methods we used, our analysis scripts, and the processed data at https://hub.docker.com/u/humanoidsctu and https://osf.io/x465b/.
[150] Attention-Guided Multi-scale Interaction Network for Face Super-Resolution
Xujie Wan, Wenjie Li, Guangwei Gao, Huimin Lu, Jian Yang, Chia-Wen Lin
Main category: cs.CV
TL;DR: AMINet is a CNN-Transformer hybrid network for face super-resolution that uses local-global feature interaction and selective attention fusion to better integrate multiscale features.
Details
Motivation: Existing hybrid FSR methods simply combine Transformer and CNN without properly fusing multiscale features from different scales, limiting their complementarity and performance.Method: Proposes AMINet with Local and Global Feature Interaction Module (LGFI) to fuse global and local features, Residual Depth Feature Extraction Module (RDFE) for multi-receptive field features, and Selective Kernel Attention Fusion Module (SKAF) for adaptive feature fusion.
Result: The method consistently performs well with less computational consumption and faster inference compared to existing approaches.
Conclusion: The proposed attention-guided multiscale interaction network effectively promotes complementarity of different scale features and enables free flow of multiscale features, enhancing face super-resolution performance.
Abstract: Recently, CNN and Transformer hybrid networks demonstrated excellent performance in face super-resolution (FSR) tasks. Since numerous features at different scales in hybrid networks, how to fuse these multiscale features and promote their complementarity is crucial for enhancing FSR. However, existing hybrid network-based FSR methods ignore this, only simply combining the Transformer and CNN. To address this issue, we propose an attention-guided Multiscale interaction network (AMINet), which incorporates local and global feature interactions, as well as encoder-decoder phase feature interactions. Specifically, we propose a Local and Global Feature Interaction Module (LGFI) to promote the fusion of global features and the local features extracted from different receptive fields by our Residual Depth Feature Extraction Module (RDFE). Additionally, we propose a Selective Kernel Attention Fusion Module (SKAF) to adaptively select fusions of different features within the LGFI and encoder-decoder phases. Our above design allows the free flow of multiscale features from within modules and between the encoder and decoder, which can promote the complementarity of different scale features to enhance FSR. Comprehensive experiments confirm that our method consistently performs well with less computational consumption and faster inference.
[151] The Oxford Spires Dataset: Benchmarking Large-Scale LiDAR-Visual Localisation, Reconstruction and Radiance Field Methods
Yifu Tao, Miguel Ángel Muñoz-Bañón, Lintong Zhang, Jiahao Wang, Lanke Frank Tarimo Fu, Maurice Fallon
Main category: cs.CV
TL;DR: A large-scale multi-modal dataset for benchmarking localization, reconstruction, and novel-view synthesis tasks, revealing limitations in current radiance field methods’ generalization capabilities.
Details
Motivation: To create a comprehensive benchmark dataset for evaluating SLAM, SfM, MVS, and radiance field methods using high-precision ground truth data from TLS scans, addressing the need for standardized evaluation in multi-sensor perception research.Method: Developed a custom multi-sensor perception unit with synchronized cameras, LiDAR, and inertial sensors, captured data around Oxford landmarks, and established benchmarks using millimeter-accurate TLS maps as ground truth for localization and reconstruction evaluation.
Result: Evaluation shows state-of-the-art radiance field methods overfit to training poses and generalize poorly to out-of-sequence viewpoints, while also underperforming in 3D reconstruction compared to traditional MVS systems using the same visual inputs.
Conclusion: The dataset and benchmarks enable better integration of radiance field methods with SLAM systems, highlighting current limitations and providing a foundation for future improvements in multi-modal perception and 3D reconstruction.
Abstract: This paper introduces a large-scale multi-modal dataset captured in and around well-known landmarks in Oxford using a custom-built multi-sensor perception unit as well as a millimetre-accurate map from a Terrestrial LiDAR Scanner (TLS). The perception unit includes three synchronised global shutter colour cameras, an automotive 3D LiDAR scanner, and an inertial sensor - all precisely calibrated. We also establish benchmarks for tasks involving localisation, reconstruction, and novel-view synthesis, which enable the evaluation of Simultaneous Localisation and Mapping (SLAM) methods, Structure-from-Motion (SfM) and Multi-view Stereo (MVS) methods as well as radiance field methods such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting. To evaluate 3D reconstruction the TLS 3D models are used as ground truth. Localisation ground truth is computed by registering the mobile LiDAR scans to the TLS 3D models. Radiance field methods are evaluated not only with poses sampled from the input trajectory, but also from viewpoints that are from trajectories which are distant from the training poses. Our evaluation demonstrates a key limitation of state-of-the-art radiance field methods: we show that they tend to overfit to the training poses/images and do not generalise well to out-of-sequence poses. They also underperform in 3D reconstruction compared to MVS systems using the same visual inputs. Our dataset and benchmarks are intended to facilitate better integration of radiance field methods and SLAM systems. The raw and processed data, along with software for parsing and evaluation, can be accessed at https://dynamic.robots.ox.ac.uk/datasets/oxford-spires/.
[152] ForestSplats: Deformable transient field for Gaussian Splatting in the Wild
Wongi Park, Myeongseok Nam, Siwon Kim, Sangwoo Jo, Soomok Lee
Main category: cs.CV
TL;DR: ForestSplats improves 3D Gaussian Splatting for dynamic scenes by using deformable transient fields and superpixel-aware masks to handle transient elements without Vision Foundation Models, achieving better performance and memory efficiency.
Details
Motivation: 3D-GS works well for static scenes but degrades in real-world environments with transient objects, lighting variations, and occlusion. Existing methods using VFMs are computationally expensive and memory-intensive.Method: Uses deformable transient field to capture per-view transient elements, superpixel-aware mask for clear occluder boundaries, and uncertainty-aware densification to avoid generating Gaussians in occluded areas.
Result: Outperforms existing methods without VFM and shows significant memory efficiency in representing transient elements across benchmark datasets.
Conclusion: ForestSplats effectively decomposes static scenes from transient distractors without requiring Vision Foundation Models, offering improved performance and efficiency.
Abstract: Recently, 3D Gaussian Splatting (3D-GS) has emerged, showing real-time rendering speeds and high-quality results in static scenes. Although 3D-GS shows effectiveness in static scenes, their performance significantly degrades in real-world environments due to transient objects, lighting variations, and diverse levels of occlusion. To tackle this, existing methods estimate occluders or transient elements by leveraging pre-trained models or integrating additional transient field pipelines. However, these methods still suffer from two defects: 1) Using semantic features from the Vision Foundation model (VFM) causes additional computational costs. 2) The transient field requires significant memory to handle transient elements with per-view Gaussians and struggles to define clear boundaries for occluders, solely relying on photometric errors. To address these problems, we propose ForestSplats, a novel approach that leverages the deformable transient field and a superpixel-aware mask to efficiently represent transient elements in the 2D scene across unconstrained image collections and effectively decompose static scenes from transient distractors without VFM. We designed the transient field to be deformable, capturing per-view transient elements. Furthermore, we introduce a superpixel-aware mask that clearly defines the boundaries of occluders by considering photometric errors and superpixels. Additionally, we propose uncertainty-aware densification to avoid generating Gaussians within the boundaries of occluders during densification. Through extensive experiments across several benchmark datasets, we demonstrate that ForestSplats outperforms existing methods without VFM and shows significant memory efficiency in representing transient elements.
[153] Parasite: A Steganography-based Backdoor Attack Framework for Diffusion Models
Jiahao Chen, Yu Pan, Yi Du, Chunkai Wu, Lin Wang
Main category: cs.CV
TL;DR: A novel backdoor attack method called “Parasite” for image-to-image tasks in diffusion models that uses steganography to hide triggers and allows flexible target content embedding, effectively bypassing existing detection frameworks.
Details
Motivation: Existing backdoor attacks on diffusion models focus mainly on noise-to-image and text-to-image tasks, use conspicuous triggers, generate fixed target images, and lack concealability and flexibility.Method: Proposes “Parasite” method that leverages steganography for trigger hiding and allows attackers to embed target content as backdoor triggers for more flexible attacks in image-to-image tasks.
Result: Achieved 0% backdoor detection rate against mainstream defense frameworks and conducted ablation studies on different hiding coefficients’ influence on attack results.
Conclusion: The “Parasite” method successfully addresses limitations of traditional backdoor attacks by providing concealability, flexibility, and effective evasion of detection frameworks in diffusion model image-to-image tasks.
Abstract: Recently, the diffusion model has gained significant attention as one of the most successful image generation models, which can generate high-quality images by iteratively sampling noise. However, recent studies have shown that diffusion models are vulnerable to backdoor attacks, allowing attackers to enter input data containing triggers to activate the backdoor and generate their desired output. Existing backdoor attack methods primarily focused on target noise-to-image and text-to-image tasks, with limited work on backdoor attacks in image-to-image tasks. Furthermore, traditional backdoor attacks often rely on a single, conspicuous trigger to generate a fixed target image, lacking concealability and flexibility. To address these limitations, we propose a novel backdoor attack method called “Parasite” for image-to-image tasks in diffusion models, which not only is the first to leverage steganography for triggers hiding, but also allows attackers to embed the target content as a backdoor trigger to achieve a more flexible attack. “Parasite” as a novel attack method effectively bypasses existing detection frameworks to execute backdoor attacks. In our experiments, “Parasite” achieved a 0 percent backdoor detection rate against the mainstream defense frameworks. In addition, in the ablation study, we discuss the influence of different hiding coefficients on the attack results. You can find our code at https://anonymous.4open.science/r/Parasite-1715/.
[154] Early Exit and Multi Stage Knowledge Distillation in VLMs for Video Summarization
Anas Anwarul Haq Khan, Utkarsh Verma, Ganesh Ramakrishnan
Main category: cs.CV
TL;DR: DEEVISum is a lightweight vision-language model for video summarization that uses multi-modal prompts, multi-stage knowledge distillation, and early exit to achieve efficient performance comparable to larger models.
Details
Motivation: To create an efficient and scalable vision-language model for video summarization that balances performance with computational efficiency, addressing the need for lightweight yet effective solutions in this domain.Method: Uses multi-modal prompts combining textual and audio signals, incorporates Multi-Stage Knowledge Distillation (MSKD) for improved performance, and Early Exit (EE) mechanism to reduce inference time.
Result: MSKD provides 1.33% absolute F1 improvement over baseline distillation, EE reduces inference time by ~21% with only 1.3 point F1 drop. Best model achieves 61.1 F1 score on TVSum dataset, competing with larger models while maintaining lower computational footprint.
Conclusion: DEEVISum successfully demonstrates that lightweight vision-language models can achieve competitive performance in video summarization through innovative techniques like MSKD and EE, offering an efficient alternative to larger models with publicly released code and dataset.
Abstract: We introduce DEEVISum (Distilled Early Exit Vision language model for Summarization), a lightweight, efficient, and scalable vision language model designed for segment wise video summarization. Leveraging multi modal prompts that combine textual and audio derived signals, DEEVISum incorporates Multi Stage Knowledge Distillation (MSKD) and Early Exit (EE) to strike a balance between performance and efficiency. MSKD offers a 1.33% absolute F1 improvement over baseline distillation (0.5%), while EE reduces inference time by approximately 21% with a 1.3 point drop in F1. Evaluated on the TVSum dataset, our best model PaLI Gemma2 3B + MSKD achieves an F1 score of 61.1, competing the performance of significantly larger models, all while maintaining a lower computational footprint. We publicly release our code and processed dataset to support further research.
[155] Combating Falsification of Speech Videos with Live Optical Signatures (Extended Version)
Hadleigh Schwartz, Xiaofeng Yan, Charles J. Carver, Xia Zhou
Main category: cs.CV
TL;DR: VeriLight is a system that protects speech videos from visual manipulation by embedding imperceptible light-based signatures that encode speaker identity and facial motion features, enabling downstream verification of video integrity.
Details
Motivation: High-profile speech videos are prime targets for falsification due to their accessibility and influence, requiring protection against visual manipulations of speaker identity and facial motion.Method: Uses modulated light to embed cryptographically-secured physical signatures at the event site that encode compact (150-bit) pose-invariant speech video features based on locality-sensitive hashing, with an optical modulation scheme embedding >200 bps while remaining imperceptible.
Result: Achieves AUCs ≥ 0.99 and 100% true positive rate in detecting falsified videos, with high robustness across recording conditions, video post-processing techniques, and white-box adversarial attacks.
Conclusion: VeriLight provides an effective low-overhead and unobtrusive solution for protecting speech videos against visual manipulation through physical light-based signatures that can be verified downstream.
Abstract: High-profile speech videos are prime targets for falsification, owing to their accessibility and influence. This work proposes VeriLight, a low-overhead and unobtrusive system for protecting speech videos from visual manipulations of speaker identity and lip and facial motion. Unlike the predominant purely digital falsification detection methods, VeriLight creates dynamic physical signatures at the event site and embeds them into all video recordings via imperceptible modulated light. These physical signatures encode semantically-meaningful features unique to the speech event, including the speaker’s identity and facial motion, and are cryptographically-secured to prevent spoofing. The signatures can be extracted from any video downstream and validated against the portrayed speech content to check its integrity. Key elements of VeriLight include (1) a framework for generating extremely compact (i.e., 150-bit), pose-invariant speech video features, based on locality-sensitive hashing; and (2) an optical modulation scheme that embeds $>$200 bps into video while remaining imperceptible both in video and live. Experiments on extensive video datasets show VeriLight achieves AUCs $\geq$ 0.99 and a true positive rate of 100% in detecting falsified videos. Further, VeriLight is highly robust across recording conditions, video post-processing techniques, and white-box adversarial attacks on its feature extraction methods. A demonstration of VeriLight is available at https://mobilex.cs.columbia.edu/verilight.
[156] MM-Prompt: Cross-Modal Prompt Tuning for Continual Visual Question Answering
Xu Li, Fan Lyu
Main category: cs.CV
TL;DR: MM-Prompt addresses modality imbalance in continual VQA by introducing cross-modal prompt query and recovery mechanisms for balanced multi-modal learning.
Details
Motivation: Existing CVQA methods use cross-modal prompt isolation which exacerbates modality imbalance and degrades performance over time in continual learning scenarios.Method: Proposes MM-Prompt framework with cross-modal prompt query (balanced prompt selection using cross-modal signals) and cross-modal prompt recovery (joint reconstruction through iterative interactions with alignment loss to prevent drift).
Result: Extensive experiments show MM-Prompt surpasses prior approaches in accuracy and knowledge retention while maintaining balanced modality engagement throughout continual learning.
Conclusion: MM-Prompt effectively addresses modality imbalance in continual VQA through cross-modal prompt integration, demonstrating superior performance and balanced learning across modalities.
Abstract: Continual Visual Question Answering (CVQA) based on pre-trained models(PTMs) has achieved promising progress by leveraging prompt tuning to enable continual multi-modal learning. However, most existing methods adopt cross-modal prompt isolation, constructing visual and textual prompts separately, which exacerbates modality imbalance and leads to degraded performance over time. To tackle this issue, we propose MM-Prompt, a novel framework incorporating cross-modal prompt query and cross-modal prompt recovery. The former enables balanced prompt selection by incorporating cross-modal signals during query formation, while the latter promotes joint prompt reconstruction through iterative cross-modal interactions, guided by an alignment loss to prevent representational drift. Extensive experiments show that MM-Prompt surpasses prior approaches in accuracy and knowledge retention, while maintaining balanced modality engagement throughout continual learning.
[157] GAPrompt: Geometry-Aware Point Cloud Prompt for 3D Vision Model
Zixiang Ai, Zichen Liu, Yuanhang Lei, Zhenyu Cui, Xu Zou, Jiahuan Zhou
Main category: cs.CV
TL;DR: GAPrompt is a geometry-aware parameter-efficient fine-tuning method for 3D vision models that uses geometric cues to enhance model adaptability while using only 2.19% trainable parameters.
Details
Motivation: Existing PEFT methods for 3D vision models struggle with geometric information capture in point clouds, and full fine-tuning is computationally expensive and storage-intensive.Method: Proposes Geometry-Aware Point Cloud Prompt (GAPrompt) with Point Prompt for fine-grained geometric details, Point Shift Prompter for global shape information, and Prompt Propagation mechanism for feature extraction enhancement.
Result: Significantly outperforms state-of-the-art PEFT methods and achieves competitive results compared to full fine-tuning on various benchmarks.
Conclusion: GAPrompt effectively addresses geometric information capture limitations in 3D vision models while maintaining parameter efficiency, demonstrating superior performance with minimal trainable parameters.
Abstract: Pre-trained 3D vision models have gained significant attention for their promising performance on point cloud data. However, fully fine-tuning these models for downstream tasks is computationally expensive and storage-intensive. Existing parameter-efficient fine-tuning (PEFT) approaches, which focus primarily on input token prompting, struggle to achieve competitive performance due to their limited ability to capture the geometric information inherent in point clouds. To address this challenge, we propose a novel Geometry-Aware Point Cloud Prompt (GAPrompt) that leverages geometric cues to enhance the adaptability of 3D vision models. First, we introduce a Point Prompt that serves as an auxiliary input alongside the original point cloud, explicitly guiding the model to capture fine-grained geometric details. Additionally, we present a Point Shift Prompter designed to extract global shape information from the point cloud, enabling instance-specific geometric adjustments at the input level. Moreover, our proposed Prompt Propagation mechanism incorporates the shape information into the model’s feature extraction process, further strengthening its ability to capture essential geometric characteristics. Extensive experiments demonstrate that GAPrompt significantly outperforms state-of-the-art PEFT methods and achieves competitive results compared to full fine-tuning on various benchmarks, while utilizing only 2.19% of trainable parameters. Our code is available at https://github.com/zhoujiahuan1991/ICML2025-GAPrompt.
[158] AdvReal: Physical Adversarial Patch Generation Framework for Security Evaluation of Object Detection Systems
Yuanhao Huang, Yilong Ren, Jinlei Wang, Lujia Huo, Xuesong Bai, Jinchuan Zhang, Haiyan Yu
Main category: cs.CV
TL;DR: A unified joint adversarial training framework for generating realistic adversarial textures that can fool object detection systems in both 2D and 3D domains, achieving high attack success rates in physical scenarios.
Details
Motivation: Deep learning-based perception methods in autonomous vehicles are vulnerable to adversarial attacks, creating security risks. There's a need for effective adversarial example generation in the physical world to evaluate object detection systems.Method: A unified framework that simultaneously optimizes texture maps in 2D image and 3D mesh spaces, featuring realistic enhanced adversarial module with time-space and relighting mapping, non-rigid deformation modeling, and texture remapping for alignment with human body surfaces.
Result: Achieved 70.13% average attack success rate on YOLOv12 in physical scenarios, significantly outperforming existing methods (T-SEA: 21.65%, AdvTexture: 19.70%). Maintained over 90% success rate across multiple viewpoints and distances.
Conclusion: The method demonstrates strong robustness and transferability under multi-angle attacks, varying lighting conditions, and real-world distances, providing effective adversarial textures for security evaluation of object detection systems.
Abstract: Autonomous vehicles are typical complex intelligent systems with artificial intelligence at their core. However, perception methods based on deep learning are extremely vulnerable to adversarial samples, resulting in security accidents. How to generate effective adversarial examples in the physical world and evaluate object detection systems is a huge challenge. In this study, we propose a unified joint adversarial training framework for both 2D and 3D domains, which simultaneously optimizes texture maps in 2D image and 3D mesh spaces to better address intra-class diversity and real-world environmental variations. The framework includes a novel realistic enhanced adversarial module, with time-space and relighting mapping pipeline that adjusts illumination consistency between adversarial patches and target garments under varied viewpoints. Building upon this, we develop a realism enhancement mechanism that incorporates non-rigid deformation modeling and texture remapping to ensure alignment with the human body’s non-rigid surfaces in 3D scenes. Extensive experiment results in digital and physical environments demonstrate that the adversarial textures generated by our method can effectively mislead the target detection model. Specifically, our method achieves an average attack success rate (ASR) of 70.13% on YOLOv12 in physical scenarios, significantly outperforming existing methods such as T-SEA (21.65%) and AdvTexture (19.70%). Moreover, the proposed method maintains stable ASR across multiple viewpoints and distances, with an average attack success rate exceeding 90% under both frontal and oblique views at a distance of 4 meters. This confirms the method’s strong robustness and transferability under multi-angle attacks, varying lighting conditions, and real-world distances. The demo video and code can be obtained at https://github.com/Huangyh98/AdvReal.git.
[159] TESSER: Transfer-Enhancing Adversarial Attacks from Vision Transformers via Spectral and Semantic Regularization
Amira Guesmi, Bassem Ouni, Muhammad Shafique
Main category: cs.CV
TL;DR: TESSER is a novel adversarial attack framework that enhances transferability between Vision Transformers and CNNs through feature-sensitive gradient scaling and spectral smoothness regularization, achieving significant improvements in attack success rates.
Details
Motivation: Adversarial transferability is crucial for black-box attacks in security-critical applications, but existing attacks often fail to transfer effectively across different architectures, particularly from Vision Transformers to CNNs or hybrid models.Method: TESSER uses two key strategies: (1) Feature-Sensitive Gradient Scaling (FSGS) that modulates gradients based on token-wise importance from feature activations, and (2) Spectral Smoothness Regularization (SSR) that suppresses high-frequency noise using a differentiable Gaussian prior.
Result: TESSER achieves +10.9% higher attack success rate on CNNs and +7.2% on ViTs compared to state-of-the-art methods, with 53.55% ASR on adversarially trained CNNs. It shows 12% reduction in high-frequency energy and strong alignment with salient visual regions.
Conclusion: TESSER effectively addresses adversarial transferability challenges by generating semantically meaningful and spectrally smooth perturbations that work across diverse architectures, demonstrating superior performance in both white-box and black-box attack scenarios.
Abstract: Adversarial transferability remains a critical challenge in evaluating the robustness of deep neural networks. In security-critical applications, transferability enables black-box attacks without access to model internals, making it a key concern for real-world adversarial threat assessment. While Vision Transformers (ViTs) have demonstrated strong adversarial performance, existing attacks often fail to transfer effectively across architectures, especially from ViTs to Convolutional Neural Networks (CNNs) or hybrid models. In this paper, we introduce \textbf{TESSER} – a novel adversarial attack framework that enhances transferability via two key strategies: (1) \textit{Feature-Sensitive Gradient Scaling (FSGS)}, which modulates gradients based on token-wise importance derived from intermediate feature activations, and (2) \textit{Spectral Smoothness Regularization (SSR)}, which suppresses high-frequency noise in perturbations using a differentiable Gaussian prior. These components work in tandem to generate perturbations that are both semantically meaningful and spectrally smooth. Extensive experiments on ImageNet across 12 diverse architectures demonstrate that TESSER achieves +10.9% higher attack succes rate (ASR) on CNNs and +7.2% on ViTs compared to the state-of-the-art Adaptive Token Tuning (ATT) method. Moreover, TESSER significantly improves robustness against defended models, achieving 53.55% ASR on adversarially trained CNNs. Qualitative analysis shows strong alignment between TESSER’s perturbations and salient visual regions identified via Grad-CAM, while frequency-domain analysis reveals a 12% reduction in high-frequency energy, confirming the effectiveness of spectral regularization.
[160] Hallo4: High-Fidelity Dynamic Portrait Animation via Direct Preference Optimization
Jiahao Cui, Yan Chen, Mingwang Xu, Hanlin Shang, Yuxuan Chen, Yun Zhan, Zilong Dong, Yao Yao, Jingdong Wang, Siyu Zhu
Main category: cs.CV
TL;DR: A diffusion framework for photorealistic portrait animation that uses human preference optimization and temporal motion modulation to improve lip sync, facial expressions, and body motion dynamics.
Details
Motivation: Generating highly dynamic and photorealistic portrait animations driven by audio and skeletal motion is challenging due to the need for precise lip synchronization, natural facial expressions, and high-fidelity body motion dynamics.Method: Proposes a human-preference-aligned diffusion framework with two innovations: 1) direct preference optimization tailored for human-centric animation using curated human preference data, and 2) temporal motion modulation that resolves spatiotemporal resolution mismatches through temporal channel redistribution and proportional feature expansion.
Result: Experiments demonstrate obvious improvements in lip-audio synchronization, expression vividness, body motion coherence over baseline methods, alongside notable gains in human preference metrics.
Conclusion: The proposed framework effectively addresses the challenges of portrait animation through human-preference alignment and temporal motion modulation, achieving superior results in multiple perceptual metrics.
Abstract: Generating highly dynamic and photorealistic portrait animations driven by audio and skeletal motion remains challenging due to the need for precise lip synchronization, natural facial expressions, and high-fidelity body motion dynamics. We propose a human-preference-aligned diffusion framework that addresses these challenges through two key innovations. First, we introduce direct preference optimization tailored for human-centric animation, leveraging a curated dataset of human preferences to align generated outputs with perceptual metrics for portrait motion-video alignment and naturalness of expression. Second, the proposed temporal motion modulation resolves spatiotemporal resolution mismatches by reshaping motion conditions into dimensionally aligned latent features through temporal channel redistribution and proportional feature expansion, preserving the fidelity of high-frequency motion details in diffusion-based synthesis. The proposed mechanism is complementary to existing UNet and DiT-based portrait diffusion approaches, and experiments demonstrate obvious improvements in lip-audio synchronization, expression vividness, body motion coherence over baseline methods, alongside notable gains in human preference metrics. Our model and source code can be found at: https://github.com/xyz123xyz456/hallo4.
[161] Uncertainty-aware Diffusion and Reinforcement Learning for Joint Plane Localization and Anomaly Diagnosis in 3D Ultrasound
Yuhao Huang, Yueyue Xu, Haoran Dou, Jiaxiao Deng, Xin Yang, Hongyu Zheng, Dong Ni
Main category: cs.CV
TL;DR: An intelligent system using denoising diffusion models and reinforcement learning for automated plane localization and congenital uterine anomaly diagnosis from 3D ultrasound data.
Details
Motivation: Congenital uterine anomalies cause infertility and pregnancy complications, and 3D ultrasound provides better visualization than 2D for accurate assessment, but requires automated analysis tools.Method: Combines denoising diffusion model with local/global guidance and adaptive weighting, reinforcement learning for key slice extraction from sequences, and text-driven uncertainty modeling for classification improvement.
Result: Extensive experiments on large 3D uterine ultrasound dataset demonstrate effective performance in both plane localization and CUA diagnosis.
Conclusion: The proposed intelligent system provides an effective automated solution for congenital uterine anomaly diagnosis using 3D ultrasound data with improved accuracy and reliability.
Abstract: Congenital uterine anomalies (CUAs) can lead to infertility, miscarriage, preterm birth, and an increased risk of pregnancy complications. Compared to traditional 2D ultrasound (US), 3D US can reconstruct the coronal plane, providing a clear visualization of the uterine morphology for assessing CUAs accurately. In this paper, we propose an intelligent system for simultaneous automated plane localization and CUA diagnosis. Our highlights are: 1) we develop a denoising diffusion model with local (plane) and global (volume/text) guidance, using an adaptive weighting strategy to optimize attention allocation to different conditions; 2) we introduce a reinforcement learning-based framework with unsupervised rewards to extract the key slice summary from redundant sequences, fully integrating information across multiple planes to reduce learning difficulty; 3) we provide text-driven uncertainty modeling for coarse prediction, and leverage it to adjust the classification probability for overall performance improvement. Extensive experiments on a large 3D uterine US dataset show the efficacy of our method, in terms of plane localization and CUA diagnosis. Code is available at https://github.com/yuhoo0302/CUA-US.
[162] JAX-IK: Real-Time Inverse Kinematics for Generating Multi-Constrained Movements of Virtual Human Characters
Hendric Voss, Stefan Kopp
Main category: cs.CV
TL;DR: A real-time inverse kinematics solver using TensorFlow’s automatic differentiation for realistic human movement generation, outperforming traditional methods in speed and accuracy.
Details
Motivation: Generate accurate and realistic virtual human movements in real-time for applications in computer graphics, virtual environments, robotics, and biomechanics.Method: Leverages TensorFlow’s automatic differentiation and just-in-time compilation to handle complex human skeletons, treating forward and inverse kinematics as differentiable operations to address error accumulation and joint limits.
Result: Achieves real-time performance with rapid convergence, minimal computational overhead, and improved success rates compared to CCD, FABRIK, and IPOPT algorithms on SMPLX human skeleton model.
Conclusion: The proposed differentiable IK solver provides an efficient solution for realistic human motion modeling with superior performance over traditional iterative-based methods.
Abstract: Generating accurate and realistic virtual human movements in real-time is of high importance for a variety of applications in computer graphics, interactive virtual environments, robotics, and biomechanics. This paper introduces a novel real-time inverse kinematics (IK) solver specifically designed for realistic human-like movement generation. Leveraging the automatic differentiation and just-in-time compilation of TensorFlow, the proposed solver efficiently handles complex articulated human skeletons with high degrees of freedom. By treating forward and inverse kinematics as differentiable operations, our method effectively addresses common challenges such as error accumulation and complicated joint limits in multi-constrained problems, which are critical for realistic human motion modeling. We demonstrate the solver’s effectiveness on the SMPLX human skeleton model, evaluating its performance against widely used iterative-based IK algorithms, like Cyclic Coordinate Descent (CCD), FABRIK, and the nonlinear optimization algorithm IPOPT. Our experiments cover both simple end-effector tasks and sophisticated, multi-constrained problems with realistic joint limits. Results indicate that our IK solver achieves real-time performance, exhibiting rapid convergence, minimal computational overhead per iteration, and improved success rates compared to existing methods. The project code is available at https://github.com/hvoss-techfak/JAX-IK
[163] GEMINUS: Dual-aware Global and Scene-Adaptive Mixture-of-Experts for End-to-End Autonomous Driving
Chi Wan, Yixin Cui, Jiatong Du, Shuo Yang, Yulong Bai, Peng Yi, Nan Li, Yanjun Huang
Main category: cs.CV
TL;DR: GEMINUS is a Mixture-of-Experts end-to-end autonomous driving framework that combines a Global Expert and Scene-Adaptive Experts Group with a Dual-aware Router to achieve both robustness and adaptability across diverse traffic scenarios.
Details
Motivation: Single-mode planning methods struggle to acquire diversified driving skills for handling complex and diverse traffic environments, requiring a more adaptive and robust approach.Method: Proposes a Mixture-of-Experts framework with: 1) Global Expert trained on overall dataset for robust performance, 2) Scene-Adaptive Experts trained on scene subsets for adaptive performance, 3) Dual-aware Router that considers scenario-level features and routing uncertainty to dynamically activate experts.
Result: Outperforms existing methods in Bench2Drive closed-loop benchmark, achieves state-of-the-art performance in Driving Score and Success Rate, even with only monocular vision input.
Conclusion: GEMINUS effectively couples global robustness with scene-specific adaptability through its expert architecture and dual-aware routing, demonstrating superior performance in diverse autonomous driving scenarios.
Abstract: End-to-end autonomous driving requires adaptive and robust handling of complex and diverse traffic environments. However, prevalent single-mode planning methods attempt to learn an overall policy while struggling to acquire diversified driving skills to handle diverse scenarios. Therefore, this paper proposes GEMINUS, a Mixture-of-Experts end-to-end autonomous driving framework featuring a Global Expert and a Scene-Adaptive Experts Group, equipped with a Dual-aware Router. Specifically, the Global Expert is trained on the overall dataset, possessing robust performance. The Scene-Adaptive Experts are trained on corresponding scene subsets, achieving adaptive performance. The Dual-aware Router simultaneously considers scenario-level features and routing uncertainty to dynamically activate expert modules. Through the effective coupling of the Global Expert and the Scene-Adaptive Experts Group via the Dual-aware Router, GEMINUS achieves both adaptability and robustness across diverse scenarios. GEMINUS outperforms existing methods in the Bench2Drive closed-loop benchmark and achieves state-of-the-art performance in Driving Score and Success Rate, even with only monocular vision input. The code is available at https://github.com/newbrains1/GEMINUS.
[164] VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization
Sihan Yang, Runsen Xu, Chenhang Cui, Tai Wang, Dahua Lin, Jiangmiao Pang
Main category: cs.CV
TL;DR: VFlowOpt is a token pruning framework that reduces 90% of visual tokens in Large Multimodal Models while maintaining performance, achieving 89% KV-Cache memory reduction and 3.8x faster inference.
Details
Motivation: Large Multimodal Models use excessive visual tokens causing high computational costs, and existing pruning methods are simplistic with significant performance degradation.Method: Uses attention-derived context relevance and patch-level information entropy to compute importance maps, implements progressive pruning with recycling mechanism, and optimizes pruning strategy via visual information flow-guided method.
Result: Prunes 90% of visual tokens while maintaining comparable performance, reduces KV-Cache memory by 89%, and achieves 3.8 times faster inference.
Conclusion: VFlowOpt provides an effective framework for token pruning in LMMs that significantly reduces computational costs without substantial performance loss.
Abstract: Large Multimodal Models (LMMs) excel in visual-language tasks by leveraging numerous visual tokens for fine-grained visual information, but this token redundancy results in significant computational costs. Previous research aimed at reducing visual tokens during inference typically leverages importance maps derived from attention scores among vision-only tokens or vision-language tokens to prune tokens across one or multiple pruning stages. Despite this progress, pruning frameworks and strategies remain simplistic and insufficiently explored, often resulting in substantial performance degradation. In this paper, we propose VFlowOpt, a token pruning framework that introduces an importance map derivation process and a progressive pruning module with a recycling mechanism. The hyperparameters of its pruning strategy are further optimized by a visual information flow-guided method. Specifically, we compute an importance map for image tokens based on their attention-derived context relevance and patch-level information entropy. We then decide which tokens to retain or prune and aggregate the pruned ones as recycled tokens to avoid potential information loss. Finally, we apply a visual information flow-guided method that regards the last token in the LMM as the most representative signal of text-visual interactions. This method minimizes the discrepancy between token representations in LMMs with and without pruning, thereby enabling superior pruning strategies tailored to different LMMs. Experiments demonstrate that VFlowOpt can prune 90% of visual tokens while maintaining comparable performance, leading to an 89% reduction in KV-Cache memory and 3.8 times faster inference.
[165] Towards Scalable Training for Handwritten Mathematical Expression Recognition
Haoyang Li, Jiaqing Li, Jialun Cao, Zongyuan Yang, Yongping Xiong
Main category: cs.CV
TL;DR: TexTeller is the first large-scale HMER model trained on 80M+ synthetic formulas and real handwritten data, achieving SOTA performance across benchmarks.
Details
Motivation: Handwritten Mathematical Expression Recognition suffers from data scarcity due to costly manual annotation, limiting model performance.Method: Developed a scalable data engine to generate 80M+ LaTeX-rendered formulas (Tex80M), then mix-trained with real handwritten data using refined pipeline.
Result: Achieved state-of-the-art performance across nearly all HMER benchmarks with the largest formula dataset to date.
Conclusion: The approach bridges the data gap in HMER through scalable synthetic data generation and will release complete model, dataset, and codebase to advance research.
Abstract: Large foundation models have achieved significant performance gains through scalable training on massive datasets. However, the field of \textbf{H}andwritten \textbf{M}athematical \textbf{E}xpression \textbf{R}ecognition (HMER) has been impeded by the scarcity of data, primarily due to the arduous and costly process of manual annotation. To bridge this gap, we propose a novel method integrating limited handwritten formulas with large-scale LaTeX-rendered formulas by developing a scalable data engine to generate complex and consistent LaTeX sequences. With this engine, we built the largest formula dataset to date, termed \texttt{Tex80M}, comprising over 80 million high-quality training instances. Then we propose \texttt{TexTeller}, the first HMER model trained at scale, by mix-training \texttt{Tex80M} with a relatively small HME dataset. The expansive training dataset and our refined pipeline have equipped \texttt{TexTeller} with state-of-the-art (SOTA) performance across nearly all benchmarks. To advance the field, we will openly release our complete model, entire dataset, and full codebase, enabling further research building upon our contributions.
[166] Glo-UMF: A Unified Multi-model Framework for Automated Morphometry of Glomerular Ultrastructural Characterization
Zhentai Zhang, Danyi Weng, Guibin Zhang, Xiang Chen, Kaixing Long, Jian Geng, Yanmeng Lu, Lei Zhang, Zhitao Zhou, Lei Cao
Main category: cs.CV
TL;DR: Glo-UMF is a unified multi-model framework that integrates segmentation, classification, and detection to quantify glomerular ultrastructural features from electron microscopy images, overcoming single-model limitations.
Details
Motivation: To address the inability of single-model architectures to perform simultaneous analysis of complex glomerular ultrastructures and overcome limitations of traditional grading methods.Method: Developed three dedicated deep models: ultrastructure segmentation model, GFB region classification model, and EDD detection model. Integrated outputs through post-processing workflow with adaptive GFB cropping and measurement location screening.
Result: Trained on 372 EM images, achieved simultaneous quantification of GBM thickness, FPE degree, and EDD location. Strong agreement with pathological reports in 115 test cases across 9 renal types, with average processing time of 4.23±0.48 seconds per case on CPU.
Conclusion: Modular design allows flexible extensibility for joint quantification of multiple features. Framework ensures robust generalization and clinical applicability as an efficient auxiliary tool in glomerular pathological analysis.
Abstract: Background and Objective: To address the inability of single-model architectures to perform simultaneous analysis of complex glomerular ultrastructures, we developed Glo-UMF, a unified multi-model framework integrating segmentation, classification, and detection to systematically quantify key ultrastructural features. Methods: Glo-UMF decouples quantification tasks by constructing three dedicated deep models: an ultrastructure segmentation model, a glomerular filtration barrier (GFB) region classification model, and an electron-dense deposits (EDD) detection model. Their outputs are integrated through a post-processing workflow with adaptive GFB cropping and measurement location screening, enhancing measurement reliability and providing comprehensive quantitative results that overcome the limitations of traditional grading. Results: Trained on 372 electron microscopy images, Glo-UMF enables simultaneous quantification of glomerular basement membrane (GBM) thickness, the degree of foot process effacement (FPE), and EDD location. In 115 test cases spanning 9 renal pathological types, the automated quantification results showed strong agreement with pathological reports, with an average processing time of 4.23$\pm$0.48 seconds per case on a CPU environment. Conclusions: The modular design of Glo-UMF allows for flexible extensibility, supporting the joint quantification of multiple features. This framework ensures robust generalization and clinical applicability, demonstrating significant potential as an efficient auxiliary tool in glomerular pathological analysis.
[167] S$^2$-Guidance: Stochastic Self Guidance for Training-Free Enhancement of Diffusion Models
Chubin Chen, Jiashu Zhu, Xiaokun Feng, Nisha Huang, Meiqi Wu, Fangyuan Mao, Jiahong Wu, Xiangxiang Chu, Xiu Li
Main category: cs.CV
TL;DR: S^2-Guidance improves upon Classifier-free Guidance by using stochastic sub-networks to refine predictions and avoid low-quality outputs in diffusion models.
Details
Motivation: CFG produces suboptimal results with semantic incoherence and low-quality outputs due to excessive reliance on imperfect predictions.Method: Uses stochastic block-dropping during forward process to create stochastic sub-networks that guide the model away from low-quality predictions.
Result: Superior performance on text-to-image and text-to-video generation tasks, consistently surpassing CFG and other advanced guidance strategies.
Conclusion: S^2-Guidance effectively addresses CFG’s limitations by leveraging the model’s own sub-networks to produce higher quality and more coherent outputs.
Abstract: Classifier-free Guidance (CFG) is a widely used technique in modern diffusion models for enhancing sample quality and prompt adherence. However, through an empirical analysis on Gaussian mixture modeling with a closed-form solution, we observe a discrepancy between the suboptimal results produced by CFG and the ground truth. The model’s excessive reliance on these suboptimal predictions often leads to semantic incoherence and low-quality outputs. To address this issue, we first empirically demonstrate that the model’s suboptimal predictions can be effectively refined using sub-networks of the model itself. Building on this insight, we propose S^2-Guidance, a novel method that leverages stochastic block-dropping during the forward process to construct stochastic sub-networks, effectively guiding the model away from potential low-quality predictions and toward high-quality outputs. Extensive qualitative and quantitative experiments on text-to-image and text-to-video generation tasks demonstrate that S^2-Guidance delivers superior performance, consistently surpassing CFG and other advanced guidance strategies. Our code will be released.
[168] A Lightweight Convolution and Vision Transformer integrated model with Multi-scale Self-attention Mechanism
Yi Zhang, Lingxiao Wei, Bowei Zhang, Ziwei Liu, Kai Yi, Shu Hu
Main category: cs.CV
TL;DR: SAEViT is an efficient Vision Transformer that combines sparse attention and convolution blocks to reduce computational complexity while maintaining performance on vision tasks.
Details
Motivation: Vision Transformers have strong long-range dependency modeling but suffer from large model sizes and weak local feature modeling, limiting real-world applications.Method: Proposes Sparsely Aggregated Attention module for adaptive sparse sampling and deconvolution recovery, Channel-Interactive Feed-Forward Network for better inter-channel information exchange, and hierarchical pyramid structure with depth-wise separable convolutional blocks.
Result: Achieves 76.3% and 79.6% Top-1 accuracy on ImageNet-1K with only 0.8 GFLOPs and 1.3 GFLOPs respectively.
Conclusion: SAEViT provides a lightweight solution for fundamental vision tasks by balancing computation efficiency and performance through sparse attention and enhanced convolutional features.
Abstract: Vision Transformer (ViT) has prevailed in computer vision tasks due to its strong long-range dependency modelling ability. \textcolor{blue}{However, its large model size and weak local feature modeling ability hinder its application in real scenarios. To balance computation efficiency and performance in downstream vision tasks, we propose an efficient ViT model with sparse attention (dubbed SAEViT) and convolution blocks. Specifically, a Sparsely Aggregated Attention (SAA) module has been proposed to perform adaptive sparse sampling and recover the feature map via deconvolution operation,} which significantly reduces the computational complexity of attention operations. In addition, a Channel-Interactive Feed-Forward Network (CIFFN) layer is developed to enhance inter-channel information exchange through feature decomposition and redistribution, which mitigates the redundancy in traditional feed-forward networks (FFN). Finally, a hierarchical pyramid structure with embedded depth-wise separable convolutional blocks (DWSConv) is devised to further strengthen convolutional features. Extensive experiments on mainstream datasets show that SAEViT achieves Top-1 accuracies of 76.3% and 79.6% on the ImageNet-1K classification task with only 0.8 GFLOPs and 1.3 GFLOPs, respectively, demonstrating a lightweight solution for fundamental vision tasks.
[169] Deep Learning-Based Rock Particulate Classification Using Attention-Enhanced ConvNeXt
Anthony Amankwah, Chris Aldrich
Main category: cs.CV
TL;DR: Enhanced ConvNeXt model with self-attention and channel attention mechanisms for improved rock size classification
Details
Motivation: Accurate rock size classification is vital for geotechnical engineering, mining, and resource management as it influences operational efficiency and safetyMethod: Proposed CNSCA model based on ConvNeXt architecture augmented with self-attention for long-range spatial dependencies and channel attention for informative feature channels
Result: Model outperformed three strong baselines on rock size classification dataset, showing significant improvement in classification accuracy and robustness
Conclusion: Incorporation of attention mechanisms enhances model capability for fine-grained classification tasks involving natural textures like rocks
Abstract: Accurate classification of rock sizes is a vital component in geotechnical engineering, mining, and resource management, where precise estimation influences operational efficiency and safety. In this paper, we propose an enhanced deep learning model based on the ConvNeXt architecture, augmented with both self-attention and channel attention mechanisms. Building upon the foundation of ConvNext, our proposed model, termed CNSCA, introduces self-attention to capture long-range spatial dependencies and channel attention to emphasize informative feature channels. This hybrid design enables the model to effectively capture both fine-grained local patterns and broader contextual relationships within rock imagery, leading to improved classification accuracy and robustness. We evaluate our model on a rock size classification dataset and compare it against three strong baseline. The results demonstrate that the incorporation of attention mechanisms significantly enhances the models capability for fine-grained classification tasks involving natural textures like rocks.
[170] Drawing2CAD: Sequence-to-Sequence Learning for CAD Generation from Vector Drawings
Feiwei Qin, Shichao Lu, Junhao Hou, Changmiao Wang, Meie Fang, Ligang Liu
Main category: cs.CV
TL;DR: Drawing2CAD is a framework that converts 2D engineering drawings into parametric CAD models using sequence-to-sequence learning, preserving geometric precision and design intent.
Details
Motivation: Traditional CAD generative modeling diverges from industrial workflows that start with 2D engineering drawings. The automatic generation of parametric CAD models from 2D vector drawings remains underexplored despite being critical for engineering design.Method: Proposes a framework with three key components: network-friendly vector primitive representation, dual-decoder transformer architecture that decouples command type and parameter generation, and soft target distribution loss function for CAD parameter flexibility.
Result: Created CAD-VGDrawing dataset of paired engineering drawings and parametric CAD models. Conducted thorough experiments demonstrating the effectiveness of the method.
Conclusion: Drawing2CAD successfully bridges the gap between 2D engineering drawings and parametric CAD model generation, preserving geometric precision and design intent throughout the transformation process.
Abstract: Computer-Aided Design (CAD) generative modeling is driving significant innovations across industrial applications. Recent works have shown remarkable progress in creating solid models from various inputs such as point clouds, meshes, and text descriptions. However, these methods fundamentally diverge from traditional industrial workflows that begin with 2D engineering drawings. The automatic generation of parametric CAD models from these 2D vector drawings remains underexplored despite being a critical step in engineering design. To address this gap, our key insight is to reframe CAD generation as a sequence-to-sequence learning problem where vector drawing primitives directly inform the generation of parametric CAD operations, preserving geometric precision and design intent throughout the transformation process. We propose Drawing2CAD, a framework with three key technical components: a network-friendly vector primitive representation that preserves precise geometric information, a dual-decoder transformer architecture that decouples command type and parameter generation while maintaining precise correspondence, and a soft target distribution loss function accommodating inherent flexibility in CAD parameters. To train and evaluate Drawing2CAD, we create CAD-VGDrawing, a dataset of paired engineering drawings and parametric CAD models, and conduct thorough experiments to demonstrate the effectiveness of our method. Code and dataset are available at https://github.com/lllssc/Drawing2CAD.
[171] Enhancing Automatic Modulation Recognition With a Reconstruction-Driven Vision Transformer Under Limited Labels
Hossein Ahmadi, Banafsheh Saffari, Sajjad Emdadi Mahdimahalleh, Mohammad Esmaeil Safari, Aria Ahmadi
Main category: cs.CV
TL;DR: A unified Vision Transformer framework for automatic modulation recognition that combines supervised, self-supervised, and reconstruction objectives to achieve robust performance with limited labeled data.
Details
Motivation: Existing AMR solutions require large labeled datasets or complex multi-stage training, limiting scalability and generalization in practical applications.Method: Uses ViT encoder with lightweight convolutional decoder and linear classifier; reconstruction branch maps augmented signals back to originals to preserve I/Q structure; combines supervised, self-supervised, and reconstruction objectives.
Result: Outperforms supervised CNN and ViT baselines in low-label regimes on RML2018.01A dataset; achieves ResNet-level accuracy with only 15-20% labeled data; maintains strong performance across varying SNR levels.
Conclusion: Provides a simple, generalizable, and label-efficient solution for automatic modulation recognition that addresses practical limitations of existing approaches.
Abstract: Automatic modulation recognition (AMR) is critical for cognitive radio, spectrum monitoring, and secure wireless communication. However, existing solutions often rely on large labeled datasets or multi-stage training pipelines, which limit scalability and generalization in practice. We propose a unified Vision Transformer (ViT) framework that integrates supervised, self-supervised, and reconstruction objectives. The model combines a ViT encoder, a lightweight convolutional decoder, and a linear classifier; the reconstruction branch maps augmented signals back to their originals, anchoring the encoder to fine-grained I/Q structure. This strategy promotes robust, discriminative feature learning during pretraining, while partial label supervision in fine-tuning enables effective classification with limited labels. On the RML2018.01A dataset, our approach outperforms supervised CNN and ViT baselines in low-label regimes, approaches ResNet-level accuracy with only 15-20% labeled data, and maintains strong performance across varying SNR levels. Overall, the framework provides a simple, generalizable, and label-efficient solution for AMR.
[172] Deep Learning Framework for Early Detection of Pancreatic Cancer Using Multi-Modal Medical Imaging Analysis
Dennis Slobodzian, Amir Kordijazi
Main category: cs.CV
TL;DR: Deep learning framework using autofluorescence and SHG imaging achieves over 90% accuracy for early pancreatic cancer detection, outperforming manual methods.
Details
Motivation: PDAC has extremely low survival rates due to late detection, creating urgent need for early diagnostic methods using advanced imaging techniques.Method: Developed specialized neural network analyzing 40 patient samples with dual-modality imaging. Evaluated 6 architectures (CNNs vs ViTs), used modified ResNet with frozen pre-trained layers and class-weighted training to address dataset limitations.
Result: Achieved over 90% accuracy in distinguishing normal, fibrotic, and cancerous tissue, significantly improving upon current manual analysis methods.
Conclusion: Establishes robust automated PDAC detection pipeline with clinical deployment potential, providing foundation for expansion to other cancers and insights for handling limited medical imaging datasets.
Abstract: Pacreatic ductal adenocarcinoma (PDAC) remains one of the most lethal forms of cancer, with a five-year survival rate below 10% primarily due to late detection. This research develops and validates a deep learning framework for early PDAC detection through analysis of dual-modality imaging: autofluorescence and second harmonic generation (SHG). We analyzed 40 unique patient samples to create a specialized neural network capable of distinguishing between normal, fibrotic, and cancerous tissue. Our methodology evaluated six distinct deep learning architectures, comparing traditional Convolutional Neural Networks (CNNs) with modern Vision Transformers (ViTs). Through systematic experimentation, we identified and overcome significant challenges in medical image analysis, including limited dataset size and class imbalance. The final optimized framework, based on a modified ResNet architecture with frozen pre-trained layers and class-weighted training, achieved over 90% accuracy in cancer detection. This represents a significant improvement over current manual analysis methods an demonstrates potential for clinical deployment. This work establishes a robust pipeline for automated PDAC detection that can augment pathologists’ capabilities while providing a foundation for future expansion to other cancer types. The developed methodology also offers valuable insights for applying deep learning to limited-size medical imaging datasets, a common challenge in clinical applications.
[173] Bidirectional Sparse Attention for Faster Video Diffusion Training
Chenlu Zhan, Wen Li, Chuyu Shen, Jun Zhang, Suhui Wu, Hao Zhang
Main category: cs.CV
TL;DR: BSA framework uses bidirectional sparse attention to dramatically improve video DiT efficiency by dynamically sparsifying both queries and key-value pairs, achieving up to 20x FLOPs reduction while maintaining generative quality.
Details
Motivation: Video diffusion Transformer models face prohibitive computational costs due to quadratic complexity of full attention when generating high-resolution, long-duration videos, with inefficiencies from excessive and redundant computation.Method: Bidirectional Sparse Attention framework that dynamically sparsifies both queries (via semantic similarity and dynamic spatial-time training) and key-value pairs (using statistical dynamic threshold to retain salient blocks).
Result: Achieves up to 20x FLOPs reduction, 17.79x faster attention training while preserving or surpassing full attention’s generative quality across long sequences.
Conclusion: BSA successfully overcomes computational bottlenecks in video DiT models by efficiently handling attention sparsity, making high-resolution long-video generation more practical.
Abstract: Video diffusion Transformer (DiT) models excel in generative quality but hit major computational bottlenecks when producing high-resolution, long-duration videos. The quadratic complexity of full attention leads to prohibitively high training and inference costs. Full attention inefficiency stems from two key challenges: excessive computation due to the inherent sparsity of Queries and Key-Value pairs, and redundant computation as fixed sparse patterns fail to leverage DiT’s dynamic attention. To overcome this limitation, we propose a Bidirectional Sparse Attention (BSA) framework for faster video DiT training, the first to dynamically sparsify both Queries and Key-Value pairs within 3D full attention, thereby substantially improving training and inference efficiency. BSA addresses these issues through two key components. Query sparsity is optimized by selecting the most informative query tokens via semantic similarity and with a dynamic spatial-time training strategy, while KV sparsity is achieved by computing a statistical dynamic threshold to retain only the most salient KV blocks for computation. Extensive experiments demonstrate that BSA significantly accelerates DiT training across long sequences, reducing FLOPs by up to 20x and achieving 17.79x faster attention training, while preserving or even surpassing the generative quality of full attention.
[174] LiDAR-BIND-T: Improved and Temporally Consistent Sensor Modality Translation and Fusion for Robotic Applications
Niels Balemans, Ali Anwar, Jan Steckel, Siegfried Mercelis
Main category: cs.CV
TL;DR: Extends LiDAR-BIND with temporal consistency mechanisms for multi-modal sensor fusion, improving SLAM performance through temporal alignment and motion-aware losses.
Details
Motivation: To enhance temporal stability and coherence in multi-modal sensor fusion (radar, sonar to LiDAR) for improved SLAM robustness and performance.Method: Introduces temporal embedding similarity, motion-aligned transformation loss, and windowed temporal fusion with updated architecture for spatial structure preservation.
Result: Demonstrates improved temporal/spatial coherence, lower trajectory error, better occupancy map accuracy in SLAM, with new temporal quality metrics (FVMD-based).
Conclusion: LiDAR-BIND-T maintains plug-and-play fusion while significantly enhancing temporal stability, resulting in better downstream SLAM performance.
Abstract: This paper extends LiDAR-BIND, a modular multi-modal fusion framework that binds heterogeneous sensors (radar, sonar) to a LiDAR-defined latent space, with mechanisms that explicitly enforce temporal consistency. We introduce three contributions: (i) temporal embedding similarity that aligns consecutive latent representations, (ii) a motion-aligned transformation loss that matches displacement between predictions and ground truth LiDAR, and (iii) windowed temporal fusion using a specialised temporal module. We further update the model architecture to better preserve spatial structure. Evaluations on radar/sonar-to-LiDAR translation demonstrate improved temporal and spatial coherence, yielding lower absolute trajectory error and better occupancy map accuracy in Cartographer-based SLAM (Simultaneous Localisation and Mapping). We propose different metrics based on the Fr'echet Video Motion Distance (FVMD) and a correlation-peak distance metric providing practical temporal quality indicators to evaluate SLAM performance. The proposed temporal LiDAR-BIND, or LiDAR-BIND-T, maintains plug-and-play modality fusion while substantially enhancing temporal stability, resulting in improved robustness and performance for downstream SLAM.
[175] Missing Fine Details in Images: Last Seen in High Frequencies
Tejaswini Medi, Hsien-Yi Wang, Arianna Rampini, Margret Keuper
Main category: cs.CV
TL;DR: The paper identifies that current latent tokenizers in generative models prioritize low-frequency reconstruction, causing loss of high-frequency details and visual artifacts. They propose a frequency-aware VAE framework using wavelet decomposition to separately optimize low and high frequencies, resulting in sharper image generation.
Details
Motivation: Existing latent generative models suffer from over-smoothed outputs and lack realism in textured regions due to inherent bias toward low-frequency information during optimization, which diminishes perceptual quality.Method: Propose a wavelet-based frequency-aware variational autoencoder (FA-VAE) framework that explicitly decouples optimization of low- and high-frequency components, enabling improved texture reconstruction while preserving global structure.
Result: The frequency-preserving latent embeddings integrated into a state-of-the-art latent diffusion model produce sharper and more realistic image generation, bridging the fidelity gap in current latent tokenizers.
Conclusion: Frequency-aware optimization is crucial for realistic image synthesis, with broader implications for content creation, neural rendering, and medical imaging applications.
Abstract: Latent generative models have shown remarkable progress in high-fidelity image synthesis, typically using a two-stage training process that involves compressing images into latent embeddings via learned tokenizers in the first stage. The quality of generation strongly depends on how expressive and well-optimized these latent embeddings are. While various methods have been proposed to learn effective latent representations, generated images often lack realism, particularly in textured regions with sharp transitions, due to loss of fine details governed by high frequencies. We conduct a detailed frequency decomposition of existing state-of-the-art (SOTA) latent tokenizers and show that conventional objectives inherently prioritize low-frequency reconstruction, often at the expense of high-frequency fidelity. Our analysis reveals these latent tokenizers exhibit a bias toward low-frequency information during optimization, leading to over-smoothed outputs and visual artifacts that diminish perceptual quality. To address this, we propose a wavelet-based, frequency-aware variational autoencoder (FA-VAE) framework that explicitly decouples the optimization of low- and high-frequency components. This decoupling enables improved reconstruction of fine textures while preserving global structure. Moreover, we integrate our frequency-preserving latent embeddings into a SOTA latent diffusion model, resulting in sharper and more realistic image generation. Our approach bridges the fidelity gap in current latent tokenizers and emphasizes the importance of frequency-aware optimization for realistic image synthesis, with broader implications for applications in content creation, neural rendering, and medical imaging.
[176] TinyDef-DETR: A DETR-based Framework for Defect Detection in Transmission Lines from UAV Imagery
Jiaming Cui, Shuai Zhou, Feng Shen
Main category: cs.CV
TL;DR: TinyDef-DETR is a DETR-based framework for accurate and efficient detection of small transmission line defects from UAV imagery, using edge-enhanced backbone, detail-preserving downsampling, multi-scale attention, and improved regression loss.
Details
Motivation: Automated defect detection from UAV imagery is challenging due to small defect size, ambiguity, and complex backgrounds that conventional detectors struggle with.Method: Integrates four components: edge-enhanced ResNet backbone, stride-free space-to-depth module, cross-stage dual-domain multi-scale attention mechanism, and Focaler-Wise-SIoU regression loss.
Result: Achieves superior detection performance and strong generalization capability on public and real-world datasets while maintaining modest computational overhead.
Conclusion: TinyDef-DETR is an effective solution for UAV-based transmission line defect detection, particularly for small and ambiguous targets, offering both accuracy and efficiency.
Abstract: Automated defect detection from UAV imagery of transmission lines is a challenging task due to the small size, ambiguity, and complex backgrounds of defects. This paper proposes TinyDef-DETR, a DETR-based framework designed to achieve accurate and efficient detection of transmission line defects from UAV-acquired images. The model integrates four major components: an edge-enhanced ResNet backbone to strengthen boundary-sensitive representations, a stride-free space-to-depth module to enable detail-preserving downsampling, a cross-stage dual-domain multi-scale attention mechanism to jointly model global context and local cues, and a Focaler-Wise-SIoU regression loss to improve the localization of small and difficult targets. Together, these designs effectively mitigate the limitations of conventional detectors. Extensive experiments on both public and real-world datasets demonstrate that TinyDef-DETR achieves superior detection performance and strong generalization capability, while maintaining modest computational overhead. The accuracy and efficiency of TinyDef-DETR make it a suitable method for UAV-based transmission line defect detection, particularly in scenarios involving small and ambiguous targets.
[177] Focusing by Contrastive Attention: Enhancing VLMs’ Visual Reasoning
Yuyao Ge, Shenghua Liu, Yiwei Wang, Lingrui Mei, Baolong Bi, Xuanshan Zhou, Jiayu Yao, Jiafeng Guo, Xueqi Cheng
Main category: cs.CV
TL;DR: CARVE is a training-free method that improves VLM performance in complex visual environments by using attention contrast to extract task-relevant signals from visual noise.
Details
Motivation: VLMs perform poorly in complex visual environments, and existing enhancement methods require additional training or external tools while overlooking VLMs' innate attention capabilities.Method: Analyze VLM attention patterns, discover correlation between visual complexity and attention entropy, and propose CARVE - a method that uses contrast between general and task-specific query attention maps to decompose visual signals at pixel level.
Result: CARVE consistently enhances performance with up to 75% improvement on open-source models without requiring additional training.
Conclusion: The work provides insights into visual complexity and attention mechanisms, offering an efficient training-free pathway for improving visual reasoning through attention contrasting.
Abstract: Vision-Language Models (VLMs) have demonstrated remarkable success across diverse visual tasks, yet their performance degrades in complex visual environments. While existing enhancement approaches require additional training, rely on external segmentation tools, or operate at coarse-grained levels, they overlook the innate ability within VLMs. To bridge this gap, we investigate VLMs’ attention patterns and discover that: (1) visual complexity strongly correlates with attention entropy, negatively impacting reasoning performance; (2) attention progressively refines from global scanning in shallow layers to focused convergence in deeper layers, with convergence degree determined by visual complexity. (3) Theoretically, we prove that the contrast of attention maps between general queries and task-specific queries enables the decomposition of visual signal into semantic signals and visual noise components. Building on these insights, we propose Contrastive Attention Refinement for Visual Enhancement (CARVE), a training-free method that extracts task-relevant visual signals through attention contrasting at the pixel level. Extensive experiments demonstrate that CARVE consistently enhances performance, achieving up to 75% improvement on open-source models. Our work provides critical insights into the interplay between visual complexity and attention mechanisms, offering an efficient pathway for improving visual reasoning with contrasting attention.
[178] Zero-shot 3D-Aware Trajectory-Guided image-to-video generation via Test-Time Training
Ruicheng Zhang, Jun Zhou, Zunnan Xu, Zihao Liu, Jiehui Huang, Mingyang Zhang, Yu Sun, Xiu Li
Main category: cs.CV
TL;DR: Zo3T is a zero-shot test-time training framework for trajectory-guided image-to-video generation that uses 3D-aware kinematic projection, dynamic LoRA adapters, and guidance field rectification to achieve realistic motion without fine-tuning.
Details
Motivation: Existing methods for trajectory-guided I2V generation either require computationally expensive fine-tuning on scarce datasets or produce unrealistic motion by neglecting 3D perspective and causing misalignment between manipulated latents and noise predictions.Method: Three core innovations: 1) 3D-Aware Kinematic Projection using scene depth for perspective-correct transformations, 2) Trajectory-Guided Test-Time LoRA with dynamic ephemeral adapters and regional feature consistency loss, 3) Guidance Field Rectification with one-step lookahead strategy to refine denoising path.
Result: Zo3T significantly enhances 3D realism and motion accuracy in trajectory-controlled I2V generation, demonstrating superior performance over existing training-based and zero-shot approaches.
Conclusion: The framework effectively addresses challenges in zero-shot trajectory-guided video generation by ensuring generative fidelity, on-manifold adherence, and realistic 3D motion without requiring fine-tuning.
Abstract: Trajectory-Guided image-to-video (I2V) generation aims to synthesize videos that adhere to user-specified motion instructions. Existing methods typically rely on computationally expensive fine-tuning on scarce annotated datasets. Although some zero-shot methods attempt to trajectory control in the latent space, they may yield unrealistic motion by neglecting 3D perspective and creating a misalignment between the manipulated latents and the network’s noise predictions. To address these challenges, we introduce Zo3T, a novel zero-shot test-time-training framework for trajectory-guided generation with three core innovations: First, we incorporate a 3D-Aware Kinematic Projection, leveraging inferring scene depth to derive perspective-correct affine transformations for target regions. Second, we introduce Trajectory-Guided Test-Time LoRA, a mechanism that dynamically injects and optimizes ephemeral LoRA adapters into the denoising network alongside the latent state. Driven by a regional feature consistency loss, this co-adaptation effectively enforces motion constraints while allowing the pre-trained model to locally adapt its internal representations to the manipulated latent, thereby ensuring generative fidelity and on-manifold adherence. Finally, we develop Guidance Field Rectification, which refines the denoising evolutionary path by optimizing the conditional guidance field through a one-step lookahead strategy, ensuring efficient generative progression towards the target trajectory. Zo3T significantly enhances 3D realism and motion accuracy in trajectory-controlled I2V generation, demonstrating superior performance over existing training-based and zero-shot approaches.
[179] 3D and 4D World Modeling: A Survey
Lingdong Kong, Wesley Yang, Jianbiao Mei, Youquan Liu, Ao Liang, Dekai Zhu, Dongyue Lu, Wei Yin, Xiaotao Hu, Mingkai Jia, Junyuan Deng, Kaiwen Zhang, Yang Wu, Tianyi Yan, Shenyuan Gao, Song Wang, Linfeng Li, Liang Pan, Yong Liu, Jianke Zhu, Wei Tsang Ooi, Steven C. H. Hoi, Ziwei Liu
Main category: cs.CV
TL;DR: This survey provides the first comprehensive review of 3D and 4D world modeling, establishing definitions, taxonomy, and evaluation metrics for this emerging field.
Details
Motivation: Prior work has focused on 2D image/video generation while overlooking 3D/4D representations, and the absence of standardized definitions has led to fragmented literature claims.Method: The authors establish precise definitions, introduce a structured taxonomy (VideoGen, OccGen, LiDARGen approaches), and systematically summarize datasets and evaluation metrics for 3D/4D settings.
Result: The survey provides a coherent foundational reference with systematic literature summary available at https://github.com/worldbench/survey
Conclusion: This work addresses gaps in 3D/4D world modeling research, provides standardization, and highlights practical applications and future research directions to advance the field.
Abstract: World modeling has become a cornerstone in AI research, enabling agents to understand, represent, and predict the dynamic environments they inhabit. While prior work largely emphasizes generative methods for 2D image and video data, they overlook the rapidly growing body of work that leverages native 3D and 4D representations such as RGB-D imagery, occupancy grids, and LiDAR point clouds for large-scale scene modeling. At the same time, the absence of a standardized definition and taxonomy for ``world models’’ has led to fragmented and sometimes inconsistent claims in the literature. This survey addresses these gaps by presenting the first comprehensive review explicitly dedicated to 3D and 4D world modeling and generation. We establish precise definitions, introduce a structured taxonomy spanning video-based (VideoGen), occupancy-based (OccGen), and LiDAR-based (LiDARGen) approaches, and systematically summarize datasets and evaluation metrics tailored to 3D/4D settings. We further discuss practical applications, identify open challenges, and highlight promising research directions, aiming to provide a coherent and foundational reference for advancing the field. A systematic summary of existing literature is available at https://github.com/worldbench/survey
[180] MESH – Understanding Videos Like Human: Measuring Hallucinations in Large Video Models
Garry Yang, Zizhe Chen, Man Hon Wong, Haoyu Lei, Yongqiang Chen, Zhenguo Li, Kaiwen Zhou, James Cheng
Main category: cs.CV
TL;DR: MESH is a new benchmark for evaluating hallucinations in Large Video Models that uses a bottom-up QA framework with target/trap instances to systematically assess object recognition, feature details, and action alignment in videos.
Details
Motivation: Current video hallucination benchmarks rely on manual categorization and neglect human perceptual processes, while LVMs suffer from producing inaccurate descriptions despite their semantic capabilities.Method: Uses Question-Answering framework with binary and multi-choice formats incorporating target and trap instances, following bottom-up evaluation of basic objects, coarse-to-fine subject features, and subject-action pairs.
Result: LVMs excel at recognizing basic objects and features but show significantly increased hallucination susceptibility when handling fine details or aligning multiple actions involving various subjects in longer videos.
Conclusion: MESH provides an effective and comprehensive approach for systematically identifying hallucinations in video models, aligning with human video understanding processes.
Abstract: Large Video Models (LVMs) build on the semantic capabilities of Large Language Models (LLMs) and vision modules by integrating temporal information to better understand dynamic video content. Despite their progress, LVMs are prone to hallucinations-producing inaccurate or irrelevant descriptions. Current benchmarks for video hallucination depend heavily on manual categorization of video content, neglecting the perception-based processes through which humans naturally interpret videos. We introduce MESH, a benchmark designed to evaluate hallucinations in LVMs systematically. MESH uses a Question-Answering framework with binary and multi-choice formats incorporating target and trap instances. It follows a bottom-up approach, evaluating basic objects, coarse-to-fine subject features, and subject-action pairs, aligning with human video understanding. We demonstrate that MESH offers an effective and comprehensive approach for identifying hallucinations in videos. Our evaluations show that while LVMs excel at recognizing basic objects and features, their susceptibility to hallucinations increases markedly when handling fine details or aligning multiple actions involving various subjects in longer videos.
[181] VRAE: Vertical Residual Autoencoder for License Plate Denoising and Deblurring
Cuong Nguyen, Dung T. Tran, Hong Nguyen, Xuan-Vu Phan, Nam-Phong Nguyen
Main category: cs.CV
TL;DR: Proposes Vertical Residual Autoencoder (VRAE) for real-time license plate image enhancement in traffic surveillance, achieving significant performance improvements over existing methods with minimal parameter increase.
Details
Motivation: Vehicle images in traffic surveillance often suffer from noise and blur due to adverse weather, poor lighting, or high-speed motion, which severely degrades license plate recognition accuracy, especially when plates occupy small regions in images.Method: VRAE architecture with enhancement strategy using auxiliary blocks that inject input-aware features at each encoding stage to guide representation learning and preserve general information better than conventional autoencoders.
Result: Outperforms AE, GAN, and Flow-Based approaches - improves PSNR by ~20%, reduces NMSE by ~50%, enhances SSIM by 1% compared to AE at same depth, with only ~1% parameter increase.
Conclusion: VRAE effectively enhances degraded license plate images in real-time traffic surveillance, significantly improving recognition performance with minimal computational overhead.
Abstract: In real-world traffic surveillance, vehicle images captured under adverse weather, poor lighting, or high-speed motion often suffer from severe noise and blur. Such degradations significantly reduce the accuracy of license plate recognition systems, especially when the plate occupies only a small region within the full vehicle image. Restoring these degraded images a fast realtime manner is thus a crucial pre-processing step to enhance recognition performance. In this work, we propose a Vertical Residual Autoencoder (VRAE) architecture designed for the image enhancement task in traffic surveillance. The method incorporates an enhancement strategy that employs an auxiliary block, which injects input-aware features at each encoding stage to guide the representation learning process, enabling better general information preservation throughout the network compared to conventional autoencoders. Experiments on a vehicle image dataset with visible license plates demonstrate that our method consistently outperforms Autoencoder (AE), Generative Adversarial Network (GAN), and Flow-Based (FB) approaches. Compared with AE at the same depth, it improves PSNR by about 20%, reduces NMSE by around 50%, and enhances SSIM by 1%, while requiring only a marginal increase of roughly 1% in parameters.
cs.AI
[182] An Interval Type-2 Version of Bayes Theorem Derived from Interval Probability Range Estimates Provided by Subject Matter Experts
John T. Rickard, William A. Dembski, James Rickards
Main category: cs.AI
TL;DR: Develops an interval type-2 version of Bayes Theorem to handle uncertainty in input probabilities provided as intervals by subject matter experts, with a conservative method to avoid inconsistencies and a flexible encoding algorithm.
Details
Motivation: Traditional Bayesian inference assumes precise input values, but real-world applications often rely on interval range estimates from experts, making precise inputs unrealistic.Method: Develops an IT2 version of Bayes Theorem with a conservative method to avoid input inconsistencies, and creates a novel algorithm to encode SME-provided intervals into IT2 fuzzy membership functions.
Result: Provides a framework that extends Bayesian inference to handle interval-based uncertainty while maintaining validity of output results.
Conclusion: The proposed IT2 Bayes Theorem and encoding algorithm enable more realistic Bayesian analysis using expert-provided interval estimates while preventing invalid outputs from potential input inconsistencies.
Abstract: Bayesian inference is widely used in many different fields to test hypotheses against observations. In most such applications, an assumption is made of precise input values to produce a precise output value. However, this is unrealistic for real-world applications. Often the best available information from subject matter experts (SMEs) in a given field is interval range estimates of the input probabilities involved in Bayes Theorem. This paper provides two key contributions to extend Bayes Theorem to an interval type-2 (IT2) version. First, we develop an IT2 version of Bayes Theorem that uses a novel and conservative method to avoid potential inconsistencies in the input IT2 MFs that otherwise might produce invalid output results. We then describe a novel and flexible algorithm for encoding SME-provided intervals into IT2 fuzzy membership functions (MFs), which we can use to specify the input probabilities in Bayes Theorem. Our algorithm generalizes and extends previous work on this problem that primarily addressed the encoding of intervals into word MFs for Computing with Words applications.
[183] Automated Unity Game Template Generation from GDDs via NLP and Multi-Modal LLMs
Amna Hassan
Main category: cs.AI
TL;DR: A framework that automatically generates Unity game prototypes from Game Design Documents using NLP and LLMs, achieving high performance scores and GDD adherence.
Details
Motivation: To streamline the transition from game design to implementation by automating the creation of functional Unity game prototypes from design documentation.Method: End-to-end system combining fine-tuned LLaMA-3 model for Unity code generation with custom Unity integration package to parse GDDs and synthesize C# code.
Result: Achieved 4.8/5.0 average score, outperforming state-of-the-art LLMs in compilation success, GDD adherence, best practices, and code modularity across multiple game genres.
Conclusion: The system effectively bridges gaps in AI-assisted game development, demonstrating LLMs as valuable tools for automating game prototype generation from design documents.
Abstract: This paper presents a novel framework for automated game template generation by transforming Game Design Documents (GDDs) into functional Unity game prototypes using Natural Language Processing (NLP) and multi-modal Large Language Models (LLMs). We introduce an end-to-end system that parses GDDs, extracts structured game specifications, and synthesizes Unity-compatible C# code that implements the core mechanics, systems, and architecture defined in the design documentation. Our approach combines a fine-tuned LLaMA-3 model specialized for Unity code generation with a custom Unity integration package that streamlines the implementation process. Evaluation results demonstrate significant improvements over baseline models, with our fine-tuned model achieving superior performance (4.8/5.0 average score) compared to state-of-the-art LLMs across compilation success, GDD adherence, best practices adoption, and code modularity metrics. The generated templates demonstrate high adherence to GDD specifications across multiple game genres. Our system effectively addresses critical gaps in AI-assisted game development, positioning LLMs as valuable tools in streamlining the transition from game design to implementation.
[184] Global Constraint LLM Agents for Text-to-Model Translation
Junyang Cai, Serdar Kadioglu, Bistra Dilkina
Main category: cs.AI
TL;DR: A multi-agent LLM framework that decomposes MiniZinc modeling tasks by constraint type, with specialized agents for different global constraints and an assembler agent, showing better performance than baseline methods.
Details
Motivation: Natural language descriptions of optimization problems are difficult to translate into correct MiniZinc models due to the need for both logical reasoning and constraint programming expertise.Method: Multiple specialized LLM agents each handle detection and code generation for specific global constraint types, with a final assembler agent integrating all constraint snippets into a complete MiniZinc model.
Result: Initial experiments show better performance against baselines like one-shot prompting and chain-of-thought prompting across several LLMs.
Conclusion: The agentic approach successfully decomposes complex modeling tasks into simpler sub-tasks, with a roadmap outlined for future enhancements and improvements.
Abstract: Natural language descriptions of optimization or satisfaction problems are challenging to translate into correct MiniZinc models, as this process demands both logical reasoning and constraint programming expertise. We introduce a framework that addresses this challenge with an agentic approach: multiple specialized large language model (LLM) agents decompose the modeling task by global constraint type. Each agent is dedicated to detecting and generating code for a specific class of global constraint, while a final assembler agent integrates these constraint snippets into a complete MiniZinc model. By dividing the problem into smaller, well-defined sub-tasks, each LLM handles a simpler reasoning challenge, potentially reducing overall complexity. We conduct initial experiments with several LLMs and show better performance against baselines such as one-shot prompting and chain-of-thought prompting. Finally, we outline a comprehensive roadmap for future work, highlighting potential enhancements and directions for improvement.
[185] ForTIFAI: Fending Off Recursive Training Induced Failure for AI Models
Soheil Zibakhsh Shabgahi, Pedram Aghazadeh, Azalia Mirhosseini, Farinaz Koushanfar
Main category: cs.AI
TL;DR: Proposes Truncated Cross Entropy (TCE) loss function to mitigate model collapse in generative AI by downweighting high-confidence predictions during training on synthetic data.
Details
Motivation: Increasing reliance on generative AI leads to synthetic data dominating training sets by 2030, causing model collapse where performance degrades over generations of training on synthetic data.Method: Identifies model overconfidence in self-generated data as key driver of collapse. Introduces confidence-aware TCE loss function that downweights high-confidence predictions. Provides model-agnostic framework linking loss design to collapse mitigation.
Result: TCE significantly delays model collapse, extending model’s fidelity interval before collapse by more than 2.3x. Method generalizes across modalities and is validated both theoretically and empirically.
Conclusion: Loss function design provides a simple yet powerful tool for preserving generative model quality in the era of increasing synthetic data usage.
Abstract: The increasing reliance on generative AI models has accelerated the generation rate of synthetic data, with some projections suggesting that most available new data for training could be machine-generated by 2030. This shift to a mainly synthetic content presents a critical challenge: repeated training in synthetic data leads to a phenomenon known as model collapse, where model performance degrades over generations of training, eventually rendering the models ineffective. Although prior studies have explored the causes and detection of model collapse, existing mitigation strategies remain limited. In this paper, we identify model overconfidence in their self-generated data as a key driver of collapse. Building on this observation, we propose a confidence-aware loss function that downweights high-confidence predictions during training. We introduce a novel loss function we call Truncated Cross Entropy (TCE). We demonstrate that TCE significantly delays model collapse in recursive training. We provide a model-agnostic framework that links the loss function design to model collapse mitigation and validate our approach both theoretically and empirically, showing that it can extend the model’s fidelity interval before collapse by more than 2.3x. Finally, we show that our method generalizes across modalities. These findings suggest that the design of loss functions provides a simple yet powerful tool for preserving the quality of generative models in the era of increasing synthetic data.
[186] Uncertainty Awareness and Trust in Explainable AI- On Trust Calibration using Local and Global Explanations
Carina Newen, Daniel Bodemer, Sonja Glantz, Emmanuel Müller, Magdalena Wischnewski, Lenka Schnaubert
Main category: cs.AI
TL;DR: The paper analyzes explainable AI (XAI) with focus on uncertainty explanations and global explanations, testing an algorithm that covers uncertainty, robustness, and global XAI concepts to calibrate trust and improve user satisfaction.
Details
Motivation: While XAI is well-studied in some areas, uncertainty explanations and global explanations are often overlooked. The research aims to address this gap by developing guidelines for XAI schemes that incorporate these understudied aspects.Method: The researchers selected an algorithm that simultaneously addresses multiple XAI concepts (uncertainty, robustness, and global explanations) and tested its effectiveness in trust calibration. They evaluated whether complex visual explanations could enhance user satisfaction and interpretability despite being difficult to understand.
Result: The study derived general guidelines for XAI schemes from their research, though specific quantitative results are not detailed in the abstract.
Conclusion: The research contributes to XAI by focusing on understudied areas like uncertainty and global explanations, demonstrating that algorithms covering multiple XAI concepts simultaneously can help calibrate trust and potentially improve user satisfaction through intuitive visual representations.
Abstract: Explainable AI has become a common term in the literature, scrutinized by computer scientists and statisticians and highlighted by psychological or philosophical researchers. One major effort many researchers tackle is constructing general guidelines for XAI schemes, which we derived from our study. While some areas of XAI are well studied, we focus on uncertainty explanations and consider global explanations, which are often left out. We chose an algorithm that covers various concepts simultaneously, such as uncertainty, robustness, and global XAI, and tested its ability to calibrate trust. We then checked whether an algorithm that aims to provide more of an intuitive visual understanding, despite being complicated to understand, can provide higher user satisfaction and human interpretability.
[187] Instructional Prompt Optimization for Few-Shot LLM-Based Recommendations on Cold-Start Users
Haowei Yang, Yushang Zhao, Sitao Min, Bo Su, Chao Yao, Wei Xu
Main category: cs.AI
TL;DR: Optimal prompt engineering with exemplar injection and instruction structuring improves LLM-based recommender systems for cold-start users, enhancing precision@k and NDCG scores in low-data settings.
Details
Motivation: Address the cold-start user problem in recommender systems where historical behavioral information is limited, by leveraging few-shot LLM capabilities through optimized instructional prompts.Method: Proposes context-conditioned prompt formulation P(u, Ds) → R̂ using transformer-based LLMs (BioGPT, LLaMA-2, GPT-4) with token-level alignments, embedding space regularization, and semantic fidelity optimization for exemplar injection and instruction structuring.
Result: Empirical evidence shows significant improvements in precision@k and NDCG scores for cold-start recommendation tasks, demonstrating that optimal prompt composition controls attention scales and decoder behavior during inference.
Conclusion: Prompt-based adaptation is an effective approach to address cold-start recommendation issues in LLM-based pipelines, with timely composition being both syntactic and functional in controlling model behavior.
Abstract: The cold-start user issue further compromises the effectiveness of
recommender systems in limiting access to the historical behavioral
information. It is an effective pipeline to optimize instructional prompts on a
few-shot large language model (LLM) used in recommender tasks. We introduce a
context-conditioned prompt formulation method P(u,\ Ds)\ \rightarrow
R\widehat, where u is a cold-start user profile, Ds is a curated support set,
and R\widehat is the predicted ranked list of items. Based on systematic
experimentation with transformer-based autoregressive LLMs (BioGPT, LLaMA-2,
GPT-4), we provide empirical evidence that optimal exemplar injection and
instruction structuring can significantly improve the precision@k and NDCG
scores of such models in low-data settings. The pipeline uses token-level
alignments and embedding space regularization with a greater semantic fidelity.
Our findings not only show that timely composition is not merely syntactic but
also functional as it is in direct control of attention scales and decoder
conduct through inference. This paper shows that prompt-based adaptation may be
considered one of the ways to address cold-start recommendation issues in
LLM-based pipelines.
[188] Understanding Economic Tradeoffs Between Human and AI Agents in Bargaining Games
Crystal Qian, Kehang Zhu, John Horton, Benjamin S. Manning, Vivian Tsai, James Wexler, Nithum Thain
Main category: cs.AI
TL;DR: Comparison of humans, LLMs, and Bayesian agents in dynamic negotiation tasks shows performance parity can mask fundamental behavioral differences in coordination processes.
Details
Motivation: As autonomous agents increasingly handle coordination tasks, it's critical to evaluate not just their performance outcomes but also their negotiation processes and behavioral dynamics in multi-agent environments.Method: Compared humans (N=216), LLMs (GPT-4o, Gemini 1.5 Pro), and Bayesian agents in identical dynamic negotiation conditions, analyzing both outcomes and behavioral dynamics.
Result: Bayesian agents achieved highest surplus through aggressive optimization but with frequent rejections. Humans and LLMs achieved similar overall surplus but through different behaviors: LLMs used conservative concessionary trades with few rejections, while humans employed more strategic, risk-taking, and fairness-oriented approaches.
Conclusion: Performance parity alone is insufficient for agent evaluation - fundamental differences in process and alignment are critical considerations for practical deployment in real-world coordination tasks.
Abstract: Coordination tasks traditionally performed by humans are increasingly being delegated to autonomous agents. As this pattern progresses, it becomes critical to evaluate not only these agents’ performance but also the processes through which they negotiate in dynamic, multi-agent environments. Furthermore, different agents exhibit distinct advantages: traditional statistical agents, such as Bayesian models, may excel under well-specified conditions, whereas large language models (LLMs) can generalize across contexts. In this work, we compare humans (N = 216), LLMs (GPT-4o, Gemini 1.5 Pro), and Bayesian agents in a dynamic negotiation setting that enables direct, identical-condition comparisons across populations, capturing both outcomes and behavioral dynamics. Bayesian agents extract the highest surplus through aggressive optimization, at the cost of frequent trade rejections. Humans and LLMs can achieve similar overall surplus, but through distinct behaviors: LLMs favor conservative, concessionary trades with few rejections, while humans employ more strategic, risk-taking, and fairness-oriented behaviors. Thus, we find that performance parity – a common benchmark in agent evaluation – can conceal fundamental differences in process and alignment, which are critical for practical deployment in real-world coordination tasks.
[189] Anti-Money Laundering Machine Learning Pipelines; A Technical Analysis on Identifying High-risk Bank Clients with Supervised Learning
Khashayar Namdar, Pin-Chien Wang, Tushar Raju, Steven Zheng, Fiona Li, Safwat Tahmin Khan
Main category: cs.AI
TL;DR: A comprehensive ML pipeline for anti-money laundering that achieved 0.961 AUROC in identifying high-risk bank clients using SQL-based feature engineering and XAI modules.
Details
Motivation: AML is a priority for financial institutions, and machine learning has shown high potential for identifying high-risk clients in banking systems.Method: 16-step design and statistical analysis approach, SQLite database framing, SQL-based feature engineering, pre-trained model integration, and explainable AI modules for feature importance.
Result: Achieved mean AUROC of 0.961 with SD of 0.005, securing second place in the University of Toronto IMI Big Data and AI Competition.
Conclusion: The proposed systematic ML pipeline demonstrates robust performance in AML risk detection and provides explainable insights through XAI modules.
Abstract: Anti-money laundering (AML) actions and measurements are among the priorities of financial institutions, for which machine learning (ML) has shown to have a high potential. In this paper, we propose a comprehensive and systematic approach for developing ML pipelines to identify high-risk bank clients in a dataset curated for Task 1 of the University of Toronto 2023-2024 Institute for Management and Innovation (IMI) Big Data and Artificial Intelligence Competition. The dataset included 195,789 customer IDs, and we employed a 16-step design and statistical analysis to ensure the final pipeline was robust. We also framed the data in a SQLite database, developed SQL-based feature engineering algorithms, connected our pre-trained model to the database, and made it inference-ready, and provided explainable artificial intelligence (XAI) modules to derive feature importance. Our pipeline achieved a mean area under the receiver operating characteristic curve (AUROC) of 0.961 with a standard deviation (SD) of 0.005. The proposed pipeline achieved second place in the competition.
[190] Mind Meets Space: Rethinking Agentic Spatial Intelligence from a Neuroscience-inspired Perspective
Bui Duc Manh, Soumyaratna Debnath, Zetong Zhang, Shriram Damodaran, Arvind Kumar, Yueyi Zhang, Lu Mi, Erik Cambria, Lin Wang
Main category: cs.AI
TL;DR: This paper proposes a neuroscience-grounded computational framework for agentic spatial intelligence, identifying six essential modules inspired by biological spatial reasoning to bridge the gap between current AI systems and human-like spatial capabilities.
Details
Motivation: Current agentic AI systems have limited spatial reasoning abilities, primarily constrained to symbolic processing, while human spatial intelligence enables flexible decision-making in unstructured environments through integrated multisensory perception and cognitive maps.Method: The authors introduce a computational framework with six neuroscience-inspired modules: bio-inspired multimodal sensing, multi-sensory integration, egocentric-allocentric conversion, artificial cognitive map, spatial memory, and spatial reasoning. They analyze existing methods against this framework and examine benchmarks and datasets.
Result: The paper provides a structured framework for evaluating and developing spatial reasoning capabilities in AI systems, identifies critical gaps in current approaches, and explores application domains from virtual to embodied systems like robotics.
Conclusion: The proposed neuroscience-grounded framework offers a promising roadmap for advancing agentic spatial intelligence, with potential to generalize spatial reasoning across dynamic and unstructured environments, benefiting both virtual and physical AI systems.
Abstract: Recent advances in agentic AI have led to systems capable of autonomous task execution and language-based reasoning, yet their spatial reasoning abilities remain limited and underexplored, largely constrained to symbolic and sequential processing. In contrast, human spatial intelligence, rooted in integrated multisensory perception, spatial memory, and cognitive maps, enables flexible, context-aware decision-making in unstructured environments. Therefore, bridging this gap is critical for advancing Agentic Spatial Intelligence toward better interaction with the physical 3D world. To this end, we first start from scrutinizing the spatial neural models as studied in computational neuroscience, and accordingly introduce a novel computational framework grounded in neuroscience principles. This framework maps core biological functions to six essential computation modules: bio-inspired multimodal sensing, multi-sensory integration, egocentric-allocentric conversion, an artificial cognitive map, spatial memory, and spatial reasoning. Together, these modules form a perspective landscape for agentic spatial reasoning capability across both virtual and physical environments. On top, we conduct a framework-guided analysis of recent methods, evaluating their relevance to each module and identifying critical gaps that hinder the development of more neuroscience-grounded spatial reasoning modules. We further examine emerging benchmarks and datasets and explore potential application domains ranging from virtual to embodied systems, such as robotics. Finally, we outline potential research directions, emphasizing the promising roadmap that can generalize spatial reasoning across dynamic or unstructured environments. We hope this work will benefit the research community with a neuroscience-grounded perspective and a structured pathway. Our project page can be found at Github.
[191] ProgD: Progressive Multi-scale Decoding with Dynamic Graphs for Joint Multi-agent Motion Forecasting
Xing Gao, Zherui Huang, Weiyao Lin, Xiao Sun
Main category: cs.AI
TL;DR: ProgD: A progressive multi-scale decoding strategy with dynamic heterogeneous graphs for multi-agent motion prediction, achieving state-of-the-art performance on major benchmarks.
Details
Motivation: Existing multi-agent prediction methods overlook the evolving nature of social interactions between agents, which is crucial for accurate motion prediction in autonomous vehicle scenarios.Method: Proposes a progressive multi-scale decoding strategy (ProgD) using dynamic heterogeneous graph-based scenario modeling to capture evolving social interactions and progressively eliminate uncertainty in future motions.
Result: Achieves state-of-the-art performance, ranking 1st on the INTERACTION multi-agent prediction benchmark and performing well on the Argoverse 2 multi-world forecasting benchmark.
Conclusion: The proposed ProgD framework effectively addresses the limitation of static interaction modeling by capturing evolving social dynamics, leading to superior multi-agent motion prediction performance.
Abstract: Accurate motion prediction of surrounding agents is crucial for the safe planning of autonomous vehicles. Recent advancements have extended prediction techniques from individual agents to joint predictions of multiple interacting agents, with various strategies to address complex interactions within future motions of agents. However, these methods overlook the evolving nature of these interactions. To address this limitation, we propose a novel progressive multi-scale decoding strategy, termed ProgD, with the help of dynamic heterogeneous graph-based scenario modeling. In particular, to explicitly and comprehensively capture the evolving social interactions in future scenarios, given their inherent uncertainty, we design a progressive modeling of scenarios with dynamic heterogeneous graphs. With the unfolding of such dynamic heterogeneous graphs, a factorized architecture is designed to process the spatio-temporal dependencies within future scenarios and progressively eliminate uncertainty in future motions of multiple agents. Furthermore, a multi-scale decoding procedure is incorporated to improve on the future scenario modeling and consistent prediction of agents’ future motion. The proposed ProgD achieves state-of-the-art performance on the INTERACTION multi-agent prediction benchmark, ranking $1^{st}$, and the Argoverse 2 multi-world forecasting benchmark.
[192] Enabling Regulatory Multi-Agent Collaboration: Architecture, Challenges, and Solutions
Qinnan Hu, Yuntao Wang, Yuan Gao, Zhou Su, Linkang Du
Main category: cs.AI
TL;DR: Blockchain-based layered architecture for regulating LLM-powered autonomous agents with behavior tracing, reputation evaluation, and malicious behavior forecasting modules.
Details
Motivation: Address governance and accountability challenges posed by unpredictable behaviors and heterogeneous capabilities of LLM-empowered autonomous agents in multi-agent collaboration systems.Method: Propose a blockchain-enabled layered architecture with three key modules: agent behavior tracing and arbitration, dynamic reputation evaluation, and malicious behavior forecasting for early detection of adversarial activities.
Result: Establishes a systematic foundation for trustworthy, resilient, and scalable regulatory mechanisms in large-scale agent ecosystems.
Conclusion: The framework provides a solution for regulatory challenges in multi-agent systems and identifies future research directions for blockchain-enabled regulatory frameworks.
Abstract: Large language models (LLMs)-empowered autonomous agents are transforming both digital and physical environments by enabling adaptive, multi-agent collaboration. While these agents offer significant opportunities across domains such as finance, healthcare, and smart manufacturing, their unpredictable behaviors and heterogeneous capabilities pose substantial governance and accountability challenges. In this paper, we propose a blockchain-enabled layered architecture for regulatory agent collaboration, comprising an agent layer, a blockchain data layer, and a regulatory application layer. Within this framework, we design three key modules: (i) an agent behavior tracing and arbitration module for automated accountability, (ii) a dynamic reputation evaluation module for trust assessment in collaborative scenarios, and (iii) a malicious behavior forecasting module for early detection of adversarial activities. Our approach establishes a systematic foundation for trustworthy, resilient, and scalable regulatory mechanisms in large-scale agent ecosystems. Finally, we discuss the future research directions for blockchain-enabled regulatory frameworks in multi-agent systems.
[193] Jupiter: Enhancing LLM Data Analysis Capabilities via Notebook and Inference-Time Value-Guided Search
Shuocheng Li, Yihao Liu, Silin Du, Wenxuan Zeng, Zhe Xu, Mengyu Zhou, Yeye He, Haoyu Dong, Shi Han, Dongmei Zhang
Main category: cs.AI
TL;DR: Proposed NbQA dataset from Jupyter notebooks and Jupiter framework with MCTS for multi-step data analysis reasoning, achieving SOTA results with Qwen2.5 models.
Details
Motivation: LLMs struggle with multi-step reasoning and tool use in complex data analysis tasks, limiting their effectiveness in automating data science workflows.Method: Created NbQA dataset from real Jupyter notebooks, developed Jupiter framework using Monte Carlo Tree Search for multi-step reasoning, and trained value models for efficient plan generation.
Result: Qwen2.5-7B and 14B-Instruct achieved 77.82% and 86.38% success rates on InfiAgent-DABench, matching or surpassing GPT-4o and advanced agent frameworks.
Conclusion: The approach significantly improves multi-step reasoning and tool-use capabilities in data analysis, demonstrating strong generalization across diverse reasoning tasks.
Abstract: Large language models (LLMs) have shown great promise in automating data science workflows, but existing models still struggle with multi-step reasoning and tool use, which limits their effectiveness on complex data analysis tasks. To address this, we propose a scalable pipeline that extracts high-quality, tool-based data analysis tasks and their executable multi-step solutions from real-world Jupyter notebooks and associated data files. Using this pipeline, we introduce NbQA, a large-scale dataset of standardized task-solution pairs that reflect authentic tool-use patterns in practical data science scenarios. To further enhance multi-step reasoning, we present Jupiter, a framework that formulates data analysis as a search problem and applies Monte Carlo Tree Search (MCTS) to generate diverse solution trajectories for value model learning. During inference, Jupiter combines the value model and node visit counts to efficiently collect executable multi-step plans with minimal search steps. Experimental results show that Qwen2.5-7B and 14B-Instruct models on NbQA solve 77.82% and 86.38% of tasks on InfiAgent-DABench, respectively-matching or surpassing GPT-4o and advanced agent frameworks. Further evaluations demonstrate improved generalization and stronger tool-use reasoning across diverse multi-step reasoning tasks.
[194] Fusing Knowledge and Language: A Comparative Study of Knowledge Graph-Based Question Answering with LLMs
Vaibhav Chaudhary, Neha Soni, Narotam Singh, Amita Kapoor
Main category: cs.AI
TL;DR: Comparative study of three knowledge graph construction methods (spaCy, Stanford CoreNLP-OpenIE, GraphRAG) for enhancing LLM-based question answering, finding OpenIE has best triplet coverage while GraphRAG excels in reasoning.
Details
Motivation: Traditional RAG approaches struggle with thematic and holistic understanding of complex texts, requiring deeper analysis of both text and context that knowledge graphs can provide.Method: Comprehensive technical comparative study evaluating three open-source methodologies for constructing knowledge graph triplets and integrating them with LLMs for question answering.
Result: OpenIE provides the most comprehensive coverage of triplets, while GraphRAG demonstrates superior reasoning abilities among the three methods.
Conclusion: Discussion of strengths and limitations of each method with insights into future directions for improving knowledge graph-based question answering systems.
Abstract: Knowledge graphs, a powerful tool for structuring information through relational triplets, have recently become the new front-runner in enhancing question-answering systems. While traditional Retrieval Augmented Generation (RAG) approaches are proficient in fact-based and local context-based extraction from concise texts, they encounter limitations when addressing the thematic and holistic understanding of complex, extensive texts, requiring a deeper analysis of both text and context. This paper presents a comprehensive technical comparative study of three different methodologies for constructing knowledge graph triplets and integrating them with Large Language Models (LLMs) for question answering: spaCy, Stanford CoreNLP-OpenIE, and GraphRAG, all leveraging open source technologies. We evaluate the effectiveness, feasibility, and adaptability of these methods by analyzing their capabilities, state of development, and their impact on the performance of LLM-based question answering. Experimental results indicate that while OpenIE provides the most comprehensive coverage of triplets, GraphRAG demonstrates superior reasoning abilities among the three. We conclude with a discussion on the strengths and limitations of each method and provide insights into future directions for improving knowledge graph-based question answering.
[195] Tree-OPO: Off-policy Monte Carlo Tree-Guided Advantage Optimization for Multistep Reasoning
Bingning Huang, Tu Nguyen, Matthieu Zimmer
Main category: cs.AI
TL;DR: Using MCTS-generated trajectories to improve policy optimization in preference-based RL through staged GRPO training with tree-structured advantage estimation.
Details
Motivation: Leverage MCTS-derived trajectories (effective in math/symbolic reasoning) to enhance policy optimization in preference-based RL, moving beyond traditional value/reward model training.Method: Staged GRPO training paradigm using partially revealed MCTS rollouts, introducing tree-structured advantage estimation and prefix-conditioned reward signals.
Result: Structured advantage estimation stabilizes updates and better reflects compositional reasoning quality, but faces challenges like advantage saturation and reward signal collapse.
Conclusion: Proposed heuristic and statistical solutions mitigate issues, but open challenges remain for learning under staged/tree-like reward structures.
Abstract: Recent advances in reasoning with large language models (LLMs) have shown the effectiveness of Monte Carlo Tree Search (MCTS) for generating high-quality intermediate trajectories, particularly in math and symbolic domains. Inspired by this, we explore how MCTS-derived trajectories, traditionally used for training value or reward models, can be repurposed to improve policy optimization in preference-based reinforcement learning (RL). Specifically, we focus on Group Relative Policy Optimization (GRPO), a recent algorithm that enables preference-consistent policy learning without value networks. We propose a staged GRPO training paradigm where completions are derived from partially revealed MCTS rollouts, introducing a novel tree-structured setting for advantage estimation. This leads to a rich class of prefix-conditioned reward signals, which we analyze theoretically and empirically. Our initial results indicate that while structured advantage estimation can stabilize updates and better reflect compositional reasoning quality, challenges such as advantage saturation and reward signal collapse remain. We propose heuristic and statistical solutions to mitigate these issues and discuss open challenges for learning under staged or tree-like reward structures.
[196] LightAgent: Production-level Open-source Agentic AI Framework
Weige Cai, Tong Zhu, Jinyi Niu, Ruiqi Hu, Lingyao Li, Tenglong Wang, Xiaowu Dai, Weining Shen, Liwen Zhang
Main category: cs.AI
TL;DR: LightAgent is a lightweight open-source multi-agent framework that balances flexibility and simplicity, integrating memory, tools, and Tree of Thought capabilities for easy agent deployment.
Details
Motivation: Address the challenges in designing versatile, robust, and efficient platforms for deploying large language model-based multi-agent systems, overcoming the trade-off between flexibility and simplicity in existing frameworks.Method: Propose LightAgent framework with core functionalities including Memory (mem0), Tools, and Tree of Thought (ToT) while maintaining an extremely lightweight structure. It’s designed as a fully open-source solution that integrates with mainstream chat platforms.
Result: Developed a lightweight yet powerful agentic framework that effectively resolves the flexibility-simplicity trade-off, enabling developers to easily build self-learning agents.
Conclusion: LightAgent provides an effective solution for multi-agent system deployment, offering both lightweight structure and comprehensive functionality while being fully open-source and compatible with mainstream platforms.
Abstract: With the rapid advancement of large language models (LLMs), Multi-agent Systems (MAS) have achieved significant progress in various application scenarios. However, substantial challenges remain in designing versatile, robust, and efficient platforms for agent deployment. To address these limitations, we propose \textbf{LightAgent}, a lightweight yet powerful agentic framework, effectively resolving the trade-off between flexibility and simplicity found in existing frameworks. LightAgent integrates core functionalities such as Memory (mem0), Tools, and Tree of Thought (ToT), while maintaining an extremely lightweight structure. As a fully open-source solution, it seamlessly integrates with mainstream chat platforms, enabling developers to easily build self-learning agents. We have released LightAgent at \href{https://github.com/wxai-space/LightAgent}{https://github.com/wxai-space/LightAgent}
[197] Explaining Tournament Solutions with Minimal Supports
Clément Contet, Umberto Grandi, Jérôme Mengin
Main category: cs.AI
TL;DR: The paper studies certified explanations for tournament winners by identifying minimal sub-tournaments where a candidate is guaranteed to win, providing abductive explanations for various tournament rules.
Details
Motivation: To provide formal, certified explanations for why candidates win tournaments under different rules, addressing the need for explainable AI in tournament-based decision systems.Method: Identify minimal supports (minimal sub-tournaments) where a candidate is a necessary winner, analyze computational complexity, and develop polynomial-time algorithms for most tournament rules.
Result: Determined the size of smallest minimal supports for various tournament rules, developed efficient algorithms for all rules except weighted uncovered set (which is NP-complete), and demonstrated how minimal supports produce compact certified explanations.
Conclusion: Minimal supports provide effective certified explanations for tournament winners across multiple rules, with most being computable in polynomial time, offering a practical approach to explainable AI in tournament settings.
Abstract: Tournaments are widely used models to represent pairwise dominance between candidates, alternatives, or teams. We study the problem of providing certified explanations for why a candidate appears among the winners under various tournament rules. To this end, we identify minimal supports, minimal sub-tournaments in which the candidate is guaranteed to win regardless of how the rest of the tournament is completed (that is, the candidate is a necessary winner of the sub-tournament). This notion corresponds to an abductive explanation for the question,“Why does the winner win the tournament”, a central concept in formal explainable AI. We focus on common tournament solutions: the top cycle, the uncovered set, the Copeland rule, the Borda rule, the maximin rule, and the weighted uncovered set. For each rule we determine the size of the smallest minimal supports, and we present polynomial-time algorithms to compute them for all but the weighted uncovered set, for which the problem is NP-complete. Finally, we show how minimal supports can serve to produce compact, certified, and intuitive explanations.
[198] Measuring Implicit Spatial Coordination in Teams: Effects on Collective Intelligence and Performance
Thuy Ngoc Nguyen, Anita Williams Woolley, Cleotilde Gonzalez
Main category: cs.AI
TL;DR: This paper examines how spatial coordination dimensions (exploration diversity, movement specialization, adaptive spatial proximity) affect team performance in search and rescue tasks with restricted communication.
Details
Motivation: Many teams (firefighters, military, emergency response) must coordinate physical movements without visual cues or extensive explicit communication, but existing research has focused on co-located or knowledge work coordination.Method: Analyzed data from 34 four-person teams (136 participants) performing specialized roles in a collaborative online search and rescue task with restricted communication, using metrics for spatial proximity, distribution patterns, and movement alignment.
Result: Spatial specialization positively predicts performance, adaptive spatial proximity shows marginal inverted U-shaped relationship (moderate adaptation optimal), and temporal dynamics of these metrics differentiate high- from low-performing teams over time.
Conclusion: Findings provide insights into implicit spatial coordination in role-based teamwork and highlight importance of balanced adaptive strategies, with implications for training and AI-assisted team support systems.
Abstract: Coordinated teamwork is essential in fast-paced decision-making environments that require dynamic adaptation, often without an opportunity for explicit communication. Although implicit coordination has been extensively considered in the existing literature, the majority of work has focused on co-located, synchronous teamwork (such as sports teams) or, in distributed teams, primarily on coordination of knowledge work. However, many teams (firefighters, military, law enforcement, emergency response) must coordinate their movements in physical space without the benefit of visual cues or extensive explicit communication. This paper investigates how three dimensions of spatial coordination, namely exploration diversity, movement specialization, and adaptive spatial proximity, influence team performance in a collaborative online search and rescue task where explicit communication is restricted and team members rely on movement patterns to infer others’ intentions and coordinate actions. Our metrics capture the relational aspects of teamwork by measuring spatial proximity, distribution patterns, and alignment of movements within shared environments. We analyze data from 34 four-person teams (136 participants) assigned to specialized roles in a search and rescue task. Results show that spatial specialization positively predicts performance, while adaptive spatial proximity exhibits a marginal inverted U-shaped relationship, suggesting moderate levels of adaptation are optimal. Furthermore, the temporal dynamics of these metrics differentiate high- from low-performing teams over time. These findings provide insights into implicit spatial coordination in role-based teamwork and highlight the importance of balanced adaptive strategies, with implications for training and AI-assisted team support systems.
[199] Towards Adaptive ML Benchmarks: Web-Agent-Driven Construction, Domain Expansion, and Metric Optimization
Hangyi Jia, Yuxi Qian, Hanwen Tong, Xinhui Wu, Lin Chen, Feng Wei
Main category: cs.AI
TL;DR: TAM Bench is a comprehensive benchmark for evaluating LLM-based agents on end-to-end ML tasks, featuring automated task collection from platforms like Kaggle, difficulty modeling based on leaderboard data, and multi-dimensional evaluation.
Details
Motivation: Existing benchmarks for LLM-based ML agents are limited in task coverage, domain diversity, difficulty modeling, and evaluation rigor, failing to capture full capabilities in realistic settings.Method: Developed a browser automation and LLM-based system to automatically collect ML challenges from multiple platforms, created leaderboard-driven difficulty modeling, and implemented multi-dimensional evaluation framework with performance, format compliance, constraint adherence, and generalization metrics.
Result: Constructed three benchmark subsets (Lite, Medium, Full) based on 150 curated AutoML tasks, with the Lite version (18 tasks) providing balanced coverage across modalities and difficulty levels for practical testing.
Conclusion: TAM Bench provides a diverse, realistic, and structured evaluation framework that addresses limitations of existing benchmarks and enables comprehensive assessment of LLM-based agents in end-to-end ML workflows.
Abstract: Recent advances in large language models (LLMs) have enabled the emergence of general-purpose agents for automating end-to-end machine learning (ML) workflows, including data analysis, feature engineering, model training, and competition solving. However, existing benchmarks remain limited in task coverage, domain diversity, difficulty modeling, and evaluation rigor, failing to capture the full capabilities of such agents in realistic settings. We present TAM Bench, a diverse, realistic, and structured benchmark for evaluating LLM-based agents on end-to-end ML tasks. TAM Bench features three key innovations: (1) A browser automation and LLM-based task acquisition system that automatically collects and structures ML challenges from platforms such as Kaggle, AIcrowd, and Biendata, spanning multiple task types and data modalities (e.g., tabular, text, image, graph, audio); (2) A leaderboard-driven difficulty modeling mechanism that estimates task complexity using participant counts and score dispersion, enabling scalable and objective task calibration; (3) A multi-dimensional evaluation framework incorporating performance, format compliance, constraint adherence, and task generalization. Based on 150 curated AutoML tasks, we construct three benchmark subsets of different sizes – Lite, Medium, and Full – designed for varying evaluation scenarios. The Lite version, with 18 tasks and balanced coverage across modalities and difficulty levels, serves as a practical testbed for daily benchmarking and comparative studies.
[200] Curriculum-Based Multi-Tier Semantic Exploration via Deep Reinforcement Learning
Abdel Hakim Drid, Vincenzo Suriani, Daniele Nardi, Abderrezzak Debilou
Main category: cs.AI
TL;DR: Novel DRL architecture with VLM integration for resource-efficient semantic exploration, using strategic VLM queries and curriculum learning to enhance object discovery and navigation to semantically rich areas.
Details
Motivation: Traditional RL approaches struggle with balancing efficient exploration and semantic understanding due to limited cognitive capabilities in small policies, requiring human intervention for semantic exploration tasks.Method: Integration of Vision-Language Model common-sense through layered reward function, modeling VLM query as dedicated action for strategic external guidance, combined with curriculum learning for robust training at different complexity levels.
Result: Significantly enhanced object discovery rates, learned capability to navigate towards semantically rich regions, and strategic mastery of when to prompt for external environmental information.
Conclusion: Provides a practical and scalable method for embedding common-sense semantic reasoning in autonomous agents, enabling fully intelligent and self-guided exploration in robotics.
Abstract: Navigating and understanding complex and unknown environments autonomously demands more than just basic perception and movement from embodied agents. Truly effective exploration requires agents to possess higher-level cognitive abilities, the ability to reason about their surroundings, and make more informed decisions regarding exploration strategies. However, traditional RL approaches struggle to balance efficient exploration and semantic understanding due to limited cognitive capabilities embedded in the small policies for the agents, leading often to human drivers when dealing with semantic exploration. In this paper, we address this challenge by presenting a novel Deep Reinforcement Learning (DRL) architecture that is specifically designed for resource efficient semantic exploration. A key methodological contribution is the integration of a Vision-Language Model (VLM) common-sense through a layered reward function. The VLM query is modeled as a dedicated action, allowing the agent to strategically query the VLM only when deemed necessary for gaining external guidance, thereby conserving resources. This mechanism is combined with a curriculum learning strategy designed to guide learning at different levels of complexity to ensure robust and stable learning. Our experimental evaluation results convincingly demonstrate that our agent achieves significantly enhanced object discovery rates and develops a learned capability to effectively navigate towards semantically rich regions. Furthermore, it also shows a strategic mastery of when to prompt for external environmental information. By demonstrating a practical and scalable method for embedding common-sense semantic reasoning with autonomous agents, this research provides a novel approach to pursuing a fully intelligent and self-guided exploration in robotics.
[201] TORSO: Template-Oriented Reasoning Towards General Tasks
Minhyuk Kim, Seungyoon Lee, Heuiseok Lim
Main category: cs.AI
TL;DR: TORSO enables LLMs to use their internal reasoning capabilities without relying on manually crafted few-shot examples, achieving strong performance across diverse tasks.
Details
Motivation: Existing few-shot prompting approaches heavily depend on provided examples, limiting model's inherent reasoning capabilities and requiring costly task-specific prompt construction.Method: Template-Oriented Reasoning (TORSO) elicits models to utilize internal reasoning abilities to generate proper responses without manually crafted few-shot examples.
Result: Experimental results demonstrate TORSO achieves strong performance on diverse LLM benchmarks with reasonable rationales.
Conclusion: TORSO provides an effective alternative to few-shot prompting that leverages models’ internal reasoning capabilities across various tasks without manual example construction.
Abstract: The approaches that guide Large Language Models (LLMs) to emulate human reasoning during response generation have emerged as an effective method for enabling them to solve complex problems in a step-by-step manner, thereby achieving superior performance. However, most existing approaches using few-shot prompts to generate responses heavily depend on the provided examples, limiting the utilization of the model’s inherent reasoning capabilities. Moreover, constructing task-specific few-shot prompts is often costly and may lead to inconsistencies across different tasks. In this work, we introduce Template-Oriented Reasoning (TORSO), which elicits the model to utilize internal reasoning abilities to generate proper responses across various tasks without the need for manually crafted few-shot examples. Our experimental results demonstrate that TORSO achieves strong performance on diverse LLMs benchmarks with reasonable rationales.
[202] Inteligencia Artificial jurídica y el desafío de la veracidad: análisis de alucinaciones, optimización de RAG y principios para una integración responsable
Alex Dantart
Main category: cs.AI
TL;DR: Analysis of LLM hallucinations in legal applications, examining RAG limitations and proposing consultative AI paradigm with human oversight.
Details
Motivation: Address the critical issue of false information (hallucinations) in large language models when applied to legal contexts, where accuracy is paramount.Method: Examines causes and manifestations of hallucinations, evaluates effectiveness of RAG mitigation strategy, and explores ethical/regulatory implications.
Result: RAG strategy has limitations; holistic optimizations needed; human oversight remains irreplaceable in legal AI applications.
Conclusion: Solution requires shift to consultative AI paradigm prioritizing veracity and traceability, amplifying rather than replacing professional legal judgment.
Abstract: This technical report analyzes the challenge of “hallucinations” (false information) in LLMs applied to law. It examines their causes, manifestations, and the effectiveness of the RAG mitigation strategy, highlighting its limitations and proposing holistic optimizations. The paper explores the ethical and regulatory implications, emphasizing human oversight as an irreplaceable role. It concludes that the solution lies not in incrementally improving generative models, but in adopting a “consultative” AI paradigm that prioritizes veracity and traceability, acting as a tool to amplify, not replace, professional judgment.
Este informe t'ecnico analiza el desaf'io de las “alucinaciones” (informaci'on falsa) en los LLMs aplicados al derecho. Se examinan sus causas, manifestaciones y la efectividad de la estrategia de mitigaci'on RAG, exponiendo sus limitaciones y proponiendo optimizaciones hol'isticas. Se exploran las implicaciones 'eticas y regulatorias, enfatizando la supervisi'on humana como un rol insustituible. El documento concluye que la soluci'on no reside en mejorar incrementalmente los modelos generativos, sino en adoptar un paradigma de IA “consultiva” que priorice la veracidad y la trazabilidad, actuando como una herramienta para amplificar, y no sustituir, el juicio profesional.
[203] SEDM: Scalable Self-Evolving Distributed Memory for Agents
Haoran Xu, Jiacong Hu, Ke Zhang, Lei Yu, Yuxin Tang, Xinyuan Song, Yiqun Duan, Lynn Ai, Bill Shi
Main category: cs.AI
TL;DR: SEDM is a self-evolving distributed memory framework that transforms memory from passive storage to an active, self-optimizing component for multi-agent systems, improving reasoning accuracy while reducing computational overhead.
Details
Motivation: Long-term multi-agent systems generate massive trajectories and interactions, requiring efficient memory management. Existing methods suffer from noise accumulation, uncontrolled memory expansion, and limited cross-domain generalization.Method: SEDM integrates verifiable write admission through reproducible replay, a self-scheduling memory controller that dynamically ranks and consolidates entries based on utility, and cross-domain knowledge diffusion for abstracting reusable insights across heterogeneous tasks.
Result: Evaluations show SEDM improves reasoning accuracy while reducing token overhead compared to strong memory baselines, and enables knowledge from fact verification to enhance multi-hop reasoning.
Conclusion: SEDM serves as a scalable and sustainable memory mechanism for open-ended multi-agent collaboration, transforming memory into an active, self-optimizing component.
Abstract: Long-term multi-agent systems inevitably generate vast amounts of trajectories and historical interactions, which makes efficient memory management essential for both performance and scalability. Existing methods typically depend on vector retrieval and hierarchical storage, yet they are prone to noise accumulation, uncontrolled memory expansion, and limited generalization across domains. To address these challenges, we present SEDM, Self-Evolving Distributed Memory, a verifiable and adaptive framework that transforms memory from a passive repository into an active, self-optimizing component. SEDM integrates verifiable write admission based on reproducible replay, a self-scheduling memory controller that dynamically ranks and consolidates entries according to empirical utility, and cross-domain knowledge diffusion that abstracts reusable insights to support transfer across heterogeneous tasks. Evaluations on benchmark datasets demonstrate that SEDM improves reasoning accuracy while reducing token overhead compared with strong memory baselines, and further enables knowledge distilled from fact verification to enhance multi-hop reasoning. The results highlight SEDM as a scalable and sustainable memory mechanism for open-ended multi-agent collaboration. The code will be released in the later stage of this project.
[204] Compositional Concept Generalization with Variational Quantum Circuits
Hala Hawashin, Mina Abbaszadeh, Nicholas Joseph, Beth Pearson, Martha Lewis, Mehrnoosh sadrzadeh
Main category: cs.AI
TL;DR: Quantum models show promise for improving compositional generalization in vision-language tasks, outperforming classical tensor-based approaches.
Details
Motivation: Address the lack of compositional generalization in current AI tools by leveraging quantum computing's training efficiency advantages over classical tensor-based semantic models.Method: Used Variational Quantum Circuits to learn compositional tensor-based representations in Hilbert spaces, testing two image encoding techniques: multi-hot encoding on binary vectors and angle/amplitude encoding on CLIP image vectors.
Result: Achieved good proof-of-concept results with noisy multi-hot encodings. Performance with CLIP vectors was mixed but still outperformed classical compositional models.
Conclusion: Quantum models demonstrate potential for compositional generalization tasks, showing improved performance over classical approaches despite some limitations with certain encoding methods.
Abstract: Compositional generalization is a key facet of human cognition, but lacking in current AI tools such as vision-language models. Previous work examined whether a compositional tensor-based sentence semantics can overcome the challenge, but led to negative results. We conjecture that the increased training efficiency of quantum models will improve performance in these tasks. We interpret the representations of compositional tensor-based models in Hilbert spaces and train Variational Quantum Circuits to learn these representations on an image captioning task requiring compositional generalization. We used two image encoding techniques: a multi-hot encoding (MHE) on binary image vectors and an angle/amplitude encoding on image vectors taken from the vision-language model CLIP. We achieve good proof-of-concept results using noisy MHE encodings. Performance on CLIP image vectors was more mixed, but still outperformed classical compositional models.
[205] Boosting Embodied AI Agents through Perception-Generation Disaggregation and Asynchronous Pipeline Execution
Shulai Zhang, Ao Xu, Quan Chen, Han Zhao, Weihao Cui, Ningxin Zheng, Haibin Lin, Xin Liu, Minyi Guo
Main category: cs.AI
TL;DR: Auras is an algorithm-system co-designed framework that improves embodied AI inference frequency by disaggregating perception/generation modules and using controlled pipeline parallelism with shared context to maintain accuracy.
Details
Motivation: Traditional sequential computation patterns in embodied AI systems face limitations in achieving the necessary thinking frequency for real-world applications due to high-frequency input/output demands.Method: Auras disaggregates perception and generation modules, provides controlled pipeline parallelism for high throughput, and establishes a shared public context between modules to prevent data staleness issues.
Result: Auras improves throughput by 2.54x on average while achieving 102.7% of the original accuracy, demonstrating effective high-frequency performance without sacrificing accuracy.
Conclusion: The framework successfully overcomes sequential computation constraints and provides high, stable throughput for embodied AI agents through algorithm-system co-design and context sharing.
Abstract: Embodied AI systems operate in dynamic environments, requiring seamless integration of perception and generation modules to process high-frequency input and output demands. Traditional sequential computation patterns, while effective in ensuring accuracy, face significant limitations in achieving the necessary “thinking” frequency for real-world applications. In this work, we present Auras, an algorithm-system co-designed inference framework to optimize the inference frequency of embodied AI agents. Auras disaggregates the perception and generation and provides controlled pipeline parallelism for them to achieve high and stable throughput. Faced with the data staleness problem that appears when the parallelism is increased, Auras establishes a public context for perception and generation to share, thereby promising the accuracy of embodied agents. Experimental results show that Auras improves throughput by 2.54x on average while achieving 102.7% of the original accuracy, demonstrating its efficacy in overcoming the constraints of sequential computation and providing high throughput.
[206] The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs
Akshit Sinha, Arvindh Arun, Shashwat Goel, Steffen Staab, Jonas Geiping
Main category: cs.AI
TL;DR: Scaling LLMs yields exponential benefits for long-horizon tasks through compounding accuracy gains, but models suffer from self-conditioning errors where prior mistakes increase future error likelihood, while thinking models avoid this issue.
Details
Motivation: To understand why LLMs fail at simple tasks when made longer and reconcile debates about their reasoning capabilities versus execution failures in complex problems.Method: Isolate execution capability by providing explicit knowledge and plans for long-horizon tasks, then analyze per-step accuracy degradation and self-conditioning effects across different model sizes.
Result: Larger models execute significantly more turns correctly even with 100% single-turn accuracy in small models. Per-step accuracy degrades with step count due to self-conditioning (models become error-prone when context contains prior mistakes), which doesn’t improve with scaling alone. Thinking models avoid self-conditioning and execute longer tasks better.
Conclusion: Focusing on execution capability reveals massive benefits of scaling model size and sequential compute for long-horizon tasks, while highlighting the self-conditioning limitation that thinking models overcome.
Abstract: Does continued scaling of large language models (LLMs) yield diminishing returns? Real-world value often stems from the length of task an agent can complete. We start this work by observing the simple but counterintuitive fact that marginal gains in single-step accuracy can compound into exponential improvements in the length of a task a model can successfully complete. Then, we argue that failures of LLMs when simple tasks are made longer arise from mistakes in execution, rather than an inability to reason. We propose isolating execution capability, by explicitly providing the knowledge and plan needed to solve a long-horizon task. We find that larger models can correctly execute significantly more turns even when small models have 100% single-turn accuracy. We observe that the per-step accuracy of models degrades as the number of steps increases. This is not just due to long-context limitations – curiously, we observe a self-conditioning effect – models become more likely to make mistakes when the context contains their errors from prior turns. Self-conditioning does not reduce by just scaling the model size. In contrast, recent thinking models do not self-condition, and can also execute much longer tasks in a single turn. We conclude by benchmarking frontier thinking models on the length of task they can execute in a single turn. Overall, by focusing on the ability to execute, we hope to reconcile debates on how LLMs can solve complex reasoning problems yet fail at simple tasks when made longer, and highlight the massive benefits of scaling model size and sequential test-time compute for long-horizon tasks.
[207] Enhancing Few-Shot Transfer Learning with Optimized Multi-Task Prompt Tuning through Modular Prompt Composition
Ahmad Pouramini, Hesham Faili
Main category: cs.AI
TL;DR: A novel multi-task prompt tuning approach that decomposes task prompts into shared source prompts and private task-specific prompts, achieving superior few-shot performance with reduced training data.
Details
Motivation: To enhance parameter-efficient transfer learning by facilitating knowledge transfer between tasks through prompt decomposition and combination in multi-task settings.Method: Decomposes each task’s prompt into shared source prompts and private task-specific prompts, then combines them through multiple methods to construct target prompts, with flexible configurations for optimization.
Result: Significant improvements in accuracy and robustness compared to conventional prompt tuning, substantially outperforming other methods in few-shot settings across GLUE benchmark and other tasks.
Conclusion: The proposed approach demonstrates superior few-shot performance with reduced training data, making it a promising method for efficient multi-task learning through effective prompt decomposition and combination strategies.
Abstract: In recent years, multi-task prompt tuning has garnered considerable attention for its inherent modularity and potential to enhance parameter-efficient transfer learning across diverse tasks. This paper aims to analyze and improve the performance of multiple tasks by facilitating the transfer of knowledge between their corresponding prompts in a multi-task setting. Our proposed approach decomposes the prompt for each target task into a combination of shared prompts (source prompts) and a task-specific prompt (private prompt). During training, the source prompts undergo fine-tuning and are integrated with the private prompt to drive the target prompt for each task. We present and compare multiple methods for combining source prompts to construct the target prompt, analyzing the roles of both source and private prompts within each method. We investigate their contributions to task performance and offer flexible, adjustable configurations based on these insights to optimize performance. Our empirical findings clearly showcase improvements in accuracy and robustness compared to the conventional practice of prompt tuning and related works. Notably, our results substantially outperform other methods in the field in few-shot settings, demonstrating superior performance in various tasks across GLUE benchmark, among other tasks. This achievement is attained with a significantly reduced amount of training data, making our method a promising one for few-shot settings.
[208] Simulating Human-like Daily Activities with Desire-driven Autonomy
Yiding Wang, Yuxuan Chen, Fangwei Zhong, Long Ma, Yizhou Wang
Main category: cs.AI
TL;DR: D2A is a desire-driven autonomous agent that enables LLMs to autonomously propose and select tasks based on multi-dimensional human-like desires, enhancing behavioral diversity and rationality.
Details
Motivation: Current AI agents require explicit task specifications which constrain autonomy and behavioral diversity, unlike humans who are motivated by intrinsic desires.Method: A dynamic Value System inspired by Theory of Needs, incorporating human-like desires (social interaction, personal fulfillment, self-care). Agent evaluates current state value, proposes candidate activities, and selects those aligning with intrinsic motivations.
Result: Experiments on Concordia text-based simulator show coherent, contextually relevant daily activities with human-like variability and adaptability. Comparative analysis demonstrates significantly enhanced rationality over other LLM-based agents.
Conclusion: The desire-driven approach enables more autonomous and human-like behavior in AI agents, moving beyond explicit task specifications to intrinsic motivation systems.
Abstract: Desires motivate humans to interact autonomously with the complex world. In contrast, current AI agents require explicit task specifications, such as instructions or reward functions, which constrain their autonomy and behavioral diversity. In this paper, we introduce a Desire-driven Autonomous Agent (D2A) that can enable a large language model (LLM) to autonomously propose and select tasks, motivated by satisfying its multi-dimensional desires. Specifically, the motivational framework of D2A is mainly constructed by a dynamic Value System, inspired by the Theory of Needs. It incorporates an understanding of human-like desires, such as the need for social interaction, personal fulfillment, and self-care. At each step, the agent evaluates the value of its current state, proposes a set of candidate activities, and selects the one that best aligns with its intrinsic motivations. We conduct experiments on Concordia, a text-based simulator, to demonstrate that our agent generates coherent, contextually relevant daily activities while exhibiting variability and adaptability similar to human behavior. A comparative analysis with other LLM-based agents demonstrates that our approach significantly enhances the rationality of the simulated activities.
[209] Effort-aware Fairness: Incorporating a Philosophy-informed, Human-centered Notion of Effort into Algorithmic Fairness Metrics
Tin Trung Nguyen, Jiannan Xu, Zora Che, Phuong-Anh Nguyen-Le, Rushil Dandamudi, Donald Braman, Furong Huang, Hal Daumé III, Zubin Jelveh
Main category: cs.AI
TL;DR: The paper proposes Effort-aware Fairness (EaF), a new fairness metric that considers individuals’ temporal effort trajectories rather than just static feature values, addressing limitations of traditional fairness metrics like demographic parity.
Details
Motivation: Traditional AI fairness metrics don't account for the effort individuals have expended to reach their current position, while philosophical and human conceptions of fairness emphasize effort as a crucial factor.Method: Developed a philosophy-informed approach using the concept of Force to represent temporal feature trajectories with inertia. Conducted pre-registered human experiments and created computational pipelines for effort-aware individual and group fairness evaluation in criminal justice and finance contexts.
Result: Human experiments showed people prioritize temporal trajectories over aggregate feature values in fairness judgments. Developed operational frameworks to compute effort-aware fairness metrics that can identify unfair decisions against individuals who have made significant improvements.
Conclusion: The proposed Effort-aware Fairness framework enables AI auditors to detect and potentially correct unfair decisions that penalize individuals who have expended substantial effort to improve but remain disadvantaged by systemic factors beyond their control.
Abstract: Although popularized AI fairness metrics, e.g., demographic parity, have uncovered bias in AI-assisted decision-making outcomes, they do not consider how much effort one has spent to get to where one is today in the input feature space. However, the notion of effort is important in how Philosophy and humans understand fairness. We propose a philosophy-informed approach to conceptualize and evaluate Effort-aware Fairness (EaF), grounded in the concept of Force, which represents the temporal trajectory of predictive features coupled with inertia. Besides theoretical formulation, our empirical contributions include: (1) a pre-registered human subjects experiment, which shows that for both stages of the (individual) fairness evaluation process, people consider the temporal trajectory of a predictive feature more than its aggregate value; (2) pipelines to compute Effort-aware Individual/Group Fairness in the criminal justice and personal finance contexts. Our work may enable AI model auditors to uncover and potentially correct unfair decisions against individuals who have spent significant efforts to improve but are still stuck with systemic disadvantages outside their control.
[210] LLMs for sensory-motor control: Combining in-context and iterative learning
Jônata Tyska Carvalho, Stefano Nolfi
Main category: cs.AI
TL;DR: A method that enables LLMs to control embodied agents by mapping continuous observations to actions, using iterative refinement with performance feedback and sensory-motor data.
Details
Motivation: To bridge the gap between large language models' symbolic reasoning capabilities and embodied agent control in continuous action spaces.Method: LLMs generate initial control strategies from textual descriptions, then iteratively refine them using performance feedback and sensory-motor data collected during evaluation.
Result: Validated on Gymnasium and MuJoCo control tasks, the approach works effectively with compact models (GPT-oss:120b, Qwen2.5:72b) and finds optimal/near-optimal solutions.
Conclusion: The method successfully integrates symbolic knowledge from reasoning with sub-symbolic sensory-motor data for effective embodied agent control.
Abstract: We propose a method that enables large language models (LLMs) to control embodied agents by directly mapping continuous observation vectors to continuous action vectors. At the outset, the LLMs generate a control strategy based on a textual description of the agent, its environment, and the intended goal. This strategy is then iteratively refined through a learning process in which the LLMs are repeatedly prompted to improve the current strategy, using performance feedback and sensory-motor data collected during its evaluation. The method is validated on classic control tasks from the Gymnasium library and the inverted pendulum task from the MuJoCo library. The approach proves effective with relatively compact models such as Gpt-oss:120b and Qwen2.5:72b. In most cases, it successfully identifies optimal or near-optimal solutions by integrating symbolic knowledge derived through reasoning with sub-symbolic sensory-motor data gathered as the agent interacts with its environment.
[211] Optimizing Length Compression in Large Reasoning Models
Zhengxiang Cheng, Dongping Chen, Mingyang Fu, Tianyi Zhou
Main category: cs.AI
TL;DR: LC-R1 is a post-training method that reduces reasoning chain length by ~50% with only ~2% accuracy drop, using length and compress rewards to eliminate redundant “invalid thinking” while preserving critical reasoning steps.
Details
Motivation: Large Reasoning Models often produce verbose reasoning chains with unnecessary double-checking after deriving correct answers, leading to computational inefficiency.Method: Proposes LC-R1 based on Group Relative Policy Optimization, using Length Reward for overall conciseness and Compress Reward specifically designed to remove invalid thinking portions while preserving critical reasoning steps.
Result: Achieves ~50% reduction in sequence length with only ~2% accuracy drop across multiple reasoning benchmarks, achieving optimal Pareto frontier trade-off favoring high compression.
Conclusion: LC-R1 demonstrates robust performance in creating computationally efficient reasoning models while maintaining accuracy, providing valuable insights for developing more powerful yet efficient Large Reasoning Models.
Abstract: Large Reasoning Models (LRMs) have achieved remarkable success, yet they often suffer from producing unnecessary and verbose reasoning chains. We identify a core aspect of this issue as “invalid thinking” – models tend to repeatedly double-check their work after having derived the correct answer. To address this specific inefficiency, we move beyond the general principles of Efficacy and Efficiency to propose two new, fine-grained principles: Brevity, which advocates for eliminating redundancy, and Sufficiency, which ensures critical reasoning steps are preserved. Guided by these principles, we introduce LC-R1, a post-training method based on Group Relative Policy Optimization (GRPO). LC-R1 employs a novel combination of a Length Reward for overall conciseness and a Compress Reward that is specifically designed to remove the invalid portion of the thinking process. Extensive experiments on multiple reasoning benchmarks demonstrate that LC-R1 achieves a significant reduction in sequence length (~50%) with only a marginal (~2%) drop in accuracy, achieving a favorable trade-off point on the Pareto frontier that prioritizes high compression. Our analysis further validates the robustness of LC-R1 and provides valuable insights for developing more powerful yet computationally efficient LRMs. Our code is released at https://github.com/zxiangx/LC-R1.
[212] Scaling LLM Planning: NL2FLOW for Parametric Problem Generation and Rigorous Evaluation
Jungkoo Kang
Main category: cs.AI
TL;DR: NL2Flow is an automated pipeline for generating and evaluating workflow planning problems to address the scarcity of scalable evaluation data for LLM planning and reasoning.
Details
Motivation: Progress in LLM planning and reasoning is hindered by lack of scalable evaluation data. Robust workflow composition is critical for effective agent performance.Method: NL2Flow generates problems parametrically in structured intermediate representation, translates them to natural language and PDDL. Evaluates open-source LLMs on 2296 low-difficulty problems. Tests neuro-symbolic integration by translating natural language to JSON before symbolic planning.
Result: Best model achieved 86% success in valid plans and 69% in optimal plans. Neuro-symbolic integration (NL to JSON translation) significantly improved success rates. Problem characteristics affect plan generation depending on model and prompt design.
Conclusion: Understanding error sources in LLM reasoning is crucial as systems scale to complex tasks. Neuro-symbolic integration shows benefits for workflow planning. Shifting bottlenecks and error sources need attention as complexity increases.
Abstract: Robust workflow composition is critical for effective agent performance, yet progress in Large Language Model (LLM) planning and reasoning is hindered by a scarcity of scalable evaluation data. This work introduces NL2Flow, a fully automated pipeline for generating and evaluating workflow planning problems. NL2Flow generates problems parametrically in a structured intermediate representation, translating them into both natural language and formal PDDL. I evaluate several open-source, instruct-tuned LLMs on a dataset of 2296 low-difficulty problems generated by NL2Flow. Results demonstrate that the best-performing model achieved 86% success in generating valid plans and 69% in generating optimal plans (for solvable problems). Regression analysis shows that the influence of problem characteristics on plan generation is contingent on both model and prompt design. Importantly, translating natural language problems into a structured JSON representation prior to symbolic planning significantly improved success rates, suggesting a benefit from neuro-symbolic integration. These findings underscore the importance of understanding error sources within LLM reasoning as systems scale to more complex tasks. As LLM reasoning scales to increasingly complex problems, understanding the shifting bottlenecks and sources of error within these systems will be crucial.
[213] KROMA: Ontology Matching with Knowledge Retrieval and Large Language Models
Lam Nguyen, Erika Barcelos, Roger French, Yinghui Wu
Main category: cs.AI
TL;DR: KROMA is a novel ontology matching framework that combines LLMs with RAG to dynamically enrich semantic context using structural, lexical, and definitional knowledge, achieving superior performance while maintaining efficiency through optimization techniques.
Details
Motivation: Existing ontology matching systems rely on handcrafted rules or specialized models with limited adaptability, creating a need for more flexible and effective approaches.Method: Uses Large Language Models within a Retrieval-Augmented Generation pipeline, integrating bisimilarity-based concept matching and lightweight ontology refinement to prune candidates and reduce LLM communication overhead.
Result: Outperforms both classic OM systems and cutting-edge LLM-based approaches on multiple benchmark datasets while keeping communication overhead comparable.
Conclusion: The study demonstrates the feasibility and benefits of targeted knowledge retrieval, prompt enrichment, and ontology refinement techniques for scalable ontology matching.
Abstract: Ontology Matching (OM) is a cornerstone task of semantic interoperability, yet existing systems often rely on handcrafted rules or specialized models with limited adaptability. We present KROMA, a novel OM framework that harnesses Large Language Models (LLMs) within a Retrieval-Augmented Generation (RAG) pipeline to dynamically enrich the semantic context of OM tasks with structural, lexical, and definitional knowledge. To optimize both performance and efficiency, KROMA integrates a bisimilarity-based concept matching and a lightweight ontology refinement step, which prune candidate concepts and substantially reduce the communication overhead from invoking LLMs. Through experiments on multiple benchmark datasets, we show that integrating knowledge retrieval with context-augmented LLMs significantly enhances ontology matching, outperforming both classic OM systems and cutting-edge LLM-based approaches while keeping communication overhead comparable. Our study highlights the feasibility and benefit of the proposed optimization techniques (targeted knowledge retrieval, prompt enrichment, and ontology refinement) for ontology matching at scale.
[214] Robix: A Unified Model for Robot Interaction, Reasoning and Planning
Huang Fang, Mengxi Zhang, Heng Dong, Wei Li, Zixuan Wang, Qifeng Zhang, Xueyun Tian, Yucheng Hu, Hang Li
Main category: cs.AI
TL;DR: Robix is a unified vision-language model that integrates robot reasoning, task planning, and natural language interaction in a single architecture, outperforming commercial baselines like GPT-4o and Gemini 2.5 Pro in interactive task execution.
Details
Motivation: To create a unified cognitive layer for robots that can handle complex instructions, plan long-horizon tasks, and interact naturally with humans through an end-to-end framework with capabilities like proactive dialogue and real-time interruption handling.Method: Three-stage training strategy: (1) continued pretraining for embodied reasoning abilities, (2) supervised finetuning to model human-robot interaction and task planning as unified reasoning-action sequences, (3) reinforcement learning for reasoning-action consistency and task coherence.
Result: Outperforms both open-source and commercial baselines in interactive task execution, demonstrating strong generalization across diverse instruction types and various user-involved tasks like table bussing, grocery shopping, and dietary filtering.
Conclusion: Robix successfully integrates reasoning, planning, and interaction capabilities within a single vision-language architecture, enabling more natural and effective human-robot collaboration with advanced features like proactive dialogue and real-time interruption handling.
Abstract: We introduce Robix, a unified model that integrates robot reasoning, task planning, and natural language interaction within a single vision-language architecture. Acting as the high-level cognitive layer in a hierarchical robot system, Robix dynamically generates atomic commands for the low-level controller and verbal responses for human interaction, enabling robots to follow complex instructions, plan long-horizon tasks, and interact naturally with human within an end-to-end framework. Robix further introduces novel capabilities such as proactive dialogue, real-time interruption handling, and context-aware commonsense reasoning during task execution. At its core, Robix leverages chain-of-thought reasoning and adopts a three-stage training strategy: (1) continued pretraining to enhance foundational embodied reasoning abilities including 3D spatial understanding, visual grounding, and task-centric reasoning; (2) supervised finetuning to model human-robot interaction and task planning as a unified reasoning-action sequence; and (3) reinforcement learning to improve reasoning-action consistency and long-horizon task coherence. Extensive experiments demonstrate that Robix outperforms both open-source and commercial baselines (e.g., GPT-4o and Gemini 2.5 Pro) in interactive task execution, demonstrating strong generalization across diverse instruction types (e.g., open-ended, multi-stage, constrained, invalid, and interrupted) and various user-involved tasks such as table bussing, grocery shopping, and dietary filtering.
[215] TreeGPT: Pure TreeFFN Encoder-Decoder Architecture for Structured Reasoning Without Attention Mechanisms
Zixi Li
Main category: cs.AI
TL;DR: TreeGPT is an attention-free neural architecture using bidirectional TreeFFN encoder-decoder for structured reasoning tasks, achieving 99% accuracy on ARC Prize 2025 with only 3.16M parameters.
Details
Motivation: To explore the potential of pure TreeFFN encoder-decoder designs as an alternative to attention-based transformers, aiming for computational efficiency while maintaining reasoning capabilities for structured tasks.Method: Uses bidirectional TreeFFN components with encoder processing left-to-right dependencies and decoder handling right-to-left patterns through simple neighbor-to-neighbor connections, eliminating attention computation while enabling parallel processing.
Result: Achieves 99% validation accuracy on ARC Prize 2025 dataset using only 3.16M parameters, converges within 1500 training steps, and demonstrates 100% token-level accuracy on selected evaluation samples.
Conclusion: Specialized TreeFFN architectures may offer advantages over attention-based approaches for certain structured reasoning tasks, though further investigation across diverse tasks is needed to establish broader applicability.
Abstract: We present TreeGPT, an attention-free neural architecture that explores the potential of pure TreeFFN encoder-decoder design for structured reasoning tasks. Unlike traditional transformer approaches that rely on attention mechanisms, TreeGPT employs bidirectional TreeFFN components that process sequences through adjacent connections in parallel, aiming to achieve computational efficiency while maintaining reasoning capabilities. Our approach centers on a TreeFFN Encoder-Decoder mechanism: $$\text{Encoder TreeFFN (L} \rightarrow \text{R)} + \text{Decoder TreeFFN (R} \leftarrow \text{L)} \rightarrow \text{Parallel Processing}$$ where the encoder processes left-to-right dependencies while the decoder handles right-to-left patterns, both using simple neighbor-to-neighbor connections. This design eliminates attention computation while maintaining sequence modeling capabilities. We evaluate our approach on the ARC Prize 2025 dataset, where TreeGPT achieves 99% validation accuracy using 3.16M parameters. The model converges within 1500 training steps and demonstrates 100% token-level accuracy on selected evaluation samples. Our preliminary results suggest that for certain structured reasoning tasks, specialized TreeFFN architectures may offer advantages over attention-based approaches. While these findings are encouraging, we acknowledge that further investigation across diverse tasks and datasets would be valuable to establish the broader applicability of attention-free designs.
[216] CogGuide: Human-Like Guidance for Zero-Shot Omni-Modal Reasoning
Zhou-Peng Shou, Zhi-Qiang You, Fang Wang, Hai-Bo Liu
Main category: cs.AI
TL;DR: Zero-shot multimodal reasoning component using human-like cognitive strategies with ‘intent sketch’ to suppress shortcut reasoning and improve contextual understanding without parameter fine-tuning.
Details
Motivation: Address issues of 'shortcuts' and insufficient contextual understanding in complex cross-modal reasoning of multimodal large models.Method: Plug-and-play three-module pipeline (Intent Perceiver, Strategy Generator, Strategy Selector) that constructs ‘understand-plan-select’ cognitive process using intent sketch strategies through in-context engineering.
Result: Achieves consistent improvements across different reasoning engines with gains up to 9.51 percentage points on IntentBench, WorldSense, and Daily-Omni benchmarks.
Conclusion: The intent sketch reasoning component demonstrates practical value and portability in zero-shot scenarios by reducing conditional entropy and improving information utilization efficiency.
Abstract: Targeting the issues of “shortcuts” and insufficient contextual understanding in complex cross-modal reasoning of multimodal large models, this paper proposes a zero-shot multimodal reasoning component guided by human-like cognitive strategies centered on an “intent sketch”. The component comprises a plug-and-play three-module pipeline-Intent Perceiver, Strategy Generator, and Strategy Selector-that explicitly constructs a “understand-plan-select” cognitive process. By generating and filtering “intent sketch” strategies to guide the final reasoning, it requires no parameter fine-tuning and achieves cross-model transfer solely through in-context engineering. Information-theoretic analysis shows that this process can reduce conditional entropy and improve information utilization efficiency, thereby suppressing unintended shortcut reasoning. Experiments on IntentBench, WorldSense, and Daily-Omni validate the method’s generality and robust gains; compared with their respective baselines, the complete “three-module” scheme yields consistent improvements across different reasoning engines and pipeline combinations, with gains up to approximately 9.51 percentage points, demonstrating the practical value and portability of the “intent sketch” reasoning component in zero-shot scenarios.
[217] Directly Aligning the Full Diffusion Trajectory with Fine-Grained Human Preference
Xiangwei Shen, Zhimin Li, Zhantao Yang, Shiyi Zhang, Yingfang Zhang, Donghao Li, Chunyu Wang, Qinglin Lu, Yansong Tang
Main category: cs.AI
TL;DR: Direct-Align method with Semantic Relative Preference Optimization (SRPO) improves diffusion model alignment with human preferences by avoiding expensive multistep denoising and enabling online reward adjustment through prompt augmentation.
Details
Motivation: Existing methods for aligning diffusion models with human preferences suffer from computational inefficiency due to multistep denoising and require continuous offline adaptation of reward models to achieve desired aesthetic quality.Method: Proposes Direct-Align method that predefines noise prior to recover original images via interpolation, avoiding multistep denoising. Introduces SRPO where rewards are text-conditioned signals enabling online adjustment through positive/negative prompt augmentation.
Result: Fine-tuning FLUX model with the proposed approach improves human-evaluated realism and aesthetic quality by over 3x compared to previous methods.
Conclusion: The combined Direct-Align and SRPO approach effectively addresses computational limitations and offline adaptation requirements of previous human preference alignment methods for diffusion models.
Abstract: Recent studies have demonstrated the effectiveness of directly aligning diffusion models with human preferences using differentiable reward. However, they exhibit two primary challenges: (1) they rely on multistep denoising with gradient computation for reward scoring, which is computationally expensive, thus restricting optimization to only a few diffusion steps; (2) they often need continuous offline adaptation of reward models in order to achieve desired aesthetic quality, such as photorealism or precise lighting effects. To address the limitation of multistep denoising, we propose Direct-Align, a method that predefines a noise prior to effectively recover original images from any time steps via interpolation, leveraging the equation that diffusion states are interpolations between noise and target images, which effectively avoids over-optimization in late timesteps. Furthermore, we introduce Semantic Relative Preference Optimization (SRPO), in which rewards are formulated as text-conditioned signals. This approach enables online adjustment of rewards in response to positive and negative prompt augmentation, thereby reducing the reliance on offline reward fine-tuning. By fine-tuning the FLUX model with optimized denoising and online reward adjustment, we improve its human-evaluated realism and aesthetic quality by over 3x.
cs.SD
[218] In situ estimation of the acoustic surface impedance using simulation-based inference
Jonas M. Schmid, Johannes D. Schmid, Martin Eser, Steffen Marburg
Main category: cs.SD
TL;DR: Bayesian framework for in situ estimation of acoustic surface impedances from sparse sound pressure measurements using simulation-based inference and neural networks.
Details
Motivation: Conventional measurement techniques for acoustic boundary conditions rely on simplifying assumptions that limit validity for real-world scenarios, requiring more robust methods.Method: Uses simulation-based inference with neural networks to map simulated data to posterior distributions of model parameters. Impedance is modeled with a damped oscillator extended with fractional calculus term.
Result: Achieved robust and accurate estimation of all six individual impedances in verification tests. Demonstrated reliable uncertainty quantification and high predictive accuracy in complex car cabin model.
Conclusion: The method provides well-calibrated inference for generalizable, efficient, and physically consistent characterization of acoustic boundary conditions in real-world interior environments.
Abstract: Accurate acoustic simulations of enclosed spaces require precise boundary conditions, typically expressed through surface impedances for wave-based methods. Conventional measurement techniques often rely on simplifying assumptions about the sound field and mounting conditions, limiting their validity for real-world scenarios. To overcome these limitations, this study introduces a Bayesian framework for the in situ estimation of frequency-dependent acoustic surface impedances from sparse interior sound pressure measurements. The approach employs simulation-based inference, which leverages the expressiveness of modern neural network architectures to directly map simulated data to posterior distributions of model parameters, bypassing conventional sampling-based Bayesian approaches and offering advantages for high-dimensional inference problems. Impedance behavior is modeled using a damped oscillator model extended with a fractional calculus term. The framework is verified on a finite element model of a cuboid room and further tested with impedance tube measurements used as reference, achieving robust and accurate estimation of all six individual impedances. Application to a numerical car cabin model further demonstrates reliable uncertainty quantification and high predictive accuracy even for complex-shaped geometries. Posterior predictive checks and coverage diagnostics confirm well-calibrated inference, highlighting the method’s potential for generalizable, efficient, and physically consistent characterization of acoustic boundary conditions in real-world interior environments.
[219] MoLEx: Mixture of LoRA Experts in Speech Self-Supervised Models for Audio Deepfake Detection
Zihan Pan, Sailor Hardik Bhupendra, Jinyang Wu
Main category: cs.SD
TL;DR: MoLEx framework combines Low-Rank Adaptation with Mixture-of-Experts to efficiently finetune SSL models for audio deepfake detection, achieving SOTA performance with reduced computational costs.
Details
Motivation: Fully finetuning self-supervised learning models for audio deepfake detection is computationally expensive, requiring a more parameter-efficient approach that preserves pre-trained knowledge while adapting to new tasks.Method: Proposes Mixture of LoRA Experts (MoLEx) framework that combines Low-Rank Adaptation with a Mixture-of-Experts router, allowing selective finetuning of experts while preserving SSL model knowledge.
Result: Achieves state-of-the-art equal error rate of 5.56% on ASVSpoof 5 evaluation set without augmentation, demonstrating domain-aware adaptability where router activates same experts for similar attacks and switches for novel spoofs.
Conclusion: MoLEx provides an efficient and flexible parameter-efficient finetuning framework for audio deepfake detection that maintains robust performance while significantly reducing training costs and enabling easy domain adaptation.
Abstract: While self-supervised learning (SSL)-based models have boosted audio deepfake detection accuracy, fully finetuning them is computationally expensive. To address this, we propose a parameter-efficient framework that combines Low-Rank Adaptation with a Mixture-of-Experts router, called Mixture of LoRA Experts (MoLEx). It preserves pre-trained knowledge of SSL models while efficiently finetuning only selected experts, reducing training costs while maintaining robust performance. The observed utility of experts during inference shows the router reactivates the same experts for similar attacks but switches to other experts for novel spoofs, confirming MoLEx’s domain-aware adaptability. MoLEx additionally offers flexibility for domain adaptation by allowing extra experts to be trained without modifying the entire model. We mainly evaluate our approach on the ASVSpoof 5 dataset and achieve the state-of-the-art (SOTA) equal error rate (EER) of 5.56% on the evaluation set without augmentation.
[220] DeCodec: Rethinking Audio Codecs as Universal Disentangled Representation Learners
Xiaoxue Luo, Jinwei Huang, Runyan Yang, Yingying Gao, Junlan Feng, Chao Deng, Shilei Zhang
Main category: cs.SD
TL;DR: DeCodec is a universal neural audio codec that learns disentangled representations, separating speech and background sounds into orthogonal subspaces, with speech further decomposed into semantic and paralinguistic components for flexible feature selection across audio tasks.
Details
Motivation: Real-world audio often contains mixed speech and background sounds, but existing codecs either learn entangled representations or are limited to specific audio types. Downstream tasks require selective access to different audio components, necessitating a universal disentangled representation learner.Method: Built on a codec framework with two key innovations: 1) subspace orthogonal projection module that factorizes input into orthogonal subspaces for speech and background, 2) representation swap training to ensure subspaces correlate to specific components. Uses parallel RVQs for independent quantization and semantic guidance for speech decomposition.
Result: Maintains advanced signal reconstruction while enabling new capabilities: superior speech enhancement, effective one-shot voice conversion on noisy speech, improved ASR robustness through clean semantic features, and controllable background sound preservation/suppression in TTS.
Conclusion: DeCodec successfully learns hierarchical disentangled audio representations, serving as a universal front-end for multiple audio applications with flexible feature selection and improved performance across various tasks.
Abstract: Universal audio codecs learn entangled representations across audio types, whereas some specific codecs offer decoupled representations but are limited to speech. Real-world audio, however, often contains mixed speech and background sounds, and downstream tasks require selective access to these components. Therefore, we rethink the audio codec as a universal disentangled representation learner to enable controllable feature selection across different audio tasks. To this end, we introduce DeCodec, a novel neural codec that learns to decouple audio representations into orthogonal subspaces dedicated to speech and background sound, and within speech, representations are further decomposed into semantic and paralinguistic components. This hierarchical disentanglement allows flexible feature selection, making DeCodec a universal front-end for multiple audio applications. Technically, built upon a codec framework, DeCodec incorporates two key innovations: a subspace orthogonal projection module that factorizes the input into two decoupled orthogonal subspaces, and a representation swap training procedure that ensures these two subspaces are correlate to the speech and background sound, respectively. These allows parallel RVQs to quantize speech and background sound components independently. Furthermore, we employ semantic guidance to the speech RVQ to achieve semantic and paralinguistic decomposition. Experimental results show that DeCodec maintains advanced signal reconstruction while enabling new capabilities: superior speech enhancement and effective one-shot voice conversion on noisy speech via representation recombination, improved ASR robustness through clean semantic features, and controllable background sound preservation/suppression in TTS. Demo Page: https://luo404.github.io/DeCodecV2/
[221] Bona fide Cross Testing Reveals Weak Spot in Audio Deepfake Detection Systems
Chin Yuen Kwok, Jia Qi Yip, Zhen Qiu, Chi Hung Chi, Kwok Yan Lam
Main category: cs.SD
TL;DR: Proposes bona fide cross-testing framework to address limitations in audio deepfake detection evaluation, using diverse bona fide datasets and aggregated EERs for more balanced assessments.
Details
Motivation: Traditional audio deepfake detection evaluation disproportionately weights synthesizers with more samples and lacks diversity in bona fide speech, limiting real-world simulation.Method: Developed bona fide cross-testing framework that incorporates diverse bona fide datasets and aggregates EERs across multiple synthesizers for balanced evaluation.
Result: Benchmarked over 150 synthesizers across nine bona fide speech types, demonstrating improved robustness and interpretability compared to traditional methods.
Conclusion: The proposed evaluation framework provides more reliable assessment of audio deepfake detection models and includes a new released dataset to support further research.
Abstract: Audio deepfake detection (ADD) models are commonly evaluated using datasets that combine multiple synthesizers, with performance reported as a single Equal Error Rate (EER). However, this approach disproportionately weights synthesizers with more samples, underrepresenting others and reducing the overall reliability of EER. Additionally, most ADD datasets lack diversity in bona fide speech, often featuring a single environment and speech style (e.g., clean read speech), limiting their ability to simulate real-world conditions. To address these challenges, we propose bona fide cross-testing, a novel evaluation framework that incorporates diverse bona fide datasets and aggregates EERs for more balanced assessments. Our approach improves robustness and interpretability compared to traditional evaluation methods. We benchmark over 150 synthesizers across nine bona fide speech types and release a new dataset to facilitate further research at https://github.com/cyaaronk/audio_deepfake_eval.
[222] Adaptive Knowledge Distillation using a Device-Aware Teacher for Low-Complexity Acoustic Scene Classification
Seung Gyu Jeong, Seong Eun Kim
Main category: cs.SD
TL;DR: A knowledge distillation system with device-aware feature alignment and test-time device fine-tuning achieves 57.93% accuracy for low-complexity acoustic scene classification, significantly outperforming the baseline especially on unseen devices.
Details
Motivation: To address the dual challenges of strict computational complexity constraints and robust generalization to both seen and unseen recording devices in acoustic scene classification, while leveraging the new rule allowing device labels at test time.Method: A knowledge distillation framework where an efficient CP-MobileNet student learns from a two-teacher ensemble: a baseline PaSST teacher and a ‘generalization expert’ teacher trained with novel Device-Aware Feature Alignment (DAFA) loss for device robustness, followed by device-specific fine-tuning using test-time device labels.
Result: The system achieves 57.93% accuracy on the development set, showing significant improvement over the official baseline, with particularly strong performance on unseen devices.
Conclusion: The proposed knowledge distillation approach with device-aware feature alignment and test-time device fine-tuning effectively addresses complexity constraints while achieving robust generalization across devices in acoustic scene classification tasks.
Abstract: In this technical report, we describe our submission for Task 1, Low-Complexity Device-Robust Acoustic Scene Classification, of the DCASE 2025 Challenge. Our work tackles the dual challenges of strict complexity constraints and robust generalization to both seen and unseen devices, while also leveraging the new rule allowing the use of device labels at test time. Our proposed system is based on a knowledge distillation framework where an efficient CP-MobileNet student learns from a compact, specialized two-teacher ensemble. This ensemble combines a baseline PaSST teacher, trained with standard cross-entropy, and a ‘generalization expert’ teacher. This expert is trained using our novel Device-Aware Feature Alignment (DAFA) loss, adapted from prior work, which explicitly structures the feature space for device robustness. To capitalize on the availability of test-time device labels, the distilled student model then undergoes a final device-specific fine-tuning stage. Our proposed system achieves a final accuracy of 57.93% on the development set, demonstrating a significant improvement over the official baseline, particularly on unseen devices.
[223] Efficient Transformer-Based Piano Transcription With Sparse Attention Mechanisms
Weixing Wei, Kazuyoshi Yoshii
Main category: cs.SD
TL;DR: Efficient Transformer architecture with sparse attention mechanisms for piano transcription that reduces computational costs while maintaining performance comparable to full-attention models.
Details
Motivation: Existing transformer-based piano transcription models cannot process entire musical pieces at once due to quadratic complexity of self-attention, requiring sliding-window processing which limits context understanding.Method: Proposed efficient architecture with sliding-window self-attention for encoder/decoder, hybrid global-local cross-attention based on MIDI token types, and hierarchical pooling between encoder and decoder to reduce computational load.
Result: Significant reduction in computational cost and memory usage, accelerated inference speed, while maintaining transcription performance comparable to full-attention baseline on MAESTRO dataset.
Conclusion: Sparse attention mechanisms enable training with longer audio contexts on same hardware, demonstrating viability for building efficient high-performance piano transcription systems.
Abstract: This paper investigates automatic piano transcription based on computationally-efficient yet high-performant variants of the Transformer that can capture longer-term dependency over the whole musical piece. Recently, transformer-based sequence-to-sequence models have demonstrated excellent performance in piano transcription. These models, however, fail to deal with the whole piece at once due to the quadratic complexity of the self-attention mechanism, and music signals are thus typically processed in a sliding-window manner in practice. To overcome this limitation, we propose an efficient architecture with sparse attention mechanisms. Specifically, we introduce sliding-window self-attention mechanisms for both the encoder and decoder, and a hybrid global-local cross-attention mechanism that attends to various spans according to the MIDI token types. We also use a hierarchical pooling strategy between the encoder and decoder to further reduce computational load. Our experiments on the MAESTRO dataset showed that the proposed model achieved a significant reduction in computational cost and memory usage, accelerating inference speed, while maintaining transcription performance comparable to the full-attention baseline. This allows for training with longer audio contexts on the same hardware, demonstrating the viability of sparse attention for building efficient and high-performance piano transcription systems. The code is available at https://github.com/WX-Wei/efficient-seq2seq-piano-trans.
[224] Finite Scalar Quantization Enables Redundant and Transmission-Robust Neural Audio Compression at Low Bit-rates
Harry Julia, Rachel Beeson, Lohith Konathala, Johanna Ulin, Jiameng Gao
Main category: cs.SD
TL;DR: NeuCodec is an FSQ-based neural audio codec that shows superior robustness to channel noise compared to RVQ codecs, with built-in redundancy that allows different encoders to produce vastly different codes while maintaining similar reconstruction quality.
Details
Motivation: Neural audio codecs are widely used in speech processing but most rely on Residual Vector Quantization (RVQ). Finite Scalar Quantization (FSQ) offers a simpler alternative with single codebook support, and the authors want to explore its robustness properties for noisy channel transmission.Method: The authors introduce NeuCodec, an FSQ-based neural audio codec. They conduct encoder distillation experiments to show different encoders can produce different codes with similar quality, and compare FSQ and RVQ performance under simulated noisy channel conditions with bit-level perturbations.
Result: FSQ encodes baked-in redundancy that makes it robust to noisy channels. Different encoders learned to produce vastly different code sequences while maintaining comparable reconstruction quality. FSQ showed vastly superior bit-level perturbation robustness compared to RVQ codecs.
Conclusion: FSQ-based neural audio codecs like NeuCodec offer significant advantages for noisy channel transmission scenarios due to their inherent redundancy and robustness properties, making them a compelling alternative to traditional RVQ-based approaches.
Abstract: Neural Audio Codecs (NACs) have become increasingly adopted in speech processing tasks due to their excellent rate-distortion performance and compatibility with Large Language Models (LLMs) as discrete feature representations for audio generation. While most existing codecs rely on Residual Vector Quantization (RVQ), Finite Scalar Quantization (FSQ) has recently emerged as a compelling alternative that simplifies training and natively supports single codebooks. We introduce NeuCodec, an FSQ-based NAC, and show that FSQ encodes baked-in redundancy which produces an encoding which is robust when transmitted through noisy channels. First, through an encoder distillation experiment, we show that two different encoders can learn to encode identical audio into vastly different code sequences whilst maintaining comparable reconstruction quality with the same quantizer and decoder. Second, we demonstrate that FSQ has vastly superior bit-level perturbation robustness by comparing the performance of RVQ and FSQ codecs when simulating the transmission of code sequences through a noisy channel.
[225] DiFlow-TTS: Discrete Flow Matching with Factorized Speech Tokens for Low-Latency Zero-Shot Text-To-Speech
Ngoc-Son Nguyen, Hieu-Nghia Huynh-Nguyen, Thanh V. T. Tran, Truong-Son Hy, Van Nguyen
Main category: cs.SD
TL;DR: DiFlow-TTS is the first discrete flow matching model for zero-shot TTS that achieves fast inference (25.8x speedup) while maintaining high quality in naturalness, prosody, and speaker style preservation.
Details
Motivation: Existing zero-shot TTS methods suffer from slow inference and repetition artifacts. While discrete codec representations show promise, current flow-matching approaches embed tokens into continuous space instead of fully leveraging discrete representations.Method: Proposes DiFlow-TTS using purely discrete flow matching with factorized attribute modeling. Uses in-context learning with text content and extracted prosodic/acoustic attributes from reference speech. Employs separate prediction heads for prosody and acoustic details.
Result: Achieves promising performance in naturalness, prosody, speaker style preservation, and energy control. Maintains compact model size with low-latency inference, generating speech 25.8x faster than latest baselines.
Conclusion: Discrete flow matching is effective for zero-shot TTS, enabling fast high-quality synthesis while preserving speaker attributes through explicit factorized modeling of speech characteristics.
Abstract: Zero-shot Text-to-Speech (TTS) aims to synthesize high-quality speech that mimics the voice of an unseen speaker using only a short reference sample, requiring not only speaker adaptation but also accurate modeling of prosodic attributes. Recent approaches based on language models, diffusion, and flow matching have shown promising results in zero-shot TTS, but still suffer from slow inference and repetition artifacts. Discrete codec representations have been widely adopted for speech synthesis, and recent works have begun to explore diffusion models in purely discrete settings, suggesting the potential of discrete generative modeling for speech synthesis. However, existing flow-matching methods typically embed these discrete tokens into a continuous space and apply continuous flow matching, which may not fully leverage the advantages of discrete representations. To address these challenges, we introduce DiFlow-TTS, which, to the best of our knowledge, is the first model to explore purely Discrete Flow Matching for speech synthesis. DiFlow-TTS explicitly models factorized speech attributes within a compact and unified architecture. It leverages in-context learning by conditioning on textual content, along with prosodic and acoustic attributes extracted from a reference speech, enabling effective attribute cloning in a zero-shot setting. In addition, the model employs a factorized flow prediction mechanism with distinct heads for prosody and acoustic details, allowing it to learn aspect-specific distributions. Experimental results demonstrate that DiFlow-TTS achieves promising performance in several key metrics, including naturalness, prosody, preservation of speaker style, and energy control. It also maintains a compact model size and achieves low-latency inference, generating speech up to 25.8 times faster than the latest existing baselines.
[226] Pretrained Conformers for Audio Fingerprinting and Retrieval
Kemal Altwlkany, Elmedin Selmanovic, Sead Delalic
Main category: cs.SD
TL;DR: Conformer-based audio encoders trained with self-supervised contrastive learning achieve SOTA results for audio retrieval using only 3-second segments, showing strong immunity to temporal misalignments and audio distortions.
Details
Motivation: To develop audio encoders that can generate unique embeddings for small audio segments and generalize well to unseen data, addressing challenges like temporal misalignments and audio distortions.Method: Utilize a self-supervised contrastive learning framework to train conformer-based encoders, capturing both local and global interactions in audio data.
Result: Achieve state-of-the-art results for audio retrieval tasks using only 3 seconds of audio, with models being almost completely immune to temporal misalignments and performing well against noise, reverb, and extreme temporal stretching.
Conclusion: The proposed conformer-based approach with contrastive learning effectively generates robust audio embeddings that generalize well and handle various distortions, with code and models made publicly available for reproducibility.
Abstract: Conformers have shown great results in speech processing due to their ability to capture both local and global interactions. In this work, we utilize a self-supervised contrastive learning framework to train conformer-based encoders that are capable of generating unique embeddings for small segments of audio, generalizing well to previously unseen data. We achieve state-of-the-art results for audio retrieval tasks while using only 3 seconds of audio to generate embeddings. Our models are almost completely immune to temporal misalignments and achieve state-of-the-art results in cases of other audio distortions such as noise, reverb or extreme temporal stretching. Code and models are made publicly available and the results are easy to reproduce as we train and test using popular and freely available datasets of different sizes.
[227] FLM-Audio: Natural Monologues Improves Native Full-Duplex Chatbots via Dual Training
Yiqun Yao, Xiang Li, Xin Jiang, Xuezhi Fang, Naitong Yu, Wenjia Ma, Aixin Sun, Yequan Wang
Main category: cs.SD
TL;DR: FLM-Audio is a 7B spoken dialog chatbot that uses natural monologues and dual training to achieve native full-duplexity with superior response quality and reduced training data requirements.
Details
Motivation: Existing full-duplex dialog models break down textual monologues for word-level audio alignment, which degrades language modeling abilities. The goal is to maintain natural language flow while achieving simultaneous listening and speaking.Method: Introduces natural monologues composed of continuous sentences with waiting intervals, mimicking human cognitive behavior. Uses dual training paradigm that alternates monologue positions (leading or trailing audio) across training stages.
Result: FLM-Audio achieves superior response qualities and chatting experiences while requiring significantly less training data compared to previous approaches.
Conclusion: The combination of natural monologues and dual training strategy enables effective native full-duplex dialog modeling with improved language capabilities and reduced data requirements.
Abstract: Full-duplex dialog models aim to listen and speak simultaneously, delivering rapid responses to dynamic user input. Among different solutions to full duplexity, a native solution merges multiple channels in each time step, achieving the lowest latency. However, prevailing designs break down the textual monologue sentences for word-level alignment with audio streams, which degrades language modeling abilities. To help address this issue, we introduce natural monologues, which are composed by continuous sentences and waiting intervals, mimicking humanoid cognitive behavior in dialogs. We find a proper training paradigm to be critical for semantically aligning natural monologues with audio. To this end, we develop a dual training paradigm that alternates the position of the monologues, either leading or trailing the audio, across different training stages. A combination of our natural monologue and dual training strategy is applied in developing FLM-Audio, our 7B spoken dialog chatbot with native full-duplexity. As confirmed by experimental results, FLM-Audio achieves superior response qualities and chatting experiences while requiring significantly less training data.
[228] AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs
Sidharth Surapaneni, Hoang Nguyen, Jash Mehta, Aman Tiwari, Oluwanifemi Bamgbose, Akshay Kalkunte, Sai Rajeswar, Sathwik Tejaswi Madhusudhan
Main category: cs.SD
TL;DR: AU-Harness is an efficient evaluation framework for Large Audio Language Models that addresses slow processing, inconsistent prompting, and narrow task coverage issues in current toolkits, enabling large-scale assessments with standardized protocols.
Details
Motivation: Current LALM evaluation frameworks suffer from slow processing that bottlenecks large-scale studies, inconsistent prompting that hurts reproducibility, and narrow task coverage that misses important audio reasoning capabilities.Method: Developed AU-Harness with optimized batch processing and parallel execution for 127% speedup, standardized prompting protocols, and introduced two new evaluation categories: LLM-Adaptive Diarization for temporal audio understanding and Spoken Language Reasoning for complex audio-based cognitive tasks.
Result: Achieved significant speed improvements enabling previously impractical large-scale evaluations. Evaluation across 380+ tasks revealed significant gaps in current LALMs, particularly in temporal understanding and complex spoken language reasoning. Found performance differences up to 9.5 absolute points due to lack of standardization in instruction modality.
Conclusion: AU-Harness provides both practical evaluation tools and insights into model limitations, advancing systematic LALM development by addressing critical evaluation challenges and standardization issues in the field.
Abstract: Large Audio Language Models (LALMs) are rapidly advancing, but evaluating them remains challenging due to inefficient toolkits that limit fair comparison and systematic assessment. Current frameworks suffer from three critical issues: slow processing that bottlenecks large-scale studies, inconsistent prompting that hurts reproducibility, and narrow task coverage that misses important audio reasoning capabilities. We introduce AU-Harness, an efficient and comprehensive evaluation framework for LALMs. Our system achieves a speedup of up to 127% over existing toolkits through optimized batch processing and parallel execution, enabling large-scale evaluations previously impractical. We provide standardized prompting protocols and flexible configurations for fair model comparison across diverse scenarios. Additionally, we introduce two new evaluation categories: LLM-Adaptive Diarization for temporal audio understanding and Spoken Language Reasoning for complex audio-based cognitive tasks. Through evaluation across 380+ tasks, we reveal significant gaps in current LALMs, particularly in temporal understanding and complex spoken language reasoning tasks. Our findings also highlight a lack of standardization in instruction modality existent across audio benchmarks, which can lead up performance differences up to 9.5 absolute points on the challenging complex instruction following downstream tasks. AU-Harness provides both practical evaluation tools and insights into model limitations, advancing systematic LALM development.
[229] Behind the Scenes: Mechanistic Interpretability of LoRA-adapted Whisper for Speech Emotion Recognition
Yujian Ma, Jinqiu Sang, Ruizhe Li
Main category: cs.SD
TL;DR: First systematic mechanistic interpretability study of LoRA adaptation in Whisper encoder for speech emotion recognition, revealing delayed specialization and forward-backward matrix dynamics.
Details
Motivation: Large pre-trained speech models like Whisper pose resource-efficient adaptation challenges, and while LoRA is popular for parameter-efficient fine-tuning, its underlying mechanisms in speech tasks remain poorly understood.Method: Used suite of analytical tools including layer contribution probing, logit-lens inspection, and representational similarity via SVD and CKA to study LoRA adaptation in Whisper encoder for SER.
Result: Revealed two key mechanisms: delayed specialization process preserving general features in early layers before consolidating task-specific information, and forward alignment, backward differentiation dynamic between LoRA’s matrices.
Conclusion: Findings clarify how LoRA reshapes encoder hierarchies, providing empirical insights and deeper mechanistic understanding for designing efficient and interpretable adaptation strategies in large speech models.
Abstract: Large pre-trained speech models such as Whisper offer strong generalization but pose significant challenges for resource-efficient adaptation. Low-Rank Adaptation (LoRA) has become a popular parameter-efficient fine-tuning method, yet its underlying mechanisms in speech tasks remain poorly understood. In this work, we conduct the first systematic mechanistic interpretability study of LoRA within the Whisper encoder for speech emotion recognition (SER). Using a suite of analytical tools, including layer contribution probing, logit-lens inspection, and representational similarity via singular value decomposition (SVD) and centered kernel alignment (CKA), we reveal two key mechanisms: a delayed specialization process that preserves general features in early layers before consolidating task-specific information, and a forward alignment, backward differentiation dynamic between LoRA’s matrices. Our findings clarify how LoRA reshapes encoder hierarchies, providing both empirical insights and a deeper mechanistic understanding for designing efficient and interpretable adaptation strategies in large speech models. Our code is available at https://github.com/harryporry77/Behind-the-Scenes.
cs.LG
[230] Continuous-Time Value Iteration for Multi-Agent Reinforcement Learning
Xuefeng Wang, Lei Zhang, Henglin Pu, Ahmed H. Qureshi, Husheng Li
Main category: cs.LG
TL;DR: A continuous-time multi-agent reinforcement learning framework using physics-informed neural networks to solve Hamilton-Jacobi-Bellman equations, overcoming dimensionality and value approximation challenges in multi-agent systems.
Details
Motivation: Existing RL methods struggle with complex dynamical systems requiring high-frequency interactions, and continuous-time RL has been limited to single-agent domains due to curse of dimensionality and value approximation challenges in multi-agent settings.Method: Proposes CT-MARL framework with physics-informed neural networks (PINNs) to approximate HJB-based value functions, introducing Value Gradient Iteration (VGI) module to align value learning with value-gradient learning and refine gradients along trajectories.
Result: The approach consistently outperforms existing continuous-time RL baselines on benchmarks including multi-agent particle environment and multi-agent MuJoCo, demonstrating scalability to complex multi-agent dynamics.
Conclusion: The proposed framework successfully addresses key challenges in continuous-time multi-agent reinforcement learning, enabling effective scaling to high-dimensional multi-agent systems through improved gradient fidelity and value approximation.
Abstract: Existing reinforcement learning (RL) methods struggle with complex dynamical systems that demand interactions at high frequencies or irregular time intervals. Continuous-time RL (CTRL) has emerged as a promising alternative by replacing discrete-time Bellman recursion with differential value functions defined as viscosity solutions of the Hamilton–Jacobi–Bellman (HJB) equation. While CTRL has shown promise, its applications have been largely limited to the single-agent domain. This limitation stems from two key challenges: (i) conventional solution methods for HJB equations suffer from the curse of dimensionality (CoD), making them intractable in high-dimensional systems; and (ii) even with HJB-based learning approaches, accurately approximating centralized value functions in multi-agent settings remains difficult, which in turn destabilizes policy training. In this paper, we propose a CT-MARL framework that uses physics-informed neural networks (PINNs) to approximate HJB-based value functions at scale. To ensure the value is consistent with its differential structure, we align value learning with value-gradient learning by introducing a Value Gradient Iteration (VGI) module that iteratively refines value gradients along trajectories. This improves gradient fidelity, in turn yielding more accurate values and stronger policy learning. We evaluate our method using continuous-time variants of standard benchmarks, including multi-agent particle environment (MPE) and multi-agent MuJoCo. Our results demonstrate that our approach consistently outperforms existing continuous-time RL baselines and scales to complex multi-agent dynamics.
[231] Uncertainty Estimation using Variance-Gated Distributions
H. Martin Gillis, Isaac Xu, Thomas Trappenberg
Main category: cs.LG
TL;DR: Proposes a new framework for uncertainty estimation and decomposition using signal-to-noise ratio of class probabilities, introducing a variance-gated measure with ensemble-based confidence scaling.
Details
Motivation: Existing additive uncertainty decomposition approaches have been questioned, and there's a need for better per-sample uncertainty quantification in high-risk neural network applications.Method: Uses signal-to-noise ratio of class probability distributions across different model predictions, with a variance-gated measure that scales predictions by confidence factors derived from ensembles.
Result: The framework provides intuitive uncertainty estimation and decomposition, and reveals insights about diversity collapse in committee machines.
Conclusion: The proposed signal-to-noise ratio based approach offers a more reliable framework for uncertainty quantification and decomposition compared to traditional additive methods.
Abstract: Evaluation of per-sample uncertainty quantification from neural networks is essential for decision-making involving high-risk applications. A common approach is to use the predictive distribution from Bayesian or approximation models and decompose the corresponding predictive uncertainty into epistemic (model-related) and aleatoric (data-related) components. However, additive decomposition has recently been questioned. In this work, we propose an intuitive framework for uncertainty estimation and decomposition based on the signal-to-noise ratio of class probability distributions across different model predictions. We introduce a variance-gated measure that scales predictions by a confidence factor derived from ensembles. We use this measure to discuss the existence of a collapse in the diversity of committee machines.
[232] Data Driven Discovery of Emergent Dynamics in Reaction Diffusion Systems from Sparse and Noisy Observations
Saumitra Dwivedi, Ricardo da Silva Torres, Ibrahim A. Hameed, Gunnar Tufte, Anniken Susanne T. Karlsen
Main category: cs.LG
TL;DR: Proposes DRSALife framework to learn Soft Artificial Life rulesets from data for reaction-diffusion systems without prior physics knowledge, achieving 74% accuracy with robustness to noise and sparsity.
Details
Motivation: Address the challenge of system identification for reaction-diffusion systems when there is no prior knowledge of underlying physics, particularly for emergent dynamics across neuroscience, ecology, and epidemiology.Method: Uses Data-driven Rulesets for Soft Artificial Life (DRSALife) model to learn Agent-based and Cellular Automata rulesets from observed data, testing on Elementary CA Rule 30, Game of Life, and Vicsek Flocking problems.
Result: Achieves 74% accuracy in predicting emergent dynamics, demonstrates robustness to Gaussian noise and temporal sparsity, and successfully identifies underlying PDE structure and parameters.
Conclusion: DRSALife framework effectively learns Soft ALife models from data without physics priors, providing a promising approach for data-driven discovery of complex emergent dynamics in reaction-diffusion systems.
Abstract: Data-driven discovery of emergent dynamics is gaining popularity, particularly in the context of reaction-diffusion systems. These systems are widely studied across various fields, including neuroscience, ecology, epidemiology, and several other subject areas that deal with emergent dynamics. A current challenge in the discovery process relates to system identification when there is no prior knowledge of the underlying physics. We attempt to address this challenge by learning Soft Artificial Life (Soft ALife) models, such as Agent-based and Cellular Automata (CA) models, from observed data for reaction-diffusion systems. In this paper, we present findings on the applicability of a conceptual framework, the Data-driven Rulesets for Soft Artificial Life (DRSALife) model, to learn Soft ALife rulesets that accurately represent emergent dynamics in a reaction-diffusion system from observed data. This model has demonstrated promising results for Elementary CA Rule 30, Game of Life, and Vicsek Flocking problems in recent work. To our knowledge, this is one of the few studies that explore machine-based Soft ALife ruleset learning and system identification for reaction-diffusion dynamics without any prior knowledge of the underlying physics. Moreover, we provide comprehensive findings from experiments investigating the potential effects of using noisy and sparse observed datasets on learning emergent dynamics. Additionally, we successfully identify the structure and parameters of the underlying partial differential equations (PDEs) representing these dynamics. Experimental results demonstrate that the learned models are able to predict the emergent dynamics with good accuracy (74%) and exhibit quite robust performance when subjected to Gaussian noise and temporal sparsity.
[233] Instance-Optimal Matrix Multiplicative Weight Update and Its Quantum Applications
Weiyuan Gong, Tongyang Li, Xinzhao Wang, Zhiyu Zhang
Main category: cs.LG
TL;DR: Improved matrix learning algorithm achieves instance-optimal regret bound using quantum relative entropy, maintaining same computational complexity as MMWU while outperforming state-of-the-art in quantum learning applications.
Details
Motivation: The Matrix Multiplicative Weight Update (MMWU) algorithm achieves minimax-optimal regret but not instance-optimal regret. There's a need for an algorithm that achieves better instance-specific performance without increasing computational cost.Method: Developed a general potential-based framework for matrix LEA, with MMWU as a special case. Used a new one-sided Jensen’s trace inequality and Laplace transform technique to apply general potential functions. The optimal potential function is derived from the imaginary error function in vector LEA.
Result: Achieved instance-optimal regret bound of O(√(T·S(X||d⁻¹I_d))) using quantum relative entropy, with same computational complexity as MMWU. Outperformed state-of-the-art in learning quantum states corrupted by depolarization noise, random quantum states, and Gibbs states.
Conclusion: The improved algorithm provides free improvement in regret bound without computational overhead, enabling better performance in quantum learning applications including predicting nonlinear quantum properties like purity and Rényi-2 correlation.
Abstract: The Matrix Multiplicative Weight Update (MMWU) is a seminal online learning
algorithm with numerous applications. Applied to the matrix version of the
Learning from Expert Advice (LEA) problem on the $d$-dimensional spectraplex,
it is well known that MMWU achieves the minimax-optimal regret bound of
$O(\sqrt{T\log d})$, where $T$ is the time horizon. In this paper, we present
an improved algorithm achieving the instance-optimal regret bound of
$O(\sqrt{T\cdot S(X||d^{-1}I_d)})$, where $X$ is the comparator in the regret,
$I_d$ is the identity matrix, and $S(\cdot||\cdot)$ denotes the quantum
relative entropy. Furthermore, our algorithm has the same computational
complexity as MMWU, indicating that the improvement in the regret bound is
free''. Technically, we first develop a general potential-based framework for matrix LEA, with MMWU being its special case induced by the standard exponential potential. Then, the crux of our analysis is a new
one-sided’’ Jensen’s trace
inequality built on a Laplace transform technique, which allows the application
of general potential functions beyond exponential to matrix LEA. Our algorithm
is finally induced by an optimal potential function from the vector LEA
problem, based on the imaginary error function.
Complementing the above, we provide a memory lower bound for matrix LEA, and
explore the applications of our algorithm in quantum learning theory. We show
that it outperforms the state of the art for learning quantum states corrupted
by depolarization noise, random quantum states, and Gibbs states. In addition,
applying our algorithm to linearized convex losses enables predicting nonlinear
quantum properties, such as purity, quantum virtual cooling, and R'{e}nyi-$2$
correlation.
[234] Corruption-Tolerant Asynchronous Q-Learning with Near-Optimal Rates
Sreejeet Maity, Aritra Mitra
Main category: cs.LG
TL;DR: Robust Q-learning algorithm for adversarial reward corruption with finite-time convergence guarantees matching non-adversarial rates plus unavoidable corruption term.
Details
Motivation: Classical RL algorithms like Q-learning degrade severely under adversarial reward corruption from noise, sensor faults, or malicious attacks.Method: Proposed provably robust Q-learning variant that handles arbitrarily perturbed rewards, using refined Azuma-Hoeffding inequality for almost-martingales in unknown statistics case.
Result: Achieves same finite-time convergence rate as non-adversarial Q-learning up to additive corruption term proportional to fraction of corrupted samples, with information-theoretic lower bound showing this is optimal.
Conclusion: First finite-time robustness guarantees for asynchronous Q-learning under adversarial corruption, bridging significant gap in robust reinforcement learning.
Abstract: We consider the problem of learning the optimal policy in a discounted, infinite-horizon reinforcement learning (RL) setting where the reward signal is subject to adversarial corruption. Such corruption, which may arise from extreme noise, sensor faults, or malicious attacks, can severely degrade the performance of classical algorithms such as Q-learning. To address this challenge, we propose a new provably robust variant of the Q-learning algorithm that operates effectively even when a fraction of the observed rewards are arbitrarily perturbed by an adversary. Under the asynchronous sampling model with time-correlated data, we establish that despite adversarial corruption, the finite-time convergence rate of our algorithm matches that of existing results for the non-adversarial case, up to an additive term proportional to the fraction of corrupted samples. Moreover, we derive an information-theoretic lower bound revealing that the additive corruption term in our upper bounds is unavoidable. Next, we propose a variant of our algorithm that requires no prior knowledge of the statistics of the true reward distributions. The analysis of this setting is particularly challenging and is enabled by carefully exploiting a refined Azuma-Hoeffding inequality for almost-martingales, a technical tool that might be of independent interest. Collectively, our contributions provide the first finite-time robustness guarantees for asynchronous Q-learning, bridging a significant gap in robust RL.
[235] Group Distributionally Robust Machine Learning under Group Level Distributional Uncertainty
Xenia Konti, Yi Shen, Zifan Wang, Karl Henrik Johansson, Michael J. Pencina, Nicoleta J. Economou-Zavlanos, Michael M. Zavlanos
Main category: cs.LG
TL;DR: A novel Wasserstein-based distributionally robust optimization framework that improves worst-group performance while accounting for distributional uncertainty in heterogeneous data sources.
Details
Motivation: Standard ML methods often learn spurious correlations that perform well on average but degrade performance for underrepresented groups, especially in noisy, non-stationary environments where group distributions cannot be accurately estimated.Method: Proposes a Wasserstein-based distributionally robust optimization (DRO) framework that accounts for distributional uncertainty within each group while optimizing worst-group performance. Develops a gradient descent-ascent algorithm with convergence guarantees.
Result: The method is validated on real-world data, demonstrating effectiveness in improving worst-group performance under distributional uncertainty.
Conclusion: The proposed framework successfully addresses distributional uncertainty in group-wise optimization, providing robust worst-group performance improvements in heterogeneous data environments.
Abstract: The performance of machine learning (ML) models critically depends on the quality and representativeness of the training data. In applications with multiple heterogeneous data generating sources, standard ML methods often learn spurious correlations that perform well on average but degrade performance for atypical or underrepresented groups. Prior work addresses this issue by optimizing the worst-group performance. However, these approaches typically assume that the underlying data distributions for each group can be accurately estimated using the training data, a condition that is frequently violated in noisy, non-stationary, and evolving environments. In this work, we propose a novel framework that relies on Wasserstein-based distributionally robust optimization (DRO) to account for the distributional uncertainty within each group, while simultaneously preserving the objective of improving the worst-group performance. We develop a gradient descent-ascent algorithm to solve the proposed DRO problem and provide convergence results. Finally, we validate the effectiveness of our method on real-world data.
[236] FoundationalECGNet: A Lightweight Foundational Model for ECG-based Multitask Cardiac Analysis
Md. Sajeebul Islam Sk., Md Jobayer, Md Mehedi Hasan Shawon, Md. Golam Raibul Alam
Main category: cs.LG
TL;DR: FoundationalECGNet is a foundational framework for automated ECG classification that achieves state-of-the-art performance with 99% F1-score for normal vs abnormal detection and excellent multi-class disease classification.
Details
Motivation: Cardiovascular diseases remain a leading cause of mortality worldwide, and current ECG analysis methods face challenges like noise, class imbalance, and dataset heterogeneity that limit diagnostic accuracy.Method: The model integrates dual-stage denoising (Morlet and Daubechies wavelets), Convolutional Block Attention Module (CBAM), Graph Attention Networks (GAT), and Time Series Transformers (TST) to capture spatial and temporal dependencies in multi-channel ECG signals.
Result: Achieves 99% F1-score for Normal vs Abnormal classification, 99% F1-score for Conduction Disorders and Hypertrophy, and 98.9% F1-score for Arrhythmias across multiple datasets with state-of-the-art performance.
Conclusion: FoundationalECGNet represents a scalable, interpretable, and generalizable solution for automated ECG analysis with potential to improve diagnostic precision and patient outcomes in healthcare settings.
Abstract: Cardiovascular diseases (CVDs) remain a leading cause of mortality worldwide, underscoring the importance of accurate and scalable diagnostic systems. Electrocardiogram (ECG) analysis is central to detecting cardiac abnormalities, yet challenges such as noise, class imbalance, and dataset heterogeneity limit current methods. To address these issues, we propose FoundationalECGNet, a foundational framework for automated ECG classification. The model integrates a dual-stage denoising by Morlet and Daubechies wavelets transformation, Convolutional Block Attention Module (CBAM), Graph Attention Networks (GAT), and Time Series Transformers (TST) to jointly capture spatial and temporal dependencies in multi-channel ECG signals. FoundationalECGNet first distinguishes between Normal and Abnormal ECG signals, and then classifies the Abnormal signals into one of five cardiac conditions: Arrhythmias, Conduction Disorders, Myocardial Infarction, QT Abnormalities, or Hypertrophy. Across multiple datasets, the model achieves a 99% F1-score for Normal vs. Abnormal classification and shows state-of-the-art performance in multi-class disease detection, including a 99% F1-score for Conduction Disorders and Hypertrophy, as well as a 98.9% F1-score for Arrhythmias. Additionally, the model provides risk level estimations to facilitate clinical decision-making. In conclusion, FoundationalECGNet represents a scalable, interpretable, and generalizable solution for automated ECG analysis, with the potential to improve diagnostic precision and patient outcomes in healthcare settings. We’ll share the code after acceptance.
[237] Value bounds and Convergence Analysis for Averages of LRP attributions
Alexander Binder, Nastaran Takmil-Homayouni, Urun Dogan
Main category: cs.LG
TL;DR: Analysis of LRP attribution methods through matrix representation, showing they can be expressed as products of modified gradient matrices analogous to Jacobian matrices from chain rule differentiation.
Details
Motivation: To understand the numerical properties and distribution of attribution values in LRP-type methods, particularly how they behave under data augmentations and in Smoothgrad-type approaches.Method: Represent LRP attribution methods as matrix products of modified gradients, derive upper bounds for singular values and component-wise bounds for attribution values, and analyze convergence constants.
Result: Found that LRP-beta’s multiplicative constants remain independent of weight norms, unlike gradient-based methods and LRP-epsilon, which has important implications for data augmentation scenarios.
Conclusion: The analysis provides theoretical insights into LRP attribution behavior, revealing fundamental differences between LRP variants and gradient-based methods in terms of weight norm dependence and convergence properties.
Abstract: We analyze numerical properties of Layer-wise relevance propagation (LRP)-type attribution methods by representing them as a product of modified gradient matrices. This representation creates an analogy to matrix multiplications of Jacobi-matrices which arise from the chain rule of differentiation. In order to shed light on the distribution of attribution values, we derive upper bounds for singular values. Furthermore we derive component-wise bounds for attribution map values. As a main result, we apply these component-wise bounds to obtain multiplicative constants. These constants govern the convergence of empirical means of attributions to expectations of attribution maps. This finding has important implications for scenarios where multiple non-geometric data augmentations are applied to individual test samples, as well as for Smoothgrad-type attribution methods. In particular, our analysis reveals that the constants for LRP-beta remain independent of weight norms, a significant distinction from both gradient-based methods and LRP-epsilon.
[238] Green Federated Learning via Carbon-Aware Client and Time Slot Scheduling
Daniel Richards Arputharaj, Charlotte Rodriguez, Angelo Rodio, Giovanni Neglia
Main category: cs.LG
TL;DR: Carbon-aware federated learning scheduler that leverages slack time and fair carbon allocation to reduce emissions while maintaining model accuracy, especially effective under tight carbon constraints.
Details
Motivation: Large-scale ML training causes substantial carbon emissions. FL's distributed nature enables leveraging regional/temporal carbon intensity variations to reduce emissions through intelligent scheduling.Method: Carbon-aware client selection and training scheduling that uses slack time to defer training to low-carbon periods, integrates alpha-fair carbon allocation, and includes global fine-tuning to handle statistical heterogeneity and temporal correlations.
Result: Outperforms slack-agnostic baselines, achieving higher model accuracy across various carbon budgets with particularly strong gains under tight carbon constraints.
Conclusion: Carbon-aware scheduling in FL effectively reduces emissions while maintaining performance, with the proposed scheduler showing significant advantages especially when carbon budgets are constrained.
Abstract: Training large-scale machine learning models incurs substantial carbon emissions. Federated Learning (FL), by distributing computation across geographically dispersed clients, offers a natural framework to leverage regional and temporal variations in Carbon Intensity (CI). This paper investigates how to reduce emissions in FL through carbon-aware client selection and training scheduling. We first quantify the emission savings of a carbon-aware scheduling policy that leverages slack time – permitting a modest extension of the training duration so that clients can defer local training rounds to lower-carbon periods. We then examine the performance trade-offs of such scheduling which stem from statistical heterogeneity among clients, selection bias in participation, and temporal correlation in model updates. To leverage these trade-offs, we construct a carbon-aware scheduler that integrates slack time, $\alpha$-fair carbon allocation, and a global fine-tuning phase. Experiments on real-world CI data show that our scheduler outperforms slack-agnostic baselines, achieving higher model accuracy across a wide range of carbon budgets, with especially strong gains under tight carbon constraints.
[239] Adaptive Pareto-Optimal Token Merging for Edge Transformer Models in Semantic Communication
Omar Erak, Omar Alhussein, Hatem Abou-Zeid, Mehdi Bennis
Main category: cs.LG
TL;DR: Training-free adaptive token merging for vision transformers to reduce computation and transmission costs in semantic communication systems, using Bayesian optimization to balance accuracy and efficiency.
Details
Motivation: Large transformer models are powerful for semantic communication but computationally expensive, making deployment challenging in resource-constrained 6G networks.Method: Formulates token merging proportion selection as multi-objective optimization, uses Gaussian process-based Bayesian optimization to find Pareto-optimal configurations for adaptive runtime adjustment.
Result: Outperforms other baselines, achieves significant FLOPs reduction while maintaining competitive accuracy across various SNR conditions, enables effective trade-off between latency and semantic fidelity.
Conclusion: Provides a scalable and efficient approach for deploying transformer-based semantic communication in future edge intelligence systems with adaptive performance optimization.
Abstract: Large-scale transformer models have emerged as a powerful tool for semantic communication systems, enabling edge devices to extract rich representations for robust inference across noisy wireless channels. However, their substantial computational demands remain a major barrier to practical deployment in resource-constrained 6G networks. In this paper, we present a training-free framework for adaptive token merging in pretrained vision transformers to jointly reduce inference time and transmission resource usage. We formulate the selection of per-layer merging proportions as a multi-objective optimization problem to balance accuracy and computational cost. We employ Gaussian process-based Bayesian optimization to construct a Pareto frontier of optimal configurations, enabling flexible runtime adaptation to dynamic application requirements and channel conditions. Extensive experiments demonstrate that our method consistently outperforms other baselines and achieves significant reductions in floating-point operations while maintaining competitive accuracy across a wide range of signal-to-noise ratio (SNR) conditions. Additional results highlight the effectiveness of adaptive policies that adjust merging aggressiveness in response to channel quality, providing a practical mechanism to trade off latency and semantic fidelity on demand. These findings establish a scalable and efficient approach for deploying transformer-based semantic communication in future edge intelligence systems.
[240] Active Learning and Explainable AI for Multi-Objective Optimization of Spin Coated Polymers
Brendan Young, Brendan Alvey, Andreas Werbrouck, Will Murphy, James Keller, Mattias J. Young, Matthew Maschmann
Main category: cs.LG
TL;DR: A framework combining active Pareto front learning (PyePAL) with visualization and explainable AI techniques to optimize spin coating parameters for polymer thin films with desired mechanical properties.
Details
Motivation: Spin coating polymer thin films to achieve specific mechanical properties is inherently a multi-objective optimization problem that requires balancing multiple competing objectives.Method: Integrates PyePAL algorithm with Gaussian process models to predict mechanical properties from design variables, uses UMAP for 2D visualization of Pareto front exploration, and incorporates fuzzy linguistic summaries for explainable insights.
Result: The method efficiently identifies promising polymer designs while providing visual and linguistic explanations that facilitate expert-driven analysis and knowledge discovery.
Conclusion: The framework successfully combines optimization with explainable AI techniques to provide both optimal solutions and interpretable insights into the relationships between process parameters and performance objectives.
Abstract: Spin coating polymer thin films to achieve specific mechanical properties is inherently a multi-objective optimization problem. We present a framework that integrates an active Pareto front learning algorithm (PyePAL) with visualization and explainable AI techniques to optimize processing parameters. PyePAL uses Gaussian process models to predict objective values (hardness and elasticity) from the design variables (spin speed, dilution, and polymer mixture), guiding the adaptive selection of samples toward promising regions of the design space. To enable interpretable insights into the high-dimensional design space, we utilize UMAP (Uniform Manifold Approximation and Projection) for two-dimensional visualization of the Pareto front exploration. Additionally, we incorporate fuzzy linguistic summaries, which translate the learned relationships between process parameters and performance objectives into linguistic statements, thus enhancing the explainability and understanding of the optimization results. Experimental results demonstrate that our method efficiently identifies promising polymer designs, while the visual and linguistic explanations facilitate expert-driven analysis and knowledge discovery.
[241] Fast attention mechanisms: a tale of parallelism
Jingwen Liu, Hantao Yu, Clayton Sanford, Alexandr Andoni, Daniel Hsu
Main category: cs.LG
TL;DR: ANNA attention mechanism enables transformers to achieve sub-quadratic time complexity while maintaining MPC simulation capabilities and solving reasoning tasks efficiently.
Details
Motivation: Transformers suffer from quadratic time complexity that limits scalability, despite having the representational capacity to simulate Massively Parallel Computation algorithms.Method: Introduces Approximate Nearest Neighbor Attention (ANNA) mechanism with sub-quadratic time complexity, and proves its expressive power matches standard attention for MPC simulation capabilities.
Result: ANNA-transformers can solve key reasoning tasks like Match2 and k-hop with near-optimal depth, and constant-depth ANNA-transformers can simulate constant-depth low-rank transformers.
Conclusion: ANNA provides a unified framework for reasoning about efficient attention approximations while maintaining theoretical guarantees and practical efficiency.
Abstract: Transformers have the representational capacity to simulate Massively Parallel Computation (MPC) algorithms, but they suffer from quadratic time complexity, which severely limits their scalability. We introduce an efficient attention mechanism called Approximate Nearest Neighbor Attention (ANNA) with sub-quadratic time complexity. We prove that ANNA-transformers (1) retain the expressive power previously established for standard attention in terms of matching the capabilities of MPC algorithms, and (2) can solve key reasoning tasks such as Match2 and $k$-hop with near-optimal depth. Using the MPC framework, we further prove that constant-depth ANNA-transformers can simulate constant-depth low-rank transformers, thereby providing a unified way to reason about a broad class of efficient attention approximations.
[242] Open-sci-ref-0.01: open and reproducible reference baselines for language model and dataset comparison
Marianna Nezhurina, Taishi Nakamura, Timur Carstensen, Niccolò Ajroldi, Ville Komulainen, David Salinas, Jenia Jitsev
Main category: cs.LG
TL;DR: Open-sci-ref provides dense transformer baseline models (0.13B-1.7B parameters) trained on 8 reference datasets with up to 1T tokens, establishing standardized benchmarks for comparing training approaches and dataset performance.
Details
Motivation: To create reference baselines that enable researchers to assess the sanity and quality of alternative training approaches across different model scales and datasets, facilitating standardized comparison and reproduction.Method: Training family of dense transformer models across multiple parameter scales (0.13B to 1.7B) and token scales (up to 1T tokens) on 8 recent open reference datasets, with intermediate checkpoints and comprehensive logging.
Result: NemoTron-CC HQ consistently outperforms other reference datasets, followed by DCLM-baseline and FineWeb-Edu. The models provide reference points for scaling trends and training dynamics analysis.
Conclusion: Open-sci-ref establishes standardized baselines that simplify reproduction, enable fair comparison of training procedures, and facilitate future research through released checkpoints, logs, code, and evaluations.
Abstract: We introduce open-sci-ref, a family of dense transformer models trained as research baselines across multiple model (0.13B to 1.7B parameters) and token scales (up to 1T) on 8 recent open reference datasets. Evaluating the models on various standardized benchmarks, our training runs set establishes reference points that enable researchers to assess the sanity and quality of alternative training approaches across scales and datasets. Intermediate checkpoints allow comparison and studying of the training dynamics. The established reference baselines allow training procedures to be compared through their scaling trends, aligning them on a common compute axis. Comparison of open reference datasets reveals that training on NemoTron-CC HQ consistently outperforms other reference datasets, followed by DCLM-baseline and FineWeb-Edu. In addition to intermediate training checkpoints, the release includes logs, code, and downstream evaluations to simplify reproduction, standardize comparison, and facilitate future research.
[243] Deep Context-Conditioned Anomaly Detection for Tabular Data
Spencer King, Zhilu Zhang, Ruofan Yu, Baris Coskun, Wei Ding, Qian Cui
Main category: cs.LG
TL;DR: A context-conditional anomaly detection framework for tabular data that automatically identifies context features and uses deep autoencoders to model conditional distributions, outperforming state-of-the-art methods.
Details
Motivation: Unsupervised anomaly detection in tabular data is challenging because real-world datasets contain heterogeneous contexts where globally rare events may be normal in certain contexts, making single global distribution models ineffective.Method: Automatically identifies context features and models the conditional data distribution using a simple deep autoencoder framework tailored for tabular datasets.
Result: Extensive experiments on multiple tabular benchmark datasets demonstrate that the method outperforms state-of-the-art approaches.
Conclusion: The framework effectively incorporates contextual information, highlighting the importance of context in accurately distinguishing anomalous from normal instances in tabular data.
Abstract: Anomaly detection is critical in domains such as cybersecurity and finance, especially when working with large-scale tabular data. Yet, unsupervised anomaly detection – where no labeled anomalies are available – remains a significant challenge. Although various deep learning methods have been proposed to model a dataset’s joint distribution, real-world tabular data often contain heterogeneous contexts (e.g., different users), making globally rare events normal under certain contexts. Consequently, relying on a single global distribution can overlook these contextual nuances, degrading detection performance. In this paper, we present a context-conditional anomaly detection framework tailored for tabular datasets. Our approach automatically identifies context features and models the conditional data distribution using a simple deep autoencoder. Extensive experiments on multiple tabular benchmark datasets demonstrate that our method outperforms state-of-the-art approaches, underscoring the importance of context in accurately distinguishing anomalous from normal instances.
[244] MoWE : A Mixture of Weather Experts
Dibyajyoti Chakraborty, Romit Maulik, Peter Harrington, Dallas Foster, Mohammad Amin Nabian, Sanjay Choudhry
Main category: cs.LG
TL;DR: Mixture of Experts (MoWE) approach combines existing weather models using Vision Transformer gating to create more accurate forecasts with lower computational cost than individual models.
Details
Motivation: Progress in data-driven weather models has plateaued, and existing models require significant computational resources. The paper aims to overcome these limitations by optimally combining outputs of existing models rather than creating new forecasters.Method: Uses Vision Transformer-based gating network that dynamically learns to weight contributions of multiple expert models at each grid point, conditioned on forecast lead time. Trained with significantly lower computational resources than individual experts.
Result: Achieves up to 10% lower RMSE than best-performing AI weather model on 2-day forecast horizon, significantly outperforming individual experts and simple averaging across experts.
Conclusion: Presents a computationally efficient and scalable strategy to advance data-driven weather prediction by optimally combining leading forecast models.
Abstract: Data-driven weather models have recently achieved state-of-the-art performance, yet progress has plateaued in recent years. This paper introduces a Mixture of Experts (MoWE) approach as a novel paradigm to overcome these limitations, not by creating a new forecaster, but by optimally combining the outputs of existing models. The MoWE model is trained with significantly lower computational resources than the individual experts. Our model employs a Vision Transformer-based gating network that dynamically learns to weight the contributions of multiple “expert” models at each grid point, conditioned on forecast lead time. This approach creates a synthesized deterministic forecast that is more accurate than any individual component in terms of Root Mean Squared Error (RMSE). Our results demonstrate the effectiveness of this method, achieving up to a 10% lower RMSE than the best-performing AI weather model on a 2-day forecast horizon, significantly outperforming individual experts as well as a simple average across experts. This work presents a computationally efficient and scalable strategy to push the state of the art in data-driven weather prediction by making the most out of leading high-quality forecast models.
[245] A Scoping Review of Machine Learning Applications in Power System Protection and Disturbance Management
Julian Oelhaf, Georg Kordowich, Mehran Pashaei, Christian Bergler, Andreas Maier, Johann Jäger, Siming Bayer
Main category: cs.LG
TL;DR: This scoping review analyzes ML applications in power system protection, finding high accuracy in simulations but insufficient real-world validation, and proposes standardization guidelines to improve research quality.
Details
Motivation: Renewable energy integration challenges conventional protection schemes, requiring ML solutions for modern power systems.Method: PRISMA for Scoping Reviews framework applied to over 100 publications, assessing ML scope, performance, and suitability for grid conditions.
Result: ML shows high accuracy in simulations but lacks real-world validation; literature is fragmented with inconsistent methodologies and evaluation metrics.
Conclusion: Future research needs standardized practices, real-world validation, and advanced ML architectures for practical deployment in dynamic power systems.
Abstract: The integration of renewable and distributed energy resources reshapes modern power systems, challenging conventional protection schemes. This scoping review synthesizes recent literature on machine learning (ML) applications in power system protection and disturbance management, following the PRISMA for Scoping Reviews framework. Based on over 100 publications, three key objectives are addressed: (i) assessing the scope of ML research in protection tasks; (ii) evaluating ML performance across diverse operational scenarios; and (iii) identifying methods suitable for evolving grid conditions. ML models often demonstrate high accuracy on simulated datasets; however, their performance under real-world conditions remains insufficiently validated. The existing literature is fragmented, with inconsistencies in methodological rigor, dataset quality, and evaluation metrics. This lack of standardization hampers the comparability of results and limits the generalizability of findings. To address these challenges, this review introduces a ML-oriented taxonomy for protection tasks, resolves key terminological inconsistencies, and advocates for standardized reporting practices. It further provides guidelines for comprehensive dataset documentation, methodological transparency, and consistent evaluation protocols, aiming to improve reproducibility and enhance the practical relevance of research outcomes. Critical gaps remain, including the scarcity of real-world validation, insufficient robustness testing, and limited consideration of deployment feasibility. Future research should prioritize public benchmark datasets, realistic validation methods, and advanced ML architectures. These steps are essential to move ML-based protection from theoretical promise to practical deployment in increasingly dynamic and decentralized power systems.
[246] STRIDE: Scalable and Interpretable XAI via Subset-Free Functional Decomposition
Chaeyun Ko
Main category: cs.LG
TL;DR: STRIDE is a scalable XAI framework that avoids exponential subset enumeration costs through orthogonal functional decomposition in RKHS, providing both local/global explanations with high fidelity and speed improvements.
Details
Motivation: Overcome limitations of traditional XAI methods: exponential computational cost from feature subset enumeration and reduced expressiveness of scalar attribution summaries.Method: Uses orthogonal functional decomposition in Reproducing Kernel Hilbert Space (RKHS) with analytical projection via recursive kernel-centering, avoiding explicit subset enumeration.
Result: Achieved speedups from 0.6x to 9.7x (median ~3x) across 10 datasets while maintaining high fidelity (R² 0.81-0.999) and rank agreement. Enables novel diagnostics like ‘component surgery’.
Conclusion: STRIDE complements scalar attribution methods by providing structured functional perspective, offering scalable and expressive explanations for tabular data with theoretical guarantees.
Abstract: Most explainable AI (XAI) frameworks face two practical limitations: the exponential cost of reasoning over feature subsets and the reduced expressiveness of summarizing effects as single scalar values. We present STRIDE, a scalable framework that aims to mitigate both issues by framing explanation as a subset-enumeration-free, orthogonal functional decomposition in a Reproducing Kernel Hilbert Space (RKHS). Rather than focusing only on scalar attributions, STRIDE computes functional components f_S(x_S) via an analytical projection scheme based on a recursive kernel-centering procedure, avoiding explicit subset enumeration. In the tabular setups we study, the approach is model-agnostic, provides both local and global views, and is supported by theoretical results on orthogonality and L^2 convergence under stated assumptions. On public tabular benchmarks in our environment, we observed speedups ranging from 0.6 times (slower than TreeSHAP on a small dataset) to 9.7 times (California), with a median approximate 3.0 times across 10 datasets, while maintaining high fidelity (R^2 between 0.81 and 0.999) and substantial rank agreement on most datasets. Overall, STRIDE complements scalar attribution methods by offering a structured functional perspective, enabling novel diagnostics like ‘component surgery’ to quantitatively measure the impact of specific interactions within our experimental scope.
[247] “A 6 or a 9?”: Ensemble Learning Through the Multiplicity of Performant Models and Explanations
Gianlucca Zuin, Adriano Veloso
Main category: cs.LG
TL;DR: The Rashomon Ensemble method selects diverse high-performing models based on performance and explanations to improve generalization, achieving up to 0.20+ AUROC improvements in real-world applications.
Details
Motivation: Addressing the challenge of selecting models that generalize well, particularly in real-world scenarios where multiple models perform similarly (Rashomon Effect) but may have different underlying patterns.Method: Proposes Rashomon Ensemble that strategically selects models from diverse high-performing solutions by grouping them based on both performance and explanations to maximize diversity while maintaining predictive accuracy.
Result: Validated on real-world datasets showing up to 0.20+ AUROC improvements, especially when Rashomon ratio is large. Demonstrates robustness to distribution shifts and practical business benefits.
Conclusion: The Rashomon Ensemble approach effectively improves generalization by leveraging diverse high-performing models, making it robust and practical for real-world applications with measurable performance gains.
Abstract: Creating models from past observations and ensuring their effectiveness on new data is the essence of machine learning. However, selecting models that generalize well remains a challenging task. Related to this topic, the Rashomon Effect refers to cases where multiple models perform similarly well for a given learning problem. This often occurs in real-world scenarios, like the manufacturing process or medical diagnosis, where diverse patterns in data lead to multiple high-performing solutions. We propose the Rashomon Ensemble, a method that strategically selects models from these diverse high-performing solutions to improve generalization. By grouping models based on both their performance and explanations, we construct ensembles that maximize diversity while maintaining predictive accuracy. This selection ensures that each model covers a distinct region of the solution space, making the ensemble more robust to distribution shifts and variations in unseen data. We validate our approach on both open and proprietary collaborative real-world datasets, demonstrating up to 0.20+ AUROC improvements in scenarios where the Rashomon ratio is large. Additionally, we demonstrate tangible benefits for businesses in various real-world applications, highlighting the robustness, practicality, and effectiveness of our approach.
[248] An entropy formula for the Deep Linear Network
Govind Menon, Tianmin Yu
Main category: cs.LG
TL;DR: Riemannian geometry analysis of Deep Linear Networks using group actions and Riemannian submersion to study overparametrization and define Boltzmann entropy.
Details
Motivation: To establish a thermodynamic description of the learning process in deep linear networks by studying their Riemannian geometry properties.Method: Using group actions to analyze overparametrization, Riemannian submersion from parameter space to observable space, and constructing orthonormal basis for tangent space using Jacobi matrix theory.
Result: Successfully defined Boltzmann entropy using foliation of balanced manifold by group orbits, and showed Riemannian geometry on observable space is obtained via submersion of balanced manifold.
Conclusion: The Riemannian geometric framework provides a foundation for thermodynamic description of learning in deep linear networks through group actions and submersion techniques.
Abstract: We study the Riemannian geometry of the Deep Linear Network (DLN) as a foundation for a thermodynamic description of the learning process. The main tools are the use of group actions to analyze overparametrization and the use of Riemannian submersion from the space of parameters to the space of observables. The foliation of the balanced manifold in the parameter space by group orbits is used to define and compute a Boltzmann entropy. We also show that the Riemannian geometry on the space of observables defined in [2] is obtained by Riemannian submersion of the balanced manifold. The main technical step is an explicit construction of an orthonormal basis for the tangent space of the balanced manifold using the theory of Jacobi matrices.
[249] Sensitivity-LoRA: Low-Load Sensitivity-Based Fine-Tuning for Large Language Models
Hao Zhang, Bo Huang, Zhenjia Li, Xi Xiao, Hui Yi Leong, Zumeng Zhang, Xinwei Long, Tianyang Wang, Hao Xu
Main category: cs.LG
TL;DR: Sensitivity-LoRA is an efficient fine-tuning method that dynamically allocates ranks to weight matrices based on global and local sensitivities using second-order derivatives, overcoming limitations of uniform rank allocation in traditional LoRA.
Details
Motivation: Adapting LLMs for specialized tasks in resource-constrained environments is challenging. Traditional LoRA uses uniform rank allocation which is inefficient, and existing rank allocation techniques are computationally expensive, complex, and unstable.Method: Proposes Sensitivity-LoRA that dynamically allocates ranks to weight matrices based on their global and local sensitivities using second-order derivatives (Hessian Matrix) of the loss function to capture weight sensitivity with minimal computational overhead.
Result: Experimental results demonstrate robust effectiveness, efficiency, and stability across diverse tasks and benchmarks.
Conclusion: Sensitivity-LoRA provides an optimal rank allocation approach that overcomes the limitations of traditional LoRA methods, making it practical for real-world applications in resource-constrained environments.
Abstract: Large Language Models (LLMs) have transformed both everyday life and scientific research. However, adapting LLMs from general-purpose models to specialized tasks remains challenging, particularly in resource-constrained environments. Low-Rank Adaptation (LoRA), a prominent method within Parameter-Efficient Fine-Tuning (PEFT), has emerged as a promising approach to LLMs by approximating model weight updates using low-rank decomposition. However, LoRA is limited by its uniform rank ( r ) allocation to each incremental matrix, and existing rank allocation techniques aimed at addressing this issue remain computationally inefficient, complex, and unstable, hindering practical applications. To address these limitations, we propose Sensitivity-LoRA, an efficient fine-tuning method that dynamically allocates ranks to weight matrices based on both their global and local sensitivities. It leverages the second-order derivatives (Hessian Matrix) of the loss function to effectively capture weight sensitivity, enabling optimal rank allocation with minimal computational overhead. Our experimental results have demonstrated robust effectiveness, efficiency and stability of Sensitivity-LoRA across diverse tasks and benchmarks.
[250] Learning What Matters: Causal Time Series Modeling for Arctic Sea Ice Prediction
Emam Hossain, Md Osman Gani
Main category: cs.LG
TL;DR: A causality-aware deep learning framework that integrates MVGC and PCMCI+ for causal feature selection, improving Arctic Sea Ice Extent prediction accuracy and interpretability.
Details
Motivation: Conventional ML/DL models rely on correlation-based learning which fails to distinguish genuine causal relationships from spurious associations, limiting robustness and generalization.Method: Integrates Multivariate Granger Causality (MVGC) and PCMCI+ for causal feature selection within a hybrid neural architecture, using 43 years of Arctic Sea Ice Extent data and ocean-atmospheric variables.
Result: Identifies causally influential predictors, reduces unnecessary features, enhances computational efficiency, and shows improved prediction accuracy and interpretability across varying lead times.
Conclusion: The framework is broadly applicable to other dynamic, high-dimensional domains, offering a scalable approach that advances both theoretical foundations and practical performance of causality-informed predictive modeling.
Abstract: Conventional machine learning and deep learning models typically rely on correlation-based learning, which often fails to distinguish genuine causal relationships from spurious associations, limiting their robustness, interpretability, and ability to generalize. To overcome these limitations, we introduce a causality-aware deep learning framework that integrates Multivariate Granger Causality (MVGC) and PCMCI+ for causal feature selection within a hybrid neural architecture. Leveraging 43 years (1979-2021) of Arctic Sea Ice Extent (SIE) data and associated ocean-atmospheric variables at daily and monthly resolutions, the proposed method identifies causally influential predictors, prioritizes direct causes of SIE dynamics, reduces unnecessary features, and enhances computational efficiency. Experimental results show that incorporating causal inputs leads to improved prediction accuracy and interpretability across varying lead times. While demonstrated on Arctic SIE forecasting, the framework is broadly applicable to other dynamic, high-dimensional domains, offering a scalable approach that advances both the theoretical foundations and practical performance of causality-informed predictive modeling.
[251] Peering Partner Recommendation for ISPs using Machine Learning
Md Ibrahim Ibne Alam, Ankur Senapati, Anindo Mahmood, Murat Yuksel, Koushik Kar
Main category: cs.LG
TL;DR: Machine learning model using public ISP data to automate peering partner selection with 98% accuracy using XGBoost.
Details
Motivation: Automate the lengthy and complex peering process between ISPs to enhance efficiency in global Internet connectivity.Method: Gathered data from public databases (PeeringDB, CAIDA), evaluated tree-based, neural network, and transformer ML models for predicting peering relationships.
Result: XGBoost achieved 98% accuracy and showed resilience to time, space, and missing data variations.
Conclusion: ISPs can adopt this ML approach to fully automate peering partner selection for a more efficient Internet ecosystem.
Abstract: Internet service providers (ISPs) need to connect with other ISPs to provide global connectivity services to their users. To ensure global connectivity, ISPs can either use transit service(s) or establish direct peering relationships between themselves via Internet exchange points (IXPs). Peering offers more room for ISP-specific optimizations and is preferred, but it often involves a lengthy and complex process. Automating peering partner selection can enhance efficiency in the global Internet ecosystem. We explore the use of publicly available data on ISPs to develop a machine learning (ML) model that can predict whether an ISP pair should peer or not. At first, we explore public databases, e.g., PeeringDB, CAIDA, etc., to gather data on ISPs. Then, we evaluate the performance of three broad types of ML models for predicting peering relationships: tree-based, neural network-based, and transformer-based. Among these, we observe that tree-based models achieve the highest accuracy and efficiency in our experiments. The XGBoost model trained with publicly available data showed promising performance, with a 98% accuracy rate in predicting peering partners. In addition, the model demonstrated great resilience to variations in time, space, and missing data. We envision that ISPs can adopt our method to fully automate the peering partner selection process, thus transitioning to a more efficient and optimized Internet ecosystem.
[252] HISPASpoof: A New Dataset For Spanish Speech Forensics
Maria Risques, Kratika Bhagtani, Amit Kumar Singh Yadav, Edward J. Delp
Main category: cs.LG
TL;DR: HISPASpoof is the first large-scale Spanish dataset for synthetic speech detection and attribution, addressing the underrepresentation of Spanish in speech forensics research.
Details
Motivation: Spanish is spoken by over 600 million people but remains underrepresented in synthetic speech detection research, while existing detectors for English and Chinese fail to generalize to Spanish.Method: Created HISPASpoof dataset with real speech from public corpora across six Spanish accents and synthetic speech generated using six zero-shot TTS systems. Evaluated five representative detection methods.
Result: Detectors trained on English failed to generalize to Spanish, while training on HISPASpoof substantially improved detection performance. Also evaluated synthetic speech attribution (identifying generation methods).
Conclusion: HISPASpoof provides a critical benchmark for advancing reliable and inclusive speech forensics in Spanish, addressing a significant gap in multilingual speech security research.
Abstract: Zero-shot Voice Cloning (VC) and Text-to-Speech (TTS) methods have advanced rapidly, enabling the generation of highly realistic synthetic speech and raising serious concerns about their misuse. While numerous detectors have been developed for English and Chinese, Spanish-spoken by over 600 million people worldwide-remains underrepresented in speech forensics. To address this gap, we introduce HISPASpoof, the first large-scale Spanish dataset designed for synthetic speech detection and attribution. It includes real speech from public corpora across six accents and synthetic speech generated with six zero-shot TTS systems. We evaluate five representative methods, showing that detectors trained on English fail to generalize to Spanish, while training on HISPASpoof substantially improves detection. We also evaluate synthetic speech attribution performance on HISPASpoof, i.e., identifying the generation method of synthetic speech. HISPASpoof thus provides a critical benchmark for advancing reliable and inclusive speech forensics in Spanish.
[253] Quantum Machine Learning, Quantitative Trading, Reinforcement Learning, Deep Learning
Jun-Hao Chen, Yu-Chien Huang, Yun-Cheng Tsai, Samuel Yen-Chi Chen
Main category: cs.LG
TL;DR: Quantum-inspired neural networks combined with deep reinforcement learning for FX trading, achieving 11.87% return with minimal drawdown using QLSTM and QA3C on USD/TWD.
Details
Motivation: To explore the convergence of quantum-inspired neural networks and deep reinforcement learning for improved financial trading performance, particularly in currency markets.Method: Implemented a trading agent using Quantum LSTM for short-term trend prediction integrated with Quantum A3C (QA3C) reinforcement learning. Trained on 2000-2025 data (80% training, 20% testing) with specific state design, reward functions, and multi-core training.
Result: Achieved 11.87% return over ~5 years with only 0.92% max drawdown, outperforming several currency ETFs. Hybrid quantum-classical models showed competitive FX trading performance.
Conclusion: QLSTM proves effective for small-profit trades with tight risk control. The approach shows promise but has limitations including classical quantum simulation and simplified strategy. Future enhancements are planned.
Abstract: The convergence of quantum-inspired neural networks and deep reinforcement learning offers a promising avenue for financial trading. We implemented a trading agent for USD/TWD by integrating Quantum Long Short-Term Memory (QLSTM) for short-term trend prediction with Quantum Asynchronous Advantage Actor-Critic (QA3C), a quantum-enhanced variant of the classical A3C. Trained on data from 2000-01-01 to 2025-04-30 (80% training, 20% testing), the long-only agent achieves 11.87% return over around 5 years with 0.92% max drawdown, outperforming several currency ETFs. We detail state design (QLSTM features and indicators), reward function for trend-following/risk control, and multi-core training. Results show hybrid models yield competitive FX trading performance. Implications include QLSTM’s effectiveness for small-profit trades with tight risk and future enhancements. Key hyperparameters: QLSTM sequence length$=$4, QA3C workers$=$8. Limitations: classical quantum simulation and simplified strategy. \footnote{The views expressed in this article are those of the authors and do not represent the views of Wells Fargo. This article is for informational purposes only. Nothing contained in this article should be construed as investment advice. Wells Fargo makes no express or implied warranties and expressly disclaims all legal, tax, and accounting implications related to this article.
[254] Clip Your Sequences Fairly: Enforcing Length Fairness for Sequence-Level RL
Hanyi Mao, Quanjia Xiao, Lei Pang, Haixiao Liu
Main category: cs.LG
TL;DR: FSPO is a sequence-level RL method that addresses length bias in LLM training by introducing length-fair clipping in importance-sampling weight space, ensuring fair treatment of short and long responses.
Details
Motivation: There's a mismatch when PPO/GRPO-style clipping is applied to sequences - fixed clip ranges systematically reweight short vs long responses, distorting the effective objective and creating length bias.Method: FSPO clips sequence log-importance-sampling ratios with a band that applies KL-corrected drift term and scales as √L (square root of length), ensuring length-fair clipping directly in IS weight space.
Result: Empirically, FSPO flattens clip rates across length bins, stabilizes training, and outperforms all baselines across multiple evaluation datasets.
Conclusion: FSPO provides a theoretically grounded solution to length fairness in sequence-level RL, with proven directional cosine guarantees and practical performance improvements over existing methods.
Abstract: We propose FSPO (Fair Sequence Policy Optimization), a sequence-level reinforcement learning method for LLMs that enforces length-fair clipping directly in the importance-sampling (IS) weight space. We revisit sequence-level RL methods and identify a mismatch when PPO/GRPO-style clipping is transplanted to sequences: a fixed clip range systematically reweights short vs. long responses, distorting the effective objective. Theoretically, we formalize length fairness via a Length Reweighting Error (LRE) and prove that small LRE yields a directional cosine guarantee between the clipped and true updates. FSPO introduces a simple, Gaussian-motivated remedy: we clip the sequence log-IS ratio with a band that applies a KL-corrected drift term and scales as $\sqrt{L}$. Empirically, FSPO flattens clip rates across length bins, stabilizes training, and outperforms all baselines across multiple evaluation datasets.
[255] Breaking the Statistical Similarity Trap in Extreme Convection Detection
Md Tanveer Hossain Munim
Main category: cs.LG
TL;DR: Current weather model metrics create a “Statistical Similarity Trap” that rewards blurry predictions while missing rare extreme events. DART framework addresses this with dual-decoder architecture for high-resolution satellite data conversion optimized for extreme convection detection.
Details
Motivation: Traditional evaluation metrics for deep learning weather models fail to capture rare but high-impact extreme weather events, creating a trap where models appear statistically similar but miss critical dangerous phenomena.Method: DART (Dual Architecture for Regression Tasks) uses dual-decoder architecture with explicit background/extreme decomposition, physically motivated oversampling, and task-specific loss functions to transform coarse forecasts into high-resolution satellite brightness temperature fields.
Result: DART achieves CSI = 0.273 with bias = 2.52 vs. 6.72 for baselines, improves extreme convection detection by 270% by removing Integrated Water Vapor Transport (IVT Paradox), and validates on real-world flooding disaster. Trains in under 10 minutes on standard hardware.
Conclusion: DART provides the first systematic solution for hybrid conversion-segmentation-downscaling tasks in weather forecasting, enabling precise operational calibration and demonstrating a pathway toward trustworthy AI for extreme weather preparedness.
Abstract: Current evaluation metrics for deep learning weather models create a “Statistical Similarity Trap”, rewarding blurry predictions while missing rare, high-impact events. We provide quantitative evidence of this trap, showing sophisticated baselines achieve 97.9% correlation yet 0.00 CSI for dangerous convection detection. We introduce DART (Dual Architecture for Regression Tasks), a framework addressing the challenge of transforming coarse atmospheric forecasts into high-resolution satellite brightness temperature fields optimized for extreme convection detection (below 220 K). DART employs dual-decoder architecture with explicit background/extreme decomposition, physically motivated oversampling, and task-specific loss functions. We present four key findings: (1) empirical validation of the Statistical Similarity Trap across multiple sophisticated baselines; (2) the “IVT Paradox”, removing Integrated Water Vapor Transport, widely regarded as essential for atmospheric river analysis, improves extreme convection detection by 270%; (3) architectural necessity demonstrated through operational flexibility (DART achieves CSI = 0.273 with bias = 2.52 vs. 6.72 for baselines at equivalent CSI), and (4) real-world validation with the August 2023 Chittagong flooding disaster as a case study. To our knowledge, this is the first work to systematically address this hybrid conversion-segmentation-downscaling task, with no direct prior benchmarks identified in existing literature. Our validation against diverse statistical and deep learning baselines sufficiently demonstrates DART’s specialized design. The framework enables precise operational calibration through beta-tuning, trains in under 10 minutes on standard hardware, and integrates seamlessly with existing meteorological workflows, demonstrating a pathway toward trustworthy AI for extreme weather preparedness.
[256] Incentivizing Safer Actions in Policy Optimization for Constrained Reinforcement Learning
Somnath Hazra, Pallab Dasgupta, Soumyajit Dey
Main category: cs.LG
TL;DR: IP3O algorithm integrates adaptive incentives and progressive penalties to stabilize constrained RL training near constraint boundaries, outperforming state-of-the-art safe RL methods.
Details
Motivation: Constrained RL faces challenges in balancing reward maximization with constraint satisfaction, particularly instability near constraint boundaries in continuous control settings.Method: Proposed Incrementally Penalized Proximal Policy Optimization (IP3O) with adaptive incentive mechanism and progressively increasing penalty to stabilize training dynamics.
Result: Empirical evaluation on benchmark environments shows IP3O outperforms state-of-the-art Safe RL algorithms, with theoretical guarantees on worst-case optimality error bounds.
Conclusion: IP3O effectively addresses training instability in constrained RL through adaptive incentives and progressive penalties, providing both practical performance improvements and theoretical guarantees.
Abstract: Constrained Reinforcement Learning (RL) aims to maximize the return while adhering to predefined constraint limits, which represent domain-specific safety requirements. In continuous control settings, where learning agents govern system actions, balancing the trade-off between reward maximization and constraint satisfaction remains a significant challenge. Policy optimization methods often exhibit instability near constraint boundaries, resulting in suboptimal training performance. To address this issue, we introduce a novel approach that integrates an adaptive incentive mechanism in addition to the reward structure to stay within the constraint bound before approaching the constraint boundary. Building on this insight, we propose Incrementally Penalized Proximal Policy Optimization (IP3O), a practical algorithm that enforces a progressively increasing penalty to stabilize training dynamics. Through empirical evaluation on benchmark environments, we demonstrate the efficacy of IP3O compared to the performance of state-of-the-art Safe RL algorithms. Furthermore, we provide theoretical guarantees by deriving a bound on the worst-case error of the optimality achieved by our algorithm.
[257] Identifying Key Features for Establishing Sustainable Agro-Tourism Centre: A Data Driven Approach
Alka Gadakh, Vidya Kumbhar, Sonal Khosla, Kumar Karunendra
Main category: cs.LG
TL;DR: Study identifies key indicators for agro-tourism growth using machine learning feature selection methods, with LASSO combined with Logistic Regression achieving 98-99% accuracy.
Details
Motivation: Agro-tourism is a strategic economic model for rural development that diversifies farmer income and preserves cultural heritage, requiring detailed study of growth strategies.Method: Two-phase study: comprehensive literature review to identify indicators, then machine learning feature selection techniques (LASSO with Logistic Regression, Decision Trees, Random Forest, XGBoost) applied to identify key growth indicators.
Result: LASSO with Logistic Regression achieved highest classification accuracy: 98% in 70-30% train-test split and 99% in 80-20% split. Other models (RF, DT, XGBoost) showed 95-97% accuracy.
Conclusion: Machine learning feature selection methods, particularly LASSO combined with Logistic Regression, are effective for identifying key indicators for agro-tourism growth, providing high accuracy predictions that can inform strategic development.
Abstract: Agro-tourism serves as a strategic economic model designed to facilitate rural development by diversifying income streams for local communities like farmers while promoting the conservation of indigenous cultural heritage and traditional agricultural practices. As a very booming subdomain of tourism, there is a need to study the strategies for the growth of Agro-tourism in detail. The current study has identified the important indicators for the growth and enhancement of agro-tourism. The study is conducted in two phases: identification of the important indicators through a comprehensive literature review and in the second phase state-of-the-art techniques were used to identify the important indicators for the growth of agro-tourism. The indicators are also called features synonymously, the machine learning models for feature selection were applied and it was observed that the Least Absolute Shrinkage and Selection Operator (LASSO) method combined with, the machine Learning Classifiers such as Logistic Regression (LR), Decision Trees (DT), Random Forest (RF) Tree, and Extreme Gradient Boosting (XGBOOST) models were used to suggest the growth of the agro-tourism. The results show that with the LASSO method, LR model gives the highest classification accuracy of 98% in 70-30% train-test data followed by RF with 95% accuracy. Similarly, in the 80-20% train-test data LR maintains the highest accuracy at 99%, while DT and XGBoost follow with 97% accuracy.
[258] Vejde: A Framework for Inductive Deep Reinforcement Learning Based on Factor Graph Color Refinement
Jakob Nyberg, Pontus Johnson
Main category: cs.LG
TL;DR: Vejde framework combines data abstraction, graph neural networks, and reinforcement learning to create inductive policy functions for decision problems with structured states, enabling generalization to unseen problem instances.
Details
Motivation: To address decision problems with richly structured states (object classes and relations) and create policies that can generalize across problem instances of varying size and structure.Method: Represents MDP states as databases of facts, converts states to bipartite graphs, uses neural message passing for latent state mapping, and employs both supervised and reinforcement learning for policy training.
Result: Vejde policies generalized to test instances without significant score loss, performing close to instance-specific MLP agents on unseen instances across eight problem domains.
Conclusion: The framework successfully produces inductive policy functions that handle varying problem sizes and structures while maintaining performance on unseen instances, demonstrating effective generalization capabilities.
Abstract: We present and evaluate Vejde; a framework which combines data abstraction, graph neural networks and reinforcement learning to produce inductive policy functions for decision problems with richly structured states, such as object classes and relations. MDP states are represented as data bases of facts about entities, and Vejde converts each state to a bipartite graph, which is mapped to latent states through neural message passing. The factored representation of both states and actions allows Vejde agents to handle problems of varying size and structure. We tested Vejde agents on eight problem domains defined in RDDL, with ten problem instances each, where policies were trained using both supervised and reinforcement learning. To test policy generalization, we separate problem instances in two sets, one for training and the other solely for testing. Test results on unseen instances for the Vejde agents were compared to MLP agents trained on each problem instance, as well as the online planning algorithm Prost. Our results show that Vejde policies in average generalize to the test instances without a significant loss in score. Additionally, the inductive agents received scores on unseen test instances that on average were close to the instance-specific MLP agents.
[259] Constructing a Question-Answering Simulator through the Distillation of LLMs
Haipeng Liu, Ting Long, Jing Fu
Main category: cs.LG
TL;DR: LDSim is a QA simulator that distills LLM knowledge to improve prediction of student response correctness, achieving strong performance while maintaining efficiency.
Details
Motivation: To prevent harmful recommendations from undertrained educational recommender systems by creating better QA simulators that can generate training data without interacting with real students.Method: Proposes LLM Distillation based Simulator (LDSim) which distills domain knowledge and reasoning capability from large language models to enhance prediction performance while maintaining efficiency.
Result: Extensive experiments show LDSim achieves strong results on both simulation tasks and knowledge tracing tasks, outperforming traditional methods.
Conclusion: LDSim successfully bridges the gap between LLM-free and LLM-based methods by distilling LLM capabilities into an efficient model that maintains high performance for educational simulation tasks.
Abstract: The question-answering (QA) simulator is a model that mimics real student learning behaviors and predicts their correctness of their responses to questions. QA simulators enable educational recommender systems (ERS) to collect large amounts of training data without interacting with real students, thereby preventing harmful recommendations made by an undertrained ERS from undermining actual student learning. Given the QA history, there are two categories of solutions to predict the correctness, conducting the simulation: (1) LLM-free methods, which apply a traditional sequential model to transfer the QA history into a vector representation first, and make predictions based on the representation; (2) LLM-based methods, which leverage the domain knowledge and reasoning capability of LLM to enhence the prediction. LLM-free methods offer fast inference but generally yield suboptimal performance. In contrast, most LLM-based methods achieve better results, but at the cost of slower inference speed and higher GPU memory consumption. In this paper, we propose a method named LLM Distillation based Simulator (LDSim), which distills domain knowledge and reasoning capability from an LLM to better assist prediction, thereby improving simulation performance. Extensive experiments demonstrate that our LDSim achieves strong results on both the simulation task and the knowledge tracing (KT) task. Our code is publicly available at https://anonymous.4open.science/r/LDSim-05A9.
[260] Unsupervised Multi-Attention Meta Transformer for Rotating Machinery Fault Diagnosis
Hanyang Wang, Yuxuan Yang, Hongjun Wang, Lihui Wang
Main category: cs.LG
TL;DR: Proposes MMT-FD method for few-shot unsupervised fault diagnosis of rotating machinery using time-frequency domain encoder and meta-learning to achieve 99% accuracy with only 1% labeled data.
Details
Motivation: Addresses challenges of limited fault samples and lack of generalizability in practical industrial applications where acquiring labeled data is difficult and expensive.Method: Multi-Attention Meta Transformer framework with time-frequency domain encoder for representation extraction and meta-learning network for classification and generalization, optimized through contrastive learning with minimal iterations.
Result: Achieves 99% fault diagnosis accuracy using only 1% of labeled sample data, demonstrating strong generalization capabilities across different mechanical equipment types.
Conclusion: MMT-FD provides an efficient and highly accurate solution for few-shot unsupervised fault diagnosis with excellent generalization, making it suitable for practical industrial applications with limited labeled data.
Abstract: The intelligent fault diagnosis of rotating mechanical equipment usually requires a large amount of labeled sample data. However, in practical industrial applications, acquiring enough data is both challenging and expensive in terms of time and cost. Moreover, different types of rotating mechanical equipment with different unique mechanical properties, require separate training of diagnostic models for each case. To address the challenges of limited fault samples and the lack of generalizability in prediction models for practical engineering applications, we propose a Multi-Attention Meta Transformer method for few-shot unsupervised rotating machinery fault diagnosis (MMT-FD). This framework extracts potential fault representations from unlabeled data and demonstrates strong generalization capabilities, making it suitable for diagnosing faults across various types of mechanical equipment. The MMT-FD framework integrates a time-frequency domain encoder and a meta-learning generalization model. The time-frequency domain encoder predicts status representations generated through random augmentations in the time-frequency domain. These enhanced data are then fed into a meta-learning network for classification and generalization training, followed by fine-tuning using a limited amount of labeled data. The model is iteratively optimized using a small number of contrastive learning iterations, resulting in high efficiency. To validate the framework, we conducted experiments on a bearing fault dataset and rotor test bench data. The results demonstrate that the MMT-FD model achieves 99% fault diagnosis accuracy with only 1% of labeled sample data, exhibiting robust generalization capabilities.
[261] Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents
Jiawei Wang, Jiacai Liu, Yuqian Fu, Yingru Li, Xintao Wang, Yuan Lin, Yu Yue, Lin Zhang, Yang Wang, Ke Wang
Main category: cs.LG
TL;DR: EMPG addresses LLM policy gradient issues by modulating updates based on action confidence and task outcomes, achieving better performance in long-horizon tasks.
Details
Motivation: LLM agents struggle with sparse rewards in long-horizon tasks, and traditional policy gradients have inefficient learning dynamics where update magnitude is coupled with entropy, leading to unstable training.Method: Proposes Entropy-Modulated Policy Gradients (EMPG) that recalibrates learning signals based on step-wise uncertainty and final outcomes, amplifying confident correct actions, penalizing confident errors, and attenuating uncertain steps with a future clarity bonus.
Result: EMPG achieves substantial performance gains and significantly outperforms strong policy gradient baselines on three challenging agent tasks: WebShop, ALFWorld, and Deep Search.
Conclusion: EMPG effectively addresses the fundamental learning dynamics problem in LLMs by modulating policy gradients based on uncertainty, leading to more stable and efficient learning in long-horizon tasks with sparse rewards.
Abstract: In long-horizon tasks, recent agents based on Large Language Models (LLMs) face a significant challenge that sparse, outcome-based rewards make it difficult to assign credit to intermediate steps. Previous methods mainly focus on creating dense reward signals to guide learning, either through traditional reinforcement learning techniques like inverse reinforcement learning or by using Process Reward Models for step-by-step feedback. In this paper, we identify a fundamental problem in the learning dynamics of LLMs: the magnitude of policy gradients is inherently coupled with the entropy, which leads to inefficient small updates for confident correct actions and potentially destabilizes large updates for uncertain ones. To resolve this, we propose Entropy-Modulated Policy Gradients (EMPG), a framework that re-calibrates the learning signal based on step-wise uncertainty and the final task outcome. EMPG amplifies updates for confident correct actions, penalizes confident errors, and attenuates updates from uncertain steps to stabilize exploration. We further introduce a bonus term for future clarity that encourages agents to find more predictable solution paths. Through comprehensive experiments on three challenging agent tasks, WebShop, ALFWorld, and Deep Search, we demonstrate that EMPG achieves substantial performance gains and significantly outperforms strong policy gradient baselines. Project page is at https://empgseed-seed.github.io/
[262] MoSE: Unveiling Structural Patterns in Graphs via Mixture of Subgraph Experts
Junda Ye, Zhongbao Zhang, Li Sun, Siqiang Luo
Main category: cs.LG
TL;DR: MoSE framework enhances GNNs by using anonymous walks to extract subgraphs and dynamically routing them to specialized experts, improving structural expressiveness and interpretability across graph tasks.
Details
Motivation: Traditional GNNs rely on local message passing which limits their ability to capture complex subgraph patterns. Existing methods using random walk kernels are limited to graph-level tasks and lack flexibility.Method: Proposes Mixture of Subgraph Experts (MoSE) framework that extracts subgraphs via anonymous walks and dynamically routes them to specialized experts based on structural semantics.
Result: Theoretical analysis shows MoSE is more powerful than Subgraph Weisfeiler-Lehman Test. Extensive experiments demonstrate superior performance over baselines with interpretable insights into learned structural patterns.
Conclusion: MoSE provides a flexible and expressive framework for subgraph-based representation learning that outperforms existing methods while offering improved interpretability across diverse graph tasks.
Abstract: While graph neural networks (GNNs) have achieved great success in learning from graph-structured data, their reliance on local, pairwise message passing restricts their ability to capture complex, high-order subgraph patterns. leading to insufficient structural expressiveness. Recent efforts have attempted to enhance structural expressiveness by integrating random walk kernels into GNNs. However, these methods are inherently designed for graph-level tasks, which limits their applicability to other downstream tasks such as node classification. Moreover, their fixed kernel configurations hinder the model’s flexibility in capturing diverse subgraph structures. To address these limitations, this paper proposes a novel Mixture of Subgraph Experts (MoSE) framework for flexible and expressive subgraph-based representation learning across diverse graph tasks. Specifically, MoSE extracts informative subgraphs via anonymous walks and dynamically routes them to specialized experts based on structural semantics, enabling the model to capture diverse subgraph patterns with improved flexibility and interpretability. We further provide a theoretical analysis of MoSE’s expressivity within the Subgraph Weisfeiler-Lehman (SWL) Test, proving that it is more powerful than SWL. Extensive experiments, together with visualizations of learned subgraph experts, demonstrate that MoSE not only outperforms competitive baselines but also provides interpretable insights into structural patterns learned by the model.
[263] Robust Non-Linear Correlations via Polynomial Regression
Luca Giuliani, Michele Lombardi
Main category: cs.LG
TL;DR: A novel computational approach for HGR correlation coefficient using polynomial kernels that offers improved robustness and determinism for real-world applications.
Details
Motivation: Existing HGR estimation methods suffer from bias-variance trade-offs due to inherent uncomputability, compromising robustness for real-world applications like algorithmic fairness and constrained machine learning.Method: Proposes a computational approach using user-configurable polynomial kernels to estimate HGR, providing greater robustness and faster computation compared to previous methods.
Result: The method demonstrates significant advantages in robustness and determinism, and experimental analysis shows it yields insightful subgradients suitable as loss regularizers in constrained machine learning.
Conclusion: The polynomial kernel-based approach provides a more reliable and robust method for HGR computation, making it suitable for practical applications in fairness, scientific analysis, and causal discovery.
Abstract: The Hirschfeld-Gebelein-R'enyi (HGR) correlation coefficient is an extension of Pearson’s correlation that is not limited to linear correlations, with potential applications in algorithmic fairness, scientific analysis, and causal discovery. Recently, novel algorithms to estimate HGR in a differentiable manner have been proposed to facilitate its use as a loss regularizer in constrained machine learning applications. However, the inherent uncomputability of HGR requires a bias-variance trade-off, which can possibly compromise the robustness of the proposed methods, hence raising technical concerns if applied in real-world scenarios. We introduce a novel computational approach for HGR that relies on user-configurable polynomial kernels, offering greater robustness compared to previous methods and featuring a faster yet almost equally effective restriction. Our approach provides significant advantages in terms of robustness and determinism, making it a more reliable option for real-world applications. Moreover, we present a brief experimental analysis to validate the applicability of our approach within a constrained machine learning framework, showing that its computation yields an insightful subgradient that can serve as a loss regularizer.
[264] MetaLLMix : An XAI Aided LLM-Meta-learning Based Approach for Hyper-parameters Optimization
Mohammed Tiouti, Mohamed Bal-Ghaoui
Main category: cs.LG
TL;DR: MetaLLMiX is a zero-shot hyperparameter optimization framework that combines meta-learning, explainable AI, and LLM reasoning to recommend optimal hyperparameters without additional trials, achieving competitive performance with drastically reduced computational costs.
Details
Motivation: Current AutoML and LLM-based approaches for model/hyperparameter selection rely on expensive trial-and-error methods with limited interpretability and generalizability, requiring extensive expertise and computation.Method: Leverages historical experiment outcomes with SHAP explanations, uses meta-learning and efficient LLM reasoning for zero-shot hyperparameter optimization, and employs LLM-as-judge evaluation for output control.
Result: Achieved competitive/superior performance to traditional HPO methods on 8 medical imaging datasets, optimal results on 5/8 tasks, 99.6-99.9% response time reduction, and 2.4-15.7x faster training times while maintaining accuracy within 1-5% of best baselines.
Conclusion: MetaLLMiX provides an efficient, interpretable, and cost-effective alternative to traditional hyperparameter optimization methods, enabling optimal model selection without expensive trial runs.
Abstract: Effective model and hyperparameter selection remains a major challenge in deep learning, often requiring extensive expertise and computation. While AutoML and large language models (LLMs) promise automation, current LLM-based approaches rely on trial and error and expensive APIs, which provide limited interpretability and generalizability. We propose MetaLLMiX, a zero-shot hyperparameter optimization framework combining meta-learning, explainable AI, and efficient LLM reasoning. By leveraging historical experiment outcomes with SHAP explanations, MetaLLMiX recommends optimal hyperparameters and pretrained models without additional trials. We further employ an LLM-as-judge evaluation to control output format, accuracy, and completeness. Experiments on eight medical imaging datasets using nine open-source lightweight LLMs show that MetaLLMiX achieves competitive or superior performance to traditional HPO methods while drastically reducing computational cost. Our local deployment outperforms prior API-based approaches, achieving optimal results on 5 of 8 tasks, response time reductions of 99.6-99.9%, and the fastest training times on 6 datasets (2.4-15.7x faster), maintaining accuracy within 1-5% of best-performing baselines.
[265] LLMs Don’t Know Their Own Decision Boundaries: The Unreliability of Self-Generated Counterfactual Explanations
Harry Mayne, Ryan Othniel Kearns, Yushi Yang, Andrew M. Bean, Eoin Delaney, Chris Russell, Adam Mahdi
Main category: cs.LG
TL;DR: LLMs can generate valid counterfactual explanations but struggle with minimality - either making excessive edits or tiny edits that don’t change predictions, revealing a validity-minimality trade-off that makes self-generated counterfactuals unreliable for explainability.
Details
Motivation: To enable effective human-AI collaboration, language models need to explain their decisions in natural language, particularly through self-generated counterfactual explanations that could provide insight into model decision-making.Method: The study evaluates whether LLMs can produce valid (achieve intended outcome) and minimal (modify input as little as possible) counterfactual explanations across multiple LLMs, datasets, and evaluation settings.
Result: LLMs typically produce valid counterfactuals but far from minimal ones, offering little insight. When asked for minimal counterfactuals, they make excessively small edits that fail to change predictions, showing a consistent validity-minimality trade-off.
Conclusion: Self-generated counterfactual explanations are ineffective explainability tools that can provide misleading insights, and proposals to deploy LLMs in high-stakes settings must consider the impact of unreliable self-explanations.
Abstract: To collaborate effectively with humans, language models must be able to explain their decisions in natural language. We study a specific type of self-explanation: self-generated counterfactual explanations (SCEs), where a model explains its prediction by modifying the input such that it would have predicted a different outcome. We evaluate whether LLMs can produce SCEs that are valid, achieving the intended outcome, and minimal, modifying the input no more than necessary. When asked to generate counterfactuals, we find that LLMs typically produce SCEs that are valid, but far from minimal, offering little insight into their decision-making behaviour. Worryingly, when asked to generate minimal counterfactuals, LLMs typically make excessively small edits that fail to change predictions. The observed validity-minimality trade-off is consistent across several LLMs, datasets, and evaluation settings. Our findings suggest that SCEs are, at best, an ineffective explainability tool and, at worst, can provide misleading insights into model behaviour. Proposals to deploy LLMs in high-stakes settings must consider the impact of unreliable self-explanations on downstream decision-making. Our code is available at https://github.com/HarryMayne/SCEs.
[266] Kriging prior Regression: A Case for Kriging-Based Spatial Features with TabPFN in Soil Mapping
Jonas Schmidinger, Viacheslav Barkov, Sebastian Vogel, Martin Atzmueller, Gerard B M Heuvelink
Main category: cs.LG
TL;DR: Hybrid framework combining machine learning and geostatistics using spatial lag features from kriging, called KpR, significantly improves soil property prediction accuracy by ~30% R2 compared to non-spatial ML methods.
Details
Motivation: To bridge the gap between machine learning (which captures feature relationships) and geostatistics (which leverages spatial structure) for more accurate digital soil mapping in precision agriculture.Method: Proposed KpR (kriging prior regression) that enriches ML with spatial context through engineering of ‘spatial lag’ features from ordinary kriging, using TabPFN model on six fieldscale datasets with soil properties and remote/proximal sensing features.
Result: KpR with TabPFN demonstrated reliable uncertainty estimates and more accurate predictions than other spatial techniques and non-spatial ML algorithms, improving average R2 by around 30% compared to ML without spatial context.
Conclusion: KpR with TabPFN is a robust and versatile modeling framework for digital soil mapping, particularly effective for small sample sizes in precision agriculture and compensates for weak feature-property relationships when sensing data is limited.
Abstract: Machine learning and geostatistics are two fundamentally different frameworks for predicting and spatially mapping soil properties. Geostatistics leverages the spatial structure of soil properties, while machine learning captures the relationship between available environmental features and soil properties. We propose a hybrid framework that enriches ML with spatial context through engineering of ‘spatial lag’ features from ordinary kriging. We call this approach ‘kriging prior regression’ (KpR), as it follows the inverse logic of regression kriging. To evaluate this approach, we assessed both the point and probabilistic prediction performance of KpR, using the TabPFN model across six fieldscale datasets from LimeSoDa. These datasets included soil organic carbon, clay content, and pH, along with features derived from remote sensing and in-situ proximal soil sensing. KpR with TabPFN demonstrated reliable uncertainty estimates and more accurate predictions in comparison to several other spatial techniques (e.g., regression/residual kriging with TabPFN), as well as to established non-spatial machine learning algorithms (e.g., random forest). Most notably, it significantly improved the average R2 by around 30% compared to machine learning algorithms without spatial context. This improvement was due to the strong prediction performance of the TabPFN algorithm itself and the complementary spatial information provided by KpR features. TabPFN is particularly effective for prediction tasks with small sample sizes, common in precision agriculture, whereas KpR can compensate for weak relationships between sensing features and soil properties when proximal soil sensing data are limited. Hence, we conclude that KpR with TabPFN is a very robust and versatile modelling framework for digital soil mapping in precision agriculture.
[267] Fused Lasso Improves Accuracy of Co-occurrence Network Inference in Grouped Samples
Daniel Agyapong, Briana H. Beatty, Peter G. Kennedy, Toby D. Hocking
Main category: cs.LG
TL;DR: Proposes fuser algorithm for microbiome network inference that handles spatial-temporal dynamics and cross-environment prediction better than existing methods
Details
Motivation: Existing co-occurrence network algorithms only analyze single environments and fail to capture how microbial associations adapt to different ecological conditionsMethod: Developed Same-All Cross-validation (SAC) framework and fuser algorithm that retains environment-specific signals while sharing information across environments
Result: fuser achieves comparable performance to glmnet in same-environment scenarios and significantly reduces test error in cross-environment predictions
Conclusion: fuser enables more accurate microbial network inference across diverse environmental conditions by capturing both environment-specific and shared association patterns
Abstract: Co-occurrence network inference algorithms have significantly advanced our understanding of microbiome communities. However, these algorithms typically analyze microbial associations within samples collected from a single environmental niche, often capturing only static snapshots rather than dynamic microbial processes. Previous studies have commonly grouped samples from different environmental niches together without fully considering how microbial communities adapt their associations when faced with varying ecological conditions. Our study addresses this limitation by explicitly investigating both spatial and temporal dynamics of microbial communities. We analyzed publicly available microbiome abundance data across multiple locations and time points, to evaluate algorithm performance in predicting microbial associations using our proposed Same-All Cross-validation (SAC) framework. SAC evaluates algorithms in two distinct scenarios: training and testing within the same environmental niche (Same), and training and testing on combined data from multiple environmental niches (All). To overcome the limitations of conventional algorithms, we propose fuser, an algorithm that, while not entirely new in machine learning, is novel for microbiome community network inference. It retains subsample-specific signals while simultaneously sharing relevant information across environments during training. Unlike standard approaches that infer a single generalized network from combined data, fuser generates distinct, environment-specific predictive networks. Our results demonstrate that fuser achieves comparable predictive performance to existing algorithms such as glmnet when evaluated within homogeneous environments (Same), and notably reduces test error compared to baseline algorithms in cross-environment (All) scenarios.
[268] Composable Score-based Graph Diffusion Model for Multi-Conditional Molecular Generation
Anjie Qiao, Zhen Wang, Chuan Chen, DeFu Lian, Enhong Chen
Main category: cs.LG
TL;DR: CSGD is a novel score-based graph diffusion model that enables precise multi-conditional molecular generation through concrete scores, composable guidance, and probability calibration, achieving state-of-the-art controllability.
Details
Motivation: Existing graph diffusion models struggle with multi-conditional molecular generation due to reliance on joint conditioning or continuous relaxations that compromise fidelity and flexibility.Method: Extends score matching to discrete graphs via concrete scores, introduces Composable Guidance (CoG) for fine-grained control over condition subsets, and Probability Calibration (PC) to mitigate train-test mismatches.
Result: Achieves 15.3% average improvement in controllability over prior methods while maintaining high validity and distributional fidelity across four molecular datasets.
Conclusion: Score-based modeling provides practical advantages for discrete graph generation and enables flexible, multi-property molecular design with superior controllability.
Abstract: Controllable molecular graph generation is essential for material and drug discovery, where generated molecules must satisfy diverse property constraints. While recent advances in graph diffusion models have improved generation quality, their effectiveness in multi-conditional settings remains limited due to reliance on joint conditioning or continuous relaxations that compromise fidelity. To address these limitations, we propose Composable Score-based Graph Diffusion model (CSGD), the first model that extends score matching to discrete graphs via concrete scores, enabling flexible and principled manipulation of conditional guidance. Building on this foundation, we introduce two score-based techniques: Composable Guidance (CoG), which allows fine-grained control over arbitrary subsets of conditions during sampling, and Probability Calibration (PC), which adjusts estimated transition probabilities to mitigate train-test mismatches. Empirical results on four molecular datasets show that CSGD achieves state-of-the-art performance, with a 15.3% average improvement in controllability over prior methods, while maintaining high validity and distributional fidelity. Our findings highlight the practical advantages of score-based modeling for discrete graph generation and its capacity for flexible, multi-property molecular design.
[269] ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms
Bingxin Xu, Zhen Dong, Oussama Elachqar, Yuzhang Shang
Main category: cs.LG
TL;DR: ButterflyQuant introduces learnable butterfly transforms for 2-bit quantization of LLMs, replacing fixed Hadamard rotations with continuous Givens rotation angles to adapt to layer-specific outlier patterns, achieving better performance than previous methods.
Details
Motivation: Large language models require massive memory that limits deployment on consumer hardware. Extreme 2-bit quantization suffers from catastrophic performance loss due to activation outliers, and existing rotation-based methods use fixed transforms that cannot adapt to specific weight distributions across different transformer layers.Method: Proposes ButterflyQuant which replaces fixed Hadamard rotations with learnable butterfly transforms parameterized by continuous Givens rotation angles. This enables smooth optimization while guaranteeing orthogonality, with O(n log n) complexity. Includes uniformity regularization on post-transformation activations to promote quantization-friendly distributions.
Result: Achieves 15.4 perplexity on LLaMA-2-7B with 2-bit quantization, compared to 22.1 for QuaRot. Learning requires only 128 calibration samples and converges in minutes on a single GPU.
Conclusion: ButterflyQuant demonstrates that layer-adaptive rotations outperform fixed transforms for extreme quantization, providing better outlier suppression and performance while maintaining computational efficiency and requiring minimal calibration overhead.
Abstract: Large language models require massive memory footprints, severely limiting deployment on consumer hardware. Quantization reduces memory through lower numerical precision, but extreme 2-bit quantization suffers from catastrophic performance loss due to outliers in activations. Rotation-based methods such as QuIP and QuaRot apply orthogonal transforms to eliminate outliers before quantization, using computational invariance: $\mathbf{y} = \mathbf{Wx} = (\mathbf{WQ}^T)(\mathbf{Qx})$ for orthogonal $\mathbf{Q}$. However, these methods use fixed transforms–Hadamard matrices achieving optimal worst-case coherence $\mu = 1/\sqrt{n}$–that cannot adapt to specific weight distributions. We identify that different transformer layers exhibit distinct outlier patterns, motivating layer-adaptive rotations rather than one-size-fits-all approaches. We propose ButterflyQuant, which replaces Hadamard rotations with learnable butterfly transforms parameterized by continuous Givens rotation angles. Unlike Hadamard’s discrete ${+1, -1}$ entries that are non-differentiable and prohibit gradient-based learning, butterfly transforms’ continuous parameterization enables smooth optimization while guaranteeing orthogonality by construction. This orthogonal constraint ensures theoretical guarantees in outlier suppression while achieving $O(n \log n)$ computational complexity with only $\frac{n \log n}{2}$ learnable parameters. We further introduce a uniformity regularization on post-transformation activations to promote smoother distributions amenable to quantization. Learning requires only 128 calibration samples and converges in minutes on a single GPU–a negligible one-time cost. On LLaMA-2-7B with 2-bit quantization, ButterflyQuant achieves 15.4 perplexity versus 22.1 for QuaRot.
[270] AquaCast: Urban Water Dynamics Forecasting with Precipitation-Informed Multi-Input Transformer
Golnoosh Abdollahinejad, Saleh Baghersalimi, Denisa-Andreea Constantinescu, Sergey Shevchik, David Atienza
Main category: cs.LG
TL;DR: AquaCast is a multi-input deep learning model for urban water forecasting that combines endogenous variables (water measurements) with exogenous factors (precipitation data) using an embedding layer, achieving state-of-the-art performance on real and synthetic datasets.
Details
Motivation: To address the challenge of forecasting urban water dynamics by incorporating both internal water measurements and external environmental factors, overcoming limitations of conventional forecasting methods.Method: Developed a multi-input, multi-output deep learning model with embedding layer for exogenous inputs, eliminating need to forecast them. Evaluated on LausanneCity dataset with 4 sensors and three synthetic datasets (MeteoSwiss, Lorenz Attractors, Random Fields) across 100 nodes.
Result: Achieved state-of-the-art performance using only endogenous variables, with further improvement when including exogenous variables and forecast reports. Consistently outperformed existing baselines across both real and synthetic datasets.
Conclusion: AquaCast provides robust and accurate water forecasting by effectively capturing inter-variable and temporal dependencies while handling both real-world and complex synthetic data scenarios.
Abstract: This work addresses the challenge of forecasting urban water dynamics by developing a multi-input, multi-output deep learning model that incorporates both endogenous variables (e.g., water height or discharge) and exogenous factors (e.g., precipitation history and forecast reports). Unlike conventional forecasting, the proposed model, AquaCast, captures both inter-variable and temporal dependencies across all inputs, while focusing forecast solely on endogenous variables. Exogenous inputs are fused via an embedding layer, eliminating the need to forecast them and enabling the model to attend to their short-term influences more effectively. We evaluate our approach on the LausanneCity dataset, which includes measurements from four urban drainage sensors, and demonstrate state-of-the-art performance when using only endogenous variables. Performance also improves with the inclusion of exogenous variables and forecast reports. To assess generalization and scalability, we additionally test the model on three large-scale synthesized datasets, generated from MeteoSwiss records, the Lorenz Attractors model, and the Random Fields model, each representing a different level of temporal complexity across 100 nodes. The results confirm that our model consistently outperforms existing baselines and maintains a robust and accurate forecast across both real and synthetic datasets.
[271] AEGIS: An Agent for Extraction and Geographic Identification in Scholarly Proceedings
Om Vishesh, Harshad Khadilkar, Deepak Akkil
Main category: cs.LG
TL;DR: Automated AI system ‘Agent-E’ identifies regional papers from conferences and uses RPA to complete actions like nomination submissions, achieving 100% recall and 99.4% accuracy on 586 papers.
Details
Motivation: Address the challenge of keeping up with rapidly growing academic literature and reduce time-consuming manual effort required for scholarly discovery workflows.Method: Developed a fully automated pipeline using specialized AI agent ‘Agent-E’ to identify papers from specific geographic regions in conference proceedings, combined with Robotic Process Automation (RPA) to execute predefined actions.
Result: Validated on 586 papers from five conferences, achieving perfect recall (100%) and near-perfect accuracy (99.4%) in identifying target papers and completing automated actions.
Conclusion: Demonstrates the potential of task-oriented AI agents to not only filter academic information but also actively participate in and accelerate academic community workflows through automated action execution.
Abstract: Keeping pace with the rapid growth of academia literature presents a significant challenge for researchers, funding bodies, and academic societies. To address the time-consuming manual effort required for scholarly discovery, we present a novel, fully automated system that transitions from data discovery to direct action. Our pipeline demonstrates how a specialized AI agent, ‘Agent-E’, can be tasked with identifying papers from specific geographic regions within conference proceedings and then executing a Robotic Process Automation (RPA) to complete a predefined action, such as submitting a nomination form. We validated our system on 586 papers from five different conferences, where it successfully identified every target paper with a recall of 100% and a near perfect accuracy of 99.4%. This demonstration highlights the potential of task-oriented AI agents to not only filter information but also to actively participate in and accelerate the workflows of the academic community.
[272] CountTRuCoLa: Rule Confidence Learning for Temporal Knowledge Graph Forecasting
Julia Gastinger, Christian Meilicke, Heiner Stuckenschmidt
Main category: cs.LG
TL;DR: A fully explainable method for temporal knowledge graph forecasting using temporal rules that matches or outperforms state-of-the-art models while providing interpretable predictions.
Details
Motivation: Recent work has shown strong baseline performance using recurrent facts, motivating the development of a fully explainable approach that maintains competitive performance.Method: Learns four simple types of temporal rules with a confidence function that considers both recency and frequency of patterns.
Result: Evaluated on nine datasets, the method matches or surpasses the performance of eight state-of-the-art models and two baselines.
Conclusion: The approach successfully provides fully interpretable predictions while maintaining competitive forecasting performance on temporal knowledge graphs.
Abstract: We address the task of temporal knowledge graph (TKG) forecasting by introducing a fully explainable method based on temporal rules. Motivated by recent work proposing a strong baseline using recurrent facts, our approach learns four simple types of rules with a confidence function that considers both recency and frequency. Evaluated on nine datasets, our method matches or surpasses the performance of eight state-of-the-art models and two baselines, while providing fully interpretable predictions.
[273] Balancing Utility and Privacy: Dynamically Private SGD with Random Projection
Zhanhong Jiang, Md Zahid Hasan, Nastaran Saadati, Aditya Balu, Chao Liu, Soumik Sarkar
Main category: cs.LG
TL;DR: D2P2-SGD optimizer combines dynamic differential privacy with automatic gradient clipping and random projection with SGD to improve privacy-utility tradeoff and learning efficiency.
Details
Motivation: Address privacy leakage concerns in stochastic optimization while overcoming limitations of DPSGD's static noise mechanism and challenges with large model parameter learning.Method: Combines dynamic differential privacy with automatic gradient clipping and random projection techniques integrated with SGD for dynamic privacy-utility tradeoff adjustment.
Result: Achieves provably sub-linear convergence rates across different objective functions and shows enhanced accuracy while maintaining privacy in diverse datasets.
Conclusion: D2P2-SGD provides better utility at privacy cost through dynamic differential privacy and enables more efficient model learning through random projection.
Abstract: Stochastic optimization is a pivotal enabler in modern machine learning, producing effective models for various tasks. However, several existing works have shown that model parameters and gradient information are susceptible to privacy leakage. Although Differentially Private SGD (DPSGD) addresses privacy concerns, its static noise mechanism impacts the error bounds for model performance. Additionally, with the exponential increase in model parameters, efficient learning of these models using stochastic optimizers has become more challenging. To address these concerns, we introduce the Dynamically Differentially Private Projected SGD (D2P2-SGD) optimizer. In D2P2-SGD, we combine two important ideas: (i) dynamic differential privacy (DDP) with automatic gradient clipping and (ii) random projection with SGD, allowing dynamic adjustment of the tradeoff between utility and privacy of the model. It exhibits provably sub-linear convergence rates across different objective functions, matching the best available rate. The theoretical analysis further suggests that DDP leads to better utility at the cost of privacy, while random projection enables more efficient model learning. Extensive experiments across diverse datasets show that D2P2-SGD remarkably enhances accuracy while maintaining privacy. Our code is available here.
[274] PIPES: A Meta-dataset of Machine Learning Pipelines
Cynthia Moreira Maia, Lucas B. V. de Amorim, George D. C. Cavalcanti, Rafael M. O. Cruz
Main category: cs.LG
TL;DR: PIPES addresses limitations in OpenML’s algorithm selection data by providing a diverse collection of 9,408 pipeline experiments across 300 datasets with comprehensive metadata.
Details
Motivation: OpenML's experiments lack diversity in preprocessing pipelines and are imbalanced towards popular techniques, limiting meta-learning research for algorithm selection.Method: Created PIPES - a collection of experiments applying all combinations of selected techniques across multiple pipeline blocks to ensure diversity and completeness.
Result: Built a comprehensive repository with 9,408 pipelines tested on 300 datasets, including detailed metadata on pipeline blocks, performance metrics, timing, and error messages.
Conclusion: PIPES provides a more diverse and representative dataset for meta-learning research, overcoming OpenML’s limitations and offering expandability for future community contributions.
Abstract: Solutions to the Algorithm Selection Problem (ASP) in machine learning face the challenge of high computational costs associated with evaluating various algorithms’ performances on a given dataset. To mitigate this cost, the meta-learning field can leverage previously executed experiments shared in online repositories such as OpenML. OpenML provides an extensive collection of machine learning experiments. However, an analysis of OpenML’s records reveals limitations. It lacks diversity in pipelines, specifically when exploring data preprocessing steps/blocks, such as scaling or imputation, resulting in limited representation. Its experiments are often focused on a few popular techniques within each pipeline block, leading to an imbalanced sample. To overcome the observed limitations of OpenML, we propose PIPES, a collection of experiments involving multiple pipelines designed to represent all combinations of the selected sets of techniques, aiming at diversity and completeness. PIPES stores the results of experiments performed applying 9,408 pipelines to 300 datasets. It includes detailed information on the pipeline blocks, training and testing times, predictions, performances, and the eventual error messages. This comprehensive collection of results allows researchers to perform analyses across diverse and representative pipelines and datasets. PIPES also offers potential for expansion, as additional data and experiments can be incorporated to support the meta-learning community further. The data, code, supplementary material, and all experiments can be found at https://github.com/cynthiamaia/PIPES.git.
[275] Cough Classification using Few-Shot Learning
Yoga Disha Sendhil Kumar, Manas V Shetty, Sudip Vhaduri
Main category: cs.LG
TL;DR: Few-shot learning using Prototypical Networks achieves competitive accuracy (74.87% multi-class, 70%+ binary) for COVID-19/Flu/healthy cough sound classification with minimal training data.
Details
Motivation: To address the challenge of limited labeled medical data by investigating whether few-shot learning can achieve performance comparable to traditional deep learning approaches for respiratory sound classification.Method: Leveraged Prototypical Networks with spectrogram representations of cough sounds, using only 15 support examples per class. Compared multi-class vs binary classification models for COVID-19, Flu, and healthy conditions.
Result: Achieved 74.87% accuracy in multi-class classification and over 70% accuracy in binary classification across all class pairs. Flu was most distinguishable, Healthy most challenging. No significant performance difference between binary and multi-class models (p=0.149 t-test, p=0.125 Wilcoxon).
Conclusion: Few-shot learning is viable for medical diagnostics when large labeled datasets are unavailable, with multi-class classification performing comparably to binary approaches for respiratory sound classification.
Abstract: This paper investigates the effectiveness of few-shot learning for respiratory sound classification, focusing on coughbased detection of COVID-19, Flu, and healthy conditions. We leverage Prototypical Networks with spectrogram representations of cough sounds to address the challenge of limited labeled data. Our study evaluates whether few-shot learning can enable models to achieve performance comparable to traditional deep learning approaches while using significantly fewer training samples. Additionally, we compare multi-class and binary classification models to assess whether multi-class models can perform comparably to their binary counterparts. Experimental findings show that few-shot learning models can achieve competitive accuracy. Our model attains 74.87% accuracy in multi-class classification with only 15 support examples per class, while binary classification achieves over 70% accuracy across all class pairs. Class-wise analysis reveals Flu as the most distinguishable class, and Healthy as the most challenging. Statistical tests (paired t-test p = 0.149, Wilcoxon p = 0.125) indicate no significant performance difference between binary and multiclass models, supporting the viability of multi-class classification in this setting. These results highlight the feasibility of applying few-shot learning in medical diagnostics, particularly when large labeled datasets are unavailable.
[276] ProDiGy: Proximity- and Dissimilarity-Based Byzantine-Robust Federated Learning
Sena Ergisi, Luis Maßny, Rawad Bitar
Main category: cs.LG
TL;DR: ProDiGy is a Byzantine-robust federated learning algorithm that uses a dual scoring system based on gradient proximity and dissimilarity to detect adversarial attacks, particularly effective under non-IID data distributions.
Details
Motivation: Federated Learning remains vulnerable to adversarial attacks, especially under data heterogeneity conditions where existing defense mechanisms often fail.Method: A joint dual scoring system that evaluates client gradients based on both proximity (similarity to honest clients) and dissimilarity (detecting suspicious uniformity that indicates attacks).
Result: ProDiGy outperforms existing defenses in various scenarios, maintaining strong defense capabilities and model accuracy even when clients’ data are non-IID distributed.
Conclusion: The dual perspective approach effectively promotes natural similarity among honest clients while detecting suspicious uniformity as an attack indicator, making it robust against Byzantine attacks in heterogeneous FL environments.
Abstract: Federated Learning (FL) emerged as a widely studied paradigm for distributed learning. Despite its many advantages, FL remains vulnerable to adversarial attacks, especially under data heterogeneity. We propose a new Byzantine-robust FL algorithm called ProDiGy. The key novelty lies in evaluating the client gradients using a joint dual scoring system based on the gradients’ proximity and dissimilarity. We demonstrate through extensive numerical experiments that ProDiGy outperforms existing defenses in various scenarios. In particular, when the clients’ data do not follow an IID distribution, while other defense mechanisms fail, ProDiGy maintains strong defense capabilities and model accuracy. These findings highlight the effectiveness of a dual perspective approach that promotes natural similarity among honest clients while detecting suspicious uniformity as a potential indicator of an attack.
[277] Graph Alignment via Dual-Pass Spectral Encoding and Latent Space Communication
Maysam Behmanesh, Erkan Turan, Maks Ovsjanikov
Main category: cs.LG
TL;DR: Novel graph alignment framework that enhances node distinctiveness and enforces geometric consistency across latent spaces using dual-pass encoder and geometry-aware functional maps.
Details
Motivation: Existing unsupervised graph alignment methods suffer from node distinctiveness degradation due to GNN oversmoothing and latent space misalignment caused by structural noise, feature heterogeneity, and training instability.Method: Dual-pass encoder combining low-pass and high-pass spectral filters for structure-aware discriminative embeddings, plus geometry-aware functional map module for learning bijective isometric transformations between graph embeddings.
Result: Outperforms existing unsupervised alignment baselines on graph benchmarks with superior robustness to structural inconsistencies and challenging scenarios, and generalizes effectively to vision-language benchmarks.
Conclusion: The proposed framework successfully addresses key limitations in graph alignment by simultaneously enhancing node distinctiveness and ensuring geometric consistency, demonstrating strong performance across diverse domains including graph and vision-language alignment.
Abstract: Graph alignment-the problem of identifying corresponding nodes across multiple graphs-is fundamental to numerous applications. Most existing unsupervised methods embed node features into latent representations to enable cross-graph comparison without ground-truth correspondences. However, these methods suffer from two critical limitations: the degradation of node distinctiveness due to oversmoothing in GNN-based embeddings, and the misalignment of latent spaces across graphs caused by structural noise, feature heterogeneity, and training instability, ultimately leading to unreliable node correspondences. We propose a novel graph alignment framework that simultaneously enhances node distinctiveness and enforces geometric consistency across latent spaces. Our approach introduces a dual-pass encoder that combines low-pass and high-pass spectral filters to generate embeddings that are both structure-aware and highly discriminative. To address latent space misalignment, we incorporate a geometry-aware functional map module that learns bijective and isometric transformations between graph embeddings, ensuring consistent geometric relationships across different representations. Extensive experiments on graph benchmarks demonstrate that our method consistently outperforms existing unsupervised alignment baselines, exhibiting superior robustness to structural inconsistencies and challenging alignment scenarios. Additionally, comprehensive evaluation on vision-language benchmarks using diverse pretrained models shows that our framework effectively generalizes beyond graph domains, enabling unsupervised alignment of vision and language representations.
[278] Conditioning on PDE Parameters to Generalise Deep Learning Emulation of Stochastic and Chaotic Dynamics
Ira J. S. Shokar, Rich R. Kerswell, Peter H. Haynes
Main category: cs.LG
TL;DR: Deep learning emulator for stochastic/chaotic spatio-temporal systems that generalizes across PDE parameter values using pre-training and fine-tuning with local attention mechanisms.
Details
Motivation: To create computationally efficient emulators for complex PDE systems that can generalize across parameter spaces and handle varying domain sizes/resolutions.Method: Pre-train on single parameter domain, then fine-tune on diverse smaller dataset; uses local attention mechanisms to handle varying domain sizes and resolutions; probabilistic variant for uncertainty quantification.
Result: Successfully demonstrated on chaotic Kuramoto-Sivashinsky equation and stochastically-forced beta-plane turbulence; captures phenomena at interpolated parameter values; provides significant computational speed-ups over conventional integration.
Conclusion: The emulator enables efficient parameter space exploration and rare event statistical study through uncertainty quantification, offering a powerful tool for complex spatio-temporal system analysis.
Abstract: We present a deep learning emulator for stochastic and chaotic spatio-temporal systems, explicitly conditioned on the parameter values of the underlying partial differential equations (PDEs). Our approach involves pre-training the model on a single parameter domain, followed by fine-tuning on a smaller, yet diverse dataset, enabling generalisation across a broad range of parameter values. By incorporating local attention mechanisms, the network is capable of handling varying domain sizes and resolutions. This enables computationally efficient pre-training on smaller domains while requiring only a small additional dataset to learn how to generalise to larger domain sizes. We demonstrate the model’s capabilities on the chaotic Kuramoto-Sivashinsky equation and stochastically-forced beta-plane turbulence, showcasing its ability to capture phenomena at interpolated parameter values. The emulator provides significant computational speed-ups over conventional numerical integration, facilitating efficient exploration of parameter space, while a probabilistic variant of the emulator provides uncertainty quantification, allowing for the statistical study of rare events.
[279] ReBaNO: Reduced Basis Neural Operator Mitigating Generalization Gaps and Achieving Discretization Invariance
Haolan Zheng, Yanlai Chen, Jiequn Han, Yue Yu
Main category: cs.LG
TL;DR: ReBaNO is a novel data-lean operator learning algorithm that combines reduced basis methods with neural networks to solve PDEs with multiple inputs, achieving superior generalization and strict discretization invariance compared to state-of-the-art methods.
Details
Motivation: To develop an efficient operator learning algorithm that can handle groups of PDEs with multiple distinct inputs while maintaining mathematical rigor, computational efficiency, and strong generalization capabilities.Method: Combines Reduced Basis Method with Generative Pre-Trained Physics-Informed Neural Networks, using a greedy algorithm to build network structure offline adaptively. Employs knowledge distillation via task-specific activation functions for compact architecture with embedded physics.
Result: Significantly outperforms state-of-the-art operator learning algorithms (PCA-Net, DeepONet, FNO, CNO) in eliminating/shrinking generalization gap for both in- and out-of-distribution tests, and is the only method achieving strict discretization invariance.
Conclusion: ReBaNO provides a mathematically rigorous, computationally efficient approach to operator learning that demonstrates superior performance and generalization capabilities compared to existing methods, particularly for PDE problems with multiple distinct inputs.
Abstract: We propose a novel data-lean operator learning algorithm, the Reduced Basis Neural Operator (ReBaNO), to solve a group of PDEs with multiple distinct inputs. Inspired by the Reduced Basis Method and the recently introduced Generative Pre-Trained Physics-Informed Neural Networks, ReBaNO relies on a mathematically rigorous greedy algorithm to build its network structure offline adaptively from the ground up. Knowledge distillation via task-specific activation function allows ReBaNO to have a compact architecture requiring minimal computational cost online while embedding physics. In comparison to state-of-the-art operator learning algorithms such as PCA-Net, DeepONet, FNO, and CNO, numerical results demonstrate that ReBaNO significantly outperforms them in terms of eliminating/shrinking the generalization gap for both in- and out-of-distribution tests and being the only operator learning algorithm achieving strict discretization invariance.
[280] Explaining Concept Drift through the Evolution of Group Counterfactuals
Ignacy Stępka, Jerzy Stefanowski
Main category: cs.LG
TL;DR: A novel method to explain concept drift by tracking temporal evolution of group-based counterfactual explanations (GCEs) to reveal structural changes in model decision boundaries.
Details
Motivation: Machine learning models in dynamic environments suffer from concept drift that degrades performance, but explaining how and why the model's decision-making logic changes remains a significant challenge.Method: Analyze temporal evolution of group-based counterfactual explanations (GCEs) by tracking shifts in cluster centroids and counterfactual action vectors before and after drift, within a three-layer framework combining data layer (distributional shifts), model layer (prediction disagreement), and explanation layer.
Result: The approach provides an interpretable proxy that reveals structural changes in the model’s decision boundary and underlying rationale, enabling comprehensive diagnosis of drift.
Conclusion: The methodology allows for distinguishing between different root causes of concept drift, such as spatial data shift versus concept re-labeling, providing a holistic view for drift explanation.
Abstract: Machine learning models in dynamic environments often suffer from concept drift, where changes in the data distribution degrade performance. While detecting this drift is a well-studied topic, explaining how and why the model’s decision-making logic changes still remains a significant challenge. In this paper, we introduce a novel methodology to explain concept drift by analyzing the temporal evolution of group-based counterfactual explanations (GCEs). Our approach tracks shifts in the GCEs’ cluster centroids and their associated counterfactual action vectors before and after a drift. These evolving GCEs act as an interpretable proxy, revealing structural changes in the model’s decision boundary and its underlying rationale. We operationalize this analysis within a three-layer framework that synergistically combines insights from the data layer (distributional shifts), the model layer (prediction disagreement), and our proposed explanation layer. We show that such holistic view allows for a more comprehensive diagnosis of drift, making it possible to distinguish between different root causes, such as a spatial data shift versus a re-labeling of concepts.
[281] Functional Groups are All you Need for Chemically Interpretable Molecular Property Prediction
Roshan Balaji, Joe Bobby, Nirav Pravinbhai Bhatt
Main category: cs.LG
TL;DR: A novel Functional Group Representation (FGR) framework that encodes molecules using chemical functional groups for interpretable molecular property prediction, achieving state-of-the-art performance while maintaining chemical interpretability.
Details
Motivation: Deep learning models for molecular property prediction lack interpretability, hindering adoption by chemists who need to understand the chemical basis for predictions.Method: Develops FGR framework using two types of functional groups: curated chemical knowledge-based FGs and mined FGs from molecular corpus via sequential pattern mining. Encodes molecules into lower-dimensional latent space with pre-training on unlabeled data and includes 2D structure descriptors.
Result: Achieves state-of-the-art performance on 33 benchmark datasets across physical chemistry, biophysics, quantum mechanics, biological activity, and pharmacokinetics while enabling chemical interpretability.
Conclusion: The FGR framework represents a significant advancement toward high-performing, chemically interpretable deep learning models for molecular discovery, allowing direct linkage between predicted properties and specific functional groups.
Abstract: Molecular property prediction using deep learning (DL) models has accelerated drug and materials discovery, but the resulting DL models often lack interpretability, hindering their adoption by chemists. This work proposes developing molecule representations using the concept of Functional Groups (FG) in chemistry. We introduce the Functional Group Representation (FGR) framework, a novel approach to encoding molecules based on their fundamental chemical substructures. Our method integrates two types of functional groups: those curated from established chemical knowledge (FG), and those mined from a large molecular corpus using sequential pattern mining (MFG). The resulting FGR framework encodes molecules into a lower-dimensional latent space by leveraging pre-training on a large dataset of unlabeled molecules. Furthermore, the proposed framework allows the inclusion of 2D structure-based descriptors of molecules. We demonstrate that the FGR framework achieves state-of-the-art performance on a diverse range of 33 benchmark datasets spanning physical chemistry, biophysics, quantum mechanics, biological activity, and pharmacokinetics while enabling chemical interpretability. Crucially, the model’s representations are intrinsically aligned with established chemical principles, allowing chemists to directly link predicted properties to specific functional groups and facilitating novel insights into structure-property relationships. Our work presents a significant step toward developing high-performing, chemically interpretable DL models for molecular discovery.
[282] Feasibility-Guided Fair Adaptive Offline Reinforcement Learning for Medicaid Care Management
Sanjay Basu, Sadiq Y. Patel, Parth Sheth, Bhairavi Muralidharan, Namrata Elamaran, Aakriti Kinra, Rajaie Batniji
Main category: cs.LG
TL;DR: FG-FARL is an offline RL method that calibrates per-group safety thresholds to reduce harm while equalizing fairness across protected subgroups, achieving comparable performance to baselines with improved fairness.
Details
Motivation: To develop a reinforcement learning approach that addresses both safety and fairness concerns in healthcare decision support systems, particularly for protected subgroups in population health management.Method: Feasibility-Guided Fair Adaptive Reinforcement Learning (FG-FARL) that uses per-group safety threshold calibration and compares against behavior cloning (BC) and HACO baselines using de-identified Medicaid data.
Result: FG-FARL achieves comparable value to baselines while improving fairness metrics (coverage or harm equality across subgroups) with statistical significance demonstrated through bootstrap confidence intervals and p-values.
Conclusion: FG-FARL provides a practical approach for safer and more equitable decision support in healthcare applications, successfully balancing performance with fairness objectives.
Abstract: We introduce Feasibility-Guided Fair Adaptive Reinforcement Learning (FG-FARL), an offline RL procedure that calibrates per-group safety thresholds to reduce harm while equalizing a chosen fairness target (coverage or harm) across protected subgroups. Using de-identified longitudinal trajectories from a Medicaid population health management program, we evaluate FG-FARL against behavior cloning (BC) and HACO (Hybrid Adaptive Conformal Offline RL; a global conformal safety baseline). We report off-policy value estimates with bootstrap 95% confidence intervals and subgroup disparity analyses with p-values. FG-FARL achieves comparable value to baselines while improving fairness metrics, demonstrating a practical path to safer and more equitable decision support.
[283] On the Relationship Between Adversarial Robustness and Decision Region in Deep Neural Networks
Seongjin Park, Haedong Jeong, Tair Djanibekov, Giyoung Jeon, Jinseok Seol, Jaesik Choi
Main category: cs.LG
TL;DR: The paper proposes Populated Region Set (PRS) as a novel geometric concept to analyze and improve adversarial robustness in DNNs, showing that lower PRS ratios correlate with better robustness.
Details
Motivation: Existing DNN evaluation metrics like generalization performance are becoming saturated, and adversarial robustness evaluation lacks geometric analysis of internal model properties.Method: Introduces Populated Region Set (PRS) concept to represent internal DNN properties, conducts systematic experiments to analyze PRS-robustness relationship, and develops PRS regularizer for robustness improvement without adversarial training.
Result: Empirical evidence shows that lower PRS ratios strongly correlate with better adversarial robustness in DNNs.
Conclusion: PRS provides a practical geometric framework for analyzing and enhancing adversarial robustness, offering an alternative to adversarial training methods.
Abstract: In general, Deep Neural Networks (DNNs) are evaluated by the generalization performance measured on unseen data excluded from the training phase. Along with the development of DNNs, the generalization performance converges to the state-of-the-art and it becomes difficult to evaluate DNNs solely based on this metric. The robustness against adversarial attack has been used as an additional metric to evaluate DNNs by measuring their vulnerability. However, few studies have been performed to analyze the adversarial robustness in terms of the geometry in DNNs. In this work, we perform an empirical study to analyze the internal properties of DNNs that affect model robustness under adversarial attacks. In particular, we propose the novel concept of the Populated Region Set (PRS), where training samples are populated more frequently, to represent the internal properties of DNNs in a practical setting. From systematic experiments with the proposed concept, we provide empirical evidence to validate that a low PRS ratio has a strong relationship with the adversarial robustness of DNNs. We also devise PRS regularizer leveraging the characteristics of PRS to improve the adversarial robustness without adversarial training.
[284] LoRA-PAR: A Flexible Dual-System LoRA Partitioning Approach to Efficient LLM Fine-Tuning
Yining Huang, Bin Li, Keke Tang, Meilian Chen
Main category: cs.LG
TL;DR: LoRA-PAR: A dual-system LoRA framework that partitions data and parameters by System 1 (intuitive) vs System 2 (analytical) demands, using focused parameters and two-stage fine-tuning (SFT for System 1, RL for System 2) to achieve SOTA performance with fewer parameters.
Details
Motivation: Large generative models benefit from chain-of-thought reasoning but require extensive resources. Existing PEFT methods focus on domain adaptation rather than tailoring to different cognitive demands (quick intuitive responses vs multi-step logical reasoning).Method: Classify task data via multi-model role-playing and voting, partition parameters by importance scoring, then use two-stage fine-tuning: SFT for System 1 tasks to enhance knowledge/intuition, and RL for System 2 tasks to reinforce logical deliberation.
Result: Extensive experiments show the two-stage fine-tuning strategy reduces active parameter usage while matching or surpassing state-of-the-art PEFT baselines.
Conclusion: The dual-system approach effectively specializes parameters for different cognitive demands, achieving efficient performance with focused parameter usage through task-specific fine-tuning strategies.
Abstract: Large-scale generative models like DeepSeek-R1 and OpenAI-O1 benefit substantially from chain-of-thought (CoT) reasoning, yet pushing their performance typically requires vast data, large model sizes, and full-parameter fine-tuning. While parameter-efficient fine-tuning (PEFT) helps reduce cost, most existing approaches primarily address domain adaptation or layer-wise allocation rather than explicitly tailoring data and parameters to different response demands. Inspired by “Thinking, Fast and Slow,” which characterizes two distinct modes of thought-System 1 (fast, intuitive, often automatic) and System 2 (slower, more deliberative and analytic)-we draw an analogy that different “subregions” of an LLM’s parameters might similarly specialize for tasks that demand quick, intuitive responses versus those requiring multi-step logical reasoning. Therefore, we propose LoRA-PAR, a dual-system LoRA framework that partitions both data and parameters by System 1 or System 2 demands, using fewer yet more focused parameters for each task. Specifically, we classify task data via multi-model role-playing and voting, and partition parameters based on importance scoring, then adopt a two-stage fine-tuning strategy of training System 1 tasks with supervised fine-tuning (SFT) to enhance knowledge and intuition and refine System 2 tasks with reinforcement learning (RL) to reinforce deeper logical deliberation next. Extensive experiments show that the two-stage fine-tuning strategy, SFT and RL, lowers active parameter usage while matching or surpassing SOTA PEFT baselines.
[285] Joint Optimization of Energy Consumption and Completion Time in Federated Learning
Xinyu Zhou, Jun Zhao, Huimei Han, Claude Guet
Main category: cs.LG
TL;DR: Proposed a resource allocation algorithm for Federated Learning that optimizes energy consumption and completion time through bandwidth, power, and CPU frequency allocation.
Details
Motivation: To balance the trade-off between energy consumption and execution latency in Federated Learning systems to accommodate different application demands and scenarios.Method: Formulated an optimization problem minimizing weighted sum of energy and time, decomposed into subproblems, and devised resource allocation algorithm for bandwidth, transmission power, and CPU frequency.
Result: Numerical results show superior performance at different weight parameters and outperforms state-of-the-art methods.
Conclusion: The proposed algorithm effectively optimizes FL system performance by balancing energy and latency trade-offs through intelligent resource allocation.
Abstract: Federated Learning (FL) is an intriguing distributed machine learning approach due to its privacy-preserving characteristics. To balance the trade-off between energy and execution latency, and thus accommodate different demands and application scenarios, we formulate an optimization problem to minimize a weighted sum of total energy consumption and completion time through two weight parameters. The optimization variables include bandwidth, transmission power and CPU frequency of each device in the FL system, where all devices are linked to a base station and train a global model collaboratively. Through decomposing the non-convex optimization problem into two subproblems, we devise a resource allocation algorithm to determine the bandwidth allocation, transmission power, and CPU frequency for each participating device. We further present the convergence analysis and computational complexity of the proposed algorithm. Numerical results show that our proposed algorithm not only has better performance at different weight parameters (i.e., different demands) but also outperforms the state of the art.
[286] Deep Reinforcement Learning for Inventory Networks: Toward Reliable Policy Optimization
Matias Alvo, Daniel Russo, Yash Kanoria, Minuk Lee
Main category: cs.LG
TL;DR: Deep reinforcement learning for inventory management using HDPO for efficient policy optimization and GNNs for encoding supply chain structure, with open-source benchmarks.
Details
Motivation: Inventory management presents unique opportunities for reliable DRL application, but lacks standardized benchmarks and efficient optimization methods.Method: HDPO (Hindsight Differentiable Policy Optimization) uses pathwise gradients from offline simulations, and Graph Neural Networks encode inventory network topology.
Result: HDPO recovers near-optimal policies, outperforms REINFORCE variants and newsvendor heuristics. GNNs reduce data requirements across diverse inventory problems.
Conclusion: Combining HDPO with GNN architectures enables reliable DRL for inventory control, with open-source benchmarks provided for reproducibility.
Abstract: We argue that inventory management presents unique opportunities for the reliable application of deep reinforcement learning (DRL). To enable this, we emphasize and test two complementary techniques. The first is Hindsight Differentiable Policy Optimization (HDPO), which uses pathwise gradients from offline counterfactual simulations to directly and efficiently optimize policy performance. Unlike standard policy gradient methods that rely on high-variance score-function estimators, HDPO computes gradients by differentiating through the known system dynamics. Via extensive benchmarking, we show that HDPO recovers near-optimal policies in settings with known or bounded optima, is more robust than variants of the REINFORCE algorithm, and significantly outperforms generalized newsvendor heuristics on problems using real time series data. Our second technique aligns neural policy architectures with the topology of the inventory network. We exploit Graph Neural Networks (GNNs) as a natural inductive bias for encoding supply chain structure, demonstrate that they can represent optimal and near-optimal policies in two theoretical settings, and empirically show that they reduce data requirements across six diverse inventory problems. A key obstacle to progress in this area is the lack of standardized benchmark problems. To address this gap, we open-source a suite of benchmark environments, along with our full codebase, to promote transparency and reproducibility. All resources are available at github.com/MatiasAlvo/Neural_inventory_control.
[287] Generative Data Refinement: Just Ask for Better Data
Minqi Jiang, João G. M. Araújo, Will Ellsworth, Sian Gooding, Edward Grefenstette
Main category: cs.LG
TL;DR: GDR framework uses generative models to transform problematic datasets into refined training data, addressing data scarcity while mitigating privacy and safety risks.
Details
Motivation: Address projected data exhaustion as web data growth can't keep pace with AI training needs, while user-generated content poses privacy and safety risks.Method: Generative Data Refinement (GDR) uses pretrained generative models to conditionally transform problematic datasets, generating synthetic data that matches original diversity without manual prompting.
Result: Outperforms industry-grade anonymization solutions, enables detoxification of unsafe datasets, and naturally preserves web-scale dataset diversity.
Conclusion: GDR provides a simple yet effective solution to scale training data for frontier models while mitigating privacy and content safety concerns.
Abstract: For a fixed parameter size, the capabilities of large models are primarily determined by the quality and quantity of its training data. Consequently, training datasets now grow faster than the rate at which new data is indexed on the web, leading to projected data exhaustion over the next decade. Much more data exists as user-generated content that is not publicly indexed, but incorporating such data comes with considerable risks, such as leaking private information and other undesirable content. We introduce a framework, Generative Data Refinement (GDR), for using pretrained generative models to transform a dataset with undesirable content into a refined dataset that is more suitable for training. Our experiments show that GDR can outperform industry-grade solutions for dataset anonymization, as well as enable direct detoxification of highly unsafe datasets. Moreover, we show that by generating synthetic data that is conditioned on each example in the real dataset, GDR’s refined outputs naturally match the diversity of web scale datasets, and thereby avoid the often challenging task of generating diverse synthetic data via model prompting. The simplicity and effectiveness of GDR make it a powerful tool for scaling up the total stock of training data for frontier models.
[288] Geometry and Stability of Supervised Learning Problems
Facundo Mémoli, Brantley Vose, Robert C. Williamson
Main category: cs.LG
TL;DR: The paper introduces the Risk distance metric for supervised learning problems, inspired by optimal transport, to measure problem similarity and stability against modifications like sampling bias, noise, and limited data.
Details
Motivation: To provide a quantitative way to measure how much supervised learning problems change due to issues like sampling bias, noise, limited data, and approximations, enabling stability analysis.Method: Develops the Risk distance metric based on optimal transport principles, explores the geometry of the resulting problem space with explicit geodesics, and creates two variants: one with predictor weights and one more sensitive to risk landscape contours.
Result: Establishes that classification problems are dense in a larger class of problems and provides mathematical foundations for quantifying problem similarity and stability.
Conclusion: The Risk distance provides a robust framework for analyzing stability and similarity in supervised learning problems, with practical variants for different application needs.
Abstract: We introduce a notion of distance between supervised learning problems, which we call the Risk distance. This distance, inspired by optimal transport, facilitates stability results; one can quantify how seriously issues like sampling bias, noise, limited data, and approximations might change a given problem by bounding how much these modifications can move the problem under the Risk distance. With the distance established, we explore the geometry of the resulting space of supervised learning problems, providing explicit geodesics and proving that the set of classification problems is dense in a larger class of problems. We also provide two variants of the Risk distance: one that incorporates specified weights on a problem’s predictors, and one that is more sensitive to the contours of a problem’s risk landscape.
[289] Merge-of-Thought Distillation
Zhanming Shen, Zeyu Qin, Zenan Huang, Hao Chen, Jiaqi Hu, Yihong Zhuang, Guoshan Lu, Gang Chen, Junbo Zhao
Main category: cs.LG
TL;DR: Merge-of-Thought Distillation (MoT) is a lightweight framework that efficiently distills reasoning capabilities from multiple teachers into compact students by alternating teacher-specific fine-tuning and weight-space merging, achieving superior performance with minimal data.
Details
Motivation: Traditional reasoning distillation assumes a single oracle teacher, but practical scenarios offer multiple candidate teachers with varying strengths. Different students benefit from different teachers, and the best teacher can vary across datasets, creating a need to unify multiple teachers' reasoning abilities without supervision conflicts.Method: MoT alternates between teacher-specific supervised fine-tuning branches and weight-space merging of the resulting student variants. It uses consensus-filtered reasoning features to transfer knowledge from diverse teachers while mitigating conflicts.
Result: On competition math benchmarks with only ~200 CoT samples, MoT applied to Qwen3-14B surpassed strong models including DEEPSEEK-R1, QWEN3-30B-A3B, QWEN3-32B, and OPENAI-O1. It consistently outperformed single-teacher distillation and naive multi-teacher union, showing robustness to distribution shifts and reducing catastrophic forgetting.
Conclusion: MoT provides a simple, scalable approach to efficiently distill long chain-of-thought capabilities from diverse teachers into compact students, demonstrating broad transfer of reasoning features and improved performance across domains beyond mathematics.
Abstract: Efficient reasoning distillation for long chain-of-thought (CoT) models is increasingly constrained by the assumption of a single oracle teacher, despite practical availability of multiple candidate teachers and growing CoT corpora. We revisit teacher selection and observe that different students have different “best teachers,” and even for the same student the best teacher can vary across datasets. Therefore, to unify multiple teachers’ reasoning abilities into student with overcoming conflicts among various teachers’ supervision, we propose Merge-of-Thought Distillation (MoT), a lightweight framework that alternates between teacher-specific supervised fine-tuning branches and weight-space merging of the resulting student variants. On competition math benchmarks, using only about 200 high-quality CoT samples, applying MoT to a Qwen3-14B student surpasses strong models including DEEPSEEK-R1, QWEN3-30B-A3B, QWEN3-32B, and OPENAI-O1, demonstrating substantial gains. Besides, MoT consistently outperforms the best single-teacher distillation and the naive multi-teacher union, raises the performance ceiling while mitigating overfitting, and shows robustness to distribution-shifted and peer-level teachers. Moreover, MoT reduces catastrophic forgetting, improves general reasoning beyond mathematics and even cultivates a better teacher, indicating that consensus-filtered reasoning features transfer broadly. These results position MoT as a simple, scalable route to efficiently distilling long CoT capabilities from diverse teachers into compact students.
[290] Attribution Regularization for Multimodal Paradigms
Sahiti Yerramilli, Jayant Sravan Tamarapalli, Jonathan Francis, Eric Nyberg
Main category: cs.LG
TL;DR: Proposes a novel regularization term to address unimodal dominance in multimodal learning, particularly in video-audio domains, to improve utilization of all modalities.
Details
Motivation: Multimodal models underperform unimodal ones despite richer information access, with single modalities often dominating decisions, leading to suboptimal performance.Method: Introduces a regularization technique that encourages multimodal models to effectively utilize information from all modalities when making decisions.
Result: The approach aims to mitigate unimodal dominance issues, though specific experimental results are not provided in the abstract.
Conclusion: The proposed regularization has potential to significantly advance multimodal machine learning and benefit applications in multimedia analysis, HCI, and embodied AI research.
Abstract: Multimodal machine learning has gained significant attention in recent years due to its potential for integrating information from multiple modalities to enhance learning and decision-making processes. However, it is commonly observed that unimodal models outperform multimodal models, despite the latter having access to richer information. Additionally, the influence of a single modality often dominates the decision-making process, resulting in suboptimal performance. This research project aims to address these challenges by proposing a novel regularization term that encourages multimodal models to effectively utilize information from all modalities when making decisions. The focus of this project lies in the video-audio domain, although the proposed regularization technique holds promise for broader applications in embodied AI research, where multiple modalities are involved. By leveraging this regularization term, the proposed approach aims to mitigate the issue of unimodal dominance and improve the performance of multimodal machine learning systems. Through extensive experimentation and evaluation, the effectiveness and generalizability of the proposed technique will be assessed. The findings of this research project have the potential to significantly contribute to the advancement of multimodal machine learning and facilitate its application in various domains, including multimedia analysis, human-computer interaction, and embodied AI research.
[291] AdaWaveNet: Adaptive Wavelet Network for Time Series Analysis
Han Yu, Peikun Guo, Akane Sano
Main category: cs.LG
TL;DR: AdaWaveNet uses adaptive wavelet transformation for multi-scale analysis of non-stationary time series data, outperforming existing methods in forecasting, imputation, and super-resolution tasks.
Details
Motivation: Traditional time series models struggle with non-stationary data due to assumptions of constant statistical properties, leading to bias and errors in analysis.Method: Proposes Adaptive Wavelet Network (AdaWaveNet) with lifting scheme-based wavelet decomposition and construction for adaptive, learnable wavelet transforms.
Result: Extensive experiments on 10 datasets across 3 tasks show AdaWaveNet outperforms existing methods in forecasting, imputation, and super-resolution.
Conclusion: AdaWaveNet demonstrates enhanced flexibility and robustness for non-stationary time series analysis with potential for various real-world applications.
Abstract: Time series data analysis is a critical component in various domains such as finance, healthcare, and meteorology. Despite the progress in deep learning for time series analysis, there remains a challenge in addressing the non-stationary nature of time series data. Traditional models, which are built on the assumption of constant statistical properties over time, often struggle to capture the temporal dynamics in realistic time series, resulting in bias and error in time series analysis. This paper introduces the Adaptive Wavelet Network (AdaWaveNet), a novel approach that employs Adaptive Wavelet Transformation for multi-scale analysis of non-stationary time series data. AdaWaveNet designed a lifting scheme-based wavelet decomposition and construction mechanism for adaptive and learnable wavelet transforms, which offers enhanced flexibility and robustness in analysis. We conduct extensive experiments on 10 datasets across 3 different tasks, including forecasting, imputation, and a newly established super-resolution task. The evaluations demonstrate the effectiveness of AdaWaveNet over existing methods in all three tasks, which illustrates its potential in various real-world applications.
[292] Discovering physical laws with parallel symbolic enumeration
Kai Ruan, Yilong Xu, Ze-Feng Gao, Yike Guo, Hao Sun, Ji-Rong Wen, Yang Liu
Main category: cs.LG
TL;DR: Parallel Symbolic Enumeration (PSE) is a new method that significantly improves symbolic regression by achieving higher accuracy and faster computation compared to state-of-the-art algorithms.
Details
Motivation: Symbolic regression faces challenges in searching for parsimonious and generalizable mathematical formulas in infinite search spaces while maintaining accuracy and efficiency, which has been a bottleneck for over a decade.Method: The authors introduce Parallel Symbolic Enumeration (PSE) to efficiently distill generic mathematical expressions from limited data through parallel processing techniques.
Result: PSE achieves up to 99% improvement in recovery accuracy and reduces runtime by an order of magnitude across over 200 synthetic and experimental problem sets compared to state-of-the-art baseline algorithms.
Conclusion: PSE represents a significant advance in accurate and efficient data-driven discovery of symbolic, interpretable models and improves the scalability of symbolic learning for scientific exploration.
Abstract: Symbolic regression plays a crucial role in modern scientific research thanks to its capability of discovering concise and interpretable mathematical expressions from data. A key challenge lies in the search for parsimonious and generalizable mathematical formulas, in an infinite search space, while intending to fit the training data. Existing algorithms have faced a critical bottleneck of accuracy and efficiency over a decade when handling problems of complexity, which essentially hinders the pace of applying symbolic regression for scientific exploration across interdisciplinary domains. To this end, we introduce parallel symbolic enumeration (PSE) to efficiently distill generic mathematical expressions from limited data. Experiments show that PSE achieves higher accuracy and faster computation compared to the state-of-the-art baseline algorithms across over 200 synthetic and experimental problem sets (e.g., improving the recovery accuracy by up to 99% and reducing runtime by an order of magnitude). PSE represents an advance in accurate and efficient data-driven discovery of symbolic, interpretable models (e.g., underlying physical laws), and improves the scalability of symbolic learning.
[293] Unveiling Multiple Descents in Unsupervised Autoencoders
Kobi Rahimi, Yehonathan Refael, Tom Tirer, Ofir Lindenbaum
Main category: cs.LG
TL;DR: Double descent occurs in nonlinear autoencoders but not linear ones, with triple descent also observed. Over-parameterized models improve reconstruction and downstream task performance.
Details
Motivation: To explore whether double descent phenomenon exists in unsupervised learning, particularly in autoencoders, challenging traditional bias-variance trade-off assumptions.Method: Analytical demonstration for linear autoencoders, experimental analysis of nonlinear autoencoders across various data models and architectures, examining effects of noise and bottleneck size.
Result: Double descent absent in linear autoencoders but present in nonlinear ones, with triple descent also observed. Over-parameterization improves reconstruction and downstream task performance.
Conclusion: Double and triple descent phenomena exist in nonlinear unsupervised learning, and over-parameterized models provide practical benefits in real-world applications.
Abstract: The phenomenon of double descent has challenged the traditional bias-variance trade-off in supervised learning but remains unexplored in unsupervised learning, with some studies arguing for its absence. In this study, we first demonstrate analytically that double descent does not occur in linear unsupervised autoencoders (AEs). In contrast, we show for the first time that both double and triple descent can be observed with nonlinear AEs across various data models and architectural designs. We examine the effects of partial sample and feature noise and highlight the importance of bottleneck size in influencing the double descent curve. Through extensive experiments on both synthetic and real datasets, we uncover model-wise, epoch-wise, and sample-wise double descent across several data types and architectures. Our findings indicate that over-parameterized models not only improve reconstruction but also enhance performance in downstream tasks such as anomaly detection and domain adaptation, highlighting their practical value in complex real-world scenarios.
[294] Rethinking Disentanglement under Dependent Factors of Variation
Antonio Almudévar, Alfonso Ortega
Main category: cs.LG
TL;DR: A new information theory-based definition and measurement method for disentangled representations that works with non-independent factors of variation, unlike existing approaches that assume independence.
Details
Motivation: Existing disentanglement definitions and metrics assume factors of variation are independent, which is unrealistic for real-world data where factors are often correlated.Method: Proposes an information theory-based definition of disentanglement that doesn’t require factor independence, relates it to the Information Bottleneck Method, and develops a measurement method for non-independent factors.
Result: The proposed method correctly measures disentanglement with non-independent factors, while other methods fail in this scenario, as demonstrated through experiments.
Conclusion: This work provides a more realistic and applicable framework for disentangled representation learning that handles the common case of non-independent factors of variation in real data.
Abstract: Representation learning is an approach that allows to discover and extract the factors of variation from the data. Intuitively, a representation is said to be disentangled if it separates the different factors of variation in a way that is understandable to humans. Definitions of disentanglement and metrics to measure it usually assume that the factors of variation are independent of each other. However, this is generally false in the real world, which limits the use of these definitions and metrics to very specific and unrealistic scenarios. In this paper we give a definition of disentanglement based on information theory that is also valid when the factors of variation are not independent. Furthermore, we relate this definition to the Information Bottleneck Method. Finally, we propose a method to measure the degree of disentanglement from the given definition that works when the factors of variation are not independent. We show through different experiments that the method proposed in this paper correctly measures disentanglement with non-independent factors of variation, while other methods fail in this scenario.
[295] Understanding Large Language Models in Your Pockets: Performance Study on COTS Mobile Devices
Jie Xiao, Qianyi Huang, Xu Chen, Chen Tian
Main category: cs.LG
TL;DR: Comprehensive measurement study of lightweight LLM performance on mobile devices, evaluating user experience metrics and developer-focused factors across different mobile SoCs.
Details
Motivation: Growing privacy concerns are pushing LLM deployment to local devices, but there's limited understanding of how these models perform on commercial mobile hardware in terms of both user experience and system resource utilization.Method: Conducted comprehensive measurement study on mobile devices, evaluating token throughput, latency, battery consumption, resource utilization, DVFS strategies, and inference engines across different mobile SoCs from major vendors.
Result: Provides detailed analysis of how hardware capabilities and system dynamics affect on-device LLM performance, identifies performance bottlenecks, and offers comparisons across different mobile SoC vendors.
Conclusion: The study provides insights for both on-device LLM development and future mobile system architecture design, helping developers optimize performance and address deployment challenges on mobile platforms.
Abstract: As large language models (LLMs) increasingly integrate into every aspect of our work and daily lives, there are growing concerns about user privacy, which push the trend toward local deployment of these models. There are a number of lightweight LLMs (e.g., Gemini Nano, LLAMA2 7B) that can run locally on smartphones, providing users with greater control over their personal data. As a rapidly emerging application, we are concerned about their performance on commercial-off-the-shelf mobile devices. To fully understand the current landscape of LLM deployment on mobile platforms, we conduct a comprehensive measurement study on mobile devices. We evaluate both metrics that affect user experience, including token throughput, latency, and battery consumption, as well as factors critical to developers, such as resource utilization, DVFS strategies, and inference engines. In addition, we provide a detailed analysis of how these hardware capabilities and system dynamics affect on-device LLM performance, which may help developers identify and address bottlenecks for mobile LLM applications. We also provide comprehensive comparisons across the mobile system-on-chips (SoCs) from major vendors, highlighting their performance differences in handling LLM workloads. We hope that this study can provide insights for both the development of on-device LLMs and the design for future mobile system architecture.
[296] Tensor-Based Foundations of Ordinary Least Squares and Neural Network Regression Models
Roberto Dias Algarte
Main category: cs.LG
TL;DR: Novel mathematical approach using Tensor Analysis for OLS and Neural Network regression, presenting streamlined algorithms including improved Backpropagation.
Details
Motivation: To develop a more rigorous mathematical foundation for regression models using Tensor Analysis instead of traditional ML approaches.Method: Leveraging Tensor Analysis and fundamental matrix computations to detail theoretical foundations and extend to complete algorithmic forms.
Result: Development of three algorithms including a streamlined version of the Backpropagation Algorithm for Neural Networks.
Conclusion: The new mathematical approach using Tensor Analysis provides benefits for understanding and implementing regression models with improved algorithms.
Abstract: This article introduces a novel approach to the mathematical development of Ordinary Least Squares and Neural Network regression models, diverging from traditional methods in current Machine Learning literature. By leveraging Tensor Analysis and fundamental matrix computations, the theoretical foundations of both models are meticulously detailed and extended to their complete algorithmic forms. The study culminates in the presentation of three algorithms, including a streamlined version of the Backpropagation Algorithm for Neural Networks, illustrating the benefits of this new mathematical approach.
[297] Communication Compression for Distributed Learning without Control Variates
Tomas Ortega, Chun-Yin Huang, Xiaoxiao Li, Hamid Jafarkhani
Main category: cs.LG
TL;DR: CAFe is a novel distributed learning framework that enables highly compressible client updates without requiring error feedback or client-specific control variates, addressing privacy and statefulness challenges in federated learning.
Details
Motivation: Existing compression methods in federated learning require biased compression with error feedback, which violates privacy principles and demands stateful clients with client-specific control variates.Method: Proposes Compressed Aggregate Feedback (CAFe) framework that exploits past aggregated updates to enable highly compressible client updates without control variates, using Distributed Gradient Descent as representative algorithm.
Result: Analytical proof shows CAFe’s superiority over Distributed Compressed Gradient Descent with biased compression in non-convex regime with bounded gradient dissimilarity. Experimental results confirm CAFe outperforms existing distributed learning compression schemes.
Conclusion: CAFe provides an effective solution for communication compression in distributed learning that maintains privacy, avoids stateful clients, and achieves better performance than existing compression methods.
Abstract: Distributed learning algorithms, such as the ones employed in Federated Learning (FL), require communication compression to reduce the cost of client uploads. The compression methods used in practice are often biased, making error feedback necessary both to achieve convergence under aggressive compression and to provide theoretical convergence guarantees. However, error feedback requires client-specific control variates, creating two key challenges: it violates privacy-preserving principles and demands stateful clients. In this paper, we propose Compressed Aggregate Feedback (CAFe), a novel distributed learning framework that allows highly compressible client updates by exploiting past aggregated updates, and does not require control variates. We consider Distributed Gradient Descent (DGD) as a representative algorithm and analytically prove CAFe’s superiority to Distributed Compressed Gradient Descent (DCGD) with biased compression in the non-convex regime with bounded gradient dissimilarity. Experimental results confirm that CAFe outperforms existing distributed learning compression schemes.
[298] Bridging Simplicity and Sophistication using GLinear: A Novel Architecture for Enhanced Time Series Prediction
Syed Tahir Hussain Rizvi, Neel Kanwal, Muddasar Naeem
Main category: cs.LG
TL;DR: GLinear - a novel Gaussian-activated linear model for time series forecasting that outperforms existing linear and transformer models while being more data-efficient.
Details
Motivation: Address the debate about whether complex Transformers are necessary for time series forecasting, and show that simpler linear models can achieve competitive or better performance while requiring less data.Method: Proposed GLinear architecture uses Gaussian activation to exploit periodic patterns in multivariate time series data, requiring less historical data than other state-of-the-art linear predictors.
Result: GLinear outperforms existing linear architectures (NLinear, DLinear, RLinear) and transformer-based models (Autoformer) on four datasets (ETTh1, Electricity, Traffic, Weather) for multivariate time series forecasting.
Conclusion: GLinear demonstrates that simpler, data-efficient architectures can achieve superior performance in time series forecasting, opening new research directions for computationally efficient time-series analysis.
Abstract: Time Series Forecasting (TSF) is an important application across many fields. There is a debate about whether Transformers, despite being good at understanding long sequences, struggle with preserving temporal relationships in time series data. Recent research suggests that simpler linear models might outperform or at least provide competitive performance compared to complex Transformer-based models for TSF tasks. In this paper, we propose a novel data-efficient architecture, \textit{Gaussian-activated Linear model (GLinear)}, for multivariate TSF that exploits periodic patterns to provide better accuracy. It achieves higher prediction accuracy while requiring less historical data than other state-of-the-art linear predictors. Four different datasets (ETTh1, Electricity, Traffic, and Weather) are used to evaluate the performance of the proposed predictor. A performance comparison with state-of-the-art linear architectures (such as NLinear, DLinear, and RLinear) and transformer-based time series predictors (Autoformer) shows that the GLinear, despite being data efficient, outperforms the existing architectures in most cases of multivariate TSF while being competitive in others. We hope that the proposed GLinear model opens new fronts of research and development of simpler and more sophisticated architectures for data and computationally efficient time-series analysis. The source code is publicly available on GitHub.
[299] Investigating Energy Efficiency and Performance Trade-offs in LLM Inference Across Tasks and DVFS Settings
Paul Joe Maliakel, Shashikant Ilager, Ivona Brandic
Main category: cs.LG
TL;DR: This paper analyzes energy-performance trade-offs in LLM inference, identifying that model architecture, input complexity, and GPU clock settings significantly impact efficiency, with strategies to reduce energy consumption by 30% while maintaining performance.
Details
Motivation: LLMs are computationally intensive during inference, raising sustainability concerns. As models scale, optimizing runtime efficiency without compromising performance becomes essential.Method: Systematic benchmarking of various LLM sizes and architectures (Falcon-7B, Mistral-7B-v0.1, LLaMA models, GPT-Neo-2.7B) across multiple NLP tasks. Analysis of input characteristics (sequence length, entropy, named entity density) and hardware optimizations through Dynamic Voltage and Frequency Scaling (DVFS) on GPU clock settings.
Result: Empirical findings show significant influence of model architecture, input complexity, and clock configuration on inference efficiency. Identified practical strategies that reduce energy consumption by up to 30% while preserving model quality.
Conclusion: Provides actionable insights for designing energy-efficient and sustainable LLM inference systems by correlating input features with energy metrics and evaluating DVFS behavior.
Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing (NLP) tasks, leading to widespread adoption in both research and industry. However, their inference workloads are computationally and energy intensive, raising concerns about sustainability and environmental impact. As LLMs continue to scale, it becomes essential to identify and optimize the factors that influence their runtime efficiency without compromising performance. In this work, we systematically investigate the energy-performance trade-offs of LLMs during inference. We benchmark models of varying sizes and architectures, including Falcon-7B, Mistral-7B-v0.1, LLaMA-3.2-1B, LLaMA-3.2-3B, and GPT-Neo-2.7B, across tasks such as question answering, commonsense reasoning, and factual generation. We analyze the effect of input characteristics, such as sequence length, entropy, named entity density and so on. Furthermore, we examine the impact of hardware-level optimizations through Dynamic Voltage and Frequency Scaling (DVFS), measuring how different GPU clock settings affect latency and power consumption. Our empirical findings show that model architecture, input complexity, and clock configuration significantly influence inference efficiency. By correlating input features with energy metrics and evaluating DVFS behavior, we identify practical strategies that reduce energy consumption by up to 30% while preserving model quality. This study provides actionable insights for designing energy-efficient and sustainable LLM inference systems.
[300] Near-Optimal Sample Complexity in Reward-Free Kernel-Based Reinforcement Learning
Aya Kayal, Sattar Vakili, Laura Toni, Alberto Bernacchia
Main category: cs.LG
TL;DR: Statistical efficiency analysis of kernel-based reinforcement learning in reward-free framework, establishing sample complexity bounds for near-optimal policy design using kernel ridge regression.
Details
Motivation: Existing work on kernel-based RL statistical efficiency has restrictive assumptions about kernel functions. This paper aims to address the fundamental question of sample requirements for near-optimal policy design using a broader class of kernels.Method: Uses kernel ridge regression with new confidence intervals specific to RL setting. Examines both generative model assumption and relaxed assumption (increasing sample complexity by episode length H). Simpler algorithm compared to prior work.
Result: Develops theoretical sample complexity bounds for kernel-based RL. Validates findings through simulations. The approach provides new confidence intervals that may have broader applicability beyond RL.
Conclusion: The paper establishes statistical efficiency results for kernel-based reinforcement learning with a broad class of kernels, providing both theoretical guarantees and empirical validation through a simpler algorithmic approach.
Abstract: Reinforcement Learning (RL) problems are being considered under increasingly more complex structures. While tabular and linear models have been thoroughly explored, the analytical study of RL under nonlinear function approximation, especially kernel-based models, has recently gained traction for their strong representational capacity and theoretical tractability. In this context, we examine the question of statistical efficiency in kernel-based RL within the reward-free RL framework, specifically asking: how many samples are required to design a near-optimal policy? Existing work addresses this question under restrictive assumptions about the class of kernel functions. We first explore this question by assuming a generative model, then relax this assumption at the cost of increasing the sample complexity by a factor of H, the length of the episode. We tackle this fundamental problem using a broad class of kernels and a simpler algorithm compared to prior work. Our approach derives new confidence intervals for kernel ridge regression, specific to our RL setting, which may be of broader applicability. We further validate our theoretical findings through simulations.
[301] Revisiting Non-Acyclic GFlowNets in Discrete Environments
Nikita Morozov, Ian Maksimov, Daniil Tiapkin, Sergey Samsonov
Main category: cs.LG
TL;DR: This paper extends GFlowNets to non-acyclic graphs, providing a simpler theoretical framework and new insights about training with fixed backward policies, flow functions, and connections to entropy-regularized RL.
Details
Motivation: GFlowNets traditionally require acyclic graphs, limiting their applicability. The authors aim to relax this assumption and develop a theoretical foundation for non-acyclic GFlowNets in discrete environments.Method: The authors revisit and simplify the theory for non-acyclic GFlowNets, providing novel theoretical insights about fixed backward policies, flow functions, and connections to entropy-regularized RL. They also conduct experimental validation of their theoretical findings.
Result: The paper presents a simpler theoretical framework that generalizes concepts from acyclic GFlowNets to non-acyclic settings, with experimental validation supporting their theoretical contributions.
Conclusion: The work successfully extends GFlowNet theory to non-acyclic graphs, providing important theoretical insights and experimental validation that broaden the applicability of GFlowNets beyond traditional acyclic constraints.
Abstract: Generative Flow Networks (GFlowNets) are a family of generative models that learn to sample objects from a given probability distribution, potentially known up to a normalizing constant. Instead of working in the object space, GFlowNets proceed by sampling trajectories in an appropriately constructed directed acyclic graph environment, greatly relying on the acyclicity of the graph. In our paper, we revisit the theory that relaxes the acyclicity assumption and present a simpler theoretical framework for non-acyclic GFlowNets in discrete environments. Moreover, we provide various novel theoretical insights related to training with fixed backward policies, the nature of flow functions, and connections between entropy-regularized RL and non-acyclic GFlowNets, which naturally generalize the respective concepts and theoretical results from the acyclic setting. In addition, we experimentally re-examine the concept of loss stability in non-acyclic GFlowNet training, as well as validate our own theoretical findings.
[302] Adaptive kernel predictors from feature-learning infinite limits of neural networks
Clarissa Lauditi, Blake Bordelon, Cengiz Pehlevan
Main category: cs.LG
TL;DR: Neural networks trained in the rich, feature learning infinite-width regime are described by kernel machines with data-dependent kernels, achieving better performance than lazy training regime kernels.
Details
Motivation: Previous work showed infinite width neural networks in lazy training regime are described by kernel machines, but this work explores whether feature learning infinite-width regimes also lead to kernel descriptions with adaptive, data-dependent kernels.Method: Studied two settings: 1) large-width limit of feature-learning Bayesian networks using saddle point equations, 2) gradient flow training with weight decay using dynamical mean field theory (DMFT) to derive kernel predictors.
Result: Derived explicit expressions for kernel predictors and numerical calculation methods. The adaptive, data-dependent kernels from feature learning regimes achieve lower test loss on benchmark datasets compared to lazy regime kernels.
Conclusion: Feature learning in infinite-width neural networks leads to task-adapted kernel machines with data-dependent kernels that outperform traditional lazy training regime kernels.
Abstract: Previous influential work showed that infinite width limits of neural networks in the lazy training regime are described by kernel machines. Here, we show that neural networks trained in the rich, feature learning infinite-width regime in two different settings are also described by kernel machines, but with data-dependent kernels. For both cases, we provide explicit expressions for the kernel predictors and prescriptions to numerically calculate them. To derive the first predictor, we study the large-width limit of feature-learning Bayesian networks, showing how feature learning leads to task-relevant adaptation of layer kernels and preactivation densities. The saddle point equations governing this limit result in a min-max optimization problem that defines the kernel predictor. To derive the second predictor, we study gradient flow training of randomly initialized networks trained with weight decay in the infinite-width limit using dynamical mean field theory (DMFT). The fixed point equations of the arising DMFT defines the task-adapted internal representations and the kernel predictor. We compare our kernel predictors to kernels derived from lazy regime and demonstrate that our adaptive kernels achieve lower test loss on benchmark datasets.
[303] Critical Challenges and Guidelines in Evaluating Synthetic Tabular Data: A Systematic Review
Nazia Nafis, Inaki Esnaola, Alvaro Martinez-Perez, Maria-Cruz Villa-Uriol, Venet Osmani
Main category: cs.LG
TL;DR: Systematic review of 1766 papers identifies key challenges in synthetic health data evaluation and provides guidelines for better generation and assessment methods.
Details
Motivation: To address the critical importance of rigorous evaluation of synthetic health data to ensure reliability, relevance, and appropriate use, as evaluation is challenging and lacks consensus.Method: Conducted a systematic review by screening 1766 papers and performing detailed analysis of 101 papers to identify challenges in synthetic data evaluation.
Result: Identified key challenges including lack of consensus on evaluation methods, improper metric usage, limited domain expert input, inadequate dataset reporting, and limited reproducibility.
Conclusion: Provides guidelines for synthetic data generation and evaluation to help the community unlock the transformative potential of synthetic data and accelerate innovation.
Abstract: Generating synthetic tabular data can be challenging, however evaluation of their quality is just as challenging, if not more. This systematic review sheds light on the critical importance of rigorous evaluation of synthetic health data to ensure reliability, relevance, and their appropriate use. Based on screening of 1766 papers and a detailed review of 101 papers we identified key challenges, including lack of consensus on evaluation methods, improper use of evaluation metrics, limited input from domain experts, inadequate reporting of dataset characteristics, and limited reproducibility of results. In response, we provide several guidelines on the generation and evaluation of synthetic data, to allow the community to unlock and fully harness the transformative potential of synthetic data and accelerate innovation.
[304] MOLLM: Multi-Objective Large Language Model for Molecular Design – Optimizing with Experts
Nian Ran, Yue Wang, Xiaoyuan Zhang, Richard Allmendinger
Main category: cs.LG
TL;DR: MOLLM is a novel multi-objective LLM framework for molecular design that combines domain knowledge with LLM adaptability, achieving superior performance and 14x speedup over SOTA methods while maintaining cost-effectiveness.
Details
Motivation: Molecular design is critical for drug discovery, materials science, and chemical engineering, but existing methods lack the ability to efficiently optimize multiple molecular properties simultaneously.Method: Combines domain-specific knowledge with large language models using in-context learning and multi-objective optimization techniques to optimize molecular properties across multiple objectives.
Result: Achieves superior performance, consistently surpassing state-of-the-art methods, with 14x faster speed and substantial cost savings without performance compromise. Excels on PMO benchmark.
Conclusion: MOLLM demonstrates effective multi-objective molecular optimization through LLM integration, providing significant efficiency gains and outperforming existing methods across various experiments and benchmarks.
Abstract: Molecular design plays a critical role in advancing fields such as drug discovery, materials science, and chemical engineering. This work introduces the Multi-Objective Large Language Model for Molecular Design (MOLLM), a novel framework that combines domain-specific knowledge with the adaptability of large language models to optimize molecular properties across multiple objectives. Leveraging in-context learning and multi-objective optimization, MOLLM achieves superior performance and innovation, consistently surpassing state-of-the-art (SOTA) methods. We significantly improve the efficiency of our framework, making it 14 times faster and substantially more cost-effective without compromising performance compared to the latest similar work. Our results demonstrate that MOLLM consistently outperforms SOTA models across experiments and excels on the PMO benchmark. In addition, we provide extensive ablation studies and analysis to evaluate the effectiveness of each component and the quality of the output molecules.
[305] A Vector-Quantized Foundation Model for Patient Behavior Monitoring
Rodrigo Oliver, Josué Pérez-Sabater, Leire Paz-Arbaizar, Diego Herrero-Quevedo, Antonio Artés-Rodríguez, Alejandro Lancho, Pablo M. Olmos
Main category: cs.LG
TL;DR: A novel foundation model using modified vector quantized variational autoencoder for processing heterogeneous smartphone/wearable data, achieving good performance on suicide risk assessment and emotional state prediction without fine-tuning.
Details
Motivation: Foundation models have shown success in many domains but remain underutilized in healthcare, particularly for patient behavior monitoring through personal digital devices which generate complex, heterogeneous data with high missing rates.Method: Modified vector quantized variational autoencoder (VQ-VAE) that creates discrete latent representations to process real-world data from smartphones and wearables, enabling downstream tasks without fine-tuning.
Result: The model effectively performed suicide risk assessment and emotional state prediction on different held-out clinical cohorts, demonstrating the trade-off between discrete and continuous latent structures.
Conclusion: Hybrid models combining discrete and continuous latent structures may be optimal for balancing accuracy across various supervised and unsupervised tasks in healthcare applications.
Abstract: Foundation models have achieved remarkable success across various domains, yet their adoption in healthcare remains limited. While significant advances have been made in medical imaging, genetic biomarkers, and time series from electronic health records, the potential of foundation models for patient behavior monitoring through personal digital devices remains underexplored. The data generated by these devices are inherently heterogeneous, multisource, and often exhibit high rates of missing data, posing unique challenges. This paper introduces a novel foundation model based on a modified vector quantized variational autoencoder, specifically designed to process real-world data from smartphones and wearable devices. We leveraged the discrete latent representation of this model to effectively perform two downstream tasks, suicide risk assessment and emotional state prediction, on different held-out clinical cohorts without the need of fine-tuning. We also highlight the existence of a trade-off between discrete and continuous latent structures, suggesting that hybrid models may be optimal for balancing accuracy across various supervised and unsupervised tasks.
[306] Variance-Aware Noisy Training: Hardening DNNs against Unstable Analog Computations
Xiao Wang, Hendrik Borras, Bernhard Klein, Holger Fröning
Main category: cs.LG
TL;DR: Variance-Aware Noisy Training improves analog computing robustness against dynamic noise, achieving 97.6% accuracy on CIFAR-10 and 99.7% on Tiny ImageNet without training overhead.
Details
Motivation: The computational demands of deep learning exceed hardware capabilities, and analog computing offers energy efficiency but suffers from noise vulnerabilities. Conventional Noisy Training degrades with dynamic noise from temperature variations and temporal drift.Method: Proposes Variance-Aware Noisy Training that incorporates noise schedules to emulate evolving noise conditions during inference, enhancing robustness without additional training overhead.
Result: Significant improvement in robustness: from 79.3% to 97.6% on CIFAR-10 and from 32.4% to 99.7% on Tiny ImageNet compared to conventional Noisy Training.
Conclusion: The method effectively addresses limitations of conventional Noisy Training in dynamic noise environments, substantially improving model robustness for analog computing applications.
Abstract: The disparity between the computational demands of deep learning and the capabilities of compute hardware is expanding drastically. Although deep learning achieves remarkable performance in countless tasks, its escalating requirements for computational power and energy consumption surpass the sustainable limits of even specialized neural processing units, including the Apple Neural Engine and NVIDIA TensorCores. This challenge is intensified by the slowdown in CMOS scaling. Analog computing presents a promising alternative, offering substantial improvements in energy efficiency by directly manipulating physical quantities such as current, voltage, charge, or photons. However, it is inherently vulnerable to manufacturing variations, nonlinearities, and noise, leading to degraded prediction accuracy. One of the most effective techniques for enhancing robustness, Noisy Training, introduces noise during the training phase to reinforce the model against disturbances encountered during inference. Although highly effective, its performance degrades in real-world environments where noise characteristics fluctuate due to external factors such as temperature variations and temporal drift. This study underscores the necessity of Noisy Training while revealing its fundamental limitations in the presence of dynamic noise. To address these challenges, we propose Variance-Aware Noisy Training, a novel approach that mitigates performance degradation by incorporating noise schedules which emulate the evolving noise conditions encountered during inference. Our method substantially improves model robustness, without training overhead. We demonstrate a significant increase in robustness, from 79.3% with conventional Noisy Training to 97.6% with Variance-Aware Noisy Training on CIFAR-10 and from 32.4% to 99.7% on Tiny ImageNet.
[307] Crack Path Prediction with Operator Learning using Discrete Particle System data Generation
Elham Kiyani, Venkatesh Ananchaperumal, Ahmad Peyvan, Mahendaran Uchimali, Gang Li, George Em Karniadakis
Main category: cs.LG
TL;DR: Using DeepONet operator learning models trained on CPD simulation data to predict crack propagation in materials with varying geometries, showing Fusion DeepONet outperforms vanilla version especially in non-fracturing cases.
Details
Motivation: Accurate crack propagation modeling is critical for predicting failure in engineering materials, particularly how cracks interact with discontinuities like holes that affect crack deflection and arrest.Method: Trained Deep Operator Networks (DeepONets) using Constitutively Informed Particle Dynamics simulation data, exploring vanilla and Fusion DeepONet variants to predict time-evolving crack propagation in specimens with varying geometries and hole configurations.
Result: Fusion DeepONet consistently outperformed vanilla variant with more accurate predictions, especially in non-fracturing cases. Fracture-driven scenarios involving displacement and crack evolution remained more challenging.
Conclusion: Fusion DeepONet shows potential to generalize across complex, geometry-varying, and time-dependent crack propagation phenomena, offering improved predictive capabilities for engineering material failure analysis.
Abstract: Accurately modeling crack propagation is critical for predicting failure in engineering materials and structures, where small cracks can rapidly evolve and cause catastrophic damage. The interaction of cracks with discontinuities, such as holes, significantly affects crack deflection and arrest. Recent developments in discrete particle systems with multibody interactions based on constitutive behavior have demonstrated the ability to capture crack nucleation and evolution without relying on continuum assumptions. In this work, we use data from Constitutively Informed Particle Dynamics (CPD) simulations to train operator learning models, specifically Deep Operator Networks (DeepONets), which learn mappings between function spaces instead of finite-dimensional vectors. We explore two DeepONet variants: vanilla and Fusion DeepONet, for predicting time-evolving crack propagation in specimens with varying geometries. Three representative cases are studied: (i) varying notch height without active fracture; and (ii) and (iii) combinations of notch height and hole radius where dynamic fracture occurs on irregular discrete meshes. The models are trained using geometric inputs in the branch network and spatial-temporal coordinates in the trunk network. Results show that Fusion DeepONet consistently outperforms the vanilla variant, with more accurate predictions especially in non-fracturing cases. Fracture-driven scenarios involving displacement and crack evolution remain more challenging. These findings highlight the potential of Fusion DeepONet to generalize across complex, geometry-varying, and time-dependent crack propagation phenomena.
[308] Convergence Analysis of Asynchronous Federated Learning with Gradient Compression for Non-Convex Optimization
Diying Yang, Yingwei Hou, Weigang Wu
Main category: cs.LG
TL;DR: The paper analyzes convergence of federated learning with gradient compression and error feedback in asynchronous settings, showing how EF mitigates compression errors and enables matching convergence rates with uncompressed FL despite system challenges.
Details
Motivation: There's a lack of theoretical understanding of how gradient compression and error feedback interact with asynchronous FL challenges (delay, data heterogeneity, flexible participation), requiring systematic convergence analysis.Method: The authors analyze three FL frameworks: basic asynchronous FL (AsynFL), compressed asynchronous FL (AsynFLC), and compressed FL with error feedback (AsynFLC-EF), deriving convergence conditions and rates for each.
Result: EF effectively reduces gradient estimation variance despite delays, enabling AsynFLC-EF to match AsynFL’s convergence rate. Asynchronous delay and data heterogeneity jointly amplify compression errors, while delay and flexible participation only slow higher-order convergence terms.
Conclusion: Error feedback is crucial for maintaining convergence in compressed asynchronous FL, successfully mitigating the negative interactions between compression, system delays, and statistical heterogeneity, as validated by experimental results.
Abstract: Gradient compression is an effective technique for reducing communication overhead in federated learning (FL), and error feedback (EF) is widely adopted to remedy the compression errors. However, in asynchronous FL settings-which inherently face three major challenges: asynchronous delay, data heterogeneity, and flexible client participation-the complex interactions among these system/statistical constraints and compression/EF mechanisms remain poorly understood theoretically. There is a significant lack of systematic convergence analysis that adequately captures these complex couplings. In this paper, we fill this gap by analyzing the convergence behaviors of FL under different frameworks. We first consider a basic asynchronous FL framework AsynFL, and establish an improved convergence analysis that relies on fewer assumptions and yields a superior convergence rate than prior studies. Then, we consider a variant framework with gradient compression, AsynFLC. We derive sufficient conditions for its convergence, indicating the nonlinear interaction between asynchronous delay and compression rate. Our analysis further demonstrates how asynchronous delay and data heterogeneity jointly amplify compression-induced errors, thereby hindering convergence. Furthermore, we study the convergence of AsynFLC-EF, the framework that further integrates EF. We prove that EF can effectively reduce the variance of gradient estimation despite asynchronous delays, which enables AsynFLC-EF to match the convergence rate of AsynFL. We also show that the impact of asynchronous delay and flexible participation on EF is limited to slowing down the higher-order convergence term. Experimental results substantiate our analytical findings very well.
[309] Uncertainty Estimation by Human Perception versus Neural Models
Pedro Mendes, Paolo Romano, David Garlan
Main category: cs.LG
TL;DR: Neural networks are poorly calibrated and overconfident, but human perceptual uncertainty can help improve model calibration without sacrificing accuracy.
Details
Motivation: Modern neural networks achieve high accuracy but produce overconfident predictions with poor uncertainty calibration, which is problematic for applications requiring reliable uncertainty estimates.Method: Used three vision benchmarks annotated with human disagreement and crowdsourced confidence to assess correlation between model-predicted uncertainty and human-perceived uncertainty. Also incorporated human-derived soft labels into training.
Result: Current methods only weakly align with human intuition, with correlations varying across tasks and uncertainty metrics. Human-derived soft labels improved calibration without compromising accuracy.
Conclusion: There’s a persistent gap between model and human uncertainty, but leveraging human insights can help develop more trustworthy AI systems with better calibration.
Abstract: Modern neural networks (NNs) often achieve high predictive accuracy but are poorly calibrated, producing overconfident predictions even when wrong. This miscalibration poses serious challenges in applications where reliable uncertainty estimates are critical. In this work, we investigate how human perceptual uncertainty compares to uncertainty estimated by NNs. Using three vision benchmarks annotated with both human disagreement and crowdsourced confidence, we assess the correlation between model-predicted uncertainty and human-perceived uncertainty. Our results show that current methods only weakly align with human intuition, with correlations varying significantly across tasks and uncertainty metrics. Notably, we find that incorporating human-derived soft labels into the training process can improve calibration without compromising accuracy. These findings reveal a persistent gap between model and human uncertainty and highlight the potential of leveraging human insights to guide the development of more trustworthy AI systems.
[310] Temporal Query Network for Efficient Multivariate Time Series Forecasting
Shengsheng Lin, Haojun Chen, Haijie Wu, Chunyun Qiu, Weiwei Lin
Main category: cs.LG
TL;DR: TQNet introduces Temporal Query technique using shifted learnable vectors as queries in attention to capture global inter-variable correlations, achieving SOTA accuracy with high efficiency comparable to linear methods.
Details
Motivation: Effectively modeling correlations among variables (channels) is crucial for accurate multivariate time series forecasting, but existing methods have limitations in capturing these relationships.Method: Proposes Temporal Query (TQ) technique using periodically shifted learnable vectors as queries in attention mechanism to capture global patterns, while keys/values from raw data encode local correlations. Builds TQNet with single-layer attention and lightweight MLP.
Result: Achieves state-of-the-art forecasting accuracy across 12 real-world datasets, learns more robust multivariate correlations, and maintains high efficiency comparable to linear-based methods even on high-dimensional data.
Conclusion: TQNet effectively balances performance and computational cost, demonstrating superior multivariate correlation modeling capability while maintaining efficiency similar to simpler linear approaches.
Abstract: Sufficiently modeling the correlations among variables (aka channels) is crucial for achieving accurate multivariate time series forecasting (MTSF). In this paper, we propose a novel technique called Temporal Query (TQ) to more effectively capture multivariate correlations, thereby improving model performance in MTSF tasks. Technically, the TQ technique employs periodically shifted learnable vectors as queries in the attention mechanism to capture global inter-variable patterns, while the keys and values are derived from the raw input data to encode local, sample-level correlations. Building upon the TQ technique, we develop a simple yet efficient model named Temporal Query Network (TQNet), which employs only a single-layer attention mechanism and a lightweight multi-layer perceptron (MLP). Extensive experiments demonstrate that TQNet learns more robust multivariate correlations, achieving state-of-the-art forecasting accuracy across 12 challenging real-world datasets. Furthermore, TQNet achieves high efficiency comparable to linear-based methods even on high-dimensional datasets, balancing performance and computational cost. The code is available at: https://github.com/ACAT-SCUT/TQNet.
[311] Towards Robust Influence Functions with Flat Validation Minima
Xichen Ye, Yifan Wu, Weizhong Zhang, Cheng Jin, Yifan Chen
Main category: cs.LG
TL;DR: New Influence Function method for flat validation minima improves influence estimation accuracy in deep neural networks with noisy data.
Details
Motivation: Existing IF methods fail with noisy training data in deep neural networks due to deficiencies in loss change estimation from sharp validation risk, not parameter change estimation.Method: Established theoretical connection between influence estimation error, validation risk, and sharpness. Introduced novel Influence Function estimation form specifically designed for flat validation minima.
Result: Experimental results across various tasks validate the superiority of the proposed approach over existing methods.
Conclusion: Flat validation minima are crucial for accurate influence estimation, and the proposed novel IF estimation form effectively addresses this requirement, outperforming previous methods.
Abstract: The Influence Function (IF) is a widely used technique for assessing the impact of individual training samples on model predictions. However, existing IF methods often fail to provide reliable influence estimates in deep neural networks, particularly when applied to noisy training data. This issue does not stem from inaccuracies in parameter change estimation, which has been the primary focus of prior research, but rather from deficiencies in loss change estimation, specifically due to the sharpness of validation risk. In this work, we establish a theoretical connection between influence estimation error, validation set risk, and its sharpness, underscoring the importance of flat validation minima for accurate influence estimation. Furthermore, we introduce a novel estimation form of Influence Function specifically designed for flat validation minima. Experimental results across various tasks validate the superiority of our approach.
[312] Closing the Gap between TD Learning and Supervised Learning with $Q$-Conditioned Maximization
Xing Lei, Zifeng Zhuang, Shentao Yang, Sheng Xu, Yunhao Luo, Fei Shen, Wenyan Yang, Xuetao Zhang, Donglin Wang
Main category: cs.LG
TL;DR: GCReinSL enhances supervised learning for offline RL with Q-conditioned policy and maximization to achieve trajectory stitching capability, outperforming prior SL methods.
Details
Motivation: Supervised learning methods for offline RL lack trajectory stitching capability that TD-based methods have, creating a performance gap that needs to be addressed.Method: Uses Q-conditioned maximization supervised learning with Normalizing Flows for Q-function estimation and Expectile Regression for maximizing Q-values within data support.
Result: Experimental results show the method outperforms prior SL approaches with stitching capabilities and goal data augmentation techniques.
Conclusion: The proposed GCReinSL successfully endows supervised learning methods with stitching capability, closing the performance gap with TD learning approaches.
Abstract: Recently, supervised learning (SL) methodology has emerged as an effective approach for offline reinforcement learning (RL) due to their simplicity, stability, and efficiency. However, recent studies show that SL methods lack the trajectory stitching capability, typically associated with temporal difference (TD)-based approaches. A question naturally surfaces: \textit{How can we endow SL methods with stitching capability and close its performance gap with TD learning?} To answer this question, we introduce $Q$-conditioned maximization supervised learning for offline goal-conditioned RL, which enhances SL with the stitching capability through $Q$-conditioned policy and $Q$-conditioned maximization. Concretely, we propose \textbf{G}oal-\textbf{C}onditioned \textbf{\textit{Rein}}forced \textbf{S}upervised \textbf{L}earning (\textbf{GC\textit{Rein}SL}), which consists of (1) estimating the $Q$-function by Normalizing Flows from the offline dataset and (2) finding the maximum $Q$-value within the data support by integrating $Q$-function maximization with Expectile Regression. In inference time, our policy chooses optimal actions based on such a maximum $Q$-value. Experimental results from stitching evaluations on offline RL datasets demonstrate that our method outperforms prior SL approaches with stitching capabilities and goal data augmentation techniques.
[313] Development and Comparative Evaluation of Three Artificial Intelligence Models (NLP, LLM, JEPA) for Predicting Triage in Emergency Departments: A 7-Month Retrospective Proof-of-Concept
Edouard Lansiaux, Ramy Azzouz, Emmanuel Chazard, Amélie Vromant, Eric Wiel
Main category: cs.LG
TL;DR: LLM-based AI model URGENTIAPARSE outperformed other AI models and nurse triage in emergency department triage accuracy, showing potential to improve patient safety and operational efficiency.
Details
Motivation: Emergency departments face persistent triage errors (undertriage and overtriage) exacerbated by increasing patient volumes and staff shortages, requiring better triage solutions.Method: Evaluated three AI models (TRIAGEMASTER-NLP, URGENTIAPARSE-LLM, EMERGINET-JEPA) against FRENCH triage scale and nurse practice using 7 months of adult triage data from Roger Salengro Hospital.
Result: URGENTIAPARSE (LLM-based) achieved highest accuracy (F1-score 0.900, AUC-ROC 0.879) and superior performance in predicting hospitalization needs, demonstrating robustness across structured data and raw transcripts.
Conclusion: LLM-based AI integration could significantly enhance emergency department workflows, but successful adoption requires addressing limitations and ensuring ethical transparency.
Abstract: Emergency departments struggle with persistent triage errors, especially undertriage and overtriage, which are aggravated by growing patient volumes and staff shortages. This study evaluated three AI models [TRIAGEMASTER (NLP), URGENTIAPARSE (LLM), and EMERGINET (JEPA)] against the FRENCH triage scale and nurse practice, using seven months of adult triage data from Roger Salengro Hospital in Lille, France. Among the models, the LLM-based URGENTIAPARSE consistently outperformed both AI alternatives and nurse triage, achieving the highest accuracy (F1-score 0.900, AUC-ROC 0.879) and superior performance in predicting hospitalization needs (GEMSA). Its robustness across structured data and raw transcripts highlighted the advantage of LLM architectures in abstracting patient information. Overall, the findings suggest that integrating LLM-based AI into emergency department workflows could significantly enhance patient safety and operational efficiency, though successful adoption will depend on addressing limitations and ensuring ethical transparency.
[314] To Theoretically Understand Transformer-Based In-Context Learning for Optimizing CSMA
Shugang Hao, Hongbo Li, Lingjie Duan
Main category: cs.LG
TL;DR: LLM transformer-based in-context learning approach for optimizing WiFi channel access, overcoming limitations of traditional backoff schemes and model-based methods by predicting optimal contention window thresholds.
Details
Motivation: Traditional binary exponential backoff in WiFi 7 performs poorly in dynamic environments, and existing model-based approaches suffer from throughput loss due to inaccurate node density estimation.Method: Transformer-based ICL optimizer that collects collision-threshold data examples and query collision cases as prompts, then generates predicted contention window thresholds with guaranteed near-optimal performance even with erroneous data.
Result: Experimental results show fast convergence and near-optimal throughput outperforming existing model-based and DRL-based approaches under unknown node densities.
Conclusion: The proposed LLM transformer-based ICL approach effectively optimizes channel access with minimal prediction deviation and maintains high throughput performance in dynamic WiFi environments.
Abstract: The binary exponential backoff scheme is widely used in WiFi 7 and still incurs poor throughput performance under dynamic channel environments. Recent model-based approaches (e.g., non-persistent and $p$-persistent CSMA) simply optimize backoff strategies under a known and fixed node density, still leading to a large throughput loss due to inaccurate node density estimation. This paper is the first to propose LLM transformer-based in-context learning (ICL) theory for optimizing channel access. We design a transformer-based ICL optimizer to pre-collect collision-threshold data examples and a query collision case. They are constructed as a prompt as the input for the transformer to learn the pattern, which then generates a predicted contention window threshold (CWT). To train the transformer for effective ICL, we develop an efficient algorithm and guarantee a near-optimal CWT prediction within limited training steps. As it may be hard to gather perfect data examples for ICL in practice, we further extend to allow erroneous data input in the prompt. We prove that our optimizer maintains minimal prediction and throughput deviations from the optimal values. Experimental results on NS-3 further demonstrate our approach’s fast convergence and near-optimal throughput over existing model-based and DRL-based approaches under unknown node densities.
[315] Group Expectation Policy Optimization for Heterogeneous Reinforcement Learning
Han Zhang, Ruibin Zheng, Zexuan Yi, Zhuo Zhang, Hanyang Peng, Hui Wang, Zike Yuan, Cai Ke, Shiwei Chen, Jiacheng Yang, Yangning Li, Xiang Li, Jiangyue Yan, Yaoqi Liu, Liwen Jing, Jiayin Qi, Ruifeng Xu, Binxing Fang, Yue Yu
Main category: cs.LG
TL;DR: HeteroRL is an asynchronous RL architecture that decouples rollout sampling from parameter learning to enable robust decentralized training in heterogeneous networks with network delays. It addresses latency-induced KL divergence issues with Group Expectation Policy Optimization (GEPO) for variance reduction.
Details
Motivation: As single-center computing faces power constraints, decentralized training becomes essential. RL post-training for LLMs faces challenges in heterogeneous distributed environments due to tightly-coupled sampling-learning alternation and network delays.Method: Propose HeteroRL architecture that decouples rollout sampling from parameter learning. Introduce Group Expectation Policy Optimization (GEPO) which reduces importance weight variance through a refined sampling mechanism to address latency-induced KL divergence issues.
Result: GEPO achieves exponential variance reduction theoretically. Experiments show superior stability over methods like GRPO, with less than 3% performance degradation under 1800-second delays.
Conclusion: HeteroRL demonstrates strong potential for decentralized RL in heterogeneous networks, maintaining performance stability even under significant network delays.
Abstract: As single-center computing approaches power constraints, decentralized training is becoming essential. Reinforcement Learning (RL) post-training enhances Large Language Models (LLMs) but faces challenges in heterogeneous distributed environments due to its tightly-coupled sampling-learning alternation. We propose HeteroRL, an asynchronous RL architecture that decouples rollout sampling from parameter learning, enabling robust deployment across geographically distributed nodes under network delays. We identify that latency-induced KL divergence causes importance sampling failure due to high variance. To address this, we propose Group Expectation Policy Optimization (GEPO), which reduces importance weight variance through a refined sampling mechanism. Theoretically, GEPO achieves exponential variance reduction. Experiments show it maintains superior stability over methods like GRPO, with less than 3% performance degradation under 1800-second delays, demonstrating strong potential for decentralized RL in heterogeneous networks.
[316] Data-Augmented Few-Shot Neural Stencil Emulation for System Identification of Computer Models
Sanket Jantre, Deepak Akhare, Xiaoning Qian, Nathan M. Urban
Main category: cs.LG
TL;DR: Proposes space-filling sampling of local stencil states as a data-augmentation strategy to train neural PDEs more efficiently than using full trajectory data.
Details
Motivation: Neural PDEs are easier to work with than traditional numerical solvers but typically require extensive trajectory data from long time integration, which contains redundant information.Method: Space-filling sampling of local “stencil” states to generate synthetic training data, removing spatiotemporal redundancy and oversampling rarely visited states to improve generalization.
Result: Accurate neural PDE stencil operators can be learned from synthetic data equivalent to just 10 timesteps of simulation, with further improvement when a single full-trajectory simulation is available.
Conclusion: The proposed data-augmentation strategy yields better trained neural stencil operators across several PDE systems, outperforming naive sampling from simulation trajectories.
Abstract: Partial differential equations (PDEs) underpin the modeling of many natural and engineered systems. It can be convenient to express such models as neural PDEs rather than using traditional numerical PDE solvers by replacing part or all of the PDE’s governing equations with a neural network representation. Neural PDEs are often easier to differentiate, linearize, reduce, or use for uncertainty quantification than the original numerical solver. They are usually trained on solution trajectories obtained by long time integration of the PDE solver. Here we propose a more sample-efficient data-augmentation strategy for generating neural PDE training data from a computer model by space-filling sampling of local “stencil” states. This approach removes a large degree of spatiotemporal redundancy present in trajectory data and oversamples states that may be rarely visited but help the neural PDE generalize across the state space. We demonstrate that accurate neural PDE stencil operators can be learned from synthetic training data generated by the computational equivalent of 10 timesteps’ worth of numerical simulation. Accuracy is further improved if we assume access to a single full-trajectory simulation from the computer model, which is typically available in practice. Across several PDE systems, we show that our data-augmented synthetic stencil data yield better trained neural stencil operators, with clear performance gains compared with naively sampled stencil data from simulation trajectories.
[317] DivMerge: A divergence-based model merging method for multi-tasking
Brahim Touayouch, Loïc Fosse, Géraldine Damnati, Gwénolé Lecorvé
Main category: cs.LG
TL;DR: A robust model merging method that uses Jensen-Shannon divergence to combine multiple task-specific models into one while maintaining performance across all tasks, without needing additional labeled data.
Details
Motivation: Address task interference in multi-task learning when merging multiple fine-tuned models, which worsens as the number of tasks increases.Method: Leverages Jensen-Shannon divergence to guide the merging process without requiring additional labeled data, and automatically balances task importance.
Result: The approach remains robust as the number of tasks grows and consistently outperforms prior work in model merging.
Conclusion: Proposes an effective solution for multi-task model merging that handles task interference and scales well with increasing task numbers.
Abstract: Multi-task learning (MTL) is often achieved by merging datasets before fine-tuning, but the growing availability of fine-tuned models has led to new approaches such as model merging via task arithmetic. A major challenge in this setting is task interference, which worsens as the number of tasks increases. We propose a method that merges models trained on different tasks into a single model, maintaining strong performance across all tasks. Our approach leverages Jensen-Shannon divergence to guide the merging process without requiring additional labelled data, and automatically balances task importance. Unlike existing methods, our approach remains robust as the number of tasks grows and consistently outperforms prior work.
[318] Learning functions through Diffusion Maps
Alvaro Almeida Gomez
Main category: cs.LG
TL;DR: A data-driven method for approximating functions on smooth manifolds using Diffusion Maps, with dimensionality reduction via SVD and online updating for scalability.
Details
Motivation: To develop an efficient function approximation method on manifolds that handles high-dimensional data and allows incremental learning with new data points.Method: Builds on Diffusion Maps framework, uses SVD for dimensionality reduction of distance matrices, and implements online updating mechanism for new data incorporation.
Result: Outperforms classical feedforward neural networks and interpolation methods in accuracy and efficiency, demonstrated through numerical experiments including sparse CT reconstruction.
Conclusion: The proposed methodology provides an effective and scalable approach for function approximation on manifolds with superior performance compared to traditional methods.
Abstract: We propose a data-driven method for approximating real-valued functions on smooth manifolds, building on the Diffusion Maps framework under the manifold hypothesis. Given pointwise evaluations of a function, the method constructs a smooth extension to the ambient space by exploiting diffusion geometry and its connection to the heat equation and the Laplace-Beltrami operator. To address the computational challenges of high-dimensional data, we introduce a dimensionality reduction strategy based on the low-rank structure of the distance matrix, revealed via singular value decomposition (SVD). In addition, we develop an online updating mechanism that enables efficient incorporation of new data, thereby improving scalability and reducing computational cost. Numerical experiments, including applications to sparse CT reconstruction, demonstrate that the proposed methodology outperforms classical feedforward neural networks and interpolation methods in terms of both accuracy and efficiency.
[319] Beyond the Pre-Service Horizon: Infusing In-Service Behavior for Improved Financial Risk Forecasting
Senhao Liu, Zhiyu Guo, Zhiyuan Ji, Yueguo Chen, Yateng Tang, Yunhai Wang, Xuehao Zheng, Xiang Ao
Main category: cs.LG
TL;DR: MGKD framework uses knowledge distillation to improve pre-service risk prediction by transferring insights from in-service user behavior data through multi-granularity distillation strategies.
Details
Motivation: Traditional financial risk management separates pre-service risk assessment and in-service default detection, missing opportunities to leverage in-service behavioral data for better pre-service predictions.Method: Multi-Granularity Knowledge Distillation (MGKD) with teacher-student architecture, using soft labels from in-service data, including coarse-grained, fine-grained, and self-distillation strategies, plus re-weighting for class imbalance.
Result: Experimental results on Tencent Mobile Payment datasets show effectiveness in both offline and online scenarios, improving pre-service risk assessment performance.
Conclusion: MGKD successfully bridges pre-service and in-service risk modeling, enabling transfer of default behavior patterns and improving overall risk prediction accuracy before service activation.
Abstract: Typical financial risk management involves distinct phases for pre-service risk assessment and in-service default detection, often modeled separately. This paper proposes a novel framework, Multi-Granularity Knowledge Distillation (abbreviated as MGKD), aimed at improving pre-service risk prediction through the integration of in-service user behavior data. MGKD follows the idea of knowledge distillation, where the teacher model, trained on historical in-service data, guides the student model, which is trained on pre-service data. By using soft labels derived from in-service data, the teacher model helps the student model improve its risk prediction prior to service activation. Meanwhile, a multi-granularity distillation strategy is introduced, including coarse-grained, fine-grained, and self-distillation, to align the representations and predictions of the teacher and student models. This approach not only reinforces the representation of default cases but also enables the transfer of key behavioral patterns associated with defaulters from the teacher to the student model, thereby improving the overall performance of pre-service risk assessment. Moreover, we adopt a re-weighting strategy to mitigate the model’s bias towards the minority class. Experimental results on large-scale real-world datasets from Tencent Mobile Payment demonstrate the effectiveness of our proposed approach in both offline and online scenarios.
[320] CAME-AB: Cross-Modality Attention with Mixture-of-Experts for Antibody Binding Site Prediction
Hongzong Li, Jiahao Ma, Zhanpeng Shi, Rui Xiao, Fanming Jin, Ye-Fan Hu, Hangjun Che, Jian-Dong Huang
Main category: cs.LG
TL;DR: CAME-AB is a novel cross-modality attention framework with MoE backbone for antibody binding site prediction, integrating five biological modalities and outperforming existing methods on multiple metrics.
Details
Motivation: Existing antibody binding site prediction methods rely on single-view features and fail to identify antibody-specific binding sites on antigens, creating a need for more robust multimodal approaches.Method: Integrates five biological modalities (amino acid encodings, BLOSUM profiles, language model embeddings, structure-aware features, GCN-refined graphs) with adaptive modality fusion, Transformer encoder, MoE module, supervised contrastive learning, and stochastic weight averaging.
Result: Extensive experiments show CAME-AB consistently outperforms strong baselines on Precision, Recall, F1-score, AUC-ROC, and MCC metrics. Ablation studies validate effectiveness of each component.
Conclusion: The proposed multimodal framework with adaptive fusion and specialized feature processing provides superior antibody binding site prediction, demonstrating the value of integrating diverse biological information sources.
Abstract: Antibody binding site prediction plays a pivotal role in computational immunology and therapeutic antibody design. Existing sequence or structure methods rely on single-view features and fail to identify antibody-specific binding sites on the antigens. In this paper, we propose \textbf{CAME-AB}, a novel Cross-modality Attention framework with a Mixture-of-Experts (MoE) backbone for robust antibody binding site prediction. CAME-AB integrates five biologically grounded modalities, including raw amino acid encodings, BLOSUM substitution profiles, pretrained language model embeddings, structure-aware features, and GCN-refined biochemical graphs, into a unified multimodal representation. To enhance adaptive cross-modal reasoning, we propose an \emph{adaptive modality fusion} module that learns to dynamically weight each modality based on its global relevance and input-specific contribution. A Transformer encoder combined with an MoE module further promotes feature specialization and capacity expansion. We additionally incorporate a supervised contrastive learning objective to explicitly shape the latent space geometry, encouraging intra-class compactness and inter-class separability. To improve optimization stability and generalization, we apply stochastic weight averaging during training. Extensive experiments on benchmark antibody-antigen datasets demonstrate that CAME-AB consistently outperforms strong baselines on multiple metrics, including Precision, Recall, F1-score, AUC-ROC, and MCC. Ablation studies further validate the effectiveness of each architectural component and the benefit of multimodal feature integration. The model implementation details and the codes are available on https://anonymous.4open.science/r/CAME-AB-C525
[321] Demo: Healthcare Agent Orchestrator (HAO) for Patient Summarization in Molecular Tumor Boards
Matthias Blondeel, Noel Codella, Sam Preston, Hao Qiu, Leonardo Schettini, Frank Tuan, Wen-wai Yim, Smitha Saligrama, Mert Öz, Shrey Jain, Matthew P. Lungren, Thomas Osborne
Main category: cs.LG
TL;DR: AI system (HAO) automates patient summary generation for Molecular Tumor Boards using LLM agents, with TBFact framework for evaluation, achieving 94% high-importance information capture.
Details
Motivation: Manual patient summary creation for MTBs is labor-intensive, subjective, and prone to omissions, requiring automated solutions for reliable and scalable support.Method: Healthcare Agent Orchestrator (HAO) - LLM-driven multi-agent workflow for patient summary generation, plus TBFact model-as-judge framework for evaluating summary comprehensiveness and succinctness.
Result: Agent captured 94% of high-importance information (including partial entailments) with TBFact recall of 0.84 under strict criteria. Enables data-free evaluation without sharing sensitive clinical data.
Conclusion: HAO and TBFact provide robust foundation for reliable and scalable MTB support, addressing both generation and evaluation challenges in clinical settings.
Abstract: Molecular Tumor Boards (MTBs) are multidisciplinary forums where oncology specialists collaboratively assess complex patient cases to determine optimal treatment strategies. A central element of this process is the patient summary, typically compiled by a medical oncologist, radiation oncologist, or surgeon, or their trained medical assistant, who distills heterogeneous medical records into a concise narrative to facilitate discussion. This manual approach is often labor-intensive, subjective, and prone to omissions of critical information. To address these limitations, we introduce the Healthcare Agent Orchestrator (HAO), a Large Language Model (LLM)-driven AI agent that coordinates a multi-agent clinical workflow to generate accurate and comprehensive patient summaries for MTBs. Evaluating predicted patient summaries against ground truth presents additional challenges due to stylistic variation, ordering, synonym usage, and phrasing differences, which complicate the measurement of both succinctness and completeness. To overcome these evaluation hurdles, we propose TBFact, a ``model-as-a-judge’’ framework designed to assess the comprehensiveness and succinctness of generated summaries. Using a benchmark dataset derived from de-identified tumor board discussions, we applied TBFact to evaluate our Patient History agent. Results show that the agent captured 94% of high-importance information (including partial entailments) and achieved a TBFact recall of 0.84 under strict entailment criteria. We further demonstrate that TBFact enables a data-free evaluation framework that institutions can deploy locally without sharing sensitive clinical data. Together, HAO and TBFact establish a robust foundation for delivering reliable and scalable support to MTBs.
[322] RoseCDL: Robust and Scalable Convolutional Dictionary Learning for Rare-event Detection
Jad Yehya, Mansour Benbakoura, Cédric Allain, Benoît Malezieux, Matthieu Kowalski, Thomas Moreau
Main category: cs.LG
TL;DR: RoseCDL is a scalable and robust convolutional dictionary learning algorithm for unsupervised rare event detection in large signals, combining stochastic windowing for efficiency with inline outlier detection for robustness.
Details
Motivation: Convolutional Dictionary Learning (CDL) has potential for rare event detection but faces challenges with computational cost and sensitivity to artifacts/outliers in large-scale signals across astronomy, physics, and biomedical fields.Method: RoseCDL uses stochastic windowing for efficient training on large datasets and incorporates inline outlier detection to enhance robustness and isolate anomalous patterns.
Result: The algorithm reframes CDL as a practical tool for event discovery and characterization, extending its capabilities beyond traditional tasks like compression or denoising.
Conclusion: RoseCDL provides a scalable and robust framework for unsupervised rare event detection in long signals, making CDL more applicable to real-world signal analysis problems.
Abstract: Identifying recurring patterns and rare events in large-scale signals is a fundamental challenge in fields such as astronomy, physical simulations, and biomedical science. Convolutional Dictionary Learning (CDL) offers a powerful framework for modeling local structures in signals, but its use for detecting rare or anomalous events remains largely unexplored. In particular, CDL faces two key challenges in this setting: high computational cost and sensitivity to artifacts and outliers. In this paper, we introduce RoseCDL, a scalable and robust CDL algorithm designed for unsupervised rare event detection in long signals. RoseCDL combines stochastic windowing for efficient training on large datasets with inline outlier detection to enhance robustness and isolate anomalous patterns. This reframes CDL as a practical tool for event discovery and characterization in real-world signals, extending its role beyond traditional tasks like compression or denoising.
[323] The Domain Mixed Unit: A New Neural Arithmetic Layer
Paul Curry
Main category: cs.LG
TL;DR: DMU is a new neural arithmetic unit that learns to mix log-space and linear-space representations for arithmetic operations, achieving state-of-the-art performance on the NALM benchmark for multiplication and division.
Details
Motivation: To create a neural arithmetic unit that can better generalize arithmetic operations by mixing different mathematical representations, addressing limitations of previous approaches.Method: Developed Domain Mixed Unit (DMU) with single parameter gate that mixes log-space and linear-space representations, with specific initializations for addition/multiplication and subtraction/division operations.
Result: Achieved state-of-the-art performance on NALM Benchmark with highest percentage solved over all seeds for multiplication and division tasks.
Conclusion: DMU successfully demonstrates improved generalization capabilities for arithmetic operations through mixed representation learning, and will be contributed to the open-source NALM benchmark.
Abstract: The Domain Mixed Unit (DMU) is a new neural arithmetic unit that learns a single parameter gate that mixes between log-space and linear-space representations while performing either addition (DMU add) or subtraction (DMU sub). Two initializations are proposed for the DMU: one covering addition and multiplication, and another covering subtraction and division. The DMU achieves state-of-the-art performance on the NALM Benchmark, a dataset designed to test the ability of neural arithmetic units to generalize arithmetic operations, specifically performing with the highest percentage solved over all seeds on multiplication and division. The DMU will be submitted as a pull request to the open-source NALM benchmark, and its code is available on GitHub at https://github.com/marict?tab=repositories
[324] Two Sides of the Same Optimization Coin: Model Degradation and Representation Collapse in Graph Foundation Models
Xunkai Li, Daohan Su, Sicheng Liu, Ru Zhang, Zhenjun Li, Bing Zhou, Rong-Hua Li, Guoren Wang
Main category: cs.LG
TL;DR: MoT addresses domain generalization conflicts in graph foundation models by introducing Information Tinker for handling model degradation and representation collapse, plus Regularization Tinker for improved optimization, achieving state-of-the-art performance across multiple domains.
Details
Motivation: Graph foundation models suffer from domain generalization conflicts that cause model degradation (encoder/codebook failing to capture input diversity) and representation collapse (embedding/codebook vectors losing semantic separability), creating optimization dilemmas during pre-training.Method: Proposes MoT (Mixture-of-Tinkers) with: (1) Information Tinker using edge-wise semantic fusion and mixture-of-codebooks with domain-aware routing to improve information capacity; (2) Regularization Tinker with two additional regularizations to enhance gradient supervision.
Result: Experiments on 22 datasets across 6 domains show MoT achieves significant improvements in supervised, few-shot, and zero-shot scenarios compared to state-of-the-art baselines.
Conclusion: MoT effectively addresses information bottleneck and regularization deficit challenges in graph foundation models, provides controllable model scaling, and demonstrates strong cross-domain generalization capabilities.
Abstract: Graph foundation models, inspired by the success of LLMs, are designed to learn the optimal embedding from multi-domain TAGs for the downstream cross-task generalization capability. During our investigation, graph VQ-MAE stands out among the increasingly diverse landscape of GFM architectures. This is attributed to its ability to jointly encode topology and textual attributes from multiple domains into discrete embedding spaces with clear semantic boundaries. Despite its potential, domain generalization conflicts cause imperceptible pitfalls. In this paper, we instantiate two of them, and they are just like two sides of the same GFM optimization coin - Side 1 Model Degradation: The encoder and codebook fail to capture the diversity of inputs; Side 2 Representation Collapse: The hidden embedding and codebook vector fail to preserve semantic separability due to constraints from narrow representation subspaces. These two pitfalls (sides) collectively impair the decoder and generate the low-quality reconstructed supervision, causing the GFM optimization dilemma during pre-training (coin). Through empirical investigation, we attribute the above challenges to Information Bottleneck and Regularization Deficit. To address them, we propose MoT (Mixture-of-Tinkers) - (1) Information Tinker for Two Pitfalls, which utilizes an edge-wise semantic fusion strategy and a mixture-of-codebooks with domain-aware routing to improve information capacity. (2) Regularization Tinker for Optimization Coin, which utilizes two additional regularizations to further improve gradient supervision in our proposed Information Tinker. Notably, as a flexible architecture, MoT adheres to the scaling laws of GFM, offering a controllable model scale. Compared to SOTA baselines, experiments on 22 datasets across 6 domains demonstrate that MoT achieves significant improvements in supervised, few-shot, and zero-shot scenarios.
[325] Adapting Vision-Language Models for Neutrino Event Classification in High-Energy Physics
Dikshant Sagar, Kaiwen Yu, Alejandro Yankelevich, Jianming Bian, Pierre Baldi
Main category: cs.LG
TL;DR: Vision Language Models outperform CNNs in neutrino interaction classification, offering better performance, interpretability, and multimodal reasoning capabilities for high-energy physics experiments.
Details
Motivation: To explore the application of Vision Language Models (VLMs) for identifying neutrino interactions in pixelated detector data, leveraging their multimodal reasoning capabilities beyond traditional CNN approaches.Method: Used a fine-tuned variant of LLaMa 3.2 VLM and benchmarked it against state-of-the-art convolutional neural network architectures similar to those used in NOvA and DUNE experiments.
Result: VLMs can outperform CNNs in classification performance while providing greater flexibility for integrating auxiliary textual/semantic information and more interpretable, reasoning-based predictions.
Conclusion: VLMs show strong potential as a general-purpose backbone for physics event classification due to their high performance, interpretability, and generalizability, opening new avenues for multimodal reasoning in experimental neutrino physics.
Abstract: Recent advances in Large Language Models (LLMs) have demonstrated their remarkable capacity to process and reason over structured and unstructured data modalities beyond natural language. In this work, we explore the applications of Vision Language Models (VLMs), specifically a fine-tuned variant of LLaMa 3.2, to the task of identifying neutrino interactions in pixelated detector data from high-energy physics (HEP) experiments. We benchmark this model against a state-of-the-art convolutional neural network (CNN) architecture, similar to those used in the NOvA and DUNE experiments, which have achieved high efficiency and purity in classifying electron and muon neutrino events. Our evaluation considers both the classification performance and interpretability of the model predictions. We find that VLMs can outperform CNNs, while also providing greater flexibility in integrating auxiliary textual or semantic information and offering more interpretable, reasoning-based predictions. This work highlights the potential of VLMs as a general-purpose backbone for physics event classification, due to their high performance, interpretability, and generalizability, which opens new avenues for integrating multimodal reasoning in experimental neutrino physics.
[326] Securing Private Federated Learning in a Malicious Setting: A Scalable TEE-Based Approach with Client Auditing
Shun Takagi, Satoshi Hasegawa
Main category: cs.LG
TL;DR: A novel server extension using ephemeral TEEs to achieve maliciously secure DP-FTRL in federated learning, providing verifiable proofs with minimal overhead while maintaining privacy against malicious servers.
Details
Motivation: Existing DP-FTRL approaches assume semi-honest servers, but practical settings require security against malicious servers and handling client dropouts/corruption. TEEs alone introduce forking attacks and availability issues.Method: Introduces a server-side trusted computing base (TCB) implemented with ephemeral TEE modules that produce verifiable proofs of server actions. Selected clients audit these proofs with small additional communication and computation costs.
Result: Formal proofs demonstrate privacy guarantees in malicious settings. Experimental results show the framework adds only small constant overhead to clients in realistic scenarios.
Conclusion: The proposed extension successfully reduces TCB size while maintaining system scalability and liveness, providing maliciously secure DP-FTRL without significant performance degradation.
Abstract: In cross-device private federated learning, differentially private follow-the-regularized-leader (DP-FTRL) has emerged as a promising privacy-preserving method. However, existing approaches assume a semi-honest server and have not addressed the challenge of securely removing this assumption. This is due to its statefulness, which becomes particularly problematic in practical settings where clients can drop out or be corrupted. While trusted execution environments (TEEs) might seem like an obvious solution, a straightforward implementation can introduce forking attacks or availability issues due to state management. To address this problem, our paper introduces a novel server extension that acts as a trusted computing base (TCB) to realize maliciously secure DP-FTRL. The TCB is implemented with an ephemeral TEE module on the server side to produce verifiable proofs of server actions. Some clients, upon being selected, participate in auditing these proofs with small additional communication and computational demands. This extension solution reduces the size of the TCB while maintaining the system’s scalability and liveness. We provide formal proofs based on interactive differential privacy, demonstrating privacy guarantee in malicious settings. Finally, we experimentally show that our framework adds small constant overhead to clients in several realistic settings.
cs.MA
[327] DeepVoting: Learning and Fine-Tuning Voting Rules with Canonical Embeddings
Leonardo Matone, Ben Abramowitz, Ben Armstrong, Avinash Balakrishnan, Nicholas Mattei
Main category: cs.MA
TL;DR: Learning probabilistic voting rules using neural networks with improved efficiency and smaller networks, with fine-tuning for axiomatic properties and resistance to strategic manipulation.
Details
Motivation: Traditional social choice theory shows designing aggregation rules with specific properties is difficult or impossible. Instead of hand-designing, learning voting rules from data offers a solution, but prior methods require large models or are limited by preference representations.Method: Recast voting rule design as learning probabilistic functions that output candidate distributions using neural networks. Use standard social choice embeddings for efficient preference profile encoding, enabling faster learning with smaller networks. Fine-tune learned rules using axiomatic properties.
Result: Preference profile encoding significantly impacts neural network efficiency and learning ability. Learned rules can be fine-tuned to create novel voting rules resistant to specific attacks, particularly a probabilistic version of the No Show Paradox.
Conclusion: Neural networks can effectively learn probabilistic social choice functions with improved efficiency, and fine-tuning enables creation of novel voting rules with desirable axiomatic properties and resistance to strategic manipulation.
Abstract: Aggregating agent preferences into a collective decision is an important step in many problems (e.g., hiring, elections, peer review) and across areas of computer science (e.g., reinforcement learning, recommender systems). As Social Choice Theory has shown, the problem of designing aggregation rules with specific sets of properties (axioms) can be difficult, or provably impossible in some cases. Instead of designing algorithms by hand, one can learn aggregation rules, particularly voting rules, from data. However, prior work in this area has required extremely large models or been limited by the choice of preference representation, i.e., embedding. We recast the problem of designing voting rules with desirable properties into one of learning probabilistic functions that output distributions over a set of candidates. Specifically, we use neural networks to learn probabilistic social choice functions. Using standard embeddings from the social choice literature we show that preference profile encoding has significant impact on the efficiency and ability of neural networks to learn rules, allowing us to learn rules faster and with smaller networks than previous work. Moreover, we show that our learned rules can be fine-tuned using axiomatic properties to create novel voting rules and make them resistant to specific types of “attack”. Namely, we fine-tune rules to resist a probabilistic version of the No Show Paradox.
[328] The Sound of Silence in Social Networks
Jesús Aranda, Juan Francisco Díaz, David Gaona, Frank Valencia
Main category: cs.MA
TL;DR: Generalizes DeGroot opinion dynamics model to incorporate Spiral of Silence theory, where minority-opinion agents become silent. Introduces two models: memoryless (SOM-) and memory-based (SOM+), showing different convergence properties.
Details
Motivation: To incorporate the Spiral of Silence theory from political science into opinion dynamics models, accounting for how individuals may withhold opinions when perceiving themselves in the minority.Method: Two opinion update models: 1) SOM- (memoryless) - agents update using weighted average of non-silent neighbors’ opinions; 2) SOM+ (memory-based) - agents update using weighted average of all neighbors’ opinions, using most recent opinion for silent neighbors.
Result: SOM- guarantees consensus convergence for clique graphs but not for strongly-connected aperiodic graphs. SOM+ does not guarantee consensus even for clique graphs. Simulations align with Spiral of Silence theory.
Conclusion: Silence dynamics significantly impact opinion formation and consensus achievement, revealing limitations of traditional consensus models in more realistic social scenarios.
Abstract: We generalize the classic multi-agent DeGroot model for opinion dynamics to incorporate the Spiral of Silence theory from political science. This theory states that individuals may withhold their opinions when they perceive them to be in the minority. As in the DeGroot model, a community of agents is represented as a weighted directed graph whose edges indicate how much agents influence one another. However, agents whose current opinions are in the minority become silent (i.e., they do not express their opinion). Two models for opinion update are then introduced. In the memoryless opinion model (SOM-), agents update their opinion by taking the weighted average of their non-silent neighbors’ opinions. In the memory based opinion model (SOM+), agents update their opinions by taking the weighted average of the opinions of all their neighbors, but for silent neighbors, their most recent opinion is considered. We show that for SOM- convergence to consensus is guaranteed for clique graphs but, unlike for the classic DeGroot, not guaranteed for strongly-connected aperiodic graphs. In contrast, we show that for SOM+ convergence to consensus is not guaranteed even for clique graphs. We showcase our models through simulations offering experimental insights that align with key aspects of the Spiral of Silence theory. These findings reveal the impact of silence dynamics on opinion formation and highlight the limitations of consensus in more nuanced social models.
[329] Capability-Aware Shared Hypernetworks for Flexible Heterogeneous Multi-Robot Coordination
Kevin Fu, Shalin Anand Jain, Pierce Howell, Harish Ravichandar
Main category: cs.MA
TL;DR: CASH is a hypernetwork architecture that enables efficient learning of flexible shared policies for heterogeneous multi-robot teams, achieving better performance and sample efficiency with fewer parameters while allowing zero-shot generalization to unseen robots.
Details
Motivation: Existing neural architectures for heterogeneous multi-robot teams force a trade-off between expressivity and efficiency - shared parameters limit behavioral diversity while separate policies sacrifice efficiency and generalization.Method: Proposed Capability-Aware Shared Hypernetworks (CASH), a soft weight sharing architecture using hypernetworks to learn a flexible shared policy that dynamically adapts to each robot’s capabilities post-training.
Result: CASH outperforms baseline architectures across multiple heterogeneous tasks and learning paradigms, with 60%-80% fewer parameters, generating appropriately-diverse behaviors and enabling zero-shot generalization to unseen robots.
Conclusion: The CASH architecture successfully avoids the expressivity-efficiency tradeoff in heterogeneous multi-robot coordination, providing a flexible solution that scales efficiently while maintaining behavioral diversity and generalization capabilities.
Abstract: Recent advances have enabled heterogeneous multi-robot teams to learn complex and effective coordination skills. However, existing neural architectures that support heterogeneous teaming tend to force a trade-off between expressivity and efficiency. Shared-parameter designs prioritize sample efficiency by enabling a single network to be shared across all or a pre-specified subset of robots (via input augmentations), but tend to limit behavioral diversity. In contrast, recent designs employ a separate policy for each robot, enabling greater diversity and expressivity at the cost of efficiency and generalization. Our key insight is that such tradeoffs can be avoided by viewing these design choices as ends of a broad spectrum. Inspired by recent work in transfer and meta learning, and building on prior work in multi-robot task allocation, we propose Capability-Aware Shared Hypernetworks (CASH), a soft weight sharing architecture that uses hypernetworks to efficiently learn a flexible shared policy that dynamically adapts to each robot post-training. By explicitly encoding the impact of robot capabilities (e.g., speed and payload) on collective behavior, CASH enables zero-shot generalization to unseen robots or team compositions. Our experiments involve multiple heterogeneous tasks, three learning paradigms (imitation learning, value-based, and policy-gradient RL), and SOTA multi-robot simulation (JaxMARL) and hardware (Robotarium) platforms. Across all conditions, we find that CASH generates appropriately-diverse behaviors and consistently outperforms baseline architectures in terms of performance and sample efficiency during both training and zero-shot generalization, all with 60%-80% fewer learnable parameters.
[330] Towards Generalized Routing: Model and Agent Orchestration for Adaptive and Efficient Inference
Xiyu Guo, Shan Wang, Chunfang Ji, Xuefeng Zhao, Wenhao Xi, Yaoyao Liu, Qinglan Li, Chao Deng, Junlan Feng
Main category: cs.MA
TL;DR: MoMA is a routing framework that combines LLM and agent-based routing to efficiently handle diverse user queries by directing them to the most suitable execution units based on cost-performance optimization.
Details
Motivation: The diversity of user queries spanning multiple domains presents a fundamental routing challenge in AI service ecosystems, requiring accurate query direction while optimizing both performance and efficiency.Method: Proposes MoMA framework with detailed training dataset to profile LLM capabilities, dynamic query routing to optimal LLMs, and context-aware agent selection with dynamic masking strategies.
Result: Experimental results show MoMA offers superior cost-efficiency and scalability compared to existing routing approaches.
Conclusion: MoMA effectively addresses the routing challenge in heterogeneous AI service landscapes through precise intent recognition and adaptive routing strategies.
Abstract: The rapid advancement of large language models (LLMs) and domain-specific AI agents has greatly expanded the ecosystem of AI-powered services. User queries, however, are highly diverse and often span multiple domains and task types, resulting in a complex and heterogeneous landscape. This diversity presents a fundamental routing challenge: how to accurately direct each query to an appropriate execution unit while optimizing both performance and efficiency. To address this, we propose MoMA (Mixture of Models and Agents), a generalized routing framework that integrates both LLM and agent-based routing. Built upon a deep understanding of model and agent capabilities, MoMA effectively handles diverse queries through precise intent recognition and adaptive routing strategies, achieving an optimal balance between efficiency and cost. Specifically, we construct a detailed training dataset to profile the capabilities of various LLMs under different routing model structures, identifying the most suitable tasks for each LLM. During inference, queries are dynamically routed to the LLM with the best cost-performance efficiency. We also introduce an efficient agent selection strategy based on a context-aware state machine and dynamic masking. Experimental results demonstrate that the MoMA router offers superior cost-efficiency and scalability compared to existing approaches.
cs.MM
eess.AS
[331] Automotive sound field reproduction using deep optimization with spatial domain constraint
Yufan Qian, Tianshu Qu, Xihong Wu
Main category: eess.AS
TL;DR: SPMnet is a learning-based sound field reproduction method that improves both sound quality and spatial localization in automotive audio systems using spatial power map constraints and neural network optimization.
Details
Motivation: Automotive cabin acoustic environments require a trade-off between sound quality and spatial accuracy in sound field reproduction, which limits audio performance.Method: Proposes Spatial Power Map Net (SPMnet) with spatial power map constraint using beamforming to guide energy toward intended directions, integrated into multi-channel equalization framework with neural network optimization for filter design.
Result: Both objective and subjective evaluations confirm enhanced sound quality and improved spatial localization within automotive cabins, with analysis of audio materials and virtual sound source angles.
Conclusion: SPMnet successfully overcomes the traditional trade-off between sound quality and spatial accuracy in complex automotive acoustic environments through learning-based optimization with spatial constraints.
Abstract: Sound field reproduction with undistorted sound quality and precise spatial localization is desirable for automotive audio systems. However, the complexity of automotive cabin acoustic environment often necessitates a trade-off between sound quality and spatial accuracy. To overcome this limitation, we propose Spatial Power Map Net (SPMnet), a learning-based sound field reproduction method that improves both sound quality and spatial localization in complex environments. We introduce a spatial power map (SPM) constraint, which characterizes the angular energy distribution of the reproduced field using beamforming. This constraint guides energy toward the intended direction to enhance spatial localization, and is integrated into a multi-channel equalization framework to also improve sound quality under reverberant conditions. To address the resulting non-convexity, deep optimization that use neural networks to solve optimization problems is employed for filter design. Both in situ objective and subjective evaluations confirm that our method enhances sound quality and improves spatial localization within the automotive cabin. Furthermore, we analyze the influence of different audio materials and the arrival angles of the virtual sound source in the reproduced sound field, investigating the potential underlying factors affecting these results.
[332] MAPSS: Manifold-based Assessment of Perceptual Source Separation
Amir Ivry, Samuele Cornell, Shinji Watanabe
Main category: eess.AS
TL;DR: Introduces Perceptual Separation (PS) and Perceptual Match (PM) measures that functionally isolate leakage and self-distortion in source separation systems, achieving highest correlation with human perception scores.
Details
Motivation: Objective assessment of source-separation systems mismatches subjective human perception, especially when leakage and self-distortion interact.Method: Uses pre-trained self-supervised learning model to encode waveforms, projects onto manifold via diffusion maps, and measures Mahalanobis distances to quantify self-distortion (PM) and leakage (PS).
Result: PS and PM achieve highest linear correlation with human scores (86.36% for speech, 87.21% for music) compared to 14 competitors, with small error radius (1.39%) and confidence intervals.
Conclusion: The measures are differentiable, granular, and complement each other most as system performance degrades, enabling more reliable evaluation of source separation systems.
Abstract: Objective assessment of source-separation systems still mismatches subjective human perception, especially when leakage and self-distortion interact. We introduce the Perceptual Separation (PS) and Perceptual Match (PM), the first pair of measures that functionally isolate these two factors. Our intrusive method begins with generating a bank of fundamental distortions for each reference waveform signal in the mixture. Distortions, references, and their respective system outputs from all sources are then independently encoded by a pre-trained self-supervised learning model. These representations are aggregated and projected onto a manifold via diffusion maps, which aligns Euclidean distances on the manifold with dissimilarities of the encoded waveforms. On this manifold, the PM measures the Mahalanobis distance from each output to its attributed cluster that consists of its reference and distortions embeddings, capturing self-distortion. The PS accounts for the Mahalanobis distance of the output to the attributed and to the closest non-attributed clusters, quantifying leakage. Both measures are differentiable and granular, operating at a resolution as low as 50 frames per second. We further derive, for both measures, deterministic error radius and non-asymptotic, high-probability confidence intervals (CIs). Experiments on English, Spanish, and music mixtures show that the PS and PM nearly always achieve the highest linear correlation coefficients with human mean-opinion scores than 14 competitors, reaching as high as 86.36% for speech and 87.21% for music. We observe, at worst, an error radius of 1.39% and a probabilistic 95% CI of 12.21% for these coefficients, which improves reliable and informed evaluation. Using mutual information, the measures complement each other most as their values decrease, suggesting they are jointly more informative as system performance degrades.
[333] Over-the-Air Adversarial Attack Detection: from Datasets to Defenses
Li Wang, Xiaoyan Lei, Haorui He, Lei Wang, Jie Shi, Zhizheng Wu
Main category: eess.AS
TL;DR: AdvSV 2.0 dataset with 628k samples and 800 hours of audio containing both OTL and OTA adversarial attacks, plus a new Neural Replay Simulator attack method and CODA-OCC defense achieving 11.2% EER and 0.95 AUC.
Details
Motivation: ASV systems are vulnerable to adversarial attacks but lack comprehensive datasets for thorough testing of detection methods.Method: Created AdvSV 2.0 dataset with classical attacks, developed Neural Replay Simulator for enhanced OTA attacks, and proposed CODA-OCC contrastive learning defense.
Result: CODA-OCC achieved 11.2% EER and 0.95 AUC on AdvSV 2.0, outperforming state-of-the-art detection methods.
Conclusion: The comprehensive dataset and novel defense method significantly improve ASV system security against adversarial attacks.
Abstract: Automatic Speaker Verification (ASV) systems can be used for voice-enabled applications for identity verification. However, recent studies have exposed these systems’ vulnerabilities to both over-the-line (OTL) and over-the-air (OTA) adversarial attacks. Although various detection methods have been proposed to counter these threats, they have not been thoroughly tested due to the lack of a comprehensive data set. To address this gap, we developed the AdvSV 2.0 dataset, which contains 628k samples with a total duration of 800 hours. This dataset incorporates classical adversarial attack algorithms, ASV systems, and encompasses both OTL and OTA scenarios. Furthermore, we introduce a novel adversarial attack method based on a Neural Replay Simulator (NRS), which enhances the potency of adversarial OTA attacks, thereby presenting a greater threat to ASV systems. To defend against these attacks, we propose CODA-OCC, a contrastive learning approach within the one-class classification framework. Experimental results show that CODA-OCC achieves an EER of 11.2% and an AUC of 0.95 on the AdvSV 2.0 dataset, outperforming several state-of-the-art detection methods.
[334] Listening for “You”: Enhancing Speech Image Retrieval via Target Speaker Extraction
Wenhao Yang, Jianguo Wei, Wenhuan Lu, Xinyue Song, Xianghu Yue
Main category: eess.AS
TL;DR: A novel framework for target speaker speech-image retrieval that handles multi-speaker scenarios by extracting and aligning target speaker’s spoken commands with corresponding images using contrastive learning.
Details
Motivation: Image retrieval using spoken language cues is promising but challenging in multi-speaker scenarios where multiple people are speaking simultaneously.Method: Integrates pre-trained self-supervised audio encoders with vision models via target speaker-aware contrastive learning, conditioned on a Target Speaker Extraction and Retrieval module to extract spoken commands from target speakers.
Result: Achieves 36.3% and 29.9% Recall@1 in 2 and 3 speaker scenarios respectively, significantly outperforming existing methods and single speaker baselines.
Conclusion: The approach demonstrates strong potential for real-world deployment in assistive robotics and multimodal interaction systems, effectively handling multi-speaker speech-image retrieval.
Abstract: Image retrieval using spoken language cues has emerged as a promising direction in multimodal perception, yet leveraging speech in multi-speaker scenarios remains challenging. We propose a novel Target Speaker Speech-Image Retrieval task and a framework that learns the relationship between images and multi-speaker speech signals in the presence of a target speaker. Our method integrates pre-trained self-supervised audio encoders with vision models via target speaker-aware contrastive learning, conditioned on a Target Speaker Extraction and Retrieval module. This enables the system to extract spoken commands from the target speaker and align them with corresponding images. Experiments on SpokenCOCO2Mix and SpokenCOCO3Mix show that TSRE significantly outperforms existing methods, achieving 36.3% and 29.9% Recall@1 in 2 and 3 speaker scenarios, respectively - substantial improvements over single speaker baselines and state-of-the-art models. Our approach demonstrates potential for real-world deployment in assistive robotics and multimodal interaction systems.
[335] Short-term cognitive fatigue of spatial selective attention after face-to-face conversations in virtual noisy environments
Ľuboš Hládek, Piotr Majdak, Robert Baumgartner
Main category: eess.AS
TL;DR: Study found that effortful conversations in noisy environments increase self-reported fatigue but unexpectedly improve response times in auditory spatial attention tasks, with strong training effects observed across sessions.
Details
Motivation: To investigate whether cognitive fatigue from effortful conversations in cocktail party situations compromises spatial selective attention abilities.Method: Within-subject design where young normal-hearing participants performed auditory spatial attention tasks before and after three conditions: 1) 30-minute face-to-face conversation in virtual reverberant room with 72 dB noise, 2) passive listening to same noise, 3) conversation in quiet. Measured self-reported effort, fatigue, response times and accuracy.
Result: Self-reported effort and fatigue increased after conversations in noise and passive listening. Surprisingly, response times decreased (improved) after conversation in noise, while accuracy remained unchanged. Strong training effects were observed across sessions.
Conclusion: Effortful conversations in noise increase subjective fatigue but may not impair (and could potentially improve) auditory spatial attention performance, with significant training effects complicating fatigue assessment in within-subject designs.
Abstract: Spatial selective attention is an important asset for communication in cocktail party situations but may be compromised by short-term cognitive fatigue. Here we tested whether an effortful conversation in a highly ecological setting depletes task performance in an auditory spatial selective attention task. Young participants with normal hearing performed the task before and after (1) having a real dyadic face-to-face conversation on a free topic in a virtual reverberant room with simulated interfering conversations and background babble noise at 72 dB SPL for 30 minutes, (2) passively listening to the interfering conversations and babble noise, or (3) having the conversation in quiet. Self-reported perceived effort and fatigue increased after conversations in noise and passive listening relative to the reports after conversations in quiet. In contrast to our expectations, response times in the attention task decreased, rather than increased, after conversation in noise and accuracy did not change systematically in any of the conditions on the group level. Unexpectedly, we observed strong training effects between the individual sessions in our within-subject design even after one hour of training on a different day.
[336] Acoustic to Articulatory Speech Inversion for Children with Velopharyngeal Insufficiency
Saba Tabatabaee, Suzanne Boyce, Liran Oren, Mark Tiede, Carol Espy-Wilson
Main category: eess.AS
TL;DR: Speech inversion system for nasalance estimation improved with glottal control proxies, achieving 16.92% better correlation in adults and 7.90% improvement when fine-tuned for children with VPI.
Details
Motivation: Traditional clinical methods for assessing nasality are invasive and problematic for children, requiring a noninvasive alternative using speech inversion.Method: Augmented speech inversion system with electroglottography source information and acoustic features (F0, periodic/aperiodic energy) as glottal control proxies, then fine-tuned on children with VPI data.
Result: 16.92% relative improvement in Pearson correlation for adult nasalance estimation; 7.90% improvement after fine-tuning for children with VPI.
Conclusion: Speech inversion with glottal control augmentation provides effective noninvasive nasalance estimation, particularly beneficial for pediatric VPI assessment.
Abstract: Traditional clinical approaches for assessing nasality, such as nasopharyngoscopy and nasometry, involve unpleasant experiences and are problematic for children. Speech Inversion (SI), a noninvasive technique, offers a promising alternative for estimating articulatory movement without the need for physical instrumentation. In this study, an SI system trained on nasalance data from healthy adults is augmented with source information from electroglottography and acoustically derived F0, periodic and aperiodic energy estimates as proxies for glottal control. This model achieves 16.92% relative improvement in Pearson Product-Moment Correlation (PPMC) compared to a previous SI system for nasalance estimation. To adapt the SI system for nasalance estimation in children with Velopharyngeal Insufficiency (VPI), the model initially trained on adult speech was fine-tuned using children with VPI data, yielding an 7.90% relative improvement in PPMC compared to its performance before fine-tuning.
[337] Region-Specific Audio Tagging for Spatial Sound
Jinzheng Zhao, Yong Xu, Haohe Liu, Davide Berghi, Xinyuan Qian, Qiuqiang Kong, Junqi Zhao, Mark D. Plumbley, Wenwu Wang
Main category: eess.AS
TL;DR: Proposes region-specific audio tagging task that labels sound events within specific spatial regions (angular space or distance) using microphone array recordings, extending PANNs and AST models with spatial features.
Details
Motivation: Audio tagging traditionally labels sound events without spatial context. This paper aims to enable region-specific sound event detection for spatial audio recordings using microphone arrays.Method: Extends state-of-the-art audio tagging systems (PANNs and AST) with spatial and position features. Studies different feature combinations including spectral, spatial, and positional information for region-specific tagging.
Result: Experimental results on simulated and real datasets demonstrate feasibility of region-specific audio tagging and effectiveness of the proposed method. Directional features also benefit omnidirectional tagging.
Conclusion: Region-specific audio tagging is a viable task that can be effectively addressed by extending existing audio tagging models with spatial features, enabling more precise spatial sound event detection.
Abstract: Audio tagging aims to label sound events appearing in an audio recording. In this paper, we propose region-specific audio tagging, a new task which labels sound events in a given region for spatial audio recorded by a microphone array. The region can be specified as an angular space or a distance from the microphone. We first study the performance of different combinations of spectral, spatial, and position features. Then we extend state-of-the-art audio tagging systems such as pre-trained audio neural networks (PANNs) and audio spectrogram transformer (AST) to the proposed region-specific audio tagging task. Experimental results on both the simulated and the real datasets show the feasibility of the proposed task and the effectiveness of the proposed method. Further experiments show that incorporating the directional features is beneficial for omnidirectional tagging.
[338] Binaural Target Speaker Extraction using HRTFs
Yoav Ellinson, Sharon Gannot
Main category: eess.AS
TL;DR: Novel binaural target speaker extraction method using HRTF without speaker embeddings, achieving speaker-independent performance with excellent binaural cue preservation in both anechoic and reverberant conditions.
Details
Motivation: To imitate human ability to selectively attend to a single speaker among multiple simultaneous talkers, without relying on speaker-specific information for better generalization.Method: Uses listener’s Head-Related Transfer Function (HRTF) with fully complex-valued neural network operating directly on complex-valued STFT of mixed audio signals, compared against Real-Imaginary based networks.
Result: Achieves excellent extraction performance while preserving binaural cues of target signal, robust in reverberant conditions, maintains speech clarity and directionality while reducing reverberation, performs on par with existing methods in noise reduction and perceptual quality.
Conclusion: The proposed HRTF-based approach provides speaker-independent target speaker extraction with superior binaural cue preservation, demonstrating strong generalization across datasets and languages without requiring speaker embeddings.
Abstract: In this work, we aim to imitate the human ability to selectively attend to a single speaker, even in the presence of multiple simultaneous talkers. To achieve this, we propose a novel approach for binaural target speaker extraction that leverages the listener’s Head-Related Transfer Function (HRTF) to isolate the desired speaker. Notably, our method does not rely on speaker embeddings, making it speaker-independent and enabling strong generalization across multiple speech datasets and languages. We employ a fully complex-valued neural network that operates directly on the complex-valued Short-Time Fourier transform (STFT) of the mixed audio signals, and compare it to a Real-Imaginary (RI)-based neural network, demonstrating the advantages of the former. We first evaluate the method in an anechoic, noise-free scenario, achieving excellent extraction performance while preserving the binaural cues of the target signal. We then extend the evaluation to reverberant conditions. Our method proves robust, maintaining speech clarity and source directionality while simultaneously reducing reverberation. A comparative analysis with existing binaural Target Speaker Extraction (TSE) methods demonstrates that our approach attains performance on par with competing techniques in terms of noise reduction and perceptual quality, while offering a clear advantage in preserving binaural cues.Demo-page: https://bi-ctse-hrtf.github.io
eess.IV
[339] USEANet: Ultrasound-Specific Edge-Aware Multi-Branch Network for Lightweight Medical Image Segmentation
Jingyi Gao, Di Wu, Baha lhnaini
Main category: eess.IV
TL;DR: USEANet is an ultrasound-specific edge-aware multi-branch network that achieves optimal performance-efficiency balance for ultrasound image segmentation with only 3.64M parameters and 0.79G FLOPs.
Details
Motivation: Ultrasound image segmentation faces unique challenges including speckle noise, low contrast, and ambiguous boundaries, while clinical deployment demands computationally efficient models.Method: Four key innovations: ultrasound-specific multi-branch processing, edge-aware attention mechanisms, hierarchical feature aggregation with adaptive weight learning, and ultrasound-aware decoder enhancement, built on an ultra-lightweight PVT-B0 backbone.
Result: Significantly outperforms existing methods across five ultrasound datasets with 67.01 IoU on BUSI dataset while maintaining exceptional computational efficiency.
Conclusion: USEANet provides superior segmentation accuracy with real-time clinical application suitability, representing substantial improvements over traditional approaches.
Abstract: Ultrasound image segmentation faces unique challenges including speckle noise, low contrast, and ambiguous boundaries, while clinical deployment demands computationally efficient models. We propose USEANet, an ultrasound-specific edge-aware multi-branch network that achieves optimal performance-efficiency balance through four key innovations: (1) ultrasound-specific multi-branch processing with specialized modules for noise reduction, edge enhancement, and contrast improvement; (2) edge-aware attention mechanisms that focus on boundary information with minimal computational overhead; (3) hierarchical feature aggregation with adaptive weight learning; and (4) ultrasound-aware decoder enhancement for optimal segmentation refinement. Built on an ultra-lightweight PVT-B0 backbone, USEANet significantly outperforms existing methods across five ultrasound datasets while using only 3.64M parameters and 0.79G FLOPs. Experimental results demonstrate superior segmentation accuracy with 67.01 IoU on BUSI dataset, representing substantial improvements over traditional approaches while maintaining exceptional computational efficiency suitable for real-time clinical applications. Code is available at https://github.com/chouheiwa/USEANet.
[340] WarpPINN-fibers: improved cardiac strain estimation from cine-MR with physics-informed neural networks
Felipe Álvarez Barrientos, Tomás Banduc, Isabeau Sirven, Francisco Sahli Costabal
Main category: eess.IV
TL;DR: WarpPINN-fibers is a physics-informed neural network that improves cardiac motion and strain analysis by incorporating fiber mechanics into deformation field prediction from cine MRI images.
Details
Motivation: Traditional cardiac strain analysis methods lack fiber mechanics modeling, limiting their accuracy in explaining cardiac function and pathology detection.Method: A neural network trained with three loss components: data-similarity between reference and warped images, tissue incompressibility regularizer, and fiber-stretch penalization using synthetic fibers.
Result: Outperforms previous WarpPINN model and alternative methods in landmark tracking and strain curve prediction on a 15-subject cine-MRI benchmark.
Conclusion: Enables more precise cardiac strain quantification with fiber-physiology consistent deformation fields using standard MRI, without requiring more sophisticated imaging.
Abstract: The contractile motion of the heart is strongly determined by the distribution of the fibers that constitute cardiac tissue. Strain analysis informed with the orientation of fibers allows to describe several pathologies that are typically associated with impaired mechanics of the myocardium, such as cardiovascular disease. Several methods have been developed to estimate strain-derived metrics from traditional imaging techniques. However, the physical models underlying these methods do not include fiber mechanics, restricting their capacity to accurately explain cardiac function. In this work, we introduce WarpPINN-fibers, a physics-informed neural network framework to accurately obtain cardiac motion and strains enhanced by fiber information. We train our neural network to satisfy a hyper-elastic model and promote fiber contraction with the goal to predict the deformation field of the heart from cine magnetic resonance images. For this purpose, we build a loss function composed of three terms: a data-similarity loss between the reference and the warped template images, a regularizer enforcing near-incompressibility of cardiac tissue and a fiber-stretch penalization that controls strain in the direction of synthetically produced fibers. We show that our neural network improves the former WarpPINN model and effectively controls fiber stretch in a synthetic phantom experiment. Then, we demonstrate that WarpPINN-fibers outperforms alternative methodologies in landmark-tracking and strain curve prediction for a cine-MRI benchmark with a cohort of 15 healthy volunteers. We expect that our method will enable a more precise quantification of cardiac strains through accurate deformation fields that are consistent with fiber physiology, without requiring imaging techniques more sophisticated than MRI.
[341] Generalized User-Oriented Image Semantic Coding Empowered by Large Vision-Language Model
Sin-Yu Huang, Vincent W. S. Wong
Main category: eess.IV
TL;DR: Proposes a user-oriented image semantic coding framework that incorporates user intent via text queries, uses CLIP for generalization, and employs LLaVA for relevance evaluation, achieving better performance on unseen objects.
Details
Motivation: Existing semantic communication models don't account for user-specific intent and lack generalization to out-of-distribution images. Users often focus on specific regions of images based on their interests.Method: UO-ISC framework where user provides text query, transmitter extracts query-relevant features from images using CLIP model, receiver reconstructs images, and user-intent relevance loss is computed using LLaVA model.
Result: Outperforms state-of-the-art query-aware image semantic coding in zero-shot inference on unseen objects, measured by answer match rate.
Conclusion: The proposed framework successfully incorporates user intent into semantic coding and demonstrates strong generalization capabilities to unseen objects through large vision-language models.
Abstract: Semantic communication has shown outstanding performance in preserving the overall source information in wireless transmission. For semantically rich content such as images, human users are often interested in specific regions depending on their intent. Moreover, recent semantic coding models are mostly trained on specific datasets. However, real-world applications may involve images out of the distribution of training dataset, which makes generalization a crucial but largely unexplored problem. To incorporate user’s intent into semantic coding, in this paper, we propose a generalized user-oriented image semantic coding (UO-ISC) framework, where the user provides a text query indicating its intent. The transmitter extracts features from the source image which are relevant to the user’s query. The receiver reconstructs an image based on those features. To enhance the generalization ability, we integrate contrastive language image pre-training (CLIP) model, which is a pretrained large vision-language model (VLM), into our proposed UO-ISC framework. To evaluate the relevance between the reconstructed image and the user’s query, we introduce the user-intent relevance loss, which is computed by using a pretrained large VLM, large language-and-vision assistant (LLaVA) model. When performing zero-shot inference on unseen objects, simulation results show that the proposed UO-ISC framework outperforms the state-of-the-art query-aware image semantic coding in terms of the answer match rate.
[342] Dynamic Structural Recovery Parameters Enhance Prediction of Visual Outcomes After Macular Hole Surgery
Yinzheng Zhao, Zhihao Zhao, Rundong Jiang, Louisa Sackewitz, Quanmin Liang, Mathias Maier, Daniel Zapp, Peter Charbel Issa, Mohammad Ali Nasseri
Main category: eess.IV
TL;DR: Novel dynamic structural parameters integrated in multimodal deep learning improve prediction of postoperative visual recovery in macular hole patients.
Details
Motivation: To enhance prediction accuracy for postoperative visual recovery in idiopathic full-thickness macular hole patients by incorporating dynamic structural parameters and multimodal deep learning.Method: Used longitudinal OCT dataset at 5 timepoints, developed stage-specific segmentation model, extracted quantitative/composite/qualitative/dynamic features, built binary logistic regression models with/without dynamic parameters, and created multimodal DL model combining clinical variables, OCT features, and raw images.
Result: Segmentation achieved high accuracy (mean Dice > 0.89), identified key predictors (base diameter, ellipsoid zone integrity, macular hole area), dynamic parameters improved logistic regression AUC, and multimodal DL outperformed regression with up to 0.12 higher AUC.
Conclusion: Integration of dynamic parameters into multimodal DL significantly enhances prediction accuracy, representing a promising automated clinical decision support tool for personalized postoperative management.
Abstract: Purpose: To introduce novel dynamic structural parameters and evaluate their integration within a multimodal deep learning (DL) framework for predicting postoperative visual recovery in idiopathic full-thickness macular hole (iFTMH) patients. Methods: We utilized a publicly available longitudinal OCT dataset at five stages (preoperative, 2 weeks, 3 months, 6 months, and 12 months). A stage specific segmentation model delineated related structures, and an automated pipeline extracted quantitative, composite, qualitative, and dynamic features. Binary logistic regression models, constructed with and without dynamic parameters, assessed their incremental predictive value for best-corrected visual acuity (BCVA). A multimodal DL model combining clinical variables, OCT-derived features, and raw OCT images was developed and benchmarked against regression models. Results: The segmentation model achieved high accuracy across all timepoints (mean Dice > 0.89). Univariate and multivariate analyses identified base diameter, ellipsoid zone integrity, and macular hole area as significant BCVA predictors (P < 0.05). Incorporating dynamic recovery rates consistently improved logistic regression AUC, especially at the 3-month follow-up. The multimodal DL model outperformed logistic regression, yielding higher AUCs and overall accuracy at each stage. The difference is as high as 0.12, demonstrating the complementary value of raw image volume and dynamic parameters. Conclusions: Integrating dynamic parameters into the multimodal DL model significantly enhances the accuracy of predictions. This fully automated process therefore represents a promising clinical decision support tool for personalized postoperative management in macular hole surgery.
[343] Virtual staining for 3D X-ray histology of bone implants
Sarah C. Irvine, Christian Lucas, Diana Krüger, Bianca Guedert, Julian Moosmann, Berit Zeller-Plumhoff
Main category: eess.IV
TL;DR: This paper introduces virtual staining to 3D X-ray imaging, using a modified CycleGAN to generate artificially stained histological images from micro-CT scans, enhancing interpretability without physical staining.
Details
Motivation: X-ray histology provides non-invasive 3D imaging but lacks biochemical specificity compared to traditional histological stains. The goal is to extend virtual staining techniques from optical to X-ray domain to improve tissue characterization.Method: Used over 50 co-registered micro-CT and toluidine blue-stained histology image pairs. Trained a modified CycleGAN network with pixelwise supervision and greyscale consistency terms, incorporating data augmentation for patch-based training.
Result: The method outperformed Pix2Pix and standard CycleGAN baselines across SSIM, PSNR, and LPIPS metrics. Successfully reproduced features like new bone formation, though some variability in implant degradation layers was noted.
Conclusion: This work successfully introduces virtual staining to 3D X-ray imaging, providing a scalable route for chemically informative, label-free tissue characterization in biomedical research, though additional training data and refinement are needed.
Abstract: Three-dimensional X-ray histology techniques offer a non-invasive alternative to conventional 2D histology, enabling volumetric imaging of biological tissues without the need for physical sectioning or chemical staining. However, the inherent greyscale image contrast of X-ray tomography limits its biochemical specificity compared to traditional histological stains. Within digital pathology, deep learning-based virtual staining has demonstrated utility in simulating stained appearances from label-free optical images. In this study, we extend virtual staining to the X-ray domain by applying cross-modality image translation to generate artificially stained slices from synchrotron-radiation-based micro-CT scans. Using over 50 co-registered image pairs of micro-CT and toluidine blue-stained histology from bone-implant samples, we trained a modified CycleGAN network tailored for limited paired data. Whole slide histology images were downsampled to match the voxel size of the CT data, with on-the-fly data augmentation for patch-based training. The model incorporates pixelwise supervision and greyscale consistency terms, producing histologically realistic colour outputs while preserving high-resolution structural detail. Our method outperformed Pix2Pix and standard CycleGAN baselines across SSIM, PSNR, and LPIPS metrics. Once trained, the model can be applied to full CT volumes to generate virtually stained 3D datasets, enhancing interpretability without additional sample preparation. While features such as new bone formation were able to be reproduced, some variability in the depiction of implant degradation layers highlights the need for further training data and refinement. This work introduces virtual staining to 3D X-ray imaging and offers a scalable route for chemically informative, label-free tissue characterisation in biomedical research.
[344] In-Loop Filtering Using Learned Look-Up Tables for Video Coding
Zhuoyuan Li, Jiacheng Li, Yao Li, Jialin Li, Li Li, Dong Liu, Feng Wu
Main category: eess.IV
TL;DR: Proposes LUT-ILF++ framework using lookup tables instead of DNNs for in-loop filtering, achieving significant bitrate reduction with lower complexity and storage costs.
Details
Motivation: Neural network-based in-loop filtering provides coding gains but has high computational complexity and hardware demands, making it impractical for general use.Method: Trains DNN with restricted reference range, caches outputs into LUTs, uses multiple filtering LUTs with customized indexing, cross-component indexing, and LUT compaction for storage efficiency.
Result: Achieves 0.82-4.11% bitrate reduction across configurations, with much lower time complexity and storage cost compared to DNN-based solutions.
Conclusion: LUT-based approach provides practical alternative to DNN-based ILF, maintaining coding gains while significantly reducing computational and storage requirements.
Abstract: In-loop filtering (ILF) is a key technology in video coding standards to reduce artifacts and enhance visual quality. Recently, neural network-based ILF schemes have achieved remarkable coding gains, emerging as a powerful candidate for next-generation video coding standards. However, the use of deep neural networks (DNN) brings significant computational and time complexity or high demands for dedicated hardware, making it challenging for general use. To address this limitation, we study a practical ILF solution by adopting look-up tables (LUTs). After training a DNN with a restricted reference range for ILF, all possible inputs are traversed, and the output values of the DNN are cached into LUTs. During the coding process, the filtering process is performed by simply retrieving the filtered pixel through locating the input pixels and interpolating between the cached values, instead of relying on heavy inference computations. In this paper, we propose a universal LUT-based ILF framework, termed LUT-ILF++. First, we introduce the cooperation of multiple kinds of filtering LUTs and propose a series of customized indexing mechanisms to enable better filtering reference perception with limited storage consumption. Second, we propose the cross-component indexing mechanism to enable the filtering of different color components jointly. Third, in order to make our solution practical for coding uses, we propose the LUT compaction scheme to enable the LUT pruning, achieving a lower storage cost of the entire solution. The proposed framework is implemented in the VVC reference software. Experimental results show that the proposed framework achieves on average 0.82%/2.97%/1.63% and 0.85%/4.11%/2.06% bitrate reduction for common test sequences, under the AI and RA configurations, respectively. Compared to DNN-based solutions, our proposed solution has much lower time complexity and storage cost.
[345] A novel method and dataset for depth-guided image deblurring from smartphone Lidar
Antonio Montanaro, Diego Valsesia
Main category: eess.IV
TL;DR: A novel image deblurring method using Lidar depth guidance with diffusion models, plus a new real dataset from iPhone 15 Pro showing improved perceptual quality over state-of-the-art methods.
Details
Motivation: Address the lack of datasets with realistic blurred images paired with mobile Lidar depth maps, and the absence of blind zero-shot methods that can deblur real images using depth guidance without extensive training data.Method: Propose an image deblurring method based on denoising diffusion models that leverages Lidar depth guidance without requiring training data with paired Lidar depth maps. Also present the first real dataset with blurred images, Lidar depth maps, and sharp ground truth images acquired using Apple iPhone 15 Pro.
Result: Experimental results on the novel dataset demonstrate that Lidar guidance is effective and the proposed method outperforms state-of-the-art deblurring methods in terms of perceptual quality.
Conclusion: The combination of diffusion models with Lidar depth guidance provides an effective zero-shot deblurring approach that doesn’t require paired training data, while the new dataset enables further research in Lidar-guided image processing.
Abstract: Modern smartphones are equipped with Lidar sensors providing depth-sensing capabilities. Recent works have shown that this complementary sensor allows to improve various tasks in image processing, including deblurring. However, there is a current lack of datasets with realistic blurred images and paired mobile Lidar depth maps to further study the topic. At the same time, there is also a lack of blind zero-shot methods that can deblur a real image using the depth guidance without requiring extensive training sets of paired data. In this paper, we propose an image deblurring method based on denoising diffusion models that can leverage the Lidar depth guidance and does not require training data with paired Lidar depth maps. We also present the first dataset with real blurred images with corresponding Lidar depth maps and sharp ground truth images, acquired with an Apple iPhone 15 Pro, for the purpose of studying Lidar-guided deblurring. Experimental results on this novel dataset show that Lidar guidance is effective and the proposed method outperforms state-of-the-art deblurring methods in terms of perceptual quality.
[346] Towards Reliable Medical Image Segmentation by Modeling Evidential Calibrated Uncertainty
Ke Zou, Yidi Chen, Ling Huang, Xuedong Yuan, Xiaojing Shen, Meng Wang, Rick Siow Mong Goh, Yong Liu, Huazhu Fu
Main category: eess.IV
TL;DR: DEviS is a foundational model that enhances medical image segmentation by improving calibration, robustness, and providing uncertainty estimation using subjective logic theory and Dirichlet distribution.
Details
Motivation: Clinicians lack confidence in medical image segmentation due to absence of confidence assessment, robustness, and calibration to accuracy, which DEviS aims to address.Method: Leverages subjective logic theory to model probability and uncertainty, parameterizes class probabilities with Dirichlet distribution, uses trainable calibrated uncertainty penalty, and incorporates uncertainty-aware filtering module with uncertainty-calibrated error metric.
Result: Validated on multiple public datasets (ISIC2018, KiTS2021, LiTS2017, BraTS2019) showing improved accuracy, robustness, and reliable uncertainty estimation across different backbone segmentation models.
Conclusion: DEviS provides an easily implementable solution that enhances medical image segmentation reliability through better calibration, robustness, and efficient uncertainty estimation.
Abstract: Medical image segmentation is critical for disease diagnosis and treatment assessment. However, concerns regarding the reliability of segmentation regions persist among clinicians, mainly attributed to the absence of confidence assessment, robustness, and calibration to accuracy. To address this, we introduce DEviS, an easily implementable foundational model that seamlessly integrates into various medical image segmentation networks. DEviS not only enhances the calibration and robustness of baseline segmentation accuracy but also provides high-efficiency uncertainty estimation for reliable predictions. By leveraging subjective logic theory, we explicitly model probability and uncertainty for medical image segmentation. Here, the Dirichlet distribution parameterizes the distribution of probabilities for different classes of the segmentation results. To generate calibrated predictions and uncertainty, we develop a trainable calibrated uncertainty penalty. Furthermore, DEviS incorporates an uncertainty-aware filtering module, which designs the metric of uncertainty-calibrated error to filter out-of-distribution data. We conducted validation studies on publicly available datasets, including ISIC2018, KiTS2021, LiTS2017, and BraTS2019, to assess the accuracy and robustness of different backbone segmentation models enhanced by DEviS, as well as the efficiency and reliability of uncertainty estimation.
[347] ABS-Mamba: SAM2-Driven Bidirectional Spiral Mamba Network for Medical Image Translation
Feng Yuan, Yifan Gao, Wenbin Wu, Keqing Wu, Xiaotong Guo, Jie Jiang, Xin Gao
Main category: eess.IV
TL;DR: ABS-Mamba is a novel medical image translation framework that combines SAM2 for organ-aware semantics, specialized CNNs for local details, and Mamba’s state-space modeling for efficient feature dependencies, achieving state-of-the-art performance on medical datasets.
Details
Motivation: Medical image translation faces challenges in harmonizing global anatomical semantics with local structural fidelity, complicated by intermodality information loss and structural distortion that can affect diagnostic accuracy.Method: Dual-resolution framework using SAM2’s encoder for organ-scale semantics, parallel CNNs for local features, Robust Feature Fusion Network (RFFN) for integration, Bidirectional Mamba Residual Network (BMRN) for spatial dependencies, and three-stage skip fusion decoder with Efficient LoRA+ fine-tuning.
Result: Outperforms state-of-the-art methods on SynthRAD2023 and BraTS2019 datasets, delivering high-fidelity cross-modal synthesis that preserves anatomical semantics and structural details.
Conclusion: ABS-Mamba enhances diagnostic accuracy in clinical applications by effectively preserving both global anatomical semantics and local structural details in medical image translation.
Abstract: Accurate multi-modal medical image translation requires ha-rmonizing global anatomical semantics and local structural fidelity, a challenge complicated by intermodality information loss and structural distortion. We propose ABS-Mamba, a novel architecture integrating the Segment Anything Model 2 (SAM2) for organ-aware semantic representation, specialized convolutional neural networks (CNNs) for preserving modality-specific edge and texture details, and Mamba’s selective state-space modeling for efficient long- and short-range feature dependencies. Structurally, our dual-resolution framework leverages SAM2’s image encoder to capture organ-scale semantics from high-resolution inputs, while a parallel CNNs branch extracts fine-grained local features. The Robust Feature Fusion Network (RFFN) integrates these epresentations, and the Bidirectional Mamba Residual Network (BMRN) models spatial dependencies using spiral scanning and bidirectional state-space dynamics. A three-stage skip fusion decoder enhances edge and texture fidelity. We employ Efficient Low-Rank Adaptation (LoRA+) fine-tuning to enable precise domain specialization while maintaining the foundational capabilities of the pre-trained components. Extensive experimental validation on the SynthRAD2023 and BraTS2019 datasets demonstrates that ABS-Mamba outperforms state-of-the-art methods, delivering high-fidelity cross-modal synthesis that preserves anatomical semantics and structural details to enhance diagnostic accuracy in clinical applications. The code is available at https://github.com/gatina-yone/ABS-Mamba
[348] Spec2VolCAMU-Net: A Spectrogram-to-Volume Model for EEG-to-fMRI Reconstruction based on Multi-directional Time-Frequency Convolutional Attention Encoder and Vision-Mamba U-Net
Dongyi He, Shiyang Li, Bin Jiang, He Yan
Main category: eess.IV
TL;DR: Spec2VolCAMU-Net: A lightweight EEG-to-fMRI generator using multi-directional time-frequency attention and vision-mamba U-Net for high-fidelity brain activity mapping from EEG signals.
Details
Motivation: High-resolution fMRI is costly and logistically challenging, while EEG is widely available. Existing EEG-to-fMRI generators have limitations in capturing cross-channel time-frequency cues or are computationally heavy and unstable.Method: Proposed Spec2VolCAMU-Net with Multi-directional Time-Frequency Convolutional Attention Encoder for feature extraction and Vision-Mamba U-Net decoder with linear-time state-space blocks for efficient spatial modeling. Trained end-to-end with hybrid SSI-MSE loss.
Result: Achieved state-of-the-art SSIM scores: 0.693 on NODDI (14.5% improvement), 0.725 on Oddball (14.9% improvement), 0.788 on CN-EPFL (16.9% improvement). Also achieved competitive PSNR scores with 4.6% improvement on CN-EPFL.
Conclusion: The model provides lightweight, efficient, and high-fidelity EEG-to-fMRI conversion, making advanced neuroimaging more accessible for clinical and research applications with real-time potential.
Abstract: High-resolution functional magnetic resonance imaging (fMRI) is essential for mapping human brain activity; however, it remains costly and logistically challenging. If comparable volumes could be generated directly from widely available scalp electroencephalography (EEG), advanced neuroimaging would become significantly more accessible. Existing EEG-to-fMRI generators rely on plain Convolutional Neural Networks (CNNs) that fail to capture cross-channel time-frequency cues or on heavy transformer/Generative Adversarial Network (GAN) decoders that strain memory and stability. To address these limitations, we propose Spec2VolCAMU-Net, a lightweight architecture featuring a Multi-directional Time-Frequency Convolutional Attention Encoder for rich feature extraction and a Vision-Mamba U-Net decoder that uses linear-time state-space blocks for efficient long-range spatial modelling. We frame the goal of this work as establishing a new state of the art in the spatial fidelity of single-volume reconstruction, a foundational prerequisite for the ultimate aim of generating temporally coherent fMRI time series. Trained end-to-end with a hybrid SSI-MSE loss, Spec2VolCAMU-Net achieves state-of-the-art fidelity on three public benchmarks, recording Structural Similarity Index (SSIM) of 0.693 on NODDI, 0.725 on Oddball and 0.788 on CN-EPFL, representing improvements of 14.5%, 14.9%, and 16.9% respectively over previous best SSIM scores. Furthermore, it achieves competitive Signal-to-Noise Ratio (PSNR) scores, particularly excelling on the CN-EPFL dataset with a 4.6% improvement over the previous best PSNR, thus striking a better balance in reconstruction quality. The proposed model is lightweight and efficient, making it suitable for real-time applications in clinical and research settings. The code is available at https://github.com/hdy6438/Spec2VolCAMU-Net.
[349] C3VDv2 – Colonoscopy 3D video dataset with enhanced realism
Mayank V. Golhar, Lucas Sebastian Galeano Fretes, Loren Ayers, Venkata S. Akshintala, Taylor L. Bobrow, Nicholas J. Durr
Main category: eess.IV
TL;DR: C3VDv2 is an enhanced 3D colonoscopy video dataset with 192 sequences and comprehensive ground truth data for robust evaluation of 3D reconstruction algorithms in challenging colonoscopy scenarios.
Details
Motivation: The lack of 3D colonoscopy datasets hinders the development of spatial computer vision techniques for improved colonoscopy diagnostics.Method: Created a high-fidelity dataset by imaging 60 unique silicone colon phantom segments, providing 169,371 frames with ground truth depth, surface normals, optical flow, occlusion maps, 6-DOF pose, coverage maps, and 3D models.
Result: Produced C3VDv2 dataset with 192 video sequences, including 8 simulated screening colonoscopy videos with ground truth poses and 15 videos with colon deformations for qualitative assessment.
Conclusion: The enhanced realism of C3VDv2 enables more robust development and evaluation of 3D reconstruction algorithms for colonoscopy applications, addressing diverse challenging scenarios like fecal debris, mucous pools, and fast camera motion.
Abstract: Spatial computer vision techniques have the potential to improve the diagnostic performance of colonoscopy. However, the lack of 3D colonoscopy datasets for training and validation hinders their development. This paper introduces C3VDv2, the second version (v2) of the high-definition Colonoscopy 3D Video Dataset, featuring enhanced realism designed to facilitate the quantitative evaluation of 3D colon reconstruction algorithms. 192 video sequences totaling 169,371 frames were captured by imaging 60 unique, high-fidelity silicone colon phantom segments. Ground truth depth, surface normals, optical flow, occlusion, diffuse maps, six-degree-of-freedom pose, coverage map, and 3D models are provided for 169 colonoscopy videos. Eight simulated screening colonoscopy videos acquired by a gastroenterologist are provided with ground truth poses. Lastly, the dataset includes 15 videos with colon deformations for qualitative assessment. C3VDv2 emulates diverse and challenging scenarios for 3D reconstruction algorithms, including fecal debris, mucous pools, blood, debris obscuring the colonoscope lens, en-face views, and fast camera motion. The enhanced realism of C3VDv2 will allow for more robust and representative development and evaluation of 3D reconstruction algorithms. Project Page - https://durrlab.github.io/C3VDv2/
[350] SV-DRR: High-Fidelity Novel View X-Ray Synthesis Using Diffusion Model
Chun Xie, Yuichi Yoshii, Itaru Kitahara
Main category: eess.IV
TL;DR: Novel view-conditioned diffusion model using Diffusion Transformer to synthesize multi-view X-ray images from single view, enabling high-resolution generation with better angular control while reducing radiation exposure.
Details
Motivation: Multi-view X-ray imaging provides better diagnostic information but increases radiation exposure and complicates clinical workflows. Need for method to generate multiple views from single acquisition.Method: View-conditioned diffusion model leveraging Diffusion Transformer architecture with weak-to-strong training strategy for stable high-resolution image generation.
Result: Method generates higher-resolution outputs with improved control over viewing angles compared to prior limited approaches.
Conclusion: Significant implications for clinical applications, medical education, and data extension - enables creation of diverse, high-quality datasets for training and analysis while reducing radiation exposure.
Abstract: X-ray imaging is a rapid and cost-effective tool for visualizing internal human anatomy. While multi-view X-ray imaging provides complementary information that enhances diagnosis, intervention, and education, acquiring images from multiple angles increases radiation exposure and complicates clinical workflows. To address these challenges, we propose a novel view-conditioned diffusion model for synthesizing multi-view X-ray images from a single view. Unlike prior methods, which are limited in angular range, resolution, and image quality, our approach leverages the Diffusion Transformer to preserve fine details and employs a weak-to-strong training strategy for stable high-resolution image generation. Experimental results demonstrate that our method generates higher-resolution outputs with improved control over viewing angles. This capability has significant implications not only for clinical applications but also for medical education and data extension, enabling the creation of diverse, high-quality datasets for training and analysis. Our code is available at GitHub.
[351] Scaling Artificial Intelligence for Prostate Cancer Detection on MRI towards Organized Screening and Primary Diagnosis in a Global, Multiethnic Population (Study Protocol)
Anindo Saha, Joeran S. Bosma, Jasper J. Twilt, Alexander B. C. D. Ng, Aqua Asif, Kirti Magudia, Peder Larson, Qinglin Xie, Xiaodong Zhang, Chi Pham Minh, Samuel N. Gitau, Ivo G. Schoots, Martijn F. Boomsma, Renato Cuocolo, Nikolaos Papanikolaou, Daniele Regge, Derya Yakar, Mattijs Elschot, Jeroen Veltman, Baris Turkbey, Nancy A. Obuchowski, Jurgen J. Fütterer, Anwar R. Padhani, Hashim U. Ahmed, Tobias Nordström, Martin Eklund, Veeru Kasivisvanathan, Maarten de Rooij, Henkjan Huisman
Main category: eess.IV
TL;DR: Large-scale international study validates PI-CAI-2B AI model for detecting clinically significant prostate cancer (Gleason grade group ≥2) on MRI across diverse global populations and clinical settings.
Details
Motivation: To develop and validate an efficient, next-generation AI system for prostate cancer detection that can perform comparably to the standard of care across different healthcare settings and diverse patient populations worldwide.Method: Retrospective cohort study of 22,481 MRI exams from 21,288 patients across 46 cities in 22 countries. Used 20,471 cases for training/internal testing and 2,010 external cases for validation. Primary endpoint was agreement with standard of care diagnoses (expert uropathologists or consensus radiologists). Statistical analysis with prespecified hypothesis of diagnostic interchangeability using PI-RADS cut-offs.
Result: The study demonstrates comprehensive external validation across multiple international cohorts including population-based screening trials and primary diagnostic settings from Europe, North/South America, Asia, Africa, and Australia.
Conclusion: The PI-CAI-2B model represents a validated, efficient AI system for prostate cancer detection that shows potential for diagnostic interchangeability with standard of care across diverse global populations and clinical scenarios.
Abstract: In this intercontinental, confirmatory study, we include a retrospective cohort of 22,481 MRI examinations (21,288 patients; 46 cities in 22 countries) to train and externally validate the PI-CAI-2B model, i.e., an efficient, next-generation iteration of the state-of-the-art AI system that was developed for detecting Gleason grade group $\geq$2 prostate cancer on MRI during the PI-CAI study. Of these examinations, 20,471 cases (19,278 patients; 26 cities in 14 countries) from two EU Horizon projects (ProCAncer-I, COMFORT) and 12 independent centers based in Europe, North America, Asia and Africa, are used for training and internal testing. Additionally, 2010 cases (2010 patients; 20 external cities in 12 countries) from population-based screening (STHLM3-MRI, IP1-PROSTAGRAM trials) and primary diagnostic settings (PRIME trial) based in Europe, North and South Americas, Asia and Australia, are used for external testing. Primary endpoint is the proportion of AI-based assessments in agreement with the standard of care diagnoses (i.e., clinical assessments made by expert uropathologists on histopathology, if available, or at least two expert urogenital radiologists in consensus; with access to patient history and peer consultation) in the detection of Gleason grade group $\geq$2 prostate cancer within the external testing cohorts. Our statistical analysis plan is prespecified with a hypothesis of diagnostic interchangeability to the standard of care at the PI-RADS $\geq$3 (primary diagnosis) or $\geq$4 (screening) cut-off, considering an absolute margin of 0.05 and reader estimates derived from the PI-CAI observer study (62 radiologists reading 400 cases). Secondary measures comprise the area under the receiver operating characteristic curve (AUROC) of the AI system stratified by imaging quality, patient age and patient ethnicity to identify underlying biases (if any).
[352] Preprocessing Algorithm Leveraging Geometric Modeling for Scale Correction in Hyperspectral Images for Improved Unmixing Performance
Praveen Sumanasekara, Athulya Ratnayake, Buddhi Wijenayake, Keshawa Ratnayake, Roshan Godaliyadda, Parakrama Ekanayake, Vijitha Herath
Main category: eess.IV
TL;DR: A novel preprocessing algorithm that corrects scale-induced spectral variability in hyperspectral data, improving unmixing accuracy by reducing scale distortions before applying existing unmixing methods.
Details
Motivation: Spectral variability caused by topography, illumination, and shadowing significantly degrades hyperspectral unmixing performance. Large-scale distortions to pixel signatures remain a major challenge that complicates model fitting and reduces accuracy in real-world GIS applications.Method: Proposes a preprocessing algorithm that estimates and corrects scale-induced distortions to pixel signatures prior to unmixing. Uses a rigorous mathematical framework to describe and correct for scale variability, producing pixel signatures with minimal scale distortions.
Result: The algorithm consistently improves performance across state-of-the-art unmixing methods on synthetic and real datasets, achieving error reductions of around 50%. Even algorithms specifically designed for spectral variability benefit significantly, demonstrating scale correction as a complementary step.
Conclusion: The proposed preprocessing step acts as a key component in hyperspectral unmixing pipelines, offering generality, consistent impact, and significant performance improvements. It facilitates more accurate unmixing with existing methods and shows potential for practical GIS applications.
Abstract: Spectral variability significantly impacts the accuracy and convergence of hyperspectral unmixing algorithms. Many methods address complex spectral variability; yet large-scale distortions to the scale of the observed pixel signatures due to topography, illumination, and shadowing remain a major challenge. These variations often degrade unmixing performance and complicate model fitting. Because of this, correcting these variations can offer significant advantages in real-world GIS applications. In this paper, we propose a novel preprocessing algorithm that corrects scale-induced spectral variability prior to unmixing. By estimating and correcting these distortions to the scale of the pixel signatures, the algorithm produces pixel signatures with minimal distortions in scale. Since these distortions in scale (which hinder the performance of many unmixing methods) are greatly minimized in the output provided by the proposed method, the abundance estimation of the unmixing algorithms is significantly improved. We present a rigorous mathematical framework to describe and correct for scale variability and provide extensive experimental validation of the proposed algorithm. Furthermore, the algorithm’s impact is evaluated across a wide range of state-of-the-art unmixing methods on two synthetic and two real hyperspectral datasets. The proposed preprocessing step consistently improves the performance of these algorithms, achieving error reductions of around 50%, even for algorithms specifically designed to handle spectral variability. This demonstrates that scale correction acts as a complementary step, facilitating more accurate unmixing with existing methods. The algorithm’s generality, consistent impact, and significant influence highlight its potential as a key component in practical hyperspectral unmixing pipelines. The implementation code will be made publicly available upon publication.
[353] Expert-Guided Explainable Few-Shot Learning for Medical Image Diagnosis
Ifrat Ikhtear Uddin, Longwei Wang, KC Santosh
Main category: eess.IV
TL;DR: Expert-guided explainable few-shot learning framework that integrates radiologist ROIs with Grad-CAM supervision to improve both classification accuracy and interpretability in medical image analysis.
Details
Motivation: Medical image analysis faces challenges due to limited expert-annotated data, which hinders model generalization and clinical adoption. There's a need to bridge the gap between performance and interpretability.Method: Proposes a framework that integrates radiologist-provided ROIs using Grad-CAM for spatial attention supervision. Introduces explanation loss based on Dice similarity to align model attention with diagnostic regions, jointly optimized with prototypical network objective.
Result: Achieved significant accuracy improvements: from 77.09% to 83.61% on BraTS (MRI) and from 54.33% to 73.29% on VinDr-CXR (Chest X-ray) compared to non-guided models. Grad-CAM visualizations confirm better attention alignment with diagnostic regions.
Conclusion: The framework effectively incorporates expert-guided attention supervision to bridge performance and interpretability gaps in few-shot medical image diagnosis, improving both predictive reliability and clinical trustworthiness.
Abstract: Medical image analysis often faces significant challenges due to limited expert-annotated data, hindering both model generalization and clinical adoption. We propose an expert-guided explainable few-shot learning framework that integrates radiologist-provided regions of interest (ROIs) into model training to simultaneously enhance classification performance and interpretability. Leveraging Grad-CAM for spatial attention supervision, we introduce an explanation loss based on Dice similarity to align model attention with diagnostically relevant regions during training. This explanation loss is jointly optimized with a standard prototypical network objective, encouraging the model to focus on clinically meaningful features even under limited data conditions. We evaluate our framework on two distinct datasets: BraTS (MRI) and VinDr-CXR (Chest X-ray), achieving significant accuracy improvements from 77.09% to 83.61% on BraTS and from 54.33% to 73.29% on VinDr-CXR compared to non-guided models. Grad-CAM visualizations further confirm that expert-guided training consistently aligns attention with diagnostic regions, improving both predictive reliability and clinical trustworthiness. Our findings demonstrate the effectiveness of incorporating expert-guided attention supervision to bridge the gap between performance and interpretability in few-shot medical image diagnosis.