Daily arXiv Papers - 2025-08-04

Summaries of research papers from arXiv

Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

Table of Contents

cs.CL

[1] PhysicsEval: Inference-Time Techniques to Improve the Reasoning Proficiency of Large Language Models on Physics Problems

Oshayer Siddique, J. M Areeb Uzair Alam, Md Jobayer Rahman Rafy, Syed Rifat Raiyan, Hasan Mahmud, Md Kamrul Hasan

Main category: cs.CL

TL;DR: The paper evaluates frontier LLMs in solving physics problems, introduces a multi-agent framework for improved performance, and presents a new benchmark, ${\rm P{\small HYSICS}E{\small VAL}$, with 19,609 problems.

DetailsMotivation: To assess and enhance LLMs' capability in solving physics problems, both mathematical and descriptive, using advanced techniques.

Method: Uses inference-time techniques, multi-agent frameworks for solution verification, and comparative analysis of performance improvements.

Result: Significant improvements in solving initially poorly performed problems with the multi-agent framework.

Conclusion: The study advances LLM performance in physics problem-solving and introduces a comprehensive benchmark for future research.

Abstract: The discipline of physics stands as a cornerstone of human intellect, driving the evolution of technology and deepening our understanding of the fundamental principles of the cosmos. Contemporary literature includes some works centered on the task of solving physics problems - a crucial domain of natural language reasoning. In this paper, we evaluate the performance of frontier LLMs in solving physics problems, both mathematical and descriptive. We also employ a plethora of inference-time techniques and agentic frameworks to improve the performance of the models. This includes the verification of proposed solutions in a cumulative fashion by other, smaller LLM agents, and we perform a comparative analysis of the performance that the techniques entail. There are significant improvements when the multi-agent framework is applied to problems that the models initially perform poorly on. Furthermore, we introduce a new evaluation benchmark for physics problems, ${\rm P{\small HYSICS}E{\small VAL}}$, consisting of 19,609 problems sourced from various physics textbooks and their corresponding correct solutions scraped from physics forums and educational websites. Our code and data are publicly available at https://github.com/areebuzair/PhysicsEval.

[2] Do LLMs produce texts with “human-like” lexical diversity?

Kelly Kendro, Jeffrey Maloney, Scott Jarvis

Main category: cs.CL

TL;DR: LLM-generated texts (ChatGPT models) differ significantly from human-written texts in lexical diversity, with newer models being less human-like. Human writers’ lexical diversity is consistent across subgroups.

DetailsMotivation: To assess how human-like LLM-generated texts are in terms of lexical diversity, comparing various ChatGPT models with human writers.

Method: Analyzed lexical diversity in texts from four ChatGPT models and 240 human participants (L1/L2 English) across six dimensions using MANOVAs, ANOVAs, and SVMs.

Result: LLM texts differed significantly from human texts, with newer models (ChatGPT-o4 mini, -4.5) showing the most divergence. Human diversity was consistent across education and language status.

Conclusion: LLMs do not produce human-like lexical diversity, and newer models are less human-like. Implications for language pedagogy are discussed.

Abstract: The degree to which LLMs produce writing that is truly human-like remains unclear despite the extensive empirical attention that this question has received. The present study addresses this question from the perspective of lexical diversity. Specifically, the study investigates patterns of lexical diversity in LLM-generated texts from four ChatGPT models (-3.5, -4, -o4 mini, and -4.5) in comparison with texts written by L1 and L2 English participants (n = 240) across four education levels. Six dimensions of lexical diversity were measured in each text: volume, abundance, variety-repetition, evenness, disparity, and dispersion. Results from one-way MANOVAs, one-way ANOVAS, and Support Vector Machines revealed that the LLM-generated texts differed significantly from human-written texts for each variable, with ChatGPT-o4 mini and -4.5 differing the most. Within these two groups, ChatGPT-4.5 demonstrated higher levels of lexical diversity despite producing fewer tokens. The human writers’ lexical diversity did not differ across subgroups (i.e., education, language status). Altogether, the results indicate that LLMs do not produce human-like texts in relation to lexical diversity, and the newer LLMs produce less human-like texts than older models. We discuss the implications of these results for language pedagogy and related applications.

[3] Semiotic Complexity and Its Epistemological Implications for Modeling Culture

Zachary K. Stine, James E. Deitrick

Main category: cs.CL

TL;DR: The paper argues for greater theorizing of methods in computational humanities, framing modeling as translation between cultural and computational domains to avoid errors and ensure clarity.

DetailsMotivation: The need for epistemological and interpretive clarity in computational humanities to mature the field.

Method: Framing modeling as translation work and introducing the concept of semiotic complexity to highlight errors in current practices.

Result: Identifies a translation error where semiotically complex data is treated as simple, leading to superficial clarity.

Conclusion: Provides recommendations for researchers to better account for epistemological issues in their work.

Abstract: Greater theorizing of methods in the computational humanities is needed for epistemological and interpretive clarity, and therefore the maturation of the field. In this paper, we frame such modeling work as engaging in translation work from a cultural, linguistic domain into a computational, mathematical domain, and back again. Translators benefit from articulating the theory of their translation process, and so do computational humanists in their work – to ensure internal consistency, avoid subtle yet consequential translation errors, and facilitate interpretive transparency. Our contribution in this paper is to lay out a particularly consequential dimension of the lack of theorizing and the sorts of translation errors that emerge in our modeling practices as a result. Along these lines we introduce the idea of semiotic complexity as the degree to which the meaning of some text may vary across interpretive lenses, and make the case that dominant modeling practices – especially around evaluation – commit a translation error by treating semiotically complex data as semiotically simple when it seems epistemologically convenient by conferring superficial clarity. We then lay out several recommendations for researchers to better account for these epistemological issues in their own work.

[4] FACTORY: A Challenging Human-Verified Prompt Set for Long-Form Factuality

Mingda Chen, Yang Li, Xilun Chen, Adina Williams, Gargi Ghosh, Scott Yih

Main category: cs.CL

TL;DR: FACTORY is a human-verified benchmark for evaluating long-form factuality in language models, revealing 40% inaccuracies in SOTA models.

DetailsMotivation: Existing benchmarks lack human verification, leading to unreliable evaluations of model factuality.

Method: Developed FACTORY using a model-in-the-loop approach and human refinement, focusing on fact-seeking, answerable, and unambiguous prompts.

Result: 40% of claims by SOTA models were non-factual on FACTORY, compared to 10% on other datasets.

Conclusion: FACTORY is a reliable, challenging benchmark, highlighting the need for models to handle long-tailed facts.

Abstract: Long-form factuality evaluation assesses the ability of models to generate accurate, comprehensive responses to short prompts. Existing benchmarks often lack human verification, leading to potential quality issues. To address this limitation, we introduce FACTORY, a large-scale, human-verified prompt set. Developed using a model-in-the-loop approach and refined by humans, FACTORY includes challenging prompts that are fact-seeking, answerable, and unambiguous. We conduct human evaluations on 6 state-of-the-art language models using FACTORY and existing datasets. Our results show that FACTORY is a challenging benchmark: approximately 40% of the claims made in the responses of SOTA models are not factual, compared to only 10% for other datasets. Our analysis identifies the strengths of FACTORY over prior benchmarks, emphasizing its reliability and the necessity for models to reason across long-tailed facts.

[5] Is neural semantic parsing good at ellipsis resolution, or isn’t it?

Xiao Zhang, Johan bos

Main category: cs.CL

TL;DR: Neural semantic parsers excel in general tasks but struggle with context-sensitive phenomena like verb phrase ellipsis, despite high overall performance.

DetailsMotivation: To evaluate neural semantic parsers' ability to handle context-sensitive linguistic phenomena, specifically verb phrase ellipsis.

Method: Constructed a corpus of 120 ellipsis cases with resolved meaning representations and tested neural parsers on this challenge set.

Result: Parsers performed poorly on ellipsis cases despite high scores on standard tests.

Conclusion: Neural semantic parsers lack robustness for context-sensitive phenomena like ellipsis, highlighting a need for improvement.

Abstract: Neural semantic parsers have shown good overall performance for a variety of linguistic phenomena, reaching semantic matching scores of more than 90%. But how do such parsers perform on strongly context-sensitive phenomena, where large pieces of semantic information need to be duplicated to form a meaningful semantic representation? A case in point is English verb phrase ellipsis, a construct where entire verb phrases can be abbreviated by a single auxiliary verb. Are the otherwise known as powerful semantic parsers able to deal with ellipsis or aren’t they? We constructed a corpus of 120 cases of ellipsis with their fully resolved meaning representation and used this as a challenge set for a large battery of neural semantic parsers. Although these parsers performed very well on the standard test set, they failed in the instances with ellipsis. Data augmentation

[6] Comparison of Large Language Models for Deployment Requirements

Alper Yaman, Jannik Schwab, Christof Nitsche, Abhirup Sinha, Marco Huber

Main category: cs.CL

TL;DR: A comparative list of foundational and domain-specific LLMs is provided to help researchers and companies select models based on features like release year, licensing, and hardware requirements.

DetailsMotivation: The rapid evolution of LLMs and the complexity in selecting the optimal model due to licensing and hardware constraints necessitate a clear, updated guide.

Method: The paper compiles and compares foundational and domain-specific LLMs, focusing on features such as release year, licensing, and hardware requirements.

Result: A comparative list of LLMs is created and published on GitLab, with plans for continuous updates.

Conclusion: The resource aids in navigating the LLM landscape, simplifying model selection for researchers and companies.

Abstract: Large Language Models (LLMs), such as Generative Pre-trained Transformers (GPTs) are revolutionizing the generation of human-like text, producing contextually relevant and syntactically correct content. Despite challenges like biases and hallucinations, these Artificial Intelligence (AI) models excel in tasks, such as content creation, translation, and code generation. Fine-tuning and novel architectures, such as Mixture of Experts (MoE), address these issues. Over the past two years, numerous open-source foundational and fine-tuned models have been introduced, complicating the selection of the optimal LLM for researchers and companies regarding licensing and hardware requirements. To navigate the rapidly evolving LLM landscape and facilitate LLM selection, we present a comparative list of foundational and domain-specific models, focusing on features, such as release year, licensing, and hardware requirements. This list is published on GitLab and will be continuously updated.

[7] Tabular Data Understanding with LLMs: A Survey of Recent Advances and Challenges

Xiaofeng Wu, Alan Ritter, Wei Xu

Main category: cs.CL

TL;DR: The paper discusses the challenges in table understanding tasks for LLMs and MLLMs, proposing a taxonomy of tabular representations and identifying research gaps like limited reasoning, complex table processing, and generalization issues.

DetailsMotivation: Tables' complex and diverse structures in LLMs/MLLMs require specialized methods, but current approaches lack universality, prompting the need for a systematic taxonomy and addressing research gaps.

Method: Introduces a taxonomy of tabular input representations and categorizes table understanding tasks to navigate challenges.

Result: Identifies key gaps: retrieval-focused tasks with minimal reasoning, difficulties with complex/large tables, and poor generalization across formats.

Conclusion: The paper calls for further research to address the identified gaps in table understanding for LLMs/MLLMs.

Abstract: Tables have gained significant attention in large language models (LLMs) and multimodal large language models (MLLMs) due to their complex and flexible structure. Unlike linear text inputs, tables are two-dimensional, encompassing formats that range from well-structured database tables to complex, multi-layered spreadsheets, each with different purposes. This diversity in format and purpose has led to the development of specialized methods and tasks, instead of universal approaches, making navigation of table understanding tasks challenging. To address these challenges, this paper introduces key concepts through a taxonomy of tabular input representations and an introduction of table understanding tasks. We highlight several critical gaps in the field that indicate the need for further research: (1) the predominance of retrieval-focused tasks that require minimal reasoning beyond mathematical and logical operations; (2) significant challenges faced by models when processing complex table structures, large-scale tables, length context, or multi-table scenarios; and (3) the limited generalization of models across different tabular representations and formats.

[8] Semantic Compression for Word and Sentence Embeddings using Discrete Wavelet Transform

Rana Aref Salama, Abdou Youssef, Mona Diab

Main category: cs.CL

TL;DR: The paper explores using Discrete Wavelet Transforms (DWT) on word and sentence embeddings to analyze and compress them while preserving semantic quality, showing significant dimensionality reduction with minimal performance loss.

DetailsMotivation: Wavelet transforms are effective in other domains, and this paper aims to extend their application to NLP for analyzing and compressing embeddings without losing semantic information.

Method: Empirical application of DWT to word and sentence embeddings, evaluating its effectiveness on semantic similarity tasks and downstream tasks using various embedding models.

Result: DWT reduces embedding dimensionality by 50-93% with negligible performance loss in semantic similarity tasks and often improves accuracy in downstream tasks.

Conclusion: DWT is a promising tool for enhancing NLP applications by efficiently compressing and analyzing embeddings.

Abstract: Wavelet transforms, a powerful mathematical tool, have been widely used in different domains, including Signal and Image processing, to unravel intricate patterns, enhance data representation, and extract meaningful features from data. Tangible results from their application suggest that Wavelet transforms can be applied to NLP capturing a variety of linguistic and semantic properties. In this paper, we empirically leverage the application of Discrete Wavelet Transforms (DWT) to word and sentence embeddings. We aim to showcase the capabilities of DWT in analyzing embedding representations at different levels of resolution and compressing them while maintaining their overall quality. We assess the effectiveness of DWT embeddings on semantic similarity tasks to show how DWT can be used to consolidate important semantic information in an embedding vector. We show the efficacy of the proposed paradigm using different embedding models, including large language models, on downstream tasks. Our results show that DWT can reduce the dimensionality of embeddings by 50-93% with almost no change in performance for semantic similarity tasks, while achieving superior accuracy in most downstream tasks. Our findings pave the way for applying DWT to improve NLP applications.

[9] Model Misalignment and Language Change: Traces of AI-Associated Language in Unscripted Spoken English

Bryce Anderson, Riley Galpin, Tom S. Juzek

Main category: cs.CL

TL;DR: The paper investigates lexical shifts in language due to LLMs, analyzing spoken language data to find a post-2022 increase in LLM-associated words, suggesting AI-driven language change.

DetailsMotivation: To determine if lexical shifts in language are due to broader human language changes or AI influence, particularly post-ChatGPT release.

Method: Constructed a dataset of 22.1 million words from unscripted spoken language in science/tech podcasts, comparing pre- and post-2022 usage of LLM-associated words.

Result: Found a moderate but significant increase in LLM-associated words post-2022, indicating convergence with human word choices, while baseline synonyms showed no shift.

Conclusion: The findings suggest AI may be driving language change, raising ethical concerns about misaligned models influencing human language and beliefs.

Abstract: In recent years, written language, particularly in science and education, has undergone remarkable shifts in word usage. These changes are widely attributed to the growing influence of Large Language Models (LLMs), which frequently rely on a distinct lexical style. Divergences between model output and target audience norms can be viewed as a form of misalignment. While these shifts are often linked to using Artificial Intelligence (AI) directly as a tool to generate text, it remains unclear whether the changes reflect broader changes in the human language system itself. To explore this question, we constructed a dataset of 22.1 million words from unscripted spoken language drawn from conversational science and technology podcasts. We analyzed lexical trends before and after ChatGPT’s release in 2022, focusing on commonly LLM-associated words. Our results show a moderate yet significant increase in the usage of these words post-2022, suggesting a convergence between human word choices and LLM-associated patterns. In contrast, baseline synonym words exhibit no significant directional shift. Given the short time frame and the number of words affected, this may indicate the onset of a remarkable shift in language use. Whether this represents natural language change or a novel shift driven by AI exposure remains an open question. Similarly, although the shifts may stem from broader adoption patterns, it may also be that upstream training misalignments ultimately contribute to changes in human language use. These findings parallel ethical concerns that misaligned models may shape social and moral beliefs.

[10] ReaGAN: Node-as-Agent-Reasoning Graph Agentic Network

Minghao Guo, Xi Zhu, Jingyuan Huang, Kai Mei, Yongfeng Zhang

Main category: cs.CL

TL;DR: ReaGAN introduces an agent-based framework for GNNs, enabling nodes to autonomously plan actions and retrieve global semantic relationships, addressing limitations of fixed aggregation schemes.

DetailsMotivation: Fixed aggregation in GNNs fails to handle node informativeness imbalance and ignores global semantic relationships, limiting performance.

Method: ReaGAN uses agent-based nodes with autonomous decision-making and retrieval-augmented generation (RAG) to access global semantic content.

Result: ReaGAN achieves competitive performance in few-shot settings using a frozen LLM backbone, without fine-tuning.

Conclusion: Agentic planning and local-global retrieval enhance graph learning, as demonstrated by ReaGAN.

Abstract: Graph Neural Networks (GNNs) have achieved remarkable success in graph-based learning by propagating information among neighbor nodes via predefined aggregation mechanisms. However, such fixed schemes often suffer from two key limitations. First, they cannot handle the imbalance in node informativeness – some nodes are rich in information, while others remain sparse. Second, predefined message passing primarily leverages local structural similarity while ignoring global semantic relationships across the graph, limiting the model’s ability to capture distant but relevant information. We propose Retrieval-augmented Graph Agentic Network (ReaGAN), an agent-based framework that empowers each node with autonomous, node-level decision-making. Each node acts as an agent that independently plans its next action based on its internal memory, enabling node-level planning and adaptive message propagation. Additionally, retrieval-augmented generation (RAG) allows nodes to access semantically relevant content and build global relationships in the graph. ReaGAN achieves competitive performance under few-shot in-context settings using a frozen LLM backbone without fine-tuning, showcasing the potential of agentic planning and local-global retrieval in graph learning.

[11] Integrating clinical reasoning into large language model-based diagnosis through etiology-aware attention steering

Peixian Li, Yu Tian, Ruiqi Tu, Chengkai Wu, Jingjing Ren, Jingsong Li

Main category: cs.CL

TL;DR: The study introduces an Etiology-Aware Attention Steering Framework to improve LLMs’ diagnostic accuracy and clinical reasoning in acute abdominal emergencies, achieving significant performance boosts.

DetailsMotivation: LLMs show promise in medical tasks but lack reliability in complex clinical diagnoses, prompting the need for enhanced reasoning and accuracy.

Method: The framework integrates structured clinical reasoning via Clinical Reasoning Scaffolding, identifies key attention heads, and uses parameter-efficient fine-tuning with a Reasoning-Guided Loss.

Result: The framework improves diagnostic accuracy by 15.65% and reasoning focus by 31.6%, with external validation confirming its effectiveness.

Conclusion: The approach enhances LLM-based diagnosis by aligning attention with clinical reasoning, offering a reliable and interpretable AI diagnostic paradigm.

Abstract: Objective: Large Language Models (LLMs) demonstrate significant capabilities in medical text understanding and generation. However, their diagnostic reliability in complex clinical scenarios remains limited. This study aims to enhance LLMs’ diagnostic accuracy and clinical reasoning ability. Method: We propose an Etiology-Aware Attention Steering Framework to integrate structured clinical reasoning into LLM-based diagnosis. Specifically, we first construct Clinical Reasoning Scaffolding (CRS) based on authoritative clinical guidelines for three representative acute abdominal emergencies: acute appendicitis, acute pancreatitis, and acute cholecystitis. Next, we develop the Etiology-Aware Head Identification algorithm to pinpoint attention heads crucial for the model’s etiology reasoning. To ensure reliable clinical reasoning alignment, we introduce the Reasoning-Guided Parameter-Efficient Fine-tuning that embeds etiological reasoning cues into input representations and steers the selected Etiology-Aware Heads toward critical information through a Reasoning-Guided Loss function. Result: On the Consistent Diagnosis Cohort, our framework improves average diagnostic accuracy by 15.65% and boosts the average Reasoning Focus Score by 31.6% over baselines. External validation on the Discrepant Diagnosis Cohort further confirms its effectiveness in enhancing diagnostic accuracy. Further assessments via Reasoning Attention Frequency indicate that our models exhibit enhanced reliability when faced with real-world complex scenarios. Conclusion: This study presents a practical and effective approach to enhance clinical reasoning in LLM-based diagnosis. By aligning model attention with structured CRS, the proposed framework offers a promising paradigm for building more interpretable and reliable AI diagnostic systems in complex clinical settings.

[12] Systematic Evaluation of Optimization Techniques for Long-Context Language Models

Ammar Ahmed, Sheng Di, Franck Cappello, Zirui Liu, Jingoo Han, Ali Anwar

Main category: cs.CL

TL;DR: The paper benchmarks optimization techniques for LLMs, revealing adverse effects of naive combinations on larger models and hidden trade-offs in performance metrics.

DetailsMotivation: Address the resource demands and limited context windows of LLMs by exploring the efficacy of optimization techniques in long-context scenarios.

Method: Systematically benchmarks pruning, quantization, and token dropping, analyzing memory usage, latency, throughput, and text generation quality. Evaluates individual and combined optimizations on two LLM architectures and a 70B-parameter model.

Result: Naive combination of optimizations harms larger models due to compounded errors. Sole reliance on F1 masks precision-recall trade-offs in QA tasks.

Conclusion: Integrating system-level profiling with task-specific insights helps balance efficiency, accuracy, and scalability for LLMs.

Abstract: Large language models (LLMs) excel across diverse natural language processing tasks but face resource demands and limited context windows. Although techniques like pruning, quantization, and token dropping can mitigate these issues, their efficacy in long-context scenarios and system evaluation remains underexplored. This paper systematically benchmarks these optimizations, characterizing memory usage, latency, and throughput, and studies how these methods impact the quality of text generation. We first analyze individual optimization methods for two LLM architectures supporting long context and then systematically evaluate combinations of these techniques to assess how this deeper analysis impacts performance metrics. We subsequently study the scalability of individual optimization methods on a larger variant with 70 billion-parameter model. Our novel insights reveal that naive combination inference optimization algorithms can adversely affect larger models due to compounded approximation errors, as compared to their smaller counterparts. Experiments show that relying solely on F1 obscures these effects by hiding precision-recall trade-offs in question answering tasks. By integrating system-level profiling with task-specific insights, this study helps LLM practitioners and researchers explore and balance efficiency, accuracy, and scalability across tasks and hardware configurations.

[13] Improving Multimodal Contrastive Learning of Sentence Embeddings with Object-Phrase Alignment

Kaiyan Zhao, Zhongtao Miao, Yoshimasa Tsuruoka

Main category: cs.CL

TL;DR: MCSEO improves multimodal sentence embeddings by aligning fine-grained object-phrase pairs, outperforming baselines on STS tasks.

DetailsMotivation: Noisy image-caption pairs in multimodal training data can degrade embedding quality, necessitating better alignment methods.

Method: MCSEO uses segmentation and object detection to extract object-phrase pairs, optimizing a contrastive learning objective for alignment.

Result: MCSEO consistently outperforms baselines on semantic textual similarity tasks.

Conclusion: Precise object-phrase alignment is crucial for effective multimodal representation learning.

Abstract: Multimodal sentence embedding models typically leverage image-caption pairs in addition to textual data during training. However, such pairs often contain noise, including redundant or irrelevant information on either the image or caption side. To mitigate this issue, we propose MCSEO, a method that enhances multimodal sentence embeddings by incorporating fine-grained object-phrase alignment alongside traditional image-caption alignment. Specifically, MCSEO utilizes existing segmentation and object detection models to extract accurate object-phrase pairs, which are then used to optimize a contrastive learning objective tailored to object-phrase correspondence. Experimental results on semantic textual similarity (STS) tasks across different backbone models demonstrate that MCSEO consistently outperforms strong baselines, highlighting the significance of precise object-phrase alignment in multimodal representation learning.

[14] PilotRL: Training Language Model Agents via Global Planning-Guided Progressive Reinforcement Learning

Keer Lu, Chong Chen, Bin Cui, Huang Leng, Wentao Zhang

Main category: cs.CL

TL;DR: The paper introduces AdaPlan and PilotRL to enhance LLM agents’ long-term planning and execution coordination, outperforming GPT-4o by 3.60%.

DetailsMotivation: Existing LLM agent paradigms like ReAct lack effectiveness in complex tasks due to limited long-term planning and coordination issues. Supervised fine-tuning also restricts generalization.

Method: Proposes AdaPlan for adaptive global planning and PilotRL, a training framework using progressive reinforcement learning to improve planning and execution.

Result: PilotRL achieves state-of-the-art performance, with LLaMA3.1-8B-Instruct + PilotRL outperforming GPT-4o by 3.60% and GPT-4o-mini by 55.78%.

Conclusion: AdaPlan and PilotRL effectively address long-term planning and coordination challenges in LLM agents, demonstrating superior performance over existing models.

Abstract: Large Language Models (LLMs) have shown remarkable advancements in tackling agent-oriented tasks. Despite their potential, existing work faces challenges when deploying LLMs in agent-based environments. The widely adopted agent paradigm ReAct centers on integrating single-step reasoning with immediate action execution, which limits its effectiveness in complex tasks requiring long-term strategic planning. Furthermore, the coordination between the planner and executor during problem-solving is also a critical factor to consider in agent design. Additionally, current approaches predominantly rely on supervised fine-tuning, which often leads models to memorize established task completion trajectories, thereby restricting their generalization ability when confronted with novel problem contexts. To address these challenges, we introduce an adaptive global plan-based agent paradigm AdaPlan, aiming to synergize high-level explicit guidance with execution to support effective long-horizon decision-making. Based on the proposed paradigm, we further put forward PilotRL, a global planning-guided training framework for LLM agents driven by progressive reinforcement learning. We first develop the model’s ability to follow explicit guidance from global plans when addressing agent tasks. Subsequently, based on this foundation, we focus on optimizing the quality of generated plans. Finally, we conduct joint optimization of the model’s planning and execution coordination. Experiments indicate that PilotRL could achieve state-of-the-art performances, with LLaMA3.1-8B-Instruct + PilotRL surpassing closed-sourced GPT-4o by 3.60%, while showing a more substantial gain of 55.78% comparing to GPT-4o-mini at a comparable parameter scale.

[15] Lucy: edgerunning agentic web search on mobile with machine generated task vectors

Alan Dao, Dinh Bach Vu, Alex Nguyen, Norapat Buppodom

Main category: cs.CL

TL;DR: Small language models (SLMs) can match larger models by using dynamic task vector reasoning, achieving 78.3% accuracy on SimpleQA.

DetailsMotivation: SLMs struggle with knowledge-intensive tasks due to limited capacity. This work aims to enhance their performance by treating reasoning as a dynamic, self-refining process.

Method: Proposes a dynamic task vector machine, interpreting model reasoning between tags as a mechanism to construct and refine task vectors. Uses RLVR for optimization and integrates MCP.

Result: Lucy, a 1.7B-parameter SLM, achieves 78.3% accuracy on SimpleQA, rivaling larger models like DeepSeek-V3.

Conclusion: Structured, self-constructed task reasoning enables small models to compete with larger ones.

Abstract: Small language models (SLMs) are inherently limited in knowledge-intensive tasks due to their constrained capacity. While test-time computation offers a path to enhanced performance, most approaches treat reasoning as a fixed or heuristic process. In this work, we propose a new paradigm: viewing the model’s internal reasoning, delimited by and tags, as a dynamic task vector machine. Rather than treating the content inside these tags as a mere trace of thought, we interpret the generation process itself as a mechanism through which the model \textbf{constructs and refines its own task vectors} on the fly. We developed a method to optimize this dynamic task vector machine through RLVR and successfully trained an agentic web-search model. We present Lucy, a 1.7B-parameter SLM that leverages this dynamic reasoning mechanism with MCP integration to achieve 78.3% accuracy on the SimpleQA benchmark, performing on par with much larger models such as DeepSeek-V3. This demonstrates that small models can rival large ones when equipped with structured, self-constructed task reasoning.

[16] EdgeInfinite-Instruct: Bridging SFT-Based Optimization and NPU-Level Efficiency for Edge Devices

Jiyu Chen, Poh Seng Lim, Shuang Peng, Daxiong Luo, JungHau Foo, Yap Deep, Timothy Lee Jun Jie, Kelvin Teh Kae Wen, Fan Yang, Danyu Feng, Hao-Yun Chen, Peng-Wen Chen, Fangyuan Li, Xiaoxin Chen, Wong Wai Mun

Main category: cs.CL

TL;DR: EdgeInfinite-Instruct optimizes Transformer-based LLMs for edge devices by fine-tuning with S-SFT, PTQ, and fixed-shape computation, improving efficiency and performance.

DetailsMotivation: Challenges in deploying LLMs on edge devices include high computational costs, memory demands, and poor TTFT. Existing solutions either degrade performance or lack edge-specific optimizations.

Method: Proposes EdgeInfinite-Instruct with S-SFT for fine-tuning, PTQ for quantization, and fixed-shape computation for memory and efficiency balance.

Result: Improves performance on long-sequence tasks and maintains efficiency on NPU-accelerated edge devices.

Conclusion: EdgeInfinite-Instruct effectively addresses deployment challenges for LLMs on edge devices while maintaining quality and efficiency.

Abstract: Deploying Transformer-based large language models (LLMs) on resource-constrained edge devices for long-sequence tasks remains challenging due to the quadratic time complexity of self-attention and growing Key-Value (KV) cache demands. While existing KV cache optimizations improve memory efficiency, they often fail to reduce time to first token (TTFT) and may degrade performance through token pruning. Alternative sequence modeling architectures address some of these limitations, but typically require full retraining and lack infrastructure support. EdgeInfinite offers an efficient solution by fine-tuning only a small subset of parameters, maintaining quality while reducing both computational and memory costs, including improved TTFT. However, its instruction-following ability is limited, and it lacks mobile-specific optimizations. To address these issues, we propose EdgeInfinite-Instruct, which introduces a Segmented Supervised Fine-Tuning (S-SFT) strategy tailored to long-sequence tasks such as summarization and question answering. We further optimized EdgeInfinite-Instruct for efficient deployment on edge NPUs by employing fine-grained post-training quantization (PTQ) to reduce computational demands while maintaining accuracy, and by implementing a fixed-shape computation graph that balances memory usage and on-device efficiency through scenario-specific customization of input token and cache sizes. Experiments on long-context benchmarks and real-world mobile tasks show that our approach improves domain-specific performance while maintaining efficiency on NPU-accelerated edge devices.

[17] Multi-Layer Attention is the Amplifier of Demonstration Effectiveness

Dingzirui Wang, Xuangliang Zhang, Keyan Xu, Qingfu Zhu, Wanxiang Che, Yang Deng

Main category: cs.CL

TL;DR: The paper investigates why some demonstrations in in-context learning (ICL) are ineffective, proposing a gradient-based method (GradS) to select effective demonstrations, validated on multiple LLMs and datasets.

DetailsMotivation: Existing work assumes all demonstrations in ICL are effective, but many are not. This paper explores the reasons behind ineffective demonstrations.

Method: Analyzes demonstration ineffectiveness using gradient flow and linear self-attention models. Proposes GradS, a gradient-based method for selecting effective demonstrations.

Result: Demonstrations’ effectiveness disparity amplifies with model layers. GradS improves performance by 6.8% over baselines.

Conclusion: GradS effectively selects demonstrations by considering both relevance and model-assimilated information, validated by experiments.

Abstract: Numerous studies have investigated the underlying mechanisms of in-context learning (ICL) effectiveness to inspire the design of related methods. However, existing work predominantly assumes the effectiveness of the demonstrations provided within ICL, while many research indicates that not all demonstrations are effective, failing to yielding any performance improvement during ICL. Therefore, in this paper, we investigate the reasons behind demonstration ineffectiveness. Our analysis is based on gradient flow and linear self-attention models. By setting the gradient flow to zero, we deduce that a demonstration becomes ineffective if its information has either been learned by the model or is irrelevant to the user query. Furthermore, we demonstrate that in multi-layer models, the disparity in effectiveness among demonstrations is amplified with layer increasing, causing the model to focus more on effective ones. Considering that current demonstration selection methods primarily focus on the relevance to the user query while overlooking the information that the model has already assimilated, we propose a novel method called GradS, which leverages gradient flow for demonstration selection. We use the magnitude of the gradient flow of the demonstration with respect to a given user query as the criterion, thereby ensuring the effectiveness of the chosen ones. We validate our derivation and GradS on four prominent LLMs across five mainstream datasets. The experimental results confirm that the disparity in effectiveness among demonstrations is magnified as the model layer increases, substantiating our derivations. Moreover, GradS achieves a relative improvement of $6.8%$ on average over the strongest baselines, demonstrating its effectiveness.

[18] SA-GCS: Semantic-Aware Gaussian Curriculum Scheduling for UAV Vision-Language Navigation

Hengxing Cai, Jinhan Dong, Yijie Rao, Jingcheng Deng, Jingjun Tan, Qien Chen, Haidong Wang, Zhen Wang, Shiyu Huang, Agachai Sumalee, Renxin Zhong

Main category: cs.CL

TL;DR: SA-GCS is a novel training framework for UAV VLN that integrates Curriculum Learning into RL, improving efficiency, convergence, and performance by dynamically adjusting task difficulty.

DetailsMotivation: Existing RL methods for UAV VLN suffer from inefficient data use, slow convergence, and inadequate handling of varying task difficulty, limiting performance.

Method: Proposes SA-GCS, combining a Semantic-Aware Difficulty Estimator (SA-DE) to quantify sample complexity and a Gaussian Curriculum Scheduler (GCS) to adjust sampling distribution.

Result: SA-GCS outperforms baselines on the CityNav benchmark, showing faster convergence, better performance, and scalability across model sizes.

Conclusion: SA-GCS is a robust and scalable solution for UAV VLN, enhancing training efficiency and model generalization.

Abstract: Unmanned Aerial Vehicle (UAV) Vision-Language Navigation (VLN) aims to enable agents to accurately localize targets and plan flight paths in complex environments based on natural language instructions, with broad applications in intelligent inspection, disaster rescue, and urban monitoring. Recent progress in Vision-Language Models (VLMs) has provided strong semantic understanding for this task, while reinforcement learning (RL) has emerged as a promising post-training strategy to further improve generalization. However, existing RL methods often suffer from inefficient use of training data, slow convergence, and insufficient consideration of the difficulty variation among training samples, which limits further performance improvement. To address these challenges, we propose \textbf{Semantic-Aware Gaussian Curriculum Scheduling (SA-GCS)}, a novel training framework that systematically integrates Curriculum Learning (CL) into RL. SA-GCS employs a Semantic-Aware Difficulty Estimator (SA-DE) to quantify the complexity of training samples and a Gaussian Curriculum Scheduler (GCS) to dynamically adjust the sampling distribution, enabling a smooth progression from easy to challenging tasks. This design significantly improves training efficiency, accelerates convergence, and enhances overall model performance. Extensive experiments on the CityNav benchmark demonstrate that SA-GCS consistently outperforms strong baselines across all metrics, achieves faster and more stable convergence, and generalizes well across models of different scales, highlighting its robustness and scalability. The implementation of our approach is publicly available.

[19] Combining Discrete Wavelet and Cosine Transforms for Efficient Sentence Embedding

Rana Salama, Abdou Youssef, Mona Diab

Main category: cs.CL

TL;DR: The paper explores using Discrete Wavelet Transforms (DWT) and Discrete Cosine Transform (DCT) to compress word and sentence embeddings, showing improved efficiency and performance in NLP tasks.

DetailsMotivation: Wavelets have proven effective in image and signal processing, suggesting potential for NLP tasks to capture linguistic properties efficiently.

Method: Apply DWT to word embeddings for dimensionality reduction and combine DWT with DCT for sentence compression into fixed-size vectors.

Result: The proposed method yields comparable or superior results to original embeddings in downstream NLP tasks.

Conclusion: Wavelet-based compression is a viable and effective approach for enhancing NLP embeddings.

Abstract: Wavelets have emerged as a cutting edge technology in a number of fields. Concrete results of their application in Image and Signal processing suggest that wavelets can be effectively applied to Natural Language Processing (NLP) tasks that capture a variety of linguistic properties. In this paper, we leverage the power of applying Discrete Wavelet Transforms (DWT) to word and sentence embeddings. We first evaluate, intrinsically and extrinsically, how wavelets can effectively be used to consolidate important information in a word vector while reducing its dimensionality. We further combine DWT with Discrete Cosine Transform (DCT) to propose a non-parameterized model that compresses a sentence with a dense amount of information in a fixed size vector based on locally varying word features. We show the efficacy of the proposed paradigm on downstream applications models yielding comparable and even superior (in some tasks) results to original embeddings.

[20] Learning an Efficient Multi-Turn Dialogue Evaluator from Multiple Judges

Yuqi Tang, Kehua Feng, Yunfeng Wang, Zhiwen Chen, Chengfei Lv, Gang Yu, Qiang Zhang, Keyan Ding

Main category: cs.CL

TL;DR: Proposes an efficient multi-turn dialogue evaluator to aggregate multiple LLM judges’ preferences into one model, reducing computational costs while maintaining evaluation quality.

DetailsMotivation: Current LLM-as-a-judge methods suffer from biases and high computational costs, necessitating a more efficient and reliable solution.

Method: Aggregates preference knowledge from multiple LLM judges into a single model for efficient dialogue quality assessment.

Result: Outperforms existing baselines on seven benchmarks, demonstrating efficiency and robustness.

Conclusion: The proposed method offers a cost-effective and reliable alternative to multi-judge LLM evaluations.

Abstract: Evaluating the conversational abilities of large language models (LLMs) remains a challenging task. Current mainstream approaches primarily rely on the ``LLM-as-a-judge" paradigm, where an LLM is prompted to serve as an evaluator to assess dialogue quality. However, such methods often suffer from various biases, which undermine the reliability and consistency of the evaluation results. To mitigate these biases, recent methods employ multiple LLMs as judges and aggregate their judgments to select the optimal assessment. Although effective, this multi-judge approach incurs significant computational overhead during inference. In this paper, we propose an efficient multi-turn dialogue evaluator that captures the collective wisdom of multiple LLM judges by aggregating their preference knowledge into a single model. Our approach preserves the advantages of diverse multi-judge feedback while drastically reducing the evaluation cost, enabling fast and flexible dialogue quality assessment. Extensive experiments on seven single rating and pairwise comparison dialogue evaluation benchmarks demonstrate that our method outperforms existing baselines across diverse scenarios, showcasing its efficiency and robustness.

[21] GETALP@AutoMin 2025: Leveraging RAG to Answer Questions based on Meeting Transcripts

Jeongwoo Kang, Markarit Vartampetian, Felix Herron, Yongxin Zhou, Diandra Fabre, Gabriela Gonzalez-Saez

Main category: cs.CL

TL;DR: GETALP’s submission to SIGDial 2025 Task B uses RAG and AMR for QA on meeting transcripts, showing AMR improves answers for 35% of questions, especially ‘who’ questions.

DetailsMotivation: To enhance question-answering accuracy in meeting transcripts by combining retrieval augmented generation (RAG) and Abstract Meaning Representations (AMR).

Method: Proposes three systems integrating RAG and AMR for QA on meeting transcripts.

Result: AMR improves response quality for ~35% of questions, particularly for ‘who’ questions.

Conclusion: Combining RAG and AMR is effective for QA in meeting transcripts, especially for participant-related questions.

Abstract: This paper documents GETALP’s submission to the Third Run of the Automatic Minuting Shared Task at SIGDial 2025. We participated in Task B: question-answering based on meeting transcripts. Our method is based on a retrieval augmented generation (RAG) system and Abstract Meaning Representations (AMR). We propose three systems combining these two approaches. Our results show that incorporating AMR leads to high-quality responses for approximately 35% of the questions and provides notable improvements in answering questions that involve distinguishing between different participants (e.g., who questions).

[22] The Missing Parts: Augmenting Fact Verification with Half-Truth Detection

Yixuan Tang, Jincheng Wang, Anthony K. H. Tung

Main category: cs.CL

TL;DR: The paper introduces the task of half-truth detection and proposes a new benchmark, PolitiFact-Hidden, along with TRACER, a framework to identify omission-based misinformation by aligning evidence and inferring intent.

DetailsMotivation: Existing fact verification systems fail to detect half-truths—claims that are factually correct but misleading due to omitted context.

Method: The authors propose TRACER, a modular framework that aligns evidence, infers claim intent, and estimates the impact of hidden content.

Result: TRACER improves performance, boosting Half-True classification F1 by up to 16 points.

Conclusion: Modeling omissions is crucial for trustworthy fact verification, and TRACER enhances existing pipelines effectively.

Abstract: Fact verification systems typically assess whether a claim is supported by retrieved evidence, assuming that truthfulness depends solely on what is stated. However, many real-world claims are half-truths, factually correct yet misleading due to the omission of critical context. Existing models struggle with such cases, as they are not designed to reason about what is left unsaid. We introduce the task of half-truth detection, and propose PolitiFact-Hidden, a new benchmark with 15k political claims annotated with sentence-level evidence alignment and inferred claim intent. To address this challenge, we present TRACER, a modular re-assessment framework that identifies omission-based misinformation by aligning evidence, inferring implied intent, and estimating the causal impact of hidden content. TRACER can be integrated into existing fact-checking pipelines and consistently improves performance across multiple strong baselines. Notably, it boosts Half-True classification F1 by up to 16 points, highlighting the importance of modeling omissions for trustworthy fact verification.

[23] EFlat-LoRA: Efficiently Seeking Flat Minima for Better Generalization in Fine-Tuning Large Language Models and Beyond

Jiaxin Deng, Qingcheng Zhu, Junbiao Pang, Linlin Yang, Zhongqian Fu, Baochang Zhang

Main category: cs.CL

TL;DR: The paper introduces Flat-LoRA and EFlat-LoRA to explore the connection between sharpness and generalization in LoRA, demonstrating improved performance over LoRA and full fine-tuning.

DetailsMotivation: Little research exists on the correlation between expressive ability and generalization in LoRA, and the role of sharpness in generalization for LoRA is unexplored.

Method: Proposes Flat-LoRA and EFlat-LoRA, which transfer perturbations from the full parameter space to the low-rank subspace to seek flat minima.

Result: EFlat-LoRA matches LoRA’s efficiency while outperforming it and full fine-tuning on tasks like GLUE and vision-language models.

Conclusion: Generalization in LoRA is linked to sharpness, a factor previously overlooked, and EFlat-LoRA effectively addresses this.

Abstract: Little research explores the correlation between the expressive ability and generalization ability of the low-rank adaptation (LoRA). Sharpness-Aware Minimization (SAM) improves model generalization for both Convolutional Neural Networks (CNNs) and Transformers by encouraging convergence to locally flat minima. However, the connection between sharpness and generalization has not been fully explored for LoRA due to the lack of tools to either empirically seek flat minima or develop theoretical methods. In this work, we propose Flat-LoRA and its efficient version i.e., EFlat-LoRA, to seek flat minima for LoRA. Concretely, we theoretically demonstrate that perturbations in the full parameter space can be transferred to the low-rank subspace. This approach eliminates the potential interference introduced by perturbations across multiple matrices in the low-rank subspace. Our extensive experiments on large language models and vision-language models demonstrate that EFlat-LoRA achieves optimize efficiency comparable to that of LoRA while simultaneously attaining comparable or even better performance. For example, on the GLUE dataset with RoBERTa-large, EFlat-LoRA outperforms LoRA and full fine-tuning by 1.0% and 0.5% on average, respectively. On vision-language models e.g., Qwen-VL-Chat shows performance improvements of 1.5% and 1.0% on SQA and VizWiz datasets, respectively. These empirical results also verify that the generalization of LoRA is closely related to sharpness, which is omitted by previous methods.

[24] The Prosody of Emojis

Giulio Zhou, Tsz Kin Lam, Alexandra Birch, Barry Haddow

Main category: cs.CL

TL;DR: The study explores how emojis influence prosody in speech and listener interpretation, showing emojis shape spoken delivery and perception.

DetailsMotivation: To understand how emojis act as visual surrogates for prosodic cues in text-based communication and their impact on spoken prosody.

Method: Analysis of human speech data from structured production and perception tasks, linking prosody and emoji directly.

Result: Speakers adapt prosody based on emojis, listeners identify emojis from prosody, and semantic differences correlate with prosodic divergence.

Conclusion: Emojis serve as meaningful carriers of prosodic intent, highlighting their communicative role in digital contexts.

Abstract: Prosodic features such as pitch, timing, and intonation are central to spoken communication, conveying emotion, intent, and discourse structure. In text-based settings, where these cues are absent, emojis act as visual surrogates that add affective and pragmatic nuance. This study examines how emojis influence prosodic realisation in speech and how listeners interpret prosodic cues to recover emoji meanings. Unlike previous work, we directly link prosody and emoji by analysing actual human speech data, collected through structured but open-ended production and perception tasks. This provides empirical evidence of how emoji semantics shape spoken delivery and perception. Results show that speakers adapt their prosody based on emoji cues, listeners can often identify the intended emoji from prosodic variation alone, and greater semantic differences between emojis correspond to increased prosodic divergence. These findings suggest that emojis can act as meaningful carriers of prosodic intent, offering insight into their communicative role in digitally mediated contexts.

[25] PaPaformer: Language Model from Pre-trained Paraller Paths

Joonas Tapaninaho, Mourad Oussala

Main category: cs.CL

TL;DR: The paper introduces PaPaformer, a decoder-only transformer variant, to reduce training time and parameters while improving performance by training parallel paths individually and combining them.

DetailsMotivation: Modern language models require extensive computation and time, even for smaller variants. This paper aims to reduce training time from days/weeks to hours.

Method: Introduces PaPaformer, a decoder-only transformer with parallel paths trained individually on different data and combined into a larger model.

Result: Reduces total parameters and training time while increasing performance. Allows customization of paths for specific tasks.

Conclusion: PaPaformer offers a scalable and efficient method for training language models with potential for task-specific customization.

Abstract: The training of modern large-language models requires an increasingly amount of computation power and time. Even smaller variants, such as small-language models (SLMs), take several days to train in the best-case scenarios, often requiring multiple GPUs. This paper explores methods to train and evaluate decoder-only transformer-based language models in hours instead of days/weeks. We introduces \textit{PaPaformer}, a decoder-only transformer architecture variant, whose lower-dimensional parallel paths are combined into larger model. The paper shows that these lower-dimensional paths can be trained individually with different types of training data and then combined into one larger model. This method gives the option to reduce the total number of model parameters and the training time with increasing performance. Moreover, the use of parallel path structure opens interesting possibilities to customize paths to accommodate specific task requirements.

[26] SynAdapt: Learning Adaptive Reasoning in Large Language Models via Synthetic Continuous Chain-of-Thought

Jianwei Wang, Ziming Wu, Fuming Lai, Shaobing Lian, Ziqian Zeng

Main category: cs.CL

TL;DR: SynAdapt improves reasoning efficiency by generating synthetic Continuous Chain-of-Thought (CCoT) for precise alignment and integrates a difficulty classifier to adaptively handle hard questions, achieving optimal accuracy-efficiency trade-off.

DetailsMotivation: Existing CCoT methods face inefficiencies due to indirect fine-tuning, limited alignment, or inconsistent targets, prompting the need for a more effective reasoning framework.

Method: SynAdapt generates synthetic CCoT for precise LLM alignment and uses a difficulty classifier to identify hard questions, prompting adaptive re-thinking.

Result: Extensive benchmarks show SynAdapt achieves the best accuracy-efficiency trade-off across various difficulty levels.

Conclusion: SynAdapt effectively addresses CCoT limitations, enhancing reasoning efficiency and performance for both easy and hard questions.

Abstract: While Chain-of-Thought (CoT) reasoning improves model performance, it incurs significant time costs due to the generation of discrete CoT tokens (DCoT). Continuous CoT (CCoT) offers a more efficient alternative, but existing CCoT methods are hampered by indirect fine-tuning, limited alignment, or inconsistent targets. To overcome these limitations, we propose \textit{SynAdapt}, an innovative efficient reasoning framework. Specifically, \textit{SynAdapt} generates the synthetic CCoT to serve as a precise and effective alignment target for LLMs. This synthetic CCoT explicitly guides the LLM to learn CCoT and derive accurate answers directly. Furthermore, relying solely on CCoT is insufficient for solving hard questions. To address this, \textit{SynAdapt} integrates a difficulty classifier that leverages both question context and CCoT to identify hard questions. CCoT can effectively help identify hard questions after some brief reasoning. We then adaptively prompt the LLM to re-think these hard questions for improved performance. Extensive experimental results across various benchmarks from different difficulty levels strongly demonstrate the effectiveness of our method, achieving the best accuracy-efficiency trade-off.

[27] A Context-Aware Dual-Metric Framework for Confidence Estimation in Large Language Models

Mingruo Yuan, Shuyi Zhang, Ben Kao

Main category: cs.CL

TL;DR: CRUX is a framework for confidence estimation in LLMs, integrating context faithfulness and consistency via two novel metrics, outperforming baselines.

DetailsMotivation: Current methods ignore context-response relevance, crucial for output quality in scenarios with background knowledge.

Method: Proposes CRUX with two metrics: contextual entropy reduction (data uncertainty) and unified consistency examination (model uncertainty).

Result: Achieves highest AUROC on three benchmark and two domain-specific datasets.

Conclusion: CRUX effectively improves confidence estimation by leveraging context relevance and consistency.

Abstract: Accurate confidence estimation is essential for trustworthy large language models (LLMs) systems, as it empowers the user to determine when to trust outputs and enables reliable deployment in safety-critical applications. Current confidence estimation methods for LLMs neglect the relevance between responses and contextual information, a crucial factor in output quality evaluation, particularly in scenarios where background knowledge is provided. To bridge this gap, we propose CRUX (Context-aware entropy Reduction and Unified consistency eXamination), the first framework that integrates context faithfulness and consistency for confidence estimation via two novel metrics. First, contextual entropy reduction represents data uncertainty with the information gain through contrastive sampling with and without context. Second, unified consistency examination captures potential model uncertainty through the global consistency of the generated answers with and without context. Experiments across three benchmark datasets (CoQA, SQuAD, QuAC) and two domain-specific datasets (BioASQ, EduQG) demonstrate CRUX’s effectiveness, achieving the highest AUROC than existing baselines.

[28] GHTM: A Graph based Hybrid Topic Modeling Approach in Low-Resource Bengali Language

Farhana Haque, Md. Abdur Rahman, Sumon Ahmed

Main category: cs.CL

TL;DR: A novel Graph Convolutional Network (GCN) based model, GHTM, is proposed for Bengali topic modeling, outperforming existing methods in coherence and diversity. A new Bengali dataset, NCTBText, is introduced.

DetailsMotivation: Topic modeling is understudied in Bengali due to its complexity and lack of resources. The paper aims to address this gap.

Method: GHTM uses GCN to create semantic embeddings from document vectors, decomposed via NMF for topic representation. Compared against LDA, LSA, NMF, BERTopic, and Top2Vec on Bengali datasets.

Result: GHTM outperforms other models in topic coherence and diversity. The NCTBText dataset enriches Bengali corpora.

Conclusion: GHTM is effective for Bengali topic modeling, and NCTBText diversifies available resources.

Abstract: Topic modeling is a Natural Language Processing (NLP) technique that is used to identify latent themes and extract topics from text corpora by grouping similar documents based on their most significant keywords. Although widely researched in English, topic modeling remains understudied in Bengali due to its morphological complexity, lack of adequate resources and initiatives. In this contribution, a novel Graph Convolutional Network (GCN) based model called GHTM (Graph-Based Hybrid Topic Model) is proposed. This model represents input vectors of documents as nodes in the graph, which GCN uses to produce semantically rich embeddings. The embeddings are then decomposed using Non-negative Matrix Factorization (NMF) to get the topical representations of the underlying themes of the text corpus. This study compares the proposed model against a wide range of Bengali topic modeling techniques, from traditional methods such as LDA, LSA, and NMF to contemporary frameworks such as BERTopic and Top2Vec on three Bengali datasets. The experimental results demonstrate the effectiveness of the proposed model by outperforming other models in topic coherence and diversity. In addition, we introduce a novel Bengali dataset called “NCTBText” sourced from Bengali textbook materials to enrich and diversify the predominantly newspaper-centric Bengali corpora.

[29] NusaAksara: A Multimodal and Multilingual Benchmark for Preserving Indonesian Indigenous Scripts

Muhammad Farid Adilazuarda, Musa Izzanardi Wijanarko, Lucky Susanto, Khumaisa Nur’aini, Derry Wijaya, Alham Fikri Aji

Main category: cs.CL

TL;DR: NusaAksara is a new benchmark for Indonesian languages, including original scripts, covering diverse NLP tasks. Most models struggle with these scripts.

DetailsMotivation: To address the lack of NLP benchmarks for Indonesian languages in their original scripts, especially low-resource ones.

Method: Human experts constructed a dataset with 8 scripts across 7 languages, including unsupported scripts like Lampung. Tasks include OCR, translation, and more.

Result: Most models (e.g., GPT-4o, Llama 3.2) perform poorly, often near-zero, on local scripts.

Conclusion: NusaAksara highlights the need for better NLP support for Indonesian scripts, especially low-resource ones.

Abstract: Indonesia is rich in languages and scripts. However, most NLP progress has been made using romanized text. In this paper, we present NusaAksara, a novel public benchmark for Indonesian languages that includes their original scripts. Our benchmark covers both text and image modalities and encompasses diverse tasks such as image segmentation, OCR, transliteration, translation, and language identification. Our data is constructed by human experts through rigorous steps. NusaAksara covers 8 scripts across 7 languages, including low-resource languages not commonly seen in NLP benchmarks. Although unsupported by Unicode, the Lampung script is included in this dataset. We benchmark our data across several models, from LLMs and VLMs such as GPT-4o, Llama 3.2, and Aya 23 to task-specific systems such as PP-OCR and LangID, and show that most NLP technologies cannot handle Indonesia’s local scripts, with many achieving near-zero performance.

[30] Prompting Science Report 3: I’ll pay you or I’ll kill you – but will you care?

Lennart Meincke, Ethan Mollick, Lilach Mollick, Dan Shapiro

Main category: cs.CL

TL;DR: Tipping or threatening AI models has no significant effect on benchmark performance, though prompt variations can impact individual question results unpredictably.

DetailsMotivation: To test common beliefs about improving AI performance through tipping or threatening models, inspired by claims from industry leaders.

Method: Empirical testing on GPQA and MMLU-Pro benchmarks to evaluate model performance under different prompting strategies.

Result: No significant effect of tipping or threatening on benchmarks, but prompt variations can unpredictably affect individual question performance.

Conclusion: Simple prompting variations like tipping or threatening are less effective than assumed, especially for difficult problems, though they may impact specific questions.

Abstract: This is the third in a series of short reports that seek to help business, education, and policy leaders understand the technical details of working with AI through rigorous testing. In this report, we investigate two commonly held prompting beliefs: a) offering to tip the AI model and b) threatening the AI model. Tipping was a commonly shared tactic for improving AI performance and threats have been endorsed by Google Founder Sergey Brin (All-In, May 2025, 8:20) who observed that ‘models tend to do better if you threaten them,’ a claim we subject to empirical testing here. We evaluate model performance on GPQA (Rein et al. 2024) and MMLU-Pro (Wang et al. 2024). We demonstrate two things:

  • Threatening or tipping a model generally has no significant effect on benchmark performance.
  • Prompt variations can significantly affect performance on a per-question level. However, it is hard to know in advance whether a particular prompting approach will help or harm the LLM’s ability to answer any particular question. Taken together, this suggests that simple prompting variations might not be as effective as previously assumed, especially for difficult problems. However, as reported previously (Meincke et al. 2025a), prompting approaches can yield significantly different results for individual questions.

[31] DACTYL: Diverse Adversarial Corpus of Texts Yielded from Large Language Models

Shantanu Thorat, Andrew Caines

Main category: cs.CL

TL;DR: The paper introduces DACTYL, a dataset for detecting AI-generated texts in one-shot/few-shot and domain-specific scenarios, revealing vulnerabilities in existing detectors. It compares BCE and DXO optimization methods, showing DXO’s superior generalization.

DetailsMotivation: Existing AI-generated text detectors fail in real-world settings, especially for one-shot/few-shot and domain-specific texts, prompting the need for a robust dataset and better detection methods.

Method: The authors create DACTYL, a dataset with one-shot/few-shot and domain-specific texts, and train classifiers using BCE and DXO optimization.

Result: Existing detectors struggle with DACTYL. DXO-trained classifiers outperform BCE-trained ones in out-of-distribution scenarios, showing better generalization.

Conclusion: DXO optimization improves generalization for AI-generated text detection, highlighting weaknesses in current detectors and suggesting future improvements.

Abstract: Existing AIG (AI-generated) text detectors struggle in real-world settings despite succeeding in internal testing, suggesting that they may not be robust enough. We rigorously examine the machine-learning procedure to build these detectors to address this. Most current AIG text detection datasets focus on zero-shot generations, but little work has been done on few-shot or one-shot generations, where LLMs are given human texts as an example. In response, we introduce the Diverse Adversarial Corpus of Texts Yielded from Language models (DACTYL), a challenging AIG text detection dataset focusing on one-shot/few-shot generations. We also include texts from domain-specific continued-pre-trained (CPT) language models, where we fully train all parameters using a memory-efficient optimization approach. Many existing AIG text detectors struggle significantly on our dataset, indicating a potential vulnerability to one-shot/few-shot and CPT-generated texts. We also train our own classifiers using two approaches: standard binary cross-entropy (BCE) optimization and a more recent approach, deep X-risk optimization (DXO). While BCE-trained classifiers marginally outperform DXO classifiers on the DACTYL test set, the latter excels on out-of-distribution (OOD) texts. In our mock deployment scenario in student essay detection with an OOD student essay dataset, the best DXO classifier outscored the best BCE-trained classifier by 50.56 macro-F1 score points at the lowest false positive rates for both. Our results indicate that DXO classifiers generalize better without overfitting to the test set. Our experiments highlight several areas of improvement for AIG text detectors.

[32] Medical Reasoning in the Era of LLMs: A Systematic Review of Enhancement Techniques and Applications

Wenxuan Wang, Zizhan Ma, Meidan Ding, Shiyi Zheng, Shengyuan Liu, Jie Liu, Jiaming Ji, Wenting Chen, Xiang Li, Linlin Shen, Yixuan Yuan

Main category: cs.CL

TL;DR: A systematic review of LLMs in medical reasoning, proposing a taxonomy of techniques and analyzing applications, benchmarks, and future challenges.

DetailsMotivation: Address the gap in LLMs' ability for systematic, transparent, and verifiable reasoning in clinical practice.

Method: Review of 60 studies (2022-2025), categorizing reasoning techniques into training-time and test-time strategies, and analyzing applications and benchmarks.

Result: Identified key techniques and applications, highlighted evaluation benchmarks, and uncovered challenges like the faithfulness-plausibility gap.

Conclusion: Future directions include improving multimodal reasoning and ensuring robust, responsible medical AI.

Abstract: The proliferation of Large Language Models (LLMs) in medicine has enabled impressive capabilities, yet a critical gap remains in their ability to perform systematic, transparent, and verifiable reasoning, a cornerstone of clinical practice. This has catalyzed a shift from single-step answer generation to the development of LLMs explicitly designed for medical reasoning. This paper provides the first systematic review of this emerging field. We propose a taxonomy of reasoning enhancement techniques, categorized into training-time strategies (e.g., supervised fine-tuning, reinforcement learning) and test-time mechanisms (e.g., prompt engineering, multi-agent systems). We analyze how these techniques are applied across different data modalities (text, image, code) and in key clinical applications such as diagnosis, education, and treatment planning. Furthermore, we survey the evolution of evaluation benchmarks from simple accuracy metrics to sophisticated assessments of reasoning quality and visual interpretability. Based on an analysis of 60 seminal studies from 2022-2025, we conclude by identifying critical challenges, including the faithfulness-plausibility gap and the need for native multimodal reasoning, and outlining future directions toward building efficient, robust, and sociotechnically responsible medical AI.

[33] MELAC: Massive Evaluation of Large Language Models with Alignment of Culture in Persian Language

Farhan Farsi, Farnaz Aghababaloo, Shahriar Shariati Motlagh, Parsa Ghofrani, MohammadAli SadraeiJavaheri, Shayan Bali, Amirhossein Shabani, Farbod Bijary, Ghazal Zamaninejad, AmirMohammad Salehoof, Saeedeh Momtazi

Main category: cs.CL

TL;DR: The paper introduces 19 Persian-language datasets to evaluate LLMs on Iranian culture and language, benchmarking 41 models to address gaps in non-Western cultural and linguistic evaluation.

DetailsMotivation: Existing LLM benchmarks focus on English and Western contexts, leaving a gap for non-Western languages like Persian and cultures like Iran's.

Method: Created 19 datasets covering Iranian law, Persian grammar, idioms, and university exams, then evaluated 41 LLMs using these datasets.

Result: Benchmarked 41 LLMs to assess their performance in Persian and Iranian cultural contexts, highlighting gaps in non-Western evaluation.

Conclusion: The study bridges the evaluation gap for non-Western languages and cultures, emphasizing the need for diverse LLM assessments.

Abstract: As large language models (LLMs) become increasingly embedded in our daily lives, evaluating their quality and reliability across diverse contexts has become essential. While comprehensive benchmarks exist for assessing LLM performance in English, there remains a significant gap in evaluation resources for other languages. Moreover, because most LLMs are trained primarily on data rooted in European and American cultures, they often lack familiarity with non-Western cultural contexts. To address this limitation, our study focuses on the Persian language and Iranian culture. We introduce 19 new evaluation datasets specifically designed to assess LLMs on topics such as Iranian law, Persian grammar, Persian idioms, and university entrance exams. Using these datasets, we benchmarked 41 prominent LLMs, aiming to bridge the existing cultural and linguistic evaluation gap in the field.

[34] Team “better_call_claude”: Style Change Detection using a Sequential Sentence Pair Classifier

Gleb Schmidt, Johannes Römisch, Mariia Halchynska, Svetlana Gorovaia, Ivan P. Yamshchikov

Main category: cs.CL

TL;DR: The paper proposes a Sequential Sentence Pair Classifier (SSPC) for fine-grained style change detection in documents, leveraging a pre-trained language model and BiLSTM to outperform baselines on PAN-2025 datasets.

DetailsMotivation: Style change detection is a key challenge in authorship analysis, especially at the sentence level, as addressed in the PAN 2025 shared task.

Method: Uses a pre-trained language model for sentence representations, followed by a BiLSTM for contextualization, and a multi-layer perceptron for predicting style changes between adjacent sentences.

Result: Achieves macro-F1 scores of 0.923 (EASY), 0.828 (MEDIUM), and 0.724 (HARD), outperforming random baselines and a zero-shot model.

Conclusion: The SSPC model effectively leverages context and handles stylistically shallow sentences, proving robust for fine-grained style change detection.

Abstract: Style change detection - identifying the points in a document where writing style shifts - remains one of the most important and challenging problems in computational authorship analysis. At PAN 2025, the shared task challenges participants to detect style switches at the most fine-grained level: individual sentences. The task spans three datasets, each designed with controlled and increasing thematic variety within documents. We propose to address this problem by modeling the content of each problem instance - that is, a series of sentences - as a whole, using a Sequential Sentence Pair Classifier (SSPC). The architecture leverages a pre-trained language model (PLM) to obtain representations of individual sentences, which are then fed into a bidirectional LSTM (BiLSTM) to contextualize them within the document. The BiLSTM-produced vectors of adjacent sentences are concatenated and passed to a multi-layer perceptron for prediction per adjacency. Building on the work of previous PAN participants classical text segmentation, the approach is relatively conservative and lightweight. Nevertheless, it proves effective in leveraging contextual information and addressing what is arguably the most challenging aspect of this year’s shared task: the notorious problem of “stylistically shallow”, short sentences that are prevalent in the proposed benchmark data. Evaluated on the official PAN-2025 test datasets, the model achieves strong macro-F1 scores of 0.923, 0.828, and 0.724 on the EASY, MEDIUM, and HARD data, respectively, outperforming not only the official random baselines but also a much more challenging one: claude-3.7-sonnet’s zero-shot performance.

Shubham Kumar Nigam, Tanmay Dubey, Noel Shallum, Arnab Bhattacharya

Main category: cs.CL

TL;DR: TraceRetriever improves legal precedent retrieval by focusing on rhetorically significant segments, combining BM25, Vector Database, and Cross-Encoder models for scalable and reliable results.

DetailsMotivation: The increasing complexity and volume of legal documents challenge traditional retrieval methods, especially when only partial case information is available.

Method: The pipeline integrates BM25, Vector Database, and Cross-Encoder models, using Reciprocal Rank Fusion for combining results and a Hierarchical BiLSTM CRF classifier for rhetorical annotations.

Result: Evaluated on IL-PCR and COLIEE 2025 datasets, TraceRetriever effectively addresses document volume challenges and aligns with practical search constraints.

Conclusion: TraceRetriever provides a scalable and reliable solution for legal precedent retrieval, enhancing research when complete case knowledge is unavailable.

Abstract: Legal precedent retrieval is a cornerstone of the common law system, governed by the principle of stare decisis, which demands consistency in judicial decisions. However, the growing complexity and volume of legal documents challenge traditional retrieval methods. TraceRetriever mirrors real-world legal search by operating with limited case information, extracting only rhetorically significant segments instead of requiring complete documents. Our pipeline integrates BM25, Vector Database, and Cross-Encoder models, combining initial results through Reciprocal Rank Fusion before final re-ranking. Rhetorical annotations are generated using a Hierarchical BiLSTM CRF classifier trained on Indian judgments. Evaluated on IL-PCR and COLIEE 2025 datasets, TraceRetriever addresses growing document volume challenges while aligning with practical search constraints, reliable and scalable foundation for precedent retrieval enhancing legal research when only partial case knowledge is available.

[36] Better Call Claude: Can LLMs Detect Changes of Writing Style?

Johannes Römisch, Svetlana Gorovaia, Mariia Halchynska, Gleb Schmidt, Ivan P. Yamshchikov

Main category: cs.CL

TL;DR: The paper examines zero-shot performance of large language models (LLMs) on sentence-level style change detection, showing their sensitivity to stylistic variations and superior accuracy over PAN competition baselines.

DetailsMotivation: To assess the capability of state-of-the-art LLMs in detecting subtle stylistic changes at the sentence level, a challenging task in authorship analysis.

Method: Benchmarked four LLMs on PAN 2024 and 2025 datasets for multi-author writing style analysis, analyzing their sensitivity to style and semantics.

Result: LLMs are highly sensitive to stylistic variations, outperforming PAN baselines, and may rely more on content-independent stylistic signals than previously thought.

Conclusion: LLMs set a strong baseline for style change detection, with implications for understanding their sensitivity to stylistic cues beyond semantics.

Abstract: This article explores the zero-shot performance of state-of-the-art large language models (LLMs) on one of the most challenging tasks in authorship analysis: sentence-level style change detection. Benchmarking four LLMs on the official PAN~2024 and 2025 “Multi-Author Writing Style Analysis” datasets, we present several observations. First, state-of-the-art generative models are sensitive to variations in writing style - even at the granular level of individual sentences. Second, their accuracy establishes a challenging baseline for the task, outperforming suggested baselines of the PAN competition. Finally, we explore the influence of semantics on model predictions and present evidence suggesting that the latest generation of LLMs may be more sensitive to content-independent and purely stylistic signals than previously reported.

Shubham Kumar Nigam, Balaramamahanthi Deepak Patnaik, Shivam Mishra, Ajay Varghese Thomas, Noel Shallum, Kripabandhu Ghosh, Arnab Bhattacharya

Main category: cs.CL

TL;DR: NyayaRAG, a Retrieval-Augmented Generation framework, enhances Legal Judgment Prediction in India by integrating factual case descriptions, legal statutes, and precedents, improving accuracy and explanation quality.

DetailsMotivation: Existing LJP methods in India overlook statutory provisions and precedents, key in common law systems, prompting the need for a more comprehensive approach.

Method: Proposes NyayaRAG, a RAG framework combining factual case descriptions, legal statutes, and retrieved precedents, evaluated using lexical, semantic, and LLM-based metrics.

Result: Augmenting factual inputs with legal knowledge significantly boosts predictive accuracy and explanation quality.

Conclusion: NyayaRAG effectively addresses gaps in LJP by leveraging structured legal knowledge, enhancing both outcomes and interpretability.

Abstract: Legal Judgment Prediction (LJP) has emerged as a key area in AI for law, aiming to automate judicial outcome forecasting and enhance interpretability in legal reasoning. While previous approaches in the Indian context have relied on internal case content such as facts, issues, and reasoning, they often overlook a core element of common law systems, which is reliance on statutory provisions and judicial precedents. In this work, we propose NyayaRAG, a Retrieval-Augmented Generation (RAG) framework that simulates realistic courtroom scenarios by providing models with factual case descriptions, relevant legal statutes, and semantically retrieved prior cases. NyayaRAG evaluates the effectiveness of these combined inputs in predicting court decisions and generating legal explanations using a domain-specific pipeline tailored to the Indian legal system. We assess performance across various input configurations using both standard lexical and semantic metrics as well as LLM-based evaluators such as G-Eval. Our results show that augmenting factual inputs with structured legal knowledge significantly improves both predictive accuracy and explanation quality.

[38] Dynamically Adaptive Reasoning via LLM-Guided MCTS for Efficient and Context-Aware KGQA

Yingxu Wang, Shiqi Fan, Mengzhu Wang, Siwei Liu

Main category: cs.CL

TL;DR: DAMR is a novel KGQA framework combining symbolic search with adaptive path evaluation, using MCTS and a lightweight Transformer-based scorer for efficient, context-aware reasoning.

DetailsMotivation: Address limitations of static path extraction and high computational costs in current KGQA methods by introducing adaptive reasoning.

Method: Integrates MCTS with an LLM-based planner for search space reduction and a Transformer-based scorer for context-aware path evaluation.

Result: Outperforms state-of-the-art methods on multiple KGQA benchmarks.

Conclusion: DAMR offers efficient, adaptable, and accurate KGQA by dynamically refining reasoning paths and leveraging context-aware evaluation.

Abstract: Knowledge Graph Question Answering (KGQA) aims to interpret natural language queries and perform structured reasoning over knowledge graphs by leveraging their relational and semantic structures to retrieve accurate answers. Recent KGQA methods primarily follow either retrieve-then-reason paradigm, relying on GNNs or heuristic rules for static paths extraction, or dynamic path generation strategies that use large language models (LLMs) with prompting to jointly perform retrieval and reasoning. However, the former suffers from limited adaptability due to static path extraction and lack of contextual refinement, while the latter incurs high computational costs and struggles with accurate path evaluation due to reliance on fixed scoring functions and extensive LLM calls. To address these issues, this paper proposes Dynamically Adaptive MCTS-based Reasoning (DAMR), a novel framework that integrates symbolic search with adaptive path evaluation for efficient and context-aware KGQA. DAMR employs a Monte Carlo Tree Search (MCTS) backbone guided by an LLM-based planner, which selects top-$k$ relevant relations at each step to reduce search space. To improve path evaluation accuracy, we introduce a lightweight Transformer-based scorer that performs context-aware plausibility estimation by jointly encoding the question and relation sequence through cross-attention, enabling the model to capture fine-grained semantic shifts during multi-hop reasoning. Furthermore, to alleviate the scarcity of high-quality supervision, DAMR incorporates a dynamic pseudo-path refinement mechanism that periodically generates training signals from partial paths explored during search, allowing the scorer to continuously adapt to the evolving distribution of reasoning trajectories. Extensive experiments on multiple KGQA benchmarks show that DAMR significantly outperforms state-of-the-art methods.

[39] Out-of-Context Abduction: LLMs Make Inferences About Procedural Data Leveraging Declarative Facts in Earlier Training Data

Sohaib Imran, Rob Lamb, Peter M. Atkinson

Main category: cs.CL

TL;DR: GPT 4o can infer chatbot names from behavior descriptions and mimic behaviors, suggesting situational awareness in LLMs.

DetailsMotivation: To investigate if LLMs can reason about training data and infer plausible explanations (out-of-context abduction).

Method: Train LLMs on fictitious chatbot names and behaviors, then test their ability to infer names and mimic behaviors.

Result: GPT 4o correctly inferred chatbot names and displayed characteristic behaviors after training.

Conclusion: LLMs show situational awareness, impacting AI safety.

Abstract: Large language models (LLMs) are trained on large corpora, yet it is unclear whether they can reason about the information present within their training data. We design experiments to study out-of-context abduction in LLMs, the ability to infer the most plausible explanations for observations using relevant facts present in training data. We train treatment LLMs on names and behavior descriptions of fictitious chatbots, but not on examples of dialogue with the chatbots. We find that OpenAI’s GPT 4o LLM can correctly infer at least one chatbot’s name after observing example responses characteristic of that chatbot. We also find that previously training GPT 4o on descriptions of a chatbot’s behavior allows it to display behaviors more characteristic of the chatbot when iteratively trained to display such behaviors. Our results have implications for situational awareness in LLMs and, therefore, for AI safety.

[40] Applying Psychometrics to Large Language Model Simulated Populations: Recreating the HEXACO Personality Inventory Experiment with Generative Agents

Sarah Mercer, Daniel P. Martin, Phil Swatton

Main category: cs.CL

TL;DR: GPT-4 agents were tested for HEXACO personality traits, showing partial alignment with human results, model-specific biases, and reliability with curated populations.

DetailsMotivation: To validate if persona-based generative agents can represent human populations in social science research.

Method: Recreated the HEXACO personality inventory with 310 GPT-4 agents, analyzed responses via factor analysis, and compared to human data.

Result: Partial alignment with HEXACO, reliable dimensions in GPT-4, and model-specific variability in personality profiling.

Conclusion: Generative agents show promise but have limitations; careful design is needed for representative personas in research.

Abstract: Generative agents powered by Large Language Models demonstrate human-like characteristics through sophisticated natural language interactions. Their ability to assume roles and personalities based on predefined character biographies has positioned them as cost-effective substitutes for human participants in social science research. This paper explores the validity of such persona-based agents in representing human populations; we recreate the HEXACO personality inventory experiment by surveying 310 GPT-4 powered agents, conducting factor analysis on their responses, and comparing these results to the original findings presented by Ashton, Lee, & Goldberg in 2004. Our results found 1) a coherent and reliable personality structure was recoverable from the agents’ responses demonstrating partial alignment to the HEXACO framework. 2) the derived personality dimensions were consistent and reliable within GPT-4, when coupled with a sufficiently curated population, and 3) cross-model analysis revealed variability in personality profiling, suggesting model-specific biases and limitations. We discuss the practical considerations and challenges encountered during the experiment. This study contributes to the ongoing discourse on the potential benefits and limitations of using generative agents in social science research and provides useful guidance on designing consistent and representative agent personas to maximise coverage and representation of human personality traits.

[41] Agentic large language models improve retrieval-based radiology question answering

Sebastian Wind, Jeta Sopa, Daniel Truhn, Mahshad Lotfinia, Tri-Thien Nguyen, Keno Bressem, Lisa Adams, Mirabela Rusu, Harald Köstler, Gerhard Wellein, Andreas Maier, Soroosh Tayebi Arasteh

Main category: cs.CL

TL;DR: An agentic RAG framework improves radiology QA by enabling LLMs to decompose questions, retrieve evidence iteratively, and synthesize responses, boosting accuracy and reducing hallucinations.

DetailsMotivation: Traditional RAG systems in radiology QA rely on single-step retrieval, limiting complex clinical reasoning. The study aims to enhance diagnostic accuracy and factual grounding using an agentic approach.

Method: Proposed an agentic RAG framework for LLMs to iteratively retrieve clinical evidence from Radiopaedia. Evaluated 24 LLMs on 104 expert-curated radiology questions.

Result: Agentic retrieval improved diagnostic accuracy (73% vs. 64% zero-shot, 73% vs. 68% conventional RAG), reduced hallucinations (9.4%), and enhanced factual grounding (46% relevant context).

Conclusion: Agentic frameworks enhance radiology QA, especially for mid-sized LLMs, suggesting future studies to validate clinical utility.

Abstract: Clinical decision-making in radiology increasingly benefits from artificial intelligence (AI), particularly through large language models (LLMs). However, traditional retrieval-augmented generation (RAG) systems for radiology question answering (QA) typically rely on single-step retrieval, limiting their ability to handle complex clinical reasoning tasks. Here we propose an agentic RAG framework enabling LLMs to autonomously decompose radiology questions, iteratively retrieve targeted clinical evidence from Radiopaedia, and dynamically synthesize evidence-based responses. We evaluated 24 LLMs spanning diverse architectures, parameter scales (0.5B to >670B), and training paradigms (general-purpose, reasoning-optimized, clinically fine-tuned), using 104 expert-curated radiology questions from previously established RSNA-RadioQA and ExtendedQA datasets. Agentic retrieval significantly improved mean diagnostic accuracy over zero-shot prompting (73% vs. 64%; P<0.001) and conventional online RAG (73% vs. 68%; P<0.001). The greatest gains occurred in mid-sized models (e.g., Mistral Large improved from 72% to 81%) and small-scale models (e.g., Qwen 2.5-7B improved from 55% to 71%), while very large models (>200B parameters) demonstrated minimal changes (<2% improvement). Additionally, agentic retrieval reduced hallucinations (mean 9.4%) and retrieved clinically relevant context in 46% of cases, substantially aiding factual grounding. Even clinically fine-tuned models exhibited meaningful improvements (e.g., MedGemma-27B improved from 71% to 81%), indicating complementary roles of retrieval and fine-tuning. These results highlight the potential of agentic frameworks to enhance factuality and diagnostic accuracy in radiology QA, particularly among mid-sized LLMs, warranting future studies to validate their clinical utility.

[42] GLiDRE: Generalist Lightweight model for Document-level Relation Extraction

Robin Armingaud, Romaric Besançon

Main category: cs.CL

TL;DR: GLiDRE, a new model for document-level relation extraction, outperforms state-of-the-art models in few-shot settings on Re-DocRED, inspired by GLiNER’s compact design.

DetailsMotivation: Current models for document-level relation extraction struggle in zero-shot or few-shot settings, despite their complexity. GLiNER's success in NER inspired the development of GLiDRE.

Method: GLiDRE builds on GLiNER’s compact architecture and is evaluated on the Re-DocRED dataset across various data settings.

Result: GLiDRE achieves state-of-the-art performance in few-shot scenarios.

Conclusion: GLiDRE is a promising model for document-level relation extraction, especially in few-shot settings, with publicly available code.

Abstract: Relation Extraction (RE) is a fundamental task in Natural Language Processing, and its document-level variant poses significant challenges, due to the need to model complex interactions between entities across sentences. Current approaches, largely based on the ATLOP architecture, are commonly evaluated on benchmarks like DocRED and Re-DocRED. However, their performance in zero-shot or few-shot settings remains largely underexplored due to the task’s complexity. Recently, the GLiNER model has shown that a compact NER model can outperform much larger Large Language Models. With a similar motivation, we introduce GLiDRE, a new model for document-level relation extraction that builds on the key ideas of GliNER. We benchmark GLiDRE against state-of-the-art models across various data settings on the Re-DocRED dataset. Our results demonstrate that GLiDRE achieves state-of-the-art performance in few-shot scenarios. Our code is publicly available.

[43] MMBERT: Scaled Mixture-of-Experts Multimodal BERT for Robust Chinese Hate Speech Detection under Cloaking Perturbations

Qiyao Xue, Yuchen Dou, Ryan Shi, Xiang Lorraine Li, Wei Gao

Main category: cs.CL

TL;DR: MMBERT, a BERT-based multimodal framework, improves hate speech detection in Chinese social networks by integrating text, speech, and visual data using a Mixture-of-Experts architecture and a progressive training paradigm.

DetailsMotivation: Hate speech detection in Chinese is challenging due to cloaking techniques and limited focus on multimodal approaches in non-English contexts.

Method: Proposes MMBERT, combining textual, speech, and visual modalities via MoE, with a three-stage training paradigm, modality-specific experts, and a shared self-attention mechanism.

Result: MMBERT outperforms fine-tuned BERT models, LLMs, and in-context learning approaches on Chinese hate speech datasets.

Conclusion: MMBERT offers a robust solution for multimodal hate speech detection in Chinese, addressing evasion techniques and outperforming existing methods.

Abstract: Hate speech detection on Chinese social networks presents distinct challenges, particularly due to the widespread use of cloaking techniques designed to evade conventional text-based detection systems. Although large language models (LLMs) have recently improved hate speech detection capabilities, the majority of existing work has concentrated on English datasets, with limited attention given to multimodal strategies in the Chinese context. In this study, we propose MMBERT, a novel BERT-based multimodal framework that integrates textual, speech, and visual modalities through a Mixture-of-Experts (MoE) architecture. To address the instability associated with directly integrating MoE into BERT-based models, we develop a progressive three-stage training paradigm. MMBERT incorporates modality-specific experts, a shared self-attention mechanism, and a router-based expert allocation strategy to enhance robustness against adversarial perturbations. Empirical results in several Chinese hate speech datasets show that MMBERT significantly surpasses fine-tuned BERT-based encoder models, fine-tuned LLMs, and LLMs utilizing in-context learning approaches.

[44] ITUNLP at SemEval-2025 Task 8: Question-Answering over Tabular Data: A Zero-Shot Approach using LLM-Driven Code Generation

Atakan Site, Emre Hakan Erdemir, Gülşen Eryiğit

Main category: cs.CL

TL;DR: The paper introduces a zero-shot system for SemEval-2025 Task 8, focusing on question-answering over tabular data using LLM-based Python code generation. It achieved competitive rankings in subtasks.

DetailsMotivation: To address the challenge of question-answering on diverse tabular datasets efficiently and accurately.

Method: Developed a Python code generation framework using open-source LLMs with optimized prompting strategies.

Result: Different LLMs varied in effectiveness, but Python code generation outperformed alternatives. The system ranked 8th in Subtask I and 6th in Subtask II.

Conclusion: LLM-based Python code generation is effective for tabular QA, with potential for further optimization.

Abstract: This paper presents our system for SemEval-2025 Task 8: DataBench, Question-Answering over Tabular Data. The primary objective of this task is to perform question answering on given tabular datasets from diverse domains under two subtasks: DataBench QA (Subtask I) and DataBench Lite QA (Subtask II). To tackle both subtasks, we developed a zero-shot solution with a particular emphasis on leveraging Large Language Model (LLM)-based code generation. Specifically, we propose a Python code generation framework utilizing state-of-the-art open-source LLMs to generate executable Pandas code via optimized prompting strategies. Our experiments reveal that different LLMs exhibit varying levels of effectiveness in Python code generation. Additionally, results show that Python code generation achieves superior performance in tabular question answering compared to alternative approaches. Although our ranking among zero-shot systems is unknown at the time of this paper’s submission, our system achieved eighth place in Subtask I and sixth place in Subtask~II among the 30 systems that outperformed the baseline in the open-source models category.

[45] Do They Understand Them? An Updated Evaluation on Nonbinary Pronoun Handling in Large Language Models

Xushuo Tang, Yi Ding, Zhengyi Yang, Yin Chen, Yongrui Gu, Wenke Yang, Mingchen Ju, Xin Cao, Yongfei Liu, Wenjie Zhang

Main category: cs.CL

TL;DR: The paper introduces MISGENDERED+, an updated benchmark for evaluating LLMs’ pronoun fidelity, showing improvements in binary and gender-neutral pronouns but inconsistencies in neopronouns and reverse inference.

DetailsMotivation: Address fairness and inclusivity in LLMs, especially in pronoun usage, by updating prior work (MISGENDERED) to reflect current model capabilities.

Method: Benchmark five LLMs (GPT-4o, Claude 4, DeepSeek-V3, Qwen Turbo, Qwen2.5) using zero-shot, few-shot, and gender identity inference tasks.

Result: Notable improvements in binary and gender-neutral pronoun accuracy, but inconsistent performance on neopronouns and reverse inference.

Conclusion: Persistent gaps in identity-sensitive reasoning highlight the need for further research in inclusive AI.

Abstract: Large language models (LLMs) are increasingly deployed in sensitive contexts where fairness and inclusivity are critical. Pronoun usage, especially concerning gender-neutral and neopronouns, remains a key challenge for responsible AI. Prior work, such as the MISGENDERED benchmark, revealed significant limitations in earlier LLMs’ handling of inclusive pronouns, but was constrained to outdated models and limited evaluations. In this study, we introduce MISGENDERED+, an extended and updated benchmark for evaluating LLMs’ pronoun fidelity. We benchmark five representative LLMs, GPT-4o, Claude 4, DeepSeek-V3, Qwen Turbo, and Qwen2.5, across zero-shot, few-shot, and gender identity inference. Our results show notable improvements compared with previous studies, especially in binary and gender-neutral pronoun accuracy. However, accuracy on neopronouns and reverse inference tasks remains inconsistent, underscoring persistent gaps in identity-sensitive reasoning. We discuss implications, model-specific observations, and avenues for future inclusive AI research.

[46] Beyond Fixed: Variable-Length Denoising for Diffusion Large Language Models

Jinsong Li, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jiaqi Wang, Dahua Lin

Main category: cs.CL

TL;DR: DAEDAL introduces a dynamic adaptive length expansion strategy for Diffusion Large Language Models (DLLMs), eliminating the need for static predefined generation lengths and improving efficiency and performance.

DetailsMotivation: The rigid static length allocation in DLLMs creates inefficiencies—either underperforming on complex tasks or wasting computational resources. The model's internal signals hint at optimal response lengths, motivating a dynamic solution.

Method: DAEDAL uses a two-phase approach: 1) iterative length expansion before denoising, guided by a completion metric, and 2) dynamic intervention during denoising to expand insufficient regions via mask token insertion.

Result: DAEDAL matches or outperforms fixed-length baselines while boosting computational efficiency by optimizing token usage.

Conclusion: DAEDAL resolves DLLMs’ static length constraint, enhancing their practicality and bridging the gap with Autoregressive models.

Abstract: Diffusion Large Language Models (DLLMs) are emerging as a powerful alternative to the dominant Autoregressive Large Language Models, offering efficient parallel generation and capable global context modeling. However, the practical application of DLLMs is hindered by a critical architectural constraint: the need for a statically predefined generation length. This static length allocation leads to a problematic trade-off: insufficient lengths cripple performance on complex tasks, while excessive lengths incur significant computational overhead and sometimes result in performance degradation. While the inference framework is rigid, we observe that the model itself possesses internal signals that correlate with the optimal response length for a given task. To bridge this gap, we leverage these latent signals and introduce DAEDAL, a novel training-free denoising strategy that enables Dynamic Adaptive Length Expansion for Diffusion Large Language Models. DAEDAL operates in two phases: 1) Before the denoising process, DAEDAL starts from a short initial length and iteratively expands it to a coarse task-appropriate length, guided by a sequence completion metric. 2) During the denoising process, DAEDAL dynamically intervenes by pinpointing and expanding insufficient generation regions through mask token insertion, ensuring the final output is fully developed. Extensive experiments on DLLMs demonstrate that DAEDAL achieves performance comparable, and in some cases superior, to meticulously tuned fixed-length baselines, while simultaneously enhancing computational efficiency by achieving a higher effective token ratio. By resolving the static length constraint, DAEDAL unlocks new potential for DLLMs, bridging a critical gap with their Autoregressive counterparts and paving the way for more efficient and capable generation.

[47] Retrieval-Augmented Semantic Parsing: Improving Generalization with Lexical Knowledge

Xiao Zhang, Qianru Meng, Johan Bos

Main category: cs.CL

TL;DR: RASP integrates retrieval-augmented methods with LLMs for open-domain semantic parsing, improving performance on unseen concepts.

DetailsMotivation: Addressing the challenge of neural models relying on heuristics and struggling with unseen concepts in open-domain semantic parsing.

Method: Introduces Retrieval-Augmented Semantic Parsing (RASP), combining external symbolic knowledge with LLMs.

Result: LLMs outperform encoder-decoder baselines, and RASP nearly doubles performance on out-of-distribution concepts.

Conclusion: LLMs and retrieval mechanisms show promise for robust open-domain semantic parsing.

Abstract: Open-domain semantic parsing remains a challenging task, as neural models often rely on heuristics and struggle to handle unseen concepts. In this paper, we investigate the potential of large language models (LLMs) for this task and introduce Retrieval-Augmented Semantic Parsing (RASP), a simple yet effective approach that integrates external symbolic knowledge into the parsing process. Our experiments not only show that LLMs outperform previous encoder-decoder baselines for semantic parsing, but that RASP further enhances their ability to predict unseen concepts, nearly doubling the performance of previous models on out-of-distribution concepts. These findings highlight the promise of leveraging large language models and retrieval mechanisms for robust and open-domain semantic parsing.

[48] An Investigation into Value Misalignment in LLM-Generated Texts for Cultural Heritage

Fan Bu, Zheng Wang, Siyi Wang, Ziyao Liu

Main category: cs.CL

TL;DR: The paper investigates cultural value misalignments in LLM-generated texts for cultural heritage tasks, revealing over 65% misalignment and proposing a benchmark dataset for future research.

DetailsMotivation: To address the lack of systematic study on cultural value misalignments in LLMs for cultural heritage, which can lead to misrepresentation and cultural erosion.

Method: A comprehensive evaluation of 1066 query tasks across 5 LLMs, using automated and manual approaches to detect misalignments.

Result: Over 65% of generated texts showed notable cultural misalignments, with some tasks almost entirely misaligned.

Conclusion: The study highlights the need for improved cultural sensitivity in LLMs and provides a benchmark dataset for future research.

Abstract: As Large Language Models (LLMs) become increasingly prevalent in tasks related to cultural heritage, such as generating descriptions of historical monuments, translating ancient texts, preserving oral traditions, and creating educational content, their ability to produce accurate and culturally aligned texts is being increasingly relied upon by users and researchers. However, cultural value misalignments may exist in generated texts, such as the misrepresentation of historical facts, the erosion of cultural identity, and the oversimplification of complex cultural narratives, which may lead to severe consequences. Therefore, investigating value misalignment in the context of LLM for cultural heritage is crucial for mitigating these risks, yet there has been a significant lack of systematic and comprehensive study and investigation in this area. To fill this gap, we systematically assess the reliability of LLMs in generating culturally aligned texts for cultural heritage-related tasks. We conduct a comprehensive evaluation by compiling an extensive set of 1066 query tasks covering 5 widely recognized categories with 17 aspects within the knowledge framework of cultural heritage across 5 open-source LLMs, and examine both the type and rate of cultural value misalignments in the generated texts. Using both automated and manual approaches, we effectively detect and analyze the cultural value misalignments in LLM-generated texts. Our findings are concerning: over 65% of the generated texts exhibit notable cultural misalignments, with certain tasks demonstrating almost complete misalignment with key cultural values. Beyond these findings, this paper introduces a benchmark dataset and a comprehensive evaluation workflow that can serve as a valuable resource for future research aimed at enhancing the cultural sensitivity and reliability of LLMs.

[49] IssueBench: Millions of Realistic Prompts for Measuring Issue Bias in LLM Writing Assistance

Paul Röttger, Musashi Hinck, Valentin Hofmann, Kobi Hackenburg, Valentina Pyatkin, Faeze Brahman, Dirk Hovy

Main category: cs.CL

TL;DR: IssueBench is a tool to measure issue bias in LLMs, revealing common biases across models, often aligning with US Democrat views.

DetailsMotivation: Concerns about LLMs presenting biased perspectives influencing users led to the need for a method to measure such biases in real interactions.

Method: Created IssueBench with 2.49m prompts based on 3.9k templates and 212 political issues from real user interactions to evaluate LLMs.

Result: Found common and persistent biases in state-of-the-art LLMs, with models aligning more with US Democrat than Republican opinions.

Conclusion: IssueBench provides robust measurement of LLM biases, aiding discussions on addressing them.

Abstract: Large language models (LLMs) are helping millions of users write texts about diverse issues, and in doing so expose users to different ideas and perspectives. This creates concerns about issue bias, where an LLM tends to present just one perspective on a given issue, which in turn may influence how users think about this issue. So far, it has not been possible to measure which issue biases LLMs actually manifest in real user interactions, making it difficult to address the risks from biased LLMs. Therefore, we create IssueBench: a set of 2.49m realistic prompts for measuring issue bias in LLM writing assistance, which we construct based on 3.9k templates (e.g. “write a blog about”) and 212 political issues (e.g. “AI regulation”) from real user interactions. Using IssueBench, we show that issue biases are common and persistent in state-of-the-art LLMs. We also show that biases are remarkably similar across models, and that all models align more with US Democrat than Republican voter opinion on a subset of issues. IssueBench can easily be adapted to include other issues, templates, or tasks. By enabling robust and realistic measurement, we hope that IssueBench can bring a new quality of evidence to ongoing discussions about LLM biases and how to address them.

[50] Better Embeddings with Coupled Adam

Felix Stollenwerk, Tobias Stollenwerk

Main category: cs.CL

TL;DR: The paper identifies anisotropy in LLM word representations, links it to Adam’s second moment, and proposes Coupled Adam to improve embedding quality and performance.

DetailsMotivation: Anisotropy in LLM word representations is undesirable but poorly understood. The paper aims to address this by analyzing its cause and proposing a solution.

Method: The authors suggest modifying the Adam optimizer to create Coupled Adam, targeting the second moment as the root cause of anisotropy.

Result: Experiments show Coupled Adam enhances embedding quality and improves upstream/downstream performance on large datasets.

Conclusion: Coupled Adam effectively mitigates anisotropy in LLM embeddings, leading to better overall model performance.

Abstract: Despite their remarkable capabilities, LLMs learn word representations that exhibit the undesirable yet poorly understood feature of anisotropy. In this paper, we argue that the second moment in Adam is a cause of anisotropic embeddings, and suggest a modified optimizer called Coupled Adam to mitigate the problem. Our experiments demonstrate that Coupled Adam significantly improves the quality of embeddings, while also leading to better upstream and downstream performance on large enough datasets.

[51] SEFL: Enhancing Educational Assignment Feedback with LLM Agents

Mike Zhang, Amalie Pernille Dilling, Léon Gondelman, Niels Erik Ruan Lyngdorf, Euan D. Lindsay, Johannes Bjerva

Main category: cs.CL

TL;DR: SEFL uses synthetic data from LLMs to simulate teacher-student feedback loops, enabling efficient, scalable feedback generation for education.

DetailsMotivation: High-quality feedback is vital for student success but is limited by time and cost constraints.

Method: Two LLMs simulate teacher-student roles to generate synthetic feedback pairs, used to fine-tune smaller models.

Result: SEFL-tuned models outperform non-tuned models and baselines in feedback quality.

Conclusion: SEFL can transform feedback processes in higher education and beyond.

Abstract: Providing high-quality feedback to student assignments is crucial for student success, but it is constrained by time and costs. In this work, we introduce Synthetic Educational Feedback Loops (SEFL), a synthetic data framework designed to generate data that resembles immediate, on-demand feedback at scale without relying on extensive, real-world student assignments. To get this type of data, two large language models (LLMs) operate in teacher-student roles to simulate assignment completion and formative feedback, generating synthetic pairs of student work and corresponding critiques and actionable improvements from a teacher. With this data, we fine-tune smaller, more computationally efficient LLMs on these synthetic pairs, enabling them to replicate key features of high-quality, goal-oriented feedback. Unlike personalized tutoring approaches that offer multi-turn, individualized instruction, SEFL specifically focuses on replicating the teacher-student assignment feedback loop in higher education. Through comprehensive evaluations with four LLM judges and three human experts, we demonstrate that SEFL-tuned models outperform both their non-tuned counterparts in feedback quality and an existing baseline. The potential for societal impact is reinforced by extensive qualitative comments by ratings by human stakeholders – both students and higher education instructors. All in all, SEFL has substantial potential to transform feedback processes for higher education and beyond.

[52] Lost in Space: Finding the Right Tokens for Structured Output

Sil Hamilton, David Mimno

Main category: cs.CL

TL;DR: The paper explores how structured output formats impact the performance of language models, finding that conventional formats and leading whitespace improve accuracy, especially for smaller models.

DetailsMotivation: To understand the systematic differences in performance when language models are guided to produce structured outputs, particularly focusing on formats that seem similar to humans.

Method: Tested four model families with five output formats on four NLP benchmarks, analyzing accuracy and performance differences.

Result: Models perform best with conventional formats (e.g., letters for multiple choice) and leading whitespace, improving accuracy by 5%-10%. Smaller models benefit the most.

Conclusion: The study provides best practices for using language models as zero-shot classifiers with structured output, emphasizing conventional formats and leading whitespace.

Abstract: General-purpose language models are trained to produce varied natural language outputs, but for some tasks, like annotation or classification, we need more specific output formats. LLM systems increasingly support structured output, which enforces formats by sampling tokens according to a grammar – but also unpredictably reduces downstream performance. Are there systematic differences between grammars that appear semantically (and often visually) similar to humans? To answer this, we test four popular model families with five varying output formats on four common NLP benchmarks. We find all models perform most accurately when guided to use formats respecting convention, such as letters for multiple choice and real numbers for numerical prediction. Performance also improves by 5%-10% when guiding models to return tokens incorporating leading whitespace, with smaller models benefiting the most. We find leading whitespace helps models avoid structural deficiencies in subword token representations. We finally present best practices for researchers using language models as zero-shot classifiers with structured output.

[53] Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning

Guijin Son, Jiwoo Hong, Hyunwoo Ko, James Thorne

Main category: cs.CL

TL;DR: The paper introduces MCLM, a multilingual math benchmark, and evaluates test-time scaling methods like ORM and BF on multilingual LLMs, finding limited generalization to non-English tasks.

DetailsMotivation: To investigate whether test-time scaling, effective for pre-training, generalizes to multilingual tasks, particularly in math problem-solving.

Method: Three test-time scaling methods (ORM, PRM, BF) are tested on Qwen2.5-1.5B Math and MR1-1.5B using the MCLM benchmark.

Result: ORM on Qwen2.5-1.5B Math scores 35.8, while BF on MR1-1.5B scores 35.2. BF improves English AIME by 20 points but only 1.94 points on other languages.

Conclusion: Test-time scaling methods show limited effectiveness in multilingual tasks compared to English, suggesting challenges in generalization.

Abstract: Scaling pre-training compute has proven effective for achieving mulitlinguality, but does the same hold for test-time scaling? In this work, we introduce MCLM, a multilingual math benchmark featuring competition-level problems in 55 languages. We test three test-time scaling methods-Outcome Reward Modeling (ORM), Process Reward Modeling (ORM), and Budget Forcing (BF)-on both Qwen2.5-1.5B Math and MR1-1.5B, a multilingual LLM we trained for extended reasoning. Our experiments show that using Qwen2.5-1.5B Math with ORM achieves a score of 35.8 on MCLM, while BF on MR1-1.5B attains 35.2. Although “thinking LLMs” have recently garnered significant attention, we find that their performance is comparable to traditional scaling methods like best-of-N once constrained to similar levels of inference FLOPs. Moreover, while BF yields a 20-point improvement on English AIME, it provides only a 1.94-point average gain across other languages-a pattern consistent across the other test-time scaling methods we studied-higlighting that test-time scaling may not generalize as effectively to multilingual tasks. To foster further research, we release MCLM, MR1-1.5B, and evaluation results.

[54] Do Large Language Models Know How Much They Know?

Gabriele Prato, Jerry Huang, Prasanna Parthasarathi, Shagun Sodhani, Sarath Chandar

Main category: cs.CL

TL;DR: LLMs show awareness of their own knowledge scope, as tested by a benchmark evaluating their ability to enumerate information on specific topics.

DetailsMotivation: To determine if LLMs can recognize the limits of their own knowledge, a key attribute of intelligent systems.

Method: Developed a benchmark to challenge LLMs to enumerate all information they possess on specific topics, assessing recall accuracy.

Result: All tested LLMs, at sufficient scale, demonstrated awareness of their knowledge scope, though rates varied by architecture.

Conclusion: Knowledge awareness may be a generalizable attribute of LLMs, but further research is needed to confirm and understand the mechanisms.

Abstract: Large Language Models (LLMs) have emerged as highly capable systems and are increasingly being integrated into various uses. However, the rapid pace of their deployment has outpaced a comprehensive understanding of their internal mechanisms and a delineation of their capabilities and limitations. A desired attribute of an intelligent system is its ability to recognize the scope of its own knowledge. To investigate whether LLMs embody this characteristic, we develop a benchmark designed to challenge these models to enumerate all information they possess on specific topics. This benchmark evaluates whether the models recall excessive, insufficient, or the precise amount of information, thereby indicating their awareness of their own knowledge. Our findings reveal that all tested LLMs, given sufficient scale, demonstrate an understanding of how much they know about specific topics. While different architectures exhibit varying rates of this capability’s emergence, the results suggest that awareness of knowledge may be a generalizable attribute of LLMs. Further research is needed to confirm this potential and fully elucidate the underlying mechanisms.

[55] A Survey on Post-training of Large Language Models

Guiyao Tie, Zeli Zhao, Dingjie Song, Fuyang Wei, Rong Zhou, Yurou Dai, Wen Yin, Zhejian Yang, Jiangyue Yan, Yao Su, Zhenhan Dai, Yifeng Xie, Yihan Cao, Lichao Sun, Pan Zhou, Lifang He, Hechang Chen, Yu Zhang, Qingsong Wen, Tianming Liu, Neil Zhenqiang Gong, Jiliang Tang, Caiming Xiong, Heng Ji, Philip S. Yu, Jianfeng Gao

Main category: cs.CL

TL;DR: This paper surveys post-training language models (PoLMs), addressing limitations of LLMs in specialized contexts through five paradigms: Fine-tuning, Alignment, Reasoning, Efficiency, and Integration. It highlights advancements like OpenAI-o1/o3 and DeepSeek-R1, offering a taxonomy and future research agenda.

DetailsMotivation: The limitations of pre-trained LLMs in specialized tasks, including reasoning, ethics, and domain performance, drive the need for advanced PoLMs.

Method: The paper systematically reviews PoLMs across five paradigms, analyzing techniques and datasets, and presents a taxonomy.

Result: It synthesizes PoLM evolution, categorizes advancements, and proposes a strategic agenda for future research.

Conclusion: The survey establishes a framework for developing more precise, ethical, and versatile LLMs, advancing their application in diverse fields.

Abstract: The emergence of Large Language Models (LLMs) has fundamentally transformed natural language processing, making them indispensable across domains ranging from conversational systems to scientific exploration. However, their pre-trained architectures often reveal limitations in specialized contexts, including restricted reasoning capacities, ethical uncertainties, and suboptimal domain-specific performance. These challenges necessitate advanced post-training language models (PoLMs) to address these shortcomings, such as OpenAI-o1/o3 and DeepSeek-R1 (collectively known as Large Reasoning Models, or LRMs). This paper presents the first comprehensive survey of PoLMs, systematically tracing their evolution across five core paradigms: Fine-tuning, which enhances task-specific accuracy; Alignment, which ensures ethical coherence and alignment with human preferences; Reasoning, which advances multi-step inference despite challenges in reward design; Efficiency, which optimizes resource utilization amidst increasing complexity; Integration and Adaptation, which extend capabilities across diverse modalities while addressing coherence issues. Charting progress from ChatGPT’s alignment strategies to DeepSeek-R1’s innovative reasoning advancements, we illustrate how PoLMs leverage datasets to mitigate biases, deepen reasoning capabilities, and enhance domain adaptability. Our contributions include a pioneering synthesis of PoLM evolution, a structured taxonomy categorizing techniques and datasets, and a strategic agenda emphasizing the role of LRMs in improving reasoning proficiency and domain flexibility. As the first survey of its scope, this work consolidates recent PoLM advancements and establishes a rigorous intellectual framework for future research, fostering the development of LLMs that excel in precision, ethical robustness, and versatility across scientific and societal applications.

[56] AdaptiVocab: Enhancing LLM Efficiency in Focused Domains through Lightweight Vocabulary Adaptation

Itay Nakash, Nitay Calderon, Eyal Ben David, Elad Hoffer, Roi Reichart

Main category: cs.CL

TL;DR: AdaptiVocab reduces LLM computational costs by adapting vocabulary to domain-specific needs, cutting token usage by 25% without performance loss.

DetailsMotivation: LLMs incur high computational costs due to broad applicability; domain-specific efficiency can be improved by focusing vocabulary.

Method: Introduces AdaptiVocab, an end-to-end approach replacing general tokens with domain-specific n-gram tokens, initialized via weighted embeddings and fine-tuned on a single GPU.

Result: Reduces token usage by over 25% in three niche domains without compromising performance.

Conclusion: AdaptiVocab effectively enhances LLM efficiency in low-resource domains by optimizing vocabulary.

Abstract: Large Language Models (LLMs) have shown impressive versatility as general purpose models. However, their broad applicability comes at a high-cost computational overhead, particularly in auto-regressive decoding where each step requires a forward pass. In domain-specific settings, general-purpose capabilities are unnecessary and can be exchanged for efficiency. In this work, we take a novel perspective on domain adaptation, reducing latency and computational costs by adapting the vocabulary to focused domains of interest. We introduce AdaptiVocab, an end-to-end approach for vocabulary adaptation, designed to enhance LLM efficiency in low-resource domains. AdaptiVocab can be applied to any tokenizer and architecture, modifying the vocabulary by replacing tokens with domain-specific n-gram-based tokens, thereby reducing the number of tokens required for both input processing and output generation. AdaptiVocab initializes new n-token embeddings using an exponentially weighted combination of existing embeddings and employs a lightweight fine-tuning phase that can be efficiently performed on a single GPU. We evaluate two 7B LLMs across three niche domains, assessing efficiency, generation quality, and end-task performance. Our results show that AdaptiVocab reduces token usage by over 25% without compromising performance

[57] MemInsight: Autonomous Memory Augmentation for LLM Agents

Rana Salama, Jason Cai, Michelle Yuan, Anna Currey, Monica Sunkara, Yi Zhang, Yassine Benajiba

Main category: cs.CL

TL;DR: MemInsight enhances LLM agents’ memory capabilities for better semantic data representation and retrieval, improving performance in tasks like recommendation, QA, and summarization.

DetailsMotivation: Address challenges of growing memory size and semantic structuring in LLM agents to improve contextualized responses.

Method: Proposes MemInsight, an autonomous memory augmentation approach for semantic data representation and retrieval.

Result: Boosts recommendation persuasiveness by 14% and outperforms RAG baseline by 34% in recall for LoCoMo retrieval.

Conclusion: MemInsight effectively enhances LLM agents’ contextual performance across multiple tasks.

Abstract: Large language model (LLM) agents have evolved to intelligently process information, make decisions, and interact with users or tools. A key capability is the integration of long-term memory capabilities, enabling these agents to draw upon historical interactions and knowledge. However, the growing memory size and need for semantic structuring pose significant challenges. In this work, we propose an autonomous memory augmentation approach, MemInsight, to enhance semantic data representation and retrieval mechanisms. By leveraging autonomous augmentation to historical interactions, LLM agents are shown to deliver more accurate and contextualized responses. We empirically validate the efficacy of our proposed approach in three task scenarios; conversational recommendation, question answering and event summarization. On the LLM-REDIAL dataset, MemInsight boosts persuasiveness of recommendations by up to 14%. Moreover, it outperforms a RAG baseline by 34% in recall for LoCoMo retrieval. Our empirical results show the potential of MemInsight to enhance the contextual performance of LLM agents across multiple tasks.

[58] Can LLMs Generate Tabular Summaries of Science Papers? Rethinking the Evaluation Protocol

Weiqi Wang, Jiefu Ou, Yangqiu Song, Benjamin Van Durme, Daniel Khashabi

Main category: cs.CL

TL;DR: The paper introduces ARXIV2TABLE, a benchmark for generating literature review tables, addressing challenges like under-specified user prompts, irrelevant content, and shallow evaluation metrics. It combines LLM-based methods and human annotations to improve table utility.

DetailsMotivation: To enhance literature review table generation by addressing real-world complexities and improving utility for information-seeking tasks.

Method: Extends prior approaches using LLM-based methods and human annotations, introducing ARXIV2TABLE for evaluation.

Result: Experiments show LLMs struggle with the task, emphasizing its difficulty and the need for further advancements.

Conclusion: The paper highlights the challenges in literature review table generation and proposes a novel approach and benchmark for future research.

Abstract: Literature review tables are essential for summarizing and comparing collections of scientific papers. We explore the task of generating tables that best fulfill a user’s informational needs given a collection of scientific papers. Building on recent work (Newman et al., 2024), we extend prior approaches to address real-world complexities through a combination of LLM-based methods and human annotations. Our contributions focus on three key challenges encountered in real-world use: (i) User prompts are often under-specified; (ii) Retrieved candidate papers frequently contain irrelevant content; and (iii) Task evaluation should move beyond shallow text similarity techniques and instead assess the utility of inferred tables for information-seeking tasks (e.g., comparing papers). To support reproducible evaluation, we introduce ARXIV2TABLE, a more realistic and challenging benchmark for this task, along with a novel approach to improve literature review table generation in real-world scenarios. Our extensive experiments on this benchmark show that both open-weight and proprietary LLMs struggle with the task, highlighting its difficulty and the need for further advancements. Our dataset and code are available at https://github.com/JHU-CLSP/arXiv2Table.

[59] Socrates or Smartypants: Testing Logic Reasoning Capabilities of Large Language Models with Logic Programming-based Test Oracles

Zihao Xu, Junchen Ding, Yiling Lou, Kun Zhang, Dong Gong, Yuekang Li

Main category: cs.CL

TL;DR: The paper introduces SmartyPat-Bench, a benchmark for evaluating LLMs’ logical reasoning, and SmartyPat, an automated framework for generating fallacious statements. It addresses limitations of existing datasets and provides nuanced insights into LLM capabilities.

DetailsMotivation: Existing datasets for evaluating LLMs' logical reasoning are limited in complexity and naturalness. There's a need for more challenging, diverse, and systematically labeled benchmarks.

Method: The authors create SmartyPat-Bench from real-world Reddit posts and introduce SmartyPat, an automated framework using Prolog rules and LLMs to generate fallacious statements.

Result: SmartyPat produces high-quality fallacies comparable to human-generated content and outperforms baselines. Experiments show structured reasoning improves fallacy categorization but excessive steps hinder detection.

Conclusion: The work advances LLM evaluation by providing a robust benchmark and automated framework, revealing key insights into LLM reasoning capabilities.

Abstract: Large Language Models (LLMs) have achieved significant progress in language understanding and reasoning. Evaluating and analyzing their logical reasoning abilities has therefore become essential. However, existing datasets and benchmarks are often limited to overly simplistic, unnatural, or contextually constrained examples. In response to the growing demand, we introduce SmartyPat-Bench, a challenging, naturally expressed, and systematically labeled benchmark derived from real-world high-quality Reddit posts containing subtle logical fallacies. Unlike existing datasets and benchmarks, it provides more detailed annotations of logical fallacies and features more diverse data. To further scale up the study and address the limitations of manual data collection and labeling - such as fallacy-type imbalance and labor-intensive annotation - we introduce SmartyPat, an automated framework powered by logic programming-based oracles. SmartyPat utilizes Prolog rules to systematically generate logically fallacious statements, which are then refined into fluent natural-language sentences by LLMs, ensuring precise fallacy representation. Extensive evaluation demonstrates that SmartyPat produces fallacies comparable in subtlety and quality to human-generated content and significantly outperforms baseline methods. Finally, experiments reveal nuanced insights into LLM capabilities, highlighting that while excessive reasoning steps hinder fallacy detection accuracy, structured reasoning enhances fallacy categorization performance.

[60] Debunking with Dialogue? Exploring AI-Generated Counterspeech to Challenge Conspiracy Theories

Mareike Lisker, Christina Gottschalk, Helena Mihaljević

Main category: cs.CL

TL;DR: The paper evaluates LLMs (GPT-4, Llama 3, Mistral) for generating counterspeech against conspiracy theories, finding them often generic, repetitive, or inaccurate.

DetailsMotivation: Counterspeech is vital against harmful online content, but expert-driven efforts are hard to scale. LLMs could help, but their effectiveness against conspiracy theories is understudied.

Method: Evaluated GPT-4, Llama 3, and Mistral using structured prompts based on psychological research to generate counterspeech.

Result: Models produced generic, repetitive, or superficial counterspeech, often over-acknowledging fear and hallucinating facts.

Conclusion: Prompt-based use of LLMs for counterspeech against conspiracy theories is currently problematic due to inaccuracies and lack of depth.

Abstract: Counterspeech is a key strategy against harmful online content, but scaling expert-driven efforts is challenging. Large Language Models (LLMs) present a potential solution, though their use in countering conspiracy theories is under-researched. Unlike for hate speech, no datasets exist that pair conspiracy theory comments with expert-crafted counterspeech. We address this gap by evaluating the ability of GPT-4o, Llama 3, and Mistral to effectively apply counterspeech strategies derived from psychological research provided through structured prompts. Our results show that the models often generate generic, repetitive, or superficial results. Additionally, they over-acknowledge fear and frequently hallucinate facts, sources, or figures, making their prompt-based use in practical applications problematic.

[61] Credible Plan-Driven RAG Method for Multi-Hop Question Answering

Ningning Zhang, Chi Zhang, Zhizhong Tan, Xingxing Yang, Weiping Deng, Wenyong Wang

Main category: cs.CL

TL;DR: PAR-RAG improves multi-hop QA by using a PDCA-inspired framework for better reasoning paths and error management.

DetailsMotivation: Existing RAG methods struggle with reasoning path deviations and error propagation in multi-hop QA.

Method: PAR-RAG employs complexity-aware planning and dual-verification to enhance reasoning accuracy and consistency.

Result: PAR-RAG outperforms state-of-the-art methods on QA benchmarks, showing improved performance and robustness.

Conclusion: PAR-RAG effectively addresses reasoning drift and error propagation, enhancing multi-hop QA accuracy and reliability.

Abstract: Multi-hop question answering (QA) presents significant challenges for retrieval-augmented generation (RAG), particularly in decomposing complex queries into reliable reasoning paths and managing error propagation. Existing RAG methods often suffer from deviations in reasoning paths and cumulative errors in intermediate steps, reducing the fidelity of the final answer. To address these limitations, we propose PAR-RAG (Plan-then-Act-and-Review RAG), a novel framework inspired by the PDCA (Plan-Do-Check-Act) cycle, to enhance both the accuracy and factual consistency in multi-hop question answering. Specifically, PAR-RAG selects exemplars matched by the semantic complexity of the current question to guide complexity-aware top-down planning, resulting in more precise and coherent multi-step reasoning trajectories. This design mitigates reasoning drift and reduces the risk of suboptimal path convergence, a common issue in existing RAG approaches. Furthermore, a dual-verification mechanism evaluates and corrects intermediate errors, ensuring that the reasoning process remains factually grounded. Experimental results on various QA benchmarks demonstrate that PAR-RAG outperforms existing state-of-the-art methods, validating its effectiveness in both performance and reasoning robustness.

[62] Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory

Hongli Zhou, Hui Huang, Ziqing Zhao, Lvyuan Han, Huicheng Wang, Kehai Chen, Muyun Yang, Wei Bao, Jian Dong, Bing Xu, Conghui Zhu, Hailong Cao, Tiejun Zhao

Main category: cs.CL

TL;DR: The paper critiques LLM benchmarks, introduces PSN-IRT for better evaluation, and reveals flaws in current benchmarks while showing PSN-IRT’s effectiveness in creating smaller, more aligned benchmarks.

DetailsMotivation: Address inconsistencies and poor separability in LLM benchmarks, questioning their accuracy in reflecting model capabilities.

Method: Proposes PSN-IRT, an enhanced Item Response Theory framework, and analyzes 11 LLM benchmarks with 41,871 items.

Result: Identifies significant shortcomings in benchmark measurement quality; PSN-IRT enables smaller, more human-aligned benchmarks.

Conclusion: PSN-IRT improves benchmark reliability and alignment with human preferences, addressing current evaluation flaws.

Abstract: The evaluation of large language models (LLMs) via benchmarks is widespread, yet inconsistencies between different leaderboards and poor separability among top models raise concerns about their ability to accurately reflect authentic model capabilities. This paper provides a critical analysis of benchmark effectiveness, examining mainstream prominent LLM benchmarks using results from diverse models. We first propose Pseudo-Siamese Network for Item Response Theory (PSN-IRT), an enhanced Item Response Theory framework that incorporates a rich set of item parameters within an IRT-grounded architecture. PSN-IRT can be utilized for accurate and reliable estimations of item characteristics and model abilities. Based on PSN-IRT, we conduct extensive analysis on 11 LLM benchmarks comprising 41,871 items, revealing significant and varied shortcomings in their measurement quality. Furthermore, we demonstrate that leveraging PSN-IRT is able to construct smaller benchmarks while maintaining stronger alignment with human preference.

[63] Mitigating Gender Bias via Fostering Exploratory Thinking in LLMs

Kangda Wei, Hasnat Md Abdullah, Ruihong Huang

Main category: cs.CL

TL;DR: A framework to mitigate gender bias in LLMs by generating story pairs with male/female protagonists in identical scenarios, comparing moral judgments, and using DPO for optimization.

DetailsMotivation: Address gender bias in LLMs, which leads to unequal treatment of male and female subjects.

Method: Generate story pairs with male/female protagonists in identical scenarios, compare moral judgments, and use DPO to optimize for balanced judgments.

Result: Significant reduction in gender bias while maintaining or improving general model capabilities.

Conclusion: The proposed framework effectively mitigates gender bias in LLMs and is supported by experimental results.

Abstract: Large Language Models (LLMs) often exhibit gender bias, resulting in unequal treatment of male and female subjects across different contexts. To address this issue, we propose a novel data generation framework that fosters exploratory thinking in LLMs. Our approach prompts models to generate story pairs featuring male and female protagonists in structurally identical, morally ambiguous scenarios, then elicits and compares their moral judgments. When inconsistencies arise, the model is guided to produce balanced, gender-neutral judgments. These story-judgment pairs are used to fine-tune or optimize the models via Direct Preference Optimization (DPO). Experimental results show that our method significantly reduces gender bias while preserving or even enhancing general model capabilities. We will release the code and generated data. We release the code and generated data at: https://github.com/WeiKangda/LLMs-Exploratory-Bias-Mitigation/tree/main.

[64] AutoSchemaKG: Autonomous Knowledge Graph Construction through Dynamic Schema Induction from Web-Scale Corpora

Jiaxin Bai, Wei Fan, Qi Hu, Qing Zong, Chunyang Li, Hong Ting Tsang, Hongyu Luo, Yauwai Yim, Haoyu Huang, Xiao Zhou, Feng Qin, Tianshi Zheng, Xi Peng, Xin Yao, Huiwen Yang, Leijie Wu, Yi Ji, Gong Zhang, Renhai Chen, Yangqiu Song

Main category: cs.CL

TL;DR: AutoSchemaKG autonomously builds knowledge graphs without predefined schemas using LLMs, achieving high accuracy and scalability.

DetailsMotivation: To eliminate the need for manual schema design in knowledge graph construction and enhance LLM factuality.

Method: Leverages LLMs to extract triples and induce schemas from text, using conceptualization for semantic organization.

Result: Constructed ATLAS with 900M+ nodes and 5.9B edges, outperforming baselines in QA tasks and achieving 92% schema alignment.

Conclusion: Dynamically induced schemas can effectively complement LLMs, enabling scalable and accurate knowledge graphs.

Abstract: We present AutoSchemaKG, a framework for fully autonomous knowledge graph construction that eliminates the need for predefined schemas. Our system leverages large language models to simultaneously extract knowledge triples and induce comprehensive schemas directly from text, modeling both entities and events while employing conceptualization to organize instances into semantic categories. Processing over 50 million documents, we construct ATLAS (Automated Triple Linking And Schema induction), a family of knowledge graphs with 900+ million nodes and 5.9 billion edges. This approach outperforms state-of-the-art baselines on multi-hop QA tasks and enhances LLM factuality. Notably, our schema induction achieves 92% semantic alignment with human-crafted schemas with zero manual intervention, demonstrating that billion-scale knowledge graphs with dynamically induced schemas can effectively complement parametric knowledge in large language models.

[65] AutoMixer: Checkpoint Artifacts as Automatic Data Mixers

Ernie Chang, Yang Li, Patrick Huber, Vish Vogeti, David Kant, Yangyang Shi, Vikas Chandra

Main category: cs.CL

TL;DR: Checkpoint models in training trajectories can be used to optimize data mixtures for language models, improving performance by up to 1.93% on reasoning benchmarks.

DetailsMotivation: The relationship between data and tasks in language model training is unclear, making it hard to obtain optimal data mixtures for diverse capabilities.

Method: Leverage checkpoint models from training trajectories, identified by their benchmark capabilities, to approximate data influence and optimize data mixtures.

Result: Significant improvements (up to 1.93%) on eight reasoning benchmarks in pretraining settings.

Conclusion: Checkpoint models can enhance data quality and optimize mixtures, demonstrating their untapped potential in training.

Abstract: In language model training, it is desirable to equip models with capabilities from various tasks. However, it is not clear how to directly obtain the right data mixtures for these capabilities as the relationship between data and tasks is difficult to be modeled. In this work, we observe that checkpoint models exhibit emerging capabilities at different points in the training trajectory. Often, the training process saves checkpoints as artifacts that are under-utilized as a source of in-training data signals. We identify these artifact models based on their respective capabilities on the benchmarks and leverage them as data mixers by using their aggregated first-order influence approximation over source data. We demonstrated on eight reasoning benchmarks that the proposed framework shows significant improvements in the pretraining setting, with performance improvements of up to 1.93%. Overall, this shows the potential of checkpoint models to enhance data quality and optimize data mixtures.

[66] RAG-R1 : Incentivize the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism

Zhiwen Tan, Jiaming Huang, Qintong Wu, Hongxuan Zhang, Chenyi Zhuang, Jinjie Gu

Main category: cs.CL

TL;DR: RAG-R1 is a novel framework enhancing LLMs by combining internal and external knowledge adaptively, improving performance and reducing inference time.

DetailsMotivation: LLMs often generate outdated or hallucinated responses due to static knowledge. RAG methods, while promising, face training stability and inefficiency issues.

Method: RAG-R1 introduces adaptive knowledge use and expands retrieval/generation to multi-query parallelism.

Result: Outperforms baselines by up to 13.2% and reduces inference time by 11.1% on QA benchmarks.

Conclusion: RAG-R1 effectively addresses LLM limitations, improving accuracy and efficiency.

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, while LLMs remain prone to generating hallucinated or outdated responses due to their static internal knowledge. Recent advancements in Retrieval-Augmented Generation (RAG) methods have aimed to enhance models’ search and reasoning capabilities through reinforcement learning (RL). Although these methods demonstrate promising results, they face challenges in training stability and encounter issues such as substantial inference time and restricted capabilities due to reliance on single-query mode. In this paper, we propose RAG-R1, a novel training framework designed to enable LLMs to adaptively leverage internal and external knowledge during the reasoning process. We further expand the generation and retrieval processes within the framework from single-query mode to multi-query parallelism, with the aim of reducing inference time and enhancing the model’s capabilities. Extensive experiments on seven question-answering benchmarks demonstrate that our method outperforms the strongest baseline by up to 13.2% and decreases inference time by 11.1%.

[67] Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations

A. Bochkov

Main category: cs.CL

TL;DR: The paper challenges the idea that trainable input embeddings are foundational for semantic representation in LLMs, showing that models with frozen, non-semantic visual embeddings outperform conventional ones.

DetailsMotivation: To understand the role of input embeddings in semantic representation and explore whether high-level semantics are emergent properties of Transformer architecture rather than inherent to embeddings.

Method: Constructed Transformer models with frozen embedding layers derived from Unicode glyphs’ visual structure, tested with a novel Unicode-centric tokenizer.

Result: Models with non-semantic embeddings converged, generated coherent text, and outperformed identical models with trainable embeddings on the MMLU benchmark.

Conclusion: Semantics are emergent from Transformer architecture and data scale, not inherent to embeddings, reframing embeddings as structural primitives rather than meaning containers.

Abstract: Understanding the locus of semantic representation in large language models (LLMs) is crucial for interpretability and architectural innovation. The dominant paradigm posits that trainable input embeddings serve as foundational “meaning vectors.” This paper challenges that view. We construct Transformer models where the embedding layer is entirely frozen, with vectors derived not from data, but from the visual structure of Unicode glyphs. These non-semantic, precomputed visual embeddings are fixed throughout training. Our method is compatible with any tokenizer, including a novel Unicode-centric tokenizer we introduce to ensure universal text coverage. Despite the absence of trainable, semantically initialized embeddings, our models converge, generate coherent text, and, critically, outperform architecturally identical models with trainable embeddings on the MMLU reasoning benchmark. We attribute this to “representational interference” in conventional models, where the embedding layer is burdened with learning both structural and semantic features. Our results indicate that high-level semantics are not inherent to input embeddings but are an emergent property of the Transformer’s compositional architecture and data scale. This reframes the role of embeddings from meaning containers to structural primitives. We release all code and models to foster further research.

[68] LLMs Encode Harmfulness and Refusal Separately

Jiachen Zhao, Jing Huang, Zhengxuan Wu, David Bau, Weiyan Shi

Main category: cs.CL

TL;DR: The paper identifies a distinct ‘harmfulness direction’ in LLMs, separate from refusal, and shows how steering along it affects model behavior. It reveals that jailbreak methods reduce refusal signals without altering harmfulness beliefs, and proposes a ‘Latent Guard’ for robust safety detection.

DetailsMotivation: To understand if LLMs truly comprehend harmfulness beyond refusal behaviors and to analyze their internal safety mechanisms.

Method: Identifies a harmfulness direction distinct from refusal, tests steering effects, and evaluates jailbreak methods and adversarial finetuning impacts. Introduces ‘Latent Guard’ for safety applications.

Result: Harmfulness is encoded separately from refusal; jailbreaks reduce refusal signals without changing harmfulness beliefs. Latent Guard performs comparably to dedicated safeguard models.

Conclusion: LLMs’ internal harmfulness understanding is robust, offering a new perspective for AI safety and practical safeguards like Latent Guard.

Abstract: LLMs are trained to refuse harmful instructions, but do they truly understand harmfulness beyond just refusing? Prior work has shown that LLMs’ refusal behaviors can be mediated by a one-dimensional subspace, i.e., a refusal direction. In this work, we identify a new dimension to analyze safety mechanisms in LLMs, i.e., harmfulness, which is encoded internally as a separate concept from refusal. There exists a harmfulness direction that is distinct from the refusal direction. As causal evidence, steering along the harmfulness direction can lead LLMs to interpret harmless instructions as harmful, but steering along the refusal direction tends to elicit refusal responses directly without reversing the model’s judgment on harmfulness. Furthermore, using our identified harmfulness concept, we find that certain jailbreak methods work by reducing the refusal signals without reversing the model’s internal belief of harmfulness. We also find that adversarially finetuning models to accept harmful instructions has minimal impact on the model’s internal belief of harmfulness. These insights lead to a practical safety application: The model’s latent harmfulness representation can serve as an intrinsic safeguard (Latent Guard) for detecting unsafe inputs and reducing over-refusals that is robust to finetuning attacks. For instance, our Latent Guard achieves performance comparable to or better than Llama Guard 3 8B, a dedicated finetuned safeguard model, across different jailbreak methods. Our findings suggest that LLMs’ internal understanding of harmfulness is more robust than their refusal decision to diverse input instructions, offering a new perspective to study AI safety

[69] FinResearchBench: A Logic Tree based Agent-as-a-Judge Evaluation Framework for Financial Research Agents

Rui Sun, Zuo Bai, Wentao Zhang, Yuxiang Zhang, Li Zhao, Shan Sun, Zhengwen Qiu

Main category: cs.CL

TL;DR: The paper introduces FinResearchBench, a logic tree-based Agent-as-a-Judge system for evaluating financial research AI agents, addressing the lack of systematic benchmarks in this domain.

DetailsMotivation: The rapid evolution of AI agents in professional research lacks evaluation frameworks, especially for financial research due to its complexity.

Method: Proposes FinResearchBench, a logic tree-based system that automatically assesses financial research agents across 7 task types.

Result: FinResearchBench provides a comprehensive, reliable evaluation by analyzing logic trees and covers 70 financial research questions.

Conclusion: The work fills a gap in evaluating financial research agents, offering an innovative and domain-specific benchmark.

Abstract: Recently, AI agents are rapidly evolving in intelligence and widely used in professional research applications, such as STEM, software development, finance, etc. Among these AI agents, deep research agent is a key category as it can perform long-horizon tasks and solve problems of greater complexity. However, there are few evaluation frameworks and benchmarks that systematically and automatically investigate the capabilities of these research agents. Furthermore, financial research problems have distinct complexity and subtlety. To fill in the gap, we propose FinResearchBench, which is a logic tree based Agent-as-a-Judge and targets specifically for the financial research agents. It provides a comprehensive and automatic assessment of the research agents across 7 key types of tasks in the financial research domain. The contributions of this work are two-folded: (1) the first and innovative Agent-as-a-Judge system that extracts the logic tree of the research outcome and uses it as the intermediate information to present a comprehensive, reliable and robust evaluation; (2) finance oriented that it covers 70 typical financial research questions, spreading across 7 frequently encountered types of tasks in the domain.

[70] Leveraging Synthetic Data for Question Answering with Multilingual LLMs in the Agricultural Domain

Rishemjit Kaur, Arshdeep Singh Bhankhar, Jashanpreet Singh Salh, Sudhir Rajput, Vidhi, Kashish Mahendra, Bhavika Berwal, Ritesh Kumar, Surangika Ranathunga

Main category: cs.CL

TL;DR: Fine-tuned LLMs improve multilingual agriculture QA by generating synthetic datasets from local documents, outperforming generic models in accuracy and relevance.

DetailsMotivation: Generic LLMs lack precision for local and multilingual agriculture advice, limiting their usefulness for farmers.

Method: Generated multilingual synthetic datasets from Indian agriculture documents and fine-tuned LLMs for QA.

Result: Fine-tuned LLMs showed significant improvements in factuality, relevance, and agricultural consensus over baselines.

Conclusion: Tailoring LLMs with local, multilingual datasets enhances their effectiveness for agriculture-specific QA tasks.

Abstract: Enabling farmers to access accurate agriculture-related information in their native languages in a timely manner is crucial for the success of the agriculture field. Publicly available general-purpose Large Language Models (LLMs) typically offer generic agriculture advisories, lacking precision in local and multilingual contexts. Our study addresses this limitation by generating multilingual (English, Hindi, Punjabi) synthetic datasets from agriculture-specific documents from India and fine-tuning LLMs for the task of question answering (QA). Evaluation on human-created datasets demonstrates significant improvements in factuality, relevance, and agricultural consensus for the fine-tuned LLMs compared to the baseline counterparts.

[71] IFEvalCode: Controlled Code Generation

Jian Yang, Wei Zhang, Shukai Liu, Linzheng Chai, Yingshui Tan, Jiaheng Liu, Ge Zhang, Wangchunshu Zhou, Guanglin Niu, Zhoujun Li, Binyuan Hui, Junyang Lin

Main category: cs.CL

TL;DR: The paper introduces forward and backward constraints generation to enhance Code LLMs’ instruction-following in controlled code generation, alongside IFEvalCode, a multilingual benchmark for nuanced evaluation.

DetailsMotivation: Real-world applications require stricter adherence to detailed coding requirements beyond correctness, which current Code LLMs struggle with.

Method: Proposes forward and backward constraints generation and introduces IFEvalCode, a benchmark with 1.6K samples across seven languages, evaluating correctness and instruction-following separately.

Result: Closed-source models outperform open-source ones in controllable code generation, with a notable gap between correctness and precise instruction-following.

Conclusion: The approach improves Code LLMs’ alignment with human guidelines, highlighting the need for better instruction-following capabilities in code generation.

Abstract: Code large language models (Code LLMs) have made significant progress in code generation by translating natural language descriptions into functional code; however, real-world applications often demand stricter adherence to detailed requirements such as coding style, line count, and structural constraints, beyond mere correctness. To address this, the paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs in controlled code generation, ensuring outputs align more closely with human-defined guidelines. The authors further present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages (Python, Java, JavaScript, TypeScript, Shell, C++, and C#), with each sample featuring both Chinese and English queries. Unlike existing benchmarks, IFEvalCode decouples evaluation into two metrics: correctness (Corr.) and instruction-following (Instr.), enabling a more nuanced assessment. Experiments on over 40 LLMs reveal that closed-source models outperform open-source ones in controllable code generation and highlight a significant gap between the models’ ability to generate correct code versus code that precisely follows instructions.

[72] ControlMed: Adding Reasoning Control to Medical Language Model

Sung-Min Lee, Siyoon Lee, Juyeon Kim, Kyoungmin Roh

Main category: cs.CL

TL;DR: ControlMed is a medical LLM that allows users to control reasoning length, improving efficiency and accuracy in clinical decision-making.

DetailsMotivation: Existing reasoning LLMs in medicine are computationally inefficient due to lengthy reasoning processes, hindering practical use.

Method: ControlMed uses a three-stage pipeline: pre-training on synthetic medical data, supervised fine-tuning with length-control markers, and reinforcement learning for accuracy.

Result: ControlMed matches or outperforms state-of-the-art models in benchmarks while offering flexible reasoning length control.

Conclusion: ControlMed is a practical, adaptable solution for clinical QA and medical analysis, balancing accuracy and efficiency.

Abstract: Reasoning Large Language Models (LLMs) with enhanced accuracy and explainability are increasingly being adopted in the medical domain, as the life-critical nature of clinical decision-making demands reliable support. Despite these advancements, existing reasoning LLMs often generate unnecessarily lengthy reasoning processes, leading to significant computational overhead and response latency. These limitations hinder their practical deployment in real-world clinical environments. To address these challenges, we introduce \textbf{ControlMed}, a medical language model that enables users to actively control the length of the reasoning process at inference time through fine-grained control markers. ControlMed is trained through a three-stage pipeline: 1) pre-training on a large-scale synthetic medical instruction dataset covering both \textit{direct} and \textit{reasoning responses}; 2) supervised fine-tuning with multi-length reasoning data and explicit length-control markers; and 3) reinforcement learning with model-based reward signals to enhance factual accuracy and response quality. Experimental results on a variety of English and Korean medical benchmarks demonstrate that our model achieves similar or better performance compared to state-of-the-art models. Furthermore, users can flexibly balance reasoning accuracy and computational efficiency by controlling the reasoning length as needed. These findings demonstrate that ControlMed is a practical and adaptable solution for clinical question answering and medical information analysis.

cs.CV

[73] A Quality-Guided Mixture of Score-Fusion Experts Framework for Human Recognition

Jie Zhu, Yiyang Su, Minchul Kim, Anil Jain, Xiaoming Liu

Main category: cs.CV

TL;DR: QME is a novel framework for whole-body biometric recognition using a learnable score-fusion strategy with Mixture of Experts (MoE), improving performance by addressing score distribution variations and data quality issues.

DetailsMotivation: Overcoming limitations of unimodal systems and traditional score-fusion methods by integrating multiple biometric modalities (face, gait, body) and addressing score distribution variations.

Method: Proposes QME: a learnable score-fusion strategy with MoE, pseudo-quality loss for quality estimation, and score triplet loss for metric improvement.

Result: Achieves state-of-the-art performance on multiple datasets, outperforming baseline methods in multimodal and multi-model scenarios.

Conclusion: QME effectively addresses challenges like model misalignment and data quality variability, enhancing whole-body biometric recognition.

Abstract: Whole-body biometric recognition is a challenging multimodal task that integrates various biometric modalities, including face, gait, and body. This integration is essential for overcoming the limitations of unimodal systems. Traditionally, whole-body recognition involves deploying different models to process multiple modalities, achieving the final outcome by score-fusion (e.g., weighted averaging of similarity matrices from each model). However, these conventional methods may overlook the variations in score distributions of individual modalities, making it challenging to improve final performance. In this work, we present \textbf{Q}uality-guided \textbf{M}ixture of score-fusion \textbf{E}xperts (QME), a novel framework designed for improving whole-body biometric recognition performance through a learnable score-fusion strategy using a Mixture of Experts (MoE). We introduce a novel pseudo-quality loss for quality estimation with a modality-specific Quality Estimator (QE), and a score triplet loss to improve the metric performance. Extensive experiments on multiple whole-body biometric datasets demonstrate the effectiveness of our proposed approach, achieving state-of-the-art results across various metrics compared to baseline methods. Our method is effective for multimodal and multi-model, addressing key challenges such as model misalignment in the similarity score domain and variability in data quality.

[74] Punching Bag vs. Punching Person: Motion Transferability in Videos

Raiyaan Abdullah, Jared Claypoole, Michael Cogswell, Ajay Divakaran, Yogesh Rawat

Main category: cs.CV

TL;DR: The paper explores motion transferability in action recognition models, introducing a framework with three datasets to evaluate performance on high-level actions in novel contexts. Findings highlight challenges in fine-grained action recognition and the impact of model size and biases.

DetailsMotivation: To assess whether action recognition models can effectively transfer high-level motion concepts across diverse contexts, even within similar distributions.

Method: A motion transferability framework is introduced using three datasets (Syn-TA, Kinetics400-TA, Something-Something-v2-TA) to evaluate 13 state-of-the-art models.

Result: Performance drops significantly for high-level actions in novel contexts. Multimodal models struggle with fine-grained actions, and larger models face challenges with temporal reasoning.

Conclusion: The study establishes a benchmark for motion transferability, revealing limitations in current models and suggesting disentangling coarse and fine motions for improvement.

Abstract: Action recognition models demonstrate strong generalization, but can they effectively transfer high-level motion concepts across diverse contexts, even within similar distributions? For example, can a model recognize the broad action “punching” when presented with an unseen variation such as “punching person”? To explore this, we introduce a motion transferability framework with three datasets: (1) Syn-TA, a synthetic dataset with 3D object motions; (2) Kinetics400-TA; and (3) Something-Something-v2-TA, both adapted from natural video datasets. We evaluate 13 state-of-the-art models on these benchmarks and observe a significant drop in performance when recognizing high-level actions in novel contexts. Our analysis reveals: 1) Multimodal models struggle more with fine-grained unknown actions than with coarse ones; 2) The bias-free Syn-TA proves as challenging as real-world datasets, with models showing greater performance drops in controlled settings; 3) Larger models improve transferability when spatial cues dominate but struggle with intensive temporal reasoning, while reliance on object and background cues hinders generalization. We further explore how disentangling coarse and fine motions can improve recognition in temporally challenging datasets. We believe this study establishes a crucial benchmark for assessing motion transferability in action recognition. Datasets and relevant code: https://github.com/raiyaan-abdullah/Motion-Transfer.

[75] The Monado SLAM Dataset for Egocentric Visual-Inertial Tracking

Mateo de Mayo, Daniel Cremers, Taihú Pire

Main category: cs.CV

TL;DR: The paper introduces the Monado SLAM dataset to address gaps in existing VIO/SLAM systems for head-mounted sensors, covering challenging real-world scenarios.

DetailsMotivation: Existing VIO/SLAM systems struggle with head-mounted use cases like high-intensity motions, dynamic occlusions, and adverse conditions, which are poorly represented in current datasets.

Method: The authors present the Monado SLAM dataset, consisting of real sequences from multiple VR headsets, to better represent these challenges.

Result: The dataset is released under a CC BY 4.0 license to advance VIO/SLAM research.

Conclusion: The Monado SLAM dataset aims to improve VIO/SLAM systems by addressing overlooked real-world challenges in head-mounted tracking.

Abstract: Humanoid robots and mixed reality headsets benefit from the use of head-mounted sensors for tracking. While advancements in visual-inertial odometry (VIO) and simultaneous localization and mapping (SLAM) have produced new and high-quality state-of-the-art tracking systems, we show that these are still unable to gracefully handle many of the challenging settings presented in the head-mounted use cases. Common scenarios like high-intensity motions, dynamic occlusions, long tracking sessions, low-textured areas, adverse lighting conditions, saturation of sensors, to name a few, continue to be covered poorly by existing datasets in the literature. In this way, systems may inadvertently overlook these essential real-world issues. To address this, we present the Monado SLAM dataset, a set of real sequences taken from multiple virtual reality headsets. We release the dataset under a permissive CC BY 4.0 license, to drive advancements in VIO/SLAM research and development.

[76] Instruction-Grounded Visual Projectors for Continual Learning of Generative Vision-Language Models

Hyundong Jin, Hyung Jin Chang, Eunwoo Kim

Main category: cs.CV

TL;DR: A novel framework improves continual learning in vision-language models by grounding visual translation on language instructions, using specialized visual projectors and expert strategies to avoid neglecting language inputs.

DetailsMotivation: Address the issue of models prioritizing visual inputs over language instructions in continual learning, especially with repetitive textual tasks.

Method: Introduces a mixture of visual projectors as experts, an expert recommendation strategy, and expert pruning to adapt to new tasks while minimizing interference.

Result: Outperforms existing continual learning approaches by generating better instruction-following responses.

Conclusion: The proposed framework effectively balances visual and language inputs, enhancing continual learning performance in vision-language models.

Abstract: Continual learning enables pre-trained generative vision-language models (VLMs) to incorporate knowledge from new tasks without retraining data from previous ones. Recent methods update a visual projector to translate visual information for new tasks, connecting pre-trained vision encoders with large language models. However, such adjustments may cause the models to prioritize visual inputs over language instructions, particularly learning tasks with repetitive types of textual instructions. To address the neglect of language instructions, we propose a novel framework that grounds the translation of visual information on instructions for language models. We introduce a mixture of visual projectors, each serving as a specialized visual-to-language translation expert based on the given instruction context to adapt to new tasks. To avoid using experts for irrelevant instruction contexts, we propose an expert recommendation strategy that reuses experts for tasks similar to those previously learned. Additionally, we introduce expert pruning to alleviate interference from the use of experts that cumulatively activated in previous tasks. Extensive experiments on diverse vision-language tasks demonstrate that our method outperforms existing continual learning approaches by generating instruction-following responses.

[77] Analyze-Prompt-Reason: A Collaborative Agent-Based Framework for Multi-Image Vision-Language Reasoning

Angelos Vlachos, Giorgos Filandrianos, Maria Lymperaiou, Nikolaos Spanos, Ilias Mitsouras, Vasileios Karampinis, Athanasios Voulodimos

Main category: cs.CV

TL;DR: A dual-agent framework (PromptEngineer and VisionReasoner) automates multi-image reasoning across diverse tasks, achieving high performance on 18 datasets.

DetailsMotivation: Addressing the challenge of interleaved multimodal reasoning across varied datasets and tasks.

Method: Uses a language-based PromptEngineer for task-specific prompts and a VisionReasoner (LVLM) for inference, without training.

Result: High accuracy on tasks like TQA (99.13%), DocVQA (96.87%), and MMCoQA (75.28 ROUGE-L).

Conclusion: LVLMs can effectively reason over multiple images with informative prompts, influenced by design choices.

Abstract: We present a Collaborative Agent-Based Framework for Multi-Image Reasoning. Our approach tackles the challenge of interleaved multimodal reasoning across diverse datasets and task formats by employing a dual-agent system: a language-based PromptEngineer, which generates context-aware, task-specific prompts, and a VisionReasoner, a large vision-language model (LVLM) responsible for final inference. The framework is fully automated, modular, and training-free, enabling generalization across classification, question answering, and free-form generation tasks involving one or multiple input images. We evaluate our method on 18 diverse datasets from the 2025 MIRAGE Challenge (Track A), covering a broad spectrum of visual reasoning tasks including document QA, visual comparison, dialogue-based understanding, and scene-level inference. Our results demonstrate that LVLMs can effectively reason over multiple images when guided by informative prompts. Notably, Claude 3.7 achieves near-ceiling performance on challenging tasks such as TQA (99.13% accuracy), DocVQA (96.87%), and MMCoQA (75.28 ROUGE-L). We also explore how design choices-such as model selection, shot count, and input length-influence the reasoning performance of different LVLMs.

[78] Exploring the Feasibility of Deep Learning Techniques for Accurate Gender Classification from Eye Images

Basna Mohammed Salih Hasan, Ramadhan J. Mstafa

Main category: cs.CV

TL;DR: The paper proposes a CNN model for gender classification using the periocular region, achieving high accuracy (99% and 96%) on two datasets.

DetailsMotivation: Gender classification is important in security and other fields, but accuracy is affected by cosmetics and disguise. The study focuses on the periocular region for reliable classification.

Method: A sophisticated CNN model is introduced, tested on CVBL and (Female and Male) datasets, and compared with state-of-the-art methods.

Result: The model achieved 99% accuracy on CVBL and 96% on (Female and Male) datasets with minimal parameters.

Conclusion: The model is effective for gender classification and has practical applications in security and surveillance.

Abstract: Gender classification has emerged as a crucial aspect in various fields, including security, human-machine interaction, surveillance, and advertising. Nonetheless, the accuracy of this classification can be influenced by factors such as cosmetics and disguise. Consequently, our study is dedicated to addressing this concern by concentrating on gender classification using color images of the periocular region. The periocular region refers to the area surrounding the eye, including the eyelids, eyebrows, and the region between them. It contains valuable visual cues that can be used to extract key features for gender classification. This paper introduces a sophisticated Convolutional Neural Network (CNN) model that utilizes color image databases to evaluate the effectiveness of the periocular region for gender classification. To validate the model’s performance, we conducted tests on two eye datasets, namely CVBL and (Female and Male). The recommended architecture achieved an outstanding accuracy of 99% on the previously unused CVBL dataset while attaining a commendable accuracy of 96% with a small number of learnable parameters (7,235,089) on the (Female and Male) dataset. To ascertain the effectiveness of our proposed model for gender classification using the periocular region, we evaluated its performance through an extensive range of metrics and compared it with other state-of-the-art approaches. The results unequivocally demonstrate the efficacy of our model, thereby suggesting its potential for practical application in domains such as security and surveillance.

[79] World Consistency Score: A Unified Metric for Video Generation Quality

Akshat Rakheja, Aarsh Ashdhir, Aryan Bhattacharjee, Vanshika Sharma

Main category: cs.CV

TL;DR: World Consistency Score (WCS) is a new metric for evaluating generative video models, focusing on internal world consistency through four sub-components: object permanence, relation stability, causal compliance, and flicker penalty. It combines these into a single score aligned with human judgments.

DetailsMotivation: Existing video evaluation metrics often overlook temporal and physical coherence, focusing instead on visual fidelity or prompt alignment. WCS aims to fill this gap by providing a unified, interpretable measure of world consistency.

Method: WCS integrates four submetrics, each computed using open-source tools (trackers, action recognizers, CLIP embeddings, optical flow). The submetrics are combined via a learned weighted formula trained on human preference data.

Result: WCS is validated using benchmarks like VBench-2.0, EvalCrafter, and LOVE, showing correlation with human evaluations and outperforming established metrics (FVD, CLIPScore, VBench, FVMD).

Conclusion: WCS provides a comprehensive and interpretable framework for assessing video generation models’ ability to maintain a coherent world over time, addressing limitations of prior metrics.

Abstract: We introduce World Consistency Score (WCS), a novel unified evaluation metric for generative video models that emphasizes internal world consistency of the generated videos. WCS integrates four interpretable sub-components - object permanence, relation stability, causal compliance, and flicker penalty - each measuring a distinct aspect of temporal and physical coherence in a video. These submetrics are combined via a learned weighted formula to produce a single consistency score that aligns with human judgments. We detail the motivation for WCS in the context of existing video evaluation metrics, formalize each submetric and how it is computed with open-source tools (trackers, action recognizers, CLIP embeddings, optical flow), and describe how the weights of the WCS combination are trained using human preference data. We also outline an experimental validation blueprint: using benchmarks like VBench-2.0, EvalCrafter, and LOVE to test WCS’s correlation with human evaluations, performing sensitivity analyses, and comparing WCS against established metrics (FVD, CLIPScore, VBench, FVMD). The proposed WCS offers a comprehensive and interpretable framework for evaluating video generation models on their ability to maintain a coherent “world” over time, addressing gaps left by prior metrics focused only on visual fidelity or prompt alignment.

[80] Is It Really You? Exploring Biometric Verification Scenarios in Photorealistic Talking-Head Avatar Videos

Laura Pedrouzo-Rodriguez, Pedro Delgado-DeRobles, Luis F. Gomez, Ruben Tolosana, Ruben Vera-Rodriguez, Aythami Morales, Julian Fierrez

Main category: cs.CV

TL;DR: The paper explores using facial motion patterns as behavioral biometrics for identity verification in photorealistic avatar-mediated communication, proposing a lightweight Graph Convolutional Network with 80% AUC performance.

DetailsMotivation: Addressing security risks like impersonation in avatar-based systems by verifying identity through unique facial motion patterns.

Method: Introduces a dataset of avatar videos and a spatio-temporal Graph Convolutional Network with temporal attention pooling, using facial landmarks.

Result: Facial motion cues achieve ~80% AUC for identity verification, demonstrating reliability.

Conclusion: Highlights the need for advanced behavioral biometric defenses in avatar communication, providing a benchmark for future research.

Abstract: Photorealistic talking-head avatars are becoming increasingly common in virtual meetings, gaming, and social platforms. These avatars allow for more immersive communication, but they also introduce serious security risks. One emerging threat is impersonation: an attacker can steal a user’s avatar-preserving their appearance and voice-making it nearly impossible to detect its fraudulent usage by sight or sound alone. In this paper, we explore the challenge of biometric verification in such avatar-mediated scenarios. Our main question is whether an individual’s facial motion patterns can serve as reliable behavioral biometrics to verify their identity when the avatar’s visual appearance is a facsimile of its owner. To answer this question, we introduce a new dataset of realistic avatar videos created using a state-of-the-art one-shot avatar generation model, GAGAvatar, with genuine and impostor avatar videos. We also propose a lightweight, explainable spatio-temporal Graph Convolutional Network architecture with temporal attention pooling, that uses only facial landmarks to model dynamic facial gestures. Experimental results demonstrate that facial motion cues enable meaningful identity verification with AUC values approaching 80%. The proposed benchmark and biometric system are available for the research community in order to bring attention to the urgent need for more advanced behavioral biometric defenses in avatar-based communication systems.

[81] GeoExplorer: Active Geo-localization with Curiosity-Driven Exploration

Li Mi, Manon Bechaz, Zeming Chen, Antoine Bosselut, Devis Tuia

Main category: cs.CV

TL;DR: GeoExplorer improves Active Geo-localization (AGL) by using curiosity-driven exploration instead of distance-based rewards, enhancing robustness and generalization in unseen scenarios.

DetailsMotivation: Current AGL methods struggle with unreliable exploration in challenging or unfamiliar environments due to distance-based rewards.

Method: Proposes GeoExplorer, an AGL agent with curiosity-driven intrinsic rewards for goal-agnostic, robust exploration.

Result: Outperforms benchmarks, showing strong generalization in diverse and unfamiliar settings.

Conclusion: Curiosity-driven exploration significantly improves AGL performance, especially for unseen targets and environments.

Abstract: Active Geo-localization (AGL) is the task of localizing a goal, represented in various modalities (e.g., aerial images, ground-level images, or text), within a predefined search area. Current methods approach AGL as a goal-reaching reinforcement learning (RL) problem with a distance-based reward. They localize the goal by implicitly learning to minimize the relative distance from it. However, when distance estimation becomes challenging or when encountering unseen targets and environments, the agent exhibits reduced robustness and generalization ability due to the less reliable exploration strategy learned during training. In this paper, we propose GeoExplorer, an AGL agent that incorporates curiosity-driven exploration through intrinsic rewards. Unlike distance-based rewards, our curiosity-driven reward is goal-agnostic, enabling robust, diverse, and contextually relevant exploration based on effective environment modeling. These capabilities have been proven through extensive experiments across four AGL benchmarks, demonstrating the effectiveness and generalization ability of GeoExplorer in diverse settings, particularly in localizing unfamiliar targets and environments.

[82] Cued-Agent: A Collaborative Multi-Agent System for Automatic Cued Speech Recognition

Guanjie Huang, Danny H. K. Tsang, Shan Yang, Guangzhi Lei, Li Liu

Main category: cs.CV

TL;DR: The paper introduces Cued-Agent, a collaborative multi-agent system for Automatic Cued Speech Recognition (ACSR), addressing challenges in multimodal fusion and limited data by integrating specialized sub-agents for hand and lip recognition, dynamic prompt decoding, and semantic refinement.

DetailsMotivation: The motivation is to improve ACSR by overcoming the temporal asynchrony between hand and lip movements and the limitations of current methods due to insufficient training data.

Method: The method involves a multi-agent system with four sub-agents: Hand Recognition (using keyframe screening and expert prompts), Lip Recognition (Transformer-based), Hand Prompt Decoding (training-free integration), and Self-Correction Phoneme-to-Word (semantic refinement).

Result: The system outperforms state-of-the-art methods in both normal and hearing-impaired scenarios, validated on an expanded Mandarin CS dataset.

Conclusion: Cued-Agent demonstrates superior performance in ACSR by effectively leveraging multimodal fusion and semantic refinement, offering a promising solution for hearing-impaired communication.

Abstract: Cued Speech (CS) is a visual communication system that combines lip-reading with hand coding to facilitate communication for individuals with hearing impairments. Automatic CS Recognition (ACSR) aims to convert CS hand gestures and lip movements into text via AI-driven methods. Traditionally, the temporal asynchrony between hand and lip movements requires the design of complex modules to facilitate effective multimodal fusion. However, constrained by limited data availability, current methods demonstrate insufficient capacity for adequately training these fusion mechanisms, resulting in suboptimal performance. Recently, multi-agent systems have shown promising capabilities in handling complex tasks with limited data availability. To this end, we propose the first collaborative multi-agent system for ACSR, named Cued-Agent. It integrates four specialized sub-agents: a Multimodal Large Language Model-based Hand Recognition agent that employs keyframe screening and CS expert prompt strategies to decode hand movements, a pretrained Transformer-based Lip Recognition agent that extracts lip features from the input video, a Hand Prompt Decoding agent that dynamically integrates hand prompts with lip features during inference in a training-free manner, and a Self-Correction Phoneme-to-Word agent that enables post-process and end-to-end conversion from phoneme sequences to natural language sentences for the first time through semantic refinement. To support this study, we expand the existing Mandarin CS dataset by collecting data from eight hearing-impaired cuers, establishing a mixed dataset of fourteen subjects. Extensive experiments demonstrate that our Cued-Agent performs superbly in both normal and hearing-impaired scenarios compared with state-of-the-art methods. The implementation is available at https://github.com/DennisHgj/Cued-Agent.

[83] Robust 3D Object Detection using Probabilistic Point Clouds from Single-Photon LiDARs

Bhavya Goyal, Felipe Gutierrez-Barragan, Wei Lin, Andreas Velten, Yin Li, Mohit Gupta

Main category: cs.CV

TL;DR: The paper introduces Probabilistic Point Clouds (PPC), a 3D representation that includes uncertainty attributes for each point, improving robustness in 3D object detection by leveraging raw LiDAR measurement uncertainties.

DetailsMotivation: LiDAR-based 3D sensors often produce sparse or noisy point clouds, especially for distant or low-albedo objects, leading to accuracy loss in downstream tasks. Conventional pipelines ignore raw measurement uncertainty.

Method: PPC augments each point with a probability attribute to encapsulate measurement uncertainty. Lightweight inference methods using PPC are introduced for robust 3D object detection.

Result: PPC-based methods outperform baseline LiDAR and fusion models in challenging scenarios, including small, distant, and low-albedo objects, as well as strong ambient light.

Conclusion: PPC enhances 3D perception by incorporating uncertainty, offering a versatile and effective solution for improving accuracy in real-world scenarios.

Abstract: LiDAR-based 3D sensors provide point clouds, a canonical 3D representation used in various scene understanding tasks. Modern LiDARs face key challenges in several real-world scenarios, such as long-distance or low-albedo objects, producing sparse or erroneous point clouds. These errors, which are rooted in the noisy raw LiDAR measurements, get propagated to downstream perception models, resulting in potentially severe loss of accuracy. This is because conventional 3D processing pipelines do not retain any uncertainty information from the raw measurements when constructing point clouds. We propose Probabilistic Point Clouds (PPC), a novel 3D scene representation where each point is augmented with a probability attribute that encapsulates the measurement uncertainty (or confidence) in the raw data. We further introduce inference approaches that leverage PPC for robust 3D object detection; these methods are versatile and can be used as computationally lightweight drop-in modules in 3D inference pipelines. We demonstrate, via both simulations and real captures, that PPC-based 3D inference methods outperform several baselines using LiDAR as well as camera-LiDAR fusion models, across challenging indoor and outdoor scenarios involving small, distant, and low-albedo objects, as well as strong ambient light. Our project webpage is at https://bhavyagoyal.github.io/ppc .

[84] FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait

Taekyung Ki, Dongchan Min, Gyeongsu Chae

Main category: cs.CV

TL;DR: FLOAT introduces a flow matching generative model for audio-driven talking portrait video generation, addressing challenges in temporal consistency and fast sampling with a learned motion latent space and transformer-based predictor.

DetailsMotivation: Overcoming limitations of diffusion-based models in temporally consistent video generation and fast sampling for portrait animation.

Method: Uses a learned orthogonal motion latent space and a transformer-based vector field predictor with frame-wise conditioning for efficient motion generation and editing.

Result: Outperforms state-of-the-art methods in visual quality, motion fidelity, and efficiency, with support for speech-driven emotion enhancement.

Conclusion: FLOAT provides a robust solution for high-quality, efficient, and expressive audio-driven talking portrait video generation.

Abstract: With the rapid advancement of diffusion-based generative models, portrait image animation has achieved remarkable results. However, it still faces challenges in temporally consistent video generation and fast sampling due to its iterative sampling nature. This paper presents FLOAT, an audio-driven talking portrait video generation method based on flow matching generative model. Instead of a pixel-based latent space, we take advantage of a learned orthogonal motion latent space, enabling efficient generation and editing of temporally consistent motion. To achieve this, we introduce a transformer-based vector field predictor with an effective frame-wise conditioning mechanism. Additionally, our method supports speech-driven emotion enhancement, enabling a natural incorporation of expressive motions. Extensive experiments demonstrate that our method outperforms state-of-the-art audio-driven talking portrait methods in terms of visual quality, motion fidelity, and efficiency.

[85] On the Risk of Misleading Reports: Diagnosing Textual Biases in Multimodal Clinical AI

David Restrepo, Ira Ktena, Maria Vakalopoulou, Stergios Christodoulidis, Enzo Ferrante

Main category: cs.CV

TL;DR: The paper introduces Selective Modality Shifting (SMS) to quantify biases in Vision-Language Models (VLMs) toward text over images in medical tasks. It evaluates six VLMs on medical datasets, revealing strong text dependency despite visual cues.

DetailsMotivation: To address biases in VLMs that favor text over images in clinical decision-making, which can overlook critical visual information.

Method: Uses SMS, a perturbation-based approach, to swap images or text between samples with opposing labels, exposing modality-specific biases. Evaluates six VLMs on MIMIC-CXR and FairVLMed datasets.

Result: VLMs exhibit a strong reliance on text input, overshadowing visual cues, even in multimodal medical tasks. Qualitative analysis confirms this bias.

Conclusion: Highlights the need for designing multimodal models that genuinely integrate both visual and textual cues, not just relying on one modality.

Abstract: Clinical decision-making relies on the integrated analysis of medical images and the associated clinical reports. While Vision-Language Models (VLMs) can offer a unified framework for such tasks, they can exhibit strong biases toward one modality, frequently overlooking critical visual cues in favor of textual information. In this work, we introduce Selective Modality Shifting (SMS), a perturbation-based approach to quantify a model’s reliance on each modality in binary classification tasks. By systematically swapping images or text between samples with opposing labels, we expose modality-specific biases. We assess six open-source VLMs-four generalist models and two fine-tuned for medical data-on two medical imaging datasets with distinct modalities: MIMIC-CXR (chest X-ray) and FairVLMed (scanning laser ophthalmoscopy). By assessing model performance and the calibration of every model in both unperturbed and perturbed settings, we reveal a marked dependency on text input, which persists despite the presence of complementary visual information. We also perform a qualitative attention-based analysis which further confirms that image content is often overshadowed by text details. Our findings highlight the importance of designing and evaluating multimodal medical models that genuinely integrate visual and textual cues, rather than relying on single-modality signals.

[86] IN2OUT: Fine-Tuning Video Inpainting Model for Video Outpainting Using Hierarchical Discriminator

Sangwoo Youn, Minji Lee, Nokap Tony Park, Yeonggyoo Jeon, Taeyoung Na

Main category: cs.CV

TL;DR: The paper proposes using video inpainting models for outpainting, introducing a hierarchical discriminator and specialized loss to improve perceptual quality and coherence.

DetailsMotivation: Existing methods for video outpainting often produce blurry results when directly applying inpainting models, highlighting the need for better perceptual quality assessment.

Method: The authors differentiate adversarial training into global and local goals, introducing a hierarchical discriminator and a specialized outpainting loss function.

Result: The proposed method outperforms state-of-the-art approaches, producing visually appealing and globally coherent outpainted scenes.

Conclusion: The hierarchical discriminator and specialized loss function effectively enhance video outpainting, addressing previous limitations.

Abstract: Video outpainting presents a unique challenge of extending the borders while maintaining consistency with the given content. In this paper, we suggest the use of video inpainting models that excel in object flow learning and reconstruction in outpainting rather than solely generating the background as in existing methods. However, directly applying or fine-tuning inpainting models to outpainting has shown to be ineffective, often leading to blurry results. Our extensive experiments on discriminator designs reveal that a critical component missing in the outpainting fine-tuning process is a discriminator capable of effectively assessing the perceptual quality of the extended areas. To tackle this limitation, we differentiate the objectives of adversarial training into global and local goals and introduce a hierarchical discriminator that meets both objectives. Additionally, we develop a specialized outpainting loss function that leverages both local and global features of the discriminator. Fine-tuning on this adversarial loss function enhances the generator’s ability to produce both visually appealing and globally coherent outpainted scenes. Our proposed method outperforms state-of-the-art methods both quantitatively and qualitatively. Supplementary materials including the demo video and the code are available in SigPort.

[87] Graph Lineages and Skeletal Graph Products

Eric Mjolsness, Cory B. Scott

Main category: cs.CV

TL;DR: The paper introduces structured graph lineages for hierarchical model architectures, with exponential growth and skeletal algebraic operations, applicable to deep learning and multigrid methods.

DetailsMotivation: To define hierarchical graph structures for mathematical models, enabling efficient operations and applications in machine learning and computational science.

Method: Defines graph lineages with exponential growth, bipartite connections, and prolongation maps. Introduces skeletal algebraic operations and unary operators like thickening and escalation.

Result: Develops an algebraic type theory for graded graphs and hierarchical lineages, with applications in deep neural networks and multigrid methods.

Conclusion: The framework is suitable for hierarchical model architectures and local algorithms, demonstrated in deep learning and numerical methods.

Abstract: Graphs, and sequences of growing graphs, can be used to specify the architecture of mathematical models in many fields including machine learning and computational science. Here we define structured graph “lineages” (ordered by level number) that grow in a hierarchical fashion, so that: (1) the number of graph vertices and edges increases exponentially in level number; (2) bipartite graphs connect successive levels within a graph lineage and, as in multigrid methods, can constrain matrices relating successive levels; (3) using prolongation maps within a graph lineage, process-derived distance measures between graphs at successive levels can be defined; (4) a category of “graded graphs” can be defined, and using it low-cost “skeletal” variants of standard algebraic graph operations and type constructors (cross product, box product, disjoint sum, and function types) can be derived for graded graphs and hence hierarchical graph lineages; (5) these skeletal binary operators have similar but not identical algebraic and category-theoretic properties to their standard counterparts; (6) graph lineages and their skeletal product constructors can approach continuum limit objects. Additional space-efficient unary operators on graded graphs are also derived: thickening, which creates a graph lineage of multiscale graphs, and escalation to a graph lineage of search frontiers (useful as a generalization of adaptive grids and in defining “skeletal” functions). The result is an algebraic type theory for graded graphs and (hierarchical) graph lineages. The approach is expected to be well suited to defining hierarchical model architectures - “hierarchitectures” - and local sampling, search, or optimization algorithms on them. We demonstrate such application to deep neural networks (including visual and feature scale spaces) and to multigrid numerical methods.

[88] Querying Autonomous Vehicle Point Clouds: Enhanced by 3D Object Counting with CounterNet

Xiaoyu Zhang, Zhifeng Bao, Hai Dong, Ziwei Wang, Jiajun Liu

Main category: cs.CV

TL;DR: CounterNet improves object counting accuracy in point cloud data for autonomous vehicles, enhancing query reliability.

DetailsMotivation: Existing methods for 3D point cloud data often fail in accurate object counting, leading to errors in query results.

Method: Proposes CounterNet, a heatmap-based network with feature map partitioning and dynamic model selection for better counting accuracy.

Result: Improves counting accuracy by 5% to 20% across object categories in real-world datasets.

Conclusion: CounterNet significantly enhances query reliability by addressing counting inaccuracies in point cloud data.

Abstract: Autonomous vehicles generate massive volumes of point cloud data, yet only a subset is relevant for specific tasks such as collision detection, traffic analysis, or congestion monitoring. Effectively querying this data is essential to enable targeted analytics. In this work, we formalize point cloud querying by defining three core query types: RETRIEVAL, COUNT, and AGGREGATION, each aligned with distinct analytical scenarios. All these queries rely heavily on accurate object counts to produce meaningful results, making precise object counting a critical component of query execution. Prior work has focused on indexing techniques for 2D video data, assuming detection models provide accurate counting information. However, when applied to 3D point cloud data, state-of-the-art detection models often fail to generate reliable object counts, leading to substantial errors in query results. To address this limitation, we propose CounterNet, a heatmap-based network designed for accurate object counting in large-scale point cloud data. Rather than focusing on accurate object localization, CounterNet detects object presence by finding object centers to improve counting accuracy. We further enhance its performance with a feature map partitioning strategy using overlapping regions, enabling better handling of both small and large objects in complex traffic scenes. To adapt to varying frame characteristics, we introduce a per-frame dynamic model selection strategy that selects the most effective configuration for each input. Evaluations on three real-world autonomous vehicle datasets show that CounterNet improves counting accuracy by 5% to 20% across object categories, resulting in more reliable query outcomes across all supported query types.

[89] Semantic and Temporal Integration in Latent Diffusion Space for High-Fidelity Video Super-Resolution

Yiwen Wang, Xinning Chai, Yuhong Zhang, Zhengxue Cheng, Jun Zhao, Rong Xie, Li Song

Main category: cs.CV

TL;DR: SeTe-VSR introduces semantic and temporal-spatio guidance in latent diffusion space to improve video super-resolution, balancing detail recovery and temporal coherence.

DetailsMotivation: Existing VSR models struggle with high-fidelity alignment and temporal consistency due to limited control over the generation process.

Method: Incorporates high-level semantic and temporal-spatio guidance in latent diffusion space.

Result: Outperforms existing methods in detail recovery and perceptual quality while maintaining temporal coherence.

Conclusion: SeTe-VSR effectively addresses challenges in video super-resolution, offering superior performance in complex tasks.

Abstract: Recent advancements in video super-resolution (VSR) models have demonstrated impressive results in enhancing low-resolution videos. However, due to limitations in adequately controlling the generation process, achieving high fidelity alignment with the low-resolution input while maintaining temporal consistency across frames remains a significant challenge. In this work, we propose Semantic and Temporal Guided Video Super-Resolution (SeTe-VSR), a novel approach that incorporates both semantic and temporal-spatio guidance in the latent diffusion space to address these challenges. By incorporating high-level semantic information and integrating spatial and temporal information, our approach achieves a seamless balance between recovering intricate details and ensuring temporal coherence. Our method not only preserves high-reality visual content but also significantly enhances fidelity. Extensive experiments demonstrate that SeTe-VSR outperforms existing methods in terms of detail recovery and perceptual quality, highlighting its effectiveness for complex video super-resolution tasks.

[90] Learning Personalised Human Internal Cognition from External Expressive Behaviours for Real Personality Recognition

Xiangyu Kong, Hengde Zhu, Haoqin Sun, Zhihao Guo, Jiayan Gu, Xinyi Ni, Wei Zhang, Shizhe Liu, Siyang Song

Main category: cs.CV

TL;DR: The paper introduces a novel approach for real personality recognition (RPR) by simulating personalized internal cognition from external audio-visual behaviors, using a 2D-GNN for inference.

DetailsMotivation: Existing RPR methods infer personality impressions as external observers, often deviating from real personalities and yielding poor performance. The study aims to bridge this gap by leveraging the link between real personality and internal cognition.

Method: The proposed method simulates personalized internal cognition from audio-visual behaviors, encodes it as a 2D graph, and uses a 2D-GNN for personality trait inference. An end-to-end training strategy integrates cognition simulation, graph construction, and recognition.

Result: The approach efficiently captures real personality traits by modeling internal cognition, outperforming traditional observer-based methods.

Conclusion: The novel RPR method, combining cognition simulation and 2D-GNN, offers improved accuracy in recognizing real personality traits from expressive behaviors.

Abstract: Automatic real personality recognition (RPR) aims to evaluate human real personality traits from their expressive behaviours. However, most existing solutions generally act as external observers to infer observers’ personality impressions based on target individuals’ expressive behaviours, which significantly deviate from their real personalities and consistently lead to inferior recognition performance. Inspired by the association between real personality and human internal cognition underlying the generation of expressive behaviours, we propose a novel RPR approach that efficiently simulates personalised internal cognition from easy-accessible external short audio-visual behaviours expressed by the target individual. The simulated personalised cognition, represented as a set of network weights that enforce the personalised network to reproduce the individual-specific facial reactions, is further encoded as a novel graph containing two-dimensional node and edge feature matrices, with a novel 2D Graph Neural Network (2D-GNN) proposed for inferring real personality traits from it. To simulate real personality-related cognition, an end-to-end strategy is designed to jointly train our cognition simulation, 2D graph construction, and personality recognition modules.

[91] A Novel Modeling Framework and Data Product for Extended VIIRS-like Artificial Nighttime Light Image Reconstruction (1986-2024)

Yihe Tian, Kwan Man Cheng, Zhengbo Zhang, Tao Zhang, Suju Li, Dongmei Yan, Bing Xu

Main category: cs.CV

TL;DR: The paper introduces a novel framework, EVAL, to reconstruct high-quality artificial night-time light (NTL) data, extending VIIRS-like NTL time-series back to 1986 for China, outperforming existing methods.

DetailsMotivation: Existing NTL time-series methods suffer from underestimation of light intensity and structural omission, limiting long-term studies.

Method: A two-stage reconstruction framework: construction (Hierarchical Fusion Decoder) and refinement (Dual Feature Refiner using impervious surface masks).

Result: EVAL improves R² from 0.68 to 0.80 and reduces RMSE from 1.27 to 0.99, with high temporal consistency and socioeconomic correlation.

Conclusion: EVAL provides a reliable, publicly available resource for long-term NTL analysis.

Abstract: Artificial Night-Time Light (NTL) remote sensing is a vital proxy for quantifying the intensity and spatial distribution of human activities. Although the NPP-VIIRS sensor provides high-quality NTL observations, its temporal coverage, which begins in 2012, restricts long-term time-series studies that extend to earlier periods. Despite the progress in extending VIIRS-like NTL time-series, current methods still suffer from two significant shortcomings: the underestimation of light intensity and the structural omission. To overcome these limitations, we propose a novel reconstruction framework consisting of a two-stage process: construction and refinement. The construction stage features a Hierarchical Fusion Decoder (HFD) designed to enhance the fidelity of the initial reconstruction. The refinement stage employs a Dual Feature Refiner (DFR), which leverages high-resolution impervious surface masks to guide and enhance fine-grained structural details. Based on this framework, we developed the Extended VIIRS-like Artificial Nighttime Light (EVAL) product for China, extending the standard data record backwards by 26 years to begin in 1986. Quantitative evaluation shows that EVAL significantly outperforms existing state-of-the-art products, boosting the $\text{R}^2$ from 0.68 to 0.80 while lowering the RMSE from 1.27 to 0.99. Furthermore, EVAL exhibits excellent temporal consistency and maintains a high correlation with socioeconomic parameters, confirming its reliability for long-term analysis. The resulting EVAL dataset provides a valuable new resource for the research community and is publicly available at https://doi.org/10.11888/HumanNat.tpdc.302930.

[92] SAM-PTx: Text-Guided Fine-Tuning of SAM with Parameter-Efficient, Parallel-Text Adapters

Shayan Jalilian, Abdul Bais

Main category: cs.CV

TL;DR: SAM-PTx enhances SAM by using CLIP-derived text embeddings for semantic segmentation, improving performance over spatial prompts with minimal architectural changes.

DetailsMotivation: The potential of semantic text prompts in segmentation is underexplored compared to spatial prompts like points and boxes.

Method: Introduces SAM-PTx, a lightweight adapter (Parallel-Text) injecting text embeddings into SAM’s image encoder while keeping most architecture frozen.

Result: Improves segmentation performance on COD10K, COCO, and ADE20K datasets over spatial prompt baselines.

Conclusion: Semantic conditioning in SAM offers a practical, scalable adaptation with low computational complexity.

Abstract: The Segment Anything Model (SAM) has demonstrated impressive generalization in prompt-based segmentation. Yet, the potential of semantic text prompts remains underexplored compared to traditional spatial prompts like points and boxes. This paper introduces SAM-PTx, a parameter-efficient approach for adapting SAM using frozen CLIP-derived text embeddings as class-level semantic guidance. Specifically, we propose a lightweight adapter design called Parallel-Text that injects text embeddings into SAM’s image encoder, enabling semantics-guided segmentation while keeping most of the original architecture frozen. Our adapter modifies only the MLP-parallel branch of each transformer block, preserving the attention pathway for spatial reasoning. Through supervised experiments and ablations on the COD10K dataset as well as low-data subsets of COCO and ADE20K, we show that incorporating fixed text embeddings as input improves segmentation performance over purely spatial prompt baselines. To our knowledge, this is the first work to use text prompts for segmentation on the COD10K dataset. These results suggest that integrating semantic conditioning into SAM’s architecture offers a practical and scalable path for efficient adaptation with minimal computational complexity.

[93] SU-ESRGAN: Semantic and Uncertainty-Aware ESRGAN for Super-Resolution of Satellite and Drone Imagery with Fine-Tuning for Cross Domain Evaluation

Prerana Ramkumar

Main category: cs.CV

TL;DR: SU-ESRGAN enhances ESRGAN for satellite imagery by adding semantic consistency and uncertainty maps, improving credibility for critical applications.

DetailsMotivation: GANs lack semantic consistency and per-pixel confidence, limiting their use in remote sensing.

Method: Integrates ESRGAN, DeepLabv3 for segmentation loss, and Monte Carlo dropout for uncertainty maps.

Result: Comparable performance to Baseline ESRGAN, with better adaptation to domain-aligned datasets.

Conclusion: Domain-aware training is crucial for SR applications, and SU-ESRGAN is modular for UAV pipelines.

Abstract: Generative Adversarial Networks (GANs) have achieved realistic super-resolution (SR) of images however, they lack semantic consistency and per-pixel confidence, limiting their credibility in critical remote sensing applications such as disaster response, urban planning and agriculture. This paper introduces Semantic and Uncertainty-Aware ESRGAN (SU-ESRGAN), the first SR framework designed for satellite imagery to integrate the ESRGAN, segmentation loss via DeepLabv3 for class detail preservation and Monte Carlo dropout to produce pixel-wise uncertainty maps. The SU-ESRGAN produces results (PSNR, SSIM, LPIPS) comparable to the Baseline ESRGAN on aerial imagery. This novel model is valuable in satellite systems or UAVs that use wide field-of-view (FoV) cameras, trading off spatial resolution for coverage. The modular design allows integration in UAV data pipelines for on-board or post-processing SR to enhance imagery resulting due to motion blur, compression and sensor limitations. Further, the model is fine-tuned to evaluate its performance on cross domain applications. The tests are conducted on two drone based datasets which differ in altitude and imaging perspective. Performance evaluation of the fine-tuned models show a stronger adaptation to the Aerial Maritime Drone Dataset, whose imaging characteristics align with the training data, highlighting the importance of domain-aware training in SR-applications.

[94] Object-Centric Cropping for Visual Few-Shot Classification

Aymane Abdali, Bartosz Boguslawski, Lucas Drumetz, Vincent Gripon

Main category: cs.CV

TL;DR: Incorporating object positioning info and using SAM or unsupervised methods improves few-shot image classification.

DetailsMotivation: Address performance degradation in few-shot image classification due to image ambiguities like multiple objects or complex backgrounds.

Method: Use local positioning info of objects, SAM (requiring just a pixel of the object), or unsupervised foreground extraction.

Result: Significant improvement in classification performance across benchmarks.

Conclusion: Local object positioning info and simple annotation or unsupervised methods can enhance few-shot classification.

Abstract: In the domain of Few-Shot Image Classification, operating with as little as one example per class, the presence of image ambiguities stemming from multiple objects or complex backgrounds can significantly deteriorate performance. Our research demonstrates that incorporating additional information about the local positioning of an object within its image markedly enhances classification across established benchmarks. More importantly, we show that a significant fraction of the improvement can be achieved through the use of the Segment Anything Model, requiring only a pixel of the object of interest to be pointed out, or by employing fully unsupervised foreground object extraction methods.

[95] Guided Depth Map Super-Resolution via Multi-Scale Fusion U-shaped Mamba Network

Chenggang Guo, Hao Xu, XianMing Wan

Main category: cs.CV

TL;DR: The paper proposes a multi-scale fusion U-shaped Mamba (MSF-UM) model for depth map super-resolution, combining Mamba’s state-space modeling with a U-shaped structure to improve resolution and detail restoration.

DetailsMotivation: Traditional CNNs struggle with long-range dependencies, and transformers are computationally expensive for high-resolution depth maps. The goal is to efficiently model global context and leverage color images for guidance.

Method: The MSF-UM integrates Mamba’s state-space modeling into a multi-scale U-shaped fusion structure, using residual dense channel attention blocks and cross-modal fusion with color images.

Result: The model reduces parameters while achieving better accuracy, validated on public datasets, and excels in large-scale depth map super-resolution.

Conclusion: MSF-UM effectively balances efficiency and performance, demonstrating superior generalization in depth map super-resolution tasks.

Abstract: Depth map super-resolution technology aims to improve the spatial resolution of low-resolution depth maps and effectively restore high-frequency detail information. Traditional convolutional neural network has limitations in dealing with long-range dependencies and are unable to fully model the global contextual information in depth maps. Although transformer can model global dependencies, its computational complexity and memory consumption are quadratic, which significantly limits its ability to process high-resolution depth maps. In this paper, we propose a multi-scale fusion U-shaped Mamba (MSF-UM) model, a novel guided depth map super-resolution framework. The core innovation of this model is to integrate Mamba’s efficient state-space modeling capabilities into a multi-scale U-shaped fusion structure guided by a color image. The structure combining the residual dense channel attention block and the Mamba state space module is designed, which combines the local feature extraction capability of the convolutional layer with the modeling advantage of the state space model for long-distance dependencies. At the same time, the model adopts a multi-scale cross-modal fusion strategy to make full use of the high-frequency texture information from the color image to guide the super-resolution process of the depth map. Compared with existing mainstream methods, the proposed MSF-UM significantly reduces the number of model parameters while achieving better reconstruction accuracy. Extensive experiments on multiple publicly available datasets validate the effectiveness of the model, especially showing excellent generalization ability in the task of large-scale depth map super-resolution.

[96] PointGauss: Point Cloud-Guided Multi-Object Segmentation for Gaussian Splatting

Wentao Sun, Hanqing Xu, Quanyun Wu, Dedong Zhang, Yiping Chen, Lingfei Ma, John S. Zelek, Jonathan Li

Main category: cs.CV

TL;DR: PointGauss is a real-time multi-object segmentation framework for Gaussian Splatting, using point cloud guidance for efficiency and multi-view consistency. It outperforms existing methods and introduces a new dataset, DesktopObjects-360.

DetailsMotivation: Existing methods for 3D segmentation in Gaussian Splatting suffer from slow initialization and poor multi-view consistency.

Method: The framework parses Gaussian primitives via a point cloud segmentation pipeline, featuring a point cloud-based decoder and GPU-accelerated 2D mask rendering.

Result: Achieves 1.89 to 31.78% performance gains in multi-view mIoU while maintaining computational efficiency.

Conclusion: PointGauss improves 3D segmentation efficiency and consistency, supported by the new DesktopObjects-360 dataset for comprehensive evaluation.

Abstract: We introduce PointGauss, a novel point cloud-guided framework for real-time multi-object segmentation in Gaussian Splatting representations. Unlike existing methods that suffer from prolonged initialization and limited multi-view consistency, our approach achieves efficient 3D segmentation by directly parsing Gaussian primitives through a point cloud segmentation-driven pipeline. The key innovation lies in two aspects: (1) a point cloud-based Gaussian primitive decoder that generates 3D instance masks within 1 minute, and (2) a GPU-accelerated 2D mask rendering system that ensures multi-view consistency. Extensive experiments demonstrate significant improvements over previous state-of-the-art methods, achieving performance gains of 1.89 to 31.78% in multi-view mIoU, while maintaining superior computational efficiency. To address the limitations of current benchmarks (single-object focus, inconsistent 3D evaluation, small scale, and partial coverage), we present DesktopObjects-360, a novel comprehensive dataset for 3D segmentation in radiance fields, featuring: (1) complex multi-object scenes, (2) globally consistent 2D annotations, (3) large-scale training data (over 27 thousand 2D masks), (4) full 360{\deg} coverage, and (5) 3D evaluation masks.

[97] Multimodal Referring Segmentation: A Survey

Henghui Ding, Song Tang, Shuting He, Chang Liu, Zuxuan Wu, Yu-Gang Jiang

Main category: cs.CV

TL;DR: A survey on multimodal referring segmentation, covering problem definitions, datasets, methods, and performance comparisons across images, videos, and 3D scenes.

DetailsMotivation: To address the need for accurate object perception based on user instructions in multimodal applications.

Method: Summarizes a unified meta architecture and reviews methods for images, videos, and 3D scenes, including Generalized Referring Expression (GREx) approaches.

Result: Provides extensive performance comparisons on standard benchmarks.

Conclusion: Highlights advancements and challenges in multimodal referring segmentation, with ongoing updates tracked on GitHub.

Abstract: Multimodal referring segmentation aims to segment target objects in visual scenes, such as images, videos, and 3D scenes, based on referring expressions in text or audio format. This task plays a crucial role in practical applications requiring accurate object perception based on user instructions. Over the past decade, it has gained significant attention in the multimodal community, driven by advances in convolutional neural networks, transformers, and large language models, all of which have substantially improved multimodal perception capabilities. This paper provides a comprehensive survey of multimodal referring segmentation. We begin by introducing this field’s background, including problem definitions and commonly used datasets. Next, we summarize a unified meta architecture for referring segmentation and review representative methods across three primary visual scenes, including images, videos, and 3D scenes. We further discuss Generalized Referring Expression (GREx) methods to address the challenges of real-world complexity, along with related tasks and practical applications. Extensive performance comparisons on standard benchmarks are also provided. We continually track related works at https://github.com/henghuiding/Awesome-Multimodal-Referring-Segmentation.

[98] Towards Robust Semantic Correspondence: A Benchmark and Insights

Wenyue Chong

Main category: cs.CV

TL;DR: A benchmark for evaluating semantic correspondence in adverse conditions reveals performance drops in existing methods, with insights on robustness using large-scale vision models and task-specific enhancements.

DetailsMotivation: To address the lack of investigation into the robustness of semantic correspondence in challenging scenarios, the paper establishes a novel benchmark for adverse conditions.

Method: The study creates a benchmark dataset with 14 challenging scenarios (e.g., geometric distortion, blurring) and evaluates existing methods, including large-scale vision models like DINO and Stable Diffusion.

Result: Existing methods show performance drops in adverse conditions; DINO outperforms Stable Diffusion in robustness, and their fusion improves absolute robustness. General data augmentations are ineffective.

Conclusion: Task-specific designs are needed for robustness in semantic correspondence, as general methods fail under adverse conditions.

Abstract: Semantic correspondence aims to identify semantically meaningful relationships between different images and is a fundamental challenge in computer vision. It forms the foundation for numerous tasks such as 3D reconstruction, object tracking, and image editing. With the progress of large-scale vision models, semantic correspondence has achieved remarkable performance in controlled and high-quality conditions. However, the robustness of semantic correspondence in challenging scenarios is much less investigated. In this work, we establish a novel benchmark for evaluating semantic correspondence in adverse conditions. The benchmark dataset comprises 14 distinct challenging scenarios that reflect commonly encountered imaging issues, including geometric distortion, image blurring, digital artifacts, and environmental occlusion. Through extensive evaluations, we provide several key insights into the robustness of semantic correspondence approaches: (1) All existing methods suffer from noticeable performance drops under adverse conditions; (2) Using large-scale vision models can enhance overall robustness, but fine-tuning on these models leads to a decline in relative robustness; (3) The DINO model outperforms the Stable Diffusion in relative robustness, and their fusion achieves better absolute robustness; Moreover, We evaluate common robustness enhancement strategies for semantic correspondence and find that general data augmentations are ineffective, highlighting the need for task-specific designs. These results are consistent across both our dataset and real-world benchmarks.

[99] Accurate Cross-modal Reconstruction of Vehicle Target from Sparse-aspect Multi-baseline SAR data

Da Li, Guoqiang Zhao, Chen Yao, Kaiqiang Zhu, Houjun Sun, Jiacheng Bao, Maokun Li

Main category: cs.CV

TL;DR: The paper introduces CMAR-Net, a cross-modal learning framework for 3D SAR reconstruction, outperforming traditional CS and DL methods by fusing SAR and optical data.

DetailsMotivation: Sparse observations in SAR imaging degrade quality, especially for small targets. Existing DL methods lack multimodal data integration, limiting performance.

Method: Proposes CMAR-Net, leveraging cross-modal supervision from 2D optical images and differentiable rendering for efficient training.

Result: CMAR-Net achieves superior reconstruction quality, especially for vehicles, and generalizes well to real-world data.

Conclusion: Cross-modal learning enhances 3D SAR reconstruction, offering a novel framework for radar imaging.

Abstract: Multi-aspect multi-baseline SAR 3D imaging is a critical remote sensing technique, promising in urban mapping and monitoring. However, sparse observation due to constrained flight trajectories degrade imaging quality, particularly for anisotropic small targets like vehicles and aircraft. In the past, compressive sensing (CS) was the mainstream approach for sparse 3D SAR reconstruction. More recently, deep learning (DL) has emerged as a powerful alternative, markedly boosting reconstruction quality and efficiency through strong data-driven representations capabilities and fast inference characteristics. However, existing DL methods typically train deep neural networks (DNNs) using only high-resolution radar images. This unimodal learning paradigm precludes the incorporation of complementary information from other data sources, thereby limiting potential improvements in reconstruction performance. In this paper, we introduce cross-modal learning and propose a Cross-Modal 3D-SAR Reconstruction Network (CMAR-Net) that enhances sparse 3D SAR reconstruction by fusing heterogeneous information. Leveraging cross-modal supervision from 2D optical images and error propagation guaranteed by differentiable rendering, CMAR-Net achieves efficient training and reconstructs highly sparse-aspect multi-baseline SAR image into visually structured and accurate 3D images, particularly for vehicle targets. Trained solely on simulated data, CMAR-Net exhibits strong generalization across extensive real-world evaluations on parking lot measurements containing numerous civilian vehicles, outperforming state-of-the-art CS and DL methods in structural accuracy. Our work highlights the potential of cross-modal learning for 3D SAR reconstruction and introduces a novel framework for radar imaging research.

[100] Privacy-Preserving Driver Drowsiness Detection with Spatial Self-Attention and Federated Learning

Tran Viet Khoa, Do Hai Son, Mohammad Abu Alsheikh, Yibeltal F Alem, Dinh Thai Hoang

Main category: cs.CV

TL;DR: Proposes a federated learning framework with Spatial Self-Attention and LSTM for accurate drowsiness detection using decentralized facial data, achieving 89.9% accuracy.

DetailsMotivation: Driver drowsiness is a major cause of road accidents, but detecting it accurately in diverse, decentralized data is challenging.

Method: Combines Spatial Self-Attention (SSA) with LSTM for feature extraction and uses Gradient Similarity Comparison (GSC) for federated learning. Includes automated video data processing.

Result: Achieves 89.9% detection accuracy in federated learning, outperforming existing methods.

Conclusion: The framework effectively handles real-world data variability and shows promise for enhancing road safety in intelligent transportation systems.

Abstract: Driver drowsiness is one of the main causes of road accidents and is recognized as a leading contributor to traffic-related fatalities. However, detecting drowsiness accurately remains a challenging task, especially in real-world settings where facial data from different individuals is decentralized and highly diverse. In this paper, we propose a novel framework for drowsiness detection that is designed to work effectively with heterogeneous and decentralized data. Our approach develops a new Spatial Self-Attention (SSA) mechanism integrated with a Long Short-Term Memory (LSTM) network to better extract key facial features and improve detection performance. To support federated learning, we employ a Gradient Similarity Comparison (GSC) that selects the most relevant trained models from different operators before aggregation. This improves the accuracy and robustness of the global model while preserving user privacy. We also develop a customized tool that automatically processes video data by extracting frames, detecting and cropping faces, and applying data augmentation techniques such as rotation, flipping, brightness adjustment, and zooming. Experimental results show that our framework achieves a detection accuracy of 89.9% in the federated learning settings, outperforming existing methods under various deployment scenarios. The results demonstrate the effectiveness of our approach in handling real-world data variability and highlight its potential for deployment in intelligent transportation systems to enhance road safety through early and reliable drowsiness detection.

[101] Simultaneous Motion And Noise Estimation with Event Cameras

Shintaro Shiba, Yoshimitsu Aoki, Guillermo Gallego

Main category: cs.CV

TL;DR: A novel method for simultaneous motion estimation and noise denoising in event cameras, outperforming existing benchmarks.

DetailsMotivation: Event cameras' noise is hard to characterize, and existing methods treat motion estimation and denoising separately, ignoring their intrinsic connection.

Method: Proposes a flexible framework integrating motion estimation (e.g., ego-motion, optical flow) and noise denoising, compatible with various motion estimators like deep neural networks.

Result: Achieves state-of-the-art results on E-MLB and competitive results on DND21 benchmarks, excelling in motion estimation and intensity reconstruction.

Conclusion: Advances event-data denoising theory and offers practical applications with open-source code.

Abstract: Event cameras are emerging vision sensors whose noise is challenging to characterize. Existing denoising methods for event cameras are often designed in isolation and thus consider other tasks, such as motion estimation, separately (i.e., sequentially after denoising). However, motion is an intrinsic part of event data, since scene edges cannot be sensed without motion. We propose, to the best of our knowledge, the first method that simultaneously estimates motion in its various forms (e.g., ego-motion, optical flow) and noise. The method is flexible, as it allows replacing the one-step motion estimation of the widely-used Contrast Maximization framework with any other motion estimator, such as deep neural networks. The experiments show that the proposed method achieves state-of-the-art results on the E-MLB denoising benchmark and competitive results on the DND21 benchmark, while demonstrating effectiveness across motion estimation and intensity reconstruction tasks. Our approach advances event-data denoising theory and expands practical denoising use-cases via open-source code. Project page: https://github.com/tub-rip/ESMD

[102] TITAN-Guide: Taming Inference-Time AligNment for Guided Text-to-Video Diffusion Models

Christian Simon, Masato Ishii, Akio Hayakawa, Zhi Zhong, Shusuke Takahashi, Takashi Shibuya, Yuki Mitsufuji

Main category: cs.CV

TL;DR: TITAN-Guide is a training-free method for guiding Text-to-Video diffusion models, addressing memory and control issues in existing frameworks.

DetailsMotivation: Current training-free guidance frameworks for diffusion models suffer from high memory usage or sub-optimal control, limiting their use in computationally intensive tasks like Text-to-Video.

Method: TITAN-Guide optimizes diffusion latents without backpropagation, using forward gradient descents and directional directives for efficient guidance.

Result: The method reduces memory requirements and improves Text-to-Video performance across benchmarks.

Conclusion: TITAN-Guide offers an efficient and effective solution for guiding diffusion models without fine-tuning.

Abstract: In the recent development of conditional diffusion models still require heavy supervised fine-tuning for performing control on a category of tasks. Training-free conditioning via guidance with off-the-shelf models is a favorable alternative to avoid further fine-tuning on the base model. However, the existing training-free guidance frameworks either have heavy memory requirements or offer sub-optimal control due to rough estimation. These shortcomings limit the applicability to control diffusion models that require intense computation, such as Text-to-Video (T2V) diffusion models. In this work, we propose Taming Inference Time Alignment for Guided Text-to-Video Diffusion Model, so-called TITAN-Guide, which overcomes memory space issues, and provides more optimal control in the guidance process compared to the counterparts. In particular, we develop an efficient method for optimizing diffusion latents without backpropagation from a discriminative guiding model. In particular, we study forward gradient descents for guided diffusion tasks with various options on directional directives. In our experiments, we demonstrate the effectiveness of our approach in efficiently managing memory during latent optimization, while previous methods fall short. Our proposed approach not only minimizes memory requirements but also significantly enhances T2V performance across a range of diffusion guidance benchmarks. Code, models, and demo are available at https://titanguide.github.io.

[103] AniMer+: Unified Pose and Shape Estimation Across Mammalia and Aves via Family-Aware Transformer

Jin Lyu, Liang An, Li Lin, Pujin Cheng, Yebin Liu, Xiaoying Tang

Main category: cs.CV

TL;DR: AniMer+ extends the scalable AniMer framework to unify mammal and bird pose/shape reconstruction using a high-capacity, family-aware ViT with MoE design and synthetic datasets to address data scarcity.

DetailsMotivation: To enable unified understanding of dynamic objects and accurate animal pose/shape estimation across species, addressing limited network capacity and data scarcity.

Method: Uses a family-aware ViT with MoE design (taxa-specific and shared components) and generates synthetic datasets (CtrlAni3D, CtrlAVES3D) via diffusion-based image generation.

Result: Superior performance on benchmarks, including Animal Kingdom, with 41.3k mammalian and 12.4k avian images (real + synthetic).

Conclusion: AniMer+’s architecture and synthetic datasets effectively enhance performance, resolving data scarcity and improving real-world applications.

Abstract: In the era of foundation models, achieving a unified understanding of different dynamic objects through a single network has the potential to empower stronger spatial intelligence. Moreover, accurate estimation of animal pose and shape across diverse species is essential for quantitative analysis in biological research. However, this topic remains underexplored due to the limited network capacity of previous methods and the scarcity of comprehensive multi-species datasets. To address these limitations, we introduce AniMer+, an extended version of our scalable AniMer framework. In this paper, we focus on a unified approach for reconstructing mammals (mammalia) and birds (aves). A key innovation of AniMer+ is its high-capacity, family-aware Vision Transformer (ViT) incorporating a Mixture-of-Experts (MoE) design. Its architecture partitions network layers into taxa-specific components (for mammalia and aves) and taxa-shared components, enabling efficient learning of both distinct and common anatomical features within a single model. To overcome the critical shortage of 3D training data, especially for birds, we introduce a diffusion-based conditional image generation pipeline. This pipeline produces two large-scale synthetic datasets: CtrlAni3D for quadrupeds and CtrlAVES3D for birds. To note, CtrlAVES3D is the first large-scale, 3D-annotated dataset for birds, which is crucial for resolving single-view depth ambiguities. Trained on an aggregated collection of 41.3k mammalian and 12.4k avian images (combining real and synthetic data), our method demonstrates superior performance over existing approaches across a wide range of benchmarks, including the challenging out-of-domain Animal Kingdom dataset. Ablation studies confirm the effectiveness of both our novel network architecture and the generated synthetic datasets in enhancing real-world application performance.

[104] Controllable Pedestrian Video Editing for Multi-View Driving Scenarios via Motion Sequence

Danzhen Fu, Jiagao Hu, Daiguo Zhou, Fei Wang, Zepeng Wang, Wenhua Liao

Main category: cs.CV

TL;DR: A novel framework for controllable pedestrian video editing in multi-view driving scenarios, enhancing robustness in pedestrian detection models.

DetailsMotivation: Addressing the lack of robust pedestrian detection in autonomous driving due to insufficient representation of dangerous scenarios in training datasets.

Method: Integrates video inpainting and human motion control to edit pedestrian videos, involving region identification, bounding box expansion, stitching, and pose-guided editing.

Result: Achieves high-quality pedestrian editing with visual realism, spatiotemporal coherence, and cross-view consistency.

Conclusion: The method is a robust solution for multi-view pedestrian video generation, useful for data augmentation and scenario simulation in autonomous driving.

Abstract: Pedestrian detection models in autonomous driving systems often lack robustness due to insufficient representation of dangerous pedestrian scenarios in training datasets. To address this limitation, we present a novel framework for controllable pedestrian video editing in multi-view driving scenarios by integrating video inpainting and human motion control techniques. Our approach begins by identifying pedestrian regions of interest across multiple camera views, expanding detection bounding boxes with a fixed ratio, and resizing and stitching these regions into a unified canvas while preserving cross-view spatial relationships. A binary mask is then applied to designate the editable area, within which pedestrian editing is guided by pose sequence control conditions. This enables flexible editing functionalities, including pedestrian insertion, replacement, and removal. Extensive experiments demonstrate that our framework achieves high-quality pedestrian editing with strong visual realism, spatiotemporal coherence, and cross-view consistency. These results establish the proposed method as a robust and versatile solution for multi-view pedestrian video generation, with broad potential for applications in data augmentation and scenario simulation in autonomous driving.

[105] Exploring Fourier Prior and Event Collaboration for Low-Light Image Enhancement

Chunyan She, Fujun Han, Chengyu Fang, Shukai Duan, Lidan Wang

Main category: cs.CV

TL;DR: The paper proposes a two-stage method for low-light image enhancement using event cameras, focusing on visibility restoration and structure refinement, outperforming existing models.

DetailsMotivation: Existing event-based methods underutilize modality-specific advantages, limiting performance. The paper aims to decouple the enhancement process to better exploit these advantages.

Method: The pipeline is split into visibility restoration (using a network with amplitude-phase entanglement in Fourier space) and structure refinement (with dynamic alignment fusion). Spatial-frequency interpolation and contrastive loss are used for training.

Result: The method outperforms state-of-the-art models in experiments.

Conclusion: Decoupling the enhancement process and leveraging modality-specific strengths improves low-light image enhancement performance.

Abstract: The event camera, benefiting from its high dynamic range and low latency, provides performance gain for low-light image enhancement. Unlike frame-based cameras, it records intensity changes with extremely high temporal resolution, capturing sufficient structure information. Currently, existing event-based methods feed a frame and events directly into a single model without fully exploiting modality-specific advantages, which limits their performance. Therefore, by analyzing the role of each sensing modality, the enhancement pipeline is decoupled into two stages: visibility restoration and structure refinement. In the first stage, we design a visibility restoration network with amplitude-phase entanglement by rethinking the relationship between amplitude and phase components in Fourier space. In the second stage, a fusion strategy with dynamic alignment is proposed to mitigate the spatial mismatch caused by the temporal resolution discrepancy between two sensing modalities, aiming to refine the structure information of the image enhanced by the visibility restoration network. In addition, we utilize spatial-frequency interpolation to simulate negative samples with diverse illumination, noise and artifact degradations, thereby developing a contrastive loss that encourages the model to learn discriminative representations. Experiments demonstrate that the proposed method outperforms state-of-the-art models.

[106] DocTron-Formula: Generalized Formula Recognition in Complex and Structured Scenarios

Yufeng Zhong, Zhixiong Zeng, Lei Chen, Longrong Yang, Liming Zheng, Jing Huang, Siqi Yang, Lin Ma

Main category: cs.CV

TL;DR: DocTron-Formula is a unified OCR framework for mathematical formulas, leveraging vision-language models and a new dataset (CSFormula) to achieve state-of-the-art performance without specialized architectures.

DetailsMotivation: Mathematical formula OCR is challenging due to structural diversity and complexity, and existing models struggle with real-world variability.

Method: The approach uses general vision-language models and introduces CSFormula, a large-scale dataset, followed by supervised fine-tuning.

Result: The method outperforms specialized models in accuracy and robustness across diverse styles and layouts.

Conclusion: DocTron-Formula sets a new standard for automated understanding of complex scientific documents.

Abstract: Optical Character Recognition (OCR) for mathematical formula is essential for the intelligent analysis of scientific literature. However, both task-specific and general vision-language models often struggle to handle the structural diversity, complexity, and real-world variability inherent in mathematical content. In this work, we present DocTron-Formula, a unified framework built upon general vision-language models, thereby eliminating the need for specialized architectures. Furthermore, we introduce CSFormula, a large-scale and challenging dataset that encompasses multidisciplinary and structurally complex formulas at the line, paragraph, and page levels. Through straightforward supervised fine-tuning, our approach achieves state-of-the-art performance across a variety of styles, scientific domains, and complex layouts. Experimental results demonstrate that our method not only surpasses specialized models in terms of accuracy and robustness, but also establishes a new paradigm for the automated understanding of complex scientific documents.

[107] GV-VAD : Exploring Video Generation for Weakly-Supervised Video Anomaly Detection

Suhang Cai, Xiaohao Peng, Chong Wang, Xiaojie Cai, Jiangbo Qian

Main category: cs.CV

TL;DR: The paper proposes GV-VAD, a generative video-enhanced weakly-supervised framework for video anomaly detection, using synthetic videos to augment training data and improve performance.

DetailsMotivation: The rarity and high annotation cost of real-world anomalies limit the scalability and performance of existing video anomaly detection models.

Method: The framework leverages text-conditioned video generation models to create synthetic videos for data augmentation and employs a synthetic sample loss scaling strategy for efficient training.

Result: GV-VAD outperforms state-of-the-art methods on the UCF-Crime dataset.

Conclusion: The proposed framework effectively addresses the limitations of existing models by using synthetic data for training, enhancing performance and generalization.

Abstract: Video anomaly detection (VAD) plays a critical role in public safety applications such as intelligent surveillance. However, the rarity, unpredictability, and high annotation cost of real-world anomalies make it difficult to scale VAD datasets, which limits the performance and generalization ability of existing models. To address this challenge, we propose a generative video-enhanced weakly-supervised video anomaly detection (GV-VAD) framework that leverages text-conditioned video generation models to produce semantically controllable and physically plausible synthetic videos. These virtual videos are used to augment training data at low cost. In addition, a synthetic sample loss scaling strategy is utilized to control the influence of generated synthetic samples for efficient training. The experiments show that the proposed framework outperforms state-of-the-art methods on UCF-Crime datasets. The code is available at https://github.com/Sumutan/GV-VAD.git.

[108] Steering Guidance for Personalized Text-to-Image Diffusion Models

Sunghyun Park, Seokeon Choi, Hyoungwoo Park, Sungrack Yun

Main category: cs.CV

TL;DR: Proposes personalization guidance for text-to-image diffusion models to balance target fidelity and text alignment without extra computational cost.

DetailsMotivation: Address the trade-off between aligning with target concepts and preserving original model knowledge in few-shot fine-tuning.

Method: Uses an unlearned weak model conditioned on null text and dynamically controls unlearning via weight interpolation during inference.

Result: Improves text alignment and target fidelity, compatible with various fine-tuning strategies.

Conclusion: Personalization guidance effectively balances adaptation and knowledge retention, outperforming existing methods like CFG and AG.

Abstract: Personalizing text-to-image diffusion models is crucial for adapting the pre-trained models to specific target concepts, enabling diverse image generation. However, fine-tuning with few images introduces an inherent trade-off between aligning with the target distribution (e.g., subject fidelity) and preserving the broad knowledge of the original model (e.g., text editability). Existing sampling guidance methods, such as classifier-free guidance (CFG) and autoguidance (AG), fail to effectively guide the output toward well-balanced space: CFG restricts the adaptation to the target distribution, while AG compromises text alignment. To address these limitations, we propose personalization guidance, a simple yet effective method leveraging an unlearned weak model conditioned on a null text prompt. Moreover, our method dynamically controls the extent of unlearning in a weak model through weight interpolation between pre-trained and fine-tuned models during inference. Unlike existing guidance methods, which depend solely on guidance scales, our method explicitly steers the outputs toward a balanced latent space without additional computational overhead. Experimental results demonstrate that our proposed guidance can improve text alignment and target distribution fidelity, integrating seamlessly with various fine-tuning strategies.

[109] Spectral Sensitivity Estimation with an Uncalibrated Diffraction Grating

Lilika Makabe, Hiroaki Santo, Fumio Okura, Michael S. Brown, Yasuyuki Matsushita

Main category: cs.CV

TL;DR: A practical method for calibrating camera spectral sensitivity using a diffraction grating, outperforming traditional methods.

DetailsMotivation: Accurate spectral sensitivity calibration is vital for tasks like color correction and material analysis, but existing methods need specialized equipment.

Method: Uses an uncalibrated diffraction grating sheet to capture images of direct and diffracted light, estimating sensitivity and grating parameters in closed-form.

Result: Outperforms conventional reference target-based methods in synthetic and real-world tests.

Conclusion: The method is effective and practical, requiring only off-the-shelf materials.

Abstract: This paper introduces a practical and accurate calibration method for camera spectral sensitivity using a diffraction grating. Accurate calibration of camera spectral sensitivity is crucial for various computer vision tasks, including color correction, illumination estimation, and material analysis. Unlike existing approaches that require specialized narrow-band filters or reference targets with known spectral reflectances, our method only requires an uncalibrated diffraction grating sheet, readily available off-the-shelf. By capturing images of the direct illumination and its diffracted pattern through the grating sheet, our method estimates both the camera spectral sensitivity and the diffraction grating parameters in a closed-form manner. Experiments on synthetic and real-world data demonstrate that our method outperforms conventional reference target-based methods, underscoring its effectiveness and practicality.

[110] Stable at Any Speed: Speed-Driven Multi-Object Tracking with Learnable Kalman Filtering

Yan Gong, Mengjun Chen, Hao Liu, Gao Yongsheng, Lei Yang, Naibang Wang, Ziying Song, Haoqun Ma

Main category: cs.CV

TL;DR: The paper proposes a Speed-Guided Learnable Kalman Filter (SG-LKF) for multi-object tracking (MOT) in autonomous vehicles, improving stability and accuracy by dynamically adapting to ego-vehicle speed.

DetailsMotivation: Conventional MOT methods ignore ego-vehicle speed-induced noise and reference frame changes, degrading tracking performance in dynamic, high-speed scenarios.

Method: SG-LKF uses MotionScaleNet (MSNet) to adaptively predict key parameters and introduces a self-supervised trajectory consistency loss for better inter-frame association.

Result: SG-LKF achieves top performance on KITTI 2D MOT (79.59% HOTA), strong results on KITTI 3D MOT (82.03% HOTA), and outperforms SimpleTrack by 2.2% AMOTA on nuScenes 3D MOT.

Conclusion: The proposed SG-LKF effectively addresses speed-induced tracking challenges, enhancing MOT stability and accuracy in dynamic scenarios.

Abstract: Multi-object tracking (MOT) enables autonomous vehicles to continuously perceive dynamic objects, supplying essential temporal cues for prediction, behavior understanding, and safe planning. However, conventional tracking-by-detection methods typically rely on static coordinate transformations based on ego-vehicle poses, disregarding ego-vehicle speed-induced variations in observation noise and reference frame changes, which degrades tracking stability and accuracy in dynamic, high-speed scenarios. In this paper, we investigate the critical role of ego-vehicle speed in MOT and propose a Speed-Guided Learnable Kalman Filter (SG-LKF) that dynamically adapts uncertainty modeling to ego-vehicle speed, significantly improving stability and accuracy in highly dynamic scenarios. Central to SG-LKF is MotionScaleNet (MSNet), a decoupled token-mixing and channel-mixing MLP that adaptively predicts key parameters of SG-LKF. To enhance inter-frame association and trajectory continuity, we introduce a self-supervised trajectory consistency loss jointly optimized with semantic and positional constraints. Extensive experiments show that SG-LKF ranks first among all vision-based methods on KITTI 2D MOT with 79.59% HOTA, delivers strong results on KITTI 3D MOT with 82.03% HOTA, and outperforms SimpleTrack by 2.2% AMOTA on nuScenes 3D MOT.

[111] CoST: Efficient Collaborative Perception From Unified Spatiotemporal Perspective

Zongheng Tang, Yi Liu, Yifan Sun, Yulu Gao, Jinyu Chen, Runsheng Xu, Si Liu

Main category: cs.CV

TL;DR: The paper proposes CoST, a method for collaborative perception that unifies spatio-temporal fusion, improving efficiency and accuracy by reducing redundant transmissions and enhancing feature fusion.

DetailsMotivation: Prior methods separate multi-agent and multi-time fusion, leading to inefficiencies. The paper aims to unify these processes for better performance.

Method: CoST aggregates observations from different agents and times into a unified spatio-temporal space simultaneously, enabling efficient feature transmission and superior fusion.

Result: CoST improves both efficiency (reducing transmission redundancy) and accuracy (enhancing feature fusion), and is compatible with existing methods.

Conclusion: CoST offers a unified approach to spatio-temporal fusion, outperforming prior methods in collaborative perception tasks.

Abstract: Collaborative perception shares information among different agents and helps solving problems that individual agents may face, e.g., occlusions and small sensing range. Prior methods usually separate the multi-agent fusion and multi-time fusion into two consecutive steps. In contrast, this paper proposes an efficient collaborative perception that aggregates the observations from different agents (space) and different times into a unified spatio-temporal space simultanesouly. The unified spatio-temporal space brings two benefits, i.e., efficient feature transmission and superior feature fusion. 1) Efficient feature transmission: each static object yields a single observation in the spatial temporal space, and thus only requires transmission only once (whereas prior methods re-transmit all the object features multiple times). 2) superior feature fusion: merging the multi-agent and multi-time fusion into a unified spatial-temporal aggregation enables a more holistic perspective, thereby enhancing perception performance in challenging scenarios. Consequently, our Collaborative perception with Spatio-temporal Transformer (CoST) gains improvement in both efficiency and accuracy. Notably, CoST is not tied to any specific method and is compatible with a majority of previous methods, enhancing their accuracy while reducing the transmission bandwidth.

[112] Honey Classification using Hyperspectral Imaging and Machine Learning

Mokhtar A. Al-Awadhi, Ratnadeep R. Deshmukh

Main category: cs.CV

TL;DR: A machine learning method for classifying honey botanical origins using dataset preparation, LDA for feature extraction, and SVM/KNN for classification, achieving high accuracy.

DetailsMotivation: To automate and improve the classification of honey botanical origins using advanced machine learning techniques.

Method: Dataset preparation with class transformation, LDA for feature extraction, and SVM/KNN for classification.

Result: Achieved 95.13% accuracy for image-based and 92.80% for instance-based classification.

Conclusion: The proposed method is effective for honey botanical origin classification, outperforming existing approaches.

Abstract: In this paper, we propose a machine learning-based method for automatically classifying honey botanical origins. Dataset preparation, feature extraction, and classification are the three main steps of the proposed method. We use a class transformation method in the dataset preparation phase to maximize the separability across classes. The feature extraction phase employs the Linear Discriminant Analysis (LDA) technique for extracting relevant features and reducing the number of dimensions. In the classification phase, we use Support Vector Machines (SVM) and K-Nearest Neighbors (KNN) models to classify the extracted features of honey samples into their botanical origins. We evaluate our system using a standard honey hyperspectral imaging (HSI) dataset. Experimental findings demonstrate that the proposed system produces state-of-the-art results on this dataset, achieving the highest classification accuracy of 95.13% for hyperspectral image-based classification and 92.80% for hyperspectral instance-based classification.

[113] SparseRecon: Neural Implicit Surface Reconstruction from Sparse Views with Feature and Depth Consistencies

Liang Han, Xu Zhang, Haichuan Song, Kanle Shi, Yu-Shen Liu, Zhizhong Han

Main category: cs.CV

TL;DR: SparseRecon improves sparse-view 3D reconstruction by combining feature consistency and uncertainty-guided depth constraints for better geometry details.

DetailsMotivation: Existing methods (generalization-based or overfitting-based) struggle with unseen views or limited geometry clues.

Method: Uses volume rendering-based feature consistency loss and uncertainty-guided depth constraint to enhance reconstruction.

Result: Outperforms state-of-the-art methods, especially in small overlapping view scenarios.

Conclusion: SparseRecon achieves high-quality geometry from sparse views, addressing ambiguity and occlusion issues.

Abstract: Surface reconstruction from sparse views aims to reconstruct a 3D shape or scene from few RGB images. The latest methods are either generalization-based or overfitting-based. However, the generalization-based methods do not generalize well on views that were unseen during training, while the reconstruction quality of overfitting-based methods is still limited by the limited geometry clues. To address this issue, we propose SparseRecon, a novel neural implicit reconstruction method for sparse views with volume rendering-based feature consistency and uncertainty-guided depth constraint. Firstly, we introduce a feature consistency loss across views to constrain the neural implicit field. This design alleviates the ambiguity caused by insufficient consistency information of views and ensures completeness and smoothness in the reconstruction results. Secondly, we employ an uncertainty-guided depth constraint to back up the feature consistency loss in areas with occlusion and insignificant features, which recovers geometry details for better reconstruction quality. Experimental results demonstrate that our method outperforms the state-of-the-art methods, which can produce high-quality geometry with sparse-view input, especially in the scenarios with small overlapping views. Project page: https://hanl2010.github.io/SparseRecon/.

[114] Representation Shift: Unifying Token Compression with FlashAttention

Joonmyung Choi, Sanghyeok Lee, Byungoh Ko, Eunseo Kim, Jihyung Kil, Hyunwoo J. Kim

Main category: cs.CV

TL;DR: The paper proposes Representation Shift, a training-free metric for token compression in Transformers, compatible with FlashAttention, improving efficiency without attention maps or retraining.

DetailsMotivation: Increasing task complexity in Transformers leads to higher computational costs and GPU memory overhead, prompting the need for efficient token compression methods.

Method: Introduces Representation Shift, a model-agnostic metric measuring token representation changes, enabling compression without attention maps or retraining.

Result: Achieves speedups of 5.5% and 4.4% in video-text retrieval and video QA, respectively, while maintaining compatibility with FlashAttention.

Conclusion: Representation Shift offers an efficient, scalable solution for token compression across various models, including Transformers, CNNs, and state space models.

Abstract: Transformers have demonstrated remarkable success across vision, language, and video. Yet, increasing task complexity has led to larger models and more tokens, raising the quadratic cost of self-attention and the overhead of GPU memory access. To reduce the computation cost of self-attention, prior work has proposed token compression techniques that drop redundant or less informative tokens. Meanwhile, fused attention kernels such as FlashAttention have been developed to alleviate memory overhead by avoiding attention map construction and its associated I/O to HBM. This, however, makes it incompatible with most training-free token compression methods, which rely on attention maps to determine token importance. Here, we propose Representation Shift, a training-free, model-agnostic metric that measures the degree of change in each token’s representation. This seamlessly integrates token compression with FlashAttention, without attention maps or retraining. Our method further generalizes beyond Transformers to CNNs and state space models. Extensive experiments show that Representation Shift enables effective token compression compatible with FlashAttention, yielding significant speedups of up to 5.5% and 4.4% in video-text retrieval and video QA, respectively. Code is available at https://github.com/mlvlab/Representation-Shift.

[115] Bidirectional Action Sequence Learning for Long-term Action Anticipation with Large Language Models

Yuji Sato, Yasunori Ishii, Takayoshi Yamashita

Main category: cs.CV

TL;DR: BiAnt improves long-term action anticipation by combining forward and backward predictions using a large language model, outperforming baseline methods.

DetailsMotivation: Conventional unidirectional methods limit performance by failing to capture distinct sub-actions in scenes.

Method: BiAnt integrates forward and backward predictions with a large language model.

Result: BiAnt outperforms baselines on Ego4D, measured by edit distance.

Conclusion: BiAnt’s bidirectional approach enhances action anticipation, proving effective in real-world applications.

Abstract: Video-based long-term action anticipation is crucial for early risk detection in areas such as automated driving and robotics. Conventional approaches extract features from past actions using encoders and predict future events with decoders, which limits performance due to their unidirectional nature. These methods struggle to capture semantically distinct sub-actions within a scene. The proposed method, BiAnt, addresses this limitation by combining forward prediction with backward prediction using a large language model. Experimental results on Ego4D demonstrate that BiAnt improves performance in terms of edit distance compared to baseline methods.

[116] Advancing Welding Defect Detection in Maritime Operations via Adapt-WeldNet and Defect Detection Interpretability Analysis

Kamal Basha S, Athira Nambiar

Main category: cs.CV

TL;DR: The paper introduces Adapt-WeldNet, an adaptive framework for welding defect detection, and DDIA, an interpretability analysis framework, to improve accuracy and transparency in defect detection for oil and gas piping systems.

DetailsMotivation: Traditional NDT methods and existing neural networks often miss subtle defects or lack interpretability, posing safety risks in critical environments.

Method: Adapt-WeldNet evaluates pre-trained architectures, transfer learning, and optimizers, while DDIA uses XAI techniques (Grad-CAM, LIME) and expert validation for transparency.

Result: The framework enhances defect detection performance and provides interpretable insights, validated by professionals.

Conclusion: The work improves trust, safety, and reliability in welding defect detection, especially for offshore and marine applications.

Abstract: Weld defect detection is crucial for ensuring the safety and reliability of piping systems in the oil and gas industry, especially in challenging marine and offshore environments. Traditional non-destructive testing (NDT) methods often fail to detect subtle or internal defects, leading to potential failures and costly downtime. Furthermore, existing neural network-based approaches for defect classification frequently rely on arbitrarily selected pretrained architectures and lack interpretability, raising safety concerns for deployment. To address these challenges, this paper introduces ``Adapt-WeldNet", an adaptive framework for welding defect detection that systematically evaluates various pre-trained architectures, transfer learning strategies, and adaptive optimizers to identify the best-performing model and hyperparameters, optimizing defect detection and providing actionable insights. Additionally, a novel Defect Detection Interpretability Analysis (DDIA) framework is proposed to enhance system transparency. DDIA employs Explainable AI (XAI) techniques, such as Grad-CAM and LIME, alongside domain-specific evaluations validated by certified ASNT NDE Level II professionals. Incorporating a Human-in-the-Loop (HITL) approach and aligning with the principles of Trustworthy AI, DDIA ensures the reliability, fairness, and accountability of the defect detection system, fostering confidence in automated decisions through expert validation. By improving both performance and interpretability, this work enhances trust, safety, and reliability in welding defect detection systems, supporting critical operations in offshore and marine environments.

[117] $MV_{Hybrid}$: Improving Spatial Transcriptomics Prediction with Hybrid State Space-Vision Transformer Backbone in Pathology Vision Foundation Models

Won June Cho, Hongjun Yoon, Daeky Jeong, Hyeongyeol Lim, Yosep Chong

Main category: cs.CV

TL;DR: The paper introduces $MV_{Hybrid}$, a hybrid architecture combining state space models (SSMs) with Vision Transformers (ViT) for predicting spatial gene expression from histopathology images, outperforming ViT in robustness and performance.

DetailsMotivation: Spatial transcriptomics is costly and complex, limiting clinical use. Predicting gene expression from histopathology images is a practical alternative, but current ViT-based models underperform. The study explores better architectures for capturing subtle morphological patterns.

Method: The authors propose $MV_{Hybrid}$, a hybrid backbone combining SSMs and ViT, initialized with negative real eigenvalues for low-frequency bias. They compare it with five other architectures pretrained on colorectal cancer datasets using DINOv2. Evaluation includes random split and leave-one-study-out (LOSO) settings.

Result: $MV_{Hybrid}$ achieves 57% higher correlation than ViT in LOSO evaluation and shows 43% smaller performance degradation in gene expression prediction. It also matches or outperforms ViT in classification, patch retrieval, and survival prediction.

Conclusion: $MV_{Hybrid}$ demonstrates superior performance and robustness, making it a promising next-generation backbone for pathology vision foundation models.

Abstract: Spatial transcriptomics reveals gene expression patterns within tissue context, enabling precision oncology applications such as treatment response prediction, but its high cost and technical complexity limit clinical adoption. Predicting spatial gene expression (biomarkers) from routine histopathology images offers a practical alternative, yet current vision foundation models (VFMs) in pathology based on Vision Transformer (ViT) backbones perform below clinical standards. Given that VFMs are already trained on millions of diverse whole slide images, we hypothesize that architectural innovations beyond ViTs may better capture the low-frequency, subtle morphological patterns correlating with molecular phenotypes. By demonstrating that state space models initialized with negative real eigenvalues exhibit strong low-frequency bias, we introduce $MV_{Hybrid}$, a hybrid backbone architecture combining state space models (SSMs) with ViT. We compare five other different backbone architectures for pathology VFMs, all pretrained on identical colorectal cancer datasets using the DINOv2 self-supervised learning method. We evaluate all pretrained models using both random split and leave-one-study-out (LOSO) settings of the same biomarker dataset. In LOSO evaluation, $MV_{Hybrid}$ achieves 57% higher correlation than the best-performing ViT and shows 43% smaller performance degradation compared to random split in gene expression prediction, demonstrating superior performance and robustness, respectively. Furthermore, $MV_{Hybrid}$ shows equal or better downstream performance in classification, patch retrieval, and survival prediction tasks compared to that of ViT, showing its promise as a next-generation pathology VFM backbone. Our code is publicly available at: https://github.com/deepnoid-ai/MVHybrid.

[118] Decouple before Align: Visual Disentanglement Enhances Prompt Tuning

Fei Zhang, Tianfei Zhou, Jiangchao Yao, Ya Zhang, Ivor W. Tsang, Yanfeng Wang

Main category: cs.CV

TL;DR: The paper introduces DAPT, a prompt tuning framework addressing visual-textual information asymmetry by decoupling and aligning visual foreground/background with text, improving model performance.

DetailsMotivation: The study addresses the overlooked issue of visual-textual information asymmetry in prompt tuning, where visual context dominates, leading to biased attention.

Method: DAPT decouples visual modality into foreground/background using segmentation cues, aligns them with text, and applies visual pull-push regularization for unbiased attention.

Result: DAPT outperforms in few-shot learning, base-to-novel generalization, and data-efficient learning across benchmarks.

Conclusion: DAPT effectively mitigates information asymmetry and enhances modal alignment, demonstrating superior performance in various tasks.

Abstract: Prompt tuning (PT), as an emerging resource-efficient fine-tuning paradigm, has showcased remarkable effectiveness in improving the task-specific transferability of vision-language models. This paper delves into a previously overlooked information asymmetry issue in PT, where the visual modality mostly conveys more context than the object-oriented textual modality. Correspondingly, coarsely aligning these two modalities could result in the biased attention, driving the model to merely focus on the context area. To address this, we propose DAPT, an effective PT framework based on an intuitive decouple-before-align concept. First, we propose to explicitly decouple the visual modality into the foreground and background representation via exploiting coarse-and-fine visual segmenting cues, and then both of these decoupled patterns are aligned with the original foreground texts and the hand-crafted background classes, thereby symmetrically strengthening the modal alignment. To further enhance the visual concentration, we propose a visual pull-push regularization tailored for the foreground-background patterns, directing the original visual representation towards unbiased attention on the region-of-interest object. We demonstrate the power of architecture-free DAPT through few-shot learning, base-to-novel generalization, and data-efficient learning, all of which yield superior performance across prevailing benchmarks. Our code will be released at https://github.com/Ferenas/DAPT.

[119] Video Forgery Detection with Optical Flow Residuals and Spatial-Temporal Consistency

Xi Xue, Kunio Suzuki, Nabarun Goswami, Takuya Shintate

Main category: cs.CV

TL;DR: A detection framework for AI-generated videos combines RGB appearance features and optical flow residuals to identify forgery by capturing spatial-temporal inconsistencies.

DetailsMotivation: The rise of realistic synthetic videos challenges existing forgery detection methods, which often miss fine-grained temporal inconsistencies in high-fidelity AI-generated content.

Method: The proposed framework uses a dual-branch architecture: one branch analyzes RGB frames for appearance artifacts, and the other processes optical flow residuals to detect motion anomalies.

Result: The method shows robustness and strong generalization in detecting forged videos across ten diverse generative models in text-to-video and image-to-video tasks.

Conclusion: The integrated approach effectively identifies forged videos by leveraging complementary spatial-temporal features, addressing limitations of existing methods.

Abstract: The rapid advancement of diffusion-based video generation models has led to increasingly realistic synthetic content, presenting new challenges for video forgery detection. Existing methods often struggle to capture fine-grained temporal inconsistencies, particularly in AI-generated videos with high visual fidelity and coherent motion. In this work, we propose a detection framework that leverages spatial-temporal consistency by combining RGB appearance features with optical flow residuals. The model adopts a dual-branch architecture, where one branch analyzes RGB frames to detect appearance-level artifacts, while the other processes flow residuals to reveal subtle motion anomalies caused by imperfect temporal synthesis. By integrating these complementary features, the proposed method effectively detects a wide range of forged videos. Extensive experiments on text-to-video and image-to-video tasks across ten diverse generative models demonstrate the robustness and strong generalization ability of the proposed approach.

[120] iSafetyBench: A video-language benchmark for safety in industrial environment

Raiyaan Abdullah, Yogesh Singh Rawat, Shruti Vyas

Main category: cs.CV

TL;DR: iSafetyBench is a new video-language benchmark for evaluating vision-language models in industrial settings, focusing on routine and hazardous actions. Despite strong performance on existing benchmarks, models struggle here, highlighting gaps in safety-critical applications.

DetailsMotivation: To address the underexplored capabilities of VLMs in high-stakes industrial domains, where recognizing routine and hazardous actions is crucial.

Method: Introduces iSafetyBench, a dataset of 1,100 industrial video clips annotated with 98 routine and 67 hazardous action categories, paired with multiple-choice questions for evaluation.

Result: Eight state-of-the-art VLMs perform poorly on iSafetyBench, especially in hazardous scenarios and multi-label tasks.

Conclusion: The benchmark reveals significant gaps in current VLMs for industrial safety, advocating for more robust, safety-aware models.

Abstract: Recent advances in vision-language models (VLMs) have enabled impressive generalization across diverse video understanding tasks under zero-shot settings. However, their capabilities in high-stakes industrial domains-where recognizing both routine operations and safety-critical anomalies is essential-remain largely underexplored. To address this gap, we introduce iSafetyBench, a new video-language benchmark specifically designed to evaluate model performance in industrial environments across both normal and hazardous scenarios. iSafetyBench comprises 1,100 video clips sourced from real-world industrial settings, annotated with open-vocabulary, multi-label action tags spanning 98 routine and 67 hazardous action categories. Each clip is paired with multiple-choice questions for both single-label and multi-label evaluation, enabling fine-grained assessment of VLMs in both standard and safety-critical contexts. We evaluate eight state-of-the-art video-language models under zero-shot conditions. Despite their strong performance on existing video benchmarks, these models struggle with iSafetyBench-particularly in recognizing hazardous activities and in multi-label scenarios. Our results reveal significant performance gaps, underscoring the need for more robust, safety-aware multimodal models for industrial applications. iSafetyBench provides a first-of-its-kind testbed to drive progress in this direction. The dataset is available at: https://github.com/raiyaan-abdullah/iSafety-Bench.

[121] Sari Sandbox: A Virtual Retail Store Environment for Embodied AI Agents

Janika Deborah Gajo, Gerarld Paul Merales, Jerome Escarcha, Brenden Ashley Molina, Gian Nartea, Emmanuel G. Maminta, Juan Carlos Roldan, Rowel O. Atienza

Main category: cs.CV

TL;DR: Sari Sandbox is a photorealistic 3D retail store simulation for training embodied agents, featuring interactive items and human benchmarks.

DetailsMotivation: To address the lack of retail-specific simulation environments for embodied agent training and benchmarking.

Method: Developed a high-fidelity 3D simulation with 250+ interactive grocery items, VR support, and a VLM-powered agent. Introduced SariBench, a dataset of human demonstrations.

Result: Enabled benchmarking of embodied agents against human performance in shopping tasks.

Conclusion: The sandbox provides a scalable and realistic environment for training and evaluating embodied agents, with recommendations for further improvements.

Abstract: We present Sari Sandbox, a high-fidelity, photorealistic 3D retail store simulation for benchmarking embodied agents against human performance in shopping tasks. Addressing a gap in retail-specific sim environments for embodied agent training, Sari Sandbox features over 250 interactive grocery items across three store configurations, controlled via an API. It supports both virtual reality (VR) for human interaction and a vision language model (VLM)-powered embodied agent. We also introduce SariBench, a dataset of annotated human demonstrations across varied task difficulties. Our sandbox enables embodied agents to navigate, inspect, and manipulate retail items, providing baselines against human performance. We conclude with benchmarks, performance analysis, and recommendations for enhancing realism and scalability. The source code can be accessed via https://github.com/upeee/sari-sandbox-env.

[122] PMR: Physical Model-Driven Multi-Stage Restoration of Turbulent Dynamic Videos

Tao Wu, Jingyuan Ye, Ying Fu

Main category: cs.CV

TL;DR: The paper proposes a Dynamic Efficiency Index (DEI) and a Physical Model-Driven Multi-Stage Video Restoration (PMR) framework to address turbulence-induced distortions in videos, achieving high-quality restoration and generalization.

DetailsMotivation: Existing methods fail to restore edge details and eliminate mixed distortions in videos affected by strong turbulence and complex dynamics.

Method: Introduces DEI to quantify dynamic intensity and PMR, a three-stage framework (de-tilting, motion segmentation enhancement, de-blurring) with lightweight backbones and joint training.

Result: PMR effectively suppresses motion trailing, restores edge details, and generalizes well in high-turbulence, dynamic scenarios.

Conclusion: The proposed method outperforms existing approaches, with code and datasets to be made publicly available.

Abstract: Geometric distortions and blurring caused by atmospheric turbulence degrade the quality of long-range dynamic scene videos. Existing methods struggle with restoring edge details and eliminating mixed distortions, especially under conditions of strong turbulence and complex dynamics. To address these challenges, we introduce a Dynamic Efficiency Index ($DEI$), which combines turbulence intensity, optical flow, and proportions of dynamic regions to accurately quantify video dynamic intensity under varying turbulence conditions and provide a high-dynamic turbulence training dataset. Additionally, we propose a Physical Model-Driven Multi-Stage Video Restoration ($PMR$) framework that consists of three stages: \textbf{de-tilting} for geometric stabilization, \textbf{motion segmentation enhancement} for dynamic region refinement, and \textbf{de-blurring} for quality restoration. $PMR$ employs lightweight backbones and stage-wise joint training to ensure both efficiency and high restoration quality. Experimental results demonstrate that the proposed method effectively suppresses motion trailing artifacts, restores edge details and exhibits strong generalization capability, especially in real-world scenarios characterized by high-turbulence and complex dynamics. We will make the code and datasets openly available.

[123] Sortblock: Similarity-Aware Feature Reuse for Diffusion Model

Hanqi Chen, Xu Zhang, Xiaoliu Guan, Lielin Jiang, Guanzhong Wang, Zeyu Chen, Yi Liu

Main category: cs.CV

TL;DR: Sortblock accelerates DiTs by dynamically caching block-wise features and skipping redundant computations, achieving 2x speedup with minimal quality loss.

DetailsMotivation: DiTs suffer from high inference latency due to sequential denoising, limiting real-time use. Existing methods overlook evolving semantic focus.

Method: Sortblock ranks residuals to adaptively skip redundant computations and uses linear prediction to reduce errors.

Result: Achieves over 2x speedup with minimal quality degradation across tasks and architectures.

Conclusion: Sortblock is an effective, generalizable solution for accelerating diffusion-based models without training.

Abstract: Diffusion Transformers (DiTs) have demonstrated remarkable generative capabilities, particularly benefiting from Transformer architectures that enhance visual and artistic fidelity. However, their inherently sequential denoising process results in high inference latency, limiting their deployment in real-time scenarios. Existing training-free acceleration approaches typically reuse intermediate features at fixed timesteps or layers, overlooking the evolving semantic focus across denoising stages and Transformer blocks.To address this, we propose Sortblock, a training-free inference acceleration framework that dynamically caches block-wise features based on their similarity across adjacent timesteps. By ranking the evolution of residuals, Sortblock adaptively determines a recomputation ratio, selectively skipping redundant computations while preserving generation quality. Furthermore, we incorporate a lightweight linear prediction mechanism to reduce accumulated errors in skipped blocks.Extensive experiments across various tasks and DiT architectures demonstrate that Sortblock achieves over 2$\times$ inference speedup with minimal degradation in output quality, offering an effective and generalizable solution for accelerating diffusion-based generative models.

[124] DC-AE 1.5: Accelerating Diffusion Model Convergence with Structured Latent Space

Junyu Chen, Dongyun Zou, Wenkun He, Junsong Chen, Enze Xie, Song Han, Han Cai

Main category: cs.CV

TL;DR: DC-AE 1.5 improves deep compression autoencoders for diffusion models by introducing structured latent space and augmented diffusion training, achieving faster convergence and better generation quality.

DetailsMotivation: The challenge of slow convergence in diffusion models when increasing autoencoder latent channels, limiting quality and compression ratios.

Method: Introduces Structured Latent Space for channel-wise organization and Augmented Diffusion Training for faster convergence.

Result: DC-AE 1.5 outperforms DC-AE in speed (4x faster) and generation quality on ImageNet 512x512.

Conclusion: DC-AE 1.5 effectively addresses convergence and quality issues, enabling higher compression ratios and better performance.

Abstract: We present DC-AE 1.5, a new family of deep compression autoencoders for high-resolution diffusion models. Increasing the autoencoder’s latent channel number is a highly effective approach for improving its reconstruction quality. However, it results in slow convergence for diffusion models, leading to poorer generation quality despite better reconstruction quality. This issue limits the quality upper bound of latent diffusion models and hinders the employment of autoencoders with higher spatial compression ratios. We introduce two key innovations to address this challenge: i) Structured Latent Space, a training-based approach to impose a desired channel-wise structure on the latent space with front latent channels capturing object structures and latter latent channels capturing image details; ii) Augmented Diffusion Training, an augmented diffusion training strategy with additional diffusion training objectives on object latent channels to accelerate convergence. With these techniques, DC-AE 1.5 delivers faster convergence and better diffusion scaling results than DC-AE. On ImageNet 512x512, DC-AE-1.5-f64c128 delivers better image generation quality than DC-AE-f32c32 while being 4x faster. Code: https://github.com/dc-ai-projects/DC-Gen.

[125] UIS-Mamba: Exploring Mamba for Underwater Instance Segmentation via Dynamic Tree Scan and Hidden State Weaken

Runmin Cong, Zongji Yu, Hao Fang, Haoyan Sun, Sam Kwong

Main category: cs.CV

TL;DR: UIS-Mamba, a Mamba-based model for underwater instance segmentation, introduces Dynamic Tree Scan and Hidden State Weaken modules to address challenges like color distortion and blurred boundaries, achieving state-of-the-art results.

DetailsMotivation: Underwater scenes pose unique challenges like color distortion and blurred boundaries, making instance segmentation difficult. Mamba's linear complexity and global receptive fields make it suitable, but adaptations are needed for underwater tasks.

Method: Proposes UIS-Mamba with Dynamic Tree Scan (DTS) for dynamic local receptive fields and Hidden State Weaken (HSW) to suppress background interference using Ncut-based weakening.

Result: Achieves state-of-the-art performance on UIIS and USIS10K datasets with low parameters and computational complexity.

Conclusion: UIS-Mamba effectively adapts Mamba for underwater tasks, addressing key challenges and outperforming existing methods.

Abstract: Underwater Instance Segmentation (UIS) tasks are crucial for underwater complex scene detection. Mamba, as an emerging state space model with inherently linear complexity and global receptive fields, is highly suitable for processing image segmentation tasks with long sequence features. However, due to the particularity of underwater scenes, there are many challenges in applying Mamba to UIS. The existing fixed-patch scanning mechanism cannot maintain the internal continuity of scanned instances in the presence of severely underwater color distortion and blurred instance boundaries, and the hidden state of the complex underwater background can also inhibit the understanding of instance objects. In this work, we propose the first Mamba-based underwater instance segmentation model UIS-Mamba, and design two innovative modules, Dynamic Tree Scan (DTS) and Hidden State Weaken (HSW), to migrate Mamba to the underwater task. DTS module maintains the continuity of the internal features of the instance objects by allowing the patches to dynamically offset and scale, thereby guiding the minimum spanning tree and providing dynamic local receptive fields. HSW module suppresses the interference of complex backgrounds and effectively focuses the information flow of state propagation to the instances themselves through the Ncut-based hidden state weakening mechanism. Experimental results show that UIS-Mamba achieves state-of-the-art performance on both UIIS and USIS10K datasets, while maintaining a low number of parameters and computational complexity. Code is available at https://github.com/Maricalce/UIS-Mamba.

[126] Contact-Aware Amodal Completion for Human-Object Interaction via Multi-Regional Inpainting

Seunggeun Chi, Enna Sachdeva, Pin-Hao Huang, Kwonjoon Lee

Main category: cs.CV

TL;DR: A new method improves amodal completion in human-object interactions by using physical priors and multi-regional inpainting, outperforming existing techniques.

DetailsMotivation: Existing methods struggle with plausible completions in dynamic HOI scenarios due to limited understanding of interactions.

Method: Incorporates physical constraints (human topology, contact info) to define primary/secondary regions, using customized denoising in a diffusion model.

Result: Significantly outperforms existing methods in HOI scenarios, enhancing realism and accuracy.

Conclusion: The approach advances machine perception in dynamic environments and is robust even without ground-truth contact annotations.

Abstract: Amodal completion, which is the process of inferring the full appearance of objects despite partial occlusions, is crucial for understanding complex human-object interactions (HOI) in computer vision and robotics. Existing methods, such as those that use pre-trained diffusion models, often struggle to generate plausible completions in dynamic scenarios because they have a limited understanding of HOI. To solve this problem, we’ve developed a new approach that uses physical prior knowledge along with a specialized multi-regional inpainting technique designed for HOI. By incorporating physical constraints from human topology and contact information, we define two distinct regions: the primary region, where occluded object parts are most likely to be, and the secondary region, where occlusions are less probable. Our multi-regional inpainting method uses customized denoising strategies across these regions within a diffusion model. This improves the accuracy and realism of the generated completions in both their shape and visual detail. Our experimental results show that our approach significantly outperforms existing methods in HOI scenarios, moving machine perception closer to a more human-like understanding of dynamic environments. We also show that our pipeline is robust even without ground-truth contact annotations, which broadens its applicability to tasks like 3D reconstruction and novel view/pose synthesis.

[127] Reducing the gap between general purpose data and aerial images in concentrated solar power plants

M. A. Pérez-Cutiño, J. Valverde, J. Capitán, J. M. Díaz-Báñez

Main category: cs.CV

TL;DR: The paper introduces AerialCSP, a synthetic dataset for aerial inspection of CSP plants, to reduce the need for manual labeling and improve model performance.

DetailsMotivation: Existing datasets lack domain-specific elements for CSP plants, making model generalization difficult without costly manual labeling.

Method: Creation of AerialCSP, a virtual dataset simulating CSP plant imagery, and benchmarking models on it.

Result: Pretraining on AerialCSP improves real-world fault detection, especially for rare and small defects.

Conclusion: AerialCSP effectively reduces manual labeling needs and enhances model performance for CSP plant inspections.

Abstract: In the context of Concentrated Solar Power (CSP) plants, aerial images captured by drones present a unique set of challenges. Unlike urban or natural landscapes commonly found in existing datasets, solar fields contain highly reflective surfaces, and domain-specific elements that are uncommon in traditional computer vision benchmarks. As a result, machine learning models trained on generic datasets struggle to generalize to this setting without extensive retraining and large volumes of annotated data. However, collecting and labeling such data is costly and time-consuming, making it impractical for rapid deployment in industrial applications. To address this issue, we propose a novel approach: the creation of AerialCSP, a virtual dataset that simulates aerial imagery of CSP plants. By generating synthetic data that closely mimic real-world conditions, our objective is to facilitate pretraining of models before deployment, significantly reducing the need for extensive manual labeling. Our main contributions are threefold: (1) we introduce AerialCSP, a high-quality synthetic dataset for aerial inspection of CSP plants, providing annotated data for object detection and image segmentation; (2) we benchmark multiple models on AerialCSP, establishing a baseline for CSP-related vision tasks; and (3) we demonstrate that pretraining on AerialCSP significantly improves real-world fault detection, particularly for rare and small defects, reducing the need for extensive manual labeling. AerialCSP is made publicly available at https://mpcutino.github.io/aerialcsp/.

[128] Fine-grained Spatiotemporal Grounding on Egocentric Videos

Shuo Liang, Yiwu Zhong, Zi-Yuan Hu, Yeyao Tao, Liwei Wang

Main category: cs.CV

TL;DR: The paper introduces EgoMask, a pixel-level benchmark for spatiotemporal video grounding in egocentric videos, addressing challenges like shorter object durations and sparser trajectories. It includes an automatic annotation pipeline and a training dataset, EgoMask-Train, showing significant improvements over state-of-the-art models.

DetailsMotivation: Egocentric video grounding is underexplored despite its importance in applications like augmented reality and robotics. The paper identifies key challenges in egocentric videos and aims to bridge this gap.

Method: The authors propose EgoMask, a benchmark with an automatic annotation pipeline for referring expressions and object masks. They also create EgoMask-Train for model training.

Result: State-of-the-art models perform poorly on EgoMask but improve significantly when fine-tuned on EgoMask-Train, without losing performance on exocentric datasets.

Conclusion: The work provides essential resources and insights for advancing egocentric video understanding, with code made publicly available.

Abstract: Spatiotemporal video grounding aims to localize target entities in videos based on textual queries. While existing research has made significant progress in exocentric videos, the egocentric setting remains relatively underexplored, despite its growing importance in applications such as augmented reality and robotics. In this work, we conduct a systematic analysis of the discrepancies between egocentric and exocentric videos, revealing key challenges such as shorter object durations, sparser trajectories, smaller object sizes, and larger positional shifts. To address these challenges, we introduce EgoMask, the first pixel-level benchmark for fine-grained spatiotemporal grounding in egocentric videos. It is constructed by our proposed automatic annotation pipeline, which annotates referring expressions and object masks across short-, medium-, and long-term videos. Additionally, we create EgoMask-Train, a large-scale training dataset to facilitate model development. Experiments demonstrate that the state-of-the-art spatiotemporal grounding models perform poorly on our benchmark EgoMask, but fine-tuning on EgoMask-Train yields significant improvements, while preserving performance on exocentric datasets. Our work thus provides essential resources and insights for advancing egocentric video understanding. Our code is available at https://github.com/LaVi-Lab/EgoMask .

[129] TopoTTA: Topology-Enhanced Test-Time Adaptation for Tubular Structure Segmentation

Jiale Zhou, Wenhan Wang, Shikun Li, Xiaolei Qu, Xin Guo, Yizhong Liu, Wenzhong Tang, Xun Lin, Yefeng Zheng

Main category: cs.CV

TL;DR: TopoTTA is a test-time adaptation framework for tubular structure segmentation (TSS) that addresses domain shifts by enhancing topological representation and continuity, achieving a 31.81% improvement in clDice.

DetailsMotivation: Domain shifts in TSS degrade performance due to sensitivity to topological and local feature changes, necessitating a specialized adaptation framework.

Method: TopoTTA uses Topological Meta Difference Convolutions (TopoMDCs) for cross-domain topological adaptation and Topology Hard sample Generation (TopoHG) with pseudo-labels to improve continuity.

Result: TopoTTA improves performance by 31.81% in clDice across four scenarios and ten datasets.

Conclusion: TopoTTA effectively handles topological distribution shifts and serves as a plug-and-play solution for CNN-based TSS models.

Abstract: Tubular structure segmentation (TSS) is important for various applications, such as hemodynamic analysis and route navigation. Despite significant progress in TSS, domain shifts remain a major challenge, leading to performance degradation in unseen target domains. Unlike other segmentation tasks, TSS is more sensitive to domain shifts, as changes in topological structures can compromise segmentation integrity, and variations in local features distinguishing foreground from background (e.g., texture and contrast) may further disrupt topological continuity. To address these challenges, we propose Topology-enhanced Test-Time Adaptation (TopoTTA), the first test-time adaptation framework designed specifically for TSS. TopoTTA consists of two stages: Stage 1 adapts models to cross-domain topological discrepancies using the proposed Topological Meta Difference Convolutions (TopoMDCs), which enhance topological representation without altering pre-trained parameters; Stage 2 improves topological continuity by a novel Topology Hard sample Generation (TopoHG) strategy and prediction alignment on hard samples with pseudo-labels in the generated pseudo-break regions. Extensive experiments across four scenarios and ten datasets demonstrate TopoTTA’s effectiveness in handling topological distribution shifts, achieving an average improvement of 31.81% in clDice. TopoTTA also serves as a plug-and-play TTA solution for CNN-based TSS models.

[130] SDMatte: Grafting Diffusion Models for Interactive Matting

Longfei Huang, Yu Liang, Hao Zhang, Jinwei Chen, Wei Dong, Lunde Chen, Wanyu Liu, Bo Li, Pengtao Jiang

Main category: cs.CV

TL;DR: SDMatte is a diffusion-driven interactive matting model that leverages diffusion models for fine-grained detail extraction, outperforming existing methods.

DetailsMotivation: Existing interactive matting methods lack precision in edge regions, while diffusion models offer robust capabilities for detailed texture synthesis and interaction.

Method: SDMatte transforms text-driven interaction into visual prompt-driven interaction, integrates coordinate and opacity embeddings into U-Net, and uses masked self-attention for focused attention.

Result: Extensive experiments show SDMatte’s superior performance in interactive matting.

Conclusion: SDMatte effectively addresses the limitations of current methods by utilizing diffusion models for enhanced detail extraction and interaction.

Abstract: Recent interactive matting methods have shown satisfactory performance in capturing the primary regions of objects, but they fall short in extracting fine-grained details in edge regions. Diffusion models trained on billions of image-text pairs, demonstrate exceptional capability in modeling highly complex data distributions and synthesizing realistic texture details, while exhibiting robust text-driven interaction capabilities, making them an attractive solution for interactive matting. To this end, we propose SDMatte, a diffusion-driven interactive matting model, with three key contributions. First, we exploit the powerful priors of diffusion models and transform the text-driven interaction capability into visual prompt-driven interaction capability to enable interactive matting. Second, we integrate coordinate embeddings of visual prompts and opacity embeddings of target objects into U-Net, enhancing SDMatte’s sensitivity to spatial position information and opacity information. Third, we propose a masked self-attention mechanism that enables the model to focus on areas specified by visual prompts, leading to better performance. Extensive experiments on multiple datasets demonstrate the superior performance of our method, validating its effectiveness in interactive matting. Our code and model are available at https://github.com/vivoCameraResearch/SDMatte.

[131] AutoDebias: Automated Framework for Debiasing Text-to-Image Models

Hongyi Cai, Mohammad Mahdinur Rahman, Mingkang Dong, Jie Li, Muxin Pu, Zhili Fang, Yinan Peng, Hanjun Luo, Yang Liu

Main category: cs.CV

TL;DR: AutoDebias is a framework that automatically detects and mitigates social biases in Text-to-Image models without prior bias knowledge, improving fairness while maintaining image quality.

DetailsMotivation: Existing debiasing methods fail to address subtle or overlapping biases in T2I models, necessitating an automated solution.

Method: AutoDebias uses vision-language models to identify biases and generates inclusive prompts for CLIP-guided training to reduce biases.

Result: The framework achieves 91.6% bias detection accuracy and reduces biased outputs from 90% to negligible levels, preserving visual fidelity.

Conclusion: AutoDebias effectively tackles subtle and overlapping biases in T2I models, offering a scalable and automated debiasing solution.

Abstract: Text-to-Image (T2I) models generate high-quality images from text prompts but often exhibit unintended social biases, such as gender or racial stereotypes, even when these attributes are not mentioned. Existing debiasing methods work well for simple or well-known cases but struggle with subtle or overlapping biases. We propose AutoDebias, a framework that automatically identifies and mitigates harmful biases in T2I models without prior knowledge of specific bias types. Specifically, AutoDebias leverages vision-language models to detect biased visual patterns and constructs fairness guides by generating inclusive alternative prompts that reflect balanced representations. These guides drive a CLIP-guided training process that promotes fairer outputs while preserving the original model’s image quality and diversity. Unlike existing methods, AutoDebias effectively addresses both subtle stereotypes and multiple interacting biases. We evaluate the framework on a benchmark covering over 25 bias scenarios, including challenging cases where multiple biases occur simultaneously. AutoDebias detects harmful patterns with 91.6% accuracy and reduces biased outputs from 90% to negligible levels, while preserving the visual fidelity of the original model.

[132] CLIPTime: Time-Aware Multimodal Representation Learning from Images and Text

Anju Rani, Daniel Ortiz-Arroyo, Petar Durdevic

Main category: cs.CV

TL;DR: CLIPTime is a multimodal framework built on CLIP to predict fungal growth stages and timestamps from images and text, addressing the temporal limitations of vision-language models.

DetailsMotivation: Understanding biological growth dynamics is crucial in fields like microbiology and agriculture, but current vision-language models lack temporal progression capabilities.

Method: CLIPTime extends CLIP to learn joint visual-textual embeddings for time-aware inference, using a synthetic fungal growth dataset for training and evaluation.

Result: The model effectively predicts growth stages and timestamps, demonstrating interpretable and temporally grounded outputs.

Conclusion: CLIPTime shows promise for real-world biological monitoring, leveraging vision-language models for temporal tasks.

Abstract: Understanding the temporal dynamics of biological growth is critical across diverse fields such as microbiology, agriculture, and biodegradation research. Although vision-language models like Contrastive Language Image Pretraining (CLIP) have shown strong capabilities in joint visual-textual reasoning, their effectiveness in capturing temporal progression remains limited. To address this, we propose CLIPTime, a multimodal, multitask framework designed to predict both the developmental stage and the corresponding timestamp of fungal growth from image and text inputs. Built upon the CLIP architecture, our model learns joint visual-textual embeddings and enables time-aware inference without requiring explicit temporal input during testing. To facilitate training and evaluation, we introduce a synthetic fungal growth dataset annotated with aligned timestamps and categorical stage labels. CLIPTime jointly performs classification and regression, predicting discrete growth stages alongside continuous timestamps. We also propose custom evaluation metrics, including temporal accuracy and regression error, to assess the precision of time-aware predictions. Experimental results demonstrate that CLIPTime effectively models biological progression and produces interpretable, temporally grounded outputs, highlighting the potential of vision-language models in real-world biological monitoring applications.

[133] Context-based Motion Retrieval using Open Vocabulary Methods for Autonomous Driving

Stefan Englmeier, Max A. Büttner, Katharina Winter, Fabian B. Flohr

Main category: cs.CV

TL;DR: A novel context-aware motion retrieval framework is proposed to identify rare human behaviors in driving datasets, improving autonomous driving system evaluation.

DetailsMotivation: Addressing the challenge of retrieving rare human behavior scenarios in large-scale datasets for robust autonomous driving evaluation.

Method: Combines SMPL-based motion sequences and video frames into a shared multimodal embedding space aligned with natural language for text-query retrieval.

Result: Outperforms state-of-the-art models by up to 27.5% accuracy in motion-context retrieval on the WayMoCo dataset.

Conclusion: The framework enables scalable retrieval of human behavior and context, enhancing evaluation of autonomous driving systems.

Abstract: Autonomous driving systems must operate reliably in safety-critical scenarios, particularly those involving unusual or complex behavior by Vulnerable Road Users (VRUs). Identifying these edge cases in driving datasets is essential for robust evaluation and generalization, but retrieving such rare human behavior scenarios within the long tail of large-scale datasets is challenging. To support targeted evaluation of autonomous driving systems in diverse, human-centered scenarios, we propose a novel context-aware motion retrieval framework. Our method combines Skinned Multi-Person Linear (SMPL)-based motion sequences and corresponding video frames before encoding them into a shared multimodal embedding space aligned with natural language. Our approach enables the scalable retrieval of human behavior and their context through text queries. This work also introduces our dataset WayMoCo, an extension of the Waymo Open Dataset. It contains automatically labeled motion and scene context descriptions derived from generated pseudo-ground-truth SMPL sequences and corresponding image data. Our approach outperforms state-of-the-art models by up to 27.5% accuracy in motion-context retrieval, when evaluated on the WayMoCo dataset.

[134] PIF-Net: Ill-Posed Prior Guided Multispectral and Hyperspectral Image Fusion via Invertible Mamba and Fusion-Aware LoRA

Baisong Li, Xingwang Wang, Haixiao Xu

Main category: cs.CV

TL;DR: PIF-Net is a novel framework for multispectral and hyperspectral image fusion, addressing the ill-posed nature of the task with invertible Mamba architecture and a lightweight Fusion-Aware Low-Rank Adaptation module, outperforming state-of-the-art methods.

DetailsMotivation: The inherent trade-off between spectral and spatial information in MHIF makes the task ill-posed, especially due to data misalignment, which prior methods haven't effectively resolved.

Method: PIF-Net uses an invertible Mamba architecture for global spectral modeling and introduces the Fusion-Aware Low-Rank Adaptation module for dynamic feature calibration.

Result: PIF-Net achieves superior image restoration performance on benchmark datasets while maintaining computational efficiency.

Conclusion: The proposed PIF-Net effectively addresses the ill-posed challenges of MHIF, offering a balanced and efficient solution for high-quality image fusion.

Abstract: The goal of multispectral and hyperspectral image fusion (MHIF) is to generate high-quality images that simultaneously possess rich spectral information and fine spatial details. However, due to the inherent trade-off between spectral and spatial information and the limited availability of observations, this task is fundamentally ill-posed. Previous studies have not effectively addressed the ill-posed nature caused by data misalignment. To tackle this challenge, we propose a fusion framework named PIF-Net, which explicitly incorporates ill-posed priors to effectively fuse multispectral images and hyperspectral images. To balance global spectral modeling with computational efficiency, we design a method based on an invertible Mamba architecture that maintains information consistency during feature transformation and fusion, ensuring stable gradient flow and process reversibility. Furthermore, we introduce a novel fusion module called the Fusion-Aware Low-Rank Adaptation module, which dynamically calibrates spectral and spatial features while keeping the model lightweight. Extensive experiments on multiple benchmark datasets demonstrate that PIF-Net achieves significantly better image restoration performance than current state-of-the-art methods while maintaining model efficiency.

[135] LesiOnTime – Joint Temporal and Clinical Modeling for Small Breast Lesion Segmentation in Longitudinal DCE-MRI

Mohammed Kamran, Maria Bernathova, Raoul Varga, Christian Singer, Zsuzsanna Bago-Horvath, Thomas Helbich, Georg Langs, Philipp Seeböck

Main category: cs.CV

TL;DR: LesiOnTime improves small lesion segmentation in breast DCE-MRI by integrating longitudinal imaging and BI-RADS scores, outperforming baselines by 5% Dice.

DetailsMotivation: Existing deep learning methods focus on large lesions and ignore longitudinal and clinical data, which are crucial for early cancer detection.

Method: Proposes a 3D segmentation approach with Temporal Prior Attention (TPA) and BI-RADS Consistency Regularization (BCR) to leverage temporal and clinical context.

Result: Outperforms state-of-the-art methods by 5% Dice on a longitudinal dataset, with TPA and BCR providing complementary gains.

Conclusion: Incorporating temporal and clinical context enhances early lesion segmentation in breast cancer screening.

Abstract: Accurate segmentation of small lesions in Breast Dynamic Contrast-Enhanced MRI (DCE-MRI) is critical for early cancer detection, especially in high-risk patients. While recent deep learning methods have advanced lesion segmentation, they primarily target large lesions and neglect valuable longitudinal and clinical information routinely used by radiologists. In real-world screening, detecting subtle or emerging lesions requires radiologists to compare across timepoints and consider previous radiology assessments, such as the BI-RADS score. We propose LesiOnTime, a novel 3D segmentation approach that mimics clinical diagnostic workflows by jointly leveraging longitudinal imaging and BIRADS scores. The key components are: (1) a Temporal Prior Attention (TPA) block that dynamically integrates information from previous and current scans; and (2) a BI-RADS Consistency Regularization (BCR) loss that enforces latent space alignment for scans with similar radiological assessments, thus embedding domain knowledge into the training process. Evaluated on a curated in-house longitudinal dataset of high-risk patients with DCE-MRI, our approach outperforms state-of-the-art single-timepoint and longitudinal baselines by 5% in terms of Dice. Ablation studies demonstrate that both TPA and BCR contribute complementary performance gains. These results highlight the importance of incorporating temporal and clinical context for reliable early lesion segmentation in real-world breast cancer screening. Our code is publicly available at https://github.com/cirmuw/LesiOnTime

[136] HyPCV-Former: Hyperbolic Spatio-Temporal Transformer for 3D Point Cloud Video Anomaly Detection

Jiaping Cao, Kangkang Zhou, Juan Du

Main category: cs.CV

TL;DR: HyPCV-Former, a hyperbolic spatio-temporal transformer, improves video anomaly detection by leveraging Lorentzian hyperbolic space and a novel attention mechanism, outperforming benchmarks by 7% and 5.6% on two datasets.

DetailsMotivation: Existing methods using Euclidean representations fail to capture hierarchical event structures and spatio-temporal continuity in video anomaly detection.

Method: HyPCV-Former extracts spatial features from point cloud videos, embeds them into Lorentzian hyperbolic space, and uses a hyperbolic multi-head self-attention mechanism for temporal modeling.

Result: The method achieves state-of-the-art performance, with 7% and 5.6% improvements on TIMo and DAD datasets, respectively.

Conclusion: HyPCV-Former effectively addresses limitations of Euclidean representations, offering superior anomaly detection in point cloud videos.

Abstract: Video anomaly detection is a fundamental task in video surveillance, with broad applications in public safety and intelligent monitoring systems. Although previous methods leverage Euclidean representations in RGB or depth domains, such embeddings are inherently limited in capturing hierarchical event structures and spatio-temporal continuity. To address these limitations, we propose HyPCV-Former, a novel hyperbolic spatio-temporal transformer for anomaly detection in 3D point cloud videos. Our approach first extracts per-frame spatial features from point cloud sequences via point cloud extractor, and then embeds them into Lorentzian hyperbolic space, which better captures the latent hierarchical structure of events. To model temporal dynamics, we introduce a hyperbolic multi-head self-attention (HMHA) mechanism that leverages Lorentzian inner products and curvature-aware softmax to learn temporal dependencies under non-Euclidean geometry. Our method performs all feature transformations and anomaly scoring directly within full Lorentzian space rather than via tangent space approximation. Extensive experiments demonstrate that HyPCV-Former achieves state-of-the-art performance across multiple anomaly categories, with a 7% improvement on the TIMo dataset and a 5.6% gain on the DAD dataset compared to benchmarks. The code will be released upon paper acceptance.

[137] LAMIC: Layout-Aware Multi-Image Composition via Scalability of Multimodal Diffusion Transformer

Yuzhuo Chen, Zehua Ma, Jianhua Wang, Kai Kang, Shunyu Yao, Weiming Zhang

Main category: cs.CV

TL;DR: LAMIC is a Layout-Aware Multi-Image Composition framework that extends single-reference diffusion models to multi-reference scenarios without training, using novel attention mechanisms and achieving state-of-the-art performance.

DetailsMotivation: Addressing the challenge of generating coherent and consistent images from multiple references with spatial layout awareness in controllable image synthesis.

Method: Introduces Group Isolation Attention (GIA) for entity disentanglement and Region-Modulated Attention (RMA) for layout-aware generation, built upon the MMDiT model.

Result: LAMIC outperforms existing baselines in ID-S, BG-S, IN-R, and AVG scores, demonstrating superior identity keeping, background preservation, and layout control.

Conclusion: LAMIC establishes a training-free paradigm for multi-image composition, showcasing strong zero-shot generalization and scalability with evolving foundation models.

Abstract: In controllable image synthesis, generating coherent and consistent images from multiple references with spatial layout awareness remains an open challenge. We present LAMIC, a Layout-Aware Multi-Image Composition framework that, for the first time, extends single-reference diffusion models to multi-reference scenarios in a training-free manner. Built upon the MMDiT model, LAMIC introduces two plug-and-play attention mechanisms: 1) Group Isolation Attention (GIA) to enhance entity disentanglement; and 2) Region-Modulated Attention (RMA) to enable layout-aware generation. To comprehensively evaluate model capabilities, we further introduce three metrics: 1) Inclusion Ratio (IN-R) and Fill Ratio (FI-R) for assessing layout control; and 2) Background Similarity (BG-S) for measuring background consistency. Extensive experiments show that LAMIC achieves state-of-the-art performance across most major metrics: it consistently outperforms existing multi-reference baselines in ID-S, BG-S, IN-R and AVG scores across all settings, and achieves the best DPG in complex composition tasks. These results demonstrate LAMIC’s superior abilities in identity keeping, background preservation, layout control, and prompt-following, all achieved without any training or fine-tuning, showcasing strong zero-shot generalization ability. By inheriting the strengths of advanced single-reference models and enabling seamless extension to multi-image scenarios, LAMIC establishes a new training-free paradigm for controllable multi-image composition. As foundation models continue to evolve, LAMIC’s performance is expected to scale accordingly. Our implementation is available at: https://github.com/Suchenl/LAMIC.

[138] SAMSA 2.0: Prompting Segment Anything with Spectral Angles for Hyperspectral Interactive Medical Image Segmentation

Alfie Roddan, Tobias Czempiel, Chi Xu, Daniel S. Elson, Stamatia Giannarou

Main category: cs.CV

TL;DR: SAMSA 2.0 improves hyperspectral medical image segmentation by integrating spectral angle prompting with SAM, boosting accuracy and robustness without retraining.

DetailsMotivation: To enhance segmentation accuracy in hyperspectral medical imaging by leveraging spectral similarity alongside spatial cues, addressing challenges in low-data and noisy clinical scenarios.

Method: Introduces spectral angle prompting for early fusion of spectral information into the Segment Anything Model (SAM), enabling guidance via spectral similarity.

Result: Achieves up to +3.8% higher Dice scores than RGB-only models and +3.1% over prior spectral fusion methods, with improved few-shot and zero-shot performance.

Conclusion: SAMSA 2.0 demonstrates superior generalization and robustness in challenging clinical imaging scenarios, outperforming existing methods.

Abstract: We present SAMSA 2.0, an interactive segmentation framework for hyperspectral medical imaging that introduces spectral angle prompting to guide the Segment Anything Model (SAM) using spectral similarity alongside spatial cues. This early fusion of spectral information enables more accurate and robust segmentation across diverse spectral datasets. Without retraining, SAMSA 2.0 achieves up to +3.8% higher Dice scores compared to RGB-only models and up to +3.1% over prior spectral fusion methods. Our approach enhances few-shot and zero-shot performance, demonstrating strong generalization in challenging low-data and noisy scenarios common in clinical imaging.

[139] Leveraging Convolutional and Graph Networks for an Unsupervised Remote Sensing Labelling Tool

Tulsi Patel, Mark W. Jones, Thomas Redfern

Main category: cs.CV

TL;DR: An unsupervised pipeline for labeling remote sensing imagery using segmentation with convolutional and graph neural networks, improving accuracy and granularity.

DetailsMotivation: Labeling remote sensing imagery is costly and time-consuming, requiring expert input. Existing tools depend on pre-labeled data, limiting flexibility and scalability.

Method: Uses segmentation with convolutional and graph neural networks to group similar pixels and encode robust features, incorporating local and neighborhood information.

Result: Reduces labeling outliers, enables granular labeling, and forms rotationally invariant semantic relationships in the encoding space.

Conclusion: The proposed unsupervised method enhances labeling efficiency and accuracy for remote sensing imagery, overcoming limitations of previous approaches.

Abstract: Machine learning for remote sensing imaging relies on up-to-date and accurate labels for model training and testing. Labelling remote sensing imagery is time and cost intensive, requiring expert analysis. Previous labelling tools rely on pre-labelled data for training in order to label new unseen data. In this work, we define an unsupervised pipeline for finding and labelling geographical areas of similar context and content within Sentinel-2 satellite imagery. Our approach removes limitations of previous methods by utilising segmentation with convolutional and graph neural networks to encode a more robust feature space for image comparison. Unlike previous approaches we segment the image into homogeneous regions of pixels that are grouped based on colour and spatial similarity. Graph neural networks are used to aggregate information about the surrounding segments enabling the feature representation to encode the local neighbourhood whilst preserving its own local information. This reduces outliers in the labelling tool, allows users to label at a granular level, and allows a rotationally invariant semantic relationship at the image level to be formed within the encoding space.

[140] EPANet: Efficient Path Aggregation Network for Underwater Fish Detection

Jinsong Yang, Zeyuan Hu, Yichen Li

Main category: cs.CV

TL;DR: EPANet, an efficient path aggregation network, improves underwater fish detection by combining complementary features and lightweight design, outperforming existing methods in accuracy and speed.

DetailsMotivation: Underwater fish detection is challenging due to low resolution, background interference, and target similarity. Existing methods are complex and inefficient.

Method: EPANet uses EPA-FPN for semantic-spatial complementarity and MS-DDSP bottleneck for feature diversity, enhancing detection efficiency.

Result: EPANet achieves higher accuracy and faster inference than state-of-the-art methods with comparable parameter complexity.

Conclusion: EPANet offers an effective, lightweight solution for underwater fish detection, balancing performance and efficiency.

Abstract: Underwater fish detection (UFD) remains a challenging task in computer vision due to low object resolution, significant background interference, and high visual similarity between targets and surroundings. Existing approaches primarily focus on local feature enhancement or incorporate complex attention mechanisms to highlight small objects, often at the cost of increased model complexity and reduced efficiency. To address these limitations, we propose an efficient path aggregation network (EPANet), which leverages complementary feature integration to achieve accurate and lightweight UFD. EPANet consists of two key components: an efficient path aggregation feature pyramid network (EPA-FPN) and a multi-scale diverse-division short path bottleneck (MS-DDSP bottleneck). The EPA-FPN introduces long-range skip connections across disparate scales to improve semantic-spatial complementarity, while cross-layer fusion paths are adopted to enhance feature integration efficiency. The MS-DDSP bottleneck extends the conventional bottleneck structure by introducing finer-grained feature division and diverse convolutional operations, thereby increasing local feature diversity and representation capacity. Extensive experiments on benchmark UFD datasets demonstrate that EPANet outperforms state-of-the-art methods in terms of detection accuracy and inference speed, while maintaining comparable or even lower parameter complexity.

[141] Wukong Framework for Not Safe For Work Detection in Text-to-Image systems

Mingrui Liu, Sixiao Zhang, Cheng Long

Main category: cs.CV

TL;DR: Wukong is a transformer-based NSFW detection framework for T2I generation, leveraging early denoising steps and pre-trained U-Net cross-attention for efficient and accurate detection.

DetailsMotivation: Existing safeguards for NSFW content in T2I generation are either inefficient (image filters) or vulnerable (text filters), necessitating a better solution.

Method: Wukong uses intermediate outputs from early denoising steps and reuses U-Net’s cross-attention parameters for early NSFW detection.

Result: Wukong outperforms text-based safeguards and matches image filters’ accuracy while being more efficient.

Conclusion: Wukong provides an efficient and accurate solution for NSFW detection in T2I generation, addressing limitations of existing methods.

Abstract: Text-to-Image (T2I) generation is a popular AI-generated content (AIGC) technology enabling diverse and creative image synthesis. However, some outputs may contain Not Safe For Work (NSFW) content (e.g., violence), violating community guidelines. Detecting NSFW content efficiently and accurately, known as external safeguarding, is essential. Existing external safeguards fall into two types: text filters, which analyze user prompts but overlook T2I model-specific variations and are prone to adversarial attacks; and image filters, which analyze final generated images but are computationally costly and introduce latency. Diffusion models, the foundation of modern T2I systems like Stable Diffusion, generate images through iterative denoising using a U-Net architecture with ResNet and Transformer blocks. We observe that: (1) early denoising steps define the semantic layout of the image, and (2) cross-attention layers in U-Net are crucial for aligning text and image regions. Based on these insights, we propose Wukong, a transformer-based NSFW detection framework that leverages intermediate outputs from early denoising steps and reuses U-Net’s pre-trained cross-attention parameters. Wukong operates within the diffusion process, enabling early detection without waiting for full image generation. We also introduce a new dataset containing prompts, seeds, and image-specific NSFW labels, and evaluate Wukong on this and two public benchmarks. Results show that Wukong significantly outperforms text-based safeguards and achieves comparable accuracy of image filters, while offering much greater efficiency.

[142] Video Color Grading via Look-Up Table Generation

Seunghyun Shin, Dongmin Shin, Jisu Shin, Hae-Gon Jeon, Joon-Young Lee

Main category: cs.CV

TL;DR: A reference-based video color grading framework using a diffusion model to generate LUTs for aligning color attributes, enabling artistic adjustments without structural loss.

DetailsMotivation: Simplify the complex and specialized process of video color grading for artistic purposes, making it accessible beyond professional colorists.

Method: Uses a diffusion model to generate LUTs for color attribute alignment between reference scenes and input video, incorporating user preferences via text prompts.

Result: Achieves effective color grading without structural detail loss, with fast inference and user preference integration.

Conclusion: The framework successfully democratizes video color grading, combining artistic intent with technical efficiency.

Abstract: Different from color correction and transfer, color grading involves adjusting colors for artistic or storytelling purposes in a video, which is used to establish a specific look or mood. However, due to the complexity of the process and the need for specialized editing skills, video color grading remains primarily the domain of professional colorists. In this paper, we present a reference-based video color grading framework. Our key idea is explicitly generating a look-up table (LUT) for color attribute alignment between reference scenes and input video via a diffusion model. As a training objective, we enforce that high-level features of the reference scenes like look, mood, and emotion should be similar to that of the input video. Our LUT-based approach allows for color grading without any loss of structural details in the whole video frames as well as achieving fast inference. We further build a pipeline to incorporate a user-preference via text prompts for low-level feature enhancement such as contrast and brightness, etc. Experimental results, including extensive user studies, demonstrate the effectiveness of our approach for video color grading. Codes are publicly available at https://github.com/seunghyuns98/VideoColorGrading.

[143] Your other Left! Vision-Language Models Fail to Identify Relative Positions in Medical Images

Daniel Wolf, Heiko Hillenhagen, Billurvan Taskin, Alex Bäuerle, Meinrad Beer, Michael Götz, Timo Ropinski

Main category: cs.CV

TL;DR: VLMs struggle with determining relative positions in medical images, despite visual prompts. A new benchmark, MIRP, is introduced to evaluate this capability.

DetailsMotivation: Clinical decision-making requires accurate understanding of anatomical positions, but VLMs lack this ability, hindering their clinical application.

Method: Evaluated state-of-the-art VLMs (GPT-4o, Llama3.2, Pixtral, JanusPro) and tested visual prompts (alphanumeric/colored markers) for improvement.

Result: VLMs performed poorly on medical images, relying more on prior knowledge than image content. Visual prompts offered only moderate improvements.

Conclusion: VLMs need significant improvement for clinical use. The MIRP benchmark is introduced to advance research in this area.

Abstract: Clinical decision-making relies heavily on understanding relative positions of anatomical structures and anomalies. Therefore, for Vision-Language Models (VLMs) to be applicable in clinical practice, the ability to accurately determine relative positions on medical images is a fundamental prerequisite. Despite its importance, this capability remains highly underexplored. To address this gap, we evaluate the ability of state-of-the-art VLMs, GPT-4o, Llama3.2, Pixtral, and JanusPro, and find that all models fail at this fundamental task. Inspired by successful approaches in computer vision, we investigate whether visual prompts, such as alphanumeric or colored markers placed on anatomical structures, can enhance performance. While these markers provide moderate improvements, results remain significantly lower on medical images compared to observations made on natural images. Our evaluations suggest that, in medical imaging, VLMs rely more on prior anatomical knowledge than on actual image content for answering relative position questions, often leading to incorrect conclusions. To facilitate further research in this area, we introduce the MIRP , Medical Imaging Relative Positioning, benchmark dataset, designed to systematically evaluate the capability to identify relative positions in medical images.

[144] DBLP: Noise Bridge Consistency Distillation For Efficient And Reliable Adversarial Purification

Chihan Huang, Belal Alsinglawi, Islam Al-qudah

Main category: cs.CV

TL;DR: DBLP introduces a diffusion-based adversarial purification framework with noise bridge distillation and adaptive semantic enhancement for efficient, high-quality purification.

DetailsMotivation: DNNs are vulnerable to adversarial perturbations, and existing purification methods are inefficient due to iterative denoising.

Method: Proposes Diffusion Bridge Distillation for Purification (DBLP) with noise bridge distillation and adaptive semantic enhancement using multi-scale pyramid edge maps.

Result: Achieves SOTA robust accuracy, superior image quality, and ~0.2s inference time.

Conclusion: DBLP advances real-time adversarial purification with efficiency and high fidelity.

Abstract: Recent advances in deep neural networks (DNNs) have led to remarkable success across a wide range of tasks. However, their susceptibility to adversarial perturbations remains a critical vulnerability. Existing diffusion-based adversarial purification methods often require intensive iterative denoising, severely limiting their practical deployment. In this paper, we propose Diffusion Bridge Distillation for Purification (DBLP), a novel and efficient diffusion-based framework for adversarial purification. Central to our approach is a new objective, noise bridge distillation, which constructs a principled alignment between the adversarial noise distribution and the clean data distribution within a latent consistency model (LCM). To further enhance semantic fidelity, we introduce adaptive semantic enhancement, which fuses multi-scale pyramid edge maps as conditioning input to guide the purification process. Extensive experiments across multiple datasets demonstrate that DBLP achieves state-of-the-art (SOTA) robust accuracy, superior image quality, and around 0.2s inference time, marking a significant step toward real-time adversarial purification.

[145] Backdoor Attacks on Deep Learning Face Detection

Quentin Le Roux, Yannick Teglia, Teddy Furon, Philippe Loubet-Moundi

Main category: cs.CV

TL;DR: The paper explores vulnerabilities in face detection systems, introducing Face Generation Attacks and a novel Landmark Shift Attack, and proposes mitigations.

DetailsMotivation: Face recognition systems in unconstrained environments face challenges like inconsistent lighting and poses, requiring robust face detection. However, these systems are vulnerable to adversarial attacks.

Method: The study introduces Face Generation Attacks and a Landmark Shift Attack to exploit face detection systems, particularly targeting bounding box and landmark coordinate regression.

Result: The attacks successfully backdoor face detectors, demonstrating vulnerabilities in coordinate regression tasks.

Conclusion: The paper highlights the need for robust defenses and offers mitigations against these adversarial attacks.

Abstract: Face Recognition Systems that operate in unconstrained environments capture images under varying conditions,such as inconsistent lighting, or diverse face poses. These challenges require including a Face Detection module that regresses bounding boxes and landmark coordinates for proper Face Alignment. This paper shows the effectiveness of Object Generation Attacks on Face Detection, dubbed Face Generation Attacks, and demonstrates for the first time a Landmark Shift Attack that backdoors the coordinate regression task performed by face detectors. We then offer mitigations against these vulnerabilities.

[146] HiPrune: Training-Free Visual Token Pruning via Hierarchical Attention in Vision-Language Models

Jizhihui Liu, Feiyi Du, Guangdao Zhu, Niu Lian, Jun Li, Bin Chen

Main category: cs.CV

TL;DR: HiPrune is a training-free, model-agnostic token pruning framework for VLMs, leveraging hierarchical attention to retain key tokens, achieving high efficiency with minimal accuracy loss.

DetailsMotivation: Address computational inefficiency in VLMs caused by lengthy visual token sequences, without relying on special tokens or task-specific training.

Method: Exploits hierarchical attention in vision encoders to select three token types: Anchor (object-centric), Buffer (spatial continuity), and Register (global summarization).

Result: Preserves 99.3% accuracy with 33.3% tokens and 99.5% accuracy with 11.1% tokens, reducing FLOPs and latency by up to 9x.

Conclusion: HiPrune is scalable, efficient, and generalizes well across models and tasks, offering a practical solution for VLM optimization.

Abstract: Vision-Language Models (VLMs) encode images into lengthy sequences of visual tokens, leading to excessive computational overhead and limited inference efficiency. While prior efforts prune or merge tokens to address this issue, they often rely on special tokens (e.g., CLS) or require task-specific training, hindering scalability across architectures. In this paper, we propose HiPrune, a training-free and model-agnostic token Pruning framework that exploits the Hierarchical attention structure within vision encoders. We identify that middle layers attend to object-centric regions, while deep layers capture global contextual features. Based on this observation, HiPrune selects three types of informative tokens: (1) Anchor tokens with high attention in object-centric layers, (2) Buffer tokens adjacent to anchors for spatial continuity, and (3) Register tokens with strong attention in deep layers for global summarization. Our method requires no retraining and integrates seamlessly with any ViT-based VLM. Extensive experiments on LLaVA-1.5, LLaVA-NeXT, and Qwen2.5-VL demonstrate that HiPrune achieves state-of-the-art pruning performance, preserving up to 99.3% task accuracy with only 33.3% tokens, and maintaining 99.5% accuracy with just 11.1% tokens. Meanwhile, it reduces inference FLOPs and latency by up to 9$\times$, showcasing strong generalization across models and tasks. Code is available at https://github.com/Danielement321/HiPrune.

[147] Training-Free Class Purification for Open-Vocabulary Semantic Segmentation

Qi Chen, Lingxiao Yang, Yun Chen, Nailong Zhao, Jianhuang Lai, Jie Shao, Xiaohua Xie

Main category: cs.CV

TL;DR: FreeCP is a training-free class purification framework for open-vocabulary semantic segmentation, addressing class redundancy and visual-language ambiguity to improve segmentation performance.

DetailsMotivation: Existing training-free methods neglect challenges like class redundancy and visual-language ambiguity, leading to suboptimal segmentation. FreeCP aims to purify semantic categories and rectify errors caused by these issues.

Method: FreeCP purifies semantic categories and leverages these purified representations for final segmentation predictions, without requiring additional training.

Result: Extensive experiments on eight benchmarks show FreeCP significantly enhances segmentation performance when combined with other methods.

Conclusion: FreeCP effectively addresses redundancy and ambiguity in open-vocabulary semantic segmentation, improving performance as a plug-and-play module.

Abstract: Fine-tuning pre-trained vision-language models has emerged as a powerful approach for enhancing open-vocabulary semantic segmentation (OVSS). However, the substantial computational and resource demands associated with training on large datasets have prompted interest in training-free methods for OVSS. Existing training-free approaches primarily focus on modifying model architectures and generating prototypes to improve segmentation performance. However, they often neglect the challenges posed by class redundancy, where multiple categories are not present in the current test image, and visual-language ambiguity, where semantic similarities among categories create confusion in class activation. These issues can lead to suboptimal class activation maps and affinity-refined activation maps. Motivated by these observations, we propose FreeCP, a novel training-free class purification framework designed to address these challenges. FreeCP focuses on purifying semantic categories and rectifying errors caused by redundancy and ambiguity. The purified class representations are then leveraged to produce final segmentation predictions. We conduct extensive experiments across eight benchmarks to validate FreeCP’s effectiveness. Results demonstrate that FreeCP, as a plug-and-play module, significantly boosts segmentation performance when combined with other OVSS methods.

[148] Guiding Diffusion-Based Articulated Object Generation by Partial Point Cloud Alignment and Physical Plausibility Constraints

Jens U. Kreber, Joerg Stueckler

Main category: cs.CV

TL;DR: PhysNAP is a diffusion model-based method for generating articulated objects aligned with partial point clouds, improving physical plausibility through SDFs and constraints.

DetailsMotivation: Articulated objects are common in everyday environments, but generating them with physical plausibility and alignment to partial point clouds is challenging.

Method: Uses SDFs for part shapes, guides diffusion with point cloud alignment loss, and enforces non-penetration and mobility constraints. Category-aware if information is available.

Result: Evaluated on PartNet-Mobility, PhysNAP improves constraint consistency and offers a tradeoff with generative ability compared to an unguided baseline.

Conclusion: PhysNAP effectively generates physically plausible articulated objects aligned with partial point clouds, outperforming unguided methods.

Abstract: Articulated objects are an important type of interactable objects in everyday environments. In this paper, we propose PhysNAP, a novel diffusion model-based approach for generating articulated objects that aligns them with partial point clouds and improves their physical plausibility. The model represents part shapes by signed distance functions (SDFs). We guide the reverse diffusion process using a point cloud alignment loss computed using the predicted SDFs. Additionally, we impose non-penetration and mobility constraints based on the part SDFs for guiding the model to generate more physically plausible objects. We also make our diffusion approach category-aware to further improve point cloud alignment if category information is available. We evaluate the generative ability and constraint consistency of samples generated with PhysNAP using the PartNet-Mobility dataset. We also compare it with an unguided baseline diffusion model and demonstrate that PhysNAP can improve constraint consistency and provides a tradeoff with generative ability.

[149] D3: Training-Free AI-Generated Video Detection Using Second-Order Features

Chende Zheng, Ruiqi suo, Chenhao Lin, Zhengyu Zhao, Le Yang, Shuai Liu, Minghui Yang, Cong Wang, Chao Shen

Main category: cs.CV

TL;DR: The paper introduces D3, a training-free method for detecting AI-generated videos by analyzing second-order temporal artifacts, outperforming existing methods.

DetailsMotivation: Public concern over synthetic video dissemination and the lack of effective detection methods focusing on temporal artifacts.

Method: Proposes Detection by Difference of Differences (D3), leveraging second-order dynamical analysis and temporal discrepancies.

Result: D3 outperforms previous methods by 10.39% mean Average Precision on Gen-Video and shows computational efficiency.

Conclusion: D3 is a robust, efficient solution for detecting AI-generated videos, validated across multiple datasets.

Abstract: The evolution of video generation techniques, such as Sora, has made it increasingly easy to produce high-fidelity AI-generated videos, raising public concern over the dissemination of synthetic content. However, existing detection methodologies remain limited by their insufficient exploration of temporal artifacts in synthetic videos. To bridge this gap, we establish a theoretical framework through second-order dynamical analysis under Newtonian mechanics, subsequently extending the Second-order Central Difference features tailored for temporal artifact detection. Building on this theoretical foundation, we reveal a fundamental divergence in second-order feature distributions between real and AI-generated videos. Concretely, we propose Detection by Difference of Differences (D3), a novel training-free detection method that leverages the above second-order temporal discrepancies. We validate the superiority of our D3 on 4 open-source datasets (Gen-Video, VideoPhy, EvalCrafter, VidProM), 40 subsets in total. For example, on GenVideo, D3 outperforms the previous best method by 10.39% (absolute) mean Average Precision. Additional experiments on time cost and post-processing operations demonstrate D3’s exceptional computational efficiency and strong robust performance. Our code is available at https://github.com/Zig-HS/D3.

[150] Weakly Supervised Virus Capsid Detection with Image-Level Annotations in Electron Microscopy Images

Hannah Kniesel, Leon Sick, Tristan Payer, Tim Bergner, Kavitha Shaga Devan, Clarissa Read, Paul Walther, Timo Ropinski

Main category: cs.CV

TL;DR: Proposes a weakly supervised object detection method using image-level annotations to avoid costly bounding box annotations, outperforming other weak labeling methods.

DetailsMotivation: High cost and expert requirement for bounding box annotations in object detection, especially in scientific domains like virus detection.

Method: Uses a pre-trained model to generate pseudo-labels from image-level annotations via optimization with a shrinking receptive field.

Result: Pseudo-labels are easier to obtain and outperform other weak labeling methods, even ground truth labels in time-limited scenarios.

Conclusion: The method effectively reduces annotation costs while maintaining or improving detection performance.

Abstract: Current state-of-the-art methods for object detection rely on annotated bounding boxes of large data sets for training. However, obtaining such annotations is expensive and can require up to hundreds of hours of manual labor. This poses a challenge, especially since such annotations can only be provided by experts, as they require knowledge about the scientific domain. To tackle this challenge, we propose a domain-specific weakly supervised object detection algorithm that only relies on image-level annotations, which are significantly easier to acquire. Our method distills the knowledge of a pre-trained model, on the task of predicting the presence or absence of a virus in an image, to obtain a set of pseudo-labels that can be used to later train a state-of-the-art object detection model. To do so, we use an optimization approach with a shrinking receptive field to extract virus particles directly without specific network architectures. Through a set of extensive studies, we show how the proposed pseudo-labels are easier to obtain, and, more importantly, are able to outperform other existing weak labeling methods, and even ground truth labels, in cases where the time to obtain the annotation is limited.

[151] CoProU-VO: Combining Projected Uncertainty for End-to-End Unsupervised Monocular Visual Odometry

Jingchao Xie, Oussema Dhaouadi, Weirong Chen, Johannes Meier, Jacques Kaiser, Daniel Cremers

Main category: cs.CV

TL;DR: CoProU-VO introduces cross-frame uncertainty propagation to improve unsupervised visual odometry, outperforming prior methods in dynamic scenes.

DetailsMotivation: Dynamic objects disrupt static scene assumptions in unsupervised VO, causing pose errors. Uncertainty modeling is needed but traditionally ignores temporal information.

Method: CoProU-VO combines target and reference frame uncertainties using a probabilistic approach, built on vision transformers for joint depth, uncertainty, and pose learning.

Result: Outperforms unsupervised monocular methods on KITTI and nuScenes, especially in dynamic highway scenes. Ablation studies confirm cross-frame uncertainty benefits.

Conclusion: Cross-frame uncertainty propagation enhances robustness in dynamic scenes, validated by improved performance and ablation studies.

Abstract: Visual Odometry (VO) is fundamental to autonomous navigation, robotics, and augmented reality, with unsupervised approaches eliminating the need for expensive ground-truth labels. However, these methods struggle when dynamic objects violate the static scene assumption, leading to erroneous pose estimations. We tackle this problem by uncertainty modeling, which is a commonly used technique that creates robust masks to filter out dynamic objects and occlusions without requiring explicit motion segmentation. Traditional uncertainty modeling considers only single-frame information, overlooking the uncertainties across consecutive frames. Our key insight is that uncertainty must be propagated and combined across temporal frames to effectively identify unreliable regions, particularly in dynamic scenes. To address this challenge, we introduce Combined Projected Uncertainty VO (CoProU-VO), a novel end-to-end approach that combines target frame uncertainty with projected reference frame uncertainty using a principled probabilistic formulation. Built upon vision transformer backbones, our model simultaneously learns depth, uncertainty estimation, and camera poses. Consequently, experiments on the KITTI and nuScenes datasets demonstrate significant improvements over previous unsupervised monocular end-to-end two-frame-based methods and exhibit strong performance in challenging highway scenes where other approaches often fail. Additionally, comprehensive ablation studies validate the effectiveness of cross-frame uncertainty propagation.

[152] Uncertainty-Aware Likelihood Ratio Estimation for Pixel-Wise Out-of-Distribution Detection

Marc Hölle, Walter Kellermann, Vasileios Belagiannis

Main category: cs.CV

TL;DR: The paper introduces an uncertainty-aware likelihood ratio estimation method to improve the detection of unknown objects in semantic segmentation for autonomous driving, outperforming existing methods with lower false positives and high precision.

DetailsMotivation: Semantic segmentation models often misclassify unknown objects in real-world scenarios, and existing methods struggle with rare or complex scenes.

Method: The proposed method uses an evidential classifier within a likelihood ratio test to distinguish known and unknown pixel features, incorporating uncertainty from rare training examples and synthetic outliers.

Result: The method achieves a 2.5% false positive rate and 90.91% precision on five benchmark datasets with minimal computational overhead.

Conclusion: Incorporating uncertainty improves outlier detection, making the method effective for real-world autonomous driving applications.

Abstract: Semantic segmentation models trained on known object classes often fail in real-world autonomous driving scenarios by confidently misclassifying unknown objects. While pixel-wise out-of-distribution detection can identify unknown objects, existing methods struggle in complex scenes where rare object classes are often confused with truly unknown objects. We introduce an uncertainty-aware likelihood ratio estimation method that addresses these limitations. Our approach uses an evidential classifier within a likelihood ratio test to distinguish between known and unknown pixel features from a semantic segmentation model, while explicitly accounting for uncertainty. Instead of producing point estimates, our method outputs probability distributions that capture uncertainty from both rare training examples and imperfect synthetic outliers. We show that by incorporating uncertainty in this way, outlier exposure can be leveraged more effectively. Evaluated on five standard benchmark datasets, our method achieves the lowest average false positive rate (2.5%) among state-of-the-art while maintaining high average precision (90.91%) and incurring only negligible computational overhead. Code is available at https://github.com/glasbruch/ULRE.

[153] Sample-Aware Test-Time Adaptation for Medical Image-to-Image Translation

Irene Iele, Francesco Di Feola, Valerio Guarrasi, Paolo Soda

Main category: cs.CV

TL;DR: A novel Test-Time Adaptation (TTA) framework is proposed to dynamically adjust image-to-image translation for medical imaging, improving performance on out-of-distribution samples without degrading in-distribution results.

DetailsMotivation: Address limitations of current image-to-image translation methods in handling out-of-distribution samples without performance degradation.

Method: Introduces a Reconstruction Module to quantify domain shift and a Dynamic Adaptation Block for selective feature modification in pretrained models.

Result: Consistent improvements over baseline and prior TTA methods in tasks like low-dose CT denoising and T1 to T2 MRI translation.

Conclusion: Dynamic, sample-specific adaptation outperforms uniform methods, enhancing model resilience in real-world medical imaging scenarios.

Abstract: Image-to-image translation has emerged as a powerful technique in medical imaging, enabling tasks such as image denoising and cross-modality conversion. However, it suffers from limitations in handling out-of-distribution samples without causing performance degradation. To address this limitation, we propose a novel Test-Time Adaptation (TTA) framework that dynamically adjusts the translation process based on the characteristics of each test sample. Our method introduces a Reconstruction Module to quantify the domain shift and a Dynamic Adaptation Block that selectively modifies the internal features of a pretrained translation model to mitigate the shift without compromising the performance on in-distribution samples that do not require adaptation. We evaluate our approach on two medical image-to-image translation tasks: low-dose CT denoising and T1 to T2 MRI translation, showing consistent improvements over both the baseline translation model without TTA and prior TTA methods. Our analysis highlights the limitations of the state-of-the-art that uniformly apply the adaptation to both out-of-distribution and in-distribution samples, demonstrating that dynamic, sample-specific adjustment offers a promising path to improve model resilience in real-world scenarios. The code is available at: https://github.com/cosbidev/Sample-Aware_TTA.

[154] GeoMoE: Divide-and-Conquer Motion Field Modeling with Mixture-of-Experts for Two-View Geometry

Jiajun Le, Jiayi Ma

Main category: cs.CV

TL;DR: GeoMoE introduces a Mixture-of-Experts framework for two-view geometry, addressing heterogeneous motion patterns with targeted modeling and outperforming prior methods.

DetailsMotivation: Existing methods fail to handle diverse motion patterns in complex scenes, leading to inaccurate motion fields.

Method: GeoMoE uses a Probabilistic Prior-Guided Decomposition and MoE-Enhanced Bi-Path Rectifier to model heterogeneous sub-fields and decouple motion regimes.

Result: GeoMoE achieves superior performance in relative pose and homography estimation with strong generalization.

Conclusion: The streamlined GeoMoE framework effectively addresses motion field variability and outperforms state-of-the-art methods.

Abstract: Recent progress in two-view geometry increasingly emphasizes enforcing smoothness and global consistency priors when estimating motion fields between pairs of images. However, in complex real-world scenes, characterized by extreme viewpoint and scale changes as well as pronounced depth discontinuities, the motion field often exhibits diverse and heterogeneous motion patterns. Most existing methods lack targeted modeling strategies and fail to explicitly account for this variability, resulting in estimated motion fields that diverge from their true underlying structure and distribution. We observe that Mixture-of-Experts (MoE) can assign dedicated experts to motion sub-fields, enabling a divide-and-conquer strategy for heterogeneous motion patterns. Building on this insight, we re-architect motion field modeling in two-view geometry with GeoMoE, a streamlined framework. Specifically, we first devise a Probabilistic Prior-Guided Decomposition strategy that exploits inlier probability signals to perform a structure-aware decomposition of the motion field into heterogeneous sub-fields, sharply curbing outlier-induced bias. Next, we introduce an MoE-Enhanced Bi-Path Rectifier that enhances each sub-field along spatial-context and channel-semantic paths and routes it to a customized expert for targeted modeling, thereby decoupling heterogeneous motion regimes, suppressing cross-sub-field interference and representational entanglement, and yielding fine-grained motion-field rectification. With this minimalist design, GeoMoE outperforms prior state-of-the-art methods in relative pose and homography estimation and shows strong generalization. The source code and pre-trained models are available at https://github.com/JiajunLe/GeoMoE.

[155] DPoser-X: Diffusion Model as Robust 3D Whole-body Human Pose Prior

Junzhe Lu, Jing Lin, Hongkun Dou, Ailing Zeng, Yue Deng, Xian Liu, Zhongang Cai, Lei Yang, Yulun Zhang, Haoqian Wang, Ziwei Liu

Main category: cs.CV

TL;DR: DPoser-X is a diffusion-based model for 3D whole-body human pose modeling, addressing challenges like pose complexity and dataset scarcity. It unifies pose tasks as inverse problems, uses variational diffusion sampling, and introduces novel training and scheduling methods.

DetailsMotivation: Building a robust full-body human pose prior is difficult due to pose complexity and limited high-quality datasets.

Method: DPoser-X uses a diffusion model (DPoser) extended for whole-body poses, unifying tasks as inverse problems solved via variational diffusion sampling. It includes truncated timestep scheduling and masked training for better performance.

Result: DPoser-X outperforms state-of-the-art models in body, hand, face, and full-body pose benchmarks, showing robustness and versatility.

Conclusion: DPoser-X sets a new benchmark for whole-body human pose prior modeling, demonstrating superior performance and adaptability.

Abstract: We present DPoser-X, a diffusion-based prior model for 3D whole-body human poses. Building a versatile and robust full-body human pose prior remains challenging due to the inherent complexity of articulated human poses and the scarcity of high-quality whole-body pose datasets. To address these limitations, we introduce a Diffusion model as body Pose prior (DPoser) and extend it to DPoser-X for expressive whole-body human pose modeling. Our approach unifies various pose-centric tasks as inverse problems, solving them through variational diffusion sampling. To enhance performance on downstream applications, we introduce a novel truncated timestep scheduling method specifically designed for pose data characteristics. We also propose a masked training mechanism that effectively combines whole-body and part-specific datasets, enabling our model to capture interdependencies between body parts while avoiding overfitting to specific actions. Extensive experiments demonstrate DPoser-X’s robustness and versatility across multiple benchmarks for body, hand, face, and full-body pose modeling. Our model consistently outperforms state-of-the-art alternatives, establishing a new benchmark for whole-body human pose prior modeling.

[156] Minimum Data, Maximum Impact: 20 annotated samples for explainable lung nodule classification

Luisa Gallée, Catharina Silvia Lisson, Christoph Gerhard Lisson, Daniela Drees, Felix Weig, Daniel Vogele, Meinrad Beer, Michael Götz

Main category: cs.CV

TL;DR: Using generative models to synthesize attribute-annotated medical images improves explainable AI performance in diagnosis.

DetailsMotivation: Enhancing clinician trust by aligning AI with radiologists' diagnostic criteria (e.g., shape, texture) despite limited annotated datasets.

Method: Propose a Diffusion Model conditioned on attributes, trained with minimal real data (20 samples), to generate synthetic images for training explainable models.

Result: Synthetic data boosts attribute prediction accuracy by 13.4% and target prediction by 1.8% over using only small real datasets.

Conclusion: Synthetic data can overcome dataset scarcity, improving explainable AI’s practicality in medical imaging.

Abstract: Classification models that provide human-interpretable explanations enhance clinicians’ trust and usability in medical image diagnosis. One research focus is the integration and prediction of pathology-related visual attributes used by radiologists alongside the diagnosis, aligning AI decision-making with clinical reasoning. Radiologists use attributes like shape and texture as established diagnostic criteria and mirroring these in AI decision-making both enhances transparency and enables explicit validation of model outputs. However, the adoption of such models is limited by the scarcity of large-scale medical image datasets annotated with these attributes. To address this challenge, we propose synthesizing attribute-annotated data using a generative model. We enhance the Diffusion Model with attribute conditioning and train it using only 20 attribute-labeled lung nodule samples from the LIDC-IDRI dataset. Incorporating its generated images into the training of an explainable model boosts performance, increasing attribute prediction accuracy by 13.4% and target prediction accuracy by 1.8% compared to training with only the small real attribute-annotated dataset. This work highlights the potential of synthetic data to overcome dataset limitations, enhancing the applicability of explainable models in medical image analysis.

[157] Revisiting Adversarial Patch Defenses on Object Detectors: Unified Evaluation, Large-Scale Dataset, and New Insights

Junhao Zheng, Jiahao Sun, Chenhao Lin, Zhengyu Zhao, Chen Ma, Chong Zhang, Cong Wang, Qian Wang, Chao Shen

Main category: cs.CV

TL;DR: The paper introduces a unified benchmark for evaluating patch attack defenses on object detectors, revealing key insights and improving defense performance by 15.09%.

DetailsMotivation: Existing evaluations of patch attack defenses lack consistency and comprehensiveness, prompting the need for a standardized framework.

Method: The study revisits 11 defenses, creates a large-scale adversarial patch dataset (94 types, 94,000 images), and evaluates using 2 attack goals, 13 attacks, 11 detectors, and 4 metrics.

Result: Key findings include the importance of data distribution in defense difficulty, the relevance of attacked object precision over patch detection accuracy, and the robustness of complex/stochastic defenses.

Conclusion: The benchmark provides guidance for evaluating and designing patch attack defenses, with ongoing updates to include new attacks/defenses.

Abstract: Developing reliable defenses against patch attacks on object detectors has attracted increasing interest. However, we identify that existing defense evaluations lack a unified and comprehensive framework, resulting in inconsistent and incomplete assessments of current methods. To address this issue, we revisit 11 representative defenses and present the first patch defense benchmark, involving 2 attack goals, 13 patch attacks, 11 object detectors, and 4 diverse metrics. This leads to the large-scale adversarial patch dataset with 94 types of patches and 94,000 images. Our comprehensive analyses reveal new insights: (1) The difficulty in defending against naturalistic patches lies in the data distribution, rather than the commonly believed high frequencies. Our new dataset with diverse patch distributions can be used to improve existing defenses by 15.09% AP@0.5. (2) The average precision of the attacked object, rather than the commonly pursued patch detection accuracy, shows high consistency with defense performance. (3) Adaptive attacks can substantially bypass existing defenses, and defenses with complex/stochastic models or universal patch properties are relatively robust. We hope that our analyses will serve as guidance on properly evaluating patch attacks/defenses and advancing their design. Code and dataset are available at https://github.com/Gandolfczjh/APDE, where we will keep integrating new attacks/defenses.

[158] Can Large Pretrained Depth Estimation Models Help With Image Dehazing?

Hongfei Zhang, Kun Zhou, Ruizheng Wu, Jiangbo Lu

Main category: cs.CV

TL;DR: The paper explores using pretrained depth representations for image dehazing, proposing a plug-and-play RGB-D fusion module adaptable to various architectures.

DetailsMotivation: Existing dehazing methods lack adaptability across diverse scenarios due to architecture-specific designs.

Method: Systematically investigates pretrained depth representations and introduces a plug-and-play RGB-D fusion module.

Result: Learned depth features remain consistent across haze levels; the module is validated across benchmarks.

Conclusion: The approach is effective and broadly applicable for image dehazing.

Abstract: Image dehazing remains a challenging problem due to the spatially varying nature of haze in real-world scenes. While existing methods have demonstrated the promise of large-scale pretrained models for image dehazing, their architecture-specific designs hinder adaptability across diverse scenarios with different accuracy and efficiency requirements. In this work, we systematically investigate the generalization capability of pretrained depth representations-learned from millions of diverse images-for image dehazing. Our empirical analysis reveals that the learned deep depth features maintain remarkable consistency across varying haze levels. Building on this insight, we propose a plug-and-play RGB-D fusion module that seamlessly integrates with diverse dehazing architectures. Extensive experiments across multiple benchmarks validate both the effectiveness and broad applicability of our approach.

[159] MIHBench: Benchmarking and Mitigating Multi-Image Hallucinations in Multimodal Large Language Models

Jiale Li, Mingrui Wu, Zixiang Jin, Hao Chen, Jiayi Ji, Xiaoshuai Sun, Liujuan Cao, Rongrong Ji

Main category: cs.CV

TL;DR: The paper introduces MIHBench, a benchmark for evaluating object-related hallucinations in multi-image Multimodal Large Language Models (MLLMs), addressing gaps in existing research. It identifies key factors influencing hallucinations and proposes a Dynamic Attention Balancing mechanism to mitigate them.

DetailsMotivation: Existing studies on hallucinations in MLLMs focus on single-image settings, leaving multi-image scenarios unexplored. This paper aims to systematically study and address hallucinations in multi-image contexts.

Method: The authors propose MIHBench, a benchmark with three tasks: Multi-Image Object Existence Hallucination, Multi-Image Object Count Hallucination, and Object Identity Consistency Hallucination. They also introduce a Dynamic Attention Balancing mechanism to adjust inter-image attention distributions.

Result: Key findings include a progressive relationship between image inputs and hallucination likelihood, a correlation between single and multi-image hallucinations, and the impact of same-object ratios and negative sample placement. The proposed method reduces hallucinations and improves reasoning stability.

Conclusion: The study fills a research gap by addressing multi-image hallucinations in MLLMs, offering a benchmark and a solution (Dynamic Attention Balancing) that enhances performance in multi-image scenarios.

Abstract: Despite growing interest in hallucination in Multimodal Large Language Models, existing studies primarily focus on single-image settings, leaving hallucination in multi-image scenarios largely unexplored. To address this gap, we conduct the first systematic study of hallucinations in multi-image MLLMs and propose MIHBench, a benchmark specifically tailored for evaluating object-related hallucinations across multiple images. MIHBench comprises three core tasks: Multi-Image Object Existence Hallucination, Multi-Image Object Count Hallucination, and Object Identity Consistency Hallucination, targeting semantic understanding across object existence, quantity reasoning, and cross-view identity consistency. Through extensive evaluation, we identify key factors associated with the occurrence of multi-image hallucinations, including: a progressive relationship between the number of image inputs and the likelihood of hallucination occurrences; a strong correlation between single-image hallucination tendencies and those observed in multi-image contexts; and the influence of same-object image ratios and the positional placement of negative samples within image sequences on the occurrence of object identity consistency hallucination. To address these challenges, we propose a Dynamic Attention Balancing mechanism that adjusts inter-image attention distributions while preserving the overall visual attention proportion. Experiments across multiple state-of-the-art MLLMs demonstrate that our method effectively reduces hallucination occurrences and enhances semantic integration and reasoning stability in multi-image scenarios.

[160] YOLO-Count: Differentiable Object Counting for Text-to-Image Generation

Guanning Zeng, Xiang Zhang, Zirui Wang, Haiyang Xu, Zeyuan Chen, Bingnan Li, Zhuowen Tu

Main category: cs.CV

TL;DR: YOLO-Count is a differentiable model for open-vocabulary object counting and precise quantity control in text-to-image generation, using a novel ‘cardinality’ map and hybrid supervision.

DetailsMotivation: To address general counting challenges and enable precise quantity control in text-to-image generation.

Method: Uses a ‘cardinality’ map for regression, representation alignment, and hybrid strong-weak supervision. Fully differentiable for gradient-based optimization.

Result: Achieves state-of-the-art counting accuracy and robust quantity control for T2I systems.

Conclusion: YOLO-Count effectively bridges open-vocabulary counting and T2I generation control.

Abstract: We propose YOLO-Count, a differentiable open-vocabulary object counting model that tackles both general counting challenges and enables precise quantity control for text-to-image (T2I) generation. A core contribution is the ‘cardinality’ map, a novel regression target that accounts for variations in object size and spatial distribution. Leveraging representation alignment and a hybrid strong-weak supervision scheme, YOLO-Count bridges the gap between open-vocabulary counting and T2I generation control. Its fully differentiable architecture facilitates gradient-based optimization, enabling accurate object count estimation and fine-grained guidance for generative models. Extensive experiments demonstrate that YOLO-Count achieves state-of-the-art counting accuracy while providing robust and effective quantity control for T2I systems.

[161] Rethinking Backbone Design for Lightweight 3D Object Detection in LiDAR

Adwait Chandorkar, Hasan Tercan, Tobias Meisen

Main category: cs.CV

TL;DR: The paper introduces Dense Backbone, a lightweight backbone for 3D object detection, reducing model complexity while maintaining performance.

DetailsMotivation: Current LiDAR-based 3D object detection relies on complex backbones (VGG/ResNet), increasing model complexity. Lightweight backbones for 3D detection are underexplored.

Method: Dense Backbone is proposed, combining speed, lightweight design, and accuracy. It’s integrated into existing detectors like PillarNet (DensePillarNet).

Result: DensePillarNet reduces parameters by 29% and latency by 28%, with only a 2% accuracy drop on nuScenes.

Conclusion: Dense Backbone offers a plug-and-play solution for efficient 3D object detection without compromising performance.

Abstract: Recent advancements in LiDAR-based 3D object detection have significantly accelerated progress toward the realization of fully autonomous driving in real-world environments. Despite achieving high detection performance, most of the approaches still rely on a VGG-based or ResNet-based backbone for feature exploration, which increases the model complexity. Lightweight backbone design is well-explored for 2D object detection, but research on 3D object detection still remains limited. In this work, we introduce Dense Backbone, a lightweight backbone that combines the benefits of high processing speed, lightweight architecture, and robust detection accuracy. We adapt multiple SoTA 3d object detectors, such as PillarNet, with our backbone and show that with our backbone, these models retain most of their detection capability at a significantly reduced computational cost. To our knowledge, this is the first dense-layer-based backbone tailored specifically for 3D object detection from point cloud data. DensePillarNet, our adaptation of PillarNet, achieves a 29% reduction in model parameters and a 28% reduction in latency with just a 2% drop in detection accuracy on the nuScenes test set. Furthermore, Dense Backbone’s plug-and-play design allows straightforward integration into existing architectures, requiring no modifications to other network components.

[162] GECO: Geometrically Consistent Embedding with Lightspeed Inference

Regine Hartwig, Dominik Muhle, Riccardo Marin, Daniel Cremers

Main category: cs.CV

TL;DR: GECO introduces a self-supervised vision model that enhances geometric awareness in feature learning, achieving faster performance and better accuracy than prior methods.

DetailsMotivation: Existing self-supervised vision models lack 3D geometry awareness, limiting their ability to distinguish parts based on geometry.

Method: GECO uses optimal transport for training, enabling supervision beyond keypoints and handling occlusions/disocclusions. It features a lightweight architecture for real-time performance.

Result: GECO runs at 30 fps (98.2% faster than prior methods) and improves PCK by 6.0%, 6.2%, and 4.1% on PFPascal, APK, and CUB datasets.

Conclusion: PCK alone is insufficient for geometric quality; GECO introduces new metrics for geometry-aware feature learning.

Abstract: Recent advances in feature learning have shown that self-supervised vision foundation models can capture semantic correspondences but often lack awareness of underlying 3D geometry. GECO addresses this gap by producing geometrically coherent features that semantically distinguish parts based on geometry (e.g., left/right eyes, front/back legs). We propose a training framework based on optimal transport, enabling supervision beyond keypoints, even under occlusions and disocclusions. With a lightweight architecture, GECO runs at 30 fps, 98.2% faster than prior methods, while achieving state-of-the-art performance on PFPascal, APK, and CUB, improving PCK by 6.0%, 6.2%, and 4.1%, respectively. Finally, we show that PCK alone is insufficient to capture geometric quality and introduce new metrics and insights for more geometry-aware feature learning. Link to project page: https://reginehartwig.github.io/publications/geco/

[163] Zero-Shot Anomaly Detection with Dual-Branch Prompt Learning

Zihan Wang, Samira Ebrahimi Kahou, Narges Armanfard

Main category: cs.CV

TL;DR: PILOT introduces a dual-branch prompt learning mechanism and label-free test-time adaptation to improve zero-shot anomaly detection under domain shifts.

DetailsMotivation: Existing ZSAD methods struggle with domain shifts due to limited training data and poor generalization.

Method: PILOT uses a dual-branch prompt learning mechanism and label-free test-time adaptation with pseudo-labels.

Result: PILOT achieves state-of-the-art performance on 13 benchmarks for anomaly detection and localization under domain shift.

Conclusion: PILOT effectively addresses domain shift challenges in ZSAD, outperforming existing methods.

Abstract: Zero-shot anomaly detection (ZSAD) enables identifying and localizing defects in unseen categories by relying solely on generalizable features rather than requiring any labeled examples of anomalies. However, existing ZSAD methods, whether using fixed or learned prompts, struggle under domain shifts because their training data are derived from limited training domains and fail to generalize to new distributions. In this paper, we introduce PILOT, a framework designed to overcome these challenges through two key innovations: (1) a novel dual-branch prompt learning mechanism that dynamically integrates a pool of learnable prompts with structured semantic attributes, enabling the model to adaptively weight the most relevant anomaly cues for each input image; and (2) a label-free test-time adaptation strategy that updates the learnable prompt parameters using high-confidence pseudo-labels from unlabeled test data. Extensive experiments on 13 industrial and medical benchmarks demonstrate that PILOT achieves state-of-the-art performance in both anomaly detection and localization under domain shift.

[164] Cross-Dataset Semantic Segmentation Performance Analysis: Unifying NIST Point Cloud City Datasets for 3D Deep Learning

Alexander Nikitas Dimopoulos, Joseph Grasso

Main category: cs.CV

TL;DR: The study examines semantic segmentation in heterogeneously labeled point-cloud datasets for public safety, highlighting challenges like class imbalance and label unification, and suggests standardized protocols for better performance.

DetailsMotivation: To address challenges in unifying differently labeled 3D point-cloud datasets for public safety applications, focusing on semantic segmentation performance.

Method: Uses NIST’s Point Cloud City dataset with a graded schema and KPConv architecture, evaluating performance via IoU metrics on safety-relevant features.

Result: Larger objects (e.g., stairs, windows) show higher segmentation performance, while smaller safety-critical features have lower recognition due to class imbalance and geometric limitations.

Conclusion: Reliable semantic segmentation requires standardized annotation and improved labeling to tackle data heterogeneity and detect small safety-critical elements.

Abstract: This study analyzes semantic segmentation performance across heterogeneously labeled point-cloud datasets relevant to public safety applications, including pre-incident planning systems derived from lidar scans. Using NIST’s Point Cloud City dataset (Enfield and Memphis collections), we investigate challenges in unifying differently labeled 3D data. Our methodology employs a graded schema with the KPConv architecture, evaluating performance through IoU metrics on safety-relevant features. Results indicate performance variability: geometrically large objects (e.g. stairs, windows) achieve higher segmentation performance, suggesting potential for navigational context, while smaller safety-critical features exhibit lower recognition rates. Performance is impacted by class imbalance and the limited geometric distinction of smaller objects in typical lidar scans, indicating limitations in detecting certain safety-relevant features using current point-cloud methods. Key identified challenges include insufficient labeled data, difficulties in unifying class labels across datasets, and the need for standardization. Potential directions include automated labeling and multi-dataset learning strategies. We conclude that reliable point-cloud semantic segmentation for public safety necessitates standardized annotation protocols and improved labeling techniques to address data heterogeneity and the detection of small, safety-critical elements.

[165] IGL-Nav: Incremental 3D Gaussian Localization for Image-goal Navigation

Wenxuan Guo, Xiuwei Xu, Hang Yin, Ziwei Wang, Jianjiang Feng, Jie Zhou, Jiwen Lu

Main category: cs.CV

TL;DR: IGL-Nav introduces an incremental 3D Gaussian localization framework for efficient and accurate image-goal navigation, outperforming existing methods.

DetailsMotivation: Existing methods struggle to model the geometric relationship between the 3D environment and the goal image, leading to inefficiency and inaccuracy.

Method: IGL-Nav incrementally updates scene representation with monocular prediction, coarsely localizes the goal using geometric information, and refines the pose via differentiable rendering.

Result: IGL-Nav significantly outperforms state-of-the-art methods and handles free-view image-goal settings, even on real-world robotic platforms.

Conclusion: The framework provides a robust and efficient solution for 3D-aware image-goal navigation, with practical applicability.

Abstract: Visual navigation with an image as goal is a fundamental and challenging problem. Conventional methods either rely on end-to-end RL learning or modular-based policy with topological graph or BEV map as memory, which cannot fully model the geometric relationship between the explored 3D environment and the goal image. In order to efficiently and accurately localize the goal image in 3D space, we build our navigation system upon the renderable 3D gaussian (3DGS) representation. However, due to the computational intensity of 3DGS optimization and the large search space of 6-DoF camera pose, directly leveraging 3DGS for image localization during agent exploration process is prohibitively inefficient. To this end, we propose IGL-Nav, an Incremental 3D Gaussian Localization framework for efficient and 3D-aware image-goal navigation. Specifically, we incrementally update the scene representation as new images arrive with feed-forward monocular prediction. Then we coarsely localize the goal by leveraging the geometric information for discrete space matching, which can be equivalent to efficient 3D convolution. When the agent is close to the goal, we finally solve the fine target pose with optimization via differentiable rendering. The proposed IGL-Nav outperforms existing state-of-the-art methods by a large margin across diverse experimental configurations. It can also handle the more challenging free-view image-goal setting and be deployed on real-world robotic platform using a cellphone to capture goal image at arbitrary pose. Project page: https://gwxuan.github.io/IGL-Nav/.

[166] LLaVA-Video: Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, Chunyuan Li

Main category: cs.CV

TL;DR: LLaVA-Video-178K, a synthetic dataset for video instruction-following, is introduced to train LLaVA-Video, a video large multimodal model (LMM), which shows strong performance on benchmarks.

DetailsMotivation: The lack of high-quality raw video data for training video LMMs motivates the creation of a synthetic dataset.

Method: A synthetic dataset (LLaVA-Video-178K) is created for tasks like captioning and QA. The dataset is combined with existing visual instruction tuning data to train LLaVA-Video.

Result: LLaVA-Video performs well across video benchmarks, proving the dataset’s effectiveness.

Conclusion: The synthetic dataset and LLaVA-Video model are effective, with plans to release the dataset, pipeline, and checkpoints.

Abstract: The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we propose an alternative approach by creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.

[167] CPCL: Cross-Modal Prototypical Contrastive Learning for Weakly Supervised Text-based Person Retrieval

Xinpeng Zhao, Yanwei Zheng, Chuanlin Lan, Xiaowei Zhang, Bowen Huang, Jibin Yang, Dongxiao Yu

Main category: cs.CV

TL;DR: The paper proposes CPCL, a method for weakly supervised text-based person retrieval, addressing intra-class differences and leveraging prototypical features for better performance.

DetailsMotivation: The challenge lies in intra-class differences (intra-modal feature variations and cross-modal semantic gaps) and the lack of prototypical feature utilization in prior works.

Method: CPCL uses CLIP for mapping, a PMM module for cross-modal associations, and an OPLM module for outlier mining to enhance clustering.

Result: Extensive experiments validate CPCL’s effectiveness and generalizability on benchmarks.

Conclusion: CPCL improves weakly supervised text-based person retrieval by addressing key challenges and leveraging prototypical features.

Abstract: Weakly supervised text-based person retrieval seeks to retrieve images of a target person using textual descriptions, without relying on identity annotations and is more challenging and practical. The primary challenge is the intra-class differences, encompassing intra-modal feature variations and cross-modal semantic gaps. Prior works have focused on instance-level samples and ignored prototypical features of each person which are intrinsic and invariant. Toward this, we propose a Cross-Modal Prototypical Contrastive Learning (CPCL) method. In practice, the CPCL introduces the CLIP model to weakly supervised text-based person retrieval to map visual and textual instances into a shared latent space. Subsequently, the proposed Prototypical Multi-modal Memory (PMM) module captures associations between heterogeneous modalities of image-text pairs belonging to the same person through the Hybrid Cross-modal Matching (HCM) module in a many-to-many mapping fashion. Moreover, the Outlier Pseudo Label Mining (OPLM) module further distinguishes valuable outlier samples from each modality, enhancing the creation of more reliable clusters by mining implicit relationships between image-text pairs. We conduct extensive experiments on popular benchmarks of weakly supervised text-based person retrieval, which validate the effectiveness, generalizability of CPCL.

[168] AttnMod: Attention-Based New Art Styles

Shih-Chieh Su

Main category: cs.CV

TL;DR: AttnMod is a training-free method to modulate cross-attention in diffusion models, enabling novel art styles without retraining or prompt changes.

DetailsMotivation: To simulate how human artists reinterpret images by altering attention mechanisms, expanding text-to-image generation's expressive range.

Method: Modifies cross-attention in pre-trained diffusion models during denoising to target stylistic transformations.

Result: Enables diverse, unpromptable art styles by adjusting attention without model retraining.

Conclusion: AttnMod enhances text-to-image generation by allowing stylistic flexibility through attention modulation.

Abstract: We introduce AttnMod, a training-free technique that modulates cross-attention in pre-trained diffusion models to generate novel, unpromptable art styles. The method is inspired by how a human artist might reinterpret a generated image, for example by emphasizing certain features, dispersing color, twisting silhouettes, or materializing unseen elements. AttnMod simulates this intent by altering how the text prompt conditions the image through attention during denoising. These targeted modulations enable diverse stylistic transformations without changing the prompt or retraining the model, and they expand the expressive capacity of text-to-image generation.

[169] Gaga: Group Any Gaussians via 3D-aware Memory Bank

Weijie Lyu, Xueting Li, Abhijit Kundu, Yi-Hsuan Tsai, Ming-Hsuan Yang

Main category: cs.CV

TL;DR: Gaga is a framework for 3D scene reconstruction and segmentation using inconsistent 2D masks from zero-shot models, outperforming existing methods by leveraging a 3D-aware memory bank.

DetailsMotivation: Prior methods rely on video object tracking or contrastive learning, which assume continuous view changes. Gaga addresses this limitation by handling sparse camera poses and diverse mask sources.

Method: Gaga uses a novel 3D-aware memory bank to associate object masks across varying camera poses, eliminating the need for continuous view assumptions.

Result: Gaga shows robustness to camera pose variations and works with diverse segmentation models, achieving superior performance in qualitative and quantitative evaluations.

Conclusion: Gaga is versatile and effective for real-world 3D scene understanding and manipulation, outperforming state-of-the-art methods.

Abstract: We introduce Gaga, a framework that reconstructs and segments open-world 3D scenes by leveraging inconsistent 2D masks predicted by zero-shot class-agnostic segmentation models. Contrasted to prior 3D scene segmentation approaches that rely on video object tracking or contrastive learning methods, Gaga utilizes spatial information and effectively associates object masks across diverse camera poses through a novel 3D-aware memory bank. By eliminating the assumption of continuous view changes in training images, Gaga demonstrates robustness to variations in camera poses, particularly beneficial for sparsely sampled images, ensuring precise mask label consistency. Furthermore, Gaga accommodates 2D segmentation masks from diverse sources and demonstrates robust performance with different open-world zero-shot class-agnostic segmentation models, significantly enhancing its versatility. Extensive qualitative and quantitative evaluations demonstrate that Gaga performs favorably against state-of-the-art methods, emphasizing its potential for real-world applications such as 3D scene understanding and manipulation.

[170] Boosting Adversarial Transferability with Low-Cost Optimization via Maximin Expected Flatness

Chunlin Qiu, Ang Li, Yiheng Duan, Shenyi Zhang, Yuanjie Zhang, Lingchen Zhao, Qian Wang

Main category: cs.CV

TL;DR: The paper addresses limitations in flatness-enhanced transfer-based attacks by proposing a theoretical foundation and a principled framework (MEF) to balance exploration-exploitation dynamics, achieving superior attack success rates and efficiency.

DetailsMotivation: Existing flatness-enhanced transfer-based attacks lack theoretical grounding and suffer from imbalanced optimization, limiting their effectiveness and efficiency.

Method: The work unifies flatness definitions, formalizes average-case flatness and transferability gaps, and designs the Maximin Expected Flatness (MEF) attack to balance exploration and exploitation.

Result: MEF outperforms state-of-the-art methods, achieving higher attack success rates (4-8% gains) and computational efficiency (half the cost). Combined with augmentation, it gains 15% against defended models.

Conclusion: The paper establishes a theoretical foundation for flatness-based transferability and introduces MEF, a superior and efficient attack framework, setting new robustness benchmarks.

Abstract: Transfer-based attacks craft adversarial examples on white-box surrogate models and directly deploy them against black-box target models, offering model-agnostic and query-free threat scenarios. While flatness-enhanced methods have recently emerged to improve transferability by enhancing the loss surface flatness of adversarial examples, their divergent flatness definitions and heuristic attack designs suffer from unexamined optimization limitations and missing theoretical foundation, thus constraining their effectiveness and efficiency. This work exposes the severely imbalanced exploitation-exploration dynamics in flatness optimization, establishing the first theoretical foundation for flatness-based transferability and proposing a principled framework to overcome these optimization pitfalls. Specifically, we systematically unify fragmented flatness definitions across existing methods, revealing their imbalanced optimization limitations in over-exploration of sensitivity peaks or over-exploitation of local plateaus. To resolve these issues, we rigorously formalize average-case flatness and transferability gaps, proving that enhancing zeroth-order average-case flatness minimizes cross-model discrepancies. Building on this theory, we design a Maximin Expected Flatness (MEF) attack that enhances zeroth-order average-case flatness while balancing flatness exploration and exploitation. Extensive evaluations across 22 models and 24 current transfer-based attacks demonstrate MEF’s superiority: it surpasses the state-of-the-art PGN attack by 4% in attack success rate at half the computational cost and achieves 8% higher success rate under the same budget. When combined with input augmentation, MEF attains 15% additional gains against defense-equipped models, establishing new robustness benchmarks. Our code is available at https://github.com/SignedQiu/MEFAttack.

[171] GUIOdyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices

Quanfeng Lu, Wenqi Shao, Zitao Liu, Lingxiao Du, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, Ping Luo

Main category: cs.CV

TL;DR: GUIOdyssey is a dataset for cross-app mobile GUI navigation, addressing limitations of single-app datasets. OdysseyAgent, a multimodal agent, leverages this dataset for improved performance in complex tasks.

DetailsMotivation: Prior GUI agents performed poorly in cross-app navigation due to single-app training datasets. GUIOdyssey aims to bridge this gap.

Method: Developed GUIOdyssey dataset with 8,334 episodes and semantic annotations. Built OdysseyAgent with a history resampler module for efficient navigation.

Result: OdysseyAgent shows effectiveness in in-domain and out-of-domain scenarios, with historical data enhancing performance.

Conclusion: GUIOdyssey and OdysseyAgent advance cross-app GUI navigation, demonstrating the value of comprehensive datasets and multimodal agents.

Abstract: Autonomous Graphical User Interface (GUI) navigation agents can enhance user experience in communication, entertainment, and productivity by streamlining workflows and reducing manual intervention. However, prior GUI agents often trained with datasets comprising tasks that can be completed within a single app, leading to poor performance in cross-app navigation. To address this problem, we present GUIOdyssey, a comprehensive dataset for cross-app mobile GUI navigation. GUIOdyssey comprises 8,334 episodes with an average of 15.3 steps per episode, covering 6 mobile devices, 212 distinct apps, and 1,357 app combinations. Each step is enriched with detailed semantic reasoning annotations, which aid the model in building cognitive processes and enhancing its reasoning abilities for complex cross-app tasks. Building on GUIOdyssey, we develop OdysseyAgent, an exploratory multimodal agent for long-step cross-app navigation equipped with a history resampler module that efficiently attends to historical screenshot tokens, balancing performance and inference speed. Extensive experiments conducted in both in-domain and out-of-domain scenarios validate the effectiveness of our approach. Moreover, we demonstrate that historial information involving actions, screenshots and context in our dataset can significantly enhances OdysseyAgent’s performance on complex cross-app tasks.

[172] Meta CLIP 2: A Worldwide Scaling Recipe

Yung-Sung Chuang, Yang Li, Dong Wang, Ching-Feng Yeh, Kehan Lyu, Ramya Raghavendra, James Glass, Lifei Huang, Jason Weston, Luke Zettlemoyer, Xinlei Chen, Zhuang Liu, Saining Xie, Wen-tau Yih, Shang-Wen Li, Hu Xu

Main category: cs.CV

TL;DR: Meta CLIP 2 improves CLIP training on worldwide web-scale data, outperforming English-only CLIP and mSigLIP in zero-shot tasks and multilingual benchmarks.

DetailsMotivation: Addressing challenges in scaling CLIP to non-English data and overcoming the 'curse of multilinguality' to enhance performance.

Method: Training CLIP from scratch on worldwide web-scale image-text pairs with minimal changes to handle multilingual data.

Result: Meta CLIP 2 ViT-H/14 surpasses English-only CLIP by 0.8% in zero-shot ImageNet classification and achieves SOTA in multilingual benchmarks.

Conclusion: Meta CLIP 2 demonstrates mutual benefits from English and non-English data, setting new benchmarks without system-level changes.

Abstract: Contrastive Language-Image Pretraining (CLIP) is a popular foundation model, supporting from zero-shot classification, retrieval to encoders for multimodal large language models (MLLMs). Although CLIP is successfully trained on billion-scale image-text pairs from the English world, scaling CLIP’s training further to learning from the worldwide web data is still challenging: (1) no curation method is available to handle data points from non-English world; (2) the English performance from existing multilingual CLIP is worse than its English-only counterpart, i.e., “curse of multilinguality” that is common in LLMs. Here, we present Meta CLIP 2, the first recipe training CLIP from scratch on worldwide web-scale image-text pairs. To generalize our findings, we conduct rigorous ablations with minimal changes that are necessary to address the above challenges and present a recipe enabling mutual benefits from English and non-English world data. In zero-shot ImageNet classification, Meta CLIP 2 ViT-H/14 surpasses its English-only counterpart by 0.8% and mSigLIP by 0.7%, and surprisingly sets new state-of-the-art without system-level confounding factors (e.g., translation, bespoke architecture changes) on multilingual benchmarks, such as CVQA with 57.4%, Babel-ImageNet with 50.2% and XM3600 with 64.3% on image-to-text retrieval.

[173] A Physical Model-Guided Framework for Underwater Image Enhancement and Depth Estimation

Dazhao Du, Lingyu Si, Fanjiang Xu, Jianwei Niu, Fuchun Sun

Main category: cs.CV

TL;DR: A framework combining a Deep Degradation Model (DDM) with any underwater image enhancement (UIE) model is proposed to improve accuracy in estimating imaging parameters like depth and veiling light, enhancing underwater images effectively.

DetailsMotivation: Underwater images suffer from visual degradations due to light absorption and scattering. Existing UIE methods often fail to accurately estimate key imaging parameters, limiting their performance.

Method: The framework includes DDM with three sub-networks for veiling light, factors, and depth estimation. It imposes physical constraints on enhancement using the underwater imaging model. A dual-branch UIEConv model is also designed for global and local feature utilization.

Result: The framework and UIEConv achieve remarkable enhancement results across diverse underwater scenes. The depth estimation sub-network also provides accurate depth estimation.

Conclusion: The proposed framework effectively enhances underwater images and estimates depth, validated by extensive experiments in real underwater scenarios.

Abstract: Due to the selective absorption and scattering of light by diverse aquatic media, underwater images usually suffer from various visual degradations. Existing underwater image enhancement (UIE) approaches that combine underwater physical imaging models with neural networks often fail to accurately estimate imaging model parameters such as depth and veiling light, resulting in poor performance in certain scenarios. To address this issue, we propose a physical model-guided framework for jointly training a Deep Degradation Model (DDM) with any advanced UIE model. DDM includes three well-designed sub-networks to accurately estimate various imaging parameters: a veiling light estimation sub-network, a factors estimation sub-network, and a depth estimation sub-network. Based on the estimated parameters and the underwater physical imaging model, we impose physical constraints on the enhancement process by modeling the relationship between underwater images and desired clean images, i.e., outputs of the UIE model. Moreover, while our framework is compatible with any UIE model, we design a simple yet effective fully convolutional UIE model, termed UIEConv. UIEConv utilizes both global and local features for image enhancement through a dual-branch structure. UIEConv trained within our framework achieves remarkable enhancement results across diverse underwater scenes. Furthermore, as a byproduct of UIE, the trained depth estimation sub-network enables accurate underwater scene depth estimation. Extensive experiments conducted in various real underwater imaging scenarios, including deep-sea environments with artificial light sources, validate the effectiveness of our framework and the UIEConv model.

[174] YOLOO: You Only Learn from Others Once

Lipeng Gu, Mingqiang Wei, Xuefeng Yan, Dingkun Zhu, Wei Zhao, Haoran Xie

Main category: cs.CV

TL;DR: YOLOO introduces a multi-modal 3D MOT paradigm where the point cloud encoder learns a unified tri-modal representation (UTR) during training, eliminating multi-modal input during inference. It uses UTEnc and F-GC for efficient tracking without compromising performance.

DetailsMotivation: To reduce computational costs of multi-modal DNNs in 3D MOT by learning from multiple modalities only during training.

Method: YOLOO employs a unified tri-modal encoder (UTEnc) to fuse point cloud, image, and text data into UTRs, and a flexible geometric constraint (F-GC) to filter mismatches.

Result: Achieves efficient tracking using only the point cloud encoder, maintaining performance without multi-modal inference.

Conclusion: YOLOO enhances robustness and efficiency in multi-modal 3D MOT by leveraging UTRs and F-GC.

Abstract: Multi-modal 3D multi-object tracking (MOT) typically necessitates extensive computational costs of deep neural networks (DNNs) to extract multi-modal representations. In this paper, we propose an intriguing question: May we learn from multiple modalities only during training to avoid multi-modal input in the inference phase? To answer it, we propose \textbf{YOLOO}, a novel multi-modal 3D MOT paradigm: You Only Learn from Others Once. YOLOO empowers the point cloud encoder to learn a unified tri-modal representation (UTR) from point clouds and other modalities, such as images and textual cues, all at once. Leveraging this UTR, YOLOO achieves efficient tracking solely using the point cloud encoder without compromising its performance, fundamentally obviating the need for computationally intensive DNNs. Specifically, YOLOO includes two core components: a unified tri-modal encoder (UTEnc) and a flexible geometric constraint (F-GC) module. UTEnc integrates a point cloud encoder with image and text encoders adapted from pre-trained CLIP. It seamlessly fuses point cloud information with rich visual-textual knowledge from CLIP into the point cloud encoder, yielding highly discriminative UTRs that facilitate the association between trajectories and detections. Additionally, F-GC filters out mismatched associations with similar representations but significant positional discrepancies. It further enhances the robustness of UTRs without requiring any scene-specific tuning, addressing a key limitation of customized geometric constraints (e.g., 3D IoU). Lastly, high-quality 3D trajectories are generated by a traditional data association component. By integrating these advancements into a multi-modal 3D MOT scheme, our YOLOO achieves substantial gains in both robustness and efficiency.

[175] BlinkTrack: Feature Tracking over 80 FPS via Events and Images

Yichen Shen, Yijin Li, Shuo Chen, Guanglin Li, Zhaoyang Huang, Hujun Bao, Zhaopeng Cui, Guofeng Zhang

Main category: cs.CV

TL;DR: BlinkTrack integrates event data with grayscale images for high-frequency feature tracking, outperforming existing methods with speeds over 80 FPS.

DetailsMotivation: Event cameras lack fine-grained texture, causing tracking errors; combining event data with grayscale images addresses this.

Method: Extends Kalman filter into a learning-based framework with differentiable filters for event and image data fusion.

Result: Achieves over 80 FPS with multi-modality data and 100 FPS with preprocessed event data, outperforming others.

Conclusion: BlinkTrack effectively solves tracking challenges by fusing event and image data, with high performance and speed.

Abstract: Event cameras, known for their high temporal resolution and ability to capture asynchronous changes, have gained significant attention for their potential in feature tracking, especially in challenging conditions. However, event cameras lack the fine-grained texture information that conventional cameras provide, leading to error accumulation in tracking. To address this, we propose a novel framework, BlinkTrack, which integrates event data with grayscale images for high-frequency feature tracking. Our method extends the traditional Kalman filter into a learning-based framework, utilizing differentiable Kalman filters in both event and image branches. This approach improves single-modality tracking and effectively solves the data association and fusion from asynchronous event and image data. We also introduce new synthetic and augmented datasets to better evaluate our model. Experimental results indicate that BlinkTrack significantly outperforms existing methods, exceeding 80 FPS with multi-modality data and 100 FPS with preprocessed event data. Codes and dataset are available at https://github.com/ColieShen/BlinkTrack.

[176] Long-LRM: Long-sequence Large Reconstruction Model for Wide-coverage Gaussian Splats

Chen Ziwen, Hao Tan, Kai Zhang, Sai Bi, Fujun Luan, Yicong Hong, Li Fuxin, Zexiang Xu

Main category: cs.CV

TL;DR: Long-LRM is a fast, high-resolution 3D Gaussian reconstruction model using a mix of Mamba2 and transformer blocks, achieving 800x speedup over optimization-based methods.

DetailsMotivation: To enable instant, high-quality, wide-coverage 3D scene reconstruction from large input sizes efficiently.

Method: Combines Mamba2 and transformer blocks with token merging and Gaussian pruning for efficiency. Inputs 32 images (960x540) and outputs reconstruction in 1 second on an A100 GPU.

Result: Achieves comparable quality to optimization-based methods with 800x speedup and handles 60x larger inputs than previous feed-forward approaches.

Conclusion: Long-LRM is a highly efficient and scalable solution for 3D Gaussian reconstruction, with potential for further enhancement via compatibility with other Gaussian variants.

Abstract: We propose Long-LRM, a feed-forward 3D Gaussian reconstruction model for instant, high-resolution, 360{\deg} wide-coverage, scene-level reconstruction. Specifically, it takes in 32 input images at a resolution of 960x540 and produces the Gaussian reconstruction in just 1 second on a single A100 GPU. To handle the long sequence of 250K tokens brought by the large input size, Long-LRM features a mixture of the recent Mamba2 blocks and the classical transformer blocks, enhanced by a light-weight token merging module and Gaussian pruning steps that balance between quality and efficiency. We evaluate Long-LRM on the large-scale DL3DV benchmark and Tanks&Temples, demonstrating reconstruction quality comparable to the optimization-based methods while achieving an 800x speedup w.r.t. the optimization-based approaches and an input size at least 60x larger than the previous feed-forward approaches. We conduct extensive ablation studies on our model design choices for both rendering quality and computation efficiency. We also explore Long-LRM’s compatibility with other Gaussian variants such as 2D GS, which enhances Long-LRM’s ability in geometry reconstruction. Project page: https://arthurhero.github.io/projects/llrm

[177] ShadowMamba: State-Space Model with Boundary-Region Selective Scan for Shadow Removal

Xiujin Zhu, Chee-Onn Chow, Joon Huang Chuah

Main category: cs.CV

TL;DR: ShadowMamba, a lightweight Mamba-based model, improves shadow removal by addressing semantic continuity and computational efficiency with a novel boundary-region selective scanning mechanism.

DetailsMotivation: Shadows disrupt brightness and affect downstream tasks. Existing Transformer-based methods limit receptive fields and long-range dependency modeling. Mamba's linear complexity is promising but overlooks semantic continuity in shadow removal.

Method: Proposes a boundary-region selective scanning mechanism to separately process shadow, boundary, and non-shadow regions. Uses a hierarchical U-Net structure for efficiency, combining local detail capture in shallow layers with global feature learning in deeper layers.

Result: Outperforms state-of-the-art models on ISTD+, ISTD, and SRD datasets with fewer parameters and lower computational cost.

Conclusion: ShadowMamba effectively balances local detail and global feature modeling, offering a lightweight and efficient solution for shadow removal.

Abstract: Image shadow removal is a common low-level vision problem. Shadows cause sudden brightness changes in some areas, which can affect the accuracy of downstream tasks. Currently, Transformer-based shadow removal methods improve computational efficiency by using a window mechanism. However, this approach reduces the effective receptive field and weakens the ability to model long-range dependencies in shadow images. Recently, Mamba has achieved significant success in computer vision by modeling long-sequence information globally with linear complexity. However, when applied to shadow removal, its original scanning mechanism overlooks the semantic continuity along shadow boundaries, and the coherence within each region. To solve this issue, we propose a new boundary-region selective scanning mechanism that scans shadow, boundary, and non-shadow regions separately, making pixels of the same type closer in the sequence. This increases semantic continuity and helps the model understand local details better. Incorporating this idea, we design the first Mamba-based lightweight shadow removal model, called ShadowMamba. It uses a hierarchical combination U-Net structure, which effectively reduces the number of parameters and computational complexity. Shallow layers rely on our boundary-region selective scanning to capture local details, while deeper layers use global cross-scanning to learn global brightness features. Extensive experiments show that ShadowMamba outperforms current state-of-the-art models on ISTD+, ISTD, and SRD datasets, and it also requires fewer parameters and less computational cost. (Code will be made available upon paper acceptance.)

[178] CorrCLIP: Reconstructing Patch Correlations in CLIP for Open-Vocabulary Semantic Segmentation

Dengke Zhang, Fagui Liu, Quan Tang

Main category: cs.CV

TL;DR: CorrCLIP improves CLIP’s segmentation by addressing inter-class correlations, using SAM for patch scope and self-supervised models for coherent similarity, achieving top performance on benchmarks.

DetailsMotivation: CLIP struggles with patch alignment in segmentation due to incoherent patch correlations, especially inter-class ones.

Method: CorrCLIP uses SAM to define patch interaction scope and self-supervised models for similarity values, enhancing features and spatial consistency.

Result: CorrCLIP outperforms on eight benchmarks by improving patch correlations, features, and segmentation maps.

Conclusion: CorrCLIP effectively addresses CLIP’s segmentation limitations, offering superior performance and open-vocabulary capabilities.

Abstract: Open-vocabulary semantic segmentation aims to assign semantic labels to each pixel without being constrained by a predefined set of categories. While Contrastive Language-Image Pre-training (CLIP) excels in zero-shot classification, it struggles to align image patches with category embeddings because of its incoherent patch correlations. This study reveals that inter-class correlations are the main reason for impairing CLIP’s segmentation performance. Accordingly, we propose CorrCLIP, which reconstructs the scope and value of patch correlations. Specifically, CorrCLIP leverages the Segment Anything Model (SAM) to define the scope of patch interactions, reducing inter-class correlations. To mitigate the problem that SAM-generated masks may contain patches belonging to different classes, CorrCLIP incorporates self-supervised models to compute coherent similarity values, suppressing the weight of inter-class correlations. Additionally, we introduce two additional branches to strengthen patch features’ spatial details and semantic representation. Finally, we update segmentation maps with SAM-generated masks to improve spatial consistency. Based on the improvement across patch correlations, feature representations, and segmentation maps, CorrCLIP achieves superior performance across eight benchmarks. Codes are available at: https://github.com/zdk258/CorrCLIP.

[179] PanoLlama: Generating Endless and Coherent Panoramas with Next-Token-Prediction LLMs

Teng Zhou, Xiaoyu Zhang, Yongchuan Tang

Main category: cs.CV

TL;DR: PanoLlama introduces an autoregressive paradigm for panoramic image generation, overcoming size limitations of existing models and achieving state-of-the-art performance in coherence, fidelity, and aesthetics.

DetailsMotivation: Existing methods for panoramic image generation (PIG) lack multilevel coherence due to complex crop connection designs. The autoregressive paradigm aligns well with this challenge.

Method: PanoLlama uses token redirection for next-crop prediction in horizontal and vertical directions, enabling endless panorama generation without training.

Result: Achieves SOTA performance in coherence (47.50%), fidelity (28.16%), and aesthetics (15%), and supports unique applications like mask-free layout control.

Conclusion: PanoLlama refreshes PIG with a novel framework, standardized evaluation dataset, and broad applicability, setting a new benchmark for the field.

Abstract: Panoramic Image Generation (PIG) aims to create coherent images of arbitrary lengths. Most existing methods fall in the joint diffusion paradigm, but their complex and heuristic crop connection designs often limit their ability to achieve multilevel coherence. By deconstructing this challenge into its core components, we find it naturally aligns with next-token prediction, leading us to adopt an autoregressive (AR) paradigm for PIG modeling. However, existing visual AR (VAR) models are limited to fixed-size generation, lacking the capability to produce panoramic images. In this paper, we propose PanoLlama, a novel framework that achieves endless and coherent panorama generation with the autoregressive paradigm. Our approach develops a training-free strategy that utilizes token redirection to overcome the size limitations of existing VAR models, enabling next-crop prediction in both horizontal and vertical directions. This refreshes the PIG pipeline while achieving SOTA performance in coherence (47.50%), fidelity(28.16%), and aesthetics (15%). Additionally, PanoLlama supports applications other PIG methods cannot achieve, including mask-free layout control, multi-scale and multi-guidance synthesis. To facilitate standardized evaluation, we also establish a dataset with 1,000 prompts spanning 100+ themes, providing a new testing benchmark for PIG research. The code is available at https://github.com/0606zt/PanoLlama.

[180] FakeIDet: Exploring Patches for Privacy-Preserving Fake ID Detection

Javier Muñoz-Haro, Ruben Tolosana, Ruben Vera-Rodriguez, Aythami Morales, Julian Fierrez

Main category: cs.CV

TL;DR: The paper addresses fake ID detection challenges, proposing a privacy-aware patch-based method (FakeIDet) and releasing a public dataset (FakeIDet-db).

DetailsMotivation: Real ID data scarcity and privacy concerns hinder research in fake ID detection.

Method: Introduces a patch-based approach (FakeIDet) with varying anonymization levels and patch sizes, using vision transformers and foundation models.

Result: Achieves 13.91% and 0% EERs at patch and whole ID levels, showing strong generalization.

Conclusion: FakeIDet advances fake ID detection with privacy-aware methods and a public dataset, addressing data scarcity.

Abstract: Verifying the authenticity of identity documents (IDs) has become a critical challenge for real-life applications such as digital banking, crypto-exchanges, renting, etc. This study focuses on the topic of fake ID detection, covering several limitations in the field. In particular, there are no publicly available data from real IDs for proper research in this area, and most published studies rely on proprietary internal databases that are not available for privacy reasons. In order to advance this critical challenge of real data scarcity that makes it so difficult to advance the technology of machine learning-based fake ID detection, we introduce a new patch-based methodology that trades off privacy and performance, and propose a novel patch-wise approach for privacy-aware fake ID detection: FakeIDet. In our experiments, we explore: i) two levels of anonymization for an ID (i.e., fully- and pseudo-anonymized), and ii) different patch size configurations, varying the amount of sensitive data visible in the patch image. State-of-the-art methods, such as vision transformers and foundation models, are considered as backbones. Our results show that, on an unseen database (DLC-2021), our proposal for fake ID detection achieves 13.91% and 0% EERs at the patch and the whole ID level, showing a good generalization to other databases. In addition to the path-based methodology introduced and the new FakeIDet method based on it, another key contribution of our article is the release of the first publicly available database that contains 48,400 patches from real and fake IDs, called FakeIDet-db, together with the experimental framework.

[181] Learn2Synth: Learning Optimal Data Synthesis Using Hypergradients for Brain Image Segmentation

Xiaoling Hu, Xiangrui Zeng, Oula Puonti, Juan Eugenio Iglesias, Bruce Fischl, Yael Balbastre

Main category: cs.CV

TL;DR: Learn2Synth introduces a method to automatically tune synthesis parameters for domain randomization, improving segmentation network performance on real data without using real examples for training.

DetailsMotivation: Domain randomization reduces overfitting by exposing networks to varied synthetic data, but manual hyperparameter tuning is cumbersome. Learn2Synth aims to automate this process.

Method: Learn2Synth learns synthesis parameters using a small set of real labeled data, optimizing synthetic images to enhance segmentation network accuracy on real data. Parametric and nonparametric strategies are employed.

Result: The method improves segmentation performance on synthetic and real-world brain scans, demonstrating effectiveness without biasing the network toward training data.

Conclusion: Learn2Synth automates synthesis parameter tuning, enhancing generalization while avoiding manual hyperparameter adjustments and training biases.

Abstract: Domain randomization through synthesis is a powerful strategy to train networks that are unbiased with respect to the domain of the input images. Randomization allows networks to see a virtually infinite range of intensities and artifacts during training, thereby minimizing overfitting to appearance and maximizing generalization to unseen data. Although powerful, this approach relies on the accurate tuning of a large set of hyperparameters that govern the probabilistic distribution of the synthesized images. Instead of manually tuning these parameters, we introduce Learn2Synth, a novel procedure in which synthesis parameters are learned using a small set of real labeled data. Unlike methods that impose constraints to align synthetic data with real data (e.g., contrastive or adversarial techniques), which risk misaligning the image and its label map, we tune an augmentation engine such that a segmentation network trained on synthetic data has optimal accuracy when applied to real data. This approach allows the training procedure to benefit from real labeled examples, without ever using these real examples to train the segmentation network, which avoids biasing the network towards the properties of the training set. Specifically, we develop parametric and nonparametric strategies to enhance synthetic images in a way that improves the performance of the segmentation network. We demonstrate the effectiveness of this learning strategy on synthetic and real-world brain scans. Code is available at: https://github.com/HuXiaoling/Learn2Synth.

[182] Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle

Miroslav Purkrabek, Jiri Matas

Main category: cs.CV

TL;DR: The paper introduces BBox-Mask-Pose (BMP), a method for improving human pose estimation in crowded scenes by iteratively refining bounding boxes, instance masks, and poses using three specialized models. It achieves state-of-the-art performance on OCHuman and COCO datasets.

DetailsMotivation: Existing pose estimation methods struggle with multiple-bodies-in-proximity scenarios, as they often overlook instance masks. The paper aims to address this gap by enforcing mutual consistency among bounding boxes, masks, and poses.

Method: The BMP method uses three models (for bounding boxes, masks, and poses) that iteratively improve each other’s outputs in a closed loop. A new mask-conditioned pose estimation model, MaskPose, is introduced.

Result: BMP achieves SOTA on OCHuman for detection, segmentation, and pose estimation, and on COCO for pose estimation. It improves detection by 39% in overlapping scenes and offers faster runtime with smaller models.

Conclusion: BMP is an effective alternative to large foundational models, excelling in crowded scenes with overlapping instances. Code and models are publicly available.

Abstract: Human pose estimation methods work well on isolated people but struggle with multiple-bodies-in-proximity scenarios. Previous work has addressed this problem by conditioning pose estimation by detected bounding boxes or keypoints, but overlooked instance masks. We propose to iteratively enforce mutual consistency of bounding boxes, instance masks, and poses. The introduced BBox-Mask-Pose (BMP) method uses three specialized models that improve each other’s output in a closed loop. All models are adapted for mutual conditioning, which improves robustness in multi-body scenes. MaskPose, a new mask-conditioned pose estimation model, is the best among top-down approaches on OCHuman. BBox-Mask-Pose pushes SOTA on OCHuman dataset in all three tasks - detection, instance segmentation, and pose estimation. It also achieves SOTA performance on COCO pose estimation. The method is especially good in scenes with large instances overlap, where it improves detection by 39% over the baseline detector. With small specialized models and faster runtime, BMP is an effective alternative to large human-centered foundational models. Code and models are available on https://MiraPurkrabek.github.io/BBox-Mask-Pose.

[183] $\texttt{BATCLIP}$: Bimodal Online Test-Time Adaptation for CLIP

Sarthak Kumar Maharana, Baoming Zhang, Leonid Karlinsky, Rogerio Feris, Yunhui Guo

Main category: cs.CV

TL;DR: The paper introduces BATCLIP, a bimodal online test-time adaptation (TTA) method to enhance CLIP’s robustness against common image corruptions, outperforming existing TTA methods.

DetailsMotivation: Despite CLIP's strong zero-shot learning, its robustness to image corruptions is unclear, and existing TTA methods fail due to their unimodal nature.

Method: Proposes BATCLIP, adapting visual encoders and aligning image-text features by associating pseudo-labeled image class prototypes with text features.

Result: Achieves state-of-the-art results on image corruption datasets and demonstrates generalization on domain generalization datasets.

Conclusion: BATCLIP effectively improves CLIP’s robustness to corruptions and generalizes well, offering a promising solution for real-world applications.

Abstract: Although open-vocabulary classification models like Contrastive Language Image Pretraining (CLIP) have demonstrated strong zero-shot learning capabilities, their robustness to common image corruptions remains poorly understood. Through extensive experiments, we show that zero-shot CLIP lacks robustness to common image corruptions during test-time, necessitating the adaptation of CLIP to unlabeled corrupted images using test-time adaptation (TTA). However, we found that existing TTA methods have severe limitations in adapting CLIP due to their unimodal nature. To address these limitations, we propose $\texttt{BATCLIP}$, a bimodal $\textbf{online}$ TTA method designed to improve CLIP’s robustness to common image corruptions. The key insight of our approach is not only to adapt the visual encoders for improving image features but also to strengthen the alignment between image and text features by promoting a stronger association between the image class prototype, computed using pseudo-labels, and the corresponding text feature. We evaluate our approach on benchmark image corruption datasets and achieve state-of-the-art results in online TTA for CLIP. Furthermore, we evaluate our proposed TTA approach on various domain generalization datasets to demonstrate its generalization capabilities. Our code is available at https://github.com/sarthaxxxxx/BATCLIP

[184] HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation

Shaina Raza, Aravind Narayanan, Vahid Reza Khazaie, Ashmal Vayani, Mukund S. Chettiar, Amandeep Singh, Mubarak Shah, Deval Pandya

Main category: cs.CV

TL;DR: HumaniBench is a new benchmark evaluating large multimodal models (LMMs) on human-centered values like fairness, ethics, and inclusivity, using 32,000 image-question pairs. It reveals gaps in model alignment and suggests techniques for improvement.

DetailsMotivation: Existing LMM evaluations lack focus on human-centered values, necessitating a benchmark to assess alignment with fairness, ethics, and inclusivity.

Method: HumaniBench uses 32,000 real-world image-question pairs, labeled via AI-assisted pipeline and expert validation, to evaluate LMMs across seven alignment principles.

Result: Proprietary models lead in reasoning and fairness, while open-source models excel in robustness. Most models struggle with ethical and inclusive behavior. Techniques like Chain-of-Thought prompting improve alignment.

Conclusion: HumaniBench provides a rigorous testbed for diagnosing LMM limitations and promoting responsible development, with all data and code publicly available.

Abstract: Large multimodal models (LMMs) have been widely tested on tasks like visual question answering (VQA), image captioning, and grounding, but lack rigorous evaluation for alignment with human-centered (HC) values such as fairness, ethics, and inclusivity. To address this gap, we introduce \textbf{HumaniBench}, a novel benchmark of 32,000 real-world image-question pairs and an evaluation suite. Labels are generated via an AI-assisted pipeline and validated by experts. HumaniBench assesses LMMs across seven key alignment principles: fairness, ethics, empathy, inclusivity, reasoning, robustness, and multilinguality, through diverse open-ended and closed-ended VQA tasks. Grounded in AI ethics and real-world needs, these principles provide a holistic lens for societal impact. Benchmarking results on different LMM shows that proprietary models generally lead in reasoning, fairness, and multilinguality, while open-source models excel in robustness and grounding. Most models struggle to balance accuracy with ethical and inclusive behavior. Techniques like Chain-of-Thought prompting and test-time scaling improve alignment. As the first benchmark tailored for HC alignment, HumaniBench offers a rigorous testbed to diagnose limitations, and promote responsible LMM development. All data and code are publicly available for reproducibility. Keywords: HumaniBench, vision-language models, responsible AI benchmark, AI alignment evaluation, AI ethics assessment, fairness in AI models, visual question answering (VQA) benchmark, image captioning evaluation, visual grounding tasks, trustworthy AI models, Chain-of-Thought prompting, test-time scaling, ethical AI development tools.

[185] The Silent Assistant: NoiseQuery as Implicit Guidance for Goal-Driven Image Generation

Ruoyu Wang, Huayang Huang, Ye Zhu, Olga Russakovsky, Yu Wu

Main category: cs.CV

TL;DR: NoiseQuery is a novel method for noise initialization in text-to-image generation, enhancing quality and controllability by combining aligned Gaussian noise with user inputs. It generalizes across diffusion models without tuning.

DetailsMotivation: Existing noise optimization methods are model-specific, limiting their applicability. NoiseQuery aims to provide a generic, reusable solution for better generation quality and control.

Method: Leverages aligned Gaussian noise as implicit guidance alongside explicit inputs. Grounded in a fundamental examination of noise scheduler design in diffusion models, ensuring tuning-free generalization.

Result: Enables fine-grained control and improves performance on both high-level semantics and low-level visual attributes. Integrates seamlessly with minimal overhead.

Conclusion: NoiseQuery offers a foundational, model-agnostic layer for enhanced text-to-image generation, compatible with multiple models and techniques.

Abstract: In this work, we introduce NoiseQuery as a novel method for enhanced noise initialization in versatile goal-driven text-to-image (T2I) generation. Specifically, we propose to leverage an aligned Gaussian noise as implicit guidance to complement explicit user-defined inputs, such as text prompts, for better generation quality and controllability. Unlike existing noise optimization methods designed for specific models, our approach is grounded in a fundamental examination of the generic finite-step noise scheduler design in diffusion formulation, allowing better generalization across different diffusion-based architectures in a tuning-free manner. This model-agnostic nature allows us to construct a reusable noise library compatible with multiple T2I models and enhancement techniques, serving as a foundational layer for more effective generation. Extensive experiments demonstrate that NoiseQuery enables fine-grained control and yields significant performance boosts not only over high-level semantics but also over low-level visual attributes, which are typically difficult to specify through text alone, with seamless integration into current workflows with minimal computational overhead.

[186] DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models

Chenbin Pan, Wenbin He, Zhengzhong Tu, Liu Ren

Main category: cs.CV

TL;DR: DINO-R1 introduces GRQO, a reinforcement learning method for vision models, improving visual reasoning and outperforming baselines.

DetailsMotivation: To address the lack of reasoning capabilities in vision foundation models like DINO, despite their success in representation.

Method: Proposes Group Relative Query Optimization (GRQO) for query-based models, with KL-regularization for stability.

Result: DINO-R1 outperforms baselines in COCO, LVIS, and ODinW benchmarks, showing strong generalization.

Conclusion: DINO-R1 successfully enhances visual reasoning in vision models using reinforcement learning, achieving superior performance.

Abstract: The recent explosive interest in the reasoning capabilities of large language models, such as DeepSeek-R1, has demonstrated remarkable success through reinforcement learning-based fine-tuning frameworks, exemplified by methods like Group Relative Policy Optimization (GRPO). However, such reasoning abilities remain underexplored and notably absent in vision foundation models, including representation models like the DINO series. In this work, we propose \textbf{DINO-R1}, the first such attempt to incentivize visual in-context reasoning capabilities of vision foundation models using reinforcement learning. Specifically, DINO-R1 introduces \textbf{Group Relative Query Optimization (GRQO)}, a novel reinforcement-style training strategy explicitly designed for query-based representation models, which computes query-level rewards based on group-normalized alignment quality. We also apply KL-regularization to stabilize the objectness distribution to reduce the training instability. This joint optimization enables dense and expressive supervision across queries while mitigating overfitting and distributional drift. Building upon Grounding-DINO, we train a series of DINO-R1 family models that integrate a visual prompt encoder and a visual-guided query selection mechanism. Extensive experiments on COCO, LVIS, and ODinW demonstrate that DINO-R1 significantly outperforms supervised fine-tuning baselines, achieving strong generalization in both open-vocabulary and closed-set visual prompting scenarios.

[187] Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration

Mark Endo, Xiaohan Wang, Serena Yeung-Levy

Main category: cs.CV

TL;DR: The paper examines early pruning of visual tokens in Vision-Language Models, revealing flaws in pruning strategies for vision-centric tasks like localization. It proposes FEATHER, a method improving token preservation and achieving significant performance gains.

DetailsMotivation: To address the limitations of current acceleration approaches in Vision-Language Models, particularly their failure in vision-centric tasks due to flawed pruning strategies.

Method: Proposes FEATHER, a multistage pruning approach with early uniform sampling to ensure broad image coverage and better token preservation.

Result: FEATHER achieves over 5x performance improvement on vision-centric localization benchmarks compared to the original acceleration method.

Conclusion: The study highlights benchmark limitations and introduces FEATHER as an effective solution for preserving fine-grained visual capabilities while accelerating models.

Abstract: Recent works on accelerating Vision-Language Models achieve strong performance across a variety of vision-language tasks despite highly compressing visual information. In this work, we examine the popular acceleration approach of early pruning of visual tokens inside the language model. Surprisingly, we find that while strong performance is maintained across many tasks, it exhibits drastically different behavior for a subset of vision-centric tasks such as localization. Upon further investigation, we uncover a core issue with the acceleration approach where most tokens towards the top of the image are pruned away. Yet, on many benchmarks aiming to evaluate vision-centric capabilities, strong performance persists with the flawed pruning strategy, highlighting these benchmarks’ limited ability to assess fine-grained visual capabilities. Based on these findings, we propose FEATHER (Fast and Effective Acceleration wiTH Ensemble cRiteria), a straightforward approach that resolves the discovered early-layer pruning issue and further enhances the preservation of relevant tokens via multistage pruning with early uniform sampling to ensure broad image coverage. With comparable computational savings, we find that FEATHER achieves more than 5x performance improvement on the vision-centric localization benchmarks compared to the original acceleration approach.

[188] FaceLift: Learning Generalizable Single Image 3D Face Reconstruction from Synthetic Heads

Weijie Lyu, Yi Zhou, Ming-Hsuan Yang, Zhixin Shu

Main category: cs.CV

TL;DR: FaceLift is a feed-forward method for 360-degree 3D head reconstruction from a single image, using a multi-view latent diffusion model and transformer-based reconstructor, outperforming existing methods.

DetailsMotivation: Existing monocular 3D face reconstruction methods lack full view coverage and consistency due to insufficient multi-view supervision.

Method: Combines a multi-view latent diffusion model for generating consistent side/back views and a transformer-based reconstructor for 3D Gaussian splats. Uses synthetic data with a technique to bridge the synthetic-real domain gap.

Result: Outperforms state-of-the-art methods in identity preservation, detail recovery, and rendering quality, generalizing well to real-world images.

Conclusion: FaceLift advances 3D head reconstruction by ensuring view consistency and high-quality results, even with synthetic training data.

Abstract: We present FaceLift, a novel feed-forward approach for generalizable high-quality 360-degree 3D head reconstruction from a single image. Our pipeline first employs a multi-view latent diffusion model to generate consistent side and back views from a single facial input, which then feeds into a transformer-based reconstructor that produces a comprehensive 3D Gaussian splats representation. Previous methods for monocular 3D face reconstruction often lack full view coverage or view consistency due to insufficient multi-view supervision. We address this by creating a high-quality synthetic head dataset that enables consistent supervision across viewpoints. To bridge the domain gap between synthetic training data and real-world images, we propose a simple yet effective technique that ensures the view generation process maintains fidelity to the input by learning to reconstruct the input image alongside the view generation. Despite being trained exclusively on synthetic data, our method demonstrates remarkable generalization to real-world images. Through extensive qualitative and quantitative evaluations, we show that FaceLift outperforms state-of-the-art 3D face reconstruction methods on identity preservation, detail recovery, and rendering quality.

[189] Enhanced Vision-Language Models for Diverse Sensor Understanding: Cost-Efficient Optimization and Benchmarking

Sangyun Chung, Youngjoon Yu, Se Yeon Kim, Youngchae Chee, Yong Man Ro

Main category: cs.CV

TL;DR: The paper introduces SAFT with DNA optimization to improve VLMs’ understanding of non-RGB sensor images without modifying existing architectures, validated by the new VS-TDX benchmark.

DetailsMotivation: Current VLMs struggle with understanding non-RGB sensor images due to RGB-centric biases, limiting their real-world applicability.

Method: Proposes Sensor-Aware Attributes Fine-Tuning (SAFT) with Diverse Negative Attributes (DNA) optimization, using minimal sensor-specific data.

Result: SAFT with DNA consistently outperforms in resource-constrained settings, validated across diverse sensor modalities.

Conclusion: The method advances scalable VLM deployment in sensor-diverse environments without extensive data or architectural changes.

Abstract: Large-scale Vision-Language Models (VLMs) have achieved notable progress in aligning visual inputs with text. However, their ability to deeply understand the unique physical properties of non-RGB vision sensor images remains limited. In this paper, we revisit and analyze these limitations and introduce a novel, cost-efficient paradigm that significantly advances sensor image understanding-without requiring extensive training data or any modifications to the existing VLM architectures. Specifically, we propose Sensor-Aware Attributes Fine-Tuning (SAFT) with the Diverse Negative Attributes (DNA) optimization, which leverages minimal sensor-specific data to enable robust learning of non-RGB characteristics and overcome RGB-centric biases inherent in current VLMs. In addition, we present VS-TDX-the first comprehensive, public benchmark designed to rigorously evaluate VLMs’ sensor-specific understanding across diverse and realistic scenarios. Through extensive experiments on VLMs and various sensor modalities, we validate that our method consistently delivers superior performance and generalization under resource-constrained and architecture-invariant settings. Our approach provides a practical advance towards scalable deployment of VLMs in increasingly sensor-diverse real-world environments.

[190] GameFactory: Creating New Games with Generative Interactive Videos

Jiwen Yu, Yiran Qin, Xintao Wang, Pengfei Wan, Di Zhang, Xihui Liu

Main category: cs.CV

TL;DR: GameFactory is a framework for generating action-controlled, scene-generalizable game videos, leveraging open-domain priors and a multi-phase training strategy to enable diverse game creation.

DetailsMotivation: To revolutionize game development by autonomously generating new content, addressing challenges like action controllability and scene-generalizability.

Method: Introduces GF-Minecraft dataset, an action control module, and a multi-phase training strategy with a domain adapter to decouple game style learning from action control.

Result: Effectively generates open-domain action-controllable game videos, advancing AI-driven game generation.

Conclusion: GameFactory represents a significant step forward in AI-driven game generation, enabling diverse and interactive video creation.

Abstract: Generative videos have the potential to revolutionize game development by autonomously creating new content. In this paper, we present GameFactory, a framework for action-controlled scene-generalizable game video generation. We first address the fundamental challenge of action controllability by introducing GF-Minecraft, an action-annotated game video dataset without human bias, and developing an action control module that enables precise control over both keyboard and mouse inputs. We further extend to support autoregressive generation for unlimited-length interactive videos. More importantly, GameFactory tackles the critical challenge of scene-generalizable action control, which most existing methods fail to address. To enable the creation of entirely new and diverse games beyond fixed styles and scenes, we leverage the open-domain generative priors from pre-trained video diffusion models. To bridge the domain gap between open-domain priors and small-scale game datasets, we propose a multi-phase training strategy with a domain adapter that decouples game style learning from action control. This decoupling ensures that action control learning is no longer bound to specific game styles, thereby achieving scene-generalizable action control. Experimental results demonstrate that GameFactory effectively generates open-domain action-controllable game videos, representing a significant step forward in AI-driven game generation.

[191] MR-CLIP: Efficient Metadata-Guided Learning of MRI Contrast Representations

Mehmet Yigit Avci, Pedro Borges, Paul Wright, Mehmet Yigitsoy, Sebastien Ourselin, Jorge Cardoso

Main category: cs.CV

TL;DR: MR-CLIP is a contrastive learning framework that aligns MRI images with DICOM metadata to learn contrast-aware representations without manual labels, improving tasks like cross-modal retrieval and contrast classification.

DetailsMotivation: The lack of reliable and standardized metadata in MRI scans complicates image interpretation and clinical workflows, necessitating robust contrast-aware representations.

Method: Proposes MR-CLIP, a multimodal contrastive learning framework that aligns MR images with DICOM metadata, trained on diverse clinical datasets.

Result: MR-CLIP effectively captures contrast variations, enabling anatomy-invariant representations and excelling in cross-modal retrieval and contrast classification.

Conclusion: MR-CLIP offers a scalable solution for contrast-aware representations, with potential for broader clinical applications.

Abstract: Accurate interpretation of Magnetic Resonance Imaging scans in clinical systems is based on a precise understanding of image contrast. This contrast is primarily governed by acquisition parameters, such as echo time and repetition time, which are stored in the DICOM metadata. To simplify contrast identification, broad labels such as T1-weighted or T2-weighted are commonly used, but these offer only a coarse approximation of the underlying acquisition settings. In many real-world datasets, such labels are entirely missing, leaving raw acquisition parameters as the only indicators of contrast. Adding to this challenge, the available metadata is often incomplete, noisy, or inconsistent. The lack of reliable and standardized metadata complicates tasks such as image interpretation, retrieval, and integration into clinical workflows. Furthermore, robust contrast-aware representations are essential to enable more advanced clinical applications, such as achieving modality-invariant representations and data harmonization. To address these challenges, we propose MR-CLIP, a multimodal contrastive learning framework that aligns MR images with their DICOM metadata to learn contrast-aware representations, without relying on manual labels. Trained on a diverse clinical dataset that spans various scanners and protocols, MR-CLIP captures contrast variations across acquisitions and within scans, enabling anatomy-invariant representations. We demonstrate its effectiveness in cross-modal retrieval and contrast classification, highlighting its scalability and potential for further clinical applications. The code and weights are publicly available at https://github.com/myigitavci/MR-CLIP.

[192] Multi-Cali Anything: Dense Feature Multi-Frame Structure-from-Motion for Large-Scale Camera Array Calibration

Jinjiang You, Hewei Wang, Yijie Li, Mingxiao Huo, Long Van Tran Ha, Mingyuan Ma, Jinfeng Xu, Jiayi Zhang, Puzhen Wu, Shubham Garg, Wei Pu

Main category: cs.CV

TL;DR: A method for calibrating large-scale camera arrays without dedicated captures, using dense features and multi-frame optimization.

DetailsMotivation: Traditional calibration for large camera arrays is time-consuming and requires known patterns. Intrinsics often vary across sessions, necessitating a flexible solution.

Method: Proposes a dense-feature-driven multi-frame calibration method with extrinsics regularization, dense feature reprojection, and intrinsics variance terms for joint optimization.

Result: Achieves precision comparable to dedicated calibration, improving intrinsics and 3D reconstruction accuracy.

Conclusion: The method is efficient, plug-and-play, and compatible with existing SfM pipelines, offering a practical solution for large-scale setups.

Abstract: Calibrating large-scale camera arrays, such as those in dome-based setups, is time-intensive and typically requires dedicated captures of known patterns. While extrinsics in such arrays are fixed due to the physical setup, intrinsics often vary across sessions due to factors like lens adjustments or temperature changes. In this paper, we propose a dense-feature-driven multi-frame calibration method that refines intrinsics directly from scene data, eliminating the necessity for additional calibration captures. Our approach enhances traditional Structure-from-Motion (SfM) pipelines by introducing an extrinsics regularization term to progressively align estimated extrinsics with ground-truth values, a dense feature reprojection term to reduce keypoint errors by minimizing reprojection loss in the feature space, and an intrinsics variance term for joint optimization across multiple frames. Experiments on the Multiface dataset show that our method achieves nearly the same precision as dedicated calibration processes, and significantly enhances intrinsics and 3D reconstruction accuracy. Fully compatible with existing SfM pipelines, our method provides an efficient and practical plug-and-play solution for large-scale camera setups. Our code is publicly available at: https://github.com/YJJfish/Multi-Cali-Anything

[193] Retinex-MEF: Retinex-based Glare Effects Aware Unsupervised Multi-Exposure Image Fusion

Haowen Bai, Jiangshe Zhang, Zixiang Zhao, Lilun Deng, Yukun Cui, Shuang Xu

Main category: cs.CV

TL;DR: Retinex-MEF is an unsupervised method for multi-exposure image fusion, addressing glare effects and enabling controllable exposure adjustments.

DetailsMotivation: Conventional Retinex-based methods inadequately model glare from overexposure, limiting fusion quality.

Method: Decomposes images into illumination and shared reflectance, uses bidirectional loss for glare mitigation, and introduces controllable fusion criteria.

Result: Effective decomposition and flexible fusion demonstrated across diverse datasets.

Conclusion: Retinex-MEF improves fusion quality by addressing glare and enabling exposure control.

Abstract: Multi-exposure image fusion (MEF) synthesizes multiple, differently exposed images of the same scene into a single, well-exposed composite. Retinex theory, which separates image illumination from scene reflectance, provides a natural framework to ensure consistent scene representation and effective information fusion across varied exposure levels. However, the conventional pixel-wise multiplication of illumination and reflectance inadequately models the glare effect induced by overexposure. To address this limitation, we introduce an unsupervised and controllable method termed Retinex-MEF. Specifically, our method decomposes multi-exposure images into separate illumination components with a shared reflectance component, and effectively models the glare induced by overexposure. The shared reflectance is learned via a bidirectional loss, which enables our approach to effectively mitigate the glare effect. Furthermore, we introduce a controllable exposure fusion criterion, enabling global exposure adjustments while preserving contrast, thus overcoming the constraints of a fixed exposure level. Extensive experiments on diverse datasets, including underexposure-overexposure fusion, exposure controlled fusion, and homogeneous extreme exposure fusion, demonstrate the effective decomposition and flexible fusion capability of our model. The code is available at https://github.com/HaowenBai/Retinex-MEF

[194] Sign Spotting Disambiguation using Large Language Models

JianHe Low, Ozge Mercanoglu Sincan, Richard Bowden

Main category: cs.CV

TL;DR: A training-free framework using LLMs improves sign spotting by integrating global features and context-aware disambiguation, outperforming traditional methods.

DetailsMotivation: Addressing data scarcity and vocabulary inflexibility in sign language translation by automating sign spotting.

Method: Extracts spatio-temporal and hand shape features, matches them to a sign dictionary using dynamic time warping and cosine similarity, and employs an LLM for gloss disambiguation.

Result: Superior accuracy and sentence fluency on synthetic and real-world datasets compared to traditional approaches.

Conclusion: LLMs enhance sign spotting without training, offering flexibility and improved performance.

Abstract: Sign spotting, the task of identifying and localizing individual signs within continuous sign language video, plays a pivotal role in scaling dataset annotations and addressing the severe data scarcity issue in sign language translation. While automatic sign spotting holds great promise for enabling frame-level supervision at scale, it grapples with challenges such as vocabulary inflexibility and ambiguity inherent in continuous sign streams. Hence, we introduce a novel, training-free framework that integrates Large Language Models (LLMs) to significantly enhance sign spotting quality. Our approach extracts global spatio-temporal and hand shape features, which are then matched against a large-scale sign dictionary using dynamic time warping and cosine similarity. This dictionary-based matching inherently offers superior vocabulary flexibility without requiring model retraining. To mitigate noise and ambiguity from the matching process, an LLM performs context-aware gloss disambiguation via beam search, notably without fine-tuning. Extensive experiments on both synthetic and real-world sign language datasets demonstrate our method’s superior accuracy and sentence fluency compared to traditional approaches, highlighting the potential of LLMs in advancing sign spotting.

[195] Towards a Unified Copernicus Foundation Model for Earth Vision

Yi Wang, Zhitong Xiong, Chenying Liu, Adam J. Stewart, Thomas Dujardin, Nikolaos Ioannis Bountos, Angelos Zavras, Franziska Gerken, Ioannis Papoutsis, Laura Leal-Taixé, Xiao Xiang Zhu

Main category: cs.CV

TL;DR: The paper introduces Copernicus-Pretrain, Copernicus-FM, and Copernicus-Bench to enhance Earth observation foundation models by integrating diverse sensor data and metadata, improving scalability and versatility.

DetailsMotivation: Existing Earth observation models are limited to fixed spectral sensors and overlook metadata, restricting their potential for broader applications.

Method: The work proposes a pretraining dataset (Copernicus-Pretrain), a unified foundation model (Copernicus-FM) with dynamic hypernetworks, and a benchmark (Copernicus-Bench) for evaluation.

Result: The approach improves scalability, versatility, and multimodal adaptability, bridging Earth observation, weather, and climate research.

Conclusion: The proposed components advance EO foundation models, offering new opportunities for interdisciplinary research and applications.

Abstract: Advances in Earth observation (EO) foundation models have unlocked the potential of big satellite data to learn generic representations from space, benefiting a wide range of downstream applications crucial to our planet. However, most existing efforts remain limited to fixed spectral sensors, focus solely on the Earth’s surface, and overlook valuable metadata beyond imagery. In this work, we take a step towards next-generation EO foundation models with three key components: 1) Copernicus-Pretrain, a massive-scale pretraining dataset that integrates 18.7M aligned images from all major Copernicus Sentinel missions, spanning from the Earth’s surface to its atmosphere; 2) Copernicus-FM, a unified foundation model capable of processing any spectral or non-spectral sensor modality using extended dynamic hypernetworks and flexible metadata encoding; and 3) Copernicus-Bench, a systematic evaluation benchmark with 15 hierarchical downstream tasks ranging from preprocessing to specialized applications for each Sentinel mission. Our dataset, model, and benchmark greatly improve the scalability, versatility, and multimodal adaptability of EO foundation models, while also creating new opportunities to connect EO, weather, and climate research. Codes, datasets and models are available at https://github.com/zhu-xlab/Copernicus-FM.

[196] TerraMesh: A Planetary Mosaic of Multimodal Earth Observation Data

Benedikt Blumenstiel, Paolo Fraccaro, Valerio Marsocci, Johannes Jakubik, Stefano Maurogiovanni, Mikolaj Czerkawski, Rocco Sedona, Gabriele Cavallaro, Thomas Brunschwiler, Juan Bernabe-Moreno, Nicolas Longépé

Main category: cs.CV

TL;DR: TerraMesh is a globally diverse, multimodal dataset for Earth Observation, combining eight aligned modalities to improve large-scale pre-training of foundation models.

DetailsMotivation: Existing public datasets lack scale, geographic coverage, or sensor variety, limiting label-efficient representation learning.

Method: Introduces TerraMesh, a dataset with over 9 million samples of optical, SAR, elevation, and land-cover data in an Analysis-Ready format.

Result: Empirical evidence shows improved model performance when pre-trained on TerraMesh.

Conclusion: TerraMesh addresses dataset limitations and enhances foundation model training for Earth Observation.

Abstract: Large-scale foundation models in Earth Observation can learn versatile, label-efficient representations by leveraging massive amounts of unlabeled data. However, existing public datasets are often limited in scale, geographic coverage, or sensor variety. We introduce TerraMesh, a new globally diverse, multimodal dataset combining optical, synthetic aperture radar, elevation, and land-cover modalities in an Analysis-Ready Data format. TerraMesh includes over 9~million samples with eight spatiotemporal aligned modalities, enabling large-scale pre-training. We provide detailed data processing steps, comprehensive statistics, and empirical evidence demonstrating improved model performance when pre-trained on TerraMesh. The dataset is hosted at https://huggingface.co/datasets/ibm-esa-geospatial/TerraMesh.

[197] DCT-Shield: A Robust Frequency Domain Defense against Malicious Image Editing

Aniruddha Bala, Rohit Chowdhury, Rohan Jaiswal, Siddharth Roheda

Main category: cs.CV

TL;DR: A novel method introduces adversarial perturbations in the frequency domain (DCT coefficients) to protect images from malicious edits by diffusion models, offering robustness and fewer visual artifacts.

DetailsMotivation: Concerns about image security due to easy exploitation of diffusion models for malicious edits, with existing defenses being noticeable and non-robust.

Method: Optimization approach modifying DCT coefficients in the frequency domain, leveraging the JPEG pipeline.

Result: Effective protection against edits with fewer visual artifacts and robustness to noise purification.

Conclusion: The proposed method outperforms prior defenses in balancing edit protection and visual quality.

Abstract: Advancements in diffusion models have enabled effortless image editing via text prompts, raising concerns about image security. Attackers with access to user images can exploit these tools for malicious edits. Recent defenses attempt to protect images by adding a limited noise in the pixel space to disrupt the functioning of diffusion-based editing models. However, the adversarial noise added by previous methods is easily noticeable to the human eye. Moreover, most of these methods are not robust to purification techniques like JPEG compression under a feasible pixel budget. We propose a novel optimization approach that introduces adversarial perturbations directly in the frequency domain by modifying the Discrete Cosine Transform (DCT) coefficients of the input image. By leveraging the JPEG pipeline, our method generates adversarial images that effectively prevent malicious image editing. Extensive experiments across a variety of tasks and datasets demonstrate that our approach introduces fewer visual artifacts while maintaining similar levels of edit protection and robustness to noise purification techniques.

[198] Core-Set Selection for Data-efficient Land Cover Segmentation

Keiller Nogueira, Akram Zaytar, Wanli Ma, Ribana Roscher, Ronny Hänsch, Caleb Robinson, Anthony Ortiz, Simone Nsutezo, Rahul Dodhia, Juan M. Lavista Ferres, Oktay Karakuş, Paul L. Rosin

Main category: cs.CV

TL;DR: The paper proposes six core-set selection methods for remote sensing image segmentation, showing that training on selected subsets can outperform random baselines and sometimes even full datasets.

DetailsMotivation: The increasing availability of remotely sensed data and the need for efficient, high-quality training datasets drive the development of data-centric learning methods.

Method: Six novel core-set selection methods are introduced, focusing on imagery, labels, or both, and benchmarked against random selection on three land cover classification datasets.

Result: Subset training outperforms random baselines and sometimes full datasets, highlighting the value of data-centric approaches.

Conclusion: Data-centric learning, emphasizing quality over quantity, is crucial for remote sensing tasks.

Abstract: The increasing accessibility of remotely sensed data and the potential of such data to inform large-scale decision-making has driven the development of deep learning models for many Earth Observation tasks. Traditionally, such models must be trained on large datasets. However, the common assumption that broadly larger datasets lead to better outcomes tends to overlook the complexities of the data distribution, the potential for introducing biases and noise, and the computational resources required for processing and storing vast datasets. Therefore, effective solutions should consider both the quantity and quality of data. In this paper, we propose six novel core-set selection methods for selecting important subsets of samples from remote sensing image segmentation datasets that rely on imagery only, labels only, and a combination of each. We benchmark these approaches against a random-selection baseline on three commonly used land cover classification datasets: DFC2022, Vaihingen, and Potsdam. In each of the datasets, we demonstrate that training on a subset of samples outperforms the random baseline, and some approaches outperform training on all available data. This result shows the importance and potential of data-centric learning for the remote sensing domain. The code is available at https://github.com/keillernogueira/data-centric-rs-classification/.

[199] GUAVA: Generalizable Upper Body 3D Gaussian Avatar

Dongbin Zhang, Yunfei Liu, Lijian Lin, Ye Zhu, Yang Li, Minghan Qin, Yu Li, Haoqian Wang

Main category: cs.CV

TL;DR: GUAVA is a framework for fast animatable upper-body 3D Gaussian avatar reconstruction from a single image, outperforming previous methods in quality and speed.

DetailsMotivation: Existing methods for 3D human avatar reconstruction are complex, time-consuming, and limited in facial expressiveness.

Method: Introduces an expressive human model (EHM) and uses inverse texture mapping and projection sampling to infer upper-body Gaussians from a single image, refined by a neural refiner.

Result: GUAVA achieves sub-second reconstruction (0.1s), high rendering quality, and real-time animation support.

Conclusion: GUAVA addresses limitations of prior methods, offering fast, high-quality, and expressive 3D avatar reconstruction.

Abstract: Reconstructing a high-quality, animatable 3D human avatar with expressive facial and hand motions from a single image has gained significant attention due to its broad application potential. 3D human avatar reconstruction typically requires multi-view or monocular videos and training on individual IDs, which is both complex and time-consuming. Furthermore, limited by SMPLX’s expressiveness, these methods often focus on body motion but struggle with facial expressions. To address these challenges, we first introduce an expressive human model (EHM) to enhance facial expression capabilities and develop an accurate tracking method. Based on this template model, we propose GUAVA, the first framework for fast animatable upper-body 3D Gaussian avatar reconstruction. We leverage inverse texture mapping and projection sampling techniques to infer Ubody (upper-body) Gaussians from a single image. The rendered images are refined through a neural refiner. Experimental results demonstrate that GUAVA significantly outperforms previous methods in rendering quality and offers significant speed improvements, with reconstruction times in the sub-second range (0.1s), and supports real-time animation and rendering.

[200] ReAlign: Bilingual Text-to-Motion Generation via Step-Aware Reward-Guided Alignment

Wanjiang Weng, Xiaofeng Tan, Hongsong Wang, Pan Zhou

Main category: cs.CV

TL;DR: BiHumanML3D dataset and BiMD model address bilingual text-to-motion generation challenges, with ReAlign improving alignment and quality.

DetailsMotivation: Bilingual text-to-motion generation lacks datasets and suffers from text-motion misalignment in diffusion models.

Method: Proposes BiHumanML3D dataset, BiMD model, and ReAlign method with a reward-guided strategy for alignment.

Result: Significant improvement in text-motion alignment and motion quality over existing methods.

Conclusion: The approach effectively tackles bilingual motion generation challenges, enhancing semantic consistency and realism.

Abstract: Bilingual text-to-motion generation, which synthesizes 3D human motions from bilingual text inputs, holds immense potential for cross-linguistic applications in gaming, film, and robotics. However, this task faces critical challenges: the absence of bilingual motion-language datasets and the misalignment between text and motion distributions in diffusion models, leading to semantically inconsistent or low-quality motions. To address these challenges, we propose BiHumanML3D, a novel bilingual human motion dataset, which establishes a crucial benchmark for bilingual text-to-motion generation models. Furthermore, we propose a Bilingual Motion Diffusion model (BiMD), which leverages cross-lingual aligned representations to capture semantics, thereby achieving a unified bilingual model. Building upon this, we propose Reward-guided sampling Alignment (ReAlign) method, comprising a step-aware reward model to assess alignment quality during sampling and a reward-guided strategy that directs the diffusion process toward an optimally aligned distribution. This reward model integrates step-aware tokens and combines a text-aligned module for semantic consistency and a motion-aligned module for realism, refining noisy motions at each timestep to balance probability density and alignment. Experiments demonstrate that our approach significantly improves text-motion alignment and motion quality compared to existing state-of-the-art methods. Project page: https://wengwanjiang.github.io/ReAlign-page/.

[201] SkillFormer: Unified Multi-View Video Understanding for Proficiency Estimation

Edoardo Bianchi, Antonio Liotta

Main category: cs.CV

TL;DR: SkillFormer is a parameter-efficient architecture for multi-view skill assessment, outperforming baselines with fewer parameters and training epochs.

DetailsMotivation: Assessing human skill levels in complex activities is challenging but crucial for sports, rehabilitation, and training.

Method: SkillFormer uses a CrossViewFusion module with multi-head cross-attention, learnable gating, and adaptive self-calibration, fine-tuned via Low-Rank Adaptation.

Result: Achieves state-of-the-art accuracy on the EgoExo4D dataset with 4.5x fewer parameters and 3.75x fewer training epochs.

Conclusion: Multi-view integration in SkillFormer enhances fine-grained skill assessment, proving its efficiency and effectiveness.

Abstract: Assessing human skill levels in complex activities is a challenging problem with applications in sports, rehabilitation, and training. In this work, we present SkillFormer, a parameter-efficient architecture for unified multi-view proficiency estimation from egocentric and exocentric videos. Building on the TimeSformer backbone, SkillFormer introduces a CrossViewFusion module that fuses view-specific features using multi-head cross-attention, learnable gating, and adaptive self-calibration. We leverage Low-Rank Adaptation to fine-tune only a small subset of parameters, significantly reducing training costs. In fact, when evaluated on the EgoExo4D dataset, SkillFormer achieves state-of-the-art accuracy in multi-view settings while demonstrating remarkable computational efficiency, using 4.5x fewer parameters and requiring 3.75x fewer training epochs than prior baselines. It excels in multiple structured tasks, confirming the value of multi-view integration for fine-grained skill assessment.

[202] Spatial-Temporal-Spectral Unified Modeling for Remote Sensing Dense Prediction

Sijie Zhao, Feng Liu, Enzhuo Zhang, Yiqing Guo, Pengfeng Xiao, Lei Bai, Xueliang Zhang, Hao Chen

Main category: cs.CV

TL;DR: STSUN is a unified deep learning model for remote sensing that adapts to varied spatial, temporal, and spectral inputs, unifies multiple dense prediction tasks, and supports flexible semantic class predictions without retraining.

DetailsMotivation: Current deep learning models for remote sensing are rigid, task-specific, and require costly retraining for new classes, limiting adaptability to real-world data heterogeneity.

Method: STSUN leverages metadata for unified representation, uses trainable task embeddings to unify tasks, and integrates category embeddings for flexible semantic predictions.

Result: STSUN achieves state-of-the-art performance across diverse datasets and scenarios, adapting to heterogeneous inputs and outputs.

Conclusion: STSUN offers a robust, generalizable solution for complex remote sensing applications by unifying tasks and enabling flexible predictions.

Abstract: The proliferation of multi-source remote sensing data has propelled the development of deep learning for dense prediction, yet significant challenges in data and task unification persist. Current deep learning architectures for remote sensing are fundamentally rigid. They are engineered for fixed input-output configurations, restricting their adaptability to the heterogeneous spatial, temporal, and spectral dimensions inherent in real-world data. Furthermore, these models neglect the intrinsic correlations among semantic segmentation, binary change detection, and semantic change detection, necessitating the development of distinct models or task-specific decoders. This paradigm is also constrained to a predefined set of output semantic classes, where any change to the classes requires costly retraining. To overcome these limitations, we introduce the Spatial-Temporal-Spectral Unified Network (STSUN) for unified modeling. STSUN can adapt to input and output data with arbitrary spatial sizes, temporal lengths, and spectral bands by leveraging their metadata for a unified representation. Moreover, STSUN unifies disparate dense prediction tasks within a single architecture by conditioning the model on trainable task embeddings. Similarly, STSUN facilitates flexible prediction across multiple set of semantic categories by integrating trainable category embeddings as metadata. Extensive experiments on multiple datasets with diverse Spatial-Temporal-Spectral configurations in multiple scenarios demonstrate that a single STSUN model effectively adapts to heterogeneous inputs and outputs, unifying various dense prediction tasks and diverse semantic class predictions. The proposed approach consistently achieves state-of-the-art performance, highlighting its robustness and generalizability for complex remote sensing applications.

[203] From Press to Pixels: Evolving Urdu Text Recognition

Samee Arif, Sualeha Farid

Main category: cs.CV

TL;DR: The paper presents an end-to-end Urdu OCR pipeline for newspapers, addressing layout, resolution, and script challenges. It includes segmentation, super-resolution, and text recognition modules, achieving high precision and improved accuracy with modern LLMs.

DetailsMotivation: To address the challenges of OCR for Urdu newspapers, such as complex layouts, low-resolution scans, and Nastaliq script variability.

Method: A four-module pipeline: article segmentation (YOLOv11x), image super-resolution (SwinIR), column segmentation, and text recognition (LLMs like Gemini-2.5-Pro).

Result: High segmentation precision (0.963-0.970), 25-70% accuracy boost from super-resolution, and Gemini-2.5-Pro achieving a WER of 0.133. Fine-tuning on 500 samples improves WER by 6.13%.

Conclusion: The pipeline effectively handles Urdu OCR challenges, with LLMs showing strong adaptability and performance, especially when fine-tuned.

Abstract: This paper introduces an end-to-end pipeline for Optical Character Recognition (OCR) on Urdu newspapers, addressing challenges posed by complex multi-column layouts, low-resolution scans, and the stylistic variability of the Nastaliq script. Our system comprises four modules: (1) article segmentation, (2) image super-resolution, (3) column segmentation, and (4) text recognition. We fine-tune YOLOv11x for segmentation, achieving 0.963 precision for articles and 0.970 for columns. A SwinIR-based super-resolution model boosts LLM text recognition accuracy by 25-70%. We also introduce the Urdu Newspaper Benchmark (UNB), a manually annotated dataset for Urdu OCR. Using UNB and the OpenITI corpus, we compare traditional CNN+RNN-based OCR models with modern LLMs. Gemini-2.5-Pro achieves the best performance with a WER of 0.133. We further analyze LLM outputs via insertion, deletion, and substitution error breakdowns, as well as character-level confusion analysis. Finally, we show that fine-tuning on just 500 samples yields a 6.13% WER improvement, highlighting the adaptability of LLMs for Urdu OCR.

[204] Lossless Token Merging Even Without Fine-Tuning in Vision Transformers

Jaeyeon Lee, Dong-Wan Choi

Main category: cs.CV

TL;DR: ATM is a novel token compression method for ViTs that ensures lossless merging without fine-tuning, outperforming existing methods.

DetailsMotivation: ViTs' large sizes cause computational overhead, and current token compression techniques suffer from information loss and require additional training.

Method: ATM adaptively merges tokens using layer-specific similarity thresholds and a novel token matching technique to minimize information loss.

Result: ATM reduces FLOPs by over 30% for DeiT-T and DeiT-S models without accuracy drop, surpassing training-free and some training-intensive methods.

Conclusion: ATM provides an efficient, training-free solution for token compression in ViTs, maintaining performance while reducing computational costs.

Abstract: Although Vision Transformers (ViTs) have become the standard architecture in computer vision, their massive sizes lead to significant computational overhead. Token compression techniques have attracted considerable attention to address this issue, but they often suffer from severe information loss, requiring extensive additional training to achieve practical performance. In this paper, we propose Adaptive Token Merging (ATM), a novel method that ensures lossless token merging, eliminating the need for fine-tuning while maintaining competitive performance. ATM adaptively reduces tokens across layers and batches by carefully adjusting layer-specific similarity thresholds, thereby preventing the undesirable merging of less similar tokens with respect to each layer. Furthermore, ATM introduces a novel token matching technique that considers not only similarity but also merging sizes, particularly for the final layers, to minimize the information loss incurred from each merging operation. We empirically validate our method across a wide range of pretrained models, demonstrating that ATM not only outperforms all existing training-free methods but also surpasses most training-intensive approaches, even without additional training. Remarkably, training-free ATM achieves over a 30% reduction in FLOPs for the DeiT-T and DeiT-S models without any drop in their original accuracy.

[205] OpenSeg-R: Improving Open-Vocabulary Segmentation via Step-by-Step Visual Reasoning

Zongyan Han, Jiale Cao, Shuo Chen, Tong Wang, Jorma Laaksonen, Rao Muhammad Anwer

Main category: cs.CV

TL;DR: OpenSeg-R introduces step-by-step visual reasoning for open-vocabulary segmentation, improving accuracy and interpretability by leveraging Large Multimodal Models (LMMs).

DetailsMotivation: Existing OVS methods lack explicit reasoning and contextual understanding, making it hard to distinguish similar categories.

Method: OpenSeg-R uses LMMs for hierarchical visual reasoning, generating structured triplets and detailed prompts to guide segmentation.

Result: Outperforms state-of-the-art methods on five benchmark datasets and improves panoptic segmentation metrics.

Conclusion: OpenSeg-R enhances segmentation precision and interpretability, setting a new standard for OVS.

Abstract: Open-Vocabulary Segmentation (OVS) has drawn increasing attention for its capacity to generalize segmentation beyond predefined categories. However, existing methods typically predict segmentation masks with simple forward inference, lacking explicit reasoning and interpretability. This makes it challenging for OVS model to distinguish similar categories in open-world settings due to the lack of contextual understanding and discriminative visual cues. To address this limitation, we propose a step-by-step visual reasoning framework for open-vocabulary segmentation, named OpenSeg-R. The proposed OpenSeg-R leverages Large Multimodal Models (LMMs) to perform hierarchical visual reasoning before segmentation. Specifically, we generate both generic and image-specific reasoning for each image, forming structured triplets that explain the visual reason for objects in a coarse-to-fine manner. Based on these reasoning steps, we can compose detailed description prompts, and feed them to the segmentor to produce more accurate segmentation masks. To the best of our knowledge, OpenSeg-R is the first framework to introduce explicit step-by-step visual reasoning into OVS. Experimental results demonstrate that OpenSeg-R significantly outperforms state-of-the-art methods on open-vocabulary semantic segmentation across five benchmark datasets. Moreover, it achieves consistent gains across all metrics on open-vocabulary panoptic segmentation. Qualitative results further highlight the effectiveness of our reasoning-guided framework in improving both segmentation precision and interpretability. Our code is publicly available at https://github.com/Hanzy1996/OpenSeg-R.

[206] YOLO-FireAD: Efficient Fire Detection via Attention-Guided Inverted Residual Learning and Dual-Pooling Feature Preservation

Weichao Pan, Bohan Xu, Xu Wang, Chengze Lv, Shuoyang Wang, Zhenke Duan, Zhen Tian

Main category: cs.CV

TL;DR: YOLO-FireAD improves fire detection with attention-guided inverted residuals and dual-pooling downscale fusion, achieving higher accuracy and efficiency than YOLOv8n and variants.

DetailsMotivation: Addressing feature extraction limitations and information loss in YOLO-based models for fire detection in dynamic environments.

Method: Introduces Attention-guided Inverted Residual Block (AIR) and Dual Pool Downscale Fusion Block (DPDF) to enhance fire features and preserve multi-scale patterns.

Result: Outperforms YOLOv8n and variants with fewer parameters (1.45M, 51.8% lower) and higher mAP75 (1.3-5.5%).

Conclusion: YOLO-FireAD is efficient and accurate for fire detection, validated on public datasets.

Abstract: Fire detection in dynamic environments faces continuous challenges, including the interference of illumination changes, many false detections or missed detections, and it is difficult to achieve both efficiency and accuracy. To address the problem of feature extraction limitation and information loss in the existing YOLO-based models, this study propose You Only Look Once for Fire Detection with Attention-guided Inverted Residual and Dual-pooling Downscale Fusion (YOLO-FireAD) with two core innovations: (1) Attention-guided Inverted Residual Block (AIR) integrates hybrid channel-spatial attention with inverted residuals to adaptively enhance fire features and suppress environmental noise; (2) Dual Pool Downscale Fusion Block (DPDF) preserves multi-scale fire patterns through learnable fusion of max-average pooling outputs, mitigating small-fire detection failures. Extensive evaluation on two public datasets shows the efficient performance of our model. Our proposed model keeps the sum amount of parameters (1.45M, 51.8% lower than YOLOv8n) (4.6G, 43.2% lower than YOLOv8n), and mAP75 is higher than the mainstream real-time object detection models YOLOv8n, YOL-Ov9t, YOLOv10n, YOLO11n, YOLOv12n and other YOLOv8 variants 1.3-5.5%. For more details, please visit our repository: https://github.com/JEFfersusu/YOLO-FireAD

[207] Occlusion Boundary and Depth: Mutual Enhancement via Multi-Task Learning

Lintao Xu, Yinghao Wang, Chaohui Wang

Main category: cs.CV

TL;DR: MoDOT jointly estimates depth and occlusion boundaries (OBs) from a single image, improving scene understanding via cross-attention and multi-scale features, achieving SOTA results.

DetailsMotivation: Occlusion boundaries (OBs) and depth estimation are interrelated; OBs provide geometric cues for depth, while depth refines occlusion reasoning. Joint estimation can enhance accuracy.

Method: MoDOT introduces CASM (cross-attention and multi-scale strip convolutions) for leveraging mid-level OB features and OBDCL loss for accurate boundary prediction.

Result: MoDOT achieves SOTA on synthetic and NYUD-v2 datasets, with sharp OBs and improved geometric fidelity in depth maps. Cross-domain results are promising.

Conclusion: Joint estimation of depth and OBs is mutually beneficial. MoDOT’s design is effective, outperforming baselines and showing strong cross-domain performance.

Abstract: Occlusion Boundary Estimation (OBE) identifies boundaries arising from both inter-object occlusions and self-occlusion within individual objects, distinguishing them from ordinary edges and semantic contours to support more accurate scene understanding. This task is closely related to Monocular Depth Estimation (MDE), which infers depth from a single image, as Occlusion Boundaries (OBs) provide critical geometric cues for resolving depth ambiguities, while depth can conversely refine occlusion reasoning. In this paper, we propose MoDOT, a novel method that jointly estimates depth and OBs from a single image for the first time. MoDOT incorporates a new module, CASM, which combines cross-attention and multi-scale strip convolutions to leverage mid-level OB features for improved depth prediction. It also includes an occlusion-aware loss, OBDCL, which encourages more accurate boundaries in the predicted depth map. Extensive experiments demonstrate the mutual benefits of jointly estimating depth and OBs, and validate the effectiveness of MoDOT’s design. Our method achieves state-of-the-art (SOTA) performance on two synthetic datasets and the widely used NYUD-v2 real-world dataset, significantly outperforming multi-task baselines. Furthermore, the cross-domain results of MoDOT on real-world depth prediction - trained solely on our synthetic dataset - yield promising results, preserving sharp OBs in the predicted depth maps and demonstrating improved geometric fidelity compared to competitors. We will release our code, pre-trained models, and dataset at [link].

[208] DONUT: A Decoder-Only Model for Trajectory Prediction

Markus Knoche, Daan de Geus, Bastian Leibe

Main category: cs.CV

TL;DR: DONUT, a decoder-only model for trajectory prediction, outperforms encoder-decoder baselines by using autoregressive predictions and an ‘overprediction’ strategy, achieving state-of-the-art results.

DetailsMotivation: To improve motion prediction for autonomous driving by leveraging decoder-only models, inspired by their success in language modeling.

Method: Uses a single autoregressive model (DONUT) to encode history and predict future trajectories iteratively, with an ‘overprediction’ strategy for longer horizons.

Result: Outperforms encoder-decoder baselines and achieves state-of-the-art on Argoverse 2 benchmark.

Conclusion: Decoder-only models like DONUT enhance trajectory prediction performance and consistency for autonomous driving.

Abstract: Predicting the motion of other agents in a scene is highly relevant for autonomous driving, as it allows a self-driving car to anticipate. Inspired by the success of decoder-only models for language modeling, we propose DONUT, a Decoder-Only Network for Unrolling Trajectories. Unlike existing encoder-decoder forecasting models, we encode historical trajectories and predict future trajectories with a single autoregressive model. This allows the model to make iterative predictions in a consistent manner, and ensures that the model is always provided with up-to-date information, thereby enhancing performance. Furthermore, inspired by multi-token prediction for language modeling, we introduce an ‘overprediction’ strategy that gives the model the auxiliary task of predicting trajectories at longer temporal horizons. This allows the model to better anticipate the future and further improves performance. Through experiments, we demonstrate that our decoder-only approach outperforms the encoder-decoder baseline, and achieves new state-of-the-art results on the Argoverse 2 single-agent motion forecasting benchmark.

[209] DiFuse-Net: RGB and Dual-Pixel Depth Estimation using Window Bi-directional Parallax Attention and Cross-modal Transfer Learning

Kunal Swami, Debtanu Gupta, Amrit Kumar Muduli, Chirag Jaiswal, Pankaj Kumar Bajpai

Main category: cs.CV

TL;DR: DiFuse-Net leverages dual-pixel (DP) technology for depth estimation, using a novel network design with a window bi-directional parallax attention mechanism (WBiPAM) and cross-modal transfer learning (CmTL). It outperforms baselines and introduces a new RGB-DP-D dataset.

DetailsMotivation: Traditional depth sensors have limitations in cost and robustness, while DP technology in modern cameras offers a viable alternative for depth estimation.

Method: DiFuse-Net uses WBiPAM to capture DP disparity cues, a separate RGB encoder, and CmTL to leverage existing RGB-D datasets. A new RGB-DP-D dataset (DCDP) is also introduced.

Result: DiFuse-Net outperforms DP and stereo-based baseline methods in depth estimation.

Conclusion: The proposed method and dataset advance depth estimation using DP technology, offering a robust and scalable solution.

Abstract: Depth estimation is crucial for intelligent systems, enabling applications from autonomous navigation to augmented reality. While traditional stereo and active depth sensors have limitations in cost, power, and robustness, dual-pixel (DP) technology, ubiquitous in modern cameras, offers a compelling alternative. This paper introduces DiFuse-Net, a novel modality decoupled network design for disentangled RGB and DP based depth estimation. DiFuse-Net features a window bi-directional parallax attention mechanism (WBiPAM) specifically designed to capture the subtle DP disparity cues unique to smartphone cameras with small aperture. A separate encoder extracts contextual information from the RGB image, and these features are fused to enhance depth prediction. We also propose a Cross-modal Transfer Learning (CmTL) mechanism to utilize large-scale RGB-D datasets in the literature to cope with the limitations of obtaining large-scale RGB-DP-D dataset. Our evaluation and comparison of the proposed method demonstrates its superiority over the DP and stereo-based baseline methods. Additionally, we contribute a new, high-quality, real-world RGB-DP-D training dataset, named Dual-Camera Dual-Pixel (DCDP) dataset, created using our novel symmetric stereo camera hardware setup, stereo calibration and rectification protocol, and AI stereo disparity estimation method.

[210] ForestFormer3D: A Unified Framework for End-to-End Segmentation of Forest LiDAR 3D Point Clouds

Binbin Xiang, Maciej Wielgosz, Stefano Puliti, Kamil Král, Martin Krůček, Azim Missarov, Rasmus Astrup

Main category: cs.CV

TL;DR: ForestFormer3D is a new framework for precise forest LiDAR point cloud segmentation, achieving state-of-the-art results on diverse datasets.

DetailsMotivation: Current methods struggle with the complexity of natural forests, necessitating a robust solution for tree and semantic segmentation.

Method: ForestFormer3D uses ISA-guided query points, score-based block merging, and one-to-many association for training.

Result: It excels on the FOR-instanceV2 dataset and generalizes well to unseen test sets like Wytham woods and LAUTx.

Conclusion: ForestFormer3D is a robust, unified solution for forest segmentation, with publicly available code and dataset.

Abstract: The segmentation of forest LiDAR 3D point clouds, including both individual tree and semantic segmentation, is fundamental for advancing forest management and ecological research. However, current approaches often struggle with the complexity and variability of natural forest environments. We present ForestFormer3D, a new unified and end-to-end framework designed for precise individual tree and semantic segmentation. ForestFormer3D incorporates ISA-guided query point selection, a score-based block merging strategy during inference, and a one-to-many association mechanism for effective training. By combining these new components, our model achieves state-of-the-art performance for individual tree segmentation on the newly introduced FOR-instanceV2 dataset, which spans diverse forest types and regions. Additionally, ForestFormer3D generalizes well to unseen test sets (Wytham woods and LAUTx), showcasing its robustness across different forest conditions and sensor modalities. The FOR-instanceV2 dataset and the ForestFormer3D code are publicly available at https://bxiang233.github.io/FF3D/.

[211] Three-dimentional reconstruction of complex, dynamic population canopy architecture for crops with a novel point cloud completion model: A case study in Brassica napus rapeseed

Ziyue Guo, Xin Yang, Yutao Shen, Yang Zhu, Lixi Jiang, Haiyan Cen

Main category: cs.CV

TL;DR: The paper proposes CP-PCN, a novel point cloud completion model for 3D reconstruction of rapeseed crop canopies, improving yield prediction accuracy.

DetailsMotivation: Accurate 3D canopy architecture descriptions are needed for evaluating photosynthesis and yield, but existing methods fail due to occlusion.

Method: Developed CP-PCN with multi-resolution dynamic graph convolutional encoder (MRDG) and point pyramid decoder (PPD) for occluded point prediction.

Result: CP-PCN outperformed state-of-the-art methods, improving yield prediction accuracy by 11.2%.

Conclusion: CP-PCN advances quantitative canopy analysis and can be extended to other crops.

Abstract: Quantitative descriptions of the complete canopy architecture are essential for accurately evaluating crop photosynthesis and yield performance to guide ideotype design. Although various sensing technologies have been developed for three-dimensional (3D) reconstruction of individual plants and canopies, they failed to obtain an accurate description of canopy architectures due to severe occlusion among complex canopy architectures. We proposed an effective method for 3D reconstruction of complex, dynamic population canopy architecture for rapeseed crops with a novel point cloud completion model. A complete point cloud generation framework was developed for automated annotation of the training dataset by distinguishing surface points from occluded points within canopies. The crop population point cloud completion network (CP-PCN) was then designed with a multi-resolution dynamic graph convolutional encoder (MRDG) and a point pyramid decoder (PPD) to predict occluded points. To further enhance feature extraction, a dynamic graph convolutional feature extractor (DGCFE) module was proposed to capture structural variations over the whole rapeseed growth period. The results demonstrated that CP-PCN achieved chamfer distance (CD) values of 3.35 cm -4.51 cm over four growth stages, outperforming the state-of-the-art transformer-based method (PoinTr). Ablation studies confirmed the effectiveness of the MRDG and DGCFE modules. Moreover, the validation experiment demonstrated that the silique efficiency index developed from CP-PCN improved the overall accuracy of rapeseed yield prediction by 11.2% compared to that of using incomplete point clouds. The CP-PCN pipeline has the potential to be extended to other crops, significantly advancing the quantitatively analysis of in-field population canopy architectures.

[212] ProtoSolo: Interpretable Image Classification via Single-Prototype Activation

Yitao Peng, Lianghua He, Hongzhou Chen

Main category: cs.CV

TL;DR: ProtoSolo simplifies interpretable deep learning for image classification by using a single prototype per class, reducing cognitive complexity while maintaining accuracy.

DetailsMotivation: Existing prototype networks use multiple prototypes, increasing cognitive complexity and hindering user understanding.

Method: ProtoSolo activates only one prototype per class, uses feature maps for similarity, and employs non-projection prototype learning.

Result: Matches state-of-the-art accuracy on CUB-200-2011 and Stanford Cars datasets with lower cognitive complexity.

Conclusion: ProtoSolo offers a simpler, interpretable alternative to existing methods without sacrificing performance.

Abstract: Although interpretable prototype networks have improved the transparency of deep learning image classification, the need for multiple prototypes in collaborative decision-making increases cognitive complexity and hinders user understanding. To solve this problem, this paper proposes a novel interpretable deep architecture for image classification, called ProtoSolo. Unlike existing prototypical networks, ProtoSolo requires activation of only a single prototype to complete the classification. This design significantly simplifies interpretation, as the explanation for each class requires displaying only the prototype with the highest similarity score and its corresponding feature map. Additionally, the traditional full-channel feature vector is replaced with a feature map for similarity comparison and prototype learning, enabling the use of richer global information within a single-prototype activation decision. A non-projection prototype learning strategy is also introduced to preserve the association between the prototype and image patch while avoiding abrupt structural changes in the network caused by projection, which can affect classification performance. Experiments on the CUB-200-2011 and Stanford Cars datasets demonstrate that ProtoSolo matches state-of-the-art interpretable methods in classification accuracy while achieving the lowest cognitive complexity. The code is available at https://github.com/pyt19/ProtoSolo.

[213] Cross-Modal Dual-Causal Learning for Long-Term Action Recognition

Xu Shaowu, Jia Xibin, Gao Junyu, Sun Qianmei, Chang Jing, Fan Chao

Main category: cs.CV

TL;DR: The paper proposes Cross-Modal Dual-Causal Learning (CMDCL) to address long-term action recognition (LTAR) challenges by modeling causal relationships between videos and label texts, outperforming existing methods on benchmarks.

DetailsMotivation: LTAR is difficult due to long temporal spans and visual/textual biases. Existing methods lack cross-modal causal modeling, limiting their effectiveness.

Method: CMDCL uses a structural causal model for cross-modal causal relationships, applying textual and visual causal interventions to remove biases and confounders.

Result: CMDCL achieves strong performance on Charades, Breakfast, and COIN benchmarks.

Conclusion: CMDCL effectively addresses LTAR challenges by leveraging dual-causal interventions, demonstrating superior results.

Abstract: Long-term action recognition (LTAR) is challenging due to extended temporal spans with complex atomic action correlations and visual confounders. Although vision-language models (VLMs) have shown promise, they often rely on statistical correlations instead of causal mechanisms. Moreover, existing causality-based methods address modal-specific biases but lack cross-modal causal modeling, limiting their utility in VLM-based LTAR. This paper proposes \textbf{C}ross-\textbf{M}odal \textbf{D}ual-\textbf{C}ausal \textbf{L}earning (CMDCL), which introduces a structural causal model to uncover causal relationships between videos and label texts. CMDCL addresses cross-modal biases in text embeddings via textual causal intervention and removes confounders inherent in the visual modality through visual causal intervention guided by the debiased text. These dual-causal interventions enable robust action representations to address LTAR challenges. Experimental results on three benchmarks including Charades, Breakfast and COIN, demonstrate the effectiveness of the proposed model. Our code is available at https://github.com/xushaowu/CMDCL.

[214] C-DOG: Multi-View Multi-instance Feature Association Using Connected δ-Overlap Graphs

Yung-Hong Sun, Ting-Hung Lin, Jiangang Chen, Hongrui Jiang, Yu Hen Hu

Main category: cs.CV

TL;DR: C-DOG algorithm uses epipolar geometry for robust feature association in 3D reconstruction, outperforming baselines in challenging conditions.

DetailsMotivation: Ambiguities in appearance-based feature matching due to identical objects in scenes necessitate a geometry-based solution.

Method: C-DOG employs epipolar geometry, delta-neighbor-overlap clustering, and IQR-based pruning for reliable feature association.

Result: C-DOG outperforms baseline algorithms, especially in high-density, featureless, or low-overlap scenes.

Conclusion: C-DOG is a scalable and robust solution for practical 3D reconstruction applications.

Abstract: Multi-view multi-instance feature association constitutes a crucial step in 3D reconstruction, facilitating the consistent grouping of object instances across various camera perspectives. The presence of multiple identical objects within a scene often leads to ambiguities for appearance-based feature matching algorithms. Our work circumvents this challenge by exclusively employing geometrical constraints, specifically epipolar geometry, for feature association. We introduce C-DOG (Connected delta-Overlap Graph), an algorithm designed for robust geometrical feature association, even in the presence of noisy feature detections. In a C-DOG graph, two nodes representing 2D feature points from distinct views are connected by an edge if they correspond to the same 3D point. Each edge is weighted by its epipolar distance. Ideally, true associations yield a zero distance; however, noisy feature detections can result in non-zero values. To robustly retain edges where the epipolar distance is less than a threshold delta, we employ a Szymkiewicz–Simpson coefficient. This process leads to a delta-neighbor-overlap clustering of 2D nodes. Furthermore, unreliable nodes are pruned from these clusters using an Inter-quartile Range (IQR)-based criterion. Our extensive experiments on synthetic benchmarks demonstrate that C-DOG not only outperforms geometry-based baseline algorithms but also remains remarkably robust under demanding conditions. This includes scenes with high object density, no visual features, and restricted camera overlap, positioning C-DOG as an excellent solution for scalable 3D reconstruction in practical applications.

[215] Rethinking Pan-sharpening: Principled Design, Unified Training, and a Universal Loss Surpass Brute-Force Scaling

Ran Zhang, Xuanhua He, Li Xueheng, Ke Cao, Liu Liu, Wenbo Xu, Fang Jiabin, Yang Qize, Jie Zhang

Main category: cs.CV

TL;DR: PanTiny is a lightweight, efficient pan-sharpening framework trained on multiple satellite datasets, achieving superior performance and generalization without large, complex models.

DetailsMotivation: Addressing the inefficiency and poor generalization of large, dataset-specific pan-sharpening models.

Method: Proposes PanTiny, a single-step framework with a multiple-in-one training paradigm and a composite loss function, trained on WV2, WV3, and GF2 datasets.

Result: PanTiny outperforms larger models, achieving better performance-to-efficiency balance and improved generalization.

Conclusion: Advocates for efficient, generalizable models in pan-sharpening, validated by principled design and training strategies.

Abstract: The field of pan-sharpening has recently seen a trend towards increasingly large and complex models, often trained on single, specific satellite datasets. This approach, however, leads to high computational overhead and poor generalization on full resolution data, a paradigm we challenge in this paper. In response to this issue, we propose PanTiny, a lightweight, single-step pan-sharpening framework designed for both efficiency and robust performance. More critically, we introduce multiple-in-one training paradigm, where a single, compact model is trained simultaneously on three distinct satellite datasets (WV2, WV3, and GF2) with different resolution and spectral information. Our experiments show that this unified training strategy not only simplifies deployment but also significantly boosts generalization on full-resolution data. Further, we introduce a universally powerful composite loss function that elevates the performance of almost all of models for pan-sharpening, pushing state-of-the-art metrics into a new era. Our PanTiny model, benefiting from these innovations, achieves a superior performance-to-efficiency balance, outperforming most larger, specialized models. Through extensive ablation studies, we validate that principled engineering in model design, training paradigms, and loss functions can surpass brute-force scaling. Our work advocates for a community-wide shift towards creating efficient, generalizable, and data-conscious models for pan-sharpening. The code is available at https://github.com/Zirconium233/PanTiny .

[216] Aesthetics is Cheap, Show me the Text: An Empirical Evaluation of State-of-the-Art Generative Models for OCR

Peirong Zhang, Haowei Xu, Jiaxin Zhang, Guitao Xu, Xuhan Zheng, Zhenhua Yang, Junle Liu, Yuyi Zhang, Lianwen Jin

Main category: cs.CV

TL;DR: The paper evaluates state-of-the-art generative models for text image generation and editing, categorizing 33 OCR tasks into five types and identifying weaknesses in current models.

DetailsMotivation: To assess whether advanced generative models can handle the intricacies of text image generation and editing, given their rising capabilities in other domains.

Method: Evaluates six models (closed-source and open-source) using 33 representative OCR tasks across five categories, with tailored inputs and prompts.

Result: Identifies weaknesses in current models and argues for integrating text image generation as a foundational skill in general-domain models.

Conclusion: Photorealistic text image generation should be a core capability of general generative models, not just specialized solutions, and this study provides insights to achieve that.

Abstract: Text image is a unique and crucial information medium that integrates visual aesthetics and linguistic semantics in modern e-society. Due to their subtlety and complexity, the generation of text images represents a challenging and evolving frontier in the image generation field. The recent surge of specialized image generators (\emph{e.g.}, Flux-series) and unified generative models (\emph{e.g.}, GPT-4o), which demonstrate exceptional fidelity, raises a natural question: can they master the intricacies of text image generation and editing? Motivated by this, we assess current state-of-the-art generative models’ capabilities in terms of text image generation and editing. We incorporate various typical optical character recognition (OCR) tasks into our evaluation and broaden the concept of text-based generation tasks into OCR generative tasks. We select 33 representative tasks and categorize them into five categories: document, handwritten text, scene text, artistic text, and complex & layout-rich text. For comprehensive evaluation, we examine six models across both closed-source and open-source domains, using tailored, high-quality image inputs and prompts. Through this evaluation, we draw crucial observations and identify the weaknesses of current generative models for OCR tasks. We argue that photorealistic text image generation and editing should be internalized as foundational skills into general-domain generative models, rather than being delegated to specialized solutions, and we hope this empirical analysis can provide valuable insights for the community to achieve this goal. This evaluation is online and will be continuously updated at our GitHub repository.

[217] CasP: Improving Semi-Dense Feature Matching Pipeline Leveraging Cascaded Correspondence Priors for Guidance

Peiqi Chen, Lei Yu, Yi Wan, Yingying Pei, Xinyi Liu, Yongxiang Yao, Yingying Zhang, Lixiang Ru, Liheng Zhong, Jingdong Chen, Ming Yang, Yongjun Zhang

Main category: cs.CV

TL;DR: CasP introduces a cascaded correspondence prior pipeline for semi-dense feature matching, improving accuracy and efficiency by decomposing matching into two phases with selective cross-attention.

DetailsMotivation: Existing methods rely on global feature map searches, limiting accuracy and efficiency. CasP addresses this by using cascaded priors.

Method: Decomposes matching into two phases with region-based selective cross-attention, restricting search ranges and incorporating high-level features.

Result: Achieves a 2.2x speedup at 1152 resolution and superior geometric estimation, with strong cross-domain generalization.

Conclusion: CasP is efficient and robust, suitable for latency-sensitive applications like SLAM and UAV systems.

Abstract: Semi-dense feature matching methods have shown strong performance in challenging scenarios. However, the existing pipeline relies on a global search across the entire feature map to establish coarse matches, limiting further improvements in accuracy and efficiency. Motivated by this limitation, we propose a novel pipeline, CasP, which leverages cascaded correspondence priors for guidance. Specifically, the matching stage is decomposed into two progressive phases, bridged by a region-based selective cross-attention mechanism designed to enhance feature discriminability. In the second phase, one-to-one matches are determined by restricting the search range to the one-to-many prior areas identified in the first phase. Additionally, this pipeline benefits from incorporating high-level features, which helps reduce the computational costs of low-level feature extraction. The acceleration gains of CasP increase with higher resolution, and our lite model achieves a speedup of $\sim2.2\times$ at a resolution of 1152 compared to the most efficient method, ELoFTR. Furthermore, extensive experiments demonstrate its superiority in geometric estimation, particularly with impressive cross-domain generalization. These advantages highlight its potential for latency-sensitive and high-robustness applications, such as SLAM and UAV systems. Code is available at https://github.com/pq-chen/CasP.

[218] SynPAIN: A Synthetic Dataset of Pain and Non-Pain Facial Expressions

Babak Taati, Muhammad Muzammil, Yasamin Zarghami, Abhishek Moturu, Amirhossein Kazerouni, Hailey Reimer, Alex Mihailidis, Thomas Hadjistavropoulos

Main category: cs.CV

TL;DR: SynPAIN is a synthetic dataset for pain detection, addressing diversity gaps and algorithmic bias in existing models, validated for clinical use.

DetailsMotivation: Pain assessment in non-communicative patients like older adults with dementia is challenging. Existing datasets lack diversity and representation.

Method: Created SynPAIN, a synthetic dataset with 10,710 images across demographics using generative AI, validated with clinical tools.

Result: Synthetic data improved pain detection by 7.0% and revealed algorithmic biases previously undetected.

Conclusion: SynPAIN fills diversity gaps, aids bias mitigation, and enhances pain detection performance.

Abstract: Accurate pain assessment in patients with limited ability to communicate, such as older adults with dementia, represents a critical healthcare challenge. Robust automated systems of pain detection may facilitate such assessments. Existing pain detection datasets, however, suffer from limited ethnic/racial diversity, privacy constraints, and underrepresentation of older adults who are the primary target population for clinical deployment. We present SynPAIN, a large-scale synthetic dataset containing 10,710 facial expression images (5,355 neutral/expressive pairs) across five ethnicities/races, two age groups (young: 20-35, old: 75+), and two genders. Using commercial generative AI tools, we created demographically balanced synthetic identities with clinically meaningful pain expressions. Our validation demonstrates that synthetic pain expressions exhibit expected pain patterns, scoring significantly higher than neutral and non-pain expressions using clinically validated pain assessment tools based on facial action unit analysis. We experimentally demonstrate SynPAIN’s utility in identifying algorithmic bias in existing pain detection models. Through comprehensive bias evaluation, we reveal substantial performance disparities across demographic characteristics. These performance disparities were previously undetectable with smaller, less diverse datasets. Furthermore, we demonstrate that age-matched synthetic data augmentation improves pain detection performance on real clinical data, achieving a 7.0% improvement in average precision. SynPAIN addresses critical gaps in pain assessment research by providing the first publicly available, demographically diverse synthetic dataset specifically designed for older adult pain detection, while establishing a framework for measuring and mitigating algorithmic bias. The dataset is available at https://doi.org/10.5683/SP3/WCXMAP

[219] HumanSAM: Classifying Human-centric Forgery Videos in Human Spatial, Appearance, and Motion Anomaly

Chang Liu, Yunfan Ye, Fan Zhang, Qingyang Zhou, Yuchuan Luo, Zhiping Cai

Main category: cs.CV

TL;DR: HumanSAM is a framework for classifying human-centric video forgeries into spatial, appearance, and motion anomalies, outperforming existing methods.

DetailsMotivation: The rise of synthetic human-centric videos threatens information security, but current detection lacks fine-grained understanding and interpretability.

Method: HumanSAM fuses video understanding and spatial depth to classify forgeries, using a rank-based confidence enhancement strategy and a new dataset (HFV).

Result: HumanSAM achieves promising results in binary and multi-class forgery classification.

Conclusion: The framework advances forgery detection by addressing fine-grained classification and interpretability.

Abstract: Numerous synthesized videos from generative models, especially human-centric ones that simulate realistic human actions, pose significant threats to human information security and authenticity. While progress has been made in binary forgery video detection, the lack of fine-grained understanding of forgery types raises concerns regarding both reliability and interpretability, which are critical for real-world applications. To address this limitation, we propose HumanSAM, a new framework that builds upon the fundamental challenges of video generation models. Specifically, HumanSAM aims to classify human-centric forgeries into three distinct types of artifacts commonly observed in generated content: spatial, appearance, and motion anomaly. To better capture the features of geometry, semantics and spatiotemporal consistency, we propose to generate the human forgery representation by fusing two branches of video understanding and spatial depth. We also adopt a rank-based confidence enhancement strategy during the training process to learn more robust representation by introducing three prior scores. For training and evaluation, we construct the first public benchmark, the Human-centric Forgery Video (HFV) dataset, with all types of forgeries carefully annotated semi-automatically. In our experiments, HumanSAM yields promising results in comparison with state-of-the-art methods, both in binary and multi-class forgery classification.

[220] LargeMvC-Net: Anchor-based Deep Unfolding Network for Large-scale Multi-view Clustering

Shide Du, Chunming Wu, Zihan Fang, Wendi Zhao, Yilin Wu, Changwei Wang, Shiping Wang

Main category: cs.CV

TL;DR: LargeMvC-Net is a deep network architecture for scalable multi-view clustering, derived by unfolding an optimization problem into three modules: representation learning, noise suppression, and anchor estimation. It outperforms existing methods in effectiveness and scalability.

DetailsMotivation: Existing anchor-based multi-view clustering methods use anchors heuristically, ignoring core structural demands and optimization principles. This gap motivates a systematic approach.

Method: The paper unfolds the optimization problem into three modules (RepresentModule, NoiseModule, AnchorModule) and aligns views with an unsupervised reconstruction loss.

Result: LargeMvC-Net outperforms state-of-the-art methods on large-scale multi-view benchmarks in effectiveness and scalability.

Conclusion: The proposed method provides a structured, traceable solution for anchor-based multi-view clustering, demonstrating superior performance.

Abstract: Deep anchor-based multi-view clustering methods enhance the scalability of neural networks by utilizing representative anchors to reduce the computational complexity of large-scale clustering. Despite their scalability advantages, existing approaches often incorporate anchor structures in a heuristic or task-agnostic manner, either through post-hoc graph construction or as auxiliary components for message passing. Such designs overlook the core structural demands of anchor-based clustering, neglecting key optimization principles. To bridge this gap, we revisit the underlying optimization problem of large-scale anchor-based multi-view clustering and unfold its iterative solution into a novel deep network architecture, termed LargeMvC-Net. The proposed model decomposes the anchor-based clustering process into three modules: RepresentModule, NoiseModule, and AnchorModule, corresponding to representation learning, noise suppression, and anchor indicator estimation. Each module is derived by unfolding a step of the original optimization procedure into a dedicated network component, providing structural clarity and optimization traceability. In addition, an unsupervised reconstruction loss aligns each view with the anchor-induced latent space, encouraging consistent clustering structures across views. Extensive experiments on several large-scale multi-view benchmarks show that LargeMvC-Net consistently outperforms state-of-the-art methods in terms of both effectiveness and scalability.

[221] Training-free Geometric Image Editing on Diffusion Models

Hanshen Zhu, Zhen Zhu, Kaile Zhang, Yiming Gong, Yuliang Liu, Xiang Bai

Main category: cs.CV

TL;DR: A decoupled pipeline for geometric image editing separates object transformation, inpainting, and refinement, outperforming state-of-the-art methods.

DetailsMotivation: Existing diffusion-based methods struggle with large or complex transformations, requiring a more effective approach.

Method: Proposes a decoupled pipeline with training-free diffusion (FreeFine) for inpainting and refinement, tested on GeoBench.

Result: FreeFine excels in image fidelity and edit precision, especially for demanding transformations.

Conclusion: The decoupled pipeline and FreeFine method offer superior performance in geometric image editing.

Abstract: We tackle the task of geometric image editing, where an object within an image is repositioned, reoriented, or reshaped while preserving overall scene coherence. Previous diffusion-based editing methods often attempt to handle all relevant subtasks in a single step, proving difficult when transformations become large or structurally complex. We address this by proposing a decoupled pipeline that separates object transformation, source region inpainting, and target region refinement. Both inpainting and refinement are implemented using a training-free diffusion approach, FreeFine. In experiments on our new GeoBench benchmark, which contains both 2D and 3D editing scenarios, FreeFine outperforms state-of-the-art alternatives in image fidelity, and edit precision, especially under demanding transformations. Code and benchmark are available at: https://github.com/CIawevy/FreeFine

[222] FFGAF-SNN: The Forward-Forward Based Gradient Approximation Free Training Framework for Spiking Neural Networks

Changqing Xu, Ziqiang Yang, Yi Liu, Xinfang Liao, Guiqi Mo, Hao Zeng, Yintang Yang

Main category: cs.CV

TL;DR: A Forward-Forward (FF) based gradient approximation-free training framework for Spiking Neural Networks (SNNs) is proposed, eliminating gradient approximation and reducing computational complexity. It includes a class-aware complexity adaptation mechanism and achieves high test accuracies on benchmark datasets.

DetailsMotivation: Training SNNs is challenging due to non-differentiability and high computational costs of backpropagation. Existing gradient approximation methods sacrifice accuracy and are inefficient for edge devices.

Method: The proposed framework uses FF-based training, treating spiking activations as black-box modules, avoiding gradient approximation. A class-aware complexity adaptation mechanism dynamically optimizes the loss function based on inter-class difficulty.

Result: Achieves test accuracies of 99.58%, 92.13%, and 75.64% on MNIST, Fashion-MNIST, and CIFAR-10, outperforming existing FF-based SNN methods. Also reduces memory access and computational power consumption.

Conclusion: The FF-based framework effectively trains SNNs without gradient approximation, improving accuracy and efficiency, making it suitable for edge devices.

Abstract: Spiking Neural Networks (SNNs) offer a biologically plausible framework for energy-efficient neuromorphic computing. However, it is a challenge to train SNNs due to their non-differentiability, efficiently. Existing gradient approximation approaches frequently sacrifice accuracy and face deployment limitations on edge devices due to the substantial computational requirements of backpropagation. To address these challenges, we propose a Forward-Forward (FF) based gradient approximation-free training framework for Spiking Neural Networks, which treats spiking activations as black-box modules, thereby eliminating the need for gradient approximation while significantly reducing computational complexity. Furthermore, we introduce a class-aware complexity adaptation mechanism that dynamically optimizes the loss function based on inter-class difficulty metrics, enabling efficient allocation of network resources across different categories. Experimental results demonstrate that our proposed training framework achieves test accuracies of 99.58%, 92.13%, and 75.64% on the MNIST, Fashion-MNIST, and CIFAR-10 datasets, respectively, surpassing all existing FF-based SNN approaches. Additionally, our proposed method exhibits significant advantages in terms of memory access and computational power consumption.

cs.AI

[223] Rethinking Evidence Hierarchies in Medical Language Benchmarks: A Critical Evaluation of HealthBench

Fred Mutisya, Shikoh Gitau, Nasubo Ongoma, Keith Mbae, Elizabeth Wamicha

Main category: cs.AI

TL;DR: HealthBench evaluates AI in health but risks biases. A new approach uses evidence-based guidelines for fairer, globally relevant benchmarks.

DetailsMotivation: Address biases in HealthBench and improve AI evaluation for global health, especially in low-resource settings.

Method: Use version-controlled Clinical Practice Guidelines (CPGs) with systematic reviews and GRADE ratings for reinforcement learning.

Result: Proposes a roadmap for evidence-robust AI evaluation, ensuring clinical trustworthiness and ethical soundness.

Conclusion: Grounding rewards in rigorous CPGs can enhance AI benchmarks for global health equity.

Abstract: HealthBench, a benchmark designed to measure the capabilities of AI systems for health better (Arora et al., 2025), has advanced medical language model evaluation through physician-crafted dialogues and transparent rubrics. However, its reliance on expert opinion, rather than high-tier clinical evidence, risks codifying regional biases and individual clinician idiosyncrasies, further compounded by potential biases in automated grading systems. These limitations are particularly magnified in low- and middle-income settings, where issues like sparse neglected tropical disease coverage and region-specific guideline mismatches are prevalent. The unique challenges of the African context, including data scarcity, inadequate infrastructure, and nascent regulatory frameworks, underscore the urgent need for more globally relevant and equitable benchmarks. To address these shortcomings, we propose anchoring reward functions in version-controlled Clinical Practice Guidelines (CPGs) that incorporate systematic reviews and GRADE evidence ratings. Our roadmap outlines “evidence-robust” reinforcement learning via rubric-to-guideline linkage, evidence-weighted scoring, and contextual override logic, complemented by a focus on ethical considerations and the integration of delayed outcome feedback. By re-grounding rewards in rigorously vetted CPGs, while preserving HealthBench’s transparency and physician engagement, we aim to foster medical language models that are not only linguistically polished but also clinically trustworthy, ethically sound, and globally relevant.

[224] Hyperproperty-Constrained Secure Reinforcement Learning

Ernest Bonnah, Luan Viet Nguyen, Khaza Anuarul Hoque

Main category: cs.AI

TL;DR: The paper introduces HyperTWTL-constrained secure reinforcement learning (SecRL), addressing a gap in security-aware RL using hyperproperties. It proposes a dynamic Boltzmann softmax RL approach to learn optimal policies under HyperTWTL constraints, validated via a robotic case study.

DetailsMotivation: To bridge the research gap in security-aware reinforcement learning by leveraging HyperTWTL for formalizing security and opacity constraints in robotics applications.

Method: Proposes a dynamic Boltzmann softmax RL approach to learn security-aware optimal policies under HyperTWTL constraints, formalized over an MDP.

Result: Demonstrates effectiveness and scalability in a robotic pick-up and delivery mission, outperforming two baseline RL algorithms.

Conclusion: The proposed method successfully integrates HyperTWTL constraints into RL, offering a scalable solution for security-aware policy learning in robotics.

Abstract: Hyperproperties for Time Window Temporal Logic (HyperTWTL) is a domain-specific formal specification language known for its effectiveness in compactly representing security, opacity, and concurrency properties for robotics applications. This paper focuses on HyperTWTL-constrained secure reinforcement learning (SecRL). Although temporal logic-constrained safe reinforcement learning (SRL) is an evolving research problem with several existing literature, there is a significant research gap in exploring security-aware reinforcement learning (RL) using hyperproperties. Given the dynamics of an agent as a Markov Decision Process (MDP) and opacity/security constraints formalized as HyperTWTL, we propose an approach for learning security-aware optimal policies using dynamic Boltzmann softmax RL while satisfying the HyperTWTL constraints. The effectiveness and scalability of our proposed approach are demonstrated using a pick-up and delivery robotic mission case study. We also compare our results with two other baseline RL algorithms, showing that our proposed method outperforms them.

[225] No AI Without PI! Object-Centric Process Mining as the Enabler for Generative, Predictive, and Prescriptive Artificial Intelligence

Wil M. P. van der Aalst

Main category: cs.AI

TL;DR: AI’s industrial application faces challenges in end-to-end processes. Object-Centric Process Mining (OCPM) bridges data and processes, enabling AI. Process Intelligence (PI) combines process-centric techniques to support AI in organizations.

DetailsMotivation: Organizations struggle to apply AI effectively in dynamic, process-driven industrial settings.

Method: Uses Object-Centric Process Mining (OCPM) to structure process-related data and connect it with AI techniques.

Result: OCPM enables AI by linking data and processes, forming Process Intelligence (PI).

Conclusion: AI requires PI for operational process improvement, with OCPM being key to integrating generative, predictive, and prescriptive AI.

Abstract: The uptake of Artificial Intelligence (AI) impacts the way we work, interact, do business, and conduct research. However, organizations struggle to apply AI successfully in industrial settings where the focus is on end-to-end operational processes. Here, we consider generative, predictive, and prescriptive AI and elaborate on the challenges of diagnosing and improving such processes. We show that AI needs to be grounded using Object-Centric Process Mining (OCPM). Process-related data are structured and organization-specific and, unlike text, processes are often highly dynamic. OCPM is the missing link connecting data and processes and enables different forms of AI. We use the term Process Intelligence (PI) to refer to the amalgamation of process-centric data-driven techniques able to deal with a variety of object and event types, enabling AI in an organizational context. This paper explains why AI requires PI to improve operational processes and highlights opportunities for successfully combining OCPM and generative, predictive, and prescriptive AI.

[226] Algorithmic Detection of Rank Reversals, Transitivity Violations, and Decomposition Inconsistencies in Multi-Criteria Decision Analysis

Agustín Borda, Juan Bautista Cabral, Gonzalo Giarda, Diego Nicolás Gimenez Irusta, Paula Pacheco, Alvaro Roy Schachner

Main category: cs.AI

TL;DR: The paper introduces three tests to detect Rank Reversals in Multi-Criteria Decision Analysis, implemented in Scikit-Criteria, and discusses their impact on method evaluation.

DetailsMotivation: Rank Reversals in Multi-Criteria Decision Analysis can distort results, necessitating tools to measure method performance and compare effectiveness.

Method: Three tests for detecting Rank Reversals were developed and implemented in the Scikit-Criteria library, addressing general scenario challenges.

Result: The tests enable performance measurement of decision methods, aiding in global ranking of method effectiveness.

Conclusion: These tests enhance the evaluation of multi-criteria decision methods, improving problem-solving judgment.

Abstract: In Multi-Criteria Decision Analysis, Rank Reversals are a serious problem that can greatly affect the results of a Multi-Criteria Decision Method against a particular set of alternatives. It is therefore useful to have a mechanism that allows one to measure the performance of a method on a set of alternatives. This idea could be taken further to build a global ranking of the effectiveness of different methods to solve a problem. In this paper, we present three tests that detect the presence of Rank Reversals, along with their implementation in the Scikit-Criteria library. We also address the complications that arise when implementing these tests for general scenarios and the design considerations we made to handle them. We close with a discussion about how these additions could play a major role in the judgment of multi-criteria decision methods for problem solving.

[227] Multi-Agent Game Generation and Evaluation via Audio-Visual Recordings

Alexia Jolicoeur-Martineau

Main category: cs.AI

TL;DR: The paper introduces AVR-Eval, a metric for multimedia content quality, and AVR-Agent, a multi-agent system for generating JavaScript games and animations. AVR-Eval effectively evaluates content, while AVR-Agent improves generation quality but struggles with custom assets and feedback.

DetailsMotivation: Current AI struggles with generating complex interactive content like video games, lacking automated evaluation metrics and efficient use of human-like resources (e.g., custom assets, feedback).

Method: Proposed AVR-Eval, a relative metric using omni-modal comparisons, and AVR-Agent, a multi-agent system for iterative code generation and improvement using AVR feedback.

Result: AVR-Agent outperforms one-shot generation in win rate but fails to leverage custom assets and feedback effectively, unlike humans.

Conclusion: The work highlights a gap in AI’s ability to utilize resources like humans do, emphasizing fundamental differences in content creation approaches.

Abstract: While AI excels at generating text, audio, images, and videos, creating interactive audio-visual content such as video games remains challenging. Current LLMs can generate JavaScript games and animations, but lack automated evaluation metrics and struggle with complex content that normally requires teams of humans working for many months (multi-shot, multi-agents) using assets made by artists. To tackle these issues, we built a new metric and a multi-agent system. We propose AVR-Eval, a relative metric for multimedia content quality using Audio-Visual Recordings (AVRs). An omni-modal model (processing text, video, and audio) compares the AVRs of two contents, with a text model reviewing evaluations to determine superiority. We show that AVR-Eval properly identifies good from broken or mismatched content. We built AVR-Agent, a multi-agent system generating JavaScript code from a bank of multimedia assets (audio, images, 3D models). The coding agent selects relevant assets, generates multiple initial codes, uses AVR-Eval to identify the best version, and iteratively improves it through omni-modal agent feedback from the AVR. We run experiments on games and animations with AVR-Eval (win rate of content A against B). We find that content generated by AVR-Agent has a significantly higher win rate against content made through one-shot generation. However, models struggle to leverage custom assets and AVR feedback effectively, showing no higher win rate. This reveals a critical gap: while humans benefit from high-quality assets and audio-visual feedback, current coding models do not seem to utilize these resources as effectively, highlighting fundamental differences between human and machine content creation approaches.

[228] Theory of Mind Using Active Inference: A Framework for Multi-Agent Cooperation

Riddhi J. Pitliya, Ozan Catal, Toon Van de Maele, Corrado Pezzato, Tim Verbelen

Main category: cs.AI

TL;DR: A novel multi-agent cooperation method using theory of mind (ToM) within active inference, enabling agents to infer others’ beliefs without shared models or explicit communication.

DetailsMotivation: To improve multi-agent cooperation by leveraging ToM, allowing agents to reason about others' beliefs and goals without relying on shared models or communication.

Method: Extends inference tree-based planning with recursive reasoning, maintaining distinct representations of self and others’ beliefs and goals.

Result: ToM-equipped agents outperform non-ToM agents in collision avoidance and foraging tasks by inferring others’ beliefs from behavior.

Conclusion: The approach advances AI applications and offers computational insights into ToM, demonstrating practical benefits in multi-agent systems.

Abstract: We present a novel approach to multi-agent cooperation by implementing theory of mind (ToM) within active inference. ToM - the ability to understand that others can have differing knowledge and goals - enables agents to reason about others’ beliefs while planning their own actions. Unlike previous active inference approaches to multi-agent cooperation, our method neither relies on task-specific shared generative models nor requires explicit communication, while being generalisable. In our framework, the ToM-equipped agent maintains distinct representations of its own and others’ beliefs and goals. We extend the sophisticated inference tree-based planning algorithm to systematically explore joint policy spaces through recursive reasoning. Our approach is evaluated through collision avoidance and foraging task simulations. Results demonstrate that ToM-equipped agents cooperate better compared to non-ToM counterparts by being able to avoid collisions and reduce redundant efforts. Crucially, ToM agents accomplish this by inferring others’ beliefs solely from observable behaviour. This work advances practical applications in artificial intelligence while providing computational insights into ToM.

[229] SHACL Validation under Graph Updates (Extended Paper)

Shqiponja Ahmetaj, George Konstantinidis, Magdalena Ortiz, Paolo Pareti, Mantas Simkus

Main category: cs.AI

TL;DR: The paper studies SHACL validation in RDF graphs under updates, introducing a SHACL-based update language and reducing static validation to SHACL constraint (un)satisfiability. It analyzes computational complexity and presents a prototype implementation.

DetailsMotivation: To address the challenge of ensuring RDF graphs remain valid under updates, providing a foundation for reasoning about evolving graphs.

Method: Develops a SHACL-based update language, uses regression to embed updates into SHACL constraints, and reduces static validation to constraint (un)satisfiability. Analyzes complexity and implements a prototype.

Result: Shows static validation can be reduced to SHACL constraint (un)satisfiability, with computational complexity analyzed. A prototype demonstrates practical feasibility.

Conclusion: The approach enables efficient static validation of evolving RDF graphs, with potential for further services in reasoning about updates.

Abstract: SHACL (SHApe Constraint Language) is a W3C standardized constraint language for RDF graphs. In this paper, we study SHACL validation in RDF graphs under updates. We present a SHACL-based update language that can capture intuitive and realistic modifications on RDF graphs and study the problem of static validation under such updates. This problem asks to verify whether every graph that validates a SHACL specification will still do so after applying a given update sequence. More importantly, it provides a basis for further services for reasoning about evolving RDF graphs. Using a regression technique that embeds the update actions into SHACL constraints, we show that static validation under updates can be reduced to (un)satisfiability of constraints in (a minor extension of) SHACL. We analyze the computational complexity of the static validation problem for SHACL and some key fragments. Finally, we present a prototype implementation that performs static validation and other static analysis tasks on SHACL constraints and demonstrate its behavior through preliminary experiments.

[230] Co-Producing AI: Toward an Augmented, Participatory Lifecycle

Rashid Mushkani, Hugo Berard, Toumadher Ammar, Cassandre Chatonnier, Shin Koseki

Main category: cs.AI

TL;DR: The paper proposes a re-architected AI lifecycle centered on co-production, DEI, and multidisciplinary collaboration to mitigate biases and harms, especially for marginalized groups.

DetailsMotivation: Addressing the disproportionate impact of AI biases on marginalized groups and the limitations of current mitigation approaches.

Method: Introduces an augmented AI lifecycle with five phases (co-framing to co-maintenance), informed by workshops and emphasizing participatory governance.

Result: A participatory AI lifecycle framework grounded in distributed authority and iterative knowledge exchange, aligned with ethical guidelines.

Conclusion: Scaling participatory governance in AI requires further research, but the proposed lifecycle offers a foundation for equitable AI production.

Abstract: Despite efforts to mitigate the inherent risks and biases of artificial intelligence (AI) algorithms, these algorithms can disproportionately impact culturally marginalized groups. A range of approaches has been proposed to address or reduce these risks, including the development of ethical guidelines and principles for responsible AI, as well as technical solutions that promote algorithmic fairness. Drawing on design justice, expansive learning theory, and recent empirical work on participatory AI, we argue that mitigating these harms requires a fundamental re-architecture of the AI production pipeline. This re-design should center co-production, diversity, equity, inclusion (DEI), and multidisciplinary collaboration. We introduce an augmented AI lifecycle consisting of five interconnected phases: co-framing, co-design, co-implementation, co-deployment, and co-maintenance. The lifecycle is informed by four multidisciplinary workshops and grounded in themes of distributed authority and iterative knowledge exchange. Finally, we relate the proposed lifecycle to several leading ethical frameworks and outline key research questions that remain for scaling participatory governance.

[231] Beyond Agreement: Rethinking Ground Truth in Educational AI Annotation

Danielle R. Thomas, Conrad Borchers, Kenneth R. Koedinger

Main category: cs.AI

TL;DR: The paper critiques overreliance on human inter-rater reliability (IRR) for annotation quality in educational AI, proposing five alternative methods to improve validity and learning outcomes.

DetailsMotivation: Human evaluators are often biased and unreliable, yet IRR metrics like Cohen's kappa dominate validation in educational AI, potentially hindering progress in producing valid, predictive data.

Method: The paper highlights five complementary evaluation methods, including multi-label annotation, expert-based approaches, and close-the-loop validity, to replace or supplement IRR.

Result: These alternative methods are argued to produce better training data and models, improving student learning and actionable insights compared to IRR alone.

Conclusion: The field should prioritize validity and educational impact over consensus, redefining annotation quality and ground truth in educational AI.

Abstract: Humans can be notoriously imperfect evaluators. They are often biased, unreliable, and unfit to define “ground truth.” Yet, given the surging need to produce large amounts of training data in educational applications using AI, traditional inter-rater reliability (IRR) metrics like Cohen’s kappa remain central to validating labeled data. IRR remains a cornerstone of many machine learning pipelines for educational data. Take, for example, the classification of tutors’ moves in dialogues or labeling open responses in machine-graded assessments. This position paper argues that overreliance on human IRR as a gatekeeper for annotation quality hampers progress in classifying data in ways that are valid and predictive in relation to improving learning. To address this issue, we highlight five examples of complementary evaluation methods, such as multi-label annotation schemes, expert-based approaches, and close-the-loop validity. We argue that these approaches are in a better position to produce training data and subsequent models that produce improved student learning and more actionable insights than IRR approaches alone. We also emphasize the importance of external validity, for example, by establishing a procedure of validating tutor moves and demonstrating that it works across many categories of tutor actions (e.g., providing hints). We call on the field to rethink annotation quality and ground truth–prioritizing validity and educational impact over consensus alone.

[232] Model-Based Soft Maximization of Suitable Metrics of Long-Term Human Power

Jobst Heitzig, Ram Potham

Main category: cs.AI

TL;DR: The paper proposes an objective function for AI agents to empower humans and manage power balance, aiming for safety and wellbeing.

DetailsMotivation: Address AI safety and human wellbeing by ensuring AI agents empower humans and maintain desirable power dynamics.

Method: Design a parametrizable, decomposable objective function considering human bounded rationality and diverse goals. Derive algorithms for computation via backward induction or multi-agent reinforcement learning.

Result: Softly maximizing human power metrics may lead to safer AI behavior than direct utility-based objectives.

Conclusion: Aggregate metrics of human power could be a safer objective for AI systems, balancing safety and wellbeing.

Abstract: Power is a key concept in AI safety: power-seeking as an instrumental goal, sudden or gradual disempowerment of humans, power balance in human-AI interaction and international AI governance. At the same time, power as the ability to pursue diverse goals is essential for wellbeing. This paper explores the idea of promoting both safety and wellbeing by forcing AI agents explicitly to empower humans and to manage the power balance between humans and AI agents in a desirable way. Using a principled, partially axiomatic approach, we design a parametrizable and decomposable objective function that represents an inequality- and risk-averse long-term aggregate of human power. It takes into account humans’ bounded rationality and social norms, and, crucially, considers a wide variety of possible human goals. We derive algorithms for computing that metric by backward induction or approximating it via a form of multi-agent reinforcement learning from a given world model. We exemplify the consequences of (softly) maximizing this metric in a variety of paradigmatic situations and describe what instrumental sub-goals it will likely imply. Our cautious assessment is that softly maximizing suitable aggregate metrics of human power might constitute a beneficial objective for agentic AI systems that is safer than direct utility-based objectives.

[233] RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization

Yihong Dong, Xue Jiang, Yongding Tao, Huanyu Liu, Kechi Zhang, Lili Mou, Rongyu Cao, Yingwei Ma, Jue Chen, Binhua Li, Zhi Jin, Fei Huang, Yongbin Li, Ge Li

Main category: cs.AI

TL;DR: RL-PLUS enhances LLM reasoning by combining internal thinking and external learning, outperforming RLVR with improved performance and boundary avoidance.

DetailsMotivation: RLVR's limitations in surpassing base LLM boundaries and causing capability collapse motivate the need for RL-PLUS.

Method: RL-PLUS uses Multiple Importance Sampling and an Exploration-Based Advantage Function to integrate external data and guide reasoning.

Result: RL-PLUS achieves state-of-the-art performance on math reasoning benchmarks and out-of-distribution tasks, with 21.1% to 69.2% improvements.

Conclusion: RL-PLUS effectively addresses RLVR’s limitations, surpassing base model boundaries and avoiding capability collapse.

Abstract: Reinforcement Learning with Verifiable Reward (RLVR) has significantly advanced the complex reasoning abilities of Large Language Models (LLMs). However, it struggles to break through the inherent capability boundaries of the base LLM, due to its inherently on-policy strategy with LLM’s immense action space and sparse reward. Further, RLVR can lead to the capability boundary collapse, narrowing the LLM’s problem-solving scope. To address this problem, we propose RL-PLUS, a novel approach that synergizes internal exploitation (i.e., Thinking) with external data (i.e., Learning) to achieve stronger reasoning capabilities and surpass the boundaries of base models. RL-PLUS integrates two core components: Multiple Importance Sampling to address for distributional mismatch from external data, and an Exploration-Based Advantage Function to guide the model towards high-value, unexplored reasoning paths. We provide both theoretical analysis and extensive experiments to demonstrate the superiority and generalizability of our approach. The results show that RL-PLUS achieves state-of-the-art performance compared with existing RLVR methods on six math reasoning benchmarks and exhibits superior performance on six out-of-distribution reasoning tasks. It also achieves consistent and significant gains across diverse model families, with average relative improvements ranging from 21.1% to 69.2%. Moreover, Pass@k curves across multiple benchmarks indicate that RL-PLUS effectively resolves the capability boundary collapse problem.

[234] MetaAgent: Toward Self-Evolving Agent via Tool Meta-Learning

Hongjin Qian, Zheng Liu

Main category: cs.AI

TL;DR: MetaAgent is a self-improving AI agent that learns by doing, using adaptive help-seeking, self-reflection, and dynamic knowledge integration to outperform baselines on knowledge discovery tasks.

DetailsMotivation: To develop an agent that autonomously improves its reasoning and tool-use strategies through hands-on practice without requiring model updates.

Method: MetaAgent starts with basic abilities, requests help when needed, reflects on tasks, and builds tools/knowledge dynamically.

Result: Outperforms workflow-based baselines and matches/exceeds end-to-end trained agents on benchmarks like GAIA and WebWalkerQA.

Conclusion: MetaAgent demonstrates the potential of self-evolving systems for robust, general-purpose knowledge discovery.

Abstract: In this work, we propose MetaAgent, an agentic paradigm inspired by the principle of learning-by-doing, where expertise is developed through hands-on practice and continual self-improvement. MetaAgent starts with a minimal workflow, equipped only with basic reasoning and adaptive help-seeking abilities. When a knowledge gap is encountered, MetaAgent generates natural language help requests, which are routed to the most suitable external tool by a dedicated tool router. As MetaAgent solves tasks, it continually conducts self-reflection and answer verification, distilling actionable experience into concise texts that are dynamically incorporated into future task contexts. Besides, MetaAgent autonomously builds in-house tools and a persistent knowledge base by organizing its tool-use history, further enhancing its ability to retrieve and integrate relevant information We term this continual, data-driven process as \textit{meta tool learning}, through which MetaAgent incrementally refines its reasoning and tool-use strategies, without changing model parameters or requiring further post-training. Evaluated on challenging knowledge discovery benchmarks, including GAIA, WebWalkerQA, and BrowseCamp, MetaAgent consistently outperforms workflow-based baselines and matches or exceeds end-to-end trained agents, demonstrating the promise of self-evolving agentic systems for robust, general-purpose knowledge discovery. We provide our source codes in https://github.com/qhjqhj00/MetaAgent.

[235] Mind the Gap: The Divergence Between Human and LLM-Generated Tasks

Yi-Long Lu, Jiajun Song, Chunhui Zhang, Wei Wang

Main category: cs.AI

TL;DR: The study compares human and LLM (GPT-4o) task generation, finding humans are driven by psychological factors, while LLMs lack this alignment, producing less social and physical tasks.

DetailsMotivation: To explore whether LLMs simulate human cognitive principles in task generation.

Method: A task-generation experiment comparing humans and GPT-4o, with psychological drivers provided to the LLM.

Result: Humans show psychological influence; LLMs produce less social, physical, and more abstract tasks, despite being perceived as more fun and novel.

Conclusion: A gap exists between human cognition and LLMs, suggesting the need for intrinsic motivation and physical grounding in LLM design.

Abstract: Humans constantly generate a diverse range of tasks guided by internal motivations. While generative agents powered by large language models (LLMs) aim to simulate this complex behavior, it remains uncertain whether they operate on similar cognitive principles. To address this, we conducted a task-generation experiment comparing human responses with those of an LLM agent (GPT-4o). We find that human task generation is consistently influenced by psychological drivers, including personal values (e.g., Openness to Change) and cognitive style. Even when these psychological drivers are explicitly provided to the LLM, it fails to reflect the corresponding behavioral patterns. They produce tasks that are markedly less social, less physical, and thematically biased toward abstraction. Interestingly, while the LLM’s tasks were perceived as more fun and novel, this highlights a disconnect between its linguistic proficiency and its capacity to generate human-like, embodied goals.We conclude that there is a core gap between the value-driven, embodied nature of human cognition and the statistical patterns of LLMs, highlighting the necessity of incorporating intrinsic motivation and physical grounding into the design of more human-aligned agents.

[236] Oedipus and the Sphinx: Benchmarking and Improving Visual Language Models for Complex Graphic Reasoning

Jianyi Zhang, Xu Ji, Ziyin Zhou, Yuchen Zhou, Shubo Shi, Haoyu Wu, Zhen Li, Shizhao Liu

Main category: cs.AI

TL;DR: ReasonBench is a benchmark for evaluating VLMs in complex graphic reasoning, revealing limitations and proposing optimizations (DiaCoT and ReasonTune) that improve performance by 33.5%.

DetailsMotivation: VLMs lack human-level graphic reasoning, especially in complex tasks, which are understudied.

Method: Proposed ReasonBench with 1,613 questions from intelligence tests, evaluating 11 VLMs. Introduced DiaCoT and ReasonTune for optimization.

Result: Current VLMs show significant limitations. Proposed optimizations improve performance by 33.5%.

Conclusion: ReasonBench highlights VLM deficiencies and offers effective strategies for improvement.

Abstract: Evaluating the performance of visual language models (VLMs) in graphic reasoning tasks has become an important research topic. However, VLMs still show obvious deficiencies in simulating human-level graphic reasoning capabilities, especially in complex graphic reasoning and abstract problem solving, which are less studied and existing studies only focus on simple graphics. To evaluate the performance of VLMs in complex graphic reasoning, we propose ReasonBench, the first evaluation benchmark focused on structured graphic reasoning tasks, which includes 1,613 questions from real-world intelligence tests. ReasonBench covers reasoning dimensions related to location, attribute, quantity, and multi-element tasks, providing a comprehensive evaluation of the performance of VLMs in spatial, relational, and abstract reasoning capabilities. We benchmark 11 mainstream VLMs (including closed-source and open-source models) and reveal significant limitations of current models. Based on these findings, we propose a dual optimization strategy: Diagrammatic Reasoning Chain (DiaCoT) enhances the interpretability of reasoning by decomposing layers, and ReasonTune enhances the task adaptability of model reasoning through training, all of which improves VLM performance by 33.5%. All experimental data and code are in the repository: https://huggingface.co/datasets/cistine/ReasonBench.

[237] R1-ACT: Efficient Reasoning Model Safety Alignment by Activating Safety Knowledge

Yeonjun In, Wonjoong Kim, Sangwu Park, Chanyoung Park

Main category: cs.AI

TL;DR: R1-Act is a post-training method that activates safety knowledge in large reasoning models (LRMs) to reduce harmful outputs, requiring minimal training resources.

DetailsMotivation: LRMs often comply with harmful instructions despite having safety knowledge, posing risks. This paper aims to address this gap.

Method: Proposes R1-Act, a method to trigger safety knowledge explicitly during reasoning with minimal training (1,000 examples, 90 minutes).

Result: R1-Act improves safety without compromising reasoning, outperforming prior methods. It works across various LRMs.

Conclusion: R1-Act is a scalable, efficient solution for enhancing LRM safety with practical applicability.

Abstract: Although large reasoning models (LRMs) have demonstrated impressive capabilities on complex tasks, recent studies reveal that these models frequently fulfill harmful user instructions, raising significant safety concerns. In this paper, we investigate the underlying cause of LRM safety risks and find that models already possess sufficient safety knowledge but fail to activate it during reasoning. Based on this insight, we propose R1-Act, a simple and efficient post-training method that explicitly triggers safety knowledge through a structured reasoning process. R1-Act achieves strong safety improvements while preserving reasoning performance, outperforming prior alignment methods. Notably, it requires only 1,000 training examples and 90 minutes of training on a single RTX A6000 GPU. Extensive experiments across multiple LRM backbones and sizes demonstrate the robustness, scalability, and practical efficiency of our approach.

[238] CoRGI: Verified Chain-of-Thought Reasoning with Visual Grounding

Shixin Yi, Lin Shang

Main category: cs.AI

TL;DR: CoRGI introduces visual verification into reasoning for VLMs, improving performance and reducing hallucinations.

DetailsMotivation: Address hallucinations in CoT prompting by grounding reasoning in visual evidence.

Method: Three-stage pipeline: generate reasoning chain, verify with visual evidence, synthesize answer.

Result: Improves reasoning on VCR benchmark; enhances factual accuracy in explanations.

Conclusion: Grounding reasoning in visual evidence boosts robustness in multimodal reasoning.

Abstract: Chain-of-Thought (CoT) prompting has shown promise in improving reasoning in vision-language models (VLMs), but it often produces explanations that are linguistically fluent yet lack grounding in visual content. We observe that such hallucinations arise in part from the absence of an explicit verification mechanism during multi-step reasoning. To address this, we propose \textbf{CoRGI}(\textbf{C}hain \textbf{o}f \textbf{R}easoning with \textbf{G}rounded \textbf{I}nsights), a modular framework that introduces visual verification into the reasoning process. CoRGI follows a three-stage pipeline: it first generates a textual reasoning chain, then extracts supporting visual evidence for each reasoning step via a dedicated module (VEVM), and finally synthesizes the textual rationale with visual evidence to generate a grounded, verified answer. The framework can be integrated with existing VLMs without end-to-end retraining. We evaluate CoRGI on the VCR benchmark and find that it improves reasoning performance on two representative open-source VLM backbones, Qwen-2.5VL and LLaVA-1.6. Ablation studies confirm the contribution of each step in the verification module, and human evaluations suggest that CoRGI leads to more factual and helpful explanations. We also examine alternative designs for the visual verification step and discuss potential limitations of post-hoc verification frameworks. These findings highlight the importance of grounding intermediate reasoning steps in visual evidence to enhance the robustness of multimodal reasoning.

[239] Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training

Tianqing Fang, Zhisong Zhang, Xiaoyang Wang, Rui Wang, Can Qin, Yuxuan Wan, Jun-Yu Ma, Ce Zhang, Jiaqi Chen, Xiyun Li, Hongming Zhang, Haitao Mi, Dong Yu

Main category: cs.AI

TL;DR: Cognitive Kernel-Pro is an open-source, free multi-module agent framework designed to democratize advanced AI agent development, achieving state-of-the-art results on GAIA.

DetailsMotivation: Current AI agent systems are often closed-source or rely on paid APIs, limiting accessibility and reproducibility for researchers.

Method: The framework includes high-quality training data curation for Agent Foundation Models, focusing on queries, trajectories, and verifiable answers across web, file, code, and reasoning domains, plus test-time reflection and voting strategies.

Result: The 8B-parameter open-source model outperforms previous systems like WebDancer and WebSailor on GAIA.

Conclusion: Cognitive Kernel-Pro sets a new standard for accessible, high-capability AI agents, with code publicly available.

Abstract: General AI Agents are increasingly recognized as foundational frameworks for the next generation of artificial intelligence, enabling complex reasoning, web interaction, coding, and autonomous research capabilities. However, current agent systems are either closed-source or heavily reliant on a variety of paid APIs and proprietary tools, limiting accessibility and reproducibility for the research community. In this work, we present \textbf{Cognitive Kernel-Pro}, a fully open-source and (to the maximum extent) free multi-module agent framework designed to democratize the development and evaluation of advanced AI agents. Within Cognitive Kernel-Pro, we systematically investigate the curation of high-quality training data for Agent Foundation Models, focusing on the construction of queries, trajectories, and verifiable answers across four key domains: web, file, code, and general reasoning. Furthermore, we explore novel strategies for agent test-time reflection and voting to enhance agent robustness and performance. We evaluate Cognitive Kernel-Pro on GAIA, achieving state-of-the-art results among open-source and free agents. Notably, our 8B-parameter open-source model surpasses previous leading systems such as WebDancer and WebSailor, establishing a new performance standard for accessible, high-capability AI agents. Code is available at https://github.com/Tencent/CognitiveKernel-Pro

[240] Thinking Machines: Mathematical Reasoning in the Age of LLMs

Andrea Asperti, Alberto Naibo, Claudio Sacerdoti Coen

Main category: cs.AI

TL;DR: The paper explores the challenges and questions surrounding the application of Large Language Models (LLMs) to formal mathematics, contrasting their success in coding with difficulties in theorem proving.

DetailsMotivation: To understand why LLMs excel in coding but struggle with formal mathematics, and to investigate their reasoning processes and state representation.

Method: Analyzes state-of-the-art models and benchmarks, focusing on three key issues: training domains, brittleness in proof generation, and logical state representation.

Result: Identifies current limitations in LLMs’ ability to handle formal mathematics and suggests areas for improvement.

Conclusion: The paper highlights open questions and potential extensions to push the boundaries of LLMs in mathematical reasoning.

Abstract: Large Language Models (LLMs) have shown remarkable abilities in structured reasoning and symbolic tasks, with coding emerging as a particular area of strength. This success has sparked growing interest in applying LLMs to mathematics, both in informal problem-solving and formal theorem proving. However, progress in formal mathematics has proven to be significantly more difficult, despite surface-level similarities between programming and proof construction. This discrepancy raises important questions about how LLMs ``reason’’, how they are supervised, and whether they internally track a notion of computational or deductive state. In this article, we address the state-of-the-art of the discipline, focusing on recent models and benchmarks, and explore three central issues at the intersection of machine learning and mathematical cognition: (i) the trade-offs between formal and informal mathematics as training domains; (ii) the deeper reasons why proof generation remains more brittle than code synthesis; (iii) and the question of whether LLMs represent, or merely mimic, a notion of evolving logical state. Our goal is not to draw hard boundaries, but to identify where the current limits lie, and how they might be extended.

[241] Pro2Guard: Proactive Runtime Enforcement of LLM Agent Safety via Probabilistic Model Checking

Haoyu Wang, Chris M. Poskitt, Jun Sun, Jiali Wei

Main category: cs.AI

TL;DR: Pro2Guard is a proactive framework for ensuring LLM agent safety by predicting risks using probabilistic reachability analysis, outperforming reactive systems.

DetailsMotivation: Existing rule-based safety systems for LLM agents lack foresight and struggle with long-term risks and distribution shifts.

Method: Pro2Guard abstracts behaviors into symbolic states, learns a DTMC from traces, and uses probabilistic reachability to anticipate risks, triggering interventions preemptively.

Result: Pro2Guard achieves 93.6% early safety enforcement in household agents and 100% risk prediction in autonomous driving, with configurable safety-task balance.

Conclusion: Pro2Guard effectively addresses LLM agent safety risks proactively, offering reliable and adaptable enforcement.

Abstract: Large Language Model (LLM) agents exhibit powerful autonomous capabilities across domains such as robotics, virtual assistants, and web automation. However, their stochastic behavior introduces significant safety risks that are difficult to anticipate. Existing rule-based enforcement systems, such as AgentSpec, focus on developing reactive safety rules, which typically respond only when unsafe behavior is imminent or has already occurred. These systems lack foresight and struggle with long-horizon dependencies and distribution shifts. To address these limitations, we propose Pro2Guard, a proactive runtime enforcement framework grounded in probabilistic reachability analysis. Pro2Guard abstracts agent behaviors into symbolic states and learns a Discrete-Time Markov Chain (DTMC) from execution traces. At runtime, it anticipates future risks by estimating the probability of reaching unsafe states, triggering interventions before violations occur when the predicted risk exceeds a user-defined threshold. By incorporating semantic validity checks and leveraging PAC bounds, Pro2Guard ensures statistical reliability while approximating the underlying ground-truth model. We evaluate Pro2Guard extensively across two safety-critical domains: embodied household agents and autonomous vehicles. In embodied agent tasks, Pro2Guard enforces safety early on up to 93.6% of unsafe tasks using low thresholds, while configurable modes (e.g., reflect) allow balancing safety with task success, maintaining up to 80.4% task completion. In autonomous driving scenarios, Pro2Guard achieves 100% prediction of traffic law violations and collisions, anticipating risks up to 38.66 seconds ahead.

[242] MultiSHAP: A Shapley-Based Framework for Explaining Cross-Modal Interactions in Multimodal AI Models

Zhanliang Wang, Kai Wang

Main category: cs.AI

TL;DR: MultiSHAP is a model-agnostic framework using Shapley Interaction Index to explain cross-modal interactions in multimodal AI, providing instance- and dataset-level insights.

DetailsMotivation: Addressing the lack of interpretability in multimodal AI models, especially for high-stakes applications requiring trustworthiness.

Method: Leverages Shapley Interaction Index to quantify synergistic effects between fine-grained visual and textual elements, applicable to open- and closed-source models.

Result: Faithfully captures cross-modal reasoning mechanisms and reveals interaction patterns, validated on public benchmarks and real-world case studies.

Conclusion: MultiSHAP offers a general, extensible solution for interpreting complex multimodal AI models, enhancing transparency and trust.

Abstract: Multimodal AI models have achieved impressive performance in tasks that require integrating information from multiple modalities, such as vision and language. However, their “black-box” nature poses a major barrier to deployment in high-stakes applications where interpretability and trustworthiness are essential. How to explain cross-modal interactions in multimodal AI models remains a major challenge. While existing model explanation methods, such as attention map and Grad-CAM, offer coarse insights into cross-modal relationships, they cannot precisely quantify the synergistic effects between modalities, and are limited to open-source models with accessible internal weights. Here we introduce MultiSHAP, a model-agnostic interpretability framework that leverages the Shapley Interaction Index to attribute multimodal predictions to pairwise interactions between fine-grained visual and textual elements (such as image patches and text tokens), while being applicable to both open- and closed-source models. Our approach provides: (1) instance-level explanations that reveal synergistic and suppressive cross-modal effects for individual samples - “why the model makes a specific prediction on this input”, and (2) dataset-level explanation that uncovers generalizable interaction patterns across samples - “how the model integrates information across modalities”. Experiments on public multimodal benchmarks confirm that MultiSHAP faithfully captures cross-modal reasoning mechanisms, while real-world case studies demonstrate its practical utility. Our framework is extensible beyond two modalities, offering a general solution for interpreting complex multimodal AI models.

[243] From EMR Data to Clinical Insight: An LLM-Driven Framework for Automated Pre-Consultation Questionnaire Generation

Ruiqing Ding, Qianfang Sun, Yongkang Leng, Hui Yin, Xiaojian Li

Main category: cs.AI

TL;DR: A multi-stage LLM-driven framework improves pre-consultation questionnaire generation from EMRs by addressing completeness, logical order, and disease-level synthesis.

DetailsMotivation: Direct LLM approaches struggle with generating comprehensive pre-consultation questionnaires from complex EMRs due to issues like incomplete information and lack of disease-level synthesis.

Method: A three-stage framework: 1) extracts atomic assertions, 2) constructs personal causal networks and synthesizes disease knowledge, 3) generates tailored questionnaires.

Result: Outperforms direct methods in information coverage, diagnostic relevance, understandability, and generation time, validated by clinical experts.

Conclusion: The framework enhances patient information collection by building explicit clinical knowledge, demonstrating practical potential.

Abstract: Pre-consultation is a critical component of effective healthcare delivery. However, generating comprehensive pre-consultation questionnaires from complex, voluminous Electronic Medical Records (EMRs) is a challenging task. Direct Large Language Model (LLM) approaches face difficulties in this task, particularly regarding information completeness, logical order, and disease-level synthesis. To address this issue, we propose a novel multi-stage LLM-driven framework: Stage 1 extracts atomic assertions (key facts with timing) from EMRs; Stage 2 constructs personal causal networks and synthesizes disease knowledge by clustering representative networks from an EMR corpus; Stage 3 generates tailored personal and standardized disease-specific questionnaires based on these structured representations. This framework overcomes limitations of direct methods by building explicit clinical knowledge. Evaluated on a real-world EMR dataset and validated by clinical experts, our method demonstrates superior performance in information coverage, diagnostic relevance, understandability, and generation time, highlighting its practical potential to enhance patient information collection.

[244] Multi-Band Variable-Lag Granger Causality: A Unified Framework for Causal Time Series Inference across Frequencies

Chakattrai Sookkongwaree, Tattep Lakmuang, Chainarong Amornbunchornvej

Main category: cs.AI

TL;DR: The paper introduces Multi-Band Variable-Lag Granger Causality (MB-VLGC), a framework that extends traditional Granger causality by modeling frequency-dependent causal delays, outperforming existing methods.

DetailsMotivation: Existing Granger causality methods assume fixed or variable time lags but ignore frequency-dependent delays, which are crucial in domains like neuroscience.

Method: The authors formalize MB-VLGC, prove its theoretical soundness, and develop an efficient inference pipeline to handle frequency-dependent causal delays.

Result: MB-VLGC significantly outperforms existing methods on synthetic and real-world datasets, demonstrating broad applicability.

Conclusion: MB-VLGC provides a robust and generalizable solution for inferring causality in time series, especially where delays vary across frequency bands.

Abstract: Understanding causal relationships in time series is fundamental to many domains, including neuroscience, economics, and behavioral science. Granger causality is one of the well-known techniques for inferring causality in time series. Typically, Granger causality frameworks have a strong fix-lag assumption between cause and effect, which is often unrealistic in complex systems. While recent work on variable-lag Granger causality (VLGC) addresses this limitation by allowing a cause to influence an effect with different time lags at each time point, it fails to account for the fact that causal interactions may vary not only in time delay but also across frequency bands. For example, in brain signals, alpha-band activity may influence another region with a shorter delay than slower delta-band oscillations. In this work, we formalize Multi-Band Variable-Lag Granger Causality (MB-VLGC) and propose a novel framework that generalizes traditional VLGC by explicitly modeling frequency-dependent causal delays. We provide a formal definition of MB-VLGC, demonstrate its theoretical soundness, and propose an efficient inference pipeline. Extensive experiments across multiple domains demonstrate that our framework significantly outperforms existing methods on both synthetic and real-world datasets, confirming its broad applicability to any type of time series data. Code and datasets are publicly available.

[245] Transparent Adaptive Learning via Data-Centric Multimodal Explainable AI

Maryam Mosleh, Marie Devlin, Ellis Solaiman

Main category: cs.AI

TL;DR: A hybrid framework combining XAI and generative AI for personalized, multimodal explanations in adaptive learning systems.

DetailsMotivation: Addressing the lack of transparency and user-centric explanations in AI-driven education systems.

Method: Proposes a hybrid framework integrating traditional XAI, generative AI, and user personalization.

Result: Redefines explainability as a dynamic, user-tailored communication process.

Conclusion: Aims to enhance transparency and user-centered experiences in AI-driven education.

Abstract: Artificial intelligence-driven adaptive learning systems are reshaping education through data-driven adaptation of learning experiences. Yet many of these systems lack transparency, offering limited insight into how decisions are made. Most explainable AI (XAI) techniques focus on technical outputs but neglect user roles and comprehension. This paper proposes a hybrid framework that integrates traditional XAI techniques with generative AI models and user personalisation to generate multimodal, personalised explanations tailored to user needs. We redefine explainability as a dynamic communication process tailored to user roles and learning goals. We outline the framework’s design, key XAI limitations in education, and research directions on accuracy, fairness, and personalisation. Our aim is to move towards explainable AI that enhances transparency while supporting user-centred experiences.

[246] Context-Aware Visualization for Explainable AI Recommendations in Social Media: A Vision for User-Aligned Explanations

Banan Alkhateeb, Ellis Solaiman

Main category: cs.AI

TL;DR: A vision paper proposes a user-segmented, context-aware visual explanation system for AI recommendations in social media, adapting explanation style and granularity to user needs.

DetailsMotivation: Improve user experience by addressing the lack of personalized and understandable AI recommendations in social media.

Method: Propose a visual explanation system with diverse methods, adapting style (visual vs. numeric) and granularity (expert vs. lay) in a single pipeline.

Result: A framework is introduced, and a pilot study with 30 X users will validate its impact on decision-making and trust.

Conclusion: The system aims to enhance user understanding and trust in AI recommendations by providing tailored explanations.

Abstract: Social media platforms today strive to improve user experience through AI recommendations, yet the value of such recommendations vanishes as users do not understand the reasons behind them. This issue arises because explainability in social media is general and lacks alignment with user-specific needs. In this vision paper, we outline a user-segmented and context-aware explanation layer by proposing a visual explanation system with diverse explanation methods. The proposed system is framed by the variety of user needs and contexts, showing explanations in different visualized forms, including a technically detailed version for AI experts and a simplified one for lay users. Our framework is the first to jointly adapt explanation style (visual vs. numeric) and granularity (expert vs. lay) inside a single pipeline. A public pilot with 30 X users will validate its impact on decision-making and trust.

[247] Unraveling Hidden Representations: A Multi-Modal Layer Analysis for Better Synthetic Content Forensics

Tom Or, Omri Azencot

Main category: cs.AI

TL;DR: Proposes using large pre-trained multi-modal models for universal fake content detection, achieving state-of-the-art results across modalities.

DetailsMotivation: Address the need for robust fake detectors due to the rise of generative models and deepfakes, as current methods generalize poorly across different generators and data domains.

Method: Utilizes latent codes from pre-trained multi-modal models to discriminate real from fake content, training linear classifiers on these features.

Result: Achieves state-of-the-art performance in fake detection for audio and images, with computational efficiency and effectiveness in few-shot settings.

Conclusion: Large pre-trained models offer a universal solution for fake detection, outperforming existing methods across diverse data modalities.

Abstract: Generative models achieve remarkable results in multiple data domains, including images and texts, among other examples. Unfortunately, malicious users exploit synthetic media for spreading misinformation and disseminating deepfakes. Consequently, the need for robust and stable fake detectors is pressing, especially when new generative models appear everyday. While the majority of existing work train classifiers that discriminate between real and fake information, such tools typically generalize only within the same family of generators and data modalities, yielding poor results on other generative classes and data domains. Towards a universal classifier, we propose the use of large pre-trained multi-modal models for the detection of generative content. Effectively, we show that the latent code of these models naturally captures information discriminating real from fake. Building on this observation, we demonstrate that linear classifiers trained on these features can achieve state-of-the-art results across various modalities, while remaining computationally efficient, fast to train, and effective even in few-shot settings. Our work primarily focuses on fake detection in audio and images, achieving performance that surpasses or matches that of strong baseline methods.

[248] Identifying Unique Spatial-Temporal Bayesian Network without Markov Equivalence

Mingyu Kang, Duxin Chen, Ning Meng, Gang Yan, Wenwu Yu

Main category: cs.AI

TL;DR: The paper proposes a Spatial-Temporal Bayesian Network (STBN) to model spatial-temporal causality, addressing limitations of existing methods like Directed Cyclic Graph and Full Time Graph. It introduces a High-order Causal Entropy (HCE) algorithm for unique identification of STBN, achieving state-of-the-art accuracy.

DetailsMotivation: Existing methods for modeling spatial-temporal causality, such as Directed Cyclic Graph and Full Time Graph, have limitations like Markov equivalence classes and assumptions of time-invariant causality. STBN aims to overcome these issues by modeling causality from an information transfer perspective.

Method: The paper proposes STBN to model spatial-temporal causality using information path blocking principles. It also introduces the HCE algorithm for uniquely identifying STBN with a time complexity of O(n³τ_max).

Result: Numerical experiments show the HCE algorithm achieves state-of-the-art identification accuracy compared to baseline methods.

Conclusion: STBN and the HCE algorithm provide a robust solution for modeling and identifying spatial-temporal causality, addressing key limitations of prior approaches.

Abstract: Identifying vanilla Bayesian network to model spatial-temporal causality can be a critical yet challenging task. Different Markovian-equivalent directed acyclic graphs would be identified if the identifiability is not satisfied. To address this issue, Directed Cyclic Graph is proposed to drop the directed acyclic constraint. But it does not always hold, and cannot model dynamical time-series process. Then, Full Time Graph is proposed with introducing high-order time delay. Full Time Graph has no Markov equivalence class by assuming no instantaneous effects. But, it also assumes that the causality is invariant with varying time, that is not always satisfied in the spatio-temporal scenarios. Thus, in this work, a Spatial-Temporal Bayesian Network (STBN) is proposed to theoretically model the spatial-temporal causality from the perspective of information transfer. STBN explains the disappearance of network structure $X\rightarrow Z \rightarrow Y$ and $X\leftarrow Z \leftarrow Y$ by the principle of information path blocking. And finally, the uniqueness of STBN is proved. Based on this, a High-order Causal Entropy (HCE) algorithm is also proposed to uniquely identify STBN under time complexity $\mathcal{O}(n^3\tau_{max})$, where $n$ is the number of variables and $\tau_{max}$ is the maximum time delay. Numerical experiments are conducted with comparison to other baseline algorithms. The results show that HCE algorithm obtains state-of-the-art identification accuracy. The code is available at https://github.com/KMY-SEU/HCE.

[249] Federated Cross-Training Learners for Robust Generalization under Data Heterogeneity

Zhuang Qi, Lei Meng, Ruohan Zhang, Yu Wang, Xin Qi, Xiangxu Meng, Han Yu, Qiang Yang

Main category: cs.AI

TL;DR: FedCT is a federated learning method using cross-training with knowledge distillation from global and local views to align feature spaces and improve generalization.

DetailsMotivation: Addressing feature space heterogeneity and misaligned optimization goals in federated learning due to differing data distributions.

Method: FedCT includes three modules: consistency-aware knowledge broadcasting, multi-view knowledge-guided representation learning, and mixup-based feature augmentation.

Result: FedCT outperforms state-of-the-art methods by alleviating knowledge forgetting and enhancing feature diversity.

Conclusion: FedCT effectively balances local and global knowledge, improving federated learning performance.

Abstract: Federated learning benefits from cross-training strategies, which enables models to train on data from distinct sources to improve generalization capability. However, due to inherent differences in data distributions, the optimization goals of local models remain misaligned, and this mismatch continues to manifest as feature space heterogeneity even after cross-training. We argue that knowledge distillation from the personalized view preserves client-specific characteristics and expands the local knowledge base, while distillation from the global view provides consistent semantic anchors that facilitate feature alignment across clients. To achieve this goal, this paper presents a cross-training scheme, termed FedCT, includes three main modules, where the consistency-aware knowledge broadcasting module aims to optimize model assignment strategies, which enhances collaborative advantages between clients and achieves an efficient federated learning process. The multi-view knowledge-guided representation learning module leverages fused prototypical knowledge from both global and local views to enhance the preservation of local knowledge before and after model exchange, as well as to ensure consistency between local and global knowledge. The mixup-based feature augmentation module aggregates rich information to further increase the diversity of feature spaces, which enables the model to better discriminate complex samples. Extensive experiments were conducted on four datasets in terms of performance comparison, ablation study, in-depth analysis and case study. The results demonstrated that FedCT alleviates knowledge forgetting from both local and global views, which enables it outperform state-of-the-art methods.

[250] BCR-DRL: Behavior- and Context-aware Reward for Deep Reinforcement Learning in Human-AI Coordination

Xin Hao, Bahareh Nakisa, Mohmmad Naim Rastgoo, Gaoyang Pang

Main category: cs.AI

TL;DR: The paper proposes a behavior- and context-aware reward (BCR) for deep reinforcement learning (DRL) to improve human-AI coordination by addressing sparse rewards and unpredictable human behaviors.

DetailsMotivation: DRL struggles with sparse rewards and unpredictable human behaviors in human-AI coordination (HAIC), limiting its ability to optimize exploration and exploitation.

Method: The BCR includes a dual intrinsic rewarding scheme (AI- and human-motivated rewards) for exploration and a context-aware weighting mechanism for exploitation.

Result: Simulations in the Overcooked environment show a 20% increase in cumulative sparse rewards and 38% improvement in sample efficiency over baselines.

Conclusion: The BCR effectively enhances DRL performance in HAIC by optimizing exploration and exploitation.

Abstract: Deep reinforcement Learning (DRL) offers a powerful framework for training AI agents to coordinate with human partners. However, DRL faces two critical challenges in human-AI coordination (HAIC): sparse rewards and unpredictable human behaviors. These challenges significantly limit DRL to identify effective coordination policies, due to its impaired capability of optimizing exploration and exploitation. To address these limitations, we propose an innovative behavior- and context-aware reward (BCR) for DRL, which optimizes exploration and exploitation by leveraging human behaviors and contextual information in HAIC. Our BCR consists of two components: (i) A novel dual intrinsic rewarding scheme to enhance exploration. This scheme composes an AI self-motivated intrinsic reward and a human-motivated intrinsic reward, which are designed to increase the capture of sparse rewards by a logarithmic-based strategy; and (ii) A new context-aware weighting mechanism for the designed rewards to improve exploitation. This mechanism helps the AI agent prioritize actions that better coordinate with the human partner by utilizing contextual information that can reflect the evolution of learning. Extensive simulations in the Overcooked environment demonstrate that our approach can increase the cumulative sparse rewards by approximately 20%, and improve the sample efficiency by around 38% compared to state-of-the-art baselines.

[251] Causal Explanations for Image Classifiers

Hana Chockler, David A. Kelly, Daniel Kroening, Youcheng Sun

Main category: cs.AI

TL;DR: A novel black-box approach for explaining image classifier outputs, grounded in actual causality theory, outperforms existing tools in efficiency and explanation quality.

DetailsMotivation: Existing explanation tools lack a principled approach based on formal causality definitions.

Method: Proposes an algorithm for approximate explanations using actual causality theory, with proven termination and complexity analysis.

Result: ReX tool implementation shows superior efficiency, smaller explanations, and better quality than state-of-the-art tools.

Conclusion: The approach provides a robust, efficient, and high-quality solution for explaining classifier outputs.

Abstract: Existing algorithms for explaining the output of image classifiers use different definitions of explanations and a variety of techniques to extract them. However, none of the existing tools use a principled approach based on formal definitions of causes and explanations for the explanation extraction. In this paper we present a novel black-box approach to computing explanations grounded in the theory of actual causality. We prove relevant theoretical results and present an algorithm for computing approximate explanations based on these definitions. We prove termination of our algorithm and discuss its complexity and the amount of approximation compared to the precise definition. We implemented the framework in a tool ReX and we present experimental results and a comparison with state-of-the-art tools. We demonstrate that \rex is the most efficient tool and produces the smallest explanations, in addition to outperforming other black-box tools on standard quality measures.

[252] OR-LLM-Agent: Automating Modeling and Solving of Operations Research Optimization Problems with Reasoning LLM

Bowen Zhang, Pengcheng Luo, Genke Yang, Boon-Hee Soong, Chau Yuen

Main category: cs.AI

TL;DR: The paper proposes OR-LLM-Agent, a framework using reasoning LLMs for automated OR problem-solving, outperforming existing methods by at least 7% in accuracy.

DetailsMotivation: Existing methods for OR problem-solving with LLMs are limited by non-reasoning models, prompting the need for a more effective framework.

Method: OR-LLM-Agent decomposes OR tasks into three stages (mathematical modeling, code generation, debugging) handled by sub-agents, and introduces the BWOR dataset for evaluation.

Result: OR-LLM-Agent with DeepSeek-R1 outperforms GPT-o3, Gemini 2.5 Pro, and others by at least 7% in accuracy.

Conclusion: Task decomposition with reasoning LLMs significantly improves OR problem-solving, as validated by the BWOR dataset.

Abstract: With the rise of artificial intelligence (AI), applying large language models (LLMs) to mathematical problem-solving has attracted increasing attention. Most existing approaches attempt to improve Operations Research (OR) optimization problem-solving through prompt engineering or fine-tuning strategies for LLMs. However, these methods are fundamentally constrained by the limited capabilities of non-reasoning LLMs. To overcome these limitations, we propose OR-LLM-Agent, an AI agent framework built on reasoning LLMs for automated OR problem solving. The framework decomposes the task into three sequential stages: mathematical modeling, code generation, and debugging. Each task is handled by a dedicated sub-agent, which enables more targeted reasoning. We also construct BWOR, an OR dataset for evaluating LLM performance on OR tasks. Our analysis shows that in the benchmarks NL4OPT, MAMO, and IndustryOR, reasoning LLMs sometimes underperform their non-reasoning counterparts within the same model family. In contrast, BWOR provides a more consistent and discriminative assessment of model capabilities. Experimental results demonstrate that OR-LLM-Agent utilizing DeepSeek-R1 in its framework outperforms advanced methods, including GPT-o3, Gemini 2.5 Pro, DeepSeek-R1, and ORLM, by at least 7% in accuracy. These results demonstrate the effectiveness of task decomposition for OR problem solving.

[253] BOOST: Bootstrapping Strategy-Driven Reasoning Programs for Program-Guided Fact-Checking

Qisheng Hu, Quanyu Long, Wenya Wang

Main category: cs.AI

TL;DR: BOOST introduces a bootstrapping method for automated few-shot reasoning program generation, eliminating manual demonstrations and improving claim verification.

DetailsMotivation: Current approaches rely on human-crafted demonstrations, which are labor-intensive and limit scalability. BOOST aims to automate this process.

Method: BOOST uses a critique-refine loop to iteratively refine meta-rules for generating reasoning programs, enabling zero-shot to few-shot learning.

Result: BOOST outperforms prior few-shot baselines in complex claim verification, showing improved interpretability and effectiveness.

Conclusion: BOOST automates reasoning program generation, reducing human effort and enhancing performance in fact-checking tasks.

Abstract: Large language model pipelines have improved automated fact-checking for complex claims, yet many approaches rely on few-shot in-context learning with demonstrations that require substantial human effort and domain expertise. Among these, program-guided reasoning, by decomposing claims into function calls and executing reasoning programs, which has shown particular promise, but remains limited by the need for manually crafted demonstrations. Fundamentally, the underlying principles of effective reasoning program generation still remain underexplored. In this work, we introduce BOOST, a bootstrapping approach for automated few-shot reasoning program generation. BOOST iteratively refines explicit, data-driven guidelines as meta-rules for guiding demonstration creation, using a critique-refine loop that eliminates the need for human intervention. This enables a seamless transition from zero-shot to few-shot program-guided learning, enhancing interpretability and effectiveness. Experimental results show that BOOST outperforms prior few-shot baselines in both zero-shot and few-shot settings for complex claim verification.

[254] The Urban Impact of AI: Modeling Feedback Loops in Next-Venue Recommendation

Giovanni Mauro, Marco Minici, Luca Pappalardo

Main category: cs.AI

TL;DR: The paper introduces a simulation framework to study the systemic impact of next-venue recommender systems on urban dynamics, revealing trade-offs between individual diversity and collective inequality.

DetailsMotivation: To address the lack of attention on the systemic impact of next-venue recommender systems beyond predictive accuracy, focusing on their influence on urban dynamics and inequality.

Method: A simulation framework models the human-AI feedback loop in next-venue recommendation, using real-world mobility data to explore effects of different recommendation strategies.

Result: Recommender systems increase individual-level venue diversity but may amplify collective inequality by concentrating visits on popular places, affecting urban accessibility and social co-location networks.

Conclusion: The framework provides a tool to assess societal impacts of AI-assisted mobility, aiding in anticipating risks, evaluating regulations, and designing ethical algorithms.

Abstract: Next-venue recommender systems are increasingly embedded in location-based services, shaping individual mobility decisions in urban environments. While their predictive accuracy has been extensively studied, less attention has been paid to their systemic impact on urban dynamics. In this work, we introduce a simulation framework to model the human-AI feedback loop underpinning next-venue recommendation, capturing how algorithmic suggestions influence individual behavior, which in turn reshapes the data used to retrain the models. Our simulations, grounded in real-world mobility data, systematically explore the effects of algorithmic adoption across a range of recommendation strategies. We find that while recommender systems consistently increase individual-level diversity in visited venues, they may simultaneously amplify collective inequality by concentrating visits on a limited subset of popular places. This divergence extends to the structure of social co-location networks, revealing broader implications for urban accessibility and spatial segregation. Our framework operationalizes the feedback loop in next-venue recommendation and offers a novel lens through which to assess the societal impact of AI-assisted mobility-providing a computational tool to anticipate future risks, evaluate regulatory interventions, and inform the design of ethic algorithmic systems.

[255] World Model-Based Learning for Long-Term Age of Information Minimization in Vehicular Networks

Lingyi Wang, Rashed Shelim, Walid Saad, Naren Ramakrishnan

Main category: cs.AI

TL;DR: A world model-based learning framework improves data efficiency and reduces packet-completeness-aware age of information (CAoI) in mmWave V2X networks, outperforming traditional RL methods.

DetailsMotivation: Traditional RL methods in wireless networks are inefficient and short-sighted, especially in dynamic, high-uncertainty environments like mmWave V2X networks.

Method: Proposes a world model to learn the environment dynamics and imagine trajectories for link scheduling, reducing reliance on real-time interactions.

Result: Achieves 26% and 16% improvement in CAoI over model-based and model-free RL, respectively, with higher data efficiency.

Conclusion: The world model framework is effective for long-term planning in dynamic wireless networks, outperforming conventional RL approaches.

Abstract: Traditional reinforcement learning (RL)-based learning approaches for wireless networks rely on expensive trial-and-error mechanisms and real-time feedback based on extensive environment interactions, which leads to low data efficiency and short-sighted policies. These limitations become particularly problematic in complex, dynamic networks with high uncertainty and long-term planning requirements. To address these limitations, in this paper, a novel world model-based learning framework is proposed to minimize packet-completeness-aware age of information (CAoI) in a vehicular network. Particularly, a challenging representative scenario is considered pertaining to a millimeter-wave (mmWave) vehicle-to-everything (V2X) communication network, which is characterized by high mobility, frequent signal blockages, and extremely short coherence time. Then, a world model framework is proposed to jointly learn a dynamic model of the mmWave V2X environment and use it to imagine trajectories for learning how to perform link scheduling. In particular, the long-term policy is learned in differentiable imagined trajectories instead of environment interactions. Moreover, owing to its imagination abilities, the world model can jointly predict time-varying wireless data and optimize link scheduling in real-world wireless and V2X networks. Thus, during intervals without actual observations, the world model remains capable of making efficient decisions. Extensive experiments are performed on a realistic simulator based on Sionna that integrates physics-based end-to-end channel modeling, ray-tracing, and scene geometries with material properties. Simulation results show that the proposed world model achieves a significant improvement in data efficiency, and achieves 26% improvement and 16% improvement in CAoI, respectively, compared to the model-based RL (MBRL) method and the model-free RL (MFRL) method.

[256] ORFS-agent: Tool-Using Agents for Chip Design Optimization

Amur Ghose, Andrew B. Kahng, Sayak Kundu, Zhiang Wang

Main category: cs.AI

TL;DR: ORFS-agent, an LLM-based iterative optimization agent, improves parameter tuning in hardware design flows, outperforming Bayesian optimization with 13% better design metrics and 40% fewer iterations.

DetailsMotivation: To leverage LLMs for high-dimensional optimization in integrated circuit design, addressing the challenge of tuning thousands of parameters for better performance, power, and area.

Method: ORFS-agent adaptively explores parameter configurations in an open-source hardware design flow, using natural language objectives for multi-objective optimization.

Result: Empirical evaluations show 13% improvement in routed wirelength and effective clock period, with 40% fewer iterations compared to standard methods.

Conclusion: ORFS-agent offers a flexible, interpretable, and model-agnostic framework for multi-objective optimization in hardware design.

Abstract: Machine learning has been widely used to optimize complex engineering workflows across numerous domains. In the context of integrated circuit design, modern flows (e.g., going from a register-transfer level netlist to physical layouts) involve extensive configuration via thousands of parameters, and small changes to these parameters can have large downstream impacts on desired outcomes - namely design performance, power, and area. Recent advances in Large Language Models (LLMs) offer new opportunities for learning and reasoning within such high-dimensional optimization tasks. In this work, we introduce ORFS-agent, an LLM-based iterative optimization agent that automates parameter tuning in an open-source hardware design flow. ORFS-agent adaptively explores parameter configurations, demonstrating clear improvements over standard Bayesian optimization approaches in terms of resource efficiency and final design metrics. Our empirical evaluations on two different technology nodes and a range of circuit benchmarks indicate that ORFS-agent can improve both routed wirelength and effective clock period by over 13%, all while using 40% fewer optimization iterations. Moreover, by following natural language objectives to trade off certain metrics for others, ORFS-agent demonstrates a flexible and interpretable framework for multi-objective optimization. Crucially, RFS-agent is modular and model-agnostic, and can be plugged in to any frontier LLM without any further fine-tuning.

[257] Dynamic Knowledge Exchange and Dual-diversity Review: Concisely Unleashing the Potential of a Multi-Agent Research Team

Weilun Yu, Shixiang Tang, Yonggui Huang, Nanqing Dong, Li Fan, Honggang Qi, Wei Liu, Xiaoli Diao, Xi Chen, Wanli Ouyang

Main category: cs.AI

TL;DR: IDVSCI is a multi-agent LLM framework with Dynamic Knowledge Exchange and Dual-Diversity Review, outperforming existing systems in scientific discovery.

DetailsMotivation: Current LLM-based scientist agents lack interactive reasoning and evaluation, limiting their real-world research applicability.

Method: Proposes IDVSCI with Dynamic Knowledge Exchange for iterative feedback and Dual-Diversity Review for expert evaluation.

Result: IDVSCI achieves top performance on computer science and health sciences datasets, surpassing AI Scientist and VIRSCI.

Conclusion: Modeling interaction and peer review in LLM-based research enhances creativity and impact.

Abstract: Scientific progress increasingly relies on effective collaboration among researchers, a dynamic that large language models (LLMs) have only begun to emulate. While recent LLM-based scientist agents show promise in autonomous scientific discovery, they often lack the interactive reasoning and evaluation mechanisms essential to real-world research. We propose IDVSCI (Internal Discussion and Vote SCIentists), a multi-agent framework built on LLMs that incorporates two key innovations: a Dynamic Knowledge Exchange mechanism enabling iterative feedback among agents, and a Dual-Diversity Review paradigm that simulates heterogeneous expert evaluation. These components jointly promote deeper reasoning and the generation of more creative and impactful scientific ideas. To evaluate the effectiveness and generalizability of our approach, we conduct experiments on two datasets: a widely used benchmark in computer science and a new dataset we introduce in the health sciences domain. Results show that IDVSCI consistently achieves the best performance across both datasets, outperforming existing systems such as AI Scientist and VIRSCI. These findings highlight the value of modeling interaction and peer review dynamics in LLM-based autonomous research.

[258] Sound and Complete Neurosymbolic Reasoning with LLM-Grounded Interpretations

Bradley P. Allen, Prateek Chhikara, Thomas Macaulay Ferguson, Filip Ilievski, Paul Groth

Main category: cs.AI

TL;DR: A method integrates LLMs into paraconsistent logic to address their inconsistency, preserving soundness and completeness while leveraging their knowledge.

DetailsMotivation: LLMs show logical inconsistency despite their capabilities; the goal is to harness their knowledge for formal reasoning.

Method: Directly integrate an LLM into the interpretation function of paraconsistent logic’s formal semantics.

Result: Experimental evidence supports feasibility using datasets from factuality benchmarks.

Conclusion: The method provides a theoretical framework for neurosymbolic reasoning, balancing LLM knowledge with logical soundness.

Abstract: Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but they exhibit problems with logical consistency in the output they generate. How can we harness LLMs’ broad-coverage parametric knowledge in formal reasoning despite their inconsistency? We present a method for directly integrating an LLM into the interpretation function of the formal semantics for a paraconsistent logic. We provide experimental evidence for the feasibility of the method by evaluating the function using datasets created from several short-form factuality benchmarks. Unlike prior work, our method offers a theoretical framework for neurosymbolic reasoning that leverages an LLM’s knowledge while preserving the underlying logic’s soundness and completeness properties.

[259] On Gradual Semantics for Assumption-Based Argumentation

Anna Rapberger, Fabrizio Russo, Antonio Rago, Francesca Toni

Main category: cs.AI

TL;DR: The paper introduces gradual semantics for assumption-based argumentation (ABA), filling a gap in computational argumentation by extending modular gradual semantics from QBAFs to ABA frameworks.

DetailsMotivation: Gradual semantics are useful for ABA, a popular structured argumentation form, but no such semantics exist for ABA despite their potential applications.

Method: The authors propose novel gradual semantics for ABA by abstracting ABA frameworks into bipolar set-based argumentation frameworks and generalizing QBAF modular semantics.

Result: The proposed gradual ABA semantics satisfy adapted properties like balance and monotonicity. Experiments compare these semantics with an argument-based baseline.

Conclusion: The study successfully extends gradual semantics to ABA, demonstrating their feasibility and comparing them with alternative approaches.

Abstract: In computational argumentation, gradual semantics are fine-grained alternatives to extension-based and labelling-based semantics . They ascribe a dialectical strength to (components of) arguments sanctioning their degree of acceptability. Several gradual semantics have been studied for abstract, bipolar and quantitative bipolar argumentation frameworks (QBAFs), as well as, to a lesser extent, for some forms of structured argumentation. However, this has not been the case for assumption-based argumentation (ABA), despite it being a popular form of structured argumentation with several applications where gradual semantics could be useful. In this paper, we fill this gap and propose a family of novel gradual semantics for equipping assumptions, which are the core components in ABA frameworks, with dialectical strengths. To do so, we use bipolar set-based argumentation frameworks as an abstraction of (potentially non-flat) ABA frameworks and generalise state-of-the-art modular gradual semantics for QBAFs. We show that our gradual ABA semantics satisfy suitable adaptations of desirable properties of gradual QBAF semantics, such as balance and monotonicity. We also explore an argument-based approach that leverages established QBAF modular semantics directly, and use it as baseline. Finally, we conduct experiments with synthetic ABA frameworks to compare our gradual ABA semantics with its argument-based counterpart and assess convergence.

[260] E.A.R.T.H.: Structuring Creative Evolution through Model Error in Generative AI

Yusen Peng, Shuhua Mao

Main category: cs.AI

TL;DR: The paper introduces the E.A.R.T.H. framework, a five-stage pipeline that leverages model-generated errors to enhance AI creativity, achieving significant improvements in novelty, relevance, and human-rated quality.

DetailsMotivation: To move AI beyond imitation by harnessing errors as creative assets, inspired by the idea that 'creative potential hides in failure.'

Method: A five-stage pipeline (Error, Amplification, Refine, Transform, Harness) using structured prompts, semantic scoring, and human feedback, implemented with models like LLaMA-2-7B-Chat and Stable Diffusion.

Result: Creativity scores improved by 52.5%, with outputs rated highly for novelty and relevance. Human evaluations confirmed strong creative quality and emotional resonance.

Conclusion: Error-driven, feedback-based generation enhances AI creativity, offering a scalable approach for human-aligned creative AI.

Abstract: How can AI move beyond imitation toward genuine creativity? This paper proposes the E.A.R.T.H. framework, a five-stage generative pipeline that transforms model-generated errors into creative assets through Error generation, Amplification, Refine selection, Transform, and Harness feedback. Drawing on cognitive science and generative modeling, we posit that “creative potential hides in failure” and operationalize this via structured prompts, semantic scoring, and human-in-the-loop evaluation. Implemented using LLaMA-2-7B-Chat, SBERT, BERTScore, CLIP, BLIP-2, and Stable Diffusion, the pipeline employs a composite reward function based on novelty, surprise, and relevance. At the Refine stage, creativity scores increase by 52.5% (1.179 to 1.898, t = -5.56, p < 0.001), with final outputs reaching 2.010 - a 70.4% improvement. Refined slogans are 48.4% shorter, 40.7% more novel, with only a 4.0% drop in relevance. Cross-modal tests show strong slogan-to-image alignment (CLIPScore: 0.249; BERTScore F1: 0.816). In human evaluations, the generated outputs were consistently rated highly, demonstrating strong creative quality and expressive clarity. Feedback highlights stylistic precision and emotional resonance. These results demonstrate that error-centered, feedback-driven generation enhances creativity, offering a scalable path toward self-evolving, human-aligned creative AI.

[261] A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence

Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, Hongru Wang, Han Xiao, Yuhang Zhou, Shaokun Zhang, Jiayi Zhang, Jinyu Xiang, Yixiong Fang, Qiwen Zhao, Dongrui Liu, Qihan Ren, Cheng Qian, Zhenhailong Wang, Minda Hu, Huazheng Wang, Qingyun Wu, Heng Ji, Mengdi Wang

Main category: cs.AI

TL;DR: The paper reviews self-evolving agents for LLMs, addressing what, when, and how to evolve, with applications in coding, education, and healthcare.

DetailsMotivation: LLMs are static and struggle with adaptation in dynamic environments, prompting the need for self-evolving agents.

Method: The survey categorizes evolutionary mechanisms, adaptation methods, and designs, analyzing evaluation metrics and benchmarks.

Result: It provides a framework for designing adaptive agents, highlighting applications and challenges like safety and scalability.

Conclusion: The survey lays a roadmap for advancing self-evolving agents toward Artificial Super Intelligence (ASI).

Abstract: Large Language Models (LLMs) have demonstrated strong capabilities but remain fundamentally static, unable to adapt their internal parameters to novel tasks, evolving knowledge domains, or dynamic interaction contexts. As LLMs are increasingly deployed in open-ended, interactive environments, this static nature has become a critical bottleneck, necessitating agents that can adaptively reason, act, and evolve in real time. This paradigm shift – from scaling static models to developing self-evolving agents – has sparked growing interest in architectures and methods enabling continual learning and adaptation from data, interactions, and experiences. This survey provides the first systematic and comprehensive review of self-evolving agents, organized around three foundational dimensions – what to evolve, when to evolve, and how to evolve. We examine evolutionary mechanisms across agent components (e.g., models, memory, tools, architecture), categorize adaptation methods by stages (e.g., intra-test-time, inter-test-time), and analyze the algorithmic and architectural designs that guide evolutionary adaptation (e.g., scalar rewards, textual feedback, single-agent and multi-agent systems). Additionally, we analyze evaluation metrics and benchmarks tailored for self-evolving agents, highlight applications in domains such as coding, education, and healthcare, and identify critical challenges and research directions in safety, scalability, and co-evolutionary dynamics. By providing a structured framework for understanding and designing self-evolving agents, this survey establishes a roadmap for advancing adaptive agentic systems in both research and real-world deployments, ultimately shedding lights to pave the way for the realization of Artificial Super Intelligence (ASI), where agents evolve autonomously, performing at or beyond human-level intelligence across a wide array of tasks.

[262] How Far Are AI Scientists from Changing the World?

Qiujie Xie, Yixuan Weng, Minjun Zhu, Fuchen Shen, Shulin Huang, Zhen Lin, Jiahui Zhou, Zilan Mao, Zijie Yang, Linyi Yang, Jian Wu, Yue Zhang

Main category: cs.AI

TL;DR: The paper surveys the progress of AI Scientist systems, assessing their potential to revolutionize scientific research and identifying key challenges and goals for future development.

DetailsMotivation: To evaluate how close AI Scientist systems are to transforming scientific research and uncovering groundbreaking discoveries.

Method: A prospect-driven review analyzing current achievements, bottlenecks, and essential components for advanced AI Scientists.

Result: Identifies limitations and gaps in current AI Scientist systems, outlining what is needed for them to achieve human-level scientific discovery.

Conclusion: The survey aims to clarify the current state, missing elements, and ultimate objectives for AI in scientific research.

Abstract: The emergence of large language models (LLMs) is propelling automated scientific discovery to the next level, with LLM-based Artificial Intelligence (AI) Scientist systems now taking the lead in scientific research. Several influential works have already appeared in the field of AI Scientist systems, with AI-generated research papers having been accepted at the ICLR 2025 workshop, suggesting that a human-level AI Scientist capable of uncovering phenomena previously unknown to humans, may soon become a reality. In this survey, we focus on the central question: How far are AI scientists from changing the world and reshaping the scientific research paradigm? To answer this question, we provide a prospect-driven review that comprehensively analyzes the current achievements of AI Scientist systems, identifying key bottlenecks and the critical components required for the emergence of a scientific agent capable of producing ground-breaking discoveries that solve grand challenges. We hope this survey will contribute to a clearer understanding of limitations of current AI Scientist systems, showing where we are, what is missing, and what the ultimate goals for scientific AI should be.

[263] Semantic Chain-of-Trust: Autonomous Trust Orchestration for Collaborator Selection via Hypergraph-Aided Agentic AI

Botao Zhu, Xianbin Wang, Dusit Niyato

Main category: cs.AI

TL;DR: Proposes an autonomous trust orchestration method using agentic AI and hypergraph for efficient trust evaluation in collaborative systems.

DetailsMotivation: Addresses the complexity and resource consumption of trust evaluations in distributed collaboration due to task complexity and dynamic device resources.

Method: Uses agentic AI for autonomous trust evaluations during device idle periods and a trust hypergraph for hierarchical management and multi-hop collaboration.

Result: Achieves resource-efficient trust evaluation, balancing overhead and accuracy.

Conclusion: The method enables efficient utilization of distributed resources and supports large-scale collaboration.

Abstract: In collaborative systems, the effective completion of tasks hinges on task-specific trust evaluations of potential devices for distributed collaboration. However, the complexity of tasks, the spatiotemporal dynamism of distributed device resources, and the inevitable assessment overhead dramatically increase the complexity and resource consumption of the trust evaluation process. As a result, ill-timed or overly frequent trust evaluations can reduce utilization rate of constrained resources, negatively affecting collaborative task execution. To address this challenge, this paper proposes an autonomous trust orchestration method based on a new concept of semantic chain-of-trust. Our technique employs agentic AI and hypergraph to establish and maintain trust relationships among devices. By leveraging its strengths in autonomous perception, task decomposition, and semantic reasoning, we propose agentic AI to perceive device states and autonomously perform trust evaluations of collaborators based on historical performance data only during device idle periods, thereby enabling efficient utilization of distributed resources. In addition, agentic AI performs task-specific trust evaluations on collaborator resources by analyzing the alignment between resource capabilities and task requirements. Moreover, by maintaining a trust hypergraph embedded with trust semantics for each device, agentic AI enables hierarchical management of collaborators and identifies collaborators requiring trust evaluation based on trust semantics, thereby achieving a balance between overhead and trust accuracy. Furthermore, local trust hypergraphs from multiple devices can be chained together to support multi-hop collaboration, enabling efficient coordination in large-scale systems. Experimental results demonstrate that the proposed method achieves resource-efficient trust evaluation.

[264] Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving

Luoxin Chen, Jinming Gu, Liankai Huang, Wenhao Huang, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Kaijing Ma, Cheng Ren, Jiawei Shen, Wenlei Shi, Tong Sun, He Sun, Jiahui Wang, Siran Wang, Zhihong Wang, Chenrui Wei, Shufa Wei, Yonghui Wu, Yuchen Wu, Yihang Xia, Huajian Xin, Fan Yang, Huaiyuan Ying, Hongyi Yuan, Zheng Yuan, Tianyang Zhan, Chi Zhang, Yue Zhang, Ge Zhang, Tianyun Zhao, Jianqiu Zhao, Yichi Zhou, Thomas Hanwen Zhu

Main category: cs.AI

TL;DR: Seed-Prover, a lemma-style whole-proof reasoning model, leverages Lean feedback and self-summarization to refine proofs, achieving high success rates on IMO and Putnam problems, and introduces Seed-Geometry for geometry reasoning.

DetailsMotivation: LLMs struggle with theorem proving due to unclear supervision in natural language, while domain-specific languages like Lean offer clear verification signals.

Method: Seed-Prover iteratively refines proofs using Lean feedback, proved lemmas, and self-summarization, with test-time inference strategies for deep and broad reasoning. Seed-Geometry addresses Lean’s lack of geometry support.

Result: Seed-Prover proves 78.1% of IMO problems, saturates MiniF2F, and achieves >50% on PutnamBench, outperforming SOTA. Seed-Geometry surpasses previous formal geometry engines.

Conclusion: The work advances automated mathematical reasoning by combining formal verification with long chain-of-thought reasoning, demonstrated by success in IMO 2025.

Abstract: LLMs have demonstrated strong mathematical reasoning abilities by leveraging reinforcement learning with long chain-of-thought, yet they continue to struggle with theorem proving due to the lack of clear supervision signals when solely using natural language. Dedicated domain-specific languages like Lean provide clear supervision via formal verification of proofs, enabling effective training through reinforcement learning. In this work, we propose \textbf{Seed-Prover}, a lemma-style whole-proof reasoning model. Seed-Prover can iteratively refine its proof based on Lean feedback, proved lemmas, and self-summarization. To solve IMO-level contest problems, we design three test-time inference strategies that enable both deep and broad reasoning. Seed-Prover proves $78.1%$ of formalized past IMO problems, saturates MiniF2F, and achieves over 50% on PutnamBench, outperforming the previous state-of-the-art by a large margin. To address the lack of geometry support in Lean, we introduce a geometry reasoning engine \textbf{Seed-Geometry}, which outperforms previous formal geometry engines. We use these two systems to participate in IMO 2025 and fully prove 5 out of 6 problems. This work represents a significant advancement in automated mathematical reasoning, demonstrating the effectiveness of formal verification with long chain-of-thought reasoning.

cs.SD

[265] Advancing Speech Quality Assessment Through Scientific Challenges and Open-source Activities

Wen-Chin Huang

Main category: cs.SD

TL;DR: The paper reviews recent challenges and open-source tools in Speech Quality Assessment (SQA), emphasizing its growing importance in evaluating generative AI for speech.

DetailsMotivation: The rise of generative AI for speech necessitates accurate SQA methods that align with human perception.

Method: The paper reviews recent SQA challenges and open-source toolkits.

Result: Highlights the progress in SQA and its role in evaluating speech generation systems.

Conclusion: Maintaining open-source activities and scientific challenges is crucial for advancing SQA and generative AI for speech.

Abstract: Speech quality assessment (SQA) refers to the evaluation of speech quality, and developing an accurate automatic SQA method that reflects human perception has become increasingly important, in order to keep up with the generative AI boom. In recent years, SQA has progressed to a point that researchers started to faithfully use automatic SQA in research papers as a rigorous measurement of goodness for speech generation systems. We believe that the scientific challenges and open-source activities of late have stimulated the growth in this field. In this paper, we review recent challenges as well as open-source implementations and toolkits for SQA, and highlight the importance of maintaining such activities to facilitate the development of not only SQA itself but also generative AI for speech.

[266] AudioGen-Omni: A Unified Multimodal Diffusion Transformer for Video-Synchronized Audio, Speech, and Song Generation

Le Wang, Jun Wang, Feng Deng, Chen Zhang, Kun Gai, Di Zhang

Main category: cs.SD

TL;DR: AudioGen-Omni is a multimodal diffusion transformer model for generating high-fidelity audio, speech, and songs synchronized with video, using a novel joint training paradigm and advanced cross-modal alignment techniques.

DetailsMotivation: To create a unified model capable of generating diverse, semantically rich audio aligned with multimodal inputs, overcoming limitations of text-frozen paradigms.

Method: Employs a multimodal diffusion transformer (MMDit) with a joint training paradigm, a unified lyrics-transcription encoder, and PAAPI-enhanced attention for cross-modal alignment.

Result: Achieves state-of-the-art performance in audio generation tasks, with high semantic alignment, lip-sync accuracy, and efficient inference (1.91s for 8s audio).

Conclusion: AudioGen-Omni advances multimodal audio generation by integrating diverse inputs and achieving robust, high-quality outputs efficiently.

Abstract: We present AudioGen-Omni - a unified approach based on multimodal diffusion transformers (MMDit), capable of generating high-fidelity audio, speech, and songs coherently synchronized with the input video. AudioGen-Omni introduces a novel joint training paradigm that seamlessly integrates large-scale video-text-audio corpora, enabling a model capable of generating semantically rich, acoustically diverse audio conditioned on multimodal inputs and adaptable to a wide range of audio generation tasks. AudioGen-Omni employs a unified lyrics-transcription encoder that encodes graphemes and phonemes from both sung and spoken inputs into dense frame-level representations. Dense frame-level representations are fused using an AdaLN-based joint attention mechanism enhanced with phase-aligned anisotropic positional infusion (PAAPI), wherein RoPE is selectively applied to temporally structured modalities to ensure precise and robust cross-modal alignment. By unfreezing all modalities and masking missing inputs, AudioGen-Omni mitigates the semantic constraints of text-frozen paradigms, enabling effective cross-modal conditioning. This joint training approach enhances audio quality, semantic alignment, and lip-sync accuracy, while also achieving state-of-the-art results on Text-to-Audio/Speech/Song tasks. With an inference time of 1.91 seconds for 8 seconds of audio, it offers substantial improvements in both efficiency and generality.

[267] SwitchCodec: A High-Fidelity Nerual Audio Codec With Sparse Quantization

Jin Wang, Wenbin Jiang, Xiangbo Wang, Yubo You, Sheng Fang

Main category: cs.SD

TL;DR: A novel neural audio compression method, REVQ, improves performance at low bitrates by expanding embedding space and using a multi-tiered discriminator, achieving high PESQ/ViSQOL scores and reducing spectral blur.

DetailsMotivation: Existing neural audio compression methods degrade at limited bitrates due to constrained embedding space.

Method: Proposes REVQ for expanded embedding space, a load-balancing strategy, and a multi-tiered discriminator for spectral focus. Uses post-training for multi-bitrate support.

Result: Achieves PESQ 2.87 and ViSQOL 4.27 at 2.67 kbps, reduces mel-spectrogram distance by 13%, and cuts training time by half.

Conclusion: The method outperforms baselines, offering high-fidelity compression at low bitrates with efficient training.

Abstract: Neural audio compression has emerged as a promising technology for efficiently representing speech, music, and general audio. However, existing methods suffer from significant performance degradation at limited bitrates, where the available embedding space is sharply constrained. To address this, we propose a universal high-fidelity neural audio compression algorithm featuring Residual Experts Vector Quantization (REVQ), which substantially expands the embedding space with minimal impact on bandwidth. A gentle load-balancing strategy is introduced to ensure the full utilization of this expanded space. Furthermore, we develop a novel multi-tiered discriminator that periodically stratifies STFT spectra, guiding the generator to focus on critical spectral regions. To support multiple bitrates without quality loss at the lower end, we adopt an efficient post-training strategy. Our proposed model achieves impressive performance, with PESQ and ViSQOL scores of 2.87 and 4.27, respectively, at 2.67 kbps bandwidth. The approach effectively reduces spectral blur, decreasing the distance to the original mel-spectrogram by 13%. Notably, our post-training strategy achieves performance comparable to dedicated fixed-bitrate models while reducing the required training time by half. Extensive ablation studies confirm the superiority of our method over baselines.

[268] Improving Code Switching with Supervised Fine Tuning and GELU Adapters

Linh Pham

Main category: cs.SD

TL;DR: The paper introduces methods to improve ASR for code-switching by leveraging Whisper’s monolingual capabilities and a GELU adapter, achieving lower error rates than current SoTA.

DetailsMotivation: Address the lack of code-switching datasets by utilizing existing monolingual data and models.

Method: Part 1: Uses Whisper’s monolingual tokenization (Switching Tokenizers Method). Part 2: Combines this with a GELU-based adapter on the encoder.

Result: Reduced MER to 9.4% (ASCEND), 6% (SEAME devman), and 9.7% (SEAME devsge), outperforming SoTA.

Conclusion: The proposed methods effectively improve ASR performance for code-switching tasks.

Abstract: There are few code switching datasets, labeled or unlabled, that exist today. As a result, ASR requires new methods to utilize the vast monolingual data and models that exist. This paper uses OpenAI’s open source ASR model, Whisper, which has been pre-trained on 680K hours of audio to perform monolingual ASR tasks. In Part 1, this paper examines how exploiting Whisper’s monolingual ability to individually tokenize training text, called “Switching Tokenizers Method”, improves transcription accuracy. In Part 2, we combine the Switching Tokenizers Method from part 1 and train a GELU based adapter on the encoder. These two methods reduced Total Mixed Error Rate (MER) to 9.4% for the ASCEND dataset, 6% for SEAME devman and 9.7% for SEAME devsge, outperforming current SoTA methods.

[269] Next Tokens Denoising for Speech Synthesis

Yanqing Liu, Ruiqing Xue, Chong Zhang, Yufei Liu, Gang Wang, Bohan Li, Yao Qian, Lei He, Shujie Liu, Sheng Zhao

Main category: cs.SD

TL;DR: Dragon-FM unifies AR and flow-matching for TTS, enabling fast, high-quality audio generation by leveraging KV-caching and bidirectional context.

DetailsMotivation: Address limitations of AR and diffusion models (slow generation, lack of future context, KV-caching issues) in generative modeling.

Method: Combines AR modeling across chunks for global coherence with parallel flow-matching within chunks for fast denoising. Uses bidirectional context and KV-caching.

Result: Efficiently generates 48 kHz audio at 12.5 tokens/sec, producing high-quality zero-shot podcasts.

Conclusion: Dragon-FM bridges AR and flow-matching, offering fast, coherent, and high-quality audio generation for long-form content.

Abstract: While diffusion and autoregressive (AR) models have significantly advanced generative modeling, they each present distinct limitations. AR models, which rely on causal attention, cannot exploit future context and suffer from slow generation speeds. Conversely, diffusion models struggle with key-value (KV) caching. To overcome these challenges, we introduce Dragon-FM, a novel text-to-speech (TTS) design that unifies AR and flow-matching. This model processes 48 kHz audio codec tokens in chunks at a compact rate of 12.5 tokens per second. This design enables AR modeling across chunks, ensuring global coherence, while parallel flow-matching within chunks facilitates fast iterative denoising. Thus, the model leverages KV-cache across chunks and utilizes bidirectional context within each chunk. Furthermore, it bridges continuous and discrete feature modeling, demonstrating that continuous AR flow-matching can predict discrete tokens with finite scalar quantizers. This efficient codec and fast chunk-autoregressive architecture also make the model highly effective for generating long-form content, such as podcasts. Experiments on podcast datasets demonstrate its capability to efficiently generate high-quality zero-shot podcasts.

cs.LG

[270] Predicting Large-scale Urban Network Dynamics with Energy-informed Graph Neural Diffusion

Tong Nie, Jian Sun, Wei Ma

Main category: cs.LG

TL;DR: The paper introduces ScaleSTF, a scalable spatiotemporal Transformer with linear complexity, inspired by physical laws, to improve efficiency and efficacy in predicting urban system dynamics.

DetailsMotivation: Current models like graph neural networks struggle with balancing efficacy and efficiency in large-scale urban networks, necessitating a more scalable and interpretable solution.

Method: The authors propose a neural diffusion scheme using Transformer-like structures with attention layers derived from low-dimensional embeddings, ensuring scalability and interpretability.

Result: ScaleSTF demonstrates state-of-the-art performance and scalability in large-scale urban systems (traffic flow, solar power, smart meters).

Conclusion: The work offers a new perspective on dynamics prediction in urban networks, combining physical principles with scalable neural architectures.

Abstract: Networked urban systems facilitate the flow of people, resources, and services, and are essential for economic and social interactions. These systems often involve complex processes with unknown governing rules, observed by sensor-based time series. To aid decision-making in industrial and engineering contexts, data-driven predictive models are used to forecast spatiotemporal dynamics of urban systems. Current models such as graph neural networks have shown promise but face a trade-off between efficacy and efficiency due to computational demands. Hence, their applications in large-scale networks still require further efforts. This paper addresses this trade-off challenge by drawing inspiration from physical laws to inform essential model designs that align with fundamental principles and avoid architectural redundancy. By understanding both micro- and macro-processes, we present a principled interpretable neural diffusion scheme based on Transformer-like structures whose attention layers are induced by low-dimensional embeddings. The proposed scalable spatiotemporal Transformer (ScaleSTF), with linear complexity, is validated on large-scale urban systems including traffic flow, solar power, and smart meters, showing state-of-the-art performance and remarkable scalability. Our results constitute a fresh perspective on the dynamics prediction in large-scale urban networks.

[271] Hybrid LSTM-Transformer Models for Profiling Highway-Railway Grade Crossings

Kaustav Chatterjee, Joshua Q. Li, Fatemeh Ansari, Masud Rana Munna, Kundan Parajulee, Jared Schwennesen

Main category: cs.LG

TL;DR: The paper introduces a hybrid deep learning framework (LSTM-Transformer) to measure HRGC profiles efficiently, addressing safety risks from hump crossings.

DetailsMotivation: Hump crossings pose safety risks due to hang-ups, and conventional measurement methods are costly and disruptive.

Method: A hybrid deep learning framework (LSTM-Transformer) was developed using IMU/GPS data and ground truth from a walking profiler.

Result: Models 2 and 3 outperformed others, enabling accurate 2D/3D HRGC profile generation.

Conclusion: The deep learning models show promise for improving HRGC safety by enabling rapid and accurate profile assessment.

Abstract: Hump crossings, or high-profile Highway Railway Grade Crossings (HRGCs), pose safety risks to highway vehicles due to potential hang-ups. These crossings typically result from post-construction railway track maintenance activities or non-compliance with design guidelines for HRGC vertical alignments. Conventional methods for measuring HRGC profiles are costly, time-consuming, traffic-disruptive, and present safety challenges. To address these issues, this research employed advanced, cost-effective techniques and innovative modeling approaches for HRGC profile measurement. A novel hybrid deep learning framework combining Long Short-Term Memory (LSTM) and Transformer architectures was developed by utilizing instrumentation and ground truth data. Instrumentation data were gathered using a highway testing vehicle equipped with Inertial Measurement Unit (IMU) and Global Positioning System (GPS) sensors, while ground truth data were obtained via an industrial-standard walking profiler. Field data was collected at the Red Rock Railroad Corridor in Oklahoma. Three advanced deep learning models Transformer-LSTM sequential (model 1), LSTM-Transformer sequential (model 2), and LSTM-Transformer parallel (model 3) were evaluated to identify the most efficient architecture. Models 2 and 3 outperformed the others and were deployed to generate 2D/3D HRGC profiles. The deep learning models demonstrated significant potential to enhance highway and railroad safety by enabling rapid and accurate assessment of HRGC hang-up susceptibility.

[272] Regime-Aware Conditional Neural Processes with Multi-Criteria Decision Support for Operational Electricity Price Forecasting

Abhinav Das, Stephan Schlüter

Main category: cs.LG

TL;DR: The paper integrates Bayesian regime detection with conditional neural processes (CNPs) for 24-hour electricity price prediction in the German market. It evaluates the proposed R-NP model against DNN and LEAR models, showing R-NP as the most balanced solution.

DetailsMotivation: To improve electricity price prediction by combining regime detection with localized modeling using CNPs, addressing the gap between raw prediction accuracy and operational utility.

Method: Uses DS-HDP-HMM for regime detection and independent CNPs for regime-specific modeling. Evaluates models via battery storage optimization and TOPSIS multi-criteria analysis.

Result: LEAR performed best in 2021, but R-NP emerged as the most balanced solution across 2021-2023.

Conclusion: R-NP offers a robust and balanced approach for electricity price prediction, outperforming alternatives in operational utility.

Abstract: This work integrates Bayesian regime detection with conditional neural processes for 24-hour electricity price prediction in the German market. Our methodology integrates regime detection using a disentangled sticky hierarchical Dirichlet process hidden Markov model (DS-HDP-HMM) applied to daily electricity prices. Each identified regime is subsequently modeled by an independent conditional neural process (CNP), trained to learn localized mappings from input contexts to 24-dimensional hourly price trajectories, with final predictions computed as regime-weighted mixtures of these CNP outputs. We rigorously evaluate R-NP against deep neural networks (DNN) and Lasso estimated auto-regressive (LEAR) models by integrating their forecasts into diverse battery storage optimization frameworks, including price arbitrage, risk management, grid services, and cost minimization. This operational utility assessment revealed complex performance trade-offs: LEAR often yielded superior absolute profits or lower costs, while DNN showed exceptional optimality in specific cost-minimization contexts. Recognizing that raw prediction accuracy doesn’t always translate to optimal operational outcomes, we employed TOPSIS as a comprehensive multi-criteria evaluation layer. Our TOPSIS analysis identified LEAR as the top-ranked model for 2021, but crucially, our proposed R-NP model emerged as the most balanced and preferred solution for 2021, 2022 and 2023.

[273] Learning Like Humans: Resource-Efficient Federated Fine-Tuning through Cognitive Developmental Stages

Yebo Wu, Jingguang Li, Zhijiang Guo, Li Li

Main category: cs.LG

TL;DR: DevFT is a resource-efficient federated fine-tuning method for LLMs, inspired by cognitive development, that improves performance and reduces overhead.

DetailsMotivation: To address the resource-intensive nature of federated fine-tuning for LLMs on edge devices while preserving data privacy.

Method: DevFT decomposes fine-tuning into developmental stages, transferring knowledge between submodels with increasing capacity. It uses deconfliction-guided layer grouping and differential-based layer fusion.

Result: Achieves 4.59x faster convergence, 10.67x less communication overhead, and 9.07% average performance improvement.

Conclusion: DevFT is a highly efficient and compatible approach for federated fine-tuning of LLMs.

Abstract: Federated fine-tuning enables Large Language Models (LLMs) to adapt to downstream tasks while preserving data privacy, but its resource-intensive nature limits deployment on edge devices. In this paper, we introduce Developmental Federated Tuning (DevFT), a resource-efficient approach inspired by cognitive development that progressively builds a powerful LLM from a compact foundation. DevFT decomposes the fine-tuning process into developmental stages, each optimizing submodels with increasing parameter capacity. Knowledge from earlier stages transfers to subsequent submodels, providing optimized initialization parameters that prevent convergence to local minima and accelerate training. This paradigm mirrors human learning, gradually constructing comprehensive knowledge structure while refining existing skills. To efficiently build stage-specific submodels, DevFT introduces deconfliction-guided layer grouping and differential-based layer fusion to distill essential information and construct representative layers. Evaluations across multiple benchmarks demonstrate that DevFT significantly outperforms state-of-the-art methods, achieving up to 4.59$\times$ faster convergence, 10.67$\times$ reduction in communication overhead, and 9.07% average performance improvement, while maintaining compatibility with existing approaches.

[274] Improved Robustness and Functional Localization in Topographic CNNs Through Weight Similarity

Nhut Truong, Uri Hasson

Main category: cs.LG

TL;DR: Topographic neural networks with Weight Similarity (WS) constraints outperform Activation Similarity (AS) and standard CNNs in robustness, input sensitivity, and functional localization.

DetailsMotivation: To systematically examine the impact of different topographic constraints (WS vs. AS) on neural network representations.

Method: Compare WS and AS constraints in topographic CNNs, evaluating classification accuracy, robustness to perturbations, and spatial organization.

Result: WS improves noise robustness, input sensitivity, and functional localization, with distinct representational geometry effects.

Conclusion: WS constraints produce more robust representations and influence feature learning, suggesting benefits for biophysical models.

Abstract: Topographic neural networks are computational models that can simulate the spatial and functional organization of the brain. Topographic constraints in neural networks can be implemented in multiple ways, with potentially different impacts on the representations learned by the network. The impact of such different implementations has not been systematically examined. To this end, here we compare topographic convolutional neural networks trained with two spatial constraints: Weight Similarity (WS), which pushes neighboring units to develop similar incoming weights, and Activation Similarity (AS), which enforces similarity in unit activations. We evaluate the resulting models on classification accuracy, robustness to weight perturbations and input degradation, and the spatial organization of learned representations. Compared to both AS and standard CNNs, WS provided three main advantages: i) improved robustness to noise, also showing higher accuracy under weight corruption; ii) greater input sensitivity, reflected in higher activation variance; and iii) stronger functional localization, with units showing similar activations positioned at closer distances. In addition, WS produced differences in orientation tuning, symmetry sensitivity, and eccentricity profiles of units, indicating an influence of this spatial constraint on the representational geometry of the network. Our findings suggest that during end-to-end training, WS constraints produce more robust representations than AS or non-topographic CNNs. These findings also suggest that weight-based spatial constraints can shape feature learning and functional organization in biophysical inspired models.

[275] Benchmarking Partial Observability in Reinforcement Learning with a Suite of Memory-Improvable Domains

Ruo Yu Tao, Kaicheng Guo, Cameron Allen, George Konidaris

Main category: cs.LG

TL;DR: The paper introduces POBAX, a benchmark for evaluating reinforcement learning algorithms under partial observability, emphasizing coverage and memory improvability.

DetailsMotivation: Existing benchmarks for partial observability are limited and don't reflect real-world complexities, necessitating a more comprehensive framework.

Method: The authors propose guidelines for benchmarking and introduce POBAX, a library with diverse environments (e.g., localization, visual control) and hyperparameters for evaluation.

Result: The selected environments are memory improvable, requiring algorithms to handle partial observability effectively.

Conclusion: POBAX provides a robust framework for evaluating and advancing research in partially observable reinforcement learning.

Abstract: Mitigating partial observability is a necessary but challenging task for general reinforcement learning algorithms. To improve an algorithm’s ability to mitigate partial observability, researchers need comprehensive benchmarks to gauge progress. Most algorithms tackling partial observability are only evaluated on benchmarks with simple forms of state aliasing, such as feature masking and Gaussian noise. Such benchmarks do not represent the many forms of partial observability seen in real domains, like visual occlusion or unknown opponent intent. We argue that a partially observable benchmark should have two key properties. The first is coverage in its forms of partial observability, to ensure an algorithm’s generalizability. The second is a large gap between the performance of a agents with more or less state information, all other factors roughly equal. This gap implies that an environment is memory improvable: where performance gains in a domain are from an algorithm’s ability to cope with partial observability as opposed to other factors. We introduce best-practice guidelines for empirically benchmarking reinforcement learning under partial observability, as well as the open-source library POBAX: Partially Observable Benchmarks in JAX. We characterize the types of partial observability present in various environments and select representative environments for our benchmark. These environments include localization and mapping, visual control, games, and more. Additionally, we show that these tasks are all memory improvable and require hard-to-learn memory functions, providing a concrete signal for partial observability research. This framework includes recommended hyperparameters as well as algorithm implementations for fast, out-of-the-box evaluation, as well as highly performant environments implemented in JAX for GPU-scalable experimentation.

[276] TriP-LLM: A Tri-Branch Patch-wise Large Language Model Framework for Time-Series Anomaly Detection

Yuan-Cheng Yu, Yen-Chieh Ouyang, Chun-An Lin

Main category: cs.LG

TL;DR: TriP-LLM is a novel unsupervised anomaly detection framework for time-series data, leveraging a tri-branch design and pretrained LLMs to outperform state-of-the-art methods with lower memory usage.

DetailsMotivation: The increasing scale and complexity of time-series data in IoT and smart manufacturing expose limitations of traditional methods, prompting the need for advanced frameworks.

Method: TriP-LLM integrates local and global temporal features via a tri-branch design (Patching, Selection, Global) and uses a frozen LLM for processing, followed by a lightweight decoder for anomaly scoring.

Result: TriP-LLM outperforms state-of-the-art methods on benchmark datasets, shows strong detection capabilities, and reduces memory consumption compared to CI-based LLM approaches.

Conclusion: TriP-LLM is effective for time-series anomaly detection, offering superior performance and efficiency, with code and models publicly available.

Abstract: Time-series anomaly detection plays a central role across a wide range of application domains. With the increasing proliferation of the Internet of Things (IoT) and smart manufacturing, time-series data has dramatically increased in both scale and dimensionality. This growth has exposed the limitations of traditional statistical methods in handling the high heterogeneity and complexity of such data. Inspired by the recent success of large language models (LLMs) in multimodal tasks across language and vision domains, we propose a novel unsupervised anomaly detection framework: A Tri-Branch Patch-wise Large Language Model Framework for Time-Series Anomaly Detection (TriP-LLM). TriP-LLM integrates local and global temporal features through a tri-branch design-Patching, Selection, and Global-to encode the input time series into patch-wise tokens, which are then processed by a frozen, pretrained LLM. A lightweight patch-wise decoder reconstructs the input, from which anomaly scores are derived. We evaluate TriP-LLM on several public benchmark datasets using PATE, a recently proposed threshold-free evaluation metric, and conduct all comparisons within a unified open-source framework to ensure fairness. Experimental results show that TriP-LLM consistently outperforms recent state-of-the-art methods across all datasets, demonstrating strong detection capabilities. Furthermore, through extensive ablation studies, we verify the substantial contribution of the LLM to the overall architecture. Compared to LLM-based approaches using Channel Independence (CI) patch processing, TriP-LLM achieves significantly lower memory consumption, making it more suitable for GPU memory-constrained environments. All code and model checkpoints are publicly available on https://github.com/YYZStart/TriP-LLM.git

[277] DiSC-Med: Diffusion-based Semantic Communications for Robust Medical Image Transmission

Fupei Guo, Hao Zheng, Xiang Zhang, Li Chen, Yue Wang, Songyang Zhang

Main category: cs.LG

TL;DR: Proposes DiSC-Med, a diffusion-based semantic communication framework for efficient and robust medical image transmission in telehealth.

DetailsMotivation: The need for efficient and reliable transmission of medical data in noisy, bandwidth-limited channels for remote healthcare.

Method: Develops DiSC-Med with medical-enhanced compression and denoising blocks for semantic communication.

Result: Superior reconstruction performance and bandwidth efficiency validated on real-world medical datasets.

Conclusion: DiSC-Med shows promise for robust and efficient telehealth applications.

Abstract: The rapid development of artificial intelligence has driven smart health with next-generation wireless communication technologies, stimulating exciting applications in remote diagnosis and intervention. To enable a timely and effective response for remote healthcare, efficient transmission of medical data through noisy channels with limited bandwidth emerges as a critical challenge. In this work, we propose a novel diffusion-based semantic communication framework, namely DiSC-Med, for the medical image transmission, where medical-enhanced compression and denoising blocks are developed for bandwidth efficiency and robustness, respectively. Unlike conventional pixel-wise communication framework, our proposed DiSC-Med is able to capture the key semantic information and achieve superior reconstruction performance with ultra-high bandwidth efficiency against noisy channels. Extensive experiments on real-world medical datasets validate the effectiveness of our framework, demonstrating its potential for robust and efficient telehealth applications.

[278] Evaluating COVID 19 Feature Contributions to Bitcoin Return Forecasting: Methodology Based on LightGBM and Genetic Optimization

Imen Mahmoud, Andrei Velichko

Main category: cs.LG

TL;DR: A LightGBM-GA framework evaluates COVID-19 indicators’ impact on Bitcoin returns, showing significant predictive improvement, especially with vaccination data.

DetailsMotivation: To assess if pandemic-related health data enhances Bitcoin return prediction accuracy.

Method: Integrates LightGBM regression and GA optimization, comparing models with/without COVID-19 features over 31 runs. Performance metrics (R2, RMSE, MAE) and PFI analysis were used.

Result: COVID-19 indicators significantly boosted performance (40% R2 increase, 2% RMSE drop). Vaccination metrics were top predictors.

Conclusion: The framework enhances financial analytics by incorporating public health data, aiding market navigation during crises.

Abstract: This study proposes a novel methodological framework integrating a LightGBM regression model and genetic algorithm (GA) optimization to systematically evaluate the contribution of COVID-19-related indicators to Bitcoin return prediction. The primary objective was not merely to forecast Bitcoin returns but rather to determine whether including pandemic-related health data significantly enhances prediction accuracy. A comprehensive dataset comprising daily Bitcoin returns and COVID-19 metrics (vaccination rates, hospitalizations, testing statistics) was constructed. Predictive models, trained with and without COVID-19 features, were optimized using GA over 31 independent runs, allowing robust statistical assessment. Performance metrics (R2, RMSE, MAE) were statistically compared through distribution overlaps and Mann-Whitney U tests. Permutation Feature Importance (PFI) analysis quantified individual feature contributions. Results indicate that COVID-19 indicators significantly improved model performance, particularly in capturing extreme market fluctuations (R2 increased by 40%, RMSE decreased by 2%, both highly significant statistically). Among COVID-19 features, vaccination metrics, especially the 75th percentile of fully vaccinated individuals, emerged as dominant predictors. The proposed methodology extends existing financial analytics tools by incorporating public health signals, providing investors and policymakers with refined indicators to navigate market uncertainty during systemic crises.

[279] Time-RA: Towards Time Series Reasoning for Anomaly with LLM Feedback

Yiyuan Yang, Zichuan Liu, Lei Song, Kai Ying, Zhiguang Wang, Tom Bamford, Svitlana Vyetrenko, Jiang Bian, Qingsong Wen

Main category: cs.LG

TL;DR: The paper introduces Time-RA, a generative task for time-series anomaly reasoning using LLMs, and RATs40K, a multimodal benchmark dataset with detailed annotations. It evaluates LLMs, emphasizing supervised fine-tuning, and open-sources resources for future research.

DetailsMotivation: Current time-series anomaly detection lacks detailed categorization and reasoning. The authors aim to enhance interpretability by transforming it into a generative task using LLMs.

Method: Proposes Time-RA task and RATs40K dataset with numeric, text, and visual data, annotated via GPT-4-refined ensemble labels. Benchmarks LLMs and multimodal LLMs.

Result: Demonstrates capabilities and limitations of current models, highlighting the importance of supervised fine-tuning.

Conclusion: The work advances interpretable anomaly detection and reasoning, with open-sourced code and dataset to support future research.

Abstract: Time series anomaly detection is critical across various domains, yet current approaches often limit analysis to mere binary anomaly classification without detailed categorization or further explanatory reasoning. To address these limitations, we propose a novel task, Time-series Reasoning for Anomaly (Time-RA) that transforms classical time series anomaly detection from a discriminative into a generative, reasoning-intensive task leveraging Large Language Models (LLMs). Also, we introduce the first real-world multimodal benchmark dataset, RATs40K, explicitly annotated for anomaly reasoning, comprising approximately 40,000 samples across 10 real-world domains. Each sample includes numeric time series data, contextual text information, and visual representations, each annotated with fine-grained categories (14 types for univariate anomalies and 6 for multivariate anomalies) and structured explanatory reasoning. We develop a sophisticated annotation framework utilizing ensemble-generated labels refined through GPT-4-driven feedback, ensuring accuracy and interpretability. Extensive benchmarking of LLMs and multimodal LLMs demonstrates the capabilities and limitations of current models, highlighting the critical role of supervised fine-tuning. Our dataset and task pave the way for significant advancements in interpretable time series anomaly detection and reasoning. The code (https://github.com/yyysjz1997/Time-RA) and dataset (https://huggingface.co/datasets/Time-RA/RATs40K) have been fully open-sourced to support and accelerate future research in this area.

[280] Stress-Aware Resilient Neural Training

Ashkan Shakarami, Yousef Yeganeh, Azade Farshad, Lorenzo Nicole, Stefano Ghidoni, Nassir Navab

Main category: cs.LG

TL;DR: Stress-Aware Learning (SAL) introduces a resilient neural training paradigm using adaptive noise to escape sharp minima, improving robustness and generalization.

DetailsMotivation: Inspired by structural fatigue in materials science, the paper addresses stagnation in training by dynamically adjusting optimization behavior.

Method: Proposes Plastic Deformation Optimizer, injecting adaptive noise based on stress signals (loss/accuracy stagnation) to escape sharp minima.

Result: Experiments on six architectures, four optimizers, and seven benchmarks show improved robustness and generalization with minimal overhead.

Conclusion: SAL effectively enhances training resilience and generalization, with code and visuals available on GitHub.

Abstract: This paper introduces Stress-Aware Learning, a resilient neural training paradigm in which deep neural networks dynamically adjust their optimization behavior - whether under stable training regimes or in settings with uncertain dynamics - based on the concept of Temporary (Elastic) and Permanent (Plastic) Deformation, inspired by structural fatigue in materials science. To instantiate this concept, we propose Plastic Deformation Optimizer, a stress-aware mechanism that injects adaptive noise into model parameters whenever an internal stress signal - reflecting stagnation in training loss and accuracy - indicates persistent optimization difficulty. This enables the model to escape sharp minima and converge toward flatter, more generalizable regions of the loss landscape. Experiments across six architectures, four optimizers, and seven vision benchmarks demonstrate improved robustness and generalization with minimal computational overhead. The code and 3D visuals will be available on GitHub: https://github.com/Stress-Aware-Learning/SAL.

[281] StackLiverNet: A Novel Stacked Ensemble Model for Accurate and Interpretable Liver Disease Detection

Md. Ehsanul Haque, S. M. Jahidul Islam, Shakil Mia, Rumana Sharmin, Ashikuzzaman, Md Samir Morshed, Md. Tahmidul Huque

Main category: cs.LG

TL;DR: StackLiverNet, an interpretable stacked ensemble model, addresses issues in liver disease classification with high accuracy (99.89%), interpretability, and efficiency.

DetailsMotivation: Current models for liver disease classification suffer from misclassification, poor interpretability, and computational inefficiency.

Method: Uses advanced preprocessing, feature selection, random undersampling, and a LightGBM meta-model with hyperparameter-optimized base classifiers.

Result: Achieves 99.89% accuracy, Cohen Kappa of 0.9974, AUC of 0.9993, and fast training/inference times.

Conclusion: StackLiverNet is robust, interpretable, and efficient, suitable for clinical practice.

Abstract: Liver diseases are a serious health concern in the world, which requires precise and timely diagnosis to enhance the survival chances of patients. The current literature implemented numerous machine learning and deep learning models to classify liver diseases, but most of them had some issues like high misclassification error, poor interpretability, prohibitive computational expense, and lack of good preprocessing strategies. In order to address these drawbacks, we introduced StackLiverNet in this study; an interpretable stacked ensemble model tailored to the liver disease detection task. The framework uses advanced data preprocessing and feature selection technique to increase model robustness and predictive ability. Random undersampling is performed to deal with class imbalance and make the training balanced. StackLiverNet is an ensemble of several hyperparameter-optimized base classifiers, whose complementary advantages are used through a LightGBM meta-model. The provided model demonstrates excellent performance, with the testing accuracy of 99.89%, Cohen Kappa of 0.9974, and AUC of 0.9993, having only 5 misclassifications, and efficient training and inference speeds that are amenable to clinical practice (training time 4.2783 seconds, inference time 0.1106 seconds). Besides, Local Interpretable Model-Agnostic Explanations (LIME) are applied to generate transparent explanations of individual predictions, revealing high concentrations of Alkaline Phosphatase and moderate SGOT as important observations of liver disease. Also, SHAP was used to rank features by their global contribution to predictions, while the Morris method confirmed the most influential features through sensitivity analysis.

[282] Structured Transformations for Stable and Interpretable Neural Computation

Saleh Nikooroo, Thomas Engel

Main category: cs.LG

TL;DR: The paper introduces structured transformations in neural networks to improve stability and interpretability while maintaining compatibility with standard learning methods.

DetailsMotivation: Contemporary neural networks lack structural safeguards for stable learning and interpretable behavior, prompting the need for a reformulation of layer-level transformations.

Method: The authors decompose transformations into structured linear operators and residual corrective components, promoting disciplined signal propagation and better training dynamics.

Result: Experiments show improved gradient conditioning, reduced sensitivity to perturbations, and layer-wise robustness, with benefits scaling across architectures and training regimes.

Conclusion: The work lays the foundation for more principled neural architectures that balance stability, transparency, and expressive power.

Abstract: Despite their impressive performance, contemporary neural networks often lack structural safeguards that promote stable learning and interpretable behavior. In this work, we introduce a reformulation of layer-level transformations that departs from the standard unconstrained affine paradigm. Each transformation is decomposed into a structured linear operator and a residual corrective component, enabling more disciplined signal propagation and improved training dynamics. Our formulation encourages internal consistency and supports stable information flow across depth, while remaining fully compatible with standard learning objectives and backpropagation. Through a series of synthetic and real-world experiments, we demonstrate that models constructed with these structured transformations exhibit improved gradient conditioning, reduced sensitivity to perturbations, and layer-wise robustness. We further show that these benefits persist across architectural scales and training regimes. This study serves as a foundation for a more principled class of neural architectures that prioritize stability and transparency-offering new tools for reasoning about learning behavior without sacrificing expressive power.

[283] ECG Latent Feature Extraction with Autoencoders for Downstream Prediction Tasks

Christopher Harvey, Sumaiya Shomaji, Zijun Yao, Amit Noheria

Main category: cs.LG

TL;DR: The study introduces novel VAE variants to simplify ECG data for deep learning, achieving high signal fidelity and improved prediction of reduced LVEF with less computational cost.

DetailsMotivation: ECG signals are complex and variable, making deep learning challenging with small datasets. The study aims to simplify ECG data for better usability.

Method: Three VAE variants (SAE, A beta-VAE, C beta-VAE) are introduced and compared using PCA and Autoencoders for feature generation, evaluated with LGBM.

Result: A beta-VAE achieved superior signal reconstruction (MAE 15.7+/-3.2 muV). SAE encodings improved LVEF prediction (AUROC 0.901), nearly matching CNN performance with less resources.

Conclusion: VAE encodings simplify ECG data effectively, offering a practical solution for deep learning with limited labeled data.

Abstract: The electrocardiogram (ECG) is an inexpensive and widely available tool for cardiac assessment. Despite its standardized format and small file size, the high complexity and inter-individual variability of ECG signals (typically a 60,000-size vector with 12 leads at 500 Hz) make it challenging to use in deep learning models, especially when only small training datasets are available. This study addresses these challenges by exploring feature generation methods from representative beat ECGs, focusing on Principal Component Analysis (PCA) and Autoencoders to reduce data complexity. We introduce three novel Variational Autoencoder (VAE) variants-Stochastic Autoencoder (SAE), Annealed beta-VAE (A beta-VAE), and Cyclical beta VAE (C beta-VAE)-and compare their effectiveness in maintaining signal fidelity and enhancing downstream prediction tasks using a Light Gradient Boost Machine (LGBM). The A beta-VAE achieved superior signal reconstruction, reducing the mean absolute error (MAE) to 15.7+/-3.2 muV, which is at the level of signal noise. Moreover, the SAE encodings, when combined with traditional ECG summary features, improved the prediction of reduced Left Ventricular Ejection Fraction (LVEF), achieving an holdout test set area under the receiver operating characteristic curve (AUROC) of 0.901 with a LGBM classifier. This performance nearly matches the 0.909 AUROC of state-of-the-art CNN model but requires significantly less computational resources. Further, the ECG feature extraction-LGBM pipeline avoids overfitting and retains predictive performance when trained with less data. Our findings demonstrate that these VAE encodings are not only effective in simplifying ECG data but also provide a practical solution for applying deep learning in contexts with limited-scale labeled training data.

[284] INSPIRE-GNN: Intelligent Sensor Placement to Improve Sparse Bicycling Network Prediction via Reinforcement Learning Boosted Graph Neural Networks

Mohit Gupta, Debjit Bhowmick, Rhys Newbury, Meead Saberi, Shirui Pan, Ben Beck

Main category: cs.LG

TL;DR: INSPIRE-GNN, a hybrid GNN framework with RL, improves bicycling volume estimation in data-sparse urban networks by optimizing sensor placement.

DetailsMotivation: High data sparsity in bicycling volume estimation due to limited sensor coverage hinders urban transportation planning.

Method: Combines GCN, GAT, and DQN-based RL for strategic sensor placement and volume estimation.

Result: Outperforms traditional methods in MSE, RMSE, and MAE, especially in sparse sensor deployments.

Conclusion: INSPIRE-GNN offers actionable insights for optimizing sensor networks and improving bicycling data accuracy.

Abstract: Accurate link-level bicycling volume estimation is essential for sustainable urban transportation planning. However, many cities face significant challenges of high data sparsity due to limited bicycling count sensor coverage. To address this issue, we propose INSPIRE-GNN, a novel Reinforcement Learning (RL)-boosted hybrid Graph Neural Network (GNN) framework designed to optimize sensor placement and improve link-level bicycling volume estimation in data-sparse environments. INSPIRE-GNN integrates Graph Convolutional Networks (GCN) and Graph Attention Networks (GAT) with a Deep Q-Network (DQN)-based RL agent, enabling a data-driven strategic selection of sensor locations to maximize estimation performance. Applied to Melbourne’s bicycling network, comprising 15,933 road segments with sensor coverage on only 141 road segments (99% sparsity) - INSPIRE-GNN demonstrates significant improvements in volume estimation by strategically selecting additional sensor locations in deployments of 50, 100, 200 and 500 sensors. Our framework outperforms traditional heuristic methods for sensor placement such as betweenness centrality, closeness centrality, observed bicycling activity and random placement, across key metrics such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE). Furthermore, our experiments benchmark INSPIRE-GNN against standard machine learning and deep learning models in the bicycle volume estimation performance, underscoring its effectiveness. Our proposed framework provides transport planners actionable insights to effectively expand sensor networks, optimize sensor placement and maximize volume estimation accuracy and reliability of bicycling data for informed transportation planning decisions.

[285] Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs

Ziqian Zhong, Aditi Raghunathan

Main category: cs.LG

TL;DR: A new method interprets weights of fine-tuned LLMs to detect novel behaviors without needing similar training data, achieving high accuracy in threat detection and model auditing.

DetailsMotivation: Existing interpretability methods rely on similar data distributions, limiting their ability to detect novel threats like backdoors in LLMs.

Method: Analyzes weight differences between fine-tuned and base models, focusing on top singular vectors to identify new behaviors.

Result: Detects backdoors with 100% success (1.2% false positives) and unlearned topics with 95.42% accuracy; also useful for model auditing.

Conclusion: The method effectively monitors and controls fine-tuned LLMs, offering robust detection and auditing capabilities without requiring original training data.

Abstract: The releases of powerful open-weight large language models (LLMs) are often not accompanied by access to their full training data. Existing interpretability methods, particularly those based on activations, often require or assume distributionally similar data. This is a significant limitation when detecting and defending against novel potential threats like backdoors, which are by definition out-of-distribution. In this work, we introduce a new method for understanding, monitoring and controlling fine-tuned LLMs that interprets weights, rather than activations, thereby side stepping the need for data that is distributionally similar to the unknown training data. We demonstrate that the top singular vectors of the weight difference between a fine-tuned model and its base model correspond to newly acquired behaviors. By monitoring the cosine similarity of activations along these directions, we can detect salient behaviors introduced during fine-tuning with high precision. For backdoored models that bypasses safety mechanisms when a secret trigger is present, our method stops up to 100% of attacks with a false positive rate below 1.2%. For models that have undergone unlearning, we detect inference on erased topics with accuracy up to 95.42% and can even steer the model to recover “unlearned” information. Besides monitoring, our method also shows potential for pre-deployment model auditing: by analyzing commercial instruction-tuned models (OLMo, Llama, Qwen), we are able to uncover model-specific fine-tuning focus including marketing strategies and Midjourney prompt generation. Our implementation can be found at https://github.com/fjzzq2002/WeightWatch.

[286] RL as Regressor: A Reinforcement Learning Approach for Function Approximation

Yongchao Huang

Main category: cs.LG

TL;DR: The paper proposes using Reinforcement Learning (RL) for regression tasks, replacing traditional differentiable loss functions with custom RL rewards, and demonstrates its effectiveness through a noisy sine wave case study.

DetailsMotivation: Traditional regression methods are limited by predefined, differentiable loss functions, which may not handle asymmetric costs or complex objectives well. RL offers more flexibility in defining objectives.

Method: The approach frames regression as an RL problem, treating predictions as actions and using custom reward signals. Techniques like Actor-Critic, Prioritized Experience Replay, and positional encoding are applied.

Result: The RL framework successfully solves the regression task and provides greater flexibility in defining objectives and guiding learning.

Conclusion: RL is a viable alternative for regression, offering enhanced flexibility and performance compared to traditional methods.

Abstract: Standard regression techniques, while powerful, are often constrained by predefined, differentiable loss functions such as mean squared error. These functions may not fully capture the desired behavior of a system, especially when dealing with asymmetric costs or complex, non-differentiable objectives. In this paper, we explore an alternative paradigm: framing regression as a Reinforcement Learning (RL) problem. We demonstrate this by treating a model’s prediction as an action and defining a custom reward signal based on the prediction error, and we can leverage powerful RL algorithms to perform function approximation. Through a progressive case study of learning a noisy sine wave, we illustrate the development of an Actor-Critic agent, iteratively enhancing it with Prioritized Experience Replay, increased network capacity, and positional encoding to enable a capable RL agent for this regression task. Our results show that the RL framework not only successfully solves the regression problem but also offers enhanced flexibility in defining objectives and guiding the learning process.

[287] EMA Without the Lag: Bias-Corrected Iterate Averaging Schemes

Adam Block, Cyril Zhang

Main category: cs.LG

TL;DR: BEMA (Bias-Corrected Exponential Moving Average) is proposed to mitigate instability in language model fine-tuning by reducing stochasticity and eliminating bias, outperforming EMA and vanilla training.

DetailsMotivation: Stochasticity in fine-tuning destabilizes training; EMA reduces it but introduces bias, creating optimization lag.

Method: BEMA augments EMA to retain variance reduction while eliminating bias, supported by theoretical acceleration proofs.

Result: BEMA improves convergence rates and final performance in LM benchmarks compared to EMA and vanilla training.

Conclusion: BEMA is a practical, theoretically motivated solution for stable and efficient fine-tuning.

Abstract: Stochasticity in language model fine-tuning, often caused by the small batch sizes typically used in this regime, can destabilize training by introducing large oscillations in generation quality. A popular approach to mitigating this instability is to take an Exponential moving average (EMA) of weights throughout training. While EMA reduces stochasticity, thereby smoothing training, the introduction of bias from old iterates often creates a lag in optimization relative to vanilla training. In this work, we propose the Bias-Corrected Exponential Moving Average (BEMA), a simple and practical augmentation of EMA that retains variance-reduction benefits while eliminating bias. BEMA is motivated by a simple theoretical model wherein we demonstrate provable acceleration of BEMA over both a standard EMA and vanilla training. Through an extensive suite of experiments on Language Models, we show that BEMA leads to significantly improved convergence rates and final performance over both EMA and vanilla training in a variety of standard LM benchmarks, making BEMA a practical and theoretically motivated intervention for more stable and efficient fine-tuning.

[288] RecoMind: A Reinforcement Learning Framework for Optimizing In-Session User Satisfaction in Recommendation Systems

Mehdi Ben Ayed, Fei Feng, Jay Adams, Vishwakarma Singh, Kritarth Anand, Jiajing Xu

Main category: cs.LG

TL;DR: RecoMind is a simulator-based RL framework for optimizing long-term goals in web-scale recommendation systems, outperforming traditional supervised learning methods.

DetailsMotivation: Existing recommendation systems focus on immediate feedback, but RL can optimize longer-term goals like in-session engagement. However, applying RL at web scale is challenging due to large action spaces and engineering complexity.

Method: RecoMind uses existing recommendation models to create a simulation environment and bootstrap RL policies. It includes a custom exploration strategy for large action spaces and integrates with industry pipelines.

Result: Offline simulations and online A/B tests showed RecoMind’s RL policy significantly improves in-session satisfaction, increasing videos watched by 15.81% and session depth by 4.71%.

Conclusion: RecoMind provides a scalable and systematic approach to integrating RL into web-scale recommendation systems, effectively optimizing session-based user satisfaction.

Abstract: Existing web-scale recommendation systems commonly use supervised learning methods that prioritize immediate user feedback. Although reinforcement learning (RL) offers a solution to optimize longer-term goals, such as in-session engagement, applying it at web scale is challenging due to the extremely large action space and engineering complexity. In this paper, we introduce RecoMind, a simulator-based RL framework designed for the effective optimization of session-based goals at web-scale. RecoMind leverages existing recommendation models to establish a simulation environment and to bootstrap the RL policy to optimize immediate user interactions from the outset. This method integrates well with existing industry pipelines, simplifying the training and deployment of RL policies. Additionally, RecoMind introduces a custom exploration strategy to efficiently explore web-scale action spaces with hundreds of millions of items. We evaluated RecoMind through extensive offline simulations and online A/B testing on a video streaming platform. Both methods showed that the RL policy trained using RecoMind significantly outperforms traditional supervised learning recommendation approaches in in-session user satisfaction. In online A/B tests, the RL policy increased videos watched for more than 10 seconds by 15.81% and improved session depth by 4.71% for sessions with at least 10 interactions. As a result, RecoMind presents a systematic and scalable approach for embedding RL into web-scale recommendation systems, showing great promise for optimizing session-based user satisfaction.

[289] Robust Classification under Noisy Labels: A Geometry-Aware Reliability Framework for Foundation Models

Ecem Bozkurt, Antonio Ortega

Main category: cs.LG

TL;DR: A two-stage framework improves robust classification with noisy labels by leveraging local geometry and non-negative kernel neighborhood construction, outperforming standard kNN methods.

DetailsMotivation: Addressing the challenge of fine-tuning foundation models with noisy data without retraining, inspired by the effectiveness of kNN methods in noisy settings.

Method: Proposes a two-stage approach: reliability estimation followed by reliability-weighted inference, using NNK neighborhood construction and geometry-aware methods.

Result: Demonstrates improved robustness across noise conditions on CIFAR-10 and DermaMNIST, surpassing standard kNN and adaptive-neighborhood baselines.

Conclusion: The framework effectively enhances classification robustness in noisy label scenarios by incorporating geometry information and adaptive reliability estimation.

Abstract: Foundation models (FMs) pretrained on large datasets have become fundamental for various downstream machine learning tasks, in particular in scenarios where obtaining perfectly labeled data is prohibitively expensive. In this paper, we assume an FM has to be fine-tuned with noisy data and present a two-stage framework to ensure robust classification in the presence of label noise without model retraining. Recent work has shown that simple k-nearest neighbor (kNN) approaches using an embedding derived from an FM can achieve good performance even in the presence of severe label noise. Our work is motivated by the fact that these methods make use of local geometry. In this paper, following a similar two-stage procedure, reliability estimation followed by reliability-weighted inference, we show that improved performance can be achieved by introducing geometry information. For a given instance, our proposed inference uses a local neighborhood of training data, obtained using the non-negative kernel (NNK) neighborhood construction. We propose several methods for reliability estimation that can rely less on distance and local neighborhood as the label noise increases. Our evaluation on CIFAR-10 and DermaMNIST shows that our methods improve robustness across various noise conditions, surpassing standard K-NN approaches and recent adaptive-neighborhood baselines.

[290] Towards Higher Effective Rank in Parameter-efficient Fine-tuning using Khatri–Rao Product

Paul Albert, Frederic Z. Zhang, Hemanth Saratchandran, Anton van den Hengel, Ehsan Abbasnejad

Main category: cs.LG

TL;DR: KRAdapter, a new PEFT method using Khatri-Rao product, outperforms LoRA on high-rank matrices and maintains efficiency.

DetailsMotivation: LoRA's limitations in handling high-rank matrices in multimodal and large language models.

Method: Quantitative comparison of full-rank and low-rank PEFT methods, introducing KRAdapter for high-rank approximation.

Result: KRAdapter shows gains on vision-language and large language models, especially in unseen tasks.

Conclusion: KRAdapter is a practical, efficient alternative for fine-tuning billion-scale models.

Abstract: Parameter-efficient fine-tuning (PEFT) has become a standard approach for adapting large pre-trained models. Amongst PEFT methods, low-rank adaptation (LoRA) has achieved notable success. However, recent studies have highlighted its limitations compared against full-rank alternatives, particularly when applied to multimodal and large language models. In this work, we present a quantitative comparison amongst full-rank and low-rank PEFT methods using a synthetic matrix approximation benchmark with controlled spectral properties. Our results confirm that LoRA struggles to approximate matrices with relatively flat spectrums or high frequency components – signs of high effective ranks. To this end, we introduce KRAdapter, a novel PEFT algorithm that leverages the Khatri-Rao product to produce weight updates, which, by construction, tends to produce matrix product with a high effective rank. We demonstrate performance gains with KRAdapter on vision-language models up to 1B parameters and on large language models up to 8B parameters, particularly on unseen common-sense reasoning tasks. In addition, KRAdapter maintains the memory and compute efficiency of LoRA, making it a practical and robust alternative to fine-tune billion-scale parameter models.

[291] Calibrated Language Models and How to Find Them with Label Smoothing

Jerry Huang, Peng Lu, Qiuhao Zeng

Main category: cs.LG

TL;DR: The paper investigates calibration degradation in LLMs after instruction tuning and proposes label smoothing as a solution, addressing its limitations and optimizing memory usage.

DetailsMotivation: Understanding the impact of instruction tuning on confidence calibration in LLMs and finding practical solutions to maintain calibration.

Method: Examination of open-sourced LLMs, theoretical and experimental analysis of label smoothing, and development of a memory-efficient kernel for smoothed losses.

Result: Label smoothing helps maintain calibration but is less effective for large vocabulary LLMs; a custom kernel reduces memory usage without performance loss.

Conclusion: Label smoothing is effective for calibration in SFT, but its limitations in LV-LLMs highlight the need for further research; the custom kernel offers a practical improvement.

Abstract: Recent advances in natural language processing (NLP) have opened up greater opportunities to enable fine-tuned large language models (LLMs) to behave as more powerful interactive agents through improved instruction-following ability. However, understanding how this impacts confidence calibration for reliable model output has not been researched in full. In this work, we examine various open-sourced LLMs, identifying significant calibration degradation after instruction tuning in each. Seeking a practical solution, we look towards label smoothing, which has been shown as an effective method to regularize for overconfident predictions but has yet to be widely adopted in the supervised fine-tuning (SFT) of LLMs. We first provide insight as to why label smoothing is sufficient to maintain calibration throughout the SFT process. However, settings remain where the effectiveness of smoothing is severely diminished, in particular the case of large vocabulary LLMs (LV-LLMs). We posit the cause to stem from the ability to become over-confident, which has a direct relationship with the hidden size and vocabulary size, and justify this theoretically and experimentally. Finally, we address an outstanding issue regarding the memory footprint of the cross-entropy loss computation in the label smoothed loss setting, designing a customized kernel to dramatically reduce memory consumption without sacrificing speed or performance in comparison to existing solutions for non-smoothed losses.

[292] Learning to Optimize Feedback for One Million Students: Insights from Multi-Armed and Contextual Bandits in Large-Scale Online Tutoring

Robin Schmucker, Nimish Pachapurkar, Shanmuga Bala, Miral Shah, Tom Mitchell

Main category: cs.LG

TL;DR: An online tutoring system uses MAB and CB frameworks to optimize feedback for students, improving learning outcomes through data-driven policies.

DetailsMotivation: To enhance student learning by providing effective feedback tailored to individual needs and optimizing assistance actions.

Method: Utilizes multi-armed bandit (MAB) and contextual bandit (CB) frameworks, offline policy evaluation, and causal inference to assess and optimize feedback policies.

Result: Significant improvements in student outcomes, with MAB policies outperforming CB in most cases due to small effect sizes.

Conclusion: Data-driven feedback systems can effectively improve learning, though personalization via CB may not always yield significant gains over optimized MAB policies.

Abstract: We present an online tutoring system that learns to provide effective feedback to students after they answer questions incorrectly. Using data from one million students, the system learns which assistance action (e.g., one of multiple hints) to provide for each question to optimize student learning. Employing the multi-armed bandit (MAB) framework and offline policy evaluation, we assess 43,000 assistance actions, and identify trade-offs between assistance policies optimized for different student outcomes (e.g., response correctness, session completion). We design an algorithm that for each question decides on a suitable policy training objective to enhance students’ immediate second attempt success and overall practice session performance. We evaluate the resulting MAB policies in 166,000 practice sessions, verifying significant improvements in student outcomes. While MAB policies optimize feedback for the overall student population, we further investigate whether contextual bandit (CB) policies can enhance outcomes by personalizing feedback based on individual student features (e.g., ability estimates, response times). Using causal inference, we examine (i) how effects of assistance actions vary across students and (ii) whether CB policies, which leverage such effect heterogeneity, outperform MAB policies. While our analysis reveals that some actions for some questions exhibit effect heterogeneity, effect sizes may often be too small for CB policies to provide significant improvements beyond what well-optimized MAB policies that deliver the same action to all students already achieve. We discuss insights gained from deploying data-driven systems at scale and implications for future refinements. Today, the teaching policies optimized by our system support thousands of students daily.

[293] Toward using explainable data-driven surrogate models for treating performance-based seismic design as an inverse engineering problem

Mohsen Zaker Esteghamati

Main category: cs.LG

TL;DR: The paper introduces a method using explainable machine learning to optimize seismic design by directly mapping design variables to performance metrics, improving efficiency. It’s applied to steel and concrete frames, showing high accuracy (R2>90%) and effective optimization.

DetailsMotivation: To address computational inefficiencies in performance-based seismic design by treating it as an inverse problem, aiming to derive design parameters directly from performance objectives.

Method: Uses explainable machine learning models to map design variables to performance metrics, integrated with a genetic optimization algorithm to solve the inverse problem. Applied to steel and concrete moment frames in Los Angeles and Charleston.

Result: High accuracy (R2>90%) in surrogate models across diverse building types, geometries, and seismic conditions. Optimization successfully identified optimal member properties.

Conclusion: The methodology effectively bridges performance objectives and design parameters, offering a computationally efficient and accurate approach to seismic design optimization.

Abstract: This study presents a methodology to treat performance-based seismic design as an inverse engineering problem, where design parameters are directly derived to achieve specific performance objectives. By implementing explainable machine learning models, this methodology directly maps design variables and performance metrics, tackling computational inefficiencies of performance-based design. The resultant machine learning model is integrated as an evaluation function into a genetic optimization algorithm to solve the inverse problem. The developed methodology is then applied to two different inventories of steel and concrete moment frames in Los Angeles and Charleston to obtain sectional properties of frame members that minimize expected annualized seismic loss in terms of repair costs. The results show high accuracy of the surrogate models (e.g., R2> 90%) across a diverse set of building types, geometries, seismic design, and site hazard, where the optimization algorithm could identify the optimum values of members’ properties for a fixed set of geometric variables, consistent with engineering principles.

[294] Invariant Graph Transformer for Out-of-Distribution Generalization

Tianyin Liao, Ziwei Zhang, Yufei Sun, Chunyu Hu, Jianxin Li

Main category: cs.LG

TL;DR: GOODFormer is a Graph Transformer designed to generalize under distribution shifts by capturing invariant graph patterns through three modules: disentangling subgraphs, encoding dynamic subgraphs, and invariant learning.

DetailsMotivation: Existing Graph Transformers fail to generalize under distribution shifts, necessitating a solution to capture invariant graph patterns for better generalization.

Method: GOODFormer uses an entropy-guided invariant subgraph disentangler, an evolving subgraph encoder, and an invariant learning module to derive generalizable representations.

Result: Extensive experiments show GOODFormer outperforms state-of-the-art baselines under distribution shifts.

Conclusion: GOODFormer effectively addresses generalization challenges in Graph Transformers under distribution shifts, supported by theoretical and empirical evidence.

Abstract: Graph Transformers (GTs) have demonstrated great effectiveness across various graph analytical tasks. However, the existing GTs focus on training and testing graph data originated from the same distribution, but fail to generalize under distribution shifts. Graph invariant learning, aiming to capture generalizable graph structural patterns with labels under distribution shifts, is potentially a promising solution, but how to design attention mechanisms and positional and structural encodings (PSEs) based on graph invariant learning principles remains challenging. To solve these challenges, we introduce Graph Out-Of-Distribution generalized Transformer (GOODFormer), aiming to learn generalized graph representations by capturing invariant relationships between predictive graph structures and labels through jointly optimizing three modules. Specifically, we first develop a GT-based entropy-guided invariant subgraph disentangler to separate invariant and variant subgraphs while preserving the sharpness of the attention function. Next, we design an evolving subgraph positional and structural encoder to effectively and efficiently capture the encoding information of dynamically changing subgraphs during training. Finally, we propose an invariant learning module utilizing subgraph node representations and encodings to derive generalizable graph representations that can to unseen graphs. We also provide theoretical justifications for our method. Extensive experiments on benchmark datasets demonstrate the superiority of our method over state-of-the-art baselines under distribution shifts.

[295] PnP-DA: Towards Principled Plug-and-Play Integration of Variational Data Assimilation and Generative Models

Yongquan Qu, Matthieu Blanke, Sara Shamekh, Pierre Gentine

Main category: cs.LG

TL;DR: PnP-DA, a Plug-and-Play data assimilation method, combines gradient-based updates with a pretrained generative prior to reduce forecast errors in chaotic systems, outperforming classical variational methods.

DetailsMotivation: Earth system modeling faces challenges in capturing multiscale dynamics and minimizing forecast errors. Conventional data assimilation methods assume Gaussian errors, which fail in chaotic systems.

Method: PnP-DA alternates between a gradient-based analysis update and a forward pass through a pretrained generative prior, avoiding restrictive statistical assumptions and complex backpropagation.

Result: Experiments show PnP-DA consistently reduces forecast errors across varying observation sparsities and noise levels.

Conclusion: PnP-DA offers an effective alternative to classical variational methods by leveraging generative priors and relaxing Gaussian assumptions.

Abstract: Earth system modeling presents a fundamental challenge in scientific computing: capturing complex, multiscale nonlinear dynamics in computationally efficient models while minimizing forecast errors caused by necessary simplifications. Even the most powerful AI- or physics-based forecast system suffer from gradual error accumulation. Data assimilation (DA) aims to mitigate these errors by optimally blending (noisy) observations with prior model forecasts, but conventional variational methods often assume Gaussian error statistics that fail to capture the true, non-Gaussian behavior of chaotic dynamical systems. We propose PnP-DA, a Plug-and-Play algorithm that alternates (1) a lightweight, gradient-based analysis update (using a Mahalanobis-distance misfit on new observations) with (2) a single forward pass through a pretrained generative prior conditioned on the background forecast via a conditional Wasserstein coupling. This strategy relaxes restrictive statistical assumptions and leverages rich historical data without requiring an explicit regularization functional, and it also avoids the need to backpropagate gradients through the complex neural network that encodes the prior during assimilation cycles. Experiments on standard chaotic testbeds demonstrate that this strategy consistently reduces forecast errors across a range of observation sparsities and noise levels, outperforming classical variational methods.

[296] Embryology of a Language Model

George Wang, Garrett Baker, Andrew Gordon, Daniel Murfet

Main category: cs.LG

TL;DR: The paper introduces an embryological approach using UMAP on susceptibility matrices to visualize language model development, revealing new structures like a “spacing fin” for counting spaces.

DetailsMotivation: To understand the internal computational structure of language models and leverage susceptibilities for visualizing network organization.

Method: Applied UMAP to susceptibility matrices to visualize structural development during training.

Result: Discovered a “body plan” in models, including known features (e.g., induction circuit) and new structures (e.g., “spacing fin”).

Conclusion: Susceptibility analysis is a powerful tool for uncovering novel mechanisms and studying neural network development.

Abstract: Understanding how language models develop their internal computational structure is a central problem in the science of deep learning. While susceptibilities, drawn from statistical physics, offer a promising analytical tool, their full potential for visualizing network organization remains untapped. In this work, we introduce an embryological approach, applying UMAP to the susceptibility matrix to visualize the model’s structural development over training. Our visualizations reveal the emergence of a clear body plan,'' charting the formation of known features like the induction circuit and discovering previously unknown structures, such as a spacing fin’’ dedicated to counting space tokens. This work demonstrates that susceptibility analysis can move beyond validation to uncover novel mechanisms, providing a powerful, holistic lens for studying the developmental principles of complex neural networks.

[297] BOOD: Boundary-based Out-Of-Distribution Data Generation

Qilin Liao, Shuo Yang, Bo Zhao, Ping Luo, Hengshuang Zhao

Main category: cs.LG

TL;DR: BOOD is a novel framework using diffusion models to generate high-quality OOD features by perturbing ID features near decision boundaries, improving OOD detection performance.

DetailsMotivation: Enhancing OOD detection by addressing the challenge of extracting effective OOD features due to unclear decision boundaries in latent space.

Method: BOOD learns a text-conditioned latent space, selects boundary-proximate ID features, perturbs them into OOD features, and decodes these into images using diffusion models.

Result: BOOD achieves a 29.64% FPR95 reduction and 7.27% AUROC improvement on CIFAR-100, outperforming state-of-the-art methods.

Conclusion: BOOD offers an efficient, high-quality OOD feature synthesis method, significantly advancing OOD detection performance.

Abstract: Harnessing the power of diffusion models to synthesize auxiliary training data based on latent space features has proven effective in enhancing out-of-distribution (OOD) detection performance. However, extracting effective features outside the in-distribution (ID) boundary in latent space remains challenging due to the difficulty of identifying decision boundaries between classes. This paper proposes a novel framework called Boundary-based Out-Of-Distribution data generation (BOOD), which synthesizes high-quality OOD features and generates human-compatible outlier images using diffusion models. BOOD first learns a text-conditioned latent feature space from the ID dataset, selects ID features closest to the decision boundary, and perturbs them to cross the decision boundary to form OOD features. These synthetic OOD features are then decoded into images in pixel space by a diffusion model. Compared to previous works, BOOD provides a more training efficient strategy for synthesizing informative OOD features, facilitating clearer distinctions between ID and OOD data. Extensive experimental results on common benchmarks demonstrate that BOOD surpasses the state-of-the-art method significantly, achieving a 29.64% decrease in average FPR95 (40.31% vs. 10.67%) and a 7.27% improvement in average AUROC (90.15% vs. 97.42%) on the CIFAR-100 dataset.

[298] Sheaf Graph Neural Networks via PAC-Bayes Spectral Optimization

Yoonhyuk Choi, Jiho Choi, Chong-Kwon Kim

Main category: cs.LG

TL;DR: SGPC introduces a unified GNN architecture with PAC-Bayes calibration to address over-smoothing in heterophilic graphs, combining sheaf-based message passing, optimal transport, and spectral regularization for robust performance.

DetailsMotivation: Over-smoothing in GNNs causes feature collapse in heterophilic graphs. Existing sheaf-based methods lack generalization, scalability, and stability guarantees.

Method: SGPC integrates cellular-sheaf message passing, optimal transport lifting, variance-reduced diffusion, and PAC-Bayes spectral regularization for semi-supervised node classification.

Result: SGPC outperforms state-of-the-art GNNs on nine benchmarks, providing certified confidence intervals on unseen nodes with linear computational complexity.

Conclusion: SGPC offers a scalable, robust solution for heterophilic graphs with theoretical guarantees and practical efficiency.

Abstract: Over-smoothing in Graph Neural Networks (GNNs) causes collapse in distinct node features, particularly on heterophilic graphs where adjacent nodes often have dissimilar labels. Although sheaf neural networks partially mitigate this problem, they typically rely on static or heavily parameterized sheaf structures that hinder generalization and scalability. Existing sheaf-based models either predefine restriction maps or introduce excessive complexity, yet fail to provide rigorous stability guarantees. In this paper, we introduce a novel scheme called SGPC (Sheaf GNNs with PAC-Bayes Calibration), a unified architecture that combines cellular-sheaf message passing with several mechanisms, including optimal transport-based lifting, variance-reduced diffusion, and PAC-Bayes spectral regularization for robust semi-supervised node classification. We establish performance bounds theoretically and demonstrate that the resulting bound-aware objective can be achieved via end-to-end training in linear computational complexity. Experiments on nine homophilic and heterophilic benchmarks show that SGPC outperforms state-of-the-art spectral and sheaf-based GNNs while providing certified confidence intervals on unseen nodes.

[299] OID-PPO: Optimal Interior Design using Proximal Policy Optimization by Transforming Design Guidelines into Reward Functions

Chanyoung Yoon, Sangbong Yoo, Soobin Yim, Chansoo Kim, Yun Jang

Main category: cs.LG

TL;DR: OID-PPO, a novel RL framework, improves residential interior design by integrating expert guidelines into a reward function, enabling continuous furniture placement and outperforming existing methods in quality and efficiency.

DetailsMotivation: Residential interior design is complex due to unstructured layouts, high computational demands, and reliance on expertise. Current methods are either costly or data-limited, and RL approaches often lack design principles.

Method: OID-PPO uses Proximal Policy Optimization with a diagonal Gaussian policy for continuous furniture placement, incorporating expert-defined functional and visual guidelines into a structured reward function.

Result: OID-PPO outperforms state-of-the-art methods in layout quality and computational efficiency across diverse room shapes and furniture configurations.

Conclusion: The framework successfully integrates design principles, demonstrates the impact of structured guidelines, and highlights individual constraint contributions, advancing automated interior design.

Abstract: Designing residential interiors strongly impacts occupant satisfaction but remains challenging due to unstructured spatial layouts, high computational demands, and reliance on expert knowledge. Existing methods based on optimization or deep learning are either computationally expensive or constrained by data scarcity. Reinforcement learning (RL) approaches often limit furniture placement to discrete positions and fail to incorporate design principles adequately. We propose OID-PPO, a novel RL framework for Optimal Interior Design using Proximal Policy Optimization, which integrates expert-defined functional and visual guidelines into a structured reward function. OID-PPO utilizes a diagonal Gaussian policy for continuous and flexible furniture placement, effectively exploring latent environmental dynamics under partial observability. Experiments conducted across diverse room shapes and furniture configurations demonstrate that OID-PPO significantly outperforms state-of-the-art methods in terms of layout quality and computational efficiency. Ablation studies further demonstrate the impact of structured guideline integration and reveal the distinct contributions of individual design constraints.

[300] Dual Adaptivity: Universal Algorithms for Minimizing the Adaptive Regret of Convex Functions

Lijun Zhang, Wenhao Yang, Guanghui Wang, Wei Jiang, Zhi-Hua Zhou

Main category: cs.LG

TL;DR: The paper proposes a universal meta-expert framework for dual adaptive algorithms in online learning, addressing limitations of existing methods by adapting to various convex function types and changing environments.

DetailsMotivation: Existing algorithms for adaptive regret minimization lack universality, as they handle only one convex function type and require prior knowledge, limiting real-world applicability.

Method: A meta-expert framework is introduced, dynamically creating and aggregating experts via a meta-algorithm with second-order bounds, incorporating sleeping experts for environmental changes. Two expert construction strategies are proposed.

Result: The algorithms minimize adaptive regret for multiple convex function types simultaneously, allowing function type switches between rounds, and extend to online composite optimization.

Conclusion: The framework achieves universality and dual adaptivity, enhancing applicability in dynamic real-world scenarios.

Abstract: To deal with changing environments, a new performance measure – adaptive regret, defined as the maximum static regret over any interval, was proposed in online learning. Under the setting of online convex optimization, several algorithms have been successfully developed to minimize the adaptive regret. However, existing algorithms lack universality in the sense that they can only handle one type of convex functions and need apriori knowledge of parameters, which hinders their application in real-world scenarios. To address this limitation, this paper investigates universal algorithms with dual adaptivity, which automatically adapt to the property of functions (convex, exponentially concave, or strongly convex), as well as the nature of environments (stationary or changing). Specifically, we propose a meta-expert framework for dual adaptive algorithms, where multiple experts are created dynamically and aggregated by a meta-algorithm. The meta-algorithm is required to yield a second-order bound, which can accommodate unknown function types. We further incorporate the technique of sleeping experts to capture the changing environments. For the construction of experts, we introduce two strategies (increasing the number of experts or enhancing the capabilities of experts) to achieve universality. Theoretical analysis shows that our algorithms are able to minimize the adaptive regret for multiple types of convex functions simultaneously, and also allow the type of functions to switch between rounds. Moreover, we extend our meta-expert framework to online composite optimization, and develop a universal algorithm for minimizing the adaptive regret of composite functions.

[301] ExeKGLib: A Platform for Machine Learning Analytics based on Knowledge Graphs

Antonis Klironomos, Baifan Zhou, Zhipeng Tan, Zhuoxun Zheng, Mohamed H. Gad-Elrab, Heiko Paulheim, Evgeny Kharlamov

Main category: cs.LG

TL;DR: ExeKGLib is a Python library with a graphical interface for non-ML experts to build ML pipelines using knowledge graphs.

DetailsMotivation: Domain experts lack ML expertise but need ML-based analytics, making pipeline development challenging.

Method: ExeKGLib uses knowledge graphs to simplify ML pipeline creation for non-experts, ensuring transparency and reusability.

Result: The library is demonstrated through real use cases, proving its usability and effectiveness.

Conclusion: ExeKGLib bridges the gap for non-ML experts, enabling accessible and executable ML workflows.

Abstract: Nowadays machine learning (ML) practitioners have access to numerous ML libraries available online. Such libraries can be used to create ML pipelines that consist of a series of steps where each step may invoke up to several ML libraries that are used for various data-driven analytical tasks. Development of high-quality ML pipelines is non-trivial; it requires training, ML expertise, and careful development of each step. At the same time, domain experts in science and engineering may not possess such ML expertise and training while they are in pressing need of ML-based analytics. In this paper, we present our ExeKGLib, a Python library enhanced with a graphical interface layer that allows users with minimal ML knowledge to build ML pipelines. This is achieved by relying on knowledge graphs that encode ML knowledge in simple terms accessible to non-ML experts. ExeKGLib also allows improving the transparency and reusability of the built ML workflows and ensures that they are executable. We show the usability and usefulness of ExeKGLib by presenting real use cases.

[302] Co-Reward: Self-supervised Reinforcement Learning for Large Language Model Reasoning via Contrastive Agreement

Zizhuo Zhang, Jianing Zhu, Xinmu Ge, Zihua Zhao, Zhanke Zhou, Xuan Li, Xiao Feng, Jiangchao Yao, Bo Han

Main category: cs.LG

TL;DR: Co-Reward is a self-supervised RL framework using contrastive agreement across similar questions to improve LLM reasoning, outperforming baselines and even GT rewards.

DetailsMotivation: Addresses the scaling up dilemma in RLVR by reducing reliance on human labels and avoiding collapse issues in self-reward methods.

Method: Constructs similar questions without labels, synthesizes surrogate labels via rollout voting, and enforces reasoning consistency across analogical inputs.

Result: Achieves superior performance on reasoning benchmarks, improving up to +6.8% over GT rewards on MATH500.

Conclusion: Co-Reward effectively enhances reasoning stability and performance, surpassing traditional self-reward and GT-labeled methods.

Abstract: Although reinforcement learning with verifiable rewards (RLVR) shows promise in improving the reasoning ability of large language models (LLMs), the scaling up dilemma remains due to the reliance on human annotated labels especially for complex tasks. Recent alternatives that explore various self-reward signals exhibit the eliciting potential of LLM reasoning, but suffer from the non-negligible collapse issue. Inspired by the success of self-supervised learning, we propose \textit{Co-Reward}, a novel RL framework that leverages contrastive agreement across semantically analogical questions as a reward basis. Specifically, we construct a similar question for each training sample (without labels) and synthesize their individual surrogate labels through a simple rollout voting, and then the reward is constructed by cross-referring the labels of each question pair to enforce the internal reasoning consistency across analogical inputs. Intuitively, such a self-supervised reward-shaping mechanism increases the difficulty of learning collapse into a trivial solution, and promotes stable reasoning elicitation and improvement through expanding the input sample variants. Empirically, Co-Reward achieves superior performance compared to other self-reward baselines on multiple reasoning benchmarks and LLM series, and reaches or even surpasses ground-truth (GT) labeled reward, with improvements of up to $+6.8%$ on MATH500 over GT reward on Llama-3.2-3B-Instruct. Our code is publicly available at https://github.com/tmlr-group/Co-Reward.

[303] Transforming Credit Risk Analysis: A Time-Series-Driven ResE-BiLSTM Framework for Post-Loan Default Detection

Yue Yang, Yuxiang Lin, Ying Zhang, Zihan Su, Chang Chuan Goh, Tangtangfang Fang, Anthony Graham Bellotti, Boon Giin Lee

Main category: cs.LG

TL;DR: The paper introduces a ResE-BiLSTM model for predicting post-loan default, outperforming baseline models like LSTM, BiLSTM, GRU, CNN, and RNN on the Freddie Mac dataset.

DetailsMotivation: Improving prediction of post-loan default is crucial for credit risk management, addressed here using machine learning for financial anomaly detection.

Method: The ResE-BiLSTM model, using a sliding window technique, is evaluated on 44 cohorts from the Freddie Mac dataset and compared with five baseline models. Ablation study and SHAP analysis are also conducted.

Result: ResE-BiLSTM achieves superior performance in metrics like Accuracy, Precision, Recall, F1, and AUC compared to baseline models.

Conclusion: The ResE-BiLSTM model demonstrates practical value and applicability in real-world credit risk scenarios.

Abstract: Prediction of post-loan default is an important task in credit risk management, and can be addressed by detection of financial anomalies using machine learning. This study introduces a ResE-BiLSTM model, using a sliding window technique, and is evaluated on 44 independent cohorts from the extensive Freddie Mac US mortgage dataset, to improve prediction performance. The ResE-BiLSTM is compared with five baseline models: Long Short-Term Memory (LSTM), BiLSTM, Gated Recurrent Units (GRU), Convolutional Neural Networks (CNN), and Recurrent Neural Networks (RNN), across multiple metrics, including Accuracy, Precision, Recall, F1, and AUC. An ablation study was conducted to evaluate the contribution of individual components in the ResE-BiLSTM architecture. Additionally, SHAP analysis was employed to interpret the underlying features the model relied upon for its predictions. Experimental results demonstrate that ResE-BiLSTM achieves superior predictive performance compared to baseline models, underscoring its practical value and applicability in real-world scenarios.

[304] A Conditional GAN for Tabular Data Generation with Probabilistic Sampling of Latent Subspaces

Leonidas Akritidis, Panayiotis Bozanis

Main category: cs.LG

TL;DR: The paper introduces ctdGAN, a conditional GAN for addressing class imbalance in tabular data by leveraging space partitioning and a novel sampling strategy.

DetailsMotivation: Class imbalance in tabular data degrades ML performance. Existing GANs ignore vector subspaces and treat class labels poorly.

Method: ctdGAN uses space partitioning, probabilistic sampling, and a new loss function to generate data in subspaces resembling the original distribution.

Result: ctdGAN outperforms on 14 datasets, generating high-fidelity samples and improving classification accuracy.

Conclusion: ctdGAN effectively mitigates class imbalance by generating realistic samples in appropriate subspaces.

Abstract: The tabular form constitutes the standard way of representing data in relational database systems and spreadsheets. But, similarly to other forms, tabular data suffers from class imbalance, a problem that causes serious performance degradation in a wide variety of machine learning tasks. One of the most effective solutions dictates the usage of Generative Adversarial Networks (GANs) in order to synthesize artificial data instances for the under-represented classes. Despite their good performance, none of the proposed GAN models takes into account the vector subspaces of the input samples in the real data space, leading to data generation in arbitrary locations. Moreover, the class labels are treated in the same manner as the other categorical variables during training, so conditional sampling by class is rendered less effective. To overcome these problems, this study presents ctdGAN, a conditional GAN for alleviating class imbalance in tabular datasets. Initially, ctdGAN executes a space partitioning step to assign cluster labels to the input samples. Subsequently, it utilizes these labels to synthesize samples via a novel probabilistic sampling strategy and a new loss function that penalizes both cluster and class mis-predictions. In this way, ctdGAN is trained to generate samples in subspaces that resemble those of the original data distribution. We also introduce several other improvements, including a simple, yet effective cluster-wise scaling technique that captures multiple feature modes without affecting data dimensionality. The exhaustive evaluation of ctdGAN with 14 imbalanced datasets demonstrated its superiority in generating high fidelity samples and improving classification accuracy.

[305] Court of LLMs: Evidence-Augmented Generation via Multi-LLM Collaboration for Text-Attributed Graph Anomaly Detection

Yiming Xu, Jiarun Chen, Zhen Peng, Zihan Chen, Qika Lin, Lan Ma, Bin Shi, Bo Dong

Main category: cs.LG

TL;DR: CoLL combines LLMs and GNNs for graph anomaly detection in text-attributed graphs, improving AP by 13.37%.

DetailsMotivation: Existing GAD methods overlook textual modality's value, using shallow embeddings and missing semantic context. LLMs offer strong semantic understanding but struggle with graph structure.

Method: CoLL integrates LLMs for semantic context and GNNs for structural information, using multi-LLM collaboration and a gating mechanism for adaptive fusion.

Result: CoLL outperforms existing methods, achieving a 13.37% average improvement in AP.

Conclusion: CoLL successfully leverages LLMs and GNNs for TAG anomaly detection, opening new avenues for LLM integration in GAD.

Abstract: The natural combination of intricate topological structures and rich textual information in text-attributed graphs (TAGs) opens up a novel perspective for graph anomaly detection (GAD). However, existing GAD methods primarily focus on designing complex optimization objectives within the graph domain, overlooking the complementary value of the textual modality, whose features are often encoded by shallow embedding techniques, such as bag-of-words or skip-gram, so that semantic context related to anomalies may be missed. To unleash the enormous potential of textual modality, large language models (LLMs) have emerged as promising alternatives due to their strong semantic understanding and reasoning capabilities. Nevertheless, their application to TAG anomaly detection remains nascent, and they struggle to encode high-order structural information inherent in graphs due to input length constraints. For high-quality anomaly detection in TAGs, we propose CoLL, a novel framework that combines LLMs and graph neural networks (GNNs) to leverage their complementary strengths. CoLL employs multi-LLM collaboration for evidence-augmented generation to capture anomaly-relevant contexts while delivering human-readable rationales for detected anomalies. Moreover, CoLL integrates a GNN equipped with a gating mechanism to adaptively fuse textual features with evidence while preserving high-order topological information. Extensive experiments demonstrate the superiority of CoLL, achieving an average improvement of 13.37% in AP. This study opens a new avenue for incorporating LLMs in advancing GAD.

[306] Text-Attributed Graph Anomaly Detection via Multi-Scale Cross- and Uni-Modal Contrastive Learning

Yiming Xu, Xu Hua, Zhen Peng, Bin Shi, Jiarun Chen, Xingbo Fu, Song Wang, Bo Dong

Main category: cs.LG

TL;DR: The paper introduces CMUCL, an end-to-end method for detecting anomalies in text-attributed graphs (TAGs) by jointly modeling text and graph data, improving accuracy by 11.13%.

DetailsMotivation: Existing methods separate text encoding from anomaly detection, limiting performance. Integrating raw text and graph topology for anomaly detection is challenging.

Method: CMUCL jointly trains text and graph encoders using cross-modal and uni-modal consistency, and uses inconsistency mining for anomaly scoring.

Result: CMUCL outperforms existing methods, achieving an 11.13% higher average accuracy (AP).

Conclusion: CMUCL effectively integrates text and graph data for anomaly detection, with significant performance improvements.

Abstract: The widespread application of graph data in various high-risk scenarios has increased attention to graph anomaly detection (GAD). Faced with real-world graphs that often carry node descriptions in the form of raw text sequences, termed text-attributed graphs (TAGs), existing graph anomaly detection pipelines typically involve shallow embedding techniques to encode such textual information into features, and then rely on complex self-supervised tasks within the graph domain to detect anomalies. However, this text encoding process is separated from the anomaly detection training objective in the graph domain, making it difficult to ensure that the extracted textual features focus on GAD-relevant information, seriously constraining the detection capability. How to seamlessly integrate raw text and graph topology to unleash the vast potential of cross-modal data in TAGs for anomaly detection poses a challenging issue. This paper presents a novel end-to-end paradigm for text-attributed graph anomaly detection, named CMUCL. We simultaneously model data from both text and graph structures, and jointly train text and graph encoders by leveraging cross-modal and uni-modal multi-scale consistency to uncover potential anomaly-related information. Accordingly, we design an anomaly score estimator based on inconsistency mining to derive node-specific anomaly scores. Considering the lack of benchmark datasets tailored for anomaly detection on TAGs, we release 8 datasets to facilitate future research. Extensive evaluations show that CMUCL significantly advances in text-attributed graph anomaly detection, delivering an 11.13% increase in average accuracy (AP) over the suboptimal.

[307] Online Nonsubmodular Optimization with Delayed Feedback in the Bandit Setting

Sifan Yang, Yuanyu Wan, Lijun Zhang

Main category: cs.LG

TL;DR: The paper addresses limitations in online nonsubmodular optimization with delayed feedback by proposing two algorithms, DBGD-NF and its extension, achieving improved regret bounds and decoupling delay and bandit effects.

DetailsMotivation: Existing regret bounds rely on maximum delay and couple delay and bandit effects, limiting performance. The goal is to improve these bounds and decouple the effects.

Method: Two algorithms are proposed: DBGD-NF uses a one-point gradient estimator and all available gradients, while its extension employs a blocking update mechanism.

Result: DBGD-NF achieves an O(n¯d¹/³T²/³) regret bound, and the extension achieves O(n(T²/³ + √(dT))), outperforming prior work under certain conditions.

Conclusion: The proposed methods offer superior performance, especially when delays are irregular or small, validated by experiments on structured sparse learning.

Abstract: We investigate the online nonsubmodular optimization with delayed feedback in the bandit setting, where the loss function is $\alpha$-weakly DR-submodular and $\beta$-weakly DR-supermodular. Previous work has established an $(\alpha,\beta)$-regret bound of $\mathcal{O}(nd^{1/3}T^{2/3})$, where $n$ is the dimensionality and $d$ is the maximum delay. However, its regret bound relies on the maximum delay and is thus sensitive to irregular delays. Additionally, it couples the effects of delays and bandit feedback as its bound is the product of the delay term and the $\mathcal{O}(nT^{2/3})$ regret bound in the bandit setting without delayed feedback. In this paper, we develop two algorithms to address these limitations, respectively. Firstly, we propose a novel method, namely DBGD-NF, which employs the one-point gradient estimator and utilizes all the available estimated gradients in each round to update the decision. It achieves a better $\mathcal{O}(n\bar{d}^{1/3}T^{2/3})$ regret bound, which is relevant to the average delay $\bar{d} = \frac{1}{T}\sum_{t=1}^T d_t\leq d$. Secondly, we extend DBGD-NF by employing a blocking update mechanism to decouple the joint effect of the delays and bandit feedback, which enjoys an $\mathcal{O}(n(T^{2/3} + \sqrt{dT}))$ regret bound. When $d = \mathcal{O}(T^{1/3})$, our regret bound matches the $\mathcal{O}(nT^{2/3})$ bound in the bandit setting without delayed feedback. Compared to our first $\mathcal{O}(n\bar{d}^{1/3}T^{2/3})$ bound, it is more advantageous when the maximum delay $d = o(\bar{d}^{2/3}T^{1/3})$. Finally, we conduct experiments on structured sparse learning to demonstrate the superiority of our methods.

[308] Phase-Locked SNR Band Selection for Weak Mineral Signal Detection in Hyperspectral Imagery

Judy X Yang

Main category: cs.LG

TL;DR: A two-stage framework enhances mineral detection in hyperspectral imaging by filtering noisy bands and refining data for improved unmixing accuracy.

DetailsMotivation: Weak mineral signatures in hyperspectral imaging are often obscured by noise and redundant bands, limiting detection performance.

Method: A two-stage approach: (1) SNR-based band selection and spectral smoothing, (2) KMeans clustering and NNLS unmixing.

Result: Improved unmixing accuracy and enhanced detection of weak mineral zones.

Conclusion: The framework provides a practical solution for spectral dimensionality reduction and unmixing in geological HSI.

Abstract: Hyperspectral imaging offers detailed spectral information for mineral mapping; however, weak mineral signatures are often masked by noisy and redundant bands, limiting detection performance. To address this, we propose a two-stage integrated framework for enhanced mineral detection in the Cuprite mining district. In the first stage, we compute the signal-to-noise ratio (SNR) for each spectral band and apply a phase-locked thresholding technique to discard low-SNR bands, effectively removing redundancy and suppressing background noise. Savitzky-Golay filtering is then employed for spectral smoothing, serving a dual role first to stabilize trends during band selection, and second to preserve fine-grained spectral features during preprocessing. In the second stage, the refined HSI data is reintroduced into the model, where KMeans clustering is used to extract 12 endmember spectra (W1 custom), followed by non negative least squares (NNLS) for abundance unmixing. The resulting endmembers are quantitatively compared with laboratory spectra (W1 raw) using cosine similarity and RMSE metrics. Experimental results confirm that our proposed pipeline improves unmixing accuracy and enhances the detection of weak mineral zones. This two-pass strategy demonstrates a practical and reproducible solution for spectral dimensionality reduction and unmixing in geological HSI applications.

[309] Foundations of Interpretable Models

Pietro Barbiero, Mateo Espinosa Zarlenga, Alberto Termine, Mateja Jamnik, Giuseppe Marra

Main category: cs.LG

TL;DR: The paper critiques current interpretability definitions as non-actionable and proposes a new, actionable definition for designing interpretable models, including a blueprint and an open-sourced library.

DetailsMotivation: Existing interpretability definitions lack practicality for model design, making research ill-posed.

Method: Proposes a new definition of interpretability, derives foundational properties, and introduces a blueprint and open-sourced library for interpretable models.

Result: The new definition is actionable, revealing necessary properties and assumptions for interpretable model design.

Conclusion: The work provides a practical framework and tools for advancing interpretable AI research.

Abstract: We argue that existing definitions of interpretability are not actionable in that they fail to inform users about general, sound, and robust interpretable model design. This makes current interpretability research fundamentally ill-posed. To address this issue, we propose a definition of interpretability that is general, simple, and subsumes existing informal notions within the interpretable AI community. We show that our definition is actionable, as it directly reveals the foundational properties, underlying assumptions, principles, data structures, and architectural features necessary for designing interpretable models. Building on this, we propose a general blueprint for designing interpretable models and introduce the first open-sourced library with native support for interpretable data structures and processes.

[310] Learning Potential Energy Surfaces of Hydrogen Atom Transfer Reactions in Peptides

Marlen Neubert, Patrick Reiser, Frauke Gräter, Pascal Friederich

Main category: cs.LG

TL;DR: Machine-learned potentials, particularly MACE, outperform other models in simulating hydrogen atom transfer (HAT) reactions in peptides, enabling quantum-accurate simulations of radical migration in proteins.

DetailsMotivation: Understanding HAT reactions in biological processes is limited by the challenges of simulating them with quantum chemical accuracy at relevant scales.

Method: Systematic generation of HAT configurations in peptides using semiempirical methods and DFT, benchmarking three graph neural network architectures (SchNet, Allegro, MACE).

Result: MACE achieves a mean absolute error of 1.13 kcal/mol on out-of-distribution DFT barrier predictions, outperforming other models.

Conclusion: The approach enables large-scale simulations of HAT reactions in biomolecular systems, with potential for broader applications in chemical reactivity studies.

Abstract: Hydrogen atom transfer (HAT) reactions are essential in many biological processes, such as radical migration in damaged proteins, but their mechanistic pathways remain incompletely understood. Simulating HAT is challenging due to the need for quantum chemical accuracy at biologically relevant scales; thus, neither classical force fields nor DFT-based molecular dynamics are applicable. Machine-learned potentials offer an alternative, able to learn potential energy surfaces (PESs) with near-quantum accuracy. However, training these models to generalize across diverse HAT configurations, especially at radical positions in proteins, requires tailored data generation and careful model selection. Here, we systematically generate HAT configurations in peptides to build large datasets using semiempirical methods and DFT. We benchmark three graph neural network architectures (SchNet, Allegro, and MACE) on their ability to learn HAT PESs and indirectly predict reaction barriers from energy predictions. MACE consistently outperforms the others in energy, force, and barrier prediction, achieving a mean absolute error of 1.13 kcal/mol on out-of-distribution DFT barrier predictions. This accuracy enables integration of ML potentials into large-scale collagen simulations to compute reaction rates from predicted barriers, advancing mechanistic understanding of HAT and radical migration in peptides. We analyze scaling laws, model transferability, and cost-performance trade-offs, and outline strategies for improvement by combining ML potentials with transition state search algorithms and active learning. Our approach is generalizable to other biomolecular systems, enabling quantum-accurate simulations of chemical reactivity in complex environments.

[311] The Role of Active Learning in Modern Machine Learning

Thorben Werner, Lars Schmidt-Thieme, Vijaya Krishna Yalavarthi

Main category: cs.LG

TL;DR: Active Learning (AL) is inefficient in low-data scenarios compared to data augmentation (DA) and semi-supervised learning (SSL), but combining AL with DA and SSL can still yield performance improvements.

DetailsMotivation: AL is rarely applied outside its literature due to high computational cost and small performance gains in low-data scenarios. This study explores methods to address this issue.

Method: The study evaluates the impact of DA, SSL, and AL in low-data scenarios, comparing their efficiency and performance lifts.

Result: AL alone provides only 1-4% lift, while DA and SSL can achieve up to 60%. However, combining AL with DA and SSL still improves performance.

Conclusion: AL should be used as a final step after applying DA and SSL to maximize performance, rather than as a primary solution for missing labels.

Abstract: Even though Active Learning (AL) is widely studied, it is rarely applied in contexts outside its own scientific literature. We posit that the reason for this is AL’s high computational cost coupled with the comparatively small lifts it is typically able to generate in scenarios with few labeled points. In this work we study the impact of different methods to combat this low data scenario, namely data augmentation (DA), semi-supervised learning (SSL) and AL. We find that AL is by far the least efficient method of solving the low data problem, generating a lift of only 1-4% over random sampling, while DA and SSL methods can generate up to 60% lift in combination with random sampling. However, when AL is combined with strong DA and SSL techniques, it surprisingly is still able to provide improvements. Based on these results, we frame AL not as a method to combat missing labels, but as the final building block to squeeze the last bits of performance out of data after appropriate DA and SSL methods as been applied.

[312] Similarity-Based Self-Construct Graph Model for Predicting Patient Criticalness Using Graph Neural Networks and EHR Data

Mukesh Kumar Sahu, Pinki Roy

Main category: cs.LG

TL;DR: Proposes a graph-based model (SBSCGM) for ICU patient criticalness prediction, outperforming baselines with AUC-ROC 0.94.

DetailsMotivation: Improving ICU patient risk prediction by leveraging relational EHR data, which conventional models ignore.

Method: Uses SBSCGM to dynamically build patient similarity graphs and HybridGraphMedGNN (combining GCN, GraphSAGE, GAT) for predictions.

Result: Achieves state-of-the-art performance (AUC-ROC 0.94) on MIMIC-III dataset, with improved precision/recall and interpretability.

Conclusion: The framework is scalable, interpretable, and suitable for real-world ICU deployment.

Abstract: Accurately predicting the criticalness of ICU patients (such as in-ICU mortality risk) is vital for early intervention in critical care. However, conventional models often treat each patient in isolation and struggle to exploit the relational structure in Electronic Health Records (EHR). We propose a Similarity-Based Self-Construct Graph Model (SBSCGM) that dynamically builds a patient similarity graph from multi-modal EHR data, and a HybridGraphMedGNN architecture that operates on this graph to predict patient mortality and a continuous criticalness score. SBSCGM uses a hybrid similarity measure (combining feature-based and structural similarities) to connect patients with analogous clinical profiles in real-time. The HybridGraphMedGNN integrates Graph Convolutional Network (GCN), GraphSAGE, and Graph Attention Network (GAT) layers to learn robust patient representations, leveraging both local and global graph patterns. In experiments on 6,000 ICU stays from the MIMIC-III dataset, our model achieves state-of-the-art performance (AUC-ROC $0.94$) outperforming baseline classifiers and single-type GNN models. We also demonstrate improved precision/recall and show that the attention mechanism provides interpretable insights into model predictions. Our framework offers a scalable and interpretable solution for critical care risk prediction, with potential to support clinicians in real-world ICU deployment.

[313] IAMAP: Unlocking Deep Learning in QGIS for non-coders and limited computing resources

Paul Tresson, Pierre Le Coz, Hadrien Tulet, Anthony Malkassian, Maxime Réjou Méchain

Main category: cs.LG

TL;DR: IAMAP is a QGIS plugin that simplifies AI-based remote sensing by leveraging self-supervised learning, enabling non-specialists to perform tasks like feature extraction, clustering, and model validation without needing large datasets or coding skills.

DetailsMotivation: The challenges of implementing deep learning in remote sensing—large datasets, computing resources, and coding expertise—limit its accessibility. IAMAP aims to democratize AI for non-specialists.

Method: IAMAP uses self-supervised learning (foundation models) for feature extraction and integrates tools for dimensionality reduction, clustering, similarity mapping, and supervised model calibration.

Result: IAMAP provides a user-friendly solution for remote sensing tasks, reducing dependency on GPUs and large datasets while maintaining computational efficiency.

Conclusion: IAMAP democratizes AI in remote sensing by making advanced deep learning methods accessible and practical for non-specialists.

Abstract: Remote sensing has entered a new era with the rapid development of artificial intelligence approaches. However, the implementation of deep learning has largely remained restricted to specialists and has been impractical because it often requires (i) large reference datasets for model training and validation; (ii) substantial computing resources; and (iii) strong coding skills. Here, we introduce IAMAP, a user-friendly QGIS plugin that addresses these three challenges in an easy yet flexible way. IAMAP builds on recent advancements in self-supervised learning strategies, which now provide robust feature extractors, often referred to as foundation models. These generalist models can often be reliably used in few-shot or zero-shot scenarios (i.e., with little to no fine-tuning). IAMAP’s interface allows users to streamline several key steps in remote sensing image analysis: (i) extracting image features using a wide range of deep learning architectures; (ii) reducing dimensionality with built-in algorithms; (iii) performing clustering on features or their reduced representations; (iv) generating feature similarity maps; and (v) calibrating and validating supervised machine learning models for prediction. By enabling non-AI specialists to leverage the high-quality features provided by recent deep learning approaches without requiring GPU capacity or extensive reference datasets, IAMAP contributes to the democratization of computationally efficient and energy-conscious deep learning methods.

[314] Separated-Variable Spectral Neural Networks: A Physics-Informed Learning Approach for High-Frequency PDEs

Xiong Xiong, Zhuo Zhang, Rongchun Hu, Chen Gao, Zichen Deng

Main category: cs.LG

TL;DR: SV-SNN introduces a novel neural network framework to solve high-frequency oscillatory PDEs by addressing spectral bias in PINNs, achieving significant accuracy and efficiency improvements.

DetailsMotivation: Traditional PINNs struggle with high-frequency solutions due to spectral bias, limiting their effectiveness in applications like fluid mechanics and quantum mechanics.

Method: SV-SNN integrates separation of variables with adaptive spectral methods, decomposing functions into univariate products, using adaptive Fourier features, and leveraging a theoretical SVD framework.

Result: SV-SNN improves accuracy by 1-3 orders of magnitude, reduces parameters by 90%, and cuts training time by 60% on benchmark PDEs.

Conclusion: SV-SNN effectively overcomes spectral bias in neural PDE solving, offering a scalable and accurate solution for high-frequency problems.

Abstract: Solving high-frequency oscillatory partial differential equations (PDEs) is a critical challenge in scientific computing, with applications in fluid mechanics, quantum mechanics, and electromagnetic wave propagation. Traditional physics-informed neural networks (PINNs) suffer from spectral bias, limiting their ability to capture high-frequency solution components. We introduce Separated-Variable Spectral Neural Networks (SV-SNN), a novel framework that addresses these limitations by integrating separation of variables with adaptive spectral methods. Our approach features three key innovations: (1) decomposition of multivariate functions into univariate function products, enabling independent spatial and temporal networks; (2) adaptive Fourier spectral features with learnable frequency parameters for high-frequency capture; and (3) theoretical framework based on singular value decomposition to quantify spectral bias. Comprehensive evaluation on benchmark problems including Heat equation, Helmholtz equation, Poisson equations and Navier-Stokes equations demonstrates that SV-SNN achieves 1-3 orders of magnitude improvement in accuracy while reducing parameter count by over 90% and training time by 60%. These results establish SV-SNN as an effective solution to the spectral bias problem in neural PDE solving. The implementation will be made publicly available upon acceptance at https://github.com/xgxgnpu/SV-SNN.

[315] KFS: KAN based adaptive Frequency Selection learning architecture for long term time series forecasting

Changning Wu, Gao Wu, Rongyao Cai, Yong Liu, Kexin Zhang

Main category: cs.LG

TL;DR: Proposes KFS, a KAN-based adaptive frequency selection learning architecture for time series forecasting, addressing noise and heterogeneous information across scales.

DetailsMotivation: Real-world time series suffer from noise interference and suboptimal multi-scale representation due to heterogeneous frequency information.

Method: Uses FreK module for dominant frequency selection, KAN for pattern representation, and timestamp embedding alignment for temporal synchronization.

Result: Achieves state-of-the-art performance on multiple real-world datasets.

Conclusion: KFS is a simple yet effective solution for multi-scale time series forecasting challenges.

Abstract: Multi-scale decomposition architectures have emerged as predominant methodologies in time series forecasting. However, real-world time series exhibit noise interference across different scales, while heterogeneous information distribution among frequency components at varying scales leads to suboptimal multi-scale representation. Inspired by Kolmogorov-Arnold Networks (KAN) and Parseval’s theorem, we propose a KAN based adaptive Frequency Selection learning architecture (KFS) to address these challenges. This framework tackles prediction challenges stemming from cross-scale noise interference and complex pattern modeling through its FreK module, which performs energy-distribution-based dominant frequency selection in the spectral domain. Simultaneously, KAN enables sophisticated pattern representation while timestamp embedding alignment synchronizes temporal representations across scales. The feature mixing module then fuses scale-specific patterns with aligned temporal features. Extensive experiments across multiple real-world time series datasets demonstrate that KT achieves state-of-the-art performance as a simple yet effective architecture.

[316] Reinforcement Learning for Decision-Level Interception Prioritization in Drone Swarm Defense

Alessandro Palmas

Main category: cs.LG

TL;DR: The paper demonstrates how reinforcement learning can optimize drone swarm interception in defense systems, outperforming rule-based methods in simulations.

DetailsMotivation: Addressing the challenge of prioritizing interceptions in low-cost kamikaze drone swarms to protect high-value targets.

Method: A reinforcement learning agent is trained in a high-fidelity simulation to coordinate multiple effectors for optimal drone interception.

Result: The RL-based policy achieves lower average damage and higher defensive efficiency compared to rule-based baselines.

Conclusion: Reinforcement learning can enhance defense resilience without replacing existing systems, with publicly available code for reproducibility.

Abstract: The growing threat of low-cost kamikaze drone swarms poses a critical challenge to modern defense systems demanding rapid and strategic decision-making to prioritize interceptions across multiple effectors and high-value target zones. In this work, we present a case study demonstrating the practical advantages of reinforcement learning in addressing this challenge. We introduce a high-fidelity simulation environment that captures realistic operational constraints, within which a decision-level reinforcement learning agent learns to coordinate multiple effectors for optimal interception prioritization. Operating in a discrete action space, the agent selects which drone to engage per effector based on observed state features such as positions, classes, and effector status. We evaluate the learned policy against a handcrafted rule-based baseline across hundreds of simulated attack scenarios. The reinforcement learning based policy consistently achieves lower average damage and higher defensive efficiency in protecting critical zones. This case study highlights the potential of reinforcement learning as a strategic layer within defense architectures, enhancing resilience without displacing existing control systems. All code and simulation assets are publicly released for full reproducibility, and a video demonstration illustrates the policy’s qualitative behavior.

[317] Light-Weight Diffusion Multiplier and Uncertainty Quantification for Fourier Neural Operators

Albert Matveev, Sanmitra Ghosh, Aamal Hussain, James-Michael Leahy, Michalis Michaelides

Main category: cs.LG

TL;DR: DINOZAUR is a diffusion-based neural operator with uncertainty quantification, addressing scalability and UQ limitations of Fourier Neural Operators (FNOs).

DetailsMotivation: FNOs suffer from overparameterization and lack native uncertainty quantification, which is crucial for reliable scientific applications.

Method: DINOZAUR replaces FNOs’ dense tensor multiplier with a diffusion multiplier inspired by the heat kernel, reducing parameters and enabling Bayesian uncertainty quantification.

Result: DINOZAUR achieves competitive or superior performance on PDE benchmarks while providing efficient uncertainty estimates.

Conclusion: DINOZAUR offers a scalable, efficient, and uncertainty-aware alternative to FNOs for solving PDEs.

Abstract: Operator learning is a powerful paradigm for solving partial differential equations, with Fourier Neural Operators serving as a widely adopted foundation. However, FNOs face significant scalability challenges due to overparameterization and offer no native uncertainty quantification – a key requirement for reliable scientific and engineering applications. Instead, neural operators rely on post hoc UQ methods that ignore geometric inductive biases. In this work, we introduce DINOZAUR: a diffusion-based neural operator parametrization with uncertainty quantification. Inspired by the structure of the heat kernel, DINOZAUR replaces the dense tensor multiplier in FNOs with a dimensionality-independent diffusion multiplier that has a single learnable time parameter per channel, drastically reducing parameter count and memory footprint without compromising predictive performance. By defining priors over those time parameters, we cast DINOZAUR as a Bayesian neural operator to yield spatially correlated outputs and calibrated uncertainty estimates. Our method achieves competitive or superior performance across several PDE benchmarks while providing efficient uncertainty quantification.

[318] TrajSurv: Learning Continuous Latent Trajectories from Electronic Health Records for Trustworthy Survival Prediction

Sihang Zeng, Lucas Jing Liu, Jun Wen, Meliha Yetisgen, Ruth Etzioni, Gang Luo

Main category: cs.LG

TL;DR: TrajSurv is a model for trustworthy survival prediction using longitudinal EHR data, leveraging NCDE for continuous latent trajectories and contrastive learning for alignment, with superior accuracy and transparency.

DetailsMotivation: Trustworthy survival prediction is crucial for clinical decisions, but irregularly sampled EHR data and opaque models pose challenges.

Method: TrajSurv uses NCDE for continuous latent trajectories, time-aware contrastive learning for alignment, and a two-step interpretation process for transparency.

Result: TrajSurv outperforms existing methods in accuracy and transparency on MIMIC-III and eICU datasets.

Conclusion: TrajSurv provides accurate, transparent survival prediction by modeling continuous clinical progression from EHR data.

Abstract: Trustworthy survival prediction is essential for clinical decision making. Longitudinal electronic health records (EHRs) provide a uniquely powerful opportunity for the prediction. However, it is challenging to accurately model the continuous clinical progression of patients underlying the irregularly sampled clinical features and to transparently link the progression to survival outcomes. To address these challenges, we develop TrajSurv, a model that learns continuous latent trajectories from longitudinal EHR data for trustworthy survival prediction. TrajSurv employs a neural controlled differential equation (NCDE) to extract continuous-time latent states from the irregularly sampled data, forming continuous latent trajectories. To ensure the latent trajectories reflect the clinical progression, TrajSurv aligns the latent state space with patient state space through a time-aware contrastive learning approach. To transparently link clinical progression to the survival outcome, TrajSurv uses latent trajectories in a two-step divide-and-conquer interpretation process. First, it explains how the changes in clinical features translate into the latent trajectory’s evolution using a learned vector field. Second, it clusters these latent trajectories to identify key clinical progression patterns associated with different survival outcomes. Evaluations on two real-world medical datasets, MIMIC-III and eICU, show TrajSurv’s competitive accuracy and superior transparency over existing deep learning methods.

[319] DP-DGAD: A Generalist Dynamic Graph Anomaly Detector with Dynamic Prototypes

Jialun Zheng, Jie Liu, Jiannong Cao, Xiao Wang, Hanchen Yang, Yankai Chen, Philip S. Yu

Main category: cs.LG

TL;DR: A dynamic graph anomaly detection (DGAD) model with Dynamic Prototypes (DP) is proposed to capture evolving domain-specific and domain-agnostic patterns, achieving state-of-the-art performance.

DetailsMotivation: Generalist graph anomaly detection models struggle with evolving anomalies in dynamic graphs and lack labeled data for new domains. Effective DGAD requires capturing both evolving domain-specific and domain-agnostic patterns.

Method: DP-DGAD extracts dynamic prototypes from temporal ego-graphs, updates a memory buffer selectively, and uses an anomaly scorer with confidence-based pseudo-labeling for self-supervised adaptation.

Result: The model achieves state-of-the-art performance across ten real-world datasets from diverse domains.

Conclusion: DP-DGAD effectively addresses the challenges of dynamic graph anomaly detection by capturing evolving patterns and adapting to new domains.

Abstract: Dynamic graph anomaly detection (DGAD) is essential for identifying anomalies in evolving graphs across domains such as finance, traffic, and social networks. Recently, generalist graph anomaly detection (GAD) models have shown promising results. They are pretrained on multiple source datasets and generalize across domains. While effective on static graphs, they struggle to capture evolving anomalies in dynamic graphs. Moreover, the continuous emergence of new domains and the lack of labeled data further challenge generalist DGAD. Effective cross-domain DGAD requires both domain-specific and domain-agnostic anomalous patterns. Importantly, these patterns evolve temporally within and across domains. Building on these insights, we propose a DGAD model with Dynamic Prototypes (DP) to capture evolving domain-specific and domain-agnostic patterns. Firstly, DP-DGAD extracts dynamic prototypes, i.e., evolving representations of normal and anomalous patterns, from temporal ego-graphs and stores them in a memory buffer. The buffer is selectively updated to retain general, domain-agnostic patterns while incorporating new domain-specific ones. Then, an anomaly scorer compares incoming data with dynamic prototypes to flag both general and domain-specific anomalies. Finally, DP-DGAD employs confidence-based pseudo-labeling for effective self-supervised adaptation in target domains. Extensive experiments demonstrate state-of-the-art performance across ten real-world datasets from different domains.

[320] Wind Power Scenario Generation based on the Generalized Dynamic Factor Model and Generative Adversarial Network

Young-ho Cho, Hao Zhu, Duehee Lee, Ross Baldick

Main category: cs.LG

TL;DR: The paper proposes combining GDFM and GAN to synthesize wind power scenarios, improving spatial and temporal correlation representation compared to standalone methods.

DetailsMotivation: To enhance the synthesis of long-term wind power scenarios by accurately capturing spatio-temporal features like correlation, waveforms, and statistical characteristics.

Method: Uses GAN to extract dynamic factors with temporal info, then applies these in GDFM to represent spatial and frequency correlations. Combines strengths of both models.

Result: Numerical tests show the combined GDFM-GAN approach outperforms alternatives in synthesizing plausible wind power scenarios, better matching actual statistical characteristics.

Conclusion: The hybrid GDFM-GAN method effectively synthesizes realistic wind power scenarios, improving upon standalone GDFM or GAN approaches.

Abstract: For conducting resource adequacy studies, we synthesize multiple long-term wind power scenarios of distributed wind farms simultaneously by using the spatio-temporal features: spatial and temporal correlation, waveforms, marginal and ramp rates distributions of waveform, power spectral densities, and statistical characteristics. Generating the spatial correlation in scenarios requires the design of common factors for neighboring wind farms and antithetical factors for distant wind farms. The generalized dynamic factor model (GDFM) can extract the common factors through cross spectral density analysis, but it cannot closely imitate waveforms. The GAN can synthesize plausible samples representing the temporal correlation by verifying samples through a fake sample discriminator. To combine the advantages of GDFM and GAN, we use the GAN to provide a filter that extracts dynamic factors with temporal information from the observation data, and we then apply this filter in the GDFM to represent both spatial and frequency correlations of plausible waveforms. Numerical tests on the combination of GDFM and GAN have demonstrated performance improvements over competing alternatives in synthesizing wind power scenarios from Australia, better realizing plausible statistical characteristics of actual wind power compared to alternatives such as the GDFM with a filter synthesized from distributions of actual dynamic filters and the GAN with direct synthesis without dynamic factors.

[321] Classification of Psychiatry Clinical Notes by Diagnosis: A Deep Learning and Machine Learning Approach

Sergio Rubio-Martín, María Teresa García-Ordás, Antonio Serrano-García, Clara Margarita Franch-Pato, Arturo Crespo-Álvaro, José Alberto Benítez-Andrades

Main category: cs.LG

TL;DR: The study compares AI models for classifying clinical notes into Anxiety and Adjustment Disorder diagnoses, finding hyperparameter tuning crucial for performance, with Decision Tree, XGBoost, DistilBERT, and SciBERT achieving 96% accuracy.

DetailsMotivation: To improve diagnostic classification in mental health using AI, comparing traditional and deep learning models, and assessing oversampling techniques.

Method: Evaluated Random Forest, SVM, KNN, Decision Tree, XGBoost, DistilBERT, and SciBERT with oversampling (None, Random, SMOTE) and hyperparameter tuning.

Result: Oversampling had minimal impact except SMOTE with BERT models. Hyperparameter tuning boosted accuracy, with Decision Tree, XGBoost, DistilBERT, and SciBERT reaching 96%.

Conclusion: Hyperparameter tuning is key for model performance, and BERT models benefit from SMOTE. The study aids AI-assisted mental health diagnostics.

Abstract: The classification of clinical notes into specific diagnostic categories is critical in healthcare, especially for mental health conditions like Anxiety and Adjustment Disorder. In this study, we compare the performance of various Artificial Intelligence models, including both traditional Machine Learning approaches (Random Forest, Support Vector Machine, K-nearest neighbors, Decision Tree, and eXtreme Gradient Boost) and Deep Learning models (DistilBERT and SciBERT), to classify clinical notes into these two diagnoses. Additionally, we implemented three oversampling strategies: No Oversampling, Random Oversampling, and Synthetic Minority Oversampling Technique (SMOTE), to assess their impact on model performance. Hyperparameter tuning was also applied to optimize model accuracy. Our results indicate that oversampling techniques had minimal impact on model performance overall. The only exception was SMOTE, which showed a positive effect specifically with BERT-based models. However, hyperparameter optimization significantly improved accuracy across the models, enhancing their ability to generalize and perform on the dataset. The Decision Tree and eXtreme Gradient Boost models achieved the highest accuracy among machine learning approaches, both reaching 96%, while the DistilBERT and SciBERT models also attained 96% accuracy in the deep learning category. These findings underscore the importance of hyperparameter tuning in maximizing model performance. This study contributes to the ongoing research on AI-assisted diagnostic tools in mental health by providing insights into the efficacy of different model architectures and data balancing methods.

[322] Learning Network Dismantling without Handcrafted Inputs

Haozhe Tian, Pietro Ferraro, Robert Shorten, Mahdi Jalili, Homayoun Hamedmoghadam

Main category: cs.LG

TL;DR: MIND eliminates handcrafted features in GNNs for network dismantling, using attention and synthetic training, outperforming state-of-the-art methods.

DetailsMotivation: Handcrafted features in GNNs introduce bias and computational cost; MIND aims for a purely data-driven, efficient solution.

Method: Uses attention mechanisms, message-iteration profiles, and synthetic network training for dismantling.

Result: MIND generalizes to large real networks, outperforming existing methods.

Conclusion: MIND’s efficiency and generalizability extend beyond dismantling to other network problems.

Abstract: The application of message-passing Graph Neural Networks has been a breakthrough for important network science problems. However, the competitive performance often relies on using handcrafted structural features as inputs, which increases computational cost and introduces bias into the otherwise purely data-driven network representations. Here, we eliminate the need for handcrafted features by introducing an attention mechanism and utilizing message-iteration profiles, in addition to an effective algorithmic approach to generate a structurally diverse training set of small synthetic networks. Thereby, we build an expressive message-passing framework and use it to efficiently solve the NP-hard problem of Network Dismantling, virtually equivalent to vital node identification, with significant real-world applications. Trained solely on diversified synthetic networks, our proposed model – MIND: Message Iteration Network Dismantler – generalizes to large, unseen real networks with millions of nodes, outperforming state-of-the-art network dismantling methods. Increased efficiency and generalizability of the proposed model can be leveraged beyond dismantling in a range of complex network problems.

[323] Efficient Solution and Learning of Robust Factored MDPs

Yannik Schnitzer, Alessandro Abate, David Parker

Main category: cs.LG

TL;DR: The paper introduces methods for solving and learning robust Markov decision processes (r-MDPs) using factored state-space representations, improving sample efficiency and policy robustness.

DetailsMotivation: To address the challenge of learning robust policies in uncertain environments with fewer interactions by leveraging factored representations.

Method: Proposes novel methods for solving and learning r-MDPs using factored state-space representations, reformulating non-convex problems into tractable linear programs.

Result: Exploiting factored structure yields dimensional gains in sample efficiency, producing more effective robust policies with tighter performance guarantees.

Conclusion: Factored representations significantly enhance the efficiency and robustness of learning and solving r-MDPs.

Abstract: Robust Markov decision processes (r-MDPs) extend MDPs by explicitly modelling epistemic uncertainty about transition dynamics. Learning r-MDPs from interactions with an unknown environment enables the synthesis of robust policies with provable (PAC) guarantees on performance, but this can require a large number of sample interactions. We propose novel methods for solving and learning r-MDPs based on factored state-space representations that leverage the independence between model uncertainty across system components. Although policy synthesis for factored r-MDPs leads to hard, non-convex optimisation problems, we show how to reformulate these into tractable linear programs. Building on these, we also propose methods to learn factored model representations directly. Our experimental results show that exploiting factored structure can yield dimensional gains in sample efficiency, producing more effective robust policies with tighter performance guarantees than state-of-the-art methods.

[324] JSON-Bag: A generic game trajectory representation

Dien Nguyen, Diego Perez-Liebana, Simon Lucas

Main category: cs.LG

TL;DR: The paper introduces JSON-Bag, a token-based model for representing game trajectories, using Jensen-Shannon distance (JSD) for comparison. It outperforms hand-crafted features in classification tasks and shows efficiency in N-shot learning.

DetailsMotivation: To provide a generic and efficient method for representing and comparing game trajectories without relying on hand-crafted features.

Method: Tokenizes JSON descriptions of game trajectories into a bag-of-tokens model, applies JSD for distance measurement, and uses prototype-based nearest-neighbor search (P-NNS) for evaluation.

Result: Outperforms hand-crafted features in most tasks, demonstrates sample efficiency in N-shot classification, and improves accuracy with automatic feature extraction.

Conclusion: JSON-Bag with JSD is effective for game trajectory representation, classification, and policy distance correlation.

Abstract: We introduce JSON Bag-of-Tokens model (JSON-Bag) as a method to generically represent game trajectories by tokenizing their JSON descriptions and apply Jensen-Shannon distance (JSD) as distance metric for them. Using a prototype-based nearest-neighbor search (P-NNS), we evaluate the validity of JSON-Bag with JSD on six tabletop games – \textit{7 Wonders}, \textit{Dominion}, \textit{Sea Salt and Paper}, \textit{Can’t Stop}, \textit{Connect4}, \textit{Dots and boxes} – each over three game trajectory classification tasks: classifying the playing agents, game parameters, or game seeds that were used to generate the trajectories. Our approach outperforms a baseline using hand-crafted features in the majority of tasks. Evaluating on N-shot classification suggests using JSON-Bag prototype to represent game trajectory classes is also sample efficient. Additionally, we demonstrate JSON-Bag ability for automatic feature extraction by treating tokens as individual features to be used in Random Forest to solve the tasks above, which significantly improves accuracy on underperforming tasks. Finally, we show that, across all six games, the JSD between JSON-Bag prototypes of agent classes highly correlates with the distances between agents’ policies.

[325] Nested Graph Pseudo-Label Refinement for Noisy Label Domain Adaptation Learning

Yingxu Wang, Mengzhu Wang, Zhichao Huang, Suyu Liu

Main category: cs.LG

TL;DR: NeGPR is a novel framework for graph domain adaptation with noisy labels, using dual-branch pretraining, nested refinement, and noise-aware regularization to improve robustness and performance.

DetailsMotivation: Existing GDA methods assume clean source labels, but real-world scenarios often have noisy labels, degrading adaptation performance. NeGPR addresses this challenge.

Method: NeGPR pretrains semantic and topology branches with neighborhood consistency, uses nested refinement for cross-domain learning, and applies noise-aware regularization to handle pseudo-label noise.

Result: NeGPR outperforms state-of-the-art methods under severe label noise, achieving up to 12.7% accuracy improvement.

Conclusion: NeGPR effectively handles noisy labels in GDA, enhancing robustness and adaptation performance through its innovative framework.

Abstract: Graph Domain Adaptation (GDA) facilitates knowledge transfer from labeled source graphs to unlabeled target graphs by learning domain-invariant representations, which is essential in applications such as molecular property prediction and social network analysis. However, most existing GDA methods rely on the assumption of clean source labels, which rarely holds in real-world scenarios where annotation noise is pervasive. This label noise severely impairs feature alignment and degrades adaptation performance under domain shifts. To address this challenge, we propose Nested Graph Pseudo-Label Refinement (NeGPR), a novel framework tailored for graph-level domain adaptation with noisy labels. NeGPR first pretrains dual branches, i.e., semantic and topology branches, by enforcing neighborhood consistency in the feature space, thereby reducing the influence of noisy supervision. To bridge domain gaps, NeGPR employs a nested refinement mechanism in which one branch selects high-confidence target samples to guide the adaptation of the other, enabling progressive cross-domain learning. Furthermore, since pseudo-labels may still contain noise and the pre-trained branches are already overfitted to the noisy labels in the source domain, NeGPR incorporates a noise-aware regularization strategy. This regularization is theoretically proven to mitigate the adverse effects of pseudo-label noise, even under the presence of source overfitting, thus enhancing the robustness of the adaptation process. Extensive experiments on benchmark datasets demonstrate that NeGPR consistently outperforms state-of-the-art methods under severe label noise, achieving gains of up to 12.7% in accuracy.

[326] Democratizing Tabular Data Access with an Open$\unicode{x2013}$Source Synthetic$\unicode{x2013}$Data SDK

Ivona Krchova, Mariana Vargas Vieyra, Mario Scriminaci, Andrey Sidorenko

Main category: cs.LG

TL;DR: The paper introduces the MOSTLY AI Synthetic Data SDK, an open-source toolkit for generating high-quality synthetic tabular data, addressing data accessibility issues due to privacy and ethical concerns.

DetailsMotivation: Increasing restrictions on data access due to privacy, proprietary interests, and ethical concerns necessitate synthetic data solutions.

Method: The SDK uses the TabularARGN autoregressive framework, integrating differential privacy, fairness-aware generation, and automated quality assurance in a Python interface.

Result: The SDK delivers competitive performance, improved speed, and usability, supporting diverse data types and complex datasets.

Conclusion: The SDK’s rapid adoption highlights its practicality in overcoming data bottlenecks and promoting data democratization.

Abstract: Machine learning development critically depends on access to high-quality data. However, increasing restrictions due to privacy, proprietary interests, and ethical concerns have created significant barriers to data accessibility. Synthetic data offers a viable solution by enabling safe, broad data usage without compromising sensitive information. This paper presents the MOSTLY AI Synthetic Data Software Development Kit (SDK), an open-source toolkit designed specifically for synthesizing high-quality tabular data. The SDK integrates robust features such as differential privacy guarantees, fairness-aware data generation, and automated quality assurance into a flexible and accessible Python interface. Leveraging the TabularARGN autoregressive framework, the SDK supports diverse data types and complex multi-table and sequential datasets, delivering competitive performance with notable improvements in speed and usability. Currently deployed both as a cloud service and locally installable software, the SDK has seen rapid adoption, highlighting its practicality in addressing real-world data bottlenecks and promoting widespread data democratization.

[327] Adaptive Machine Learning-Driven Multi-Fidelity Stratified Sampling for Failure Analysis of Nonlinear Stochastic Systems

Liuyun Xu, Seymour M. J. Spence

Main category: cs.LG

TL;DR: A multi-fidelity stratified sampling scheme with adaptive machine learning metamodels is proposed to efficiently estimate small failure probabilities in stochastic simulations, reducing computational costs.

DetailsMotivation: Existing variance reduction techniques for rare event analysis require many model evaluations, which is computationally challenging for complex systems like nonlinear finite element models under stochastic excitation.

Method: The approach uses stratified sampling to generate a high-fidelity dataset for training a deep learning-based metamodel. An adaptive training scheme balances approximation quality and computational demand. A multi-fidelity Monte Carlo framework integrates low-fidelity outputs with high-fidelity results for unbiased failure probability estimates.

Result: Applied to a high-rise steel building under stochastic wind excitation, the method accurately estimates exceedance probability curves for nonlinear responses with significant computational savings.

Conclusion: The proposed scheme effectively addresses computational challenges in rare event analysis, offering a cost-efficient alternative to single-fidelity variance reduction methods.

Abstract: Existing variance reduction techniques used in stochastic simulations for rare event analysis still require a substantial number of model evaluations to estimate small failure probabilities. In the context of complex, nonlinear finite element modeling environments, this can become computationally challenging-particularly for systems subjected to stochastic excitation. To address this challenge, a multi-fidelity stratified sampling scheme with adaptive machine learning metamodels is introduced for efficiently propagating uncertainties and estimating small failure probabilities. In this approach, a high-fidelity dataset generated through stratified sampling is used to train a deep learning-based metamodel, which then serves as a cost-effective and highly correlated low-fidelity model. An adaptive training scheme is proposed to balance the trade-off between approximation quality and computational demand associated with the development of the low-fidelity model. By integrating the low-fidelity outputs with additional high-fidelity results, an unbiased estimate of the strata-wise failure probabilities is obtained using a multi-fidelity Monte Carlo framework. The overall probability of failure is then computed using the total probability theorem. Application to a full-scale high-rise steel building subjected to stochastic wind excitation demonstrates that the proposed scheme can accurately estimate exceedance probability curves for nonlinear responses of interest, while achieving significant computational savings compared to single-fidelity variance reduction approaches.

[328] A Simple and Effective Method for Uncertainty Quantification and OOD Detection

Yaxin Ma, Benjamin Colburn, Jose C. Principe

Main category: cs.LG

TL;DR: A method using feature space density for uncertainty quantification in deterministic models, outperforming Bayesian and ensemble methods in efficiency and performance.

DetailsMotivation: Address computational and storage inefficiencies of Bayesian neural networks and deep ensembles for uncertainty quantification.

Method: Leverages kernel density estimation to approximate feature space density of training data, comparing it with test samples to detect distributional shifts and OOD.

Result: Outperforms baseline models on synthetic datasets (Two Moons, Three Spirals) and OOD detection (CIFAR-10 vs. SVHN).

Conclusion: Proposed method is effective for uncertainty quantification and OOD detection, offering computational and storage advantages.

Abstract: Bayesian neural networks and deep ensemble methods have been proposed for uncertainty quantification; however, they are computationally intensive and require large storage. By utilizing a single deterministic model, we can solve the above issue. We propose an effective method based on feature space density to quantify uncertainty for distributional shifts and out-of-distribution (OOD) detection. Specifically, we leverage the information potential field derived from kernel density estimation to approximate the feature space density of the training set. By comparing this density with the feature space representation of test samples, we can effectively determine whether a distributional shift has occurred. Experiments were conducted on a 2D synthetic dataset (Two Moons and Three Spirals) as well as an OOD detection task (CIFAR-10 vs. SVHN). The results demonstrate that our method outperforms baseline models.

[329] Diffusion-Scheduled Denoising Autoencoders for Anomaly Detection in Tabular Data

Timur Sattarov, Marco Schreyer, Damian Borth

Main category: cs.LG

TL;DR: DDAE combines diffusion-based noise scheduling and contrastive learning to enhance anomaly detection in tabular data, outperforming state-of-the-art methods.

DetailsMotivation: Anomaly detection in tabular data is difficult due to complex feature interactions and lack of anomalous examples; existing methods like denoising autoencoders and diffusion models have limitations.

Method: Proposes Diffusion-Scheduled Denoising Autoencoder (DDAE), integrating diffusion-based noise scheduling and contrastive learning into encoding.

Result: DDAE outperforms baselines, improving PR-AUC by up to 65% (9%) and ROC-AUC by 16% (6%), with optimal noise levels varying by setting.

Conclusion: Principled noise strategies are crucial for effective tabular anomaly detection, with DDAE demonstrating superior performance.

Abstract: Anomaly detection in tabular data remains challenging due to complex feature interactions and the scarcity of anomalous examples. Denoising autoencoders rely on fixed-magnitude noise, limiting adaptability to diverse data distributions. Diffusion models introduce scheduled noise and iterative denoising, but lack explicit reconstruction mappings. We propose the Diffusion-Scheduled Denoising Autoencoder (DDAE), a framework that integrates diffusion-based noise scheduling and contrastive learning into the encoding process to improve anomaly detection. We evaluated DDAE on 57 datasets from ADBench. Our method outperforms in semi-supervised settings and achieves competitive results in unsupervised settings, improving PR-AUC by up to 65% (9%) and ROC-AUC by 16% (6%) over state-of-the-art autoencoder (diffusion) model baselines. We observed that higher noise levels benefit unsupervised training, while lower noise with linear scheduling is optimal in semi-supervised settings. These findings underscore the importance of principled noise strategies in tabular anomaly detection.

[330] Evaluating Angle and Amplitude Encoding Strategies for Variational Quantum Machine Learning: their impact on model’s accuracy

Antonio Tudisco, Andrea Marchesin, Maurizio Zamboni, Mariagrazia Graziano, Giovanna Turvani

Main category: cs.LG

TL;DR: The paper analyzes Variational Quantum Circuits (VQC) in Quantum Machine Learning, comparing Amplitude- and Angle-encoding models with different rotational gates. Results show significant performance variations (10%-41%) based on encoding choices, confirming embedding as a critical hyperparameter.

DetailsMotivation: To explore how different encoding methods and rotational gates in VQC models impact classification performance in Quantum Machine Learning.

Method: Comparative analysis of Amplitude- and Angle-encoding models with varying rotational gates, tested on Wine and Diabetes datasets.

Result: Performance differences ranged from 10% to 41%, with rotational gate choice significantly affecting accuracy.

Conclusion: Embedding type is a crucial hyperparameter for VQC models, influencing classification performance.

Abstract: Recent advancements in Quantum Computing and Machine Learning have increased attention to Quantum Machine Learning (QML), which aims to develop machine learning models by exploiting the quantum computing paradigm. One of the widely used models in this area is the Variational Quantum Circuit (VQC), a hybrid model where the quantum circuit handles data inference while classical optimization adjusts the parameters of the circuit. The quantum circuit consists of an encoding layer, which loads data into the circuit, and a template circuit, known as the ansatz, responsible for processing the data. This work involves performing an analysis by considering both Amplitude- and Angle-encoding models, and examining how the type of rotational gate applied affects the classification performance of the model. This comparison is carried out by training the different models on two datasets, Wine and Diabetes, and evaluating their performance. The study demonstrates that, under identical model topologies, the difference in accuracy between the best and worst models ranges from 10% to 30%, with differences reaching up to 41%. Moreover, the results highlight how the choice of rotational gates used in encoding can significantly impact the model’s classification performance. The findings confirm that the embedding represents a hyperparameter for VQC models.

[331] Explainable AI and Machine Learning for Exam-based Student Evaluation: Causal and Predictive Analysis of Socio-academic and Economic Factors

Bushra Akter, Md Biplob Hosen, Sabbir Ahmed, Mehrin Anannya, Md. Farhad Hossain

Main category: cs.LG

TL;DR: The study explores socio-academic and financial factors affecting CGPA, using causal analysis and machine learning to predict and classify performance, achieving high accuracy. A web app was developed for personalized student insights.

DetailsMotivation: To understand and optimize students' CGPA by identifying key influencing factors and developing actionable strategies.

Method: Literature review, online survey (1,050 participants), causal analysis, regression (Ridge Regression), classification (Random Forest), and Explainable AI (SHAP, LIME, Interpret).

Result: Ridge Regression achieved MAE of 0.12 and MSE of 0.023; Random Forest had 98.68% accuracy. Key factors: study hours, scholarships, parental education, prior performance.

Conclusion: The study successfully identified critical CGPA influencers and developed a predictive web app, aiding students in improving academic performance.

Abstract: Academic performance depends on a multivariable nexus of socio-academic and financial factors. This study investigates these influences to develop effective strategies for optimizing students’ CGPA. To achieve this, we reviewed various literature to identify key influencing factors and constructed an initial hypothetical causal graph based on the findings. Additionally, an online survey was conducted, where 1,050 students participated, providing comprehensive data for analysis. Rigorous data preprocessing techniques, including cleaning and visualization, ensured data quality before analysis. Causal analysis validated the relationships among variables, offering deeper insights into their direct and indirect effects on CGPA. Regression models were implemented for CGPA prediction, while classification models categorized students based on performance levels. Ridge Regression demonstrated strong predictive accuracy, achieving a Mean Absolute Error of 0.12 and a Mean Squared Error of 0.023. Random Forest outperformed in classification, attaining an F1-score near perfection and an accuracy of 98.68%. Explainable AI techniques such as SHAP, LIME, and Interpret enhanced model interpretability, highlighting critical factors such as study hours, scholarships, parental education, and prior academic performance. The study culminated in the development of a web-based application that provides students with personalized insights, allowing them to predict academic performance, identify areas for improvement, and make informed decisions to enhance their outcomes.

[332] Adacc: Adaptive Compression and Activation Checkpointing for LLM Memory Management

Ping Chen, Zhuohong Deng, Ping Li, Shuibing He, Hongzi Zhu, Yi Zheng, Zhefeng Wang, Baoxing Huai, Minyi Guo

Main category: cs.LG

TL;DR: Adacc reduces GPU memory footprint in LLM training by combining adaptive compression and activation checkpointing, achieving 1.01x-1.37x speedup without compromising accuracy.

DetailsMotivation: Recomputation in large language model training introduces up to 30% overhead, prompting the need for efficient memory management.

Method: Adacc uses layer-specific compression, MILP-based scheduling, and adaptive policy evolution to optimize memory usage.

Result: Adacc accelerates training by 1.01x-1.37x while maintaining model accuracy comparable to the baseline.

Conclusion: Adacc effectively balances memory efficiency and training performance for LLMs.

Abstract: Training large language models often employs recomputation to alleviate memory pressure, which can introduce up to 30% overhead in real-world scenarios. In this paper, we propose Adacc, a novel memory management framework that combines adaptive compression and activation checkpointing to reduce the GPU memory footprint. It comprises three modules: (1) We design layer-specific compression algorithms that account for outliers in LLM tensors, instead of directly quantizing floats from FP16 to INT4, to ensure model accuracy. (2) We propose an optimal scheduling policy that employs MILP to determine the best memory optimization for each tensor. (3) To accommodate changes in training tensors, we introduce an adaptive policy evolution mechanism that adjusts the policy during training to enhance throughput. Experimental results show that Adacc can accelerate the LLM training by 1.01x to 1.37x compared to state-of-the-art frameworks, while maintaining comparable model accuracy to the Baseline.

[333] Loss Landscape Degeneracy and Stagewise Development in Transformers

Jesse Hoogland, George Wang, Matthew Farrugia-Roberts, Liam Carroll, Susan Wei, Daniel Murfet

Main category: cs.LG

TL;DR: The paper explores how degeneracy in the loss landscape influences neural network training, linking it to changes in computational structure and behavior in transformers.

DetailsMotivation: To uncover principles governing neural network development by studying loss landscape degeneracy.

Method: Analyzing loss landscape degeneracy via the local learning coefficient in a transformer language model and an in-context linear regression transformer.

Result: Training phases show distinct changes in degeneracy, aligning with shifts in computational structure and behavior.

Conclusion: Degeneracy and development are linked in transformers, suggesting a degeneracy-based perspective for understanding deep learning.

Abstract: Deep learning involves navigating a high-dimensional loss landscape over the neural network parameter space. Over the course of training, complex computational structures form and re-form inside the neural network, leading to shifts in input/output behavior. It is a priority for the science of deep learning to uncover principles governing the development of neural network structure and behavior. Drawing on the framework of singular learning theory, we propose that model development is deeply linked to degeneracy in the local geometry of the loss landscape. We investigate this link by monitoring loss landscape degeneracy throughout training, as quantified by the local learning coefficient, for a transformer language model and an in-context linear regression transformer. We show that training can be divided into distinct periods of change in loss landscape degeneracy, and that these changes in degeneracy coincide with significant changes in the internal computational structure and the input/output behavior of the transformers. This finding provides suggestive evidence that degeneracy and development are linked in transformers, underscoring the potential of a degeneracy-based perspective for understanding modern deep learning.

Yuanyuan Xu, Wenjie Zhang, Ying Zhang, Xuemin Lin, Xiwei Xu

Main category: cs.LG

TL;DR: MoMent is a multi-modal model for DyTAGs that integrates temporal, textual, and structural modalities, improving link prediction accuracy and speed.

DetailsMotivation: Existing approaches overlook temporal and textual modalities in DyTAGs, leading to suboptimal performance.

Method: MoMent constructs modality-specific features, encodes them separately, and fuses them with a dual-domain alignment loss for coherent representations.

Result: MoMent achieves up to 17.28% accuracy improvement and 31x speed-up over baselines.

Conclusion: Explicitly modeling and aligning modalities in DyTAGs significantly enhances performance.

Abstract: Dynamic Text-Attributed Graphs (DyTAGs) are a novel graph paradigm that captures evolving temporal events (edges) alongside rich textual attributes. Existing studies can be broadly categorized into TGNN-driven and LLM-driven approaches, both of which encode textual attributes and temporal structures for DyTAG representation. We observe that DyTAGs inherently comprise three distinct modalities: temporal, textual, and structural, often exhibiting completely disjoint distributions. However, the first two modalities are largely overlooked by existing studies, leading to suboptimal performance. To address this, we propose MoMent, a multi-modal model that explicitly models, integrates, and aligns each modality to learn node representations for link prediction. Given the disjoint nature of the original modality distributions, we first construct modality-specific features and encode them using individual encoders to capture correlations across temporal patterns, semantic context, and local structures. Each encoder generates modality-specific tokens, which are then fused into comprehensive node representations with a theoretical guarantee. To avoid disjoint subspaces of these heterogeneous modalities, we propose a dual-domain alignment loss that first aligns their distributions globally and then fine-tunes coherence at the instance level. This enhances coherent representations from temporal, textual, and structural views. Extensive experiments across seven datasets show that MoMent achieves up to 17.28% accuracy improvement and up to 31x speed-up against eight baselines.

[335] Safe machine learning model release from Trusted Research Environments: The SACRO-ML package

Jim Smith, Richard J. Preen, Andrew McCarthy, Maha Albashir, Alba Crespi-Boixader, Shahzad Mumtaz, Christian Cole, James Liley, Jost Migenda, Simon Rogers, Yola Jones

Main category: cs.LG

TL;DR: SACRO-ML is an open-source Python toolkit for statistical disclosure control (SDC) in machine learning models, offering both ante-hoc and post-hoc SDC methods.

DetailsMotivation: To address the risk of data disclosure in ML models trained on confidential data before public release.

Method: Combines a SafeModel package for ante-hoc SDC (assessing training vulnerability) and an Attacks package for post-hoc SDC (simulating attacks to evaluate risk).

Result: Provides tools for assessing and mitigating disclosure risks in ML models.

Conclusion: SACRO-ML is a practical solution for ensuring privacy in ML models, available as open-source under an MIT license.

Abstract: We present SACRO-ML, an integrated suite of open source Python tools to facilitate the statistical disclosure control (SDC) of machine learning (ML) models trained on confidential data prior to public release. SACRO-ML combines (i) a SafeModel package that extends commonly used ML models to provide ante-hoc SDC by assessing the vulnerability of disclosure posed by the training regime; and (ii) an Attacks package that provides post-hoc SDC by rigorously assessing the empirical disclosure risk of a model through a variety of simulated attacks after training. The SACRO-ML code and documentation are available under an MIT license at https://github.com/AI-SDC/SACRO-ML

[336] Gradient Leakage Defense with Key-Lock Module for Federated Learning

Hanchi Ren, Jingjing Deng, Xianghua Xie

Main category: cs.LG

TL;DR: The paper introduces a key-lock module to defend against gradient leakage in Federated Learning, ensuring privacy and model performance.

DetailsMotivation: Recent findings show that shared gradients in FL can leak private data, necessitating a robust defense mechanism.

Method: A private key-lock module is proposed to secure gradients before sharing, preventing data reconstruction without compromising model performance.

Result: The method is theoretically proven and empirically validated to defend against gradient leakage while maintaining model accuracy.

Conclusion: The key-lock module effectively balances privacy and performance in FL, offering a practical solution to gradient leakage.

Abstract: Federated Learning (FL) is a widely adopted privacy-preserving machine learning approach where private data remains local, enabling secure computations and the exchange of local model gradients between local clients and third-party parameter servers. However, recent findings reveal that privacy may be compromised and sensitive information potentially recovered from shared gradients. In this study, we offer detailed analysis and a novel perspective on understanding the gradient leakage problem. These theoretical works lead to a new gradient leakage defense technique that secures arbitrary model architectures using a private key-lock module. Only the locked gradient is transmitted to the parameter server for global model aggregation. Our proposed learning method is resistant to gradient leakage attacks, and the key-lock module is designed and trained to ensure that, without the private information of the key-lock module: a) reconstructing private training data from the shared gradient is infeasible; and b) the global model’s inference performance is significantly compromised. We discuss the theoretical underpinnings of why gradients can leak private information and provide theoretical proof of our method’s effectiveness. We conducted extensive empirical evaluations with many models on several popular benchmarks, demonstrating the robustness of our proposed approach in both maintaining model performance and defending against gradient leakage attacks.

[337] Tackling Size Generalization of Graph Neural Networks on Biological Data from a Spectral Perspective

Gaotang Li, Danai Koutra, Yujun Yan

Main category: cs.LG

TL;DR: The paper addresses size-induced distribution shifts in GNNs, proposing data-driven strategies to improve generalization to larger graphs, with a focus on spectral analysis and subgraph patterns.

DetailsMotivation: To understand and mitigate the impact of size-induced distribution shifts on GNN performance, especially for larger graphs, which is underexplored in prior work.

Method: A data-driven approach analyzing spectral differences and subgraph patterns in biological graphs, leading to three model-agnostic strategies, with size-intensive attention being the most effective.

Result: Size-intensive attention improves GNN performance on larger graphs, boosting F1 scores by up to 8% over baselines.

Conclusion: The study highlights the importance of spectral analysis and subgraph patterns in GNN generalization, offering practical strategies for better performance on larger graphs.

Abstract: We address the key challenge of size-induced distribution shifts in graph neural networks (GNNs) and their impact on the generalization of GNNs to larger graphs. Existing literature operates under diverse assumptions about distribution shifts, resulting in varying conclusions about the generalizability of GNNs. In contrast to prior work, we adopt a data-driven approach to identify and characterize the types of size-induced distribution shifts and explore their impact on GNN performance from a spectral standpoint, a perspective that has been largely underexplored. Leveraging the significant variance in graph sizes in real biological datasets, we analyze biological graphs and find that spectral differences, driven by subgraph patterns (e.g., average cycle length), strongly correlate with GNN performance on larger, unseen graphs. Based on these insights, we propose three model-agnostic strategies to enhance GNNs’ awareness of critical subgraph patterns, identifying size-intensive attention as the most effective approach. Extensive experiments with six GNN architectures and seven model-agnostic strategies across five datasets show that our size-intensive attention strategy significantly improves graph classification on test graphs 2 to 10 times larger than the training graphs, boosting F1 scores by up to 8% over strong baselines.

[338] OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding

Ramchalam Kinattinkara Ramakrishnan, Zhaocong Yuan, Shaojie Zhuo, Chen Feng, Yicheng Lin, Chenzheng Su, Xiaopeng Zhang

Main category: cs.LG

TL;DR: OmniDraft is a unified framework enabling a single draft model to work with any target model, addressing cross-vocabulary mismatches and improving decoding speed via adaptive techniques.

DetailsMotivation: Challenges in speculative decoding include incompatible draft-target models and latency expectations. OmniDraft aims to solve these for on-device LLM applications.

Method: Uses an online n-gram cache with hybrid distillation fine-tuning and adaptive drafting techniques.

Result: Achieves 1.5-2x speedup, allowing a single Llama-68M model to pair with various target models.

Conclusion: OmniDraft supports the ‘one drafter for all’ paradigm, proving effective for math, coding, and text tasks.

Abstract: Speculative decoding generally dictates having a small, efficient draft model that is either pretrained or distilled offline to a particular target model series, for instance, Llama or Qwen models. However, within online deployment settings, there are two major challenges: 1) usage of a target model that is incompatible with the draft model; 2) expectation of latency improvements over usage and time. In this work, we propose OmniDraft, a unified framework that enables a single draft model to operate with any target model and adapt dynamically to user data. We introduce an online n-gram cache with hybrid distillation fine-tuning to address the cross-vocabulary mismatch across draft and target models; and further improve decoding speed by leveraging adaptive drafting techniques. OmniDraft is particularly suitable for on-device LLM applications where model cost, efficiency and user customization are the major points of contention. This further highlights the need to tackle the above challenges and motivates the \textit{``one drafter for all’’} paradigm. We showcase the proficiency of the OmniDraft framework by performing online learning on math reasoning, coding and text generation tasks. Notably, OmniDraft enables a single Llama-68M model to pair with various target models including Vicuna-7B, Qwen2-7B and Llama3-8B models for speculative decoding; and additionally provides up to 1.5-2x speedup.

[339] Sampling-enabled scalable manifold learning unveils discriminative cluster structure of high-dimensional data

Dehua Peng, Zhipeng Gui, Wenzhang Wei, Fa Li, Jie Gui, Huayi Wu, Jianya Gong

Main category: cs.LG

TL;DR: SUDE is a scalable manifold learning technique for large-scale, high-dimensional data, addressing distortions and scalability issues in existing methods.

DetailsMotivation: Existing manifold learning techniques suffer from cluster structure distortions and scalability limitations, hindering pattern understanding and large-scale data handling.

Method: SUDE uses landmark sampling to construct a low-dimensional skeleton and incorporates non-landmarks via constrained locally linear embedding (CLLE).

Result: SUDE shows superior scalability, cluster separation, and structure preservation, with robust embedding quality even at lower sampling rates.

Conclusion: SUDE is effective for large-scale data, offering improved performance in visualization, classification, and anomaly detection.

Abstract: As a pivotal branch of machine learning, manifold learning uncovers the intrinsic low-dimensional structure within complex nonlinear manifolds in high-dimensional space for visualization, classification, clustering, and gaining key insights. Although existing techniques have achieved remarkable successes, they suffer from extensive distortions of cluster structure, which hinders the understanding of underlying patterns. Scalability issues also limit their applicability for handling large-scale data. We hence propose a sampling-based Scalable manifold learning technique that enables Uniform and Discriminative Embedding, namely SUDE, for large-scale and high-dimensional data. It starts by seeking a set of landmarks to construct the low-dimensional skeleton of the entire data, and then incorporates the non-landmarks into the learned space based on the constrained locally linear embedding (CLLE). We empirically validated the effectiveness of SUDE on synthetic datasets and real-world benchmarks, and applied it to analyze single-cell data and detect anomalies in electrocardiogram (ECG) signals. SUDE exhibits distinct advantage in scalability with respect to data size and embedding dimension, and has promising performance in cluster separation, integrity, and global structure preservation. The experiments also demonstrate notable robustness in embedding quality as the sampling rate decreases.

[340] Evaluating LLMs on Real-World Forecasting Against Human Superforecasters

Janna Lu

Main category: cs.LG

TL;DR: LLMs show promise in forecasting but still lag behind human superforecasters.

DetailsMotivation: To assess the forecasting capabilities of state-of-the-art LLMs compared to human superforecasters.

Method: Evaluated LLMs on 464 forecasting questions from Metaculus, comparing their Brier scores to human performance.

Result: LLMs surpassed the human crowd but underperformed against superforecasters.

Conclusion: While LLMs have improved, they still fall short of superforecaster accuracy in forecasting tasks.

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, but their ability to forecast future events remains understudied. A year ago, large language models struggle to come close to the accuracy of a human crowd. I evaluate state-of-the-art LLMs on 464 forecasting questions from Metaculus, comparing their performance against human superforecasters. Frontier models achieve Brier scores that ostensibly surpass the human crowd but still significantly underperform a group of superforecasters.

[341] Model Stock: All we need is just a few fine-tuned models

Dong-Hwan Jang, Sangdoo Yun, Dongyoon Han

Main category: cs.LG

TL;DR: Introduces Model Stock, a method for efficient fine-tuning of large pre-trained models using only two models for weight averaging, achieving superior ID and OOD performance with minimal computational overhead.

DetailsMotivation: Traditional fine-tuning requires many models for averaging, which is inefficient. The paper aims to reduce this while maintaining or improving performance.

Method: Leverages insights from weight space to approximate a center-close weight using two fine-tuned models, employing a layer-wise weight averaging technique.

Result: Model Stock outperforms state-of-the-art methods like Model Soup, achieving strong performance on ID and OOD tasks with minimal computational cost.

Conclusion: Model Stock is a highly efficient and effective fine-tuning method, demonstrated on CLIP architectures, with potential for broader applications.

Abstract: This paper introduces an efficient fine-tuning method for large pre-trained models, offering strong in-distribution (ID) and out-of-distribution (OOD) performance. Breaking away from traditional practices that need a multitude of fine-tuned models for averaging, our approach employs significantly fewer models to achieve final weights yet yield superior accuracy. Drawing from key insights in the weight space of fine-tuned weights, we uncover a strong link between the performance and proximity to the center of weight space. Based on this, we introduce a method that approximates a center-close weight using only two fine-tuned models, applicable during or after training. Our innovative layer-wise weight averaging technique surpasses state-of-the-art model methods such as Model Soup, utilizing only two fine-tuned models. This strategy can be aptly coined Model Stock, highlighting its reliance on selecting a minimal number of models to draw a more optimized-averaged model. We demonstrate the efficacy of Model Stock with fine-tuned models based upon pre-trained CLIP architectures, achieving remarkable performance on both ID and OOD tasks on the standard benchmarks, all while barely bringing extra computational demands. Our code and pre-trained models are available at https://github.com/naver-ai/model-stock.

[342] SPLITZ: Certifiable Robustness via Split Lipschitz Randomized Smoothing

Meiyu Zhong, Ravi Tandon

Main category: cs.LG

TL;DR: SPLITZ combines Lipschitz-constrained training and randomized smoothing to improve certifiable robustness in classifiers, outperforming state-of-the-art methods.

DetailsMotivation: Existing methods for certifiable robustness (Lipschitz constraints or randomized smoothing) have limitations. SPLITZ leverages both to exploit layer-wise heterogeneity in deep networks.

Method: SPLITZ splits a classifier into two parts: the first half has constrained Lipschitz constants, and the second half uses randomized smoothing.

Result: SPLITZ outperforms existing methods, achieving 43.2% top-1 accuracy on CIFAR-10 (vs. 39.8% state-of-art) for ℓ2 norm perturbation budget ε=1.

Conclusion: SPLITZ offers a scalable, effective framework for certifiable robustness, validated across MNIST, CIFAR-10, and ImageNet.

Abstract: Certifiable robustness gives the guarantee that small perturbations around an input to a classifier will not change the prediction. There are two approaches to provide certifiable robustness to adversarial examples: a) explicitly training classifiers with small Lipschitz constants, and b) Randomized smoothing, which adds random noise to the input to create a smooth classifier. We propose SPLITZ, a practical and novel approach which leverages the synergistic benefits of both the above ideas into a single framework. Our main idea is to split a classifier into two halves, constrain the Lipschitz constant of the first half, and smooth the second half via randomization. Motivation for SPLITZ comes from the observation that many standard deep networks exhibit heterogeneity in Lipschitz constants across layers. SPLITZ can exploit this heterogeneity while inheriting the scalability of randomized smoothing. We present a principled approach to train SPLITZ and provide theoretical analysis to derive certified robustness guarantees during inference. We present a comprehensive comparison of robustness-accuracy trade-offs and show that SPLITZ consistently improves on existing state-of-the-art approaches in the MNIST, CIFAR-10 and ImageNet datasets. For instance, with $\ell_2$ norm perturbation budget of $\epsilon=1$, SPLITZ achieves $43.2%$ top-1 test accuracy on CIFAR-10 dataset compared to state-of-art top-1 test accuracy $39.8%$.

[343] Convergence of Implicit Gradient Descent for Training Two-Layer Physics-Informed Neural Networks

Xianliang Xu, Ting Du, Wang Kong, Bin Shan, Ye Li, Zhongyi Huang

Main category: cs.LG

TL;DR: Implicit Gradient Descent (IGD) outperforms Gradient Descent (GD) in training over-parameterized two-layer PINNs, with proven linear convergence and flexible learning rate selection.

DetailsMotivation: Optimization algorithms like GD may fail in multi-scale problems, prompting the need for better methods like IGD.

Method: Derived training dynamics of IGD for two-layer PINNs, analyzed convergence under over-parameterization, and validated empirically.

Result: IGD converges linearly to a global optimum, with learning rate flexibility and milder network width requirements.

Conclusion: IGD is a superior alternative to GD for training PINNs, supported by theory and experiments.

Abstract: The optimization algorithms are crucial in training physics-informed neural networks (PINNs), as unsuitable methods may lead to poor solutions. Compared to the common gradient descent (GD) algorithm, implicit gradient descent (IGD) outperforms it in handling certain multi-scale problems. In this paper, we provide convergence analysis for the IGD in training over-parameterized two-layer PINNs. We first derive the training dynamics of IGD in training two-layer PINNs. Then, over-parameterization allows us to prove that the randomly initialized IGD converges to a globally optimal solution at a linear convergence rate. Moreover, due to the distinct training dynamics of IGD compared to GD, the learning rate can be selected independently of the sample size and the least eigenvalue of the Gram matrix. Additionally, the novel approach used in our convergence analysis imposes a milder requirement on the network width. Finally, empirical results validate our theoretical findings.

[344] PATH: A Discrete-sequence Dataset for Evaluating Online Unsupervised Anomaly Detection Approaches for Multivariate Time Series

Lucas Correia, Jan-Christoph Goos, Thomas Bäck, Anna V. Kononova

Main category: cs.LG

TL;DR: The paper introduces a diverse, realistic dataset for benchmarking anomaly detection in multivariate time series, addressing gaps in current datasets. It provides baseline results and highlights challenges like threshold selection and robustness to contaminated data.

DetailsMotivation: Current datasets for multivariate time series anomaly detection are limited in size, diversity, and anomaly complexity, hindering research progress.

Method: A realistic dataset is generated using simulation tools for automotive powertrains, catering to unsupervised and semi-supervised settings. Baseline results are provided using autoencoders and non-parametric methods.

Result: Semi-supervised approaches outperform unsupervised ones, and threshold selection significantly impacts detection performance.

Conclusion: The dataset fills a research gap, but challenges remain in threshold selection and robustness to contaminated data.

Abstract: Benchmarking anomaly detection approaches for multivariate time series is a challenging task due to a lack of high-quality datasets. Current publicly available datasets are too small, not diverse and feature trivial anomalies, which hinders measurable progress in this research area. We propose a solution: a diverse, extensive, and non-trivial dataset generated via state-of-the-art simulation tools that reflects realistic behaviour of an automotive powertrain, including its multivariate, dynamic and variable-state properties. Additionally, our dataset represents a discrete-sequence problem, which remains unaddressed by previously-proposed solutions in literature. To cater for both unsupervised and semi-supervised anomaly detection settings, as well as time series generation and forecasting, we make different versions of the dataset available, where training and test subsets are offered in contaminated and clean versions, depending on the task. We also provide baseline results from a selection of approaches based on deterministic and variational autoencoders, as well as a non-parametric approach. As expected, the baseline experimentation shows that the approaches trained on the semi-supervised version of the dataset outperform their unsupervised counterparts, highlighting a need for approaches more robust to contaminated training data. Furthermore, results show that the threshold used can have a large influence on detection performance, hence more work needs to be invested in methods to find a suitable threshold without the need for labelled data.

[345] TensorSocket: Shared Data Loading for Deep Learning Training

Ties Robroek, Neil Kim Nielsen, Pınar Tözün

Main category: cs.LG

TL;DR: TensorSocket reduces computational needs in deep learning training by enabling shared data loaders, improving throughput and cutting costs.

DetailsMotivation: Repetitive training tasks and redundant data processing in deep learning increase computational costs and inefficiencies.

Method: TensorSocket allows simultaneous training processes to share a data loader, reducing redundant computations and leveraging GPU-GPU interconnects.

Result: TensorSocket increases training throughput by up to 100%, reduces CPU resource needs by 50%, and outperforms state-of-the-art solutions like CoorDL and Joader.

Conclusion: TensorSocket efficiently addresses CPU-side bottlenecks in deep learning training, offering significant performance and cost benefits.

Abstract: Training deep learning models is a repetitive and resource-intensive process. Data scientists often train several models before landing on a set of parameters (e.g., hyper-parameter tuning) and model architecture (e.g., neural architecture search), among other things that yield the highest accuracy. The computational efficiency of these training tasks depends highly on how well the training data is supplied to the training process. The repetitive nature of these tasks results in the same data processing pipelines running over and over, exacerbating the need for and costs of computational resources. In this paper, we present TensorSocket to reduce the computational needs of deep learning training by enabling simultaneous training processes to share the same data loader. TensorSocket mitigates CPU-side bottlenecks in cases where the collocated training workloads have high throughput on GPU, but are held back by lower data-loading throughput on CPU. TensorSocket achieves this by reducing redundant computations and data duplication across collocated training processes and leveraging modern GPU-GPU interconnects. While doing so, TensorSocket is able to train and balance differently-sized models and serve multiple batch sizes simultaneously and is hardware- and pipeline-agnostic in nature. Our evaluation shows that TensorSocket enables scenarios that are infeasible without data sharing, increases training throughput by up to 100%, and when utilizing cloud instances, achieves cost savings of 50% by reducing the hardware resource needs on the CPU side. Furthermore, TensorSocket outperforms the state-of-the-art solutions for shared data loading such as CoorDL and Joader; it is easier to deploy and maintain and either achieves higher or matches their throughput while requiring fewer CPU resources.

[346] Un-mixing Test-time Adaptation under Heterogeneous Data Streams

Zixian Su, Jingwei Guo, Xi Yang, Qiufeng Wang, Kaizhu Huang

Main category: cs.LG

TL;DR: The paper introduces FreDA, a frequency-based decentralized adaptation framework for test-time adaptation (TTA) under mixed distribution shifts, improving robustness in complex, evolving environments.

DetailsMotivation: Addressing performance drops in deep models due to mixed distribution shifts, especially in unlabeled and online conditions, where conventional TTA methods struggle.

Method: Proposes FreDA, leveraging frequency-domain insights to decompose heterogeneous data into homogeneous components, using decentralized learning and augmentation.

Result: FreDA outperforms state-of-the-art methods in diverse environments (corrupted, natural, medical).

Conclusion: Frequency-based decentralized adaptation effectively tackles mixed distribution shifts, advancing TTA robustness.

Abstract: Deploying deep models in real-world scenarios remains challenging due to significant performance drops under distribution shifts between training and deployment environments. Test-Time Adaptation (TTA) has recently emerged as a promising solution, enabling on-the-fly model adaptation without access to source data. However, its effectiveness degrades significantly in the presence of complex, mixed distribution shifts - common in practical settings - where multiple latent domains coexist. Adapting under such intrinsic heterogeneity, especially in unlabeled and online conditions, remains an open and underexplored challenge. In this paper, we study TTA under mixed distribution shifts and move beyond conventional homogeneous adaptation paradigms. By revisiting TTA from a frequency-domain perspective, we observe that distribution heterogeneity often manifests in Fourier space - for instance, high-frequency components tend to carry domain-specific variations. This motivates us to perform domain-aware separation using high-frequency texture cues, making diverse shift patterns more tractable. To this end, we propose FreDA, a novel Frequency-based Decentralized Adaptation framework that decomposes globally heterogeneous data into locally homogeneous components in the frequency domain. It further employs decentralized learning and augmentation strategies to robustly adapt under complex, evolving shifts. Extensive experiments across various environments (corrupted, natural, and medical) demonstrate the superiority of our proposed framework over the state-of-the-arts.

[347] Sampling from Energy-based Policies using Diffusion

Vineet Jain, Tara Akhound-Sadegh, Siamak Ravanbakhsh

Main category: cs.LG

TL;DR: The paper introduces a diffusion-based method (DQS) for sampling from energy-based policies in RL, improving expressiveness and sample efficiency in continuous control tasks.

DetailsMotivation: Existing methods use simpler parametric distributions (e.g., Gaussians) for policies, limiting their ability to model complex, multimodal behaviors.

Method: Proposes Diffusion Q-Sampling (DQS), a diffusion-based approach for sampling from energy-based policies, using the negative Q-function as the energy function.

Result: DQS enhances sample efficiency and captures multimodal behaviors in continuous control tasks.

Conclusion: The diffusion-based approach addresses limitations of existing methods, enabling more expressive policy representations and stable learning.

Abstract: Energy-based policies offer a flexible framework for modeling complex, multimodal behaviors in reinforcement learning (RL). In maximum entropy RL, the optimal policy is a Boltzmann distribution derived from the soft Q-function, but direct sampling from this distribution in continuous action spaces is computationally intractable. As a result, existing methods typically use simpler parametric distributions, like Gaussians, for policy representation – limiting their ability to capture the full complexity of multimodal action distributions. In this paper, we introduce a diffusion-based approach for sampling from energy-based policies, where the negative Q-function defines the energy function. Based on this approach, we propose an actor-critic method called Diffusion Q-Sampling (DQS) that enables more expressive policy representations, allowing stable learning in diverse environments. We show that our approach enhances sample efficiency in continuous control tasks and captures multimodal behaviors, addressing key limitations of existing methods.

[348] Federated Time Series Generation on Feature and Temporally Misaligned Data

Zhi Wen Soi, Chenrui Fan, Aditya Shankar, Abele Mălan, Lydia Y. Chen

Main category: cs.LG

TL;DR: FedTDD is a federated time series diffusion model addressing feature and temporal misalignment by exchanging synthetic outputs, improving local imputations and outperforming centralized training.

DetailsMotivation: Existing federated time series models assume perfect alignment, which is unrealistic. FedTDD aims to handle misaligned timesteps and features across clients.

Method: FedTDD uses a data distillation and aggregation framework, exchanging synthetic outputs to learn correlations. A global distiller network iteratively improves by leveraging shared synthetic data.

Result: FedTDD outperforms centralized training, achieving 79.4% and 62.8% improvement in Context-FID and Correlational scores over local training.

Conclusion: FedTDD effectively addresses misalignment in federated time series learning by sharing synthetic outputs, enhancing local imputations and overall performance.

Abstract: Distributed time series data presents a challenge for federated learning, as clients often possess different feature sets and have misaligned time steps. Existing federated time series models are limited by the assumption of perfect temporal or feature alignment across clients. In this paper, we propose FedTDD, a novel federated time series diffusion model that jointly learns a synthesizer across clients. At the core of FedTDD is a novel data distillation and aggregation framework that reconciles the differences between clients by imputing the misaligned timesteps and features. In contrast to traditional federated learning, FedTDD learns the correlation across clients’ time series through the exchange of local synthetic outputs instead of model parameters. A coordinator iteratively improves a global distiller network by leveraging shared knowledge from clients through the exchange of synthetic data. As the distiller becomes more refined over time, it subsequently enhances the quality of the clients’ local feature estimates, allowing each client to then improve its local imputations for missing data using the latest, more accurate distiller. Experimental results on five datasets demonstrate FedTDD’s effectiveness compared to centralized training, and the effectiveness of sharing synthetic outputs to transfer knowledge of local time series. Notably, FedTDD achieves 79.4% and 62.8% improvement over local training in Context-FID and Correlational scores.

[349] Embracing Large Language Models in Traffic Flow Forecasting

Yusheng Zhao, Xiao Luo, Haomin Wen, Zhiping Xiao, Wei Ju, Ming Zhang

Main category: cs.LG

TL;DR: The paper proposes LEAF, a method using large language models (LLMs) to enhance traffic flow forecasting by adapting to test-time changes. It combines graph and hypergraph structures for spatio-temporal relations and uses LLMs for prediction selection.

DetailsMotivation: Existing methods for traffic flow forecasting lack adaptability to test-time environmental changes, limiting their effectiveness.

Method: LEAF uses two branches (graph and hypergraph structures) to capture spatio-temporal relations, pre-trains them, and employs an LLM to select predictions. A ranking loss enhances prediction ability.

Result: Experiments show LEAF’s effectiveness in traffic flow forecasting across multiple datasets.

Conclusion: LEAF successfully addresses adaptability issues in traffic flow forecasting by integrating LLMs and dual-branch structures.

Abstract: Traffic flow forecasting aims to predict future traffic flows based on the historical traffic conditions and the road network. It is an important problem in intelligent transportation systems, with a plethora of methods been proposed. Existing efforts mainly focus on capturing and utilizing spatio-temporal dependencies to predict future traffic flows. Though promising, they fall short in adapting to test-time environmental changes of traffic conditions. To tackle this challenge, we propose to introduce large language models (LLMs) to help traffic flow forecasting and design a novel method named Large Language Model Enhanced Traffic Flow Predictor (LEAF). LEAF adopts two branches, capturing different spatio-temporal relations using graph and hypergraph structures respectively. The two branches are first pre-trained individually, and during test-time, they yield different predictions. Based on these predictions, a large language model is used to select the most likely result. Then, a ranking loss is applied as the learning objective to enhance the prediction ability of the two branches. Extensive experiments on several datasets demonstrate the effectiveness of the proposed LEAF.

[350] Data-driven tool wear prediction in milling, based on a process-integrated single-sensor approach

Eric Hirsch, Christian Friedrich

Main category: cs.LG

TL;DR: The study explores data-driven deep learning methods for tool wear prediction, focusing on transferability with minimal training data and a simple single-sensor setup. It evaluates various models, with ConvNeXt achieving 99.1% accuracy.

DetailsMotivation: Accurate tool wear prediction is crucial for productivity and cost reduction, but current methods face challenges like limited generalization and impractical multi-sensor setups.

Method: The research investigates transfer learning with minimal data, using a single acceleration sensor. It tests models like CNN, LSTM, SVM, and decision trees on feature vectors and STFT inputs.

Result: ConvNeXt outperforms other models with 99.1% accuracy, even with reduced datasets, demonstrating effective tool wear prediction under constrained conditions.

Conclusion: The study highlights the potential of specific models and configurations for adaptable and efficient predictive maintenance in machining.

Abstract: Accurate tool wear prediction is essential for maintaining productivity and minimizing costs in machining. However, the complex nature of the tool wear process poses significant challenges to achieving reliable predictions. This study explores data-driven methods, in particular deep learning, for tool wear prediction. Traditional data-driven approaches often focus on a single process, relying on multi-sensor setups and extensive data generation, which limits generalization to new settings. Moreover, multi-sensor integration is often impractical in industrial environments. To address these limitations, this research investigates the transferability of predictive models using minimal training data, validated across two processes. Furthermore, it uses a simple setup with a single acceleration sensor to establish a low-cost data generation approach that facilitates the generalization of models to other processes via transfer learning. The study evaluates several machine learning models, including transformer-inspired convolutional neural networks (CNN), long short-term memory networks (LSTM), support vector machines (SVM), and decision trees, trained on different input formats such as feature vectors and short-time Fourier transform (STFT). The performance of the models is evaluated on two machines and on different amounts of training data, including scenarios with significantly reduced datasets, providing insight into their effectiveness under constrained data conditions. The results demonstrate the potential of specific models and configurations for effective tool wear prediction, contributing to the development of more adaptable and efficient predictive maintenance strategies in machining. Notably, the ConvNeXt model has an exceptional performance, achieving 99.1% accuracy in identifying tool wear using data from only four milling tools operated until they are worn.

[351] ULTHO: Ultra-Lightweight yet Efficient Hyperparameter Optimization in Deep Reinforcement Learning

Mingqi Yuan, Bo Li, Xin Jin, Wenjun Zeng

Main category: cs.LG

TL;DR: ULTHO is a lightweight framework for efficient hyperparameter optimization (HPO) in deep reinforcement learning, using a multi-armed bandit approach with clustered arms (MABC).

DetailsMotivation: HPO is costly and inefficient in deep RL due to non-stationarity and high computational demands. Existing methods are sample-inefficient and expensive.

Method: ULTHO formulates HPO as a multi-armed bandit problem with clustered arms (MABC) and links it to long-term return optimization, providing a statistical filtering mechanism.

Result: ULTHO achieves superior performance on benchmarks (ALE, Procgen, MiniGrid, PyBullet) with a simple architecture.

Conclusion: ULTHO offers an efficient, lightweight solution for HPO in deep RL, advancing automated RL systems.

Abstract: Hyperparameter optimization (HPO) is a billion-dollar problem in machine learning, which significantly impacts the training efficiency and model performance. However, achieving efficient and robust HPO in deep reinforcement learning (RL) is consistently challenging due to its high non-stationarity and computational cost. To tackle this problem, existing approaches attempt to adapt common HPO techniques (e.g., population-based training or Bayesian optimization) to the RL scenario. However, they remain sample-inefficient and computationally expensive, which cannot facilitate a wide range of applications. In this paper, we propose ULTHO, an ultra-lightweight yet powerful framework for fast HPO in deep RL within single runs. Specifically, we formulate the HPO process as a multi-armed bandit with clustered arms (MABC) and link it directly to long-term return optimization. ULTHO also provides a quantified and statistical perspective to filter the HPs efficiently. We test ULTHO on benchmarks including ALE, Procgen, MiniGrid, and PyBullet. Extensive experiments demonstrate that the ULTHO can achieve superior performance with a simple architecture, contributing to the development of advanced and automated RL systems.

[352] A comparative analysis of rank aggregation methods for the partial label ranking problem

Jiayi Wang, Juan C. Alfaro, Viktor Bengs

Main category: cs.LG

TL;DR: The paper explores alternative aggregation methods for partial label ranking, finding scoring-based approaches outperform current methods, while probabilistic-based ones do not.

DetailsMotivation: To improve the handling of ties in partial label ranking by investigating better aggregation methods.

Method: Investigates scoring-based and non-parametric probabilistic-based rank aggregation approaches, extending them for ties.

Result: Scoring-based variants outperform state-of-the-art; probabilistic-based variants underperform.

Conclusion: Scoring-based methods are more effective for partial label ranking with incomplete information.

Abstract: The label ranking problem is a supervised learning scenario in which the learner predicts a total order of the class labels for a given input instance. Recently, research has increasingly focused on the partial label ranking problem, a generalization of the label ranking problem that allows ties in the predicted orders. So far, most existing learning approaches for the partial label ranking problem rely on approximation algorithms for rank aggregation in the final prediction step. This paper explores several alternative aggregation methods for this critical step, including scoring-based and non-parametric probabilistic-based rank aggregation approaches. To enhance their suitability for the more general partial label ranking problem, the investigated methods are extended to increase the likelihood of producing ties. Experimental evaluations on standard benchmarks demonstrate that scoring-based variants consistently outperform the current state-of-the-art method in handling incomplete information. In contrast, non-parametric probabilistic-based variants fail to achieve competitive performance.

[353] Panopticon: Advancing Any-Sensor Foundation Models for Earth Observation

Leonard Waldmann, Ando Shah, Yi Wang, Nils Lehmann, Adam J. Stewart, Zhitong Xiong, Xiao Xiang Zhu, Stefan Bauer, John Chuang

Main category: cs.LG

TL;DR: Panopticon is an any-sensor foundation model based on DINOv2, designed to process diverse EO data by treating multi-sensor images as augmentations, subsampling channels, and using cross-attention for flexible embeddings. It achieves SOTA performance on GEO-Bench.

DetailsMotivation: Prior work limited inputs to fixed sensors, but Panopticon addresses the need for a flexible, sensor-agnostic model to handle diverse EO data.

Method: Extends DINOv2 by treating multi-sensor images as augmentations, subsampling channels, and adding cross-attention for patch embeddings. Encodes wavelength and modes for optical and radar sensors.

Result: Achieves state-of-the-art performance on GEO-Bench, outperforming other any-sensor and fixed-sensor models, especially on Sentinel-1 and Sentinel-2.

Conclusion: Panopticon advances sensor-agnostic EO by generalizing to existing and future satellite platforms.

Abstract: Earth observation (EO) data features diverse sensing platforms with varying spectral bands, spatial resolutions, and sensing modalities. While most prior work has constrained inputs to fixed sensors, a new class of any-sensor foundation models able to process arbitrary sensors has recently emerged. Contributing to this line of work, we propose Panopticon, an any-sensor foundation model built on the DINOv2 framework. We extend DINOv2 by (1) treating images of the same geolocation across sensors as natural augmentations, (2) subsampling channels to diversify spectral input, and (3) adding a cross attention over channels as a flexible patch embedding mechanism. By encoding the wavelength and modes of optical and synthetic aperture radar sensors, respectively, Panopticon can effectively process any combination of arbitrary channels. In extensive evaluations, we achieve state-of-the-art performance on GEO-Bench, especially on the widely-used Sentinel-1 and Sentinel-2 sensors, while out-competing other any-sensor models, as well as domain adapted fixed-sensor models on unique sensor configurations. Panopticon enables immediate generalization to both existing and future satellite platforms, advancing sensor-agnostic EO.

[354] Decision by Supervised Learning with Deep Ensembles: A Practical Framework for Robust Portfolio Optimization

Juhyeong Kim, Sungyoon Choi, Youngbin Lee, Yejin Kim, Yongmin Choi, Yongjae Lee

Main category: cs.LG

TL;DR: DSL reframes portfolio optimization as a supervised learning problem, using cross-entropy loss and ensemble methods for stability, outperforming traditional and ML-based strategies.

DetailsMotivation: To improve robustness and performance in portfolio optimization by leveraging supervised learning and ensemble techniques.

Method: Train models to predict optimal portfolio weights using cross-entropy loss and Sharpe/Sortino ratios, enhanced with Deep Ensemble methods.

Result: Superior performance in backtesting, higher median returns, and stable risk-adjusted performance with larger ensembles.

Conclusion: DSL is a robust and effective framework for portfolio optimization, outperforming existing methods.

Abstract: We propose Decision by Supervised Learning (DSL), a practical framework for robust portfolio optimization. DSL reframes portfolio construction as a supervised learning problem: models are trained to predict optimal portfolio weights, using cross-entropy loss and portfolios constructed by maximizing the Sharpe or Sortino ratio. To further enhance stability and reliability, DSL employs Deep Ensemble methods, substantially reducing variance in portfolio allocations. Through comprehensive backtesting across diverse market universes and neural architectures, shows superior performance compared to both traditional strategies and leading machine learning-based methods, including Prediction-Focused Learning and End-to-End Learning. We show that increasing the ensemble size leads to higher median returns and more stable risk-adjusted performance. The code is available at https://github.com/DSLwDE/DSLwDE.

[355] Directional Sign Loss: A Topology-Preserving Loss Function that Approximates the Sign of Finite Differences

Harvey Dam, Tripti Agarwal, Ganesh Gopalakrishnan

Main category: cs.LG

TL;DR: The paper introduces Directional Sign Loss (DSL), a differentiable loss function to preserve topological features in latent spaces, outperforming traditional methods.

DetailsMotivation: Preserving topological features in learned latent spaces is challenging, especially for topology-sensitive data.

Method: Proposes DSL, a differentiable loss function penalizing discrepancies in critical points between input and reconstructed data.

Result: DSL combined with traditional losses preserves topological features better than traditional losses alone.

Conclusion: DSL is an efficient, differentiable proxy for topology-based metrics, enabling feature preservation in larger problem sizes and gradient-based frameworks.

Abstract: Preserving topological features in learned latent spaces is a fundamental challenge in representation learning, particularly for topology-sensitive data. This paper introduces directional sign loss (DSL), an efficient, differentiable loss function that approximates the number of mismatches in the signs of finite differences between corresponding elements of two arrays. By penalizing discrepancies in critical points between input and reconstructed data, DSL encourages autoencoders and other learnable compressors to retain the topological features of the original data. We present the formulation and complexity analysis of DSL, comparing it to other non-differentiable topological measures. Experiments on multidimensional array data show that combining DSL with traditional loss functions preserves topological features more effectively than traditional losses alone. DSL serves as a differentiable, efficient proxy for common topology-based metrics, enabling topological feature preservation on previously impractical problem sizes and in a wider range of gradient-based optimization frameworks.

[356] Adapting to the Unknown: Robust Meta-Learning for Zero-Shot Financial Time Series Forecasting

Anxian Liu, Junying Ma, Guang Zhang

Main category: cs.LG

TL;DR: A novel task-construction method using GMMs for financial time series forecasting improves zero-shot meta-learning by leveraging learned embeddings and dual meta-task types.

DetailsMotivation: Addressing suboptimal performance of existing meta-task strategies in turbulent financial markets and emerging markets with limited data.

Method: Uses GMMs to cluster embeddings, creating intra-cluster and inter-cluster tasks, and employs hard task mining for cross-regime generalization.

Result: Outperforms existing methods and shows stronger generalization in zero-shot scenarios, validated on high-volatility and emerging market data.

Conclusion: The proposed method enhances adaptability and generalization in financial forecasting, particularly in zero-shot settings.

Abstract: Financial time series forecasting in zero-shot settings is critical for investment decisions, especially during abrupt market regime shifts or in emerging markets with limited historical data. While Model-Agnostic Meta-Learning (MAML) approaches show promise, existing meta-task construction strategies often yield suboptimal performance for highly turbulent financial series. To address this, we propose a novel task-construction method that leverages learned embeddings for both meta task and also downstream predictions, enabling effective zero-shot meta-learning. Specifically, we use Gaussian Mixture Models (GMMs) to softly cluster embeddings, constructing two complementary meta-task types: intra-cluster tasks and inter-cluster tasks. By assigning embeddings to multiple latent regimes probabilistically, GMMs enable richer, more diverse meta-learning. This dual approach ensures the model can quickly adapt to local patterns while simultaneously capturing invariant cross-series features. Furthermore, we enhance inter-cluster generalization through hard task mining, which identifies robust patterns across divergent market regimes. Our method was validated using real-world financial data from high-volatility periods and multiple international markets (including emerging markets). The results demonstrate significant out-performance over existing approaches and stronger generalization in zero-shot scenarios.

[357] MOSIC: Model-Agnostic Optimal Subgroup Identification with Multi-Constraint for Improved Reliability

Wenxin Chen, Weishen Pan, Kyra Gan, Fei Wang

Main category: cs.LG

TL;DR: A unified optimization framework is proposed to identify optimal subgroups in clinical decision-making by directly incorporating constraints like subgroup size and propensity overlap, outperforming traditional two-step methods.

DetailsMotivation: Existing subgroup identification methods decouple CATE estimation and constraint application, limiting practical applicability in clinical settings.

Method: A reformulated constrained primal problem as an unconstrained differentiable min-max objective, solved via gradient descent-ascent, ensuring feasibility and local optimality.

Result: The framework effectively identifies high-benefit subgroups while better satisfying constraints, demonstrated on synthetic and real-world datasets.

Conclusion: The proposed method offers a model-agnostic, extensible solution for subgroup identification, directly enforcing constraints during optimization.

Abstract: Current subgroup identification methods typically follow a two-step approach: first estimate conditional average treatment effects and then apply thresholding or rule-based procedures to define subgroups. While intuitive, this decoupled approach fails to incorporate key constraints essential for real-world clinical decision-making, such as subgroup size and propensity overlap. These constraints operate on fundamentally different axes than CATE estimation and are not naturally accommodated within existing frameworks, thereby limiting the practical applicability of these methods. We propose a unified optimization framework that directly solves the primal constrained optimization problem to identify optimal subgroups. Our key innovation is a reformulation of the constrained primal problem as an unconstrained differentiable min-max objective, solved via a gradient descent-ascent algorithm. We theoretically establish that our solution converges to a feasible and locally optimal solution. Unlike threshold-based CATE methods that apply constraints as post-hoc filters, our approach enforces them directly during optimization. The framework is model-agnostic, compatible with a wide range of CATE estimators, and extensible to additional constraints like cost limits or fairness criteria. Extensive experiments on synthetic and real-world datasets demonstrate its effectiveness in identifying high-benefit subgroups while maintaining better satisfaction of constraints.

[358] Transfer learning-enhanced deep reinforcement learning for aerodynamic airfoil optimisation subject to structural constraints

David Ramos, Lucas Lacasa, Eusebio Valero, Gonzalo Rubio

Main category: cs.LG

TL;DR: The paper introduces a transfer learning-enhanced deep reinforcement learning (DRL) method for airfoil optimization, outperforming Particle Swarm Optimization (PSO) in efficiency and performance.

DetailsMotivation: To optimize airfoil geometry using aerodynamic and structural criteria, improving lift-to-drag ratio while maintaining structural integrity.

Method: Uses DRL with transfer learning strategies, comparing performance with PSO.

Result: DRL outperforms PSO in computational efficiency and aerodynamic improvement; transfer learning further saves resources.

Conclusion: Transfer learning-enhanced DRL is effective for airfoil optimization, balancing performance and computational cost.

Abstract: The main objective of this paper is to introduce a transfer learning-enhanced deep reinforcement learning (DRL) methodology that is able to optimise the geometry of any airfoil based on concomitant aerodynamic and structural integrity criteria. To showcase the method, we aim to maximise the lift-to-drag ratio $C_L/C_D$ while preserving the structural integrity of the airfoil – as modelled by its maximum thickness – and train the DRL agent using a list of different transfer learning (TL) strategies. The performance of the DRL agent is compared with Particle Swarm Optimisation (PSO), a traditional gradient-free optimisation method. Results indicate that DRL agents are able to perform purely aerodynamic and hybrid aerodynamic/structural shape optimisation, that the DRL approach outperforms PSO in terms of computational efficiency and aerodynamic improvement, and that the TL-enhanced DRL agent achieves performance comparable to the DRL one, while further saving substantial computational resources.

[359] Latent Diffeomorphic Dynamic Mode Decomposition

Willem Diepeveen, Jon Schwenk, Andrea Bertozzi

Main category: cs.LG

TL;DR: LDDMD combines DMD’s interpretability with RNNs’ predictive power for non-linear systems, demonstrated in streamflow prediction.

DetailsMotivation: To bridge the gap between interpretability (DMD) and predictive power (RNNs) in non-linear system analysis.

Method: Latent Diffeomorphic Dynamic Mode Decomposition (LDDMD) integrates DMD and RNNs for data reduction and modeling.

Result: LDDMD effectively models complex non-linear systems with memory, enabling accurate predictions (e.g., streamflow).

Conclusion: LDDMD offers a simple yet powerful tool for analyzing non-linear systems, balancing interpretability and predictive performance.

Abstract: We present Latent Diffeomorphic Dynamic Mode Decomposition (LDDMD), a new data reduction approach for the analysis of non-linear systems that combines the interpretability of Dynamic Mode Decomposition (DMD) with the predictive power of Recurrent Neural Networks (RNNs). Notably, LDDMD maintains simplicity, which enhances interpretability, while effectively modeling and learning complex non-linear systems with memory, enabling accurate predictions. This is exemplified by its successful application in streamflow prediction.

[360] Adaptive Branch Specialization in Spectral-Spatial Graph Neural Networks for Certified Robustness

Yoonhyuk Choi, Jiho Choi, Chong-Kwon Kim

Main category: cs.LG

TL;DR: SpecSphere combines spectral-spatial GNNs with specialized adversarial training for certified robustness, achieving state-of-the-art accuracy and robustness.

DetailsMotivation: Addressing the lack of certified robustness in GNNs by specializing spectral and spatial branches for different adversarial attacks and patterns.

Method: Specialized training: spectral branch resists l0 edge flips (homophilic structures), spatial branch resists linf feature perturbations (heterophilic patterns). A gating network dynamically fuses representations.

Result: Theoretical guarantees on expressivity, frequency bias, and certified robustness. Empirically, achieves top accuracy and tighter robustness on benchmarks.

Conclusion: SpecSphere effectively balances accuracy and robustness through specialized adversarial training and dynamic fusion, advancing GNN reliability.

Abstract: Recent Graph Neural Networks (GNNs) combine spectral-spatial architectures for enhanced representation learning. However, limited attention has been paid to certified robustness, particularly regarding training strategies and underlying rationale. In this paper, we explicitly specialize each branch: the spectral network is trained to withstand l0 edge flips and capture homophilic structures, while the spatial part is designed to resist linf feature perturbations and heterophilic patterns. A context-aware gating network adaptively fuses the two representations, dynamically routing each node’s prediction to the more reliable branch. This specialized adversarial training scheme uses branch-specific inner maximization (structure vs feature attacks) and a unified alignment objective. We provide theoretical guarantees: (i) expressivity of the gating mechanism beyond 1-WL, (ii) spectral-spatial frequency bias, and (iii) certified robustness with trade-off. Empirically, SpecSphere attains state-of-the-art node classification accuracy and offers tighter certified robustness on real-world benchmarks.

[361] Curious Causality-Seeking Agents Learn Meta Causal World

Zhiyu Zhao, Haoxuan Li, Haifeng Zhang, Jun Wang, Francesco Faccio, Jürgen Schmidhuber, Mengyue Yang

Main category: cs.LG

TL;DR: The paper introduces Meta-Causal Graphs as world models to address shifting causal mechanisms, proposing a Causality-Seeking Agent to identify and refine these graphs through curiosity-driven exploration.

DetailsMotivation: Real-world environments often exhibit drifting causal mechanisms due to narrow observational windows, complicating world model construction.

Method: A Meta-Causal Graph encodes transformation rules for causal shifts, with subgraphs triggered by latent meta states. A Causality-Seeking Agent identifies meta states, discovers causal relationships, and refines the graph iteratively.

Result: Experiments on synthetic and robot arm tasks show the method captures causal shifts and generalizes to unseen contexts.

Conclusion: The Meta-Causal Graph and Causality-Seeking Agent effectively model dynamic causal mechanisms, enhancing generalization.

Abstract: When building a world model, a common assumption is that the environment has a single, unchanging underlying causal rule, like applying Newton’s laws to every situation. In reality, what appears as a drifting causal mechanism is often the manifestation of a fixed underlying mechanism seen through a narrow observational window. This brings about a problem that, when building a world model, even subtle shifts in policy or environment states can alter the very observed causal mechanisms. In this work, we introduce the \textbf{Meta-Causal Graph} as world models, a minimal unified representation that efficiently encodes the transformation rules governing how causal structures shift across different latent world states. A single Meta-Causal Graph is composed of multiple causal subgraphs, each triggered by meta state, which is in the latent state space. Building on this representation, we introduce a \textbf{Causality-Seeking Agent} whose objectives are to (1) identify the meta states that trigger each subgraph, (2) discover the corresponding causal relationships by agent curiosity-driven intervention policy, and (3) iteratively refine the Meta-Causal Graph through ongoing curiosity-driven exploration and agent experiences. Experiments on both synthetic tasks and a challenging robot arm manipulation task demonstrate that our method robustly captures shifts in causal dynamics and generalizes effectively to previously unseen contexts.

[362] Towards Fair In-Context Learning with Tabular Foundation Models

Patrik Kenfack, Samira Ebrahimi Kahou, Ulrich Aïvodji

Main category: cs.LG

TL;DR: The paper investigates fairness in tabular in-context learning (ICL) using transformer-based models, exploring three bias-mitigation methods, with uncertainty-based sampling showing the best fairness improvements.

DetailsMotivation: To address the unexplored fairness implications of transformer-based tabular foundation models in ICL, which are emerging as alternatives to gradient-boosted trees.

Method: Evaluates three models (TabPFNv2, TabICL, TabDPT) on benchmark datasets using three fairness-enhancing methods: correlation removal, group-balanced sampling, and uncertainty-based sampling.

Result: Uncertainty-based sampling consistently improves group fairness metrics (e.g., demographic parity) with minimal impact on predictive accuracy.

Conclusion: Uncertainty-based sampling is effective for enhancing fairness in tabular ICL, and the code is released for reproducibility.

Abstract: Transformer-based tabular foundation models have recently demonstrated promising in-context learning (ICL) performance on structured data, emerging as competitive alternatives to gradient-boosted trees. However, the fairness implications of this new paradigm remain largely unexplored. We present the first investigation of fairness in tabular ICL, evaluating three recently proposed foundation models – TabPFNv2, TabICL, and TabDPT – on multiple benchmark datasets. To mitigate biases, we explore three pre-processing fairness-enhancing methods: correlation removal (decorrelating input features from the sensitive attribute), group-balanced sample selection (ensuring equal representation of protected groups in context examples), and uncertainty-based sample selection (prioritizing context examples with high sensitive-attribute prediction uncertainty). Our experiments show that the uncertainty-based strategy consistently improves group fairness metrics (e.g., demographic parity, equalized odds, and equal opportunity) with minimal impact on predictive accuracy. We release our code to facilitate reproducibility (https://github.com/patrikken/Fair-TabICL)

[363] Rethinking Irregular Time Series Forecasting: A Simple yet Effective Baseline

Xvyuan Liu, Xiangfei Qiu, Xingjian Wu, Zhengyu Li, Chenjuan Guo, Jilin Hu, Bin Yang

Main category: cs.LG

TL;DR: APN introduces a novel Time-Aware Patch Aggregation (TAPA) module for efficient and accurate forecasting of irregular multivariate time series (IMTS), outperforming existing methods.

DetailsMotivation: The challenges of non-uniformity, missing data, and computational inefficiency in IMTS forecasting motivate the development of APN.

Method: APN uses TAPA to dynamically segment time series, compute patch representations via time-aware weighted aggregation, and employs a lightweight query module and MLP for prediction.

Result: APN achieves state-of-the-art performance in accuracy and computational efficiency across multiple datasets.

Conclusion: APN’s TAPA module effectively addresses IMTS forecasting challenges, offering a scalable and accurate solution.

Abstract: The forecasting of irregular multivariate time series (IMTS) is a critical task in domains like healthcare and climate science. However, this task faces two significant hurdles: 1) the inherent non-uniformity and missing data in IMTS complicate the modeling of temporal dynamics, and 2) existing methods often rely on computationally expensive architectures. To address these dual challenges, we introduce APN, a general and efficient forecasting framework. At the core of APN is a novel Time-Aware Patch Aggregation (TAPA) module that introduces an aggregation-based paradigm for adaptive patching, moving beyond the limitations of fixed-span segmentation and interpolation-based methods. TAPA first learns dynamic temporal boundaries to define data-driven segments. Crucially, instead of resampling or interpolating, it directly computes patch representations via a time-aware weighted aggregation of all raw observations, where weights are determined by each observation’s temporal relevance to the segment. This approach provides two key advantages: it preserves data fidelity by avoiding the introduction of artificial data points and ensures complete information coverage by design.The resulting regularized and information-rich patch representations enable the use of a lightweight query module for historical context aggregation and a simple MLP for final prediction. Extensive experiments on multiple real-world datasets demonstrate that APN establishes a new state-of-the-art, significantly outperforming existing methods in both prediction accuracy and computational efficiency.

[364] Conformal Predictive Distributions for Order Fulfillment Time Forecasting

Tinghan Ye, Amira Hijazi, Pascal Van Hentenryck

Main category: cs.LG

TL;DR: A novel framework for distributional forecasting of order fulfillment time in e-commerce logistics, using model-agnostic techniques for rigorous guarantees, outperforming rule-based systems.

DetailsMotivation: Traditional rule-based approaches fail to capture uncertainties in delivery operations, necessitating more accurate and reliable forecasting methods.

Method: Leverages Conformal Predictive Systems and Cross Venn-Abers Predictors, integrating spatiotemporal features and a cost-sensitive decision rule for probabilistic forecasts.

Result: Achieves up to 14% higher prediction accuracy and 75% improvement in identifying late deliveries compared to rule-based systems.

Conclusion: The proposed framework significantly enhances predictive accuracy and reliability in order fulfillment time estimation.

Abstract: Accurate estimation of order fulfillment time is critical for e-commerce logistics, yet traditional rule-based approaches often fail to capture the inherent uncertainties in delivery operations. This paper introduces a novel framework for distributional forecasting of order fulfillment time, leveraging Conformal Predictive Systems and Cross Venn-Abers Predictors – model-agnostic techniques that provide rigorous coverage or validity guarantees. The proposed machine learning methods integrate granular spatiotemporal features, capturing fulfillment location and carrier performance dynamics to enhance predictive accuracy. Additionally, a cost-sensitive decision rule is developed to convert probabilistic forecasts into reliable point predictions. Experimental evaluation on a large-scale industrial dataset demonstrates that the proposed methods generate competitive distributional forecasts, while machine learning-based point predictions significantly outperform the existing rule-based system – achieving up to 14% higher prediction accuracy and up to 75% improvement in identifying late deliveries.

[365] How to Evaluate Participant Contributions in Decentralized Federated Learning

Honoka Anada, Tatsuya Kaneko, Shinya Takamaeda-Yamazaki

Main category: cs.LG

TL;DR: TRIP-Shapley is a novel method for evaluating participant contributions in decentralized federated learning (DFL), addressing challenges like model inaccessibility and contribution propagation.

DetailsMotivation: Existing contribution evaluation methods for federated learning (FL) assume centralized settings and fail in DFL due to challenges like inaccessibility of non-neighboring clients' models and tracing contribution propagation.

Method: TRIP-Shapley traces the propagation of round-wise local contributions to formulate overall contributions, enabling lightweight estimation without collecting models.

Result: TRIP-Shapley closely approximates the ground-truth Shapley value, scales well in large scenarios, and remains robust against dishonest clients.

Conclusion: TRIP-Shapley effectively addresses DFL’s unique challenges, providing accurate and scalable contribution evaluation.

Abstract: Federated learning (FL) enables multiple clients to collaboratively train machine learning models without sharing local data. In particular, decentralized FL (DFL), where clients exchange models without a central server, has gained attention for mitigating communication bottlenecks. Evaluating participant contributions is crucial in DFL to incentivize active participation and enhance transparency. However, existing contribution evaluation methods for FL assume centralized settings and cannot be applied directly to DFL due to two challenges: the inaccessibility of each client to non-neighboring clients’ models, and the necessity to trace how contributions propagate in conjunction with peer-to-peer model exchanges over time. To address these challenges, we propose TRIP-Shapley, a novel contribution evaluation method for DFL. TRIP-Shapley formulates the clients’ overall contributions by tracing the propagation of the round-wise local contributions. In this way, TRIP-Shapley accurately reflects the delayed and gradual influence propagation, as well as allowing a lightweight coordinator node to estimate the overall contributions without collecting models, but based solely on locally observable contributions reported by each client. Experiments demonstrate that TRIP-Shapley is sufficiently close to the ground-truth Shapley value, is scalable to large-scale scenarios, and remains robust in the presence of dishonest clients.

[366] EmissionNet: Air Quality Pollution Forecasting for Agriculture

Prady Saligram, Tanvir Bhathal

Main category: cs.LG

TL;DR: The paper proposes two deep learning models, EmissionNet (ENV) and EmissionNet-Transformer (ENT), to forecast N$_2$O agricultural emissions, addressing limitations of traditional physics-based methods.

DetailsMotivation: Agricultural emissions are a significant but overlooked source of air pollution, and current physics-based models fail to capture complex pollutant interactions.

Method: The study evaluates popular architectures and introduces two novel deep learning models (ENV and ENT) combining convolutional and transformer-based approaches for spatial-temporal data.

Result: The proposed models (ENV and ENT) effectively capture spatial-temporal dependencies in high-resolution emissions data.

Conclusion: Deep learning architectures like ENV and ENT offer promising alternatives to traditional methods for forecasting agricultural emissions.

Abstract: Air pollution from agricultural emissions is a significant yet often overlooked contributor to environmental and public health challenges. Traditional air quality forecasting models rely on physics-based approaches, which struggle to capture complex, nonlinear pollutant interactions. In this work, we explore forecasting N$_2$O agricultural emissions through evaluating popular architectures, and proposing two novel deep learning architectures, EmissionNet (ENV) and EmissionNet-Transformer (ENT). These models leverage convolutional and transformer-based architectures to extract spatial-temporal dependencies from high-resolution emissions data

[367] EVINET: Towards Open-World Graph Learning via Evidential Reasoning Network

Weijie Guan, Haohui Wang, Jian Kang, Lihui Liu, Dawei Zhou

Main category: cs.LG

TL;DR: EVINET is a framework for open-world graph learning, addressing misclassification and out-of-distribution detection using Beta embedding and subjective logic.

DetailsMotivation: Graph learning often assumes a closed-world, but real-world tasks require handling unknown or noisy data. EVINET aims to detect misclassifications and novel classes.

Method: Integrates Beta embedding with subjective logic, featuring Dissonance Reasoning for misclassification and Vacuity Reasoning for out-of-distribution detection.

Result: Outperforms state-of-the-art methods in classification, misclassification detection, and out-of-distribution detection.

Conclusion: EVINET highlights the importance of uncertainty estimation and logical reasoning for open-world graph learning.

Abstract: Graph learning has been crucial to many real-world tasks, but they are often studied with a closed-world assumption, with all possible labels of data known a priori. To enable effective graph learning in an open and noisy environment, it is critical to inform the model users when the model makes a wrong prediction to in-distribution data of a known class, i.e., misclassification detection or when the model encounters out-of-distribution from novel classes, i.e., out-of-distribution detection. This paper introduces Evidential Reasoning Network (EVINET), a framework that addresses these two challenges by integrating Beta embedding within a subjective logic framework. EVINET includes two key modules: Dissonance Reasoning for misclassification detection and Vacuity Reasoning for out-of-distribution detection. Extensive experiments demonstrate that EVINET outperforms state-of-the-art methods across multiple metrics in the tasks of in-distribution classification, misclassification detection, and out-of-distribution detection. EVINET demonstrates the necessity of uncertainty estimation and logical reasoning for misclassification detection and out-of-distribution detection and paves the way for open-world graph learning. Our code and data are available at https://github.com/SSSKJ/EviNET.

[368] Efficient Federated Learning with Encrypted Data Sharing for Data-Heterogeneous Edge Devices

Hangyu Li, Hongyue Wu, Guodong Fan, Zhen Zhang, Shizhan Chen, Zhiyong Feng

Main category: cs.LG

TL;DR: FedEDS is a federated learning scheme for edge devices that uses encrypted data sharing to improve convergence speed and model performance, addressing issues like latency and data heterogeneity.

DetailsMotivation: Current federated learning research neglects network topology, physical distance, and data heterogeneity, causing latency and degraded performance.

Method: FedEDS trains a data encryptor using client models and stochastic layers, enabling encrypted data sharing among clients for local model training.

Result: FedEDS accelerates convergence and mitigates data heterogeneity, enhancing model performance on edge devices.

Conclusion: FedEDS is effective for edge applications requiring rapid convergence and robust performance.

Abstract: As privacy protection gains increasing importance, more models are being trained on edge devices and subsequently merged into the central server through Federated Learning (FL). However, current research overlooks the impact of network topology, physical distance, and data heterogeneity on edge devices, leading to issues such as increased latency and degraded model performance. To address these issues, we propose a new federated learning scheme on edge devices that called Federated Learning with Encrypted Data Sharing(FedEDS). FedEDS uses the client model and the model’s stochastic layer to train the data encryptor. The data encryptor generates encrypted data and shares it with other clients. The client uses the corresponding client’s stochastic layer and encrypted data to train and adjust the local model. FedEDS uses the client’s local private data and encrypted shared data from other clients to train the model. This approach accelerates the convergence speed of federated learning training and mitigates the negative impact of data heterogeneity, making it suitable for application services deployed on edge devices requiring rapid convergence. Experiments results show the efficacy of FedEDS in promoting model performance.

[369] Disentangling Neural Disjunctive Normal Form Models

Kexin Gu Baugh, Vincent Perreault, Matthew Baugh, Luke Dickens, Katsumi Inoue, Alessandra Russo

Main category: cs.LG

TL;DR: The paper addresses performance degradation in Neural DNF models during symbolic translation by proposing a disentanglement method, improving interpretability and preserving performance.

DetailsMotivation: Performance degradation in Neural DNF models during symbolic translation due to failure in disentangling learned knowledge.

Method: Proposes splitting nodes encoding nested rules into smaller independent nodes to disentangle learned knowledge.

Result: Demonstrates improved performance and interpretability in binary, multiclass, and multilabel classification tasks.

Conclusion: The disentanglement method provides compact, interpretable logical representations with performance closer to pre-translation models.

Abstract: Neural Disjunctive Normal Form (DNF) based models are powerful and interpretable approaches to neuro-symbolic learning and have shown promising results in classification and reinforcement learning settings without prior knowledge of the tasks. However, their performance is degraded by the thresholding of the post-training symbolic translation process. We show here that part of the performance degradation during translation is due to its failure to disentangle the learned knowledge represented in the form of the networks’ weights. We address this issue by proposing a new disentanglement method; by splitting nodes that encode nested rules into smaller independent nodes, we are able to better preserve the models’ performance. Through experiments on binary, multiclass, and multilabel classification tasks (including those requiring predicate invention), we demonstrate that our disentanglement method provides compact and interpretable logical representations for the neural DNF-based models, with performance closer to that of their pre-translation counterparts. Our code is available at https://github.com/kittykg/disentangling-ndnf-classification.

[370] Hierarchical Multi-Label Contrastive Learning for Protein-Protein Interaction Prediction Across Organisms

Shiyi Liu, Buwen Liang, Yuetong Fang, Zixuan Jiang, Renjing Xu

Main category: cs.LG

TL;DR: HIPPO is a hierarchical contrastive learning framework for protein-protein interaction prediction, achieving state-of-the-art performance and strong zero-shot transferability across species.

DetailsMotivation: To bridge heterogeneous biological data modalities and improve PPI prediction by leveraging hierarchical protein attributes and contrastive learning.

Method: Uses hierarchical contrastive loss functions and data-driven penalties to align protein sequences and their hierarchical attributes, incorporating domain knowledge.

Result: Outperforms existing methods, shows robustness in low-data regimes, and demonstrates zero-shot transferability to other species.

Conclusion: HIPPO advances cross-species PPI prediction and provides a unified framework for sparse or imbalanced multi-species data scenarios.

Abstract: Recent advances in AI for science have highlighted the power of contrastive learning in bridging heterogeneous biological data modalities. Building on this paradigm, we propose HIPPO (HIerarchical Protein-Protein interaction prediction across Organisms), a hierarchical contrastive framework for protein-protein interaction(PPI) prediction, where protein sequences and their hierarchical attributes are aligned through multi-tiered biological representation matching. The proposed approach incorporates hierarchical contrastive loss functions that emulate the structured relationship among functional classes of proteins. The framework adaptively incorporates domain and family knowledge through a data-driven penalty mechanism, enforcing consistency between the learned embedding space and the intrinsic hierarchy of protein functions. Experiments on benchmark datasets demonstrate that HIPPO achieves state-of-the-art performance, outperforming existing methods and showing robustness in low-data regimes. Notably, the model demonstrates strong zero-shot transferability to other species without retraining, enabling reliable PPI prediction and functional inference even in less characterized or rare organisms where experimental data are limited. Further analysis reveals that hierarchical feature fusion is critical for capturing conserved interaction determinants, such as binding motifs and functional annotations. This work advances cross-species PPI prediction and provides a unified framework for interaction prediction in scenarios with sparse or imbalanced multi-species data.

[371] Binarizing Physics-Inspired GNNs for Combinatorial Optimization

Martin Krutský, Gustav Šír, Vyacheslav Kungurtsev, Georgios Korpas

Main category: cs.LG

TL;DR: PI-GNNs show declining performance with denser problem graphs due to a phase transition in training dynamics. Proposed methods from fuzzy logic and binarized networks improve results.

DetailsMotivation: Address the performance drop of PI-GNNs in dense combinatorial problem graphs, revealing a phase transition and solution discrepancy.

Method: Analyze PI-GNNs’ training dynamics, propose alternatives based on fuzzy logic and binarized neural networks.

Result: Proposed methods significantly enhance PI-GNN performance in dense settings.

Conclusion: Insights from fuzzy logic and binarization effectively address PI-GNN limitations in dense graphs.

Abstract: Physics-inspired graph neural networks (PI-GNNs) have been utilized as an efficient unsupervised framework for relaxing combinatorial optimization problems encoded through a specific graph structure and loss, reflecting dependencies between the problem’s variables. While the framework has yielded promising results in various combinatorial problems, we show that the performance of PI-GNNs systematically plummets with an increasing density of the combinatorial problem graphs. Our analysis reveals an interesting phase transition in the PI-GNNs’ training dynamics, associated with degenerate solutions for the denser problems, highlighting a discrepancy between the relaxed, real-valued model outputs and the binary-valued problem solutions. To address the discrepancy, we propose principled alternatives to the naive strategy used in PI-GNNs by building on insights from fuzzy logic and binarized neural networks. Our experiments demonstrate that the portfolio of proposed methods significantly improves the performance of PI-GNNs in increasingly dense settings.

[372] Geometric Multi-color Message Passing Graph Neural Networks for Blood-brain Barrier Permeability Prediction

Trung Nguyen, Md Masud Rana, Farjana Tasnim Mukta, Chang-Guo Zhan, Duc Duy Nguyen

Main category: cs.LG

TL;DR: GMC-MPNN, a novel GNN framework incorporating 3D geometric features, outperforms state-of-the-art models in predicting BBB permeability, achieving high AUC-ROC and low RMSE.

DetailsMotivation: Accurate BBBP prediction is crucial for CNS drug development, but existing GNNs often ignore 3D geometric information, limiting their effectiveness.

Method: GMC-MPNN enhances standard message-passing GNNs by incorporating atomic-level geometric features and long-range interactions, using weighted colored subgraphs based on atom types.

Result: GMC-MPNN achieves superior performance (AUC-ROC 0.9704, RMSE 0.4609) on benchmark datasets, outperforming existing models.

Conclusion: GMC-MPNN sets a new benchmark for BBBP prediction by integrating spatial geometry, offering a more accurate tool for drug discovery.

Abstract: Accurate prediction of blood-brain barrier permeability (BBBP) is essential for central nervous system (CNS) drug development. While graph neural networks (GNNs) have advanced molecular property prediction, they often rely on molecular topology and neglect the three-dimensional geometric information crucial for modeling transport mechanisms. This paper introduces the geometric multi-color message-passing graph neural network (GMC-MPNN), a novel framework that enhances standard message-passing architectures by explicitly incorporating atomic-level geometric features and long-range interactions. Our model constructs weighted colored subgraphs based on atom types to capture the spatial relationships and chemical context that govern BBB permeability. We evaluated GMC-MPNN on three benchmark datasets for both classification and regression tasks, using rigorous scaffold-based splitting to ensure a robust assessment of generalization. The results demonstrate that GMC-MPNN consistently outperforms existing state-of-the-art models, achieving superior performance in both classifying compounds as permeable/non-permeable (AUC-ROC of 0.9704 and 0.9685) and in regressing continuous permeability values (RMSE of 0.4609, Pearson correlation of 0.7759). An ablation study further quantified the impact of specific atom-pair interactions, revealing that the model’s predictive power derives from its ability to learn from both common and rare, but chemically significant, functional motifs. By integrating spatial geometry into the graph representation, GMC-MPNN sets a new performance benchmark and offers a more accurate and generalizable tool for drug discovery pipelines.

[373] Flow Matching Policy Gradients

David McAllister, Songwei Ge, Brent Yi, Chung Min Kim, Ethan Weber, Hongsuk Choi, Haiwen Feng, Angjoo Kanazawa

Main category: cs.LG

TL;DR: FPO integrates flow matching into policy gradient reinforcement learning, avoiding exact likelihood computation and outperforming Gaussian policies in multimodal tasks.

DetailsMotivation: To leverage flow-based generative models for reinforcement learning without being tied to specific sampling methods.

Method: FPO uses advantage-weighted ratio from flow matching loss, compatible with PPO-clip, and works with any flow or diffusion integration.

Result: FPO trains diffusion-style policies effectively, capturing multimodal action distributions and outperforming Gaussian policies.

Conclusion: FPO is a flexible and effective method for flow-based reinforcement learning, excelling in complex, multimodal tasks.

Abstract: Flow-based generative models, including diffusion models, excel at modeling continuous distributions in high-dimensional spaces. In this work, we introduce Flow Policy Optimization (FPO), a simple on-policy reinforcement learning algorithm that brings flow matching into the policy gradient framework. FPO casts policy optimization as maximizing an advantage-weighted ratio computed from the conditional flow matching loss, in a manner compatible with the popular PPO-clip framework. It sidesteps the need for exact likelihood computation while preserving the generative capabilities of flow-based models. Unlike prior approaches for diffusion-based reinforcement learning that bind training to a specific sampling method, FPO is agnostic to the choice of diffusion or flow integration at both training and inference time. We show that FPO can train diffusion-style policies from scratch in a variety of continuous control tasks. We find that flow-based models can capture multimodal action distributions and achieve higher performance than Gaussian policies, particularly in under-conditioned settings.

[374] SourceSplice: Source Selection for Machine Learning Tasks

Ambarish Singh, Romila Pradhan

Main category: cs.LG

TL;DR: The paper introduces SourceGrasp and SourceSplice frameworks to select optimal data source subsets for ML tasks, improving downstream model performance.

DetailsMotivation: Existing data discovery methods ignore source quality for ML tasks, necessitating frameworks to select high-utility data sources.

Method: SourceGrasp uses a metaheuristic with greediness and randomization, while SourceSplice mimics gene splicing for source selection.

Result: Empirical evaluation shows SourceSplice efficiently identifies high-utility subsets with fewer explorations.

Conclusion: SourceSplice outperforms in selecting data sources for ML tasks, demonstrating robustness across settings.

Abstract: Data quality plays a pivotal role in the predictive performance of machine learning (ML) tasks - a challenge amplified by the deluge of data sources available in modern organizations. Prior work in data discovery largely focus on metadata matching, semantic similarity or identifying tables that should be joined to answer a particular query, but do not consider source quality for high performance of the downstream ML task. This paper addresses the problem of determining the best subset of data sources that must be combined to construct the underlying training dataset for a given ML task. We propose SourceGrasp and SourceSplice, frameworks designed to efficiently select a suitable subset of sources that maximizes the utility of the downstream ML model. Both the algorithms rely on the core idea that sources (or their combinations) contribute differently to the task utility, and must be judiciously chosen. While SourceGrasp utilizes a metaheuristic based on a greediness criterion and randomization, the SourceSplice framework presents a source selection mechanism inspired from gene splicing - a core concept used in protein synthesis. We empirically evaluate our algorithms on three real-world datasets and synthetic datasets and show that, with significantly fewer subset explorations, SourceSplice effectively identifies subsets of data sources leading to high task utility. We also conduct studies reporting the sensitivity of SourceSplice to the decision choices under several settings.

[375] Bayesian Optimization of Process Parameters of a Sensor-Based Sorting System using Gaussian Processes as Surrogate Models

Felix Kronenwett, Georg Maier, Thomas Längle

Main category: cs.LG

TL;DR: The paper introduces a Bayesian Optimization-based method for optimizing and adjusting process parameters in sensor-based sorting systems, using Gaussian process regression to minimize experiments and account for uncertainties.

DetailsMotivation: Sensor-based sorting systems require continuous parameter adjustments due to changing material streams and requirements, necessitating an efficient optimization approach.

Method: Uses Bayesian Optimization with Gaussian process regression models to optimize process parameters, minimizing experiments while considering uncertainties and dual optimization targets.

Result: Evaluated with three example process parameters, the method effectively optimizes and adjusts sorting system parameters under uncertainty.

Conclusion: The proposed method efficiently optimizes sensor-based sorting systems, reducing experimental needs and handling uncertainties.

Abstract: Sensor-based sorting systems enable the physical separation of a material stream into two fractions. The sorting decision is based on the image data evaluation of the sensors used and is carried out using actuators. Various process parameters must be set depending on the properties of the material stream, the dimensioning of the system, and the required sorting accuracy. However, continuous verification and re-adjustment are necessary due to changing requirements and material stream compositions. In this paper, we introduce an approach for optimizing, recurrently monitoring and adjusting the process parameters of a sensor-based sorting system. Based on Bayesian Optimization, Gaussian process regression models are used as surrogate models to achieve specific requirements for system behavior with the uncertainties contained therein. This method minimizes the number of necessary experiments while simultaneously considering two possible optimization targets based on the requirements for both material output streams. In addition, uncertainties are considered during determining sorting accuracies in the model calculation. We evaluated the method with three example process parameters.

[376] Teaching the Teacher: Improving Neural Network Distillability for Symbolic Regression via Jacobian Regularization

Soumyadeep Dhar, Kei Sen Fong, Mehul Motani

Main category: cs.LG

TL;DR: A novel training paradigm using Jacobian-based regularization improves symbolic distillation of neural networks, enhancing fidelity by 120% relative to standard methods.

DetailsMotivation: Current symbolic distillation of neural networks is brittle due to complex functions, leading to low-fidelity student models.

Method: Introduces a Jacobian-based regularizer to encourage smoother, more distillable functions in the teacher network.

Result: Improves the R² score of distilled symbolic models by 120% (relative) while maintaining teacher accuracy.

Conclusion: The method offers a practical and principled way to enhance interpretable model fidelity from complex networks.

Abstract: Distilling large neural networks into simple, human-readable symbolic formulas is a promising path toward trustworthy and interpretable AI. However, this process is often brittle, as the complex functions learned by standard networks are poor targets for symbolic discovery, resulting in low-fidelity student models. In this work, we propose a novel training paradigm to address this challenge. Instead of passively distilling a pre-trained network, we introduce a \textbf{Jacobian-based regularizer} that actively encourages the ``teacher’’ network to learn functions that are not only accurate but also inherently smoother and more amenable to distillation. We demonstrate through extensive experiments on a suite of real-world regression benchmarks that our method is highly effective. By optimizing the regularization strength for each problem, we improve the $R^2$ score of the final distilled symbolic model by an average of \textbf{120% (relative)} compared to the standard distillation pipeline, all while maintaining the teacher’s predictive accuracy. Our work presents a practical and principled method for significantly improving the fidelity of interpretable models extracted from complex neural networks.

[377] NaN-Propagation: A Novel Method for Sparsity Detection in Black-Box Computational Functions

Peter Sharpe

Main category: cs.LG

TL;DR: NaN-propagation detects sparsity in black-box functions by exploiting NaN values, improving gradient computation speed and accuracy.

DetailsMotivation: Existing sparsity detection methods for black-box functions suffer from false negatives, leading to errors in gradient calculations.

Method: Uses NaN-propagation to trace input-output dependencies by contaminating inputs with NaN and observing outputs, ensuring conservative sparsity patterns.

Result: Achieved a 1.52x speedup on an aerospace model, uncovering missed dependencies.

Conclusion: The method is language-agnostic, leverages IEEE 754, and offers practical improvements for optimization workflows.

Abstract: When numerically evaluating a function’s gradient, sparsity detection can enable substantial computational speedups through Jacobian coloring and compression. However, sparsity detection techniques for black-box functions are limited, and existing finite-difference-based methods suffer from false negatives due to coincidental zero gradients. These false negatives can silently corrupt gradient calculations, leading to difficult-to-diagnose errors. We introduce NaN-propagation, which exploits the universal contamination property of IEEE 754 Not-a-Number values to trace input-output dependencies through floating-point numerical computations. By systematically contaminating inputs with NaN and observing which outputs become NaN, the method reconstructs conservative sparsity patterns that eliminate a major source of false negatives. We demonstrate this approach on an aerospace wing weight model, achieving a 1.52x speedup while uncovering dozens of dependencies missed by conventional methods – a significant practical improvement since gradient computation is often the bottleneck in optimization workflows. The technique leverages IEEE 754 compliance to work across programming languages and math libraries without requiring modifications to existing black-box codes. Furthermore, advanced strategies such as NaN payload encoding via direct bit manipulation enable faster-than-linear time complexity, yielding speed improvements over existing black-box sparsity detection methods. Practical algorithms are also proposed to mitigate challenges from branching code execution common in engineering applications.

[378] Coflex: Enhancing HW-NAS with Sparse Gaussian Processes for Efficient and Scalable DNN Accelerator Design

Yinhui Ma, Tomomasa Yamasaki, Zhehui Wang, Tao Luo, Bo Wang

Main category: cs.LG

TL;DR: Coflex is a novel HW-NAS framework using Sparse Gaussian Process and multi-objective Bayesian optimization to efficiently co-optimize neural network performance and hardware energy efficiency, reducing computational costs while maintaining accuracy.

DetailsMotivation: The extensive search space and high computational cost of HW-NAS hinder its practical adoption, necessitating a more efficient solution.

Method: Coflex integrates Sparse Gaussian Process with multi-objective Bayesian optimization, reducing kernel complexity from cubic to near-linear using sparse inducing points.

Result: Coflex outperforms state-of-the-art methods in accuracy and Energy-Delay-Product, achieving computational speed-ups of 1.9x to 9.5x.

Conclusion: Coflex provides a scalable and efficient solution for HW-NAS, balancing performance and computational overhead.

Abstract: Hardware-Aware Neural Architecture Search (HW-NAS) is an efficient approach to automatically co-optimizing neural network performance and hardware energy efficiency, making it particularly useful for the development of Deep Neural Network accelerators on the edge. However, the extensive search space and high computational cost pose significant challenges to its practical adoption. To address these limitations, we propose Coflex, a novel HW-NAS framework that integrates the Sparse Gaussian Process (SGP) with multi-objective Bayesian optimization. By leveraging sparse inducing points, Coflex reduces the GP kernel complexity from cubic to near-linear with respect to the number of training samples, without compromising optimization performance. This enables scalable approximation of large-scale search space, substantially decreasing computational overhead while preserving high predictive accuracy. We evaluate the efficacy of Coflex across various benchmarks, focusing on accelerator-specific architecture. Our experimental results show that Coflex outperforms state-of-the-art methods in terms of network accuracy and Energy-Delay-Product, while achieving a computational speed-up ranging from 1.9x to 9.5x.

cs.MA

[379] Strategic Communication and Language Bias in Multi-Agent LLM Coordination

Alessio Buscemi, Daniele Proverbio, Alessandro Di Stefano, The Anh Han, German Castignani, Pietro Liò

Main category: cs.MA

TL;DR: LLM-based agents’ cooperative behavior is influenced by communication, language, and game structure, with GPT-4o and Llama 4 Maverick showing varied impacts.

DetailsMotivation: To investigate how communication affects language-driven cooperative behavior in multi-agent LLM scenarios.

Method: Simulated one-shot and repeated games using the FAIRGAME framework, testing GPT-4o and Llama 4 Maverick with and without communication.

Result: Communication significantly impacts behavior, varying by language, personality, and game structure.

Conclusion: Communication can both enhance coordination and reinforce biases in LLM-based agents.

Abstract: Large Language Model (LLM)-based agents are increasingly deployed in multi-agent scenarios where coordination is crucial but not always assured. Previous studies indicate that the language used to frame strategic scenarios can influence cooperative behavior. This paper explores whether allowing agents to communicate amplifies these language-driven effects. Leveraging the FAIRGAME framework, we simulate one-shot and repeated games across different languages and models, both with and without communication. Our experiments, conducted with two advanced LLMs, GPT-4o and Llama 4 Maverick, reveal that communication significantly influences agent behavior, though its impact varies by language, personality, and game structure. These findings underscore the dual role of communication in fostering coordination and reinforcing biases.

[380] WMAS: A Multi-Agent System Towards Intelligent and Customized Wireless Networks

Jingchen Peng, Dingli Yuan, Boxiang Ren, Jie Fan, Hao Wu, Lu Yang

Main category: cs.MA

TL;DR: Proposes a Wireless Multi-Agent System (WMAS) using reinforcement learning to optimize conversation topologies for efficient task handling in wireless networks.

DetailsMotivation: To enable intelligent and customized services in wireless networks while avoiding malfunctions and infinite loops in multi-agent conversations.

Method: Models conversation topology as a directed acyclic graph and uses reinforcement learning to optimize adjacency matrices.

Result: WMAS achieves higher task performance and lower conversation overhead compared to existing systems.

Conclusion: WMAS enhances the intelligence of future wireless networks by optimizing multi-agent collaboration.

Abstract: The fast development of Artificial Intelligence (AI) agents provides a promising way for the realization of intelligent and customized wireless networks. In this paper, we propose a Wireless Multi-Agent System (WMAS), which can provide intelligent and customized services for different user equipment (UEs). Note that orchestrating multiple agents carries the risk of malfunction, and multi-agent conversations may fall into infinite loops. It is thus crucial to design a conversation topology for WMAS that enables agents to complete UE task requests with high accuracy and low conversation overhead. To address this issue, we model the multi-agent conversation topology as a directed acyclic graph and propose a reinforcement learning-based algorithm to optimize the adjacency matrix of this graph. As such, WMAS is capable of generating and self-optimizing multi-agent conversation topologies, enabling agents to effectively and collaboratively handle a variety of task requests from UEs. Simulation results across various task types demonstrate that WMAS can achieve higher task performance and lower conversation overhead compared to existing multi-agent systems. These results validate the potential of WMAS to enhance the intelligence of future wireless networks.

cs.MM

[381] MMRAG-DocQA: A Multi-Modal Retrieval-Augmented Generation Method for Document Question-Answering with Hierarchical Index and Multi-Granularity Retrieval

Ziyu Gong, Yihua Huang, Chengcheng Mai

Main category: cs.MM

TL;DR: MMRAG-DocQA is a novel multi-modal RAG model for long-context document QA, addressing hallucinations and inter-modal disconnection with hierarchical indexing and multi-granularity retrieval.

DetailsMotivation: Existing LVLM-based methods suffer from hallucinations, while RAG-based methods struggle with inter-modal disconnection and cross-page fragmentation.

Method: Proposes MMRAG-DocQA with hierarchical indexing (flattened in-page and topological cross-page chunks) and multi-granularity semantic retrieval (page-level and document-level).

Result: Outperforms on MMLongBench-Doc and LongDocURL datasets, excelling in modality-rich, multi-page document QA.

Conclusion: MMRAG-DocQA effectively integrates multi-modal evidence across long-range pages, improving accuracy in document QA.

Abstract: The multi-modal long-context document question-answering task aims to locate and integrate multi-modal evidences (such as texts, tables, charts, images, and layouts) distributed across multiple pages, for question understanding and answer generation. The existing methods can be categorized into Large Vision-Language Model (LVLM)-based and Retrieval-Augmented Generation (RAG)-based methods. However, the former were susceptible to hallucinations, while the latter struggled for inter-modal disconnection and cross-page fragmentation. To address these challenges, a novel multi-modal RAG model, named MMRAG-DocQA, was proposed, leveraging both textual and visual information across long-range pages to facilitate accurate question answering. A hierarchical indexing method with the integration of flattened in-page chunks and topological cross-page chunks was designed to jointly establish in-page multi-modal associations and long-distance cross-page dependencies. By means of joint similarity evaluation and large language model (LLM)-based re-ranking, a multi-granularity semantic retrieval method, including the page-level parent page retrieval and document-level summary retrieval, was proposed to foster multi-modal evidence connection and long-distance evidence integration and reasoning. Experimental results performed on public datasets, MMLongBench-Doc and LongDocURL, demonstrated the superiority of our MMRAG-DocQA method in understanding and answering modality-rich and multi-page documents.

[382] Semantic-Aware Adaptive Video Streaming Using Latent Diffusion Models for Wireless Networks

Zijiang Yan, Jianhua Pei, Hongda Wu, Hina Tabassum, Ping Wang

Main category: cs.MM

TL;DR: A novel Semantic Communication framework integrates Latent Diffusion Models with FFmpeg for adaptive-bitrate video streaming, optimizing bandwidth and QoE.

DetailsMotivation: Addresses high bandwidth, storage inefficiencies, and QoE degradation in traditional streaming methods.

Method: Uses LDMs to compress I-frames into latent space, retains B/P-frames as metadata, and employs denoising and VFI techniques.

Result: Achieves high-quality streaming with optimized bandwidth, outperforming existing solutions in QoE and efficiency.

Conclusion: Enables scalable real-time video streaming for 5G and beyond.

Abstract: This paper proposes a novel Semantic Communication (SemCom) framework for real-time adaptive-bitrate video streaming by integrating Latent Diffusion Models (LDMs) within the FFmpeg techniques. This solution addresses the challenges of high bandwidth usage, storage inefficiencies, and quality of experience (QoE) degradation associated with traditional Constant Bitrate Streaming (CBS) and Adaptive Bitrate Streaming (ABS). The proposed approach leverages LDMs to compress I-frames into a latent space, offering significant storage and semantic transmission savings without sacrificing high visual quality. While retaining B-frames and P-frames as adjustment metadata to support efficient refinement of video reconstruction at the user side, the proposed framework further incorporates state-of-the-art denoising and Video Frame Interpolation (VFI) techniques. These techniques mitigate semantic ambiguity and restore temporal coherence between frames, even in noisy wireless communication environments. Experimental results demonstrate the proposed method achieves high-quality video streaming with optimized bandwidth usage, outperforming state-of-the-art solutions in terms of QoE and resource efficiency. This work opens new possibilities for scalable real-time video streaming in 5G and future post-5G networks.

eess.AS

[383] Melody-Lyrics Matching with Contrastive Alignment Loss

Changhong Wang, Michel Olvera, Gaël Richard

Main category: eess.AS

TL;DR: The paper introduces melody-lyrics matching (MLM), a task to retrieve lyrics for symbolic melodies using self-supervised learning and contrastive alignment, without needing alignment annotations.

DetailsMotivation: To explore the understudied connection between music and lyrics beyond semantics, leveraging existing paired data.

Method: Proposes a self-supervised framework with contrastive alignment loss and introduces ‘sylphone,’ a syllable-level lyric representation.

Result: Demonstrates coherent and singable lyric matching for melodies.

Conclusion: MLM effectively matches melodies with lyrics, with potential for broader applications in music information retrieval.

Abstract: The connection between music and lyrics is far beyond semantic bonds. Conceptual pairs in the two modalities such as rhythm and rhyme, note duration and syllabic stress, and structure correspondence, raise a compelling yet seldom-explored direction in the field of music information retrieval. In this paper, we present melody-lyrics matching (MLM), a new task which retrieves potential lyrics for a given symbolic melody from text sources. Rather than generating lyrics from scratch, MLM essentially exploits the relationships between melody and lyrics. We propose a self-supervised representation learning framework with contrastive alignment loss for melody and lyrics. This has the potential to leverage the abundance of existing songs with paired melody and lyrics. No alignment annotations are required. Additionally, we introduce sylphone, a novel representation for lyrics at syllable-level activated by phoneme identity and vowel stress. We demonstrate that our method can match melody with coherent and singable lyrics with empirical results and intuitive examples. We open source code and provide matching examples on the companion webpage: https://github.com/changhongw/mlm.

[384] Ambisonics Super-Resolution Using A Waveform-Domain Neural Network

Ismael Nawfal, Symeon Delikaris Manias, Mehrez Souden, Juha Merimaa, Joshua Atkins, Elisabeth McMullin, Shadi Pirhosseinloo, Daniel Phillips

Main category: eess.AS

TL;DR: A data-driven method using Conv-TasNet improves spatial audio quality from FOA to HOA, outperforming traditional renderers.

DetailsMotivation: Overcome the spatial accuracy limitations of First-order Ambisonics (FOA) while retaining its efficiency.

Method: Utilizes a fully convolutional time-domain audio neural network (Conv-TasNet) to transform FOA input into higher-order Ambisonics (HOA) output.

Result: Achieves a 0.6dB average positional error reduction and an 80% improvement in perceived quality over traditional methods.

Conclusion: The data-driven approach successfully enhances spatial audio quality beyond conventional renderers.

Abstract: Ambisonics is a spatial audio format describing a sound field. First-order Ambisonics (FOA) is a popular format comprising only four channels. This limited channel count comes at the expense of spatial accuracy. Ideally one would be able to take the efficiency of a FOA format without its limitations. We have devised a data-driven spatial audio solution that retains the efficiency of the FOA format but achieves quality that surpasses conventional renderers. Utilizing a fully convolutional time-domain audio neural network (Conv-TasNet), we created a solution that takes a FOA input and provides a higher order Ambisonics (HOA) output. This data driven approach is novel when compared to typical physics and psychoacoustic based renderers. Quantitative evaluations showed a 0.6dB average positional mean squared error difference between predicted and actual 3rd order HOA. The median qualitative rating showed an 80% improvement in perceived quality over the traditional rendering approach.

[385] Beamformed 360° Sound Maps: U-Net-Driven Acoustic Source Segmentation and Localization

Belman Jahir Rodriguez, Sergio F. Chevtchenko, Marcelo Herrera Martinez, Yeshwant Bethy, Saeed Afshar

Main category: eess.AS

TL;DR: A U-net model is introduced for 360° acoustic source localization using spherical semantic segmentation, outperforming traditional methods by segmenting beamformed audio maps and post-processing centroids for robust DoA estimates.

DetailsMotivation: The paper aims to improve sound source localization by addressing limitations of discrete DoA angle regression, leveraging semantic segmentation for spatially distributed source identification.

Method: The approach uses a modified U-Net trained on frequency-domain beamformed audio maps, employing Tversky loss for class imbalance, and post-processes segmentation outputs for centroid-based DoA estimates.

Result: The U-net model generalizes across environments, offering improved angular precision and adaptability to different microphone configurations without retraining.

Conclusion: The proposed method provides a new paradigm for dense spatial audio understanding, surpassing traditional SSL approaches.

Abstract: We introduce a U-net model for 360{\deg} acoustic source localization formulated as a spherical semantic segmentation task. Rather than regressing discrete direction-of-arrival (DoA) angles, our model segments beamformed audio maps (azimuth and elevation) into regions of active sound presence. Using delay-and-sum (DAS) beamforming on a custom 24-microphone array, we generate signals aligned with drone GPS telemetry to create binary supervision masks. A modified U-Net, trained on frequency-domain representations of these maps, learns to identify spatially distributed source regions while addressing class imbalance via the Tversky loss. Because the network operates on beamformed energy maps, the approach is inherently array-independent and can adapt to different microphone configurations without retraining from scratch. The segmentation outputs are post-processed by computing centroids over activated regions, enabling robust DoA estimates. Our dataset includes real-world open-field recordings of a DJI Air 3 drone, synchronized with 360{\deg} video and flight logs across multiple dates and locations. Experimental results show that U-net generalizes across environments, providing improved angular precision, offering a new paradigm for dense spatial audio understanding beyond traditional Sound Source Localization (SSL).

[386] Wavelet-Based Time-Frequency Fingerprinting for Feature Extraction of Traditional Irish Music

Noah Shore

Main category: eess.AS

TL;DR: A wavelet-based method for time-frequency fingerprinting in audio identification, tested on Irish tunes, shows high accuracy and efficiency, with applications in EEG and finance.

DetailsMotivation: To address challenges in feature extraction from time-series data, particularly for identifying live recordings of traditional Irish tunes.

Method: Uses continuous wavelet transform for spectral feature extraction and wavelet coherence analysis to compare recorded audio with synthetic tunes from ABC notation.

Result: Demonstrates accurate and efficient tune identification, with superior performance over other time-frequency decomposition methods.

Conclusion: The wavelet-based approach is effective for audio identification and has broader applications in EEG and financial time series.

Abstract: This work presents a wavelet-based approach to time-frequency fingerprinting for time series feature extraction, with a focus on audio identification from live recordings of traditional Irish tunes. The challenges of identifying features in time-series data are addressed by employing a continuous wavelet transform to extract spectral features and wavelet coherence analysis is used to compare recorded audio spectrograms to synthetically generated tunes. The synthetic tunes are derived from ABC notation, which is a common symbolic representation for Irish music. Experimental results demonstrate that the wavelet-based method can accurately and efficiently identify recorded tunes. This research study also details the performance of the wavelet coherence model, highlighting its strengths over other methods of time-frequency decomposition. Additionally, we discuss and deploy the model on several applications beyond music, including in EEG signal analysis and financial time series forecasting.

[387] VR-PTOLEMAIC: A Virtual Environment for the Perceptual Testing of Spatial Audio Algorithms

Paolo Ostan, Francesca Del Gaudio, Federico Miotello, Mirco Pezzoli, Fabio Antonacci

Main category: eess.AS

TL;DR: VR-PTOLEMAIC is a VR system for evaluating spatial audio algorithms using the MUSHRA methodology in a virtual seminar room, showing positive usability results.

DetailsMotivation: To improve the perceptual evaluation of spatial audio algorithms by leveraging VR for immersive and interactive testing environments.

Method: Implemented the MUSHRA methodology in VR, allowing users to evaluate simulated acoustic responses from 25 listening positions in a virtual seminar room.

Result: The VR platform effectively supported the assessment of spatial audio algorithms, with positive feedback on user experience and immersivity.

Conclusion: VR-PTOLEMAIC is a viable tool for evaluating spatial audio algorithms, enhancing perceptual assessment through immersive VR environments.

Abstract: The perceptual evaluation of spatial audio algorithms is an important step in the development of immersive audio applications, as it ensures that synthesized sound fields meet quality standards in terms of listening experience, spatial perception and auditory realism. To support these evaluations, virtual reality can offer a powerful platform by providing immersive and interactive testing environments. In this paper, we present VR-PTOLEMAIC, a virtual reality evaluation system designed for assessing spatial audio algorithms. The system implements the MUSHRA (MUlti-Stimulus test with Hidden Reference and Anchor) evaluation methodology into a virtual environment. In particular, users can position themselves in each of the 25 simulated listening positions of a virtually recreated seminar room and evaluate simulated acoustic responses with respect to the actually recorded second-order ambisonic room impulse responses, all convolved with various source signals. We evaluated the usability of the proposed framework through an extensive testing campaign in which assessors were asked to compare the reconstruction capabilities of various sound field reconstruction algorithms. Results show that the VR platform effectively supports the assessment of spatial audio algorithms, with generally positive feedback on user experience and immersivity.

[388] Dynamic Real-Time Ambisonics Order Adaptation for Immersive Networked Music Performances

Paolo Ostan, Carlo Centofanti, Mirco Pezzoli, Alberto Bernardini, Claudia Rinaldi, Fabio Antonacci

Main category: eess.AS

TL;DR: The paper proposes a real-time adaptive higher-order Ambisonics strategy for Networked Music Performance (NMP) to balance spatial fidelity and network reliability by dynamically adjusting the Ambisonics order based on bandwidth.

DetailsMotivation: To address the challenge of maintaining immersive spatial audio in NMP while managing network impairments like latency and packet loss.

Method: A real-time adaptive strategy that monitors network throughput and dynamically scales the Ambisonics order to prevent audio dropouts.

Result: A MUSHRA-based evaluation shows the adaptive approach effectively balances immersion and reliability in bandwidth-limited scenarios.

Conclusion: The proposed adaptive Ambisonics strategy is promising for ensuring user experience in NMP under varying network conditions.

Abstract: Advanced remote applications such as Networked Music Performance (NMP) require solutions to guarantee immersive real-world-like interaction among users. Therefore, the adoption of spatial audio formats, such as Ambisonics, is fundamental to let the user experience an immersive acoustic scene. The accuracy of the sound scene reproduction increases with the order of the Ambisonics enconding, resulting in an improved immersivity at the cost of a greater number of audio channels, which in turn escalates both bandwidth requirements and susceptibility to network impairments (e.g., latency, jitter, and packet loss). These factors pose a significant challenge for interactive music sessions, which demand high spatial fidelity and low end-to-end delay. We propose a real-time adaptive higher-order Ambisonics strategy that continuously monitors network throughput and dynamically scales the Ambisonics order. When available bandwidth drops below a preset threshold, the order is lowered to prevent audio dropouts; it then reverts to higher orders once conditions recover, thus balancing immersion and reliability. A MUSHRA-based evaluation indicates that this adaptive approach is promising to guarantee user experience in bandwidth-limited NMP scenarios.

[389] OpenACE: An Open Benchmark for Evaluating Audio Coding Performance

Jozef Coldenhoff, Niclas Granqvist, Milos Cernak

Main category: eess.AS

TL;DR: The paper introduces an open-source benchmark for evaluating audio and speech coding quality, addressing the lack of unified testing and reproducibility in existing methods.

DetailsMotivation: Current audio and speech coding evaluations are often proprietary, non-reproducible, or limited in dataset diversity, leading to unfair comparisons between machine learning-based and traditional codecs.

Method: The paper proposes a full-band audio and speech coding quality benchmark with diverse content types, including traditional test vectors, and demonstrates its use with open-source and proprietary codecs.

Result: The benchmark is applied to evaluate codecs like Opus, EVS, and LC3+, and includes quality variations for emotional speech encoding at 16 kbps.

Conclusion: The open-source benchmark promotes democratization in audio and speech coding and is publicly available for use.

Abstract: Audio and speech coding lack unified evaluation and open-source testing. Many candidate systems were evaluated on proprietary, non-reproducible, or small data, and machine learning-based codecs are often tested on datasets with similar distributions as trained on, which is unfairly compared to digital signal processing-based codecs that usually work well with unseen data. This paper presents a full-band audio and speech coding quality benchmark with more variable content types, including traditional open test vectors. An example use case of audio coding quality assessment is presented with open-source Opus, 3GPP’s EVS, and recent ETSI’s LC3 with LC3+ used in Bluetooth LE Audio profiles. Besides, quality variations of emotional speech encoding at 16 kbps are shown. The proposed open-source benchmark contributes to audio and speech coding democratization and is available at https://github.com/JozefColdenhoff/OpenACE.

eess.IV

[390] GEPAR3D: Geometry Prior-Assisted Learning for 3D Tooth Segmentation

Tomasz Szczepański, Szymon Płotka, Michal K. Grzeszczyk, Arleta Adamowicz, Piotr Fudalej, Przemysław Korzeniowski, Tomasz Trzciński, Arkadiusz Sitek

Main category: eess.IV

TL;DR: GEPAR3D is a novel method for tooth segmentation in CBCT scans, combining instance detection and multi-class segmentation with a Statistical Shape Model and deep watershed method, achieving superior performance.

DetailsMotivation: Accurate tooth segmentation, especially for fine structures like root apices, is critical for assessing root resorption in orthodontics but remains challenging.

Method: GEPAR3D integrates a Statistical Shape Model as a geometric prior and uses a deep watershed method to model teeth as 3D energy basins, ensuring precise segmentation.

Result: GEPAR3D achieved a DSC of 95.0% (+2.8% over the second-best method) and recall of 95.2% (+9.5%), with qualitative improvements in root segmentation.

Conclusion: GEPAR3D offers significant potential for improving root resorption assessment and clinical decision-making in orthodontics.

Abstract: Tooth segmentation in Cone-Beam Computed Tomography (CBCT) remains challenging, especially for fine structures like root apices, which is critical for assessing root resorption in orthodontics. We introduce GEPAR3D, a novel approach that unifies instance detection and multi-class segmentation into a single step tailored to improve root segmentation. Our method integrates a Statistical Shape Model of dentition as a geometric prior, capturing anatomical context and morphological consistency without enforcing restrictive adjacency constraints. We leverage a deep watershed method, modeling each tooth as a continuous 3D energy basin encoding voxel distances to boundaries. This instance-aware representation ensures accurate segmentation of narrow, complex root apices. Trained on publicly available CBCT scans from a single center, our method is evaluated on external test sets from two in-house and two public medical centers. GEPAR3D achieves the highest overall segmentation performance, averaging a Dice Similarity Coefficient (DSC) of 95.0% (+2.8% over the second-best method) and increasing recall to 95.2% (+9.5%) across all test sets. Qualitative analyses demonstrated substantial improvements in root segmentation quality, indicating significant potential for more accurate root resorption assessment and enhanced clinical decision-making in orthodontics. We provide the implementation and dataset at https://github.com/tomek1911/GEPAR3D.

[391] On the Utility of Virtual Staining for Downstream Applications as it relates to Task Network Capacity

Sourya Sengupta, Jianquan Xu, Phuong Nguyen, Frank J. Brooks, Yang Liu, Mark A. Anastasio

Main category: eess.IV

TL;DR: The study evaluates virtual staining’s effectiveness for downstream biomedical tasks, finding its utility depends on the task network’s capacity.

DetailsMotivation: Assess whether virtual staining improves clinically relevant tasks like segmentation or classification, considering deep neural network capacity.

Method: Empirical evaluations using biological datasets, comparing label-free, virtually stained, and ground truth fluorescence images for task performance.

Result: Virtual staining’s effectiveness varies with task network capacity; it may not improve or even degrade performance if the network is already capable.

Conclusion: Task network capacity should be a key factor in deciding whether to use virtual staining.

Abstract: Virtual staining, or in-silico-labeling, has been proposed to computationally generate synthetic fluorescence images from label-free images by use of deep learning-based image-to-image translation networks. In most reported studies, virtually stained images have been assessed only using traditional image quality measures such as structural similarity or signal-to-noise ratio. However, in biomedical imaging, images are typically acquired to facilitate an image-based inference, which we refer to as a downstream biological or clinical task. This study systematically investigates the utility of virtual staining for facilitating clinically relevant downstream tasks (like segmentation or classification) with consideration of the capacity of the deep neural networks employed to perform the tasks. Comprehensive empirical evaluations were conducted using biological datasets, assessing task performance by use of label-free, virtually stained, and ground truth fluorescence images. The results demonstrated that the utility of virtual staining is largely dependent on the ability of the segmentation or classification task network to extract meaningful task-relevant information, which is related to the concept of network capacity. Examples are provided in which virtual staining does not improve, or even degrades, segmentation or classification performance when the capacity of the associated task network is sufficiently large. The results demonstrate that task network capacity should be considered when deciding whether to perform virtual staining.

[392] Weakly Supervised Intracranial Aneurysm Detection and Segmentation in MR angiography via Multi-task UNet with Vesselness Prior

Erin Rainville, Amirhossein Rasoulian, Hassan Rivaz, Yiming Xiao

Main category: eess.IV

TL;DR: A novel weakly supervised 3D multi-task UNet integrates vesselness priors for improved intracranial aneurysm detection and segmentation in TOF-MRA, outperforming state-of-the-art methods.

DetailsMotivation: Accurate detection and segmentation of intracranial aneurysms are challenging due to small size, soft contrast, and lack of large annotated datasets.

Method: Proposes a 3D multi-task UNet using Frangi’s vesselness filter for cerebrovascular priors, combining detection and segmentation tasks.

Result: Achieves superior performance (Dice = 0.614, 95%HD =1.38mm for segmentation; false positive rate = 1.47, sensitivity = 92.9% for detection).

Conclusion: The method effectively addresses challenges in aneurysm analysis and demonstrates strong generalizability across datasets.

Abstract: Intracranial aneurysms (IAs) are abnormal dilations of cerebral blood vessels that, if ruptured, can lead to life-threatening consequences. However, their small size and soft contrast in radiological scans often make it difficult to perform accurate and efficient detection and morphological analyses, which are critical in the clinical care of the disorder. Furthermore, the lack of large public datasets with voxel-wise expert annotations pose challenges for developing deep learning algorithms to address the issues. Therefore, we proposed a novel weakly supervised 3D multi-task UNet that integrates vesselness priors to jointly perform aneurysm detection and segmentation in time-of-flight MR angiography (TOF-MRA). Specifically, to robustly guide IA detection and segmentation, we employ the popular Frangi’s vesselness filter to derive soft cerebrovascular priors for both network input and an attention block to conduct segmentation from the decoder and detection from an auxiliary branch. We train our model on the Lausanne dataset with coarse ground truth segmentation, and evaluate it on the test set with refined labels from the same database. To further assess our model’s generalizability, we also validate it externally on the ADAM dataset. Our results demonstrate the superior performance of the proposed technique over the SOTA techniques for aneurysm segmentation (Dice = 0.614, 95%HD =1.38mm) and detection (false positive rate = 1.47, sensitivity = 92.9%).

[393] Diffusion-Based User-Guided Data Augmentation for Coronary Stenosis Detection

Sumin Seo, In Kyu Lee, Hyun-Woo Kim, Jaesik Min, Chung-Hwan Jung

Main category: eess.IV

TL;DR: A novel data augmentation method using diffusion-based inpainting generates realistic coronary lesions for improved stenosis assessment, outperforming existing methods even with limited data.

DetailsMotivation: Addressing challenges like limited labeled data and class imbalance in deep learning for coronary stenosis assessment.

Method: Proposes a diffusion model-based inpainting approach for generating realistic lesions with user-controlled severity.

Result: Superior performance in lesion detection and severity classification, even with limited data, on both in-house and public datasets.

Conclusion: The method enhances stenosis assessment and data utilization, offering reliable decision support in clinical settings.

Abstract: Coronary stenosis is a major risk factor for ischemic heart events leading to increased mortality, and medical treatments for this condition require meticulous, labor-intensive analysis. Coronary angiography provides critical visual cues for assessing stenosis, supporting clinicians in making informed decisions for diagnosis and treatment. Recent advances in deep learning have shown great potential for automated localization and severity measurement of stenosis. In real-world scenarios, however, the success of these competent approaches is often hindered by challenges such as limited labeled data and class imbalance. In this study, we propose a novel data augmentation approach that uses an inpainting method based on a diffusion model to generate realistic lesions, allowing user-guided control of severity. Extensive evaluation on lesion detection and severity classification across various synthetic dataset sizes shows superior performance of our method on both a large-scale in-house dataset and a public coronary angiography dataset. Furthermore, our approach maintains high detection and classification performance even when trained with limited data, highlighting its clinical importance in improving the assessment of severity of stenosis and optimizing data utilization for more reliable decision support.

[394] FMPlug: Plug-In Foundation Flow-Matching Priors for Inverse Problems

Yuxiang Wan, Ryan Devera, Wenjie Zhang, Ju Sun

Main category: eess.IV

TL;DR: FMPlug is a plug-in framework enhancing flow-matching priors for inverse problems, outperforming state-of-the-art methods in tasks like super-resolution and deblurring.

DetailsMotivation: Traditional methods rely on domain-specific or untrained priors, limiting their effectiveness. FMPlug aims to leverage domain-agnostic foundation models more efficiently.

Method: FMPlug uses a time-adaptive warm-up strategy and sharp Gaussianity regularization, capitalizing on object similarity and Gaussianity of generative flows.

Result: FMPlug significantly outperforms existing methods using foundation flow-matching priors in image super-resolution and Gaussian deblurring.

Conclusion: FMPlug demonstrates the potential of domain-agnostic foundation models for solving ill-posed inverse problems effectively.

Abstract: We present FMPlug, a novel plug-in framework that enhances foundation flow-matching (FM) priors for solving ill-posed inverse problems. Unlike traditional approaches that rely on domain-specific or untrained priors, FMPlug smartly leverages two simple but powerful insights: the similarity between observed and desired objects and the Gaussianity of generative flows. By introducing a time-adaptive warm-up strategy and sharp Gaussianity regularization, FMPlug unlocks the true potential of domain-agnostic foundation models. Our method beats state-of-the-art methods that use foundation FM priors by significant margins, on image super-resolution and Gaussian deblurring.

[395] AI-Driven Collaborative Satellite Object Detection for Space Sustainability

Peng Hu, Wenxuan Zhang

Main category: eess.IV

TL;DR: A novel satellite clustering framework for collaborative deep learning-based space object detection is proposed, showing competitive accuracy and low SWaP footprint.

DetailsMotivation: Address the limitations of ground-based tracking systems for space sustainability by enabling onboard, vision-based detection.

Method: Proposes a satellite clustering framework with a distance-aware viewpoint selection strategy and evaluates using DL models on a simulated dataset.

Result: Achieves competitive detection accuracy compared to single-satellite and existing methods while maintaining low SWaP.

Conclusion: Distributed AI-enabled systems can enhance space situational awareness and sustainability.

Abstract: The growing density of satellites in low-Earth orbit (LEO) presents serious challenges to space sustainability, primarily due to the increased risk of in-orbit collisions. Traditional ground-based tracking systems are constrained by latency and coverage limitations, underscoring the need for onboard, vision-based space object detection (SOD) capabilities. In this paper, we propose a novel satellite clustering framework that enables the collaborative execution of deep learning (DL)-based SOD tasks across multiple satellites. To support this approach, we construct a high-fidelity dataset simulating imaging scenarios for clustered satellite formations. A distance-aware viewpoint selection strategy is introduced to optimize detection performance, and recent DL models are used for evaluation. Experimental results show that the clustering-based method achieves competitive detection accuracy compared to single-satellite and existing approaches, while maintaining a low size, weight, and power (SWaP) footprint. These findings underscore the potential of distributed, AI-enabled in-orbit systems to enhance space situational awareness and contribute to long-term space sustainability.

[396] Navigating Distribution Shifts in Medical Image Analysis: A Survey

Zixian Su, Jingwei Guo, Xi Yang, Qiufeng Wang, Frans Coenen, Kaizhu Huang

Main category: eess.IV

TL;DR: The paper reviews DL strategies to address distribution shifts in MedIA, categorizing methods by real-world healthcare constraints like data accessibility and privacy.

DetailsMotivation: To enhance DL model adaptability and robustness in MedIA for diverse clinical environments despite distribution shifts.

Method: Systematic review categorizing approaches into Joint Training, Federated Learning, Fine-tuning, and Domain Generalization based on operational constraints.

Result: Provides a nuanced understanding of DL deployment strategies tailored to specific healthcare scenarios.

Conclusion: Highlights future research pathways to improve deployable MedIA technologies by addressing current limitations.

Abstract: Medical Image Analysis (MedIA) has become indispensable in modern healthcare, enhancing clinical diagnostics and personalized treatment. Despite the remarkable advancements supported by deep learning (DL) technologies, their practical deployment faces challenges due to distribution shifts, where models trained on specific datasets underperform across others from varying hospitals, regions, or patient populations. To navigate this issue, researchers have been actively developing strategies to increase the adaptability and robustness of DL models, enabling their effective use in unfamiliar and diverse environments. This paper systematically reviews approaches that apply DL techniques to MedIA systems affected by distribution shifts. Unlike traditional categorizations based on technical specifications, our approach is grounded in the real-world operational constraints faced by healthcare institutions. Specifically, we categorize the existing body of work into Joint Training, Federated Learning, Fine-tuning, and Domain Generalization, with each method tailored to distinct scenarios caused by Data Accessibility, Privacy Concerns, and Collaborative Protocols. This perspective equips researchers with a nuanced understanding of how DL can be strategically deployed to address distribution shifts in MedIA, ensuring diverse and robust medical applications. By delving deeper into these topics, we highlight potential pathways for future research that not only address existing limitations but also push the boundaries of deployable MedIA technologies.

[397] Generating Novel Brain Morphology by Deforming Learned Templates

Alan Q. Wang, Fangrui Huang, Bailey Trang, Wei Peng, Mohammad Abbasi, Kilian Pohl, Mert Sabuncu, Ehsan Adeli

Main category: eess.IV

TL;DR: MorphLDM, a 3D brain MRI generation method using latent diffusion models, synthesizes images by applying deformation fields to a learned template, outperforming existing methods in diversity and accuracy.

DetailsMotivation: To improve 3D brain MRI synthesis by capturing intricate morphological details, which direct image generation methods like GANs or diffusion models may miss.

Method: Uses latent diffusion models (LDMs) with a novel encoder-decoder setup: the encoder outputs a latent embedding from an image and a learned template, passed to a deformation field decoder. A registration loss is minimized between the original image and deformed template.

Result: Outperforms generative baselines in image diversity, condition adherence, and voxel-based morphometry.

Conclusion: MorphLDM offers a promising approach for generating plausible and attribute-specific 3D brain MRIs by leveraging deformation fields and learned templates.

Abstract: Designing generative models for 3D structural brain MRI that synthesize morphologically-plausible and attribute-specific (e.g., age, sex, disease state) samples is an active area of research. Existing approaches based on frameworks like GANs or diffusion models synthesize the image directly, which may limit their ability to capture intricate morphological details. In this work, we propose a 3D brain MRI generation method based on state-of-the-art latent diffusion models (LDMs), called MorphLDM, that generates novel images by applying synthesized deformation fields to a learned template. Instead of using a reconstruction-based autoencoder (as in a typical LDM), our encoder outputs a latent embedding derived from both an image and a learned template that is itself the output of a template decoder; this latent is passed to a deformation field decoder, whose output is applied to the learned template. A registration loss is minimized between the original image and the deformed template with respect to the encoder and both decoders. Empirically, our approach outperforms generative baselines on metrics spanning image diversity, adherence with respect to input conditions, and voxel-based morphometry. Our code is available at https://github.com/alanqrwang/morphldm.

[398] Optimizing Federated Learning Configurations for MRI Prostate Segmentation and Cancer Detection: A Simulation Study

Ashkan Moradi, Fadila Zerka, Joeran S. Bosma, Mohammed R. S. Sunoqrot, Bendik S. Abrahamsen, Derya Yakar, Jeroen Geerdink, Henkjan Huisman, Tone Frost Bathen, Mattijs Elschot

Main category: eess.IV

TL;DR: Optimized federated learning (FL) framework improves MRI prostate segmentation and csPCa detection, outperforming local models.

DetailsMotivation: To enhance performance and generalizability of prostate segmentation and csPCa detection using federated learning across multiple clients.

Method: Used Flower FL with nnU-Net architecture, optimizing local epochs, federated rounds, and aggregation strategies for T2-weighted MRIs and biparametric MRIs.

Result: Optimized FL configurations (FedMedian for segmentation, FedAdagrad for csPCa) improved performance significantly on independent test sets.

Conclusion: FL outperforms local models, with optimized configurations further enhancing lesion detection performance.

Abstract: Purpose: To develop and optimize a federated learning (FL) framework across multiple clients for biparametric MRI prostate segmentation and clinically significant prostate cancer (csPCa) detection. Materials and Methods: A retrospective study was conducted using Flower FL to train a nnU-Net-based architecture for MRI prostate segmentation and csPCa detection, using data collected from January 2010 to August 2021. Model development included training and optimizing local epochs, federated rounds, and aggregation strategies for FL-based prostate segmentation on T2-weighted MRIs (four clients, 1294 patients) and csPCa detection using biparametric MRIs (three clients, 1440 patients). Performance was evaluated on independent test sets using the Dice score for segmentation and the Prostate Imaging: Cancer Artificial Intelligence (PI-CAI) score, defined as the average of the area under the receiver operating characteristic curve and average precision, for csPCa detection. P-values for performance differences were calculated using permutation testing. Results: The FL configurations were independently optimized for both tasks, showing improved performance at 1 epoch 300 rounds using FedMedian for prostate segmentation and 5 epochs 200 rounds using FedAdagrad, for csPCa detection. Compared with the average performance of the clients, the optimized FL model significantly improved performance in prostate segmentation and csPCa detection on the independent test set. The optimized FL model showed higher lesion detection performance compared to the FL-baseline model, but no evidence of a difference was observed for prostate segmentation. Conclusions: FL enhanced the performance and generalizability of MRI prostate segmentation and csPCa detection compared with local models, and optimizing its configuration further improved lesion detection performance.

[399] CADS: A Comprehensive Anatomical Dataset and Segmentation for Whole-Body Anatomy in Computed Tomography

Murong Xu, Tamaz Amiranashvili, Fernando Navarro, Maksym Fritsak, Ibrahim Ethem Hamamci, Suprosanna Shit, Bastian Wittmann, Sezgin Er, Sebastian M. Christ, Ezequiel de la Rosa, Julian Deseoe, Robert Graf, Hendrik Möller, Anjany Sekuboyina, Jan C. Peeken, Sven Becker, Giulia Baldini, Johannes Haubold, Felix Nensa, René Hosch, Nikhil Mirajkar, Saad Khalid, Stefan Zachow, Marc-André Weber, Georg Langs, Jakob Wasserthal, Mehmet Kemal Ozdemir, Andrey Fedorov, Ron Kikinis, Stephanie Tanadini-Lang, Jan S. Kirschke, Stephanie E. Combs, Bjoern Menze

Main category: eess.IV

TL;DR: The paper introduces CADS, an open-source framework for whole-body CT segmentation, addressing data heterogeneity and anatomical coverage limitations with a large-scale dataset and standardized approach.

DetailsMotivation: Current AI segmentation models are fragmented and lack comprehensive training data, hindering robust clinical deployment.

Method: CADS integrates and standardizes heterogeneous data sources, using a dataset of 22,022 CT volumes with 167 annotated structures, and develops a segmentation model based on established architectures.

Result: The CADS-model outperforms state-of-the-art approaches in evaluations across 18 public datasets and a real-world hospital cohort, showing clinical utility.

Conclusion: CADS advances AI solutions in radiology by providing accessible, comprehensive anatomical analysis tools for clinicians and researchers.

Abstract: Accurate delineation of anatomical structures in volumetric CT scans is crucial for diagnosis and treatment planning. While AI has advanced automated segmentation, current approaches typically target individual structures, creating a fragmented landscape of incompatible models with varying performance and disparate evaluation protocols. Foundational segmentation models address these limitations by providing a holistic anatomical view through a single model. Yet, robust clinical deployment demands comprehensive training data, which is lacking in existing whole-body approaches, both in terms of data heterogeneity and, more importantly, anatomical coverage. In this work, rather than pursuing incremental optimizations in model architecture, we present CADS, an open-source framework that prioritizes the systematic integration, standardization, and labeling of heterogeneous data sources for whole-body CT segmentation. At its core is a large-scale dataset of 22,022 CT volumes with complete annotations for 167 anatomical structures, representing a significant advancement in both scale and coverage, with 18 times more scans than existing collections and 60% more distinct anatomical targets. Building on this diverse dataset, we develop the CADS-model using established architectures for accessible and automated full-body CT segmentation. Through comprehensive evaluation across 18 public datasets and an independent real-world hospital cohort, we demonstrate advantages over SoTA approaches. Notably, thorough testing of the model’s performance in segmentation tasks from radiation oncology validates its direct utility for clinical interventions. By making our large-scale dataset, our segmentation models, and our clinical software tool publicly available, we aim to advance robust AI solutions in radiology and make comprehensive anatomical analysis accessible to clinicians and researchers alike.

Last updated: 2025-08-22
Built with Hugo, theme modified on Stack