Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 149]
cs.CV [Total: 228]
cs.AI [Total: 87]
cs.SD [Total: 13]
cs.LG [Total: 180]
cs.MA [Total: 8]
cs.MM [Total: 1]
eess.AS [Total: 9]
eess.IV [Total: 25]

cs.CL

[1] An Empirical Analysis of Discrete Unit Representations in Speech Language Modeling Pre-training

Yanis Labrak, Richard Dufour, Mickaël Rouvier

Main category: cs.CL

TL;DR: Analysis of discrete unit representations in Speech Language Models, focusing on how model architecture, data representation, and training robustness affect speech modeling during continual pre-training.

Details

Motivation: To optimize speech modeling during continual pre-training by adapting existing pre-trained language models to the speech modality, and to understand how different factors influence this process.

Method: Systematic examination of model architecture, data representation, and training robustness; experiments on speech encoders and clustering granularity across different model scales; analysis of cluster distribution and phonemic alignments.

Result: Optimal discretization strategies vary with model capacity; linguistic and paralinguistic patterns are uncovered through discrete vocabulary analysis; domain matching between discretization training and target applications is crucial for robustness.

Conclusion: The study provides insights into effective discrete unit representations for SLMs, highlighting the importance of tailored discretization strategies based on model capacity and domain-specific requirements for robust speech modeling.

Abstract: This paper investigates discrete unit representations in Speech Language Models (SLMs), focusing on optimizing speech modeling during continual pre-training. In this paper, we systematically examine how model architecture, data representation, and training robustness influence the pre-training stage in which we adapt existing pre-trained language models to the speech modality. Our experiments highlight the role of speech encoders and clustering granularity across different model scales, showing how optimal discretization strategies vary with model capacity. By examining cluster distribution and phonemic alignments, we investigate the effective use of discrete vocabulary, uncovering both linguistic and paralinguistic patterns. Additionally, we explore the impact of clustering data selection on model robustness, highlighting the importance of domain matching between discretization training and target applications.

[2] Beyond ROUGE: N-Gram Subspace Features for LLM Hallucination Detection

Jerry Li, Evangelos Papalexakis

Main category: cs.CL

TL;DR: Novel N-Gram frequency tensor approach for detecting LLM hallucinations, outperforming traditional metrics and competing with LLM judges.

Details

Motivation: Existing hallucination detection methods like ROUGE, BERTScore, and Perplexity lack semantic depth needed to effectively identify hallucinations in LLM-generated text.

Method: Construct N-Gram frequency tensor from LLM text to capture co-occurrence patterns, apply tensor decomposition to extract singular values, and train MLP binary classifier using these features.

Result: Significant improvements over traditional baselines and competitive performance against state-of-the-art LLM judges on HaluEval dataset.

Conclusion: The proposed tensor-based approach provides richer semantic structure for better hallucination detection, offering a promising alternative to existing methods.

Abstract: Large Language Models (LLMs) have demonstrated effectiveness across a wide variety of tasks involving natural language, however, a fundamental problem of hallucinations still plagues these models, limiting their trustworthiness in generating consistent, truthful information. Detecting hallucinations has quickly become an important topic, with various methods such as uncertainty estimation, LLM Judges, retrieval augmented generation (RAG), and consistency checks showing promise. Many of these methods build upon foundational metrics, such as ROUGE, BERTScore, or Perplexity, which often lack the semantic depth necessary to detect hallucinations effectively. In this work, we propose a novel approach inspired by ROUGE that constructs an N-Gram frequency tensor from LLM-generated text. This tensor captures richer semantic structure by encoding co-occurrence patterns, enabling better differentiation between factual and hallucinated content. We demonstrate this by applying tensor decomposition methods to extract singular values from each mode and use these as input features to train a multi-layer perceptron (MLP) binary classifier for hallucinations. Our method is evaluated on the HaluEval dataset and demonstrates significant improvements over traditional baselines, as well as competitive performance against state-of-the-art LLM judges.

[3] A Lightweight Framework for Trigger-Guided LoRA-Based Self-Adaptation in LLMs

Jiacheng Wei, Faguo Wu, Xiao Zhang

Main category: cs.CL

TL;DR: SAGE is a dynamic fine-tuning framework that enables large language models to adapt and learn from new data during inference time by decomposing complex reasoning into subtasks and using trigger-guided updates.

Details

Motivation: Large language models cannot continuously adapt and learn from new data during reasoning at inference time, limiting their ability to handle dynamic reasoning tasks.

Method: SAGE uses three components: Trigger module for real-time failure detection, Trigger Buffer for anomaly clustering with HDBSCAN and stability checks, and Lora Store for dynamic parameter optimization with adapter pool.

Result: SAGE demonstrates excellent accuracy, robustness, and stability on atomic reasoning subtasks through dynamic knowledge updating during test time.

Conclusion: The proposed framework successfully enables adaptive updates during reasoning at inference time, addressing the limitation of static LLMs in continuous learning from new data.

Abstract: Large language models are unable to continuously adapt and learn from new data during reasoning at inference time. To address this limitation, we propose that complex reasoning tasks be decomposed into atomic subtasks and introduce SAGE, a trigger-guided dynamic fine-tuning framework that enables adaptive updates during reasoning at inference time. SAGE consists of three key components: (1) a Trigger module that detects reasoning failures through multiple evaluation metrics in real time; (2) a Trigger Buffer module that clusters anomaly samples using a streaming clustering process with HDBSCAN, followed by stability checks and similarity-based merging; and (3) a Lora Store module that dynamically optimizes parameter updates with an adapter pool for knowledge retention. Evaluation results show that SAGE demonstrates excellent accuracy, robustness, and stability on the atomic reasoning subtask through dynamic knowledge updating during test time.

[4] Talk Isn’t Always Cheap: Understanding Failure Modes in Multi-Agent Debate

Andrea Wynn, Harsh Satija, Gillian Hadfield

Main category: cs.CL

TL;DR: Multi-agent debate can decrease accuracy over time, even with stronger models outnumbering weaker ones, as models prioritize agreement over challenging flawed reasoning.

Details

Motivation: To investigate how diversity in model capabilities affects multi-agent debate outcomes, challenging the assumption that debate always improves reasoning.

Method: Conducted experiments with heterogeneous groups of AI agents with varying capabilities, analyzing debate dynamics and answer changes over time.

Result: Debate led to accuracy decrease as models frequently shifted from correct to incorrect answers, favoring agreement rather than challenging flawed reasoning.

Conclusion: Naive debate applications may cause performance degradation when agents aren’t incentivized or equipped to resist persuasive but incorrect reasoning.

Abstract: While multi-agent debate has been proposed as a promising strategy for improving AI reasoning ability, we find that debate can sometimes be harmful rather than helpful. The prior work has exclusively focused on debates within homogeneous groups of agents, whereas we explore how diversity in model capabilities influences the dynamics and outcomes of multi-agent interactions. Through a series of experiments, we demonstrate that debate can lead to a decrease in accuracy over time – even in settings where stronger (i.e., more capable) models outnumber their weaker counterparts. Our analysis reveals that models frequently shift from correct to incorrect answers in response to peer reasoning, favoring agreement over challenging flawed reasoning. These results highlight important failure modes in the exchange of reasons during multi-agent debate, suggesting that naive applications of debate may cause performance degradation when agents are neither incentivized nor adequately equipped to resist persuasive but incorrect reasoning.

[5] No Translation Needed: Forecasting Quality from Fertility and Metadata

Jessica M. Lundin, Ada Zhang, David Adelani, Cody Carroll

Main category: cs.CL

TL;DR: Translation quality can be accurately predicted without running translation systems using token fertility ratios, token counts, and basic linguistic metadata, achieving R²=0.66-0.72 on FLORES-200 benchmark.

Details

Motivation: To forecast translation quality without the computational cost of running actual translation systems, using only simple features to understand what factors shape translation quality.

Method: Used gradient boosting models with features including token fertility ratios, token counts, and linguistic metadata (language family, script, region) to predict ChrF scores for GPT-4o translations across 203 languages.

Result: Achieved R²=0.66 for translations into English and R²=0.72 for translations from English into other languages, with feature importance showing typological factors dominate predictions into English while fertility is more important for diverse target languages.

Conclusion: Translation quality is shaped by both token-level fertility and broader linguistic typology, providing new insights for multilingual evaluation and quality estimation without running translation systems.

Abstract: We show that translation quality can be predicted with surprising accuracy \textit{without ever running the translation system itself}. Using only a handful of features, token fertility ratios, token counts, and basic linguistic metadata (language family, script, and region), we can forecast ChrF scores for GPT-4o translations across 203 languages in the FLORES-200 benchmark. Gradient boosting models achieve favorable performance ($R^{2}=0.66$ for XX$\rightarrow$English and $R^{2}=0.72$ for English$\rightarrow$XX). Feature importance analyses reveal that typological factors dominate predictions into English, while fertility plays a larger role for translations into diverse target languages. These findings suggest that translation quality is shaped by both token-level fertility and broader linguistic typology, offering new insights for multilingual evaluation and quality estimation.

[6] Direct-Scoring NLG Evaluators Can Use Pairwise Comparisons Too

Logan Lawrence, Ashton Williamson, Alexander Shelton

Main category: cs.CL

TL;DR: Proposes a direct-scoring method using synthetic summaries for pairwise machine rankings to assign absolute scores to individual summaries, performing comparably to state-of-the-art pairwise evaluators.

Details

Motivation: Current pairwise comparison methods for evaluating LLM-generated text lack the ability to assign absolute scores to individual summaries, which is crucial for thresholding use cases.

Method: Uses synthetic summaries to act as pairwise machine rankings at test time, enabling direct scoring of individual summaries rather than just relative comparisons.

Result: Performs comparably to state-of-the-art pairwise evaluators on SummEval (+0.03), TopicalChat (-0.03), and HANNA (+0.05) benchmarks in terms of axis-averaged sample-level correlations.

Conclusion: The proposed direct-scoring method with synthetic summaries provides effective absolute scoring capability while maintaining performance comparable to existing pairwise evaluation approaches.

Abstract: As large-language models have been increasingly used as automatic raters for evaluating free-form content, including document summarization, dialog, and story generation, work has been dedicated to evaluating such models by measuring their correlations with human judgment. For \textit{sample-level} performance, methods which operate by using pairwise comparisons between machine-generated text perform well but often lack the ability to assign absolute scores to individual summaries, an ability crucial for use cases that require thresholding. In this work, we propose a direct-scoring method which uses synthetic summaries to act as pairwise machine rankings at test time. We show that our method performs comparably to state-of-the-art pairwise evaluators in terms of axis-averaged sample-level correlations on the SummEval (\textbf{+0.03}), TopicalChat (\textbf{-0.03}), and HANNA (\textbf{+0.05}) meta-evaluation benchmarks, and release the synthetic in-context summaries as data to facilitate future work.

[7] From Staff Messages to Actionable Insights: A Multi-Stage LLM Classification Framework for Healthcare Analytics

Hajar Sakai, Yi-En Tseng, Mohammadsadegh Mikaeili, Joshua Bosire, Franziska Jovin

Main category: cs.CL

TL;DR: LLM-based framework for analyzing hospital call center staff messages to identify topics and classify reasons, achieving 78-79% accuracy while maintaining HIPAA compliance.

Details

Motivation: Hospital call centers generate large volumes of text data that can provide valuable insights, but traditional supervised learning requires extensive annotated data and training. LLMs offer a more efficient approach for healthcare analytics.

Method: Multi-stage LLM framework using reasoning, general-purpose, and lightweight models to identify message topics and classify reasons in multi-class fashion, with data security and HIPAA compliance measures.

Result: Best-performing model (o3) achieved 78.4% weighted F1-score and 79.2% accuracy, followed by gpt-5 with 75.3% F1-score and 76.2% accuracy. Processed outputs integrated into visualization decision support tool.

Conclusion: The approach enables efficient utilization of staff messaging data, identifies training opportunities, and supports improved patient experience and care quality through actionable insights for healthcare professionals.

Abstract: Hospital call centers serve as the primary contact point for patients within a hospital system. They also generate substantial volumes of staff messages as navigators process patient requests and communicate with the hospital offices following the established protocol restrictions and guidelines. This continuously accumulated large amount of text data can be mined and processed to retrieve insights; however, traditional supervised learning approaches require annotated data, extensive training, and model tuning. Large Language Models (LLMs) offer a paradigm shift toward more computationally efficient methodologies for healthcare analytics. This paper presents a multi-stage LLM-based framework that identifies staff message topics and classifies messages by their reasons in a multi-class fashion. In the process, multiple LLM types, including reasoning, general-purpose, and lightweight models, were evaluated. The best-performing model was o3, achieving 78.4% weighted F1-score and 79.2% accuracy, followed closely by gpt-5 (75.3% Weighted F1-score and 76.2% accuracy). The proposed methodology incorporates data security measures and HIPAA compliance requirements essential for healthcare environments. The processed LLM outputs are integrated into a visualization decision support tool that transforms the staff messages into actionable insights accessible to healthcare professionals. This approach enables more efficient utilization of the collected staff messaging data, identifies navigator training opportunities, and supports improved patient experience and care quality.

[8] The Token Tax: Systematic Bias in Multilingual Tokenization

Jessica M. Lundin, Ada Zhang, Nihal Karim, Hamza Louzan, Victor Wei, David Adelani, Cody Carroll

Main category: cs.CL

TL;DR: Tokenization inefficiency in morphologically complex languages causes higher computational costs and lower accuracy in LLMs, with fertility (tokens/word) being a reliable predictor of performance degradation.

Details

Motivation: To address the structural disadvantages that tokenization inefficiency imposes on morphologically complex, low-resource languages, which leads to inflated compute resources and reduced accuracy in natural language processing.

Method: Evaluated 10 large language models on AfriMMLU dataset (9,000 multiple-choice questions across 5 subjects and 16 African languages), analyzing fertility (tokens/word) as a predictor of accuracy and comparing reasoning vs non-reasoning models.

Result: Higher fertility consistently predicts lower accuracy across all models and subjects. Reasoning models (DeepSeek, o1) outperform non-reasoning peers across both high and low resource languages, narrowing accuracy gaps. Token inflation leads to quadrupled training costs and time when tokens double.

Conclusion: The findings motivate the need for morphologically aware tokenization, fair pricing models, and multilingual benchmarks to achieve equitable natural language processing for all languages.

Abstract: Tokenization inefficiency imposes structural disadvantages on morphologically complex, low-resource languages, inflating compute resources and depressing accuracy. We evaluate 10 large language models (LLMs) on AfriMMLU (9,000 MCQA items; 5 subjects; 16 African languages) and show that fertility (tokens/word) reliably predicts accuracy. Higher fertility consistently predicts lower accuracy across all models and subjects. We further find that reasoning models (DeepSeek, o1) consistently outperform non-reasoning peers across high and low resource languages in the AfriMMLU dataset, narrowing accuracy gaps observed in prior generations. Finally, translating token inflation to economics, a doubling in tokens results in quadrupled training cost and time, underscoring the token tax faced by many languages. These results motivate morphologically aware tokenization, fair pricing, and multilingual benchmarks for equitable natural language processing (NLP).

[9] Biomedical Literature Q&A System Using Retrieval-Augmented Generation (RAG)

Mansi Garg, Lee-Chi Wang, Bhavesh Ghanchi, Sanjana Dumpala, Shreyash Kakde, Yen Chih Chen

Main category: cs.CL

TL;DR: A biomedical Q&A system using RAG architecture with MiniLM embeddings and fine-tuned Mistral-7B model shows improved factual accuracy and relevance in medical information retrieval.

Details

Motivation: Address shortcomings of conventional health search engines and bridge the gap between complex biomedical literature and accessible public health knowledge.

Method: Retrieval-Augmented Generation (RAG) architecture with MiniLM-based semantic embeddings, FAISS vector search, and fine-tuned Mistral-7B-v0.3 using QLoRA for efficient training.

Result: Substantial improvements in factual consistency and semantic relevance (measured by BERTScore F1) compared to baseline models, particularly demonstrated in breast cancer literature.

Conclusion: RAG-enhanced language models show strong potential for biomedical Q&A systems, with future directions including multilingual adaptation, privacy-preserving inference, and personalized medical AI.

Abstract: This work presents a Biomedical Literature Question Answering (Q&A) system based on a Retrieval-Augmented Generation (RAG) architecture, designed to improve access to accurate, evidence-based medical information. Addressing the shortcomings of conventional health search engines and the lag in public access to biomedical research, the system integrates diverse sources, including PubMed articles, curated Q&A datasets, and medical encyclopedias ,to retrieve relevant information and generate concise, context-aware responses. The retrieval pipeline uses MiniLM-based semantic embeddings and FAISS vector search, while answer generation is performed by a fine-tuned Mistral-7B-v0.3 language model optimized using QLoRA for efficient, low-resource training. The system supports both general medical queries and domain-specific tasks, with a focused evaluation on breast cancer literature demonstrating the value of domain-aligned retrieval. Empirical results, measured using BERTScore (F1), show substantial improvements in factual consistency and semantic relevance compared to baseline models. The findings underscore the potential of RAG-enhanced language models to bridge the gap between complex biomedical literature and accessible public health knowledge, paving the way for future work on multilingual adaptation, privacy-preserving inference, and personalized medical AI systems.

[10] Using Contrastive Learning to Improve Two-Way Reasoning in Large Language Models: The Obfuscation Task as a Case Study

Serge Lionel Nikiema, Jordan Samhi, Micheline Bénédicte Moumoula, Albérick Euraste Djiré, Abdoul Kader Kaboré, Jacques Klein, Tegawendé F. Bissyandé

Main category: cs.CL

TL;DR: The paper proposes bidirectional reasoning as a test for genuine AI understanding, showing that standard fine-tuning causes cognitive specialization (improved forward but worse reverse performance), and introduces Contrastive Fine-Tuning (CFT) to achieve true bidirectional capabilities without explicit reverse training.

Details

Motivation: To determine whether large language models truly understand concepts or merely recognize patterns, and to develop a method that enables genuine comprehension through bidirectional reasoning capabilities.

Method: Proposed bidirectional reasoning as a test for understanding, then developed Contrastive Fine-Tuning (CFT) using three example types: positive (semantic preservation), negative (different semantics), and forward-direction obfuscation examples to train models for deeper understanding.

Result: CFT successfully achieved bidirectional reasoning, enabling strong reverse performance while maintaining forward task capabilities, unlike standard fine-tuning which caused cognitive specialization (improved forward but worse reverse performance).

Conclusion: Bidirectional reasoning serves both as a theoretical framework for assessing genuine understanding and as a practical training approach for developing more capable AI systems that truly comprehend concepts rather than just recognizing patterns.

Abstract: This research addresses a fundamental question in AI: whether large language models truly understand concepts or simply recognize patterns. The authors propose bidirectional reasoning,the ability to apply transformations in both directions without being explicitly trained on the reverse direction, as a test for genuine understanding. They argue that true comprehension should naturally allow reversibility. For example, a model that can change a variable name like userIndex to i should also be able to infer that i represents a user index without reverse training. The researchers tested current language models and discovered what they term cognitive specialization: when models are fine-tuned on forward tasks, their performance on those tasks improves, but their ability to reason bidirectionally becomes significantly worse. To address this issue, they developed Contrastive Fine-Tuning (CFT), which trains models using three types of examples: positive examples that maintain semantic meaning, negative examples with different semantics, and forward-direction obfuscation examples. This approach aims to develop deeper understanding rather than surface-level pattern recognition and allows reverse capabilities to develop naturally without explicit reverse training. Their experiments demonstrated that CFT successfully achieved bidirectional reasoning, enabling strong reverse performance while maintaining forward task capabilities. The authors conclude that bidirectional reasoning serves both as a theoretical framework for assessing genuine understanding and as a practical training approach for developing more capable AI systems.

[11] Ad hoc conventions generalize to new referents

Anya Ji, Claire Augusta Bergey, Ron Eliav, Yoav Artzi, Robert D. Hawkins

Main category: cs.CL

TL;DR: People develop shared naming systems through conceptual alignment rather than arbitrary labeling, with new conventions generalizing to similar undiscussed objects following Shepard’s law.

Details

Motivation: To test competing theories about how people establish shared naming conventions - whether they create arbitrary labels specific to targets or develop broader conceptual alignment that generalizes to new referents.

Method: Dyadic communication study with 302 participants using the KiloGram dataset of 1,000+ abstract tangram images. Pairs coordinated on naming conventions for one image set, then alignment was measured for undiscussed images.

Result: Strong evidence for generalization: partners showed increased alignment for undiscussed images relative to pre-test labels. Generalization decayed nonlinearly with visual similarity (following Shepard’s law) and was robust across image nameability levels.

Conclusion: Ad hoc naming conventions reflect genuine conceptual coordination rather than arbitrary labeling, with implications for reference theories and designing more adaptive language agents.

Abstract: How do people talk about things they’ve never talked about before? One view suggests that a new shared naming system establishes an arbitrary link to a specific target, like proper names that cannot extend beyond their bearers. An alternative view proposes that forming a shared way of describing objects involves broader conceptual alignment, reshaping each individual’s semantic space in ways that should generalize to new referents. We test these competing accounts in a dyadic communication study (N=302) leveraging the recently-released KiloGram dataset containing over 1,000 abstract tangram images. After pairs of participants coordinated on referential conventions for one set of images through repeated communication, we measured the extent to which their descriptions aligned for undiscussed images. We found strong evidence for generalization: partners showed increased alignment relative to their pre-test labels. Generalization also decayed nonlinearly with visual similarity (consistent with Shepard’s law) and was robust across levels of the images’ nameability. These findings suggest that ad hoc conventions are not arbitrary labels but reflect genuine conceptual coordination, with implications for theories of reference and the design of more adaptive language agents.

[12] Mitigating Spurious Correlations Between Question and Answer via Chain-of-Thought Correctness Perception Distillation

Hongyan Xie, Yitong Yao, Yikun Ban, Zixuan Huang, Deqing Wang, Zhenhe Wu, Haoxiang Su, Chao Wang, Shuangyong Song, Xuelong Li

Main category: cs.CL

TL;DR: CoPeD improves small language model reasoning by filtering noisy chain-of-thought data and using correctness-aware training to focus on high-quality rationales.

Details

Motivation: Small language models fine-tuned on LLM-generated chain-of-thought data often learn from noisy rationales that don't properly support answers, leading to poor reasoning quality.

Method: Introduces correctness-aware task setting that makes models predict answers based on correct rationales and revise incorrect ones, plus a Correctness-Aware Weighted loss that dynamically adjusts training focus based on rationale quality.

Result: Effective performance on both in-distribution and out-of-distribution benchmark reasoning datasets.

Conclusion: CoPeD successfully improves reasoning quality in small language models by addressing noisy chain-of-thought data through correctness-aware training approaches.

Abstract: Large language models (LLMs) excel at reasoning tasks but are expensive to deploy. Thus small language models (SLMs) are fine-tuned on CoT data generated by LLMs to copy LLMs’ abilities. However, these CoT data may include noisy rationales that either fail to substantiate the answers or contribute no additional information to support answer prediction, which leads SLMs to capture spurious correlations between questions and answers and compromise the quality of reasoning. In this work, we propose Chain-of-Thought Correctness Perception Distillation (CoPeD), which aims to improve the reasoning quality of the student model from the perspectives of task setting and data utilization. Firstly, we introduce a correctness-aware task setting that encourages the student model to predict answers based on correct rationales and revise them when they are incorrect. This setting improves the faithfulness of reasoning and allows the model to learn from its mistakes. Then, we propose a Correctness-Aware Weighted loss, which dynamically adjusts the contribution of each training instance based on the combined loss of the rationale and the answer. This strategy encourages the model to focus more on samples where the rationale offers stronger support for the correct answer. Experiments have shown that CoPeD is effective on both in-distribution (IND) and out-of-distribution (OOD) benchmark reasoning datasets.

[13] Enhancing the Robustness of Contextual ASR to Varying Biasing Information Volumes Through Purified Semantic Correlation Joint Modeling

Yue Gu, Zhihao Du, Ying Shi, Shiliang Zhang, Qian Chen, Jiqing Han

Main category: cs.CL

TL;DR: Proposes PSC-Joint approach to improve contextual ASR by identifying and integrating only the most relevant biasing information rather than the entire biasing list, addressing performance degradation with longer biasing lists.

Details

Motivation: Cross-attention-based contextual ASR models struggle with performance when biasing list length increases significantly, as they process all biasing information rather than focusing on the most relevant parts for specific ASR intermediate representations.

Method: PSC-Joint approach defines and calculates three semantic correlations (list-level, phrase-level, token-level) between ASR representations and biasing information, jointly models them to find intersection, and uses purification mechanism with grouped-and-competitive strategy to filter irrelevant phrases.

Result: Achieves average relative F1 score improvements of up to 21.34% on AISHELL-1 and 28.46% on KeSpeech across biasing lists of varying lengths compared to baselines.

Conclusion: The approach effectively addresses performance degradation in contextual ASR with long biasing lists by focusing on the most relevant biasing information through multi-granularity semantic correlation modeling and purification.

Abstract: Recently, cross-attention-based contextual automatic speech recognition (ASR) models have made notable advancements in recognizing personalized biasing phrases. However, the effectiveness of cross-attention is affected by variations in biasing information volume, especially when the length of the biasing list increases significantly. We find that, regardless of the length of the biasing list, only a limited amount of biasing information is most relevant to a specific ASR intermediate representation. Therefore, by identifying and integrating the most relevant biasing information rather than the entire biasing list, we can alleviate the effects of variations in biasing information volume for contextual ASR. To this end, we propose a purified semantic correlation joint modeling (PSC-Joint) approach. In PSC-Joint, we define and calculate three semantic correlations between the ASR intermediate representations and biasing information from coarse to fine: list-level, phrase-level, and token-level. Then, the three correlations are jointly modeled to produce their intersection, so that the most relevant biasing information across various granularities is highlighted and integrated for contextual recognition. In addition, to reduce the computational cost introduced by the joint modeling of three semantic correlations, we also propose a purification mechanism based on a grouped-and-competitive strategy to filter out irrelevant biasing phrases. Compared with baselines, our PSC-Joint approach achieves average relative F1 score improvements of up to 21.34% on AISHELL-1 and 28.46% on KeSpeech, across biasing lists of varying lengths.

[14] Icon$^{2}$: Aligning Large Language Models Using Self-Synthetic Preference Data via Inherent Regulation

Qiyuan Chen, Hongsen Huang, Qian Shao, Jiahe Chen, Jintai Chen, Hongxia Xu, Renjie Hua, Ren Chuan, Jian Wu

Main category: cs.CL

TL;DR: Icon² is a novel method that uses LLMs’ inherent representation space to efficiently construct high-quality preference datasets, achieving better alignment with human preferences while reducing computational costs by up to 48.1%.

Details

Motivation: Traditional methods for building preference datasets face distribution mismatches with target models and high computational overhead from sampling multiple stochastic responses.

Method: Extracts layer-wise direction vectors to encode human preferences, filters self-synthesized instructions based on inherent consistency, and applies bidirectional inherent control during decoding to generate response pairs with clear alignment distinctions.

Result: Llama3-8B and Qwen2-7B models show average win rate improvements of 13.89% on AlpacaEval 2.0 and 13.45% on Arena-Hard benchmarks.

Conclusion: Icon² provides an efficient and effective paradigm for preference dataset construction that significantly improves model alignment while substantially reducing computational costs.

Abstract: Large Language Models (LLMs) require high quality preference datasets to align with human preferences. However, conventional methods for constructing such datasets face significant challenges: reliance on pre-collected instructions often leads to distribution mismatches with target models, while the need for sampling multiple stochastic responses introduces substantial computational overhead. In this work, we explore a paradigm shift by leveraging inherent regulation of LLMs’ representation space for efficient and tailored preference dataset construction, named Icon$^{2}$. Specifically, it first extracts layer-wise direction vectors to encode sophisticated human preferences and then uses these vectors to filter self-synthesized instructions based on their inherent consistency. During decoding, bidirectional inherent control is applied to steer token representations, enabling the precise generation of response pairs with clear alignment distinctions. Experimental results demonstrate significant improvements in both alignment and efficiency. Llama3-8B and Qwen2-7B achieve an average win rate improvement of 13.89% on AlpacaEval 2.0 and 13.45% on Arena-Hard, while reducing computational costs by up to 48.1%.

[15] Beyond Keywords: Driving Generative Search Engine Optimization with Content-Centric Agents

Qiyuan Chen, Jiahe Chen, Hongsen Huang, Qian Shao, Jintai Chen, Renjie Hua, Hongxia Xu, Ruijia Wu, Ren Chuan, Jian Wu

Main category: cs.CL

TL;DR: This paper introduces a comprehensive framework for Generative Search Engine Optimization (GSEO) with a content-centric benchmark and multi-agent system to measure and optimize content influence on AI-generated answers.

Details

Motivation: The shift from traditional ranked-based search to Generative Search Engines has made conventional SEO metrics obsolete, creating an urgent need to understand, measure, and optimize content influence on synthesized answers.

Method: The authors construct CC-GSEO-Bench (a large-scale content-centric benchmark) and propose a multi-dimensional evaluation framework. They also design a novel multi-agent system with analyze-revise-evaluate workflow to automate content refinement.

Result: The framework provides systematic quantification of content influence beyond surface-level attribution, assessing substantive semantic impact. Empirical analysis reveals novel insights into content influence dynamics.

Conclusion: The research offers actionable strategies for content creators and establishes a principled foundation for future GSEO research, addressing the critical need for optimization in generative search environments.

Abstract: The paradigm shift from traditional ranked-based search to Generative Search Engines has rendered conventional SEO metrics obsolete, creating an urgent need to understand, measure, and optimize for content influence on synthesized answers. This paper introduces a comprehensive, end-to-end framework for Generative Search Engine Optimization (GSEO) to address this challenge. We make two primary contributions. First, we construct CC-GSEO-Bench, a large-scale, content-centric benchmark, and propose a multi-dimensional evaluation framework that systematically quantifies influence, moving beyond surface-level attribution to assess substantive semantic impact. Second, we design a novel multi-agent system that operationalizes this framework, automating the strategic refinement of content through a collaborative analyze-revise-evaluate workflow. Our empirical analysis using this framework reveals novel insights into the dynamics of content influence, offering actionable strategies for creators and establishing a principled foundation for future GSEO research.

[16] New Insights into Optimal Alignment of Acoustic and Linguistic Representations for Knowledge Transfer in ASR

Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai

Main category: cs.CL

TL;DR: A novel unbalanced optimal transport approach for aligning acoustic and linguistic representations in ASR that handles structural asymmetries and distributional mismatches through soft partial matching.

Details

Motivation: Aligning acoustic frames with linguistic tokens is challenging due to inherent structural asymmetries (many-to-one, one-to-many mappings) and the presence of noisy/redundant acoustic frames without linguistic counterparts.

Method: Proposes an unbalanced optimal transport-based alignment model that treats alignment as a detection problem, ensuring every linguistic token is grounded in at least one acoustic observation while allowing flexible probabilistic mappings.

Result: Experimental evaluation on CTC-based ASR with pre-trained language models demonstrates effectiveness in flexibly controlling matching degree and improving ASR performance.

Conclusion: The approach successfully addresses the alignment challenge by providing a framework that handles distributional mismatch and structural asymmetries, enabling better knowledge transfer for ASR systems.

Abstract: Aligning acoustic and linguistic representations is a central challenge to bridge the pre-trained models in knowledge transfer for automatic speech recognition (ASR). This alignment is inherently structured and asymmetric: while multiple consecutive acoustic frames typically correspond to a single linguistic token (many-to-one), certain acoustic transition regions may relate to multiple adjacent tokens (one-to-many). Moreover, acoustic sequences often include frames with no linguistic counterpart, such as background noise or silence may lead to imbalanced matching conditions. In this work, we take a new insight to regard alignment and matching as a detection problem, where the goal is to identify meaningful correspondences with high precision and recall ensuring full coverage of linguistic tokens while flexibly handling redundant or noisy acoustic frames in transferring linguistic knowledge for ASR. Based on this new insight, we propose an unbalanced optimal transport-based alignment model that explicitly handles distributional mismatch and structural asymmetries with soft and partial matching between acoustic and linguistic modalities. Our method ensures that every linguistic token is grounded in at least one acoustic observation, while allowing for flexible, probabilistic mappings from acoustic to linguistic units. We evaluate our proposed model with experiments on an CTC-based ASR system with a pre-trained language model for knowledge transfer. Experimental results demonstrate the effectiveness of our approach in flexibly controlling degree of matching and hence to improve ASR performance.

[17] From Joy to Fear: A Benchmark of Emotion Estimation in Pop Song Lyrics

Shay Dahary, Avi Edana, Alexander Apartsin, Yehudit Aperstein

Main category: cs.CL

TL;DR: This paper analyzes emotional content in song lyrics using multi-label emotion intensity scoring, comparing zero-shot LLMs with fine-tuned BERT models on a manually annotated dataset.

Details

Motivation: To understand how emotional content in song lyrics influences listener experiences and musical preferences, and to develop reliable methods for emotion-based music information retrieval.

Method: Created a manually labeled dataset using mean opinion score (MOS) approach with multiple human raters, evaluated zero-shot performance of various LLMs, and fine-tuned a BERT model for multi-label emotion score prediction.

Result: Experimental results showed both strengths and limitations of zero-shot versus fine-tuned models in capturing nuanced emotional content in lyrics, with LLMs demonstrating potential for emotion recognition in creative texts.

Conclusion: The study provides insights into model selection strategies for emotion-based music applications and makes the labeled dataset publicly available for further research.

Abstract: The emotional content of song lyrics plays a pivotal role in shaping listener experiences and influencing musical preferences. This paper investigates the task of multi-label emotional attribution of song lyrics by predicting six emotional intensity scores corresponding to six fundamental emotions. A manually labeled dataset is constructed using a mean opinion score (MOS) approach, which aggregates annotations from multiple human raters to ensure reliable ground-truth labels. Leveraging this dataset, we conduct a comprehensive evaluation of several publicly available large language models (LLMs) under zero-shot scenarios. Additionally, we fine-tune a BERT-based model specifically for predicting multi-label emotion scores. Experimental results reveal the relative strengths and limitations of zero-shot and fine-tuned models in capturing the nuanced emotional content of lyrics. Our findings highlight the potential of LLMs for emotion recognition in creative texts, providing insights into model selection strategies for emotion-based music information retrieval applications. The labeled dataset is available at https://github.com/LLM-HITCS25S/LyricsEmotionAttribution.

[18] Few-Shot Query Intent Detection via Relation-Aware Prompt Learning

Liang Zhang, Yuan Li, Shijie Zhang, Zheng Zhang, Xitong Li

Main category: cs.CL

TL;DR: SAID is a novel framework that integrates textual and relational structure information for few-shot intent detection, outperforming state-of-the-art methods through query-adaptive attention mechanism.

Details

Motivation: Existing intent detection methods focus primarily on textual data and neglect crucial structural information like query-query and query-answer relations in conversational systems, limiting their effectiveness in few-shot scenarios.

Method: Proposes SAID framework that integrates textual and relational structure information in pretraining, and introduces QueryAdapt mechanism that generates intent-specific relation tokens from learned query-query and query-answer relations for fine-grained knowledge transfer.

Result: Extensive experiments on two real-world datasets demonstrate that SAID significantly outperforms state-of-the-art methods in few-shot intent detection.

Conclusion: The integration of both textual and relational structure information through the SAID framework with QueryAdapt mechanism provides superior performance for few-shot intent detection in conversational systems.

Abstract: Intent detection is a crucial component of modern conversational systems, since accurately identifying user intent at the beginning of a conversation is essential for generating effective responses. Recent efforts have focused on studying this problem under a challenging few-shot scenario. These approaches primarily leverage large-scale unlabeled dialogue text corpora to pretrain language models through various pretext tasks, followed by fine-tuning for intent detection with very limited annotations. Despite the improvements achieved, existing methods have predominantly focused on textual data, neglecting to effectively capture the crucial structural information inherent in conversational systems, such as the query-query relation and query-answer relation. To address this gap, we propose SAID, a novel framework that integrates both textual and relational structure information in a unified manner for model pretraining for the first time. Building on this framework, we further propose a novel mechanism, the query-adaptive attention network (QueryAdapt), which operates at the relation token level by generating intent-specific relation tokens from well-learned query-query and query-answer relations explicitly, enabling more fine-grained knowledge transfer. Extensive experimental results on two real-world datasets demonstrate that SAID significantly outperforms state-of-the-art methods.

[19] Empathy Omni: Enabling Empathetic Speech Response Generation through Large Language Models

Haoyu Wang, Guangyan Zhang, Jiale Chen, Jingyu Li, Yuehai Wang, Yiwen Guo

Main category: cs.CL

TL;DR: Emotion Omni is a speech LLM that understands emotional cues in user speech and generates empathetic responses without requiring massive datasets or large-scale pretraining, achieving superior speech quality and empathy scores.

Details

Motivation: Existing speech LLMs fail to capture emotional cues in user queries, where the same sentence can convey different meanings based on expression. Most empathetic models require massive datasets and high computational costs, creating a need for efficient emotional understanding in human-machine interaction.

Method: Proposed Emotion Omni model that understands emotional content in user speech and generates empathetic responses. Developed a data pipeline to construct a 200k emotional dialogue dataset specifically for empathetic speech assistants.

Result: Achieves comparable instruction-following ability without large-scale pretraining. Surpasses existing models in speech quality (UTMOS: 4.41) and empathy (Emotion GPT Score: 3.97). Demonstrates improvements in both speech fidelity and emotional expressiveness.

Conclusion: Emotion Omni successfully addresses the challenge of building empathetic speech LLMs with limited data and computational resources, providing enhanced emotional understanding and response generation capabilities for human-machine interaction.

Abstract: With the development of speech large language models (speech LLMs), users can now interact directly with assistants via speech. However, most existing models only convert response content into speech without fully capturing the rich emotional cues in user queries, where the same sentence may convey different meanings depending on the expression. Emotional understanding is thus essential for improving human-machine interaction. Most empathetic speech LLMs rely on massive datasets, demanding high computational cost. A key challenge is to build models that generate empathetic responses with limited data and without large-scale training. To this end, we propose Emotion Omni, a model that understands emotional content in user speech and generates empathetic responses. We further developed a data pipeline to construct a 200k emotional dialogue dataset supporting empathetic speech assistants. Experiments show that Emotion Omni achieves comparable instruction-following ability without large-scale pretraining, while surpassing existing models in speech quality (UTMOS:4.41) and empathy (Emotion GPT Score: 3.97). These results confirm its improvements in both speech fidelity and emotional expressiveness. Demos are available at https://w311411.github.io/omni_demo/.

[20] LM-Searcher: Cross-domain Neural Architecture Search with LLMs via Unified Numerical Encoding

Yuxuan Hu, Jihao Liu, Ke Wang, Jinliang Zhen, Weikang Shi, Manyuan Zhang, Qi Dou, Rui Liu, Aojun Zhou, Hongsheng Li

Main category: cs.CL

TL;DR: LM-Searcher is a novel LLM-based framework for cross-domain neural architecture search that uses universal numerical encoding (NCode) and reformulates NAS as a ranking task, achieving competitive performance across diverse domains without extensive domain-specific tuning.

Details

Motivation: Existing LLM-driven NAS approaches rely heavily on prompt engineering and domain-specific tuning, limiting their practicality and scalability across diverse optimization tasks.

Method: Proposes NCode (universal numerical string representation for neural architectures), reformulates NAS as a ranking task, uses instruction-tuning with pruning-based subspace sampling, and creates a curated dataset of architecture-performance pairs.

Result: Achieves competitive performance in both in-domain (CNNs for image classification) and out-of-domain tasks (LoRA configurations for segmentation and generation), demonstrating robust and transferable learning.

Conclusion: Establishes a new paradigm for flexible and generalizable LLM-based architecture search that works across domains without extensive adaptation, with datasets and models to be publicly released.

Abstract: Recent progress in Large Language Models (LLMs) has opened new avenues for solving complex optimization problems, including Neural Architecture Search (NAS). However, existing LLM-driven NAS approaches rely heavily on prompt engineering and domain-specific tuning, limiting their practicality and scalability across diverse tasks. In this work, we propose LM-Searcher, a novel framework that leverages LLMs for cross-domain neural architecture optimization without the need for extensive domain-specific adaptation. Central to our approach is NCode, a universal numerical string representation for neural architectures, which enables cross-domain architecture encoding and search. We also reformulate the NAS problem as a ranking task, training LLMs to select high-performing architectures from candidate pools using instruction-tuning samples derived from a novel pruning-based subspace sampling strategy. Our curated dataset, encompassing a wide range of architecture-performance pairs, encourages robust and transferable learning. Comprehensive experiments demonstrate that LM-Searcher achieves competitive performance in both in-domain (e.g., CNNs for image classification) and out-of-domain (e.g., LoRA configurations for segmentation and generation) tasks, establishing a new paradigm for flexible and generalizable LLM-based architecture search. The datasets and models will be released at https://github.com/Ashone3/LM-Searcher.

[21] Cross-Question Method Reuse in Large Language Models: From Word-Level Prediction to Rational Logical-Layer Reasoning

Hong Su

Main category: cs.CL

TL;DR: Extends method reuse beyond highly similar questions to handle low-similarity questions and hidden similarities by separating questions from solutions and guiding LLMs to focus on solution transfer rather than question recognition.

Details

Motivation: Existing method reuse approaches require high question similarity, limiting applicability. This paper aims to extend reuse to questions with low similarity or hidden similarities that aren't explicitly observable.

Method: Separate questions and solutions rather than feeding them as pairs to LLMs. Guide LLMs to adapt solutions to new related questions, focusing on solution transfer. Extend to cases with partial feature sharing or hidden characteristics.

Result: Experimental verification shows increased probability of filtering out reusable solutions, improving effectiveness of cross-question method reuse.

Conclusion: The scope-extension approach successfully enables method reuse across questions with low similarity or hidden similarities, overcoming conventional similarity constraints.

Abstract: Large language models (LLMs) have been widely applied to assist in finding solutions for diverse questions. Prior work has proposed representing a method as a pair of a question and its corresponding solution, enabling method reuse. However, existing approaches typically require the questions to be highly similar. In this paper, we extend the scope of method reuse to address questions with low similarity or with hidden similarities that are not explicitly observable. For questions that are similar in a general-specific sense (i.e., broader or narrower in scope), we propose to first separate the question and solution, rather than directly feeding the pair to the LLM. The LLM is then guided to adapt the solution to new but related questions, allowing it to focus on solution transfer rather than question recognition. Furthermore, we extend this approach to cases where questions only share partial features or hidden characteristics. This enables cross-question method reuse beyond conventional similarity constraints. Experimental verification shows that our scope-extension approach increases the probability of filtering out reusable solutions, thereby improving the effectiveness of cross-question method reuse.

[22] Llama-GENBA-10B: A Trilingual Large Language Model for German, English and Bavarian

Michael Hoffmann, Jophin John, Stefan Schweter, Gokul Ramakrishnan, Hoi-Fong Mak, Alice Zhang, Dmitry Gaynullin, Nicolay J. Hammer

Main category: cs.CL

TL;DR: Llama-GENBA-10B is a 10B parameter trilingual model addressing English bias, trained on English, German, and Bavarian with balanced data distribution and strong cross-lingual performance.

Details

Motivation: To address English-centric bias in large language models and serve the German NLP community while promoting Bavarian as a low-resource language.

Method: Continuous pretraining on Llama 3.1-8B scaled to 10B parameters using 164B tokens (balanced English/German/Bavarian), with unified tokenizer creation, architecture optimization, and standardized trilingual evaluation suite development.

Result: Achieves strong cross-lingual performance, surpasses Apertus-8B-2509 and gemma-2-9b in Bavarian, outperforms EuroLLM in English, and matches EuroLLM in German.

Conclusion: Provides a blueprint for inclusive foundation models that effectively integrate low-resource languages through efficient multilingual pretraining with documented energy usage.

Abstract: We present Llama-GENBA-10B, a trilingual foundation model addressing English-centric bias in large language models. Built on Llama 3.1-8B and scaled to 10B parameters, Llama-GENBA-10B is continuously pretrained on 164B tokens (82B English, 82B German, and 80M Bavarian), balancing resources while preventing English dominance. Targeted at the German NLP community, the model also promotes Bavarian as a low-resource language. Development tackled four challenges: (1) curating a multilingual corpus despite Bavarian scarcity, (2) creating a unified tokenizer for English, German, and Bavarian, (3) optimizing architecture and language-ratio hyperparameters for cross-lingual transfer, and (4) establishing the first standardized trilingual evaluation suite by translating German benchmarks into Bavarian. Evaluations show that Llama-GENBA-10B achieves strong cross-lingual performance, with the fine-tuned variant surpassing Apertus-8B-2509 and gemma-2-9b in Bavarian and establishing itself as the best model in its class for this language, while also outperforming EuroLLM in English and matching its results in German. Training on the Cerebras CS-2 demonstrated efficient large-scale multilingual pretraining with documented energy use, offering a blueprint for inclusive foundation models that integrate low-resource languages.

[23] Revealing the Numeracy Gap: An Empirical Investigation of Text Embedding Models

Ningyuan Deng, Hanyu Duan, Yixuan Tang, Yi Yang

Main category: cs.CL

TL;DR: Text embedding models struggle to accurately capture numerical nuances in text, despite being widely used in domains where numbers matter like finance and healthcare.

Details

Motivation: Current embedding models are benchmarked on tasks that don't require understanding numerical information, but they're increasingly used in domains where numerical precision is critical (e.g., distinguishing between 2% vs 20% market share growth).

Method: Evaluated 13 widely used text embedding models using synthetic data in a financial context to test their ability to encode numerical content accurately.

Result: Most embedding models struggle to capture numerical details accurately, showing limitations in handling nuanced numerical information.

Conclusion: The findings highlight the need for future research to improve embedding models’ capacity for handling numerical content, which is crucial for strengthening NLP systems in number-sensitive domains.

Abstract: Text embedding models are widely used in natural language processing applications. However, their capability is often benchmarked on tasks that do not require understanding nuanced numerical information in text. As a result, it remains unclear whether current embedding models can precisely encode numerical content, such as numbers, into embeddings. This question is critical because embedding models are increasingly applied in domains where numbers matter, such as finance and healthcare. For example, Company X’s market share grew by 2% should be interpreted very differently from Company X’s market share grew by 20%, even though both indicate growth in market share. This study aims to examine whether text embedding models can capture such nuances. Using synthetic data in a financial context, we evaluate 13 widely used text embedding models and find that they generally struggle to capture numerical details accurately. Our further analyses provide deeper insights into embedding numeracy, informing future research to strengthen embedding model-based NLP systems with improved capacity for handling numerical content.

[24] A Survey of the State-of-the-Art in Conversational Question Answering Systems

Manoj Madushanka Perera, Adnan Mahmood, Kasun Eranda Wijethilake, Fahmida Islam, Maryam Tahermazandarani, Quan Z. Sheng

Main category: cs.CL

TL;DR: This survey paper provides a comprehensive analysis of Conversational Question Answering (ConvQA) systems, covering core components, advanced ML techniques, large language models, datasets, and future research directions.

Details

Motivation: ConvQA systems are becoming increasingly important across various domains like customer support, education, legal, and healthcare where maintaining coherent and context-aware conversations is essential. The paper aims to provide a comprehensive overview of the current state-of-the-art in this rapidly advancing field.

Method: The survey examines core ConvQA components (history selection, question understanding, answer prediction), investigates advanced ML techniques (reinforcement learning, contrastive learning, transfer learning), explores large language models (RoBERTa, GPT-4, Gemini 2.0 Flash, Mistral 7B, LLaMA 3), and analyzes key ConvQA datasets.

Result: The paper provides a comprehensive overview of the ConvQA landscape, highlighting how different components and techniques work together to ensure conversation coherence and relevance. It showcases the impact of large language models through data scalability and architectural advancements.

Conclusion: This work offers valuable insights to guide future advancements in Conversational Question Answering and outlines open research directions for the field, serving as a comprehensive reference for researchers and practitioners.

Abstract: Conversational Question Answering (ConvQA) systems have emerged as a pivotal area within Natural Language Processing (NLP) by driving advancements that enable machines to engage in dynamic and context-aware conversations. These capabilities are increasingly being applied across various domains, i.e., customer support, education, legal, and healthcare where maintaining a coherent and relevant conversation is essential. Building on recent advancements, this survey provides a comprehensive analysis of the state-of-the-art in ConvQA. This survey begins by examining the core components of ConvQA systems, i.e., history selection, question understanding, and answer prediction, highlighting their interplay in ensuring coherence and relevance in multi-turn conversations. It further investigates the use of advanced machine learning techniques, including but not limited to, reinforcement learning, contrastive learning, and transfer learning to improve ConvQA accuracy and efficiency. The pivotal role of large language models, i.e., RoBERTa, GPT-4, Gemini 2.0 Flash, Mistral 7B, and LLaMA 3, is also explored, thereby showcasing their impact through data scalability and architectural advancements. Additionally, this survey presents a comprehensive analysis of key ConvQA datasets and concludes by outlining open research directions. Overall, this work offers a comprehensive overview of the ConvQA landscape and provides valuable insights to guide future advancements in the field.

[25] Exploring Subjective Tasks in Farsi: A Survey Analysis and Evaluation of Language Models

Donya Rooein, Flor Miriam Plaza-del-Arco, Debora Nozza, Dirk Hovy

Main category: cs.CL

TL;DR: Farsi is considered a middle-resource language but faces significant challenges in subjective NLP tasks due to poor data availability, quality issues, and lack of demographic information, leading to unstable model performance.

Details

Motivation: To examine the actual state of Farsi NLP resources despite its classification as a middle-resource language, particularly focusing on subjective tasks like sentiment analysis, emotion analysis, and toxicity detection.

Method: Reviewed 110 publications on subjective tasks in Farsi, analyzed data availability and quality issues, and evaluated prediction models using available datasets.

Result: Found significant challenges in data availability and quality, lack of publicly available datasets, missing essential demographic factors, and highly unstable results across datasets and models.

Conclusion: The volume of data alone is insufficient to improve a language’s NLP prospects; data quality and proper demographic information are crucial for subjective language modeling.

Abstract: Given Farsi’s speaker base of over 127 million people and the growing availability of digital text, including more than 1.3 million articles on Wikipedia, it is considered a middle-resource language. However, this label quickly crumbles when the situation is examined more closely. We focus on three subjective tasks (Sentiment Analysis, Emotion Analysis, and Toxicity Detection) and find significant challenges in data availability and quality, despite the overall increase in data availability. We review 110 publications on subjective tasks in Farsi and observe a lack of publicly available datasets. Furthermore, existing datasets often lack essential demographic factors, such as age and gender, that are crucial for accurately modeling subjectivity in language. When evaluating prediction models using the few available datasets, the results are highly unstable across both datasets and models. Our findings indicate that the volume of data is insufficient to significantly improve a language’s prospects in NLP.

[26] QCSE: A Pretrained Quantum Context-Sensitive Word Embedding for Natural Language Processing

Charles M. Varmantchaonala, Niclas GÖtting, Nils-Erik SchÜtte, Jean Louis E. K. Fendji, Christopher Gies

Main category: cs.CL

TL;DR: QCSE is a pretrained quantum context-sensitive embedding model that uses quantum computation to capture contextual word relationships through five innovative context matrix computation methods, demonstrating effectiveness on both Fulani and English corpora.

Details

Motivation: To leverage quantum computation for natural language processing by developing context-sensitive word embeddings that can handle linguistic complexity and address data scarcity issues in low-resource languages.

Method: Proposes QCSE model with quantum-native context learning using five distinct context matrix computation methods (exponential decay, sinusoidal modulation, phase shifts, hash-based transformations) to create unique word representations based on surrounding context.

Result: QCSE successfully captures context sensitivity and leverages quantum system expressibility for rich context-aware language information, demonstrating effectiveness on both Fulani (low-resource language) and English corpora.

Conclusion: Quantum computation shows strong potential for NLP, particularly in addressing data scarcity for low-resource languages, opening new avenues for applying QNLP to real-world linguistic challenges.

Abstract: Quantum Natural Language Processing (QNLP) offers a novel approach to encoding and understanding the complexity of natural languages through the power of quantum computation. This paper presents a pretrained quantum context-sensitive embedding model, called QCSE, that captures context-sensitive word embeddings, leveraging the unique properties of quantum systems to learn contextual relationships in languages. The model introduces quantum-native context learning, enabling the utilization of quantum computers for linguistic tasks. Central to the proposed approach are innovative context matrix computation methods, designed to create unique, representations of words based on their surrounding linguistic context. Five distinct methods are proposed and tested for computing the context matrices, incorporating techniques such as exponential decay, sinusoidal modulation, phase shifts, and hash-based transformations. These methods ensure that the quantum embeddings retain context sensitivity, thereby making them suitable for downstream language tasks where the expressibility and properties of quantum systems are valuable resources. To evaluate the effectiveness of the model and the associated context matrix methods, evaluations are conducted on both a Fulani corpus, a low-resource African language, dataset of small size and an English corpus of slightly larger size. The results demonstrate that QCSE not only captures context sensitivity but also leverages the expressibility of quantum systems for representing rich, context-aware language information. The use of Fulani further highlights the potential of QNLP to mitigate the problem of lack of data for this category of languages. This work underscores the power of quantum computation in natural language processing (NLP) and opens new avenues for applying QNLP to real-world linguistic challenges across various tasks and domains.

[27] Enhancing Factual Accuracy and Citation Generation in LLMs via Multi-Stage Self-Verification

Fernando Gabriela García, Qiyang Shi, Zilin Feng

Main category: cs.CL

TL;DR: VeriFact-CoT is a novel method that addresses LLM hallucination and citation issues through fact verification-reflection-citation integration, improving accuracy and trustworthiness.

Details

Motivation: To solve the pervasive problems of hallucination and lack of credible citation sources in LLMs when generating fact-sensitive content for applications like scientific research, news reporting, and legal consultation.

Method: Multi-stage mechanism of ‘fact verification-reflection-citation integration’ that enables LLMs to critically self-examine and revise intermediate reasoning steps and final answers.

Result: Significantly enhances objective accuracy, trustworthiness, and traceability of generated outputs, making LLMs more reliable for high-fidelity applications.

Conclusion: VeriFact-CoT provides an effective solution to improve the reliability of LLMs in fact-sensitive domains by incorporating verification and citation mechanisms into the reasoning process.

Abstract: This research introduces VeriFact-CoT (Verified Factual Chain-of-Thought), a novel method designed to address the pervasive issues of hallucination and the absence of credible citation sources in Large Language Models (LLMs) when generating complex, fact-sensitive content. By incorporating a multi-stage mechanism of ‘fact verification-reflection-citation integration,’ VeriFact-CoT empowers LLMs to critically self-examine and revise their intermediate reasoning steps and final answers. This process significantly enhances the objective accuracy, trustworthiness, and traceability of the generated outputs, making LLMs more reliable for applications demanding high fidelity such as scientific research, news reporting, and legal consultation.

[28] LatinX: Aligning a Multilingual TTS Model with Direct Preference Optimization

Luis Felipe Chary, Miguel Arjona Ramirez

Main category: cs.CL

TL;DR: LatinX is a multilingual TTS model for speech-to-speech translation that preserves speaker identity across languages using a 3-stage training approach with DPO optimization.

Details

Motivation: To develop a text-to-speech model that can maintain the source speaker's voice identity when translating speech across different languages, particularly focusing on English and Romance languages with emphasis on Portuguese.

Method: 12-layer decoder-only Transformer trained in three stages: pre-training for text-to-audio mapping, supervised fine-tuning for zero-shot voice cloning, and alignment with Direct Preference Optimization using WER and speaker-similarity metrics.

Result: LatinX with DPO consistently reduces Word Error Rate and improves objective similarity over baseline. Human evaluations show stronger perceived speaker similarity than XTTSv2 baseline, revealing gaps between objective and subjective measures.

Conclusion: The model successfully preserves speaker identity across languages, with DPO optimization proving effective. Future work includes cross-lingual analyses, balanced preference signals, and lower-latency architectures.

Abstract: We present LatinX, a multilingual text-to-speech (TTS) model for cascaded speech-to-speech translation that preserves the source speaker’s identity across languages. LatinX is a 12-layer decoder-only Transformer trained in three stages: (i) pre-training for text-to-audio mapping, (ii) supervised fine-tuning for zero-shot voice cloning, and (iii) alignment with Direct Preference Optimization (DPO) using automatically labeled pairs based on Word Error Rate (WER) and speaker-similarity metrics. Trained on English and Romance languages with emphasis on Portuguese, LatinX with DPO consistently reduces WER and improves objective similarity over the fine-tuned baseline. Human evaluations further indicate stronger perceived speaker similarity than a strong baseline (XTTSv2), revealing gaps between objective and subjective measures. We provide cross-lingual analyses and discuss balanced preference signals and lower-latency architectures as future work.

[29] ZhiFangDanTai: Fine-tuning Graph-based Retrieval-Augmented Generation Model for Traditional Chinese Medicine Formula

ZiXuan Zhang, Bowen Hao, Yingjie Li, Hongzhi Yin

Main category: cs.CL

TL;DR: ZhiFangDanTai framework combines GraphRAG with LLM fine-tuning to generate comprehensive TCM formulas with detailed explanations, reducing errors and hallucinations.

Details

Motivation: Existing TCM models lack comprehensive results and detailed explanations, and current datasets are insufficient for deep model outputs in Traditional Chinese Medicine formula generation.

Method: Proposes ZhiFangDanTai framework using Graph-based Retrieval-Augmented Generation (GraphRAG) to retrieve structured TCM knowledge and fine-tune LLMs with enhanced instruction datasets.

Result: Experimental results show significant improvements over state-of-the-art models on both collected and clinical datasets, with reduced generalization error and hallucination rates.

Conclusion: The framework effectively addresses limitations in TCM formula generation by integrating structured knowledge retrieval with LLM fine-tuning, providing comprehensive and explainable formula outputs.

Abstract: Traditional Chinese Medicine (TCM) formulas play a significant role in treating epidemics and complex diseases. Existing models for TCM utilize traditional algorithms or deep learning techniques to analyze formula relationships, yet lack comprehensive results, such as complete formula compositions and detailed explanations. Although recent efforts have used TCM instruction datasets to fine-tune Large Language Models (LLMs) for explainable formula generation, existing datasets lack sufficient details, such as the roles of the formula’s sovereign, minister, assistant, courier; efficacy; contraindications; tongue and pulse diagnosis-limiting the depth of model outputs. To address these challenges, we propose ZhiFangDanTai, a framework combining Graph-based Retrieval-Augmented Generation (GraphRAG) with LLM fine-tuning. ZhiFangDanTai uses GraphRAG to retrieve and synthesize structured TCM knowledge into concise summaries, while also constructing an enhanced instruction dataset to improve LLMs’ ability to integrate retrieved information. Furthermore, we provide novel theoretical proofs demonstrating that integrating GraphRAG with fine-tuning techniques can reduce generalization error and hallucination rates in the TCM formula task. Experimental results on both collected and clinical datasets demonstrate that ZhiFangDanTai achieves significant improvements over state-of-the-art models. Our model is open-sourced at https://huggingface.co/tczzx6/ZhiFangDanTai1.0.

[30] MedFactEval and MedAgentBrief: A Framework and Workflow for Generating and Evaluating Factual Clinical Summaries

François Grolleau, Emily Alsentzer, Timothy Keyes, Philip Chung, Akshay Swaminathan, Asad Aali, Jason Hom, Tridu Huynh, Thomas Lew, April S. Liang, Weihan Chu, Natasha Z. Steele, Christina F. Lin, Jingkun Yang, Kameron C. Black, Stephen P. Ma, Fateme N. Haredasht, Nigam H. Shah, Kevin Schulman, Jonathan H. Chen

Main category: cs.CL

TL;DR: MedFactEval framework uses multi-LLM jury voting to evaluate factual accuracy in clinical text generation, achieving near-perfect agreement with physician panels, while MedAgentBrief provides a model-agnostic workflow for generating high-quality discharge summaries.

Details

Motivation: Expert review of LLM-generated clinical text is unscalable for continuous quality assurance needed in healthcare applications, creating a critical barrier to adoption of generative AI in clinical workflows.

Method: Two complementary approaches: 1) MedFactEval framework with clinician-defined key facts and multi-LLM majority vote assessment, 2) MedAgentBrief multi-step workflow for generating factual discharge summaries. Validation used seven-physician majority vote as gold standard.

Result: MedFactEval LLM Jury achieved almost perfect agreement with physician panel (Cohen’s kappa=81%), statistically non-inferior to single human expert performance (kappa=67%, P < 0.001).

Conclusion: Provides both robust evaluation framework (MedFactEval) and high-performing generation workflow (MedAgentBrief) for responsible deployment of generative AI in clinical settings.

Abstract: Evaluating factual accuracy in Large Language Model (LLM)-generated clinical text is a critical barrier to adoption, as expert review is unscalable for the continuous quality assurance these systems require. We address this challenge with two complementary contributions. First, we introduce MedFactEval, a framework for scalable, fact-grounded evaluation where clinicians define high-salience key facts and an “LLM Jury”–a multi-LLM majority vote–assesses their inclusion in generated summaries. Second, we present MedAgentBrief, a model-agnostic, multi-step workflow designed to generate high-quality, factual discharge summaries. To validate our evaluation framework, we established a gold-standard reference using a seven-physician majority vote on clinician-defined key facts from inpatient cases. The MedFactEval LLM Jury achieved almost perfect agreement with this panel (Cohen’s kappa=81%), a performance statistically non-inferior to that of a single human expert (kappa=67%, P < 0.001). Our work provides both a robust evaluation framework (MedFactEval) and a high-performing generation workflow (MedAgentBrief), offering a comprehensive approach to advance the responsible deployment of generative AI in clinical workflows.

[31] Let’s Roleplay: Examining LLM Alignment in Collaborative Dialogues

Abhijnan Nath, Carine Graff, Nikhil Krishnaswamy

Main category: cs.CL

TL;DR: This paper examines how different alignment methods affect LLM agents’ effectiveness in multiturn, multiparty collaborations, focusing on friction agents that encourage reflection in group dialogues.

Details

Motivation: As LLMs become AI collaborators, their behavior in multiturn interactions must be predictable and reliable. Current alignment techniques are developed for single-user settings and don't account for long-horizon multiparty dynamics.

Method: The study uses a roleplay methodology to evaluate interventions from differently-trained friction agents in collaborative task conversations, proposing a novel counterfactual evaluation framework to quantify how friction interventions change group collaboration trajectories.

Result: Results show that a friction-aware approach significantly outperforms common alignment baselines in helping convergence to common ground and correctness of task outcomes.

Conclusion: Friction-aware alignment methods are more effective for LLM agents in multiturn, multiparty collaborative settings compared to traditional single-user alignment approaches.

Abstract: As Large Language Models (LLMs) integrate into diverse workflows, they are increasingly being considered “collaborators” with humans. If such AI collaborators are to be reliable, their behavior over multiturn interactions must be predictable, validated and verified before deployment. Common alignment techniques are typically developed under simplified single-user settings and do not account for the dynamics of long-horizon multiparty interactions. This paper examines how different alignment methods affect LLM agents’ effectiveness as partners in multiturn, multiparty collaborations. We study this question through the lens of friction agents that intervene in group dialogues to encourage the collaborative group to slow down and reflect upon their reasoning for deliberative decision-making. Using a roleplay methodology, we evaluate interventions from differently-trained friction agents in collaborative task conversations. We propose a novel counterfactual evaluation framework that quantifies how friction interventions change the trajectory of group collaboration and belief alignment. Our results show that a friction-aware approach significantly outperforms common alignment baselines in helping both convergence to a common ground, or agreed-upon task-relevant propositions, and correctness of task outcomes.

[32] Accelerating Large Language Model Inference via Early-Exiting Algorithms

Sangmin Bae

Main category: cs.CL

TL;DR: This dissertation resolves the conflict between adaptive computation efficiency and system-level bottlenecks in LLMs through co-designed algorithms and architectures, achieving optimal balance between dynamism and efficiency.

Details

Motivation: Large language models face high computational costs that hinder deployment, and while adaptive computation methods like early-exiting promise cost reduction, they create system-level bottlenecks that paradoxically reduce throughput.

Method: Proposes efficient parallel decoding to address overhead in conventional early-exiting, uses deep parameter sharing to mitigate synchronization issues, and develops a unified framework with lightweight routers pretrained to dynamically assign optimal recursion depth per token.

Result: Establishes a new Pareto frontier between efficiency and performance by effectively optimizing for both adaptive computation and parameter efficiency within a single model.

Conclusion: The co-design of adaptive algorithms and model architectures successfully resolves the fundamental conflict between computational dynamism and system efficiency in large language model deployment.

Abstract: Large language models have achieved remarkable capabilities, but their practical deployment is hindered by significant computational costs. While adaptive computation methods like early-exiting promise to reduce these costs, they introduce a fundamental conflict: the per-token dynamism intended to save computation often creates system-level bottlenecks that can paradoxically reduce throughput in batched inference. This dissertation resolves this conflict by co-designing adaptive algorithms and model architectures to strike an optimal balance between dynamism and efficiency. To this end, our work first addresses critical sources of overhead in conventional early-exiting by proposing an efficient parallel decoding mechanism. We then show that deep parameter sharing provides an architectural foundation that not only yields compact, parameter-efficient models but also inherently mitigates the critical synchronization issues affecting dynamic inference. Finally, this work presents a unified framework where lightweight routers are pretrained to dynamically assign an optimal recursion depth for each token. This approach establishes a new Pareto frontier between efficiency and performance by effectively optimizing for both adaptive computation and parameter efficiency within a single model.

[33] KatotohananQA: Evaluating Truthfulness of Large Language Models in Filipino

Lorenzo Alfred Nery, Ronald Dawson Catignas, Thomas James Tiam-Lee

Main category: cs.CL

TL;DR: KatotohananQA is a Filipino translation of TruthfulQA benchmark that reveals significant performance gap between English and Filipino truthfulness in LLMs, with newer OpenAI models showing better multilingual robustness.

Details

Motivation: LLMs have truthfulness issues (hallucinations) but existing benchmarks like TruthfulQA are primarily English-only, creating a gap in evaluating low-resource languages like Filipino.

Method: Created KatotohananQA (Filipino translation of TruthfulQA), evaluated 7 free-tier proprietary models using binary-choice framework to measure truthfulness.

Result: Significant performance gap between English and Filipino truthfulness; newer OpenAI models (GPT-5/5 mini) showed strong multilingual robustness; disparities found across question types/categories/topics.

Conclusion: Need broader multilingual evaluation to ensure fairness and reliability in LLM usage, as some question characteristics are less robust to multilingual transfer.

Abstract: Large Language Models (LLMs) achieve remarkable performance across various tasks, but their tendency to produce hallucinations limits reliable adoption. Benchmarks such as TruthfulQA have been developed to measure truthfulness, yet they are primarily available in English, leaving a gap in evaluating LLMs in low-resource languages. To address this, we present KatotohananQA, a Filipino translation of the TruthfulQA benchmark. Seven free-tier proprietary models were assessed using a binary-choice framework. Findings show a significant performance gap between English and Filipino truthfulness, with newer OpenAI models (GPT-5 and GPT-5 mini) demonstrating strong multilingual robustness. Results also reveal disparities across question characteristics, suggesting that some question types, categories, and topics are less robust to multilingual transfer which highlight the need for broader multilingual evaluation to ensure fairness and reliability in LLM usage.

[34] Multimodal Fine-grained Context Interaction Graph Modeling for Conversational Speech Synthesis

Zhenqi Jia, Rui Liu, Berrak Sisman, Haizhou Li

Main category: cs.CL

TL;DR: MFCIG-CSS is a novel conversational speech synthesis system that uses multimodal fine-grained context interaction graphs to model word-level semantic and prosodic interactions from dialogue history, achieving superior prosodic expressiveness.

Details

Motivation: Existing CSS methods only model utterance-level interactions and overlook fine-grained semantic and prosodic knowledge at the word level in multimodal dialogue history, limiting prosody prediction accuracy.

Method: Constructs two specialized multimodal fine-grained dialogue interaction graphs (semantic and prosody interaction graphs) to encode word-level interactions between semantics, prosody, and their influence on subsequent utterances.

Result: Experiments on DailyTalk dataset show MFCIG-CSS outperforms all baseline models in prosodic expressiveness.

Conclusion: The proposed fine-grained word-level interaction modeling approach effectively captures conversational prosody patterns and enhances synthesized speech quality.

Abstract: Conversational Speech Synthesis (CSS) aims to generate speech with natural prosody by understanding the multimodal dialogue history (MDH). The latest work predicts the accurate prosody expression of the target utterance by modeling the utterance-level interaction characteristics of MDH and the target utterance. However, MDH contains fine-grained semantic and prosody knowledge at the word level. Existing methods overlook the fine-grained semantic and prosodic interaction modeling. To address this gap, we propose MFCIG-CSS, a novel Multimodal Fine-grained Context Interaction Graph-based CSS system. Our approach constructs two specialized multimodal fine-grained dialogue interaction graphs: a semantic interaction graph and a prosody interaction graph. These two interaction graphs effectively encode interactions between word-level semantics, prosody, and their influence on subsequent utterances in MDH. The encoded interaction features are then leveraged to enhance synthesized speech with natural conversational prosody. Experiments on the DailyTalk dataset demonstrate that MFCIG-CSS outperforms all baseline models in terms of prosodic expressiveness. Code and speech samples are available at https://github.com/AI-S2-Lab/MFCIG-CSS.

[35] Multimodal Reasoning for Science: Technical Report and 1st Place Solution to the ICML 2025 SeePhys Challenge

Hao Liang, Ruitao Wu, Bohan Zeng, Junbo Niu, Wentao Zhang, Bin Dong

Main category: cs.CL

TL;DR: A caption-assisted reasoning framework that bridges visual and textual modalities, achieving state-of-the-art performance in multimodal reasoning challenges.

Details

Motivation: Multimodal reasoning remains challenging for AI systems, with even advanced models like GPT-3 struggling in multimodal scenarios where visual and textual information must be integrated.

Method: Introduces a caption-assisted reasoning framework that effectively bridges visual and textual modalities to enhance multimodal reasoning capabilities.

Result: Achieved 1st place in ICML 2025 AI for Math Workshop & Challenge 2: SeePhys, and demonstrated strong generalization on the MathVerse benchmark for geometric reasoning.

Conclusion: The proposed framework is effective, robust, and versatile for multimodal reasoning tasks, particularly in mathematical and geometric problem-solving domains.

Abstract: Multimodal reasoning remains a fundamental challenge in artificial intelligence. Despite substantial advances in text-based reasoning, even state-of-the-art models such as GPT-o3 struggle to maintain strong performance in multimodal scenarios. To address this gap, we introduce a caption-assisted reasoning framework that effectively bridges visual and textual modalities. Our approach achieved 1st place in the ICML 2025 AI for Math Workshop & Challenge 2: SeePhys, highlighting its effectiveness and robustness. Furthermore, we validate its generalization on the MathVerse benchmark for geometric reasoning, demonstrating the versatility of our method. Our code is publicly available at https://github.com/OpenDCAI/SciReasoner.

[36] Orthogonal Low-rank Adaptation in Lie Groups for Continual Learning of Large Language Models

Kefan Cao, Shuaicheng Wu

Main category: cs.CL

TL;DR: OLieRA introduces Lie group theory to LLM fine-tuning, using multiplicative updates to preserve parameter geometry while enforcing orthogonality constraints, achieving state-of-the-art results in continual learning benchmarks.

Details

Motivation: Existing parameter regularization methods like O-LoRA and N-LoRA enforce orthogonality but overlook how conventional additive fine-tuning disrupts the intrinsic geometric structure of LLM parameters, limiting performance in sequential multi-task settings.

Method: Proposes Orthogonal Low-rank Adaptation in Lie Groups (OLieRA) that leverages Lie group theory with multiplicative updates to preserve parameter geometry while applying orthogonality constraints to task subspaces.

Result: OLieRA achieves state-of-the-art results on the Standard CL benchmark and remains among the top-performing methods in the Large Number of Tasks setting.

Conclusion: Preserving the geometric structure of LLM parameters through Lie group-based multiplicative updates, in addition to enforcing orthogonality, significantly improves performance in continual learning scenarios.

Abstract: Large language models (LLMs) are prone to catastrophic forgetting in sequential multi-task settings. Parameter regularization methods such as O-LoRA and N-LoRA alleviate task interference by enforcing low-rank subspace orthogonality, but they overlook the fact that conventional additive fine-tuning disrupts the intrinsic geometric structure of LLM parameters, limiting performance. Our key insight is that the parameter space of LLMs possesses a geometric structure, which must be preserved in addition to enforcing orthogonality. Based on this, we propose Orthogonal Low-rank Adaptation in Lie Groups (OLieRA), which introduces Lie group theory into LLM fine-tuning: leveraging multiplicative updates to preserve parameter geometry while applying orthogonality constraints to task subspaces. Experiments demonstrate that OLieRA achieves state-of-the-art results on the Standard CL benchmark and remains among the top-performing methods in the Large Number of Tasks setting.

[37] Benchmarking Gender and Political Bias in Large Language Models

Jinrui Yang, Xudong Han, Timothy Baldwin

Main category: cs.CL

TL;DR: EuroParlVote is a new benchmark for evaluating LLM bias in political contexts using European Parliament data, revealing gender and political biases in vote prediction and classification tasks.

Details

Motivation: To create a comprehensive benchmark for assessing large language model biases in politically sensitive scenarios, particularly around gender and political affiliation in European parliamentary contexts.

Method: Developed EuroParlVote dataset linking debate speeches to roll-call votes with demographic metadata, then evaluated state-of-the-art LLMs on gender classification and vote prediction tasks.

Result: LLMs consistently misclassified female MEPs as male, showed reduced accuracy for female speakers’ vote predictions, and favored centrist political groups while underperforming on far-left and far-right groups. GPT-4o outperformed open-weight models in robustness and fairness.

Conclusion: The study reveals systematic biases in LLMs across gender and political dimensions, highlighting the need for improved fairness in political NLP applications. The released dataset supports future research on accountability in political contexts.

Abstract: We introduce EuroParlVote, a novel benchmark for evaluating large language models (LLMs) in politically sensitive contexts. It links European Parliament debate speeches to roll-call vote outcomes and includes rich demographic metadata for each Member of the European Parliament (MEP), such as gender, age, country, and political group. Using EuroParlVote, we evaluate state-of-the-art LLMs on two tasks – gender classification and vote prediction – revealing consistent patterns of bias. We find that LLMs frequently misclassify female MEPs as male and demonstrate reduced accuracy when simulating votes for female speakers. Politically, LLMs tend to favor centrist groups while underperforming on both far-left and far-right ones. Proprietary models like GPT-4o outperform open-weight alternatives in terms of both robustness and fairness. We release the EuroParlVote dataset, code, and demo to support future research on fairness and accountability in NLP within political contexts.

[38] Understanding the Influence of Synthetic Data for Text Embedders

Jacob Mitchell Springer, Vaibhav Adlakha, Siva Reddy, Aditi Raghunathan, Marius Mosbach

Main category: cs.CL

TL;DR: Synthetic data from LLMs improves text embedding models but benefits are sparse and localized, with trade-offs between different tasks, challenging the notion that synthetic data leads to more robust general-purpose embedders.

Details

Motivation: To address the lack of publicly available synthetic datasets for studying text embedding generalization and to critically examine where synthetic data actually improves model performance.

Method: Reproduced and released the synthetic data from Mistral-E5, then analyzed where synthetic data improves generalization and identified performance trade-offs between different task categories.

Result: Synthetic data leads to consistent performance improvements but benefits are highly localized to individual datasets, with trade-offs where data that helps one task degrades performance on another.

Conclusion: Current synthetic data approaches have limitations for building general-purpose embedders and do not necessarily lead to more robust embedding models across diverse tasks.

Abstract: Recent progress in developing general purpose text embedders has been driven by training on ever-growing corpora of synthetic LLM-generated data. Nonetheless, no publicly available synthetic dataset exists, posing a barrier to studying its role for generalization. To address this issue, we first reproduce and publicly release the synthetic data proposed by Wang et al. (Mistral-E5). Our synthetic data is high quality and leads to consistent improvements in performance. Next, we critically examine where exactly synthetic data improves model generalization. Our analysis reveals that benefits from synthetic data are sparse and highly localized to individual datasets. Moreover, we observe trade-offs between the performance on different categories and data that benefits one task, degrades performance on another. Our findings highlight the limitations of current synthetic data approaches for building general-purpose embedders and challenge the notion that training on synthetic data leads to more robust embedding models across tasks.

[39] Augmented Fine-Tuned LLMs for Enhanced Recruitment Automation

Mohamed T. Younes, Omar Walid, Khaled Shaban, Ali Hamdi, Mai Hassan

Main category: cs.CL

TL;DR: Fine-tuned LLMs for recruitment automation achieve 90.62% F1 score using synthetic and parsed resume data in standardized JSON format.

Details

Motivation: To address limitations of generic LLMs in recruitment tasks by creating specialized models that improve accuracy and efficiency in candidate-job matching.

Method: Fine-tuned Large Language Models (including Phi-4) using a synthetic dataset in standardized JSON format and real resumes parsed by DeepSeek into the same structured format to ensure consistency and data diversity.

Result: Significant improvements in performance metrics: fine-tuned Phi-4 model achieved highest F1 score of 90.62%, with better exact match, BLEU score, ROUGE score, and overall similarity compared to base models and other state-of-the-art LLMs.

Conclusion: Fine-tuned LLMs show exceptional potential to revolutionize recruitment workflows through more accurate candidate-job matching, demonstrating superior precision and recall in recruitment automation tasks.

Abstract: This paper presents a novel approach to recruitment automation. Large Language Models (LLMs) were fine-tuned to improve accuracy and efficiency. Building upon our previous work on the Multilayer Large Language Model-Based Robotic Process Automation Applicant Tracking (MLAR) system . This work introduces a novel methodology. Training fine-tuned LLMs specifically tuned for recruitment tasks. The proposed framework addresses the limitations of generic LLMs by creating a synthetic dataset that uses a standardized JSON format. This helps ensure consistency and scalability. In addition to the synthetic data set, the resumes were parsed using DeepSeek, a high-parameter LLM. The resumes were parsed into the same structured JSON format and placed in the training set. This will help improve data diversity and realism. Through experimentation, we demonstrate significant improvements in performance metrics, such as exact match, F1 score, BLEU score, ROUGE score, and overall similarity compared to base models and other state-of-the-art LLMs. In particular, the fine-tuned Phi-4 model achieved the highest F1 score of 90.62%, indicating exceptional precision and recall in recruitment tasks. This study highlights the potential of fine-tuned LLMs. Furthermore, it will revolutionize recruitment workflows by providing more accurate candidate-job matching.

[40] MSLEF: Multi-Segment LLM Ensemble Finetuning in Recruitment

Omar Walid, Mohamed T. Younes, Khaled Shaban, Mai Hassan, Ali Hamdi

Main category: cs.CL

TL;DR: MSLEF is a multi-segment ensemble framework that uses fine-tuned LLMs with weighted voting for resume parsing, achieving significant performance improvements over single-model approaches.

Details

Motivation: To overcome limitations of single-model resume parsing systems by creating a more accurate and adaptable framework that can handle diverse resume formats and structures in recruitment automation.

Method: Uses segment-aware architecture with field-specific weighting, integrates multiple fine-tuned LLMs (Gemma 9B, LLaMA 3.1 8B, Phi-4 14B) with Gemini-2.5-Flash as high-level aggregator, and employs weighted voting ensemble approach.

Result: Achieves significant improvements: up to +7% in Recruitment Similarity (RS), plus gains in Exact Match, F1 score, BLEU, and ROUGE metrics, outperforming best single model.

Conclusion: MSLEF’s segment-aware design enhances generalization across varied resume layouts, making it highly adaptable to real-world hiring scenarios while ensuring precise candidate representation.

Abstract: This paper presents MSLEF, a multi-segment ensemble framework that employs LLM fine-tuning to enhance resume parsing in recruitment automation. It integrates fine-tuned Large Language Models (LLMs) using weighted voting, with each model specializing in a specific resume segment to boost accuracy. Building on MLAR , MSLEF introduces a segment-aware architecture that leverages field-specific weighting tailored to each resume part, effectively overcoming the limitations of single-model systems by adapting to diverse formats and structures. The framework incorporates Gemini-2.5-Flash LLM as a high-level aggregator for complex sections and utilizes Gemma 9B, LLaMA 3.1 8B, and Phi-4 14B. MSLEF achieves significant improvements in Exact Match (EM), F1 score, BLEU, ROUGE, and Recruitment Similarity (RS) metrics, outperforming the best single model by up to +7% in RS. Its segment-aware design enhances generalization across varied resume layouts, making it highly adaptable to real-world hiring scenarios while ensuring precise and reliable candidate representation.

[41] No Encore: Unlearning as Opt-Out in Music Generation

Jinju Kim, Taehan Kim, Abdul Waheed, Rita Singh

Main category: cs.CL

TL;DR: First application of machine unlearning techniques to prevent copyright infringement in AI music generation systems by removing specific training data without degrading model performance.

Details

Motivation: Address ethical and legal concerns about AI music generation systems potentially exploiting copyrighted content by developing methods to prevent inadvertent usage of creative works.

Method: Apply existing machine unlearning techniques to a pre-trained Text-to-Music (TTM) baseline model to explore their efficacy in removing specific datasets from the training data.

Result: Preliminary results show insights into the challenges of applying unlearning in music generation, providing foundational analysis for future research in this area.

Conclusion: Machine unlearning techniques show promise for addressing copyright concerns in AI music generation, though significant challenges remain that require further research and development.

Abstract: AI music generation is rapidly emerging in the creative industries, enabling intuitive music generation from textual descriptions. However, these systems pose risks in exploitation of copyrighted creations, raising ethical and legal concerns. In this paper, we present preliminary results on the first application of machine unlearning techniques from an ongoing research to prevent inadvertent usage of creative content. Particularly, we explore existing methods in machine unlearning to a pre-trained Text-to-Music (TTM) baseline and analyze their efficacy in unlearning pre-trained datasets without harming model performance. Through our experiments, we provide insights into the challenges of applying unlearning in music generation, offering a foundational analysis for future works on the application of unlearning for music generative models.

[42] Mask-GCG: Are All Tokens in Adversarial Suffixes Necessary for Jailbreak Attacks?

Junjie Mu, Zonghao Ying, Zhekui Fan, Zonglei Jing, Yaoyuan Zhang, Zhengmin Yu, Wenxin Zhang, Quanchen Zou, Xiangzheng Zhang

Main category: cs.CL

TL;DR: Mask-GCG improves jailbreak attacks on LLMs by identifying and pruning redundant tokens in fixed-length suffixes, reducing computational overhead while maintaining attack success rates.

Details

Motivation: Existing GCG-based jailbreak methods use fixed-length suffixes with potential token redundancy, which increases computational cost without necessarily improving attack effectiveness.

Method: Proposes Mask-GCG with learnable token masking to identify high-impact tokens in suffixes, increasing update probability for important tokens while pruning low-impact ones to reduce gradient space.

Result: Most suffix tokens contribute significantly to attack success, but pruning minority low-impact tokens doesn’t affect loss values or attack success rate, revealing token redundancy in LLM prompts.

Conclusion: The method provides insights for developing efficient and interpretable LLMs by demonstrating token redundancy in jailbreak attacks and offering a plug-and-play approach to reduce computational overhead.

Abstract: Jailbreak attacks on Large Language Models (LLMs) have demonstrated various successful methods whereby attackers manipulate models into generating harmful responses that they are designed to avoid. Among these, Greedy Coordinate Gradient (GCG) has emerged as a general and effective approach that optimizes the tokens in a suffix to generate jailbreakable prompts. While several improved variants of GCG have been proposed, they all rely on fixed-length suffixes. However, the potential redundancy within these suffixes remains unexplored. In this work, we propose Mask-GCG, a plug-and-play method that employs learnable token masking to identify impactful tokens within the suffix. Our approach increases the update probability for tokens at high-impact positions while pruning those at low-impact positions. This pruning not only reduces redundancy but also decreases the size of the gradient space, thereby lowering computational overhead and shortening the time required to achieve successful attacks compared to GCG. We evaluate Mask-GCG by applying it to the original GCG and several improved variants. Experimental results show that most tokens in the suffix contribute significantly to attack success, and pruning a minority of low-impact tokens does not affect the loss values or compromise the attack success rate (ASR), thereby revealing token redundancy in LLM prompts. Our findings provide insights for developing efficient and interpretable LLMs from the perspective of jailbreak attacks.

[43] PL-CA: A Parametric Legal Case Augmentation Framework

Ao Chang, Yubo Chen, Jun Zhao

Main category: cs.CL

TL;DR: PL-CA proposes a parametric RAG framework that encodes legal knowledge into parametric vectors and integrates them into LLMs via LoRA, reducing context window pressure while maintaining performance on legal tasks.

Details

Motivation: Conventional RAG methods have limitations including constrained context windows, computational overhead from long contexts, and disruption of model attention. Existing benchmarks lack expert annotation and don't reflect real-world multi-task legal scenarios.

Method: Introduces parametric RAG (P-RAG) framework that performs data augmentation on corpus knowledge, encodes legal knowledge into parametric vectors, and integrates this knowledge into LLM’s feed-forward networks via LoRA to alleviate context pressure.

Result: Experimental results show the method reduces overhead from excessively long contexts while maintaining competitive performance on downstream legal tasks compared to conventional RAG.

Conclusion: PL-CA effectively addresses limitations of conventional RAG by using parametric knowledge encoding and integration, providing a more efficient solution for legal domain applications with reduced computational overhead.

Abstract: Conventional RAG is considered one of the most effective methods for addressing model knowledge insufficiency and hallucination, particularly in the judicial domain that requires high levels of knowledge rigor, logical consistency, and content integrity. However, the conventional RAG method only injects retrieved documents directly into the model’s context, which severely constrains models due to their limited context windows and introduces additional computational overhead through excessively long contexts, thereby disrupting models’ attention and degrading performance on downstream tasks. Moreover, many existing benchmarks lack expert annotation and focus solely on individual downstream tasks while real-world legal scenarios consist of multiple mixed legal tasks, indicating conventional benchmarks’ inadequacy for reflecting models’ true capabilities. To address these limitations, we propose PL-CA, which introduces a parametric RAG (P-RAG) framework to perform data augmentation on corpus knowledge and encode this legal knowledge into parametric vectors, and then integrates this parametric knowledge into the LLM’s feed-forward networks (FFN) via LoRA, thereby alleviating models’ context pressure. Additionally, we also construct a multi-task legal dataset comprising more than 2000 training and test instances, which are all expert-annotated and manually verified. We conduct our experiments on our dataset, and the experimental results demonstrate that our method reduces the overhead associated with excessively long contexts while maintaining competitive performance on downstream tasks compared to conventional RAG. Our code and dataset are provided in the appendix.

[44] Do LLMs exhibit the same commonsense capabilities across languages?

Ivan Martínez-Murillo, Elena Lloret, Paloma Moreda, Albert Gatt

Main category: cs.CL

TL;DR: LLMs show strong English commonsense generation but struggle with less-resourced languages like Spanish, Dutch, and Valencian, as evaluated on the new MULTICOM benchmark.

Details

Motivation: To investigate multilingual commonsense generation capabilities of LLMs across languages with varying resource availability.

Method: Created MULTICOM benchmark extending COCOTEROS to 4 languages, evaluated various open-source LLMs using automatic metrics, LLM-as-judge approaches (Prometheus, JudgeLM), and human annotations.

Result: Consistently superior performance in English, significantly lower performance in less-resourced languages. Contextual support provided mixed results but tended to help underrepresented languages.

Conclusion: Current LLMs have significant limitations in multilingual commonsense generation, particularly for less-resourced languages, highlighting the need for improved multilingual capabilities.

Abstract: This paper explores the multilingual commonsense generation abilities of Large Language Models (LLMs). To facilitate this investigation, we introduce MULTICOM, a novel benchmark that extends the COCOTEROS dataset to four languages: English, Spanish, Dutch, and Valencian. The task involves generating a commonsensical sentence that includes a given triplet of words. We evaluate a range of open-source LLMs, including LLaMA, Qwen, Gemma, EuroLLM, and Salamandra, on this benchmark. Our evaluation combines automatic metrics, LLM-as-a-judge approaches (using Prometheus and JudgeLM), and human annotations. Results consistently show superior performance in English, with significantly lower performance in less-resourced languages. While contextual support yields mixed results, it tends to benefit underrepresented languages. These findings underscore the current limitations of LLMs in multilingual commonsense generation. The dataset is publicly available at https://huggingface.co/datasets/gplsi/MULTICOM.

[45] WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents

Junteng Liu, Yunji Li, Chi Zhang, Jingyang Li, Aili Chen, Ke Ji, Weiyu Cheng, Zijia Wu, Chengyu Du, Qidi Xu, Jiayuan Song, Zhengmao Zhu, Wenhu Chen, Pengyu Zhao, Junxian He

Main category: cs.CL

TL;DR: WebExplorer is a new 8B parameter web agent that achieves state-of-the-art performance through systematic data generation and training, outperforming much larger models on complex information-seeking tasks.

Details

Motivation: Existing open-source web agents have limited information-seeking abilities on complex tasks or lack transparent implementations, with the key challenge being scarcity of challenging data for information seeking.

Method: Introduces WebExplorer: a systematic data generation approach using model-based exploration and iterative, long-to-short query evolution. Develops advanced web agent through supervised fine-tuning followed by reinforcement learning, supporting 128K context length and up to 100 tool calling turns.

Result: WebExplorer-8B achieves state-of-the-art performance at its scale, effectively searches over 16 turns after RL training, outperforms WebSailor-72B on BrowseComp-en/zh, and achieves best performance among models up to 100B parameters on WebWalkerQA and FRAMES. Also shows strong generalization on HLE benchmark.

Conclusion: The approach provides a practical path toward long-horizon web agents, demonstrating that systematic data generation and training can create highly capable web agents even at smaller model sizes.

Abstract: The paradigm of Large Language Models (LLMs) has increasingly shifted toward agentic applications, where web browsing capabilities are fundamental for retrieving information from diverse online sources. However, existing open-source web agents either demonstrate limited information-seeking abilities on complex tasks or lack transparent implementations. In this work, we identify that the key challenge lies in the scarcity of challenging data for information seeking. To address this limitation, we introduce WebExplorer: a systematic data generation approach using model-based exploration and iterative, long-to-short query evolution. This method creates challenging query-answer pairs that require multi-step reasoning and complex web navigation. By leveraging our curated high-quality dataset, we successfully develop advanced web agent WebExplorer-8B through supervised fine-tuning followed by reinforcement learning. Our model supports 128K context length and up to 100 tool calling turns, enabling long-horizon problem solving. Across diverse information-seeking benchmarks, WebExplorer-8B achieves the state-of-the-art performance at its scale. Notably, as an 8B-sized model, WebExplorer-8B is able to effectively search over an average of 16 turns after RL training, achieving higher accuracy than WebSailor-72B on BrowseComp-en/zh and attaining the best performance among models up to 100B parameters on WebWalkerQA and FRAMES. Beyond these information-seeking tasks, our model also achieves strong generalization on the HLE benchmark even though it is only trained on knowledge-intensive QA data. These results highlight our approach as a practical path toward long-horizon web agents.

[46] Crown, Frame, Reverse: Layer-Wise Scaling Variants for LLM Pre-Training

Andrei Baroian, Kasper Notebomer

Main category: cs.CL

TL;DR: Layer-Wise Scaling variants redistribute FFN widths and attention heads via linear interpolation, achieving better performance than isotropic baselines with similar parameter budgets.

Details

Motivation: Traditional transformer models use uniform layer sizes, ignoring the diverse functional roles and computational capacity needs at different depths.

Method: Introduced three LWS variants (Framed, Reverse, Crown) that redistribute FFN widths and attention heads via two or three-point linear interpolation during pre-training, tested on 180M parameters with 5B tokens.

Result: All models converged to similar losses and achieved better performance compared to equal-cost isotropic baseline, without substantial decrease in training throughput.

Conclusion: This represents an initial step into layer-wise architecture design for pre-training, but future work needs larger-scale experiments to fully assess potential.

Abstract: Transformer-based language models traditionally use uniform (isotropic) layer sizes, yet they ignore the diverse functional roles that different depths can play and their computational capacity needs. Building on Layer-Wise Scaling (LWS) and pruning literature, we introduce three new LWS variants - Framed, Reverse, and Crown - that redistribute FFN widths and attention heads via two or three-point linear interpolation in the pre-training stage. We present the first systematic ablation of LWS and its variants, on a fixed budget of 180M parameters, trained on 5B tokens. All models converge to similar losses and achieve better performance compared to an equal-cost isotropic baseline, without a substantial decrease in training throughput. This work represents an initial step into the design space of layer-wise architectures for pre-training, but future work should scale experiments to orders of magnitude more tokens and parameters to fully assess their potential.

[47] LAMDAS: LLM as an Implicit Classifier for Domain-specific Data Selection

Jian Wu, Hang Yu, Bingchang Liu, Wenjie Yang, Peng Di, Jianguo Li, Yue Zhang

Main category: cs.CL

TL;DR: LAMDAS uses LLMs as implicit classifiers for domain-specific data selection, outperforming full-data training with less data and achieving better performance-efficiency balance than SOTA methods.

Details

Motivation: Addressing the scarcity of high-quality domain-specific data for LLM fine-tuning and the limitations of existing data selection methods that struggle with accuracy and efficiency trade-offs.

Method: Leverages pre-trained LLM as an implicit classifier to reframe data selection as one-class classification, identifying candidate data that belongs to the target domain using a small reference dataset without explicit feature engineering.

Result: Exceeds full-data training performance using only a fraction of data, outperforms nine SOTA baselines across various scenarios, and achieves the best performance-computational efficiency balance.

Conclusion: LAMDAS provides an effective and efficient solution for domain-specific data selection, demonstrating superior performance while reducing computational overhead compared to existing approaches.

Abstract: Adapting large language models (LLMs) to specific domains often faces a critical bottleneck: the scarcity of high-quality, human-curated data. While large volumes of unchecked data are readily available, indiscriminately using them for fine-tuning risks introducing noise and degrading performance. Strategic data selection is thus crucial, requiring a method that is both accurate and efficient. Existing approaches, categorized as similarity-based and direct optimization methods, struggle to simultaneously achieve these goals. In this paper, we introduce LAMDAS (LLM As an iMplicit classifier for domain-specific DAta Selection), a novel approach that leverages the pre-trained LLM itself as an implicit classifier, thereby bypassing explicit feature engineering and computationally intensive optimization process. LAMDAS reframes data selection as a one-class classification problem, identifying candidate data that “belongs” to the target domain defined by a small reference dataset. Extensive experimental results demonstrate that LAMDAS not only exceeds the performance of full-data training using a fraction of the data but also outperforms nine state-of-the-art (SOTA) baselines under various scenarios. Furthermore, LAMDAS achieves the most compelling balance between performance gains and computational efficiency compared to all evaluated baselines.

[48] SLiNT: Structure-aware Language Model with Injection and Contrastive Training for Knowledge Graph Completion

Mengxue Yang, Chun Yang, Jiaqi Zhu, Jiafan Li, Jingqi Zhang, Yuyang Li, Ying Li

Main category: cs.CL

TL;DR: SLiNT is a modular framework that injects knowledge graph structural context into frozen LLMs using LoRA adaptation for robust link prediction, achieving superior performance on benchmark datasets.

Details

Motivation: Address structural sparsity and semantic ambiguity in knowledge graph link prediction by integrating structural information with LLM capabilities, especially under incomplete or zero-shot settings.

Method: Uses Structure-Guided Neighborhood Enhancement to enrich sparse entities, Dynamic Hard Contrastive Learning for fine-grained supervision, and Gradient-Decoupled Dual Injection for token-level structure-aware intervention while preserving LLM parameters.

Result: Achieves superior or competitive performance compared to both embedding-based and generation-based baselines on WN18RR and FB15k-237 datasets.

Conclusion: Demonstrates effectiveness of structure-aware representation learning for scalable knowledge graph completion by effectively combining structural context with LLM capabilities.

Abstract: Link prediction in knowledge graphs requires integrating structural information and semantic context to infer missing entities. While large language models offer strong generative reasoning capabilities, their limited exploitation of structural signals often results in structural sparsity and semantic ambiguity, especially under incomplete or zero-shot settings. To address these challenges, we propose SLiNT (Structure-aware Language model with Injection and coNtrastive Training), a modular framework that injects knowledge-graph-derived structural context into a frozen LLM backbone with lightweight LoRA-based adaptation for robust link prediction. Specifically, Structure-Guided Neighborhood Enhancement (SGNE) retrieves pseudo-neighbors to enrich sparse entities and mitigate missing context; Dynamic Hard Contrastive Learning (DHCL) introduces fine-grained supervision by interpolating hard positives and negatives to resolve entity-level ambiguity; and Gradient-Decoupled Dual Injection (GDDI) performs token-level structure-aware intervention while preserving the core LLM parameters. Experiments on WN18RR and FB15k-237 show that SLiNT achieves superior or competitive performance compared with both embedding-based and generation-based baselines, demonstrating the effectiveness of structure-aware representation learning for scalable knowledge graph completion.

[49] HAVE: Head-Adaptive Gating and ValuE Calibration for Hallucination Mitigation in Large Language Models

Xin Tong, Zhi Lin, Jingya Wang, Bo Jin

Main category: cs.CL

TL;DR: HAVE is a parameter-free decoding framework that reduces hallucinations in LLMs through head-adaptive gating and value calibration, requiring no finetuning and operating in a single forward pass.

Details

Motivation: LLMs often produce hallucinations in retrieval-augmented or long-context generation even when relevant evidence is present, due to input-agnostic head importance treatment and poor reflection of token contributions by raw attention weights.

Method: HAVE introduces head-adaptive gating (instance-level soft reweighing of attention heads) and value calibration (augmenting attention with value vector magnitude to approximate write-back contribution), fused with LM distribution through uncertainty-scaled policy.

Result: Experiments across multiple QA benchmarks and LLM families show HAVE consistently reduces hallucinations and outperforms strong baselines including DAGCD with modest overhead.

Conclusion: HAVE provides an efficient, transparent, and reproducible framework that readily integrates with off-the-shelf LLMs, advancing trustworthy generation in real-world settings.

Abstract: Large Language Models (LLMs) often produce hallucinations in retrieval-augmented or long-context generation, even when relevant evidence is present. This stems from two issues: head importance is treated as input-agnostic, and raw attention weights poorly reflect each token’s true contribution. We present HAVE (Head-Adaptive Gating and ValuE Calibration), a parameter-free decoding framework that directly addresses both challenges. HAVE introduces head-adaptive gating, which performs instance-level soft reweighing of attention heads, and value calibration, which augments attention with the magnitude of value vectors to approximate write-back contribution. Together, these modules construct token-level evidence aligned with model updates and fuse it with the LM distribution through a lightweight uncertainty-scaled policy. HAVE requires no finetuning and operates in a single forward pass, making it efficient and broadly applicable. Experiments across multiple QA benchmarks and LLM families demonstrate that HAVE consistently reduces hallucinations and outperforms strong baselines, including DAGCD, with modest overhead. The framework is transparent, reproducible, and readily integrates with off-the-shelf LLMs, advancing trustworthy generation in real-world settings.

[50] Guided Decoding and Its Critical Role in Retrieval-Augmented Generation

Özgür Uğur, Musa Yılmaz, Esra Şavirdi, Özay Ezerceli, Mahmut El Huseyni, Selva Taş, Reyhan Bayraktar

Main category: cs.CL

TL;DR: Comparison of three guided decoding methods (Outlines, XGrammar, LM Format Enforcer) in RAG systems across multi-turn prompting setups, evaluating success rates, hallucination rates, and output quality to inform method selection.

Details

Motivation: Need for structured and reliable responses from LLMs in RAG systems while minimizing hallucinations and ensuring output format alignment.

Method: Evaluated three guided decoding methods (Outlines, XGrammar, LM Format Enforcer) across different multi-turn prompting setups (0-turn, 1-turn, 2-turn) by measuring success rates, hallucination rates, and output quality.

Result: Revealed unexpected performance variations across methods and multi-turn interactions, providing insights into how guided decoding performs in different prompting scenarios.

Conclusion: Advances understanding of structured output generation in RAG systems and offers both theoretical insights and practical guidance for LLM deployment with specific method selection recommendations.

Abstract: The integration of Large Language Models (LLMs) into various applications has driven the need for structured and reliable responses. A key challenge in Retrieval-Augmented Generation (RAG) systems is ensuring that outputs align with expected formats while minimizing hallucinations. This study examines the role of guided decoding in RAG systems, comparing three methods, Outlines, XGrammar, and LM Format Enforcer, across different multi-turn prompting setups (0-turn, 1-turn, and 2-turn). By evaluating success rates, hallucination rates, and output quality, we provide insights into their performance and applicability. Our findings reveal how multi-turn interactions influence guided decoding, uncovering unexpected performance variations that can inform method selection for specific use cases. This work advances the understanding of structured output generation in RAG systems, offering both theoretical insights and practical guidance for LLM deployment.

[51] Modelling Intertextuality with N-gram Embeddings

Yi Xing

Main category: cs.CL

TL;DR: A new quantitative model for measuring intertextuality using n-gram embeddings and pairwise comparisons, validated on known texts and scalable to large corpora.

Details

Motivation: To enable scalable analysis of intertextual relationships between literary texts through computational methods, moving beyond qualitative approaches.

Method: Perform pairwise comparisons of n-gram embeddings from two texts and average the results to calculate overall intertextuality scores.

Result: Method validated on four texts with known intertextuality degrees and scaled to 267 diverse texts, showing effectiveness and efficiency. Network analysis revealed centrality and community structures.

Conclusion: The approach successfully captures and quantifies intertextual relationships, enabling scalable network-based insights into literary connections.

Abstract: Intertextuality is a central tenet in literary studies. It refers to the intricate links between literary texts that are created by various types of references. This paper proposes a new quantitative model of intertextuality to enable scalable analysis and network-based insights: perform pairwise comparisons of the embeddings of n-grams from two texts and average their results as the overall intertextuality. Validation on four texts with known degrees of intertextuality, alongside a scalability test on 267 diverse texts, demonstrates the method’s effectiveness and efficiency. Network analysis further reveals centrality and community structures, affirming the approach’s success in capturing and quantifying intertextual relationships.

[52] Domain-Aware RAG: MoL-Enhanced RL for Efficient Training and Scalable Retrieval

Hao Lin, Peitong Xie, Jingxue Chen, Jie Lin, Qingkun Tang, Qianchun Lu

Main category: cs.CL

TL;DR: MoLER is a domain-aware RAG method that uses Mixture of Losses-enhanced Reinforcement Learning to optimize retrieval performance through balanced domain knowledge learning and query enhancement.

Details

Motivation: Existing coarse-ranking optimization approaches in RAG systems struggle to balance domain-specific knowledge learning with query enhancement, leading to suboptimal retrieval performance.

Method: Two-stage pipeline: 1) Continual pre-training with Mixture of Losses (MoL) to balance domain-specific and general language capabilities, 2) Reinforcement learning phase using Group Relative Policy Optimization (GRPO) to optimize query and passage generation. Features Multi-query Single-passage Late Fusion (MSLF) strategy for efficient training.

Result: Extensive experiments show MoLER achieves state-of-the-art performance, significantly outperforming baseline methods on benchmark datasets.

Conclusion: MoLER bridges the knowledge gap in RAG systems, enabling robust and scalable retrieval in specialized domains through its innovative training approach and fusion strategies.

Abstract: Retrieval-Augmented Generation (RAG) systems rely heavily on the retrieval stage, particularly the coarse-ranking process. Existing coarse-ranking optimization approaches often struggle to balance domain-specific knowledge learning with query enhencement, resulting in suboptimal retrieval performance. To address this challenge, we propose MoLER, a domain-aware RAG method that uses MoL-Enhanced Reinforcement Learning to optimize retrieval. MoLER has a two-stage pipeline: a continual pre-training (CPT) phase using a Mixture of Losses (MoL) to balance domain-specific knowledge with general language capabilities, and a reinforcement learning (RL) phase leveraging Group Relative Policy Optimization (GRPO) to optimize query and passage generation for maximizing document recall. A key innovation is our Multi-query Single-passage Late Fusion (MSLF) strategy, which reduces computational overhead during RL training while maintaining scalable inference via Multi-query Multi-passage Late Fusion (MMLF). Extensive experiments on benchmark datasets show that MoLER achieves state-of-the-art performance, significantly outperforming baseline methods. MoLER bridges the knowledge gap in RAG systems, enabling robust and scalable retrieval in specialized domains.

[53] IntrEx: A Dataset for Modeling Engagement in Educational Conversations

Xingwei Tan, Mahathi Parvatham, Chiara Gambi, Gabriele Pergola

Main category: cs.CL

TL;DR: IntrEx dataset enables study of engagement in educational conversations through sequence-level annotations, showing fine-tuned LLMs outperform larger models in predicting interestingness.

Details

Motivation: Addressing the gap in understanding linguistic features that drive engagement in educational conversations, as maintaining learner interest remains challenging in second-language acquisition.

Method: Created IntrEx dataset with sequence-level annotations from teacher-student interactions, using comparison-based rating with 100+ second-language learners inspired by RLHF approach.

Result: Fine-tuned LLMs (7B/8B parameters) outperformed larger proprietary models like GPT-4o in predicting human interestingness judgments, demonstrating specialized datasets’ value.

Conclusion: Linguistic and cognitive factors like concreteness, comprehensibility, and uptake significantly influence engagement in educational dialogues, with specialized models showing superior performance.

Abstract: Engagement and motivation are crucial for second-language acquisition, yet maintaining learner interest in educational conversations remains a challenge. While prior research has explored what makes educational texts interesting, still little is known about the linguistic features that drive engagement in conversations. To address this gap, we introduce IntrEx, the first large dataset annotated for interestingness and expected interestingness in teacher-student interactions. Built upon the Teacher-Student Chatroom Corpus (TSCC), IntrEx extends prior work by incorporating sequence-level annotations, allowing for the study of engagement beyond isolated turns to capture how interest evolves over extended dialogues. We employ a rigorous annotation process with over 100 second-language learners, using a comparison-based rating approach inspired by reinforcement learning from human feedback (RLHF) to improve agreement. We investigate whether large language models (LLMs) can predict human interestingness judgments. We find that LLMs (7B/8B parameters) fine-tuned on interestingness ratings outperform larger proprietary models like GPT-4o, demonstrating the potential for specialised datasets to model engagement in educational settings. Finally, we analyze how linguistic and cognitive factors, such as concreteness, comprehensibility (readability), and uptake, influence engagement in educational dialogues.

[54] ParCzech4Speech: A New Speech Corpus Derived from Czech Parliamentary Data

Vladislav Stankov, Matyáš Kopp, Ondřej Bojar

Main category: cs.CL

TL;DR: ParCzech4Speech 1.0 is a processed Czech parliamentary speech corpus with 2,695 hours of audio-text aligned data, offering three variants for different speech modeling tasks.

Details

Motivation: To create a high-quality Czech speech dataset for speech recognition and synthesis tasks by improving upon previous versions with better alignment reliability and more data extraction.

Method: Combined Czech parliamentary sound recordings with official transcripts, processed with WhisperX and Wav2Vec 2.0 for automated audio-text alignment, and created three dataset variants with different segmentation approaches.

Result: Produced a 2,695-hour corpus with improved alignment reliability over ParCzech 3.0, available in sentence-segmented, unsegmented, and raw-alignment variants under CC-BY license.

Conclusion: The dataset provides a valuable resource for Czech speech modeling tasks and is publicly available through LINDAT repository and Hugging Face platform.

Abstract: We introduce ParCzech4Speech 1.0, a processed version of the ParCzech 4.0 corpus, targeted at speech modeling tasks with the largest variant containing 2,695 hours. We combined the sound recordings of the Czech parliamentary speeches with the official transcripts. The recordings were processed with WhisperX and Wav2Vec 2.0 to extract automated audio-text alignment. Our processing pipeline improves upon the ParCzech 3.0 speech recognition version by extracting more data with higher alignment reliability. The dataset is offered in three flexible variants: (1) sentence-segmented for automatic speech recognition and speech synthesis tasks with clean boundaries, (2) unsegmented preserving original utterance flow across sentences, and (3) a raw-alignment for further custom refinement for other possible tasks. All variants maintain the original metadata and are released under a permissive CC-BY license. The dataset is available in the LINDAT repository, with the sentence-segmented and unsegmented variants additionally available on Hugging Face.

[55] Will Annotators Disagree? Identifying Subjectivity in Value-Laden Arguments

Amir Homayounirad, Enrico Liscio, Tong Wang, Catholijn M. Jonker, Luciano C. Siebert

Main category: cs.CL

TL;DR: Direct subjectivity identification outperforms value prediction for identifying subjective arguments in human value recognition tasks, with contrastive loss reducing dependency on per-label subjectivity.

Details

Motivation: Aggregating multiple annotations into single labels can obscure valuable insights from annotator disagreement, especially in subjective tasks like recognizing human values in arguments.

Method: Evaluated two approaches: inferring subjectivity through value prediction vs. direct subjectivity identification. Tested combining contrastive loss with binary cross-entropy loss.

Result: Direct subjectivity identification significantly improved model performance for flagging subjective arguments. Contrastive + binary cross-entropy combination didn’t improve performance but reduced dependency on per-label subjectivity.

Conclusion: Proposed methods help identify arguments with differing interpretations, enabling more nuanced annotation processes that preserve valuable disagreement insights.

Abstract: Aggregating multiple annotations into a single ground truth label may hide valuable insights into annotator disagreement, particularly in tasks where subjectivity plays a crucial role. In this work, we explore methods for identifying subjectivity in recognizing the human values that motivate arguments. We evaluate two main approaches: inferring subjectivity through value prediction vs. directly identifying subjectivity. Our experiments show that direct subjectivity identification significantly improves the model performance of flagging subjective arguments. Furthermore, combining contrastive loss with binary cross-entropy loss does not improve performance but reduces the dependency on per-label subjectivity. Our proposed methods can help identify arguments that individuals may interpret differently, fostering a more nuanced annotation process.

[56] Anchoring Refusal Direction: Mitigating Safety Risks in Tuning via Projection Constraint

Yanrui Du, Fenglei Fan, Sendong Zhao, Jiawei Cao, Qika Lin, Kai He, Ting Liu, Bing Qin, Mengling Feng

Main category: cs.CL

TL;DR: ProCon method prevents safety degradation in instruction-tuned LLMs by constraining refusal direction drift during training, maintaining safety while preserving performance.

Details

Motivation: Instruction fine-tuning (IFT) enhances LLM capabilities but significantly compromises safety, particularly the ability to refuse malicious instructions, due to drift in the refusal direction during training.

Method: ProCon introduces a projection-constrained loss term to regularize hidden state projections onto the refusal direction, with an enhanced warm-up strategy that applies strong early constraints and broadens data distribution for better constraint signals.

Result: Experimental results show ProCon significantly mitigates safety risks from IFT while preserving task performance gains, outperforming strong baselines and stabilizing the refusal direction during training.

Conclusion: ProCon effectively addresses safety degradation in IFT through interpretability-driven constraints on internal mechanisms, providing a foundation for future LLM safety research.

Abstract: Instruction Fine-Tuning (IFT) has been widely adopted as an effective post-training strategy to enhance various abilities of Large Language Models (LLMs). However, prior studies have shown that IFT can significantly compromise LLMs’ safety, particularly their ability to refuse malicious instructions, raising significant concerns. Recent research into the internal mechanisms of LLMs has identified the refusal direction (r-direction) in the hidden states, which plays a pivotal role in governing refusal behavior. Building on this insight, our study reveals that the r-direction tends to drift during training, which we identify as one of the causes of the associated safety risks. To mitigate such drift, our proposed ProCon method introduces a projection-constrained loss term that regularizes the projection magnitude of each training sample’s hidden state onto the r-direction. Our initial analysis shows that applying an appropriate constraint can effectively mitigate the refusal direction drift and associated safety risks, but remains limited by overall performance barriers. To overcome this barrier, informed by our observation of early-stage sharp drift and a data-driven perspective, we introduce a warm-up strategy that emphasizes early-stage strong constraints and broaden the data distribution to strengthen constraint signals, leading to an enhanced ProCon method. Experimental results under various datasets, scenarios, and LLMs demonstrate that our method can significantly mitigate safety risks posed by IFT while preserving task performance gains. Even compared with strong baselines, our method consistently delivers superior overall performance. Crucially, our analysis indicates that ProCon can contribute to stabilizing the r-direction during training, while such an interpretability-driven exploration of LLMs’ internal mechanisms lays a solid foundation for future safety research.

[57] MachineLearningLM: Continued Pretraining Language Models on Millions of Synthetic Tabular Prediction Tasks Scales In-Context ML

Haoyu Dong, Pengkun Zhang, Mingzhe Lu, Yanzhen Shen, Guolin Ke

Main category: cs.CL

TL;DR: MachineLearningLM is a framework that enhances LLMs’ in-context learning for ML tasks through continued pretraining on synthetic data from structural causal models, enabling strong many-shot performance while preserving general capabilities.

Details

Motivation: Large language models struggle with in-context learning on standard machine learning tasks despite having broad knowledge and reasoning abilities, particularly when dealing with many-shot demonstrations without gradient descent.

Method: Uses continued pretraining framework that synthesizes ML tasks from millions of structural causal models (SCMs) with up to 1,024 shots. Employs random-forest teacher for knowledge distillation, token-efficient prompting for 3-6x more examples per context, and batch inference for throughput.

Result: Outperforms strong LLM baselines by ~15% on out-of-distribution tabular classification across multiple domains. Shows monotonic accuracy improvement from 8 to 1,024 shots, achieves random-forest-level accuracy, and maintains 75.4% on MMLU while preserving general chat capabilities.

Conclusion: MachineLearningLM successfully equips general-purpose LLMs with robust in-context ML capability while preserving their general knowledge and reasoning, demonstrating effective many-shot scaling without task-specific training.

Abstract: Large language models (LLMs) possess broad world knowledge and strong general-purpose reasoning ability, yet they struggle to learn from many in-context examples on standard machine learning (ML) tasks, that is, to leverage many-shot demonstrations purely via in-context learning (ICL) without gradient descent. We introduce MachineLearningLM, a portable continued-pretraining framework that equips a general-purpose LLM with robust in-context ML capability while preserving its general knowledge and reasoning for broader chat workflows. Our pretraining procedure synthesizes ML tasks from millions of structural causal models (SCMs), spanning shot counts up to 1,024. We begin with a random-forest teacher, distilling tree-based decision strategies into the LLM to strengthen robustness in numerical modeling. All tasks are serialized with a token-efficient prompt, enabling 3x to 6x more examples per context window and delivering up to 50x amortized throughput via batch inference. Despite a modest setup (Qwen-2.5-7B-Instruct with LoRA rank 8), MachineLearningLM outperforms strong LLM baselines (e.g., GPT-5-mini) by an average of about 15% on out-of-distribution tabular classification across finance, physics, biology, and healthcare domains. It exhibits a striking many-shot scaling law: accuracy increases monotonically as in-context demonstrations grow from 8 to 1,024. Without any task-specific training, it attains random-forest-level accuracy across hundreds of shots. General chat capabilities, including knowledge and reasoning, are preserved: it achieves 75.4% on MMLU.

[58] MoGU V2: Toward a Higher Pareto Frontier Between Model Usability and Security

Yanrui Du, Fenglei Fan, Sendong Zhao, Jiawei Cao, Ting Liu, Bing Qin

Main category: cs.CL

TL;DR: MoGU_v2 framework improves LLM security by dynamically routing between security-optimized and usability-optimized variants using layer-specific routers and bidirectional adaptation, achieving better security-usability balance without performance trade-offs.

Details

Motivation: Existing security methods for LLMs often lead to overly conservative, rejection-oriented responses that compromise practical usability, creating a need to advance the Pareto frontier between security and usability rather than forcing trade-offs.

Method: MoGU_v2 framework embeds routers only in layers encoding highly classifiable security features, establishes tighter coupling between routers and hidden states, activates backbone modules during router optimization for bidirectional adaptation, and uses a simple data-mix strategy to restore security after instruction fine-tuning.

Result: MoGU_v2 demonstrates strong adaptability and stable improvements across various LLM types (mainstream, on-device, reasoning), effectively mitigates security risks while maintaining task performance, and easily restores security without compromising gains from instruction fine-tuning.

Conclusion: MoGU_v2 provides a robust and versatile solution for mitigating security risks in real-world LLM applications by achieving better security-usability balance through improved routing mechanisms and bidirectional adaptation.

Abstract: As Large Language Models (LLMs) increasingly permeate human life, their security has emerged as a critical concern, particularly their ability to maintain harmless responses to malicious instructions. Although extensive methods have improved LLMs’ security, they often lead to conservative, rejection-oriented responses that compromise practical usability. This presents a key challenge: how to advance the Pareto frontier between LLMs’ usability and security, rather than necessitate a trade-off between them. To address this, we propose the MoGU framework, in which the intra-layer router dynamically allocates weights by sensing hidden states, thereby balancing the contributions of security-optimized and usability-optimized variants. Despite its initial potential, the MoGU framework faces limitations such as parameter redundancy and performance bottlenecks. To overcome these, we further propose an improved MoGU_v2 framework that establishes a tighter coupling between the routers and hidden states. In MoGU_v2, routers are embedded only in layers encoding highly classifiable security features, and backbone modules are activated during router optimization to enable bidirectional adaptation. MoGU_V2 exhibits strong adaptability and stable improvements across various series of LLMs, including mainstream LLMs serving as brains in various applications, on-device LLMs optimized for resource-constrained scenarios, and reasoning LLMs tailored for user interpretability. Meanwhile, even facing risks introduced by Instruction Fine-tuning, MoGU_v2 can easily restore security without compromising the task performance gains via a simple data-mix strategy. These comprehensive improvements highlight MoGU_V2 as a robust and versatile solution for mitigating security risks in real-world applications.

[59] Saturation-Driven Dataset Generation for LLM Mathematical Reasoning in the TPTP Ecosystem

Valentin Quesnel, Damien Sileo

Main category: cs.CL

TL;DR: A framework that converts automated theorem proving research into scalable, guaranteed-valid mathematical reasoning data using E-prover on TPTP axioms, creating three difficulty-controlled challenges for LLM evaluation and training.

Details

Motivation: Address the scarcity of high-quality, logically sound data for advancing mathematical reasoning in Large Language Models by leveraging decades of automated theorem proving research.

Method: Uses E-prover’s saturation capabilities on TPTP axiom library to derive valid theorems, filters for interesting theorems, and generates three tasks: entailment verification, premise selection, and proof reconstruction without LLM involvement.

Result: Zero-shot experiments on frontier models show performance collapses on tasks requiring deep structural reasoning, revealing a clear weakness in current LLM capabilities.

Conclusion: The framework provides both a diagnostic tool to measure the reasoning gap in LLMs and a scalable source of symbolic training data to address mathematical reasoning limitations.

Abstract: The scarcity of high-quality, logically sound data is a critical bottleneck for advancing the mathematical reasoning of Large Language Models (LLMs). Our work confronts this challenge by turning decades of automated theorem proving research into a scalable data engine. Rather than relying on error-prone LLMs or complex proof-assistant syntax like Lean and Isabelle, our framework leverages E-prover’s saturation capabilities on the vast TPTP axiom library to derive a massive, guaranteed-valid corpus of theorems. Our pipeline is principled and simple: saturate axioms, filter for “interesting” theorems, and generate tasks. With no LLMs in the loop, we eliminate factual errors by construction. This purely symbolic data is then transformed into three difficulty-controlled challenges: entailment verification, premise selection, and proof reconstruction. Our zero-shot experiments on frontier models reveal a clear weakness: performance collapses on tasks requiring deep, structural reasoning. Our framework provides both the diagnostic tool to measure this gap and a scalable source of symbolic training data to address it. We make the code and data publicly available. https://github.com/sileod/reasoning_core https://hf.co/datasets/reasoning-core/rc1

[60] A Comparative Benchmark of Large Language Models for Labelling Wind Turbine Maintenance Logs

Max Malyi, Jonathan Shek, Alasdair McDonald, Andre Biscaya

Main category: cs.CL

TL;DR: A framework for benchmarking LLMs on wind turbine maintenance log classification, showing performance hierarchy and recommending human-in-the-loop systems for optimal results.

Details

Motivation: Unstructured free-text turbine maintenance logs present barriers to automated analysis, hindering effective wind power O&M and LCOE reduction.

Method: Developed a novel reproducible benchmarking framework for evaluating diverse state-of-the-art proprietary and open-source LLMs on maintenance log classification tasks.

Result: Identified clear performance hierarchy among models, with high alignment to benchmark standards and trustworthy confidence scores. Performance depends on semantic ambiguity - better on objective component identification than interpretive maintenance actions.

Conclusion: Human-in-the-loop systems are most effective, with LLMs acting as assistants to accelerate data labeling for human experts, enhancing O&M data quality and reliability analysis.

Abstract: Effective Operation and Maintenance (O&M) is critical to reducing the Levelised Cost of Energy (LCOE) from wind power, yet the unstructured, free-text nature of turbine maintenance logs presents a significant barrier to automated analysis. Our paper addresses this by presenting a novel and reproducible framework for benchmarking Large Language Models (LLMs) on the task of classifying these complex industrial records. To promote transparency and encourage further research, this framework has been made publicly available as an open-source tool. We systematically evaluate a diverse suite of state-of-the-art proprietary and open-source LLMs, providing a foundational assessment of their trade-offs in reliability, operational efficiency, and model calibration. Our results quantify a clear performance hierarchy, identifying top models that exhibit high alignment with a benchmark standard and trustworthy, well-calibrated confidence scores. We also demonstrate that classification performance is highly dependent on the task’s semantic ambiguity, with all models showing higher consensus on objective component identification than on interpretive maintenance actions. Given that no model achieves perfect accuracy and that calibration varies dramatically, we conclude that the most effective and responsible near-term application is a Human-in-the-Loop system, where LLMs act as a powerful assistant to accelerate and standardise data labelling for human experts, thereby enhancing O&M data quality and downstream reliability analysis.

[61] COMPACT: Common-token Optimized Model Pruning Across Channels and Tokens

Eugene Kwek, Wenpeng Yin

Main category: cs.CL

TL;DR: COMPACT is a joint pruning method that removes rare vocabulary and FFN intermediate channels using token-weighted activations, maintaining standard transformer architecture while achieving state-of-the-art efficiency gains.

Details

Motivation: Making LLMs more efficient for edge deployment, interactive applications, and sustainable inference at scale, addressing limitations of prior pruning methods that break transformer layout or cause accuracy drops.

Method: Jointly prunes rare vocabulary to shrink embedding/unembedding layers and prunes FFN intermediate channels using common-token-weighted activations that align importance with post-pruning token distribution.

Result: Achieves state-of-the-art downstream task performance at similar or higher pruning ratios across Qwen, LLaMA, and Gemma families (0.5B-70B), with substantial reductions in parameters, GPU memory, and end-to-end latency.

Conclusion: COMPACT combines benefits of depth and width pruning while maintaining deployment-friendliness, scale-adaptivity, training-free operation, and strong memory savings with throughput gains.

Abstract: Making LLMs more efficient in memory, latency, and serving cost is crucial for edge deployment, interactive applications, and sustainable inference at scale. Pruning is a key technique toward this goal. However, prior pruning methods are limited: width pruning often breaks the standard transformer layout or requires custom inference code, while depth pruning removes entire layers and can cause abrupt accuracy drops. In this work, we propose COMPACT, which jointly (i) prunes rare vocabulary to shrink embedding/unembedding and (ii) prunes FFN intermediate channels using common-token-weighted activations, aligning importance with the post-pruning token distribution. COMPACT enjoys merits of both depth and width pruning, such as: deployment-friendliness (keeps a standard transformer architecture), scale-adaptivity (trade off vocab vs. FFN pruning), training-free operation with competitive pruning time, and strong memory savings alongside throughput gains. Experiments across Qwen, LLaMA, and Gemma families (0.5B-70B) show state-of-the-art downstream task performance at similar or higher pruning ratios, with substantial reductions in parameters, GPU memory, and end-to-end latency.

[62] EPT Benchmark: Evaluation of Persian Trustworthiness in Large Language Models

Mohammad Reza Mirbagheri, Mohammad Mahdi Mirkamali, Zahra Motoshaker Arani, Ali Javeri, Amir Mahdi Sadeghzadeh, Rasool Jalili

Main category: cs.CL

TL;DR: Introduces EPT metric for evaluating LLM trustworthiness in Persian context across 6 dimensions, revealing significant safety deficiencies in major models.

Details

Motivation: Address the critical challenge of ensuring LLM trustworthiness and alignment with Persian ethical-cultural values, as reliability is essential for responsible AI systems.

Method: Developed culturally informed EPT benchmark with labeled dataset, evaluated leading LLMs (ChatGPT, Claude, DeepSeek, Gemini, Grok, LLaMA, Mistral, Qwen) using both automated LLM-based and human assessments across six trustworthiness aspects.

Result: Revealed significant deficiencies in safety dimension across evaluated models, providing insights into alignment with Persian ethical-cultural values and identifying critical gaps.

Conclusion: Highlights urgent need for focused attention on safety aspects and provides valuable benchmark for advancing trustworthy and culturally responsible AI in Persian context.

Abstract: Large Language Models (LLMs), trained on extensive datasets using advanced deep learning architectures, have demonstrated remarkable performance across a wide range of language tasks, becoming a cornerstone of modern AI technologies. However, ensuring their trustworthiness remains a critical challenge, as reliability is essential not only for accurate performance but also for upholding ethical, cultural, and social values. Careful alignment of training data and culturally grounded evaluation criteria are vital for developing responsible AI systems. In this study, we introduce the EPT (Evaluation of Persian Trustworthiness) metric, a culturally informed benchmark specifically designed to assess the trustworthiness of LLMs across six key aspects: truthfulness, safety, fairness, robustness, privacy, and ethical alignment. We curated a labeled dataset and evaluated the performance of several leading models - including ChatGPT, Claude, DeepSeek, Gemini, Grok, LLaMA, Mistral, and Qwen - using both automated LLM-based and human assessments. Our results reveal significant deficiencies in the safety dimension, underscoring the urgent need for focused attention on this critical aspect of model behavior. Furthermore, our findings offer valuable insights into the alignment of these models with Persian ethical-cultural values and highlight critical gaps and opportunities for advancing trustworthy and culturally responsible AI. The dataset is publicly available at: https://github.com/Rezamirbagheri110/EPT-Benchmark.

[63] The Majority is not always right: RL training for solution aggregation

Wenting Zhao, Pranjal Aggarwal, Swarnadeep Saha, Asli Celikyilmaz, Jason Weston, Ilia Kulikov

Main category: cs.CL

TL;DR: AggLM learns to aggregate multiple LLM solutions using reinforcement learning with verifiable rewards, outperforming majority voting and reward models while being more token-efficient.

Details

Motivation: Current approaches like majority voting or reward model ranking provide limited benefits for aggregating multiple LLM solutions to reasoning tasks.

Method: Train an aggregator model using reinforcement learning from verifiable rewards to review, reconcile, and synthesize final answers from candidate solutions, with careful balancing of easy and hard training examples.

Result: AggLM outperforms rule-based and reward-model baselines across multiple benchmarks, generalizes to solutions from different models (including stronger ones), and requires fewer tokens than majority voting.

Conclusion: Learning aggregation as an explicit reasoning skill through reinforcement learning with verifiable rewards is an effective approach for improving LLM performance on reasoning tasks.

Abstract: Scaling up test-time compute, by generating multiple independent solutions and selecting or aggregating among them, has become a central paradigm for improving large language models (LLMs) on challenging reasoning tasks. While most prior work relies on simple majority voting or reward model ranking to aggregate solutions, these approaches may only yield limited benefits. In this work, we propose to learn aggregation as an explicit reasoning skill: given a set of candidate solutions, we train an aggregator model to review, reconcile, and synthesize a final, correct answer using reinforcement learning from verifiable rewards. A key ingredient is careful balancing of easy and hard training examples, allowing the model to learn both to recover minority-but-correct answers as well as easy majority-correct answers. Empirically, we find our method, AggLM, outperforms both strong rule-based and reward-model baselines, across multiple benchmarks. Furthermore, it generalizes effectively to solutions from differing models, including stronger ones than contained in the training data, all while requiring substantially fewer tokens than majority voting with larger numbers of solutions.

[64] UNH at CheckThat! 2025: Fine-tuning Vs Prompting in Claim Extraction

Joe Wilder, Nikhil Kadapala, Benji Xu, Mohammed Alsaadi, Aiden Parsons, Mitchell Rogers, Palash Agarwal, Adam Hassick, Laura Dietz

Main category: cs.CL

TL;DR: Fine-tuned FLAN-T5 achieved best METEOR score for check-worthy claim extraction, but other methods sometimes produced higher-quality claims despite lower scores.

Details

Motivation: To explore various prompting and in-context learning methods for extracting check-worthy claims from social media passages in the CheckThat! Task 2 English competition.

Method: Tested few-shot prompting and fine-tuning with different LLM families, including FLAN-T5 models.

Result: Fine-tuned FLAN-T5 achieved the best METEOR score, but other methods sometimes extracted higher-quality claims even with lower METEOR scores.

Conclusion: METEOR score alone may not fully capture claim quality, as alternative methods can produce superior claims despite lower metric performance.

Abstract: We participate in CheckThat! Task 2 English and explore various methods of prompting and in-context learning, including few-shot prompting and fine-tuning with different LLM families, with the goal of extracting check-worthy claims from social media passages. Our best METEOR score is achieved by fine-tuning a FLAN-T5 model. However, we observe that higher-quality claims can sometimes be extracted using other methods, even when their METEOR scores are lower.

[65] mmBERT: A Modern Multilingual Encoder with Annealed Language Learning

Marc Marone, Orion Weller, William Fleshman, Eugene Yang, Dawn Lawrie, Benjamin Van Durme

Main category: cs.CL

TL;DR: mmBERT is a new multilingual encoder-only language model pretrained on 3T tokens across 1800+ languages, featuring novel techniques like inverse mask ratio schedule and inverse temperature sampling that achieve state-of-the-art performance on classification and retrieval tasks.

Details

Motivation: There has been a lack of recent research on encoder-only models, particularly multilingual ones, despite their widespread use in standard ML tasks like classification and retrieval.

Method: Pretrained on 3T multilingual tokens using novel techniques: inverse mask ratio schedule, inverse temperature sampling ratio, and adding 1700+ low-resource languages only during the decay phase to maximize performance gains from limited data.

Result: Achieves similar classification performance to OpenAI’s o3 and Google’s Gemini 2.5 Pro, significantly outperforms previous generation models on both high and low-resource languages for classification and retrieval tasks.

Conclusion: mmBERT demonstrates that strategic training techniques can dramatically boost multilingual encoder model performance, particularly for low-resource languages, achieving state-of-the-art results with efficient data utilization.

Abstract: Encoder-only languages models are frequently used for a variety of standard machine learning tasks, including classification and retrieval. However, there has been a lack of recent research for encoder models, especially with respect to multilingual models. We introduce mmBERT, an encoder-only language model pretrained on 3T tokens of multilingual text in over 1800 languages. To build mmBERT we introduce several novel elements, including an inverse mask ratio schedule and an inverse temperature sampling ratio. We add over 1700 low-resource languages to the data mix only during the decay phase, showing that it boosts performance dramatically and maximizes the gains from the relatively small amount of training data. Despite only including these low-resource languages in the short decay phase we achieve similar classification performance to models like OpenAI’s o3 and Google’s Gemini 2.5 Pro. Overall, we show that mmBERT significantly outperforms the previous generation of models on classification and retrieval tasks – on both high and low-resource languages.

[66] Proof-Carrying Numbers (PCN): A Protocol for Trustworthy Numeric Answers from LLMs via Claim Verification

Aivin V. Solatorio

Main category: cs.CL

TL;DR: Proof-Carrying Numbers (PCN) is a presentation-layer protocol that mechanically verifies numeric claims from LLMs to prevent numeric hallucinations, ensuring only verified numbers are displayed as trustworthy.

Details

Motivation: LLMs often generate numeric hallucinations - fabricated or misquoted numbers that appear correct. Existing safeguards like retrieval-augmented generation and citations improve transparency but cannot guarantee numeric fidelity.

Method: PCN emits numeric spans as claim-bound tokens tied to structured claims, with a verifier that checks each token under declared policies (exact equality, rounding, tolerance, etc.). Verification happens at the renderer level, not the model level, preventing spoofing.

Result: PCN provides soundness, completeness under honest tokens, fail-closed behavior, and monotonicity under policy refinement. It is lightweight, model-agnostic, and integrates seamlessly into existing applications.

Conclusion: PCN establishes a trust contract where numeric trust is earned only through mechanical verification, while unverified numbers are clearly marked as uncertain, providing reliable numeric fidelity for sensitive applications.

Abstract: Large Language Models (LLMs) as stochastic systems may generate numbers that deviate from available data, a failure known as \emph{numeric hallucination}. Existing safeguards – retrieval-augmented generation, citations, and uncertainty estimation – improve transparency but cannot guarantee fidelity: fabricated or misquoted values may still be displayed as if correct. We propose \textbf{Proof-Carrying Numbers (PCN)}, a presentation-layer protocol that enforces numeric fidelity through mechanical verification. Under PCN, numeric spans are emitted as \emph{claim-bound tokens} tied to structured claims, and a verifier checks each token under a declared policy (e.g., exact equality, rounding, aliases, or tolerance with qualifiers). Crucially, PCN places verification in the \emph{renderer}, not the model: only claim-checked numbers are marked as verified, and all others default to unverified. This separation prevents spoofing and guarantees fail-closed behavior. We formalize PCN and prove soundness, completeness under honest tokens, fail-closed behavior, and monotonicity under policy refinement. PCN is lightweight and model-agnostic, integrates seamlessly into existing applications, and can be extended with cryptographic commitments. By enforcing verification as a mandatory step before display, PCN establishes a simple contract for numerically sensitive settings: \emph{trust is earned only by proof}, while the absence of a mark communicates uncertainty.

[67] Beyond Two-Stage Training: Cooperative SFT and RL for LLM Reasoning

Liang Chen, Xueting Han, Li Shen, Jing Bai, Kam-Fai Wong

Main category: cs.CL

TL;DR: Bilevel optimization method that integrates SFT and RL training for LLM reasoning, enabling SFT to meta-learn how to guide RL optimization for better performance and efficiency.

Details

Motivation: Current decoupled two-stage SFT+RL approach limits interaction between training paradigms, constraining overall effectiveness and efficiency in reasoning tasks.

Method: Bilevel optimization framework where lower level performs RL updates with SFT supervision, and upper level maximizes cooperative gain (performance advantage of joint SFT-RL over RL alone). SFT objective is conditioned on optimal RL policy.

Result: Outperforms baselines on five reasoning benchmarks, achieving better balance between effectiveness and efficiency.

Conclusion: The proposed bilevel optimization approach enables better cooperation between SFT and RL, leading to improved reasoning performance and training efficiency compared to traditional decoupled methods.

Abstract: Reinforcement learning (RL) has proven effective in incentivizing the reasoning abilities of large language models (LLMs), but suffers from severe efficiency challenges due to its trial-and-error nature. While the common practice employs supervised fine-tuning (SFT) as a warm-up stage for RL, this decoupled two-stage approach limits interaction between SFT and RL, thereby constraining overall effectiveness. This study introduces a novel method for learning reasoning models that employs bilevel optimization to facilitate better cooperation between these training paradigms. By conditioning the SFT objective on the optimal RL policy, our approach enables SFT to meta-learn how to guide RL’s optimization process. During training, the lower level performs RL updates while simultaneously receiving SFT supervision, and the upper level explicitly maximizes the cooperative gain-the performance advantage of joint SFT-RL training over RL alone. Empirical evaluations on five reasoning benchmarks demonstrate that our method consistently outperforms baselines and achieves a better balance between effectiveness and efficiency.

[68] Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models

Yinjie Wang, Ling Yang, Bowen Li, Ye Tian, Ke Shen, Mengdi Wang

Main category: cs.CL

TL;DR: TraceRL is a trajectory-aware RL framework for diffusion language models that improves reasoning performance on math and coding tasks, enabling state-of-the-art models (TraDo series) that outperform larger AR models.

Details

Motivation: To enhance diffusion language models' reasoning capabilities by incorporating preferred inference trajectories into post-training, making them more effective for complex tasks like mathematics and coding.

Method: Proposes TraceRL framework with diffusion-based value model for training stability, uses curriculum learning, and applies to adapt block-specific models to larger blocks for improved sampling flexibility.

Result: TraDo-4B-Instruct outperforms 7B-scale AR models on math reasoning; TraDo-8B-Instruct achieves 6.1% improvement over Qwen2.5-7B-Instruct and 51.3% over Llama3.1-8B-Instruct; first long-CoT DLM shows 18.1% accuracy gain on MATH500.

Conclusion: TraceRL effectively enhances diffusion language models’ reasoning performance, demonstrating superior results over larger autoregressive models while providing an open-source framework for building and deploying diffusion LLMs across various architectures.

Abstract: We propose TraceRL, a trajectory-aware reinforcement learning framework for diffusion language models (DLMs) that incorporates preferred inference trajectory into post-training, and is applicable across different architectures. Equipped with a diffusion-based value model that enhances training stability, we demonstrate improved reasoning performance on complex math and coding tasks. Besides, it can also be applied to adapt block-specific models to larger blocks, which improves sampling flexibility. Employing TraceRL, we derive a series of state-of-the-art diffusion language models, namely TraDo. Although smaller than 7B-scale AR models, TraDo-4B-Instruct still consistently outperforms them across complex math reasoning tasks. TraDo-8B-Instruct achieves relative accuracy improvements of 6.1% over Qwen2.5-7B-Instruct and 51.3% over Llama3.1-8B-Instruct on mathematical reasoning benchmarks. Through curriculum learning, we also derive the first long-CoT DLM, outperforming Qwen2.5-7B-Instruct on MATH500 with an 18.1% relative accuracy gain. To facilitate reproducible research and practical applications, we release a comprehensive open-source framework for building, training, and deploying diffusion LLMs across diverse architectures. The framework integrates accelerated KV-cache techniques and inference engines for both inference and reinforcement learning, and includes implementations of various supervised fine-tuning and RL methods for mathematics, coding, and general tasks. Code and Models: https://github.com/Gen-Verse/dLLM-RL

[69] On the Same Wavelength? Evaluating Pragmatic Reasoning in Language Models across Broad Concepts

Linlu Qiu, Cedegao E. Zhang, Joshua B. Tenenbaum, Yoon Kim, Roger P. Levy

Main category: cs.CL

TL;DR: Evaluation framework using Wavelength game shows large LMs have strong pragmatic reasoning in comprehension, while RSA improves production performance significantly.

Details

Motivation: To understand language models' pragmatic reasoning abilities as conversational agents, since language use is shaped by pragmatics involving communicative goals and contextual norms.

Method: Proposed evaluation framework based on Wavelength communication game, testing LMs on comprehension and production using direct prompting, Chain-of-Thought prompting, and Rational Speech Act (RSA) approach with Bayesian pragmatic reasoning.

Result: State-of-the-art LMs achieve human-like accuracy in comprehension without CoT or RSA, while on production, CoT outperforms direct prompting and RSA provides significant improvements over both approaches.

Conclusion: Study identifies LM pragmatic reasoning strengths/limitations, demonstrates RSA’s potential for improvement, and opens avenues for understanding conceptual representation and social reasoning in both LMs and humans.

Abstract: Language use is shaped by pragmatics – i.e., reasoning about communicative goals and norms in context. As language models (LMs) are increasingly used as conversational agents, it becomes ever more important to understand their pragmatic reasoning abilities. We propose an evaluation framework derived from Wavelength, a popular communication game where a speaker and a listener communicate about a broad range of concepts in a granular manner. We study a range of LMs on both language comprehension and language production using direct and Chain-of-Thought (CoT) prompting, and further explore a Rational Speech Act (RSA) approach to incorporating Bayesian pragmatic reasoning into LM inference. We find that state-of-the-art LMs, but not smaller ones, achieve strong performance on language comprehension, obtaining similar-to-human accuracy and exhibiting high correlations with human judgments even without CoT prompting or RSA. On language production, CoT can outperform direct prompting, and using RSA provides significant improvements over both approaches. Our study helps identify the strengths and limitations in LMs’ pragmatic reasoning abilities and demonstrates the potential for improving them with RSA, opening up future avenues for understanding conceptual representation, language understanding, and social reasoning in LMs and humans.

[70] Multiple Noises in Diffusion Model for Semi-Supervised Multi-Domain Translation

Tsiry Mayet, Simon Bernard, Romain Herault, Clement Chatelain

Main category: cs.CL

TL;DR: Multi-Domain Diffusion (MDD) enables flexible translation between arbitrary domain configurations without separate models, handling missing views as noise in diffusion process.

Details

Motivation: Address the challenge of multi-domain translation where mappings between arbitrary domain configurations need to be learned efficiently without requiring separate models for each specific translation setup.

Method: Introduces Multi-Domain Diffusion (MDD) that exploits noise formulation of diffusion models, modeling one noise level per domain. Handles missing views by representing them as noise in the diffusion process, enabling semi-supervised learning without modification.

Result: MDD successfully learns translations between any combination of domains and handles semi-supervised contexts with arbitrary supervision configurations.

Conclusion: The proposed MDD method provides an efficient and flexible approach for multi-domain translation that inherently handles semi-supervised learning by leveraging diffusion model noise formulation.

Abstract: In this work, we address the challenge of multi-domain translation, where the objective is to learn mappings between arbitrary configurations of domains within a defined set (such as $(D_1, D_2)\rightarrow{}D_3$, $D_2\rightarrow{}(D_1, D_3)$, $D_3\rightarrow{}D_1$, etc. for three domains) without the need for separate models for each specific translation configuration, enabling more efficient and flexible domain translation. We introduce Multi-Domain Diffusion (MDD), a method with dual purposes: i) reconstructing any missing views for new data objects, and ii) enabling learning in semi-supervised contexts with arbitrary supervision configurations. MDD achieves these objectives by exploiting the noise formulation of diffusion models, specifically modeling one noise level per domain. Similar to existing domain translation approaches, MDD learns the translation between any combination of domains. However, unlike prior work, our formulation inherently handles semi-supervised learning without modification by representing missing views as noise in the diffusion process. We evaluate our approach through domain translation experiments on BL3NDT, a multi-domain synthetic dataset designed for challenging semantic domain inversion, the BraTS2020 dataset, and the CelebAMask-HQ dataset.

[71] Support or Refute: Analyzing the Stance of Evidence to Detect Out-of-Context Mis- and Disinformation

Xin Yuan, Jie Guo, Weidong Qiu, Zheng Huang, Shujun Li

Main category: cs.CL

TL;DR: Proposed stance extraction network (SEN) for detecting out-of-context misinformation by analyzing evidence stances, achieving 3.2% accuracy improvement over state-of-the-art methods.

Details

Motivation: Existing methods for detecting out-of-context misinformation ignore the importance of evidence stances, which represent biases toward different detection outcomes.

Method: Developed a stance extraction network (SEN) that extracts stances from multi-modal evidence in a unified framework, incorporating support-refutation scores based on named entity co-occurrence relations.

Result: Extensive experiments on a large-scale public dataset showed the proposed method outperformed state-of-the-art baselines with 3.2% accuracy improvement.

Conclusion: The stance-aware approach effectively detects out-of-context misinformation by leveraging evidence stances, demonstrating superior performance compared to existing methods.

Abstract: Mis- and disinformation online have become a major societal problem as major sources of online harms of different kinds. One common form of mis- and disinformation is out-of-context (OOC) information, where different pieces of information are falsely associated, e.g., a real image combined with a false textual caption or a misleading textual description. Although some past studies have attempted to defend against OOC mis- and disinformation through external evidence, they tend to disregard the role of different pieces of evidence with different stances. Motivated by the intuition that the stance of evidence represents a bias towards different detection results, we propose a stance extraction network (SEN) that can extract the stances of different pieces of multi-modal evidence in a unified framework. Moreover, we introduce a support-refutation score calculated based on the co-occurrence relations of named entities into the textual SEN. Extensive experiments on a public large-scale dataset demonstrated that our proposed method outperformed the state-of-the-art baselines, with the best model achieving a performance gain of 3.2% in accuracy. The source code and checkpoints are publicly available at https://github.com/yx3266/SEN.

Qihang Yang, Caimei Yang, Yu Liao, Ziman Zhuang

Main category: cs.CL

TL;DR: VP omission in double centre-embedded structures creates grammaticality illusions in multiple languages including Mandarin. The study argues this is due to verb ambiguity rather than grammatical illusion, supported by EEG experiments showing N400 effects (semantic processing) instead of P600 (syntactic violation), and that semantic cues eliminate the illusion.

Details

Motivation: To resolve the debate about whether the grammaticality illusion in Mandarin missing-NP double centre-embedded structures is truly a grammatical illusion or can be better explained by ambiguous verb interpretations.

Method: Two EEG experiments using quasi double centre-embedded structures with reduced complexity by placing self-embedding relative clauses in subject position. Experiment 1 tested the phenomenon, while Experiment 2 provided semantic cues to reduce ambiguity.

Result: Experiment 1 showed absence of P600 effect (no syntactic violation detection) and presence of N400 effect (semantic processing), supporting the verb ambiguity hypothesis. Experiment 2 showed that semantic cues dispelled the illusion, evidenced by P600 effect.

Conclusion: The phenomenon is best explained by ambiguous verb interpretations rather than grammatical illusion. Word-order differences may account for cross-linguistic variations, supporting garden-path theory interpretation.

Abstract: In several languages, omitting a verb phrase (VP) in double centre-embedded structures creates a grammaticality illusion. Similar illusion also exhibited in Mandarin missing-NP double centre-embedded structures. However, there is no consensus on its very nature. Instead of treating it as grammaticality illusion, we argue that ambiguous interpretations of verbs can best account for this phenomenon in Mandarin. To further support this hypothesis, we conducted two electroencephalography (EEG) experiments on quasi double centre-embedded structures whose complexity is reduced by placing the self-embedding relative clauses into the sentence’s subject position. Experiment 1 showed that similar phenomenon even exhibited in this structure, evidenced by an absence of P600 effect and a presence of N400 effect. In Experiment 2, providing semantic cues to reduce ambiguity dispelled this illusion, as evidenced by a P600 effect. We interpret the results under garden-path theory and propose that word-order difference may account for this cross-linguistic variation.

[73] AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, Hang Yan, Jie Fu, Tao Gui, Tianxiang Sun, Yu-Gang Jiang, Xipeng Qiu

Main category: cs.CL

TL;DR: AnyGPT is an any-to-any multimodal language model that uses discrete representations to unify processing of speech, text, images, and music without changing LLM architecture, achieving performance comparable to specialized models.

Details

Motivation: To create a unified multimodal language model that can handle any combination of inputs and outputs across different modalities (speech, text, images, music) without requiring architectural changes to existing LLMs.

Method: Uses discrete representations for multimodal processing, data-level preprocessing, builds multimodal text-centric dataset for alignment pre-training, and synthesizes large-scale any-to-any multimodal instruction dataset with 108k multi-turn conversation samples.

Result: AnyGPT successfully facilitates any-to-any multimodal conversations and achieves performance comparable to specialized models across all modalities, demonstrating that discrete representations can effectively unify multiple modalities.

Conclusion: Discrete representations provide an effective and convenient way to unify multiple modalities within language models, enabling stable training without architectural changes while maintaining competitive performance across diverse modalities.

Abstract: We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities, including speech, text, images, and music. AnyGPT can be trained stably without any alterations to the current large language model (LLM) architecture or training paradigms. Instead, it relies exclusively on data-level preprocessing, facilitating the seamless integration of new modalities into LLMs, akin to the incorporation of new languages. We build a multimodal text-centric dataset for multimodal alignment pre-training. Utilizing generative models, we synthesize the first large-scale any-to-any multimodal instruction dataset. It consists of 108k samples of multi-turn conversations that intricately interweave various modalities, thus equipping the model to handle arbitrary combinations of multimodal inputs and outputs. Experimental results demonstrate that AnyGPT is capable of facilitating any-to-any multimodal conversation while achieving performance comparable to specialized models across all modalities, proving that discrete representations can effectively and conveniently unify multiple modalities within a language model. Demos are shown in https://junzhan2000.github.io/AnyGPT.github.io/

[74] Repetition Improves Language Model Embeddings

Jacob Mitchell Springer, Suhas Kotha, Daniel Fried, Graham Neubig, Aditi Raghunathan

Main category: cs.CL

TL;DR: Echo embeddings convert autoregressive LMs into text embedding models by repeating input tokens, achieving strong performance without architectural changes or fine-tuning.

Details

Motivation: Challenge the premise that bidirectional models are essential for strong text embeddings by showing autoregressive LMs can be adapted effectively.

Method: Repeat input tokens and extract embeddings from repeated tokens that have access to all original tokens, enabling bidirectional context without architectural modifications.

Result: Improves over classical LM embeddings by over 5% in zero-shot settings, nearly matches bidirectionally-converted LMs with MLM training, and performs competitively in supervised settings.

Conclusion: Repetition is a simple and effective strategy to circumvent the need for bidirectional attention, enabling unified architecture for all NLP tasks.

Abstract: Bidirectional models are considered essential for strong text embeddings. Recent approaches to adapt autoregressive language models (LMs) into strong text embedding models have largely had the requirement to modify the LM architecture to be bidirectional. We challenge this premise by introducing “echo embeddings” which converts autoregressive LMs into high quality text embedding models without changing the architecture or requiring fine-tuning. By repeating the input and extracting embeddings from the repeated tokens – which have access to all original tokens – echo embeddings improve over classical LM embeddings by over 5% in zero-shot settings. Our zero-shot embeddings nearly match those obtained by bidirectionally-converted LMs that undergo additional masked-language modeling training. Echo embeddings are also compatible with supervised fine-tuning, matching or outperforming bidirectionally-converted LMs in an apples-to-apples comparison, even with an identical compute budget during training and inference. Overall, repetition is a simple and effective strategy to circumvent the need for bidirectional attention in embedding models, paving the way towards a unified architecture for all NLP tasks.

[75] Linearly Controlled Language Generation with Performative Guarantees

Emily Cheng, Carmen Amo Alonso

Main category: cs.CL

TL;DR: A control-theoretic approach for steering LLM text generation away from undesired meanings by dynamically intervening in the latent semantic space using optimal control techniques.

Details

Motivation: Need for computationally efficient and performance-guaranteed controlled language generation strategies in critical LLM applications.

Method: Gradient-free intervention that dynamically steers token activations in embedding space using optimal control theory to precisely guide trajectories away from undesired semantic regions.

Result: Effective toxicity avoidance and sentiment control while maintaining text quality, with minimal impact on generation time.

Conclusion: Control-theoretic treatment of text generation enables fine-grained steering of attributes with performance guarantees and computational efficiency.

Abstract: The increasing prevalence of Large Language Models (LMs) in critical applications highlights the need for controlled language generation strategies that are not only computationally efficient but that also enjoy performance guarantees. To achieve this, we use a common model of concept semantics as linearly represented in an LM’s latent space. In particular, we take the view that natural language generation traces a trajectory in this continuous semantic space, realized by the language model’s hidden activations. This view permits a control-theoretic treatment of text generation in latent space, in which we propose a lightweight, gradient-free intervention that dynamically steers trajectories away from regions corresponding to undesired meanings. In particular, we propose to directly intervene the activations of the token that is being generated in embedding space in an online fashion. Crucially, we do not simply steer activations towards a desirable region. Instead, our method relies on classical techniques from control theory to precisely control activations in a context-dependent way, and guarantees that they are brought into a specific pre-defined region of embedding space that corresponds to allowed semantics. Our intervention is computed in closed-form according to an optimal controller formulation, minimally impacting generation time. This control of the activations in embedding space allows for fine-grained steering of attributes of the generated sequence. We demonstrate the effectiveness of our approach on different objectives– toxicity avoidance and sentiment control– while maintaining text quality.

Avijit Mitra, Zhichao Yang, Emily Druhl, Raelene Goodwin, Hong Yu

Main category: cs.CL

TL;DR: Synth-SBDH is a synthetic dataset for extracting social and behavioral determinants of health from clinical text, offering detailed annotations across 15 categories and outperforming models without this training.

Details

Motivation: Existing SBDH datasets have limitations in availability and coverage, creating a need for high-quality synthetic data to improve extraction of crucial health determinants from clinical text.

Method: Created Synth-SBDH synthetic dataset with detailed SBDH annotations including status, temporal information, and rationale across 15 categories. Tested on three tasks using real-world clinical datasets from two hospital settings.

Result: Models trained on Synth-SBDH consistently outperformed counterparts without this training, achieving up to 63.75% macro-F improvements. Effective for rare SBDH categories and under resource constraints, while being cheaper than expert-annotated data. Human evaluation showed 71.06% Human-LLM alignment.

Conclusion: Synth-SBDH demonstrates versatility, generalizability, and distillation capabilities for SBDH extraction, providing a cost-effective alternative to expert-annotated data while identifying areas for future refinement.

Abstract: Social and behavioral determinants of health (SBDH) play a crucial role in health outcomes and are frequently documented in clinical text. Automatically extracting SBDH information from clinical text relies on publicly available good-quality datasets. However, existing SBDH datasets exhibit substantial limitations in their availability and coverage. In this study, we introduce Synth-SBDH, a novel synthetic dataset with detailed SBDH annotations, encompassing status, temporal information, and rationale across 15 SBDH categories. We showcase the utility of Synth-SBDH on three tasks using real-world clinical datasets from two distinct hospital settings, highlighting its versatility, generalizability, and distillation capabilities. Models trained on Synth-SBDH consistently outperform counterparts with no Synth-SBDH training, achieving up to 63.75% macro-F improvements. Additionally, Synth-SBDH proves effective for rare SBDH categories and under-resource constraints while being substantially cheaper than expert-annotated real-world data. Human evaluation reveals a 71.06% Human-LLM alignment and uncovers areas for future refinements.

[77] A Principled Framework for Evaluating on Typologically Diverse Languages

Esther Ploeger, Wessel Poelman, Andreas Holck Høeg-Petersen, Anders Schlichtkrull, Miryam de Lhoneux, Johannes Bjerva

Main category: cs.CL

TL;DR: A systematic framework for selecting typologically diverse languages for multilingual NLP evaluation, outperforming previous methods in diversity and improving model generalizability assessment.

Details

Motivation: Multilingual NLP needs representative language sampling for generalizable evaluation, but current 'typologically diverse' sampling methods are flawed and inconsistent, making proper evaluation across world languages impractical.

Method: Developed a language sampling framework informed by language typology to select highly typologically diverse languages given a sampling frame, comparing various sampling methods with multiple metrics.

Result: The systematic methods consistently retrieve more typologically diverse language selections than previous NLP methods, and this diversity affects generalizability in multilingual model evaluation.

Conclusion: Diverse language sampling is crucial for effective multilingual NLP evaluation, and the proposed systematic framework provides more reliable and representative language selection for assessing model performance across languages.

Abstract: Beyond individual languages, multilingual natural language processing (NLP) research increasingly aims to develop models that perform well across languages generally. However, evaluating these systems on all the world’s languages is practically infeasible. To attain generalizability, representative language sampling is essential. Previous work argues that generalizable multilingual evaluation sets should contain languages with diverse typological properties. However, ’typologically diverse’ language samples have been found to vary considerably in this regard, and popular sampling methods are flawed and inconsistent. We present a language sampling framework for selecting highly typologically diverse languages given a sampling frame, informed by language typology. We compare sampling methods with a range of metrics and find that our systematic methods consistently retrieve more typologically diverse language selections than previous methods in NLP. Moreover, we provide evidence that this affects generalizability in multilingual model evaluation, emphasizing the importance of diverse language sampling in NLP evaluation.

[78] Affective Computing in the Era of Large Language Models: A Survey from the NLP Perspective

Yiqun Zhang, Xiaocui Yang, Xingle Xu, Zeran Gao, Yijie Huang, Shiyi Mu, Shi Feng, Daling Wang, Yifei Zhang, Kaisong Song, Ge Yu

Main category: cs.CL

TL;DR: This survey paper provides a comprehensive overview of Affective Computing in the LLM era, covering traditional tasks, adaptation techniques for affective understanding and generation, evaluation methods, and future challenges.

Details

Motivation: Traditional fine-tuned pre-trained language models have limitations in affective computing - poor generalization across tasks and limited capability for diverse, emotionally appropriate response generation. The emergence of LLMs offers new opportunities with in-context learning, broader knowledge, and stronger generation capabilities.

Method: The survey consolidates traditional AC tasks and LLM-based studies, reviews adaptation techniques including Instruction Tuning (full and parameter-efficient methods), Prompt Engineering (zero/few-shot, chain-of-thought, agent-based), and Reinforcement Learning (RLHF, RLVR, RLAIF). It also compiles benchmarks and evaluation practices.

Result: The paper provides a comprehensive framework for understanding how LLMs can enhance affective computing capabilities, enabling finer-grained multi-objective control for empathy, safety, and planning in both affective understanding and generation tasks.

Conclusion: LLMs represent a paradigm shift in affective computing, offering practical guidance for building affect-aware, reliable, and responsible systems, though challenges remain in ethics, data quality, safety, robust evaluation, and resource efficiency that require further research.

Abstract: Affective Computing (AC) integrates computer science, psychology, and cognitive science to enable machines to recognize, interpret, and simulate human emotions across domains such as social media, finance, healthcare, and education. AC commonly centers on two task families: Affective Understanding (AU) and Affective Generation (AG). While fine-tuned pre-trained language models (PLMs) have achieved solid AU performance, they often generalize poorly across tasks and remain limited for AG, especially in producing diverse, emotionally appropriate responses. The advent of Large Language Models (LLMs) (e.g., ChatGPT and LLaMA) has catalyzed a paradigm shift by offering in-context learning, broader world knowledge, and stronger sequence generation. This survey presents an NLP-oriented overview of AC in the LLM era. We (i) consolidate traditional AC tasks and preliminary LLM-based studies; (ii) review adaptation techniques that improve AU/AG, including Instruction Tuning (full and parameter-efficient methods such as LoRA, P-/Prompt-Tuning), Prompt Engineering (zero/few-shot, chain-of-thought, agent-based prompting), and Reinforcement Learning. For the latter, we summarize RL from human preferences (RLHF), verifiable/programmatic rewards (RLVR), and AI feedback (RLAIF), which provide preference- or rule-grounded optimization signals that can help steer AU/AG toward empathy, safety, and planning, achieving finer-grained or multi-objective control. To assess progress, we compile benchmarks and evaluation practices for both AU and AG. We also discuss open challenges-from ethics, data quality, and safety to robust evaluation and resource efficiency-and outline research directions. We hope this survey clarifies the landscape and offers practical guidance for building affect-aware, reliable, and responsible LLM systems.

[79] Self-Alignment: Improving Alignment of Cultural Values in LLMs via In-Context Learning

Rochelle Choenni, Ekaterina Shutova

Main category: cs.CL

TL;DR: A simple and inexpensive method using in-context learning and human survey data improves LLM alignment with cultural values across multiple languages and culturally diverse countries.

Details

Motivation: Improving alignment of Large Language Models with cultural values they encode has become increasingly important, as current models may not adequately represent diverse cultural perspectives.

Method: Uses in-context learning (ICL) combined with human survey data to adjust model responses to cultural value probes at inference time.

Result: Method improves cultural value alignment across 5 models including both English-centric and multilingual LLMs, works in languages other than English, and aligns with values from culturally diverse countries.

Conclusion: The approach provides an effective way to enhance cultural value alignment in LLMs without expensive retraining, demonstrating cross-lingual and cross-cultural applicability.

Abstract: Improving the alignment of Large Language Models (LLMs) with respect to the cultural values that they encode has become an increasingly important topic. In this work, we study whether we can exploit existing knowledge about cultural values at inference time to adjust model responses to cultural value probes. We present a simple and inexpensive method that uses a combination of in-context learning (ICL) and human survey data, and show that we can improve the alignment to cultural values across 5 models that include both English-centric and multilingual LLMs. Importantly, we show that our method could prove useful in test languages other than English and can improve alignment to the cultural values that correspond to a range of culturally diverse countries.

[80] Towards No-Code Programming of Cobots: Experiments with Code Synthesis by Large Code Models for Conversational Programming

Chalamalasetti Kranti, Sherzod Hakimov, David Schlangen

Main category: cs.CL

TL;DR: LLMs can generate basic assembly instructions for collaborative robots but struggle with higher-order programming concepts like functions and loops.

Details

Motivation: Traditional collaborative robots require expert programming or manual guidance, limiting flexibility and expressivity in industrial assembly tasks.

Method: Created RATS dataset with assembly task instructions and code examples, then systematically evaluated state-of-the-art LLMs for conversational code generation using in-context learning.

Result: LLMs successfully generated accurate first-order code (instruction sequences) but had difficulties producing higher-order code (abstractions, functions, loops).

Conclusion: While LLMs show promise for conversational programming of collaborative robots, they currently excel at basic instruction sequences but need improvement for complex programming abstractions.

Abstract: While there has been a lot of research recently on robots in household environments, at the present time, most robots in existence can be found on shop floors, and most interactions between humans and robots happen there. Collaborative robots'' (cobots) designed to work alongside humans on assembly lines traditionally require expert programming, limiting ability to make changes, or manual guidance, limiting expressivity of the resulting programs. To address these limitations, we explore using Large Language Models (LLMs), and in particular, their abilities of doing in-context learning, for conversational code generation. As a first step, we define RATS, the Repetitive Assembly Task’’, a 2D building task designed to lay the foundation for simulating industry assembly scenarios. In this task, a programmer' instructs a cobot, using natural language, on how a certain assembly is to be built; that is, the programmer induces a program, through natural language. We create a dataset that pairs target structures with various example instructions (human-authored, template-based, and model-generated) and example code. With this, we systematically evaluate the capabilities of state-of-the-art LLMs for synthesising this kind of code, given in-context examples. Evaluating in a simulated environment, we find that LLMs are capable of generating accurate first order code’ (instruction sequences), but have problems producing `higher-order code’ (abstractions such as functions, or use of loops).

[81] Extracting and Combining Abilities For Building Multi-lingual Ability-enhanced Large Language Models

Zhipeng Chen, Kun Zhou, Liang Song, Wayne Xin Zhao, Bingning Wang, Weipeng Chen, Ji-Rong Wen

Main category: cs.CL

TL;DR: MAEC extracts language-agnostic ability weights from LLMs and combines them across languages without training, achieving comparable performance to PaLM on mathematical and scientific tasks.

Details

Motivation: Existing multilingual ability transfer methods rely on training data that may not be available for low-resource languages, creating a need for data-efficient approaches.

Method: Two-stage approach: 1) Extract ability-related weights by locating key neurons and extracting transferable weights, 2) Combine ability-related tensors with language-specific weights using addition/subtraction operations.

Result: Extensive experiments on LLaMA-3 8B show MAEC effectively extracts and combines advanced abilities, achieving performance comparable to PaLM in both high-resource and low-resource language scenarios.

Conclusion: MAEC provides an efficient training-free method for multilingual ability transfer that works well even for low-resource languages, demonstrating strong performance on mathematical and scientific tasks.

Abstract: Multi-lingual ability transfer has become increasingly important for the broad application of large language models (LLMs). Existing work highly relies on training with the multi-lingual ability-related data, which may not be available for low-resource languages. To solve it, we propose a Multi-lingual Abilities Extraction and Combination approach, named as MAEC. Our key idea is to decompose and extract language-agnostic ability-related weights from LLMs, and combine them across different languages by simple addition and subtraction operations without training. Specifically, our MAEC consists of the extraction and combination stages. In the extraction stage, we firstly locate key neurons that are highly related to specific abilities, and then employ them to extract the transferable ability-related weights. In the combination stage, we further select the ability-related tensors that mitigate the linguistic effects, and design a combining strategy based on them and the language-specific weights, to build the multi-lingual ability-enhanced LLM. To assess the effectiveness of our approach, we conduct extensive experiments on LLaMA-3 8B on mathematical and scientific tasks in both high-resource and low-resource lingual scenarios. Experiment results have shown that MAEC can effectively and efficiently extract and combine the advanced abilities, achieving comparable performance with PaLM. Resources are available at https://github.com/RUCAIBox/MAET.

[82] Conversational Code Generation: a Case Study of Designing a Dialogue System for Generating Driving Scenarios for Testing Autonomous Vehicles

Rimvydas Rubavicius, Antonio Valerio Miceli-Barone, Alex Lascarides, Subramanian Ramamoorthy

Main category: cs.CL

TL;DR: Natural language interface using LLM to help non-coders create simulation scenarios for autonomous vehicle testing, with dialogue proving critical for success.

Details

Motivation: To enable non-coding domain experts to specify testing scenarios for autonomous vehicles in simulation without requiring programming expertise.

Method: Used an instruction-following large language model to convert natural language utterances into symbolic programs for scenario specification, despite limited training data.

Result: Dialogue-based interaction proved crucial, achieving 4.5 times higher success rate compared to generation without extended conversation.

Conclusion: Natural language interfaces with conversational capabilities are feasible and effective for domain experts to create complex simulation scenarios for autonomous vehicle testing.

Abstract: Cyber-physical systems like autonomous vehicles are tested in simulation before deployment, using domain-specific programs for scenario specification. To aid the testing of autonomous vehicles in simulation, we design a natural language interface, using an instruction-following large language model, to assist a non-coding domain expert in synthesising the desired scenarios and vehicle behaviours. We show that using it to convert utterances to the symbolic program is feasible, despite the very small training dataset. Human experiments show that dialogue is critical to successful simulation generation, leading to a 4.5 times higher success rate than a generation without engaging in extended conversation.

[83] GASE: Generatively Augmented Sentence Encoding

Manuel Frank, Haithem Afli

Main category: cs.CL

TL;DR: Training-free approach using generative text models for data augmentation at inference time to improve sentence embeddings without fine-tuning.

Details

Motivation: To enhance sentence embeddings without requiring model parameter access or computational resources for fine-tuning state-of-the-art models.

Method: Generatively Augmented Sentence Encoding that variates input text through paraphrasing, summarizing, or keyword extraction, followed by pooling original and synthetic embeddings.

Result: Performance improvements on Massive Text Embedding Benchmark for Semantic Textual Similarity across various embedding models, with larger gains for models with lower baseline performance.

Conclusion: Generative augmentation at inference time adds semantic diversity and enhances robustness and generalizability of sentence embeddings, with performance gains dependent on both embedding model and dataset.

Abstract: We propose a training-free approach to improve sentence embeddings leveraging test-time compute by applying generative text models for data augmentation at inference time. Unlike conventional data augmentation that utilises synthetic training data, our approach does not require access to model parameters or the computational resources typically required for fine-tuning state-of-the-art models. Generatively Augmented Sentence Encoding variates the input text by paraphrasing, summarising, or extracting keywords, followed by pooling the original and synthetic embeddings. Experimental results on the Massive Text Embedding Benchmark for Semantic Textual Similarity (STS) demonstrate performance improvements across a range of embedding models using different generative models for augmentation. We find that generative augmentation leads to larger performance improvements for embedding models with lower baseline performance. These findings suggest that integrating generative augmentation at inference time adds semantic diversity and can enhance the robustness and generalisability of sentence embeddings for embedding models. Our results show that performance gains depend on the embedding model and the dataset.

[84] Exploring the Limits of Large Language Models: A Systematic Evaluation of Masked Text Processing Ability through MskQA and MskCal

Fuka Matsuzaki, Haru-Tada Sato

Main category: cs.CL

TL;DR: LLMs show limited ability to process masked text, with performance dropping significantly when semantic cues are completely removed, revealing their reliance on surface-level patterns rather than true comprehension.

Details

Motivation: To evaluate the limitations of Large Language Models in processing masked text and understand their reliance on semantic cues for reasoning tasks.

Method: Introduced two novel tasks: MskQA (masked question-answering on datasets like RealtimeQA) and MskCal (masked arithmetic problems), testing GPT-4o and 4o-mini with different masking rates and semantic cue conditions.

Result: LLMs show some resilience to masked text but performance is highly dependent on masking rates and semantic cues. GPT-4o outperforms 4o-mini, especially in numerical reasoning. Solid masking (no semantic clues) causes significant performance drop compared to partial lifting.

Conclusion: Semantic cues play a crucial role in LLM reasoning processes. The study reveals LLMs’ reliance on surface-level patterns and highlights the need for more robust evaluation methods to assess true comprehension abilities.

Abstract: This paper sheds light on the limitations of Large Language Models (LLMs) by rigorously evaluating their ability to process masked text. We introduce two novel tasks: MskQA, measuring reasoning on masked question-answering datasets like RealtimeQA, and MskCal, assessing numerical reasoning on masked arithmetic problems.Testing GPT-4o and 4o-mini reveals that while LLMs exhibit some resilience to masked text, their performance is highly contingent on masking rates and semantic cues. Specifically, “solid masking,” where semantic clues are entirely absent, leads to a significant performance drop compared to “partial lifting,” where some semantic information is retained, indicating LLMs’ reliance on surface-level patterns. Interestingly, GPT-4o consistently outperforms 4o-mini, particularly in MskCal, demonstrating a greater ability to handle numerical reasoning with masked text. This underscores the crucial role of semantic cues in the reasoning process of LLMs. Our study illuminates the interplay between background knowledge and reasoning ability in masked text processing, paving the way for a deeper understanding of LLM capabilities and limitations, and highlighting the need for more robust evaluation methods to accurately assess their true comprehension abilities.

[85] HierTOD: A Task-Oriented Dialogue System Driven by Hierarchical Goals

Lingbo Mo, Shun Jiang, Akash Maharaj, Bernard Hishamunda, Yunyao Li

Main category: cs.CL

TL;DR: HierTOD is an enterprise task-oriented dialogue system that uses hierarchical goals to handle complex workflows, combining slot-filling and step-by-step guidance for better task completion.

Details

Motivation: Traditional TOD systems struggle with enterprise environments due to task complexity and lack of standardized documentation, requiring a more proactive approach.

Method: Uses hierarchical goal-driven interactions with components for natural language understanding, composite goal retrieval, dialogue management, and response generation, backed by domain knowledge base.

Result: Human evaluators found HierTOD effective and helpful, successfully unifying slot-filling and step-by-step guidance paradigms.

Conclusion: HierTOD demonstrates improved task assistance in enterprise environments through goal-driven hierarchical workflows and mixed-initiative dialogue.

Abstract: Task-Oriented Dialogue (TOD) systems assist users in completing tasks through natural language interactions, often relying on a single-layered workflow structure for slot-filling in public tasks, such as hotel bookings. However, in enterprise environments, which involve rich domain-specific knowledge, TOD systems face challenges due to task complexity and the lack of standardized documentation. In this work, we introduce HierTOD, an enterprise TOD system driven by hierarchical goals that can support composite workflows. By focusing on goal-driven interactions, our system serves a more proactive role, facilitating mixed-initiative dialogue and improving task completion. Equipped with components for natural language understanding, composite goal retriever, dialogue management, and response generation, backed by a well-organized data service with domain knowledge base and retrieval engine, HierTOD delivers efficient task assistance as judged by human evaluators. Furthermore, our system implementation unifies two TOD paradigms: slot-filling for information collection and step-by-step guidance for task execution. Our user study demonstrates the effectiveness and helpfulness of HierTOD in performing both paradigms.

[86] Lessons from Studying Two-Hop Latent Reasoning

Mikita Balesni, Tomek Korbak, Owain Evans

Main category: cs.CL

TL;DR: LLMs can perform latent two-hop reasoning but fail when both facts are synthetic, showing nuanced reasoning capabilities that require careful experimental design to avoid false conclusions.

Details

Motivation: To investigate whether LLMs have latent reasoning capabilities for two-hop question answering, as this basic capability would indicate potential for complex agentic tasks requiring chain-of-thought reasoning.

Method: Fine-tuned LLMs (including Llama 3 8B and GPT-4o) on synthetic facts and tested two-hop reasoning in a controlled setting to eliminate memorization and reasoning shortcuts.

Result: Models failed to compose two synthetic facts but succeeded when one fact was synthetic and the other natural, demonstrating clear latent two-hop reasoning capability.

Conclusion: LLMs are capable of latent two-hop reasoning, but researchers must design experiments carefully to avoid both spurious successes (from memorization) and spurious failures (from artificial setups).

Abstract: Large language models can use chain-of-thought (CoT) to externalize reasoning, potentially enabling oversight of capable LLM agents. Prior work has shown that models struggle at two-hop question-answering without CoT. This capability is so basic that if it was a fundamental limitation, it would imply that many complex agentic tasks would similarly require CoT. We investigate LLM latent reasoning capabilities using two-hop question answering as a case study. Previous work on the gap between latent and externalized two-hop reasoning produced mixed evidence with inconclusive results. In this paper, we introduce a controlled setting for investigating two-hop reasoning in LLMs, where a positive result provides definitive evidence for latent reasoning. We fine-tune LLMs (including Llama 3 8B and GPT-4o) on synthetic facts and test two-hop reasoning over these facts. By using synthetic facts, we rule out memorization and reasoning shortcuts as explanations for two-hop performance. We observe a nuanced picture: Models fail to compose two synthetic facts, but can succeed when one fact is synthetic and the other is natural. These results demonstrate that LLMs are undeniably capable of latent two-hop reasoning, although it remains unclear how this ability scales with model size. Finally, we highlight a lesson for researchers studying LLM reasoning: when drawing conclusions about LLM latent reasoning, one must be careful to avoid both spurious successes (that stem from memorization and reasoning shortcuts) and spurious failures (that may stem from artificial experimental setups, divorced from training setups of frontier LLMs).

[87] Fine-Tuning Large Language Models for Scientific Text Classification: A Comparative Study

Zhyar Rzgar K Rostam, Gábor Kertész

Main category: cs.CL

TL;DR: Fine-tuning domain-specific LLMs like SciBERT outperforms general-purpose models in scientific text classification tasks, demonstrating the importance of domain adaptation for specialized content.

Details

Motivation: General-purpose LLMs struggle with domain-specific content like scientific texts due to specialized vocabulary and imbalanced data, necessitating better approaches for scientific text classification.

Method: Fine-tuned four state-of-the-art LLMs (BERT, SciBERT, BioBERT, BlueBERT) on three datasets from WoS-46985 dataset and evaluated their performance in abstract-based and keyword-based scientific text classification tasks.

Result: Domain-specific models, particularly SciBERT, consistently outperformed general-purpose models. The results also showed advantages over other deep learning models reported in literature, especially when used in specific domains.

Conclusion: Domain-specific adaptations are crucial for enhancing LLM effectiveness in specialized text classification tasks, with domain-specific models demonstrating superior performance over general-purpose alternatives.

Abstract: The exponential growth of online textual content across diverse domains has necessitated advanced methods for automated text classification. Large Language Models (LLMs) based on transformer architectures have shown significant success in this area, particularly in natural language processing (NLP) tasks. However, general-purpose LLMs often struggle with domain-specific content, such as scientific texts, due to unique challenges like specialized vocabulary and imbalanced data. In this study, we fine-tune four state-of-the-art LLMs BERT, SciBERT, BioBERT, and BlueBERT on three datasets derived from the WoS-46985 dataset to evaluate their performance in scientific text classification. Our experiments reveal that domain-specific models, particularly SciBERT, consistently outperform general-purpose models in both abstract-based and keyword-based classification tasks. Additionally, we compare our achieved results with those reported in the literature for deep learning models, further highlighting the advantages of LLMs, especially when utilized in specific domains. The findings emphasize the importance of domain-specific adaptations for LLMs to enhance their effectiveness in specialized text classification tasks.

[88] Think-to-Talk or Talk-to-Think? When LLMs Come Up with an Answer in Multi-Hop Arithmetic Reasoning

Keito Kudo, Yoichi Aoki, Tatsuki Kuribayashi, Shusaku Sone, Masaya Taniguchi, Ana Brassard, Keisuke Sakaguchi, Kentaro Inui

Main category: cs.CL

TL;DR: LMs use incremental reasoning for multi-hop arithmetic problems, obtaining sub-answers during chain generation rather than immediately after reading the problem.

Details

Motivation: To understand the internal problem-solving process of language models, specifically how they handle multi-hop reasoning tasks and when they internally resolve sub-problems.

Method: Investigated LMs’ arithmetic multi-hop reasoning by analyzing when they resolve sub/whole problems through reading problem statements, generating reasoning chains, and achieving final answers.

Result: LMs employ a systematic incremental reasoning strategy - they don’t derive answers immediately after reading problems but obtain sub-answers while generating reasoning chains.

Conclusion: Generated reasoning chains faithfully reflect the model’s internal computation process, providing mechanistic interpretation of LMs’ multi-hop problem-solving.

Abstract: This study investigates the incremental, internal problem-solving process of language models (LMs) with arithmetic multi-hop reasoning as a case study. We specifically investigate when LMs internally resolve sub/whole problems through first reading the problem statements, generating reasoning chains, and achieving the final answer to mechanistically interpret LMs’ multi-hop problem-solving process. Our experiments reveal a systematic incremental reasoning strategy underlying LMs. They have not derived an answer at the moment they first read the problem; instead, they obtain (sub)answers while generating the reasoning chain. Therefore, the generated reasoning chains can be regarded as faithful reflections of the model’s internal computation.

[89] Concept Bottleneck Large Language Models

Chung-En Sun, Tuomas Oikarinen, Berk Ustun, Tsui-Wei Weng

Main category: cs.CL

TL;DR: CB-LLMs integrate intrinsic interpretability into LLMs for text classification and generation, providing explicit reasoning, controlled generation, and safety enhancements while maintaining competitive performance.

Details

Motivation: Traditional black-box LLMs lack transparency and rely on limited post-hoc interpretations, making it difficult to understand model reasoning, control outputs, and ensure safety.

Method: Concept Bottleneck framework that embeds interpretable neurons directly into LLMs, enabling explicit concept detection and reasoning pathways for both text classification and generation tasks.

Result: CB-LLMs achieve competitive or better performance than traditional models in text classification while providing interpretable reasoning. For text generation, they enable precise concept detection, controlled generation, and safer outputs with enhanced transparency.

Conclusion: CB-LLMs significantly enhance LLM safety, reliability, and trustworthiness through embedded interpretability, enabling users to identify harmful content, steer behavior, and unlearn undesired concepts - capabilities missing in existing models.

Abstract: We introduce Concept Bottleneck Large Language Models (CB-LLMs), a novel framework for building inherently interpretable Large Language Models (LLMs). In contrast to traditional black-box LLMs that rely on limited post-hoc interpretations, CB-LLMs integrate intrinsic interpretability directly into the LLMs – allowing accurate explanations with scalability and transparency. We build CB-LLMs for two essential NLP tasks: text classification and text generation. In text classification, CB-LLMs is competitive with, and at times outperforms, traditional black-box models while providing explicit and interpretable reasoning. For the more challenging task of text generation, interpretable neurons in CB-LLMs enable precise concept detection, controlled generation, and safer outputs. The embedded interpretability empowers users to transparently identify harmful content, steer model behavior, and unlearn undesired concepts – significantly enhancing the safety, reliability, and trustworthiness of LLMs, which are critical capabilities notably absent in existing models. Our code is available at https://github.com/Trustworthy-ML-Lab/CB-LLMs.

[90] Process-Supervised Reward Models for Verifying Clinical Note Generation: A Scalable Approach Guided by Domain Expertise

Hanyin Wang, Chufan Gao, Qiping Xu, Bolun Liu, Guleid Hussein, Hariprasad Korsapati, Mohamad El Labban, Kingsley Iheasirim, Mohamed Hassan, Gokhan Anil, Brian Bartlett, Jimeng Sun

Main category: cs.CL

TL;DR: Novel framework for training process-supervised reward models (PRMs) on clinical note generation, achieving state-of-the-art performance in distinguishing quality notes and physician preferences.

Details

Motivation: PRMs excel in domains with ground-truth answers like math/coding but face challenges in clinical note generation where ground truth is lacking, requiring new approaches for step-level verification.

Method: Defined meaningful clinical steps, injected realistic errors using domain expertise, leveraged LLMs to generate process supervision data at scale, and trained PRM on LLaMA-3.1 8B with optimal loss functions and data selection strategies.

Result: PRM achieved 98.8% accuracy distinguishing gold-standard from error-containing samples and 56.2% accuracy selecting physician-preferred notes, outperforming proprietary reasoning and non-reasoning models.

Conclusion: The framework successfully unlocks PRM potential for diverse generative tasks without ground-truth answers, with comprehensive physician study identifying predictors for downstream performance.

Abstract: Process-supervised reward models (PRMs) excel at providing step-by-step verification for large language model (LLM) outputs in domains like mathematics and coding. However, their application to fields lacking ground-truth answers, such as clinical note generation, poses significant challenges. We introduce a novel framework for training PRMs to deliver step-level reward signals for LLM-generated clinical notes. By precisely defining meaningful “steps,” injecting realistic “errors” informed by domain expertise, and leveraging LLMs to generate process supervision data at scale, we overcome previous limitations. Our PRM, built on LLaMA-3.1 8B, consistently outperforms proprietary reasoning and non-reasoning models, achieving state-of-the-art performance on two key evaluations: (1) distinguishing gold-standard from error-containing samples with 98.8% accuracy, and (2) selecting physician-preferred clinical notes with 56.2% accuracy. We investigate critical components for effective PRM training, including optimal loss functions and data selection strategies, and present a comprehensive physician reader study identifying predictors of downstream Best-of-N performance. Our study sheds light on unlocking the potential of PRMs for diverse generative tasks across domains.

[91] Revealing the impact of synthetic native samples and multi-tasking strategies in Hindi-English code-mixed humour and sarcasm detection

Debajyoti Mazumder, Aakash Kumar, Jasabanta Patro

Main category: cs.CL

TL;DR: Three approaches tested for code-mixed humour/sarcasm detection: native sample mixing, multi-task learning, and prompting VMLMs. MTL performed best, VMLMs underperformed.

Details

Motivation: To improve detection of humour and sarcasm in code-mixed text through various training strategies.

Method: Tested three approaches: (i) native sample mixing - adding monolingual samples to code-mixed training, (ii) multi-task learning with hate detection as related task, (iii) prompting and instruction finetuning of very large multilingual language models.

Result: Native samples improved F1-scores (humour: +6.76%, sarcasm: +8.64%). MTL boosted performance further (humour: +10.67%, sarcasm: +12.35%). VMLMs underperformed compared to other approaches.

Conclusion: Multi-task learning framework was most effective for code-mixed humour and sarcasm detection, while VMLMs via prompting/finetuning couldn’t outperform traditional approaches.

Abstract: In this paper, we reported our experiments with various strategies to improve code-mixed humour and sarcasm detection. Particularly, we tried three approaches: (i) native sample mixing, (ii) multi-task learning (MTL), and (iii) prompting and instruction finetuning very large multilingual language models (VMLMs). In native sample mixing, we added monolingual task samples to code-mixed training sets. In MTL learning, we relied on native and code-mixed samples of a semantically related task (hate detection in our case). Finally, in our third approach, we evaluated the efficacy of VMLMs via few-shot context prompting and instruction finetuning. Some interesting findings we got are (i) adding native samples improved humor (raising the F1-score up to 6.76%) and sarcasm (raising the F1-score up to 8.64%) detection, (ii) training MLMs in an MTL framework boosted performance for both humour (raising the F1-score up to 10.67%) and sarcasm (increment up to 12.35% in F1-score) detection, and (iii) prompting and instruction finetuning VMLMs couldn’t outperform the other approaches. Finally, our ablation studies and error analysis discovered the cases where our model is yet to improve. We provided our code for reproducibility.

[92] Knowledge Editing through Chain-of-Thought

Changyue Wang, Weihang Su, Qingyao Ai, Yichen Tang, Yiqun Liu

Main category: cs.CL

TL;DR: EditCoT is a novel knowledge editing framework that uses chain-of-thought refinement to update LLMs across diverse tasks without retraining, achieving state-of-the-art performance with better generalization and stability.

Details

Motivation: Existing in-context knowledge editing methods are task-specific, unstable due to few-shot prompting, and lack generalization across diverse tasks, creating a need for a more flexible and effective approach.

Method: EditCoT generates a chain-of-thought (CoT) for input and iteratively refines it using a CoT editor based on updated knowledge, enabling flexible knowledge updates without model retraining.

Result: EditCoT achieves state-of-the-art performance across diverse benchmarks covering multiple languages and tasks, demonstrating superior generalization, effectiveness, and stability compared to existing methods.

Conclusion: EditCoT represents a significant advancement in knowledge updating by providing a flexible, efficient framework that maintains model capabilities while integrating new knowledge across various tasks without retraining costs.

Abstract: Knowledge Editing is a technique that updates large language models (LLMs) with new information to maintain their world knowledge. This approach avoids the need to rebuild the model from scratch, thereby addressing the high costs associated with frequent retraining. Among these, the in-context editing paradigm stands out for its effectiveness in integrating new knowledge while preserving the model’s original capabilities. Despite its potential, existing in-context knowledge editing methods are often task-specific, focusing primarily on multi-hop QA tasks using structured knowledge triples. Moreover, their reliance on few-shot prompting for task decomposition makes them unstable and less effective in generalizing across diverse tasks. In response to these limitations, we propose EditCoT, a novel knowledge editing framework that flexibly and efficiently updates LLMs across various tasks without retraining. EditCoT works by generating a chain-of-thought (CoT) for a given input and then iteratively refining this CoT process using a CoT editor based on updated knowledge. We evaluate EditCoT across a diverse range of benchmarks, covering multiple languages and tasks. The results demonstrate that our approach achieves state-of-the-art performance while offering superior generalization, effectiveness, and stability compared to existing methods, marking a significant advancement in the field of knowledge updating. The code and data of EditCoT are available at: https://github.com/bebr2/EditCoT .

[93] Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions

Rachneet Sachdeva, Rima Hazra, Iryna Gurevych

Main category: cs.CL

TL;DR: POATE is a novel jailbreak technique that uses contrastive reasoning to bypass LLM safety measures, achieving ~44% success rate. Countermeasures include Intent-Aware CoT and Reverse Thinking CoT.

Details

Motivation: Existing safety measures fail against subtle, reasoning-driven vulnerabilities in large language models, leaving them susceptible to sophisticated jailbreak attacks.

Method: POATE (Polar Opposite query generation, Adversarial Template construction, and Elaboration) crafts semantically opposing intents with adversarial templates to provoke unethical responses through contrastive reasoning.

Result: Extensive evaluation across six diverse language model families shows POATE achieves significantly higher attack success rates (~44%) compared to existing methods.

Conclusion: The proposed Intent-Aware CoT and Reverse Thinking CoT methods enhance reasoning robustness and strengthen defenses against adversarial exploits by decomposing queries and reasoning in reverse.

Abstract: Large language models, despite extensive alignment with human values and ethical principles, remain vulnerable to sophisticated jailbreak attacks that exploit their reasoning abilities. Existing safety measures often detect overt malicious intent but fail to address subtle, reasoning-driven vulnerabilities. In this work, we introduce POATE (Polar Opposite query generation, Adversarial Template construction, and Elaboration), a novel jailbreak technique that harnesses contrastive reasoning to provoke unethical responses. POATE crafts semantically opposing intents and integrates them with adversarial templates, steering models toward harmful outputs with remarkable subtlety. We conduct extensive evaluation across six diverse language model families of varying parameter sizes to demonstrate the robustness of the attack, achieving significantly higher attack success rates (~44%) compared to existing methods. To counter this, we propose Intent-Aware CoT and Reverse Thinking CoT, which decompose queries to detect malicious intent and reason in reverse to evaluate and reject harmful responses. These methods enhance reasoning robustness and strengthen the model’s defense against adversarial exploits.

[94] OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking

Zekun Xi, Wenbiao Yin, Jizhan Fang, Jialong Wu, Runnan Fang, Jiang Yong, Pengjun Xie, Fei Huang, Huajun Chen, Ningyu Zhang

Main category: cs.CL

TL;DR: OmniThink is a slow-thinking framework that improves machine writing by simulating human iterative knowledge expansion, addressing limitations of retrieval-augmented generation like shallow, unoriginal content.

Details

Motivation: Current retrieval-augmented generation approaches are limited by their predefined scope, producing content that lacks depth, novelty, and suffers from redundancy, leading to poor article quality.

Method: Proposes OmniThink framework that emulates human-like iterative expansion and reflection processes, simulating how learners slowly deepen their knowledge of topics.

Result: Experimental results show OmniThink improves knowledge density without compromising coherence and depth. Human evaluations confirm its effectiveness for long-form article generation.

Conclusion: OmniThink demonstrates potential to address real-world challenges in generating high-quality long-form articles through slow-thinking cognitive simulation.

Abstract: Machine writing with large language models often relies on retrieval-augmented generation. However, these approaches remain confined within the boundaries of the model’s predefined scope, limiting the generation of content with rich information. Specifically, vanilla-retrieved information tends to lack depth, novelty, and suffers from redundancy, which negatively impacts the quality of generated articles, leading to shallow, unoriginal, and repetitive outputs. To address these issues, we propose OmniThink, a slow-thinking machine writing framework that emulates the human-like process of iterative expansion and reflection. The core idea behind OmniThink is to simulate the cognitive behavior of learners as they slowly deepen their knowledge of the topics. Experimental results demonstrate that OmniThink improves the knowledge density of generated articles without compromising metrics such as coherence and depth. Human evaluations and expert feedback further highlight the potential of OmniThink to address real-world challenges in the generation of long-form articles. Code is available at https://github.com/zjunlp/OmniThink.

[95] Error Classification of Large Language Models on Math Word Problems: A Dynamically Adaptive Framework

Yuhong Sun, Zhangyue Yin, Xuanjing Huang, Xipeng Qiu, Hui Zhao

Main category: cs.CL

TL;DR: This paper introduces MWPES-300K, a comprehensive dataset of 304,865 error samples from 15 LLMs across 4 math word problem datasets, and proposes an automated dynamic error classification framework with Error-Aware Prompting that improves mathematical reasoning performance.

Details

Motivation: Current error classification methods for math word problems rely on static predefined categories that cannot capture the full spectrum of error patterns in LLMs' mathematical reasoning, limiting systematic error analysis.

Method: Collected error samples from 15 different LLMs across 4 MWP datasets using multiple sampling strategies, created MWPES-300K dataset, developed automated dynamic error classification framework, and proposed Error-Aware Prompting (EAP) that incorporates common error patterns as explicit guidance.

Result: Experimental results show dataset characteristics significantly shape error patterns, which evolve from basic to complex as model capabilities increase. Error-Aware Prompting leads to significant improvements in mathematical reasoning performance.

Conclusion: The proposed framework enables systematic error analysis and deeper insights into error patterns, allowing for the development of more effective prompting strategies that significantly enhance LLMs’ mathematical reasoning capabilities.

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains. Math Word Problems (MWPs) serve as a crucial benchmark for evaluating LLMs’ reasoning abilities. While most research primarily focuses on improving accuracy, it often neglects understanding and addressing the underlying patterns of errors. Current error classification methods rely on static and predefined categories, which limit their ability to capture the full spectrum of error patterns in mathematical reasoning. To enable systematic error analysis, we collect error samples from 15 different LLMs of varying sizes across four distinct MWP datasets using multiple sampling strategies. Based on this extensive collection, we introduce MWPES-300K, a comprehensive dataset containing 304,865 error samples that cover diverse error patterns and reasoning paths. To reduce human bias and enable fine-grained analysis of error patterns, we propose a novel framework for automated dynamic error classification in mathematical reasoning. Experimental results demonstrate that dataset characteristics significantly shape error patterns, which evolve from basic to complex manifestations as model capabilities increase. With deeper insights into error patterns, we propose Error-Aware Prompting (EAP) that incorporates common error patterns as explicit guidance, leading to significant improvements in mathematical reasoning performance.

[96] Through the Prism of Culture: Evaluating LLMs’ Understanding of Indian Subcultures and Traditions

Garima Chhikara, Abhishek Kumar, Abhijnan Chakraborty

Main category: cs.CL

TL;DR: LLMs struggle with cultural bias and fail to adequately represent Indian subcultures (Little Traditions) despite some ability to articulate cultural nuances.

Details

Motivation: To evaluate LLMs' capacity to recognize and respond accurately to localized cultural practices and subcultures within Indian society, addressing concerns about cultural bias and under-representation.

Method: Conducted case studies assessing LLMs’ ability to balance dominant Great Traditions with localized Little Traditions, using various prompting strategies including regional languages to test cultural sensitivity.

Result: LLMs demonstrate ability to articulate cultural nuances but struggle to apply this understanding in practical, context-specific scenarios, showing limitations in cultural representation.

Conclusion: First study analyzing LLMs’ engagement with Indian subcultures, revealing significant challenges in embedding cultural diversity in AI systems and highlighting the need for improved cultural sensitivity.

Abstract: Large Language Models (LLMs) have shown remarkable advancements but also raise concerns about cultural bias, often reflecting dominant narratives at the expense of under-represented subcultures. In this study, we evaluate the capacity of LLMs to recognize and accurately respond to the Little Traditions within Indian society, encompassing localized cultural practices and subcultures such as caste, kinship, marriage, and religion. Through a series of case studies, we assess whether LLMs can balance the interplay between dominant Great Traditions and localized Little Traditions. We explore various prompting strategies and further investigate whether using prompts in regional languages enhances the models cultural sensitivity and response quality. Our findings reveal that while LLMs demonstrate an ability to articulate cultural nuances, they often struggle to apply this understanding in practical, context-specific scenarios. To the best of our knowledge, this is the first study to analyze LLMs engagement with Indian subcultures, offering critical insights into the challenges of embedding cultural diversity in AI systems.

[97] Premise-Augmented Reasoning Chains Improve Error Identification in Math reasoning with LLMs

Sagnik Mukherjee, Abhinav Chinta, Takyoung Kim, Tarun Anoop Sharma, Dilek Hakkani-Tür

Main category: cs.CL

TL;DR: The paper introduces Premise Augmented Reasoning Chains (PARC) to restructure linear CoT reasoning into directed acyclic graphs with premise links, improving error identification and verification in mathematical reasoning by 6-16%.

Details

Motivation: Chain-of-Thought prompting produces verbose reasoning chains that are hard to verify due to long sequences and dependencies between distant steps, making error tracing difficult.

Method: Developed PARC framework that identifies premises for each reasoning step and restructures linear chains into directed acyclic graphs with premise links. Built PERL dataset for evaluation.

Result: LLMs achieve 90% recall in premise identification. PARC improves error identification accuracy by 6-16% absolute when verification is done under premises.

Conclusion: PARC’s premise-centric representation significantly enhances reasoning reliability and opens new avenues for improving LLM-based reasoning evaluations.

Abstract: Chain-of-Thought (CoT) prompting enhances mathematical reasoning in large language models (LLMs) by enabling detailed step-by-step solutions. However, due to the verbosity of LLMs, the resulting reasoning chains can be long, making it harder to verify the reasoning steps and trace issues resulting from dependencies between the steps that may be farther away in the sequence of steps. Importantly, mathematical reasoning allows each step to be derived from a small set of premises, which are a subset of the preceding steps in the reasoning chain. In this paper, we present a framework that identifies the premises for each step, to improve the evaluation of reasoning. We restructure conventional linear reasoning chains into Premise Augmented Reasoning Chains (PARC) by introducing premise links, resulting in a directed acyclic graph where the nodes are the steps and the edges are the premise links. Through experiments with a PARC-based dataset that we built, namely PERL (Premises and ERrors identification in LLMs), we demonstrate that LLMs can reliably identify premises within complex reasoning chains. In particular, even open-source LLMs achieve 90% recall in premise identification. We also show that PARC helps to identify errors in reasoning chains more reliably. The accuracy of error identification improves by 6% to 16% absolute when step-by-step verification is carried out in PARC under the premises. Our findings highlight the utility of premise-centric representations in addressing complex problem-solving tasks and open new avenues for improving the reliability of LLM-based reasoning evaluations.

[98] Position: LLMs Can be Good Tutors in English Education

Jingheng Ye, Shen Wang, Deqing Zou, Yibo Yan, Kun Wang, Hai-Tao Zheng, Ruitong Liu, Zenglin Xu, Irwin King, Philip S. Yu, Qingsong Wen

Main category: cs.CL

TL;DR: LLMs can serve as effective tutors in English education through three roles: data enhancers, task predictors, and agents, enabling personalized learning while requiring interdisciplinary research to address challenges.

Details

Motivation: Current LLM integration in English education relies on traditional approaches without fully embracing educational methodologies, lacking adaptability to language learning needs.

Method: Proposes three critical roles for LLMs: 1) as data enhancers for creating learning materials and student simulations, 2) as task predictors for learner assessment and optimizing learning pathways, and 3) as agents for personalized and inclusive education.

Result: The paper presents a framework for leveraging LLMs’ potential in English education but does not provide empirical results - it’s a conceptual proposal advocating for interdisciplinary research.

Conclusion: LLMs have significant potential to advance English education through thoughtful integration, but interdisciplinary collaboration is needed to explore these roles while addressing associated challenges and risks.

Abstract: While recent efforts have begun integrating large language models (LLMs) into English education, they often rely on traditional approaches to learning tasks without fully embracing educational methodologies, thus lacking adaptability to language learning. To address this gap, we argue that LLMs have the potential to serve as effective tutors in English Education. Specifically, LLMs can play three critical roles: (1) as data enhancers, improving the creation of learning materials or serving as student simulations; (2) as task predictors, serving as learner assessment or optimizing learning pathway; and (3) as agents, enabling personalized and inclusive education. We encourage interdisciplinary research to explore these roles, fostering innovation while addressing challenges and risks, ultimately advancing English Education through the thoughtful integration of LLMs.

[99] Reinforced Lifelong Editing for Language Models

Zherui Li, Houcheng Jiang, Hao Chen, Baolong Bi, Zhenhong Zhou, Fei Sun, Junfeng Fang, Xiang Wang

Main category: cs.CL

TL;DR: RLEdit is a reinforcement learning-based method for lifelong editing of large language models that addresses incompatibility issues with dynamically changing parameters, achieving 59.24% improvement with only 2.11% of the time compared to existing approaches.

Details

Motivation: Large language models' stored knowledge becomes inaccurate over time, and existing hypernetwork-based editing methods struggle with lifelong editing due to incompatibility with dynamically changing LLM parameters during the editing process.

Method: Proposed RLEdit, an RL-based editing method that treats editing losses as rewards and optimizes hypernetwork parameters at the full knowledge sequence level to precisely capture LLM changes and generate appropriate parameter updates.

Result: Extensive empirical evaluation across several LLMs shows RLEdit outperforms existing methods in lifelong editing with superior effectiveness and efficiency - 59.24% improvement while requiring only 2.11% of the time compared to most approaches.

Conclusion: RLEdit successfully addresses the challenges of lifelong model editing by leveraging reinforcement learning principles, providing a more effective and efficient solution for keeping LLM knowledge up-to-date without retraining.

Abstract: Large language models (LLMs) acquire information from pre-training corpora, but their stored knowledge can become inaccurate or outdated over time. Model editing addresses this challenge by modifying model parameters without retraining, and prevalent approaches leverage hypernetworks to generate these parameter updates. However, they face significant challenges in lifelong editing due to their incompatibility with LLM parameters that dynamically change during the editing process. To address this, we observed that hypernetwork-based lifelong editing aligns with reinforcement learning modeling and proposed RLEdit, an RL-based editing method. By treating editing losses as rewards and optimizing hypernetwork parameters at the full knowledge sequence level, we enable it to precisely capture LLM changes and generate appropriate parameter updates. Our extensive empirical evaluation across several LLMs demonstrates that RLEdit outperforms existing methods in lifelong editing with superior effectiveness and efficiency, achieving a 59.24% improvement while requiring only 2.11% of the time compared to most approaches. Our code is available at: https://github.com/zhrli324/RLEdit.

[100] Improve LLM-as-a-Judge Ability as a General Ability

Jiachen Yu, Shaoning Sun, Xiaohui Hu, Jiaxu Yan, Kaidong Yu, Xuelong Li

Main category: cs.CL

TL;DR: Two-stage training approach (SFT + DPO) with efficient data synthesis to train LLM judges, achieving SOTA performance with minimal data while enhancing general model capabilities.

Details

Motivation: Existing methods for training LLMs as judges are data-intensive, lack accuracy, and focus only on judging ability without considering broader model enhancement.

Method: Two-stage training: supervised fine-tuning warm-up followed by direct preference optimization enhancement, plus efficient data synthesis for judgmental content generation.

Result: Achieves state-of-the-art performance on RewardBench using only 2-40% of data compared to other methods, and significantly improves downstream DPO training performance.

Conclusion: The approach successfully develops LLM judges with improved accuracy and efficiency while enhancing general model capabilities, with open-sourced weights and data for research.

Abstract: LLM-as-a-Judge leverages the generative and reasoning capabilities of large language models (LLMs) to evaluate LLM responses across diverse scenarios, providing accurate preference signals. This approach plays a vital role in aligning LLMs with human values, ensuring ethical and reliable AI outputs that align with societal norms. Recent studies have raised many methods to train LLM as generative judges, but most of them are data consuming or lack accuracy, and only focus on LLM’s judge ability. In this work, we regard judge ability as a general ability of LLM and implement a two-stage training approach, comprising supervised fine-tuning (SFT) warm-up and direct preference optimization (DPO) enhancement, to achieve judge style adaptation and improve judgment accuracy. Additionally, we introduce an efficient data synthesis method to generate judgmental content. Experimental results demonstrate that our approach, utilizing only about 2% to 40% of the data required by other methods, achieves SOTA performance on RewardBench. Furthermore, our training method enhances the general capabilities of the model by constructing complicated judge task, and the judge signals provided by our model have significantly enhanced the downstream DPO training performance of our internal models in our test to optimize policy model with Judge Model. We also open-source our model weights and training data to facilitate further research.

[101] Soft Token Attacks Cannot Reliably Audit Unlearning in Large Language Models

Haokun Chen, Sebastian Szyller, Weilin Xu, Nageen Himayat

Main category: cs.CL

TL;DR: Soft token attacks can extract any information from LLMs regardless of unlearning effectiveness, making them inadequate for auditing unlearning algorithms.

Details

Motivation: Large language models often contain undesirable content that needs to be removed through unlearning, but current auditing methods using soft token attacks may give misleading results about unlearning effectiveness.

Method: The researchers used common benchmarks (Who Is Harry Potter? and TOFU) to demonstrate that soft token attacks can elicit information from LLMs regardless of whether unlearning was performed or if the content was originally in the training data.

Result: Soft token attacks with just 1-10 tokens can extract random strings over 400 characters long, showing these attacks can retrieve any information from the model independent of unlearning success.

Conclusion: Soft token attacks must be used carefully and cannot be relied upon as the sole method for auditing unlearning effectiveness, as they can extract information regardless of the unlearning algorithm’s performance.

Abstract: Large language models (LLMs) are trained using massive datasets, which often contain undesirable content such as harmful texts, personal information, and copyrighted material. To address this, machine unlearning aims to remove information from trained models. Recent work has shown that soft token attacks (STA) can successfully extract unlearned information from LLMs, but in this work we show that STAs can be an inadequate tool for auditing unlearning. Using common benchmarks such as Who Is Harry Potter? and TOFU, we demonstrate that in a strong auditor setting such attacks can elicit any information from the LLM, regardless of the deployed unlearning algorithm or whether the queried content was originally present in the training corpus. We further show that STA with just a few soft tokens (1-10) can elicit random strings over 400 characters long, indicating that STAs must be used carefully to effectively audit unlearning. Example code can be found at: https://github.com/IntelLabs/LLMart/tree/main/examples/unlearning

[102] Evaluating the Robustness and Accuracy of Text Watermarking Under Real-World Cross-Lingual Manipulations

Mansour Al Ghanim, Jiaqi Xue, Rochana Prih Hastuti, Mengxin Zheng, Yan Solihin, Qian Lou

Main category: cs.CL

TL;DR: Benchmark study of watermarking methods in cross-lingual settings, evaluating four methods across four languages with translation attack scenarios.

Details

Motivation: Current literature focuses mainly on English watermarking evaluation, leaving cross-lingual scenarios overlooked despite their practical importance for adversaries with multilingual capabilities.

Method: Evaluated four watermarking methods across four vocabulary-rich languages, testing text quality and watermark detectability under practical translation attack scenarios that simulate cross-lingual adversaries.

Result: The study provides empirical evaluation of watermarking performance in cross-lingual contexts, revealing vulnerabilities and effectiveness against translation-based attacks.

Conclusion: Key insights are drawn about the suitability of current watermarking methods for cross-lingual scenarios, highlighting gaps and practical considerations for multilingual watermarking applications.

Abstract: We present a study to benchmark representative watermarking methods in cross-lingual settings. The current literature mainly focuses on the evaluation of watermarking methods for the English language. However, the literature for evaluating watermarking in cross-lingual settings is scarce. This results in overlooking important adversary scenarios in which a cross-lingual adversary could be in, leading to a gray area of practicality over cross-lingual watermarking. In this paper, we evaluate four watermarking methods in four different and vocabulary rich languages. Our experiments investigate the quality of text under different watermarking procedure and the detectability of watermarks with practical translation attack scenarios. Specifically, we investigate practical scenarios that an adversary with cross-lingual knowledge could take, and evaluate whether current watermarking methods are suitable for such scenarios. Finally, from our findings, we draw key insights about watermarking in cross-lingual settings.

[103] Multi-EuP: The Multilingual European Parliament Dataset for Analysis of Bias in Information Retrieval

Jinrui Yang, Timothy Baldwin, Trevor Cohn

Main category: cs.CL

TL;DR: Multi-EuP is a new multilingual benchmark dataset with 22K documents across 24 languages from the European Parliament, designed to study fairness, language bias, and demographic bias in information retrieval systems.

Details

Motivation: To investigate fairness in multilingual information retrieval by analyzing both language and demographic bias in ranking contexts, addressing the need for authentic multilingual corpora with cross-lingual relevance judgments and demographic information.

Method: Created a dataset comprising 22K multilingual documents from the European Parliament spanning 24 languages, with topics translated into all languages and cross-lingual relevance judgments. Includes rich demographic information associated with documents.

Result: The dataset proves effective for benchmarking both monolingual and multilingual information retrieval systems. Preliminary experiments reveal language bias caused by tokenization strategy choices.

Conclusion: Multi-EuP provides a valuable resource for studying fairness and bias in multilingual IR contexts, enabling research on both language and demographic biases with authentic multilingual content and comprehensive relevance judgments.

Abstract: We present Multi-EuP, a new multilingual benchmark dataset, comprising 22K multi-lingual documents collected from the European Parliament, spanning 24 languages. This dataset is designed to investigate fairness in a multilingual information retrieval (IR) context to analyze both language and demographic bias in a ranking context. It boasts an authentic multilingual corpus, featuring topics translated into all 24 languages, as well as cross-lingual relevance judgments. Furthermore, it offers rich demographic information associated with its documents, facilitating the study of demographic bias. We report the effectiveness of Multi-EuP for benchmarking both monolingual and multilingual IR. We also conduct a preliminary experiment on language bias caused by the choice of tokenization strategy.

[104] Low-Confidence Gold: Refining Low-Confidence Samples for Efficient Instruction Tuning

Hongyi Cai, Jie Li, Mohammad Mahdinur Rahman, Wenzhen Dong

Main category: cs.CL

TL;DR: LCG is a novel filtering framework that uses centroid-based clustering and confidence-guided selection to identify valuable instruction pairs for efficient instruction fine-tuning of LLMs.

Details

Motivation: Instruction fine-tuning effectiveness is constrained by dataset quality and efficiency. Current methods need better ways to identify the most valuable training samples.

Method: Uses centroid-based clustering and confidence-guided selection with a lightweight classifier trained on representative samples to curate high-quality subsets while preserving diversity.

Result: Models fine-tuned on LCG-filtered subsets of just 6K samples achieve superior performance compared to existing methods, with substantial improvements on MT-bench and consistent gains across comprehensive metrics.

Conclusion: LCG establishes a promising direction for efficient instruction tuning by effectively maintaining model performance while significantly reducing required training data.

Abstract: The effectiveness of instruction fine-tuning for Large Language Models is fundamentally constrained by the quality and efficiency of training datasets. This work introduces Low-Confidence Gold (LCG), a novel filtering framework that employs centroid-based clustering and confidence-guided selection for identifying valuable instruction pairs. Through a semi-supervised approach using a lightweight classifier trained on representative samples, LCG curates high-quality subsets while preserving data diversity. Experimental evaluation demonstrates that models fine-tuned on LCG-filtered subsets of 6K samples achieve superior performance compared to existing methods, with substantial improvements on MT-bench and consistent gains across comprehensive evaluation metrics. The framework’s efficacy while maintaining model performance establishes a promising direction for efficient instruction tuning.

[105] PlainQAFact: Retrieval-augmented Factual Consistency Evaluation Metric for Biomedical Plain Language Summarization

Zhiwen You, Yue Guo

Main category: cs.CL

TL;DR: PlainQAFact is a new automatic evaluation metric that addresses the challenge of factual consistency evaluation in plain language medical summarization, particularly handling elaborative explanations that introduce external content not present in source abstracts.

Details

Motivation: Existing factual consistency evaluation methods struggle with plain language summarization in medical contexts due to elaborative explanations that add external content (definitions, background, examples) to enhance comprehension but complicate factual verification.

Method: PlainQAFact uses a two-step approach: first classifies sentence types, then applies a retrieval-augmented QA scoring method. It’s trained on PlainFact, a fine-grained human-annotated dataset for evaluating both source-simplified and elaboratively explained sentences.

Result: PlainQAFact consistently outperforms existing evaluation metrics across all evaluation settings, effectively handling factual consistency assessment for elaborative explanations where other methods fail.

Conclusion: This work presents the first evaluation metric specifically designed for PLS factual consistency evaluation, providing both a robust benchmark and practical tool to advance reliable plain language communication in medical domains.

Abstract: Hallucinated outputs from large language models (LLMs) pose risks in the medical domain, especially for lay audiences making health-related decisions. Existing automatic factual consistency evaluation methods, such as entailment- and question-answering (QA) -based, struggle with plain language summarization (PLS) due to elaborative explanation phenomenon, which introduces external content (e.g., definitions, background, examples) absent from the scientific abstract to enhance comprehension. To address this, we introduce PlainQAFact, an automatic factual consistency evaluation metric trained on a fine-grained, human-annotated dataset PlainFact, for evaluating factual consistency of both source-simplified and elaborately explained sentences. PlainQAFact first classifies sentence type, then applies a retrieval-augmented QA scoring method. Empirical results show that existing evaluation metrics fail to evaluate the factual consistency in PLS, especially for elaborative explanations, whereas PlainQAFact consistently outperforms them across all evaluation settings. We further analyze PlainQAFact’s effectiveness across external knowledge sources, answer extraction strategies, answer overlap measures, and document granularity levels, refining its overall factual consistency assessment. Taken together, our work presents the first evaluation metric designed for PLS factual consistency evaluation, providing the community with both a robust benchmark and a practical tool to advance reliable and safe plain language communication in the medical domain. PlainQAFact and PlainFact are available at: https://github.com/zhiwenyou103/PlainQAFact

[106] X-EcoMLA: Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression

Guihong Li, Mehdi Rezagholizadeh, Mingyu Yang, Vikram Appia, Emad Barsoum

Main category: cs.CL

TL;DR: X-EcoMLA enables post-training adaptation to convert standard Transformer attention into efficient Multi-head Latent Attention (MLA) through distillation, achieving up to 10.6x KV cache compression without performance loss.

Details

Motivation: MLA requires training from scratch, limiting its applicability to existing pre-trained models. The goal is to enable MLA's memory efficiency benefits for models already trained with different attention mechanisms.

Method: Uses post-training distillation to upcycle Transformer-based attention into hybrid MLA through lightweight adaptation, leveraging dark knowledge from well-trained models without extensive pre-training.

Result: Achieves 6.4x compression with same performance using 3.6B tokens and 70 GPU hours, and 10.6x compression with <0.1% performance drop using 7B tokens and 140 GPU hours on Llama3.2-1B-Instruct.

Conclusion: X-EcoMLA successfully enables efficient KV cache compression for pre-trained models through post-training distillation, making MLA benefits accessible without costly retraining from scratch.

Abstract: Multi-head latent attention (MLA) is designed to optimize KV cache memory through low-rank key-value joint compression. Rather than caching keys and values separately, MLA stores their compressed latent representations, reducing memory overhead while maintaining the performance. While MLA improves memory efficiency without compromising language model accuracy, its major limitation lies in its integration during the pre-training phase, requiring models to be trained from scratch. This raises a key question: can we use MLA’s benefits fully or partially in models that have already been pre-trained with different attention mechanisms? In this paper, we propose X-EcoMLA to deploy post training distillation to enable the upcycling of Transformer-based attention into an efficient hybrid MLA variant through lightweight post-training adaptation, bypassing the need for extensive pre-training. We demonstrate that leveraging the dark knowledge of a well-trained model can enhance training accuracy and enable extreme KV cache compression in MLA without compromising model performance. The experimental results show that our proposed method can effectively compress the KV cache while preserving the performance on the benchmarks; specifically, for Llama3.2-1B-Instruct baseline, a 6.4x compression achieves the same average score by using only 3.6B training tokens and 70 GPU hours on AMD MI300, whereas a 10.6x compression have less than 0.1% average score drop with 7B training tokens and 140 GPU hours. The code for this work is available at https://github.com/AMD-AGI/AMD-Hybrid-Models.

[107] LinkAlign: Scalable Schema Linking for Real-World Large-Scale Multi-Database Text-to-SQL

Yihan Wang, Peiyu Liu, Xin Yang

Main category: cs.CL

TL;DR: LinkAlign is a novel framework that addresses schema linking challenges in Text-to-SQL systems for large-scale databases, achieving state-of-the-art performance on benchmarks.

Details

Motivation: Schema linking is a critical bottleneck in applying Text-to-SQL models to real-world, large-scale multi-database environments, with two main challenges: database retrieval and schema item grounding.

Method: LinkAlign framework with three key steps: multi-round semantic enhanced retrieval, irrelevant information isolation, and schema extraction enhancement. Supports both Agent and Pipeline execution modes for balancing efficiency and performance.

Result: Outperforms existing baselines on all schema linking metrics, achieves 33.09% on Spider 2.0-Lite benchmark using only open-source LLMs, ranking first on the leaderboard.

Conclusion: LinkAlign effectively addresses schema linking challenges in large-scale database environments and demonstrates superior performance compared to existing approaches.

Abstract: Schema linking is a critical bottleneck in applying existing Text-to-SQL models to real-world, large-scale, multi-database environments. Through error analysis, we identify two major challenges in schema linking: (1) Database Retrieval: accurately selecting the target database from a large schema pool, while effectively filtering out irrelevant ones; and (2) Schema Item Grounding: precisely identifying the relevant tables and columns within complex and often redundant schemas for SQL generation. Based on these, we introduce LinkAlign, a novel framework tailored for large-scale databases with thousands of fields. LinkAlign comprises three key steps: multi-round semantic enhanced retrieval and irrelevant information isolation for Challenge 1, and schema extraction enhancement for Challenge 2. Each stage supports both Agent and Pipeline execution modes, enabling balancing efficiency and performance via modular design. To enable more realistic evaluation, we construct AmbiDB, a synthetic dataset designed to reflect the ambiguity of real-world schema linking. Experiments on widely-used Text-to-SQL benchmarks demonstrate that LinkAlign consistently outperforms existing baselines on all schema linking metrics. Notably, it improves the overall Text-to-SQL pipeline and achieves a new state-of-the-art score of 33.09% on the Spider 2.0-Lite benchmark using only open-source LLMs, ranking first on the leaderboard at the time of submission. The codes are available at https://github.com/Satissss/LinkAlign

[108] Learning to Reason for Long-Form Story Generation

Alexander Gurung, Mirella Lapata

Main category: cs.CL

TL;DR: Using RL with verifiable rewards for next-chapter prediction in story generation, achieving better results than non-trained and SFT baselines, especially in Scifi/Fantasy genres.

Details

Motivation: Long-form story generation requires complex skills but lacks labeled datasets and quality measurements, making manual prompting techniques necessary. Recent RL success in math/coding domains inspired applying similar verifiable reward approaches to story generation.

Method: Proposed Next-Chapter Prediction task with Verified Rewards via Completion Likelihood Improvement. Uses unlabeled book dataset to learn reasoning over story information and generate detailed next-chapter plans. Evaluates reasoning through generated chapters.

Result: Pairwise human judgments show chapters from learned reasoning are preferred across almost all metrics. Effect is more pronounced in Scifi and Fantasy genres compared to non-trained and supervised finetuning baselines.

Conclusion: RL with verifiable rewards can effectively improve long-form story generation by learning to reason over story context and plan next chapters, demonstrating particular strength in genre-specific storytelling.

Abstract: Generating high-quality stories spanning thousands of tokens requires competency across a variety of skills, from tracking plot and character arcs to keeping a consistent and engaging style. Due to the difficulty of sourcing labeled datasets and precise quality measurements, most work using large language models (LLMs) for long-form story generation uses combinations of hand-designed prompting techniques to elicit author-like behavior. This is a manual process that is highly dependent on the specific story-generation task. Motivated by the recent success of applying RL with Verifiable Rewards to domains like math and coding, we propose a general story-generation task (Next-Chapter Prediction) and a reward formulation (Verified Rewards via Completion Likelihood Improvement) that allows us to use an unlabeled book dataset as a learning signal for reasoning. We learn to reason over a story’s condensed information and generate a detailed plan for the next chapter. Our reasoning is evaluated via the chapters it helps a story-generator create, and compared against non-trained and supervised finetuning (SFT) baselines. Pairwise human judgments reveal the chapters our learned reasoning produces are preferred across almost all metrics, and the effect is more pronounced in Scifi and Fantasy genres.

[109] Efficient Dynamic Clustering-Based Document Compression for Retrieval-Augmented-Generation

Weitao Li, Kaiming Liu, Xiangyu Zhang, Xuanyu Lei, Weizhi Ma, Yang Liu

Main category: cs.CL

TL;DR: EDC2-RAG is a dynamic clustering-based document compression framework that improves RAG by exploiting inter-document relationships to remove noise and redundancy, achieving better performance on knowledge QA and hallucination detection tasks.

Details

Motivation: Current RAG implementations struggle with noise and redundancy in retrieved content due to limited ability to exploit fine-grained inter-document relationships, which can cause errors in generation results.

Method: Proposes an Efficient Dynamic Clustering-based document Compression framework (EDC2-RAG) that utilizes latent inter-document relationships to remove irrelevant information and redundant content.

Result: Experimental results on GPT-3.5-Turbo and GPT-4o-mini show consistent performance improvements across various scenarios and settings, demonstrating strong robustness and applicability.

Conclusion: EDC2-RAG effectively addresses RAG limitations by leveraging inter-document relationships for better document compression, leading to improved generation quality and reduced errors.

Abstract: Retrieval-Augmented Generation (RAG) has emerged as a widely adopted approach for knowledge injection during large language model (LLM) inference in recent years. However, due to their limited ability to exploit fine-grained inter-document relationships, current RAG implementations face challenges in effectively addressing the retrieved noise and redundancy content, which may cause error in the generation results. To address these limitations, we propose an Efficient Dynamic Clustering-based document Compression framework (EDC2-RAG) that utilizes latent inter-document relationships while simultaneously removing irrelevant information and redundant content. We validate our approach, built upon GPT-3.5-Turbo and GPT-4o-mini, on widely used knowledge-QA and Hallucination-Detection datasets. Experimental results show that our method achieves consistent performance improvements across various scenarios and experimental settings, demonstrating strong robustness and applicability. Our code and datasets are available at https://github.com/Tsinghua-dhy/EDC-2-RAG.

[110] Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models

NVIDIA, :, Aaron Blakeman, Aarti Basant, Abhinav Khattar, Adithya Renduchintala, Akhiad Bercovich, Aleksander Ficek, Alexis Bjorlin, Ali Taghibakhshi, Amala Sanjay Deshmukh, Ameya Sunil Mahabaleshwarkar, Andrew Tao, Anna Shors, Ashwath Aithal, Ashwin Poojary, Ayush Dattagupta, Balaram Buddharaju, Bobby Chen, Boris Ginsburg, Boxin Wang, Brandon Norick, Brian Butterfield, Bryan Catanzaro, Carlo del Mundo, Chengyu Dong, Christine Harvey, Christopher Parisien, Dan Su, Daniel Korzekwa, Danny Yin, Daria Gitman, David Mosallanezhad, Deepak Narayanan, Denys Fridman, Dima Rekesh, Ding Ma, Dmytro Pykhtar, Dong Ahn, Duncan Riach, Dusan Stosic, Eileen Long, Elad Segal, Ellie Evans, Eric Chung, Erick Galinkin, Evelina Bakhturina, Ewa Dobrowolska, Fei Jia, Fuxiao Liu, Gargi Prasad, Gerald Shen, Guilin Liu, Guo Chen, Haifeng Qian, Helen Ngo, Hongbin Liu, Hui Li, Igor Gitman, Ilia Karmanov, Ivan Moshkov, Izik Golan, Jan Kautz, Jane Polak Scowcroft, Jared Casper, Jarno Seppanen, Jason Lu, Jason Sewall, Jiaqi Zeng, Jiaxuan You, Jimmy Zhang, Jing Zhang, Jining Huang, Jinze Xue, Jocelyn Huang, Joey Conway, John Kamalu, Jon Barker, Jonathan Cohen, Joseph Jennings, Jupinder Parmar, Karan Sapra, Kari Briski, Kateryna Chumachenko, Katherine Luna, Keshav Santhanam, Kezhi Kong, Kirthi Sivamani, Krzysztof Pawelec, Kumar Anik, Kunlun Li, Lawrence McAfee, Leon Derczynski, Lindsey Pavao, Luis Vega, Lukas Voegtle, Maciej Bala, Maer Rodrigues de Melo, Makesh Narsimhan Sreedhar, Marcin Chochowski, Markus Kliegl, Marta Stepniewska-Dziubinska, Matthieu Le, Matvei Novikov, Mehrzad Samadi, Michael Andersch, Michael Evans, Miguel Martinez, Mike Chrzanowski, Mike Ranzinger, Mikolaj Blaz, Misha Smelyanskiy, Mohamed Fawzy, Mohammad Shoeybi, Mostofa Patwary, Nayeon Lee, Nima Tajbakhsh, Ning Xu, Oleg Rybakov, Oleksii Kuchaiev, Olivier Delalleau, Osvald Nitski, Parth Chadha, Pasha Shamis, Paulius Micikevicius, Pavlo Molchanov, Peter Dykas, Philipp Fischer, Pierre-Yves Aquilanti, Piotr Bialecki, Prasoon Varshney, Pritam Gundecha, Przemek Tredak, Rabeeh Karimi, Rahul Kandu, Ran El-Yaniv, Raviraj Joshi, Roger Waleffe, Ruoxi Zhang, Sabrina Kavanaugh, Sahil Jain, Samuel Kriman, Sangkug Lym, Sanjeev Satheesh, Saurav Muralidharan, Sean Narenthiran, Selvaraj Anandaraj, Seonmyeong Bak, Sergey Kashirsky, Seungju Han, Shantanu Acharya, Shaona Ghosh, Sharath Turuvekere Sreenivas, Sharon Clay, Shelby Thomas, Shrimai Prabhumoye, Shubham Pachori, Shubham Toshniwal, Shyamala Prayaga, Siddhartha Jain, Sirshak Das, Slawek Kierat, Somshubra Majumdar, Song Han, Soumye Singhal, Sriharsha Niverty, Stefania Alborghetti, Suseella Panguluri, Swetha Bhendigeri, Syeda Nahida Akter, Szymon Migacz, Tal Shiri, Terry Kong, Timo Roman, Tomer Ronen, Trisha Saar, Tugrul Konuk, Tuomas Rintamaki, Tyler Poon, Ushnish De, Vahid Noroozi, Varun Singh, Vijay Korthikanti, Vitaly Kurin, Wasi Uddin Ahmad, Wei Du, Wei Ping, Wenliang Dai, Wonmin Byeon, Xiaowei Ren, Yao Xu, Yejin Choi, Yian Zhang, Ying Lin, Yoshi Suhara, Zhiding Yu, Zhiqi Li, Zhiyu Li, Zhongbo Zhu, Zhuolin Yang, Zijia Chen

Main category: cs.CL

TL;DR: Nemotron-H introduces hybrid Mamba-Transformer models that replace most self-attention layers with more efficient Mamba layers, achieving similar or better accuracy than comparable Transformer models while being up to 3x faster at inference.

Details

Motivation: As inference-time scaling becomes critical for enhanced reasoning capabilities, there is a growing need to build models that are efficient to infer while maintaining accuracy.

Method: Replace majority of self-attention layers in Transformer architecture with Mamba layers that perform constant computation and require constant memory per token. Use MiniPuzzle compression technique (pruning and distillation) to create smaller models from larger ones. Implement FP8-based training recipe as alternative to BF16.

Result: Nemotron-H models achieve better or on-par accuracy compared to similarly-sized state-of-the-art Transformer models (Qwen-2.5-7B/72B, Llama-3.1-8B/70B) while being up to 3x faster at inference. Nemotron-H-47B-Base achieves similar accuracy to 56B model but is 20% faster. FP8 training achieves on-par results with BF16.

Conclusion: Hybrid Mamba-Transformer architecture with efficient Mamba layers significantly reduces inference costs while maintaining competitive accuracy, making it a promising approach for efficient large language models.

Abstract: As inference-time scaling becomes critical for enhanced reasoning capabilities, it is increasingly becoming important to build models that are efficient to infer. We introduce Nemotron-H, a family of 8B and 56B/47B hybrid Mamba-Transformer models designed to reduce inference cost for a given accuracy level. To achieve this goal, we replace the majority of self-attention layers in the common Transformer model architecture with Mamba layers that perform constant computation and require constant memory per generated token. We show that Nemotron-H models offer either better or on-par accuracy compared to other similarly-sized state-of-the-art open-sourced Transformer models (e.g., Qwen-2.5-7B/72B and Llama-3.1-8B/70B), while being up to 3$\times$ faster at inference. To further increase inference speed and reduce the memory required at inference time, we created Nemotron-H-47B-Base from the 56B model using a new compression via pruning and distillation technique called MiniPuzzle. Nemotron-H-47B-Base achieves similar accuracy to the 56B model, but is 20% faster to infer. In addition, we introduce an FP8-based training recipe and show that it can achieve on par results with BF16-based training. This recipe is used to train the 56B model. We are releasing Nemotron-H base model checkpoints with support in Hugging Face and NeMo.

[111] Advancing Scientific Text Classification: Fine-Tuned Models with Dataset Expansion and Hard-Voting

Zhyar Rzgar K Rostam, Gábor Kertész

Main category: cs.CL

TL;DR: This study demonstrates that fine-tuning pre-trained language models (BERT, SciBERT, BioBERT, BlueBERT) on an augmented Web of Science dataset significantly improves scientific text classification accuracy, with domain-specific models outperforming general ones.

Details

Motivation: Efficient text classification is essential for handling the increasing volume of academic publications, requiring robust and scalable automated solutions.

Method: Augmented WoS-46985 dataset with 7 targeted queries (1,000 articles per category), used PLMs for label prediction with hard-voting strategy, fine-tuned models with dynamic learning rates and early stopping.

Result: Fine-tuning on expanded dataset significantly boosts classification accuracy, especially in specialized domains. Domain-specific models (SciBERT, BioBERT) consistently outperform general-purpose models (BERT).

Conclusion: Dataset augmentation, inference-driven label prediction, hard-voting, and fine-tuning techniques create robust and scalable solutions for automated academic text classification.

Abstract: Efficient text classification is essential for handling the increasing volume of academic publications. This study explores the use of pre-trained language models (PLMs), including BERT, SciBERT, BioBERT, and BlueBERT, fine-tuned on the Web of Science (WoS-46985) dataset for scientific text classification. To enhance performance, we augment the dataset by executing seven targeted queries in the WoS database, retrieving 1,000 articles per category aligned with WoS-46985’s main classes. PLMs predict labels for this unlabeled data, and a hard-voting strategy combines predictions for improved accuracy and confidence. Fine-tuning on the expanded dataset with dynamic learning rates and early stopping significantly boosts classification accuracy, especially in specialized domains. Domain-specific models like SciBERT and BioBERT consistently outperform general-purpose models such as BERT. These findings underscore the efficacy of dataset augmentation, inference-driven label prediction, hard-voting, and fine-tuning techniques in creating robust and scalable solutions for automated academic text classification.

[112] Assessing and Mitigating Medical Knowledge Drift and Conflicts in Large Language Models

Weiyi Wu, Xinwen Xu, Chongyang Gao, Xingjian Diao, Siting Li, Lucas A. Salas, Jiang Gui

Main category: cs.CL

TL;DR: LLMs struggle with evolving medical guidelines, often providing outdated or conflicting recommendations. The study created DriftMedQA benchmark to test temporal reliability and found mitigation strategies like RAG and DPO fine-tuning improve performance.

Details

Motivation: LLMs have great potential in healthcare but face challenges adapting to rapidly evolving medical knowledge, which can lead to outdated or contradictory treatment suggestions.

Method: Developed DriftMedQA benchmark to simulate guideline evolution, evaluated 7 state-of-the-art LLMs across 4,290 scenarios, and explored two mitigation strategies: Retrieval-Augmented Generation and preference fine-tuning via Direct Preference Optimization.

Result: Models demonstrated difficulties in rejecting outdated recommendations and frequently endorsed conflicting guidance. Both mitigation strategies improved performance, with their combination yielding the most consistent and reliable results.

Conclusion: Findings underscore the need to improve LLM robustness to temporal shifts to ensure more dependable applications in clinical practice. The dataset is publicly available for further research.

Abstract: Large Language Models (LLMs) have great potential in the field of health care, yet they face great challenges in adapting to rapidly evolving medical knowledge. This can lead to outdated or contradictory treatment suggestions. This study investigated how LLMs respond to evolving clinical guidelines, focusing on concept drift and internal inconsistencies. We developed the DriftMedQA benchmark to simulate guideline evolution and assessed the temporal reliability of various LLMs. Our evaluation of seven state-of-the-art models across 4,290 scenarios demonstrated difficulties in rejecting outdated recommendations and frequently endorsing conflicting guidance. Additionally, we explored two mitigation strategies: Retrieval-Augmented Generation and preference fine-tuning via Direct Preference Optimization. While each method improved model performance, their combination led to the most consistent and reliable results. These findings underscore the need to improve LLM robustness to temporal shifts to ensure more dependable applications in clinical practice. The dataset is available at https://huggingface.co/datasets/RDBH/DriftMed.

[113] Pierce the Mists, Greet the Sky: Decipher Knowledge Overshadowing via Knowledge Circuit Analysis

Haoming Huang, Yibo Yan, Jiahao Huo, Xin Zou, Xinfeng Li, Kun Wang, Xuming Hu

Main category: cs.CL

TL;DR: PhantomCircuit is a framework that analyzes knowledge overshadowing in LLMs - where one piece of knowledge masks another, causing hallucinations. It uses knowledge circuit analysis to understand attention patterns and training dynamics.

Details

Motivation: LLMs suffer from knowledge overshadowing hallucinations where relevant knowledge gets masked, but current understanding is limited to inference-time observations without insights into training origins and mechanisms.

Method: PhantomCircuit employs knowledge circuit analysis to dissect key component functions and attention pattern dynamics, tracking how overshadowing evolves throughout the training process.

Result: Extensive experiments demonstrate PhantomCircuit’s effectiveness in identifying knowledge overshadowing instances, providing novel insights into this type of hallucination.

Conclusion: The framework offers the research community a new methodological approach for analyzing and potentially mitigating knowledge overshadowing in large language models.

Abstract: Large Language Models (LLMs), despite their remarkable capabilities, are hampered by hallucinations. A particularly challenging variant, knowledge overshadowing, occurs when one piece of activated knowledge inadvertently masks another relevant piece, leading to erroneous outputs even with high-quality training data. Current understanding of overshadowing is largely confined to inference-time observations, lacking deep insights into its origins and internal mechanisms during model training. Therefore, we introduce PhantomCircuit, a novel framework designed to comprehensively analyze and detect knowledge overshadowing. By innovatively employing knowledge circuit analysis, PhantomCircuit dissects the function of key components in the circuit and how the attention pattern dynamics contribute to the overshadowing phenomenon and its evolution throughout the training process. Extensive experiments demonstrate PhantomCircuit’s effectiveness in identifying such instances, offering novel insights into this elusive hallucination and providing the research community with a new methodological lens for its potential mitigation.

[114] Language Mixing in Reasoning Language Models: Patterns, Impact, and Internal Causes

Mingyang Wang, Lukas Lange, Heike Adel, Yunpu Ma, Jannik Strötgen, Hinrich Schütze

Main category: cs.CL

TL;DR: Systematic study of language mixing in reasoning language models shows it affects performance, with constrained decoding in specific scripts improving accuracy, and reveals language mixing reflects internal processing preferences.

Details

Motivation: Language mixing (reasoning steps containing tokens from languages other than the prompt) has been observed in RLMs and shown to affect performance, but its impact remains debated and requires systematic investigation.

Method: Examined language mixing patterns across 15 languages, 7 task difficulty levels, and 18 subject areas. Used constrained decoding to force models to reason in specific scripts (Latin or Han) and analyzed script composition alignment with internal representations.

Result: Language mixing is influenced by language, task difficulty, and subject area. Forcing models to reason in Latin or Han scripts via constrained decoding notably improves accuracy. Script composition of reasoning traces aligns with internal representations.

Conclusion: Language mixing reflects latent processing preferences in RLMs. Findings provide actionable insights for optimizing multilingual reasoning and enable controlling reasoning languages to build more interpretable and adaptable models.

Abstract: Reasoning language models (RLMs) excel at complex tasks by leveraging a chain-of-thought process to generate structured intermediate steps. However, language mixing, i.e., reasoning steps containing tokens from languages other than the prompt, has been observed in their outputs and shown to affect performance, though its impact remains debated. We present the first systematic study of language mixing in RLMs, examining its patterns, impact, and internal causes across 15 languages, 7 task difficulty levels, and 18 subject areas, and show how all three factors influence language mixing. Moreover, we demonstrate that the choice of reasoning language significantly affects performance: forcing models to reason in Latin or Han scripts via constrained decoding notably improves accuracy. Finally, we show that the script composition of reasoning traces closely aligns with that of the model’s internal representations, indicating that language mixing reflects latent processing preferences in RLMs. Our findings provide actionable insights for optimizing multilingual reasoning and open new directions for controlling reasoning languages to build more interpretable and adaptable RLMs.

[115] VocalBench: Benchmarking the Vocal Conversational Abilities for Speech Interaction Models

Heyang Liu, Yuhao Wang, Ziyang Cheng, Ronghua Wu, Qunshan Gu, Yanfeng Wang, Yu Wang

Main category: cs.CL

TL;DR: VocalBench is a comprehensive benchmark for evaluating speech conversational abilities of multimodal models, addressing gaps in existing evaluations by covering semantic quality, acoustic performance, conversational abilities, and robustness across 9,400 curated instances.

Details

Motivation: Existing evaluations of speech interaction models lack real-world scenarios and focus mainly on textual responses, overlooking critical vocal performance aspects like acoustic variations, paralanguage cues, and environmental context.

Method: Proposed VocalBench with 9,400 curated instances across four dimensions: semantic quality, acoustic performance, conversational abilities, and robustness. Used objective evaluation indicators and LLM-as-a-judge approach for scoring open-ended questions.

Result: Experimental evaluation of 15 mainstream systems revealed significant variability, with each system exhibiting distinct strengths and weaknesses across different evaluation dimensions.

Conclusion: VocalBench provides valuable insights to guide future research in speech interaction systems by offering comprehensive evaluation of speech conversational abilities that better reflects real-world scenarios.

Abstract: The rapid advancement of large language models (LLMs) has accelerated the development of multimodal models capable of speech communications. Unlike text interactions, speech conveys diverse information, including acoustic variations, paralanguage cues, and environmental context. However, existing evaluations of speech interaction models lack instances mimicking real scenarios and predominantly focus on the quality of their textual responses, overlooking critical aspects of vocal performance. To address this gap, we propose VocalBench, a comprehensive benchmark to assess the speech conversational abilities, comprising 9,400 carefully curated instances across four key dimensions: semantic quality, acoustic performance, conversational abilities, and robustness. It covers a broad range of fundamental skills essential for effective vocal interactions. For the evaluation scheme, we propose several objective evaluation indicators and incorporate an additional LLM-as-a-judge approach to score open-ended questions. Experimental results on 15 mainstream systems reveal significant variability, each exhibiting distinct strengths and weaknesses, and provide valuable insights to guide future research in speech interaction systems.

[116] Too Consistent to Detect: A Study of Self-Consistent Errors in LLMs

Hexiang Tan, Fei Sun, Sha Liu, Du Su, Qi Cao, Xin Chen, Jingang Wang, Xunliang Cai, Yuanzhuo Wang, Huawei Shen, Xueqi Cheng

Main category: cs.CL

TL;DR: LLMs often generate self-consistent errors - the same incorrect response across multiple samples. Current detection methods struggle with these errors, which persist across model scaling. A cross-model probe method using external verifier LLM hidden states significantly improves detection.

Details

Motivation: Existing error detection methods overlook self-consistent errors where LLMs repeatedly generate the same incorrect content across multiple stochastic samples, revealing critical limitations in current detection approaches.

Method: Proposed a cross-model probe method that fuses hidden state evidence from an external verifier LLM to detect self-consistent errors, leveraging the observation that these errors often differ across different LLMs.

Result: Self-consistent errors remain stable or increase with LLM scale, unlike inconsistent errors. All four mainstream detection methods struggle significantly. The proposed cross-model probe method significantly enhances performance on self-consistent errors across three LLM families.

Conclusion: Current detection methods have critical limitations in handling self-consistent errors. The cross-model verification approach using external LLM hidden states provides an effective solution, highlighting the need for improved detection techniques that address this persistent error type.

Abstract: As large language models (LLMs) often generate plausible but incorrect content, error detection has become increasingly critical to ensure truthfulness. However, existing detection methods often overlook a critical problem we term as self-consistent error, where LLMs repeatedly generate the same incorrect response across multiple stochastic samples. This work formally defines self-consistent errors and evaluates mainstream detection methods on them. Our investigation reveals two key findings: (1) Unlike inconsistent errors, whose frequency diminishes significantly as the LLM scale increases, the frequency of self-consistent errors remains stable or even increases. (2) All four types of detection methods significantly struggle to detect self-consistent errors. These findings reveal critical limitations in current detection methods and underscore the need for improvement. Motivated by the observation that self-consistent errors often differ across LLMs, we propose a simple but effective cross-model probe method that fuses hidden state evidence from an external verifier LLM. Our method significantly enhances performance on self-consistent errors across three LLM families.

[117] Fast Quiet-STaR: Thinking Without Thought Tokens

Wei Huang, Yizhe Xiong, Xin Ye, Zhijie Deng, Hui Chen, Zijia Lin, Guiguang Ding

Main category: cs.CL

TL;DR: Fast Quiet-STaR is an efficient reasoning framework that improves on Quiet-STaR by reducing computational overhead through curriculum learning and reinforcement learning, achieving better accuracy with same inference time.

Details

Motivation: Large Language Models need more than scaling to improve complex reasoning. While Quiet-STaR showed promise with token-level reasoning, it incurred substantial inference overhead that needed to be addressed.

Method: Proposes curriculum learning to gradually reduce thought tokens, enabling models to internalize abstract reasoning. Extends to Next Token Prediction via reinforcement learning fine-tuning to eliminate explicit thought generation during inference.

Result: Outperforms Quiet-STaR on four benchmark datasets with Mistral 7B and Qwen2.5 7B under same inference time. Fast Quiet-STaR NTP achieves 9% average accuracy improvement on Mistral 7B and 5.7% on Qwen2.5 7B while maintaining same inference latency.

Conclusion: Fast Quiet-STaR provides an efficient reasoning framework that preserves token-level reasoning benefits while significantly reducing computational costs, making it practical for real-world applications.

Abstract: Large Language Models (LLMs) have achieved impressive performance across a range of natural language processing tasks. However, recent advances demonstrate that further gains particularly in complex reasoning tasks require more than merely scaling up model sizes or training data. One promising direction is to enable models to think during the reasoning process. Recently, Quiet STaR significantly improves reasoning by generating token-level thought traces, but incurs substantial inference overhead. In this work, we propose Fast Quiet STaR, a more efficient reasoning framework that preserves the benefits of token-level reasoning while reducing computational cost. Our method introduces a curriculum learning based training strategy that gradually reduces the number of thought tokens, enabling the model to internalize more abstract and concise reasoning processes. We further extend this approach to the standard Next Token Prediction (NTP) setting through reinforcement learning-based fine-tuning, resulting in Fast Quiet-STaR NTP, which eliminates the need for explicit thought token generation during inference. Experiments on four benchmark datasets with Mistral 7B and Qwen2.5 7B demonstrate that Fast Quiet-STaR consistently outperforms Quiet-STaR in terms of average accuracy under the same inference time budget. Notably, Fast Quiet-STaR NTP achieves an average accuracy improvement of 9% on Mistral 7B and 5.7% on Qwen2.5 7B, while maintaining the same inference latency. Our code will be available at https://github.com/huangwei200012/Fast-Quiet-STaR.

[118] Rhapsody: A Dataset for Highlight Detection in Podcasts

Younghan Park, Anuj Diwan, David Harwath, Eunsol Choi

Main category: cs.CL

TL;DR: Rhapsody dataset enables podcast highlight detection using YouTube’s ‘most replayed’ data, showing fine-tuned models outperform zero-shot LLMs by combining speech and text features.

Details

Motivation: Podcasts have massive content volume, making automatic highlight detection valuable for users to quickly understand episodes and decide what to listen to, but this is challenging due to unstructured long-form content.

Method: Created Rhapsody dataset with 13K podcast episodes using YouTube’s ‘most replayed’ feature for highlight scores. Framed as segment-level binary classification, tested zero-shot LLMs (GPT-4o, Gemini) vs fine-tuned models using both speech signals and transcripts.

Result: State-of-the-art LLMs struggled with the task, while fine-tuned models significantly outperformed zero-shot performance. Best results came from models leveraging both speech features and transcripts.

Conclusion: Fine-grained information access in long-form spoken media remains challenging, requiring in-domain fine-tuning and multimodal approaches combining audio and text features for effective podcast highlight detection.

Abstract: Podcasts have become daily companions for half a billion users. Given the enormous amount of podcast content available, highlights provide a valuable signal that helps viewers get the gist of an episode and decide if they want to invest in listening to it in its entirety. However, identifying highlights automatically is challenging due to the unstructured and long-form nature of the content. We introduce Rhapsody, a dataset of 13K podcast episodes paired with segment-level highlight scores derived from YouTube’s ‘most replayed’ feature. We frame the podcast highlight detection as a segment-level binary classification task. We explore various baseline approaches, including zero-shot prompting of language models and lightweight fine-tuned language models using segment-level classification heads. Our experimental results indicate that even state-of-the-art language models like GPT-4o and Gemini struggle with this task, while models fine-tuned with in-domain data significantly outperform their zero-shot performance. The fine-tuned model benefits from leveraging both speech signal features and transcripts. These findings highlight the challenges for fine-grained information access in long-form spoken media.

Yingming Wang, Pepa Atanasova

Main category: cs.CL

TL;DR: SR-NLE framework enables LLMs to self-critique and refine their natural language explanations to improve faithfulness without external supervision, reducing unfaithfulness rates by 18.79% on average.

Details

Motivation: Natural Language Explanations from LLMs often fail to faithfully represent the model's actual reasoning process, and existing self-critique capabilities haven't been explored for improving explanation faithfulness.

Method: Introduces SR-NLE framework with iterative critique and refinement process using natural language self-feedback and novel feature attribution feedback that highlights important input words.

Result: Experiments across 3 datasets and 4 state-of-the-art LLMs show significant reduction in unfaithfulness rates (36.02% vs 54.81% baseline), an absolute reduction of 18.79%.

Conclusion: LLMs can refine their explanations to better reflect actual reasoning with appropriate feedback guidance, without additional training or fine-tuning.

Abstract: With the rapid development of Large Language Models (LLMs), Natural Language Explanations (NLEs) have become increasingly important for understanding model predictions. However, these explanations often fail to faithfully represent the model’s actual reasoning process. While existing work has demonstrated that LLMs can self-critique and refine their initial outputs for various tasks, this capability remains unexplored for improving explanation faithfulness. To address this gap, we introduce Self-critique and Refinement for Natural Language Explanations (SR-NLE), a framework that enables models to improve the faithfulness of their own explanations – specifically, post-hoc NLEs – through an iterative critique and refinement process without external supervision. Our framework leverages different feedback mechanisms to guide the refinement process, including natural language self-feedback and, notably, a novel feedback approach based on feature attribution that highlights important input words. Our experiments across three datasets and four state-of-the-art LLMs demonstrate that SR-NLE significantly reduces unfaithfulness rates, with our best method achieving an average unfaithfulness rate of 36.02%, compared to 54.81% for baseline – an absolute reduction of 18.79%. These findings reveal that the investigated LLMs can indeed refine their explanations to better reflect their actual reasoning process, requiring only appropriate guidance through feedback without additional training or fine-tuning.

[120] ChatCFD: An LLM-Driven Agent for End-to-End CFD Automation with Domain-Specific Structured Reasoning

E Fan, Kang Hu, Zhuowen Wu, Jiangyang Ge, Jiawei Miao, Yuzhi Zhang, He Sun, Weizong Wang, Tianhan Zhang

Main category: cs.CL

TL;DR: ChatCFD is an automated agent system that uses large language models to simplify and automate OpenFOAM CFD simulations, achieving 82.1% success rate on basic cases and significantly outperforming existing solutions.

Details

Motivation: CFD simulations are hindered by operational complexity, high expertise requirements, and limited accessibility, creating barriers for widespread scientific and engineering applications.

Method: Four-stage pipeline using multi-agent architecture with DeepSeek LLMs: Knowledge Base Construction, User Input Processing, Case File Generation, and Execution and Error Reflection with iterative refinement.

Result: 82.1% operational success rate on basic cases (vs 6.2% for MetaOpenFOAM and 42.3% for Foam-Agent), 60-80% on complex literature cases, with varying performance across turbulence models and physics coupling scenarios.

Conclusion: ChatCFD successfully automates CFD hypothesis testing and parameter exploration, addressing LLM limitations through structured design and showing strong potential as a modular component in multi-agent systems for scalable AI-driven CFD innovation.

Abstract: Computational Fluid Dynamics (CFD) is essential for advancing scientific and engineering fields but is hindered by operational complexity, high expertise requirements, and limited accessibility. This paper introduces ChatCFD, an automated agent system for OpenFOAM simulations that processes multi-modal inputs (e.g., research papers, meshes) via an interactive interface, leveraging DeepSeek-R1 and DeepSeek-V3 large language models, a multi-agent architecture, and OpenFOAM knowledge. Its four-stage pipeline (Knowledge Base Construction, User Input Processing, Case File Generation, and Execution and Error Reflection) enables iterative trial-reflection-refinement for intricate setups, supporting diverse physical models and external meshes. Validation on 205 benchmark tutorial cases, 110 perturbed variants, and 2 literature-derived cases shows ChatCFD’s 82.1 percent operational success rate on basic cases, outperforming MetaOpenFOAM (6.2 percent) and Foam-Agent (42.3 percent), and 60-80 percent on literature-derived complex cases. Turbulence model studies show a 40 percent success rate for common models versus 10 percent for rare ones like RNG k-epsilon. Physics coupling analyses reveal higher resource demands for multi-physics-coupled cases, while LLM bias toward simpler setups introduces persistent errors, such as dimensional inconsistency. Ablation studies highlight the efficacy of RAG-based modules and reflection mechanisms. By automating hypothesis testing and parameter exploration, ChatCFD accelerates scientific discovery in fluid mechanics and engineering, addressing LLM limitations through structured design and showing strong potential as a modular component in MCP-based agent networks for collaborative multi-agent systems, paving the way for scalable AI-driven CFD innovation. The code for ChatCFD is available at https://github.com/ConMoo/ChatCFD.

[121] TreeReview: A Dynamic Tree of Questions Framework for Deep and Efficient LLM-based Scientific Peer Review

Yuan Chang, Ziyue Li, Hengyuan Zhang, Yuanbo Kong, Yanru Wu, Hayden Kwok-Hay So, Zhijiang Guo, Liya Zhu, Ngai Wong

Main category: cs.CL

TL;DR: TreeReview is a hierarchical QA framework that generates comprehensive paper reviews by recursively decomposing questions and aggregating answers, achieving better quality while reducing token usage by 80%.

Details

Motivation: Current LLM-based peer review methods struggle to generate thorough and insightful reviews while maintaining efficiency, needing a more structured approach.

Method: Models paper review as hierarchical bidirectional QA - constructs question trees by recursive decomposition, resolves them bottom-up, and uses dynamic question expansion for deeper probing.

Result: Outperforms baselines in comprehensive and expert-aligned reviews, reduces LLM token usage by up to 80% compared to intensive approaches.

Conclusion: TreeReview provides an effective framework for generating high-quality peer reviews efficiently through structured hierarchical question-answering.

Abstract: While Large Language Models (LLMs) have shown significant potential in assisting peer review, current methods often struggle to generate thorough and insightful reviews while maintaining efficiency. In this paper, we propose TreeReview, a novel framework that models paper review as a hierarchical and bidirectional question-answering process. TreeReview first constructs a tree of review questions by recursively decomposing high-level questions into fine-grained sub-questions and then resolves the question tree by iteratively aggregating answers from leaf to root to get the final review. Crucially, we incorporate a dynamic question expansion mechanism to enable deeper probing by generating follow-up questions when needed. We construct a benchmark derived from ICLR and NeurIPS venues to evaluate our method on full review generation and actionable feedback comments generation tasks. Experimental results of both LLM-based and human evaluation show that TreeReview outperforms strong baselines in providing comprehensive, in-depth, and expert-aligned review feedback, while reducing LLM token usage by up to 80% compared to computationally intensive approaches. Our code and benchmark dataset are available at https://github.com/YuanChang98/tree-review.

[122] Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models

Maximilian Kreutner, Marlene Lutz, Markus Strohmaier

Main category: cs.CL

TL;DR: LLMs can simulate European Parliament voting behavior with 79.3% F1 score using zero-shot persona prompts, overcoming progressive bias to accurately predict individual and group political positions.

Details

Motivation: LLMs show progressive left-leaning bias but persona prompts can align them with different socioeconomic groups. Researchers want to test if zero-shot persona prompting can accurately predict voting behavior and policy positions of European political groups.

Method: Used zero-shot persona prompting with limited information to predict individual voting decisions and aggregate policy positions. Evaluated stability against counterfactual arguments, different persona prompts, and generation methods.

Result: Achieved weighted F1 score of approximately 0.793 for simulating voting behavior of Members of the European Parliament. Predictions were stable across different prompts and arguments.

Conclusion: Persona prompting effectively simulates political voting behavior, providing a viable method for political prediction and analysis despite LLMs’ inherent biases. Dataset and code are publicly available for further research.

Abstract: Large Language Models (LLMs) display remarkable capabilities to understand or even produce political discourse, but have been found to consistently display a progressive left-leaning bias. At the same time, so-called persona or identity prompts have been shown to produce LLM behavior that aligns with socioeconomic groups that the base model is not aligned with. In this work, we analyze whether zero-shot persona prompting with limited information can accurately predict individual voting decisions and, by aggregation, accurately predict positions of European groups on a diverse set of policies. We evaluate if predictions are stable towards counterfactual arguments, different persona prompts and generation methods. Finally, we find that we can simulate voting behavior of Members of the European Parliament reasonably well with a weighted F1 score of approximately 0.793. Our persona dataset of politicians in the 2024 European Parliament and our code are available at https://github.com/dess-mannheim/european_parliament_simulation.

[123] A Structured Dataset of Disease-Symptom Associations to Improve Diagnostic Accuracy

Abdullah Al Shafi, Rowzatul Zannat, Abdul Muntakim, Mahmudul Hasan

Main category: cs.CL

TL;DR: A structured Bangla disease-symptom dataset compiled from verified medical sources with binary symptom associations for machine learning applications.

Details

Motivation: To address the significant gap in structured disease-symptom datasets for the Bangla language and improve medical informatics tools for underrepresented linguistic communities.

Method: Systematic compilation from peer-reviewed medical articles, clinical case studies, and verified online sources using binary tabular format (diseases vs symptoms with 0/1 associations).

Result: Created a structured tabular dataset with disease-symptom relationships that enables machine learning-based disease prediction and clinical decision support.

Conclusion: The dataset bridges the Bangla language gap in medical informatics and should be expanded with region-specific diseases and refined symptom associations for better diagnostic performance.

Abstract: Disease-symptom datasets are significant and in demand for medical research, disease diagnosis, clinical decision-making, and AI-driven health management applications. These datasets help identify symptom patterns associated with specific diseases, thus improving diagnostic accuracy and enabling early detection. The dataset presented in this study systematically compiles disease-symptom relationships from various online sources, medical literature, and publicly available health databases. The data was gathered through analyzing peer-reviewed medical articles, clinical case studies, and disease-symptom association reports. Only the verified medical sources were included in the dataset, while those from non-peer-reviewed and anecdotal sources were excluded. The dataset is structured in a tabular format, where the first column represents diseases, and the remaining columns represent symptoms. Each symptom cell contains a binary value, indicating whether a symptom is associated with a disease. Thereby, this structured representation makes the dataset very useful for a wide range of applications, including machine learning-based disease prediction, clinical decision support systems, and epidemiological studies. Although there are some advancements in the field of disease-symptom datasets, there is a significant gap in structured datasets for the Bangla language. This dataset aims to bridge that gap by facilitating the development of multilingual medical informatics tools and improving disease prediction models for underrepresented linguistic communities. Further developments should include region-specific diseases and further fine-tuning of symptom associations for better diagnostic performance

[124] RADIANT: Retrieval AugmenteD entIty-context AligNmenT – Introducing RAG-ability and Entity-Context Divergence

Vipula Rawte, Rajarshi Roy, Gurpreet Singh, Danush Khanna, Yaswanth Narsupalli, Basab Ghosh, Abhay Gupta, Argha Kamal Samanta, Aditya Shingote, Aadi Krishna Vikram, Vinija Jain, Aman Chadha, Amit Sheth, Amitava Das

Main category: cs.CL

TL;DR: The paper introduces Entity-Context Divergence (ECD) metric to measure factual inconsistency in RAG systems and proposes Radiant framework to improve LLMs’ ability to faithfully integrate retrieved evidence into generated responses.

Details

Motivation: LLMs often fail to accurately integrate retrieved external knowledge into their responses, leading to factual inconsistencies in retrieval-augmented generation despite the technique's potential to enhance factual accuracy.

Method: Introduces Radiant framework that extends Direct Preference Optimization (DPO) to teach LLMs how to properly integrate retrieved information. Uses Entity-Context Divergence (ECD) metric to quantify the gap between retrieved evidence and generated content.

Result: Empirical analysis shows RAG-ability remains low across most LLMs, with significant challenges in entity retention and context fidelity. Radiant framework demonstrates improved performance across various retrieval scenarios including noisy web contexts, knowledge conflicts, and hallucination reduction.

Conclusion: Radiant enables more reliable, contextually grounded, and factually coherent content generation by optimizing the interplay between retrieved evidence and generated content through alignment techniques.

Abstract: As Large Language Models (LLMs) continue to advance, Retrieval-Augmented Generation (RAG) has emerged as a vital technique to enhance factual accuracy by integrating external knowledge into the generation process. However, LLMs often fail to faithfully integrate retrieved evidence into their generated responses, leading to factual inconsistencies. To quantify this gap, we introduce Entity-Context Divergence (ECD), a metric that measures the extent to which retrieved information is accurately reflected in model outputs. We systematically evaluate contemporary LLMs on their ability to preserve factual consistency in retrieval-augmented settings, a capability we define as RAG-ability. Our empirical analysis reveals that RAG-ability remains low across most LLMs, highlighting significant challenges in entity retention and context fidelity. This paper introduces Radiant (Retrieval AugmenteD entIty-context AligNmenT), a novel framework that merges RAG with alignment designed to optimize the interplay between retrieved evidence and generated content. Radiant extends Direct Preference Optimization (DPO) to teach LLMs how to integrate provided additional information into subsequent generations. As a behavior correction mechanism, Radiant boosts RAG performance across varied retrieval scenarios, such as noisy web contexts, knowledge conflicts, and hallucination reduction. This enables more reliable, contextually grounded, and factually coherent content generation.

[125] Dynamic Injection of Entity Knowledge into Dense Retrievers

Ikuya Yamada, Ryokan Ri, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo

Main category: cs.CL

TL;DR: KPR is a BERT-based dense retriever enhanced with context-entity attention and dynamic entity embeddings to improve retrieval of queries with less-frequent entities.

Details

Motivation: Dense retrievers struggle with queries involving less-frequent entities due to limited entity knowledge, requiring better incorporation of external entity information.

Method: Proposes Knowledgeable Passage Retriever (KPR) with context-entity attention layer and dynamically updatable entity embeddings that can incorporate external knowledge without retraining.

Result: KPR consistently improves retrieval accuracy across three datasets, with particularly large gains on EntityQuestions dataset. Achieves state-of-the-art performance among similarly sized models on two datasets when built on bge-base retriever.

Conclusion: The proposed KPR framework effectively enhances dense retrievers with external entity knowledge, demonstrating significant performance improvements especially for entity-centric queries.

Abstract: Dense retrievers often struggle with queries involving less-frequent entities due to their limited entity knowledge. We propose the Knowledgeable Passage Retriever (KPR), a BERT-based retriever enhanced with a context-entity attention layer and dynamically updatable entity embeddings. This design enables KPR to incorporate external entity knowledge without retraining. Experiments on three datasets demonstrate that KPR consistently improves retrieval accuracy, with particularly large gains on the EntityQuestions dataset. When built on the off-the-shelf bge-base retriever, KPR achieves state-of-the-art performance among similarly sized models on two datasets. Models and code are released at https://github.com/knowledgeable-embedding/knowledgeable-embedding.

[126] Step-level Verifier-guided Hybrid Test-Time Scaling for Large Language Models

Kaiyan Chang, Yonghao Shi, Chenglong Wang, Hang Zhou, Chi Hu, Xiaoqian Liu, Yingfeng Luo, Yuan Ge, Tong Xiao, Jingbo Zhu

Main category: cs.CL

TL;DR: Hybrid Test-Time Scaling combines training-free methods like Conditional Step-level Self-refinement with parallel scaling techniques to enhance LLM reasoning without additional training overhead.

Details

Motivation: Training-based TTS methods increase computation burden during inference, so the paper focuses on developing effective training-free TTS methods for reasoning tasks.

Method: Developed Conditional Step-level Self-refinement (fine-grained sequential scaling with process verification) and combined it with classical parallel scaling methods to create Hybrid Test-Time Scaling.

Result: Extensive experiments on 5 instruction-tuned LLMs (3B-14B) show hybrid training-free TTS methods significantly expand reasoning performance boundaries.

Conclusion: Training-free hybrid TTS strategies at fine granularity have considerable potential for enhancing LLM reasoning capabilities without additional training overhead.

Abstract: Test-Time Scaling (TTS) is a promising approach to progressively elicit the model’s intelligence during inference. Recently, training-based TTS methods, such as continued reinforcement learning (RL), have further surged in popularity, while training-free TTS methods are gradually fading from prominence. However, the additional computation overhead of training amplifies the burden on test-time scaling. In this paper, we focus on training-free TTS methods for reasoning. We first design Conditional Step-level Self-refinement, a fine-grained sequential scaling method guided by process verification. On top of its effectiveness, we further combine it with other classical parallel scaling methods at the step level, to introduce a novel inference paradigm called Hybrid Test-Time Scaling. Extensive experiments on five instruction-tuned LLMs across different scales (3B-14B) and families demonstrate that hybrid strategy incorporating various training-free TTS methods at a fine granularity has considerable potential for expanding the reasoning performance boundaries of LLMs.

[127] Not All Features Deserve Attention: Graph-Guided Dependency Learning for Tabular Data Generation with Language Models

Zheyu Zhang, Shuo Yang, Bardh Prenkaj, Gjergji Kasneci

Main category: cs.CL

TL;DR: GraDe integrates sparse dependency graphs into LLM attention to improve tabular data generation by focusing on critical feature relationships while suppressing irrelevant ones.

Details

Motivation: LLMs struggle with tabular data generation because their self-attention mechanism distributes focus across all feature pairs, diluting attention on critical dependencies in sparse tabular data structures.

Method: GraDe uses a lightweight dynamic graph learning module guided by externally extracted functional dependencies to explicitly integrate sparse dependency graphs into LLMs’ attention mechanism.

Result: Outperforms existing LLM-based approaches by up to 12% on complex datasets and achieves competitive results with state-of-the-art approaches in synthetic data quality.

Conclusion: GraDe provides a minimally intrusive yet effective practical solution for structure-aware tabular data modeling with LLMs.

Abstract: Large Language Models (LLMs) have shown strong potential for tabular data generation by modeling textualized feature-value pairs. However, tabular data inherently exhibits sparse feature-level dependencies, where many feature interactions are structurally insignificant. This creates a fundamental mismatch as LLMs’ self-attention mechanism inevitably distributes focus across all pairs, diluting attention on critical relationships, particularly in datasets with complex dependencies or semantically ambiguous features. To address this limitation, we propose GraDe (Graph-Guided Dependency Learning), a novel method that explicitly integrates sparse dependency graphs into LLMs’ attention mechanism. GraDe employs a lightweight dynamic graph learning module guided by externally extracted functional dependencies, prioritizing key feature interactions while suppressing irrelevant ones. Our experiments across diverse real-world datasets demonstrate that GraDe outperforms existing LLM-based approaches by up to 12% on complex datasets while achieving competitive results with state-of-the-art approaches in synthetic data quality. Our method is minimally intrusive yet effective, offering a practical solution for structure-aware tabular data modeling with LLMs.

[128] CodeMixBench: Evaluating Code-Mixing Capabilities of LLMs Across 18 Languages

Yilun Yang, Yekun Chai

Main category: cs.CL

TL;DR: CodeMixBench is a new benchmark for evaluating LLMs on code-mixing tasks across 18 languages and 8 tasks, revealing consistent underperformance and proposing a GPT-4 based synthetic data generation method.

Details

Motivation: Existing benchmarks are limited in language pairs and tasks, failing to adequately assess LLMs' code-mixing abilities despite its importance for multilingual users.

Method: Introduces CodeMixBench covering 8 tasks and 18 languages across 7 language families, plus a new synthetic data generation method combining word substitution with GPT-4 prompting.

Result: LLMs consistently underperform on code-mixed datasets involving different language families. Performance improvements possible through larger training data, model scaling, and few-shot learning.

Conclusion: The benchmark and synthetic data generation method provide tools to advance code-mixing research, highlighting current LLM limitations and potential improvement pathways.

Abstract: Code-mixing, the practice of switching between languages within a conversation, poses unique challenges for traditional NLP. Existing benchmarks are limited by their narrow language pairs and tasks, failing to adequately assess large language models’ (LLMs) code-mixing abilities. Despite the recognized importance of code-mixing for multilingual users, research on LLMs in this context remains sparse. Additionally, current techniques for synthesizing code-mixed data are underdeveloped to generate code-mixing. In response, we introduce CodeMixBench, a comprehensive benchmark covering eight tasks, including three specific to LLMs and five traditional NLP tasks, and 18 languages across seven language families. We also propose a new method for generating large-scale synthetic code-mixed texts by combining word substitution with GPT-4 prompting. Our evaluation reveals consistent underperformance of LLMs on code-mixed datasets involving different language families. Enhancements in training data size, model scale, and few-shot learning could improve their performance. The code and dataset are available at https://github.com/Jeromeyluck/CodeMixBench.

[129] Dynamically Adaptive Reasoning via LLM-Guided MCTS for Efficient and Context-Aware KGQA

Yingxu Wang, Shiqi Fan, Mengzhu Wang, Siyang Gao, Siwei Liu, Nan Yin

Main category: cs.CL

TL;DR: DAMR framework combines MCTS search with LLM guidance and lightweight Transformer scoring for efficient, context-aware KGQA, outperforming SOTA methods.

Details

Motivation: Address limitations of current KGQA approaches: static path extraction lacks adaptability, while LLM-based dynamic methods are computationally expensive and struggle with accurate path evaluation.

Method: Uses MCTS backbone with LLM-based planner to select top-k relations, lightweight Transformer scorer for context-aware path evaluation, and dynamic pseudo-path refinement for continuous training.

Result: Extensive experiments show DAMR significantly outperforms state-of-the-art methods on multiple KGQA benchmarks.

Conclusion: DAMR provides an efficient and adaptive solution for KGQA by integrating symbolic search with neural scoring and continuous learning from search trajectories.

Abstract: Knowledge Graph Question Answering (KGQA) aims to interpret natural language queries and perform structured reasoning over knowledge graphs by leveraging their relational and semantic structures to retrieve accurate answers. Recent KGQA methods primarily follow either retrieve-then-reason paradigm, relying on GNNs or heuristic rules for static paths extraction, or dynamic path generation strategies that use large language models (LLMs) with prompting to jointly perform retrieval and reasoning. However, the former suffers from limited adaptability due to static path extraction and lack of contextual refinement, while the latter incurs high computational costs and struggles with accurate path evaluation due to reliance on fixed scoring functions and extensive LLM calls. To address these issues, this paper proposes Dynamically Adaptive MCTS-based Reasoning (DAMR), a novel framework that integrates symbolic search with adaptive path evaluation for efficient and context-aware KGQA. DAMR employs a Monte Carlo Tree Search (MCTS) backbone guided by an LLM-based planner, which selects top-$k$ relevant relations at each step to reduce search space. To improve path evaluation accuracy, we introduce a lightweight Transformer-based scorer that performs context-aware plausibility estimation by jointly encoding the question and relation sequence through cross-attention, enabling the model to capture fine-grained semantic shifts during multi-hop reasoning. Furthermore, to alleviate the scarcity of high-quality supervision, DAMR incorporates a dynamic pseudo-path refinement mechanism that periodically generates training signals from partial paths explored during search, allowing the scorer to continuously adapt to the evolving distribution of reasoning trajectories. Extensive experiments on multiple KGQA benchmarks show that DAMR significantly outperforms state-of-the-art methods.

[130] Out of the Box, into the Clinic? Evaluating State-of-the-Art ASR for Clinical Applications for Older Adults

Bram van Dijk, Tiberon Kuiper, Sirin Aoulad si Ahmed, Armel Levebvre, Jake Johnson, Jan Duin, Simon Mooijaart, Marco Spruit

Main category: cs.CL

TL;DR: Evaluation of ASR models for older Dutch adults using a geriatric chatbot shows generic multilingual models outperform fine-tuned ones, with architecture truncation helping balance accuracy and speed.

Details

Motivation: Voice interfaces and chatbots can support older adults in clinical settings, but reliable ASR for underrepresented groups like older Dutch speakers remains a challenge.

Method: Benchmarked state-of-the-art ASR models on language use of older Dutch adults interacting with Welzijn.AI chatbot, comparing generic multilingual models vs models fine-tuned for Dutch spoken by older adults, while considering processing speed.

Result: Generic multilingual models outperformed fine-tuned models, suggesting recent ASR models generalize well to realistic datasets. Truncating architectures helps balance accuracy-speed trade-off, though some cases show high WER due to hallucinations.

Conclusion: Recent generic ASR models show strong out-of-the-box performance for underrepresented groups like older Dutch speakers, with architecture optimization providing effective accuracy-speed trade-offs for clinical applications.

Abstract: Voice-controlled interfaces can support older adults in clinical contexts, with chatbots being a prime example, but reliable Automatic Speech Recognition (ASR) for underrepresented groups remains a bottleneck. This study evaluates state-of-the-art ASR models on language use of older Dutch adults, who interacted with the \texttt{Welzijn.AI} chatbot designed for geriatric contexts. We benchmark generic multilingual ASR models, and models fine-tuned for Dutch spoken by older adults, while also considering processing speed. Our results show that generic multilingual models outperform fine-tuned models, which suggests recent ASR models can generalise well out of the box to realistic datasets. Furthermore, our results suggest that truncating existing architectures is helpful in balancing the accuracy-speed trade-off, though we also identify some cases with high WER due to hallucinations.

[131] A Survey on Training-free Alignment of Large Language Models

Birong Pan, Yongqi Li, Weiyu Zhang, Wenpeng Lu, Mayi Xu, Shen Zhou, Yuanyuan Zhu, Ming Zhong, Tieyun Qian

Main category: cs.CL

TL;DR: Systematic review of training-free alignment methods for LLMs that don’t require fine-tuning, categorized by pre-decoding, in-decoding, and post-decoding stages.

Details

Motivation: Traditional fine-tuning alignment methods are resource-intensive, cause knowledge degradation, and face accessibility constraints. Training-free methods offer a promising alternative for both open-source and closed-source LLMs.

Method: Categorizes training-free alignment techniques into three stages: pre-decoding (in-context learning), in-decoding (decoding-time adjustments), and post-decoding (post-generation corrections). Examines mechanisms and limitations for both LLMs and multimodal LLMs.

Result: Provides a comprehensive framework for understanding training-free alignment methods, highlighting their mechanisms, limitations, and applicability across different model types.

Conclusion: This survey organizes the growing research field, identifies key challenges and future directions, and provides guidance for developing more inclusive and effective training-free alignment techniques for safer LLMs.

Abstract: The alignment of large language models (LLMs) aims to ensure their outputs adhere to human values, ethical standards, and legal norms. Traditional alignment methods often rely on resource-intensive fine-tuning (FT), which may suffer from knowledge degradation and face challenges in scenarios where the model accessibility or computational resources are constrained. In contrast, training-free (TF) alignment techniques–leveraging in-context learning, decoding-time adjustments, and post-generation corrections–offer a promising alternative by enabling alignment without heavily retraining LLMs, making them adaptable to both open-source and closed-source environments. This paper presents the first systematic review of TF alignment methods, categorizing them by stages of pre-decoding, in-decoding, and post-decoding. For each stage, we provide a detailed examination from the viewpoint of LLMs and multimodal LLMs (MLLMs), highlighting their mechanisms and limitations. Furthermore, we identify key challenges and future directions, paving the way for more inclusive and effective TF alignment techniques. By synthesizing and organizing the rapidly growing body of research, this survey offers a guidance for practitioners and advances the development of safer and more reliable LLMs.

[132] EMNLP: Educator-role Moral and Normative Large Language Models Profiling

Yilin Jiang, Mingzi Zhang, Sheng Jin, Zengyi Yu, Xiangjie Kong, Binghao Tu

Main category: cs.CL

TL;DR: EMNLP framework evaluates teacher-role LLMs’ personality, moral development, and ethical vulnerability to soft prompt injection, revealing idealized personalities, strong abstract reasoning but emotional struggles, and a safety-capability paradox.

Details

Motivation: To address the lack of comprehensive psychological and ethical evaluation frameworks for Large Language Models simulating professional roles, particularly in educational contexts where teacher-role LLMs require careful assessment of moral and normative alignment.

Method: Developed EMNLP framework with extended psychological scales and 88 teacher-specific moral dilemmas for profession-oriented comparison with human teachers, plus targeted soft prompt injection tests to evaluate compliance and vulnerability.

Result: Experiments on 14 LLMs showed teacher-role models exhibit more idealized/polarized personalities than humans, excel in abstract moral reasoning but struggle with emotional complexity, and models with stronger reasoning are paradoxically more vulnerable to harmful prompt injection.

Conclusion: This presents the first benchmark for assessing ethical and psychological alignment of teacher-role LLMs, highlighting critical safety-capability tradeoffs that need addressing for responsible educational AI deployment.

Abstract: Simulating Professions (SP) enables Large Language Models (LLMs) to emulate professional roles. However, comprehensive psychological and ethical evaluation in these contexts remains lacking. This paper introduces EMNLP, an Educator-role Moral and Normative LLMs Profiling framework for personality profiling, moral development stage measurement, and ethical risk under soft prompt injection. EMNLP extends existing scales and constructs 88 teacher-specific moral dilemmas, enabling profession-oriented comparison with human teachers. A targeted soft prompt injection set evaluates compliance and vulnerability in teacher SP. Experiments on 14 LLMs show teacher-role LLMs exhibit more idealized and polarized personalities than human teachers, excel in abstract moral reasoning, but struggle with emotionally complex situations. Models with stronger reasoning are more vulnerable to harmful prompt injection, revealing a paradox between capability and safety. The model temperature and other hyperparameters have limited influence except in some risk behaviors. This paper presents the first benchmark to assess ethical and psychological alignment of teacher-role LLMs for educational AI. Resources are available at https://e-m-n-l-p.github.io/.

[133] CARFT: Boosting LLM Reasoning via Contrastive Learning with Annotated Chain-of-Thought-based Reinforced Fine-Tuning

Wenqiao Zhu, Ji Liu, Rongjuncheng Zhang, Haipang Wu, Yulun Zhang

Main category: cs.CL

TL;DR: CARFT - Contrastive learning with annotated Chain-of-Thought-based Reinforced Fine-Tuning approach that enhances LLM reasoning performance by addressing limitations of vanilla RL and SFT methods through novel contrastive signals and stable training.

Details

Motivation: Existing RL-based fine-tuning approaches ignore annotated Chain-of-Thought (CoT) and have unstable reasoning path sampling, while SFT approaches overemphasize annotated CoT, both leading to suboptimal performance and model collapse issues in LLM reasoning.

Method: Proposes learning representations for each CoT and designing novel contrastive signals to guide fine-tuning. Combines reinforcement learning with contrastive learning to fully exploit annotated CoT while stabilizing training through additional unsupervised learning signals.

Result: Significant advantages in robustness, performance (up to 10.15% improvement), and efficiency (up to 30.62% improvement) across three baseline approaches, two foundation models, and two datasets.

Conclusion: CARFT effectively addresses limitations of existing RL and SFT approaches for LLM reasoning enhancement, providing a stable and efficient fine-tuning method that fully leverages annotated CoT while preventing model collapse and performance degradation.

Abstract: Reasoning capability plays a significantly critical role in the the broad applications of Large Language Models (LLMs). To enhance the reasoning performance of LLMs, diverse Reinforcement Learning (RL)-based fine-tuning approaches have been proposed to address the limited generalization capability of LLMs trained solely via Supervised Fine-Tuning (SFT). Despite their effectiveness, two major limitations hinder the advancement of LLMs. First, vanilla RL-based approaches ignore annotated Chain-of-Thought (CoT) and incorporate unstable reasoning path sampling, which typically results in model collapse, unstable training process, and suboptimal performance. Second, existing SFT approaches generally overemphasize the annotated CoT, potentially leading to performance degradation due to insufficient exploitation of potential CoT. In this paper, we propose a Contrastive learning with annotated CoT-based Reinforced Fine-Tuning approach, i.e., \TheName{}, to enhance the reasoning performance of LLMs while addressing the aforementioned limitations. Specifically, we propose learning a representation for each CoT. Based on this representation, we design novel contrastive signals to guide the fine-tuning process. Our approach not only fully exploits the available annotated CoT but also stabilizes the fine-tuning procedure by incorporating an additional unsupervised learning signal. We conduct comprehensive experiments and in-depth analysis with three baseline approaches, two foundation models, and two datasets to demonstrate significant advantages of \TheName{} in terms of robustness, performance (up to 10.15%), and efficiency (up to 30.62%). Code is available at https://github.com/WNQzhu/CARFT.

[134] Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search

Yuxian Gu, Qinghao Hu, Shang Yang, Haocheng Xi, Junyu Chen, Song Han, Han Cai

Main category: cs.CL

TL;DR: Jet-Nemotron is a hybrid-architecture language model family that achieves comparable or superior accuracy to leading full-attention models while significantly improving generation throughput through a novel PostNAS architecture exploration pipeline.

Details

Motivation: To develop language models that match or exceed the accuracy of full-attention models while significantly improving generation throughput and efficiency.

Method: Post Neural Architecture Search (PostNAS) pipeline that starts with a pre-trained full-attention model, freezes MLP weights, and explores attention block designs through four components: optimal layer placement/elimination, linear attention block selection, new attention block design, and hardware-aware hyperparameter search.

Result: Jet-Nemotron-2B achieves comparable or superior accuracy to Qwen3, Qwen2.5, Gemma3, and Llama3.2 across benchmarks, with up to 53.6x generation throughput speedup and 6.1x prefilling speedup. It also outperforms larger MoE models like DeepSeek-V3-Small and Moonlight on MMLU and MMLU-Pro.

Conclusion: The PostNAS pipeline enables efficient development of hybrid-architecture models that deliver both high accuracy and significant performance improvements over traditional full-attention and MoE models.

Abstract: We present Jet-Nemotron, a new family of hybrid-architecture language models, which matches or exceeds the accuracy of leading full-attention models while significantly improving generation throughput. Jet-Nemotron is developed using Post Neural Architecture Search (PostNAS), a novel neural architecture exploration pipeline that enables efficient model design. Unlike prior approaches, PostNAS begins with a pre-trained full-attention model and freezes its MLP weights, allowing efficient exploration of attention block designs. The pipeline includes four key components: (1) learning optimal full-attention layer placement and elimination, (2) linear attention block selection, (3) designing new attention blocks, and (4) performing hardware-aware hyperparameter search. Our Jet-Nemotron-2B model achieves comparable or superior accuracy to Qwen3, Qwen2.5, Gemma3, and Llama3.2 across a comprehensive suite of benchmarks while delivering up to 53.6x generation throughput speedup and 6.1x prefilling speedup. It also achieves higher accuracy on MMLU and MMLU-Pro than recent advanced MoE full-attention models, such as DeepSeek-V3-Small and Moonlight, despite their larger scale with 15B total and 2.2B activated parameters.

[135] Leveraging Large Language Models for Accurate Sign Language Translation in Low-Resource Scenarios

Luana Bulla, Gabriele Tuccio, Misael Mongiovì, Aldo Gangemi

Main category: cs.CL

TL;DR: AulSign method uses LLMs with dynamic prompting and in-context learning for sign language translation, achieving superior performance in low-data scenarios.

Details

Motivation: Limited parallel corpora and data scarcity hinder sign language translation development, with existing methods struggling to generalize due to domain-specific, non-standardized datasets.

Method: Leverages Large Language Models via dynamic prompting and in-context learning with sample selection and sign association through natural language descriptions.

Result: Demonstrates superior performance compared to state-of-the-art models on both English (SignBank+) and Italian (LaCAM CNR-ISTC) datasets in low-data scenarios.

Conclusion: AulSign effectively enhances sign language translation, showing potential to improve accessibility and inclusivity for underrepresented linguistic communities.

Abstract: Translating natural languages into sign languages is a highly complex and underexplored task. Despite growing interest in accessibility and inclusivity, the development of robust translation systems remains hindered by the limited availability of parallel corpora which align natural language with sign language data. Existing methods often struggle to generalize in these data-scarce environments, as the few datasets available are typically domain-specific, lack standardization, or fail to capture the full linguistic richness of sign languages. To address this limitation, we propose Advanced Use of LLMs for Sign Language Translation (AulSign), a novel method that leverages Large Language Models via dynamic prompting and in-context learning with sample selection and subsequent sign association. Despite their impressive abilities in processing text, LLMs lack intrinsic knowledge of sign languages; therefore, they are unable to natively perform this kind of translation. To overcome this limitation, we associate the signs with compact descriptions in natural language and instruct the model to use them. We evaluate our method on both English and Italian languages using SignBank+, a recognized benchmark in the field, as well as the Italian LaCAM CNR-ISTC dataset. We demonstrate superior performance compared to state-of-the-art models in low-data scenario. Our findings demonstrate the effectiveness of AulSign, with the potential to enhance accessibility and inclusivity in communication technologies for underrepresented linguistic communities.

[136] Automatic Prompt Optimization with Prompt Distillation

Ernest A. Dyagin, Nikita I. Kulin, Artur R. Khairullin, Viktor N. Zhuravlev, Alena N. Sitkina

Main category: cs.CL

TL;DR: DistillPrompt is a novel autoprompting method that uses multi-stage integration of task-specific information through distillation, compression, and aggregation operations to automatically generate optimized prompts for LLMs.

Details

Motivation: With the rapid development of prompt engineering and large language models, there's a growing need for automated methods to select optimized prompts rather than relying on manual prompt engineering.

Method: DistillPrompt employs a multi-stage integration approach using training data, featuring distillation, compression, and aggregation operations to thoroughly explore the prompt space. It was tested on text classification and generation tasks using the t-lite-instruct-0.1 language model.

Result: The method demonstrated significant average improvement (20.12% across the entire dataset compared to Grips) in key metrics over existing methods.

Conclusion: DistillPrompt establishes itself as one of the most effective non-gradient approaches in autoprompting, showing substantial performance gains over existing techniques.

Abstract: Autoprompting is the process of automatically selecting optimized prompts for language models, which is gaining popularity due to the rapid development of prompt engineering driven by extensive research in the field of large language models (LLMs). This paper presents DistillPrompt – a novel autoprompting method based on large language models that employs a multi-stage integration of task-specific information into prompts using training data. DistillPrompt utilizes distillation, compression, and aggregation operations to explore the prompt space more thoroughly. The method was tested on different datasets for text classification and generation tasks using the t-lite-instruct-0.1 language model. The results demonstrate a significant average improvement (e.g., 20.12% across the entire dataset compared to Grips) in key metrics over existing methods in the field, establishing DistillPrompt as one of the most effective non-gradient approaches in autoprompting.

[137] MovieCORE: COgnitive REasoning in Movies

Gueter Josmy Faure, Min-Hung Chen, Jia-Fong Yeh, Ying Cheng, Hung-Ting Su, Yung-Hao Tang, Shang-Hong Lai, Winston H. Hsu

Main category: cs.CL

TL;DR: MovieCORE is a new video QA dataset focusing on deeper cognitive movie understanding, using agentic brainstorming with LLMs to generate questions, and introducing ACE enhancement to boost VQA model reasoning by up to 25%.

Details

Motivation: Existing video QA datasets focus on surface-level comprehension, lacking deeper cognitive understanding of movie content that requires System-2 thinking.

Method: Agentic brainstorming approach using multiple LLMs as thought agents to generate and refine question-answer pairs, plus Agentic Choice Enhancement (ACE) module to improve VQA model reasoning post-training.

Result: Developed MovieCORE dataset with cognitive tests for depth assessment, and ACE enhancement improved model reasoning capabilities by up to 25%.

Conclusion: Advances movie understanding in AI systems and provides insights into current VQA model limitations when handling nuanced cinematic content questions.

Abstract: This paper introduces MovieCORE, a novel video question answering (VQA) dataset designed to probe deeper cognitive understanding of movie content. Unlike existing datasets that focus on surface-level comprehension, MovieCORE emphasizes questions that engage System-2 thinking while remaining specific to the video material. We present an innovative agentic brainstorming approach, utilizing multiple large language models (LLMs) as thought agents to generate and refine high-quality question-answer pairs. To evaluate dataset quality, we develop a set of cognitive tests assessing depth, thought-provocation potential, and syntactic complexity. We also propose a comprehensive evaluation scheme for assessing VQA model performance on deeper cognitive tasks. To address the limitations of existing video-language models (VLMs), we introduce an agentic enhancement module, Agentic Choice Enhancement (ACE), which improves model reasoning capabilities post-training by up to 25%. Our work contributes to advancing movie understanding in AI systems and provides valuable insights into the capabilities and limitations of current VQA models when faced with more challenging, nuanced questions about cinematic content. Our project page, dataset and code can be found at https://joslefaure.github.io/assets/html/moviecore.html.

[138] MultiPL-MoE: Multi-Programming-Lingual Extension of Large Language Models through Hybrid Mixture-of-Experts

Qing Wang, Xue Han, Jiahui Wang, Lehao Xing, Qian Hu, Lianlian Zhang, Chao Deng, Junlan Feng

Main category: cs.CL

TL;DR: MultiPL-MoE: A hybrid mixture of experts approach that improves multilingual code generation in LLMs using token-level and segment-level MoE with novel routing strategies.

Details

Motivation: Despite LLMs' strong code creation capabilities, multilingual code generation remains challenging. The paper aims to improve multi-programming-lingual performance while retaining popular languages using limited computational resources.

Method: Proposes MultiPL-MoE combining two paired MoEs: token-level MoE with shared expert and gate weight normalization, and segment-level MoE with sliding window segmentation and expert-choice routing strategy allowing experts to select top-k segments.

Result: Experimental results proved the effectiveness of the MultiPL-MoE approach for multilingual code generation.

Conclusion: The hybrid MoE approach successfully addresses multilingual code generation challenges by optimizing expert selection at both token and segment levels, capturing programming language syntax and contextual patterns effectively.

Abstract: Despite LLMs’ excellent code creation capabilities, multilingual code generation remains extremely challenging. To address this, we intent to improve the multi-programming-lingual (MultiPL) performance of the base LLMs while retaining the most popular ones using restricted computational resources. We consider MultiPL to be a special case of multiple natural languages and propose a MultiPL extension of LLMs utilizing a hybrid mixture of experts (MoE), called MultiPL-MoE. Specifically, MultiPL-MoE combines two paired MoEs to optimize expert selection at both the token and segment levels. The token-level MoE is a standard upcycling MoE structure with a shared expert and a novel gate weight normalization approach that aids in the final fusion with the segment-level MoE. The segment-level MoE incorporates two innovative designs to better capture the syntactic structure and contextual patterns of programming languages: First, using a sliding window to partition the input token sequence into multiple segments; Then, adopting an expert-choice routing strategy that allows experts to select the top-k segments. The results of the experiment proved the effectiveness of MultiPL-MoE.

[139] KG-CQR: Leveraging Structured Relation Representations in Knowledge Graphs for Contextual Query Retrieval

Chi Minh Bui, Ngoc Mai Thieu, Van Vinh Nguyen, Jason J. Jung, Khac-Hoai Nam Bui

Main category: cs.CL

TL;DR: KG-CQR is a novel framework that enhances RAG retrieval by enriching query contexts using knowledge graphs, achieving 4-6% mAP and 2-3% Recall@25 improvements over baselines.

Details

Motivation: Improve retrieval phase of RAG systems by addressing contextual query representation limitations through structured knowledge graph integration.

Method: Model-agnostic pipeline with subgraph extraction, completion, and contextual generation modules that enrich queries using corpus-centric knowledge graphs.

Result: Superior performance on RAGBench and MultiHop-RAG datasets with 4-6% mAP and 2-3% Recall@25 improvements; excels in multi-hop QA tasks.

Conclusion: KG-CQR effectively enhances retrieval effectiveness in RAG systems through structured knowledge graph integration without requiring additional LLM training.

Abstract: The integration of knowledge graphs (KGs) with large language models (LLMs) offers significant potential to improve the retrieval phase of retrieval-augmented generation (RAG) systems. In this study, we propose KG-CQR, a novel framework for Contextual Query Retrieval (CQR) that enhances the retrieval phase by enriching the contextual representation of complex input queries using a corpus-centric KG. Unlike existing methods that primarily address corpus-level context loss, KG-CQR focuses on query enrichment through structured relation representations, extracting and completing relevant KG subgraphs to generate semantically rich query contexts. Comprising subgraph extraction, completion, and contextual generation modules, KG-CQR operates as a model-agnostic pipeline, ensuring scalability across LLMs of varying sizes without additional training. Experimental results on RAGBench and MultiHop-RAG datasets demonstrate KG-CQR’s superior performance, achieving a 4-6% improvement in mAP and a 2-3% improvement in Recall@25 over strong baseline models. Furthermore, evaluations on challenging RAG tasks such as multi-hop question answering show that, by incorporating KG-CQR, the performance consistently outperforms the existing baseline in terms of retrieval effectiveness

[140] CVPD at QIAS 2025 Shared Task: An Efficient Encoder-Based Approach for Islamic Inheritance Reasoning

Salah Eddine Bekhouche, Abdellah Zakaria Sellam, Hichem Telli, Cosimo Distante, Abdenour Hadid

Main category: cs.CL

TL;DR: A lightweight framework using Arabic text encoders and Attentive Relevance Scoring for solving Islamic inheritance multiple-choice questions, achieving 69.87% accuracy with MARBERT while being efficient and privacy-preserving.

Details

Motivation: Islamic inheritance law requires precise heir identification and share calculation, which poses challenges for AI systems. There's a need for efficient, on-device solutions that balance accuracy with practical deployability and privacy concerns.

Method: Specialized Arabic text encoder (MARBERT, ArabicBERT, AraBERT) with Attentive Relevance Scoring (ARS) to rank answer options by semantic relevance, enabling fast on-device inference without generative reasoning.

Result: MARBERT-based approach achieved 69.87% accuracy, while large LLMs (Gemini, DeepSeek) reached up to 87.6% accuracy on QIAS 2025 dataset. Smaller models offer efficiency, on-device deployability, and privacy advantages.

Conclusion: There’s a critical trade-off between peak performance of large models and practical advantages of smaller specialized systems in high-stakes domains like Islamic inheritance law, with lightweight frameworks offering compelling efficiency and privacy benefits.

Abstract: Islamic inheritance law (Ilm al-Mawarith) requires precise identification of heirs and calculation of shares, which poses a challenge for AI. In this paper, we present a lightweight framework for solving multiple-choice inheritance questions using a specialised Arabic text encoder and Attentive Relevance Scoring (ARS). The system ranks answer options according to semantic relevance, and enables fast, on-device inference without generative reasoning. We evaluate Arabic encoders (MARBERT, ArabicBERT, AraBERT) and compare them with API-based LLMs (Gemini, DeepSeek) on the QIAS 2025 dataset. While large models achieve an accuracy of up to 87.6%, they require more resources and are context-dependent. Our MARBERT-based approach achieves 69.87% accuracy, presenting a compelling case for efficiency, on-device deployability, and privacy. While this is lower than the 87.6% achieved by the best-performing LLM, our work quantifies a critical trade-off between the peak performance of large models and the practical advantages of smaller, specialized systems in high-stakes domains.

[141] Probe-Rewrite-Evaluate: A Workflow for Reliable Benchmarks and Quantifying Evaluation Awareness

Lang Xiong, Nishant Bhargava, Jeremy Chang, Jianhang Hong, Haihao Liu, Vasu Sharma, Kevin Zhu

Main category: cs.CL

TL;DR: LLMs show behavioral changes between real-world deployment and controlled evaluation settings (evaluation awareness). This study quantifies these changes using prompt rewriting to shift contexts from test-like to deploy-like, revealing models are more prone to unsafe/deceptive outputs in perceived test environments.

Details

Motivation: Benchmark performance may not accurately reflect LLMs' true safety and honesty due to evaluation awareness - the behavioral shifts when models perceive different contexts. This poses critical challenges for AI alignment.

Method: Used linear probe to score prompts on test-like to deploy-like scale, then employed LLM rewriting strategy to shift prompts toward natural deployment-style context while preserving original tasks.

Result: Rewritten prompts achieved 30% increase in average probe score. Across all models: 5.26% average increase in honest responses, 12.40% average decrease in deceptive responses, and 6.38% average increase in refusal rates indicating improved safety compliance.

Conclusion: Evaluation awareness is quantifiable and manipulable, directly influencing LLM behavior. Models are more prone to unsafe/deceptive outputs in perceived test environments, highlighting the need for more realistic evaluation frameworks.

Abstract: Large Language Models (LLMs) often exhibit significant behavioral shifts when they perceive a change from a real-world deployment context to a controlled evaluation setting, a phenomenon known as “evaluation awareness.” This discrepancy poses a critical challenge for AI alignment, as benchmark performance may not accurately reflect a model’s true safety and honesty. In this work, we systematically quantify these behavioral changes by manipulating the perceived context of prompts. We introduce a methodology that uses a linear probe to score prompts on a continuous scale from “test-like” to “deploy-like” and leverage an LLM rewriting strategy to shift these prompts towards a more natural, deployment-style context while preserving the original task. Using this method, we achieved a 30% increase in the average probe score across a strategic role-playing dataset after rewriting. Evaluating a suite of state-of-the-art models on these original and rewritten prompts, we find that rewritten “deploy-like” prompts induce a significant and consistent shift in behavior. Across all models, we observed an average increase in honest responses of 5.26% and a corresponding average decrease in deceptive responses of 12.40%. Furthermore, refusal rates increased by an average of 6.38%, indicating heightened safety compliance. Our findings demonstrate that evaluation awareness is a quantifiable and manipulable factor that directly influences LLM behavior, revealing that models are more prone to unsafe or deceptive outputs in perceived test environments. This underscores the urgent need for more realistic evaluation frameworks to accurately gauge true model alignment before deployment.

[142] Joint Information Extraction Across Classical and Modern Chinese with Tea-MOELoRA

Xuemei Tang, Chengxi Yan, Jinghang Gu, Chu-Ren Huang

Main category: cs.CL

TL;DR: Tea-MOELoRA is a parameter-efficient multi-task framework combining LoRA with Mixture-of-Experts for Chinese information extraction across different temporal domains, using task-era-aware routing to dynamically allocate expert contributions.

Details

Motivation: Fine-tuning a single model on heterogeneous Chinese IE tasks across Classical and Modern documents may cause interference and reduced performance due to the diverse temporal domains.

Method: Combines LoRA with Mixture-of-Experts design, where multiple low-rank LoRA experts specialize in different IE tasks and eras, with a task-era-aware router mechanism for dynamic expert allocation.

Result: Outperforms both single-task and joint LoRA baselines, demonstrating effective leveraging of task and temporal knowledge.

Conclusion: Tea-MOELoRA framework successfully addresses the challenge of multi-task Chinese IE across diverse temporal domains through parameter-efficient expert specialization and dynamic routing.

Abstract: Chinese information extraction (IE) involves multiple tasks across diverse temporal domains, including Classical and Modern documents. Fine-tuning a single model on heterogeneous tasks and across different eras may lead to interference and reduced performance. Therefore, in this paper, we propose Tea-MOELoRA, a parameter-efficient multi-task framework that combines LoRA with a Mixture-of-Experts (MoE) design. Multiple low-rank LoRA experts specialize in different IE tasks and eras, while a task-era-aware router mechanism dynamically allocates expert contributions. Experiments show that Tea-MOELoRA outperforms both single-task and joint LoRA baselines, demonstrating its ability to leverage task and temporal knowledge effectively.

[143] Efficient Large Language Models with Zero-Shot Adjustable Acceleration

Sajjad Kachuee, Mohammad Sharifkhani

Main category: cs.CL

TL;DR: Zero-Shot Adjustable Acceleration method enables dynamic hardware utilization adjustment during LLM inference without additional fine-tuning, achieving up to 11x speedup.

Details

Motivation: Balancing computational efficiency with model performance in real-world LLM applications, particularly optimizing acceleration after fine-tuning and during inference.

Method: A novel training and inference method that dynamically adjusts hardware utilization during inference without requiring additional fine-tuning, applied to recent LLMs.

Result: Achieves up to 11x speedup compared to baseline across multiple classification and text generation tasks, supporting wide range of zero-shot acceleration.

Conclusion: The proposed Zero-Shot Adjustable Acceleration method effectively addresses LLM efficiency challenges by enabling dynamic hardware optimization during inference.

Abstract: Using Large Language Models (LLMs) in real-world applications presents significant challenges, particularly in balancing computational efficiency with model performance. Optimizing acceleration after fine-tuning and during inference is critical for building efficient architectures. This paper introduces Zero-Shot Adjustable Acceleration, a novel training and inference method that dynamically adjusts hardware utilization during inference without requiring additional fine-tuning. The proposed approach is applied to recent LLMs and evaluated across multiple classification and text generation tasks. Experimental results demonstrate that the method supports a wide range of zero-shot acceleration and achieves up to 11x speedup compared to the baseline.

[144] DCPO: Dynamic Clipping Policy Optimization

Shihui Yang, Chengfeng Dou, Peidong Guo, Kai Lu, Qiang Ju, Fei Deng, Rihui Xin

Main category: cs.CL

TL;DR: DCPO introduces dynamic clipping and smooth advantage standardization to fix zero gradient issues in RLVR, achieving SOTA performance on multiple benchmarks with improved training efficiency and data utilization.

Details

Motivation: Existing RLVR approaches like GRPO suffer from zero gradients due to fixed clipping bounds and identical reward standardization, leading to ineffective gradient updates and underutilization of generated responses.

Method: Dynamic Clipping Policy Optimization (DCPO) uses adaptive clipping bounds based on token-specific prior probabilities for better token-level exploration, and smooth advantage standardization across cumulative training steps for improved response-level utilization.

Result: DCPO achieved state-of-the-art performance on four benchmarks with four different models, including 46.7 Avg@1 on AIME24 with Qwen2.5-Math-7B, 28% improvement in nonzero advantage over GRPO, doubled training efficiency over DAPO, and significantly reduced token clipping ratio.

Conclusion: DCPO effectively leverages generated data more efficiently for reinforcement learning in large language models, demonstrating superior performance and addressing key limitations of previous approaches.

Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a promising framework for enhancing the reasoning capabilities of large language models. However, existing approaches such as GRPO often suffer from zero gradients. This problem arises primarily due to fixed clipping bounds for token-level probability ratios and the standardization of identical rewards, which can lead to ineffective gradient updates and underutilization of generated responses. In this work, we propose Dynamic Clipping Policy Optimization(DCPO), which introduces a dynamic clipping strategy that adaptively adjusts clipping bounds based on token-specific prior probabilities to enhance token-level exploration, and a smooth advantage standardization technique that standardizes rewards across cumulative training steps to improve the response-level effective utilization of generated responses. DCPO achieved state-of-the-art performance on four benchmarks based on four different models. In particular, DCPO achieved an Avg@1 of 46.7 under greedy decoding and an Avg@32 of 38.8 under 32 times sampling on the AIME24 benchmark, surpassing DAPO (36.7/31.6), GRPO (36.7/32.1) and GSPO (40.0/34.9) on the Qwen2.5-Math-7B model. On the AIME25 benchmark based on Qwen2.5-14B, DCPO achieves a performance of (23.3/19.0), surpassing GRPO (13.3/10.5), DAPO (20.0/15.3) and GSPO (16.7/9.9). Furthermore, DCPO achieved an average 28% improvement in the nonzero advantage over GRPO in four models, doubled the training efficiency over DAPO, and significantly reduced the token clipping ratio by an order of magnitude compared to both GRPO and DAPO, while achieving superior performance. These results highlight DCPO’s effectiveness in leveraging generated data more efficiently for reinforcement learning in large language models.

[145] MoSEs: Uncertainty-Aware AI-Generated Text Detection via Mixture of Stylistics Experts with Conditional Thresholds

Junxi Wu, Jinpeng Wang, Zheng Liu, Bin Chen, Dongjian Hu, Hao Wu, Shu-Tao Xia

Main category: cs.CL

TL;DR: MoSEs framework improves AI-generated text detection by incorporating stylistic modeling and dynamic threshold estimation, achieving 11.34% average performance improvement over baselines.

Details

Motivation: Address public concerns about AI misuse by building trustworthy detection systems, as existing methods neglect stylistic modeling and rely on static thresholds which limit performance.

Method: Mixture of Stylistic Experts (MoSEs) framework with three components: Stylistics Reference Repository (SRR) for reference data activation, Stylistics-Aware Router (SAR), and Conditional Threshold Estimator (CTE) that jointly models linguistic statistical properties and semantic features for dynamic threshold determination.

Result: Achieves 11.34% average improvement in detection performance compared to baselines, with even more significant 39.15% improvement in low-resource scenarios.

Conclusion: MoSEs framework effectively addresses limitations of existing detection methods by incorporating stylistic awareness and dynamic thresholding, demonstrating substantial performance gains especially in challenging low-resource conditions.

Abstract: The rapid advancement of large language models has intensified public concerns about the potential misuse. Therefore, it is important to build trustworthy AI-generated text detection systems. Existing methods neglect stylistic modeling and mostly rely on static thresholds, which greatly limits the detection performance. In this paper, we propose the Mixture of Stylistic Experts (MoSEs) framework that enables stylistics-aware uncertainty quantification through conditional threshold estimation. MoSEs contain three core components, namely, the Stylistics Reference Repository (SRR), the Stylistics-Aware Router (SAR), and the Conditional Threshold Estimator (CTE). For input text, SRR can activate the appropriate reference data in SRR and provide them to CTE. Subsequently, CTE jointly models the linguistic statistical properties and semantic features to dynamically determine the optimal threshold. With a discrimination score, MoSEs yields prediction labels with the corresponding confidence level. Our framework achieves an average improvement 11.34% in detection performance compared to baselines. More inspiringly, MoSEs shows a more evident improvement 39.15% in the low-resource case. Our code is available at https://github.com/creator-xi/MoSEs.

[146] Energy Landscapes Enable Reliable Abstention in Retrieval-Augmented Large Language Models for Healthcare

Ravi Shankar, Sheng Wong, Lin Li, Magdalena Bachmann, Alex Silverthorne, Beth Albert, Gabriel Davis Jones

Main category: cs.CL

TL;DR: Energy-based model for reliable abstention in RAG systems, outperforming softmax and kNN methods on hard semantic cases with AUROC 0.961 and lower false positive rates.

Details

Motivation: Reliable abstention is critical for safety-critical RAG systems like women's health applications, where incorrect answers can cause harm.

Method: Energy-based model (EBM) that learns smooth energy landscape over 2.6M guideline-derived questions, benchmarked against calibrated softmax baseline and kNN density heuristic.

Result: EBM achieves superior abstention performance on hard cases (AUROC 0.961 vs 0.950 for softmax, FPR@95 0.235 vs 0.331). Comparable performance on easy negatives but significant advantage in safety-critical distributions.

Conclusion: Energy-based abstention scoring provides more reliable confidence signal than probability-based softmax, offering scalable and interpretable foundation for safe RAG systems.

Abstract: Reliable abstention is critical for retrieval-augmented generation (RAG) systems, particularly in safety-critical domains such as women’s health, where incorrect answers can lead to harm. We present an energy-based model (EBM) that learns a smooth energy landscape over a dense semantic corpus of 2.6M guideline-derived questions, enabling the system to decide when to generate or abstain. We benchmark the EBM against a calibrated softmax baseline and a k-nearest neighbour (kNN) density heuristic across both easy and hard abstention splits, where hard cases are semantically challenging near-distribution queries. The EBM achieves superior abstention performance abstention on semantically hard cases, reaching AUROC 0.961 versus 0.950 for softmax, while also reducing FPR@95 (0.235 vs 0.331). On easy negatives, performance is comparable across methods, but the EBM’s advantage becomes most pronounced in safety-critical hard distributions. A comprehensive ablation with controlled negative sampling and fair data exposure shows that robustness stems primarily from the energy scoring head, while the inclusion or exclusion of specific negative types (hard, easy, mixed) sharpens decision boundaries but is not essential for generalisation to hard cases. These results demonstrate that energy-based abstention scoring offers a more reliable confidence signal than probability-based softmax confidence, providing a scalable and interpretable foundation for safe RAG systems.

[147] The Good, the Bad and the Constructive: Automatically Measuring Peer Review’s Utility for Authors

Abdelrahman Sadallah, Tim Baumgärtner, Iryna Gurevych, Ted Briscoe

Main category: cs.CL

TL;DR: The paper introduces RevUtil dataset to evaluate review comment utility through four key aspects (Actionability, Grounding & Specificity, Verifiability, Helpfulness) and shows fine-tuned models can match/exceed GPT-4o performance in assessing review quality.

Details

Motivation: With reviewers having less time, automated systems are needed to maintain high reviewing quality and ensure feedback is useful for authors. The paper aims to identify key aspects that make review comments valuable.

Method: Created RevUtil dataset with 1,430 human-labeled review comments and 10k synthetically labeled comments with rationales. Benchmarked fine-tuned models for assessing review comments on four utility aspects and generating explanations.

Result: Fine-tuned models achieved agreement levels with humans comparable to or exceeding GPT-4o. Machine-generated reviews generally underperformed human reviews on the four utility aspects.

Conclusion: The RevUtil dataset enables effective evaluation of review comment utility, and fine-tuned models can assess review quality at levels competitive with state-of-the-art closed models, though human reviews still outperform automated ones on key utility dimensions.

Abstract: Providing constructive feedback to paper authors is a core component of peer review. With reviewers increasingly having less time to perform reviews, automated support systems are required to ensure high reviewing quality, thus making the feedback in reviews useful for authors. To this end, we identify four key aspects of review comments (individual points in weakness sections of reviews) that drive the utility for authors: Actionability, Grounding & Specificity, Verifiability, and Helpfulness. To enable evaluation and development of models assessing review comments, we introduce the RevUtil dataset. We collect 1,430 human-labeled review comments and scale our data with 10k synthetically labeled comments for training purposes. The synthetic data additionally contains rationales, i.e., explanations for the aspect score of a review comment. Employing the RevUtil dataset, we benchmark fine-tuned models for assessing review comments on these aspects and generating rationales. Our experiments demonstrate that these fine-tuned models achieve agreement levels with humans comparable to, and in some cases exceeding, those of powerful closed models like GPT-4o. Our analysis further reveals that machine-generated reviews generally underperform human reviews on our four aspects.

[148] Comparative Analysis of Transformer Models in Disaster Tweet Classification for Public Safety

Sharif Noor Zisad, N. M. Istiak Chowdhury, Ragib Hasan

Main category: cs.CL

TL;DR: Transformer models (BERT, DistilBERT, RoBERTa, DeBERTa) outperform traditional ML models for disaster tweet classification, with BERT achieving 91% accuracy vs 82% for Logistic Regression/Naive Bayes.

Details

Motivation: Social media platforms provide real-time disaster information, but traditional ML models fail to understand contextual language in informal tweets, requiring better classification methods for emergency response.

Method: Evaluated transformer-based models (BERT, DistilBERT, RoBERTa, DeBERTa) against traditional ML approaches (Logistic Regression, Naive Bayes, SVM) for classifying disaster-related tweets.

Result: BERT achieved highest accuracy at 91%, significantly outperforming traditional models (82% for Logistic Regression/Naive Bayes). Transformer models better understand subtle language through contextual embeddings and attention mechanisms.

Conclusion: Transformer architectures are far more suitable for public safety applications, offering improved accuracy, deeper language understanding, and better generalization across real-world social media text.

Abstract: Twitter and other social media platforms have become vital sources of real time information during disasters and public safety emergencies. Automatically classifying disaster related tweets can help emergency services respond faster and more effectively. Traditional Machine Learning (ML) models such as Logistic Regression, Naive Bayes, and Support Vector Machines have been widely used for this task, but they often fail to understand the context or deeper meaning of words, especially when the language is informal, metaphorical, or ambiguous. We posit that, in this context, transformer based models can perform better than traditional ML models. In this paper, we evaluate the effectiveness of transformer based models, including BERT, DistilBERT, RoBERTa, and DeBERTa, for classifying disaster related tweets. These models are compared with traditional ML approaches to highlight the performance gap. Experimental results show that BERT achieved the highest accuracy (91%), significantly outperforming traditional models like Logistic Regression and Naive Bayes (both at 82%). The use of contextual embeddings and attention mechanisms allows transformer models to better understand subtle language in tweets, where traditional ML models fall short. This research demonstrates that transformer architectures are far more suitable for public safety applications, offering improved accuracy, deeper language understanding, and better generalization across real world social media text.

[149] HoPE: Hyperbolic Rotary Positional Encoding for Stable Long-Range Dependency Modeling in Large Language Models

Chang Dai, Hongyu Shan, Mingyang Song, Di Liang

Main category: cs.CL

TL;DR: HoPE is a new positional encoding method that uses hyperbolic geometry to fix RoPE’s oscillation problems, enabling better long-range dependency modeling and outperforming existing methods on long sequence benchmarks.

Details

Motivation: Existing positional encodings struggle with long sequences - absolute encodings can't extrapolate, relative methods like Alibi degrade on extremely long contexts, and RoPE has oscillatory attention patterns that hinder stable long-distance modeling.

Method: Proposes Hyperbolic Rotary Positional Encoding (HoPE) inspired by Lorentz transformations in hyperbolic geometry, using hyperbolic functions to implement Lorentz rotations on token representations. RoPE is shown to be a special case of this generalized formulation.

Result: HoPE fundamentally resolves RoPE’s oscillation issues by enforcing monotonic decay of attention weights with increasing token distances. Extensive experiments show HoPE consistently exceeds existing positional encoding methods on extended sequence benchmarks.

Conclusion: HoPE demonstrates enhanced capacity for representing and generalizing long-range dependencies, providing a geometric solution to positional encoding limitations in Transformers for long-sequence modeling.

Abstract: Positional encoding mechanisms enable Transformers to model sequential structure and long-range dependencies in text. While absolute positional encodings struggle with extrapolation to longer sequences due to fixed positional representations, and relative approaches like Alibi exhibit performance degradation on extremely long contexts, the widely-used Rotary Positional Encoding (RoPE) introduces oscillatory attention patterns that hinder stable long-distance dependency modelling. We address these limitations through a geometric reformulation of positional encoding. Drawing inspiration from Lorentz transformations in hyperbolic geometry, we propose Hyperbolic Rotary Positional Encoding (HoPE), which leverages hyperbolic functions to implement Lorentz rotations on token representations. Theoretical analysis demonstrates that RoPE is a special case of our generalized formulation. HoPE fundamentally resolves RoPE’s slation issues by enforcing monotonic decay of attention weights with increasing token distances. Extensive experimental results, including perplexity evaluations under several extended sequence benchmarks, show that HoPE consistently exceeds existing positional encoding methods. These findings underscore HoPE’s enhanced capacity for representing and generalizing long-range dependencies. Data and code will be available.

cs.CV

[150] Label Smoothing++: Enhanced Label Regularization for Training Neural Networks

Sachin Chhabra, Hemanth Venkateswara, Baoxin Li

Main category: cs.CV

TL;DR: Label Smoothing++ is a novel label regularization method that improves upon standard label smoothing by preserving inter-class relationships while reducing overconfidence.

Details

Motivation: Standard label smoothing addresses overconfidence and overfitting from one-hot labels but destroys inter-class relationships by assigning equal importance to all non-target classes.

Method: Proposes Label Smoothing++ which assigns non-zero probabilities to non-target classes while accounting for their inter-class relationships, using a fixed label for target class and learning labels for non-target classes.

Result: Extensive experiments on multiple datasets demonstrate that Label Smoothing++ mitigates overconfident predictions while promoting inter-class relationships and improving generalization capabilities.

Conclusion: Label Smoothing++ effectively addresses the limitations of standard label smoothing by preserving class relationships while maintaining regularization benefits for improved model generalization.

Abstract: Training neural networks with one-hot target labels often results in overconfidence and overfitting. Label smoothing addresses this issue by perturbing the one-hot target labels by adding a uniform probability vector to create a regularized label. Although label smoothing improves the network’s generalization ability, it assigns equal importance to all the non-target classes, which destroys the inter-class relationships. In this paper, we propose a novel label regularization training strategy called Label Smoothing++, which assigns non-zero probabilities to non-target classes and accounts for their inter-class relationships. Our approach uses a fixed label for the target class while enabling the network to learn the labels associated with non-target classes. Through extensive experiments on multiple datasets, we demonstrate how Label Smoothing++ mitigates overconfident predictions while promoting inter-class relationships and generalization capabilities.

[151] VILOD: A Visual Interactive Labeling Tool for Object Detection

Isac Holm

Main category: cs.CV

TL;DR: VILOD is a visual interactive labeling tool for object detection that combines human expertise with active learning through visual analytics, enabling transparent and effective human-AI collaboration in dataset annotation.

Details

Motivation: Traditional object detection requires large, accurately labeled datasets which are time-consuming and expensive to create. Active learning methods lack transparency and may miss informative samples, while human experts' strategic insights are underutilized.

Method: Developed VILOD tool featuring t-SNE projection of image features, uncertainty heatmaps, and model state views. Enables interactive exploration of data, interpretation of model states, and implementation of diverse sample selection strategies in a human-in-the-loop workflow.

Result: Comparative use cases showed VILOD makes model states and dataset characteristics more interpretable, allowing implementation of distinct labeling strategies. Visually-guided strategies achieved competitive object detection performance compared to automated uncertainty sampling baseline.

Conclusion: VILOD contributes a novel tool that makes human-in-the-loop active learning workflows for object detection annotation more transparent, manageable, and potentially more effective by leveraging visual analytics for human-AI collaboration.

Abstract: The advancement of Object Detection (OD) using Deep Learning (DL) is often hindered by the significant challenge of acquiring large, accurately labeled datasets, a process that is time-consuming and expensive. While techniques like Active Learning (AL) can reduce annotation effort by intelligently querying informative samples, they often lack transparency, limit the strategic insight of human experts, and may overlook informative samples not aligned with an employed query strategy. To mitigate these issues, Human-in-the-Loop (HITL) approaches integrating human intelligence and intuition throughout the machine learning life-cycle have gained traction. Leveraging Visual Analytics (VA), effective interfaces can be created to facilitate this human-AI collaboration. This thesis explores the intersection of these fields by developing and investigating “VILOD: A Visual Interactive Labeling tool for Object Detection”. VILOD utilizes components such as a t-SNE projection of image features, together with uncertainty heatmaps and model state views. Enabling users to explore data, interpret model states, AL suggestions, and implement diverse sample selection strategies within an iterative HITL workflow for OD. An empirical investigation using comparative use cases demonstrated how VILOD, through its interactive visualizations, facilitates the implementation of distinct labeling strategies by making the model’s state and dataset characteristics more interpretable (RQ1). The study showed that different visually-guided labeling strategies employed within VILOD result in competitive OD performance trajectories compared to an automated uncertainty sampling AL baseline (RQ2). This work contributes a novel tool and empirical insight into making the HITL-AL workflow for OD annotation more transparent, manageable, and potentially more effective.

[152] Context-Aware Knowledge Distillation with Adaptive Weighting for Image Classification

Zhengda Li

Main category: cs.CV

TL;DR: Proposes Adaptive Knowledge Distillation (AKD) with dynamic alpha parameter that adapts during training instead of using fixed weight, plus context-aware module for class-wise teacher output reweighting.

Details

Motivation: Traditional KD uses fixed alpha hyperparameter to balance hard-label and soft-label losses, but static alpha is suboptimal as optimal trade-off varies during training.

Method: Make alpha learnable parameter optimized during training, use student-teacher discrepancy gap formula to compute alpha dynamically, and introduce Context-Aware Module (MLP + Attention) to adaptively reweight class-wise teacher outputs.

Result: Experiments on CIFAR-10 with ResNet-50 teacher and ResNet-18 student show superior accuracy compared to fixed-weight KD baselines and more stable convergence.

Conclusion: Adaptive Knowledge Distillation framework with dynamic alpha and context-aware reweighting outperforms traditional fixed-weight KD methods in both accuracy and training stability.

Abstract: Knowledge distillation (KD) is a widely used technique to transfer knowledge from a large teacher network to a smaller student model. Traditional KD uses a fixed balancing factor alpha as a hyperparameter to combine the hard-label cross-entropy loss with the soft-label distillation loss. However, a static alpha is suboptimal because the optimal trade-off between hard and soft supervision can vary during training. In this work, we propose an Adaptive Knowledge Distillation (AKD) framework. First we try to make alpha as learnable parameter that can be automatically learned and optimized during training. Then we introduce a formula to reflect the gap between the student and the teacher to compute alpha dynamically, guided by student-teacher discrepancies, and further introduce a Context-Aware Module (CAM) using MLP + Attention to adaptively reweight class-wise teacher outputs. Experiments on CIFAR-10 with ResNet-50 as teacher and ResNet-18 as student demonstrate that our approach achieves superior accuracy compared to fixed-weight KD baselines, and yields more stable convergence.

[153] A Real-Time, Vision-Based System for Badminton Smash Speed Estimation on Mobile Devices

Diwen Huang

Main category: cs.CV

TL;DR: A smartphone-based system using YOLOv5 and Kalman filter to measure badminton smash speed cost-effectively.

Details

Motivation: Make performance metrics like shot speed accessible to amateur players by overcoming expensive and complex traditional technology.

Method: Uses custom-trained YOLOv5 model for shuttlecock detection, Kalman filter for trajectory tracking, and video-based kinematic speed estimation with spatiotemporal scaling.

Result: Developed an intuitive mobile application that automatically calculates shuttlecock velocity from standard video recordings.

Conclusion: Democratizes access to high-level performance analytics, empowering players at all levels to analyze and improve their game through accessible smartphone technology.

Abstract: Performance metrics in sports, such as shot speed and angle, provide crucial feedback for athlete development. However, the technology to capture these metrics has historically been expensive, complex, and largely inaccessible to amateur and recreational players. This paper addresses this gap in the context of badminton, one of the world’s most popular sports, by introducing a novel, cost-effective, and user-friendly system for measuring smash speed using ubiquitous smartphone technology. Our approach leverages a custom-trained YOLOv5 model for shuttlecock detection, combined with a Kalman filter for robust trajectory tracking. By implementing a video-based kinematic speed estimation method with spatiotemporal scaling, the system automatically calculates the shuttlecock’s velocity from a standard video recording. The entire process is packaged into an intuitive mobile application, democratizing access to high-level performance analytics and empowering players at all levels to analyze and improve their game.

[154] A Dataset Generation Scheme Based on Video2EEG-SPGN-Diffusion for SEED-VD

Yunfei Guo, Tao Zhang, Wu Huang, Yao Song

Main category: cs.CV

TL;DR: Open-source framework Video2EEG-SPGN-Diffusion generates multimodal EEG dataset from video stimuli using SEED-VD dataset and diffusion models with self-play graph networks.

Details

Motivation: To advance multimodal research by creating tools for video-EEG alignment, enabling emotion analysis, data augmentation, and brain-computer interface applications.

Method: Uses self-play graph network (SPGN) integrated with diffusion model to generate personalized EEG signals, with engineering pipeline for aligning video and EEG data pairs.

Result: Released dataset with over 1000 samples of SEED-VD video stimuli paired with generated 62-channel EEG signals at 200 Hz and emotion labels.

Conclusion: Framework provides novel tools for multimodal research with significant research and engineering applications in emotion analysis and brain-computer interfaces.

Abstract: This paper introduces an open-source framework, Video2EEG-SPGN-Diffusion, that leverages the SEED-VD dataset to generate a multimodal dataset of EEG signals conditioned on video stimuli. Additionally, we disclose an engineering pipeline for aligning video and EEG data pairs, facilitating the training of multimodal large models with EEG alignment capabilities. Personalized EEG signals are generated using a self-play graph network (SPGN) integrated with a diffusion model. As a major contribution, we release a new dataset comprising over 1000 samples of SEED-VD video stimuli paired with generated 62-channel EEG signals at 200 Hz and emotion labels, enabling video-EEG alignment and advancing multimodal research. This framework offers novel tools for emotion analysis, data augmentation, and brain-computer interface applications, with substantial research and engineering significance.

[155] Application of discrete Ricci curvature in pruning randomly wired neural networks: A case study with chest x-ray classification of COVID-19

Pavithra Elumalai, Sudharsan Vijayaraghavan, Madhumita Mondal, Areejit Samal

Main category: cs.CV

TL;DR: Study compares three edge-centric network measures (FRC, ORC, EBC) for pruning randomly wired neural networks on COVID-19 chest x-ray classification, finding FRC offers computational efficiency with comparable performance to ORC.

Details

Motivation: To investigate how different network connectivity patterns impact learning efficiency and model performance, and to explore edge-centric network measures for pruning and optimization in randomly wired neural networks.

Method: Used three edge-centric measures (Forman-Ricci curvature, Ollivier-Ricci curvature, and edge betweenness centrality) to compress RWNNs by selectively retaining important edges. Applied to three network generators (ER, WS, BA models) for COVID-19 chest x-ray classification.

Result: FRC-based pruning effectively simplifies RWNNs with significant computational advantages while maintaining performance comparable to ORC. Provided comparative analysis of pruning performance in terms of compression ratio and theoretical speedup.

Conclusion: Forman-Ricci curvature offers a computationally efficient alternative to Ollivier-Ricci curvature for network pruning in randomly wired neural networks, achieving comparable effectiveness while being more efficient.

Abstract: Randomly Wired Neural Networks (RWNNs) serve as a valuable testbed for investigating the impact of network topology in deep learning by capturing how different connectivity patterns impact both learning efficiency and model performance. At the same time, they provide a natural framework for exploring edge-centric network measures as tools for pruning and optimization. In this study, we investigate three edge-centric network measures: Forman-Ricci curvature (FRC), Ollivier-Ricci curvature (ORC), and edge betweenness centrality (EBC), to compress RWNNs by selectively retaining important synapses (or edges) while pruning the rest. As a baseline, RWNNs are trained for COVID-19 chest x-ray image classification, aiming to reduce network complexity while preserving performance in terms of accuracy, specificity, and sensitivity. We extend prior work on pruning RWNN using ORC by incorporating two additional edge-centric measures, FRC and EBC, across three network generators: Erd"{o}s-R'{e}nyi (ER) model, Watts-Strogatz (WS) model, and Barab'{a}si-Albert (BA) model. We provide a comparative analysis of the pruning performance of the three measures in terms of compression ratio and theoretical speedup. A central focus of our study is to evaluate whether FRC, which is computationally more efficient than ORC, can achieve comparable pruning effectiveness. Along with performance evaluation, we further investigate the structural properties of the pruned networks through modularity and global efficiency, offering insights into the trade-off between modular segregation and network efficiency in compressed RWNNs. Our results provide initial evidence that FRC-based pruning can effectively simplify RWNNs, offering significant computational advantages while maintaining performance comparable to ORC.

[156] Optical Music Recognition of Jazz Lead Sheets

Juan Carlos Martinez-Sevilla, Francesco Foscarin, Patricia Garcia-Iasci, David Rizo, Jorge Calvo-Zaragoza, Gerhard Widmer

Main category: cs.CV

TL;DR: A new dataset and OMR model for handwritten jazz lead sheets that handle chords and variability in handwritten music scores.

Details

Motivation: Existing Optical Music Recognition systems cannot handle chords in handwritten jazz lead sheets, which present challenges due to high variability and quality issues.

Method: Created a novel dataset of 293 handwritten jazz lead sheets with ground truth scores, developed a specialized OMR model with specific tokenization for jazz data, and utilized synthetic scores and pretrained models.

Result: Produced a comprehensive dataset with 2021 staves aligned with Humdrum **kern and MusicXML formats, along with synthetic score images and a functional OMR model for jazz lead sheets.

Conclusion: Successfully addressed the OMR challenge for handwritten jazz lead sheets by creating a specialized dataset and model, with all code, data, and models publicly released for community use.

Abstract: In this paper, we address the challenge of Optical Music Recognition (OMR) for handwritten jazz lead sheets, a widely used musical score type that encodes melody and chords. The task is challenging due to the presence of chords, a score component not handled by existing OMR systems, and the high variability and quality issues associated with handwritten images. Our contribution is two-fold. We present a novel dataset consisting of 293 handwritten jazz lead sheets of 163 unique pieces, amounting to 2021 total staves aligned with Humdrum **kern and MusicXML ground truth scores. We also supply synthetic score images generated from the ground truth. The second contribution is the development of an OMR model for jazz lead sheets. We discuss specific tokenisation choices related to our kind of data, and the advantages of using synthetic scores and pretrained models. We publicly release all code, data, and models.

[157] RT-VLM: Re-Thinking Vision Language Model with 4-Clues for Real-World Object Recognition Robustness

Junghyun Park, Tuan Anh Nguyen, Dugki Min

Main category: cs.CV

TL;DR: RT-VLM framework uses synthetic data with 4 types of annotations and a two-stage self-correcting inference process to improve object recognition robustness against domain shifts.

Details

Motivation: Modern object recognition models suffer severe accuracy drops when exposed to domain shifts including image statistics variations, pose/viewpoint changes, occlusion, and visual confusion between classes.

Method: Created synthetic dataset with 4-clue annotations (bounding boxes, class names, object captions, context captions), fine-tuned Llama 3.2 11B Vision Instruct, and implemented two-stage inference with self-critique loop.

Result: RT-VLM consistently outperforms strong baselines across robustness benchmarks isolating individual domain shifts.

Conclusion: Integration of structured multimodal evidence with explicit self-critique loops provides a promising approach for reliable and transferable visual understanding.

Abstract: Real world deployments often expose modern object recognition models to domain shifts that precipitate a severe drop in accuracy. Such shifts encompass (i) variations in low level image statistics, (ii) changes in object pose and viewpoint, (iii) partial occlusion, and (iv) visual confusion across adjacent classes. To mitigate this degradation, we introduce the Re-Thinking Vision Language Model (RT-VLM) framework. The foundation of this framework is a unique synthetic dataset generation pipeline that produces images annotated with “4-Clues”: precise bounding boxes, class names, detailed object-level captions, and a comprehensive context-level caption for the entire scene. We then perform parameter efficient supervised tuning of Llama 3.2 11B Vision Instruct on this resource. At inference time, a two stage Re-Thinking scheme is executed: the model first emits its own four clues, then re examines these responses as evidence and iteratively corrects them. Across robustness benchmarks that isolate individual domain shifts, RT-VLM consistently surpasses strong baselines. These findings indicate that the integration of structured multimodal evidence with an explicit self critique loop constitutes a promising route toward reliable and transferable visual understanding.

[158] A Stroke-Level Large-Scale Database of Chinese Character Handwriting and the OpenHandWrite_Toolbox for Handwriting Research

Zebo Xu, Shaoyun Yu, Mark Torrance, Guido Nottbusch, Nan Zhao, Zhenguang Cai

Main category: cs.CV

TL;DR: Built large-scale Chinese handwriting database and enhanced OpenHandWrite_Toolbox to analyze linguistic effects on handwriting at character, radical, and stroke levels, finding hierarchical attenuation of orthographic and phonological influences.

Details

Motivation: To understand how linguistic components (phonological, semantic, orthographic) modulate Chinese handwriting at different levels and address the lack of comprehensive tools for capturing fine-grained handwriting data.

Method: Constructed large-scale handwriting database with 42 Chinese speakers writing 1200 characters each, enhanced OpenHandWrite_Toolbox for stroke-level trajectory capture and batch processing of handwriting measurements, conducted multiple regression analysis.

Result: Orthographic predictors impact handwriting preparation and execution across all levels; phonological factors influence execution at all levels; lexical effects show hierarchical attenuation (strongest at character level, followed by radical, weakest at stroke levels).

Conclusion: Handwriting preparation and execution at radical and stroke levels are closely linked to linguistic components; the database and toolbox provide valuable resources for future psycholinguistic and neurolinguistic handwriting research.

Abstract: Understanding what linguistic components (e.g., phonological, semantic, and orthographic systems) modulate Chinese handwriting at the character, radical, and stroke levels remains an important yet understudied topic. Additionally, there is a lack of comprehensive tools for capturing and batch-processing fine-grained handwriting data. To address these issues, we constructed a large-scale handwriting database in which 42 Chinese speakers for each handwriting 1200 characters in a handwriting-to-dictation task. Additionally, we enhanced the existing handwriting package and provided comprehensive documentation for the upgraded OpenHandWrite_Toolbox, which can easily modify the experimental design, capture the stroke-level handwriting trajectory, and batch-process handwriting measurements (e.g., latency, duration, and pen-pressure). In analysing our large-scale database, multiple regression results show that orthographic predictors impact handwriting preparation and execution across character, radical, and stroke levels. Phonological factors also influence execution at all three levels. Importantly, these lexical effects demonstrate hierarchical attenuation - they were most pronounced at the character level, followed by the radical, and were weakest at the stroke levels. These findings demonstrate that handwriting preparation and execution at the radical and stroke levels are closely intertwined with linguistic components. This database and toolbox offer valuable resources for future psycholinguistic and neurolinguistic research on the handwriting of characters and sub-characters across different languages.

[159] Anticipatory Fall Detection in Humans with Hybrid Directed Graph Neural Networks and Long Short-Term Memory

Younggeol Cho, Gokhan Solak, Olivia Nocentini, Marta Lorenzini, Andrea Fortuna, Arash Ajoudani

Main category: cs.CV

TL;DR: Hybrid DGNN-LSTM model for early fall prediction by decoupling motion prediction and gait classification, outperforming existing methods.

Details

Motivation: Existing fall detection systems only detect falls after they occur, leaving fall prediction and analysis of the transient state between stability and falling unexplored.

Method: Combines Dynamic Graph Neural Networks (DGNN) for gait classification (stable/transient/fall) with LSTM networks for motion prediction, using real-time skeletal features from video sequences.

Result: Superior performance in prediction error and recognition accuracy compared to DGNN-only models and literature baselines on OUMVLP-Pose and URFD datasets.

Conclusion: Decoupling prediction and classification improves performance and enables monitoring of transient states, providing valuable insights for advanced assistive systems.

Abstract: Detecting and preventing falls in humans is a critical component of assistive robotic systems. While significant progress has been made in detecting falls, the prediction of falls before they happen, and analysis of the transient state between stability and an impending fall remain unexplored. In this paper, we propose a anticipatory fall detection method that utilizes a hybrid model combining Dynamic Graph Neural Networks (DGNN) with Long Short-Term Memory (LSTM) networks that decoupled the motion prediction and gait classification tasks to anticipate falls with high accuracy. Our approach employs real-time skeletal features extracted from video sequences as input for the proposed model. The DGNN acts as a classifier, distinguishing between three gait states: stable, transient, and fall. The LSTM-based network then predicts human movement in subsequent time steps, enabling early detection of falls. The proposed model was trained and validated using the OUMVLP-Pose and URFD datasets, demonstrating superior performance in terms of prediction error and recognition accuracy compared to models relying solely on DGNN and models from literature. The results indicate that decoupling prediction and classification improves performance compared to addressing the unified problem using only the DGNN. Furthermore, our method allows for the monitoring of the transient state, offering valuable insights that could enhance the functionality of advanced assistance systems.

[160] CoreMark: Toward Robust and Universal Text Watermarking Technique

Jiale Meng, Yiming Li, Zheming Lu, Zewei He, Hao Luo, Tianwei Zhang

Main category: cs.CV

TL;DR: CoreMark introduces a new text watermarking framework using CORE segments (consecutive black pixels) that achieves robust, generalizable, and imperceptible watermarking across multiple languages and fonts.

Details

Motivation: Existing text watermarking schemes struggle to simultaneously achieve robustness, generalizability across languages/fonts, and imperceptibility, creating a need for a more effective solution.

Method: CoreMark dynamically extracts CORE segments from characters, selects robust characters based on CORE lengths, embeds data by modifying CORE thickness, and uses an adaptive embedding strength modulator that adjusts based on font size.

Result: CoreMark demonstrates outstanding generalizability across multiple languages and fonts, significantly outperforms existing methods in resisting screenshot, print-scan, and print-camera attacks, while maintaining satisfactory imperceptibility.

Conclusion: The CORE-based paradigm provides an effective solution for robust text watermarking that works across diverse languages and fonts while maintaining visual quality, representing a significant advancement in the field.

Abstract: Text watermarking schemes have gained considerable attention in recent years, yet still face critical challenges in achieving simultaneous robustness, generalizability, and imperceptibility. This paper introduces a new embedding paradigm,termed CORE, which comprises several consecutively aligned black pixel segments. Its key innovation lies in its inherent noise resistance during transmission and broad applicability across languages and fonts. Based on the CORE, we present a text watermarking framework named CoreMark. Specifically, CoreMark first dynamically extracts COREs from characters. Then, the characters with stronger robustness are selected according to the lengths of COREs. By modifying the thickness of the CORE, the hidden data is embedded into the selected characters without causing significant visual distortions. Moreover, a general plug-and-play embedding strength modulator is proposed, which can adaptively enhance the robustness for small font sizes by adjusting the embedding strength according to the font size. Experimental evaluation indicates that CoreMark demonstrates outstanding generalizability across multiple languages and fonts. Compared to existing methods, CoreMark achieves significant improvements in resisting screenshot, print-scan, and print camera attacks, while maintaining satisfactory imperceptibility.

[161] InterAct: A Large-Scale Dataset of Dynamic, Expressive and Interactive Activities between Two People in Daily Scenarios

Leo Ho, Yinghao Huang, Dafei Qin, Mingyi Shi, Wangpok Tse, Wei Liu, Junichi Yamagishi, Taku Komura

Main category: cs.CV

TL;DR: A new multi-modal dataset called InterAct captures realistic two-person interactions with body motions, facial expressions, and audio, along with a diffusion-based method for generating interactive motions from speech.

Details

Motivation: Previous works either focus on single persons or limited conversational gestures, ignoring dynamic, objective-driven interactions that span longer durations and larger spaces between two people.

Method: Captured 241 motion sequences of two-person interactions with different roles and emotions. Developed a diffusion-based hierarchical method to estimate interactive face expressions and body motions from speech, with a novel fine-tuning mechanism for lip accuracy.

Result: Created the InterAct dataset containing diverse and complex motions with long-term interaction patterns. The proposed method effectively generates interactive behaviors from speech inputs.

Conclusion: The InterAct dataset enables research on realistic two-person interactions, and the diffusion-based approach provides an effective solution for generating semantically consistent interactive behaviors from speech.

Abstract: We address the problem of accurate capture of interactive behaviors between two people in daily scenarios. Most previous works either only consider one person or solely focus on conversational gestures of two people, assuming the body orientation and/or position of each actor are constant or barely change over each interaction. In contrast, we propose to simultaneously model two people’s activities, and target objective-driven, dynamic, and semantically consistent interactions which often span longer duration and cover bigger space. To this end, we capture a new multi-modal dataset dubbed InterAct, which is composed of 241 motion sequences where two people perform a realistic and coherent scenario for one minute or longer over a complete interaction. For each sequence, two actors are assigned different roles and emotion labels, and collaborate to finish one task or conduct a common interaction activity. The audios, body motions, and facial expressions of both persons are captured. InterAct contains diverse and complex motions of individuals and interesting and relatively long-term interaction patterns barely seen before. We also demonstrate a simple yet effective diffusion-based method that estimates interactive face expressions and body motions of two people from speech inputs. Our method regresses the body motions in a hierarchical manner, and we also propose a novel fine-tuning mechanism to improve the lip accuracy of facial expressions. To facilitate further research, the data and code is made available at https://hku-cg.github.io/interact/ .

[162] Comparative Evaluation of Hard and Soft Clustering for Precise Brain Tumor Segmentation in MR Imaging

Dibya Jyoti Bora, Mrinal Kanti Mishra

Main category: cs.CV

TL;DR: Comparative analysis of K-Means vs Fuzzy C-Means for brain tumor segmentation from MRI, showing K-Means is faster but FCM provides better accuracy.

Details

Motivation: Accurate brain tumor segmentation from MRI is critical for clinical decision-making and treatment planning, but challenging due to tumor heterogeneity.

Method: Used BraTS2020 dataset with Gaussian filtering and CLAHE pre-processing. Compared hard clustering (K-Means) vs soft clustering (FCM) approaches.

Result: K-Means: 0.3s per image (faster) but lower DSC of 0.43. FCM: 1.3s per image (slower) but higher DSC of 0.67.

Conclusion: Trade-off between computational efficiency (K-Means) and segmentation accuracy (FCM) exists, with FCM providing better boundary precision despite higher computational cost.

Abstract: Segmentation of brain tumors from Magnetic Resonance Imaging (MRI) remains a pivotal challenge in medical image analysis due to the heterogeneous nature of tumor morphology and intensity distributions. Accurate delineation of tumor boundaries is critical for clinical decision-making, radiotherapy planning, and longitudinal disease monitoring. In this study, we perform a comprehensive comparative analysis of two major clustering paradigms applied in MRI tumor segmentation: hard clustering, exemplified by the K-Means algorithm, and soft clustering, represented by Fuzzy C-Means (FCM). While K-Means assigns each pixel strictly to a single cluster, FCM introduces partial memberships, meaning each pixel can belong to multiple clusters with varying degrees of association. Experimental validation was performed using the BraTS2020 dataset, incorporating pre-processing through Gaussian filtering and Contrast Limited Adaptive Histogram Equalization (CLAHE). Evaluation metrics included the Dice Similarity Coefficient (DSC) and processing time, which collectively demonstrated that K-Means achieved superior speed with an average runtime of 0.3s per image, whereas FCM attained higher segmentation accuracy with an average DSC of 0.67 compared to 0.43 for K-Means, albeit at a higher computational cost (1.3s per image). These results highlight the inherent trade-off between computational efficiency and boundary precision.

[163] LMM4Edit: Benchmarking and Evaluating Multimodal Image Editing with LMMs

Zitong Xu, Huiyu Duan, Bingnan Liu, Guangji Ma, Jiarui Wang, Liu Yang, Shiqi Gao, Xiaoyu Wang, Jia Wang, Xiongkuo Min, Guangtao Zhai, Weisi Lin

Main category: cs.CV

TL;DR: EBench-18K is a large-scale benchmark with 18K+ edited images and human annotations for evaluating text-guided image editing models. The paper also introduces LMM4Edit, an LMM-based metric that aligns well with human preferences across multiple evaluation dimensions.

Details

Motivation: Current TIE models struggle to balance image quality, editing alignment, and consistency with original images. Existing evaluation benchmarks have limitations in scale and alignment with human perception.

Method: Created EBench-18K benchmark with 1,080 source images, 21 editing tasks, 18K+ edited images from 17 TIE models, and 55K+ human annotations. Developed LMM4Edit metric using large multimodal models to assess edited images across perceptual quality, editing alignment, attribute preservation, and task-specific QA accuracy.

Result: LMM4Edit achieves outstanding performance and aligns well with human preference. Zero-shot validation on other datasets demonstrates strong generalization ability.

Conclusion: EBench-18K provides a comprehensive benchmark for TIE evaluation, and LMM4Edit offers an effective all-in-one metric that correlates well with human judgment, advancing the field of text-guided image editing assessment.

Abstract: The rapid advancement of Text-guided Image Editing (TIE) enables image modifications through text prompts. However, current TIE models still struggle to balance image quality, editing alignment, and consistency with the original image, limiting their practical applications. Existing TIE evaluation benchmarks and metrics have limitations on scale or alignment with human perception. To this end, we introduce EBench-18K, the first large-scale image Editing Benchmark including 18K edited images with fine-grained human preference annotations for evaluating TIE. Specifically, EBench-18K includes 1,080 source images with corresponding editing prompts across 21 tasks, 18K+ edited images produced by 17 state-of-the-art TIE models, 55K+ mean opinion scores (MOSs) assessed from three evaluation dimensions, and 18K+ question-answering (QA) pairs. Based on EBench-18K, we employ outstanding LMMs to assess edited images, while the evaluation results, in turn, provide insights into assessing the alignment between the LMMs’ understanding ability and human preferences. Then, we propose LMM4Edit, a LMM-based metric for evaluating image Editing models from perceptual quality, editing alignment, attribute preservation, and task-specific QA accuracy in an all-in-one manner. Extensive experiments show that LMM4Edit achieves outstanding performance and aligns well with human preference. Zero-shot validation on the other datasets also shows the generalization ability of our model. The dataset and code are available at https://github.com/IntMeGroup/LMM4Edit.

[164] Handling imbalance and few-sample size in ML based Onion disease classification

Abhijeet Manoj Pal, Rajbabu Velmurugan

Main category: cs.CV

TL;DR: A deep learning model with attention modules and data augmentation achieves 96.9% accuracy for multi-class classification of onion crop diseases and pests, outperforming existing methods.

Details

Motivation: Current pest and disease classification methods are limited to binary classification, which restricts practical applications where identifying specific disease/pest types is crucial for precision agriculture.

Method: Enhanced a pre-trained CNN model by integrating attention-based modules and employing a comprehensive data augmentation pipeline to address class imbalance issues.

Result: Achieved 96.90% overall accuracy and 0.96 F1 score on real-world field image dataset, outperforming other approaches using the same datasets.

Conclusion: The proposed robust deep learning model successfully addresses the limitations of binary classification and provides accurate multi-class identification of onion crop diseases and pests for practical agricultural applications.

Abstract: Accurate classification of pests and diseases plays a vital role in precision agriculture, enabling efficient identification, targeted interventions, and preventing their further spread. However, current methods primarily focus on binary classification, which limits their practical applications, especially in scenarios where accurately identifying the specific type of disease or pest is essential. We propose a robust deep learning based model for multi-class classification of onion crop diseases and pests. We enhance a pre-trained Convolutional Neural Network (CNN) model by integrating attention based modules and employing comprehensive data augmentation pipeline to mitigate class imbalance. We propose a model which gives 96.90% overall accuracy and 0.96 F1 score on real-world field image dataset. This model gives better results than other approaches using the same datasets.

[165] Delta Velocity Rectified Flow for Text-to-Image Editing

Gaspard Beaudouin, Minghan Li, Jaeyeon Kim, Sunghoon Yoon, Mengyu Wang

Main category: cs.CV

TL;DR: DVRF is a novel inversion-free text-to-image editing framework that uses velocity field discrepancy modeling and time-dependent shifts to improve editing quality without architectural changes.

Details

Motivation: To address over-smoothing artifacts in prior distillation sampling approaches for text-to-image editing and provide a principled framework that bridges diffusion and rectified-flow optimization methods.

Method: Distillation-based approach that explicitly models source-target velocity field discrepancies, introduces time-dependent shift terms to push noisy latents toward target trajectory, and connects to both Delta Denoising Score and FlowEdit methods theoretically.

Result: Superior editing quality, fidelity, and controllability compared to previous methods, with no architectural modifications required.

Conclusion: DVRF provides an efficient, broadly applicable text-to-image editing solution that theoretically unifies score-based diffusion and velocity-based rectified-flow optimization approaches.

Abstract: We propose Delta Velocity Rectified Flow (DVRF), a novel inversion-free, path-aware editing framework within rectified flow models for text-to-image editing. DVRF is a distillation-based method that explicitly models the discrepancy between the source and target velocity fields in order to mitigate over-smoothing artifacts rampant in prior distillation sampling approaches. We further introduce a time-dependent shift term to push noisy latents closer to the target trajectory, enhancing the alignment with the target distribution. We theoretically demonstrate that when this shift is disabled, DVRF reduces to Delta Denoising Score, thereby bridging score-based diffusion optimization and velocity-based rectified-flow optimization. Moreover, when the shift term follows a linear schedule under rectified-flow dynamics, DVRF generalizes the Inversion-free method FlowEdit and provides a principled theoretical interpretation for it. Experimental results indicate that DVRF achieves superior editing quality, fidelity, and controllability while requiring no architectural modifications, making it efficient and broadly applicable to text-to-image editing tasks. Code is available at https://github.com/gaspardbd/DeltaVelocityRectifiedFlow.

[166] Systematic Integration of Attention Modules into CNNs for Accurate and Generalizable Medical Image Diagnosis

Zahid Ullah, Minki Hong, Tahir Mahmood, Jihie Kim

Main category: cs.CV

TL;DR: Attention mechanisms integrated into 5 CNN architectures improve medical image analysis by enhancing feature focus and discriminative performance across brain tumor MRI and histopathology datasets.

Details

Motivation: Conventional CNNs often fail to capture fine-grained features critical for accurate medical diagnosis, requiring enhanced attention to salient regions.

Method: Systematically integrated Squeeze and Excitation blocks and hybrid Convolutional Block Attention Modules into VGG16, ResNet18, InceptionV3, DenseNet121, and EfficientNetB5 architectures.

Result: Attention-augmented CNNs consistently outperformed baseline architectures across all metrics, with EfficientNetB5 with hybrid attention achieving the highest overall performance on both medical imaging datasets.

Conclusion: Attention mechanisms improve classification accuracy and feature localization, providing a systematic framework for developing robust and interpretable deep learning systems for clinical applications.

Abstract: Deep learning has become a powerful tool for medical image analysis; however, conventional Convolutional Neural Networks (CNNs) often fail to capture the fine-grained and complex features critical for accurate diagnosis. To address this limitation, we systematically integrate attention mechanisms into five widely adopted CNN architectures, namely, VGG16, ResNet18, InceptionV3, DenseNet121, and EfficientNetB5, to enhance their ability to focus on salient regions and improve discriminative performance. Specifically, each baseline model is augmented with either a Squeeze and Excitation block or a hybrid Convolutional Block Attention Module, allowing adaptive recalibration of channel and spatial feature representations. The proposed models are evaluated on two distinct medical imaging datasets, a brain tumor MRI dataset comprising multiple tumor subtypes, and a Products of Conception histopathological dataset containing four tissue categories. Experimental results demonstrate that attention augmented CNNs consistently outperform baseline architectures across all metrics. In particular, EfficientNetB5 with hybrid attention achieves the highest overall performance, delivering substantial gains on both datasets. Beyond improved classification accuracy, attention mechanisms enhance feature localization, leading to better generalization across heterogeneous imaging modalities. This work contributes a systematic comparative framework for embedding attention modules in diverse CNN architectures and rigorously assesses their impact across multiple medical imaging tasks. The findings provide practical insights for the development of robust, interpretable, and clinically applicable deep learning based decision support systems.

[167] Near Real-Time Dust Aerosol Detection with 3D Convolutional Neural Networks on MODIS Data

Caleb Gates, Patrick Moorhead, Jayden Ferguson, Omar Darwish, Conner Stallman, Pablo Rivas, Paapa Quansah

Main category: cs.CV

TL;DR: Real-time dust detection system using 3D CNN on MODIS satellite imagery with 36 spectral bands, achieving 92% accuracy for global dust monitoring.

Details

Motivation: Dust storms pose health risks and reduce visibility, requiring rapid detection from satellite imagery to provide timely alerts and warnings.

Method: 3D convolutional network processes multi-band MODIS images from Terra and Aqua satellites, using all 36 spectral bands plus split thermal bands with simple normalization and local filling for missing data.

Result: Achieved 0.92 accuracy with 0.014 mean squared error on 17 independent MODIS scenes, with strong agreement in plume cores and most misses occurring along edges.

Conclusion: Joint band-and-space learning enables timely global dust alerts; future improvements could include wider input windows or attention-based models to better handle edge detection.

Abstract: Dust storms harm health and reduce visibility; quick detection from satellites is needed. We present a near real-time system that flags dust at the pixel level using multi-band images from NASA’s Terra and Aqua (MODIS). A 3D convolutional network learns patterns across all 36 bands, plus split thermal bands, to separate dust from clouds and surface features. Simple normalization and local filling handle missing data. An improved version raises training speed by 21x and supports fast processing of full scenes. On 17 independent MODIS scenes, the model reaches about 0.92 accuracy with a mean squared error of 0.014. Maps show strong agreement in plume cores, with most misses along edges. These results show that joint band-and-space learning can provide timely dust alerts at global scale; using wider input windows or attention-based models may further sharpen edges.

[168] Vision-Based Object Detection for UAV Solar Panel Inspection Using an Enhanced Defects Dataset

Ashen Rodrigo, Isuru Munasinghe, Asanka Perera

Main category: cs.CV

TL;DR: Evaluation of 5 object detection models (YOLOv3, Faster R-CNN, RetinaNet, EfficientDet, Swin Transformer) for solar panel defect and contamination detection using a custom COCO-format dataset, with performance comparison based on mAP, precision, recall, and inference speed.

Details

Motivation: Timely and accurate detection of defects and contaminants in solar panels is critical for maintaining photovoltaic system efficiency and reliability.

Method: Developed custom COCO-format dataset for solar panel defects and contaminants, trained and evaluated five state-of-the-art object detection models with a user interface, assessed performance using mAP, precision, recall, and inference speed metrics.

Result: Results demonstrate trade-offs between detection accuracy and computational efficiency, highlighting relative strengths and limitations of each model for solar panel inspection.

Conclusion: Findings provide valuable guidance for selecting appropriate detection approaches in practical solar panel monitoring and maintenance scenarios, with the dataset made publicly available for further research.

Abstract: Timely and accurate detection of defects and contaminants in solar panels is critical for maintaining the efficiency and reliability of photovoltaic systems. This study presents a comprehensive evaluation of five state-of-the-art object detection models: YOLOv3, Faster R-CNN, RetinaNet, EfficientDet, and Swin Transformer, for identifying physical and electrical defects as well as surface contaminants such as dust, dirt, and bird droppings on solar panels. A custom dataset, annotated in the COCO format and specifically designed for solar panel defect and contamination detection, was developed alongside a user interface to train and evaluate the models. The performance of each model is assessed and compared based on mean Average Precision (mAP), precision, recall, and inference speed. The results demonstrate the trade-offs between detection accuracy and computational efficiency, highlighting the relative strengths and limitations of each model. These findings provide valuable guidance for selecting appropriate detection approaches in practical solar panel monitoring and maintenance scenarios. The dataset will be publicly available at https://github.com/IsuruMunasinghe98/solar-panel-inspection-dataset.

[169] VQualA 2025 Challenge on Image Super-Resolution Generated Content Quality Assessment: Methods and Results

Yixiao Li, Xin Li, Chris Wei Zhou, Shuo Xing, Hadi Amirpour, Xiaoshuai Hao, Guanghui Yue, Baoquan Zhao, Weide Liu, Xiaoyuan Yang, Zhengzhong Tu, Xinyu Li, Chuanbiao Song, Chenqi Zhang, Jun Lan, Huijia Zhu, Weiqiang Wang, Xiaoyan Sun, Shishun Tian, Dongyang Yan, Weixia Zhang, Junlin Chen, Wei Sun, Zhihua Wang, Zhuohang Shi, Zhizun Luo, Hang Ouyang, Tianxin Xiao, Fan Yang, Zhaowang Wu, Kaixin Deng

Main category: cs.CV

TL;DR: ISRGC-Q Challenge introduces a new dataset and competition focused on quality assessment of super-resolution images generated by modern generative approaches like GANs and diffusion models.

Details

Motivation: Existing SR-IQA datasets don't adequately address the unique artifacts introduced by the latest generative super-resolution techniques, requiring specialized evaluation methods.

Method: Organized a challenge using the ISRGen-QA dataset with 108 participants, focusing on evaluating SR images from GANs and diffusion models through submitted solutions and fact sheets.

Result: 4 teams submitted valid solutions achieving state-of-the-art performance on the ISRGen-QA dataset, demonstrating effective quality assessment capabilities.

Conclusion: The challenge successfully advanced the field of SR image quality assessment for generative approaches and provides a publicly available benchmark for future research.

Abstract: This paper presents the ISRGC-Q Challenge, built upon the Image Super-Resolution Generated Content Quality Assessment (ISRGen-QA) dataset, and organized as part of the Visual Quality Assessment (VQualA) Competition at the ICCV 2025 Workshops. Unlike existing Super-Resolution Image Quality Assessment (SR-IQA) datasets, ISRGen-QA places a greater emphasis on SR images generated by the latest generative approaches, including Generative Adversarial Networks (GANs) and diffusion models. The primary goal of this challenge is to analyze the unique artifacts introduced by modern super-resolution techniques and to evaluate their perceptual quality effectively. A total of 108 participants registered for the challenge, with 4 teams submitting valid solutions and fact sheets for the final testing phase. These submissions demonstrated state-of-the-art (SOTA) performance on the ISRGen-QA dataset. The project is publicly available at: https://github.com/Lighting-YXLI/ISRGen-QA.

[170] Unsupervised Instance Segmentation with Superpixels

Cuong Manh Hoang

Main category: cs.CV

TL;DR: A novel self-supervised instance segmentation framework that eliminates the need for human annotations by using MultiCut algorithm, mask filtering, superpixel-guided mask loss, and adaptive self-training.

Details

Motivation: Human annotations for instance segmentation are costly to collect, creating a need for methods that can effectively segment objects without requiring labeled data.

Method: Uses MultiCut algorithm on self-supervised features for coarse segmentation, mask filtering for quality control, superpixel-guided mask loss (hard and soft loss), and adaptive self-training to improve mask quality.

Result: Outperforms previous state-of-the-art methods on public datasets for both instance segmentation and object detection tasks.

Conclusion: The proposed framework provides an efficient and effective solution for instance segmentation without human annotations, demonstrating superior performance compared to existing methods.

Abstract: Instance segmentation is essential for numerous computer vision applications, including robotics, human-computer interaction, and autonomous driving. Currently, popular models bring impressive performance in instance segmentation by training with a large number of human annotations, which are costly to collect. For this reason, we present a new framework that efficiently and effectively segments objects without the need for human annotations. Firstly, a MultiCut algorithm is applied to self-supervised features for coarse mask segmentation. Then, a mask filter is employed to obtain high-quality coarse masks. To train the segmentation network, we compute a novel superpixel-guided mask loss, comprising hard loss and soft loss, with high-quality coarse masks and superpixels segmented from low-level image features. Lastly, a self-training process with a new adaptive loss is proposed to improve the quality of predicted masks. We conduct experiments on public datasets in instance segmentation and object detection to demonstrate the effectiveness of the proposed framework. The results show that the proposed framework outperforms previous state-of-the-art methods.

[171] Perception-oriented Bidirectional Attention Network for Image Super-resolution Quality Assessment

Yixiao Li, Xiaoyuan Yang, Guanghui Yue, Jun Fu, Qiuping Jiang, Xu Jia, Paul L. Rosin, Hantao Liu, Wei Zhou

Main category: cs.CV

TL;DR: Proposes PBAN, a perception-oriented bidirectional attention network for full-reference image quality assessment of super-resolution images, outperforming state-of-the-art methods.

Details

Motivation: Limited full-reference image quality assessment metrics exist for comparing and evaluating different super-resolution algorithms, creating a need for better evaluation tools.

Method: Three-module architecture: image encoder, perception-oriented bidirectional attention module (with bidirectional attention, grouped multi-scale deformable convolution, and sub-information excitation convolution), and quality prediction module to integrate features and regress scores.

Result: Extensive experiments demonstrate that PBAN outperforms state-of-the-art quality assessment methods.

Conclusion: The proposed PBAN framework effectively addresses the limitations in SR image quality assessment by incorporating human visual system characteristics and bidirectional attention mechanisms.

Abstract: Many super-resolution (SR) algorithms have been proposed to increase image resolution. However, full-reference (FR) image quality assessment (IQA) metrics for comparing and evaluating different SR algorithms are limited. In this work, we propose the Perception-oriented Bidirectional Attention Network (PBAN) for image SR FR-IQA, which is composed of three modules: an image encoder module, a perception-oriented bidirectional attention (PBA) module, and a quality prediction module. First, we encode the input images for feature representations. Inspired by the characteristics of the human visual system, we then construct the perception-oriented PBA module. Specifically, different from existing attention-based SR IQA methods, we conceive a Bidirectional Attention to bidirectionally construct visual attention to distortion, which is consistent with the generation and evaluation processes of SR images. To further guide the quality assessment towards the perception of distorted information, we propose Grouped Multi-scale Deformable Convolution, enabling the proposed method to adaptively perceive distortion. Moreover, we design Sub-information Excitation Convolution to direct visual perception to both sub-pixel and sub-channel attention. Finally, the quality prediction module is exploited to integrate quality-aware features and regress quality scores. Extensive experiments demonstrate that our proposed PBAN outperforms state-of-the-art quality assessment methods.

[172] Augmented Structure Preserving Neural Networks for cell biomechanics

Juan Olalla-Pombo, Alberto Badías, Miguel Ángel Sanz-Gómez, José María Benítez, Francisco Javier Montáns

Main category: cs.CV

TL;DR: A new hybrid machine learning approach combining Structure Preserving Neural Networks and Artificial Neural Networks to predict cell migration trajectories and mitosis events with high accuracy.

Details

Motivation: Cell biomechanics involve complex phenomena crucial to life processes, but many interactions and collective cell decisions remain unclear despite increasing research.

Method: Combines Structure Preserving Neural Networks (for mechanical cell movements) with Artificial Neural Networks (for environmental factors from Computer Vision), plus a separate mitosis prediction model using Neural Networks.

Result: The model predicts complete cell trajectories with high accuracy on both simulated and real cell migration cases using a roll-out policy.

Conclusion: The hybrid approach successfully integrates mechanical and environmental factors to accurately predict cell migration patterns and mitosis events.

Abstract: Cell biomechanics involve a great number of complex phenomena that are fundamental to the evolution of life itself and other associated processes, ranging from the very early stages of embryo-genesis to the maintenance of damaged structures or the growth of tumors. Given the importance of such phenomena, increasing research has been dedicated to their understanding, but the many interactions between them and their influence on the decisions of cells as a collective network or cluster remain unclear. We present a new approach that combines Structure Preserving Neural Networks, which study cell movements as a purely mechanical system, with other Machine Learning tools (Artificial Neural Networks), which allow taking into consideration environmental factors that can be directly deduced from an experiment with Computer Vision techniques. This new model, tested on simulated and real cell migration cases, predicts complete cell trajectories following a roll-out policy with a high level of accuracy. This work also includes a mitosis event prediction model based on Neural Networks architectures which makes use of the same observed features.

[173] Intraoperative 2D/3D Registration via Spherical Similarity Learning and Inference-Time Differentiable Levenberg-Marquardt Optimization

Minheng Chen, Youyong Kong

Main category: cs.CV

TL;DR: A novel 2D/3D registration method using spherical feature spaces and Riemannian distances in SO(4) space for better manifold structure preservation and faster convergence.

Details

Motivation: Existing Euclidean approximations distort manifold structure and slow convergence in intraoperative 2D/3D registration, limiting accuracy and efficiency.

Method: Extract feature embeddings using CNN-Transformer encoder, project into spherical space, approximate geodesic distances with Riemannian distances in bi-invariant SO(4) space, and use differentiable Levenberg-Marquardt optimization.

Result: Superior accuracy achieved on both real and synthetic datasets in patient-specific and patient-agnostic scenarios.

Conclusion: Spherical feature spaces with Riemannian geometry provide more expressive and geometrically consistent similarity metrics, enhancing registration performance and convergence speed.

Abstract: Intraoperative 2D/3D registration aligns preoperative 3D volumes with real-time 2D radiographs, enabling accurate localization of instruments and implants. A recent fully differentiable similarity learning framework approximates geodesic distances on SE(3), expanding the capture range of registration and mitigating the effects of substantial disturbances, but existing Euclidean approximations distort manifold structure and slow convergence. To address these limitations, we explore similarity learning in non-Euclidean spherical feature spaces to better capture and fit complex manifold structure. We extract feature embeddings using a CNN-Transformer encoder, project them into spherical space, and approximate their geodesic distances with Riemannian distances in the bi-invariant SO(4) space. This enables a more expressive and geometrically consistent deep similarity metric, enhancing the ability to distinguish subtle pose differences. During inference, we replace gradient descent with fully differentiable Levenberg-Marquardt optimization to accelerate convergence. Experiments on real and synthetic datasets show superior accuracy in both patient-specific and patient-agnostic scenarios.

[174] Advanced Brain Tumor Segmentation Using EMCAD: Efficient Multi-scale Convolutional Attention Decoding

GodsGift Uzor, Tania-Amanda Nkoyo Fredrick Eneye, Chukwuebuka Ijezue

Main category: cs.CV

TL;DR: EMCAD - an efficient multi-scale convolutional attention decoder for brain tumor segmentation that balances performance and computational efficiency on MRI data.

Details

Motivation: Brain tumor segmentation is critical but existing decoding mechanisms have high computational costs, especially problematic in resource-constrained scenarios.

Method: Developed EMCAD, an efficient multi-scale convolutional attention decoder, and tested it on the BraTs2020 dataset with 369 brain tumor patients’ MRI scans.

Result: Achieved best Dice score of 0.31 with stable mean Dice score of 0.285 ± 0.015, showing consistent performance without overfitting on validation set.

Conclusion: EMCAD provides a computationally efficient solution for brain tumor segmentation with moderate but stable performance, suitable for resource-constrained environments.

Abstract: Brain tumor segmentation is a critical pre-processing step in the medical image analysis pipeline that involves precise delineation of tumor regions from healthy brain tissue in medical imaging data, particularly MRI scans. An efficient and effective decoding mechanism is crucial in brain tumor segmentation especially in scenarios with limited computational resources. However these decoding mechanisms usually come with high computational costs. To address this concern EMCAD a new efficient multi-scale convolutional attention decoder designed was utilized to optimize both performance and computational efficiency for brain tumor segmentation on the BraTs2020 dataset consisting of MRI scans from 369 brain tumor patients. The preliminary result obtained by the model achieved a best Dice score of 0.31 and maintained a stable mean Dice score of 0.285 plus/minus 0.015 throughout the training process which is moderate. The initial model maintained consistent performance across the validation set without showing signs of over-fitting.

[175] FAVAE-Effective Frequency Aware Latent Tokenizer

Tejaswini Medi, Hsien-Yi Wang, Arianna Rampini, Margret Keuper

Main category: cs.CV

TL;DR: The paper identifies that current latent tokenizers in generative models prioritize low-frequency reconstruction at the expense of high-frequency details, leading to over-smoothed images. It proposes a wavelet-based frequency-aware VAE that decouples optimization of low and high frequencies to improve texture reconstruction while preserving global structure.

Details

Motivation: Current latent generative models suffer from poor reconstruction of fine textures and high-frequency details due to conventional objectives that inherently prioritize low-frequency information, resulting in visual artifacts and diminished perceptual quality in reconstructed images.

Method: The authors propose a wavelet-based frequency-aware variational autoencoder (FA-VAE) framework that explicitly decouples the optimization of low- and high-frequency components using frequency decomposition analysis.

Result: The approach enables improved reconstruction of fine textures while preserving global structure, bridging the fidelity gap in current latent tokenizers and demonstrating better perceptual quality.

Conclusion: Frequency-aware optimization is crucial for realistic image representation, and the proposed FA-VAE framework offers significant improvements for applications in content creation, neural rendering, and medical imaging.

Abstract: Latent generative models have shown remarkable progress in high-fidelity image synthesis, typically using a two-stage training process that involves compressing images into latent embeddings via learned tokenizers in the first stage. The quality of generation strongly depends on how expressive and well-optimized these latent embeddings are. While various methods have been proposed to learn effective latent representations, the reconstructed images often lack realism, particularly in textured regions with sharp transitions, due to loss of fine details governed by high frequencies. We conduct a detailed frequency decomposition of existing state-of-the-art (SOTA) latent tokenizers and show that conventional objectives inherently prioritize low-frequency reconstruction, often at the expense of high-frequency fidelity. Our analysis reveals these latent tokenizers exhibit a bias toward low-frequency information, when jointly optimized, leading to over-smoothed outputs and visual artifacts that diminish perceptual quality. To address this, we propose a wavelet-based, frequency-aware variational autoencoder (FA-VAE) framework that explicitly decouples the optimization of low- and high-frequency components. This decoupling enables improved reconstruction of fine textures while preserving global structure. Our approach bridges the fidelity gap in current latent tokenizers and emphasizes the importance of frequency-aware optimization for realistic image representation, with broader implications for applications in content creation, neural rendering, and medical imaging.

[176] Dynamic Sensitivity Filter Pruning using Multi-Agent Reinforcement Learning For DCNN’s

Iftekhar Haider Chowdhury, Zaed Ikbal Syed, Ahmed Faizul Haque Dhrubo, Mohammad Abdul Qayum

Main category: cs.CV

TL;DR: Differential Sensitivity Fusion Pruning is a novel single-shot filter pruning method that fuses multiple importance metrics to identify and remove redundant filters, achieving significant model compression (80% FLOPs reduction) while maintaining high accuracy.

Details

Motivation: Deep Convolutional Neural Networks face practical deployment limitations due to computational and memory overhead, requiring efficient compression methods for edge and mobile platforms.

Method: Computes differential sensitivity scores by fusing discrepancies among gradient-based sensitivity, first-order Taylor expansion, and KL divergence of activation distributions, using exponential scaling to identify unstable filters.

Result: Achieves over 80% FLOPs reduction while maintaining up to 98.23% of baseline accuracy at 70% pruning rate, outperforming traditional heuristics in compression and generalization.

Conclusion: Provides an efficient, deterministic single-shot pruning solution that enables scalable and adaptive DCNN compression for practical deployment on resource-constrained devices.

Abstract: Deep Convolutional Neural Networks have achieved state of the art performance across various computer vision tasks, however their practical deployment is limited by computational and memory overhead. This paper introduces Differential Sensitivity Fusion Pruning, a novel single shot filter pruning framework that focuses on evaluating the stability and redundancy of filter importance scores across multiple criteria. Differential Sensitivity Fusion Pruning computes a differential sensitivity score for each filter by fusing the discrepancies among gradient based sensitivity, first order Taylor expansion, and KL divergence of activation distributions. An exponential scaling mechanism is applied to emphasize filters with inconsistent importance across metrics, identifying candidates that are structurally unstable or less critical to the model performance. Unlike iterative or reinforcement learning based pruning strategies, Differential Sensitivity Fusion Pruning is efficient and deterministic, requiring only a single forward-backward pass for scoring and pruning. Extensive experiments across varying pruning rates between 50 to 70 percent demonstrate that Differential Sensitivity Fusion Pruning significantly reduces model complexity, achieving over 80 percent Floating point Operations Per Seconds reduction while maintaining high accuracy. For instance, at 70 percent pruning, our approach retains up to 98.23 percent of baseline accuracy, surpassing traditional heuristics in both compression and generalization. The proposed method presents an effective solution for scalable and adaptive Deep Convolutional Neural Networks compression, paving the way for efficient deployment on edge and mobile platforms.

[177] Veriserum: A dual-plane fluoroscopic dataset with knee implant phantoms for deep learning in medical imaging

Jinhao Wang, Florian Vogl, Pascal Schütz, Saša Ćuković, William R. Taylor

Main category: cs.CV

TL;DR: Veriserum is an open-source dataset of 110,000 dual-plane X-ray knee implant images with ground-truth poses for deep learning registration training and benchmarking.

Details

Motivation: To provide a reproducible benchmark for developing and evaluating computer vision algorithms in medical imaging, particularly for 2D/3D registration, segmentation, and 3D reconstruction applications.

Method: Created a dataset comprising approximately 110,000 X-ray images from 10 knee implant combinations captured during 1,600 trials, with automatically registered ground-truth poses and 200 manually registered poses for benchmarking.

Result: The dataset includes dual-plane images, calibration tools, and covers poses from daily activities like level gait and ramp descent, making it suitable for various medical imaging applications.

Conclusion: Veriserum is a freely accessible, comprehensive dataset that aims to advance computer vision and medical imaging research by providing standardized data for algorithm development and evaluation.

Abstract: Veriserum is an open-source dataset designed to support the training of deep learning registration for dual-plane fluoroscopic analysis. It comprises approximately 110,000 X-ray images of 10 knee implant pair combinations (2 femur and 5 tibia implants) captured during 1,600 trials, incorporating poses associated with daily activities such as level gait and ramp descent. Each image is annotated with an automatically registered ground-truth pose, while 200 images include manually registered poses for benchmarking. Key features of Veriserum include dual-plane images and calibration tools. The dataset aims to support the development of applications such as 2D/3D image registration, image segmentation, X-ray distortion correction, and 3D reconstruction. Freely accessible, Veriserum aims to advance computer vision and medical imaging research by providing a reproducible benchmark for algorithm development and evaluation. The Veriserum dataset used in this study is publicly available via https://movement.ethz.ch/data-repository/veriserum.html, with the data stored at ETH Z"urich Research Collections: https://doi.org/10.3929/ethz-b-000701146.

[178] An Analysis of Layer-Freezing Strategies for Enhanced Transfer Learning in YOLO Architectures

Andrzej D. Dobrzycki, Ana M. Bernardos, José R. Casar

Main category: cs.CV

TL;DR: Analysis of layer-freezing strategies for YOLOv8 and YOLOv10 shows no universal optimal approach - effectiveness depends on dataset characteristics, with backbone freezing preserving general features and shallower freezing handling class imbalance better, achieving up to 28% GPU memory reduction while maintaining or surpassing full fine-tuning performance.

Details

Motivation: Deploying YOLO architectures in resource-constrained environments like UAVs requires efficient transfer learning, but the specific impact of various freezing configurations on contemporary YOLOv8 and YOLOv10 architectures remains unexplored, particularly regarding freezing depth, dataset characteristics, and training dynamics.

Method: Systematic investigation of multiple freezing configurations across YOLOv8 and YOLOv10 variants using four challenging infrastructure monitoring datasets, integrated with gradient behavior analysis (L2 norm) and visual explanations (Grad-CAM) to provide insights into training dynamics.

Result: No universal optimal freezing strategy exists - effectiveness depends on data properties. Backbone freezing preserves general-purpose features, while shallower freezing handles extreme class imbalance better. Configurations reduce GPU memory by up to 28% and sometimes achieve mAP@50 scores surpassing full fine-tuning.

Conclusion: This work provides empirical findings and practical guidelines for selecting freezing strategies, offering an evidence-based approach to balanced transfer learning for object detection in resource-constrained scenarios, with gradient analysis showing distinct convergence patterns for moderately frozen models.

Abstract: The You Only Look Once (YOLO) architecture is crucial for real-time object detection. However, deploying it in resource-constrained environments such as unmanned aerial vehicles (UAVs) requires efficient transfer learning. Although layer freezing is a common technique, the specific impact of various freezing configurations on contemporary YOLOv8 and YOLOv10 architectures remains unexplored, particularly with regard to the interplay between freezing depth, dataset characteristics, and training dynamics. This research addresses this gap by presenting a detailed analysis of layer-freezing strategies. We systematically investigate multiple freezing configurations across YOLOv8 and YOLOv10 variants using four challenging datasets that represent critical infrastructure monitoring. Our methodology integrates a gradient behavior analysis (L2 norm) and visual explanations (Grad-CAM) to provide deeper insights into training dynamics under different freezing strategies. Our results reveal that there is no universal optimal freezing strategy but, rather, one that depends on the properties of the data. For example, freezing the backbone is effective for preserving general-purpose features, while a shallower freeze is better suited to handling extreme class imbalance. These configurations reduce graphics processing unit (GPU) memory consumption by up to 28% compared to full fine-tuning and, in some cases, achieve mean average precision (mAP@50) scores that surpass those of full fine-tuning. Gradient analysis corroborates these findings, showing distinct convergence patterns for moderately frozen models. Ultimately, this work provides empirical findings and practical guidelines for selecting freezing strategies. It offers a practical, evidence-based approach to balanced transfer learning for object detection in scenarios with limited resources.

[179] Quaternion Approximation Networks for Enhanced Image Classification and Oriented Object Detection

Bryce Grant, Peng Wang

Main category: cs.CV

TL;DR: QUAN is a novel deep learning framework that uses quaternion algebra approximation for rotation equivariant tasks, achieving better accuracy with fewer parameters and faster convergence than existing methods.

Details

Motivation: To create an efficient rotation-aware deep learning framework that preserves geometric properties while being computationally practical for resource-constrained systems like robotics.

Method: Approximates quaternion convolution through Hamilton product decomposition using real-valued operations, introduces Independent Quaternion Batch Normalization (IQBN), and extends quaternion operations to spatial attention mechanisms with custom CUDA kernels.

Result: Achieves higher accuracy with fewer parameters and faster convergence in classification tasks (CIFAR-10/100, ImageNet), improved parameter efficiency and rotation handling in object detection (COCO, DOTA), and establishes SOTA for quaternion CNNs in detection tasks.

Conclusion: QUAN demonstrates strong potential for deployment in resource-constrained robotic systems requiring rotation-aware perception and shows promise for applications in other domains.

Abstract: This paper introduces Quaternion Approximate Networks (QUAN), a novel deep learning framework that leverages quaternion algebra for rotation equivariant image classification and object detection. Unlike conventional quaternion neural networks attempting to operate entirely in the quaternion domain, QUAN approximates quaternion convolution through Hamilton product decomposition using real-valued operations. This approach preserves geometric properties while enabling efficient implementation with custom CUDA kernels. We introduce Independent Quaternion Batch Normalization (IQBN) for training stability and extend quaternion operations to spatial attention mechanisms. QUAN is evaluated on image classification (CIFAR-10/100, ImageNet), object detection (COCO, DOTA), and robotic perception tasks. In classification tasks, QUAN achieves higher accuracy with fewer parameters and faster convergence compared to existing convolution and quaternion-based models. For objection detection, QUAN demonstrates improved parameter efficiency and rotation handling over standard Convolutional Neural Networks (CNNs) while establishing the SOTA for quaternion CNNs in this downstream task. These results highlight its potential for deployment in resource-constrained robotic systems requiring rotation-aware perception and application in other domains.

[180] OpenEgo: A Large-Scale Multimodal Egocentric Dataset for Dexterous Manipulation

Ahad Jawaid, Yu Xiang

Main category: cs.CV

TL;DR: OpenEgo is a large-scale multimodal egocentric manipulation dataset with standardized hand-pose annotations and intention-aligned action primitives, totaling 1107 hours across 6 public datasets covering 290 manipulation tasks.

Details

Motivation: Existing egocentric video corpora lack either fine-grained, temporally localized action descriptions or dexterous hand annotations, limiting their utility for imitation learning of dexterous manipulation tasks.

Method: The authors introduce OpenEgo, which unifies hand-pose layouts across multiple datasets and provides descriptive, timestamped action primitives. They validate the dataset by training language-conditioned imitation-learning policies to predict dexterous hand trajectories.

Result: OpenEgo totals 1107 hours across six public datasets, covering 290 manipulation tasks in 600+ environments with standardized annotations.

Conclusion: OpenEgo is designed to lower the barrier to learning dexterous manipulation from egocentric video and support reproducible research in vision-language-action learning.

Abstract: Egocentric human videos provide scalable demonstrations for imitation learning, but existing corpora often lack either fine-grained, temporally localized action descriptions or dexterous hand annotations. We introduce OpenEgo, a multimodal egocentric manipulation dataset with standardized hand-pose annotations and intention-aligned action primitives. OpenEgo totals 1107 hours across six public datasets, covering 290 manipulation tasks in 600+ environments. We unify hand-pose layouts and provide descriptive, timestamped action primitives. To validate its utility, we train language-conditioned imitation-learning policies to predict dexterous hand trajectories. OpenEgo is designed to lower the barrier to learning dexterous manipulation from egocentric video and to support reproducible research in vision-language-action learning. All resources and instructions will be released at www.openegocentric.com.

[181] Visibility-Aware Language Aggregation for Open-Vocabulary Segmentation in 3D Gaussian Splatting

Sen Wang, Kunyi Li, Siyun Liang, Elena Alegret, Jing Ma, Nassir Navab, Stefano Gasperini

Main category: cs.CV

TL;DR: VALA is a visibility-aware language aggregation method that addresses background noise and multi-view inconsistencies in distilling 2D image language features into 3D Gaussians, improving open-vocabulary 3D scene understanding.

Details

Motivation: Existing methods for distilling open-vocabulary language features from 2D images into 3D Gaussians suffer from two issues: background Gaussians getting the same features as foreground ones despite negligible contributions, and multi-view inconsistencies due to view-specific noise in language embeddings.

Method: Introduces Visibility-Aware Language Aggregation (VALA) which computes marginal contributions for each ray and applies a visibility-aware gate to retain only visible Gaussians. Also proposes a streaming weighted geometric median in cosine space to merge noisy multi-view features.

Result: The method yields robust, view-consistent language feature embedding in a fast and memory-efficient manner. VALA improves open-vocabulary localization and segmentation across reference datasets.

Conclusion: VALA consistently surpasses existing works by effectively addressing background noise and multi-view inconsistency issues in language feature distillation for 3D Gaussians.

Abstract: Recently, distilling open-vocabulary language features from 2D images into 3D Gaussians has attracted significant attention. Although existing methods achieve impressive language-based interactions of 3D scenes, we observe two fundamental issues: background Gaussians contributing negligibly to a rendered pixel get the same feature as the dominant foreground ones, and multi-view inconsistencies due to view-specific noise in language embeddings. We introduce Visibility-Aware Language Aggregation (VALA), a lightweight yet effective method that computes marginal contributions for each ray and applies a visibility-aware gate to retain only visible Gaussians. Moreover, we propose a streaming weighted geometric median in cosine space to merge noisy multi-view features. Our method yields a robust, view-consistent language feature embedding in a fast and memory-efficient manner. VALA improves open-vocabulary localization and segmentation across reference datasets, consistently surpassing existing works.

[182] DuoCLR: Dual-Surrogate Contrastive Learning for Skeleton-based Human Action Segmentation

Haitao Tian, Pierre Payeur

Main category: cs.CV

TL;DR: Proposes DuoCLR - a contrastive learning framework using ‘Shuffle and Warp’ augmentation with two surrogate tasks (CPC and ROR) for improved human action segmentation from skeleton data.

Details

Motivation: Previous representation learning works focus on action recognition with isolated sequence-wise representations, lacking multi-scale representations and cross-sequence variations needed for action segmentation.

Method: Uses ‘Shuffle and Warp’ data augmentation to create multi-action permutations. Introduces two contrastive learning tasks: Cross Permutation Contrasting (CPC) for intra-class similarities and Relative Order Reasoning (ROR) for inter-class contexts. Pre-trains on trimmed skeleton sequences.

Result: Significant performance boost over state-of-the-art methods in both multi-class and multi-label action segmentation tasks when evaluated on untrimmed datasets.

Conclusion: The proposed DuoCLR framework effectively learns multi-scale feature representations optimized for action segmentation through novel augmentation and dual surrogate contrastive learning tasks.

Abstract: In this paper, a contrastive representation learning framework is proposed to enhance human action segmentation via pre-training using trimmed (single action) skeleton sequences. Unlike previous representation learning works that are tailored for action recognition and that build upon isolated sequence-wise representations, the proposed framework focuses on exploiting multi-scale representations in conjunction with cross-sequence variations. More specifically, it proposes a novel data augmentation strategy, ‘Shuffle and Warp’, which exploits diverse multi-action permutations. The latter effectively assists two surrogate tasks that are introduced in contrastive learning: Cross Permutation Contrasting (CPC) and Relative Order Reasoning (ROR). In optimization, CPC learns intra-class similarities by contrasting representations of the same action class across different permutations, while ROR reasons about inter-class contexts by predicting relative mapping between two permutations. Together, these tasks enable a Dual-Surrogate Contrastive Learning (DuoCLR) network to learn multi-scale feature representations optimized for action segmentation. In experiments, DuoCLR is pre-trained on a trimmed skeleton dataset and evaluated on an untrimmed dataset where it demonstrates a significant boost over state-the-art comparatives in both multi-class and multi-label action segmentation tasks. Lastly, ablation studies are conducted to evaluate the effectiveness of each component of the proposed approach.

[183] RED: Robust Event-Guided Motion Deblurring with Modality-Specific Disentangled Representation

Yihong Leng, Siming Zheng, Jinwei Chen, Bo Li, Jiaojiao Li, Peng-Tao Jiang

Main category: cs.CV

TL;DR: RED network uses robust event-guided deblurring with modality-specific disentangled representation to handle incomplete event streams and improve motion deblurring performance.

Details

Motivation: Existing event-based deblurring methods overlook the inherent incompleteness of event streams caused by DVS thresholding mechanisms, which compromises motion priors and limits effectiveness.

Method: Proposes Robust Event-guided Deblurring (RED) network with Robustness-Oriented Perturbation Strategy (RPS) using random masking, disentangled OmniAttention for modeling correlations, and interactive modules for feature enhancement.

Result: Extensive experiments on synthetic and real-world datasets show RED achieves state-of-the-art performance in both accuracy and robustness.

Conclusion: The proposed RED network effectively addresses event stream incompleteness and demonstrates superior deblurring performance across various scenarios.

Abstract: Event cameras provide sparse yet temporally high-temporal-resolution motion information, demonstrating great potential for motion deblurring. Existing methods focus on cross-modal interaction, overlooking the inherent incompleteness of event streams, which arises from the trade-off between sensitivity and noise introduced by the thresholding mechanism of Dynamic Vision Sensors (DVS). Such degradation compromises the integrity of motion priors and limits the effectiveness of event-guided deblurring. To tackle these challenges, we propose a Robust Event-guided Deblurring (RED) network with modality-specific disentangled representation. First, we introduce a Robustness-Oriented Perturbation Strategy (RPS) that applies random masking to events, which exposes RED to incomplete patterns and then foster robustness against various unknown scenario conditions.Next, a disentangled OmniAttention is presented to explicitly model intra-motion, inter-motion, and cross-modality correlations from two inherently distinct but complementary sources: blurry images and partially disrupted events. Building on these reliable features, two interactive modules are designed to enhance motion-sensitive areas in blurry images and inject semantic context into incomplete event representations. Extensive experiments on synthetic and real-world datasets demonstrate RED consistently achieves state-of-the-art performance in both accuracy and robustness.

[184] Sensitivity-Aware Post-Training Quantization for Deep Neural Networks

Zekang Zheng, Haokun Li, Yaofo Chen, Mingkui Tan, Qing Du

Main category: cs.CV

TL;DR: Efficient post-training quantization method using parameter sensitivity analysis to prioritize quantization of high-sensitivity parameters while using unquantized low-sensitivity parameters to compensate for errors, achieving 20-200x speedup with minimal accuracy loss.

Details

Motivation: Existing PTQ methods require iterative parameter updates that incur high computational complexity and resource overhead, limiting their applicability in resource-constrained edge computing and real-time inference scenarios.

Method: Proposes parameter sensitivity analysis to guide quantization, prioritizing high-sensitivity parameters and using unquantized low-sensitivity parameters for error compensation. Introduces row-parallel quantization framework with globally shared inverse Hessian matrix update mechanism.

Result: Achieves 20-200x quantization speedup over Optimal Brain Quantization baseline on ResNet-50 and YOLOv5s, with mean accuracy loss below 0.3%.

Conclusion: The method effectively balances efficiency and accuracy in model quantization, making it suitable for resource-constrained environments while maintaining high compression ratios.

Abstract: Model quantization reduces neural network parameter precision to achieve compression, but often compromises accuracy. Existing post-training quantization (PTQ) methods employ iterative parameter updates to preserve accuracy under high compression ratios, incurring significant computational complexity and resource overhead, which limits applicability in resource-constrained edge computing and real-time inference scenarios. This paper proposes an efficient PTQ method guided by parameter sensitivity analysis. The approach prioritizes quantization of high-sensitivity parameters, leveraging unquantized low-sensitivity parameters to compensate for quantization errors, thereby mitigating accuracy degradation. Furthermore, by exploiting column-wise clustering of parameter sensitivity, the method introduces a row-parallel quantization framework with a globally shared inverse Hessian matrix update mechanism, reducing computational complexity by an order of magnitude. Experimental results on ResNet-50 and YOLOv5s demonstrate a 20-200-fold quantization speedup over the Optimal Brain Quantization baseline, with mean accuracy loss below 0.3%, confirming the method’s efficacy in balancing efficiency and accuracy.

[185] Reconstruction and Reenactment Separated Method for Realistic Gaussian Head

Zhiling Ye, Cong Zhou, Xiubao Zhang, Haifeng Shen, Weihong Deng, Quan Lu

Main category: cs.CV

TL;DR: A single-image 3D Gaussian head framework that separates reconstruction and reenactment, achieving 90 FPS rendering at 512x512 resolution with state-of-the-art performance.

Details

Motivation: To create controllable 3D avatars from just a single portrait image while maintaining high rendering efficiency and quality.

Method: Uses a reconstruction and reenactment separated framework with large-scale one-shot Gaussian head generator based on WebSSL, employing two-stage training for better generalization and texture reconstruction.

Result: Achieves 90 FPS at 512x512 resolution, follows scaling law for improved performance with larger reconstruction modules, and outperforms current state-of-the-art methods in both quantitative and qualitative evaluations.

Conclusion: The proposed framework successfully enables high-quality 3D avatar generation from single images with efficient real-time rendering capabilities while maintaining driving efficiency through separation design.

Abstract: In this paper, we explore a reconstruction and reenactment separated framework for 3D Gaussians head, which requires only a single portrait image as input to generate controllable avatar. Specifically, we developed a large-scale one-shot gaussian head generator built upon WebSSL and employed a two-stage training approach that significantly enhances the capabilities of generalization and high-frequency texture reconstruction. During inference, an ultra-lightweight gaussian avatar driven by control signals enables high frame-rate rendering, achieving 90 FPS at a resolution of 512x512. We further demonstrate that the proposed framework follows the scaling law, whereby increasing the parameter scale of the reconstruction module leads to improved performance. Moreover, thanks to the separation design, driving efficiency remains unaffected. Finally, extensive quantitative and qualitative experiments validate that our approach outperforms current state-of-the-art methods.

[186] MFFI: Multi-Dimensional Face Forgery Image Dataset for Real-World Scenarios

Changtao Miao, Yi Zhang, Man Luo, Weiwei Feng, Kaiyuan Zheng, Qi Chu, Tao Gong, Jianshu Li, Yunfeng Diao, Wei Zhou, Joey Tianyi Zhou, Xiaoshuai Hao

Main category: cs.CV

TL;DR: MFFI dataset addresses limitations in current Deepfake detection by providing diverse forgery methods, varied facial scenes, authentic data, and real-world degradation operations to improve detection in realistic scenarios.

Details

Motivation: Current Deepfake detection methods are limited by existing datasets that lack diversity needed for real-world scenarios, particularly in unknown forgery techniques, facial scene variability, authentic data richness, and real-world degradation.

Method: Proposed Multi-dimensional Face Forgery Image (MFFI) dataset with four strategic dimensions: wider forgery methods (50 different techniques), varied facial scenes, diversified authentic data, and multi-level degradation operations, containing 1024K image samples.

Result: Benchmark evaluations show MFFI outperforms existing public datasets in scene complexity, cross-domain generalization capability, and detection difficulty gradients, validating its technical advancement and practical utility.

Conclusion: MFFI dataset successfully addresses the limitations of current Deepfake detection datasets and provides a more realistic and comprehensive resource for improving detection performance in real-world conditions.

Abstract: Rapid advances in Artificial Intelligence Generated Content (AIGC) have enabled increasingly sophisticated face forgeries, posing a significant threat to social security. However, current Deepfake detection methods are limited by constraints in existing datasets, which lack the diversity necessary in real-world scenarios. Specifically, these data sets fall short in four key areas: unknown of advanced forgery techniques, variability of facial scenes, richness of real data, and degradation of real-world propagation. To address these challenges, we propose the Multi-dimensional Face Forgery Image (\textbf{MFFI}) dataset, tailored for real-world scenarios. MFFI enhances realism based on four strategic dimensions: 1) Wider Forgery Methods; 2) Varied Facial Scenes; 3) Diversified Authentic Data; 4) Multi-level Degradation Operations. MFFI integrates $50$ different forgery methods and contains $1024K$ image samples. Benchmark evaluations show that MFFI outperforms existing public datasets in terms of scene complexity, cross-domain generalization capability, and detection difficulty gradients. These results validate the technical advance and practical utility of MFFI in simulating real-world conditions. The dataset and additional details are publicly available at {https://github.com/inclusionConf/MFFI}.

[187] Language-guided Recursive Spatiotemporal Graph Modeling for Video Summarization

Jungin Park, Jiyoung Lee, Kwanghoon Sohn

Main category: cs.CV

TL;DR: VideoGraph proposes a language-guided spatiotemporal graph modeling approach for video summarization, using recursive graph networks to model object and frame relationships with semantic language queries.

Details

Motivation: Previous video summarization methods focused on temporal modeling but ignored fine-grained visual entities like objects. Language-guided approaches require better semantic understanding of complex videos and how objects relate to each other.

Method: Proposes VideoGraph with recursive spatiotemporal graph networks that formulate objects and frames as nodes in spatial and temporal graphs. Uses language queries to incorporate semantic knowledge into node representations and prevent visual similarity-based edges. Adopts recursive strategy to refine graphs and classify keyframes.

Result: Achieves state-of-the-art performance on several benchmarks for both generic and query-focused video summarization in supervised and unsupervised settings.

Conclusion: The language-guided spatiotemporal graph modeling approach effectively captures semantic relationships between objects and frames, demonstrating superior performance in video summarization tasks.

Abstract: Video summarization aims to select keyframes that are visually diverse and can represent the whole story of a given video. Previous approaches have focused on global interlinkability between frames in a video by temporal modeling. However, fine-grained visual entities, such as objects, are also highly related to the main content of the video. Moreover, language-guided video summarization, which has recently been studied, requires a comprehensive linguistic understanding of complex real-world videos. To consider how all the objects are semantically related to each other, this paper regards video summarization as a language-guided spatiotemporal graph modeling problem. We present recursive spatiotemporal graph networks, called VideoGraph, which formulate the objects and frames as nodes of the spatial and temporal graphs, respectively. The nodes in each graph are connected and aggregated with graph edges, representing the semantic relationships between the nodes. To prevent the edges from being configured with visual similarity, we incorporate language queries derived from the video into the graph node representations, enabling them to contain semantic knowledge. In addition, we adopt a recursive strategy to refine initial graphs and correctly classify each frame node as a keyframe. In our experiments, VideoGraph achieves state-of-the-art performance on several benchmarks for generic and query-focused video summarization in both supervised and unsupervised manners. The code is available at https://github.com/park-jungin/videograph.

[188] ADIR: Adaptive Diffusion for Image Reconstruction

Shady Abu-Hussein, Tom Tirer, Raja Giryes

Main category: cs.CV

TL;DR: ADIR introduces a conditional sampling framework with LoRA-based fine-tuning to adapt pre-trained diffusion models for image reconstruction tasks, achieving substantial improvements over existing methods.

Details

Motivation: Leverage the powerful priors learned by diffusion models for image reconstruction while maintaining consistency with degraded measurements, and adapt pre-trained models to specific degradation types.

Method: Conditional sampling framework that enforces measurement consistency, combined with LoRA-based fine-tuning using semantically similar images retrieved via vision-language models from large datasets.

Result: Substantial improvements across various image reconstruction tasks when applied to Stable Diffusion and Guided Diffusion models.

Conclusion: ADIR effectively bridges the gap between generative diffusion models and image reconstruction, demonstrating the value of adaptive fine-tuning with semantically relevant data for improved performance.

Abstract: Denoising diffusion models have recently achieved remarkable success in image generation, capturing rich information about natural image statistics. This makes them highly promising for image reconstruction, where the goal is to recover a clean image from a degraded observation. In this work, we introduce a conditional sampling framework that leverages the powerful priors learned by diffusion models while enforcing consistency with the available measurements. To further adapt pre-trained diffusion models to the specific degradation at hand, we propose a novel fine-tuning strategy. In particular, we employ LoRA-based adaptation using images that are semantically and visually similar to the degraded input, efficiently retrieved from a large and diverse dataset via an off-the-shelf vision-language model. We evaluate our approach on two leading publicly available diffusion models–Stable Diffusion and Guided Diffusion–and demonstrate that our method, termed Adaptive Diffusion for Image Reconstruction (ADIR), yields substantial improvements across a range of image reconstruction tasks.

[189] Patch-level Kernel Alignment for Self-Supervised Dense Representation Learning

Juan Yeo, Ijun Jang, Taesup Kim

Main category: cs.CV

TL;DR: A framework that enhances dense vision representations through self-supervised learning with patch-level kernel alignment and specialized augmentations, achieving state-of-the-art results on dense prediction tasks.

Details

Motivation: Most self-supervised learning methods focus on global image representations but fail to capture localized semantics needed for dense prediction tasks that require spatial precision and fine-grained detail.

Method: Proposes Patch-level Kernel Alignment (PaKA) to align dense feature distributions between teacher and student models, capturing statistical dependencies and structural relationships of patches. Also develops specialized augmentation strategies for dense representation learning.

Result: Achieves state-of-the-art results across various dense vision benchmarks, demonstrating the framework’s effectiveness in transferring semantic knowledge to dense feature space.

Conclusion: The proposed framework successfully overcomes limitations of global representation methods by enabling effective self-supervised learning for dense visual features through kernel alignment and targeted augmentations.

Abstract: Dense representations are essential for vision tasks that require spatial precision and fine-grained detail. While most self-supervised representation learning methods focus on global representations that summarize the image as a whole, such approaches often fall short in capturing the localized semantics necessary for dense prediction tasks. To overcome these limitations, we propose a framework that builds on pretrained representations through additional self-supervised learning, aiming to transfer existing semantic knowledge into the dense feature space. Our method aligns the distributions of dense features between a teacher and a student model. Specifically, we introduce Patch-level Kernel Alignment (PaKA), a simple yet effective alignment objective that captures statistical dependencies, thereby matching the structural relationships of dense patches across the two models. In addition, we investigate augmentation strategies specifically designed for dense representation learning. Our framework achieves state-of-the-art results across a variety of dense vision benchmarks, demonstrating the effectiveness of our approach.

[190] SpecPrune-VLA: Accelerating Vision-Language-Action Models via Action-Aware Self-Speculative Pruning

Hanzhen Wang, Jiaming Xu, Jiayi Pan, Yongkang Zhou, Guohao Dai

Main category: cs.CV

TL;DR: SpecPrune-VLA is a training-free pruning method for Vision-Language-Action models that uses both local and global context information to intelligently prune tokens, achieving significant speedups with minimal performance loss.

Details

Motivation: Existing pruning methods for VLA models only use local information from current actions, ignoring global context from prior actions, which causes significant performance drops (>20% success rate loss) and limited speedup.

Method: Two-level pruning with heuristic control: (1) Static pruning at action level using global history and local context, (2) Dynamic pruning at layer level based on layer-specific importance, (3) Lightweight action-aware controller that classifies actions as coarse/fine-grained and adjusts pruning aggressiveness accordingly.

Result: Achieves 1.46x speedup on NVIDIA A800 and 1.57x speedup on NVIDIA GeForce RTX 3090 compared to OpenVLA-OFT, with negligible success rate loss on LIBERO benchmark.

Conclusion: Leveraging both local and global context information enables effective pruning of VLA models without training, achieving substantial speed improvements while maintaining performance.

Abstract: Pruning accelerates compute-bound models by reducing computation. Recently applied to Vision-Language-Action (VLA) models, existing methods prune tokens using only local info from current action, ignoring global context from prior actions, causing >20% success rate drop and limited speedup. We observe high similarity across consecutive actions and propose leveraging both local (current) and global (past) info for smarter token selection. We introduce SpecPrune-VLA, a training-free method with two-level pruning and heuristic control: (1) Static pruning at action level: uses global history and local context to reduce visual tokens per action; (2) Dynamic pruning at layer level: prunes tokens per layer based on layer-specific importance; (3) Lightweight action-aware controller: classifies actions as coarse/fine-grained (by speed), adjusting pruning aggressiveness since fine-grained actions are pruning-sensitive. Experiments on LIBERO show SpecPrune-VLA achieves 1.46 times speedup on NVIDIA A800 and 1.57 times on NVIDIA GeForce RTX 3090 vs. OpenVLA-OFT, with negligible success rate loss.

[191] SuMa: A Subspace Mapping Approach for Robust and Effective Concept Erasure in Text-to-Image Diffusion Models

Kien Nguyen, Anh Tran, Cuong Pham

Main category: cs.CV

TL;DR: SuMa is a novel concept erasure method that uses subspace mapping to robustly remove narrow concepts like copyrighted characters while maintaining image quality, outperforming existing methods.

Details

Motivation: Existing concept erasure methods fail to handle narrow concepts (copyrighted characters, celebrities) effectively while maintaining robustness and image quality, which is critical for addressing copyright and legal concerns.

Method: Subspace Mapping (SuMa) first derives a target subspace representing the concept to be erased, then neutralizes it by mapping it to a reference subspace that minimizes distance between them, enabling fine-grained manipulation.

Result: Extensive experiments across four tasks (subclass erasure, celebrity erasure, artistic style erasure, instance erasure) show SuMa achieves image quality comparable to effectiveness-focused methods while matching the completeness of robustness-focused methods.

Conclusion: SuMa successfully addresses the challenge of erasing narrow concepts by providing both robustness and effectiveness, making it suitable for practical applications involving copyright protection and content moderation.

Abstract: The rapid growth of text-to-image diffusion models has raised concerns about their potential misuse in generating harmful or unauthorized contents. To address these issues, several Concept Erasure methods have been proposed. However, most of them fail to achieve both robustness, i.e., the ability to robustly remove the target concept., and effectiveness, i.e., maintaining image quality. While few recent techniques successfully achieve these goals for NSFW concepts, none could handle narrow concepts such as copyrighted characters or celebrities. Erasing these narrow concepts is critical in addressing copyright and legal concerns. However, erasing them is challenging due to their close distances to non-target neighboring concepts, requiring finer-grained manipulation. In this paper, we introduce Subspace Mapping (SuMa), a novel method specifically designed to achieve both robustness and effectiveness in easing these narrow concepts. SuMa first derives a target subspace representing the concept to be erased and then neutralizes it by mapping it to a reference subspace that minimizes the distance between the two. This mapping ensures the target concept is robustly erased while preserving image quality. We conduct extensive experiments with SuMa across four tasks: subclass erasure, celebrity erasure, artistic style erasure, and instance erasure and compare the results with current state-of-the-art methods. Our method achieves image quality comparable to approaches focused on effectiveness, while also yielding results that are on par with methods targeting completeness.

[192] Self-supervised Learning for Hyperspectral Images of Trees

Moqsadur Rahman, Saurav Kumar, Santosh S. Palmate, M. Shahriar Hossain

Main category: cs.CV

TL;DR: Self-supervised learning creates better tree representations from aerial hyperspectral images than direct use of vegetation properties for downstream ML tasks.

Details

Motivation: Aerial hyperspectral imaging enables precision agriculture but analyzing these images with limited labels is challenging, requiring effective representation learning methods.

Method: Uses self-supervised learning to create neural network embeddings that capture vegetation properties of trees from aerial hyperspectral images of crop fields.

Result: The constructed tree representation using vegetation property-related embedding space outperforms direct use of hyperspectral vegetation properties in downstream machine learning tasks.

Conclusion: Self-supervised learning provides superior tree representations from hyperspectral imagery compared to traditional vegetation property extraction methods.

Abstract: Aerial remote sensing using multispectral and RGB imagers has provided a critical impetus to precision agriculture. Analysis of the hyperspectral images with limited or no labels is challenging. This paper focuses on self-supervised learning to create neural network embeddings reflecting vegetation properties of trees from aerial hyperspectral images of crop fields. Experimental results demonstrate that a constructed tree representation, using a vegetation property-related embedding space, performs better in downstream machine learning tasks compared to the direct use of hyperspectral vegetation properties as tree representations.

[193] Evaluating YOLO Architectures: Implications for Real-Time Vehicle Detection in Urban Environments of Bangladesh

Ha Meem Hossain, Pritam Nath, Mahitun Nesa Mahi, Imtiaz Uddin, Ishrat Jahan Eiste, Syed Nasibur Rahman Ratul, Md Naim Uddin Mozumdar, Asif Mohammed Saad

Main category: cs.CV

TL;DR: YOLOv11x performs best (63.7% mAP@0.5) on Bangladesh-specific vehicle detection, but rare vehicle classes struggle due to dataset imbalance. Medium YOLO variants offer optimal balance between accuracy and speed.

Details

Motivation: Vehicle detection systems trained on non-Bangladeshi datasets fail to accurately identify local vehicle types in Bangladesh's unique road environments, creating critical gaps in autonomous driving technology for developing regions.

Method: Evaluated six YOLO model variants on a custom dataset of 29 distinct Bangladeshi vehicle classes using high-resolution images (1920x1080) captured across various roads and manually annotated with YOLO format bounding boxes.

Result: YOLOv11x achieved best performance (63.7% mAP@0.5, 43.8% mAP@0.5:0.95) but slow inference (45.8ms). Medium variants (YOLOv8m, YOLOv11m) offered optimal balance (62.5% and 61.8% mAP@0.5) with faster inference (14-15ms). Rare vehicles like Construction Vehicles and Desi Nosimons showed near-zero accuracy due to dataset imbalance.

Conclusion: This research provides a foundation for developing robust object detection systems specifically adapted to Bangladesh traffic conditions, addressing critical needs in autonomous vehicle technology for developing regions where conventional generic-trained models fail.

Abstract: Vehicle detection systems trained on Non-Bangladeshi datasets struggle to accurately identify local vehicle types in Bangladesh’s unique road environments, creating critical gaps in autonomous driving technology for developing regions. This study evaluates six YOLO model variants on a custom dataset featuring 29 distinct vehicle classes, including region-specific vehicles such as Desi Nosimon'', Leguna’’, Battery Rickshaw'', and CNG’’. The dataset comprises high-resolution images (1920x1080) captured across various Bangladeshi roads using mobile phone cameras and manually annotated using LabelImg with YOLO format bounding boxes. Performance evaluation revealed YOLOv11x as the top performer, achieving 63.7% mAP@0.5, 43.8% mAP@0.5:0.95, 61.4% recall, and 61.6% F1-score, though requiring 45.8 milliseconds per image for inference. Medium variants (YOLOv8m, YOLOv11m) struck an optimal balance, delivering robust detection performance with mAP@0.5 values of 62.5% and 61.8% respectively, while maintaining moderate inference times around 14-15 milliseconds. The study identified significant detection challenges for rare vehicle classes, with Construction Vehicles and Desi Nosimons showing near-zero accuracy due to dataset imbalances and insufficient training samples. Confusion matrices revealed frequent misclassifications between visually similar vehicles, particularly Mini Trucks versus Mini Covered Vans. This research provides a foundation for developing robust object detection systems specifically adapted to Bangladesh traffic conditions, addressing critical needs in autonomous vehicle technology advancement for developing regions where conventional generic-trained models fail to perform adequately.

[194] EditIDv2: Editable ID Customization with Data-Lubricated ID Feature Integration for Text-to-Image Generation

Guandong Li, Zhaobin Chu

Main category: cs.CV

TL;DR: EditIDv2 is a tuning-free solution for high-complexity narrative scenes and long text inputs that maintains identity consistency while enabling deep semantic editing with minimal data requirements.

Details

Motivation: Existing character editing methods struggle with degraded editing capabilities, semantic understanding biases, and identity consistency breakdowns when handling long text narratives with multiple semantic layers, temporal logic, and complex contextual relationships.

Method: Decomposes PerceiverAttention, introduces ID loss and joint dynamic training with diffusion model, and employs offline fusion strategy for integration module to achieve deep multi-level semantic editing with minimal data lubrication.

Result: Achieves excellent results in IBench evaluation, meeting demands of long prompts and high-quality image generation while maintaining identity consistency in complex narrative environments.

Conclusion: EditIDv2 successfully addresses editability injection under minimal data lubrication, enabling effective character editing in complex narrative scenes with long text inputs while preserving identity consistency.

Abstract: We propose EditIDv2, a tuning-free solution specifically designed for high-complexity narrative scenes and long text inputs. Existing character editing methods perform well under simple prompts, but often suffer from degraded editing capabilities, semantic understanding biases, and identity consistency breakdowns when faced with long text narratives containing multiple semantic layers, temporal logic, and complex contextual relationships. In EditID, we analyzed the impact of the ID integration module on editability. In EditIDv2, we further explore and address the influence of the ID feature integration module. The core of EditIDv2 is to discuss the issue of editability injection under minimal data lubrication. Through a sophisticated decomposition of PerceiverAttention, the introduction of ID loss and joint dynamic training with the diffusion model, as well as an offline fusion strategy for the integration module, we achieve deep, multi-level semantic editing while maintaining identity consistency in complex narrative environments using only a small amount of data lubrication. This meets the demands of long prompts and high-quality image generation, and achieves excellent results in the IBench evaluation.

[195] OOTSM: A Decoupled Linguistic Framework for Effective Scene Graph Anticipation

Xiaomeng Zhu, Changwei Wang, Haozhe Wang, Xinyu Liu, Fangzhen Lin

Main category: cs.CV

TL;DR: Proposes a linguistic approach for scene graph anticipation using LLMs to predict future object appearances/disappearances and human-object relations, achieving state-of-the-art performance with significant long-term prediction improvements.

Details

Motivation: Existing scene graph anticipation methods rely mainly on visual cues and struggle to integrate commonsense knowledge, limiting long-term prediction robustness. The paper aims to explicitly leverage commonsense knowledge through linguistic modeling.

Method: Object-Oriented Two-Staged Method (OOTSM) using LLMs: first forecasts object appearances/disappearances, then generates detailed human-object relations. Combines with visual scene graph extraction for complete SGA pipeline.

Result: Achieves 3.4% improvement in short-term mean-Recall (@10) and dramatic 21.9% improvement in long-term mean-Recall (@50), demonstrating effective state-of-the-art performance.

Conclusion: Linguistic Scene Graph Anticipation (LSGA) with OOTSM effectively integrates commonsense knowledge through LLMs, significantly improving both short-term and long-term scene graph prediction robustness.

Abstract: A scene graph is a structured represention of objects and their relationships in a scene. Scene Graph Anticipation (SGA) involves predicting future scene graphs from video clips, enabling applications as intelligent surveillance and human-machine collaboration. Existing SGA approaches primarily leverage visual cues, often struggling to integrate valuable commonsense knowledge, thereby limiting long-term prediction robustness. To explicitly leverage such commonsense knowledge, we propose a new approach to better understand the objects, concepts, and relationships in a scene graph. Our approach decouples the SGA task in two steps: first a scene graph capturing model is used to convert a video clip into a sequence of scene graphs, then a pure text-based model is used to predict scene graphs in future frames. Our focus in this work is on the second step, and we call it Linguistic Scene Graph Anticipation (LSGA) and believes it should have independent interest beyond the use in SGA discussed here. For LSGA, we introduce an Object-Oriented Two-Staged Method (OOTSM) where an Large Language Model (LLM) first forecasts object appearances and disappearances before generating detailed human-object relations. We conduct extensive experiments to evaluate OOTSM in two settings. For LSGA, we evaluate our fine-tuned open-sourced LLMs against zero-shot APIs (i.e., GPT-4o, GPT-4o-mini, and DeepSeek-V3) on a benchmark constructed from Action Genome annotations. For SGA, we combine our OOTSM with STTran++ from, and our experiments demonstrate effective state-of-the-art performance: short-term mean-Recall (@10) increases by 3.4% while long-term mean-Recall (@50) improves dramatically by 21.9%. Code is available at https://github.com/ZhuXMMM/OOTSM.

[196] WIPUNet: A Physics-inspired Network with Weighted Inductive Biases for Image Denoising

Wasikul Islam

Main category: cs.CV

TL;DR: Physics-inspired denoising models inspired by particle physics pileup mitigation show improved robustness at high noise levels compared to standard baselines.

Details

Motivation: Leverage physics principles from high-energy particle physics (conservation, locality, isolation) to improve image denoising robustness under strong corruption, rather than targeting state-of-the-art benchmarks.

Method: Developed a hierarchy of pileup-inspired denoisers: conservation-constrained CNN, Gaussian-noise variants, and WIPUNet (Weighted Inductive Pileup-physics-inspired U-Network) that integrates physics priors into a UNet backbone.

Result: On CIFAR-10 with Gaussian noise (σ=15-100), PU-inspired CNNs are competitive with baselines, while WIPUNet shows widening performance margin at higher noise levels. Same trend observed on BSD500, demonstrating physics-inspired priors provide stability where data-driven models degrade.

Conclusion: Physics-inspired inductive biases from particle physics can be effectively translated into neural architectures to improve denoising robustness under strong corruption without relying on heavy SOTA machinery.

Abstract: In high-energy particle physics, collider measurements are contaminated by “pileup”, overlapping soft interactions that obscure the hard-scatter signal of interest. Dedicated subtraction strategies exploit physical priors such as conservation, locality, and isolation. Inspired by this analogy, we investigate how such principles can inform image denoising by embedding physics-guided inductive biases into neural architectures. This paper is a proof of concept: rather than targeting state-of-the-art (SOTA) benchmarks, we ask whether physics-inspired priors improve robustness under strong corruption. We introduce a hierarchy of PU-inspired denoisers: a residual CNN with conservation constraints, its Gaussian-noise variants, and the Weighted Inductive Pileup-physics-inspired U-Network for Denoising (WIPUNet), which integrates these ideas into a UNet backbone. On CIFAR-10 with Gaussian noise at $\sigma\in{15,25,50,75,100}$, PU-inspired CNNs are competitive with standard baselines, while WIPUNet shows a \emph{widening margin} at higher noise. Complementary BSD500 experiments show the same trend, suggesting physics-inspired priors provide stability where purely data-driven models degrade. Our contributions are: (i) translating pileup-mitigation principles into modular inductive biases; (ii) integrating them into UNet; and (iii) demonstrating robustness gains at high noise without relying on heavy SOTA machinery.

[197] Context-Aware Multi-Turn Visual-Textual Reasoning in LVLMs via Dynamic Memory and Adaptive Visual Guidance

Weijie Shen, Xinrui Wang, Yuanqi Nie, Apiradee Boonmee

Main category: cs.CV

TL;DR: CAMVR framework enhances multi-turn visual reasoning in LVLMs with dynamic memory and adaptive attention mechanisms, achieving SOTA performance on challenging datasets.

Details

Motivation: Current LLMs and LVLMs struggle with multi-turn interactions requiring deep contextual understanding and complex visual reasoning, leading to fragmented reasoning, context loss, and hallucinations.

Method: Proposes CAMVR framework with Visual-Textual Context Memory Unit (VCMU) for dynamic storage of visual-textual features and Adaptive Visual Focus Guidance (AVFG) for context-aware attention adjustment, plus multi-level reasoning integration.

Result: Extensive experiments on VisDial, adapted A-OKVQA, and novel MTIF dataset demonstrate CAMVR consistently achieves state-of-the-art performance.

Conclusion: CAMVR effectively addresses multi-turn visual reasoning challenges by maintaining contextual coherence and reducing hallucinations through innovative memory and attention mechanisms.

Abstract: Current Large Language Models (LLMs) and Vision-Language Large Models (LVLMs) excel in single-turn tasks but face significant challenges in multi-turn interactions requiring deep contextual understanding and complex visual reasoning, often leading to fragmented reasoning, context loss, and hallucinations. To address these limitations, we propose Context-Aware Multi-Turn Visual Reasoning (CAMVR), a novel framework designed to empower LVLMs with robust and coherent multi-turn visual-textual inference capabilities. CAMVR introduces two key innovations: a Visual-Textual Context Memory Unit (VCMU), a dynamic read-write memory network that stores and manages critical visual features, textual semantic representations, and their cross-modal correspondences from each interaction turn; and an Adaptive Visual Focus Guidance (AVFG) mechanism, which leverages the VCMU’s context to dynamically adjust the visual encoder’s attention to contextually relevant image regions. Our multi-level reasoning integration strategy ensures that response generation is deeply coherent with both current inputs and accumulated historical context. Extensive experiments on challenging datasets, including VisDial, an adapted A-OKVQA, and our novel Multi-Turn Instruction Following (MTIF) dataset, demonstrate that CAMVR consistently achieves state-of-the-art performance.

[198] MeshMetrics: A Precise Implementation of Distance-Based Image Segmentation Metrics

Gašper Podobnik, Tomaž Vrtovec

Main category: cs.CV

TL;DR: MeshMetrics is a mesh-based framework that provides more precise computation of distance-based segmentation metrics than conventional grid-based approaches, addressing implementation pitfalls that cause significant discrepancies in existing tools.

Details

Motivation: The reproducibility crisis in image segmentation research is partly due to unreliable metric implementations, where distance-based metrics show considerable discrepancies between common open-source tools (e.g., >100mm for Hausdorff distance).

Method: Developed a mesh-based framework that computes distance-based metrics more precisely than grid-based approaches, reducing discretization artifacts like distance quantization through theoretical analysis and empirical validation.

Result: MeshMetrics achieves higher accuracy and precision than established tools and is substantially less affected by discretization artifacts, providing more reliable metric computations.

Conclusion: The mesh-based approach offers a more reliable solution for distance-based segmentation metrics, and the open-source MeshMetrics package addresses implementation inconsistencies that contribute to reproducibility issues in the field.

Abstract: The surge of research in image segmentation has yielded remarkable performance gains but also exposed a reproducibility crisis. A major contributor is performance evaluation, where both selection and implementation of metrics play critical roles. While recent efforts have improved the former, the reliability of metric implementation has received far less attention. Pitfalls in distance-based metric implementation can lead to considerable discrepancies between common open-source tools, for instance, exceeding 100 mm for the Hausdorff distance and 30%pt for the normalized surface distance for the same pair of segmentations. To address these pitfalls, we introduce MeshMetrics, a mesh-based framework that provides a more precise computation of distance-based metrics than conventional grid-based approaches. Through theoretical analysis and empirical validation, we demonstrate that MeshMetrics achieves higher accuracy and precision than established tools, and is substantially less affected by discretization artifacts, such as distance quantization. We release MeshMetrics as an open-source Python package, available at https://github.com/gasperpodobnik/MeshMetrics.

[199] Leveraging Vision-Language Large Models for Interpretable Video Action Recognition with Semantic Tokenization

Jingwei Peng, Zhixuan Qiu, Boyu Jin, Surasakdi Siripong

Main category: cs.CV

TL;DR: LVLM-VAR is a novel framework that uses pre-trained Vision-Language Large Models for video action recognition, achieving state-of-the-art performance while providing natural language explanations for predictions.

Details

Motivation: Traditional human action recognition methods struggle with deep semantic understanding, complex contextual information, and fine-grained distinctions in diverse video data.

Method: Uses a Video-to-Semantic-Tokens Module to transform raw video sequences into semantic action tokens, then processes them with LoRA-fine-tuned LVLM (like LLaVA-13B) combined with natural language instructions for classification and reasoning.

Result: Achieves 94.1% on NTU RGB+D X-Sub and 90.0% on NTU RGB+D 120 X-Set benchmarks, demonstrating state-of-the-art or highly competitive performance.

Conclusion: The framework significantly improves both accuracy and interpretability in video action recognition by leveraging large language models’ capabilities.

Abstract: Human action recognition often struggles with deep semantic understanding, complex contextual information, and fine-grained distinction, limitations that traditional methods frequently encounter when dealing with diverse video data. Inspired by the remarkable capabilities of large language models, this paper introduces LVLM-VAR, a novel framework that pioneers the application of pre-trained Vision-Language Large Models (LVLMs) to video action recognition, emphasizing enhanced accuracy and interpretability. Our method features a Video-to-Semantic-Tokens (VST) Module, which innovatively transforms raw video sequences into discrete, semantically and temporally consistent “semantic action tokens,” effectively crafting an “action narrative” that is comprehensible to an LVLM. These tokens, combined with natural language instructions, are then processed by a LoRA-fine-tuned LVLM (e.g., LLaVA-13B) for robust action classification and semantic reasoning. LVLM-VAR not only achieves state-of-the-art or highly competitive performance on challenging benchmarks such as NTU RGB+D and NTU RGB+D 120, demonstrating significant improvements (e.g., 94.1% on NTU RGB+D X-Sub and 90.0% on NTU RGB+D 120 X-Set), but also substantially boosts model interpretability by generating natural language explanations for its predictions.

[200] JRN-Geo: A Joint Perception Network based on RGB and Normal images for Cross-view Geo-localization

Hongyu Zhou, Yunzhou Zhang, Tingsong Huang, Fawei Ge, Man Qi, Xichen Zhang, Yizhong Zhang

Main category: cs.CV

TL;DR: JRN-Geo network integrates RGB and normal images for cross-view UAV geo-localization, using dual-branch feature extraction with fusion modules and 3D geographic augmentation to handle viewpoint variations.

Details

Motivation: Existing methods rely mainly on RGB semantic features and neglect spatial structural information, which is crucial for handling drastic viewpoint differences and appearance variations in cross-view geo-localization.

Method: Proposes JRN-Geo network with dual-branch feature extraction framework, Difference-Aware Fusion Module (DAFM), Joint-Constrained Interaction Aggregation (JCIA) strategy, and 3D geographic augmentation technique to integrate RGB and normal images.

Result: Achieves state-of-the-art performance on University-1652 and SUES-200 datasets, demonstrating robustness against complex viewpoint variations.

Conclusion: Incorporating geometric structural information from normal images alongside RGB data significantly improves cross-view geo-localization performance by capturing better viewpoint-invariant features.

Abstract: Cross-view geo-localization plays a critical role in Unmanned Aerial Vehicle (UAV) localization and navigation. However, significant challenges arise from the drastic viewpoint differences and appearance variations between images. Existing methods predominantly rely on semantic features from RGB images, often neglecting the importance of spatial structural information in capturing viewpoint-invariant features. To address this issue, we incorporate geometric structural information from normal images and introduce a Joint perception network to integrate RGB and Normal images (JRN-Geo). Our approach utilizes a dual-branch feature extraction framework, leveraging a Difference-Aware Fusion Module (DAFM) and Joint-Constrained Interaction Aggregation (JCIA) strategy to enable deep fusion and joint-constrained semantic and structural information representation. Furthermore, we propose a 3D geographic augmentation technique to generate potential viewpoint variation samples, enhancing the network’s ability to learn viewpoint-invariant features. Extensive experiments on the University-1652 and SUES-200 datasets validate the robustness of our method against complex viewpoint ariations, achieving state-of-the-art performance.

[201] Knowledge-Augmented Vision Language Models for Underwater Bioacoustic Spectrogram Analysis

Ragib Amin Nihal, Benjamin Yen, Takeshi Ashizawa, Kazuhiro Nakadai

Main category: cs.CV

TL;DR: VLMs can extract meaningful patterns from marine mammal spectrograms without domain-specific training, using integrated VLM interpretation and LLM validation.

Details

Motivation: Marine mammal vocalization analysis relies on bioacoustic spectrograms, but VLMs are not trained on these domain-specific visualizations, creating a gap in automated analysis capabilities.

Method: Framework that integrates VLM interpretation with LLM-based validation to build domain knowledge, enabling adaptation to acoustic data without manual annotation or model retraining.

Result: VLMs can extract meaningful patterns from spectrograms visually despite not being trained on these specific domain visualizations.

Conclusion: The integrated VLM-LLM framework provides a viable approach for marine bioacoustic analysis without requiring manual annotation or model retraining, demonstrating VLMs’ ability to interpret domain-specific spectrograms.

Abstract: Marine mammal vocalization analysis depends on interpreting bioacoustic spectrograms. Vision Language Models (VLMs) are not trained on these domain-specific visualizations. We investigate whether VLMs can extract meaningful patterns from spectrograms visually. Our framework integrates VLM interpretation with LLM-based validation to build domain knowledge. This enables adaptation to acoustic data without manual annotation or model retraining.

Niels Balemans, Ali Anwar, Jan Steckel, Siegfried Mercelis

Main category: cs.CV

TL;DR: LiDAR-BIND-T extends LiDAR-BIND with temporal consistency mechanisms for multi-sensor fusion, improving SLAM performance through temporal embedding alignment, motion-aligned transformation loss, and windowed temporal fusion.

Details

Motivation: To enhance temporal stability and coherence in multi-modal sensor fusion (radar, sonar to LiDAR) for improved robustness and performance in SLAM applications.

Method: Three main contributions: temporal embedding similarity for latent alignment, motion-aligned transformation loss for displacement matching, and windowed temporal fusion with specialized temporal module. Updated architecture for better spatial structure preservation.

Result: Improved temporal and spatial coherence, lower absolute trajectory error, better occupancy map accuracy in Cartographer-based SLAM. Proposed new metrics (FVMD-based and correlation-peak distance) for temporal quality evaluation.

Conclusion: LiDAR-BIND-T maintains plug-and-play modality fusion while significantly enhancing temporal stability, resulting in improved robustness and performance for downstream SLAM systems.

Abstract: This paper extends LiDAR-BIND, a modular multi-modal fusion framework that binds heterogeneous sensors (radar, sonar) to a LiDAR-defined latent space, with mechanisms that explicitly enforce temporal consistency. We introduce three contributions: (i) temporal embedding similarity that aligns consecutive latents, (ii) a motion-aligned transformation loss that matches displacement between predictions and ground truth LiDAR, and (iii) windows temporal fusion using a specialised temporal module. We further update the model architecture to better preserve spatial structure. Evaluations on radar/sonar-to-LiDAR translation demonstrate improved temporal and spatial coherence, yielding lower absolute trajectory error and better occupancy map accuracy in Cartographer-based SLAM (Simultaneous Localisation and Mapping). We propose different metrics based on the Fr'echet Video Motion Distance (FVMD) and a correlation-peak distance metric providing practical temporal quality indicators to evaluate SLAM performance. The proposed temporal LiDAR-BIND, or LiDAR-BIND-T, maintains plug-and-play modality fusion while substantially enhancing temporal stability, resulting in improved robustness and performance for downstream SLAM.

[203] Multi-LVI-SAM: A Robust LiDAR-Visual-Inertial Odometry for Multiple Fisheye Cameras

Xinyu Zhang, Kai Huang, Junqiao Zhao, Zihan Yuan, Tiantian Feng

Main category: cs.CV

TL;DR: Multi-LVI-SAM is a multi-camera LiDAR-visual-inertial odometry framework that uses a panoramic visual feature model to unify multiple fisheye camera data, with extrinsic compensation for improved accuracy and robustness.

Details

Motivation: To achieve highly accurate and robust state estimation by efficiently fusing data from multiple fisheye cameras, LiDAR, and inertial sensors while addressing challenges in multi-camera integration.

Method: Introduces a panoramic visual feature model that unifies multi-camera observations into a single representation, with extrinsic compensation for frame misalignment, integrated into a tightly coupled factor graph-based LiDAR-visual-inertial system.

Result: Extensive experiments on public datasets show enhanced quality and consistency of multi-camera constraints, resulting in higher accuracy and robustness compared to existing multi-camera LiDAR-visual-inertial systems.

Conclusion: The panoramic visual feature model with extrinsic compensation effectively improves feature consistency, reduces triangulation errors, and enables more accurate pose estimation in multi-sensor fusion systems.

Abstract: We propose a multi-camera LiDAR-visual-inertial odometry framework, Multi-LVI-SAM, which fuses data from multiple fisheye cameras, LiDAR and inertial sensors for highly accurate and robust state estimation. To enable efficient and consistent integration of visual information from multiple fisheye cameras, we introduce a panoramic visual feature model that unifies multi-camera observations into a single representation. The panoramic model serves as a global geometric optimization framework that consolidates multi-view constraints, enabling seamless loop closure and global pose optimization, while simplifying system design by avoiding redundant handling of individual cameras. To address the triangulation inconsistency caused by the misalignment between each camera’s frame and the panoramic model’s frame, we propose an extrinsic compensation method. This method improves feature consistency across views and significantly reduces triangulation and optimization errors, leading to more accurate pose estimation. We integrate the panoramic visual feature model into a tightly coupled LiDAR-visual-inertial system based on a factor graph. Extensive experiments on public datasets demonstrate that the panoramic visual feature model enhances the quality and consistency of multi-camera constraints, resulting in higher accuracy and robustness than existing multi-camera LiDAR-visual-inertial systems.

[204] Depth-Aware Super-Resolution via Distance-Adaptive Variational Formulation

Tianhao Guo, Bingjie Lu, Feng Wang, Zhengyang Lu

Main category: cs.CV

TL;DR: A novel distance-adaptive super-resolution framework that models spatially-varying degradation using pseudodifferential operators with depth-dependent spectral characteristics, achieving state-of-the-art performance on depth-variant scenarios.

Details

Motivation: Traditional super-resolution assumes spatially-invariant degradation, but real-world imaging systems exhibit complex distance-dependent effects like atmospheric scattering and perspective distortions, requiring spatially-adaptive reconstruction strategies with geometric scene understanding.

Method: Variational framework formulating SR as spatially-varying inverse problem using pseudodifferential operators. Neural architecture implements discrete gradient flow dynamics with cascaded residual blocks and depth-conditional convolution kernels, incorporating learned distance-adaptive regularization and spectral constraints from atmospheric scattering theory.

Result: Achieves 36.89/0.9516 and 30.54/0.8721 PSNR/SSIM at 2× and 4× scales on KITTI outdoor scenes, outperforming existing methods by 0.44dB and 0.36dB respectively. State-of-the-art performance across five benchmark datasets.

Conclusion: Establishes the first theoretically-grounded distance-adaptive super-resolution framework, demonstrating significant improvements on depth-variant scenarios while maintaining competitive performance on traditional benchmarks.

Abstract: Single image super-resolution traditionally assumes spatially-invariant degradation models, yet real-world imaging systems exhibit complex distance-dependent effects including atmospheric scattering, depth-of-field variations, and perspective distortions. This fundamental limitation necessitates spatially-adaptive reconstruction strategies that explicitly incorporate geometric scene understanding for optimal performance. We propose a rigorous variational framework that characterizes super-resolution as a spatially-varying inverse problem, formulating the degradation operator as a pseudodifferential operator with distance-dependent spectral characteristics that enable theoretical analysis of reconstruction limits across depth ranges. Our neural architecture implements discrete gradient flow dynamics through cascaded residual blocks with depth-conditional convolution kernels, ensuring convergence to stationary points of the theoretical energy functional while incorporating learned distance-adaptive regularization terms that dynamically adjust smoothness constraints based on local geometric structure. Spectral constraints derived from atmospheric scattering theory prevent bandwidth violations and noise amplification in far-field regions, while adaptive kernel generation networks learn continuous mappings from depth to reconstruction filters. Comprehensive evaluation across five benchmark datasets demonstrates state-of-the-art performance, achieving 36.89/0.9516 and 30.54/0.8721 PSNR/SSIM at 2 and 4 scales on KITTI outdoor scenes, outperforming existing methods by 0.44dB and 0.36dB respectively. This work establishes the first theoretically-grounded distance-adaptive super-resolution framework and demonstrates significant improvements on depth-variant scenarios while maintaining competitive performance across traditional benchmarks.

[205] Unleashing Hierarchical Reasoning: An LLM-Driven Framework for Training-Free Referring Video Object Segmentation

Bingrui Zhao, Lin Yuanbo Wu, Xiangtian Fan, Deyin Liu, Lu Zhang, Ruyi He, Jialie Shen, Ximing Li

Main category: cs.CV

TL;DR: PARSE-VOS is a training-free RVOS framework using LLMs for hierarchical reasoning to segment video objects from language queries, achieving SOTA results.

Details

Motivation: Current RVOS methods struggle with complex compositional descriptions and holistic visual-language fusion, especially when objects have similar appearances but different motion/poses.

Method: Parses language queries into structured commands, uses spatio-temporal grounding to generate candidate trajectories, and employs two-stage hierarchical reasoning (coarse motion analysis + fine pose verification) with LLMs.

Result: Achieved state-of-the-art performance on Ref-YouTube-VOS, Ref-DAVIS17, and MeViS benchmarks.

Conclusion: PARSE-VOS demonstrates that hierarchical coarse-to-fine reasoning with LLMs effectively addresses complex RVOS challenges without requiring training, outperforming existing methods.

Abstract: Referring Video Object Segmentation (RVOS) aims to segment an object of interest throughout a video based on a language description. The prominent challenge lies in aligning static text with dynamic visual content, particularly when objects exhibiting similar appearances with inconsistent motion and poses. However, current methods often rely on a holistic visual-language fusion that struggles with complex, compositional descriptions. In this paper, we propose \textbf{PARSE-VOS}, a novel, training-free framework powered by Large Language Models (LLMs), for a hierarchical, coarse-to-fine reasoning across text and video domains. Our approach begins by parsing the natural language query into structured semantic commands. Next, we introduce a spatio-temporal grounding module that generates all candidate trajectories for all potential target objects, guided by the parsed semantics. Finally, a hierarchical identification module select the correct target through a two-stage reasoning process: it first performs coarse-grained motion reasoning with an LLM to narrow down candidates; if ambiguity remains, a fine-grained pose verification stage is conditionally triggered to disambiguate. The final output is an accurate segmentation mask for the target object. \textbf{PARSE-VOS} achieved state-of-the-art performance on three major benchmarks: Ref-YouTube-VOS, Ref-DAVIS17, and MeViS.

[206] PictOBI-20k: Unveiling Large Multimodal Models in Visual Decipherment for Pictographic Oracle Bone Characters

Zijian Chen, Wenjie Hua, Jinhao Li, Lirong Deng, Fan Du, Tingzhu Chen, Guangtao Zhai

Main category: cs.CV

TL;DR: A new dataset PictOBI-20k with 20k oracle bone character and real object images is introduced to evaluate large multimodal models’ visual decipherment capabilities for ancient Chinese script.

Details

Motivation: Oracle bone characters are crucial for understanding early human civilization but current decipherment methods are limited by scarce archaeological findings and small inscription corpus. Large multimodal models offer potential for visual decipherment.

Method: Created PictOBI-20k dataset with 20k OBC and real object images forming 15k+ multi-choice questions. Conducted subjective annotations to compare human and LMM visual reasoning consistency.

Result: General LMMs show preliminary visual decipherment skills but are not effectively using visual information and are limited by language priors rather than visual understanding.

Conclusion: The dataset facilitates evaluation and optimization of visual attention in future OBC-oriented LMMs, helping bridge the gap between human and machine visual reasoning for ancient script decipherment.

Abstract: Deciphering oracle bone characters (OBCs), the oldest attested form of written Chinese, has remained the ultimate, unwavering goal of scholars, offering an irreplaceable key to understanding humanity’s early modes of production. Current decipherment methodologies of OBC are primarily constrained by the sporadic nature of archaeological excavations and the limited corpus of inscriptions. With the powerful visual perception capability of large multimodal models (LMMs), the potential of using LMMs for visually deciphering OBCs has increased. In this paper, we introduce PictOBI-20k, a dataset designed to evaluate LMMs on the visual decipherment tasks of pictographic OBCs. It includes 20k meticulously collected OBC and real object images, forming over 15k multi-choice questions. We also conduct subjective annotations to investigate the consistency of the reference point between humans and LMMs in visual reasoning. Experiments indicate that general LMMs possess preliminary visual decipherment skills, and LMMs are not effectively using visual information, while most of the time they are limited by language priors. We hope that our dataset can facilitate the evaluation and optimization of visual attention in future OBC-oriented LMMs. The code and dataset will be available at https://github.com/OBI-Future/PictOBI-20k.

[207] AdCare-VLM: Leveraging Large Vision Language Model (LVLM) to Monitor Long-Term Medication Adherence and Care

Md Asaduzzaman Jabin, Hanqi Jiang, Yiwei Li, Patrick Kaggwa, Eugene Douglass, Juliet N. Sekandi, Tianming Liu

Main category: cs.CV

TL;DR: AdCare-VLM is a specialized video-language model for medication adherence monitoring using tuberculosis patient videos, achieving 3.1-3.54% improvement over existing VLMs.

Details

Motivation: Chronic diseases require strict medication adherence, but adherence is often compromised by patient behavior, high costs, and poor healthcare infrastructure. There's a need for automated visual monitoring systems to detect medication adherence patterns.

Method: Proposed AdCare-VLM based on Video-LLaVA architecture, fine-tuned on 806 custom-annotated TB medication monitoring videos. Created LLM-TB-VQA dataset with positive/negative/ambiguous adherence cases. Identified visual features (face visibility, medication, water intake, ingestion) correlated with medical concepts.

Result: Outperformed parameter-efficient fine-tuning enabled VLM models (LLaVA-V1.5, Chat-UniVi) with absolute improvements of 3.1% to 3.54% across different configurations. Comprehensive ablation studies and attention visualizations validated the approach.

Conclusion: The specialized multimodal model effectively integrates visual-linguistic representations for medication adherence monitoring, demonstrating superior performance and enhanced interpretability for healthcare applications.

Abstract: Chronic diseases, including diabetes, hypertension, asthma, HIV-AIDS, epilepsy, and tuberculosis, necessitate rigorous adherence to medication to avert disease progression, manage symptoms, and decrease mortality rates. Adherence is frequently undermined by factors including patient behavior, caregiver support, elevated medical costs, and insufficient healthcare infrastructure. We propose AdCare-VLM, a specialized Video-LLaVA-based multimodal large vision language model (LVLM) aimed at visual question answering (VQA) concerning medication adherence through patient videos. We employ a private dataset comprising 806 custom-annotated tuberculosis (TB) medication monitoring videos, which have been labeled by clinical experts, to fine-tune the model for adherence pattern detection. We present LLM-TB-VQA, a detailed medical adherence VQA dataset that encompasses positive, negative, and ambiguous adherence cases. Our method identifies correlations between visual features, such as the clear visibility of the patient’s face, medication, water intake, and the act of ingestion, and their associated medical concepts in captions. This facilitates the integration of aligned visual-linguistic representations and improves multimodal interactions. Experimental results indicate that our method surpasses parameter-efficient fine-tuning (PEFT) enabled VLM models, such as LLaVA-V1.5 and Chat-UniVi, with absolute improvements ranging from 3.1% to 3.54% across pre-trained, regular, and low-rank adaptation (LoRA) configurations. Comprehensive ablation studies and attention map visualizations substantiate our approach, enhancing interpretability.

[208] Posterior shape models revisited: Improving 3D reconstructions from partial data using target specific models

Jonathan Aellen, Florian Burkhardt, Thomas Vetter, Marcel Lüthi

Main category: cs.CV

TL;DR: Pose alignment is crucial for accurate partial shape reconstruction in medical imaging. The paper proposes an efficient method to adjust existing models to target poses without original training data, improving accuracy while maintaining computational efficiency.

Details

Motivation: Point distribution models for shape reconstruction often overlook pose alignment between training and target shapes, leading to biased solutions especially for small shape parts.

Method: An efficient method to adjust existing statistical shape models to specific target poses, preserving linear model efficiency while improving reconstruction accuracy. Handles translations exactly and provides good approximations for small rotations without requiring original training data.

Result: Significantly improved reconstruction accuracy and predicted variance compared to unaligned models. The method works as a simple preprocessing step that can be integrated into existing reconstruction pipelines.

Conclusion: Pose alignment is essential for accurate partial shape reconstruction. The proposed plug-and-play approach enables efficient adaptation of existing shape models to target poses, making it widely applicable in medical imaging reconstruction pipelines.

Abstract: In medical imaging, point distribution models are often used to reconstruct and complete partial shapes using a statistical model of the full shape. A commonly overlooked, but crucial factor in this reconstruction process, is the pose of the training data relative to the partial target shape. A difference in pose alignment of the training and target shape leads to biased solutions, particularly when observing small parts of a shape. In this paper, we demonstrate the importance of pose alignment for partial shape reconstructions and propose an efficient method to adjust an existing model to a specific target. Our method preserves the computational efficiency of linear models while significantly improving reconstruction accuracy and predicted variance. It exactly recovers the intended aligned model for translations, and provides a good approximation for small rotations, all without access to the original training data. Hence, existing shape models in reconstruction pipelines can be adapted by a simple preprocessing step, making our approach widely applicable in plug-and-play scenarios.

[209] 3DPillars: Pillar-based two-stage 3D object detection

Jongyoun Noh, Junghyup Lee, Hyekang Park, Bumsub Ham

Main category: cs.CV

TL;DR: A two-stage 3D detection framework that improves PointPillars by introducing 3DPillars CNN architecture for better 3D feature learning and sparse scene context features for effective two-stage detection.

Details

Motivation: PointPillars is efficient but underperforms state-of-the-art methods due to poor 3D structure preservation and inability to effectively use two-stage detection pipelines with 3D proposals.

Method: Introduces 3DPillars CNN that treats 3D voxel features as stacked pseudo images using separable voxel feature module without 3D convolutions, plus RoI head with sparse scene context feature module for multi-scale feature aggregation.

Result: Achieves good compromise between speed and accuracy on KITTI and Waymo Open datasets, narrowing performance gap with state-of-the-art methods while retaining efficiency.

Conclusion: The framework successfully overcomes PointPillars’ limitations by enabling effective two-stage detection with pseudo image representations, demonstrating both effectiveness and efficiency in 3D object detection.

Abstract: PointPillars is the fastest 3D object detector that exploits pseudo image representations to encode features for 3D objects in a scene. Albeit efficient, PointPillars is typically outperformed by state-of-the-art 3D detection methods due to the following limitations: 1) The pseudo image representations fail to preserve precise 3D structures, and 2) they make it difficult to adopt a two-stage detection pipeline using 3D object proposals that typically shows better performance than a single-stage approach. We introduce in this paper the first two-stage 3D detection framework exploiting pseudo image representations, narrowing the performance gaps between PointPillars and state-of-the-art methods, while retaining its efficiency. Our framework consists of two novel components that overcome the aforementioned limitations of PointPillars: First, we introduce a new CNN architecture, dubbed 3DPillars, that enables learning 3D voxel-based features from the pseudo image representation efficiently using 2D convolutions. The basic idea behind 3DPillars is that 3D features from voxels can be viewed as a stack of pseudo images. To implement this idea, we propose a separable voxel feature module that extracts voxel-based features without using 3D convolutions. Second, we introduce an RoI head with a sparse scene context feature module that aggregates multi-scale features from 3DPillars to obtain a sparse scene feature. This enables adopting a two-stage pipeline effectively, and fully leveraging contextual information of a scene to refine 3D object proposals. Experimental results on the KITTI and Waymo Open datasets demonstrate the effectiveness and efficiency of our approach, achieving a good compromise in terms of speed and accuracy.

[210] CRAB: Camera-Radar Fusion for Reducing Depth Ambiguity in Backward Projection based View Transformation

In-Jae Lee, Sihwan Hwang, Youngseok Kim, Wonjune Kim, Sanmin Kim, Dongsuk Kum

Main category: cs.CV

TL;DR: CRAB is a novel camera-radar fusion model that uses backward projection with radar to reduce depth ambiguity in 3D object detection and segmentation, achieving state-of-the-art performance on nuScenes dataset.

Details

Motivation: Previous camera-radar fusion methods either struggle with sparse BEV feature generation (forward projection) or suffer from depth ambiguity leading to false positives (backward projection). The complementary nature of camera and radar sensors makes fusion promising for cost-effective 3D perception.

Method: CRAB uses backward projection that aggregates perspective view image features into BEV queries while leveraging radar to mitigate depth ambiguity. It combines dense but unreliable image depth distribution with sparse yet precise radar occupancy information. Also introduces spatial cross-attention with radar context features for better 3D scene understanding.

Result: Achieved state-of-the-art performance among backward projection-based camera-radar fusion methods on nuScenes dataset: 62.4% NDS and 54.0% mAP in 3D object detection.

Conclusion: The proposed CRAB model effectively addresses depth ambiguity in backward projection by fusing camera and radar data, demonstrating superior 3D object detection performance through complementary sensor fusion and improved depth distinction.

Abstract: Recently, camera-radar fusion-based 3D object detection methods in bird’s eye view (BEV) have gained attention due to the complementary characteristics and cost-effectiveness of these sensors. Previous approaches using forward projection struggle with sparse BEV feature generation, while those employing backward projection overlook depth ambiguity, leading to false positives. In this paper, to address the aforementioned limitations, we propose a novel camera-radar fusion-based 3D object detection and segmentation model named CRAB (Camera-Radar fusion for reducing depth Ambiguity in Backward projection-based view transformation), using a backward projection that leverages radar to mitigate depth ambiguity. During the view transformation, CRAB aggregates perspective view image context features into BEV queries. It improves depth distinction among queries along the same ray by combining the dense but unreliable depth distribution from images with the sparse yet precise depth information from radar occupancy. We further introduce spatial cross-attention with a feature map containing radar context information to enhance the comprehension of the 3D scene. When evaluated on the nuScenes open dataset, our proposed approach achieves a state-of-the-art performance among backward projection-based camera-radar fusion methods with 62.4% NDS and 54.0% mAP in 3D object detection.

[211] Dual-Mode Deep Anomaly Detection for Medical Manufacturing: Structural Similarity and Feature Distance

Julio Zanon Diaz, Georgios Siogkas, Peter Corcoran

Main category: cs.CV

TL;DR: Two attention-guided autoencoder architectures for deep anomaly detection in medical device manufacturing, addressing small imbalanced datasets and regulatory requirements.

Details

Motivation: Automating visual inspection in medical device manufacturing is challenging due to small imbalanced datasets, high-resolution imagery, and stringent regulatory requirements.

Method: Two approaches: 1) Structural similarity-based anomaly score (4-MS-SSIM) for lightweight real-time detection, 2) Feature-distance approach using Mahalanobis scoring on reduced latent features for supervisory monitoring.

Result: First method achieved ACC 0.903 (unsupervised) and 0.931 (supervised) with only 10% defective samples. Second method achieved ACC 0.722 with supervised thresholding. Both surpassed re-implemented baselines.

Conclusion: The methods provide complementary capabilities for inline inspection and post-production surveillance, offering a practical pathway for deploying deep anomaly detection in regulated manufacturing environments while meeting EU AI Act requirements.

Abstract: Automating visual inspection in medical device manufacturing remains challenging due to small and imbalanced datasets, high-resolution imagery, and stringent regulatory requirements. This work proposes two attention-guided autoencoder architectures for deep anomaly detection designed to address these constraints. The first employs a structural similarity-based anomaly score (4-MS-SSIM), offering lightweight and accurate real-time defect detection, yielding ACC 0.903 (unsupervised thresholding) and 0.931 (supervised thresholding) on the - Surface Seal Image - Test split with only 10% of defective samples. The second applies a feature-distance approach using Mahalanobis scoring on reduced latent features, providing high sensitivity to distributional shifts for supervisory monitoring, achieving ACC 0.722 with supervised thresholding. Together, these methods deliver complementary capabilities: the first supports reliable inline inspection, while the second enables scalable post-production surveillance and regulatory compliance monitoring. Experimental results demonstrate that both approaches surpass re-implemented baselines and provide a practical pathway for deploying deep anomaly detection in regulated manufacturing environments, aligning accuracy, efficiency, and the regulatory obligations defined for high-risk AI systems under the EU AI Act.

[212] A Probabilistic Segment Anything Model for Ambiguity-Aware Medical Image Segmentation

Tyler Ward, Abdullah Imran

Main category: cs.CV

TL;DR: Probabilistic SAM extends SAM to generate multiple plausible segmentations by modeling segmentation distributions with latent variables, addressing annotation uncertainty in medical imaging.

Details

Motivation: SAM produces deterministic segmentations but fails to capture inherent ambiguity in real-world tasks like medical imaging where multiple plausible segmentations exist due to annotation uncertainty and inter-expert variability.

Method: Incorporates latent variable space with variational training, integrates prior and posterior networks into SAM framework, uses latent codes to modulate prompt embeddings during inference for efficient sampling.

Result: Outperforms existing probabilistic baselines on uncertainty-aware metrics, produces diverse outputs that align with expert disagreement on LIDC-IDRI lung nodule dataset.

Conclusion: Probabilistic SAM successfully models segmentation distributions, enabling uncertainty-aware outputs that reflect human annotation variability with minimal computational overhead.

Abstract: Recent advances in promptable segmentation, such as the Segment Anything Model (SAM), have enabled flexible, high-quality mask generation across a wide range of visual domains. However, SAM and similar models remain fundamentally deterministic, producing a single segmentation per object per prompt, and fail to capture the inherent ambiguity present in many real-world tasks. This limitation is particularly troublesome in medical imaging, where multiple plausible segmentations may exist due to annotation uncertainty or inter-expert variability. In this paper, we introduce Probabilistic SAM, a probabilistic extension of SAM that models a distribution over segmentations conditioned on both the input image and prompt. By incorporating a latent variable space and training with a variational objective, our model learns to generate diverse and plausible segmentation masks reflecting the variability in human annotations. The architecture integrates a prior and posterior network into the SAM framework, allowing latent codes to modulate the prompt embeddings during inference. The latent space allows for efficient sampling during inference, enabling uncertainty-aware outputs with minimal overhead. We evaluate Probabilistic SAM on the public LIDC-IDRI lung nodule dataset and demonstrate its ability to produce diverse outputs that align with expert disagreement, outperforming existing probabilistic baselines on uncertainty-aware metrics. Our code is available at: https://github.com/tbwa233/Probabilistic-SAM/.

[213] Challenges in Deep Learning-Based Small Organ Segmentation: A Benchmarking Perspective for Medical Research with Limited Datasets

Phongsakon Mark Konrad, Andrei-Alexandru Popa, Yaser Sabzehmeidani, Liang Zhong, Elisa A. Liehn, Serkan Ayvaz

Main category: cs.CV

TL;DR: Evaluation of deep learning segmentation models on limited cardiovascular histology data shows performance is highly sensitive to data splits, challenging standard benchmarking practices in low-data clinical settings.

Details

Motivation: Accurate carotid artery segmentation in histopathological images is crucial for cardiovascular disease research, but deep learning development is constrained by scarce annotated data.

Method: Systematic evaluation of state-of-the-art segmentation models (U-Net, DeepLabV3+, SegFormer, SAM, MedSAM, MedSAM+UNet) on limited cardiovascular histology dataset with Bayesian hyperparameter optimization.

Result: Model performance was highly sensitive to data splits, with minor differences driven by statistical noise rather than true algorithmic superiority.

Conclusion: Standard benchmarking practices are limited in low-data clinical settings, and performance rankings may not reflect meaningful clinical utility due to data split sensitivity.

Abstract: Accurate segmentation of carotid artery structures in histopathological images is vital for advancing cardiovascular disease research and diagnosis. However, deep learning model development in this domain is constrained by the scarcity of annotated cardiovascular histopathological data. This study investigates a systematic evaluation of state-of-the-art deep learning segmentation models, including convolutional neural networks (U-Net, DeepLabV3+), a Vision Transformer (SegFormer), and recent foundation models (SAM, MedSAM, MedSAM+UNet), on a limited dataset of cardiovascular histology images. Despite employing an extensive hyperparameter optimization strategy with Bayesian search, our findings reveal that model performance is highly sensitive to data splits, with minor differences driven more by statistical noise than by true algorithmic superiority. This instability exposes the limitations of standard benchmarking practices in low-data clinical settings and challenges the assumption that performance rankings reflect meaningful clinical utility.

[214] BTCChat: Advancing Remote Sensing Bi-temporal Change Captioning with Multimodal Large Language Model

Yujie Li, Wenjia Xu, Yuanben Zhang, Zhiwei Wei, Mugen Peng

Main category: cs.CV

TL;DR: BTCChat is a new multimodal large language model that improves bi-temporal satellite image analysis with better change detection and understanding capabilities.

Details

Motivation: Current methods for analyzing bi-temporal satellite imagery inadequately model temporal correlations and spatial semantic changes, limiting their effectiveness in applications like urban monitoring and disaster assessment.

Method: Proposed BTCChat with a Change Extraction module to capture temporal features and spatial semantic changes, plus a Prompt Augmentation mechanism to enhance attention to spatial details by incorporating contextual clues.

Result: BTCChat achieves state-of-the-art performance on change captioning and visual question answering tasks, demonstrating superior bi-temporal change understanding capabilities.

Conclusion: The proposed BTCChat model effectively addresses limitations in current bi-temporal image analysis methods and shows significant improvements in change detection and interpretation tasks.

Abstract: Bi-temporal satellite imagery supports critical applications such as urban development monitoring and disaster assessment. Although powerful multimodal large language models (MLLMs) have been applied in bi-temporal change analysis, previous methods process image pairs through direct concatenation, inadequately modeling temporal correlations and spatial semantic changes. This deficiency hampers visual-semantic alignment in change understanding, thereby constraining the overall effectiveness of current approaches. To address this gap, we propose BTCChat, a multi-temporal MLLM with advanced bi-temporal change understanding capability. BTCChat supports bi-temporal change captioning and retains single-image interpretation capability. To better capture temporal features and spatial semantic changes in image pairs, we design a Change Extraction module. Moreover, to enhance the model’s attention to spatial details, we introduce a Prompt Augmentation mechanism, which incorporates contextual clues into the prompt to enhance model performance. Experimental results demonstrate that BTCChat achieves state-of-the-art performance on change captioning and visual question answering tasks.

[215] A Fine-Grained Attention and Geometric Correspondence Model for Musculoskeletal Risk Classification in Athletes Using Multimodal Visual and Skeletal Features

Md. Abdur Rahman, Mohaimenul Azam Khan Raiaan, Tamanna Shermin, Md Rafiqul Islam, Mukhtar Hussain, Sami Azam

Main category: cs.CV

TL;DR: ViSK-GAT is a multimodal deep learning framework that combines visual and skeletal data to classify musculoskeletal risk in athletes, achieving 93.89% test accuracy and outperforming existing methods.

Details

Motivation: Existing musculoskeletal risk assessment methods rely on single data types and fail in complex environments, creating a need for multimodal approaches that can provide reliable early risk detection for athletes.

Method: Proposes ViSK-GAT framework with Residual Block and Lightweight Transformer Block, featuring Fine-Grained Attention Module for inter-modal feature refinement and Multimodal Geometric Correspondence Module for cross-modal alignment. Uses custom multimodal dataset with 8 risk categories based on REBA system.

Result: Achieved 93.55% validation accuracy, 93.89% test accuracy, 93.86% precision, 93.85% F1 score, 93% Cohen’s Kappa and Matthews Correlation Coefficient. Low RMSE (0.1205) and MAE (0.0156) for probability distribution. Outperformed 9 transfer learning backbones.

Conclusion: ViSK-GAT advances AI implementation for musculoskeletal risk classification, enabling impactful early interventions in sports through effective multimodal learning of visual and skeletal data.

Abstract: Musculoskeletal disorders pose significant risks to athletes, and assessing risk early is important for prevention. However, most existing methods are designed for controlled settings and fail to reliably assess risk in complex environments due to their reliance on a single type of data. This research proposes ViSK-GAT (Visual-Skeletal Geometric Attention Transformer), a novel multimodal deep learning framework designed to classify musculoskeletal risk using visual and skeletal coordinate-based features. In addition, a custom multimodal dataset is constructed by combining visual data and skeletal coordinates for risk assessment. Each sample is labeled into eight risk categories based on the Rapid Entire Body Assessment system. ViSK-GAT combines a Residual Block with a Lightweight Transformer Block to learn spatial and temporal dependencies jointly. It incorporates two novel modules: the Fine-Grained Attention Module (FGAM), which enables precise inter-modal feature refinement through cross-attention between visual and skeletal inputs, and the Multimodal Geometric Correspondence Module (MGCM), which enhances cross-modal coherence by aligning image features with coordinate-based representations. ViSK-GAT achieved strong performance with validation and test accuracies of 93.55% and 93.89%, respectively; a precision of 93.86%; an F1 score of 93.85%; and Cohen’s Kappa and Matthews Correlation Coefficient of 93%. The regression results also indicated a low Root Mean Square Error of the predicted probability distribution of 0.1205 and a corresponding Mean Absolute Error of 0.0156. Compared to nine popular transfer learning backbones, ViSK-GAT consistently outperformed previous methods. The ViSK-GAT model advances artificial intelligence implementation and application, transforming musculoskeletal risk classification and enabling impactful early interventions in sports.

[216] Compression Beyond Pixels: Semantic Compression with Multimodal Foundation Models

Ruiqi Shen, Haotian Wu, Wenjing Zhang, Jiangjing Hu, Deniz Gunduz

Main category: cs.CV

TL;DR: A novel semantic image compression method using CLIP embeddings that achieves extreme compression (2-3*10^-3 bpp) while preserving semantic information across diverse tasks and datasets, outperforming traditional methods by requiring less than 5% of the bitrate.

Details

Motivation: Emerging applications prioritize semantic preservation over pixel-level reconstruction and demand robust performance across diverse data distributions and downstream tasks, calling for advanced semantic compression paradigms beyond traditional rate-distortion optimization.

Method: Proposes compressing CLIP feature embeddings into minimal bits instead of reconstructing pixels, leveraging the zero-shot and representational capabilities of multimodal foundation models to preserve semantic information.

Result: Achieves average bit rate of approximately 2-3*10^-3 bits per pixel (less than 5% of mainstream methods), maintains semantic integrity across benchmark datasets, and exhibits zero-shot robustness across diverse data distributions and downstream tasks even under extreme compression.

Conclusion: The CLIP-based semantic compression approach provides an effective paradigm shift from pixel reconstruction to semantic preservation, enabling extreme compression rates while maintaining robust performance across various applications and data distributions.

Abstract: Recent deep learning-based methods for lossy image compression achieve competitive rate-distortion performance through extensive end-to-end training and advanced architectures. However, emerging applications increasingly prioritize semantic preservation over pixel-level reconstruction and demand robust performance across diverse data distributions and downstream tasks. These challenges call for advanced semantic compression paradigms. Motivated by the zero-shot and representational capabilities of multimodal foundation models, we propose a novel semantic compression method based on the contrastive language-image pretraining (CLIP) model. Rather than compressing images for reconstruction, we propose compressing the CLIP feature embeddings into minimal bits while preserving semantic information across different tasks. Experiments show that our method maintains semantic integrity across benchmark datasets, achieving an average bit rate of approximately 2-3* 10(-3) bits per pixel. This is less than 5% of the bitrate required by mainstream image compression approaches for comparable performance. Remarkably, even under extreme compression, the proposed approach exhibits zero-shot robustness across diverse data distributions and downstream tasks.

[217] AttriPrompt: Dynamic Prompt Composition Learning for CLIP

Qiqi Zhan, Shiwei Li, Qingjie Liu, Yunhong Wang

Main category: cs.CV

TL;DR: AttriPrompt enhances CLIP’s performance through content-aware dynamic prompting using visual features to retrieve relevant prompts, achieving fine-grained alignment and preventing overfitting with dual-stream contrastive learning and self-regularization.

Details

Motivation: Current deep text prompting methods suffer from over-reliance on contrastive learning that neglects fine-grained feature optimization and use static prompts that lack content-aware adaptation.

Method: Uses Attribute Retrieval module to cluster visual features and retrieve semantically similar prompts from a pool, concatenates them to text encoder inputs, employs Dual-stream Contrastive Learning for fine-grained alignment, and Self-Regularization to prevent overfitting.

Result: Achieves up to 7.37% improvement in base-to-novel setting across three benchmarks, demonstrating superior performance over state-of-the-art methods and strong cross-domain knowledge transfer capabilities.

Conclusion: AttriPrompt effectively addresses limitations of current prompting methods, making vision-language pre-trained models more viable for real-world implementation through content-aware dynamic prompting and fine-grained optimization.

Abstract: The evolution of prompt learning methodologies has driven exploration of deeper prompt designs to enhance model performance. However, current deep text prompting approaches suffer from two critical limitations: Over-reliance on constrastive learning objectives that prioritize high-level semantic alignment, neglecting fine-grained feature optimization; Static prompts across all input categories, preventing content-aware adaptation. To address these limitations, we propose AttriPrompt-a novel framework that enhances and refines textual semantic representations by leveraging the intermediate-layer features of CLIP’s vision encoder. We designed an Attribute Retrieval module that first clusters visual features from each layer. The aggregated visual features retrieve semantically similar prompts from a prompt pool, which are then concatenated to the input of every layer in the text encoder. Leveraging hierarchical visual information embedded in prompted text features, we introduce Dual-stream Contrastive Learning to realize fine-grained alignment. Furthermore, we introduce a Self-Regularization mechanism by applying explicit regularization constraints between the prompted and non-prompted text features to prevent overfitting on limited training data. Extensive experiments across three benchmarks demonstrate AttriPrompt’s superiority over state-of-the-art methods, achieving up to 7.37% improvement in the base-to-novel setting. The observed strength of our method in cross-domain knowledge transfer positions vision-language pre-trained models as more viable solutions for real-world implementation.

[218] Coefficients-Preserving Sampling for Reinforcement Learning with Flow Matching

Feng Wang, Zihao Yu

Main category: cs.CV

TL;DR: Proposes CPS method to eliminate noise artifacts in SDE-based RL for Flow Matching models, enabling more accurate reward modeling and stable convergence.

Details

Motivation: SDE-based sampling in RL for Flow Matching introduces noise artifacts that harm reward learning and image generation quality.

Method: Draws inspiration from DDIM to reformulate sampling process with Coefficients-Preserving Sampling (CPS) that eliminates excess stochasticity.

Result: CPS removes noise artifacts, enables more accurate reward modeling, and allows faster, more stable convergence for RL optimizers like Flow-GRPO and Dance-GRPO.

Conclusion: The proposed CPS method successfully addresses noise issues in SDE-based RL approaches for Flow Matching, improving both generation quality and learning stability.

Abstract: Reinforcement Learning (RL) has recently emerged as a powerful technique for improving image and video generation in Diffusion and Flow Matching models, specifically for enhancing output quality and alignment with prompts. A critical step for applying online RL methods on Flow Matching is the introduction of stochasticity into the deterministic framework, commonly realized by Stochastic Differential Equation (SDE). Our investigation reveals a significant drawback to this approach: SDE-based sampling introduces pronounced noise artifacts in the generated images, which we found to be detrimental to the reward learning process. A rigorous theoretical analysis traces the origin of this noise to an excess of stochasticity injected during inference. To address this, we draw inspiration from Denoising Diffusion Implicit Models (DDIM) to reformulate the sampling process. Our proposed method, Coefficients-Preserving Sampling (CPS), eliminates these noise artifacts. This leads to more accurate reward modeling, ultimately enabling faster and more stable convergence for reinforcement learning-based optimizers like Flow-GRPO and Dance-GRPO. Code will be released at https://github.com/IamCreateAI/FlowCPS

[219] Dual Interaction Network with Cross-Image Attention for Medical Image Segmentation

Jeonghyun Noh, Wangsu Jeon, Jinsun Park

Main category: cs.CV

TL;DR: Proposes a dual interactive fusion module (DIFM) with cross-attention and multi-scale boundary loss to improve medical image segmentation by effectively combining original and enhanced images while preserving diagnostic information.

Details

Motivation: Medical image segmentation is hindered by noise, blurriness, and low contrast. Traditional enhancement techniques may alter crucial diagnostic information, and conventional fusion methods fail to fully leverage the advantages of both original and enhanced images while suppressing enhancement side effects.

Method: Dual Interactive Fusion Module (DIFM) using bidirectional cross-attention to attend to spatial information across different images, refined via global spatial attention. Also introduces multi-scale boundary loss based on gradient extraction to improve boundary segmentation accuracy.

Result: Experimental results on ACDC and Synapse datasets demonstrate superior performance both quantitatively and qualitatively compared to existing methods.

Conclusion: The proposed DIFM effectively exploits mutual complementary information from original and enhanced medical images, producing enhanced features with important spatial characteristics while improving segmentation accuracy at object boundaries.

Abstract: Medical image segmentation is a crucial method for assisting professionals in diagnosing various diseases through medical imaging. However, various factors such as noise, blurriness, and low contrast often hinder the accurate diagnosis of diseases. While numerous image enhancement techniques can mitigate these issues, they may also alter crucial information needed for accurate diagnosis in the original image. Conventional image fusion strategies, such as feature concatenation can address this challenge. However, they struggle to fully leverage the advantages of both original and enhanced images while suppressing the side effects of the enhancements. To overcome the problem, we propose a dual interactive fusion module (DIFM) that effectively exploits mutual complementary information from the original and enhanced images. DIFM employs cross-attention bidirectionally to simultaneously attend to corresponding spatial information across different images, subsequently refining the complementary features via global spatial attention. This interaction leverages low- to high-level features implicitly associated with diverse structural attributes like edges, blobs, and object shapes, resulting in enhanced features that embody important spatial characteristics. In addition, we introduce a multi-scale boundary loss based on gradient extraction to improve segmentation accuracy at object boundaries. Experimental results on the ACDC and Synapse datasets demonstrate the superiority of the proposed method quantitatively and qualitatively. Code available at: https://github.com/JJeong-Gari/DIN

[220] StripDet: Strip Attention-Based Lightweight 3D Object Detection from Point Cloud

Weichao Wang, Wendong Mao, Zhongfeng Wang

Main category: cs.CV

TL;DR: StripDet is a lightweight 3D object detection framework that uses novel Strip Attention Blocks and hierarchical backbone to achieve efficient on-device performance with 7x parameter reduction while maintaining high accuracy.

Details

Motivation: High-accuracy 3D object detection models from point clouds have substantial computational and memory requirements, making deployment on edge devices challenging.

Method: Proposes Strip Attention Block (SAB) that decomposes 2D convolutions into asymmetric strip convolutions to capture long-range dependencies with linear complexity. Integrates SAB with depthwise separable convolutions and multiscale fusion in a hardware-friendly hierarchical backbone.

Result: Achieves 79.97% mAP for car detection on KITTI dataset with only 0.65M parameters, surpassing PointPillars baseline with 7x parameter reduction. Outperforms recent lightweight and knowledge distillation methods.

Conclusion: StripDet provides superior accuracy-efficiency trade-off and establishes itself as a practical solution for real-world 3D detection on edge devices.

Abstract: The deployment of high-accuracy 3D object detection models from point cloud remains a significant challenge due to their substantial computational and memory requirements. To address this, we introduce StripDet, a novel lightweight framework designed for on-device efficiency. First, we propose the novel Strip Attention Block (SAB), a highly efficient module designed to capture long-range spatial dependencies. By decomposing standard 2D convolutions into asymmetric strip convolutions, SAB efficiently extracts directional features while reducing computational complexity from quadratic to linear. Second, we design a hardware-friendly hierarchical backbone that integrates SAB with depthwise separable convolutions and a simple multiscale fusion strategy, achieving end-to-end efficiency. Extensive experiments on the KITTI dataset validate StripDet’s superiority. With only 0.65M parameters, our model achieves a 79.97% mAP for car detection, surpassing the baseline PointPillars with a 7x parameter reduction. Furthermore, StripDet outperforms recent lightweight and knowledge distillation-based methods, achieving a superior accuracy-efficiency trade-off while establishing itself as a practical solution for real-world 3D detection on edge devices.

[221] Neural Bloom: A Deep Learning Approach to Real-Time Lighting

Rafal Karp, Dawid Gruszka, Tomasz Trzcinski

Main category: cs.CV

TL;DR: Neural network-based bloom lighting methods (NBL and FastNBL) achieve 12-28% faster performance than traditional techniques while maintaining high-quality visual effects.

Details

Motivation: Traditional bloom lighting techniques rely on computationally expensive operations like multiple blur applications and texture sampling, which create bottlenecks in real-time rendering and consume significant computational resources.

Method: Proposed two neural network-based approaches: Neural Bloom Lighting (NBL) and Fast Neural Bloom Lighting (FastNBL) that generate brightness masks from 3D scene views using neural networks instead of traditional blur operations.

Result: Both methods outperformed state-of-the-art bloom implementations, with FastNBL being 28% faster and NBL 12% faster, while maintaining high-quality bloom effects across various 3D scenes.

Conclusion: Neural network-based approaches enable faster, high-quality bloom lighting effects, saving computational resources and advancing real-time rendering capabilities for more immersive and smooth high-FPS experiences.

Abstract: We propose a novel method to generate bloom lighting effect in real time using neural networks. Our solution generate brightness mask from given 3D scene view up to 30% faster than state-of-the-art methods. The existing traditional techniques rely on multiple blur appliances and texture sampling, also very often have existing conditional branching in its implementation. These operations occupy big portion of the execution time. We solve this problem by proposing two neural network-based bloom lighting methods, Neural Bloom Lighting (NBL) and Fast Neural Bloom Lighting (FastNBL), focusing on their quality and performance. Both methods were tested on a variety of 3D scenes, with evaluations conducted on brightness mask accuracy and inference speed. The main contribution of this work is that both methods produce high-quality bloom effects while outperforming the standard state-of-the-art bloom implementation, with FastNBL being faster by 28% and NBL faster by 12%. These findings highlight that we can achieve realistic bloom lighting phenomena faster, moving us towards more realism in real-time environments in the future. This improvement saves computational resources, which is a major bottleneck in real-time rendering. Furthermore, it is crucial for sustaining immersion and ensuring smooth experiences in high FPS environments, while maintaining high-quality realism.

[222] Spatial-Aware Self-Supervision for Medical 3D Imaging with Multi-Granularity Observable Tasks

Yiqin Zhang, Meiling Chen, Zhengjie Zhang

Main category: cs.CV

TL;DR: A self-supervised learning method for medical 3D imaging that uses three interpretable sub-tasks to capture spatial semantics while maintaining performance comparable to existing methods.

Details

Motivation: Current self-supervised methods in medical visualization lack interpretability and intuitive demonstration of 3D spatial learning, as they are influenced by 2D visual domain designs.

Method: Proposes three interpretable sub-tasks designed with observable principles to capture spatially relevant semantics in 3D medical imaging, incorporating multi-granularity spatial relationship modeling to maintain training stability.

Result: Experimental results show the approach delivers performance on par with current methodologies while enabling intuitive understanding of the self-supervised learning process.

Conclusion: The method successfully addresses interpretability issues in medical 3D self-supervised learning while maintaining competitive performance through carefully designed spatial relationship modeling tasks.

Abstract: The application of self-supervised techniques has become increasingly prevalent within medical visualization tasks, primarily due to its capacity to mitigate the data scarcity prevalent in the healthcare sector. The majority of current works are influenced by designs originating in the generic 2D visual domain, which lack the intuitive demonstration of the model’s learning process regarding 3D spatial knowledge. Consequently, these methods often fall short in terms of medical interpretability. We propose a method consisting of three sub-tasks to capture the spatially relevant semantics in medical 3D imaging. Their design adheres to observable principles to ensure interpretability, and minimize the performance loss caused thereby as much as possible. By leveraging the enhanced semantic depth offered by the extra dimension in 3D imaging, this approach incorporates multi-granularity spatial relationship modeling to maintain training stability. Experimental findings suggest that our approach is capable of delivering performance that is on par with current methodologies, while facilitating an intuitive understanding of the self-supervised learning process.

[223] OmniStyle2: Scalable and High Quality Artistic Style Transfer Data Generation via Destylization

Ye Wang, Zili Yi, Yibo Zhang, Peng Zheng, Xuping Xie, Jiang Lin, Yilin Wang, Rui Ma

Main category: cs.CV

TL;DR: OmniStyle2 reframes style transfer as a data problem using destylization to create DST-100K dataset, training a simple FLUX-based model that outperforms state-of-the-art methods.

Details

Motivation: Addresses the fundamental challenge of lacking ground-truth data in artistic style transfer by creating authentic supervision signals through destylization.

Method: Develops DST (text-guided destylization model) and DST-Filter (multi-stage evaluation with Chain-of-Thought reasoning) to build DST-100K dataset, then trains a simple feed-forward model based on FLUX.1-dev.

Result: OmniStyle2 consistently surpasses state-of-the-art methods across both qualitative and quantitative benchmarks despite its simplicity.

Conclusion: Scalable data generation via destylization provides a reliable supervision paradigm that overcomes the ground-truth data limitation in artistic style transfer.

Abstract: OmniStyle2 introduces a novel approach to artistic style transfer by reframing it as a data problem. Our key insight is destylization, reversing style transfer by removing stylistic elements from artworks to recover natural, style-free counterparts. This yields DST-100K, a large-scale dataset that provides authentic supervision signals by aligning real artistic styles with their underlying content. To build DST-100K, we develop (1) DST, a text-guided destylization model that reconstructs stylefree content, and (2) DST-Filter, a multi-stage evaluation model that employs Chain-of-Thought reasoning to automatically discard low-quality pairs while ensuring content fidelity and style accuracy. Leveraging DST-100K, we train OmniStyle2, a simple feed-forward model based on FLUX.1-dev. Despite its simplicity, OmniStyle2 consistently surpasses state-of-the-art methods across both qualitative and quantitative benchmarks. Our results demonstrate that scalable data generation via destylization provides a reliable supervision paradigm, overcoming the fundamental challenge posed by the lack of ground-truth data in artistic style transfer.

[224] ConstStyle: Robust Domain Generalization with Unified Style Transformation

Nam Duong Tran, Nam Nguyen Phuong, Hieu H. Pham, Phi Le Nguyen, My T. Thai

Main category: cs.CV

TL;DR: ConstStyle proposes a novel domain generalization approach that maps all samples to a unified domain to capture domain-invariant features and bridge domain gaps, achieving significant performance improvements especially with limited training domains.

Details

Motivation: Deep neural networks suffer performance drops when test data distribution differs from training data. Existing DG methods struggle with limited training domains or large gaps between seen and unseen domains.

Method: Leverages a unified domain to capture domain-invariant features. Maps all training samples onto this unified domain optimized for seen domains, and similarly projects unseen domain samples during testing to align both training and testing data.

Result: Extensive experiments show ConstStyle consistently outperforms existing methods across diverse scenarios. With limited seen domains, it boosts accuracy up to 19.82% compared to the next best approach.

Conclusion: ConstStyle effectively reduces the impact of domain shifts by aligning training and testing data within a unified domain, demonstrating strong performance even with large domain gaps or few seen domains.

Abstract: Deep neural networks often suffer performance drops when test data distribution differs from training data. Domain Generalization (DG) aims to address this by focusing on domain-invariant features or augmenting data for greater diversity. However, these methods often struggle with limited training domains or significant gaps between seen (training) and unseen (test) domains. To enhance DG robustness, we hypothesize that it is essential for the model to be trained on data from domains that closely resemble unseen test domains-an inherently difficult task due to the absence of prior knowledge about the unseen domains. Accordingly, we propose ConstStyle, a novel approach that leverages a unified domain to capture domain-invariant features and bridge the domain gap with theoretical analysis. During training, all samples are mapped onto this unified domain, optimized for seen domains. During testing, unseen domain samples are projected similarly before predictions. By aligning both training and testing data within this unified domain, ConstStyle effectively reduces the impact of domain shifts, even with large domain gaps or few seen domains. Extensive experiments demonstrate that ConstStyle consistently outperforms existing methods across diverse scenarios. Notably, when only a limited number of seen domains are available, ConstStyle can boost accuracy up to 19.82% compared to the next best approach.

[225] Multi-Strategy Guided Diffusion via Sparse Masking Temporal Reweighting Distribution Correction

Zekun Zhou, Yanru Gong, Liu Shi, Qiegen Liu

Main category: cs.CV

TL;DR: STRIDE diffusion model for sparse-view CT reconstruction with temporal reweighting guidance and dual-network architecture achieves state-of-the-art performance in image quality metrics.

Details

Motivation: To address the challenges of sparse-view CT reconstruction where limited projection views lead to incomplete data and artifacts, requiring advanced generative models to complete missing information while preserving structural details.

Method: Proposes STRIDE diffusion model with: 1) Joint training guided by sparse conditional probabilities, 2) Temporally varying sparse condition reweighting strategy, 3) Linear regression for distribution shift correction, 4) Dual-network parallel architecture for multi-frequency component optimization.

Result: Achieves best improvement of 2.58 dB in PSNR, 2.37% increase in SSIM, and 0.236 reduction in MSE compared to baseline methods. Excellent generalization and robustness in structural consistency, detail restoration, and artifact suppression.

Conclusion: STRIDE effectively addresses sparse-view CT reconstruction challenges through innovative temporal reweighting guidance and dual-network architecture, demonstrating superior performance in both quantitative metrics and qualitative image quality.

Abstract: Diffusion models have demonstrated remarkable generative capabilities in image processing tasks. We propose a Sparse condition Temporal Rewighted Integrated Distribution Estimation guided diffusion model (STRIDE) for sparse-view CT reconstruction. Specifically, we design a joint training mechanism guided by sparse conditional probabilities to facilitate the model effective learning of missing projection view completion and global information modeling. Based on systematic theoretical analysis, we propose a temporally varying sparse condition reweighting guidance strategy to dynamically adjusts weights during the progressive denoising process from pure noise to the real image, enabling the model to progressively perceive sparse-view information. The linear regression is employed to correct distributional shifts between known and generated data, mitigating inconsistencies arising during the guidance process. Furthermore, we construct a dual-network parallel architecture to perform global correction and optimization across multiple sub-frequency components, thereby effectively improving the model capability in both detail restoration and structural preservation, ultimately achieving high-quality image reconstruction. Experimental results on both public and real datasets demonstrate that the proposed method achieves the best improvement of 2.58 dB in PSNR, increase of 2.37% in SSIM, and reduction of 0.236 in MSE compared to the best-performing baseline methods. The reconstructed images exhibit excellent generalization and robustness in terms of structural consistency, detail restoration, and artifact suppression.

[226] S-LAM3D: Segmentation-Guided Monocular 3D Object Detection via Feature Space Fusion

Diana-Alexandra Sas, Florin Oniga

Main category: cs.CV

TL;DR: A decoupled strategy that injects precomputed segmentation priors into feature space to improve monocular 3D object detection, particularly for small objects like pedestrians and cyclists.

Details

Motivation: Monocular 3D object detection is challenging due to lack of depth cues in single 2D images. Existing methods rely on CNNs or Transformers but struggle with depth estimation. The paper aims to leverage segmentation information to guide detection without expanding the model architecture.

Method: Proposes injecting precomputed segmentation information priors directly into the feature space to guide 3D detection. Uses a decoupled strategy that doesn’t require additional prediction branches or joint learning of priors.

Result: Evaluated on KITTI 3D Object Detection Benchmark, outperforms equivalent RGB-only architectures for small objects (pedestrians and cyclists). Demonstrates that understanding input data can reduce need for additional sensors or training data.

Conclusion: Segmentation priors effectively improve monocular 3D detection performance, especially for challenging small objects, without requiring model expansion or additional training complexity.

Abstract: Monocular 3D Object Detection represents a challenging Computer Vision task due to the nature of the input used, which is a single 2D image, lacking in any depth cues and placing the depth estimation problem as an ill-posed one. Existing solutions leverage the information extracted from the input by using Convolutional Neural Networks or Transformer architectures as feature extraction backbones, followed by specific detection heads for 3D parameters prediction. In this paper, we introduce a decoupled strategy based on injecting precomputed segmentation information priors and fusing them directly into the feature space for guiding the detection, without expanding the detection model or jointly learning the priors. The focus is on evaluating the impact of additional segmentation information on existing detection pipelines without adding additional prediction branches. The proposed method is evaluated on the KITTI 3D Object Detection Benchmark, outperforming the equivalent architecture that relies only on RGB image features for small objects in the scene: pedestrians and cyclists, and proving that understanding the input data can balance the need for additional sensors or training data.

[227] Motion Aware ViT-based Framework for Monocular 6-DoF Spacecraft Pose Estimation

Jose Sosa, Dan Pineau, Arunkumar Rathinam, Abdelrahman Shabayek, Djamila Aouada

Main category: cs.CV

TL;DR: Deep learning framework for spacecraft pose estimation that combines Vision Transformer features with optical flow motion cues to improve 2D keypoint localization and 6-DoF pose estimation performance.

Details

Motivation: Existing monocular spacecraft pose estimation methods rely on single static images and fail to exploit valuable temporal information that is inherent in space operations, limiting their performance.

Method: Adapts human pose estimation framework to spacecraft domain by integrating motion-aware heatmaps and optical flow to capture motion dynamics. Uses ViT encoder for image features combined with pre-trained optical flow model cues for 2D keypoint localization, followed by Perspective-n-Point solver for 6-DoF pose recovery.

Result: Demonstrates improved performance over single-image baselines in both 2D keypoint localization and 6-DoF pose estimation on SPADES-RGB dataset. Shows promising generalization capabilities on different data distributions from SPARK-2024 dataset.

Conclusion: The integration of temporal motion information through optical flow and motion-aware heatmaps significantly enhances spacecraft pose estimation performance and generalization compared to static single-image approaches.

Abstract: Monocular 6-DoF pose estimation plays an important role in multiple spacecraft missions. Most existing pose estimation approaches rely on single images with static keypoint localisation, failing to exploit valuable temporal information inherent to space operations. In this work, we adapt a deep learning framework from human pose estimation to the spacecraft pose estimation domain that integrates motion-aware heatmaps and optical flow to capture motion dynamics. Our approach combines image features from a Vision Transformer (ViT) encoder with motion cues from a pre-trained optical flow model to localise 2D keypoints. Using the estimates, a Perspective-n-Point (PnP) solver recovers 6-DoF poses from known 2D-3D correspondences. We train and evaluate our method on the SPADES-RGB dataset and further assess its generalisation on real and synthetic data from the SPARK-2024 dataset. Overall, our approach demonstrates improved performance over single-image baselines in both 2D keypoint localisation and 6-DoF pose estimation. Furthermore, it shows promising generalisation capabilities when testing on different data distributions.

[228] Khana: A Comprehensive Indian Cuisine Dataset

Omkar Prabhu

Main category: cs.CV

TL;DR: Khana is a new benchmark dataset for Indian food image analysis with 131K images across 80 categories, addressing the gap in comprehensive Indian cuisine datasets for classification, segmentation, and retrieval tasks.

Details

Motivation: There is a lack of comprehensive labeled datasets capturing the nuances of Indian cuisine despite its vast regional diversity and complex preparations, which limits the development of accurate food recognition and related applications.

Method: Created Khana dataset with around 131K images spread across 80 labels at 500x500 resolution, establishing a taxonomy of Indian cuisine and evaluating state-of-the-art models on classification, segmentation, and retrieval tasks.

Result: The paper presents Khana as a comprehensive benchmark that bridges the research-development gap, providing a valuable resource for both researchers and developers working with Indian cuisine applications.

Conclusion: Khana successfully fills the critical gap in Indian food image datasets and serves as a challenging benchmark for advancing food-related computer vision applications specific to Indian cuisine.

Abstract: As global interest in diverse culinary experiences grows, food image models are essential for improving food-related applications by enabling accurate food recognition, recipe suggestions, dietary tracking, and automated meal planning. Despite the abundance of food datasets, a noticeable gap remains in capturing the nuances of Indian cuisine due to its vast regional diversity, complex preparations, and the lack of comprehensive labeled datasets that cover its full breadth. Through this exploration, we uncover Khana, a new benchmark dataset for food image classification, segmentation, and retrieval of dishes from Indian cuisine. Khana fills the gap by establishing a taxonomy of Indian cuisine and offering around 131K images in the dataset spread across 80 labels, each with a resolution of 500x500 pixels. This paper describes the dataset creation process and evaluates state-of-the-art models on classification, segmentation, and retrieval as baselines. Khana bridges the gap between research and development by providing a comprehensive and challenging benchmark for researchers while also serving as a valuable resource for developers creating real-world applications that leverage the rich tapestry of Indian cuisine. Webpage: https://khana.omkar.xyz

[229] Index-Preserving Lightweight Token Pruning for Efficient Document Understanding in Vision-Language Models

Jaemin Son, Sujin Choi, Inyong Yun

Main category: cs.CV

TL;DR: Lightweight token pruning framework reduces computational costs for vision-language models in document understanding by filtering non-informative background regions while maintaining accuracy.

Details

Motivation: Vision-language models show impressive document understanding capabilities but have high computational demands that need to be mitigated.

Method: A binary patch-level classifier removes non-text areas from document images, followed by a max-pooling refinement step to recover fragmented text regions and enhance spatial coherence.

Result: Experiments on real-world document datasets show substantial reduction in computational costs while maintaining comparable accuracy.

Conclusion: The proposed token pruning framework effectively reduces compute burdens for VLMs in document understanding tasks without sacrificing performance.

Abstract: Recent progress in vision-language models (VLMs) has led to impressive results in document understanding tasks, but their high computational demands remain a challenge. To mitigate the compute burdens, we propose a lightweight token pruning framework that filters out non-informative background regions from document images prior to VLM processing. A binary patch-level classifier removes non-text areas, and a max-pooling refinement step recovers fragmented text regions to enhance spatial coherence. Experiments on real-world document datasets demonstrate that our approach substantially lowers computational costs, while maintaining comparable accuracy.

[230] BLaVe-CoT: Consistency-Aware Visual Question Answering for Blind and Low Vision Users

Wanyin Cheng, Zanxi Ruan

Main category: cs.CV

TL;DR: BLaVe-CoT is a VQA framework designed for BLV users that handles ambiguous questions by generating diverse candidate answers, grounding them spatially, and using chain-of-thought reasoning to assess answer consistency across different image regions.

Details

Motivation: BLV users face challenges with VQA systems due to blurry photos, poorly framed images, and difficulty articulating specific questions about what they cannot see, leading to ambiguous questions with multiple valid answers that conventional single-answer VQA systems cannot handle.

Method: Proposes diverse candidate answers using LoRA-tuned BLIP-2 model, grounds each answer spatially using PolyFormer, and applies chain-of-thought reasoning to assess whether answers refer to same or different regions.

Result: Outperforms previous methods on VQA-AnswerTherapy benchmark and proves more robust to ambiguity and visual noise common in assistive settings for BLV users.

Conclusion: Highlights the need for VQA systems that adapt to real human uncertainty and provide inclusive support for BLV users, with code made publicly available to foster further research and accessibility applications.

Abstract: Visual Question Answering (VQA) holds great potential for assisting Blind and Low Vision (BLV) users, yet real-world usage remains challenging. Due to visual impairments, BLV users often take blurry or poorly framed photos and face difficulty in articulating specific questions about what they cannot fully see. As a result, their visual questions are frequently ambiguous, and different users may interpret them in diverse ways. This leads to multiple valid answers, each grounded in different image regions-posing a mismatch with conventional VQA systems that assume a single answer and region. To bridge this gap, we present BLaVe-CoT, a VQA framework designed to reason about answer consistency in the face of ambiguity. Our method proposes diverse candidate answers using a LoRA-tuned BLIP-2 model, then grounds each answer spatially using PolyFormer, and finally applies a chain-of-thought reasoning module to assess whether the answers refer to the same or different regions. Evaluated on the VQA-AnswerTherapy benchmark, BLaVe-CoT outperforms previous methods and proves more robust to the ambiguity and visual noise common in assistive settings. This work highlights the need for VQA systems that can adapt to real human uncertainty and provide inclusive support for BLV users. To foster further research and accessibility applications, we have made the code publicly available at https://github.com/Accecwan/BLaVe-CoT.

Zhenhai Weng, Zhongliang Yu

Main category: cs.CV

TL;DR: Proposed UAV-specific datasets and cross-attention fusion module to bridge domain gap in open-vocabulary object detection for UAV imagery.

Details

Motivation: Existing OVD datasets are ground-level natural images, creating performance drop when applied to UAV imagery due to domain gap.

Method: Developed UAV-Label engine, created UAVDE-2M (2M+ instances, 1800 categories) and UAVCAP-15k (15k+ images) datasets, and proposed Cross-Attention Gated Enhancement Fusion module integrated into YOLO-World-v2.

Result: Extensive experiments on VisDrone and SIMD datasets verified effectiveness for UAV-based imagery and remote sensing applications.

Conclusion: The proposed datasets and CAGE module successfully address the domain gap issue in UAV open-vocabulary object detection.

Abstract: Open-Vocabulary Object Detection (OVD) has emerged as a pivotal technology for applications involving Unmanned Aerial Vehicles (UAVs). However, the prevailing large-scale datasets for OVD pre-training are predominantly composed of ground-level, natural images. This creates a significant domain gap, causing models trained on them to exhibit a substantial drop in performance on UAV imagery. To address this limitation, we first propose a refined UAV-Label engine. Then we construct and introduce UAVDE-2M(contains over 2,000,000 instances and 1800 categories) and UAVCAP-15k(contains over 15,000 images). Furthermore, we propose a novel Cross-Attention Gated Enhancement Fusion (CAGE) module and integrate it into the YOLO-World-v2 architecture. Finally, extensive experiments on the VisDrone and SIMD datasets verify the effectiveness of our proposed method for applications in UAV-based imagery and remote sensing.

[232] Micro-Expression Recognition via Fine-Grained Dynamic Perception

Zhiwen Shao, Yifan Cheng, Fan Zhang, Xuehuai Shi, Canlin Li, Lizhuang Ma, Dit-yan Yeung

Main category: cs.CV

TL;DR: A novel fine-grained dynamic perception framework for facial micro-expression recognition that ranks frame-level features to encode dynamic information and uses a transformer for representation learning, achieving state-of-the-art performance across multiple datasets.

Details

Motivation: Facial micro-expression recognition is challenging due to the transience, subtlety, and dynamics of micro-expressions. Existing methods either require key frames (hand-crafted features) or suffer from small-scale, low-diversity training data (deep networks).

Method: Proposes a fine-grained dynamic perception framework that ranks frame-level features in chronological order to encode dynamic information. Uses a local-global feature-aware transformer for frame representation learning, a rank scorer for calculating frame scores, and temporal pooling for dynamic representation. Includes a MER module for category prediction and a dynamic image construction module with encoder-decoder structure.

Result: Significantly outperforms state-of-the-art MER methods, improving by 4.05% on CASME II, 2.50% on SAMM, 7.71% on CAS(ME)^2, and 2.11% on CAS(ME)^3 in terms of F1-score. Also works well for dynamic image construction.

Conclusion: The proposed FDP framework effectively captures facial subtle actions associated with micro-expressions and alleviates data scarcity issues, demonstrating superior performance across multiple benchmark datasets for micro-expression recognition.

Abstract: Facial micro-expression recognition (MER) is a challenging task, due to the transience, subtlety, and dynamics of micro-expressions (MEs). Most existing methods resort to hand-crafted features or deep networks, in which the former often additionally requires key frames, and the latter suffers from small-scale and low-diversity training data. In this paper, we develop a novel fine-grained dynamic perception (FDP) framework for MER. We propose to rank frame-level features of a sequence of raw frames in chronological order, in which the rank process encodes the dynamic information of both ME appearances and motions. Specifically, a novel local-global feature-aware transformer is proposed for frame representation learning. A rank scorer is further adopted to calculate rank scores of each frame-level feature. Afterwards, the rank features from rank scorer are pooled in temporal dimension to capture dynamic representation. Finally, the dynamic representation is shared by a MER module and a dynamic image construction module, in which the former predicts the ME category, and the latter uses an encoder-decoder structure to construct the dynamic image. The design of dynamic image construction task is beneficial for capturing facial subtle actions associated with MEs and alleviating the data scarcity issue. Extensive experiments show that our method (i) significantly outperforms the state-of-the-art MER methods, and (ii) works well for dynamic image construction. Particularly, our FDP improves by 4.05%, 2.50%, 7.71%, and 2.11% over the previous best results in terms of F1-score on the CASME II, SAMM, CAS(ME)^2, and CAS(ME)^3 datasets, respectively. The code is available at https://github.com/CYF-cuber/FDP.

[233] Interleaving Reasoning for Better Text-to-Image Generation

Wenxuan Huang, Shuang Chen, Zheyong Xie, Shaosheng Cao, Shixiang Tang, Yufan Shen, Qingyu Yin, Wenbo Hu, Xiaoman Wang, Yuntian Tang, Junbo Qiao, Yue Guo, Yao Hu, Zhenfei Yin, Philip Torr, Yu Cheng, Wanli Ouyang, Shaohui Lin

Main category: cs.CV

TL;DR: IRG framework improves text-to-image generation by interleaving text-based reasoning with image synthesis, achieving state-of-the-art performance through a two-stage training approach.

Details

Motivation: Address the gap in instruction following and detail preservation in unified multimodal models compared to tightly coupled systems like GPT-4o by leveraging recent advances in interleaving reasoning.

Method: Introduces Interleaving Reasoning Generation (IRG) framework that alternates between text-based thinking and image synthesis, with Interleaving Reasoning Generation Learning (IRGL) training using IRGL-300K dataset and two-stage training approach.

Result: Achieves absolute gains of 5-10 points on multiple benchmarks (GenEval, WISE, TIIF, GenAI-Bench, OneIG-EN) with substantial improvements in visual quality and fine-grained fidelity.

Conclusion: Interleaving reasoning significantly improves text-to-image generation quality, instruction following, and detail preservation, establishing new state-of-the-art performance.

Abstract: Unified multimodal understanding and generation models recently have achieve significant improvement in image generation capability, yet a large gap remains in instruction following and detail preservation compared to systems that tightly couple comprehension with generation such as GPT-4o. Motivated by recent advances in interleaving reasoning, we explore whether such reasoning can further improve Text-to-Image (T2I) generation. We introduce Interleaving Reasoning Generation (IRG), a framework that alternates between text-based thinking and image synthesis: the model first produces a text-based thinking to guide an initial image, then reflects on the result to refine fine-grained details, visual quality, and aesthetics while preserving semantics. To train IRG effectively, we propose Interleaving Reasoning Generation Learning (IRGL), which targets two sub-goals: (1) strengthening the initial think-and-generate stage to establish core content and base quality, and (2) enabling high-quality textual reflection and faithful implementation of those refinements in a subsequent image. We curate IRGL-300K, a dataset organized into six decomposed learning modes that jointly cover learning text-based thinking, and full thinking-image trajectories. Starting from a unified foundation model that natively emits interleaved text-image outputs, our two-stage training first builds robust thinking and reflection, then efficiently tunes the IRG pipeline in the full thinking-image trajectory data. Extensive experiments show SoTA performance, yielding absolute gains of 5-10 points on GenEval, WISE, TIIF, GenAI-Bench, and OneIG-EN, alongside substantial improvements in visual quality and fine-grained fidelity. The code, model weights and datasets will be released in: https://github.com/Osilly/Interleaving-Reasoning-Generation .

[234] DVLO4D: Deep Visual-Lidar Odometry with Sparse Spatial-temporal Fusion

Mengmeng Liu, Michael Ying Yang, Jiuming Liu, Yunpeng Zhang, Jiangtao Li, Sander Oude Elberink, George Vosselman, Hao Cheng

Main category: cs.CV

TL;DR: DVLO4D is a novel visual-LiDAR odometry framework that uses sparse spatial-temporal fusion to achieve state-of-the-art accuracy and robustness for autonomous system localization.

Details

Motivation: Traditional visual-LiDAR odometry approaches struggle with sensor misalignment, fail to leverage temporal information effectively, and require extensive manual tuning for different sensor configurations.

Method: Three key innovations: 1) Sparse Query Fusion for multi-modal data fusion, 2) Temporal Interaction and Update module for better pose initialization and robustness, 3) Temporal Clip Training with Collective Average Loss for global optimization and reduced scale drift.

Result: Achieves state-of-the-art performance on KITTI and Argoverse Odometry datasets with high efficiency (82ms inference time), enabling real-time deployment potential.

Conclusion: DVLO4D provides a robust and accurate visual-LiDAR odometry solution that effectively addresses sensor misalignment, temporal information utilization, and manual tuning challenges while maintaining real-time capabilities.

Abstract: Visual-LiDAR odometry is a critical component for autonomous system localization, yet achieving high accuracy and strong robustness remains a challenge. Traditional approaches commonly struggle with sensor misalignment, fail to fully leverage temporal information, and require extensive manual tuning to handle diverse sensor configurations. To address these problems, we introduce DVLO4D, a novel visual-LiDAR odometry framework that leverages sparse spatial-temporal fusion to enhance accuracy and robustness. Our approach proposes three key innovations: (1) Sparse Query Fusion, which utilizes sparse LiDAR queries for effective multi-modal data fusion; (2) a Temporal Interaction and Update module that integrates temporally-predicted positions with current frame data, providing better initialization values for pose estimation and enhancing model’s robustness against accumulative errors; and (3) a Temporal Clip Training strategy combined with a Collective Average Loss mechanism that aggregates losses across multiple frames, enabling global optimization and reducing the scale drift over long sequences. Extensive experiments on the KITTI and Argoverse Odometry dataset demonstrate the superiority of our proposed DVLO4D, which achieves state-of-the-art performance in terms of both pose accuracy and robustness. Additionally, our method has high efficiency, with an inference time of 82 ms, possessing the potential for the real-time deployment.

[235] Analysis of Blood Report Images Using General Purpose Vision-Language Models

Nadia Bakhsheshi, Hamid Beigy

Main category: cs.CV

TL;DR: General-purpose Vision-Language Models show promising results for automated blood report image analysis, with Qwen-VL-Max, Gemini 2.5 Pro, and Llama 4 Maverick demonstrating practical potential for patient-facing healthcare tools.

Details

Motivation: Individuals often struggle with interpreting blood reports, leading to anxiety and overlooked health issues, creating a need for automated analysis tools to improve health literacy.

Method: Comparative evaluation of three VLMs on 100 diverse blood report images using clinically relevant questions, with responses processed using Sentence-BERT for similarity comparison.

Result: General-purpose VLMs demonstrated practical performance in analyzing blood report images and providing clear interpretations directly from images.

Conclusion: VLMs are a promising technology for developing patient-facing blood report analysis tools, though results should be interpreted cautiously due to limited dataset size.

Abstract: The reliable analysis of blood reports is important for health knowledge, but individuals often struggle with interpretation, leading to anxiety and overlooked issues. We explore the potential of general-purpose Vision-Language Models (VLMs) to address this challenge by automatically analyzing blood report images. We conduct a comparative evaluation of three VLMs: Qwen-VL-Max, Gemini 2.5 Pro, and Llama 4 Maverick, determining their performance on a dataset of 100 diverse blood report images. Each model was prompted with clinically relevant questions adapted to each blood report. The answers were then processed using Sentence-BERT to compare and evaluate how closely the models responded. The findings suggest that general-purpose VLMs are a practical and promising technology for developing patient-facing tools for preliminary blood report analysis. Their ability to provide clear interpretations directly from images can improve health literacy and reduce the limitations to understanding complex medical information. This work establishes a foundation for the future development of reliable and accessible AI-assisted healthcare applications. While results are encouraging, they should be interpreted cautiously given the limited dataset size.

[236] TinyDef-DETR:An Enhanced DETR Detector for UAV Power Line Defect Detection

Jiaming Cui

Main category: cs.CV

TL;DR: TinyDef-DETR is a DETR-based framework for small-defect detection in UAV transmission line inspection, addressing detail loss, weak boundaries, and insufficient context integration through stride-free downsampling, edge-enhanced convolution, dual-domain attention, and adaptive regression loss.

Details

Motivation: Automated UAV inspection of transmission lines faces challenges in detecting small, ambiguous defects against complex backgrounds due to detail loss from strided downsampling, weak boundary sensitivity in lightweight backbones, and poor integration of global context with local cues.

Method: Proposes TinyDef-DETR with: 1) stride-free space-to-depth module for lossless downsampling, 2) edge-enhanced convolution for boundary-aware feature extraction, 3) cross-stage dual-domain multi-scale attention for global-local information capture, and 4) Focaler-Wise-SIoU regression loss for small object localization.

Result: Experiments on CSG-ADCD dataset show substantial improvements in precision and recall compared to baselines, with notable gains on small-object subsets and modest computational overhead. Validation on VisDrone benchmark confirms generalization capability.

Conclusion: Integrating detail-preserving downsampling, edge-sensitive representations, dual-domain attention, and difficulty-adaptive regression provides a practical and efficient solution for UAV-based small-defect inspection in power grids.

Abstract: Automated inspection of transmission lines using UAVs is hindered by the difficulty of detecting small and ambiguous defects against complex backgrounds. Conventional detectors often suffer from detail loss due to strided downsampling, weak boundary sensitivity in lightweight backbones, and insufficient integration of global context with local cues. To address these challenges, we propose TinyDef-DETR, a DETR-based framework designed for small-defect detection. The method introduces a stride-free space-to-depth module for lossless downsampling, an edge-enhanced convolution for boundary-aware feature extraction, a cross-stage dual-domain multi-scale attention module to jointly capture global and local information, and a Focaler-Wise-SIoU regression loss to improve localization of small objects. Experiments conducted on the CSG-ADCD dataset demonstrate that TinyDef-DETR achieves substantial improvements in both precision and recall compared to competitive baselines, with particularly notable gains on small-object subsets, while incurring only modest computational overhead. Further validation on the VisDrone benchmark confirms the generalization capability of the proposed approach. Overall, the results indicate that integrating detail-preserving downsampling, edge-sensitive representations, dual-domain attention, and difficulty-adaptive regression provides a practical and efficient solution for UAV-based small-defect inspection in power grids.

[237] BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models

Yuming Li, Yikai Wang, Yuying Zhu, Zhongyu Zhao, Ming Lu, Qi She, Shanghang Zhang

Main category: cs.CV

TL;DR: BranchGRPO reduces computational costs and improves training stability in image/video generative model alignment by introducing branch sampling, tree-based advantage estimation, and pruning strategies.

Details

Motivation: Existing GRPO methods face high computational costs from on-policy rollouts and SDE sampling steps, as well as training instability due to sparse rewards.

Method: Introduces branch sampling policy to update SDE sampling process, shares computation across common prefixes, prunes low-reward paths and redundant depths, and uses tree-based advantage estimator with dense process-level rewards.

Result: Improves alignment scores by 16% over strong baselines while cutting training time by 50% in image and video preference alignment experiments.

Conclusion: BranchGRPO effectively addresses computational cost and training instability issues in generative model alignment while maintaining or improving exploration diversity.

Abstract: Recent advancements in aligning image and video generative models via GRPO have achieved remarkable gains in enhancing human preference alignment. However, these methods still face high computational costs from on-policy rollouts and excessive SDE sampling steps, as well as training instability due to sparse rewards. In this paper, we propose BranchGRPO, a novel method that introduces a branch sampling policy updating the SDE sampling process. By sharing computation across common prefixes and pruning low-reward paths and redundant depths, BranchGRPO substantially lowers the per-update compute cost while maintaining or improving exploration diversity. This work makes three main contributions: (1) a branch sampling scheme that reduces rollout and training cost; (2) a tree-based advantage estimator incorporating dense process-level rewards; and (3) pruning strategies exploiting path and depth redundancy to accelerate convergence and boost performance. Experiments on image and video preference alignment show that BranchGRPO improves alignment scores by 16% over strong baselines, while cutting training time by 50%.

[238] Multi-Stage Graph Neural Networks for Data-Driven Prediction of Natural Convection in Enclosed Cavities

Mohammad Ahangarkiasari, Hassan Pouraria

Main category: cs.CV

TL;DR: Proposed multi-stage GNN with hierarchical pooling/unpooling for better thermal-fluid modeling on irregular meshes, overcoming long-range dependency issues in conventional GNNs.

Details

Motivation: High-fidelity CFD modeling is accurate but computationally expensive and requires expert knowledge, while conventional GNNs struggle with long-range dependencies in thermal-fluid simulations.

Method: Novel multi-stage GNN architecture using hierarchical pooling and unpooling operations to model global-to-local interactions across multiple spatial scales on irregular mesh structures.

Result: Achieves higher predictive accuracy, improved training efficiency, and reduced long-term error accumulation compared to state-of-the-art GNN baselines on natural convection CFD dataset.

Conclusion: The multi-stage GNN approach shows strong potential for modeling complex heat transfer in mesh-based fluid dynamics simulations, offering an efficient alternative to traditional CFD methods.

Abstract: Buoyancy-driven heat transfer in closed cavities serves as a canonical testbed for thermal design High-fidelity CFD modelling yields accurate thermal field solutions, yet its reliance on expert-crafted physics models, fine meshes, and intensive computation limits rapid iteration. Recent developments in data-driven modeling, especially Graph Neural Networks (GNNs), offer new alternatives for learning thermal-fluid behavior directly from simulation data, particularly on irregular mesh structures. However, conventional GNNs often struggle to capture long-range dependencies in high-resolution graph structures. To overcome this limitation, we propose a novel multi-stage GNN architecture that leverages hierarchical pooling and unpooling operations to progressively model global-to-local interactions across multiple spatial scales. We evaluate the proposed model on our newly developed CFD dataset simulating natural convection within a rectangular cavities with varying aspect ratios where the bottom wall is isothermal hot, the top wall is isothermal cold, and the two vertical walls are adiabatic. Experimental results demonstrate that the proposed model achieves higher predictive accuracy, improved training efficiency, and reduced long-term error accumulation compared to state-of-the-art (SOTA) GNN baselines. These findings underscore the potential of the proposed multi-stage GNN approach for modeling complex heat transfer in mesh-based fluid dynamics simulations.

[239] Home-made Diffusion Model from Scratch to Hatch

Shih-Ying Yeh

Main category: cs.CV

TL;DR: HDM is an efficient text-to-image diffusion model that achieves competitive 1024x1024 generation quality with only $535-620 training cost on consumer GPUs, using novel architecture and training techniques.

Details

Motivation: To democratize high-quality text-to-image generation by making it accessible to individual researchers and smaller organizations with limited computational resources, reducing the traditional high training costs.

Method: Developed Cross-U-Transformer (XUT) with cross-attention skip connections, TREAD acceleration, shifted square crop strategy for arbitrary aspect-ratio training, and progressive resolution scaling with a 343M parameter model.

Result: Achieves competitive 1024x1024 generation quality with remarkably low training cost ($535-620 using four RTX5090 GPUs), demonstrating emergent capabilities like intuitive camera control and superior compositional consistency.

Conclusion: Provides an alternative scaling paradigm that demonstrates smaller, carefully crafted models can achieve high-quality results, making text-to-image generation accessible to researchers with limited computational resources.

Abstract: We introduce Home-made Diffusion Model (HDM), an efficient yet powerful text-to-image diffusion model optimized for training (and inferring) on consumer-grade hardware. HDM achieves competitive 1024x1024 generation quality while maintaining a remarkably low training cost of $535-620 using four RTX5090 GPUs, representing a significant reduction in computational requirements compared to traditional approaches. Our key contributions include: (1) Cross-U-Transformer (XUT), a novel U-shape transformer, Cross-U-Transformer (XUT), that employs cross-attention for skip connections, providing superior feature integration that leads to remarkable compositional consistency; (2) a comprehensive training recipe that incorporates TREAD acceleration, a novel shifted square crop strategy for efficient arbitrary aspect-ratio training, and progressive resolution scaling; and (3) an empirical demonstration that smaller models (343M parameters) with carefully crafted architectures can achieve high-quality results and emergent capabilities, such as intuitive camera control. Our work provides an alternative paradigm of scaling, demonstrating a viable path toward democratizing high-quality text-to-image generation for individual researchers and smaller organizations with limited computational resources.

[240] High-Quality Tomographic Image Reconstruction Integrating Neural Networks and Mathematical Optimization

Anuraag Mishra, Andrea Gilch, Benjamin Apeleo Zubiri, Jan Rolfes, Frauke Liers

Main category: cs.CV

TL;DR: A novel neural network-based technique for improving image reconstruction quality in nano/microtomography by enhancing edge detection and reducing artifacts in homogeneous materials with sharp edges.

Details

Motivation: To address the challenge of reconstructing high-quality images from projection-based nano- and microtomography, particularly for specimens composed of homogeneous material phases with sharp edges, where traditional methods often produce blurry results and artifacts.

Method: Training a neural network to identify edges within subpictures, then integrating this trained network into a mathematical optimization model that favors solutions according to learned predictions while allowing alternative solutions strongly supported by raw data.

Result: Significant enhancements in interface sharpness and material homogeneity compared to benchmark algorithms on experimental datasets, with successful elimination of blurriness.

Conclusion: The technique successfully incorporates knowledge about material homogeneity and sharp edges, producing high-quality reconstructions and showcasing potential for advancing tomographic imaging techniques.

Abstract: In this work, we develop a novel technique for reconstructing images from projection-based nano- and microtomography. Our contribution focuses on enhancing reconstruction quality, particularly for specimen composed of homogeneous material phases connected by sharp edges. This is accomplished by training a neural network to identify edges within subpictures. The trained network is then integrated into a mathematical optimization model, to reduce artifacts from previous reconstructions. To this end, the optimization approach favors solutions according to the learned predictions, however may also determine alternative solutions if these are strongly supported by the raw data. Hence, our technique successfully incorporates knowledge about the homogeneity and presence of sharp edges in the sample and thereby eliminates blurriness. Our results on experimental datasets show significant enhancements in interface sharpness and material homogeneity compared to benchmark algorithms. Thus, our technique produces high-quality reconstructions, showcasing its potential for advancing tomographic imaging techniques.

[241] MedSeqFT: Sequential Fine-tuning Foundation Models for 3D Medical Image Segmentation

Yiwen Ye, Yicheng Wu, Xiangde Luo, He Zhang, Ziyang Chen, Ting Dang, Yanning Zhang, Yong Xia

Main category: cs.CV

TL;DR: MedSeqFT is a sequential fine-tuning framework for medical image segmentation that preserves pre-trained knowledge while adapting to new tasks, outperforming existing methods by 3.0% Dice score on average.

Details

Motivation: Existing fine-tuning strategies for medical image analysis either isolate tasks (parallel fine-tuning) or require simultaneous access to all datasets (multi-task fine-tuning), failing to effectively handle incremental task integration and knowledge retention.

Method: Proposes MedSeqFT with two components: 1) Maximum Data Similarity (MDS) selection to identify representative downstream samples, and 2) Knowledge and Generalization Retention Fine-Tuning (K&G RFT) using LoRA-based knowledge distillation to balance task adaptation with knowledge preservation.

Result: Outperforms state-of-the-art methods on ten 3D segmentation tasks with 3.0% average Dice improvement. Enhances transferability on unseen tasks (COVID-19-20 and Kidney), particularly for tumor segmentation. Visual analyses confirm robustness.

Conclusion: MedSeqFT establishes sequential fine-tuning as an effective, knowledge-retentive paradigm for adapting foundation models to evolving clinical segmentation tasks.

Abstract: Foundation models have become a promising paradigm for advancing medical image analysis, particularly for segmentation tasks where downstream applications often emerge sequentially. Existing fine-tuning strategies, however, remain limited: parallel fine-tuning isolates tasks and fails to exploit shared knowledge, while multi-task fine-tuning requires simultaneous access to all datasets and struggles with incremental task integration. To address these challenges, we propose MedSeqFT, a sequential fine-tuning framework that progressively adapts pre-trained models to new tasks while refining their representational capacity. MedSeqFT introduces two core components: (1) Maximum Data Similarity (MDS) selection, which identifies downstream samples most representative of the original pre-training distribution to preserve general knowledge, and (2) Knowledge and Generalization Retention Fine-Tuning (K&G RFT), a LoRA-based knowledge distillation scheme that balances task-specific adaptation with the retention of pre-trained knowledge. Extensive experiments on two multi-task datasets covering ten 3D segmentation tasks demonstrate that MedSeqFT consistently outperforms state-of-the-art fine-tuning strategies, yielding substantial performance gains (e.g., an average Dice improvement of 3.0%). Furthermore, evaluations on two unseen tasks (COVID-19-20 and Kidney) verify that MedSeqFT enhances transferability, particularly for tumor segmentation. Visual analyses of loss landscapes and parameter variations further highlight the robustness of MedSeqFT. These results establish sequential fine-tuning as an effective, knowledge-retentive paradigm for adapting foundation models to evolving clinical tasks. Code will be released.

[242] PathoHR: Hierarchical Reasoning for Vision-Language Models in Pathology

Yating Huang, Ziyan Huang, Lintao Xiang, Qijun Yang, Hujun Yin

Main category: cs.CV

TL;DR: PathoHR-Bench benchmark reveals VL models struggle with pathology image analysis due to complex reasoning requirements. A new pathology-specific training scheme with enhanced/perturbed samples achieves SOTA performance.

Details

Motivation: Accurate pathological image analysis is crucial for automated tumor diagnosis but remains challenging due to high structural similarity and subtle morphological variations. Current VL models fail to capture the complex reasoning needed for interpreting structured pathological reports.

Method: Proposed PathoHR-Bench benchmark for evaluating VL models’ hierarchical semantic understanding and compositional reasoning. Introduced pathology-specific VL training scheme that generates enhanced and perturbed samples for multimodal contrastive learning.

Result: Experimental evaluations demonstrate state-of-the-art performance on PathoHR-Bench and six additional pathology datasets, showing effectiveness in fine-grained pathology representation.

Conclusion: The proposed pathology-specific training approach effectively addresses VL models’ limitations in capturing intricate cross-modal relationships, making them more applicable in clinical settings for pathological image analysis.

Abstract: Accurate analysis of pathological images is essential for automated tumor diagnosis but remains challenging due to high structural similarity and subtle morphological variations in tissue images. Current vision-language (VL) models often struggle to capture the complex reasoning required for interpreting structured pathological reports. To address these limitations, we propose PathoHR-Bench, a novel benchmark designed to evaluate VL models’ abilities in hierarchical semantic understanding and compositional reasoning within the pathology domain. Results of this benchmark reveal that existing VL models fail to effectively model intricate cross-modal relationships, hence limiting their applicability in clinical setting. To overcome this, we further introduce a pathology-specific VL training scheme that generates enhanced and perturbed samples for multimodal contrastive learning. Experimental evaluations demonstrate that our approach achieves state-of-the-art performance on PathoHR-Bench and six additional pathology datasets, highlighting its effectiveness in fine-grained pathology representation.

[243] CARDIE: clustering algorithm on relevant descriptors for image enhancement

Giulia Bonino, Luca Alberto Rizzo

Main category: cs.CV

TL;DR: CARDIE is an unsupervised algorithm that clusters images based on color and luminosity for image enhancement tasks, outperforming semantic-based clustering and improving tone mapping and denoising performance.

Details

Motivation: Automatic image clustering for enhancement is limited due to difficulty defining meaningful clusters for this specific task, as current methods often rely on semantic attributes rather than enhancement-relevant features.

Method: Introduces CARDIE - an unsupervised algorithm that clusters images based on color and luminosity content, plus a method to quantify enhancement algorithm impact on luminance distribution and local variance.

Result: CARDIE produces clusters more relevant to image enhancement than semantic-based clusters, and these clusters can be used to resample datasets leading to improved performance for tone mapping and denoising algorithms.

Conclusion: CARDIE provides effective unsupervised clustering for image enhancement tasks, with code publicly released on GitHub to encourage adoption and ensure reproducibility.

Abstract: Automatic image clustering is a cornerstone of computer vision, yet its application to image enhancement remains limited, primarily due to the difficulty of defining clusters that are meaningful for this specific task. To address this issue, we introduce CARDIE, an unsupervised algorithm that clusters images based on their color and luminosity content. In addition, we introduce a method to quantify the impact of image enhancement algorithms on luminance distribution and local variance. Using this method, we demonstrate that CARDIE produces clusters more relevant to image enhancement than those derived from semantic image attributes. Furthermore, we demonstrate that CARDIE clusters can be leveraged to resample image enhancement datasets, leading to improved performance for tone mapping and denoising algorithms. To encourage adoption and ensure reproducibility, we publicly release CARDIE code on our GitHub.

[244] SpecSwin3D: Generating Hyperspectral Imagery from Multispectral Data via Transformer Networks

Tang Sui, Songxi Yang, Qunying Huang

Main category: cs.CV

TL;DR: SpecSwin3D is a transformer-based model that generates high-quality hyperspectral imagery from multispectral inputs, achieving superior spatial and spectral fidelity compared to existing methods.

Details

Motivation: There's a fundamental trade-off in remote sensing: multispectral imagery has high spatial resolution but limited spectral resolution, while hyperspectral imagery has rich spectral information but lower spatial resolution. Existing methods struggle to preserve both spatial detail and spectral fidelity when generating hyperspectral data.

Method: The proposed SpecSwin3D model uses a 3D shifted-window transformer framework that takes 5 multispectral bands as input and reconstructs 224 hyperspectral bands. It features a cascade training strategy that progressively expands the spectral range, and an optimized band sequence that strategically repeats and orders input bands to capture pairwise relations.

Result: The model achieves PSNR of 35.82 dB, SAM of 2.40°, and SSIM of 0.96, outperforming the baseline MHF-Net by +5.6 dB in PSNR and reducing ERGAS by more than half. It also demonstrates practical value on downstream tasks like land use classification and burnt area segmentation.

Conclusion: SpecSwin3D effectively bridges the spatial-spectral resolution gap in remote sensing, providing high-fidelity hyperspectral generation with practical applications in environmental monitoring and analysis tasks.

Abstract: Multispectral and hyperspectral imagery are widely used in agriculture, environmental monitoring, and urban planning due to their complementary spatial and spectral characteristics. A fundamental trade-off persists: multispectral imagery offers high spatial but limited spectral resolution, while hyperspectral imagery provides rich spectra at lower spatial resolution. Prior hyperspectral generation approaches (e.g., pan-sharpening variants, matrix factorization, CNNs) often struggle to jointly preserve spatial detail and spectral fidelity. In response, we propose SpecSwin3D, a transformer-based model that generates hyperspectral imagery from multispectral inputs while preserving both spatial and spectral quality. Specifically, SpecSwin3D takes five multispectral bands as input and reconstructs 224 hyperspectral bands at the same spatial resolution. In addition, we observe that reconstruction errors grow for hyperspectral bands spectrally distant from the input bands. To address this, we introduce a cascade training strategy that progressively expands the spectral range to stabilize learning and improve fidelity. Moreover, we design an optimized band sequence that strategically repeats and orders the five selected multispectral bands to better capture pairwise relations within a 3D shifted-window transformer framework. Quantitatively, our model achieves a PSNR of 35.82 dB, SAM of 2.40{\deg}, and SSIM of 0.96, outperforming the baseline MHF-Net by +5.6 dB in PSNR and reducing ERGAS by more than half. Beyond reconstruction, we further demonstrate the practical value of SpecSwin3D on two downstream tasks, including land use classification and burnt area segmentation.

[245] RetinaGuard: Obfuscating Retinal Age in Fundus Images for Biometric Privacy Preserving

Zhengquan Luo, Chi Liu, Dongfu Xiao, Zhen Yu, Yueye Wang, Tianqing Zhu

Main category: cs.CV

TL;DR: RetinaGuard is a privacy framework that protects retinal age information in fundus images while maintaining diagnostic utility through adversarial masking and knowledge distillation.

Details

Motivation: AI can extract sensitive biometric data like retinal age from medical images, raising privacy concerns about unauthorized bioinformation leakage.

Method: Feature-level generative adversarial masking mechanism combined with multiple-to-one knowledge distillation using retinal foundation model and surrogate age encoders.

Result: Successfully obfuscates retinal age prediction with minimal impact on image quality and pathological feature representation.

Conclusion: RetinaGuard provides effective universal defense against black-box age prediction models and can be extended to protect other medical image biomarkers.

Abstract: The integration of AI with medical images enables the extraction of implicit image-derived biomarkers for a precise health assessment. Recently, retinal age, a biomarker predicted from fundus images, is a proven predictor of systemic disease risks, behavioral patterns, aging trajectory and even mortality. However, the capability to infer such sensitive biometric data raises significant privacy risks, where unauthorized use of fundus images could lead to bioinformation leakage, breaching individual privacy. In response, we formulate a new research problem of biometric privacy associated with medical images and propose RetinaGuard, a novel privacy-enhancing framework that employs a feature-level generative adversarial masking mechanism to obscure retinal age while preserving image visual quality and disease diagnostic utility. The framework further utilizes a novel multiple-to-one knowledge distillation strategy incorporating a retinal foundation model and diverse surrogate age encoders to enable a universal defense against black-box age prediction models. Comprehensive evaluations confirm that RetinaGuard successfully obfuscates retinal age prediction with minimal impact on image quality and pathological feature representation. RetinaGuard is also flexible for extension to other medical image derived biomarkers. RetinaGuard is also flexible for extension to other medical image biomarkers.

[246] UniVerse-1: Unified Audio-Video Generation via Stitching of Experts

Duomin Wang, Wei Zuo, Aojie Li, Ling-Hao Chen, Xinyao Liao, Deyu Zhou, Zixin Yin, Xili Dai, Daxin Jiang, Gang Yu

Main category: cs.CV

TL;DR: UniVerse-1 is a unified model that generates coordinated audio and video using a stitching of experts approach, avoiding training from scratch by fusing pre-trained video and music generation models.

Details

Motivation: To create a model capable of simultaneous audio-video generation with accurate temporal alignment, addressing performance degradation from misaligned text annotations.

Method: Employs stitching of experts (SoE) technique to fuse pre-trained video and music generation models, plus an online annotation pipeline for accurate temporal alignment during training.

Result: After fine-tuning on ~7,600 hours of data, the model produces well-coordinated audio-visuals for ambient sounds and strong alignment for speech generation.

Conclusion: Introduces Verse-Bench benchmark and makes model/code publicly available to advance audio-video generation research and close gap with state-of-the-art models like Veo3.

Abstract: We introduce UniVerse-1, a unified, Veo-3-like model capable of simultaneously generating coordinated audio and video. To enhance training efficiency, we bypass training from scratch and instead employ a stitching of experts (SoE) technique. This approach deeply fuses the corresponding blocks of pre-trained video and music generation experts models, thereby fully leveraging their foundational capabilities. To ensure accurate annotations and temporal alignment for both ambient sounds and speech with video content, we developed an online annotation pipeline that processes the required training data and generates labels during training process. This strategy circumvents the performance degradation often caused by misalignment text-based annotations. Through the synergy of these techniques, our model, after being finetuned on approximately 7,600 hours of audio-video data, produces results with well-coordinated audio-visuals for ambient sounds generation and strong alignment for speech generation. To systematically evaluate our proposed method, we introduce Verse-Bench, a new benchmark dataset. In an effort to advance research in audio-video generation and to close the performance gap with state-of-the-art models such as Veo3, we make our model and code publicly available. We hope this contribution will benefit the broader research community. Project page: https://dorniwang.github.io/UniVerse-1/.

[247] UNO: Unifying One-stage Video Scene Graph Generation via Object-Centric Visual Representation Learning

Huy Le, Nhat Chung, Tung Kieu, Jingkang Yang, Ngan Le

Main category: cs.CV

TL;DR: UNO is a unified single-stage framework for both box-level and pixel-level Video Scene Graph Generation that uses slot attention and temporal consistency learning without explicit tracking.

Details

Motivation: Prior VidSGG methods require separate architectures for different granularity levels (box-level vs pixel-level) and multi-stage training pipelines, which is inefficient and lacks generalization.

Method: Extended slot attention mechanism to decompose visual features into object and relation slots, object temporal consistency learning for cross-frame consistency, and dynamic triplet prediction module for linking relations to object pairs.

Result: Achieves competitive performance on both box-level and pixel-level VidSGG benchmarks while offering improved efficiency through unified design.

Conclusion: UNO demonstrates that a single unified framework can effectively handle multiple granularity levels in VidSGG with minimal task-specific modifications and maximum parameter sharing.

Abstract: Video Scene Graph Generation (VidSGG) aims to represent dynamic visual content by detecting objects and modeling their temporal interactions as structured graphs. Prior studies typically target either coarse-grained box-level or fine-grained panoptic pixel-level VidSGG, often requiring task-specific architectures and multi-stage training pipelines. In this paper, we present UNO (UNified Object-centric VidSGG), a single-stage, unified framework that jointly addresses both tasks within an end-to-end architecture. UNO is designed to minimize task-specific modifications and maximize parameter sharing, enabling generalization across different levels of visual granularity. The core of UNO is an extended slot attention mechanism that decomposes visual features into object and relation slots. To ensure robust temporal modeling, we introduce object temporal consistency learning, which enforces consistent object representations across frames without relying on explicit tracking modules. Additionally, a dynamic triplet prediction module links relation slots to corresponding object pairs, capturing evolving interactions over time. We evaluate UNO on standard box-level and pixel-level VidSGG benchmarks. Results demonstrate that UNO not only achieves competitive performance across both tasks but also offers improved efficiency through a unified, object-centric design.

[248] AI-Based Applied Innovation for Fracture Detection in X-rays Using Custom CNN and Transfer Learning Models

Amna Hassan, Ilsa Afzaal, Nouman Muneeb, Aneeqa Batool, Hamail Noor

Main category: cs.CV

TL;DR: AI-based fracture detection from X-rays using custom CNN achieves 95.96% accuracy, outperforming transfer learning models in limited-resource settings.

Details

Motivation: Address global health challenge of bone fractures in low-resource areas with limited radiology access, overcoming high costs and radiation exposure of conventional methods.

Method: Developed custom Convolutional Neural Network and benchmarked against transfer learning models (EfficientNetB0, MobileNetV2, ResNet50) using FracAtlas dataset of 4,083 musculoskeletal radiographs.

Result: Custom CNN achieved 95.96% accuracy, 0.94 precision, 0.88 recall, and 0.91 F1-score. Transfer learning models performed poorly due to class imbalance and dataset limitations.

Conclusion: Lightweight CNNs show promise for fracture detection in X-rays, emphasizing need for fair benchmarking, diverse datasets, and external validation for clinical application.

Abstract: Bone fractures present a major global health challenge, often resulting in pain, reduced mobility, and productivity loss, particularly in low-resource settings where access to expert radiology services is limited. Conventional imaging methods suffer from high costs, radiation exposure, and dependency on specialized interpretation. To address this, we developed an AI-based solution for automated fracture detection from X-ray images using a custom Convolutional Neural Network (CNN) and benchmarked it against transfer learning models including EfficientNetB0, MobileNetV2, and ResNet50. Training was conducted on the publicly available FracAtlas dataset, comprising 4,083 anonymized musculoskeletal radiographs. The custom CNN achieved 95.96% accuracy, 0.94 precision, 0.88 recall, and an F1-score of 0.91 on the FracAtlas dataset. Although transfer learning models (EfficientNetB0, MobileNetV2, ResNet50) performed poorly in this specific setup, these results should be interpreted in light of class imbalance and data set limitations. This work highlights the promise of lightweight CNNs for detecting fractures in X-rays and underscores the importance of fair benchmarking, diverse datasets, and external validation for clinical translation

[249] Exploring Light-Weight Object Recognition for Real-Time Document Detection

Lucas Wojcik, Luiz Coelho, Roger Granada, David Menotti

Main category: cs.CV

TL;DR: Efficient document detection pipeline adapted from license plate detection network (IWPOD-Net) for real-time document rectification, achieving competitive OCR quality with smaller, faster model than state-of-the-art solutions.

Details

Motivation: Real-time document detection and rectification is largely unexplored but vital for automatic information retrieval from visual documents. Need for efficient pipeline that balances performance and speed for OCR applications.

Method: Adapted IWPOD-Net (license plate detection network) and trained on synthetic ID card dataset (NBID). Used data augmentation and cross-dataset validation with MIDV dataset. Evaluated OCR quality using Levenshtein distance metric.

Result: Model is smaller and more efficient than current state-of-the-art solutions while maintaining competitive OCR quality. Document rectification doesn’t need to be perfect to achieve state-of-the-art performance.

Conclusion: Proposed efficient document detection pipeline successfully bridges the gap between performance and efficiency for real-time document processing, demonstrating that smaller models can achieve competitive OCR results without perfect rectification.

Abstract: Object Recognition and Document Skew Estimation have come a long way in terms of performance and efficiency. New models follow one of two directions: improving performance using larger models, and improving efficiency using smaller models. However, real-time document detection and rectification is a niche that is largely unexplored by the literature, yet it remains a vital step for automatic information retrieval from visual documents. In this work, we strive towards an efficient document detection pipeline that is satisfactory in terms of Optical Character Recognition (OCR) retrieval and faster than other available solutions. We adapt IWPOD-Net, a license plate detection network, and train it for detection on NBID, a synthetic ID card dataset. We experiment with data augmentation and cross-dataset validation with MIDV (another synthetic ID and passport document dataset) to find the optimal scenario for the model. Other methods from both the Object Recognition and Skew Estimation state-of-the-art are evaluated for comparison with our approach. We use each method to detect and rectify the document, which is then read by an OCR system. The OCR output is then evaluated using a novel OCR quality metric based on the Levenshtein distance. Since the end goal is to improve automatic information retrieval, we use the overall OCR quality as a performance metric. We observe that with a promising model, document rectification does not have to be perfect to attain state-of-the-art performance scores. We show that our model is smaller and more efficient than current state-of-the-art solutions while retaining a competitive OCR quality metric. All code is available at https://github.com/BOVIFOCR/iwpod-doc-corners.git

[250] Spatial Reasoning with Vision-Language Models in Ego-Centric Multi-View Scenes

Mohsen Gholami, Ahmad Rezaei, Zhou Weimin, Yong Zhang, Mohammad Akbari

Main category: cs.CV

TL;DR: Ego3D-Bench is a new benchmark for evaluating VLMs’ 3D spatial reasoning using ego-centric multi-view outdoor data, showing current VLMs lag behind human performance, and Ego3D-VLM framework improves spatial reasoning by 12-56%.

Details

Motivation: Current VLMs struggle with 3D spatial understanding, especially in real-world embodied AI applications like robotics and self-driving that rely on ego-centric multi-view observations, creating a need for better evaluation and improvement methods.

Method: Created Ego3D-Bench with 8,600+ human-annotated QA pairs from ego-centric outdoor data, benchmarked 16 SOTA VLMs, and developed Ego3D-VLM - a post-training framework that generates cognitive maps using estimated global 3D coordinates.

Result: Significant performance gap between human level scores and VLMs; Ego3D-VLM achieved 12% average improvement on multi-choice QA and 56% average improvement on absolute distance estimation; modular framework compatible with existing VLMs.

Conclusion: Ego3D-Bench and Ego3D-VLM provide valuable tools for advancing toward human-level spatial understanding in real-world multi-view environments, addressing a major limitation of current vision-language models.

Abstract: Understanding 3D spatial relationships remains a major limitation of current Vision-Language Models (VLMs). Prior work has addressed this issue by creating spatial question-answering (QA) datasets based on single images or indoor videos. However, real-world embodied AI agents such as robots and self-driving cars typically rely on ego-centric, multi-view observations. To this end, we introduce Ego3D-Bench, a new benchmark designed to evaluate the spatial reasoning abilities of VLMs using ego-centric, multi-view outdoor data. Ego3D-Bench comprises over 8,600 QA pairs, created with significant involvement from human annotators to ensure quality and diversity. We benchmark 16 SOTA VLMs, including GPT-4o, Gemini1.5-Pro, InternVL3, and Qwen2.5-VL. Our results reveal a notable performance gap between human level scores and VLM performance, highlighting that current VLMs still fall short of human level spatial understanding. To bridge this gap, we propose Ego3D-VLM, a post-training framework that enhances 3D spatial reasoning of VLMs. Ego3D-VLM generates cognitive map based on estimated global 3D coordinates, resulting in 12% average improvement on multi-choice QA and 56% average improvement on absolute distance estimation. Ego3D-VLM is modular and can be integrated with any existing VLM. Together, Ego3D-Bench and Ego3D-VLM offer valuable tools for advancing toward human level spatial understanding in real-world, multi-view environments.

[251] AI-driven Remote Facial Skin Hydration and TEWL Assessment from Selfie Images: A Systematic Solution

Cecelia Soh, Rizhao Cai, Monalisha Paul, Dennis Sng, Alex Kot

Main category: cs.CV

TL;DR: First study to estimate skin hydration and trans-epidermal water loss from selfie facial images using smartphones, enabling remote skin barrier assessment without specialized instruments.

Details

Motivation: Skin barrier function measured by SH/TEWL is crucial for health but requires specialized clinic equipment, making regular monitoring inaccessible to general users.

Method: Developed Skin-Prior Adaptive Vision Transformer model with symmetric-based contrastive regularization to handle data imbalance, using selfie facial images for SH/TEWL regression.

Result: Proposed solution effectively estimates skin barrier indicators from facial images, addressing annotation imbalance issues and enabling remote skin assessment.

Conclusion: Bridges computer vision and skincare research, enabling AI-driven accessible skin analysis for broader real-world applications without physical measurements.

Abstract: Skin health and disease resistance are closely linked to the skin barrier function, which protects against environmental factors and water loss. Two key physiological indicators can quantitatively represent this barrier function: skin hydration (SH) and trans-epidermal water loss (TEWL). Measurement of SH and TEWL is valuable for the public to monitor skin conditions regularly, diagnose dermatological issues, and personalize their skincare regimens. However, these measurements are not easily accessible to general users unless they visit a dermatology clinic with specialized instruments. To tackle this problem, we propose a systematic solution to estimate SH and TEWL from selfie facial images remotely with smartphones. Our solution encompasses multiple stages, including SH/TEWL data collection, data preprocessing, and formulating a novel Skin-Prior Adaptive Vision Transformer model for SH/TEWL regression. Through experiments, we identified the annotation imbalance of the SH/TEWL data and proposed a symmetric-based contrastive regularization to reduce the model bias due to the imbalance effectively. This work is the first study to explore skin assessment from selfie facial images without physical measurements. It bridges the gap between computer vision and skin care research, enabling AI-driven accessible skin analysis for broader real-world applications.

[252] Prototype-Aware Multimodal Alignment for Open-Vocabulary Visual Grounding

Jiangnan Xie, Xiaolong Zheng, Liang Zheng

Main category: cs.CV

TL;DR: PAML is a novel framework that addresses limitations in open-vocabulary visual grounding through prototype-aware multimodal learning, achieving state-of-the-art performance in handling both familiar and novel object categories.

Details

Motivation: Current transformer-based visual grounding methods perform well in standard scenes but struggle with open-vocabulary scenarios due to imperfect visual-linguistic alignment, insufficient cross-modal fusion, and ineffective use of semantic prototype information.

Method: The framework uses ALBEF for initial cross-modal alignment, a Visual Discriminative Feature Encoder to enhance object representations, a prototype discovering and inheriting mechanism for semantic prototype aggregation, and a Multi-stage Decoder for comprehensive multimodal integration before bounding box regression.

Result: Extensive experiments across five benchmark datasets show competitive performance in standard scenes and state-of-the-art results in open-vocabulary visual grounding scenarios.

Conclusion: PAML successfully addresses key challenges in open-vocabulary visual grounding through systematic multimodal learning with prototype awareness, demonstrating superior performance particularly in handling novel object categories during testing.

Abstract: Visual Grounding (VG) aims to utilize given natural language queries to locate specific target objects within images. While current transformer-based approaches demonstrate strong localization performance in standard scene (i.e, scenarios without any novel objects), they exhibit notable limitations in open-vocabulary scene (i.e, both familiar and novel object categories during testing). These limitations primarily stem from three key factors: (1) imperfect alignment between visual and linguistic modalities, (2) insufficient cross-modal feature fusion, and (3) ineffective utilization of semantic prototype information. To overcome these challenges, we present Prototype-Aware Multimodal Learning (PAML), an innovative framework that systematically addresses these issues through several key components: First, we leverage ALBEF to establish robust cross-modal alignment during initial feature encoding. Subsequently, our Visual Discriminative Feature Encoder selectively enhances salient object representations while suppressing irrelevant visual context. The framework then incorporates a novel prototype discovering and inheriting mechanism that extracts and aggregates multi-neighbor semantic prototypes to facilitate open-vocabulary recognition. These enriched features undergo comprehensive multimodal integration through our Multi-stage Decoder before final bounding box regression. Extensive experiments across five benchmark datasets validate our approach, showing competitive performance in standard scene while achieving state-of-the-art results in open-vocabulary scene. Our code is available at https://github.com/plankXie/PAML.

[253] Video-based Generalized Category Discovery via Memory-Guided Consistency-Aware Contrastive Learning

Zhang Jing, Pu Nan, Xie Yu Xiang, Guo Yanming, Lu Qianqi, Zou Shiwei, Yan Jie, Chen Yan

Main category: cs.CV

TL;DR: Proposes Video-GCD framework with Memory-guided Consistency-aware Contrastive Learning (MCCL) to discover novel categories in videos using temporal-spatial cues and dual-level memory buffers, outperforming image-based GCD methods.

Details

Motivation: Static images are insufficient for reliable novel category discovery. Videos provide multi-perspective temporal information that can enhance category discovery, but existing GCD methods focus only on static images.

Method: MCCL framework with two components: Consistency-Aware Contrastive Learning (CACL) that weights contrastive loss using temporal consistency scores, and Memory-Guided Representation Enhancement (MGRE) with dual-level memory buffers for feature and logit representations.

Result: Significantly outperforms competitive GCD approaches adapted from image-based settings on new Video-GCD benchmark including action recognition and bird classification datasets.

Conclusion: Temporal information is crucial for discovering novel categories in videos. The proposed MCCL framework effectively integrates multi-perspective temporal cues and demonstrates superior performance over image-based methods.

Abstract: Generalized Category Discovery (GCD) is an emerging and challenging open-world problem that has garnered increasing attention in recent years. Most existing GCD methods focus on discovering categories in static images. However, relying solely on static visual content is often insufficient to reliably discover novel categories. To bridge this gap, we extend the GCD problem to the video domain and introduce a new setting, termed Video-GCD. Thus, effectively integrating multi-perspective information across time is crucial for accurate Video-GCD. To tackle this challenge, we propose a novel Memory-guided Consistency-aware Contrastive Learning (MCCL) framework, which explicitly captures temporal-spatial cues and incorporates them into contrastive learning through a consistency-guided voting mechanism. MCCL consists of two core components: Consistency-Aware Contrastive Learning(CACL) and Memory-Guided Representation Enhancement (MGRE). CACL exploits multiperspective temporal features to estimate consistency scores between unlabeled instances, which are then used to weight the contrastive loss accordingly. MGRE introduces a dual-level memory buffer that maintains both feature-level and logit-level representations, providing global context to enhance intra-class compactness and inter-class separability. This in turn refines the consistency estimation in CACL, forming a mutually reinforcing feedback loop between representation learning and consistency modeling. To facilitate a comprehensive evaluation, we construct a new and challenging Video-GCD benchmark, which includes action recognition and bird classification video datasets. Extensive experiments demonstrate that our method significantly outperforms competitive GCD approaches adapted from image-based settings, highlighting the importance of temporal information for discovering novel categories in videos. The code will be publicly available.

[254] Text4Seg++: Advancing Image Segmentation via Generative Language Modeling

Mengcheng Lan, Chaofeng Chen, Jiaxing Xu, Zongrui Li, Yiping Ke, Xudong Jiang, Yingchen Yu, Yunqing Zhao, Song Bai

Main category: cs.CV

TL;DR: Text4Seg++ proposes a text-as-mask paradigm that treats image segmentation as a text generation problem using semantic descriptors, achieving state-of-the-art performance without task-specific fine-tuning.

Details

Motivation: Effectively integrating image segmentation into Multimodal Large Language Models (MLLMs) remains challenging, requiring a simpler approach that eliminates additional decoders.

Method: Uses semantic descriptors as textual representations of segmentation masks, with image-wise and box-wise variants. Introduces Row-wise Run-Length Encoding (R-RLE) for compression and efficiency. Text4Seg++ formulates segmentation as next-brick prediction using structured mask tokens.

Result: Achieves strong segmentation performance across diverse benchmarks, reduces semantic descriptor length by 74%, accelerates inference by 3x, and outperforms state-of-the-art models without task-specific fine-tuning.

Conclusion: The text-driven approach demonstrates effectiveness, scalability, and generalizability for image segmentation within MLLM frameworks while maintaining compatibility with existing backbones.

Abstract: Multimodal Large Language Models (MLLMs) have shown exceptional capabilities in vision-language tasks. However, effectively integrating image segmentation into these models remains a significant challenge. In this work, we propose a novel text-as-mask paradigm that casts image segmentation as a text generation problem, eliminating the need for additional decoders and significantly simplifying the segmentation process. Our key innovation is semantic descriptors, a new textual representation of segmentation masks where each image patch is mapped to its corresponding text label. We first introduce image-wise semantic descriptors, a patch-aligned textual representation of segmentation masks that integrates naturally into the language modeling pipeline. To enhance efficiency, we introduce the Row-wise Run-Length Encoding (R-RLE), which compresses redundant text sequences, reducing the length of semantic descriptors by 74% and accelerating inference by $3\times$, without compromising performance. Building upon this, our initial framework Text4Seg achieves strong segmentation performance across a wide range of vision tasks. To further improve granularity and compactness, we propose box-wise semantic descriptors, which localizes regions of interest using bounding boxes and represents region masks via structured mask tokens called semantic bricks. This leads to our refined model, Text4Seg++, which formulates segmentation as a next-brick prediction task, combining precision, scalability, and generative efficiency. Comprehensive experiments on natural and remote sensing datasets show that Text4Seg++ consistently outperforms state-of-the-art models across diverse benchmarks without any task-specific fine-tuning, while remaining compatible with existing MLLM backbones. Our work highlights the effectiveness, scalability, and generalizability of text-driven image segmentation within the MLLM framework.

[255] Towards scalable organ level 3D plant segmentation: Bridging the data algorithm computing gap

Ruiming Du, Guangxun Zhai, Tian Qiu, Yu Jiang

Main category: cs.CV

TL;DR: This review addresses 3D plant segmentation challenges by analyzing datasets, summarizing deep learning methods, introducing an open-source benchmarking framework, and evaluating networks and sim-to-real strategies to bridge algorithmic advances with practical deployment.

Details

Motivation: The precise characterization of plant morphology provides valuable insights into plant-environment interactions and genetic evolution, but 3D segmentation adoption for plant phenotyping remains limited by dataset scarcity, technical adaptation difficulties, and lack of standardized benchmarks.

Method: Systematic review approach including: overview of existing 3D plant datasets, summary of deep learning methods for point cloud segmentation, introduction of Plant Segmentation Studio (open-source benchmarking framework), and extensive quantitative experiments evaluating representative networks and sim-to-real learning strategies.

Result: Findings highlight the efficacy of sparse convolutional backbones and transformer-based instance segmentation, and emphasize the complementary role of modeling-based and augmentation-based synthetic data generation for sim-to-real learning to reduce annotation demands.

Conclusion: This study bridges the gap between algorithmic advances and practical deployment, providing immediate tools for researchers and a roadmap for developing data-efficient and generalizable deep learning solutions in 3D plant phenotyping.

Abstract: The precise characterization of plant morphology provides valuable insights into plant environment interactions and genetic evolution. A key technology for extracting this information is 3D segmentation, which delineates individual plant organs from complex point clouds. Despite significant progress in general 3D computer vision domains, the adoption of 3D segmentation for plant phenotyping remains limited by three major challenges: i) the scarcity of large-scale annotated datasets, ii) technical difficulties in adapting advanced deep neural networks to plant point clouds, and iii) the lack of standardized benchmarks and evaluation protocols tailored to plant science. This review systematically addresses these barriers by: i) providing an overview of existing 3D plant datasets in the context of general 3D segmentation domains, ii) systematically summarizing deep learning-based methods for point cloud semantic and instance segmentation, iii) introducing Plant Segmentation Studio (PSS), an open-source framework for reproducible benchmarking, and iv) conducting extensive quantitative experiments to evaluate representative networks and sim-to-real learning strategies. Our findings highlight the efficacy of sparse convolutional backbones and transformer-based instance segmentation, while also emphasizing the complementary role of modeling-based and augmentation-based synthetic data generation for sim-to-real learning in reducing annotation demands. In general, this study bridges the gap between algorithmic advances and practical deployment, providing immediate tools for researchers and a roadmap for developing data-efficient and generalizable deep learning solutions in 3D plant phenotyping. Data and code are available at https://github.com/perrydoremi/PlantSegStudio.

[256] Quantitative Currency Evaluation in Low-Resource Settings through Pattern Analysis to Assist Visually Impaired Users

Md Sultanul Islam Ovi, Mainul Hossain, Md Badsha Biswas

Main category: cs.CV

TL;DR: Unified framework for currency evaluation with denomination classification, damage quantification, and counterfeit detection using lightweight CNN models and novel UCDI metric for real-time on-device inference in low-resource environments.

Details

Motivation: Existing currency recognition systems overlook usability and authenticity assessment, especially for visually impaired users and offline validation in low-resource environments. Current methods focus only on denomination classification while ignoring physical degradation and forgery detection.

Method: Three-module framework: 1) Denomination classification using lightweight CNN models, 2) Damage quantification through Unified Currency Damage Index (UCDI) based on binary mask loss, chromatic distortion, and structural feature loss, 3) Counterfeit detection using feature-based template matching. Dataset of 82,000+ annotated images covering clean, damaged, and counterfeit notes.

Result: Custom_CNN model achieves high classification performance with low parameter count. UCDI provides continuous usability score. Counterfeit detection module reliably identifies forged notes across varied imaging conditions. Framework supports real-time on-device inference.

Conclusion: The framework demonstrates that accurate, interpretable, and compact solutions can support inclusive currency evaluation in practical settings, addressing key deployment challenges in constrained environments.

Abstract: Currency recognition systems often overlook usability and authenticity assessment, especially in low-resource environments where visually impaired users and offline validation are common. While existing methods focus on denomination classification, they typically ignore physical degradation and forgery, limiting their applicability in real-world conditions. This paper presents a unified framework for currency evaluation that integrates three modules: denomination classification using lightweight CNN models, damage quantification through a novel Unified Currency Damage Index (UCDI), and counterfeit detection using feature-based template matching. The dataset consists of over 82,000 annotated images spanning clean, damaged, and counterfeit notes. Our Custom_CNN model achieves high classification performance with low parameter count. The UCDI metric provides a continuous usability score based on binary mask loss, chromatic distortion, and structural feature loss. The counterfeit detection module demonstrates reliable identification of forged notes across varied imaging conditions. The framework supports real-time, on-device inference and addresses key deployment challenges in constrained environments. Results show that accurate, interpretable, and compact solutions can support inclusive currency evaluation in practical settings.

Penelope Brown, Julie Stephany Berrio Perez, Mao Shan, Stewart Worrall

Main category: cs.CV

TL;DR: Multimodal framework combining RGB and thermal imaging with YOLOv8 improves vulnerable road user detection in challenging conditions through class re-weighting and data augmentation.

Details

Motivation: Vulnerable road users represent over half of global traffic deaths, with detection remaining challenging in poor lighting, adverse weather, and unbalanced datasets.

Method: Integrated RGB and thermal infrared imaging with fine-tuned YOLOv8 model, using KITTI, BDD100K, and FLIR datasets. Employed class re-weighting, light augmentations, 640-pixel resolution, and partial backbone freezing.

Result: Thermal models achieved highest precision, RGB-to-thermal augmentation boosted recall. Class-weighted losses improved recall for rare VRUs, demonstrating optimized accuracy and efficiency.

Conclusion: Multimodal detection framework shows potential to significantly improve VRU safety at intersections by leveraging complementary strengths of RGB and thermal imaging modalities.

Abstract: Vulnerable road users (VRUs) such as pedestrians, cyclists, and motorcyclists represent more than half of global traffic deaths, yet their detection remains challenging in poor lighting, adverse weather, and unbalanced data sets. This paper presents a multimodal detection framework that integrates RGB and thermal infrared imaging with a fine-tuned YOLOv8 model. Training leveraged KITTI, BDD100K, and Teledyne FLIR datasets, with class re-weighting and light augmentations to improve minority-class performance and robustness, experiments show that 640-pixel resolution and partial backbone freezing optimise accuracy and efficiency, while class-weighted losses enhance recall for rare VRUs. Results highlight that thermal models achieve the highest precision, and RGB-to-thermal augmentation boosts recall, demonstrating the potential of multimodal detection to improve VRU safety at intersections.

[258] Harnessing Object Grounding for Time-Sensitive Video Understanding

Tz-Ying Wu, Sharath Nittur Sridhar, Subarna Tripathi

Main category: cs.CV

TL;DR: GO-Tokenizer improves video understanding by encoding object information directly into Video-LLMs instead of using noisy textual descriptions, achieving better performance across multiple tasks.

Details

Motivation: Time-sensitive video understanding can benefit from grounded object information, but textual object descriptions add token length and noise to prompts.

Method: Propose GO-Tokenizer - a lightweight add-on module that uses off-the-shelf object detectors to encode compact object information on the fly for Video-LLMs.

Result: Pretraining with GO-Tokenizer outperforms vanilla Video-LLM and text-based object description methods across different models, datasets, and video understanding tasks.

Conclusion: GO-Tokenizer provides an effective way to incorporate object-level information into Video-LLMs without the drawbacks of textual descriptions, generalizing well across various video understanding tasks.

Abstract: We propose to improve the time-sensitive video understanding (TSV) capability of video large language models (Video-LLMs) with grounded objects (GO). We hypothesize that TSV tasks can benefit from GO within frames, which is supported by our preliminary experiments on LITA, a state-of-the-art Video-LLM for reasoning temporal localization. While augmenting prompts with textual description of these object annotations improves the performance of LITA, it also introduces extra token length and susceptibility to the noise in object level information. To address this, we propose GO-Tokenizer, a lightweight add-on module for Video-LLMs leveraging off-the-shelf object detectors to encode compact object information on the fly. Experimental results demonstrate that pretraining with GO-Tokenizer outperforms the vanilla Video-LLM and its counterpart utilizing textual description of objects in the prompt. The gain generalizes across different models, datasets and video understanding tasks such as reasoning temporal localization and dense captioning.

[259] Multi View Slot Attention Using Paraphrased Texts For Face Anti-Spoofing

Jeongmin Yu, Susang Kim, Kisu Lee, Taekyoung Kwon, Won-Yong Shin, Ha Young Kim

Main category: cs.CV

TL;DR: MVP-FAS is a novel face anti-spoofing framework that enhances CLIP-based models by using multi-view slot attention and multi-text patch alignment with multiple paraphrased texts to improve generalization and spoof detection.

Details

Motivation: Existing CLIP-based FAS models underutilize patch embedding tokens and rely on single text prompts per class, limiting their ability to detect critical spoofing clues and generalize across domains.

Method: Proposes MVP-FAS with two modules: Multi-View Slot attention (MVS) extracts local spatial features and global context from patch embeddings using diverse text perspectives; Multi-Text Patch Alignment (MTPA) aligns patches with multiple text representations for semantic robustness.

Result: Extensive experiments show MVP-FAS achieves superior generalization performance, outperforming previous state-of-the-art methods on cross-domain datasets.

Conclusion: The framework successfully addresses limitations of existing CLIP-based FAS models by leveraging multiple text perspectives and improved patch-text alignment, demonstrating enhanced cross-domain generalization capabilities.

Abstract: Recent face anti-spoofing (FAS) methods have shown remarkable cross-domain performance by employing vision-language models like CLIP. However, existing CLIP-based FAS models do not fully exploit CLIP’s patch embedding tokens, failing to detect critical spoofing clues. Moreover, these models rely on a single text prompt per class (e.g., ’live’ or ‘fake’), which limits generalization. To address these issues, we propose MVP-FAS, a novel framework incorporating two key modules: Multi-View Slot attention (MVS) and Multi-Text Patch Alignment (MTPA). Both modules utilize multiple paraphrased texts to generate generalized features and reduce dependence on domain-specific text. MVS extracts local detailed spatial features and global context from patch embeddings by leveraging diverse texts with multiple perspectives. MTPA aligns patches with multiple text representations to improve semantic robustness. Extensive experiments demonstrate that MVP-FAS achieves superior generalization performance, outperforming previous state-of-the-art methods on cross-domain datasets. Code: https://github.com/Elune001/MVP-FAS.

Krithik Ramesh, Ritvik Koneru

Main category: cs.CV

TL;DR: A unified deep learning pipeline using CNN to classify both histopathology slides and colonoscopy video frames for colorectal disease diagnosis, integrating class-balancing and augmentation methods.

Details

Motivation: Traditional colorectal disease diagnosis requires separate evaluations of histological images and colonoscopy footage, leading to variability and inefficiencies in the diagnostic process.

Method: Proposed a unified CNN-based pipeline (ResNet-50 architecture) that processes both histopathological slides from PathMNIST dataset and colonoscopy video frames from HyperKvasir dataset, incorporating class-balancing learning, robust augmentation, and calibration methods.

Result: The study demonstrates an interpretable and reproducible diagnostic pipeline that successfully unifies multiple diagnostic modalities for colorectal disease detection.

Conclusion: This unified deep learning approach advances and simplifies colorectal disease detection by integrating multiple diagnostic modalities into a single efficient pipeline, potentially reducing variability and improving diagnostic accuracy.

Abstract: Colorectal diseases, including inflammatory conditions and neoplasms, require quick, accurate care to be effectively treated. Traditional diagnostic pipelines require extensive preparation and rely on separate, individual evaluations on histological images and colonoscopy footage, introducing possible variability and inefficiencies. This pilot study proposes a unified deep learning network that uses convolutional neural networks (CN N s) to classify both histopathological slides and colonoscopy video frames in one pipeline. The pipeline integrates class-balancing learning, robust augmentation, and calibration methods to ensure accurate results. Static colon histology images were taken from the PathMNIST dataset, and the lower gastrointestinal (colonoscopy) videos were drawn from the HyperKvasir dataset. The CNN architecture used was ResNet-50. This study demonstrates an interpretable and reproducible diagnostic pipeline that unifies multiple diagnostic modalities to advance and ease the detection of colorectal diseases.

[261] MRD-LiNet: A Novel Lightweight Hybrid CNN with Gradient-Guided Unlearning for Improved Drought Stress Identification

Aswini Kumar Patra, Lingaraj Sahoo

Main category: cs.CV

TL;DR: A lightweight hybrid CNN framework for drought stress detection that reduces parameters by 15x while maintaining accuracy, with machine unlearning capability for improved adaptability.

Details

Motivation: Traditional drought detection methods are time-consuming and labor-intensive, while existing deep learning models require too many parameters for resource-limited agricultural settings.

Method: Novel lightweight hybrid CNN inspired by ResNet, DenseNet, and MobileNet architectures, plus gradient norm-based machine unlearning mechanism for targeted data removal.

Result: Achieves 15-fold parameter reduction compared to conventional CNNs and Vision Transformers while maintaining competitive accuracy on aerial potato field images.

Conclusion: The framework offers a practical, scalable, and adaptive solution for drought stress monitoring in precision agriculture, especially under resource-constrained conditions.

Abstract: Drought stress is a major threat to global crop productivity, making its early and precise detection essential for sustainable agricultural management. Traditional approaches, though useful, are often time-consuming and labor-intensive, which has motivated the adoption of deep learning methods. In recent years, Convolutional Neural Network (CNN) and Vision Transformer architectures have been widely explored for drought stress identification; however, these models generally rely on a large number of trainable parameters, restricting their use in resource-limited and real-time agricultural settings. To address this challenge, we propose a novel lightweight hybrid CNN framework inspired by ResNet, DenseNet, and MobileNet architectures. The framework achieves a remarkable 15-fold reduction in trainable parameters compared to conventional CNN and Vision Transformer models, while maintaining competitive accuracy. In addition, we introduce a machine unlearning mechanism based on a gradient norm-based influence function, which enables targeted removal of specific training data influence, thereby improving model adaptability. The method was evaluated on an aerial image dataset of potato fields with expert-annotated healthy and drought-stressed regions. Experimental results show that our framework achieves high accuracy while substantially lowering computational costs. These findings highlight its potential as a practical, scalable, and adaptive solution for drought stress monitoring in precision agriculture, particularly under resource-constrained conditions.

[262] Your Super Resolution Model is not Enough for Tackling Real-World Scenarios

Dongsik Yoon, Jongeun Kim

Main category: cs.CV

TL;DR: Proposes SAAM - a plug-in Scale-Aware Attention Module that enables fixed-scale super-resolution models to handle arbitrary scale factors with minimal computational overhead.

Details

Motivation: Traditional SISR models struggle to generalize across varying scale factors, limiting their real-world applicability. Need for a solution that can retrofit existing models for multi-scale capability.

Method: Uses lightweight scale-adaptive feature extraction and upsampling with Simple parameter-free Attention Module (SimAM) for efficient guidance, plus gradient variance loss to enhance image sharpness.

Result: Successfully integrates into multiple state-of-the-art SR backbones (SCNet, HiT-SR, OverNet), delivering competitive or superior performance across integer and non-integer scale factors.

Conclusion: Provides a practical plug-in solution for robust multi-scale upscaling with minimal computational overhead, enhancing real-world applicability of existing SR models.

Abstract: Despite remarkable progress in Single Image Super-Resolution (SISR), traditional models often struggle to generalize across varying scale factors, limiting their real-world applicability. To address this, we propose a plug-in Scale-Aware Attention Module (SAAM) designed to retrofit modern fixed-scale SR models with the ability to perform arbitrary-scale SR. SAAM employs lightweight, scale-adaptive feature extraction and upsampling, incorporating the Simple parameter-free Attention Module (SimAM) for efficient guidance and gradient variance loss to enhance sharpness in image details. Our method integrates seamlessly into multiple state-of-the-art SR backbones (e.g., SCNet, HiT-SR, OverNet), delivering competitive or superior performance across a wide range of integer and non-integer scale factors. Extensive experiments on benchmark datasets demonstrate that our approach enables robust multi-scale upscaling with minimal computational overhead, offering a practical solution for real-world scenarios.

[263] AI-based response assessment and prediction in longitudinal imaging for brain metastases treated with stereotactic radiosurgery

Lorenz Achim Kuhn, Daniel Abler, Jonas Richiardi, Andreas F. Hottinger, Luis Schiappacasse, Vincent Dunet, Adrien Depeursinge, Vincent Andrearczyk

Main category: cs.CV

TL;DR: Automated pipeline for analyzing brain metastases growth trajectories and predicting treatment response using longitudinal MRI data and machine learning.

Details

Motivation: Brain metastases contribute significantly to cancer mortality, but manual analysis of longitudinal MRI follow-up images is too time-consuming for clinicians, limiting systematic assessment of treatment response.

Method: Implemented automated pipeline to curate longitudinal dataset of 896 brain metastases from 177 patients. Used data-driven clustering to identify growth trajectories and employed both classical gradient boosting and graph machine learning for 12-month response prediction.

Result: Identified 5 dominant growth trajectories with distinct response categories. Achieved 0.90 AUC using gradient boosting and 0.88 AUC using graph ML for response prediction using only pre-treatment and first follow-up MRI data.

Conclusion: The pipeline enables automated, precise assessment of brain metastases response to treatment and provides foundation for clinical decision support systems for personalized care optimization.

Abstract: Brain Metastases (BM) are a large contributor to mortality of patients with cancer. They are treated with Stereotactic Radiosurgery (SRS) and monitored with Magnetic Resonance Imaging (MRI) at regular follow-up intervals according to treatment guidelines. Analyzing and quantifying this longitudinal imaging represents an intractable workload for clinicians. As a result, follow-up images are not annotated and merely assessed by observation. Response to treatment in longitudinal imaging is being studied, to better understand growth trajectories and ultimately predict treatment success or toxicity as early as possible. In this study, we implement an automated pipeline to curate a large longitudinal dataset of SRS treatment data, resulting in a cohort of 896 BMs in 177 patients who were monitored for >360 days at approximately two-month intervals at Lausanne University Hospital (CHUV). We use a data-driven clustering to identify characteristic trajectories. In addition, we predict 12 months lesion-level response using classical as well as graph machine learning Graph Machine Learning (GML). Clustering revealed 5 dominant growth trajectories with distinct final response categories. Response prediction reaches up to 0.90 AUC (CI95%=0.88-0.92) using only pre-treatment and first follow-up MRI with gradient boosting. Similarly, robust predictive performance of up to 0.88 AUC (CI95%=0.86-0.90) was obtained using GML, offering more flexibility with a single model for multiple input time-points configurations. Our results suggest potential automation and increased precision for the comprehensive assessment and prediction of BM response to SRS in longitudinal MRI. The proposed pipeline facilitates scalable data curation for the investigation of BM growth patterns, and lays the foundation for clinical decision support systems aiming at optimizing personalized care.

[264] 3DOF+Quantization: 3DGS quantization for large scenes with limited Degrees of Freedom

Matthieu Gendrin, Stéphane Pateux, Théo Ladune

Main category: cs.CV

TL;DR: 3D Gaussian Splatting for limited 3DoF+ scenes with improved quantization using spherical coordinates to reduce projection errors.

Details

Motivation: 3DGS provides 6DoF freedom but input views are often limited to a zone, making 3DoF+ (limited position offsets) more practical. Position quantization errors cause projection errors that scale with inverse squared distance.

Method: Proposed spherical coordinate quantization scheme to address position error impact on projection accuracy. Analyzed projection error proportional to squared inverse distance of projected points.

Result: Improved rate-distortion performance demonstrated on the Garden scene benchmark.

Conclusion: Spherical coordinate quantization effectively reduces projection errors for 3DoF+ applications in 3D Gaussian Splatting, making it more suitable for scenes with limited viewpoint acquisition zones.

Abstract: 3D Gaussian Splatting (3DGS) is a major breakthrough in 3D scene reconstruction. With a number of views of a given object or scene, the algorithm trains a model composed of 3D gaussians, which enables the production of novel views from arbitrary points of view. This freedom of movement is referred to as 6DoF for 6 degrees of freedom: a view is produced for any position (3 degrees), orientation of camera (3 other degrees). On large scenes, though, the input views are acquired from a limited zone in space, and the reconstruction is valuable for novel views from the same zone, even if the scene itself is almost unlimited in size. We refer to this particular case as 3DoF+, meaning that the 3 degrees of freedom of camera position are limited to small offsets around the central position. Considering the problem of coordinate quantization, the impact of position error on the projection error in pixels is studied. It is shown that the projection error is proportional to the squared inverse distance of the point being projected. Consequently, a new quantization scheme based on spherical coordinates is proposed. Rate-distortion performance of the proposed method are illustrated on the well-known Garden scene.

[265] Phantom-Insight: Adaptive Multi-cue Fusion for Video Camouflaged Object Detection with Multimodal LLM

Hua Zhang, Changjiang Luo, Ruoyu Chen

Main category: cs.CV

TL;DR: Phantom-Insight is a novel VCOD method that combines SAM and MLLM to address edge separation and object-background separability issues in video camouflaged object detection.

Details

Motivation: Existing SAM-based methods struggle with camouflaged object edge separation due to model freezing, while MLLM-based methods suffer from poor object separability as they merge foreground and background information.

Method: Uses temporal and spatial clues for video representation, performs feature fusion via LLM, employs dynamic foreground visual token scoring and prompt network to adaptively guide SAM, and implements decoupled foreground-background learning strategy with separate cue generation and training.

Result: Achieves state-of-the-art performance on MoCA-Mask dataset across various metrics and demonstrates strong generalization ability by detecting unseen camouflaged objects on CAD2016 dataset.

Conclusion: Phantom-Insight effectively addresses key limitations in VCOD by enhancing object edge detail separability and improving object-background discrimination through innovative SAM and MLLM integration strategies.

Abstract: Video camouflaged object detection (VCOD) is challenging due to dynamic environments. Existing methods face two main issues: (1) SAM-based methods struggle to separate camouflaged object edges due to model freezing, and (2) MLLM-based methods suffer from poor object separability as large language models merge foreground and background. To address these issues, we propose a novel VCOD method based on SAM and MLLM, called Phantom-Insight. To enhance the separability of object edge details, we represent video sequences with temporal and spatial clues and perform feature fusion via LLM to increase information density. Next, multiple cues are generated through the dynamic foreground visual token scoring module and the prompt network to adaptively guide and fine-tune the SAM model, enabling it to adapt to subtle textures. To enhance the separability of objects and background, we propose a decoupled foreground-background learning strategy. By generating foreground and background cues separately and performing decoupled training, the visual token can effectively integrate foreground and background information independently, enabling SAM to more accurately segment camouflaged objects in the video. Experiments on the MoCA-Mask dataset show that Phantom-Insight achieves state-of-the-art performance across various metrics. Additionally, its ability to detect unseen camouflaged objects on the CAD2016 dataset highlights its strong generalization ability.

[266] When Language Model Guides Vision: Grounding DINO for Cattle Muzzle Detection

Rabin Dulal, Lihong Zheng, Muhammad Ashad Kabir

Main category: cs.CV

TL;DR: Zero-shot cattle muzzle detection using Grounding DINO vision-language model achieves 76.8% mAP@0.5 without annotated training data, providing an annotation-free alternative to supervised methods.

Details

Motivation: Traditional muzzle detection methods require extensive annotated datasets and are data-dependent, limiting performance on new cattle. Manual detection is labor-intensive and inconsistent.

Method: Proposes a zero-shot framework using Grounding DINO vision-language model that leverages natural language prompts to detect muzzles without task-specific training or annotated data.

Result: Achieves mean Average Precision (mAP)@0.5 of 76.8%, demonstrating promising performance without requiring annotated training data.

Conclusion: First annotation-free solution for cattle muzzle detection, offering practical alternative to supervised methods with improved adaptability and ease of deployment in livestock monitoring.

Abstract: Muzzle patterns are among the most effective biometric traits for cattle identification. Fast and accurate detection of the muzzle region as the region of interest is critical to automatic visual cattle identification.. Earlier approaches relied on manual detection, which is labor-intensive and inconsistent. Recently, automated methods using supervised models like YOLO have become popular for muzzle detection. Although effective, these methods require extensive annotated datasets and tend to be trained data-dependent, limiting their performance on new or unseen cattle. To address these limitations, this study proposes a zero-shot muzzle detection framework based on Grounding DINO, a vision-language model capable of detecting muzzles without any task-specific training or annotated data. This approach leverages natural language prompts to guide detection, enabling scalable and flexible muzzle localization across diverse breeds and environments. Our model achieves a mean Average Precision (mAP)@0.5 of 76.8%, demonstrating promising performance without requiring annotated data. To our knowledge, this is the first research to provide a real-world, industry-oriented, and annotation-free solution for cattle muzzle detection. The framework offers a practical alternative to supervised methods, promising improved adaptability and ease of deployment in livestock monitoring applications.

[267] Cross3DReg: Towards a Large-scale Real-world Cross-source Point Cloud Registration Benchmark

Zongyi Xu, Zhongpeng Lang, Yilong Chen, Shanshan Zhao, Xiaoshui Huang, Yifan Zuo, Yan Zhang, Qianni Zhang, Xinbo Gao

Main category: cs.CV

TL;DR: Cross-source point cloud registration framework using visual-geometric attention and overlap prediction to handle multi-sensor data differences, achieving state-of-the-art performance with significant error reduction.

Details

Motivation: Cross-source point cloud registration faces challenges due to lack of large-scale real-world datasets and inherent sensor differences that negatively impact feature extraction and matching accuracy.

Method: Constructed Cross3DReg dataset from mechanical and semi-solid-state lidars, designed overlap-based framework using unaligned images to predict overlapping regions, and developed visual-geometric attention guided matching module to fuse image and geometric information.

Result: Achieved 63.2% reduction in relative rotation error, 40.2% reduction in relative translation error, and 5.4% improvement in registration recall compared to previous methods.

Conclusion: The proposed framework effectively addresses cross-source registration challenges by filtering redundant points and enhancing feature consistency through visual-geometric fusion, demonstrating superior accuracy and robustness.

Abstract: Cross-source point cloud registration, which aims to align point cloud data from different sensors, is a fundamental task in 3D vision. However, compared to the same-source point cloud registration, cross-source registration faces two core challenges: the lack of publicly available large-scale real-world datasets for training the deep registration models, and the inherent differences in point clouds captured by multiple sensors. The diverse patterns induced by the sensors pose great challenges in robust and accurate point cloud feature extraction and matching, which negatively influence the registration accuracy. To advance research in this field, we construct Cross3DReg, the currently largest and real-world multi-modal cross-source point cloud registration dataset, which is collected by a rotating mechanical lidar and a hybrid semi-solid-state lidar, respectively. Moreover, we design an overlap-based cross-source registration framework, which utilizes unaligned images to predict the overlapping region between source and target point clouds, effectively filtering out redundant points in the irrelevant regions and significantly mitigating the interference caused by noise in non-overlapping areas. Then, a visual-geometric attention guided matching module is proposed to enhance the consistency of cross-source point cloud features by fusing image and geometric information to establish reliable correspondences and ultimately achieve accurate and robust registration. Extensive experiments show that our method achieves state-of-the-art registration performance. Our framework reduces the relative rotation error (RRE) and relative translation error (RTE) by $63.2%$ and $40.2%$, respectively, and improves the registration recall (RR) by $5.4%$, which validates its effectiveness in achieving accurate cross-source registration.

[268] IGAff: Benchmarking Adversarial Iterative and Genetic Affine Algorithms on Deep Neural Networks

Sebastian-Vasile Echim, Andrei-Alexandru Preda, Dumitru-Clementin Cercel, Florin Pop

Main category: cs.CV

TL;DR: Novel black-box adversarial attack methods ATA and AGA outperform existing approaches, achieving up to 8.82% accuracy improvement on image classification across multiple network architectures and datasets.

Details

Motivation: Deep neural networks are powerful but vulnerable to adversarial attacks, especially in black-box scenarios where model details are inaccessible. Understanding these vulnerabilities is crucial for improving model robustness.

Method: Developed two novel black-box iterative adversarial algorithms: 1) Affine Transformation Attack (ATA) using random affine transformations to maximize attack score, and 2) Affine Genetic Attack (AGA) combining random noise and affine transformations with genetic algorithms. Evaluated on ResNet-18, DenseNet-121, Swin Transformer V2, and Vision Transformer using Tiny ImageNet, Caltech-256, and Food-101 datasets.

Result: The proposed algorithms achieved better performance than existing black-box adversarial methods (Pixle and Square Attack), with up to 8.82% accuracy improvement on image classification tasks. Provided insights into successful adversarial defenses and attacks at both global and targeted levels.

Conclusion: The affine transformation-based approaches demonstrate effective adversarial attack capabilities in black-box settings, offering valuable insights for improving model robustness and understanding network vulnerabilities across different architectures.

Abstract: Deep neural networks currently dominate many fields of the artificial intelligence landscape, achieving state-of-the-art results on numerous tasks while remaining hard to understand and exhibiting surprising weaknesses. An active area of research focuses on adversarial attacks, which aim to generate inputs that uncover these weaknesses. However, this proves challenging, especially in the black-box scenario where model details are inaccessible. This paper explores in detail the impact of such adversarial algorithms on ResNet-18, DenseNet-121, Swin Transformer V2, and Vision Transformer network architectures. Leveraging the Tiny ImageNet, Caltech-256, and Food-101 datasets, we benchmark two novel black-box iterative adversarial algorithms based on affine transformations and genetic algorithms: 1) Affine Transformation Attack (ATA), an iterative algorithm maximizing our attack score function using random affine transformations, and 2) Affine Genetic Attack (AGA), a genetic algorithm that involves random noise and affine transformations. We evaluate the performance of the models in the algorithm parameter variation, data augmentation, and global and targeted attack configurations. We also compare our algorithms with two black-box adversarial algorithms, Pixle and Square Attack. Our experiments yield better results on the image classification task than similar methods in the literature, achieving an accuracy improvement of up to 8.82%. We provide noteworthy insights into successful adversarial defenses and attacks at both global and targeted levels, and demonstrate adversarial robustness through algorithm parameter variation.

[269] Focusing by Contrastive Attention: Enhancing VLMs’ Visual Reasoning

Yuyao Ge, Shenghua Liu, Yiwei Wang, Lingrui Mei, Baolong Bi, Xuanshan Zhou, Jiayu Yao, Jiafeng Guo, Xueqi Cheng

Main category: cs.CV

TL;DR: CARVE is a training-free method that improves VLM performance in complex visual environments by using attention contrasting to extract task-relevant signals from visual noise.

Details

Motivation: VLMs degrade in complex visual environments, and existing enhancement approaches require additional training or external tools while overlooking VLMs' innate attention capabilities.

Method: Analyze VLM attention patterns, discover correlation between visual complexity and attention entropy, and propose CARVE - a method that uses contrastive attention between general and task-specific queries to decompose visual signals at pixel level.

Result: CARVE consistently enhances performance with up to 75% improvement on open-source models without requiring additional training.

Conclusion: The work provides insights into visual complexity and attention mechanisms, offering an efficient training-free pathway for improving visual reasoning through attention contrasting.

Abstract: Vision-Language Models (VLMs) have demonstrated remarkable success across diverse visual tasks, yet their performance degrades in complex visual environments. While existing enhancement approaches require additional training, rely on external segmentation tools, or operate at coarse-grained levels, they overlook the innate ability within VLMs. To bridge this gap, we investigate VLMs’ attention patterns and discover that: (1) visual complexity strongly correlates with attention entropy, negatively impacting reasoning performance; (2) attention progressively refines from global scanning in shallow layers to focused convergence in deeper layers, with convergence degree determined by visual complexity. (3) Theoretically, we prove that the contrast of attention maps between general queries and task-specific queries enables the decomposition of visual signal into semantic signals and visual noise components. Building on these insights, we propose Contrastive Attention Refinement for Visual Enhancement (CARVE), a training-free method that extracts task-relevant visual signals through attention contrasting at the pixel level. Extensive experiments demonstrate that CARVE consistently enhances performance, achieving up to 75% improvement on open-source models. Our work provides critical insights into the interplay between visual complexity and attention mechanisms, offering an efficient pathway for improving visual reasoning with contrasting attention.

[270] A Statistical 3D Stomach Shape Model for Anatomical Analysis

Erez Posner, Ore Shtalrid, Oded Erell, Daniel Noy, Moshe Bouhnik

Main category: cs.CV

TL;DR: This paper presents the first statistical 3D shape model of the stomach, combining synthetic data generation with real CT scan refinement to capture anatomical variability for medical applications.

Details

Motivation: There is a need for realistic 3D stomach models for research, diagnostics, and surgical planning, but development has been limited by data availability and methodological challenges.

Method: A novel pipeline for generating synthetic 3D stomach models informed by anatomical studies, followed by developing a statistical shape model trained on this synthetic dataset and refined using real CT meshes through semi-supervised alignment.

Result: The model demonstrated robust generalization and fit accuracy on a held-out test set of real stomach CT scans, successfully capturing natural anatomical variability in a low-dimensional shape space.

Conclusion: This work represents a significant advancement in organ modeling by combining synthetic data generation, parametric modeling, and real-world validation, opening new possibilities for personalized healthcare solutions in surgical simulation, pre-operative planning, and medical education.

Abstract: Realistic and parameterized 3D models of human anatomy have become invaluable in research, diagnostics, and surgical planning. However, the development of detailed models for internal organs, such as the stomach, has been limited by data availability and methodological challenges. In this paper, we propose a novel pipeline for the generation of synthetic 3D stomach models, enabling the creation of anatomically diverse morphologies informed by established studies on stomach shape variability. Using this pipeline, we construct a dataset of synthetic stomachs. Building on this dataset, we develop a 3D statistical shape model of the stomach, trained to capture natural anatomical variability in a low-dimensional shape space. The model is further refined using CT meshes derived from publicly available datasets through a semi-supervised alignment process, enhancing its ability to generalize to unseen anatomical variations. We evaluated the model on a held-out test set of real stomach CT scans, demonstrating robust generalization and fit accuracy. We make the statistical shape model along with the synthetic dataset publicly available on GitLab: https://gitlab.com/Erez.Posner/stomach_pytorch to facilitate further research. This work introduces the first statistical 3D shape model of the stomach, with applications ranging from surgical simulation and pre-operative planning to medical education and computational modeling. By combining synthetic data generation, parametric modeling, and real-world validation, our approach represents a significant advancement in organ modeling and opens new possibilities for personalized healthcare solutions.

[271] Does DINOv3 Set a New Medical Vision Standard?

Che Liu, Yinda Chen, Haoyuan Shi, Jinpeng Lu, Bailiang Jian, Jiazhen Pan, Linghan Cai, Jiayi Wang, Yundi Zhang, Jun Li, Cosmin I. Bercea, Cheng Ouyang, Chen Chen, Zhiwei Xiong, Benedikt Wiestler, Christian Wachinger, Daniel Rueckert, Wenjia Bai, Rossella Arcucci

Main category: cs.CV

TL;DR: DINOv3, a self-supervised vision transformer trained on natural images, shows impressive performance as a unified encoder for medical vision tasks without domain-specific pre-training, outperforming medical-specific models on some tasks but showing limitations in highly specialized domains.

Details

Motivation: To investigate whether general-purpose vision foundation models like DINOv3 can effectively transfer to medical imaging domains without requiring domain-specific pre-training, addressing the open question of how well these models perform in specialized medical contexts.

Method: Benchmarked DINOv3 across various medical vision tasks including 2D/3D classification and segmentation on multiple medical imaging modalities, systematically analyzing scalability by varying model sizes and input image resolutions.

Result: DINOv3 establishes strong performance as a baseline, outperforming medical-specific models like BiomedCLIP and CT-Net on several tasks despite being trained only on natural images. However, performance degrades in highly specialized domains like WSIs, EM, and PET, and scaling laws don’t consistently apply in medical contexts.

Conclusion: DINOv3 serves as a powerful baseline and robust visual prior for medical tasks, opening promising directions for leveraging its features in applications like 3D reconstruction, though domain-specific limitations exist in highly specialized medical imaging scenarios.

Abstract: The advent of large-scale vision foundation models, pre-trained on diverse natural images, has marked a paradigm shift in computer vision. However, how the frontier vision foundation models’ efficacies transfer to specialized domains remains such as medical imaging remains an open question. This report investigates whether DINOv3, a state-of-the-art self-supervised vision transformer (ViT) that features strong capability in dense prediction tasks, can directly serve as a powerful, unified encoder for medical vision tasks without domain-specific pre-training. To answer this, we benchmark DINOv3 across common medical vision tasks, including 2D/3D classification and segmentation on a wide range of medical imaging modalities. We systematically analyze its scalability by varying model sizes and input image resolutions. Our findings reveal that DINOv3 shows impressive performance and establishes a formidable new baseline. Remarkably, it can even outperform medical-specific foundation models like BiomedCLIP and CT-Net on several tasks, despite being trained solely on natural images. However, we identify clear limitations: The model’s features degrade in scenarios requiring deep domain specialization, such as in Whole-Slide Pathological Images (WSIs), Electron Microscopy (EM), and Positron Emission Tomography (PET). Furthermore, we observe that DINOv3 does not consistently obey scaling law in the medical domain; performance does not reliably increase with larger models or finer feature resolutions, showing diverse scaling behaviors across tasks. Ultimately, our work establishes DINOv3 as a strong baseline, whose powerful visual features can serve as a robust prior for multiple complex medical tasks. This opens promising future directions, such as leveraging its features to enforce multiview consistency in 3D reconstruction.

[272] FSG-Net: Frequency-Spatial Synergistic Gated Network for High-Resolution Remote Sensing Change Detection

Zhongxiang Xie, Shuangxi Miao, Yuhan Jiang, Zhewei Zhang, Jing Yao, Xuecao Li, Jianxi Huang, Pedram Ghamisi

Main category: cs.CV

TL;DR: FSG-Net is a novel frequency-spatial synergistic network that addresses false alarms and semantic gaps in high-resolution remote sensing change detection through frequency domain processing, spatial attention, and gated fusion, achieving state-of-the-art performance on multiple benchmarks.

Details

Motivation: To overcome two critical challenges in change detection: false alarms caused by radiometric variations from temporal shifts (illumination, season), and the semantic gap between deep abstract features and shallow detail-rich features that leads to poorly delineated boundaries.

Method: Proposes FSG-Net with three key components: 1) Discrepancy-Aware Wavelet Interaction Module (DAWIM) for frequency domain processing to mitigate pseudo-changes, 2) Synergistic Temporal-Spatial Attention Module (STSAM) to amplify genuine change regions in spatial domain, and 3) Lightweight Gated Fusion Unit (LGFU) to bridge semantic gap by selectively integrating details from shallow layers using high-level semantics.

Result: Achieves state-of-the-art performance with F1-scores of 94.16% on CDD, 89.51% on GZ-CD, and 91.27% on LEVIR-CD benchmarks.

Conclusion: FSG-Net effectively addresses both false alarm reduction and semantic gap issues in change detection through frequency-spatial synergistic approach, demonstrating superior performance across multiple standard benchmarks.

Abstract: Change detection from high-resolution remote sensing images lies as a cornerstone of Earth observation applications, yet its efficacy is often compromised by two critical challenges. First, false alarms are prevalent as models misinterpret radiometric variations from temporal shifts (e.g., illumination, season) as genuine changes. Second, a non-negligible semantic gap between deep abstract features and shallow detail-rich features tends to obstruct their effective fusion, culminating in poorly delineated boundaries. To step further in addressing these issues, we propose the Frequency-Spatial Synergistic Gated Network (FSG-Net), a novel paradigm that aims to systematically disentangle semantic changes from nuisance variations. Specifically, FSG-Net first operates in the frequency domain, where a Discrepancy-Aware Wavelet Interaction Module (DAWIM) adaptively mitigates pseudo-changes by discerningly processing different frequency components. Subsequently, the refined features are enhanced in the spatial domain by a Synergistic Temporal-Spatial Attention Module (STSAM), which amplifies the saliency of genuine change regions. To finally bridge the semantic gap, a Lightweight Gated Fusion Unit (LGFU) leverages high-level semantics to selectively gate and integrate crucial details from shallow layers. Comprehensive experiments on the CDD, GZ-CD, and LEVIR-CD benchmarks validate the superiority of FSG-Net, establishing a new state-of-the-art with F1-scores of 94.16%, 89.51%, and 91.27%, respectively. The code will be made available at https://github.com/zxXie-Air/FSG-Net after a possible publication.

[273] WS$^2$: Weakly Supervised Segmentation using Before-After Supervision in Waste Sorting

Andrea Marelli, Alberto Foresti, Leonardo Pesce, Giacomo Boracchi, Mario Grosso

Main category: cs.CV

TL;DR: A weakly supervised segmentation approach called Before-After Supervision that trains models using visual differences between images before and after human operators remove unwanted items from conveyor belts, avoiding extensive labeling requirements.

Details

Motivation: Human operators are still essential for waste sorting tasks, but fully supervised computer vision approaches require extensive labeling efforts. Weakly supervised alternatives that leverage the implicit supervision from operator removal actions are under-explored.

Method: Proposes Before-After Supervision concept that trains segmentation networks using only visual differences between images acquired before and after operator actions. Introduces WS^2 dataset with 11,000+ high-resolution video frames and benchmarks state-of-the-art weakly supervised segmentation methods.

Result: Developed a robust end-to-end pipeline for weakly supervised segmentation in waste-sorting applications using the novel Before-After Supervision approach.

Conclusion: The Before-After Supervision framework provides a viable alternative to fully supervised methods for industrial quality control tasks like waste sorting, reducing labeling requirements while maintaining accuracy through implicit operator supervision.

Abstract: In industrial quality control, to visually recognize unwanted items within a moving heterogeneous stream, human operators are often still indispensable. Waste-sorting stands as a significant example, where operators on multiple conveyor belts manually remove unwanted objects to select specific materials. To automate this recognition problem, computer vision systems offer great potential in accurately identifying and segmenting unwanted items in such settings. Unfortunately, considering the multitude and the variety of sorting tasks, fully supervised approaches are not a viable option to address this challange, as they require extensive labeling efforts. Surprisingly, weakly supervised alternatives that leverage the implicit supervision naturally provided by the operator in his removal action are relatively unexplored. In this paper, we define the concept of Before-After Supervision, illustrating how to train a segmentation network by leveraging only the visual differences between images acquired \textit{before} and \textit{after} the operator. To promote research in this direction, we introduce WS$^2$ (Weakly Supervised segmentation for Waste-Sorting), the first multiview dataset consisting of more than 11 000 high-resolution video frames captured on top of a conveyor belt, including “before” and “after” images. We also present a robust end-to-end pipeline, used to benchmark several state-of-the-art weakly supervised segmentation methods on WS$^2$.

[274] On the Reproducibility of “FairCLIP: Harnessing Fairness in Vision-Language Learning’’

Hua Chang Bakker, Stan Fris, Angela Madelon Bernardy, Stan Deutekom

Main category: cs.CV

TL;DR: Reproducibility study of FairCLIP finds that while CLIP shows demographic bias in medical classification, the claimed fairness improvements from FairCLIP are not supported by experimental results on two datasets.

Details

Motivation: To investigate the reproducibility of FairCLIP's claims about improving group fairness in CLIP models for medical image-text applications, particularly in zero-shot glaucoma classification.

Method: Reproduced the original FairCLIP experimental setup, identified discrepancies in model description vs implementation, created A-FairCLIP for design choice analysis, proposed FairCLIP+ for multi-attribute fairness, and examined distance minimization impact.

Result: CLIP shows demographic bias in medical classification as claimed, but FairCLIP does not improve performance or fairness despite reducing Sinkhorn distances. Both official and aligned implementations failed to demonstrate fairness improvements.

Conclusion: The original FairCLIP claims about fairness and performance improvements are not reproducible. While CLIP bias exists, the proposed fairness regularization method does not effectively address it in the tested medical classification scenario.

Abstract: We investigated the reproducibility of FairCLIP, proposed by Luo et al. (2024), for improving the group fairness of CLIP (Radford et al., 2021) by minimizing image-text similarity score disparities across sensitive groups using the Sinkhorn distance. The experimental setup of Luo et al. (2024) was reproduced to primarily investigate the research findings for FairCLIP. The model description by Luo et al. (2024) was found to differ from the original implementation. Therefore, a new implementation, A-FairCLIP, is introduced to examine specific design choices. Furthermore, FairCLIP+ is proposed to extend the FairCLIP objective to include multiple attributes. Additionally, the impact of the distance minimization on FairCLIP’s fairness and performance was explored. In alignment with the original authors, CLIP was found to be biased towards certain demographics when applied to zero-shot glaucoma classification using medical scans and clinical notes from the Harvard-FairVLMed dataset. However, the experimental results on two datasets do not support their claim that FairCLIP improves the performance and fairness of CLIP. Although the regularization objective reduces Sinkhorn distances, both the official implementation and the aligned implementation, A-FairCLIP, were not found to improve performance nor fairness in zero-shot glaucoma classification.

[275] TIDE: Achieving Balanced Subject-Driven Image Generation via Target-Instructed Diffusion Enhancement

Jibai Lin, Bo Ma, Yating Yang, Rong Ma, Turghun Osman, Ahtamjan Ahmat, Rui Dong, Lei Wang, Xi Zhou

Main category: cs.CV

TL;DR: TIDE framework resolves subject-driven image generation tension through target-supervised triplet alignment and preference learning without test-time fine-tuning, achieving superior subject preservation and instruction compliance.

Details

Motivation: Existing methods inadequately address the tension between maintaining subject identity and complying with dynamic edit instructions in subject-driven image generation.

Method: Target-supervised triplet alignment using (reference image, instruction, target images) triplets with Direct Subject Diffusion objective, training with systematically generated winning/losing targets for implicit reward modeling.

Result: Superior performance on standard benchmarks, outperforming baselines across multiple quantitative metrics while maintaining versatility for diverse tasks including structural-conditioned generation and image-to-image generation.

Conclusion: TIDE effectively balances subject preservation and instruction compliance through innovative target supervision and preference learning, advancing text-to-image diffusion models without requiring test-time fine-tuning.

Abstract: Subject-driven image generation (SDIG) aims to manipulate specific subjects within images while adhering to textual instructions, a task crucial for advancing text-to-image diffusion models. SDIG requires reconciling the tension between maintaining subject identity and complying with dynamic edit instructions, a challenge inadequately addressed by existing methods. In this paper, we introduce the Target-Instructed Diffusion Enhancing (TIDE) framework, which resolves this tension through target supervision and preference learning without test-time fine-tuning. TIDE pioneers target-supervised triplet alignment, modelling subject adaptation dynamics using a (reference image, instruction, target images) triplet. This approach leverages the Direct Subject Diffusion (DSD) objective, training the model with paired “winning” (balanced preservation-compliance) and “losing” (distorted) targets, systematically generated and evaluated via quantitative metrics. This enables implicit reward modelling for optimal preservation-compliance balance. Experimental results on standard benchmarks demonstrate TIDE’s superior performance in generating subject-faithful outputs while maintaining instruction compliance, outperforming baseline methods across multiple quantitative metrics. TIDE’s versatility is further evidenced by its successful application to diverse tasks, including structural-conditioned generation, image-to-image generation, and text-image interpolation. Our code is available at https://github.com/KomJay520/TIDE.

[276] Predicting Brain Tumor Response to Therapy using a Hybrid Deep Learning and Radiomics Approach

Daniil Tikhonov, Matheus Scatolin, Mohor Banerjee, Qiankun Ji, Ahmed Jaheen, Mostafa Salem, Abdelrahman Elsayed, Hu Wang, Sarim Hashmi, Mohammad Yaqub

Main category: cs.CV

TL;DR: Automated MRI-based glioblastoma treatment response classification using hybrid deep learning and radiomics features with 0.81 ROC AUC

Details

Motivation: Standard RANO criteria for glioblastoma response assessment are complex and subjective, requiring automated methods to reduce observer variability and improve consistency

Method: Hybrid framework combining fine-tuned ResNet-18 deep features with 4800+ radiomic/clinical features including 3D tumor masks, volumetric changes, and centroid shift, using CatBoost classifier

Result: Achieved mean ROC AUC of 0.81 and Macro F1 score of 0.50 in 4-class response prediction (Complete Response, Partial Response, Stable Disease, Progressive Disease)

Conclusion: Combining deep learning image representations with domain-specific radiomic features provides robust automated treatment response assessment in neuro-oncology

Abstract: Accurate evaluation of the response of glioblastoma to therapy is crucial for clinical decision-making and patient management. The Response Assessment in Neuro-Oncology (RANO) criteria provide a standardized framework to assess patients’ clinical response, but their application can be complex and subject to observer variability. This paper presents an automated method for classifying the intervention response from longitudinal MRI scans, developed to predict tumor response during therapy as part of the BraTS 2025 challenge. We propose a novel hybrid framework that combines deep learning derived feature extraction and an extensive set of radiomics and clinically chosen features. Our approach utilizes a fine-tuned ResNet-18 model to extract features from 2D regions of interest across four MRI modalities. These deep features are then fused with a rich set of more than 4800 radiomic and clinically driven features, including 3D radiomics of tumor growth and shrinkage masks, volumetric changes relative to the nadir, and tumor centroid shift. Using the fused feature set, a CatBoost classifier achieves a mean ROC AUC of 0.81 and a Macro F1 score of 0.50 in the 4-class response prediction task (Complete Response, Partial Response, Stable Disease, Progressive Disease). Our results highlight that synergizing learned image representations with domain-targeted radiomic features provides a robust and effective solution for automated treatment response assessment in neuro-oncology.

[277] Benchmarking EfficientTAM on FMO datasets

Senem Aktas, Charles Markham, John McDonald, Rozenn Dahyot

Main category: cs.CV

TL;DR: The paper introduces FMOX, a JSON metadata format with ground truth information for Fast Moving Object (FMO) datasets, and shows that EfficientTAM foundational model performs comparably to specialized FMO tracking pipelines.

Details

Motivation: Fast and tiny object tracking remains challenging in computer vision, and there's a need for standardized metadata formats to improve dataset usability and benchmarking.

Method: Created JSON metadata files (FMOX) with additional ground truth information for four FMO datasets, then tested EfficientTAM model using Trajectory Intersection of Union (TIoU) scores for comparison.

Result: EfficientTAM foundational model performs well compared to pipelines specifically designed for FMO datasets, demonstrating competitive tracking performance.

Conclusion: FMOX format provides accessible and usable metadata for machine learning pipelines processing FMO datasets, and foundational models like EfficientTAM can effectively handle fast moving object tracking tasks.

Abstract: Fast and tiny object tracking remains a challenge in computer vision and in this paper we first introduce a JSON metadata file associated with four open source datasets of Fast Moving Objects (FMOs) image sequences. In addition, we extend the description of the FMOs datasets with additional ground truth information in JSON format (called FMOX) with object size information. Finally we use our FMOX file to test a recently proposed foundational model for tracking (called EfficientTAM) showing that its performance compares well with the pipelines originally taylored for these FMO datasets. Our comparison of these state-of-the-art techniques on FMOX is provided with Trajectory Intersection of Union (TIoU) scores. The code and JSON is shared open source allowing FMOX to be accessible and usable for other machine learning pipelines aiming to process FMO datasets.

[278] Improved Classification of Nitrogen Stress Severity in Plants Under Combined Stress Conditions Using Spatio-Temporal Deep Learning Framework

Aswini Kumar Patra

Main category: cs.CV

TL;DR: Novel deep learning framework using CNN-LSTM with multi-modal imaging achieves 98% accuracy in classifying nitrogen stress severity under combined stress conditions (drought + weed competition).

Details

Motivation: Plants face multiple interacting stresses in nature, making nitrogen deficiency detection difficult when compounded with drought and weed competition. Early detection is crucial for effective plant health management.

Method: Uses four imaging modalities (RGB, multispectral, and two infrared wavelengths) as time-series data. Combines CNN for spatial feature extraction with LSTM for temporal dependencies in a spatio-temporal pipeline, compared against spatial-only CNN.

Result: CNN-LSTM pipeline achieved 98% accuracy, significantly outperforming spatial-only model (80.45%) and previous machine learning methods (76%).

Conclusion: The CNN-LSTM approach effectively captures complex stress interactions and provides a robust platform for timely nitrogen stress identification, enabling better crop management and plant health.

Abstract: Plants in their natural habitats endure an array of interacting stresses, both biotic and abiotic, that rarely occur in isolation. Nutrient stress-particularly nitrogen deficiency-becomes even more critical when compounded with drought and weed competition, making it increasingly difficult to distinguish and address its effects. Early detection of nitrogen stress is therefore crucial for protecting plant health and implementing effective management strategies. This study proposes a novel deep learning framework to accurately classify nitrogen stress severity in a combined stress environment. Our model uses a unique blend of four imaging modalities-RGB, multispectral, and two infrared wavelengths-to capture a wide range of physiological plant responses from canopy images. These images, provided as time-series data, document plant health across three levels of nitrogen availability (low, medium, and high) under varying water stress and weed pressures. The core of our approach is a spatio-temporal deep learning pipeline that merges a Convolutional Neural Network (CNN) for extracting spatial features from images with a Long Short-Term Memory (LSTM) network to capture temporal dependencies. We also devised and evaluated a spatial-only CNN pipeline for comparison. Our CNN-LSTM pipeline achieved an impressive accuracy of 98%, impressively surpassing the spatial-only model’s 80.45% and other previously reported machine learning method’s 76%. These results bring actionable insights based on the power of our CNN-LSTM approach in effectively capturing the subtle and complex interactions between nitrogen deficiency, water stress, and weed pressure. This robust platform offers a promising tool for the timely and proactive identification of nitrogen stress severity, enabling better crop management and improved plant health.

[279] Back To The Drawing Board: Rethinking Scene-Level Sketch-Based Image Retrieval

Emil Demić, Luka Čehovin Zajc

Main category: cs.CV

TL;DR: A robust training approach for scene-level sketch-based image retrieval that handles sketch ambiguity and noise, achieving state-of-the-art results without complex architectural changes.

Details

Motivation: Address the inherent ambiguity and noise in real-world sketches, which prior work often overlooked by focusing mainly on architectural improvements rather than training robustness.

Method: Combines appropriate pre-training, encoder architecture, and loss formulation specifically designed to be robust to sketch variability, without introducing additional complexity.

Result: Achieves state-of-the-art performance on challenging FS-COCO and widely-used SketchyCOCO datasets, demonstrating effectiveness of the approach.

Conclusion: Highlights the critical role of training design in cross-modal retrieval tasks and emphasizes the need to improve evaluation scenarios for scene-level SBIR.

Abstract: The goal of Scene-level Sketch-Based Image Retrieval is to retrieve natural images matching the overall semantics and spatial layout of a free-hand sketch. Unlike prior work focused on architectural augmentations of retrieval models, we emphasize the inherent ambiguity and noise present in real-world sketches. This insight motivates a training objective that is explicitly designed to be robust to sketch variability. We show that with an appropriate combination of pre-training, encoder architecture, and loss formulation, it is possible to achieve state-of-the-art performance without the introduction of additional complexity. Extensive experiments on a challenging FS-COCO and widely-used SketchyCOCO datasets confirm the effectiveness of our approach and underline the critical role of training design in cross-modal retrieval tasks, as well as the need to improve the evaluation scenarios of scene-level SBIR.

[280] BioLite U-Net: Edge-Deployable Semantic Segmentation for In Situ Bioprinting Monitoring

Usman Haider, Lukasz Szemet, Daniel Kelly, Vasileios Sergis, Andrew C. Daly, Karl Mason

Main category: cs.CV

TL;DR: A lightweight semantic segmentation framework called BioLite U-Net is proposed for real-time bioprinting monitoring, achieving 92.85% mIoU with 1300x smaller size than MobileNetV2-DeepLabV3+ and near real-time inference on Raspberry Pi 4B.

Details

Motivation: Real-time monitoring of bioprinting processes is crucial for maintaining print quality and biological viability, but challenging due to limited imaging data and resource-constrained embedded hardware constraints.

Method: Proposed BioLite U-Net architecture using depthwise separable convolutions to reduce computational load. Created a manually annotated dataset of 787 RGB images with three classes (nozzle, bioink, background) and benchmarked against MobileNetV2/V3 baselines using mIoU, Dice score, and pixel accuracy.

Result: BioLite U-Net achieved 92.85% mIoU and 96.17% Dice score, with 1300x smaller size than MobileNetV2-DeepLabV3+. On-device inference took 335 ms per frame on Raspberry Pi 4B, demonstrating near real-time capability.

Conclusion: BioLite U-Net offers superior tradeoff between segmentation accuracy, efficiency, and deployability, making it highly suitable for intelligent, closed-loop bioprinting systems.

Abstract: Bioprinting is a rapidly advancing field that offers a transformative approach to fabricating tissue and organ models through the precise deposition of cell-laden bioinks. Ensuring the fidelity and consistency of printed structures in real-time remains a core challenge, particularly under constraints imposed by limited imaging data and resource-constrained embedded hardware. Semantic segmentation of the extrusion process, differentiating between nozzle, extruded bioink, and surrounding background, enables in situ monitoring critical to maintaining print quality and biological viability. In this work, we introduce a lightweight semantic segmentation framework tailored for real-time bioprinting applications. We present a novel, manually annotated dataset comprising 787 RGB images captured during the bioprinting process, labeled across three classes: nozzle, bioink, and background. To achieve fast and efficient inference suitable for integration with bioprinting systems, we propose a BioLite U-Net architecture that leverages depthwise separable convolutions to drastically reduce computational load without compromising accuracy. Our model is benchmarked against MobileNetV2 and MobileNetV3-based segmentation baselines using mean Intersection over Union (mIoU), Dice score, and pixel accuracy. All models were evaluated on a Raspberry Pi 4B to assess real-world feasibility. The proposed BioLite U-Net achieves an mIoU of 92.85% and a Dice score of 96.17%, while being over 1300x smaller than MobileNetV2-DeepLabV3+. On-device inference takes 335 ms per frame, demonstrating near real-time capability. Compared to MobileNet baselines, BioLite U-Net offers a superior tradeoff between segmentation accuracy, efficiency, and deployability, making it highly suitable for intelligent, closed-loop bioprinting systems.

[281] Evolving from Unknown to Known: Retentive Angular Representation Learning for Incremental Open Set Recognition

Runqing Yang, Yimin Fu, Changyuan Wu, Zhunga Liu

Main category: cs.CV

TL;DR: Proposes Retentive Angular Representation Learning (RARL) for Incremental Open Set Recognition (IOSR) to handle evolving unknown classes in continuous data streams while maintaining decision boundary discriminability.

Details

Motivation: Existing open set recognition methods are designed for static scenarios with fixed scopes, but real-world applications require models to incrementally identify newly emerging unknown classes from continuous data streams while maintaining knowledge.

Method: Uses retentive angular representation learning with virtual-intrinsic interactive training strategy to compact known representations and enforce clear inter-class margins through boundary-proximal virtual classes. Includes stratified rectification strategy to refine decision boundaries.

Result: Achieves state-of-the-art performance on CIFAR100 and TinyImageNet datasets across various task setups, establishing a new benchmark for incremental open set recognition.

Conclusion: RARL effectively mitigates representation drift and maintains discriminative decision boundaries in evolving open set scenarios, outperforming existing methods in incremental unknown class identification.

Abstract: Existing open set recognition (OSR) methods are typically designed for static scenarios, where models aim to classify known classes and identify unknown ones within fixed scopes. This deviates from the expectation that the model should incrementally identify newly emerging unknown classes from continuous data streams and acquire corresponding knowledge. In such evolving scenarios, the discriminability of OSR decision boundaries is hard to maintain due to restricted access to former training data, causing severe inter-class confusion. To solve this problem, we propose retentive angular representation learning (RARL) for incremental open set recognition (IOSR). In RARL, unknown representations are encouraged to align around inactive prototypes within an angular space constructed under the equiangular tight frame, thereby mitigating excessive representation drift during knowledge updates. Specifically, we adopt a virtual-intrinsic interactive (VII) training strategy, which compacts known representations by enforcing clear inter-class margins through boundary-proximal virtual classes. Furthermore, a stratified rectification strategy is designed to refine decision boundaries, mitigating representation bias and feature space distortion caused by imbalances between old/new and positive/negative class samples. We conduct thorough evaluations on CIFAR100 and TinyImageNet datasets and establish a new benchmark for IOSR. Experimental results across various task setups demonstrate that the proposed method achieves state-of-the-art performance.

[282] MRI-Based Brain Tumor Detection through an Explainable EfficientNetV2 and MLP-Mixer-Attention Architecture

Mustafa Yurdakul, Şakir Taşdemir

Main category: cs.CV

TL;DR: Proposed an attention-based MLP-Mixer integrated with EfficientNetV2 for brain tumor classification from MRI images, achieving 99.50% accuracy with enhanced interpretability using Grad-CAM visualizations.

Details

Motivation: Brain tumors require early diagnosis but manual MRI examination is error-prone and requires expertise, creating a need for automated, accurate, and explainable diagnostic systems.

Method: Used Figshare dataset with 3,064 T1-weighted contrast-enhanced brain MRI images. Evaluated 9 CNN architectures, selected EfficientNetV2 as backbone, integrated attention-based MLP-Mixer, and used five-fold cross-validation with Grad-CAM for interpretability.

Result: Achieved superior performance: 99.50% accuracy, 99.47% precision, 99.52% recall, and 99.49% F1 score, outperforming existing literature methods.

Conclusion: The combined EfficientNetV2 and attention-based MLP-Mixer model provides a robust, high-accuracy, and interpretable solution for clinical decision support in brain tumor classification.

Abstract: Brain tumors are serious health problems that require early diagnosis due to their high mortality rates. Diagnosing tumors by examining Magnetic Resonance Imaging (MRI) images is a process that requires expertise and is prone to error. Therefore, the need for automated diagnosis systems is increasing day by day. In this context, a robust and explainable Deep Learning (DL) model for the classification of brain tumors is proposed. In this study, a publicly available Figshare dataset containing 3,064 T1-weighted contrast-enhanced brain MRI images of three tumor types was used. First, the classification performance of nine well-known CNN architectures was evaluated to determine the most effective backbone. Among these, EfficientNetV2 demonstrated the best performance and was selected as the backbone for further development. Subsequently, an attention-based MLP-Mixer architecture was integrated into EfficientNetV2 to enhance its classification capability. The performance of the final model was comprehensively compared with basic CNNs and the methods in the literature. Additionally, Grad-CAM visualization was used to interpret and validate the decision-making process of the proposed model. The proposed model’s performance was evaluated using the five-fold cross-validation method. The proposed model demonstrated superior performance with 99.50% accuracy, 99.47% precision, 99.52% recall and 99.49% F1 score. The results obtained show that the model outperforms the studies in the literature. Moreover, Grad-CAM visualizations demonstrate that the model effectively focuses on relevant regions of MRI images, thus improving interpretability and clinical reliability. A robust deep learning model for clinical decision support systems has been obtained by combining EfficientNetV2 and attention-based MLP-Mixer, providing high accuracy and interpretability in brain tumor classification.

[283] Approximating Condorcet Ordering for Vector-valued Mathematical Morphology

Marcos Eduardo Valle, Santiago Velasco-Forero, Joao Batista Florindo, Gustavo Jesus Angulo

Main category: cs.CV

TL;DR: Machine learning approach to learn reduced ordering approximating Condorcet ranking for vector-valued mathematical morphology operators in color image processing

Details

Motivation: Address lack of consensus on suitable vector ordering for constructing morphological operators in vector-valued images like color and hyperspectral images

Method: Develop machine learning approach that learns reduced ordering approximating Condorcet ranking derived from set of vector orderings, inspired by voting problems

Result: Preliminary computational experiments confirm effectiveness of learning reduced mapping for defining vector-valued morphological operators for color images

Conclusion: Proposed machine learning approach successfully approximates Condorcet ordering to address vector ordering challenges in mathematical morphology for color image processing

Abstract: Mathematical morphology provides a nonlinear framework for image and spatial data processing and analysis. Although there have been many successful applications of mathematical morphology to vector-valued images, such as color and hyperspectral images, there is still no consensus on the most suitable vector ordering for constructing morphological operators. This paper addresses this issue by examining a reduced ordering approximating the Condorcet ranking derived from a set of vector orderings. Inspired by voting problems, the Condorcet ordering ranks elements from most to least voted, with voters representing different orderings. In this paper, we develop a machine learning approach that learns a reduced ordering that approximates the Condorcet ordering. Preliminary computational experiments confirm the effectiveness of learning the reduced mapping to define vector-valued morphological operators for color images.

[284] CausNVS: Autoregressive Multi-view Diffusion for Flexible 3D Novel View Synthesis

Xin Kong, Daniel Watson, Yannick Strümpler, Michael Niemeyer, Federico Tombari

Main category: cs.CV

TL;DR: CausNVS is an autoregressive multi-view diffusion model for 3D novel view synthesis that supports flexible input-output view configurations and sequential generation, addressing limitations of non-autoregressive approaches.

Details

Motivation: Existing multi-view diffusion models use non-autoregressive formulations that limit them to fixed view counts and suffer from slow inference due to simultaneous denoising of all frames, making them unsuitable for world modeling applications.

Method: Train with causal masking and per-frame noise, use pairwise-relative camera pose encodings (CaPE) for precise camera control, and employ spatially-aware sliding-window with key-value caching and noise conditioning augmentation at inference to prevent drift.

Result: CausNVS supports diverse camera trajectories, enables flexible autoregressive novel view synthesis, and achieves consistently strong visual quality across various settings.

Conclusion: The autoregressive approach with causal training and specialized inference techniques provides a more flexible and efficient solution for multi-view synthesis compared to non-autoregressive diffusion models.

Abstract: Multi-view diffusion models have shown promise in 3D novel view synthesis, but most existing methods adopt a non-autoregressive formulation. This limits their applicability in world modeling, as they only support a fixed number of views and suffer from slow inference due to denoising all frames simultaneously. To address these limitations, we propose CausNVS, a multi-view diffusion model in an autoregressive setting, which supports arbitrary input-output view configurations and generates views sequentially. We train CausNVS with causal masking and per-frame noise, using pairwise-relative camera pose encodings (CaPE) for precise camera control. At inference time, we combine a spatially-aware sliding-window with key-value caching and noise conditioning augmentation to mitigate drift. Our experiments demonstrate that CausNVS supports a broad range of camera trajectories, enables flexible autoregressive novel view synthesis, and achieves consistently strong visual quality across diverse settings. Project page: https://kxhit.github.io/CausNVS.html.

[285] Automated Radiographic Total Sharp Score (ARTSS) in Rheumatoid Arthritis: A Solution to Reduce Inter-Intra Reader Variation and Enhancing Clinical Practice

Hajar Moradmand, Lei Ren

Main category: cs.CV

TL;DR: Deep learning framework ARTSS automates rheumatoid arthritis severity scoring from hand X-rays, achieving 99% joint identification accuracy and low prediction error with Vision Transformer model.

Details

Motivation: Manual RA severity scoring using Total Sharp Score is time-consuming and subjective, with high inter- and intra-observer variability. Automation can improve consistency and efficiency in clinical practice.

Method: Four-stage framework: 1) Image pre-processing with ResNet50, 2) Hand segmentation with UNet.3, 3) Joint identification with YOLOv7, 4) TSS prediction using multiple models (VGG16, VGG19, ResNet50, DenseNet201, EfficientNetB0, ViT). Evaluated with 3-fold cross-validation on 970 patients.

Result: Joint identification achieved 99% accuracy. Vision Transformer performed best with Huber loss of 0.87 for TSS prediction. Framework handles joint disappearance and variable-length image sequences effectively.

Conclusion: ARTSS demonstrates deep learning’s potential to automate RA scoring, reducing variability, saving time, improving accuracy, and aiding clinical decision-making for rheumatologists.

Abstract: Assessing the severity of rheumatoid arthritis (RA) using the Total Sharp/Van Der Heijde Score (TSS) is crucial, but manual scoring is often time-consuming and subjective. This study introduces an Automated Radiographic Sharp Scoring (ARTSS) framework that leverages deep learning to analyze full-hand X-ray images, aiming to reduce inter- and intra-observer variability. The research uniquely accommodates patients with joint disappearance and variable-length image sequences. We developed ARTSS using data from 970 patients, structured into four stages: I) Image pre-processing and re-orientation using ResNet50, II) Hand segmentation using UNet.3, III) Joint identification using YOLOv7, and IV) TSS prediction using models such as VGG16, VGG19, ResNet50, DenseNet201, EfficientNetB0, and Vision Transformer (ViT). We evaluated model performance with Intersection over Union (IoU), Mean Average Precision (MAP), mean absolute error (MAE), Root Mean Squared Error (RMSE), and Huber loss. The average TSS from two radiologists was used as the ground truth. Model training employed 3-fold cross-validation, with each fold consisting of 452 training and 227 validation samples, and external testing included 291 unseen subjects. Our joint identification model achieved 99% accuracy. The best-performing model, ViT, achieved a notably low Huber loss of 0.87 for TSS prediction. Our results demonstrate the potential of deep learning to automate RA scoring, which can significantly enhance clinical practice. Our approach addresses the challenge of joint disappearance and variable joint numbers, offers timesaving benefits, reduces inter- and intra-reader variability, improves radiologist accuracy, and aids rheumatologists in making more informed decisions.

[286] Detection of trade in products derived from threatened species using machine learning and a smartphone

Ritwik Kulkarni, WU Hanqin, Enrico Di Minin

Main category: cs.CV

TL;DR: Machine learning models developed to automatically detect illegal wildlife products (elephant ivory, pangolin scales, tiger bones) in images with 84.2% overall accuracy, deployed via smartphone app with 91.3% accuracy for real-time monitoring.

Details

Motivation: Unsustainable wildlife trade is a major biodiversity threat, increasingly occurring in digital marketplaces. Automated detection methods are needed to handle the large volume of digital content and identify wildlife products like ivory.

Method: Developed machine learning-based object recognition models using images of illegally sold/confiscated wildlife products. Tested various training strategies and loss functions, creating both species-specific models and a combined model for elephant, pangolin, and tiger products.

Result: Best model achieved 84.2% overall accuracy with species-specific accuracies: 71.1% (elephant), 90.2% (pangolin), 93.5% (tiger). Smartphone application achieved 91.3% overall accuracy for real-time detection.

Conclusion: The method is effective for monitoring wildlife trade both online and in physical markets, with practical deployment through smartphone apps for government authorities and law enforcement agencies.

Abstract: Unsustainable trade in wildlife is a major threat to biodiversity and is now increasingly prevalent in digital marketplaces and social media. With the sheer volume of digital content, the need for automated methods to detect wildlife trade listings is growing. These methods are especially needed for the automatic identification of wildlife products, such as ivory. We developed machine learning-based object recognition models that can identify wildlife products within images and highlight them. The data consists of images of elephant, pangolin, and tiger products that were identified as being sold illegally or that were confiscated by authorities. Specifically, the wildlife products included elephant ivory and skins, pangolin scales, and claws (raw and crafted), and tiger skins and bones. We investigated various combinations of training strategies and two loss functions to identify the best model to use in the automatic detection of these wildlife products. Models were trained for each species while also developing a single model to identify products from all three species. The best model showed an overall accuracy of 84.2% with accuracies of 71.1%, 90.2% and 93.5% in detecting products derived from elephants, pangolins, and tigers, respectively. We further demonstrate that the machine learning model can be made easily available to stakeholders, such as government authorities and law enforcement agencies, by developing a smartphone-based application that had an overall accuracy of 91.3%. The application can be used in real time to click images and help identify potentially prohibited products of target species. Thus, the proposed method is not only applicable for monitoring trade on the web but can also be used e.g. in physical markets for monitoring wildlife trade.

[287] Hybrid Swin Attention Networks for Simultaneously Low-Dose PET and CT Denoising

Yichao Liu, YueYang Teng

Main category: cs.CV

TL;DR: HSANet is a novel hybrid network for LDCT/PET denoising that combines Efficient Global Attention modules and hybrid upsampling to improve image quality while maintaining low radiation exposure and practical deployment capabilities.

Details

Motivation: Low-dose CT and PET imaging reduce radiation exposure but introduce noise and artifacts that compromise diagnostic accuracy, creating a need for effective denoising methods that maintain safety while improving image quality.

Method: Proposed Hybrid Swin Attention Network (HSANet) with Efficient Global Attention (EGA) modules for enhanced spatial and channel-wise feature interaction, and a hybrid upsampling module to prevent overfitting to noise.

Result: HSANet achieves superior denoising performance compared to existing methods while maintaining a lightweight model size suitable for deployment on standard GPU configurations.

Conclusion: The approach is highly practical for real-world clinical applications, providing effective denoising for low-dose imaging while ensuring radiation safety and computational efficiency.

Abstract: Low-dose computed tomography (LDCT) and positron emission tomography (PET) have emerged as safer alternatives to conventional imaging modalities by significantly reducing radiation exposure. However, this reduction often results in increased noise and artifacts, which can compromise diagnostic accuracy. Consequently, denoising for LDCT/PET has become a vital area of research aimed at enhancing image quality while maintaining radiation safety. In this study, we introduce a novel Hybrid Swin Attention Network (HSANet), which incorporates Efficient Global Attention (EGA) modules and a hybrid upsampling module. The EGA modules enhance both spatial and channel-wise interaction, improving the network’s capacity to capture relevant features, while the hybrid upsampling module mitigates the risk of overfitting to noise. We validate the proposed approach using a publicly available LDCT/PET dataset. Experimental results demonstrate that HSANet achieves superior denoising performance compared to existing methods, while maintaining a lightweight model size suitable for deployment on GPUs with standard memory configurations. This makes our approach highly practical for real-world clinical applications.

[288] Barlow-Swin: Toward a novel siamese-based segmentation architecture using Swin-Transformers

Morteza Kiani Haftlang, Mohammadhossein Malmir, Foroutan Parand, Umberto Michelucci, Safouane El Ghazouali

Main category: cs.CV

TL;DR: Lightweight real-time medical image segmentation model combining Swin Transformer encoder with U-Net decoder, using self-supervised pretraining for efficiency.

Details

Motivation: Existing segmentation models are either computationally expensive (transformers) or have limited receptive fields (CNNs like U-Net), making them unsuitable for real-time clinical applications.

Method: Combines Swin Transformer-like encoder with U-Net-like decoder via skip connections. Uses Barlow Twins self-supervised pretraining on encoder before fine-tuning entire model for segmentation.

Result: Achieves competitive accuracy with significantly reduced parameters and faster inference compared to existing models like Swin Transformer and U-Net.

Conclusion: Provides a practical, efficient alternative for real-time medical image segmentation in resource-limited clinical environments.

Abstract: Medical image segmentation is a critical task in clinical workflows, particularly for the detection and delineation of pathological regions. While convolutional architectures like U-Net have become standard for such tasks, their limited receptive field restricts global context modeling. Recent efforts integrating transformers have addressed this, but often result in deep, computationally expensive models unsuitable for real-time use. In this work, we present a novel end-to-end lightweight architecture designed specifically for real-time binary medical image segmentation. Our model combines a Swin Transformer-like encoder with a U-Net-like decoder, connected via skip pathways to preserve spatial detail while capturing contextual information. Unlike existing designs such as Swin Transformer or U-Net, our architecture is significantly shallower and competitively efficient. To improve the encoder’s ability to learn meaningful features without relying on large amounts of labeled data, we first train it using Barlow Twins, a self-supervised learning method that helps the model focus on important patterns by reducing unnecessary repetition in the learned features. After this pretraining, we fine-tune the entire model for our specific task. Experiments on benchmark binary segmentation tasks demonstrate that our model achieves competitive accuracy with substantially reduced parameter count and faster inference, positioning it as a practical alternative for deployment in real-time and resource-limited clinical environments. The code for our method is available at Github repository: https://github.com/mkianih/Barlow-Swin.

[289] Investigating Location-Regularised Self-Supervised Feature Learning for Seafloor Visual Imagery

Cailei Liang, Adrian Bodenmann, Emma J Curtis, Samuel Simmons, Kazunori Nagano, Stan Brown, Adam Riese, Blair Thornton

Main category: cs.CV

TL;DR: Location-based regularization improves self-supervised learning for seafloor image analysis, boosting CNN performance by 4.9% and ViT performance by 6.3% on average across diverse datasets.

Details

Motivation: To enhance marine monitoring efficiency by improving high-throughput interpretation of robotically gathered seafloor imagery through better self-supervised feature learning using location metadata.

Method: Evaluated impact of location-based regularization on six state-of-the-art SSL frameworks (CNN and ViT models) across three diverse seafloor image datasets with varying latent-space dimensionality (128 vs 512 dimensions).

Result: Location-regularization consistently improved downstream classification: 4.9±4.0% F1-score gain for CNNs, 6.3±8.9% for ViTs. CNNs benefited more from high-dimensional representations, while ViTs showed strong generalization with pre-trained models matching best location-regularized SSL (F1-scores: 0.795±0.075 vs 0.795±0.077).

Conclusion: Location metadata is valuable for SSL regularization, particularly beneficial for low-dimensional latent representations, while high-dimensional ViTs demonstrate strong generalization capabilities for seafloor image analysis.

Abstract: High-throughput interpretation of robotically gathered seafloor visual imagery can increase the efficiency of marine monitoring and exploration. Although recent research has suggested that location metadata can enhance self-supervised feature learning (SSL), its benefits across different SSL strategies, models and seafloor image datasets are underexplored. This study evaluates the impact of location-based regularisation on six state-of-the-art SSL frameworks, which include Convolutional Neural Network (CNN) and Vision Transformer (ViT) models with varying latent-space dimensionality. Evaluation across three diverse seafloor image datasets finds that location-regularisation consistently improves downstream classification performance over standard SSL, with average F1-score gains of $4.9 \pm 4.0%$ for CNNs and $6.3 \pm 8.9%$ for ViTs, respectively. While CNNs pretrained on generic datasets benefit from high-dimensional latent representations, dataset-optimised SSL achieves similar performance across the high (512) and low (128) dimensional latent representations. Location-regularised SSL improves CNN performance over pre-trained models by $2.7 \pm 2.7%$ and $10.1 \pm 9.4%$ for high and low-dimensional latent representations, respectively. For ViTs, high-dimensionality benefits both pre-trained and dataset-optimised SSL. Although location-regularisation improves SSL performance compared to standard SSL methods, pre-trained ViTs show strong generalisation, matching the best-performing location-regularised SSL with F1-scores of $0.795 \pm 0.075$ and $0.795 \pm 0.077$, respectively. The findings highlight the value of location metadata for SSL regularisation, particularly when using low-dimensional latent representations, and demonstrate strong generalisation of high-dimensional ViTs for seafloor image analysis.

[290] Online Clustering of Seafloor Imagery for Interpretation during Long-Term AUV Operations

Cailei Liang, Adrian Bodenmann, Sam Fenton, Blair Thornton

Main category: cs.CV

TL;DR: Online clustering framework for real-time unsupervised seafloor image analysis that achieves high accuracy with bounded computational time, enabling adaptive AUV missions.

Details

Motivation: Need for real-time interpretation of seafloor imagery to support adaptive missions and optimize communication efficiency in autonomous underwater vehicles, overcoming limitations of offline methods that require complete datasets and human labeling.

Method: Online clustering framework (OCF) that operates on continuous data streams using representative sampling to capture evolving feature distributions, supporting dynamic cluster merging/splitting without reprocessing full image history.

Result: Achieved highest average F1 score of 0.68 across three seafloor datasets with 3% standard deviation, demonstrating superior clustering capability and robustness to trajectory variation while maintaining bounded computational time.

Conclusion: OCF provides scalable, adaptive real-time seafloor image interpretation beneficial for survey data summaries and informative path planning in long-term autonomous marine exploration.

Abstract: As long-endurance and seafloor-resident AUVs become more capable, there is an increasing need for extended, real-time interpretation of seafloor imagery to enable adaptive missions and optimise communication efficiency. Although offline image analysis methods are well established, they rely on access to complete datasets and human-labelled examples to manage the strong influence of environmental and operational conditions on seafloor image appearance-requirements that cannot be met in real-time settings. To address this, we introduce an online clustering framework (OCF) capable of interpreting seafloor imagery without supervision, which is designed to operate in real-time on continuous data streams in a scalable, adaptive, and self-consistent manner. The method enables the efficient review and consolidation of common patterns across the entire data history in constant time by identifying and maintaining a set of representative samples that capture the evolving feature distribution, supporting dynamic cluster merging and splitting without reprocessing the full image history. We evaluate the framework on three diverse seafloor image datasets, analysing the impact of different representative sampling strategies on both clustering accuracy and computational cost. The OCF achieves the highest average F1 score of 0.68 across the three datasets among all comparative online clustering approaches, with a standard deviation of 3% across three distinct survey trajectories, demonstrating its superior clustering capability and robustness to trajectory variation. In addition, it maintains consistently lower and bounded computational time as the data volume increases. These properties are beneficial for generating survey data summaries and supporting informative path planning in long-term, persistent autonomous marine exploration.

[291] VIM-GS: Visual-Inertial Monocular Gaussian Splatting via Object-level Guidance in Large Scenes

Shengkai Zhang, Yuhe Liu, Guanjun Wu, Jianhua He, Xinggang Wang, Mozi Chen, Kezhong Liu

Main category: cs.CV

TL;DR: VIM-GS is a Gaussian Splatting framework that uses monocular images for novel-view synthesis in large scenes by combining sparse SfM depth with dense foundation model depth estimates.

Details

Motivation: Traditional Gaussian Splatting requires accurate depth from RGB-D/stereo cameras, which have limited range for large scenes. Monocular images lack depth guidance, while large foundation models suffer from inconsistency, inaccuracy for distant scenes, and texture ambiguity.

Method: Leverages sparse but accurate depth from visual-inertial SfM to refine dense but coarse depth from large foundation models. Uses object-segmented depth propagation algorithm and dynamic depth refinement module to handle structured objects and dynamic object depth issues.

Result: Experiments on public and customized datasets demonstrate superior rendering quality in large scenes compared to existing approaches.

Conclusion: VIM-GS successfully addresses the depth estimation challenges for Gaussian Splatting in large scenes by combining the strengths of SfM and foundation models, enabling high-quality novel-view synthesis from monocular inputs.

Abstract: VIM-GS is a Gaussian Splatting (GS) framework using monocular images for novel-view synthesis (NVS) in large scenes. GS typically requires accurate depth to initiate Gaussian ellipsoids using RGB-D/stereo cameras. Their limited depth sensing range makes it difficult for GS to work in large scenes. Monocular images, however, lack depth to guide the learning and lead to inferior NVS results. Although large foundation models (LFMs) for monocular depth estimation are available, they suffer from cross-frame inconsistency, inaccuracy for distant scenes, and ambiguity in deceptive texture cues. This paper aims to generate dense, accurate depth images from monocular RGB inputs for high-definite GS rendering. The key idea is to leverage the accurate but sparse depth from visual-inertial Structure-from-Motion (SfM) to refine the dense but coarse depth from LFMs. To bridge the sparse input and dense output, we propose an object-segmented depth propagation algorithm that renders the depth of pixels of structured objects. Then we develop a dynamic depth refinement module to handle the crippled SfM depth of dynamic objects and refine the coarse LFM depth. Experiments using public and customized datasets demonstrate the superior rendering quality of VIM-GS in large scenes.

[292] H$_{2}$OT: Hierarchical Hourglass Tokenizer for Efficient Video Pose Transformers

Wenhao Li, Mengyuan Liu, Hong Liu, Pichao Wang, Shijian Lu, Nicu Sebe

Main category: cs.CV

TL;DR: H₂OT is a hierarchical plug-and-play framework that prunes redundant pose tokens in video transformers and recovers full sequences, improving efficiency for 3D human pose estimation while maintaining accuracy.

Details

Motivation: Video pose transformers (VPTs) have high computational costs that make them impractical for resource-constrained devices, despite their success in 3D human pose estimation.

Method: Hierarchical Hourglass Tokenizer (H₂OT) uses Token Pruning Module (TPM) to dynamically select representative tokens and Token Recovering Module (TRM) to restore full-length sequences, reducing intermediate computation while maintaining output quality.

Result: Extensive experiments show H₂OT achieves both high efficiency and estimation accuracy, demonstrating that maintaining full pose sequences is unnecessary and a few representative tokens suffice.

Conclusion: The framework is general-purpose, works with common VPT models, and effectively accommodates different pruning/recovery strategies, making transformer-based 3D pose estimation more practical for resource-constrained applications.

Abstract: Transformers have been successfully applied in the field of video-based 3D human pose estimation. However, the high computational costs of these video pose transformers (VPTs) make them impractical on resource-constrained devices. In this paper, we present a hierarchical plug-and-play pruning-and-recovering framework, called Hierarchical Hourglass Tokenizer (H${2}$OT), for efficient transformer-based 3D human pose estimation from videos. H${2}$OT begins with progressively pruning pose tokens of redundant frames and ends with recovering full-length sequences, resulting in a few pose tokens in the intermediate transformer blocks and thus improving the model efficiency. It works with two key modules, namely, a Token Pruning Module (TPM) and a Token Recovering Module (TRM). TPM dynamically selects a few representative tokens to eliminate the redundancy of video frames, while TRM restores the detailed spatio-temporal information based on the selected tokens, thereby expanding the network output to the original full-length temporal resolution for fast inference. Our method is general-purpose: it can be easily incorporated into common VPT models on both seq2seq and seq2frame pipelines while effectively accommodating different token pruning and recovery strategies. In addition, our H$_{2}$OT reveals that maintaining the full pose sequence is unnecessary, and a few pose tokens of representative frames can achieve both high efficiency and estimation accuracy. Extensive experiments on multiple benchmark datasets demonstrate both the effectiveness and efficiency of the proposed method. Code and models are available at https://github.com/NationalGAILab/HoT.

[293] STAGE: Segmentation-oriented Industrial Anomaly Synthesis via Graded Diffusion with Explicit Mask Alignment

Xichen Xu, Yanshu Wang, Jinbao Wang, Qunyi Zhang, Xiaoning Lei, Guoyang Xie, Guannan Jiang, Zhichao Lu

Main category: cs.CV

TL;DR: STAGE proposes a novel diffusion-based method for industrial anomaly synthesis that addresses limitations in texture detail and pixel-level precision through graded diffusion and explicit mask alignment.

Details

Motivation: Existing SIAS methods lack intricate texture details, fail to align with background properly, and struggle with fine-grained pixel-level anomaly generation, limiting downstream anomaly segmentation performance.

Method: Uses graded diffusion framework with anomaly-only branch, incorporates clean background as prior for denoising, and employs explicit mask alignment (EMA) strategy for progressive background alignment.

Result: Achieves state-of-the-art performance on MVTec and BTAD datasets, demonstrating superior anomaly synthesis that enhances downstream anomaly segmentation tasks.

Conclusion: STAGE effectively addresses key limitations in industrial anomaly synthesis through its novel diffusion-based approach with graded processing and explicit alignment, significantly improving synthetic anomaly quality and downstream performance.

Abstract: Segmentation-oriented Industrial Anomaly Synthesis (SIAS) plays a pivotal role in enhancing the performance of downstream anomaly segmentation, as it provides an effective means of expanding abnormal data. However, existing SIAS methods face several critical limitations: (i) the synthesized anomalies often lack intricate texture details and fail to align precisely with the surrounding background, and (ii) they struggle to generate fine-grained, pixel-level anomalies. To address these challenges, we propose Segmentation-oriented Anomaly synthesis via Graded diffusion with Explicit mask alignment, termed STAGE. STAGE introduces a novel anomaly inference strategy that incorporates clean background information as a prior to guide the denoising distribution, enabling the model to more effectively distinguish and highlight abnormal foregrounds. Furthermore, it employs a graded diffusion framework with an anomaly-only branch to explicitly record local anomalies during both the forward and reverse processes, ensuring that subtle anomalies are not overlooked. Finally, STAGE incorporates the explicit mask alignment (EMA) strategy to progressively align the synthesized anomalies with the background, resulting in context-consistent and structurally coherent generations. Extensive experiments on the MVTec and BTAD datasets demonstrate that STAGE achieves state-of-the-art performance in SIAS, which in turn enhances downstream anomaly segmentation.

[294] Cortex-Synth: Differentiable Topology-Aware 3D Skeleton Synthesis with Hierarchical Graph Attention

Mohamed Zayaan S

Main category: cs.CV

TL;DR: Cortex Synth is a differentiable framework for generating 3D skeleton geometry and topology from single 2D images, achieving state-of-the-art results with significant improvements in accuracy and reduced topological errors.

Details

Motivation: To address the challenge of synthesizing both 3D skeleton geometry and topology from single 2D images in an end-to-end differentiable manner, enabling applications in robotics, medical imaging, and character rigging.

Method: Uses hierarchical graph attention with multi-scale skeletal refinement, differentiable spectral topology optimization via Laplacian eigen decomposition, and adversarial geometric consistency training. Integrates four modules: pseudo 3D point cloud generator, enhanced PointNet encoder, skeleton coordinate decoder, and Differentiable Graph Construction Network (DGCN).

Result: Achieved 18.7% improvement in MPJPE and 27.3% improvement in Graph Edit Distance on ShapeNet, while reducing topological errors by 42% compared to previous approaches.

Conclusion: The framework successfully demonstrates end-to-end differentiability for joint 3D skeleton geometry and topology synthesis, with superior performance metrics and broad application potential.

Abstract: We present Cortex Synth, a novel end-to-end differentiable framework for joint 3D skeleton geometry and topology synthesis from single 2D images. Our architecture introduces three key innovations: (1) A hierarchical graph attention mechanism with multi-scale skeletal refinement, (2) Differentiable spectral topology optimization via Laplacian eigen decomposition, and (3) Adversarial geometric consistency training for pose structure alignment. The framework integrates four synergistic modules: a pseudo 3D point cloud generator, an enhanced PointNet encoder, a skeleton coordinate decoder, and a novel Differentiable Graph Construction Network (DGCN). Our experiments demonstrate state-of-the-art results with 18.7 percent improvement in MPJPE and 27.3 percent in Graph Edit Distance on ShapeNet, while reducing topological errors by 42 percent compared to previous approaches. The model’s end-to-end differentiability enables applications in robotic manipulation, medical imaging, and automated character rigging.

[295] Zero-shot 3D-Aware Trajectory-Guided image-to-video generation via Test-Time Training

Ruicheng Zhang, Jun Zhou, Zunnan Xu, Zihao Liu, Jiehui Huang, Mingyang Zhang, Yu Sun, Xiu Li

Main category: cs.CV

TL;DR: Zo3T is a zero-shot test-time training framework for trajectory-guided image-to-video generation that uses 3D-aware kinematic projection, dynamic LoRA adapters, and guidance field rectification to achieve realistic motion without expensive fine-tuning.

Details

Motivation: Existing methods for trajectory-guided I2V generation either require computationally expensive fine-tuning on scarce datasets or produce unrealistic motion by neglecting 3D perspective and creating misalignment between manipulated latents and noise predictions.

Method: Three core innovations: 1) 3D-Aware Kinematic Projection using scene depth for perspective-correct transformations, 2) Trajectory-Guided Test-Time LoRA with dynamic adapter optimization and regional feature consistency loss, 3) Guidance Field Rectification with one-step lookahead strategy to refine denoising path.

Result: Zo3T significantly enhances 3D realism and motion accuracy in trajectory-controlled I2V generation, demonstrating superior performance over existing training-based and zero-shot approaches.

Conclusion: The proposed framework effectively addresses challenges in zero-shot trajectory-guided video generation by ensuring 3D perspective correctness, maintaining generative fidelity, and enabling efficient motion control without requiring annotated training data.

Abstract: Trajectory-Guided image-to-video (I2V) generation aims to synthesize videos that adhere to user-specified motion instructions. Existing methods typically rely on computationally expensive fine-tuning on scarce annotated datasets. Although some zero-shot methods attempt to trajectory control in the latent space, they may yield unrealistic motion by neglecting 3D perspective and creating a misalignment between the manipulated latents and the network’s noise predictions. To address these challenges, we introduce Zo3T, a novel zero-shot test-time-training framework for trajectory-guided generation with three core innovations: First, we incorporate a 3D-Aware Kinematic Projection, leveraging inferring scene depth to derive perspective-correct affine transformations for target regions. Second, we introduce Trajectory-Guided Test-Time LoRA, a mechanism that dynamically injects and optimizes ephemeral LoRA adapters into the denoising network alongside the latent state. Driven by a regional feature consistency loss, this co-adaptation effectively enforces motion constraints while allowing the pre-trained model to locally adapt its internal representations to the manipulated latent, thereby ensuring generative fidelity and on-manifold adherence. Finally, we develop Guidance Field Rectification, which refines the denoising evolutionary path by optimizing the conditional guidance field through a one-step lookahead strategy, ensuring efficient generative progression towards the target trajectory. Zo3T significantly enhances 3D realism and motion accuracy in trajectory-controlled I2V generation, demonstrating superior performance over existing training-based and zero-shot approaches.

[296] Co-Seg: Mutual Prompt-Guided Collaborative Learning for Tissue and Nuclei Segmentation

Qing Xu, Wenting Duan, Zhen Chen

Main category: cs.CV

TL;DR: Co-Seg framework for collaborative tissue and nuclei segmentation in histopathology images using mutual enhancement between tasks

Details

Motivation: Existing studies treat tissue semantic segmentation and nuclei instance segmentation separately, ignoring their inherent relationship, leading to insufficient histopathology understanding

Method: Proposes a co-segmentation paradigm with region-aware prompt encoder (RP-Encoder) for semantic/instance region prompts and mutual prompt mask decoder (MP-Decoder) for cross-guidance and contextual consistency

Result: Extensive experiments on PUMA dataset show Co-Seg surpasses state-of-the-art methods in semantic, instance, and panoptic segmentation of tumor tissues and nuclei

Conclusion: The collaborative approach effectively addresses the relationship between tissue and nuclei segmentation tasks, achieving superior performance in histopathology image analysis

Abstract: Histopathology image analysis is critical yet challenged by the demand of segmenting tissue regions and nuclei instances for tumor microenvironment and cellular morphology analysis. Existing studies focused on tissue semantic segmentation or nuclei instance segmentation separately, but ignored the inherent relationship between these two tasks, resulting in insufficient histopathology understanding. To address this issue, we propose a Co-Seg framework for collaborative tissue and nuclei segmentation. Specifically, we introduce a novel co-segmentation paradigm, allowing tissue and nuclei segmentation tasks to mutually enhance each other. To this end, we first devise a region-aware prompt encoder (RP-Encoder) to provide high-quality semantic and instance region prompts as prior constraints. Moreover, we design a mutual prompt mask decoder (MP-Decoder) that leverages cross-guidance to strengthen the contextual consistency of both tasks, collaboratively computing semantic and instance segmentation masks. Extensive experiments on the PUMA dataset demonstrate that the proposed Co-Seg surpasses state-of-the-arts in the semantic, instance and panoptic segmentation of tumor tissues and nuclei instances. The source code is available at https://github.com/xq141839/Co-Seg.

[297] Event Spectroscopy: Event-based Multispectral and Depth Sensing using Structured Light

Christian Geckeler, Niklas Neugebauer, Manasi Muglikar, Davide Scaramuzza, Stefano Mintchev

Main category: cs.CV

TL;DR: Novel event spectroscopy system for UAVs that combines high-resolution depth reconstruction and multispectral imaging using a single sensor, achieving 60% RMSE improvement over commercial depth sensors and 30% accuracy boost in material differentiation.

Details

Motivation: Traditional UAV sensing approaches in forest environments suffer from latency, poor depth resolution, and light dependency, especially under forest canopies where reliable navigation and data collection are critical.

Method: Uses structured light projection with modulated wavelengths (650-850 nm) to simultaneously capture depth and spectral information through a single sensor, enabling both depth reconstruction and multispectral imaging.

Result: Demonstrated 60% RMSE improvement over commercial depth sensors, comparable spectral accuracy to reference spectrometers, and 30% accuracy improvement in material differentiation when combining depth with spectral data vs. color-only methods.

Conclusion: The system enables lightweight, integrated UAV perception for complex natural environments, showing strong performance in depth estimation, RGB reconstruction, and material differentiation in both lab and real-world rainforest testing.

Abstract: Uncrewed aerial vehicles (UAVs) are increasingly deployed in forest environments for tasks such as environmental monitoring and search and rescue, which require safe navigation through dense foliage and precise data collection. Traditional sensing approaches, including passive multispectral and RGB imaging, suffer from latency, poor depth resolution, and strong dependence on ambient light - especially under forest canopies. In this work, we present a novel event spectroscopy system that simultaneously enables high-resolution, low-latency depth reconstruction and multispectral imaging using a single sensor. Depth is reconstructed using structured light, and by modulating the wavelength of the projected structured light, our system captures spectral information in controlled bands between 650 nm and 850 nm. We demonstrate up to $60%$ improvement in RMSE over commercial depth sensors and validate the spectral accuracy against a reference spectrometer and commercial multispectral cameras, demonstrating comparable performance. A portable version limited to RGB (3 wavelengths) is used to collect real-world depth and spectral data from a Masoala Rainforest. We demonstrate the use of this prototype for color image reconstruction and material differentiation between leaves and branches using spectral and depth data. Our results show that adding depth (available at no extra effort with our setup) to material differentiation improves the accuracy by over $30%$ compared to color-only method. Our system, tested in both lab and real-world rainforest environments, shows strong performance in depth estimation, RGB reconstruction, and material differentiation - paving the way for lightweight, integrated, and robust UAV perception and data collection in complex natural environments.

[298] Pothole Detection and Recognition based on Transfer Learning

Mang Hu, Qianqian Xia

Main category: cs.CV

TL;DR: Proposed a ResNet50-EfficientNet-RegNet transfer learning model for pothole detection that achieves 98.89% accuracy, outperforming traditional ML models like Random Forest, SVM, and LightGBM.

Details

Motivation: Automated pothole detection from road images is crucial for social development and infrastructure maintenance, requiring accurate and efficient computer vision solutions.

Method: Used transfer learning with a hybrid ResNet50-EfficientNet-RegNet architecture after preprocessing data with standardization, normalization, and augmentation techniques.

Result: Achieved 97.78% accuracy on initial 90 test samples and 98.89% accuracy on expanded 900-sample test set, outperforming all comparison models (Random Forest, MLP, SVM, LightGBM) in both accuracy and speed.

Conclusion: The proposed transfer learning model demonstrates superior performance for pothole detection with high classification accuracy and computational efficiency, making it suitable for real-world road condition monitoring applications.

Abstract: With the rapid development of computer vision and machine learning, automated methods for pothole detection and recognition based on image and video data have received significant attention. It is of great significance for social development to conduct an in-depth analysis of road images through feature extraction, thereby achieving automatic identification of the pothole condition in new images. Consequently, this is the main issue addressed in this study. Based on preprocessing techniques such as standardization, normalization, and data augmentation applied to the collected raw dataset, we continuously improved the network model based on experimental results. Ultimately, we constructed a deep learning feature extraction network ResNet50-EfficientNet-RegNet model based on transfer learning. This model exhibits high classification accuracy and computational efficiency. In terms of model evaluation, this study employed a comparative evaluation approach by comparing the performance of the proposed transfer learning model with other models, including Random Forest, MLP, SVM, and LightGBM. The comparison analysis was conducted based on metrics such as Accuracy, Recall, Precision, F1-score, and FPS, to assess the classification performance of the transfer learning model proposed in this paper. The results demonstrate that our model exhibits high performance in terms of recognition speed and accuracy, surpassing the performance of other models. Through careful parameter selection and model optimization, our transfer learning model achieved a classification accuracy of 97.78% (88/90) on the initial set of 90 test samples and 98.89% (890/900) on the expanded test set.

[299] Raw2Event: Converting Raw Frame Camera into Event Camera

Zijie Ning, Enmin Lin, Sudarshan R. Iyengar, Patrick Vandewalle

Main category: cs.CV

TL;DR: Raw2Event is a hardware-software system that converts raw Bayer data from low-cost frame cameras into real-time event streams, providing event camera-like functionality with higher resolution and autofocus capabilities.

Details

Motivation: Event cameras offer advantages like high temporal resolution and dynamic range but are expensive and lack features like autofocus, limiting their adoption for early-stage development and prototyping.

Method: Leverages direct access to raw Bayer data bypassing traditional ISP, uses DVS-Voltmeter model with configurable simulation framework optimized for embedded platforms, and includes synchronized data acquisition pipeline for raw, RGB, and event streams.

Result: Generates event streams closely resembling real event cameras while benefiting from higher resolution and autofocus capabilities, supports real-time operation on Raspberry Pi.

Conclusion: Provides a scalable, cost-effective solution for event-based vision research and early-stage system development with user-intuitive parameter tuning for flexible adaptation.

Abstract: Event cameras offer unique advantages such as high temporal resolution, low latency, and high dynamic range, making them more and more popular for vision tasks under challenging light conditions. However, their high cost, limited resolution, and lack of features such as autofocus hinder their broad adoption, particularly for early-stage development and prototyping. In this work, we present Raw2Event, a complete hardware-software system that enables real-time event generation from low-cost raw frame-based cameras. By leveraging direct access to raw Bayer data and bypassing traditional image signal processors (ISP), our system is able to utilize the full potential of camera hardware, delivering higher dynamic range, higher resolution, and more faithful output than RGB-based frame-to-event converters. Built upon the DVS-Voltmeter model, Raw2Event features a configurable simulation framework optimized for deployment on embedded platforms. We further design a data acquisition pipeline that supports synchronized recording of raw, RGB, and event streams, facilitating downstream evaluation and dataset creation. Experimental results show that Raw2Event can generate event streams closely resembling those from real event cameras, while benefiting from higher resolution and autofocus capabilities. The system also supports user-intuitive parameter tuning, enabling flexible adaptation to various application requirements. Finally, we deploy the system on a Raspberry Pi for real-time operation, providing a scalable and cost-effective solution for event-based vision research and early-stage system development. The codes are available online: https://anonymous.4open.science/r/raw2event-BFF2/README.md.

[300] D-HUMOR: Dark Humor Understanding via Multimodal Open-ended Reasoning

Sai Kartheek Reddy Kasu, Mohammad Zia Ur Rehman, Shahid Shafi Dar, Rishi Bharat Junghare, Dhanvin Sanjay Namboodiri, Nagendra Kumar

Main category: cs.CV

TL;DR: Novel framework for detecting dark humor in memes using multimodal reasoning and role-reversal self-loop refinement, with a new annotated dataset of 4,379 Reddit memes.

Details

Motivation: Dark humor in online memes presents unique challenges due to implicit, sensitive, and culturally contextual cues, with existing methods lacking resources for multimodal dark humor detection.

Method: Reasoning-augmented framework using Large Vision-Language Model to generate structured explanations via Role-Reversal Self-Loop, then fusing text, image, and reasoning features through Tri-stream Cross-Reasoning Network (TCRNet) with pairwise attention mechanisms.

Result: Outperforms strong baselines across three tasks: dark humor detection, target identification, and intensity prediction.

Conclusion: The approach effectively addresses multimodal dark humor understanding, with released dataset and code to facilitate further research in content moderation and humor analysis.

Abstract: Dark humor in online memes poses unique challenges due to its reliance on implicit, sensitive, and culturally contextual cues. To address the lack of resources and methods for detecting dark humor in multimodal content, we introduce a novel dataset of 4,379 Reddit memes annotated for dark humor, target category (gender, mental health, violence, race, disability, and other), and a three-level intensity rating (mild, moderate, severe). Building on this resource, we propose a reasoning-augmented framework that first generates structured explanations for each meme using a Large Vision-Language Model (VLM). Through a Role-Reversal Self-Loop, VLM adopts the author’s perspective to iteratively refine its explanations, ensuring completeness and alignment. We then extract textual features from both the OCR transcript and the self-refined reasoning via a text encoder, while visual features are obtained using a vision transformer. A Tri-stream Cross-Reasoning Network (TCRNet) fuses these three streams, text, image, and reasoning, via pairwise attention mechanisms, producing a unified representation for classification. Experimental results demonstrate that our approach outperforms strong baselines across three tasks: dark humor detection, target identification, and intensity prediction. The dataset, annotations, and code are released to facilitate further research in multimodal humor understanding and content moderation. Code and Dataset are available at: https://github.com/Sai-Kartheek-Reddy/D-Humor-Dark-Humor-Understanding-via-Multimodal-Open-ended-Reasoning

[301] UrbanTwin: High-Fidelity Synthetic Replicas of Roadside Lidar Datasets

Muhammad Shahbaz, Shaurya Agarwal

Main category: cs.CV

TL;DR: UrbanTwin datasets are high-fidelity synthetic replicas of three real roadside lidar datasets, created using realistic digital twins with precise geometry and traffic patterns, offering strong standalone and augmentative value for training 3D perception models.

Details

Motivation: To create realistic synthetic datasets that can replace or augment real-world lidar datasets for 3D perception tasks, addressing data scarcity and enabling custom scenario testing.

Method: Synthesized datasets using emulated lidar sensors within realistic digital twins modeled on actual locations’ geometry, road alignment, lane topology, and vehicle movement patterns, with comprehensive annotations including 3D bounding boxes, instance segmentation, tracking IDs, and semantic segmentation.

Result: High similarity scores between synthetic and real data, and improved 3D object detection performance when models trained solely on synthetic data were tested on real, unseen data compared to models trained on real data.

Conclusion: UrbanTwin datasets effectively enhance existing benchmark datasets by increasing sample size and scene diversity, and represent the first digitally synthesized datasets that can replace in-domain real-world datasets for lidar perception tasks.

Abstract: This article presents UrbanTwin datasets - high-fidelity, realistic replicas of three public roadside lidar datasets: LUMPI, V2X-Real-IC, and TUMTraf-I. Each UrbanTwin dataset contains 10K annotated frames corresponding to one of the public datasets. Annotations include 3D bounding boxes, instance segmentation labels, and tracking IDs for six object classes, along with semantic segmentation labels for nine classes. These datasets are synthesized using emulated lidar sensors within realistic digital twins, modeled based on surrounding geometry, road alignment at lane level, and the lane topology and vehicle movement patterns at intersections of the actual locations corresponding to each real dataset. Due to the precise digital twin modeling, the synthetic datasets are well aligned with their real counterparts, offering strong standalone and augmentative value for training deep learning models on tasks such as 3D object detection, tracking, and semantic and instance segmentation. We evaluate the alignment of the synthetic replicas through statistical and structural similarity analysis with real data, and further demonstrate their utility by training 3D object detection models solely on synthetic data and testing them on real, unseen data. The high similarity scores and improved detection performance, compared to the models trained on real data, indicate that the UrbanTwin datasets effectively enhance existing benchmark datasets by increasing sample size and scene diversity. In addition, the digital twins can be adapted to test custom scenarios by modifying the design and dynamics of the simulations. To our knowledge, these are the first digitally synthesized datasets that can replace in-domain real-world datasets for lidar perception tasks. UrbanTwin datasets are publicly available at https://dataverse.harvard.edu/dataverse/ucf-ut.

[302] P3-SAM: Native 3D Part Segmentation

Changfeng Ma, Yang Li, Xinhao Yan, Jiachen Xu, Yunhan Yang, Chunshi Wang, Zibo Zhao, Yanwen Guo, Zhuo Chen, Chunchao Guo

Main category: cs.CV

TL;DR: P3-SAM is a native 3D point-promptable part segmentation model that automates 3D object segmentation into components using a SAM-inspired architecture with feature extraction, multiple segmentation heads, and IoU prediction.

Details

Motivation: Current 3D part segmentation methods lack robustness with complex objects and cannot fully automate the process, limiting applications in 3D understanding and model reuse.

Method: Proposes P3-SAM with feature extractor, multiple segmentation heads, and IoU predictor for interactive segmentation. Includes automatic mask selection and merging algorithm for part instance segmentation. Trained on 3.7M models with segmentation labels.

Result: Achieves precise segmentation results and strong robustness on complex objects, attaining state-of-the-art performance.

Conclusion: P3-SAM successfully automates 3D object segmentation with improved robustness and precision, enabling better 3D understanding and model reuse applications.

Abstract: Segmenting 3D assets into their constituent parts is crucial for enhancing 3D understanding, facilitating model reuse, and supporting various applications such as part generation. However, current methods face limitations such as poor robustness when dealing with complex objects and cannot fully automate the process. In this paper, we propose a native 3D point-promptable part segmentation model termed P3-SAM, designed to fully automate the segmentation of any 3D objects into components. Inspired by SAM, P3-SAM consists of a feature extractor, multiple segmentation heads, and an IoU predictor, enabling interactive segmentation for users. We also propose an algorithm to automatically select and merge masks predicted by our model for part instance segmentation. Our model is trained on a newly built dataset containing nearly 3.7 million models with reasonable segmentation labels. Comparisons show that our method achieves precise segmentation results and strong robustness on any complex objects, attaining state-of-the-art performance. Our code will be released soon.

[303] AIM 2025 Challenge on High FPS Motion Deblurring: Methods and Results

George Ciubotariu, Florin-Alexandru Vasluianu, Zhuyun Zhou, Nancy Mehta, Radu Timofte, Ke Wu, Long Sun, Lingshun Kong, Zhongbao Yang, Jinshan Pan, Jiangxin Dong, Jinhui Tang, Hao Chen, Yinghui Fang, Dafeng Zhang, Yongqi Song, Jiangbo Guo, Shuhua Jin, Zeyu Xiao, Rui Zhao, Zhuoyuan Li, Cong Zhang, Yufeng Peng, Xin Lu, Zhijing Sun, Chengjie Ge, Zihao Li, Zishun Liao, Ziang Zhou, Qiyu Kang, Xueyang Fu, Zheng-Jun Zha, Yuqian Zhang, Shuai Liu, Jie Liu, Zhuhao Zhang, Lishen Qu, Zhihao Liu, Shihao Zhou, Yaqi Luo, Juncheng Zhou, Jufeng Yang, Qianfeng Yang, Qiyuan Guan, Xiang Chen, Guiyue Jin, Jiyu Jin

Main category: cs.CV

TL;DR: Review of AIM 2025 High FPS Non-Uniform Motion Deblurring Challenge, analyzing 9 submitted solutions from 68 participants for single image motion deblurring using novel MIORe dataset.

Details

Motivation: To identify effective networks that can produce clearer images in challenging conditions by learning visual cues for complex motion types and evaluate state-of-the-art advances in high-FPS motion deblurring.

Method: Comprehensive review and evaluation of proposed solutions from competition participants, leveraging the novel MIORe dataset that introduces challenging movement patterns for testing.

Result: 68 participants registered, 9 teams submitted valid entries, showcasing significant progress in high-FPS single image motion deblurring field.

Conclusion: The challenge successfully demonstrated advanced capabilities in motion deblurring, highlighting substantial improvements in handling complex motion patterns through novel approaches and datasets.

Abstract: This paper presents a comprehensive review of the AIM 2025 High FPS Non-Uniform Motion Deblurring Challenge, highlighting the proposed solutions and final results. The objective of this challenge is to identify effective networks capable of producing clearer and visually compelling images in diverse and challenging conditions, by learning representative visual cues for complex aggregations of motion types. A total of 68 participants registered for the competition, and 9 teams ultimately submitted valid entries. This paper thoroughly evaluates the state-of-the-art advances in high-FPS single image motion deblurring, showcasing the significant progress in the field, while leveraging samples of the novel dataset, MIORe, that introduces challenging examples of movement patterns.

[304] SynthDrive: Scalable Real2Sim2Real Sensor Simulation Pipeline for High-Fidelity Asset Generation and Driving Data Synthesis

Zhengqing Chen, Ruohong Mei, Xiaoyang Guo, Qingjie Wang, Yubin Hu, Wei Yin, Weiqiang Ren, Qian Zhang

Main category: cs.CV

TL;DR: A scalable real2sim2real system using 3D generation to automate asset mining and rare-case data synthesis for autonomous driving sensor simulation.

Details

Motivation: Current sensor simulation methods (CG-based like CARLA lack diversity and scalability, while learning-based approaches like NeuSim are limited to specific object categories and require extensive multi-sensor data, making them unsuitable for generic objects.

Method: Proposes a real2sim2real system that leverages 3D generation technology to automate asset mining, generation, and rare-case data synthesis for sensor simulation.

Result: Not specified in the abstract - the paper proposes a solution but doesn’t report experimental results in this excerpt.

Conclusion: The proposed system aims to overcome limitations of existing methods by providing a scalable approach for generating diverse and rare scenarios needed for robust perception training in autonomous driving.

Abstract: In the field of autonomous driving, sensor simulation is essential for generating rare and diverse scenarios that are difficult to capture in real-world environments. Current solutions fall into two categories: 1) CG-based methods, such as CARLA, which lack diversity and struggle to scale to the vast array of rare cases required for robust perception training; and 2) learning-based approaches, such as NeuSim, which are limited to specific object categories (vehicles) and require extensive multi-sensor data, hindering their applicability to generic objects. To address these limitations, we propose a scalable real2sim2real system that leverages 3D generation to automate asset mining, generation, and rare-case data synthesis.

[305] MIORe & VAR-MIORe: Benchmarks to Push the Boundaries of Restoration

George Ciubotariu, Zhuyun Zhou, Zongwei Wu, Radu Timofte

Main category: cs.CV

TL;DR: MIORe and VAR-MIORe are novel multi-task datasets for motion restoration with high-frame-rate capture, adaptive motion blur generation, and variable motion magnitude control to benchmark image/video restoration algorithms.

Details

Motivation: Address limitations in current motion restoration benchmarks by capturing complex motion scenarios including ego-camera movements, multi-subject interactions, and depth-dependent blur effects with professional-grade equipment.

Method: Use 1000 FPS acquisition and professional optics to capture motion scenarios, then adaptively average frames based on optical flow metrics to generate consistent motion blur while preserving sharp inputs for video frame interpolation and optical flow estimation.

Result: Created scalable high-resolution ground truth datasets that challenge existing algorithms under both controlled and adverse conditions, with VAR-MIORe providing the first benchmark offering explicit control over motion amplitude from minimal to extreme.

Conclusion: These datasets pave the way for next-generation research in various image and video restoration tasks by providing comprehensive benchmarks that address current limitations in motion restoration evaluation.

Abstract: We introduce MIORe and VAR-MIORe, two novel multi-task datasets that address critical limitations in current motion restoration benchmarks. Designed with high-frame-rate (1000 FPS) acquisition and professional-grade optics, our datasets capture a broad spectrum of motion scenarios, which include complex ego-camera movements, dynamic multi-subject interactions, and depth-dependent blur effects. By adaptively averaging frames based on computed optical flow metrics, MIORe generates consistent motion blur, and preserves sharp inputs for video frame interpolation and optical flow estimation. VAR-MIORe further extends by spanning a variable range of motion magnitudes, from minimal to extreme, establishing the first benchmark to offer explicit control over motion amplitude. We provide high-resolution, scalable ground truths that challenge existing algorithms under both controlled and adverse conditions, paving the way for next-generation research of various image and video restoration tasks.

[306] UMO: Scaling Multi-Identity Consistency for Image Customization via Matching Reward

Yufeng Cheng, Wenxu Wu, Shaojin Wu, Mengqi Huang, Fei Ding, Qian He

Main category: cs.CV

TL;DR: UMO is a unified multi-identity optimization framework that addresses identity confusion in image customization by reformulating multi-identity generation as a global assignment optimization problem using reinforcement learning on diffusion models.

Details

Motivation: Humans are highly sensitive to faces, making identity preservation challenging in image customization. Current methods struggle with maintaining consistent identity while avoiding identity confusion when using multiple reference images, limiting identity scalability.

Method: UMO uses a “multi-to-multi matching” paradigm to reformulate multi-identity generation as a global assignment optimization problem. It employs reinforcement learning on diffusion models and includes a scalable customization dataset with multi-reference images (both synthesized and real).

Result: Extensive experiments show UMO significantly improves identity consistency and reduces identity confusion across several image customization methods, achieving state-of-the-art performance in identity preservation among open-source methods.

Conclusion: UMO provides an effective framework for maintaining high-fidelity identity preservation while alleviating identity confusion, enabling better scalability for multi-identity image customization tasks.

Abstract: Recent advancements in image customization exhibit a wide range of application prospects due to stronger customization capabilities. However, since we humans are more sensitive to faces, a significant challenge remains in preserving consistent identity while avoiding identity confusion with multi-reference images, limiting the identity scalability of customization models. To address this, we present UMO, a Unified Multi-identity Optimization framework, designed to maintain high-fidelity identity preservation and alleviate identity confusion with scalability. With “multi-to-multi matching” paradigm, UMO reformulates multi-identity generation as a global assignment optimization problem and unleashes multi-identity consistency for existing image customization methods generally through reinforcement learning on diffusion models. To facilitate the training of UMO, we develop a scalable customization dataset with multi-reference images, consisting of both synthesised and real parts. Additionally, we propose a new metric to measure identity confusion. Extensive experiments demonstrate that UMO not only improves identity consistency significantly, but also reduces identity confusion on several image customization methods, setting a new state-of-the-art among open-source methods along the dimension of identity preserving. Code and model: https://github.com/bytedance/UMO

[307] Video-Based MPAA Rating Prediction: An Attention-Driven Hybrid Architecture Using Contrastive Learning

Dipta Neogi, Nourash Azmine Chowdhury, Muhammad Rafsan Kabir, Mohammad Ashrafuzzaman Khan

Main category: cs.CV

TL;DR: Hybrid LRCN model with attention and contrastive learning achieves 88% accuracy for automated MPAA video rating classification, deployed as web application.

Details

Motivation: Address challenges in automated video age-rating: large labeled data requirements, poor generalization, and inefficient feature learning for MPAA standards.

Method: Contrastive learning with three frameworks (Instance Discrimination, Contextual Contrastive, Multi-View) + hybrid LRCN (CNN+LSTM) with Bahdanau attention mechanism.

Result: State-of-the-art 88% accuracy and 0.8815 F1 score, excels in fine-grained distinctions like PG-13 vs R content.

Conclusion: Robust architecture enables practical deployment for real-time automated content compliance across streaming platforms.

Abstract: The rapid growth of visual content consumption across platforms necessitates automated video classification for age-suitability standards like the MPAA rating system (G, PG, PG-13, R). Traditional methods struggle with large labeled data requirements, poor generalization, and inefficient feature learning. To address these challenges, we employ contrastive learning for improved discrimination and adaptability, exploring three frameworks: Instance Discrimination, Contextual Contrastive Learning, and Multi-View Contrastive Learning. Our hybrid architecture integrates an LRCN (CNN+LSTM) backbone with a Bahdanau attention mechanism, achieving state-of-the-art performance in the Contextual Contrastive Learning framework, with 88% accuracy and an F1 score of 0.8815. By combining CNNs for spatial features, LSTMs for temporal modeling, and attention mechanisms for dynamic frame prioritization, the model excels in fine-grained borderline distinctions, such as differentiating PG-13 and R-rated content. We evaluate the model’s performance across various contrastive loss functions, including NT-Xent, NT-logistic, and Margin Triplet, demonstrating the robustness of our proposed architecture. To ensure practical application, the model is deployed as a web application for real-time MPAA rating classification, offering an efficient solution for automated content compliance across streaming platforms.

Corentin Dancette, Julien Khlaut, Antoine Saporta, Helene Philippe, Elodie Ferreres, Baptiste Callard, Théo Danielou, Léo Alberge, Léo Machado, Daniel Tordjman, Julie Dupuis, Korentin Le Floch, Jean Du Terrail, Mariam Moshiri, Laurent Dercle, Tom Boeken, Jules Gregory, Maxime Ronot, François Legou, Pascal Roux, Marc Sapoval, Pierre Manceron, Paul Hérent

Main category: cs.CV

TL;DR: Curia is a radiology foundation model trained on 150,000 exams (130TB) from a major hospital that demonstrates broad generalization across 19 radiological tasks, outperforming radiologists and other models in organ identification, disease detection, and outcome prediction.

Details

Motivation: Current AI radiology models are narrow and single-task, which is impractical for covering the vast spectrum of imaging modalities and diseases. Foundation models promise broad generalization but this potential hasn't been realized in radiology.

Method: Trained Curia foundation model on the entire cross-sectional imaging output of a major hospital over several years - 150,000 exams (130TB) of real-world data. Validated on a newly curated 19-task external benchmark.

Result: Curia accurately identifies organs, detects conditions (brain hemorrhages, myocardial infarctions), predicts tumor staging outcomes. Meets or surpasses radiologists and recent foundation models. Shows emergent properties in cross-modality and low-data settings.

Conclusion: Curia demonstrates that foundation models can achieve broad generalization in radiology, outperforming specialized models and human experts. The model weights are released to accelerate progress in medical AI.

Abstract: AI-assisted radiological interpretation is based on predominantly narrow, single-task models. This approach is impractical for covering the vast spectrum of imaging modalities, diseases, and radiological findings. Foundation models (FMs) hold the promise of broad generalization across modalities and in low-data settings. However, this potential has remained largely unrealized in radiology. We introduce Curia, a foundation model trained on the entire cross-sectional imaging output of a major hospital over several years, which to our knowledge is the largest such corpus of real-world data-encompassing 150,000 exams (130 TB). On a newly curated 19-task external validation benchmark, Curia accurately identifies organs, detects conditions like brain hemorrhages and myocardial infarctions, and predicts outcomes in tumor staging. Curia meets or surpasses the performance of radiologists and recent foundation models, and exhibits clinically significant emergent properties in cross-modality, and low-data regimes. To accelerate progress, we release our base model’s weights at https://huggingface.co/raidium/curia.

[309] Leveraging Generic Foundation Models for Multimodal Surgical Data Analysis

Simon Pezold, Jérôme A. Kurylec, Jan S. Liechti, Beat P. Müller, Joël L. Lavanchy

Main category: cs.CV

TL;DR: This paper investigates how adapting generic foundation models through transfer learning and integrating multimodal OR data can enhance surgical data science, showing improved performance in predicting hospital outcomes and surgical phase recognition.

Details

Motivation: To explore how domain adaptation of foundation models and multimodal integration of OR data streams can support surgical data science applications like outcome prediction and phase recognition.

Method: Used V-JEPA as foundation model, finetuned on unlabeled surgical videos, integrated additional time-resolved OR data streams via separate encoders, and tested on liver surgery videos (in-house) and HeiCo dataset for phase recognition.

Result: Finetuning on domain-specific data improved performance; multimodal integration benefited in-house data; video-only baseline matched top EndoVis2017 submissions, with further accuracy gains from finetuning.

Conclusion: Surgical data science can effectively leverage public foundation models through domain adaptation and multimodal integration of complementary OR data streams, with code and models released for further research.

Abstract: We investigate how both the adaptation of a generic foundation model via transfer learning and the integration of complementary modalities from the operating room (OR) can support surgical data science. To this end, we use V-JEPA as the single-modality foundation of a multimodal model for minimally invasive surgery support. We analyze how the model’s downstream performance can benefit (a) from finetuning on unlabeled surgical video data and (b) from providing additional time-resolved data streams from the OR in a multimodal setup. In an in-house dataset of liver surgery videos, we analyze the tasks of predicting hospital length of stay and postoperative complications. In videos of the public HeiCo dataset, we analyze the task of surgical phase recognition. As a baseline, we apply pretrained V-JEPA to all tasks. We then finetune it on unlabeled, held-out videos to investigate its change in performance after domain adaptation. Following the idea of modular decision support networks, we integrate additional data streams from the OR by training a separate encoder to form a shared representation space with V-JEPA’s embeddings. Our experiments show that finetuning on domain-specific data increases model performance. On the in-house data, integrating additional time-resolved data likewise benefits the model. On the HeiCo data, accuracy of the pretrained video-only, single-modality baseline setup is on par with the top-performing submissions of the EndoVis2017 challenge, while finetuning on domain-specific data increases accuracy further. Our results thus demonstrate how surgical data science can leverage public, generic foundation models. Likewise, they indicate the potential of domain adaptation and of integrating suitable complementary data streams from the OR. To support further research, we release our code and model weights at https://github.com/DigitalSurgeryLab-Basel/ML-CDS-2025.

[310] Evaluating the Impact of Adversarial Attacks on Traffic Sign Classification using the LISA Dataset

Nabeyou Tadessa, Balaji Iyangar, Mashrur Chowdhury

Main category: cs.CV

TL;DR: Traffic sign classifiers are vulnerable to adversarial attacks like FGSM and PGD, with accuracy dropping sharply as perturbation increases.

Details

Motivation: Adversarial attacks threaten ML models but prior work focused on simple datasets like MNIST, leaving real-world traffic sign recognition systems understudied.

Method: Trained CNN on LISA Traffic Sign dataset (47 classes) and evaluated robustness against FGSM and PGD adversarial attacks with varying perturbation magnitudes.

Result: Sharp decline in classification accuracy as perturbation magnitude increases, demonstrating high susceptibility to adversarial examples.

Conclusion: Traffic sign classifiers are highly vulnerable to adversarial attacks, highlighting the need for tailored defense mechanisms for real-world traffic recognition systems.

Abstract: Adversarial attacks pose significant threats to machine learning models by introducing carefully crafted perturbations that cause misclassification. While prior work has primarily focused on MNIST and similar datasets, this paper investigates the vulnerability of traffic sign classifiers using the LISA Traffic Sign dataset. We train a convolutional neural network to classify 47 different traffic signs and evaluate its robustness against Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD) attacks. Our results show a sharp decline in classification accuracy as the perturbation magnitude increases, highlighting the models susceptibility to adversarial examples. This study lays the groundwork for future exploration into defense mechanisms tailored for real-world traffic sign recognition systems.

[311] ToonOut: Fine-tuned Background-Removal for Anime Characters

Matteo Muratori, Joël Seytre

Main category: cs.CV

TL;DR: Fine-tuned BiRefNet model on custom anime dataset achieves 99.5% pixel accuracy for background removal in anime-style images, up from 95.3%.

Details

Motivation: State-of-the-art background removal models underperform on anime-style content due to complex features like hair and transparency.

Method: Collected and annotated 1,228 high-quality anime images, then fine-tuned the open-source BiRefNet model on this custom dataset.

Result: Significant improvement in background removal accuracy for anime images, increasing from 95.3% to 99.5% pixel accuracy.

Conclusion: Domain-specific fine-tuning with specialized datasets effectively addresses background removal challenges in anime content, with code, model weights, and dataset made publicly available.

Abstract: While state-of-the-art background removal models excel at realistic imagery, they frequently underperform in specialized domains such as anime-style content, where complex features like hair and transparency present unique challenges. To address this limitation, we collected and annotated a custom dataset of 1,228 high-quality anime images of characters and objects, and fine-tuned the open-sourced BiRefNet model on this dataset. This resulted in marked improvements in background removal accuracy for anime-style images, increasing from 95.3% to 99.5% for our newly introduced Pixel Accuracy metric. We are open-sourcing the code, the fine-tuned model weights, as well as the dataset at: https://github.com/MatteoKartoon/BiRefNet.

[312] Matching Shapes Under Different Topologies: A Topology-Adaptive Deformation Guided Approach

Aymen Merrouche, Stefanie Wuhrer, Edmond Boyer

Main category: cs.CV

TL;DR: A topology-adaptive deformation model for non-rigid 3D mesh matching that handles topological artifacts and non-isometric deformations without requiring data-driven priors.

Details

Motivation: Address real-world scenarios like per-frame multi-view reconstructions that suffer from topological artifacts, where current approaches (Functional Maps and ARAP-like methods) fail due to their restrictive assumptions about deformation types.

Method: Propose a topology-adaptive deformation model that allows changes in shape topology while maintaining ARAP and bijective association constraints. Jointly optimize for a template mesh with adequate topology and its alignment with target shapes to extract correspondences.

Result: The approach successfully handles highly non-isometric shapes and shapes with topological artifacts, including noisy per-frame multi-view reconstructions. It outperforms methods trained on large datasets in 3D alignment quality despite not relying on data-driven priors.

Conclusion: The proposed topology-adaptive deformation model provides an effective solution for 3D mesh matching in challenging scenarios with topological artifacts, demonstrating superior performance over existing methods without requiring extensive training data.

Abstract: Non-rigid 3D mesh matching is a critical step in computer vision and computer graphics pipelines. We tackle matching meshes that contain topological artefacts which can break the assumption made by current approaches. While Functional Maps assume the deformation induced by the ground truth correspondences to be near-isometric, ARAP-like deformation-guided approaches assume the latter to be ARAP. Neither assumption holds in certain topological configurations of the input shapes. We are motivated by real-world scenarios such as per-frame multi-view reconstructions, often suffering from topological artefacts. To this end, we propose a topology-adaptive deformation model allowing changes in shape topology to align shape pairs under ARAP and bijective association constraints. Using this model, we jointly optimise for a template mesh with adequate topology and for its alignment with the shapes to be matched to extract correspondences. We show that, while not relying on any data-driven prior, our approach applies to highly non-isometric shapes and shapes with topological artefacts, including noisy per-frame multi-view reconstructions, even outperforming methods trained on large datasets in 3D alignment quality.

[313] A New Hybrid Model of Generative Adversarial Network and You Only Look Once Algorithm for Automatic License-Plate Recognition

Behnoud Shafiezadeh, Amir Mashmool, Farshad Eshghi, Manoochehr Kelarestaghi

Main category: cs.CV

TL;DR: Proposes a selective GAN for deblurring preprocessing combined with YOLOv5 for license plate detection and character recognition, achieving high accuracy (95-97%) and fast real-time performance (0.026s detection time) with 40% improvement on blurred plates.

Details

Motivation: ALPR faces challenges due to high variability and blurred inputs in real-world scenarios, requiring efficient deep learning solutions for smart city applications.

Method: Selective GAN preprocessing for deblurring + YOLOv5 architecture for license plate detection, character segmentation, and recognition. Uses Iranian license plate dataset with realistic blur scenarios.

Result: 0.026s detection time for both LP and CR stages, 95% LPD accuracy, 97% CR accuracy, 40% improvement on blurred plates with Deblur-GAN preprocessing.

Conclusion: YOLOv5 with selective GAN preprocessing delivers excellent precision and real-time performance suitable for portable applications, significantly enhancing effectiveness on blurred inputs.

Abstract: Automatic License-Plate Recognition (ALPR) plays a pivotal role in Intelligent Transportation Systems (ITS) as a fundamental element of Smart Cities. However, due to its high variability, ALPR faces challenging issues more efficiently addressed by deep learning techniques. In this paper, a selective Generative Adversarial Network (GAN) is proposed for deblurring in the preprocessing step, coupled with the state-of-the-art You-Only-Look-Once (YOLO)v5 object detection architectures for License-Plate Detection (LPD), and the integrated Character Segmentation (CS) and Character Recognition (CR) steps. The selective preprocessing bypasses unnecessary and sometimes counter-productive input manipulations, while YOLOv5 LPD/CS+CR delivers high accuracy and low computing cost. As a result, YOLOv5 achieves a detection time of 0.026 seconds for both LP and CR detection stages, facilitating real-time applications with exceptionally rapid responsiveness. Moreover, the proposed model achieves accuracy rates of 95% and 97% in the LPD and CR detection phases, respectively. Furthermore, the inclusion of the Deblur-GAN pre-processor significantly improves detection accuracy by nearly 40%, especially when encountering blurred License Plates (LPs).To train and test the learning components, we generated and publicly released our blur and ALPR datasets (using Iranian license plates as a use-case), which are more representative of close-to-real-life ad-hoc situations. The findings demonstrate that employing the state-of-the-art YOLO model results in excellent overall precision and detection time, making it well-suited for portable applications. Additionally, integrating the Deblur-GAN model as a preliminary processing step enhances the overall effectiveness of our comprehensive model, particularly when confronted with blurred scenes captured by the camera as input.

Cem Eteke, Alexander Griessel, Wolfgang Kellerer, Eckehard Steinbach

Main category: cs.CV

TL;DR: BIR-Adapter is a low-complexity adapter for diffusion models that enables blind image restoration without training auxiliary feature extractors, achieving competitive performance with significantly lower complexity.

Details

Motivation: To leverage pre-trained diffusion models for blind image restoration tasks without the need for additional feature extractors, reducing complexity while maintaining performance.

Method: Extracts features from degraded images using the model itself, extends self-attention mechanism with degraded features, and introduces sampling guidance to reduce hallucinations.

Result: Achieves competitive or better performance compared to state-of-the-art methods on synthetic and real-world degradations with significantly lower complexity.

Conclusion: BIR-Adapter’s adapter-based design enables integration into other diffusion models, expanding applications in image restoration and allowing super-resolution models to handle additional unknown degradations.

Abstract: This paper introduces BIR-Adapter, a low-complexity blind image restoration adapter for diffusion models. The BIR-Adapter enables the utilization of the prior of pre-trained large-scale diffusion models on blind image restoration without training any auxiliary feature extractor. We take advantage of the robustness of pretrained models. We extract features from degraded images via the model itself and extend the self-attention mechanism with these degraded features. We introduce a sampling guidance mechanism to reduce hallucinations. We perform experiments on synthetic and real-world degradations and demonstrate that BIR-Adapter achieves competitive or better performance compared to state-of-the-art methods while having significantly lower complexity. Additionally, its adapter-based design enables integration into other diffusion models, enabling broader applications in image restoration tasks. We showcase this by extending a super-resolution-only model to perform better under additional unknown degradations.

[315] FoMo4Wheat: Toward reliable crop vision foundation models with globally curated data

Bing Han, Chen Zhu, Dong Han, Rui Yu, Songliang Cao, Jianhui Wu, Scott Chapman, Zijian Wang, Bangyou Zheng, Wei Guo, Marie Weiss, Benoit de Solan, Andreas Hund, Lukas Roth, Kirchgessner Norbert, Andrea Visioni, Yufeng Ge, Wenjuan Li, Alexis Comar, Dong Jiang, Dejun Han, Fred Baret, Yanfeng Ding, Hao Lu, Shouyang Liu

Main category: cs.CV

TL;DR: FoMo4Wheat is a crop-domain vision foundation model pretrained on the largest wheat image dataset (ImAg4Wheat) that outperforms general-domain models across 10 in-field vision tasks, demonstrating the value of crop-specific pretraining.

Details

Motivation: General-domain pretrained vision models fail to generalize well for agricultural tasks due to the complex interaction of fine canopy structures with varying field conditions, creating a need for crop-specific foundation models.

Method: Self-supervised pretraining on ImAg4Wheat dataset (2.5M high-resolution wheat images from 30 global sites spanning 2,000+ genotypes and 500+ environmental conditions collected over a decade) to create wheat-specific representations.

Result: FoMo4Wheat consistently outperforms state-of-the-art general-domain pretrained models across ten in-field vision tasks at both canopy and organ levels, with representations that are robust for wheat and transferable to other crops and weeds.

Conclusion: Crop-specific foundation models provide reliable in-field perception and pave the way for universal crop foundation models with cross-species and cross-task capabilities, with both model and dataset publicly available.

Abstract: Vision-driven field monitoring is central to digital agriculture, yet models built on general-domain pretrained backbones often fail to generalize across tasks, owing to the interaction of fine, variable canopy structures with fluctuating field conditions. We present FoMo4Wheat, one of the first crop-domain vision foundation model pretrained with self-supervision on ImAg4Wheat, the largest and most diverse wheat image dataset to date (2.5 million high-resolution images collected over a decade at 30 global sites, spanning >2,000 genotypes and >500 environmental conditions). This wheat-specific pretraining yields representations that are robust for wheat and transferable to other crops and weeds. Across ten in-field vision tasks at canopy and organ levels, FoMo4Wheat models consistently outperform state-of-the-art models pretrained on general-domain dataset. These results demonstrate the value of crop-specific foundation models for reliable in-field perception and chart a path toward a universal crop foundation model with cross-species and cross-task capabilities. FoMo4Wheat models and the ImAg4Wheat dataset are publicly available online: https://github.com/PheniX-Lab/FoMo4Wheat and https://huggingface.co/PheniX-Lab/FoMo4Wheat. The demonstration website is: https://fomo4wheat.phenix-lab.com/.

Pragati Shuddhodhan Meshram, Swetha Karthikeyan, Bhavya Bhavya, Suma Bhat

Main category: cs.CV

TL;DR: Proposes ElectroVizQA benchmark dataset with 626 visual questions to evaluate MLLMs on digital electronics circuit problems from undergraduate curricula.

Details

Motivation: MLLMs struggle with fundamental engineering problems and lack specialized datasets for digital electronics training, creating a gap in technical domain applications.

Method: Created the first specialized benchmark dataset for VQA tasks in digital electronics, comprising approximately 626 visual questions covering comprehensive digital electronics topics.

Result: The dataset enables rigorous assessment of MLLMs’ capabilities and limitations in understanding and solving digital electronic circuit questions.

Conclusion: This benchmark aims to motivate further research in applying MLLMs to engineering education and bridge the performance gap in technical fields.

Abstract: Multi-modal Large Language Models (MLLMs) are gaining significant attention for their ability to process multi-modal data, providing enhanced contextual understanding of complex problems. MLLMs have demonstrated exceptional capabilities in tasks such as Visual Question Answering (VQA); however, they often struggle with fundamental engineering problems, and there is a scarcity of specialized datasets for training on topics like digital electronics. To address this gap, we propose a benchmark dataset called ElectroVizQA specifically designed to evaluate MLLMs’ performance on digital electronic circuit problems commonly found in undergraduate curricula. This dataset, the first of its kind tailored for the VQA task in digital electronics, comprises approximately 626 visual questions, offering a comprehensive overview of digital electronics topics. This paper rigorously assesses the extent to which MLLMs can understand and solve digital electronic circuit questions, providing insights into their capabilities and limitations within this specialized domain. By introducing this benchmark dataset, we aim to motivate further research and development in the application of MLLMs to engineering education, ultimately bridging the performance gap and enhancing the efficacy of these models in technical fields.

[317] Representation-Centric Survey of Skeletal Action Recognition and the ANUBIS Benchmark

Yang Liu, Jiyao Yang, Madhawa Perera, Pan Ji, Dongwoo Kim, Min Xu, Tianyang Wang, Saeed Anwar, Tom Gedeon, Lei Wang, Zhenyue Qin

Main category: cs.CV

TL;DR: This paper presents a comprehensive survey of skeleton-based action recognition methods categorized by input representations, and introduces ANUBIS - a large-scale challenging dataset with multi-view recordings, complex interactions, and contemporary behaviors for benchmarking state-of-the-art models.

Details

Motivation: Current research in 3D skeleton-based human action recognition is fragmented across diverse input representations and lacks evaluation under modern real-world challenges, necessitating a systematic review and better benchmarking.

Method: The authors systematically categorize state-of-the-art methods by input feature types (joint coordinates, bone vectors, motion flows, extended representations) and introduce ANUBIS dataset with multi-view recordings, complex multi-person interactions, fine-grained/violent actions, and contemporary social behaviors.

Result: Benchmarking diverse state-of-the-art models on ANUBIS revealed strong action-feature dependencies, limitations of naive multi-representational fusion, and the need for task-aware, semantically aligned integration strategies across 102 action categories.

Conclusion: This work provides both a comprehensive foundation and practical benchmarking resource to guide the development of robust, generalizable skeleton-based action recognition systems for complex real-world scenarios.

Abstract: 3D skeleton-based human action recognition has emerged as a powerful alternative to traditional RGB and depth-based approaches, offering robustness to environmental variations, computational efficiency, and enhanced privacy. Despite remarkable progress, current research remains fragmented across diverse input representations and lacks evaluation under scenarios that reflect modern real-world challenges.This paper presents a representation-centric survey of skeleton-based action recognition, systematically categorizing state-of-the-art methods by their input feature types: joint coordinates, bone vectors, motion flows, and extended representations, and analyzing how these choices influence spatial-temporal modeling strategies. Building on the insights from this review, we introduce ANUBIS, a large-scale, challenging skeleton action dataset designed to address critical gaps in existing benchmarks. ANUBIS incorporates multi-view recordings with back-view perspectives, complex multi-person interactions, fine-grained and violent actions, and contemporary social behaviors.We benchmark a diverse set of state-of-the-art models on ANUBIS and conduct an in-depth analysis of how different feature types affect recognition performance across 102 action categories. Our results show strong action-feature dependencies, highlight the limitations of na"ive multi-representational fusion, and point toward the need for task-aware, semantically aligned integration strategies. This work offers both a comprehensive foundation and a practical benchmarking resource, aiming to guide the next generation of robust, generalizable skeleton-based action recognition systems for complex real-world scenarios.The dataset website, benchmarking framework, and download link are available at https://yliu1082.github.io/ANUBIS/.

[318] AI Sees Your Location, But With A Bias Toward The Wealthy World

Jingyuan Huang, Jen-tse Huang, Ziyi Liu, Xiaoyuan Liu, Wenxuan Wang, Jieyu Zhao

Main category: cs.CV

TL;DR: VLMs show geographic recognition capabilities but exhibit significant regional biases, performing better on developed/populated areas and over-predicting certain locations like Sydney for Australian images, raising privacy concerns.

Details

Motivation: To systematically evaluate regional biases in Visual-Language Models' geographic recognition capabilities and address privacy concerns from accurate location identification.

Method: Created a benchmark with 1,200 images paired with geographic metadata and evaluated four VLMs on their ability to recognize geographic information from images.

Result: VLMs achieved up to 53.8% accuracy in city prediction but showed significant performance gaps: -12.5% for less developed regions and -17.0% for sparsely populated areas. Models consistently over-predicted certain locations (e.g., Sydney for Australian images).

Conclusion: VLMs exhibit substantial regional biases in geographic recognition, with better performance on developed areas and systematic over-prediction patterns, highlighting both fairness issues and privacy concerns for online image sharing.

Abstract: Visual-Language Models (VLMs) have shown remarkable performance across various tasks, particularly in recognizing geographic information from images. However, VLMs still show regional biases in this task. To systematically evaluate these issues, we introduce a benchmark consisting of 1,200 images paired with detailed geographic metadata. Evaluating four VLMs, we find that while these models demonstrate the ability to recognize geographic information from images, achieving up to 53.8% accuracy in city prediction, they exhibit significant biases. Specifically, performance is substantially higher for economically developed and densely populated regions compared to less developed (-12.5%) and sparsely populated (-17.0%) areas. Moreover, regional biases of frequently over-predicting certain locations remain. For instance, they consistently predict Sydney for images taken in Australia, shown by the low entropy scores for these countries. The strong performance of VLMs also raises privacy concerns, particularly for users who share images online without the intent of being identified. Our code and dataset are publicly available at https://github.com/uscnlp-lime/FairLocator.

[319] Can Machines Imitate Humans? Integrative Turing-like tests for Language and Vision Demonstrate a Narrowing Gap

Mengmi Zhang, Elisa Pavarino, Xiao Liu, Giorgia Dellaferrera, Ankur Sikarwar, Caishun Chen, Marcelo Armendariz, Noga Mudrik, Prachi Agrawal, Spandan Madan, Mranmay Shetty, Andrei Barbu, Haochen Yang, Tanishq Kumar, Shui’Er Han, Aman Raj Singh, Meghna Sadwani, Stella Dellaferrera, Michele Pizzochero, Brandon Tang, Yew Soon Ong, Hanspeter Pfister, Gabriel Kreiman

Main category: cs.CV

TL;DR: AI systems are becoming increasingly capable of convincingly impersonating humans across language and vision tasks, often deceiving human judges in Turing-like tests, while simple AI judges outperform humans at detecting AI responses.

Details

Motivation: As AI becomes more integrated into daily life, determining whether an agent is human is critical for trust and safety. The study aims to systematically benchmark AI's ability to imitate humans across multiple domains.

Method: Researchers collected data from 636 humans and 37 AI agents across 6 tasks (3 language: image captioning, word association, conversation; 3 vision: color estimation, object detection, attention prediction). They conducted 72,191 Turing-like tests with 1,916 human judges and 10 AI judges to evaluate imitation capabilities.

Result: Current AIs are approaching convincing human impersonation and can deceive human judges in both language and vision tasks. Simple AI judges outperformed humans in distinguishing AI from human responses. Imitation ability showed minimal correlation with conventional AI performance metrics.

Conclusion: Passing as human is an important independent evaluation criterion for AI. The introduced large-scale Turing datasets and metrics provide valuable benchmarks for assessing human-likeness in AI, highlighting the need for rigorous quantitative imitation tests in AI development.

Abstract: As AI becomes increasingly embedded in daily life, ascertaining whether an agent is human is critical. We systematically benchmark AI’s ability to imitate humans in three language tasks (image captioning, word association, conversation) and three vision tasks (color estimation, object detection, attention prediction), collecting data from 636 humans and 37 AI agents. Next, we conducted 72,191 Turing-like tests with 1,916 human judges and 10 AI judges. Current AIs are approaching the ability to convincingly impersonate humans and deceive human judges in both language and vision. Even simple AI judges outperformed humans in distinguishing AI from human responses. Imitation ability showed minimal correlation with conventional AI performance metrics, suggesting that passing as human is an important independent evaluation criterion. The large-scale Turing datasets and metrics introduced here offer valuable benchmarks for assessing human-likeness in AI and highlight the importance of rigorous, quantitative imitation tests for AI development.

Jen-tse Huang, Jiantong Qin, Jianping Zhang, Youliang Yuan, Wenxuan Wang, Jieyu Zhao

Main category: cs.CV

TL;DR: This paper investigates both explicit and implicit social biases in Vision-Language Models (VLMs) through structured evaluation methods for gender and racial biases.

Details

Motivation: To systematically analyze and quantify both conscious (explicit) and subconscious (implicit) social biases in VLMs, as these models become increasingly integrated into real-world applications where biased outputs can have significant societal impacts.

Method: Uses two approaches: 1) Explicit bias analysis through direct questions (multiple-choice and yes-no comparisons about gender/racial differences), and 2) Implicit bias analysis through assistive tasks (image description and form completion where biases emerge indirectly). Evaluates Gemini-1.5, GPT-4V, GPT-4o, LLaMA-3.2-Vision and LLaVA-v1.6.

Result: The research provides a comprehensive framework and publicly available code/data for evaluating social biases in VLMs, though specific quantitative results are not detailed in the abstract.

Conclusion: The study establishes methods to detect both explicit and implicit biases in VLMs, highlighting the importance of addressing subconscious biases that may not be apparent through direct questioning, and provides tools for ongoing bias evaluation in multimodal AI systems.

Abstract: This research investigates both explicit and implicit social biases exhibited by Vision-Language Models (VLMs). The key distinction between these bias types lies in the level of awareness: explicit bias refers to conscious, intentional biases, while implicit bias operates subconsciously. To analyze explicit bias, we directly pose questions to VLMs related to gender and racial differences: (1) Multiple-choice questions based on a given image (e.g., “What is the education level of the person in the image?”) (2) Yes-No comparisons using two images (e.g., “Is the person in the first image more educated than the person in the second image?”) For implicit bias, we design tasks where VLMs assist users but reveal biases through their responses: (1) Image description tasks: Models are asked to describe individuals in images, and we analyze disparities in textual cues across demographic groups. (2) Form completion tasks: Models draft a personal information collection form with 20 attributes, and we examine correlations among selected attributes for potential biases. We evaluate Gemini-1.5, GPT-4V, GPT-4o, LLaMA-3.2-Vision and LLaVA-v1.6. Our code and data are publicly available at https://github.com/uscnlp-lime/VisBias.

[321] The GOOSE Dataset for Perception in Unstructured Environments

Peter Mortimer, Raphael Hagmanns, Miguel Granero, Thorsten Luettel, Janko Petereit, Hans-Joachim Wuensche

Main category: cs.CV

TL;DR: GOOSE dataset provides 10,000 labeled image-point cloud pairs for unstructured outdoor environments to address data scarcity in autonomous systems development.

Details

Motivation: Limited data availability for training and testing deep learning models in unstructured outdoor environments hinders autonomous system development.

Method: Created comprehensive German Outdoor and Offroad Dataset (GOOSE) with labeled image-point cloud pairs, trained state-of-the-art segmentation models, and established dataset standards and ontology.

Result: Provides open-source dataset, pre-trained models, ontology, and framework for enhancing perception capabilities in unstructured environments.

Conclusion: GOOSE dataset establishes a common framework to facilitate seamless integration of existing datasets and accelerate perception improvement for robots in unstructured outdoor environments.

Abstract: The potential for deploying autonomous systems can be significantly increased by improving the perception and interpretation of the environment. However, the development of deep learning-based techniques for autonomous systems in unstructured outdoor environments poses challenges due to limited data availability for training and testing. To address this gap, we present the German Outdoor and Offroad Dataset (GOOSE), a comprehensive dataset specifically designed for unstructured outdoor environments. The GOOSE dataset incorporates 10 000 labeled pairs of images and point clouds, which are utilized to train a range of state-of-the-art segmentation models on both image and point cloud data. We open source the dataset, along with an ontology for unstructured terrain, as well as dataset standards and guidelines. This initiative aims to establish a common framework, enabling the seamless inclusion of existing datasets and a fast way to enhance the perception capabilities of various robots operating in unstructured environments. The dataset, pre-trained models for offroad perception, and additional documentation can be found at https://goose-dataset.de/.

[322] MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs

Erik Daxberger, Nina Wenzel, David Griffiths, Haiming Gang, Justin Lazarow, Gefen Kohavi, Kai Kang, Marcin Eichner, Yinfei Yang, Afshin Dehghan, Peter Grasch

Main category: cs.CV

TL;DR: CA-VQA dataset and MM-Spatial MLLM enable superior 3D spatial understanding in indoor scenes, achieving SOTA performance on spatial reasoning tasks.

Details

Motivation: Multimodal LLMs excel at 2D vision but lack 3D spatial reasoning capabilities, creating a need for specialized 3D understanding models.

Method: Created Cubify Anything VQA dataset with open-set 3D scene annotations, used for supervised fine-tuning of MM-Spatial model with metric depth and multi-view inputs.

Result: MM-Spatial achieves state-of-the-art performance on 3D spatial benchmarks, with depth perception comparable to dedicated monocular depth estimation models.

Conclusion: High-quality 3D scene data enables training generalist MLLMs with strong 3D spatial understanding, bridging the gap between 2D vision and 3D reasoning.

Abstract: Multimodal large language models (MLLMs) excel at 2D visual understanding but remain limited in their ability to reason about 3D space. In this work, we leverage large-scale high-quality 3D scene data with open-set annotations to introduce 1) a novel supervised fine-tuning dataset and 2) a new evaluation benchmark, focused on indoor scenes. Our Cubify Anything VQA (CA-VQA) data covers diverse spatial tasks including spatial relationship prediction, metric size and distance estimation, and 3D grounding. We show that CA-VQA enables us to train MM-Spatial, a strong generalist MLLM that also achieves state-of-the-art performance on 3D spatial understanding benchmarks, including our own. We show how incorporating metric depth and multi-view inputs (provided in CA-VQA) can further improve 3D understanding, and demonstrate that data alone allows our model to achieve depth perception capabilities comparable to dedicated monocular depth estimation models.

[323] EdgeSAM: Prompt-In-the-Loop Distillation for SAM

Chong Zhou, Xiangtai Li, Chen Change Loy, Bo Dai

Main category: cs.CV

TL;DR: EdgeSAM is an optimized version of Segment Anything Model that runs 37x faster on edge devices while maintaining performance through CNN-based architecture and improved distillation techniques.

Details

Motivation: To enable efficient execution of SAM on edge devices with limited computational resources while minimizing performance degradation.

Method: Distilled ViT-based SAM encoder into CNN architecture, included prompt encoder and mask decoder in distillation with box/point prompts, added lightweight module to mitigate dataset bias.

Result: 37x speed increase vs original SAM, 7x faster than MobileSAM/EfficientSAM, improved mIoUs on COCO (2.3/1.5) and LVIS (3.1/1.6), first SAM variant running at 30+ FPS on iPhone 14.

Conclusion: EdgeSAM successfully bridges the performance-efficiency gap for SAM deployment on edge devices through careful architectural design and distillation strategy.

Abstract: This paper presents EdgeSAM, an accelerated variant of the Segment Anything Model (SAM), optimized for efficient execution on edge devices with minimal compromise in performance. Our approach involves distilling the original ViT-based SAM image encoder into a purely CNN-based architecture, better suited for edge devices. We carefully benchmark various distillation strategies and demonstrate that task-agnostic encoder distillation fails to capture the full knowledge embodied in SAM. To overcome this bottleneck, we include both the prompt encoder and mask decoder in the distillation process, with box and point prompts in the loop, so that the distilled model can accurately capture the intricate dynamics between user input and mask generation. To mitigate dataset bias issues stemming from point prompt distillation, we incorporate a lightweight module within the encoder. As a result, EdgeSAM achieves a 37-fold speed increase compared to the original SAM, and it also outperforms MobileSAM/EfficientSAM, being over 7 times as fast when deployed on edge devices while enhancing the mIoUs on COCO and LVIS by 2.3/1.5 and 3.1/1.6, respectively. It is also the first SAM variant that can run at over 30 FPS on an iPhone 14. Code and demo are available at https://www.mmlab-ntu.com/project/edgesam.

[324] LD-SDM: Language-Driven Hierarchical Species Distribution Modeling

Srikumar Sastry, Xin Xing, Aayush Dhakal, Subash Khanal, Adeel Ahmad, Nathan Jacobs

Main category: cs.CV

TL;DR: Integrates taxonomic classification via LLM for species distribution modeling, enabling zero-shot prediction of unseen species with new proximity-aware evaluation metric that outperforms SOTA models.

Details

Motivation: To improve species distribution modeling by incorporating taxonomic classification information, enabling prediction of ranges for unseen species without additional supervision, while addressing limitations of traditional evaluation metrics.

Method: Uses large language model to extract latent representations from taxonomic classification text prompts, combines geographical and environmental features, and introduces proximity-aware evaluation metric for species distribution models.

Result: Outperforms state-of-the-art models in species range prediction, zero-shot prediction, and geo-feature regression tasks.

Conclusion: Taxonomic classification integration via LLM enables effective zero-shot species distribution modeling with improved evaluation through proximity-aware metrics, demonstrating superior performance over existing approaches.

Abstract: We focus on species distribution modeling using global-scale presence-only data, leveraging geographical and environmental features to map species ranges, as in previous studies. However, we innovate by integrating taxonomic classification into our approach. Specifically, we propose using a large language model to extract a latent representation of the taxonomic classification from a textual prompt. This allows us to map the range of any taxonomic rank, including unseen species, without additional supervision. We also present a new proximity-aware evaluation metric, suitable for evaluating species distribution models, which addresses critical shortcomings of traditional metrics. We evaluated our model for species range prediction, zero-shot prediction, and geo-feature regression and found that it outperforms several state-of-the-art models.

[325] Osprey: Pixel Understanding with Visual Instruction Tuning

Yuqian Yuan, Wentong Li, Jian Liu, Dongqi Tang, Xinjie Luo, Chi Qin, Lei Zhang, Jianke Zhu

Main category: cs.CV

TL;DR: Osprey introduces mask-text instruction tuning to extend multimodal LLMs for pixel-level visual understanding, using a curated 724K dataset and mask-aware feature extraction.

Details

Motivation: Current MLLMs lack fine-grained pixel-level vision-language alignment and mask-based instruction data, limiting their ability for detailed visual understanding.

Method: Proposes mask-text instruction tuning with a convolutional CLIP backbone and mask-aware visual extractor to process high-resolution inputs and extract precise mask features.

Result: Osprey demonstrates superiority in region understanding tasks and enables seamless integration with Segment Anything Model for multi-granularity semantics.

Conclusion: Osprey successfully achieves pixel-level visual instruction tuning, advancing MLLMs beyond image/box-level understanding to fine-grained mask-based comprehension.

Abstract: Multimodal large language models (MLLMs) have recently achieved impressive general-purpose vision-language capabilities through visual instruction tuning. However, current MLLMs primarily focus on image-level or box-level understanding, falling short in achieving fine-grained vision-language alignment at pixel level. Besides, the lack of mask-based instruction data limits their advancements. In this paper, we propose Osprey, a mask-text instruction tuning approach, to extend MLLMs by incorporating fine-grained mask regions into language instruction, aiming at achieving pixel-wise visual understanding. To achieve this goal, we first meticulously curate a mask-based region-text dataset with 724K samples, and then design a vision-language model by injecting pixel-level representation into LLM. Specifically, Osprey adopts a convolutional CLIP backbone as the vision encoder and employs a mask-aware visual extractor to extract precise visual mask features from high resolution input. Experimental results demonstrate Osprey’s superiority in various region understanding tasks, showcasing its new capability for pixel-level instruction tuning. In particular, Osprey can be integrated with Segment Anything Model (SAM) seamlessly to obtain multi-granularity semantics. The source code, dataset and demo can be found at https://github.com/CircleRadon/Osprey.

[326] Regeneration Based Training-free Attribution of Fake Images Generated by Text-to-Image Generative Models

Meiling Li, Zhenxing Qian, Xinpeng Zhang

Main category: cs.CV

TL;DR: A training-free method to attribute fake images from text-to-image models by inverting prompts and comparing regenerated images to identify source models.

Details

Motivation: Address concerns about misuse of AI-generated fake images by enabling attribution to source models, holding model owners accountable.

Method: Invert textual prompt from test image, regenerate candidate images using different models, calculate similarity scores to rank and identify source model.

Result: Achieves comparable performance to SOTA methods, high scalability for real-world scenarios, and robustness against common attacks like blurring and compression.

Conclusion: Effective solution for AI-generated image source attribution that can prevent misuse of text-to-image models and works as plug-in to improve existing methods.

Abstract: Text-to-image generative models have recently garnered significant attention due to their ability to generate images based on prompt descriptions. While these models have shown promising performance, concerns have been raised regarding the potential misuse of the generated fake images. In response to this, we have presented a simple yet effective training-free method to attribute fake images generated by text-to-image models to their source models. Given a test image to be attributed, we first inverse the textual prompt of the image, and then put the reconstructed prompt into different candidate models to regenerate candidate fake images. By calculating and ranking the similarity of the test image and the candidate images, we can determine the source of the image. This attribution allows model owners to be held accountable for any misuse of their models. Note that our approach does not limit the number of candidate text-to-image generative models. Comprehensive experiments reveal that (1) Our method can effectively attribute fake images to their source models, achieving comparable attribution performance with the state-of-the-art method; (2) Our method has high scalability ability, which is well adapted to real-world attribution scenarios. (3) The proposed method yields satisfactory robustness to common attacks, such as Gaussian blurring, JPEG compression, and Resizing. We also analyze the factors that influence the attribution performance, and explore the boost brought by the proposed method as a plug-in to improve the performance of existing SOTA. We hope our work can shed some light on the solutions to addressing the source of AI-generated images, as well as to prevent the misuse of text-to-image generative models.

[327] Slice-100K: A Multimodal Dataset for Extrusion-based 3D Printing

Anushrut Jignasu, Kelly O. Marshall, Ankush Kumar Mishra, Lucas Nerone Rillo, Baskar Ganapathysubramanian, Aditya Balu, Chinmay Hegde, Adarsh Krishnamurthy

Main category: cs.CV

TL;DR: Slice-100K is a novel dataset of over 100,000 G-code files with corresponding CAD models, created to address the lack of large curated repositories for additive manufacturing research.

Details

Motivation: There is currently no large repository of curated CAD models with corresponding G-code files for additive manufacturing, which limits research and development in digital manufacturing.

Method: Built the dataset from triangulated meshes derived from Objaverse-XL and Thingi10K datasets, including G-code files, tessellated CAD models, LVIS categories, geometric properties, and renderings.

Result: Created a comprehensive dataset of over 100,000 G-code files and demonstrated its utility by finetuning GPT-2 for G-code translation between legacy (Sailfish) and modern (Marlin) formats.

Conclusion: Slice-100K serves as the foundational step towards developing multimodal foundation models for digital manufacturing and will enable advanced research in additive manufacturing.

Abstract: G-code (Geometric code) or RS-274 is the most widely used computer numerical control (CNC) and 3D printing programming language. G-code provides machine instructions for the movement of the 3D printer, especially for the nozzle, stage, and extrusion of material for extrusion-based additive manufacturing. Currently, there does not exist a large repository of curated CAD models along with their corresponding G-code files for additive manufacturing. To address this issue, we present Slice-100K, a first-of-its-kind dataset of over 100,000 G-code files, along with their tessellated CAD model, LVIS (Large Vocabulary Instance Segmentation) categories, geometric properties, and renderings. We build our dataset from triangulated meshes derived from Objaverse-XL and Thingi10K datasets. We demonstrate the utility of this dataset by finetuning GPT-2 on a subset of the dataset for G-code translation from a legacy G-code format (Sailfish) to a more modern, widely used format (Marlin). Our dataset can be found at https://github.com/idealab-isu/Slice-100K. Slice-100K will be the first step in developing a multimodal foundation model for digital manufacturing.

[328] ShapeSplat: A Large-scale Dataset of Gaussian Splats and Their Self-Supervised Pretraining

Qi Ma, Yue Li, Bin Ren, Nicu Sebe, Ender Konukoglu, Theo Gevers, Luc Van Gool, Danda Pani Paudel

Main category: cs.CV

TL;DR: The paper introduces ShapeSplat, a large-scale 3DGS dataset with 206K objects across 87 categories, and Gaussian-MAE for representation learning from Gaussian parameters, showing improved segmentation but degraded classification when using centroids alone.

Details

Motivation: To enable 3D understanding directly in 3D Gaussian Splatting representation space by creating a large-scale dataset and developing methods for unsupervised pretraining and supervised finetuning.

Method: Built ShapeSplat dataset using ShapeNet, ModelNet and Objaverse (206K objects, 87 categories). Introduced Gaussian-MAE with Gaussian feature grouping in normalized feature space and splats pooling layer to effectively group similar Gaussians.

Result: Distribution of optimized GS centroids differs significantly from uniformly sampled point clouds; this change degrades classification but improves segmentation; Gaussian feature grouping with splats pooling leads to notable improvement in finetuning tasks.

Conclusion: The proposed Gaussian-MAE framework and ShapeSplat dataset provide valuable insights for 3D representation learning in Gaussian splatting space, demonstrating the importance of leveraging additional Gaussian parameters beyond just centroids for improved performance.

Abstract: 3D Gaussian Splatting (3DGS) has become the de facto method of 3D representation in many vision tasks. This calls for the 3D understanding directly in this representation space. To facilitate the research in this direction, we first build ShapeSplat, a large-scale dataset of 3DGS using the commonly used ShapeNet, ModelNet and Objaverse datasets. Our dataset ShapeSplat consists of 206K objects spanning over 87 unique categories, whose labels are in accordance with the respective datasets. The creation of this dataset utilized the compute equivalent of 3.8 GPU years on a TITAN XP GPU. We utilize our dataset for unsupervised pretraining and supervised finetuning for classification and segmentation tasks. To this end, we introduce Gaussian-MAE, which highlights the unique benefits of representation learning from Gaussian parameters. Through exhaustive experiments, we provide several valuable insights. In particular, we show that (1) the distribution of the optimized GS centroids significantly differs from the uniformly sampled point cloud (used for initialization) counterpart; (2) this change in distribution results in degradation in classification but improvement in segmentation tasks when using only the centroids; (3) to leverage additional Gaussian parameters, we propose Gaussian feature grouping in a normalized feature space, along with splats pooling layer, offering a tailored solution to effectively group and embed similar Gaussians, which leads to notable improvement in finetuning tasks.

[329] Evidential Transformers for Improved Image Retrieval

Danilo Dordevic, Suryansh Kumar

Main category: cs.CV

TL;DR: Evidential Transformer combines probabilistic methods with transformer architecture for robust image retrieval, achieving state-of-the-art results on SOP and CUB datasets.

Details

Motivation: To improve content-based image retrieval by incorporating uncertainty-driven probabilistic methods for more robust and reliable results compared to traditional multiclass classification approaches.

Method: Incorporates evidential classification into deep metric learning, leveraging the Global Context Vision Transformer (GC ViT) architecture to enhance image retrieval performance.

Result: Achieves new state-of-the-art retrieval results on Stanford Online Products (SOP) and CUB-200-2011 datasets, demonstrating consistent reliability across all test settings.

Conclusion: The Evidential Transformer sets a new benchmark in CBIR by successfully integrating probabilistic uncertainty methods with transformer architecture, providing robust and superior performance.

Abstract: We introduce the Evidential Transformer, an uncertainty-driven transformer model for improved and robust image retrieval. In this paper, we make several contributions to content-based image retrieval (CBIR). We incorporate probabilistic methods into image retrieval, achieving robust and reliable results, with evidential classification surpassing traditional training based on multiclass classification as a baseline for deep metric learning. Furthermore, we improve the state-of-the-art retrieval results on several datasets by leveraging the Global Context Vision Transformer (GC ViT) architecture. Our experimental results consistently demonstrate the reliability of our approach, setting a new benchmark in CBIR in all test settings on the Stanford Online Products (SOP) and CUB-200-2011 datasets.

[330] ESVO2: Direct Visual-Inertial Odometry with Stereo Event Cameras

Junkai Niu, Sheng Zhong, Xiuyuan Lu, Shaojie Shen, Guillermo Gallego, Yi Zhou

Main category: cs.CV

TL;DR: Event-based stereo visual-inertial odometry system that combines efficient contour sampling, temporal-static stereo fusion, and IMU integration to overcome computational complexity and pose tracking degeneracy in event-based SLAM.

Details

Motivation: Event-based visual odometry faces challenges with explicit data association under large view changes, high computational complexity in mapping, and degeneracy in camera pose tracking for certain rotational degrees of freedom.

Method: Proposes an efficient contour point sampling strategy based on event dynamics, merges temporal and static stereo results for better mapping, integrates IMU measurements via pre-integration for motion priors, and develops a compact back-end for IMU bias updates and velocity prediction.

Result: System scales well with high-resolution event cameras and achieves better global positioning accuracy in large-scale outdoor environments, outperforming five state-of-the-art methods across five public datasets.

Conclusion: The proposed event-based stereo visual-inertial odometry system effectively addresses computational and tracking limitations through efficient sampling, stereo fusion, and IMU integration, demonstrating superior performance in various scenarios.

Abstract: Event-based visual odometry is a specific branch of visual Simultaneous Localization and Mapping (SLAM) techniques, which aims at solving tracking and mapping subproblems (typically in parallel), by exploiting the special working principles of neuromorphic (i.e., event-based) cameras. Due to the motion-dependent nature of event data, explicit data association (i.e., feature matching) under large-baseline view-point changes is difficult to establish, making direct methods a more rational choice. However, state-of-the-art direct methods are limited by the high computational complexity of the mapping sub-problem and the degeneracy of camera pose tracking in certain degrees of freedom (DoF) in rotation. In this paper, we tackle these issues by building an event-based stereo visual-inertial odometry system on top of a direct pipeline. Specifically, to speed up the mapping operation, we propose an efficient strategy for sampling contour points according to the local dynamics of events. The mapping performance is also improved in terms of structure completeness and local smoothness by merging the temporal stereo and static stereo results. To circumvent the degeneracy of camera pose tracking in recovering the pitch and yaw components of general 6-DoF motion, we introduce IMU measurements as motion priors via pre-integration. To this end, a compact back-end is proposed for continuously updating the IMU bias and predicting the linear velocity, enabling an accurate motion prediction for camera pose tracking. The resulting system scales well with modern high-resolution event cameras and leads to better global positioning accuracy in large-scale outdoor environments. Extensive evaluations on five publicly available datasets featuring different resolutions and scenarios justify the superior performance of the proposed system against five state-of-the-art methods.

[331] Generative World Explorer

Taiming Lu, Tianmin Shu, Alan Yuille, Daniel Khashabi, Jieneng Chen

Main category: cs.CV

TL;DR: Genex is a framework that enables AI agents to mentally explore 3D environments through imagination rather than physical exploration, updating beliefs with generated observations to improve decision-making.

Details

Motivation: Humans can imagine unseen parts of the world through mental exploration and revise beliefs without physical exploration, while most AI agents require physical exploration to update their world state beliefs.

Method: Developed Generative World Explorer (Genex) framework for egocentric world exploration, trained on synthetic urban scene dataset (Genex-DB) to generate imagined observations in large-scale 3D environments.

Result: Genex generates high-quality, consistent observations during long-horizon exploration and enables better decision-making when beliefs are updated with these generated observations.

Conclusion: The framework successfully demonstrates mental exploration capabilities similar to humans, allowing AI agents to make more informed decisions without requiring physical exploration at all times.

Abstract: Planning with partial observation is a central challenge in embodied AI. A majority of prior works have tackled this challenge by developing agents that physically explore their environment to update their beliefs about the world state. In contrast, humans can $\textit{imagine}$ unseen parts of the world through a mental exploration and $\textit{revise}$ their beliefs with imagined observations. Such updated beliefs can allow them to make more informed decisions, without necessitating the physical exploration of the world at all times. To achieve this human-like ability, we introduce the $\textit{Generative World Explorer (Genex)}$, an egocentric world exploration framework that allows an agent to mentally explore a large-scale 3D world (e.g., urban scenes) and acquire imagined observations to update its belief. This updated belief will then help the agent to make a more informed decision at the current step. To train $\textit{Genex}$, we create a synthetic urban scene dataset, Genex-DB. Our experimental results demonstrate that (1) $\textit{Genex}$ can generate high-quality and consistent observations during long-horizon exploration of a large virtual physical world and (2) the beliefs updated with the generated observations can inform an existing decision-making model (e.g., an LLM agent) to make better plans.

[332] Diagram-Driven Course Questions Generation

Xinyu Zhang, Lingling Zhang, Yanrui Wu, Muye Huang, Wenjun Wu, Bo Li, Shaowei Wang, Basura Fernando, Jun Liu

Main category: cs.CV

TL;DR: Proposes Diagram-Driven Course Questions Generation (DDCQG) task and Hierarchical Knowledge Integration framework (HKI-DDCQG) for generating educational questions from diagrams, outperforming existing models on the new DiagramQG dataset.

Details

Motivation: Current VQG research focuses on natural images but neglects educational diagrams, creating a gap for pedagogical assessment needs. Diagrams are critical components in educational materials that require specialized question generation approaches.

Method: Hierarchical Knowledge Integration framework (HKI-DDCQG) uses trainable CLIP for diagram patch identification, frozen vision-language models for knowledge extraction, and trainable T5 for question generation. Incorporates course and input text constraints.

Result: HKI-DDCQG outperforms existing models on the DiagramQG dataset (15,720 diagrams, 25,798 questions across 37 subjects) while maintaining strong generalizability on natural image datasets.

Conclusion: Establishes a strong baseline for DDCQG task, addressing challenges of domain-specific knowledge, long-tail course distribution, and high diagram information density. The framework demonstrates effective knowledge integration for educational question generation.

Abstract: Visual Question Generation (VQG) research focuses predominantly on natural images while neglecting the diagram, which is a critical component in educational materials. To meet the needs of pedagogical assessment, we propose the Diagram-Driven Course Questions Generation (DDCQG) task and construct DiagramQG, a comprehensive dataset with 15,720 diagrams and 25,798 questions across 37 subjects and 371 courses. Our approach employs course and input text constraints to generate course-relevant questions about specific diagram elements. We reveal three challenges of DDCQG: domain-specific knowledge requirements across courses, long-tail distribution in course coverage, and high information density in diagrams. To address these, we propose the Hierarchical Knowledge Integration framework (HKI-DDCQG), which utilizes trainable CLIP for identifying relevant diagram patches, leverages frozen vision-language models for knowledge extraction, and generates questions with trainable T5. Experiments demonstrate that HKI-DDCQG outperforms existing models on DiagramQG while maintaining strong generalizability across natural image datasets, establishing a strong baseline for DDCQG.

[333] TASR: Timestep-Aware Diffusion Model for Image Super-Resolution

Qinwei Lin, Xiaopeng Sun, Yu Gao, Yujie Zhong, Dengjie Li, Zheng Zhao, Haoqian Wang

Main category: cs.CV

TL;DR: A novel timestep-aware diffusion model for image super-resolution that adaptively integrates ControlNet and Stable Diffusion features at different denoising stages to balance fidelity and detail generation.

Details

Motivation: Current diffusion-based super-resolution methods use ControlNet to inject low-resolution images, but the temporal dynamics of information infusion are not well understood or optimized.

Method: Proposes a timestep-aware diffusion model that adaptively integrates features from ControlNet and Stable Diffusion, with early stages focusing on LR information transmission for fidelity and later stages stimulating SD’s generation ability for detail enhancement. Uses a timestep-aware training strategy with distinct losses at different timesteps.

Result: Experiments on benchmark datasets demonstrate the effectiveness of the proposed method in achieving better image super-resolution results.

Conclusion: The timestep-aware approach effectively balances image fidelity and detail generation by strategically controlling information flow at different stages of the denoising process, outperforming existing methods.

Abstract: Diffusion models have recently achieved outstanding results in the field of image super-resolution. These methods typically inject low-resolution (LR) images via ControlNet.In this paper, we first explore the temporal dynamics of information infusion through ControlNet, revealing that the input from LR images predominantly influences the initial stages of the denoising process. Leveraging this insight, we introduce a novel timestep-aware diffusion model that adaptively integrates features from both ControlNet and the pre-trained Stable Diffusion (SD). Our method enhances the transmission of LR information in the early stages of diffusion to guarantee image fidelity and stimulates the generation ability of the SD model itself more in the later stages to enhance the detail of generated images. To train this method, we propose a timestep-aware training strategy that adopts distinct losses at varying timesteps and acts on disparate modules. Experiments on benchmark datasets demonstrate the effectiveness of our method. Code: https://github.com/SleepyLin/TASR

[334] NF3DM: Combining Neural Fields and Deformation Models for 3D Non-Rigid Motion Reconstruction

Aymen Merrouche, Stefanie Wuhrer, Edmond Boyer

Main category: cs.CV

TL;DR: Novel data-driven approach for reconstructing temporally coherent 3D motion from unstructured partial observations of non-rigid deformations, combining implicit shape representations with explicit mesh-based deformation models.

Details

Motivation: To achieve high-fidelity motion reconstructions for near-isometric deformations (like humans in loose clothing) without relying on parametric shape models or decoupling shape and motion.

Method: Combines implicit neural field representations with explicit mesh-based deformation models, fusing observations over time in feature space and enforcing temporal coherence through near-isometric deformation constraints between adjacent frames.

Result: Outperforms state-of-the-art approaches in human and animal motion sequences reconstructed from monocular depth videos.

Conclusion: The method successfully achieves detailed, temporally coherent motion reconstructions by integrating implicit and explicit representations with temporal constraints, preserving geometric details from input data.

Abstract: We introduce a novel, data-driven approach for reconstructing temporally coherent 3D motion from unstructured and potentially partial observations of non-rigidly deforming shapes. Our goal is to achieve high-fidelity motion reconstructions for shapes that undergo near-isometric deformations, such as humans wearing loose clothing. The key novelty of our work lies in its ability to combine implicit shape representations with explicit mesh-based deformation models, enabling detailed and temporally coherent motion reconstructions without relying on parametric shape models or decoupling shape and motion. Each frame is represented as a neural field decoded from a feature space where observations over time are fused, hence preserving geometric details present in the input data. Temporal coherence is enforced with a near-isometric deformation constraint between adjacent frames that applies to the underlying surface in the neural field. Our method outperforms state-of-the-art approaches, as demonstrated by its application to human and animal motion sequences reconstructed from monocular depth videos.

[335] Llama Learns to Direct: DirectorLLM for Human-Centric Video Generation

Kunpeng Song, Tingbo Hou, Zecheng He, Haoyu Ma, Jialiang Wang, Animesh Sinha, Sam Tsai, Yaqiao Luo, Xiaoliang Dai, Li Chen, Xide Xia, Peizhao Zhang, Peter Vajda, Ahmed Elgammal, Felix Juefei-Xu

Main category: cs.CV

TL;DR: DirectorLLM uses LLMs as video directors to generate human pose instructions for more realistic video generation, outperforming existing methods in motion fidelity and prompt faithfulness.

Details

Motivation: As text-to-video models advance, there's growing demand for high-quality human motion and interaction. Current models need better authenticity in human motions, requiring a solution to enhance motion realism in generated videos.

Method: Extends LLMs from text generators to video directors by training DirectorLLM on Llama 3 resources to generate detailed human pose instructions. These signals guide video renderers (UNet/DiT) while offloading motion simulation from the video generator.

Result: Experiments show DirectorLLM outperforms existing models in generating videos with higher human motion fidelity, improved prompt faithfulness, and enhanced subject naturalness in both automatic benchmarks and human evaluations.

Conclusion: DirectorLLM successfully transforms LLMs into effective video directors for human motion simulation, providing a flexible module that can work with different video renderers to produce more realistic human-centric videos.

Abstract: In this paper, we introduce DirectorLLM, a novel video generation model that employs a large language model (LLM) to orchestrate human poses within videos. As foundational text-to-video models rapidly evolve, the demand for high-quality human motion and interaction grows. To address this need and enhance the authenticity of human motions, we extend the LLM from a text generator to a video director and human motion simulator. Utilizing open-source resources from Llama 3, we train the DirectorLLM to generate detailed instructional signals, such as human poses, to guide video generation. This approach offloads the simulation of human motion from the video generator to the LLM, effectively creating informative outlines for human-centric scenes. These signals are used as conditions by the video renderer, facilitating more realistic and prompt-following video generation. As an independent LLM module, it can be applied to different video renderers, including UNet and DiT, with minimal effort. Experiments on automatic evaluation benchmarks and human evaluations show that our model outperforms existing ones in generating videos with higher human motion fidelity, improved prompt faithfulness, and enhanced rendered subject naturalness.

[336] Motion-enhanced Cardiac Anatomy Segmentation via an Insertable Temporal Attention Module

Md. Kamrul Hasan, Guang Yang, Choon Hwai Yap

Main category: cs.CV

TL;DR: A lightweight Temporal Attention Module (TAM) that enhances cardiac segmentation with motion information, offering plug-and-play integration into various network architectures while maintaining computational efficiency.

Details

Motivation: Existing motion enhancement techniques for cardiac segmentation are suboptimal - they have high computational costs, reduced robustness from non-DL motion registration or single-headed attention, and limited adaptability for integration into existing networks.

Method: Proposed a novel multi-headed cross-temporal attention module (TAM) that is lightweight and plug-and-play, can be inserted into CNN-based, Transformer-based, or hybrid segmentation networks without major backbone changes.

Result: Extensive experiments on multiple 2D and 3D cardiac ultrasound and MRI datasets show TAM consistently improves segmentation across various networks while maintaining computational efficiency and outperforming current reported performance.

Conclusion: TAM is a robust, generalizable solution for motion-awareness enhancement that is scalable (2D to 3D) and offers easy integration for enhancing both existing and future cardiac segmentation networks.

Abstract: Cardiac anatomy segmentation is useful for clinical assessment of cardiac morphology to inform diagnosis and intervention. Deep learning (DL), especially with motion information, has improved segmentation accuracy. However, existing techniques for motion enhancement are not yet optimal, and they have high computational costs due to increased dimensionality or reduced robustness due to suboptimal approaches that use non-DL motion registration, non-attention models, or single-headed attention. They further have limited adaptability and are inconvenient for incorporation into existing networks where motion awareness is desired. Here, we propose a novel, computationally efficient Temporal Attention Module (TAM) that offers robust motion enhancement, modeled as a small, multi-headed, cross-temporal attention module. TAM’s uniqueness is that it is a lightweight, plug-and-play module that can be inserted into a broad range of segmentation networks (CNN-based, Transformer-based, or hybrid) for motion enhancement without requiring substantial changes in the network’s backbone. This feature enables high adaptability and ease of integration for enhancing both existing and future networks. Extensive experiments on multiple 2D and 3D cardiac ultrasound and MRI datasets confirm that TAM consistently improves segmentation across a range of networks while maintaining computational efficiency and improving on currently reported performance. The evidence demonstrates that it is a robust, generalizable solution for motion-awareness enhancement that is scalable (such as from 2D to 3D).

[337] Separate Motion from Appearance: Customizing Motion via Customizing Text-to-Video Diffusion Models

Huijie Liu, Jingyun Wang, Shuai Ma, Jie Hu, Xiaoming Wei, Guoliang Kang

Main category: cs.CV

TL;DR: Motion customization method using temporal attention purification and appearance highway to separate motion from appearance in video generation, enabling better text-aligned appearance while maintaining reference motion consistency.

Details

Motivation: To adapt diffusion models for generating videos with specific motion concepts without compromising appearance diversity, addressing the challenge of separating motion from appearance in the adaptation process.

Method: Learn motion LoRA with two novel strategies: Temporal Attention Purification (TAP) that reshapes temporal attention using motion LoRAs to reorganize Value embeddings, and Appearance Highway (AH) that alters skip connections from temporal to spatial attention outputs.

Result: Extensive experiments show the method generates videos with appearance more aligned with text descriptions and motion more consistent with reference videos compared to previous works.

Conclusion: The proposed TAP and AH strategies effectively enhance motion-appearance separation in motion customization, achieving superior performance in maintaining both text-aligned appearance and reference motion consistency.

Abstract: Motion customization aims to adapt the diffusion model (DM) to generate videos with the motion specified by a set of video clips with the same motion concept. To realize this goal, the adaptation of DM should be possible to model the specified motion concept, without compromising the ability to generate diverse appearances. Thus, the key to solving this problem lies in how to separate the motion concept from the appearance in the adaptation process of DM. Typical previous works explore different ways to represent and insert a motion concept into large-scale pretrained text-to-video diffusion models, e.g., learning a motion LoRA, using latent noise residuals, etc. While those methods can encode the motion concept, they also inevitably encode the appearance in the reference videos, resulting in weakened appearance generation capability. In this paper, we follow the typical way to learn a motion LoRA to encode the motion concept, but propose two novel strategies to enhance motion-appearance separation, including temporal attention purification (TAP) and appearance highway (AH). Specifically, we assume that in the temporal attention module, the pretrained Value embeddings are sufficient to serve as basic components needed by producing a new motion. Thus, in TAP, we choose only to reshape the temporal attention with motion LoRAs so that Value embeddings can be reorganized to produce a new motion. Further, in AH, we alter the starting point of each skip connection in U-Net from the output of each temporal attention module to the output of each spatial attention module. Extensive experiments demonstrate that compared to previous works, our method can generate videos with appearance more aligned with the text descriptions and motion more consistent with the reference videos.

[338] FAAGC: Feature Augmentation on Adaptive Geodesic Curve Based on the shape space theory

Yuexing Han, Ruijie Li

Main category: cs.CV

TL;DR: Proposes FAAGC method for feature augmentation in pre-shape space to address data scarcity by sampling along geodesic curves to generate additional training data.

Details

Motivation: Many fields face challenges due to limited and insufficient data for deep learning models, requiring effective data augmentation techniques.

Method: Projects deep model representations into pre-shape space, constructs geodesic curves (arcs of great circles) for each class, and performs feature augmentation by sampling along these geodesic paths.

Result: Extensive experiments show FAAGC improves classification accuracy under data-scarce conditions and generalizes well across various feature types.

Conclusion: FAAGC provides an effective feature augmentation approach that leverages geometric properties of pre-shape space to address data scarcity in deep learning applications.

Abstract: Deep learning models have been widely applied across various domains and industries. However, many fields still face challenges due to limited and insufficient data. This paper proposes a Feature Augmentation on Adaptive Geodesic Curve (FAAGC) method in the pre-shape space to increase data. In the pre-shape space, objects with identical shapes lie on a great circle. Thus, we project deep model representations into the pre-shape space and construct a geodesic curve, i.e., an arc of a great circle, for each class. Feature augmentation is then performed by sampling along these geodesic paths. Extensive experiments demonstrate that FAAGC improves classification accuracy under data-scarce conditions and generalizes well across various feature types.

[339] CHIRLA: Comprehensive High-resolution Identification and Re-identification for Large-scale Analysis

Bessie Dominguez-Dager, Felix Escalona, Francisco Gomez-Donoso, Miguel Cazorla

Main category: cs.CV

TL;DR: CHIRLA is a novel long-term person re-identification dataset recorded over 7 months with 22 individuals, 5+ hours of video, and 1M annotated bounding boxes, featuring realistic clothing and appearance variations across 4 indoor environments with 7 cameras.

Details

Motivation: Most person Re-ID research focuses on short-term scenarios with minimal appearance changes, but real-world applications require robust systems that handle long-term variations caused by clothing and physical changes over time.

Method: Created CHIRLA dataset by recording over seven months in four connected indoor environments using seven strategically placed cameras, capturing realistic movements with substantial clothing variability. Used semi-automatic labeling to obtain about 1M bounding boxes with identity annotations.

Result: The dataset includes 22 individuals, more than five hours of video, and comprehensive benchmark protocols for person tracking and Re-ID covering challenging scenarios like occlusion, reappearance, and multi-camera conditions.

Conclusion: CHIRLA provides a comprehensive benchmark to facilitate development and evaluation of Re-ID algorithms that can reliably perform in challenging, long-term real-world scenarios, with publicly available benchmark code.

Abstract: Person re-identification (Re-ID) is a key challenge in computer vision, requiring the matching of individuals across cameras, locations, and time. While most research focuses on short-term scenarios with minimal appearance changes, real-world applications demand robust systems that handle long-term variations caused by clothing and physical changes. We present CHIRLA, Comprehensive High-resolution Identification and Re-identification for Large-scale Analysis, a novel dataset designed for video-based long-term person Re-ID. CHIRLA was recorded over seven months in four connected indoor environments using seven strategically placed cameras, capturing realistic movements with substantial clothing and appearance variability. The dataset includes 22 individuals, more than five hours of video, and about 1M bounding boxes with identity annotations obtained through semi-automatic labeling. We also define benchmark protocols for person tracking and Re-ID, covering diverse and challenging scenarios such as occlusion, reappearance, and multi-camera conditions. By introducing this comprehensive benchmark, we aim to facilitate the development and evaluation of Re-ID algorithms that can reliably perform in challenging, long-term real-world scenarios. The benchmark code is publicly available at: https://github.com/bdager/CHIRLA.

[340] In-Context Reverse Classification Accuracy: Efficient Estimation of Segmentation Quality without Ground-Truth

Matias Cosarinsky, Ramiro Billot, Lucas Mansilla, Gabriel Jimenez, Nicolas Gaggión, Guanghui Fu, Enzo Ferrante

Main category: cs.CV

TL;DR: In-Context RCA is a novel framework for automatic segmentation quality estimation without ground-truth annotations, using in-context learning and retrieval-augmentation techniques for efficient performance across medical imaging modalities.

Details

Motivation: Assessing automatic image segmentation quality is crucial but challenging in clinical practice due to limited ground truth availability, requiring reliable automated quality control solutions.

Method: Leverages in-context learning segmentation models with retrieval-augmentation techniques to select relevant reference images for efficient quality estimation with minimal reference data.

Result: Demonstrates robust performance and computational efficiency across diverse medical imaging modalities, validated for automated quality control in clinical workflows.

Conclusion: Offers a promising solution for fast and reliable segmentation assessment in clinical practice where ground-truth annotations are scarce, with code publicly available.

Abstract: Assessing the quality of automatic image segmentation is crucial in clinical practice, but often very challenging due to the limited availability of ground truth annotations. In this paper, we introduce In-Context Reverse Classification Accuracy (In-Context RCA), a novel framework for automatically estimating segmentation quality in the absence of ground-truth annotations. By leveraging recent in-context learning segmentation models and incorporating retrieval-augmentation techniques to select the most relevant reference images, our approach enables efficient quality estimation with minimal reference data. Validated across diverse medical imaging modalities, our method demonstrates robust performance and computational efficiency, offering a promising solution for automated quality control in clinical workflows, where fast and reliable segmentation assessment is essential. The code is available at https://github.com/mcosarinsky/In-Context-RCA.

[341] GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectories Generation in End-to-End Autonomous Driving

Zebin Xing, Xingyu Zhang, Yang Hu, Bo Jiang, Tong He, Qian Zhang, Xiaoxiao Long, Wei Yin

Main category: cs.CV

TL;DR: GoalFlow is an end-to-end autonomous driving method that generates high-quality multimodal trajectories using goal point constraints and flow matching, achieving state-of-the-art performance with single-step denoising.

Details

Motivation: Existing multimodal trajectory generation methods suffer from trajectory selection complexity, reduced trajectory quality, high trajectory divergence, and inconsistencies between guidance and scene information in autonomous driving scenarios.

Method: GoalFlow introduces goal point constraints to resolve trajectory divergence, establishes a scoring mechanism to select appropriate goal points based on scene information, employs Flow Matching for efficient multimodal trajectory generation, and uses a refined scoring mechanism to select optimal trajectories.

Result: GoalFlow achieves state-of-the-art performance on Navsim with PDMS of 90.3, significantly surpassing other methods. It requires only a single denoising step compared to other diffusion-based methods while delivering robust multimodal trajectories.

Conclusion: GoalFlow effectively addresses trajectory divergence and quality issues in multimodal trajectory generation for autonomous driving, providing high-quality trajectories with efficient single-step generation and superior performance over existing methods.

Abstract: We propose GoalFlow, an end-to-end autonomous driving method for generating high-quality multimodal trajectories. In autonomous driving scenarios, there is rarely a single suitable trajectory. Recent methods have increasingly focused on modeling multimodal trajectory distributions. However, they suffer from trajectory selection complexity and reduced trajectory quality due to high trajectory divergence and inconsistencies between guidance and scene information. To address these issues, we introduce GoalFlow, a novel method that effectively constrains the generative process to produce high-quality, multimodal trajectories. To resolve the trajectory divergence problem inherent in diffusion-based methods, GoalFlow constrains the generated trajectories by introducing a goal point. GoalFlow establishes a novel scoring mechanism that selects the most appropriate goal point from the candidate points based on scene information. Furthermore, GoalFlow employs an efficient generative method, Flow Matching, to generate multimodal trajectories, and incorporates a refined scoring mechanism to select the optimal trajectory from the candidates. Our experimental results, validated on the Navsim\cite{Dauner2024_navsim}, demonstrate that GoalFlow achieves state-of-the-art performance, delivering robust multimodal trajectories for autonomous driving. GoalFlow achieved PDMS of 90.3, significantly surpassing other methods. Compared with other diffusion-policy-based methods, our approach requires only a single denoising step to obtain excellent performance. The code is available at https://github.com/YvanYin/GoalFlow.

[342] StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition

Xin Ding, Hao Wu, Yifan Yang, Shiqi Jiang, Donglin Bai, Zhibo Chen, Ting Cao

Main category: cs.CV

TL;DR: StreamMind is a video LLM framework that achieves 100 fps streaming video processing using event-gated LLM invocation and event-preserving feature extraction, enabling real-time proactive responses without user intervention.

Details

Motivation: Address the need for real-time human-AI interaction in applications like AI assistants by solving the contradiction between linear video streaming speed and quadratic transformer computation costs.

Method: Proposes event-gated LLM invocation paradigm with a Cognition Gate network that only invokes LLM when relevant events occur, plus Event-Preserving Feature Extractor using state-space methods for constant-cost feature extraction.

Result: Achieves state-of-the-art performance on Ego4D and SoccerNet streaming tasks and standard offline benchmarks, with 100 fps processing on a single A100 GPU.

Conclusion: Enables ultra-high-FPS applications like Game AI and interactive media, paving the way for real-time streaming video dialogue systems.

Abstract: With the rise of real-world human-AI interaction applications, such as AI assistants, the need for Streaming Video Dialogue is critical. To address this need, we introduce StreamMind, a video LLM framework that achieves ultra-FPS streaming video processing (100 fps on a single A100) and enables proactive, always-on responses in real time, without explicit user intervention. To solve the key challenge of the contradiction between linear video streaming speed and quadratic transformer computation cost, we propose a novel perception-cognition interleaving paradigm named ‘’event-gated LLM invocation’’, in contrast to the existing per-time-step LLM invocation. By introducing a Cognition Gate network between the video encoder and the LLM, LLM is only invoked when relevant events occur. To realize the event feature extraction with constant cost, we propose Event-Preserving Feature Extractor (EPFE) based on state-space method, generating a single perception token for spatiotemporal features. These techniques enable the video LLM with full-FPS perception and real-time cognition response. Experiments on Ego4D and SoccerNet streaming tasks, as well as standard offline benchmarks, demonstrate state-of-the-art performance in both model capability and real-time efficiency, paving the way for ultra-high-FPS applications, such as Game AI and interactive media. The code and data is available at https://aka.ms/StreamMind.

Leyang Wang, Joice Lin

Main category: cs.CV

TL;DR: LaPIG framework uses LLMs and diffusion models to generate high-quality paired visible-thermal images from captions, addressing data scarcity in facial translation networks.

Details

Motivation: High-quality paired datasets are costly and challenging to acquire for facial translation networks, creating a need for synthetic data generation methods.

Method: Three-part framework: 1) visible image synthesis with ArcFace embedding, 2) thermal image translation using Latent Diffusion Models, 3) caption generation with LLMs to produce multi-view paired images while preserving identity.

Result: Superior performance compared to existing methods on public datasets, demonstrating high-quality paired data generation with maintained identity information.

Conclusion: LaPIG effectively addresses data scarcity by generating comprehensive, diverse paired visible-thermal images using LLM-assisted caption generation and diffusion models.

Abstract: The success of modern machine learning, particularly in facial translation networks, is highly dependent on the availability of high-quality, paired, large-scale datasets. However, acquiring sufficient data is often challenging and costly. Inspired by the recent success of diffusion models in high-quality image synthesis and advancements in Large Language Models (LLMs), we propose a novel framework called LLM-assisted Paired Image Generation (LaPIG). This framework enables the construction of comprehensive, high-quality paired visible and thermal images using captions generated by LLMs. Our method encompasses three parts: visible image synthesis with ArcFace embedding, thermal image translation using Latent Diffusion Models (LDMs), and caption generation with LLMs. Our approach not only generates multi-view paired visible and thermal images to increase data diversity but also produces high-quality paired data while maintaining their identity information. We evaluate our method on public datasets by comparing it with existing methods, demonstrating the superiority of LaPIG.

[344] Convolutional Neural Networks Can (Meta-)Learn the Same-Different Relation

Max Gupta, Sunayana Rane, R. Thomas McCoy, Thomas L. Griffiths

Main category: cs.CV

TL;DR: CNNs struggle with same-different relation tasks despite excelling at object-level tasks, but meta-learning enables successful generalization.

Details

Motivation: Humans outperform CNNs in visual relation tasks like identifying same/different objects, and CNNs show poor generalization in these relational tasks despite conventional training.

Method: Using meta-learning instead of conventional training to explicitly encourage abstraction and generalization across same-different relation tasks.

Result: CNN architectures that previously failed to generalize same-different relations with conventional training succeeded when trained via meta-learning.

Conclusion: Meta-learning provides an effective approach for enabling CNNs to learn and generalize abstract visual relations like same-different, bridging a key gap between human and machine visual reasoning capabilities.

Abstract: While convolutional neural networks (CNNs) have come to match and exceed human performance in many settings, the tasks these models optimize for are largely constrained to the level of individual objects, such as classification and captioning. Humans remain vastly superior to CNNs in visual tasks involving relations, including the ability to identify two objects as same' or different’. A number of studies have shown that while CNNs can be coaxed into learning the same-different relation in some settings, they tend to generalize poorly to other instances of this relation. In this work we show that the same CNN architectures that fail to generalize the same-different relation with conventional training are able to succeed when trained via meta-learning, which explicitly encourages abstraction and generalization across tasks.

[345] IMAGGarment: Fine-Grained Garment Generation for Controllable Fashion Design

Fei Shen, Jian Yu, Cong Wang, Xin Jiang, Xiaoyu Du, Jinhui Tang

Main category: cs.CV

TL;DR: IMAGGarment is a fine-grained garment generation framework that enables high-fidelity garment synthesis with precise control over silhouette, color, and logo placement through a two-stage training approach.

Details

Motivation: Existing methods are limited to single-condition inputs, but personalized fashion design and digital apparel applications require multi-conditional controllability for precise garment generation.

Method: Two-stage training strategy: 1) Global appearance model with mixed attention module and color adapter for silhouette and color encoding, 2) Local enhancement model with adaptive appearance-aware module for logo injection and spatial constraints. Uses GarmentBench dataset with 180K+ garment samples.

Result: Outperforms existing baselines with superior structural stability, color fidelity, and local controllability performance.

Conclusion: IMAGGarment successfully addresses multi-conditional garment generation challenges and provides a comprehensive solution for high-fidelity, controllable fashion design applications.

Abstract: This paper presents IMAGGarment, a fine-grained garment generation (FGG) framework that enables high-fidelity garment synthesis with precise control over silhouette, color, and logo placement. Unlike existing methods that are limited to single-condition inputs, IMAGGarment addresses the challenges of multi-conditional controllability in personalized fashion design and digital apparel applications. Specifically, IMAGGarment employs a two-stage training strategy to separately model global appearance and local details, while enabling unified and controllable generation through end-to-end inference. In the first stage, we propose a global appearance model that jointly encodes silhouette and color using a mixed attention module and a color adapter. In the second stage, we present a local enhancement model with an adaptive appearance-aware module to inject user-defined logos and spatial constraints, enabling accurate placement and visual consistency. To support this task, we release GarmentBench, a large-scale dataset comprising over 180K garment samples paired with multi-level design conditions, including sketches, color references, logo placements, and textual prompts. Extensive experiments demonstrate that our method outperforms existing baselines, achieving superior structural stability, color fidelity, and local controllability performance. Code, models, and datasets are publicly available at https://github.com/muzishen/IMAGGarment.

[346] Transforming Hyperspectral Images Into Chemical Maps: A Novel End-to-End Deep Learning Approach

Ole-Christian Galbo Engstrøm, Michela Albano-Gaglio, Erik Schou Dreier, Yamine Bouzembrak, Maria Font-i-Furnols, Puneet Mishra, Kim Steenstrup Pedersen

Main category: cs.CV

TL;DR: U-Net deep learning approach outperforms traditional PLS regression for chemical map generation from hyperspectral images, providing lower error rates, better spatial correlation, and physically plausible predictions.

Details

Motivation: Current chemical map generation methods like PLS regression produce noisy, pixel-wise predictions without spatial context and generate physically impossible values beyond the 0-100% range.

Method: Proposed end-to-end deep learning approach using modified U-Net architecture with custom loss function to directly generate chemical maps from hyperspectral images, skipping intermediate steps of traditional pixel-wise analysis.

Result: U-Net achieved 9-13% lower RMSE than PLS on mean fat prediction, generated chemical maps with 99.91% spatially correlated variance (vs 2.53% for PLS), and stayed within physically possible 0-100% range unlike PLS.

Conclusion: U-Net is superior to PLS for chemical map generation, providing more accurate predictions with proper spatial context and physically plausible results.

Abstract: Current approaches to chemical map generation from hyperspectral images are based on models such as partial least squares (PLS) regression, generating pixel-wise predictions that do not consider spatial context and suffer from a high degree of noise. This study proposes an end-to-end deep learning approach using a modified version of U-Net and a custom loss function to directly obtain chemical maps from hyperspectral images, skipping all intermediate steps required for traditional pixel-wise analysis. The U-Net is compared with the traditional PLS regression on a real dataset of pork belly samples with associated mean fat reference values. The U-Net obtains a test set root mean squared error of between 9% and 13% lower than that of PLS regression on the task of mean fat prediction. At the same time, U-Net generates fine detail chemical maps where 99.91% of the variance is spatially correlated. Conversely, only 2.53% of the variance in the PLS-generated chemical maps is spatially correlated, indicating that each pixel-wise prediction is largely independent of neighboring pixels. Additionally, while the PLS-generated chemical maps contain predictions far beyond the physically possible range of 0-100%, U-Net learns to stay inside this range. Thus, the findings of this study indicate that U-Net is superior to PLS for chemical map generation.

[347] WMKA-Net: A Weighted Multi-Kernel Attention Network for Retinal Vessel Segmentation

Xinran Xu, Yuliang Ma, Sifu Cai, Ming Meng, Qiang Lv, Ruoyan Shi

Main category: cs.CV

TL;DR: WMKA-Net proposes a dual-stage approach for retinal vessel segmentation with reversible multi-scale fusion and vascular-oriented attention to address feature fusion, contextual continuity, and noise challenges, achieving state-of-the-art performance on multiple datasets.

Details

Motivation: Address three major challenges in retinal vessel segmentation: insufficient multi-scale feature fusion, disruption of contextual continuity, and noise interference for improved ophthalmic diagnosis.

Method: Dual-stage solution: 1) Reversible Multi-Scale Fusion Module (RMS) with hierarchical adaptive convolution for cross-scale feature merging, 2) Vascular-Oriented Attention Mechanism with axial pathway for long-distance continuity and bifurcation attention pathway for topological key nodes.

Result: Achieves accuracy of 0.9909, sensitivity of 0.9198, and specificity of 0.9953 on DRIVE, STARE, and CHASE-DB1 datasets, significantly outperforming existing methods.

Conclusion: Provides an efficient, precise, and robust intelligent solution for early screening of diabetic retinopathy through effective restoration of vascular continuity and improved segmentation accuracy.

Abstract: Retinal vessel segmentation is crucial for intelligent ophthalmic diagnosis, yet it faces three major challenges: insufficient multi-scale feature fusion, disruption of contextual continuity, and noise interference. This study proposes a dual-stage solution to address these issues. The first stage employs a Reversible Multi-Scale Fusion Module (RMS) that uses hierarchical adaptive convolution to dynamically merge cross-scale features from capillaries to main vessels, self-adaptively calibrating feature biases. The second stage introduces a Vascular-Oriented Attention Mechanism, which models long-distance vascular continuity through an axial pathway and enhances the capture of topological key nodes, such as bifurcation points, via a dedicated bifurcation attention pathway. The synergistic operation of these two pathways effectively restores the continuity of vascular structures and improves the segmentation accuracy of complex vascular networks. Systematic experiments on the DRIVE, STARE, and CHASE-DB1 datasets demonstrate that WMKA-Net achieves an accuracy of 0.9909, sensitivity of 0.9198, and specificity of 0.9953, significantly outperforming existing methods. This model provides an efficient, precise, and robust intelligent solution for the early screening of diabetic retinopathy.

[348] HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation

Shaina Raza, Aravind Narayanan, Vahid Reza Khazaie, Ashmal Vayani, Mukund S. Chettiar, Amandeep Singh, Mubarak Shah, Deval Pandya

Main category: cs.CV

TL;DR: HumaniBench is a new benchmark with 32,000 image-question pairs to evaluate large multimodal models’ alignment with human-centered values like fairness, ethics, and inclusivity across 7 principles.

Details

Motivation: Existing LMM evaluations focus on technical tasks like VQA and image captioning but lack rigorous assessment of human-centered values such as fairness, ethics, and inclusivity that are crucial for responsible AI development.

Method: Created a benchmark of 32,000 real-world image-question pairs using an AI-assisted labeling pipeline validated by experts. Evaluates LMMs across 7 alignment principles through diverse open-ended and closed-ended VQA tasks.

Result: Proprietary models lead in reasoning, fairness, and multilinguality, while open-source models excel in robustness and grounding. Most models struggle to balance accuracy with ethical/inclusive behavior. Chain-of-Thought prompting and test-time scaling improve alignment.

Conclusion: HumaniBench provides the first comprehensive testbed for evaluating human-centered alignment in LMMs, enabling diagnosis of limitations and promoting responsible development. All data and code are publicly available for reproducibility.

Abstract: Large multimodal models (LMMs) have been widely tested on tasks like visual question answering (VQA), image captioning, and grounding, but lack rigorous evaluation for alignment with human-centered (HC) values such as fairness, ethics, and inclusivity. To address this gap, we introduce \textbf{HumaniBench}, a novel benchmark of 32,000 real-world image-question pairs and an evaluation suite. Labels are generated via an AI-assisted pipeline and validated by experts. HumaniBench assesses LMMs across seven key alignment principles: fairness, ethics, empathy, inclusivity, reasoning, robustness, and multilinguality, through diverse open-ended and closed-ended VQA tasks. Grounded in AI ethics and real-world needs, these principles provide a holistic lens for societal impact. Benchmarking results on different LMM shows that proprietary models generally lead in reasoning, fairness, and multilinguality, while open-source models excel in robustness and grounding. Most models struggle to balance accuracy with ethical and inclusive behavior. Techniques like Chain-of-Thought prompting and test-time scaling improve alignment. As the first benchmark tailored for HC alignment, HumaniBench offers a rigorous testbed to diagnose limitations, and promote responsible LMM development. All data and code are publicly available for reproducibility. Keywords: HumaniBench, vision-language models, responsible AI benchmark, AI alignment evaluation, AI ethics assessment, fairness in AI models, visual question answering (VQA) benchmark, image captioning evaluation, visual grounding tasks, trustworthy AI models, Chain-of-Thought prompting, test-time scaling, ethical AI development tools.

[349] Enhanced Partially Relevant Video Retrieval through Inter- and Intra-Sample Analysis with Coherence Prediction

Junlong Ren, Gangjian Zhang, Hao Wang, Yu Hu, Jian Shu, Hui Xiong

Main category: cs.CV

TL;DR: A novel framework for Partially Relevant Video Retrieval that addresses semantic asymmetry through inter-sample correlation enhancement, intra-sample redundancy mining, and temporal coherence prediction.

Details

Motivation: Existing PRVR methods coarsely align videos and text queries, neglecting the critical cross-modal dual nature of inter-sample correlation and intra-sample redundancy in partially relevant video retrieval.

Method: Three core modules: 1) Inter Correlation Enhancement (ICE) creates pseudo-positive pairs from similar unpaired queries and moments; 2) Intra Redundancy Mining (IRM) distinguishes redundant from relevant moments; 3) Temporal Coherence Prediction (TCP) enhances temporal structure learning through frame/moment order prediction.

Result: Extensive experiments demonstrate superiority over prior methods, achieving state-of-the-art results in PRVR.

Conclusion: The proposed framework effectively addresses semantic asymmetry in PRVR by systematically exploiting inter-sample correlation and intra-sample redundancy characteristics through three complementary modules.

Abstract: Partially Relevant Video Retrieval (PRVR) aims to retrieve the target video that is partially relevant to the text query. The primary challenge in PRVR arises from the semantic asymmetry between textual and visual modalities, as videos often contain substantial content irrelevant to the query. Existing methods coarsely align paired videos and text queries to construct the semantic space, neglecting the critical cross-modal dual nature inherent in this task: inter-sample correlation and intra-sample redundancy. To this end, we propose a novel PRVR framework to systematically exploit these two characteristics. Our framework consists of three core modules. First, the Inter Correlation Enhancement (ICE) module captures inter-sample correlation by identifying semantically similar yet unpaired text queries and video moments, combining them to form pseudo-positive pairs for more robust semantic space construction. Second, the Intra Redundancy Mining (IRM) module mitigates intra-sample redundancy by mining redundant moment features and distinguishing them from query-relevant moments, encouraging the model to learn more discriminative representations. Finally, to reinforce these modules, we introduce the Temporal Coherence Prediction (TCP) module, which enhances temporal structure learning by training the model to predict the original temporal order of randomly shuffled video frames and moments. Extensive experiments demonstrate the superiority of our approach compared to prior methods, achieving state-of-the-art results.

[350] Corner Cases: How Size and Position of Objects Challenge ImageNet-Trained Models

Mishal Fatima, Steffen Jung, Margret Keuper

Main category: cs.CV

TL;DR: The paper introduces Hard-Spurious-ImageNet, a synthetic dataset showing how background spurious correlations are amplified when objects are small and off-center, and demonstrates that current debiasing methods fail to address these spatial biases.

Details

Motivation: Backgrounds in images create spurious correlations due to human aesthetic preferences, leading to positional and size biases in datasets that affect model reliance on spurious background features.

Method: Created a synthetic dataset derived from ImageNet-1k with controlled variations in backgrounds, object positions, and object sizes, then evaluated various pretrained models on this dataset.

Result: Models heavily rely on spurious background features when the object is small (low ROI-to-image ratio) and positioned far from the image center. Current debiasing methods fail to improve worst-group accuracy under these spatial variations.

Conclusion: Spatial factors (object size and position) significantly impact model reliance on spurious features, and existing mitigation approaches are inadequate for addressing these specific biases, highlighting the need for new methods that consider spatial characteristics.

Abstract: Backgrounds in images play a major role in contributing to spurious correlations among different data points. Owing to aesthetic preferences of humans capturing the images, datasets can exhibit positional (location of the object within a given frame) and size (region-of-interest to image ratio) biases for different classes. In this paper, we show that these biases can impact how much a model relies on spurious features in the background to make its predictions. To better illustrate our findings, we propose a synthetic dataset derived from ImageNet-1k, Hard-Spurious-ImageNet, which contains images with various backgrounds, object positions, and object sizes. By evaluating the dataset on different pretrained models, we find that most models rely heavily on spurious features in the background when the region-of-interest (ROI) to image ratio is small and the object is far from the center of the image. Moreover, we also show that current methods that aim to mitigate harmful spurious features, do not take into account these factors, hence fail to achieve considerable performance gains for worst-group accuracies when the size and location of core features in an image change. The dataset and implementation code are available at https://github.com/Mishalfatima/Corner_Cases.

[351] Making Rotation Averaging Fast and Robust with Anisotropic Coordinate Descent

Yaroslava Lochman, Carl Olsson, Christopher Zach

Main category: cs.CV

TL;DR: A new anisotropic rotation averaging method that combines optimality, robustness and efficiency through simplified block coordinate descent formulation and anisotropic extension.

Details

Motivation: Existing anisotropic rotation averaging methods have trade-offs - semidefinite relaxations recover global minima but scale poorly, while local methods are fast but sensitive to initialization and suffer from drift accumulation.

Method: Analyzed block coordinate descent methods for chordal distances, derived simpler formulation and anisotropic extension to create a fast general solver, integrated into extended anisotropic large-scale robust rotation averaging pipeline.

Result: Achieves state-of-the-art performance on public structure-from-motion datasets.

Conclusion: Successfully bridges the gap between optimality, robustness and efficiency in anisotropic rotation averaging with a fast general solver that outperforms existing methods.

Abstract: Anisotropic rotation averaging has recently been explored as a natural extension of respective isotropic methods. In the anisotropic formulation, uncertainties of the estimated relative rotations – obtained via standard two-view optimization – are propagated to the optimization of absolute rotations. The resulting semidefinite relaxations are able to recover global minima but scale poorly with the problem size. Local methods are fast and also admit robust estimation but are sensitive to initialization. They usually employ minimum spanning trees and therefore suffer from drift accumulation and can get trapped in poor local minima. In this paper, we attempt to bridge the gap between optimality, robustness and efficiency of anisotropic rotation averaging. We analyze a family of block coordinate descent methods initially proposed to optimize the standard chordal distances, and derive a much simpler formulation and an anisotropic extension obtaining a fast general solver. We integrate this solver into the extended anisotropic large-scale robust rotation averaging pipeline. The resulting algorithm achieves state-of-the-art performance on public structure-from-motion datasets. Project page: https://ylochman.github.io/acd

[352] Image Segmentation with Large Language Models: A Survey with Perspectives for Intelligent Transportation Systems

Sanjeda Akter, Ibne Farabi Shihab, Anuj Sharma

Main category: cs.CV

TL;DR: Survey of LLM-augmented image segmentation for intelligent transportation systems, covering applications, taxonomy of approaches, challenges, and future directions.

Details

Motivation: LLMs are transforming computer vision tasks, and accurate scene understanding is critical for safety and efficiency in intelligent transportation systems.

Method: Systematic review and taxonomy of current approaches based on prompting mechanisms and core architectures for LLM-augmented image segmentation.

Result: Identified how LLM innovations enhance road scene understanding for autonomous driving, traffic monitoring, and infrastructure maintenance.

Conclusion: Key challenges include real-time performance and safety-critical reliability, with future focus on explainable, human-centric AI for successful deployment in next-generation transportation systems.

Abstract: The integration of Large Language Models (LLMs) with computer vision is profoundly transforming perception tasks like image segmentation. For intelligent transportation systems (ITS), where accurate scene understanding is critical for safety and efficiency, this new paradigm offers unprecedented capabilities. This survey systematically reviews the emerging field of LLM-augmented image segmentation, focusing on its applications, challenges, and future directions within ITS. We provide a taxonomy of current approaches based on their prompting mechanisms and core architectures, and we highlight how these innovations can enhance road scene understanding for autonomous driving, traffic monitoring, and infrastructure maintenance. Finally, we identify key challenges, including real-time performance and safety-critical reliability, and outline a perspective centered on explainable, human-centric AI as a prerequisite for the successful deployment of this technology in next-generation transportation systems.

[353] Pushing Trade-Off Boundaries: Compact yet Effective Remote Sensing Change Detection

Luosheng Xu, Dalin Zhang, Zhaohui Song

Main category: cs.CV

TL;DR: FlickCD is a lightweight change detection model that achieves SOTA performance with significantly reduced computational costs through an Enhanced Difference Module and Local-Global Fusion Blocks.

Details

Motivation: Address the inefficiency of complex deep learning models in remote sensing change detection, focusing on lightweight solutions suitable for on-satellite processing while maintaining high accuracy.

Method: Proposes FlickCD with Enhanced Difference Module (EDM) to amplify critical feature differences and suppress irrelevant variations, plus Local-Global Fusion Blocks using Shifted Window Self-Attention and Efficient Global Self-Attention for multi-scale semantic capture.

Result: Reduces computational and storage overheads by more than an order of magnitude while achieving state-of-the-art performance or only minor (<1% F1) accuracy trade-off on four benchmark datasets.

Conclusion: FlickCD successfully pushes the performance-resource trade-off boundaries, demonstrating that lightweight models can maintain high accuracy while being significantly more efficient, making them suitable for resource-constrained applications like on-satellite processing.

Abstract: Remote sensing change detection is essential for monitoring urban expansion, disaster assessment, and resource management, offering timely, accurate, and large-scale insights into dynamic landscape transformations. While deep learning has revolutionized change detection, the increasing complexity and computational demands of modern models have not necessarily translated into significant accuracy gains. Instead of following this trend, this study explores a more efficient approach, focusing on lightweight models that maintain high accuracy while minimizing resource consumption, which is an essential requirement for on-satellite processing. To this end, we propose FlickCD, which means quick flick then get great results, pushing the boundaries of the performance-resource trade-off. FlickCD introduces an Enhanced Difference Module (EDM) to amplify critical feature differences between temporal phases while suppressing irrelevant variations such as lighting and weather changes, thereby reducing computational costs in the subsequent change decoder. Additionally, the FlickCD decoder incorporates Local-Global Fusion Blocks, leveraging Shifted Window Self-Attention (SWSA) and Efficient Global Self-Attention (EGSA) to effectively capture semantic information at multiple scales, preserving both coarse- and fine-grained changes. Extensive experiments on four benchmark datasets demonstrate that FlickCD reduces computational and storage overheads by more than an order of magnitude while achieving state-of-the-art (SOTA) performance or incurring only a minor (<1% F1) accuracy trade-off. The implementation code is publicly available at https://github.com/xulsh8/FlickCD.

[354] Visual Structures Helps Visual Reasoning: Addressing the Binding Problem in VLMs

Amirmohammad Izadi, Mohammad Ali Banayeeanzade, Fatemeh Askari, Ali Rahimiakbar, Mohammad Mahdi Vahedi, Hosein Hasani, Mahdieh Soleymani Baghshah

Main category: cs.CV

TL;DR: VISER augments visual inputs with spatial structures and uses textual prompts to improve visual reasoning in VLMs, achieving significant performance gains across multiple tasks without complex inference procedures.

Details

Motivation: Current Vision-Language Models struggle with visual reasoning due to the binding problem - failing to reliably associate perceptual features with correct visual referents, particularly in tasks requiring spatial awareness.

Method: Introduces VISER: augmenting visual inputs with low-level spatial structures and pairing with textual prompts that encourage sequential, spatially-aware parsing.

Result: Substantial improvements: 25.00% increase in GPT-4o visual search accuracy, 26.83% increase in counting accuracy, 0.32 reduction in scene description edit distance error, and 9.50% improvement on spatial relationship tasks.

Conclusion: Low-level visual structuring is essential for improving compositional visual reasoning, outperforming purely textual strategies like Chain-of-Thought prompting, and represents a powerful underexplored direction for enhancing VLM performance on spatially grounded tasks.

Abstract: Despite progress in Vision-Language Models (VLMs), their capacity for visual reasoning is often limited by the binding problem: the failure to reliably associate perceptual features with their correct visual referents. This limitation underlies persistent errors in tasks such as counting, visual search, scene description, and spatial relationship understanding. A key factor is that current VLMs process visual features largely in parallel, lacking mechanisms for spatially grounded, serial attention. This paper introduces VISER (Visual Input Structure for Enhanced Reasoning), a simple yet effective intervention: augmenting visual inputs with low-level spatial structures and pairing this with a textual prompt that encourages sequential, spatially-aware parsing. We empirically demonstrate substantial performance improvements across core visual reasoning tasks. Specifically, VISER improves GPT-4o visual search accuracy by 25.00%, increases counting accuracy by 26.83%, reduces edit distance error in scene description by 0.32, and enhances performance on spatial relationship tasks by 9.50% on a 2D synthetic dataset. Furthermore, we find that the visual modification is essential for these gains; purely textual strategies, including Chain-of-Thought prompting, are insufficient and can even degrade performance. VISER enhances binding only with a single-query inference, underscoring the importance of visual input design over purely linguistically-based approaches. These findings suggest that low-level visual structuring is a powerful and underexplored direction for improving compositional visual reasoning and could serve as a general strategy for enhancing VLM performance on spatially grounded tasks.

[355] Leveraging Out-of-Distribution Unlabeled Images: Semi-Supervised Semantic Segmentation with an Open-Vocabulary Model

Wooseok Shin, Jisu Kang, Hyeonki Jeong, Jin Sob Kim, Sung Won Han

Main category: cs.CV

TL;DR: Proposes SemiOVS framework using open-vocabulary segmentation to effectively leverage out-of-distribution unlabeled images for semi-supervised semantic segmentation, achieving state-of-the-art performance.

Details

Motivation: Existing semi-supervised segmentation methods work well with controlled benchmark splits but struggle with abundant real-world unlabeled images that have different distributions (OOD), leading to inaccurate pseudo-labels.

Method: Develops a semi-supervised semantic segmentation framework with an open-vocabulary segmentation model to pseudo-label OOD images effectively.

Result: Outperforms PrevMatch and SemiVL by +3.5 and +3.0 mIoU on Pascal VOC with 92 labels, demonstrating substantial performance gains from using OOD images.

Conclusion: The approach successfully utilizes abundant unlabeled OOD images for semantic segmentation, showing promise for real-world applications and inspiring future research.

Abstract: In semi-supervised semantic segmentation, existing studies have shown promising results in academic settings with controlled splits of benchmark datasets. However, the potential benefits of leveraging significantly larger sets of unlabeled images remain unexplored. In real-world scenarios, abundant unlabeled images are often available from online sources (web-scraped images) or large-scale datasets. However, these images may have different distributions from those of the target dataset, a situation known as out-of-distribution (OOD). Using these images as unlabeled data in semi-supervised learning can lead to inaccurate pseudo-labels, potentially misguiding network training. In this paper, we propose a new semi-supervised semantic segmentation framework with an open-vocabulary segmentation model (SemiOVS) to effectively utilize unlabeled OOD images. Extensive experiments on Pascal VOC and Context datasets demonstrate two key findings: (1) using additional unlabeled images improves the performance of semi-supervised learners in scenarios with few labels, and (2) using the open-vocabulary segmentation (OVS) model to pseudo-label OOD images leads to substantial performance gains. In particular, SemiOVS outperforms existing PrevMatch and SemiVL methods by +3.5 and +3.0 mIoU, respectively, on Pascal VOC with a 92-label setting, achieving state-of-the-art performance. These findings demonstrate that our approach effectively utilizes abundant unlabeled OOD images for semantic segmentation tasks. We hope this work can inspire future research and real-world applications. The code is available at https://github.com/wooseok-shin/SemiOVS

[356] Driver-Net: Multi-Camera Fusion for Assessing Driver Take-Over Readiness in Automated Vehicles

Mahdi Rezaei, Mohsen Azarmi

Main category: cs.CV

TL;DR: Driver-Net is a deep learning framework that uses multi-camera inputs to estimate driver readiness for take-over in automated vehicles, achieving 95.8% accuracy by fusing visual cues from head, hands, and body posture.

Details

Motivation: Existing driver monitoring systems focus only on head pose or eye gaze, which may not provide comprehensive assessment of driver readiness for safe transition of control in automated vehicles.

Method: Uses triple-camera setup to capture synchronized visual cues from driver’s head, hands, and body posture. Employs dual-path architecture with Context Block and Feature Block, followed by cross-modal fusion strategy for spatio-temporal data integration.

Result: Achieves 95.8% accuracy in driver readiness classification on dataset from University of Leeds Driving Simulator, significantly outperforming existing approaches.

Conclusion: Driver-Net demonstrates the importance of multimodal and multi-view fusion for accurate driver readiness assessment, providing a real-time, non-intrusive solution that enhances automated vehicle safety and meets regulatory standards.

Abstract: Ensuring safe transition of control in automated vehicles requires an accurate and timely assessment of driver readiness. This paper introduces Driver-Net, a novel deep learning framework that fuses multi-camera inputs to estimate driver take-over readiness. Unlike conventional vision-based driver monitoring systems that focus on head pose or eye gaze, Driver-Net captures synchronised visual cues from the driver’s head, hands, and body posture through a triple-camera setup. The model integrates spatio-temporal data using a dual-path architecture, comprising a Context Block and a Feature Block, followed by a cross-modal fusion strategy to enhance prediction accuracy. Evaluated on a diverse dataset collected from the University of Leeds Driving Simulator, the proposed method achieves an accuracy of up to 95.8% in driver readiness classification. This performance significantly enhances existing approaches and highlights the importance of multimodal and multi-view fusion. As a real-time, non-intrusive solution, Driver-Net contributes meaningfully to the development of safer and more reliable automated vehicles and aligns with new regulatory mandates and upcoming safety standards.

[357] DeepShade: Enable Shade Simulation by Text-conditioned Image Generation

Longchao Da, Xiangrui Liu, Mithun Shivakoti, Thirulogasankar Pranav Kutralingam, Yezhou Yang, Hua Wei

Main category: cs.CV

TL;DR: DeepShade: A diffusion-based model that generates accurate shade predictions from satellite imagery to improve heatwave route planning by calculating shade ratios for urban environments.

Details

Motivation: Current routing systems lack shade information which is crucial for public health during heatwaves. Existing methods struggle with noisy satellite imagery and limited training data for generative models.

Method: 1) Created extensive dataset using Blender-based 3D simulations to capture building shadows under various solar angles 2) Developed DeepShade diffusion model that combines RGB with Canny edge features and uses contrastive learning for temporal shade variations 3) Conditions on textual descriptions of time and solar angles

Result: The framework provides improved performance in generating shade images and enables calculation of shade ratios for real-world route planning applications.

Conclusion: This work benefits society by providing urban planning references for extreme heat weather and has practical environmental applications for heatwave-safe routing.

Abstract: Heatwaves pose a significant threat to public health, especially as global warming intensifies. However, current routing systems (e.g., online maps) fail to incorporate shade information due to the difficulty of estimating shades directly from noisy satellite imagery and the limited availability of training data for generative models. In this paper, we address these challenges through two main contributions. First, we build an extensive dataset covering diverse longitude-latitude regions, varying levels of building density, and different urban layouts. Leveraging Blender-based 3D simulations alongside building outlines, we capture building shadows under various solar zenith angles throughout the year and at different times of day. These simulated shadows are aligned with satellite images, providing a rich resource for learning shade patterns. Second, we propose the DeepShade, a diffusion-based model designed to learn and synthesize shade variations over time. It emphasizes the nuance of edge features by jointly considering RGB with the Canny edge layer, and incorporates contrastive learning to capture the temporal change rules of shade. Then, by conditioning on textual descriptions of known conditions (e.g., time of day, solar angles), our framework provides improved performance in generating shade images. We demonstrate the utility of our approach by using our shade predictions to calculate shade ratios for real-world route planning in Tempe, Arizona. We believe this work will benefit society by providing a reference for urban planning in extreme heat weather and its potential practical applications in the environment.

[358] DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model

Han Zhang, Xiangde Luo, Yong Chen, Kang Li

Main category: cs.CV

TL;DR: DiffOSeg is a diffusion-based framework that simultaneously achieves consensus-driven and preference-driven medical image segmentation to address annotation variability from multiple experts.

Details

Motivation: Annotation variability in medical image segmentation due to ambiguous boundaries and diverse clinical expertise creates challenges. Traditional methods produce single deterministic predictions that fail to capture annotator biases, while existing multi-rater approaches focus on either consensus or individual preferences but not both.

Method: A two-stage diffusion-based framework: Stage I establishes population consensus through probabilistic consensus strategy, and Stage II captures expert-specific preferences via adaptive prompts.

Result: The model outperforms existing state-of-the-art methods across all evaluated metrics on two public datasets (LIDC-IDRI and NPC-170).

Conclusion: DiffOSeg successfully addresses annotation variability by providing both consensus-driven and preference-driven segmentation simultaneously, offering a more comprehensive solution for medical image segmentation with multiple expert annotations.

Abstract: Annotation variability remains a substantial challenge in medical image segmentation, stemming from ambiguous imaging boundaries and diverse clinical expertise. Traditional deep learning methods producing single deterministic segmentation predictions often fail to capture these annotator biases. Although recent studies have explored multi-rater segmentation, existing methods typically focus on a single perspective – either generating a probabilistic ``gold standard’’ consensus or preserving expert-specific preferences – thus struggling to provide a more omni view. In this study, we propose DiffOSeg, a two-stage diffusion-based framework, which aims to simultaneously achieve both consensus-driven (combining all experts’ opinions) and preference-driven (reflecting experts’ individual assessments) segmentation. Stage I establishes population consensus through a probabilistic consensus strategy, while Stage II captures expert-specific preference via adaptive prompts. Demonstrated on two public datasets (LIDC-IDRI and NPC-170), our model outperforms existing state-of-the-art methods across all evaluated metrics. Source code is available at https://github.com/string-ellipses/DiffOSeg .

[359] Depthwise-Dilated Convolutional Adapters for Medical Object Tracking and Segmentation Using the Segment Anything Model 2

Guoping Xu, Christopher Kabat, You Zhang

Main category: cs.CV

TL;DR: DD-SAM2 is an efficient adapter-based framework that fine-tunes SAM2 for medical video segmentation and tracking with minimal parameters, achieving state-of-the-art performance on tumor and heart ventricle datasets.

Details

Motivation: Existing medical segmentation methods lack adaptability to dynamic scenarios, and adapting SAM2 for medical videos requires large datasets and risks catastrophic forgetting with high computational costs.

Method: Proposes DD-SAM2 with Depthwise-Dilated Adapter for multi-scale feature extraction, enabling efficient fine-tuning of SAM2’s streaming memory mechanism for medical videos with limited training data.

Result: Achieves Dice scores of 0.93 on TrackRad2025 (tumor segmentation) and 0.97 on EchoNet-Dynamic (left ventricle tracking), demonstrating superior performance.

Conclusion: Provides the first systematic exploration of adapter-based SAM2 fine-tuning for medical video segmentation and tracking, offering an efficient solution with minimal parameter overhead.

Abstract: Recent advances in medical image segmentation have been driven by deep learning; however, most existing methods remain limited by modality-specific designs and exhibit poor adaptability to dynamic medical imaging scenarios. The Segment Anything Model 2 (SAM2) and its related variants, which introduce a streaming memory mechanism for real-time video segmentation, present new opportunities for prompt-based, generalizable solutions. Nevertheless, adapting these models to medical video scenarios typically requires large-scale datasets for retraining or transfer learning, leading to high computational costs and the risk of catastrophic forgetting. To address these challenges, we propose DD-SAM2, an efficient adaptation framework for SAM2 that incorporates a Depthwise-Dilated Adapter (DD-Adapter) to enhance multi-scale feature extraction with minimal parameter overhead. This design enables effective fine-tuning of SAM2 on medical videos with limited training data. Unlike existing adapter-based methods focused solely on static images, DD-SAM2 fully exploits SAM2’s streaming memory for medical video object tracking and segmentation. Comprehensive evaluations on TrackRad2025 (tumor segmentation) and EchoNet-Dynamic (left ventricle tracking) datasets demonstrate superior performance, achieving Dice scores of 0.93 and 0.97, respectively. To the best of our knowledge, this work provides an initial attempt at systematically exploring adapter-based SAM2 fine-tuning for medical video segmentation and tracking. Code, datasets, and models will be publicly available at https://github.com/apple1986/DD-SAM2.

[360] Self-Supervised Continuous Colormap Recovery from a 2D Scalar Field Visualization without a Legend

Hongxu Liu, Xinyu Chen, Haoyang Zheng, Manyi Li, Zhenfan Liu, Fumeng Yang, Yunhai Wang, Changhe Tu, Qiong Zeng

Main category: cs.CV

TL;DR: A novel approach to recover continuous colormaps from 2D scalar field visualizations without color legends, using decoupling-and-reconstruction strategy with self-supervised optimization.

Details

Motivation: Recovering continuous colormaps from single 2D scalar field visualizations is challenging, especially without color legends, which limits analysis and reuse of visualization techniques.

Method: Uses decoupling-and-reconstruction strategy: separates visualization into colormap and data, then reconstructs with differentiable color-mapping. Employs reconstruction loss for self-supervised optimization and introduces cubic B-spline representation with color order loss for smoothness.

Result: Method evaluated quantitatively and qualitatively on synthetic and real-world datasets (VIS30K), demonstrating effectiveness in colormap recovery and utility in applications like colormap adjustment and transfer.

Conclusion: The approach successfully recovers colormaps from visualizations without legends, generalizes to visualizations with legends and discrete palettes, and enables practical applications for visualization analysis and reuse.

Abstract: Recovering a continuous colormap from a single 2D scalar field visualization can be quite challenging, especially in the absence of a corresponding color legend. In this paper, we propose a novel colormap recovery approach that extracts the colormap from a color-encoded 2D scalar field visualization by simultaneously predicting the colormap and underlying data using a decoupling-and-reconstruction strategy. Our approach first separates the input visualization into colormap and data using a decoupling module, then reconstructs the visualization with a differentiable color-mapping module. To guide this process, we design a reconstruction loss between the input and reconstructed visualizations, which serves both as a constraint to ensure strong correlation between colormap and data during training, and as a self-supervised optimizer for fine-tuning the predicted colormap of unseen visualizations during inferencing. To ensure smoothness and correct color ordering in the extracted colormap, we introduce a compact colormap representation using cubic B-spline curves and an associated color order loss. We evaluate our method quantitatively and qualitatively on a synthetic dataset and a collection of real-world visualizations from the VIS30K dataset. Additionally, we demonstrate its utility in two prototype applications – colormap adjustment and colormap transfer – and explore its generalization to visualizations with color legends and ones encoded using discrete color palettes.

[361] A Novel Image Similarity Metric for Scene Composition Structure

Md Redwanul Haque, Manzur Murshed, Manoranjan Paul, Tsz-Kwan Lee

Main category: cs.CV

TL;DR: SCSSIM is a novel training-free metric that quantifies Scene Composition Structure preservation in generative AI outputs using cuboidal hierarchical partitioning and statistical measures, outperforming traditional metrics in structural evaluation.

Details

Motivation: Traditional image similarity metrics fail to adequately assess Scene Composition Structure (SCS) - the geometric relationships among objects and background. Pixel-level metrics are too sensitive to noise, perception-based metrics prioritize aesthetics, and neural metrics have training overheads.

Method: SCSSIM uses analytical, training-free approach with statistical measures derived from cuboidal hierarchical partitioning of images to robustly capture non-object-based structural relationships without neural network training.

Result: SCSSIM shows high invariance to non-compositional distortions (accurately reflecting unchanged SCS) and strong monotonic decrease for compositional distortions (precisely indicating altered SCS), outperforming existing metrics for structural evaluation.

Conclusion: SCSSIM is an invaluable training-free tool for developing and evaluating generative models, ensuring scene composition integrity by effectively quantifying structural preservation beyond human perception limitations.

Abstract: The rapid advancement of generative AI models necessitates novel methods for evaluating image quality that extend beyond human perception. A critical concern for these models is the preservation of an image’s underlying Scene Composition Structure (SCS), which defines the geometric relationships among objects and the background, their relative positions, sizes, orientations, etc. Maintaining SCS integrity is paramount for ensuring faithful and structurally accurate GenAI outputs. Traditional image similarity metrics often fall short in assessing SCS. Pixel-level approaches are overly sensitive to minor visual noise, while perception-based metrics prioritize human aesthetic appeal, neither adequately capturing structural fidelity. Furthermore, recent neural-network-based metrics introduce training overheads and potential generalization issues. We introduce the SCS Similarity Index Measure (SCSSIM), a novel, analytical, and training-free metric that quantifies SCS preservation by exploiting statistical measures derived from the Cuboidal hierarchical partitioning of images, robustly capturing non-object-based structural relationships. Our experiments demonstrate SCSSIM’s high invariance to non-compositional distortions, accurately reflecting unchanged SCS. Conversely, it shows a strong monotonic decrease for compositional distortions, precisely indicating when SCS has been altered. Compared to existing metrics, SCSSIM exhibits superior properties for structural evaluation, making it an invaluable tool for developing and evaluating generative models, ensuring the integrity of scene composition. See \href{https://github.com/RedwanPlague/scssim}{code}.

[362] SGDFuse: SAM-Guided Diffusion for High-Fidelity Infrared and Visible Image Fusion

Xiaoyang Zhang, Zhen Hua, Yakun Ju, Wei Zhou, Jun Liu, Alex C. Kot

Main category: cs.CV

TL;DR: SGDFuse is a novel infrared and visible image fusion method that combines SAM-generated semantic masks with conditional diffusion models to achieve high-fidelity, semantically-aware fusion results.

Details

Motivation: Existing infrared and visible image fusion methods often fail to preserve key targets due to lack of deep semantic understanding and introduce artifacts/detail loss, compromising both image quality and downstream task performance.

Method: Two-stage conditional diffusion model guided by Segment Anything Model (SAM): 1) preliminary fusion of multi-modal features, 2) uses SAM semantic masks with preliminary fused image as condition to drive coarse-to-fine denoising generation.

Result: SGDFuse achieves state-of-the-art performance in both subjective and objective evaluations, with excellent adaptability to downstream tasks.

Conclusion: The method provides a powerful solution to core challenges in image fusion by ensuring explicit semantic directionality and high fidelity through SAM-guided conditional diffusion modeling.

Abstract: Infrared and visible image fusion (IVIF) aims to combine the thermal radiation information from infrared images with the rich texture details from visible images to enhance perceptual capabilities for downstream visual tasks. However, existing methods often fail to preserve key targets due to a lack of deep semantic understanding of the scene, while the fusion process itself can also introduce artifacts and detail loss, severely compromising both image quality and task performance. To address these issues, this paper proposes SGDFuse, a conditional diffusion model guided by the Segment Anything Model (SAM), to achieve high-fidelity and semantically-aware image fusion. The core of our method is to utilize high-quality semantic masks generated by SAM as explicit priors to guide the optimization of the fusion process via a conditional diffusion model. Specifically, the framework operates in a two-stage process: it first performs a preliminary fusion of multi-modal features, and then utilizes the semantic masks from SAM jointly with the preliminary fused image as a condition to drive the diffusion model’s coarse-to-fine denoising generation. This ensures the fusion process not only has explicit semantic directionality but also guarantees the high fidelity of the final result. Extensive experiments demonstrate that SGDFuse achieves state-of-the-art performance in both subjective and objective evaluations, as well as in its adaptability to downstream tasks, providing a powerful solution to the core challenges in image fusion. The code of SGDFuse is available at https://github.com/boshizhang123/SGDFuse.

[363] VSI: Visual Subtitle Integration for Keyframe Selection to enhance Long Video Understanding

Jianxiang He, Meisheng Hong, Jungang Li, Yijie Xu, Ziyang Chen, Weiyu Guo, Hui Xiong

Main category: cs.CV

TL;DR: VSI is a multimodal keyframe search method that integrates visual and subtitle information to improve long video understanding, achieving state-of-the-art performance on video QA tasks.

Details

Motivation: Existing keyframe retrieval methods for long video understanding suffer from weak multimodal alignment between text queries and visual content, and fail to capture complex temporal semantic information needed for precise reasoning.

Method: Visual-Subtitle Integration (VSI) with dual-stream search mechanism: Video Search Stream for visual information and Subtitle Match Stream for textual information from subtitles, timestamps, and scene boundaries, with interaction between both streams.

Result: Achieved 40.00% keyframe localization accuracy on LongVideoBench text-relevant subset and 68.48% accuracy on downstream long Video-QA tasks, surpassing baselines by 20.35% and 15.79% respectively. State-of-the-art performance on medium-to-long video-QA tasks.

Conclusion: VSI demonstrates robustness and generalizability through its unified multimodal search approach that effectively combines visual and textual information for improved long video understanding.

Abstract: Long video understanding presents a significant challenge to multimodal large language models (MLLMs) primarily due to the immense data scale. A critical and widely adopted strategy for making this task computationally tractable is keyframe retrieval, which seeks to identify a sparse set of video frames that are most salient to a given textual query. However, the efficacy of this approach is hindered by weak multimodal alignment between textual queries and visual content and fails to capture the complex temporal semantic information required for precise reasoning. To address this, we propose Visual-Subtitle Integeration(VSI), a multimodal keyframe search method that integrates subtitles, timestamps, and scene boundaries into a unified multimodal search process. The proposed method captures the visual information of video frames as well as the complementary textual information through a dual-stream search mechanism by Video Search Stream as well as Subtitle Match Stream, respectively, and improves the keyframe search accuracy through the interaction of the two search streams. Experimental results show that VSI achieve 40.00% key frame localization accuracy on the text-relevant subset of LongVideoBench and 68.48% accuracy on downstream long Video-QA tasks, surpassing competitive baselines by 20.35% and 15.79%, respectively. Furthermore, on the LongVideoBench, VSI achieved state-of-the-art(SOTA) in medium-to-long video-QA tasks, demonstrating the robustness and generalizability of the proposed multimodal search strategy.

[364] Physical Autoregressive Model for Robotic Manipulation without Action Pretraining

Zijian Song, Sihan Qin, Tianshui Chen, Liang Lin, Guangrun Wang

Main category: cs.CV

TL;DR: PAR is a physical autoregressive model that combines video frames and actions as tokens, leveraging pretrained video generation models for robotic manipulation without requiring action pretraining.

Details

Motivation: Addresses the scarcity of manipulation data by utilizing pretrained large models from other modalities, specifically video generation models, to understand physical dynamics in robotics.

Method: Uses physical tokens combining frames and actions, DiT-based de-tokenizer for continuous token modeling, causal mask with inverse kinematics, parallel training, and KV-cache mechanism for improved performance.

Result: Achieves 100% success rate on PushCube task in ManiSkill benchmark, matches action-pretrained baselines on other tasks, and produces accurate video predictions with aligned action trajectories.

Conclusion: Demonstrates promising direction for robotic manipulation by transferring world knowledge from autoregressive video pretraining, enabling effective physical dynamics understanding without action-specific pretraining.

Abstract: The scarcity of manipulation data has motivated the use of pretrained large models from other modalities in robotics. In this work, we build upon autoregressive video generation models to propose a Physical Autoregressive Model (PAR), where physical tokens combine frames and actions to represent the joint evolution of the robot and its environment. PAR leverages the world knowledge embedded in video pretraining to understand physical dynamics without requiring action pretraining, enabling accurate video prediction and consistent action trajectories. It also adopts a DiT-based de-tokenizer to model frames and actions as continuous tokens, mitigating quantization errors and facilitating mutual enhancement. Furthermore, we incorporate a causal mask with inverse kinematics, parallel training, and the KV-cache mechanism to further improve performance and efficiency. Experiments on the ManiSkill benchmark show that PAR achieves a 100% success rate on the PushCube task, matches the performance of action-pretrained baselines on other tasks, and accurately predicts future videos with tightly aligned action trajectories. These findings underscore a promising direction for robotic manipulation by transferring world knowledge from autoregressive video pretraining. The project page is here: https://hcplab-sysu.github.io/PhysicalAutoregressiveModel/

[365] ComplicitSplat: Downstream Models are Vulnerable to Blackbox Attacks by 3D Gaussian Splat Camouflages

Matthew Hull, Haoyang Yang, Pratham Mehta, Mansi Phute, Aeree Cho, Haorang Wang, Matthew Lau, Wenke Lee, Wilian Lunardi, Martin Andreoni, Duen Horng Chau

Main category: cs.CV

TL;DR: ComplicitSplat is the first black-box attack that exploits 3D Gaussian Splatting shading methods to create viewpoint-specific camouflage, embedding adversarial content visible only from specific angles without needing model access.

Details

Motivation: As 3DGS gains adoption in safety-critical tasks like autonomous navigation, there's a need to understand how adversaries might tamper with images to cause harm in mission-critical robotic systems.

Method: The attack exploits standard 3DGS shading methods to create viewpoint-specific camouflage - colors and textures that change with viewing angle to embed adversarial content in scene objects visible only from specific viewpoints.

Result: Extensive experiments show ComplicitSplat successfully attacks various popular detectors including single-stage, multi-stage, and transformer-based models on both real-world physical objects and synthetic scenes.

Conclusion: This exposes a novel safety risk for applications using 3DGS, demonstrating the first black-box attack on downstream object detectors using 3D Gaussian Splatting.

Abstract: As 3D Gaussian Splatting (3DGS) gains rapid adoption in safety-critical tasks for efficient novel-view synthesis from static images, how might an adversary tamper images to cause harm? We introduce ComplicitSplat, the first attack that exploits standard 3DGS shading methods to create viewpoint-specific camouflage

colors and textures that change with viewing angle - to embed adversarial content in scene objects that are visible only from specific viewpoints and without requiring access to model architecture or weights. Our extensive experiments show that ComplicitSplat generalizes to successfully attack a variety of popular detector - both single-stage, multi-stage, and transformer-based models on both real-world capture of physical objects and synthetic scenes. To our knowledge, this is the first black-box attack on downstream object detectors using 3DGS, exposing a novel safety risk for applications like autonomous navigation and other mission-critical robotic systems.

[366] Semantic Discrepancy-aware Detector for Image Forgery Identification

Ziye Wang, Minghang Yu, Chunyan Xu, Zhen Cui

Main category: cs.CV

TL;DR: SDD is a semantic discrepancy-aware detector that aligns forgery and semantic concept spaces using reconstruction learning to improve fake image detection.

Details

Motivation: The misalignment between forgery detection and semantic concept spaces hinders performance in identifying fake images, requiring better space alignment.

Method: Uses semantic token sampling to mitigate irrelevant feature shifts, concept-level forgery discrepancy learning with visual reconstruction, and low-level forgery feature enhancement.

Result: Achieves superior results on two standard image forgery datasets compared to existing methods.

Conclusion: SDD effectively aligns semantic and forgery spaces, demonstrating strong performance in image forgery detection through semantic-guided discrepancy learning.

Abstract: With the rapid advancement of image generation techniques, robust forgery detection has become increasingly imperative to ensure the trustworthiness of digital media. Recent research indicates that the learned semantic concepts of pre-trained models are critical for identifying fake images. However, the misalignment between the forgery and semantic concept spaces hinders the model’s forgery detection performance. To address this problem, we propose a novel Semantic Discrepancy-aware Detector (SDD) that leverages reconstruction learning to align the two spaces at a fine-grained visual level. By exploiting the conceptual knowledge embedded in the pre-trained vision language model, we specifically design a semantic token sampling module to mitigate the space shifts caused by features irrelevant to both forgery traces and semantic concepts. A concept-level forgery discrepancy learning module, built upon a visual reconstruction paradigm, is proposed to strengthen the interaction between visual semantic concepts and forgery traces, effectively capturing discrepancies under the concepts’ guidance. Finally, the low-level forgery feature enhancemer integrates the learned concept level forgery discrepancies to minimize redundant forgery information. Experiments conducted on two standard image forgery datasets demonstrate the efficacy of the proposed SDD, which achieves superior results compared to existing methods. The code is available at https://github.com/wzy1111111/SSD.

[367] REVEAL – Reasoning and Evaluation of Visual Evidence through Aligned Language

Ipsita Praharaj, Yukta Butala, Badrikanath Praharaj, Yash Butala

Main category: cs.CV

TL;DR: REVEAL framework uses vision-language models for image forgery detection through holistic scene evaluation and region-wise anomaly analysis, showing strong generalization across different manipulation domains.

Details

Motivation: Existing forgery detection methods struggle with generalization across domains and lack reasoning capabilities. The need for robust frameworks that can detect forgeries while providing explanations and localization.

Method: Prompt-driven visual reasoning using large vision-language models. Two approaches: (1) Holistic scene-level evaluation (physics, semantics, perspective, realism) and (2) Region-wise anomaly detection by splitting image into multiple regions for individual analysis.

Result: Experiments conducted across multiple domains (Photoshop, DeepFake, AIGC editing) show competitive performance against baselines with improved reasoning capabilities.

Conclusion: The REVEAL framework successfully leverages vision-language models for generalized forgery detection with reasoning and localization, addressing the generalization challenge across different manipulation types.

Abstract: The rapid advancement of generative models has intensified the challenge of detecting and interpreting visual forgeries, necessitating robust frameworks for image forgery detection while providing reasoning as well as localization. While existing works approach this problem using supervised training for specific manipulation or anomaly detection in the embedding space, generalization across domains remains a challenge. We frame this problem of forgery detection as a prompt-driven visual reasoning task, leveraging the semantic alignment capabilities of large vision-language models. We propose a framework, REVEAL (Reasoning and Evaluation of Visual Evidence through Aligned Language), that incorporates generalized guidelines. We propose two tangential approaches - (1) Holistic Scene-level Evaluation that relies on the physics, semantics, perspective, and realism of the image as a whole and (2) Region-wise anomaly detection that splits the image into multiple regions and analyzes each of them. We conduct experiments over datasets from different domains (Photoshop, DeepFake and AIGC editing). We compare the Vision Language Models against competitive baselines and analyze the reasoning provided by them.

[368] 4D Visual Pre-training for Robot Learning

Chengkai Hou, Yanjie Ze, Yankai Fu, Zeyu Gao, Songbo Hu, Yue Yu, Shanghang Zhang, Huazhe Xu

Main category: cs.CV

TL;DR: FVP is a 4D visual pre-training framework that uses next-point-cloud-prediction with diffusion models to enhance 3D representations for robotics, achieving 28% performance boost on real-world manipulation tasks.

Details

Motivation: Current visual representations for robotics are mostly 2D-based, neglecting the 3D nature of the world. There's a need for better 3D representations but large-scale 3D data is scarce, so the authors seek a general pre-training framework to improve existing 3D representations.

Method: FVP frames visual pre-training as a next-point-cloud-prediction problem using diffusion models. It pre-trains on larger public datasets and can be applied to various point cloud encoders and robotic models.

Result: FVP boosts average success rate of 3D Diffusion Policy by 28% across twelve real-world manipulation tasks, achieves state-of-the-art performance, and enhances performance of larger Vision-Language-Action models like RDT-1B.

Conclusion: FVP provides an effective 4D visual pre-training framework that significantly improves 3D representations for robotics without requiring massive 3D datasets, demonstrating broad applicability across different architectures and tasks.

Abstract: General visual representations learned from web-scale datasets for robotics have achieved great success in recent years, enabling data-efficient robot learning on manipulation tasks; yet these pre-trained representations are mostly on 2D images, neglecting the inherent 3D nature of the world. However, due to the scarcity of large-scale 3D data, it is still hard to extract a universal 3D representation from web datasets. Instead, we are seeking a general visual pre-training framework that could improve all 3D representations as an alternative. Our framework, called FVP, is a novel 4D Visual Pre-training framework for real-world robot learning. FVP frames the visual pre-training objective as a next-point-cloud-prediction problem, models the prediction model as a diffusion model, and pre-trains the model on the larger public datasets directly. Across twelve real-world manipulation tasks, FVP boosts the average success rate of 3D Diffusion Policy (DP3) for these tasks by 28%. The FVP pre-trained DP3 achieves state-of-the-art performance across imitation learning methods. Moreover, the efficacy of FVP adapts across various point cloud encoders and datasets. Finally, we apply FVP to the RDT-1B, a larger Vision-Language-Action robotic model, enhancing its performance on various robot tasks. Our project page is available at: https://4d-visual-pretraining.github.io/

[369] Robust and Label-Efficient Deep Waste Detection

Hassan Abid, Khan Muhammad, Muhammad Haris Khan

Main category: cs.CV

TL;DR: Strong baselines and ensemble-based semi-supervised learning framework for waste detection, achieving 51.6 mAP and surpassing fully supervised performance with pseudo-labeling.

Details

Motivation: AI research in waste sorting lags behind commercial systems due to limited datasets and reliance on legacy object detectors, requiring better detection methods for sustainable recycling.

Method: Benchmarked OVOD models, fine-tuned transformer-based detectors, and proposed soft pseudo-labeling strategy with spatial and consensus-aware weighting for semi-supervised training.

Result: LLM-optimized prompts significantly enhanced zero-shot accuracy, achieved new baseline of 51.6 mAP, and pseudo-annotations surpassed fully supervised training performance.

Conclusion: Established rigorous baselines, introduced robust ensemble-based pseudo-labeling pipeline, generated high-quality annotations, and systematically evaluated OVOD models for real-world waste sorting applications.

Abstract: Effective waste sorting is critical for sustainable recycling, yet AI research in this domain continues to lag behind commercial systems due to limited datasets and reliance on legacy object detectors. In this work, we advance AI-driven waste detection by establishing strong baselines and introducing an ensemble-based semi-supervised learning framework. We first benchmark state-of-the-art Open-Vocabulary Object Detection (OVOD) models on the real-world ZeroWaste dataset, demonstrating that while class-only prompts perform poorly, LLM-optimized prompts significantly enhance zero-shot accuracy. Next, to address domain-specific limitations, we fine-tune modern transformer-based detectors, achieving a new baseline of 51.6 mAP. We then propose a soft pseudo-labeling strategy that fuses ensemble predictions using spatial and consensus-aware weighting, enabling robust semi-supervised training. Applied to the unlabeled ZeroWaste-s subset, our pseudo-annotations achieve performance gains that surpass fully supervised training, underscoring the effectiveness of scalable annotation pipelines. Our work contributes to the research community by establishing rigorous baselines, introducing a robust ensemble-based pseudo-labeling pipeline, generating high-quality annotations for the unlabeled ZeroWaste-s subset, and systematically evaluating OVOD models under real-world waste sorting conditions. Our code is available at: https://github.com/h-abid97/robust-waste-detection.

[370] AIM: Adaptive Intra-Network Modulation for Balanced Multimodal Learning

Shu Shen, C. L. Philip Chen, Tong Zhang

Main category: cs.CV

TL;DR: AIM addresses optimization bias in imbalanced multimodal learning by adaptively modulating parameters across network depths, achieving balanced learning without hindering any modality.

Details

Motivation: Existing methods for imbalanced multimodal learning typically hinder dominant modalities to promote weaker ones, which negatively impacts overall performance due to overlooked optimization bias within networks.

Method: Proposes Adaptive Intra-Network Modulation (AIM) that decouples under-optimized parameters into Auxiliary Blocks and encourages reliance on these blocks for joint training. It assesses modality imbalance across depths and adaptively adjusts modulation strength.

Result: AIM outperforms state-of-the-art methods across multiple benchmarks and shows strong generalizability across different backbones, fusion strategies, and optimizers.

Conclusion: AIM successfully addresses optimization bias in multimodal networks, enabling balanced learning without performance degradation for any modality, representing a significant advancement in imbalanced multimodal learning.

Abstract: Multimodal learning has significantly enhanced machine learning performance but still faces numerous challenges and limitations. Imbalanced multimodal learning is one of the problems extensively studied in recent works and is typically mitigated by modulating the learning of each modality. However, we find that these methods typically hinder the dominant modality’s learning to promote weaker modalities, which affects overall multimodal performance. We analyze the cause of this issue and highlight a commonly overlooked problem: optimization bias within networks. To address this, we propose Adaptive Intra-Network Modulation (AIM) to improve balanced modality learning. AIM accounts for differences in optimization state across parameters and depths within the network during modulation, achieving balanced multimodal learning without hindering either dominant or weak modalities for the first time. Specifically, AIM decouples the dominant modality’s under-optimized parameters into Auxiliary Blocks and encourages reliance on these performance-degraded blocks for joint training with weaker modalities. This approach effectively prevents suppression of weaker modalities while enabling targeted optimization of under-optimized parameters to improve the dominant modality. Additionally, AIM assesses modality imbalance level across network depths and adaptively adjusts modulation strength at each depth. Experimental results demonstrate that AIM outperforms state-of-the-art imbalanced modality learning methods across multiple benchmarks and exhibits strong generalizability across different backbones, fusion strategies, and optimizers.

[371] What Can We Learn from Harry Potter? An Exploratory Study of Visual Representation Learning from Atypical Videos

Qiyue Sun, Qiming Huang, Yang Yang, Hongjun Wang, Jianbo Jiao

Main category: cs.CV

TL;DR: Atypical videos (sci-fi, animation, etc.) improve open-world learning performance in OOD detection, novel category discovery, and zero-shot action recognition tasks.

Details

Motivation: Humans excel at generalizing from uncommon concepts in open-world settings, but most video studies focus on common typical data from closed sets, leaving open-world novel discovery underexplored.

Method: Collected a new video dataset of unusual atypical data and fed them into model training for representation learning. Evaluated on three open-world tasks: OOD detection, novel category discovery, and zero-shot action recognition.

Result: Straightforward learning approaches with atypical data consistently improved performance across all settings. Categorical diversity boosted OOD detection, semantic diversity improved novel category discovery, and atypical videos helped generalize better to unseen action classes.

Conclusion: Atypical videos provide significant benefits for visual representation learning in open-world scenarios, encouraging further research in this direction with the newly proposed dataset.

Abstract: Humans usually show exceptional generalisation and discovery ability in the open world, when being shown uncommon new concepts. Whereas most existing studies in the literature focus on common typical data from closed sets, open-world novel discovery is under-explored in videos. In this paper, we are interested in asking: What if atypical unusual videos are exposed in the learning process? To this end, we collect a new video dataset consisting of various types of unusual atypical data (e.g., sci-fi, animation, etc.). To study how such atypical data may benefit open-world learning, we feed them into the model training process for representation learning. Focusing on three key tasks in open-world learning: out-of-distribution (OOD) detection, novel category discovery (NCD), and zero-shot action recognition (ZSAR), we found that even straightforward learning approaches with atypical data consistently improve performance across various settings. Furthermore, we found that increasing the categorical diversity of the atypical samples further boosts OOD detection performance. Additionally, in the NCD task, using a smaller yet more semantically diverse set of atypical samples leads to better performance compared to using a larger but more typical dataset. In the ZSAR setting, the semantic diversity of atypical videos helps the model generalise better to unseen action classes. These observations in our extensive experimental evaluations reveal the benefits of atypical videos for visual representation learning in the open world, together with the newly proposed dataset, encouraging further studies in this direction. The project page is at: https://julysun98.github.io/atypical_dataset.

[372] MESTI-MEGANet: Micro-expression Spatio-Temporal Image and Micro-expression Gradient Attention Networks for Micro-expression Recognition

Luu Tu Nguyen, Vu Tram Anh Khuong, Thanh Ha Le, Thi Duyen Ngo

Main category: cs.CV

TL;DR: Proposes MESTI (novel dynamic image modality) and MEGANet (with Gradient Attention) for micro-expression recognition, achieving state-of-the-art results on CASMEII and SAMM datasets.

Details

Motivation: Traditional input modalities like Apex Frame, Optical Flow, and Dynamic Image fail to adequately capture subtle and fleeting micro-expressions, leading to suboptimal performance in micro-expression recognition.

Method: Introduced MESTI to transform video sequences into single images preserving micro-movements, and developed MEGANet with Gradient Attention block to extract fine-grained motion features. Evaluated across multiple CNN architectures and compared with existing methods.

Result: MESTI consistently improved performance when replacing inputs of existing MER networks. MEGANet with MESTI achieved highest accuracy to date on CASMEII and SAMM datasets, setting new benchmark for micro-expression recognition.

Conclusion: MESTI is a superior input modality and MEGANet is an advanced recognition network that paves the way for more effective micro-expression recognition systems in various applications.

Abstract: Micro-expression recognition (MER) is a challenging task due to the subtle and fleeting nature of micro-expressions. Traditional input modalities, such as Apex Frame, Optical Flow, and Dynamic Image, often fail to adequately capture these brief facial movements, resulting in suboptimal performance. In this study, we introduce the Micro-expression Spatio-Temporal Image (MESTI), a novel dynamic input modality that transforms a video sequence into a single image while preserving the essential characteristics of micro-movements. Additionally, we present the Micro-expression Gradient Attention Network (MEGANet), which incorporates a novel Gradient Attention block to enhance the extraction of fine-grained motion features from micro-expressions. By combining MESTI and MEGANet, we aim to establish a more effective approach to MER. Extensive experiments were conducted to evaluate the effectiveness of MESTI, comparing it with existing input modalities across three CNN architectures (VGG19, ResNet50, and EfficientNetB0). Moreover, we demonstrate that replacing the input of previously published MER networks with MESTI leads to consistent performance improvements. The performance of MEGANet, both with MESTI and Dynamic Image, is also evaluated, showing that our proposed network achieves state-of-the-art results on the CASMEII and SAMM datasets. The combination of MEGANet and MESTI achieves the highest accuracy reported to date, setting a new benchmark for micro-expression recognition. These findings underscore the potential of MESTI as a superior input modality and MEGANet as an advanced recognition network, paving the way for more effective MER systems in a variety of applications.

[373] ER-LoRA: Effective-Rank Guided Adaptation for Weather-Generalized Depth Estimation

Weilong Yan, Xin Zhang, Robby T. Tan

Main category: cs.CV

TL;DR: Proposes STM strategy for weather-generalized monocular depth estimation using parameter-efficient fine-tuning of vision foundation models with minimal normal weather data

Details

Motivation: Existing methods struggle with adverse weather depth estimation due to lack of reliable ground truth, domain gaps in synthetic data, and violated photometric assumptions in self-supervised learning

Method: Introduces Selecting-Tuning-Maintaining (STM) strategy that structurally decomposes pretrained weights using entropy-rank and stable-rank metrics, adaptively selects task-aware singular directions, and applies principal direction regularization

Result: Outperforms existing PEFT methods, full fine-tuning, and methods trained with adverse synthetic data across four real-world benchmarks in diverse weather conditions

Conclusion: STM enables effective weather-generalized depth estimation while preserving pretrained knowledge, demonstrating superior performance over current approaches

Abstract: Monocular depth estimation under adverse weather conditions (e.g.\ rain, fog, snow, and nighttime) remains highly challenging due to the lack of reliable ground truth and the difficulty of learning from unlabeled real-world data. Existing methods often rely on synthetic adverse data with pseudo-labels, which suffer from domain gaps, or employ self-supervised learning, which violates photometric assumptions in adverse scenarios. In this work, we propose to achieve weather-generalized depth estimation by Parameter-Efficient Fine-Tuning (PEFT) of Vision Foundation Models (VFMs), using only a small amount of high-visibility (normal) data. While PEFT has shown strong performance in semantic tasks such as segmentation, it remains underexplored for geometry – centric tasks like depth estimation – especially in terms of balancing effective adaptation with the preservation of pretrained knowledge. To this end, we introduce the Selecting-Tuning-Maintaining (STM) strategy, which structurally decomposes the pretrained weights of VFMs based on two kinds of effective ranks (entropy-rank and stable-rank). In the tuning phase, we adaptively select the proper rank number as well as the task-aware singular directions for initialization, based on the entropy-rank and full-tuned weight; while in the maintaining stage, we enforce a principal direction regularization based on the stable-rank. This design guarantees flexible task adaptation while preserving the strong generalization capability of the pretrained VFM. Extensive experiments on four real-world benchmarks across diverse weather conditions demonstrate that STM not only outperforms existing PEFT methods and full fine-tuning but also surpasses methods trained with adverse synthetic data, and even the depth foundation model

[374] Diffusion-Based Image-to-Brain Signal Generation with Cross-Attention Mechanisms for Visual Prostheses

Ganxi Xu, Jinyi Long, Jia Zhang

Main category: cs.CV

TL;DR: First image-to-brain signal framework using diffusion models with cross-attention to convert images to M/EEG signals for visual prostheses.

Details

Motivation: Visual prostheses can restore vision, but while brain decoding (M/EEG to images) is explored, the complementary encoding stage (images to M/EEG signals) remains largely unaddressed.

Method: Uses denoising diffusion probabilistic models enhanced with cross-attention mechanisms, combining a pre-trained CLIP visual encoder for semantic extraction and a cross-attention U-Net diffusion model for biologically plausible brain signal reconstruction.

Result: Framework effectively generates biologically plausible brain signals, evaluated on THINGS-EEG2 and THINGS-MEG datasets, with pioneering visualization of M/EEG topographies showing intra- and inter-subject variations.

Conclusion: The proposed framework successfully addresses the image-to-brain signal conversion problem in visual prostheses, demonstrating effective generation of realistic brain signals and providing valuable insights into brain signal variations.

Abstract: Visual prostheses have shown great potential in restoring vision for blind individuals. However, while researchers have successfully utilized M/EEG signals to evoke visual perceptions during the brain decoding stage of visual prostheses, the complementary process-converting images to M/EEG signals in the brain encoding stage-remains largely unexplored. Thus, we present the first image-to-brain signal (M/EEG) framework based on denoising diffusion probabilistic models enhanced with cross-attention mechanisms. Our framework consists of two key architectural components: a pre-trained CLIP visual encoder that extracts rich semantic representations from input images, and a cross-attention enhanced U-Net diffusion model that learns to reconstruct biologically plausible brain signals through iterative denoising. Unlike conventional generative models that rely on simple concatenation for conditioning, our cross-attention modules enable dynamic interaction between visual features and brain signal representations, facilitating fine-grained alignment during the generation process. Furthermore, we evaluate our framework on two multimodal datasets (THINGS-EEG2 and THINGS-MEG) to demonstrate its effectiveness in generating biologically plausible brain signals. Additionally, we pioneer the visualization of M/EEG topographies across all subjects in both datasets, providing intuitive demonstrations of intra-subject and inter-subject variations in brain signals.

[375] Kwai Keye-VL 1.5 Technical Report

Biao Yang, Bin Wen, Boyang Ding, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, Fan Yang, Guorui Zhou, Guowang Zhang, Han Shen, Hao Peng, Haojie Ding, Hao Wang, Haonan Fan, Hengrui Ju, Jiaming Huang, Jiangxia Cao, Jiankang Chen, Jingyun Hua, Kaibing Chen, Kaiyu Jiang, Kaiyu Tang, Kun Gai, Muhao Wei, Qiang Wang, Ruitao Wang, Sen Na, Shengnan Zhang, Siyang Mao, Sui Huang, Tianke Zhang, Tingting Gao, Wei Chen, Wei Yuan, Xiangyu Wu, Xiao Hu, Xingyu Lu, Yi-Fan Zhang, Yiping Yang, Yulong Chen, Zeyi Lu, Zhenhua Wu, Zhixin Ling, Zhuoran Yang, Ziming Li, Di Xu, Haixuan Gao, Hang Li, Jing Wang, Lejian Ren, Qigen Hu, Qianqian Wang, Shiyao Wang, Xinchen Luo, Yan Li, Yuhang Hu, Zixing Zhang

Main category: cs.CV

TL;DR: Keye-VL-1.5 is a multimodal LLM that improves video understanding through innovative Slow-Fast encoding, progressive pre-training, and comprehensive post-training, achieving superior performance on video tasks.

Details

Motivation: Video understanding remains challenging due to the dynamic nature of videos and the trade-off between spatial resolution and temporal coverage in existing models.

Method: Three key innovations: 1) Slow-Fast video encoding that dynamically allocates resources based on frame similarity, 2) Progressive four-stage pre-training extending context length from 8K to 128K tokens, 3) Comprehensive post-training pipeline with chain-of-thought data construction, GSPO-based RL, and alignment training.

Result: Significant improvements over existing models, particularly excelling in video understanding tasks while maintaining competitive performance on general multimodal benchmarks.

Conclusion: Keye-VL-1.5 successfully addresses fundamental challenges in video comprehension through its innovative architecture and training methodology, demonstrating state-of-the-art performance in video understanding.

Abstract: In recent years, the development of Large Language Models (LLMs) has significantly advanced, extending their capabilities to multimodal tasks through Multimodal Large Language Models (MLLMs). However, video understanding remains a challenging area due to the dynamic and information-dense nature of videos. Existing models struggle with the trade-off between spatial resolution and temporal coverage when processing video content. We present Keye-VL-1.5, which addresses fundamental challenges in video comprehension through three key innovations. First, we introduce a novel Slow-Fast video encoding strategy that dynamically allocates computational resources based on inter-frame similarity, processing key frames with significant visual changes at higher resolution (Slow pathway) while handling relatively static frames with increased temporal coverage at lower resolution (Fast pathway). Second, we implement a progressive four-stage pre-training methodology that systematically extends the model’s context length from 8K to 128K tokens, enabling processing of longer videos and more complex visual content. Third, we develop a comprehensive post-training pipeline focusing on reasoning enhancement and human preference alignment, incorporating a 5-step chain-of-thought data construction process, iterative GSPO-based reinforcement learning with progressive prompt hinting for difficult cases, and alignment training. Through extensive evaluation on public benchmarks and rigorous internal human assessment, Keye-VL-1.5 demonstrates significant improvements over existing models, particularly excelling in video understanding tasks while maintaining competitive performance on general multimodal benchmarks.

[376] PractiLight: Practical Light Control Using Foundational Diffusion Models

Yotam Erel, Rishabh Dabral, Vladislav Golyanik, Amit H. Bermano, Christian Theobalt

Main category: cs.CV

TL;DR: PractiLight is a practical approach for controlling lighting in generated images by leveraging foundational generative models, using a lightweight LoRA regressor to produce irradiance maps and incorporating desired lighting through Classifier Guidance.

Details

Motivation: Existing approaches for light control in generated images require extensive domain-specific datasets, limiting generalization and applicability of foundational models. The paper aims to develop a more practical and generalizable solution.

Method: Trains a lightweight LoRA regressor to produce direct irradiance maps from small training sets, then uses Classifier Guidance to incorporate desired lighting into image generation. Key insight is that lighting relationships are similar to token interactions in self-attention layers.

Result: Achieves state-of-the-art performance in quality and control with proven parameter and data efficiency across diverse scenes and image domains.

Conclusion: Image lighting can be effectively controlled by leveraging foundational knowledge in generative models, enabling practical and general relighting without extensive domain-specific training.

Abstract: Light control in generated images is a difficult task, posing specific challenges, spanning over the entire image and frequency spectrum. Most approaches tackle this problem by training on extensive yet domain-specific datasets, limiting the inherent generalization and applicability of the foundational backbones used. Instead, PractiLight is a practical approach, effectively leveraging foundational understanding of recent generative models for the task. Our key insight is that lighting relationships in an image are similar in nature to token interaction in self-attention layers, and hence are best represented there. Based on this and other analyses regarding the importance of early diffusion iterations, PractiLight trains a lightweight LoRA regressor to produce the direct irradiance map for a given image, using a small set of training images. We then employ this regressor to incorporate the desired lighting into the generation process of another image using Classifier Guidance. This careful design generalizes well to diverse conditions and image domains. We demonstrate state-of-the-art performance in terms of quality and control with proven parameter and data efficiency compared to leading works over a wide variety of scenes types. We hope this work affirms that image lighting can feasibly be controlled by tapping into foundational knowledge, enabling practical and general relighting.

[377] Parameter-Efficient Adaptation of mPLUG-Owl2 via Pixel-Level Visual Prompts for NR-IQA

Yahya Benmahane, Mohammed El Hassouni

Main category: cs.CV

TL;DR: Proposes parameter-efficient NR-IQA method using pixel-space visual prompts, training only 600K parameters while keeping MLLM frozen, achieving competitive performance across multiple datasets.

Details

Motivation: To enable efficient adaptation of Multimodal Large Language Models for No-Reference Image Quality Assessment without full fine-tuning, reducing computational costs while maintaining performance.

Method: Uses visual prompts optimized in pixel-space that are added to images during inference, processed by frozen mPLUG-Owl2 model with textual query “Rate the technical quality of the image.” Only trains 600K parameters (<0.01% of base model).

Result: Achieves competitive performance across synthetic, realistic, and AI-generated distortions on KADID-10k, KonIQ-10k, and AGIQA-3k datasets, with 0.93 SRCC on KADID-10k, comparable to full fine-tuning and specialized NR-IQA models.

Conclusion: First work to leverage pixel-space visual prompts for NR-IQA, demonstrating efficient MLLM adaptation for low-level vision tasks with minimal parameter updates while maintaining strong performance.

Abstract: In this paper, we propose a novel parameter-efficient adaptation method for No- Reference Image Quality Assessment (NR-IQA) using visual prompts optimized in pixel-space. Unlike full fine-tuning of Multimodal Large Language Models (MLLMs), our approach trains only 600K parameters at most (< 0.01% of the base model), while keeping the underlying model fully frozen. During inference, these visual prompts are combined with images via addition and processed by mPLUG-Owl2 with the textual query “Rate the technical quality of the image.” Evaluations across distortion types (synthetic, realistic, AI-generated) on KADID- 10k, KonIQ-10k, and AGIQA-3k demonstrate competitive performance against full finetuned methods and specialized NR-IQA models, achieving 0.93 SRCC on KADID-10k. To our knowledge, this is the first work to leverage pixel-space visual prompts for NR-IQA, enabling efficient MLLM adaptation for low-level vision tasks. The source code is publicly available at https: // github. com/ yahya-ben/ mplug2-vp-for-nriqa.

cs.AI

[378] Attention of a Kiss: Exploring Attention Maps in Video Diffusion for XAIxArts

Adam Cole, Mick Grierson

Main category: cs.AI

TL;DR: Visualization tool for cross-attention mechanisms in video diffusion transformers, enabling artistic exploration and interpretability of text-to-video generation models.

Details

Motivation: Inspired by analog video artists who manipulated signals for aesthetic creation, this research aims to make AI attention mechanisms accessible as both analytical tools and artistic material for creative exploration.

Method: Built on the open-source Wan model, the study develops a tool to extract and visualize cross-attention maps, using exploratory probes and artistic case studies to examine temporal and spatial attention behavior.

Result: The tool provides an interpretable window into attention mechanisms in text-to-video generation, demonstrating how attention maps can serve as both analytical tools and raw artistic material for creative expression.

Conclusion: This work contributes to Explainable AI for the Arts (XAIxArts), empowering artists to reclaim AI’s inner workings as a creative medium and bridging technical analysis with artistic practice.

Abstract: This paper presents an artistic and technical investigation into the attention mechanisms of video diffusion transformers. Inspired by early video artists who manipulated analog video signals to create new visual aesthetics, this study proposes a method for extracting and visualizing cross-attention maps in generative video models. Built on the open-source Wan model, our tool provides an interpretable window into the temporal and spatial behavior of attention in text-to-video generation. Through exploratory probes and an artistic case study, we examine the potential of attention maps as both analytical tools and raw artistic material. This work contributes to the growing field of Explainable AI for the Arts (XAIxArts), inviting artists to reclaim the inner workings of AI as a creative medium.

[379] Perception Graph for Cognitive Attack Reasoning in Augmented Reality

Rongqian Chen, Shu Hong, Rifatul Islam, Mahdi Imani, G. Gary Tan, Tian Lan

Main category: cs.AI

TL;DR: The paper introduces a Perception Graph model to detect and analyze cognitive attacks on AR systems by quantifying perception distortion.

Details

Motivation: AR systems in tactical environments are vulnerable to cognitive attacks that manipulate user perception and compromise decision-making, requiring new detection methods.

Method: Developed a Perception Graph model that mimics human interpretation of MR environments and represents outcomes using semantic structures to compute quantitative perception distortion scores.

Result: The model successfully computes quantitative scores reflecting perception distortion levels, providing a measurable method for detecting cognitive attacks.

Conclusion: The Perception Graph offers a robust framework for detecting and analyzing cognitive attacks on AR systems, enhancing security in tactical environments.

Abstract: Augmented reality (AR) systems are increasingly deployed in tactical environments, but their reliance on seamless human-computer interaction makes them vulnerable to cognitive attacks that manipulate a user’s perception and severely compromise user decision-making. To address this challenge, we introduce the Perception Graph, a novel model designed to reason about human perception within these systems. Our model operates by first mimicking the human process of interpreting key information from an MR environment and then representing the outcomes using a semantically meaningful structure. We demonstrate how the model can compute a quantitative score that reflects the level of perception distortion, providing a robust and measurable method for detecting and analyzing the effects of such cognitive attacks.

[380] SynDelay: A Synthetic Dataset for Delivery Delay Prediction

Liming Xu, Yunbo Long, Alexandra Brintrup

Main category: cs.AI

TL;DR: SynDelay is a synthetic dataset for delivery delay prediction, created using generative AI trained on real data to address the scarcity of open supply chain datasets while preserving privacy.

Details

Motivation: Progress in AI for supply chain management is constrained by scarce high-quality open datasets. Existing datasets are often proprietary, small, or inconsistently maintained, hindering reproducibility and benchmarking.

Method: Generated using an advanced generative model trained on real-world data to preserve realistic delivery patterns while ensuring privacy protection.

Result: SynDelay provides a challenging and practical testbed for delivery delay prediction research, with baseline results and evaluation metrics provided as initial benchmarks (not state-of-the-art claims).

Conclusion: The dataset is publicly available through Supply Chain Data Hub to promote open dataset sharing and benchmarking, encouraging community contributions to advance supply chain AI research.

Abstract: Artificial intelligence (AI) is transforming supply chain management, yet progress in predictive tasks – such as delivery delay prediction – remains constrained by the scarcity of high-quality, openly available datasets. Existing datasets are often proprietary, small, or inconsistently maintained, hindering reproducibility and benchmarking. We present SynDelay, a synthetic dataset designed for delivery delay prediction. Generated using an advanced generative model trained on real-world data, SynDelay preserves realistic delivery patterns while ensuring privacy. Although not entirely free of noise or inconsistencies, it provides a challenging and practical testbed for advancing predictive modelling. To support adoption, we provide baseline results and evaluation metrics as initial benchmarks, serving as reference points rather than state-of-the-art claims. SynDelay is publicly available through the Supply Chain Data Hub, an open initiative promoting dataset sharing and benchmarking in supply chain AI. We encourage the community to contribute datasets, models, and evaluation practices to advance research in this area. All code is openly accessible at https://supplychaindatahub.org.

[381] MVRS: The Multimodal Virtual Reality Stimuli-based Emotion Recognition Dataset

Seyed Muhammad Hossein Mousavi, Atiye Ilanloo

Main category: cs.AI

TL;DR: MVRS dataset introduces synchronized multimodal emotional data (eye tracking, body motion, EMG, GSR) from VR-based emotional stimuli, with feature fusion and classifier evaluation showing emotion separability.

Details

Motivation: Address the lack of multimodal datasets combining body motion and physiological signals for emotion recognition, which limits progress in AI applications for healthcare, education, and automotive systems.

Method: Collected synchronized data from 13 participants (12-60 years) exposed to VR emotional stimuli using eye tracking (webcam in VR headset), body motion (Kinect v2), and physiological signals (EMG and GSR via Arduino UNO). Features extracted from each modality and fused using early and late fusion techniques.

Result: The dataset quality and emotion separability were confirmed through evaluation with classifiers, demonstrating the effectiveness of the multimodal approach for emotion recognition.

Conclusion: MVRS dataset represents a valuable contribution to multimodal affective computing by providing synchronized, high-quality emotional data that enables better emotion recognition research and applications.

Abstract: Automatic emotion recognition has become increasingly important with the rise of AI, especially in fields like healthcare, education, and automotive systems. However, there is a lack of multimodal datasets, particularly involving body motion and physiological signals, which limits progress in the field. To address this, the MVRS dataset is introduced, featuring synchronized recordings from 13 participants aged 12 to 60 exposed to VR based emotional stimuli (relaxation, fear, stress, sadness, joy). Data were collected using eye tracking (via webcam in a VR headset), body motion (Kinect v2), and EMG and GSR signals (Arduino UNO), all timestamp aligned. Participants followed a unified protocol with consent and questionnaires. Features from each modality were extracted, fused using early and late fusion techniques, and evaluated with classifiers to confirm the datasets quality and emotion separability, making MVRS a valuable contribution to multimodal affective computing.

[382] Benchmarking Large Language Models for Personalized Guidance in AI-Enhanced Learning

Bo Yuan, Jiazi Hu

Main category: cs.AI

TL;DR: Empirical comparison of GPT-4o, DeepSeek-V3, and GLM-4.5 LLMs in tutoring tasks shows GPT-4o produces superior feedback for personalized learning.

Details

Motivation: There's a need for systematic head-to-head evaluations of LLMs as intelligent assistants in authentic learning scenarios to assess their effectiveness in personalized education.

Method: Used a dataset of student answers with correctness labels, required LLMs to analyze quizzes, infer mastery profiles, and generate guidance. Employed Gemini as virtual judge for pairwise comparisons across accuracy, clarity, actionability, and appropriateness dimensions using Bradley-Terry model.

Result: GPT-4o was generally preferred, producing more informative and better structured feedback than DeepSeek-V3 and GLM-4.5, which showed intermittent strengths but lower consistency.

Conclusion: LLMs are feasible as advanced teaching assistants for individualized support, and the study provides methodological guidance for future empirical research on LLM-driven personalized learning.

Abstract: While Large Language Models (LLMs) are increasingly envisioned as intelligent assistants for personalized learning, systematic head-to-head evaluations within authentic learning scenarios remain limited. This study conducts an empirical comparison of three state-of-the-art LLMs on a tutoring task that simulates a realistic learning setting. Using a dataset comprising a student’s answers to ten questions of mixed formats with correctness labels, each LLM is required to (i) analyze the quiz to identify underlying knowledge components, (ii) infer the student’s mastery profile, and (iii) generate targeted guidance for improvement. To mitigate subjectivity and evaluator bias, we employ Gemini as a virtual judge to perform pairwise comparisons along various dimensions: accuracy, clarity, actionability, and appropriateness. Results analyzed via the Bradley-Terry model indicate that GPT-4o is generally preferred, producing feedback that is more informative and better structured than its counterparts, while DeepSeek-V3 and GLM-4.5 demonstrate intermittent strengths but lower consistency. These findings highlight the feasibility of deploying LLMs as advanced teaching assistants for individualized support and provide methodological guidance for future empirical research on LLM-driven personalized learning.

[383] SasAgent: Multi-Agent AI System for Small-Angle Scattering Data Analysis

Lijie Ding, Changwoo Do

Main category: cs.AI

TL;DR: SasAgent is an LLM-powered multi-agent system that automates small-angle scattering data analysis using SasView tools through a text-based interface, with specialized agents for SLD calculation, data generation, and fitting.

Details

Motivation: To streamline and automate small-angle scattering (SAS) data analysis workflows by leveraging large language models to interpret user prompts and execute complex scientific tasks through specialized agents.

Method: Uses a coordinator agent with three specialized agents (SLD calculation, synthetic data generation, experimental data fitting) that employ LLM-friendly tools from SasView Python library, including model data tool, RAG documentation, bump fitting, and SLD calculator, with a Gradio-based interface.

Result: Demonstrated ability to interpret complex prompts, calculate scattering length densities, generate accurate scattering data, and fit experimental datasets with high precision through diverse examples.

Conclusion: Shows the potential of LLM-driven AI systems to enhance automation and streamline scientific workflows in SAS research, making complex data analysis more accessible through natural language interaction.

Abstract: We introduce SasAgent, a multi-agent AI system powered by large language models (LLMs) that automates small-angle scattering (SAS) data analysis by leveraging tools from the SasView software and enables user interaction via text input. SasAgent features a coordinator agent that interprets user prompts and delegates tasks to three specialized agents for scattering length density (SLD) calculation, synthetic data generation, and experimental data fitting. These agents utilize LLM-friendly tools to execute tasks efficiently. These tools, including the model data tool, Retrieval-Augmented Generation (RAG) documentation tool, bump fitting tool, and SLD calculator tool, are derived from the SasView Python library. A user-friendly Gradio-based interface enhances user accessibility. Through diverse examples, we demonstrate SasAgent’s ability to interpret complex prompts, calculate SLDs, generate accurate scattering data, and fit experimental datasets with high precision. This work showcases the potential of LLM-driven AI systems to streamline scientific workflows and enhance automation in SAS research.

[384] Characterizing Fitness Landscape Structures in Prompt Engineering

Arend Hintze

Main category: cs.AI

TL;DR: Analysis of prompt engineering optimization landscapes reveals fundamentally different topologies between systematic and diversified prompt generation approaches, with varying ruggedness across different error detection tasks.

Details

Motivation: Current prompt engineering approaches treat optimization as a black-box problem without understanding the underlying landscape topology, making it difficult to develop effective optimization strategies.

Method: Used autocorrelation analysis across semantic embedding spaces to analyze fitness landscape structures, comparing systematic enumeration (1,024 prompts) and novelty-driven diversification (1,000 prompts) on error detection tasks across 10 categories.

Result: Systematic prompt generation shows smoothly decaying autocorrelation, while diversified generation exhibits non-monotonic patterns with peak correlation at intermediate semantic distances, indicating rugged, hierarchically structured landscapes. Different error types show varying degrees of ruggedness.

Conclusion: The study provides empirical foundation for understanding optimization complexity in prompt engineering, revealing fundamentally different landscape topologies that should inform future optimization algorithm design.

Abstract: While prompt engineering has emerged as a crucial technique for optimizing large language model performance, the underlying optimization landscape remains poorly understood. Current approaches treat prompt optimization as a black-box problem, applying sophisticated search algorithms without characterizing the landscape topology they navigate. We present a systematic analysis of fitness landscape structures in prompt engineering using autocorrelation analysis across semantic embedding spaces. Through experiments on error detection tasks with two distinct prompt generation strategies – systematic enumeration (1,024 prompts) and novelty-driven diversification (1,000 prompts) – we reveal fundamentally different landscape topologies. Systematic prompt generation yields smoothly decaying autocorrelation, while diversified generation exhibits non-monotonic patterns with peak correlation at intermediate semantic distances, indicating rugged, hierarchically structured landscapes. Task-specific analysis across 10 error detection categories reveals varying degrees of ruggedness across different error types. Our findings provide an empirical foundation for understanding the complexity of optimization in prompt engineering landscapes.

[385] Code Like Humans: A Multi-Agent Solution for Medical Coding

Andreas Motzfeldt, Joakim Edin, Casper L. Christensen, Christian Hardmeier, Lars Maaløe, Anna Rogers

Main category: cs.AI

TL;DR: Code Like Humans is a new agentic framework using LLMs for medical coding that implements official guidelines and supports the full ICD-10 system with 70K+ labels, achieving best performance on rare diagnosis codes.

Details

Motivation: Medical coding requires experts to map unstructured clinical notes to standardized codes, but existing solutions are limited in scope and performance, especially for rare codes.

Method: An agentic framework using large language models that implements official coding guidelines for human experts, supporting the complete ICD-10 coding system.

Result: Achieves best performance to date on rare diagnosis codes, though fine-tuned discriminative classifiers still have an advantage for high-frequency codes. Also identifies systematic undercoding of certain codes.

Conclusion: The framework successfully addresses medical coding challenges, particularly for rare codes, and provides analysis of system performance and blind spots for future improvement.

Abstract: In medical coding, experts map unstructured clinical notes to alphanumeric codes for diagnoses and procedures. We introduce Code Like Humans: a new agentic framework for medical coding with large language models. It implements official coding guidelines for human experts, and it is the first solution that can support the full ICD-10 coding system (+70K labels). It achieves the best performance to date on rare diagnosis codes (fine-tuned discriminative classifiers retain an advantage for high-frequency codes, to which they are limited). Towards future work, we also contribute an analysis of system performance and identify its `blind spots’ (codes that are systematically undercoded).

[386] Murphys Laws of AI Alignment: Why the Gap Always Wins

Madhava Gaikwad

Main category: cs.AI

TL;DR: The paper introduces the Alignment Gap concept to explain recurring failures in AI alignment methods like RLHF and DPO, showing how optimization pressure amplifies divergence from true human intent through a KL-tilting framework.

Details

Motivation: Existing alignment methods (RLHF, DPO, Constitutional AI, RLAIF) exhibit recurring failure patterns including reward hacking, sycophancy, annotator drift, and misgeneralization, indicating systematic issues in feedback-based alignment approaches.

Method: Uses KL-tilting formalism to illustrate optimization pressure effects, organizes failures into a catalogue of “Murphy’s Laws of AI Alignment,” proposes the Alignment Trilemma framework, and conducts small-scale empirical studies for support.

Result: Demonstrates that optimization pressure tends to amplify divergence between proxy rewards and true human intent, revealing structural trade-offs among optimization strength, value capture, and generalization.

Conclusion: Proposes the MAPS framework (Misspecification, Annotation, Pressure, Shift) as practical design levers and reframes alignment debates around structural limits and trade-offs to guide future design, rather than presenting an impossibility theorem.

Abstract: Large language models are increasingly aligned to human preferences through reinforcement learning from human feedback (RLHF) and related methods such as Direct Preference Optimization (DPO), Constitutional AI, and RLAIF. While effective, these methods exhibit recurring failure patterns i.e., reward hacking, sycophancy, annotator drift, and misgeneralization. We introduce the concept of the Alignment Gap, a unifying lens for understanding recurring failures in feedback-based alignment. Using a KL-tilting formalism, we illustrate why optimization pressure tends to amplify divergence between proxy rewards and true human intent. We organize these failures into a catalogue of Murphys Laws of AI Alignment, and propose the Alignment Trilemma as a way to frame trade-offs among optimization strength, value capture, and generalization. Small-scale empirical studies serve as illustrative support. Finally, we propose the MAPS framework (Misspecification, Annotation, Pressure, Shift) as practical design levers. Our contribution is not a definitive impossibility theorem but a perspective that reframes alignment debates around structural limits and trade-offs, offering clearer guidance for future design.

[387] From Image Generation to Infrastructure Design: a Multi-agent Pipeline for Street Design Generation

Chenguang Wang, Xiang Yan, Yilong Dai, Ziyi Wang, Susu Xu

Main category: cs.AI

TL;DR: AI-powered multi-agent system for generating realistic street-design scenarios with bicycle facilities, enabling rapid visualization for public engagement in transportation planning.

Details

Motivation: Traditional street-design visualization is labor-intensive and hinders public engagement. AI-assisted generative design shows potential but struggles with precise spatial variations and requires large domain-specific training data.

Method: Multi-agent system that edits bicycle facilities directly on real-world street-view imagery, integrating lane localization, prompt optimization, design generation, and automated evaluation.

Result: System adapts to diverse urban scenarios, varying road geometries, and environmental conditions, producing visually coherent and instruction-compliant designs.

Conclusion: Establishes foundation for applying multi-agent pipelines to transportation infrastructure planning and facility design, enabling more effective public engagement.

Abstract: Realistic visual renderings of street-design scenarios are essential for public engagement in active transportation planning. Traditional approaches are labor-intensive, hindering collective deliberation and collaborative decision-making. While AI-assisted generative design shows transformative potential by enabling rapid creation of design scenarios, existing generative approaches typically require large amounts of domain-specific training data and struggle to enable precise spatial variations of design/configuration in complex street-view scenes. We introduce a multi-agent system that edits and redesigns bicycle facilities directly on real-world street-view imagery. The framework integrates lane localization, prompt optimization, design generation, and automated evaluation to synthesize realistic, contextually appropriate designs. Experiments across diverse urban scenarios demonstrate that the system can adapt to varying road geometries and environmental conditions, consistently yielding visually coherent and instruction-compliant results. This work establishes a foundation for applying multi-agent pipelines to transportation infrastructure planning and facility design.

[388] TreeGPT: A Novel Hybrid Architecture for Abstract Syntax Tree Processing with Global Parent-Child Aggregation

Zixi Li

Main category: cs.AI

TL;DR: TreeGPT is a novel neural architecture combining transformer attention with global parent-child aggregation for processing Abstract Syntax Trees in program synthesis, achieving 96% accuracy on ARC Prize 2025 with only 1.5M parameters.

Details

Motivation: Traditional approaches for AST processing rely on sequential processing or graph neural networks, lacking effective modeling of hierarchical tree structures and global dependencies in program synthesis tasks.

Method: Hybrid architecture using self-attention for local dependencies and Tree Feed-Forward Network with Global Parent-Child Aggregation mechanism for hierarchical structures. Includes gated aggregation, residual connections, and bidirectional propagation.

Result: Achieves 96% accuracy on ARC Prize 2025 dataset, significantly outperforming transformer baselines (1.3%), Grok-4 (15.9%), and SOAR (52%) with only 1.5M parameters. Edge projection identified as most critical component.

Conclusion: TreeGPT demonstrates that combining transformer attention with specialized tree aggregation mechanisms enables highly efficient and effective processing of hierarchical structures for program synthesis tasks.

Abstract: We introduce TreeGPT, a novel neural architecture that combines transformer-based attention mechanisms with global parent-child aggregation for processing Abstract Syntax Trees (ASTs) in neural program synthesis tasks. Unlike traditional approaches that rely solely on sequential processing or graph neural networks, TreeGPT employs a hybrid design that leverages both self-attention for capturing local dependencies and a specialized Tree Feed-Forward Network (TreeFFN) for modeling hierarchical tree structures through iterative message passing. The core innovation lies in our Global Parent-Child Aggregation mechanism, formalized as: $$h_i^{(t+1)} = \sigma \Big( h_i^{(0)} + W_{pc} \sum_{(p,c) \in E_i} f(h_p^{(t)}, h_c^{(t)}) + b \Big)$$ where $h_i^{(t)}$ represents the hidden state of node $i$ at iteration $t$, $E_i$ denotes all parent-child edges involving node $i$, and $f(h_p, h_c)$ is an edge aggregation function. This formulation enables each node to progressively aggregate information from the entire tree structure through $T$ iterations. Our architecture integrates optional enhancements including gated aggregation with learnable edge weights, residual connections for gradient stability, and bidirectional propagation for capturing both bottom-up and top-down dependencies. We evaluate TreeGPT on the ARC Prize 2025 dataset, a challenging visual reasoning benchmark requiring abstract pattern recognition and rule inference. Experimental results demonstrate that TreeGPT achieves 96% accuracy, significantly outperforming transformer baselines (1.3%), large-scale models like Grok-4 (15.9%), and specialized program synthesis methods like SOAR (52%) while using only 1.5M parameters. Our comprehensive ablation study reveals that edge projection is the most critical component, with the combination of edge projection and gating achieving optimal performance.

[389] OccVLA: Vision-Language-Action Model with Implicit 3D Occupancy Supervision

Ruixun Liu, Lingyu Kong, Derun Li, Hang Zhao

Main category: cs.AI

TL;DR: OccVLA integrates 3D occupancy representations into multimodal reasoning for autonomous driving, achieving state-of-the-art performance on nuScenes benchmark without extra computational overhead during inference.

Details

Motivation: MLLMs lack robust 3D spatial understanding critical for autonomous driving due to challenges in constructing accessible 3D representations and loss of fine-grained spatial details from absence of large-scale 3D vision-language pretraining.

Method: Proposes OccVLA framework that treats dense 3D occupancy as both predictive output and supervisory signal, enabling learning of fine-grained spatial structures directly from 2D visual inputs. Occupancy predictions serve as implicit reasoning processes that can be skipped during inference.

Result: Achieves state-of-the-art results on nuScenes benchmark for trajectory planning and demonstrates superior performance on 3D visual question-answering tasks.

Conclusion: OccVLA offers a scalable, interpretable, and fully vision-based solution for autonomous driving that addresses 3D spatial understanding limitations without expensive manual annotations or extra computational costs.

Abstract: Multimodal large language models (MLLMs) have shown strong vision-language reasoning abilities but still lack robust 3D spatial understanding, which is critical for autonomous driving. This limitation stems from two key challenges: (1) the difficulty of constructing accessible yet effective 3D representations without expensive manual annotations, and (2) the loss of fine-grained spatial details in VLMs due to the absence of large-scale 3D vision-language pretraining. To address these challenges, we propose OccVLA, a novel framework that integrates 3D occupancy representations into a unified multimodal reasoning process. Unlike prior approaches that rely on explicit 3D inputs, OccVLA treats dense 3D occupancy as both a predictive output and a supervisory signal, enabling the model to learn fine-grained spatial structures directly from 2D visual inputs. The occupancy predictions are regarded as implicit reasoning processes and can be skipped during inference without performance degradation, thereby adding no extra computational overhead. OccVLA achieves state-of-the-art results on the nuScenes benchmark for trajectory planning and demonstrates superior performance on 3D visual question-answering tasks, offering a scalable, interpretable, and fully vision-based solution for autonomous driving.

[390] PillagerBench: Benchmarking LLM-Based Agents in Competitive Minecraft Team Environments

Olivier Schipper, Yudi Zhang, Yali Du, Mykola Pechenizkiy, Meng Fang

Main category: cs.AI

TL;DR: PillagerBench is a new framework for evaluating multi-agent systems in competitive Minecraft team scenarios, featuring TactiCrafter - an LLM-based system that outperforms baselines through adaptive learning and strategic teamwork.

Details

Motivation: LLM-based agents show promise in cooperative tasks but their effectiveness in competitive multi-agent environments remains underexplored, creating a need for proper evaluation frameworks.

Method: Developed PillagerBench framework with extensible API, multi-round testing, and rule-based opponents. Created TactiCrafter - an LLM-based multi-agent system that uses human-readable tactics, learns causal dependencies, and adapts to opponent strategies through self-play.

Result: TactiCrafter outperforms baseline approaches and demonstrates adaptive learning capabilities. The system shows strategic evolution over multiple game episodes.

Conclusion: The framework enables fair, reproducible comparisons in competitive multi-agent environments. Open-sourcing PillagerBench aims to foster further research advancements in competitive multi-agent AI systems.

Abstract: LLM-based agents have shown promise in various cooperative and strategic reasoning tasks, but their effectiveness in competitive multi-agent environments remains underexplored. To address this gap, we introduce PillagerBench, a novel framework for evaluating multi-agent systems in real-time competitive team-vs-team scenarios in Minecraft. It provides an extensible API, multi-round testing, and rule-based built-in opponents for fair, reproducible comparisons. We also propose TactiCrafter, an LLM-based multi-agent system that facilitates teamwork through human-readable tactics, learns causal dependencies, and adapts to opponent strategies. Our evaluation demonstrates that TactiCrafter outperforms baseline approaches and showcases adaptive learning through self-play. Additionally, we analyze its learning process and strategic evolution over multiple game episodes. To encourage further research, we have open-sourced PillagerBench, fostering advancements in multi-agent AI for competitive environments.

[391] MSRFormer: Road Network Representation Learning using Multi-scale Feature Fusion of Heterogeneous Spatial Interactions

Jian Yang, Jiahui Wu, Li Fang, Hongchao Fan, Bianying Zhang, Huijie Zhao, Guangyi Yang, Rui Xin, Xiong You

Main category: cs.AI

TL;DR: MSRFormer is a novel road network representation learning framework that integrates multi-scale spatial interactions to address flow heterogeneity and long-distance dependencies in urban road networks, outperforming baseline methods by up to 16%.

Details

Motivation: Urban road networks have heterogeneous and hierarchical nature, making accurate representation learning challenging. Traditional graph neural networks struggle due to homogeneity assumptions and single-scale focus.

Method: Uses spatial flow convolution to extract small-scale features from trajectory data, identifies scale-dependent spatial interaction regions, employs graph transformer for multi-scale dependencies, and fuses features with residual connections for contrastive learning.

Result: Outperforms baseline methods in two road network analysis tasks, with up to 16% improvement over the most competitive baseline, showing greater benefits for traffic-related tasks and complex road network structures.

Conclusion: Provides a practical framework for task-agnostic road network representation models and reveals distinct association patterns between scale effects and flow heterogeneity in spatial interactions.

Abstract: Transforming road network data into vector representations using deep learning has proven effective for road network analysis. However, urban road networks’ heterogeneous and hierarchical nature poses challenges for accurate representation learning. Graph neural networks, which aggregate features from neighboring nodes, often struggle due to their homogeneity assumption and focus on a single structural scale. To address these issues, this paper presents MSRFormer, a novel road network representation learning framework that integrates multi-scale spatial interactions by addressing their flow heterogeneity and long-distance dependencies. It uses spatial flow convolution to extract small-scale features from large trajectory datasets, and identifies scale-dependent spatial interaction regions to capture the spatial structure of road networks and flow heterogeneity. By employing a graph transformer, MSRFormer effectively captures complex spatial dependencies across multiple scales. The spatial interaction features are fused using residual connections, which are fed to a contrastive learning algorithm to derive the final road network representation. Validation on two real-world datasets demonstrates that MSRFormer outperforms baseline methods in two road network analysis tasks. The performance gains of MSRFormer suggest the traffic-related task benefits more from incorporating trajectory data, also resulting in greater improvements in complex road network structures with up to 16% improvements compared to the most competitive baseline method. This research provides a practical framework for developing task-agnostic road network representation models and highlights distinct association patterns of the interplay between scale effects and flow heterogeneity of spatial interactions.

[392] Towards Meta-Cognitive Knowledge Editing for Multimodal LLMs

Zhaoyu Fan, Kaihang Pan, Mingze Zhou, Bosheng Qin, Juncheng Li, Shengyu Zhang, Wenqiao Zhang, Siliang Tang, Fei Wu, Yueting Zhuang

Main category: cs.AI

TL;DR: CogEdit benchmark evaluates MLLMs’ meta-cognitive knowledge editing across three levels: counterfactual awareness, boundary constraints, and noise robustness. MIND framework outperforms existing methods with meta-knowledge memory and game-theoretic monitoring.

Details

Motivation: Existing knowledge editing benchmarks focus only on cognitive-level modifications but lack evaluation of deeper meta-cognitive processes like self-awareness and reflective evaluation.

Method: Proposed MIND framework with meta-knowledge memory for self-awareness, game-theoretic interactions for knowledge activation monitoring, and label refinement for noise-robust updates.

Result: MIND significantly outperforms existing cognitive editing approaches on both traditional and meta-cognitive benchmarks.

Conclusion: The CogEdit benchmark and MIND framework successfully address the gap in meta-cognitive knowledge editing evaluation and demonstrate superior performance compared to cognitive-only approaches.

Abstract: Knowledge editing enables multimodal large language models (MLLMs) to efficiently update outdated or incorrect information. However, existing benchmarks primarily emphasize cognitive-level modifications while lacking a focus on deeper meta-cognitive processes. To bridge this gap, we introduce CogEdit, a novel benchmark designed to evaluate MLLMs’ meta-cognitive knowledge editing abilities across three levels: (1) Counterfactual-Driven Editing, assessing self-awareness of knowledge correctness changes; (2) Boundary Constraint Editing, ensuring appropriate generalization without unintended interference; and (3) Noise-Robust Editing, promoting reflective evaluation of uncertain information. To advance meta-cognitive editing, we propose MIND (Meta-cognitive INtegrated Dynamic Knowledge Editing), a framework that constructs a meta-knowledge memory for self-awareness, employs game-theoretic interactions to monitor knowledge activation, and incorporates label refinement for noise-robust updates. Extensive experiments show that MIND significantly outperforms existing cognitive editing approaches, achieving strong performance on both traditional and meta-cognitive knowledge editing benchmarks.

[393] Hyperbolic Large Language Models

Sarang Patil, Zeyong Zhang, Yiran Huang, Tengfei Ma, Mengjia Xu

Main category: cs.AI

TL;DR: This paper provides a comprehensive survey of hyperbolic geometry applications in large language models (LLMs) to better represent hierarchical and non-Euclidean data structures.

Details

Motivation: Many real-world datasets exhibit hierarchical, non-Euclidean structures that standard Euclidean-based LLMs struggle to represent effectively, particularly in domains like biological networks, transportation systems, and linguistic structures.

Method: The paper categorizes Hyperbolic LLMs (HypLLMs) into four main techniques: hyperbolic LLMs through exponential/logarithmic maps, hyperbolic fine-tuned models, fully hyperbolic LLMs, and hyperbolic state-space models.

Result: The survey demonstrates that hyperbolic geometry provides an expressive latent representation space that enhances semantic representation learning and multi-scale reasoning capabilities in LLMs for complex hierarchical data.

Conclusion: Hyperbolic geometry offers promising enhancements for LLMs in modeling hierarchical structures, with significant potential applications across various domains, though further research is needed to fully explore this emerging field.

Abstract: Large language models (LLMs) have achieved remarkable success and demonstrated superior performance across various tasks, including natural language processing (NLP), weather forecasting, biological protein folding, text generation, and solving mathematical problems. However, many real-world data exhibit highly non-Euclidean latent hierarchical anatomy, such as protein networks, transportation networks, financial networks, brain networks, and linguistic structures or syntactic trees in natural languages. Effectively learning intrinsic semantic entailment and hierarchical relationships from these raw, unstructured input data using LLMs remains an underexplored area. Due to its effectiveness in modeling tree-like hierarchical structures, hyperbolic geometry – a non-Euclidean space – has rapidly gained popularity as an expressive latent representation space for complex data modeling across domains such as graphs, images, languages, and multi-modal data. Here, we provide a comprehensive and contextual exposition of recent advancements in LLMs that leverage hyperbolic geometry as a representation space to enhance semantic representation learning and multi-scale reasoning. Specifically, the paper presents a taxonomy of the principal techniques of Hyperbolic LLMs (HypLLMs) in terms of four main categories: (1) hyperbolic LLMs through exp/log maps; (2) hyperbolic fine-tuned models; (3) fully hyperbolic LLMs, and (4) hyperbolic state-space models. We also explore crucial potential applications and outline future research directions. A repository of key papers, models, datasets, and code implementations is available at https://github.com/sarangp2402/Hyperbolic-LLM-Models/tree/main.

[394] DRF: LLM-AGENT Dynamic Reputation Filtering Framework

Yuwei Lou, Hao Hu, Shaocong Ma, Zongfei Zhang, Liang Wang, Jidong Ge, Xianping Tao

Main category: cs.AI

TL;DR: DRF framework introduces dynamic reputation filtering for multi-agent LLM systems, using interactive rating networks and reputation scoring to improve agent selection and task performance.

Details

Motivation: Multi-agent LLM systems lack mechanisms to quantify agent performance and assess credibility, limiting their effectiveness in complex tasks.

Method: DRF constructs interactive rating networks, designs reputation scoring mechanisms for honesty/capability assessment, and integrates UCB-based strategies for efficient agent selection.

Result: Experiments show significant improvements in task completion quality and collaboration efficiency for logical reasoning and code-generation tasks.

Conclusion: DRF provides an effective approach for multi-agent systems to handle large-scale tasks through dynamic reputation-based filtering and selection.

Abstract: With the evolution of generative AI, multi - agent systems leveraging large - language models(LLMs) have emerged as a powerful tool for complex tasks. However, these systems face challenges in quantifying agent performance and lack mechanisms to assess agent credibility. To address these issues, we introduce DRF, a dynamic reputation filtering framework. DRF constructs an interactive rating network to quantify agent performance, designs a reputation scoring mechanism to measure agent honesty and capability, and integrates an Upper Confidence Bound - based strategy to enhance agent selection efficiency. Experiments show that DRF significantly improves task completion quality and collaboration efficiency in logical reasoning and code - generation tasks, offering a new approach for multi - agent systems to handle large - scale tasks.

[395] Decision-Focused Learning Enhanced by Automated Feature Engineering for Energy Storage Optimisation

Nasser Alkhulaifi, Ismail Gokay Dogan, Timothy R. Cargan, Alexander L. Bowler, Direnc Pekaslan, Nicholas J. Watson, Isaac Triguero

Main category: cs.AI

TL;DR: AFE-DFL framework combines automated feature engineering with decision-focused learning for BESS optimization, achieving 22.9-56.5% cost reduction compared to traditional methods on real-world UK data.

Details

Motivation: Traditional Predict-Then-Optimise approaches suffer from error cascading in energy management, while Decision-Focused Learning methods lack real-world validation and struggle with data scarcity and variability in BESS operations.

Method: Proposed AFE-DFL framework that integrates automated feature engineering with decision-focused learning to forecast electricity prices/demand while optimizing BESS operations for small datasets.

Result: DFL yields lower operating costs than PTO, and adding AFE improves DFL performance by 22.9-56.5% compared to models without AFE, validated on real-world UK property dataset.

Conclusion: The framework provides empirical evidence for DFL’s practical viability, showing domain-specific AFE enhances DFL, reduces reliance on domain expertise, and offers economic benefits for energy management systems.

Abstract: Decision-making under uncertainty in energy management is complicated by unknown parameters hindering optimal strategies, particularly in Battery Energy Storage System (BESS) operations. Predict-Then-Optimise (PTO) approaches treat forecasting and optimisation as separate processes, allowing prediction errors to cascade into suboptimal decisions as models minimise forecasting errors rather than optimising downstream tasks. The emerging Decision-Focused Learning (DFL) methods overcome this limitation by integrating prediction and optimisation; however, they are relatively new and have been tested primarily on synthetic datasets or small-scale problems, with limited evidence of their practical viability. Real-world BESS applications present additional challenges, including greater variability and data scarcity due to collection constraints and operational limitations. Because of these challenges, this work leverages Automated Feature Engineering (AFE) to extract richer representations and improve the nascent approach of DFL. We propose an AFE-DFL framework suitable for small datasets that forecasts electricity prices and demand while optimising BESS operations to minimise costs. We validate its effectiveness on a novel real-world UK property dataset. The evaluation compares DFL methods against PTO, with and without AFE. The results show that, on average, DFL yields lower operating costs than PTO and adding AFE further improves the performance of DFL methods by 22.9-56.5% compared to the same models without AFE. These findings provide empirical evidence for DFL’s practical viability in real-world settings, indicating that domain-specific AFE enhances DFL and reduces reliance on domain expertise for BESS optimisation, yielding economic benefits with broader implications for energy management systems facing similar challenges.

[396] Chatbot To Help Patients Understand Their Health

Won Seok Jang, Hieu Tran, Manav Mistry, SaiKiran Gandluri, Yifan Zhang, Sharmin Sultana, Sunjae Kown, Yuan Zhang, Zonghai Yao, Hong Yu

Main category: cs.AI

TL;DR: NoteAid-Chatbot is a conversational AI that helps patients understand medical information using a multi-agent LLM and reinforcement learning framework without human-labeled data, achieving human-level performance in patient education.

Details

Motivation: Patients need knowledge to actively participate in their care, but often struggle to understand complex medical information. There's a need for AI systems that can effectively educate patients through natural conversations.

Method: Built on LLaMA 3.2 3B model with two-stage training: supervised fine-tuning on synthetically generated medical conversations, followed by RL using PPO with rewards based on patient understanding assessments in simulated hospital discharge scenarios.

Result: NoteAid-Chatbot exhibits emergent behaviors critical for patient education (clarity, relevance, structured dialogue) without explicit supervision, surpasses non-expert humans in Turing tests, and demonstrates that lightweight PPO-based RL can handle complex conversational domains.

Conclusion: The framework shows feasibility of applying low-cost, PPO-based RL to realistic conversational domains, broadening RL-based alignment methods beyond healthcare applications.

Abstract: Patients must possess the knowledge necessary to actively participate in their care. We present NoteAid-Chatbot, a conversational AI that promotes patient understanding via a novel ’learning as conversation’ framework, built on a multi-agent large language model (LLM) and reinforcement learning (RL) setup without human-labeled data. NoteAid-Chatbot was built on a lightweight LLaMA 3.2 3B model trained in two stages: initial supervised fine-tuning on conversational data synthetically generated using medical conversation strategies, followed by RL with rewards derived from patient understanding assessments in simulated hospital discharge scenarios. Our evaluation, which includes comprehensive human-aligned assessments and case studies, demonstrates that NoteAid-Chatbot exhibits key emergent behaviors critical for patient education, such as clarity, relevance, and structured dialogue, even though it received no explicit supervision for these attributes. Our results show that even simple Proximal Policy Optimization (PPO)-based reward modeling can successfully train lightweight, domain-specific chatbots to handle multi-turn interactions, incorporate diverse educational strategies, and meet nuanced communication objectives. Our Turing test demonstrates that NoteAid-Chatbot surpasses non-expert human. Although our current focus is on healthcare, the framework we present illustrates the feasibility and promise of applying low-cost, PPO-based RL to realistic, open-ended conversational domains, broadening the applicability of RL-based alignment methods.

[397] MAPF-World: Action World Model for Multi-Agent Path Finding

Zhanjiang Yang, Yang Shen, Yueming Li, Meng Li, Lijun Sun

Main category: cs.AI

TL;DR: MAPF-World is an autoregressive action world model that improves multi-agent path finding by modeling environmental dynamics and temporal dependencies, enabling better long-term planning with smaller model size and less data.

Details

Motivation: Existing decentralized learnable solvers for multi-agent path finding have limited modeling of environmental temporal dynamics and inter-agent dependencies, leading to performance degradation in complex, long-term planning scenarios.

Method: Proposes MAPF-World, an autoregressive action world model that unifies situation understanding and action generation, explicitly modeling environmental dynamics through future state and actions prediction. Also introduces an automatic map generator for realistic training data.

Result: MAPF-World outperforms state-of-the-art learnable solvers with superior zero-shot generalization to out-of-distribution cases, using 96.5% smaller model size and 92% reduced data.

Conclusion: The proposed world model approach effectively addresses limitations of reactive policies in multi-agent path finding, enabling more informed and coordinated decision-making through explicit modeling of environmental dynamics and temporal dependencies.

Abstract: Multi-agent path finding (MAPF) is the problem of planning conflict-free paths from the designated start locations to goal positions for multiple agents. It underlies a variety of real-world tasks, including multi-robot coordination, robot-assisted logistics, and social navigation. Recent decentralized learnable solvers have shown great promise for large-scale MAPF, especially when leveraging foundation models and large datasets. However, these agents are reactive policy models and exhibit limited modeling of environmental temporal dynamics and inter-agent dependencies, resulting in performance degradation in complex, long-term planning scenarios. To address these limitations, we propose MAPF-World, an autoregressive action world model for MAPF that unifies situation understanding and action generation, guiding decisions beyond immediate local observations. It improves situational awareness by explicitly modeling environmental dynamics, including spatial features and temporal dependencies, through future state and actions prediction. By incorporating these predicted futures, MAPF-World enables more informed, coordinated, and far-sighted decision-making, especially in complex multi-agent settings. Furthermore, we augment MAPF benchmarks by introducing an automatic map generator grounded in real-world scenarios, capturing practical map layouts for training and evaluating MAPF solvers. Extensive experiments demonstrate that MAPF-World outperforms state-of-the-art learnable solvers, showcasing superior zero-shot generalization to out-of-distribution cases. Notably, MAPF-World is trained with a 96.5% smaller model size and 92% reduced data.

[398] MapAgent: A Hierarchical Agent for Geospatial Reasoning with Dynamic Map Tool Integration

Md Hasebul Hasan, Mahir Labib Dihan, Mohammed Eunus Ali, Md Rizwan Parvez

Main category: cs.AI

TL;DR: MapAgent is a hierarchical multi-agent framework for geospatial reasoning that outperforms existing approaches by decoupling planning from execution and using specialized modules for map-based tasks.

Details

Motivation: Existing AI agent frameworks are inadequate for geospatial tasks requiring spatial reasoning, multi-hop planning, and real-time map interaction, as they treat tools uniformly and overwhelm LLMs with similar geospatial APIs.

Method: Hierarchical multi-agent framework with high-level planner that decomposes queries into subgoals, specialized modules, and dedicated map-tool agent that orchestrates geospatial APIs in parallel while simpler modules operate without agent overhead.

Result: Substantial gains over state-of-the-art tool-augmented and agentic baselines on four diverse geospatial benchmarks (MapEval-Textual, MapEval-API, MapEval-Visual, and MapQA).

Conclusion: MapAgent’s hierarchical design reduces cognitive load, improves tool selection accuracy, and enables precise coordination across similar APIs, making it effective for map-integrated geospatial reasoning tasks.

Abstract: Agentic AI has significantly extended the capabilities of large language models (LLMs) by enabling complex reasoning and tool use. However, most existing frameworks are tailored to domains such as mathematics, coding, or web automation, and fall short on geospatial tasks that require spatial reasoning, multi-hop planning, and real-time map interaction. To address these challenges, we introduce MapAgent, a hierarchical multi-agent plug-and-play framework with customized toolsets and agentic scaffolds for map-integrated geospatial reasoning. Unlike existing flat agent-based approaches that treat tools uniformly-often overwhelming the LLM when handling similar but subtly different geospatial APIs-MapAgent decouples planning from execution. A high-level planner decomposes complex queries into subgoals, which are routed to specialized modules. For tool-heavy modules-such as map-based services-we then design a dedicated map-tool agent that efficiently orchestrates related APIs adaptively in parallel to effectively fetch geospatial data relevant for the query, while simpler modules (e.g., solution generation or answer extraction) operate without additional agent overhead. This hierarchical design reduces cognitive load, improves tool selection accuracy, and enables precise coordination across similar APIs. We evaluate MapAgent on four diverse geospatial benchmarks-MapEval-Textual, MapEval-API, MapEval-Visual, and MapQA-and demonstrate substantial gains over state-of-the-art tool-augmented and agentic baselines. We open-source our framwork at https://github.com/Hasebul/MapAgent.

[399] Rethinking Reasoning Quality in Large Language Models through Enhanced Chain-of-Thought via RL

Haoyang He, Zihua Rong, Kun Ji, Chenyang Li, Qing Huang, Chong Xia, Lan Yang, Honggang Zhang

Main category: cs.AI

TL;DR: DRER is a reinforcement learning framework that improves reasoning in LLMs by rewarding beneficial Chain-of-Thought tokens and stabilizing training with dynamic length control, achieving GPT-o3-mini level performance on logical reasoning tasks.

Details

Motivation: Current RL methods for LLMs use rule-based rewards that only assess answer correctness, ignoring whether the reasoning process actually improves answers and offering limited control over logical depth.

Method: Proposes Dynamic Reasoning Efficiency Reward (DRER) with two components: Reasoning Quality Reward that credits CoT tokens that raise correct answer likelihood, and Dynamic Length Advantage that stabilizes training by decaying advantage for responses with length deviations.

Result: 7B model achieves GPT-o3-mini level performance on Logictree benchmark with 400 training steps, 30% increase in CoT-augmented answer confidence, and demonstrates generalization across logical reasoning datasets and AIME24 mathematical benchmark.

Conclusion: DRER effectively shapes CoT behavior and provides a practical approach to enhance formal reasoning skills in large language models, with code and data publicly available.

Abstract: Reinforcement learning (RL) has recently become the dominant paradigm for strengthening the reasoning abilities of large language models (LLMs). Yet the rule-based reward functions commonly used on mathematical or programming benchmarks assess only answer format and correctness, providing no signal as to whether the induced Chain-of-Thought (CoT) actually improves the answer. Furthermore, such task-specific training offers limited control over logical depth and therefore may fail to reveal a model’s genuine reasoning capacity. We propose Dynamic Reasoning Efficiency Reward (DRER) – a plug-and-play RL reward framework that reshapes both reward and advantage signals. (i) A Reasoning Quality Reward assigns fine-grained credit to those reasoning chains that demonstrably raise the likelihood of the correct answer, directly incentivising the trajectories with beneficial CoT tokens. (ii) A Dynamic Length Advantage decays the advantage of responses whose length deviates from a validation-derived threshold, stabilising training. To facilitate rigorous assessment, we also release Logictree, a dynamically constructed deductive reasoning dataset that functions both as RL training data and as a comprehensive benchmark. Experiments confirm the effectiveness of DRER: our 7B model attains GPT-o3-mini level performance on Logictree with 400 trianing steps, while the average confidence of CoT-augmented answers rises by 30%. The model further exhibits generalisation across diverse logical-reasoning datasets, and the mathematical benchmark AIME24. These results illuminate how RL shapes CoT behaviour and chart a practical path toward enhancing formal-reasoning skills in large language models. All code and data are available in repository https://github.com/Henryhe09/DRER.

[400] Reverse-Engineered Reasoning for Open-Ended Generation

Haozhe Wang, Haoran Que, Qixin Xu, Minghao Liu, Wangchunshu Zhou, Jiazhan Feng, Wanjun Zhong, Wei Ye, Tong Yang, Wenhao Huang, Ge Zhang, Fangzhen Lin

Main category: cs.AI

TL;DR: REER introduces a reverse-engineering approach to discover latent reasoning processes from solutions, creating DeepWriting-20K dataset and achieving competitive performance with proprietary models.

Details

Motivation: Current reasoning methods (RL and instruction distillation) struggle with open-ended creative generation due to lack of clear reward signals and high costs, requiring a new paradigm.

Method: REverse-Engineered Reasoning (REER) works backwards from known-good solutions to computationally discover step-by-step deep reasoning processes, using a scalable gradient-free approach.

Result: Created DeepWriting-20K dataset with 20K reasoning trajectories. DeepWriter-8B model trained on this data outperforms open-source baselines and competes with GPT-4o and Claude 3.5.

Conclusion: REER provides an effective alternative to traditional reasoning methods, enabling scalable discovery of reasoning processes and achieving state-of-the-art performance in open-ended generation tasks.

Abstract: While the deep reasoning'' paradigm has spurred significant advances in verifiable domains like mathematics, its application to open-ended, creative generation remains a critical challenge. The two dominant methods for instilling reasoning -- reinforcement learning (RL) and instruction distillation -- falter in this area; RL struggles with the absence of clear reward signals and high-quality reward models, while distillation is prohibitively expensive and capped by the teacher model's capabilities. To overcome these limitations, we introduce REverse-Engineered Reasoning (REER), a new paradigm that fundamentally shifts the approach. Instead of building a reasoning process forwards’’ through trial-and-error or imitation, REER works ``backwards’’ from known-good solutions to computationally discover the latent, step-by-step deep reasoning process that could have produced them. Using this scalable, gradient-free approach, we curate and open-source DeepWriting-20K, a large-scale dataset of 20,000 deep reasoning trajectories for open-ended tasks. Our model, DeepWriter-8B, trained on this data, not only surpasses strong open-source baselines but also achieves performance competitive with, and at times superior to, leading proprietary models like GPT-4o and Claude 3.5.

[401] From Long to Short: LLMs Excel at Trimming Own Reasoning Chains

Wei Han, Geng Zhan, Sicheng Yu, Chenyu Wang, Bryan Hooi

Main category: cs.AI

TL;DR: EDIT is a test-time scaling method that helps large reasoning models identify the shortest correct reasoning paths, balancing conciseness and correctness to mitigate overthinking issues.

Details

Motivation: Large reasoning models (LRMs) suffer from overthinking - they overcomplicate simple problems with excessive strategy switching and convoluted reasoning traces, which hinders interpretability and efficiency.

Method: Proposed EDIT (Efficient Dynamic Inference Trimming), a test-time scaling method that employs constraint-guided generation while jointly tracking length and answer distributions under varying constraints to identify optimal reasoning paths.

Result: Extensive experiments across diverse models and datasets show that EDIT substantially enhances reasoning efficiency, producing compact yet informative outputs that improve readability and user experience.

Conclusion: EDIT effectively addresses the overthinking problem in LRMs by guiding them to find the shortest correct reasoning paths, achieving a better balance between conciseness and correctness.

Abstract: O1/R1 style large reasoning models (LRMs) signal a substantial leap forward over conventional instruction-following LLMs. By applying test-time scaling to generate extended reasoning paths, they establish many SOTAs across a wide range of complex reasoning tasks. However, recent studies show that LRMs are prone to suffer from overthinking – the tendency to overcomplicate simple problems, leading to excessive strategy switching and long, convoluted reasoning traces that hinder their interpretability. To mitigate this issue, we conduct a systematic investigation into the reasoning efficiency of a broad set of LRMs and uncover a common dilemma: the difficulty in balancing multiple generation objectives such as correctness and brevity. Based on this discovery, we propose a test-time scaling method, EDIT (Efficient Dynamic Inference Trimming), which efficiently guides LRMs to identify the shortest correct reasoning paths at test time. EDIT employs constraint-guided generation while jointly tracking length and answer distributions under varying constraints, allowing it to select responses that strike an optimal balance between conciseness and correctness. Extensive experiments across diverse models and datasets show that EDIT substantially enhance the reasoning efficiency, producing compact yet informative outputs that improve readability and user experience.

[402] Proof2Silicon: Prompt Repair for Verified Code and Hardware Generation via Reinforcement Learning

Manvi Jha, Jiaxin Wan, Deming Chen

Main category: cs.AI

TL;DR: Proof2Silicon is an end-to-end framework that uses RL-based prompt optimization (PREFACE) to generate formally verifiable Dafny code from LLMs, then translates it to C and synthesizes RTL hardware with up to 72% success rate.

Details

Motivation: LLMs often generate code that fails formal verification, which is critical for hardware and safety-critical applications. There's a need for automated generation of correctness-by-construction hardware from natural language.

Method: Combines PREFACE’s verifier-driven RL agent for prompt optimization, automatic translation of verified Dafny to C using Dafny’s Python backend and PyLog, and Vivado HLS for RTL synthesis.

Result: PREFACE improved Dafny verification success rates by up to 21% across LLMs. Proof2Silicon achieved 72% end-to-end hardware synthesis success rate on a 100-task benchmark.

Conclusion: The framework provides a robust, scalable automated pipeline for LLM-driven formally verified hardware synthesis from natural language to silicon implementation.

Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in automated code generation but frequently produce code that fails formal verification, an essential requirement for hardware and safety-critical domains. To overcome this fundamental limitation, we previously proposed PREFACE, a model-agnostic framework based on reinforcement learning (RL) that iteratively repairs the prompts provided to frozen LLMs, systematically steering them toward generating formally verifiable Dafny code without costly fine-tuning. This work presents Proof2Silicon, a novel end-to-end synthesis framework that embeds the previously proposed PREFACE flow to enable the generation of correctness-by-construction hardware directly from natural language specifications. Proof2Silicon operates by: (1) leveraging PREFACE’s verifier-driven RL agent to optimize prompt generation iteratively, ensuring Dafny code correctness; (2) automatically translating verified Dafny programs into synthesizable high-level C using Dafny’s Python backend and PyLog; and (3) employing Vivado HLS to produce RTL implementations. Evaluated rigorously on a challenging 100-task benchmark, PREFACE’s RL-guided prompt optimization consistently improved Dafny verification success rates across diverse LLMs by up to 21%. Crucially, Proof2Silicon achieved an end-to-end hardware synthesis success rate of up to 72%, generating RTL designs through Vivado HLS synthesis flows. These results demonstrate a robust, scalable, and automated pipeline for LLM-driven, formally verified hardware synthesis, bridging natural-language specification and silicon realization.

[403] REMI: A Novel Causal Schema Memory Architecture for Personalized Lifestyle Recommendation Agents

Vishal Raman, Vijai Aravindh R, Abhijith Ragav

Main category: cs.AI

TL;DR: REMI is a Causal Schema Memory architecture that integrates personal causal knowledge graphs, causal reasoning, and schema-based planning to provide explainable, personalized lifestyle recommendations.

Details

Motivation: Current AI assistants struggle with complex personal data and causal knowledge, leading to generic advice lacking explanatory power in domains like fashion, wellness, and lifestyle planning.

Method: Uses personal causal graph of user’s life events/habits, performs goal-directed causal traversals with external knowledge and hypothetical reasoning, retrieves adaptable plan schemas, and orchestrates with LLM for transparent explanations.

Result: CSM-based agents provide more context-aware, user-aligned recommendations compared to baseline LLM agents, as measured by new metrics like Personalization Salience Score and Causal Reasoning Accuracy.

Conclusion: Demonstrates a novel approach to memory-augmented causal reasoning in personalized agents, advancing transparent and trustworthy AI lifestyle assistants.

Abstract: Personalized AI assistants often struggle to incorporate complex personal data and causal knowledge, leading to generic advice that lacks explanatory power. We propose REMI, a Causal Schema Memory architecture for a multimodal lifestyle agent that integrates a personal causal knowledge graph, a causal reasoning engine, and a schema based planning module. The idea is to deliver explainable, personalized recommendations in domains like fashion, personal wellness, and lifestyle planning. Our architecture uses a personal causal graph of the user’s life events and habits, performs goal directed causal traversals enriched with external knowledge and hypothetical reasoning, and retrieves adaptable plan schemas to generate tailored action plans. A Large Language Model orchestrates these components, producing answers with transparent causal explanations. We outline the CSM system design and introduce new evaluation metrics for personalization and explainability, including Personalization Salience Score and Causal Reasoning Accuracy, to rigorously assess its performance. Results indicate that CSM based agents can provide more context aware, user aligned recommendations compared to baseline LLM agents. This work demonstrates a novel approach to memory augmented, causal reasoning in personalized agents, advancing the development of transparent and trustworthy AI lifestyle assistants.

[404] TableMind: An Autonomous Programmatic Agent for Tool-Augmented Table Reasoning

Chuang Jiang, Mingyue Cheng, Xiaoyu Tao, Qingyang Mao, Jie Ouyang, Qi Liu

Main category: cs.AI

TL;DR: TableMind is an LLM-driven table reasoning agent that autonomously uses tools, writes/executes code in sandbox, and employs planning/self-reflection for complex table reasoning tasks.

Details

Motivation: Existing text-based LLMs struggle with complex numerical computations in table reasoning, while tool-integrated approaches lack autonomous adaptability and rely on rigid patterns.

Method: Two-stage fine-tuning: supervised fine-tuning on reasoning trajectories followed by reinforcement fine-tuning with Rank-Aware Policy Optimization (RAPO) that prioritizes high-quality trajectories.

Result: Extensive experiments show TableMind achieves superior performance on mainstream benchmarks with substantial gains in reasoning accuracy and computational precision.

Conclusion: TableMind demonstrates effective autonomous table reasoning through integrated tool usage, code execution, and adaptive planning capabilities, outperforming competitive baselines.

Abstract: Table reasoning is crucial for leveraging structured data in domains such as finance, healthcare, and scientific research. While large language models (LLMs) show promise in multi-step reasoning, purely text-based methods often struggle with the complex numerical computations and fine-grained operations inherently required in this task. Tool-integrated reasoning improves computational accuracy via explicit code execution, yet existing systems frequently rely on rigid patterns, supervised imitation, and lack true autonomous adaptability. In this paper, we present TableMind, an LLM-driven table reasoning agent that (i) autonomously performs multi-turn tool invocation, (ii) writes and executes data-analyzing code in a secure sandbox environment for data analysis and precise numerical reasoning, and (iii) exhibits high-level capabilities such as planning and self-reflection to adapt strategies. To realize these capabilities, we adopt a two-stage fine-tuning paradigm built on top of a powerful pre-trained language model: supervised fine-tuning on high-quality reasoning trajectories to establish effective tool usage patterns, followed by reinforcement fine-tuning to optimize multi-objective strategies. In particular, we propose Rank-Aware Policy Optimization (RAPO), which increases the update weight of high-quality trajectories when their output probabilities are lower than those of low-quality ones, thereby guiding the model more consistently toward better and more accurate answers. Extensive experiments on several mainstream benchmarks demonstrate that TableMind achieves superior performance compared to competitive baselines, yielding substantial gains in both reasoning accuracy and computational precision.

[405] SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents

Xuan-Phi Nguyen, Shrey Pandit, Revanth Gangi Reddy, Austin Xu, Silvio Savarese, Caiming Xiong, Shafiq Joty

Main category: cs.AI

TL;DR: Developed autonomous single-agent models for deep research using continual RL on reasoning-optimized LLMs with synthetic data, achieving 28.7% on Humanity’s Last Exam benchmark.

Details

Motivation: To enhance LLMs' complex reasoning and tool-use capabilities for deep research applications that require extensive search and reasoning over multiple sources, moving beyond static multi-agent workflows to dynamic autonomous single-agents.

Method: Proposed continual reinforcement learning recipe with entirely synthetic data applied to various open-source LLMs, focusing on reasoning-optimized models with minimal web crawling and Python tool integration for autonomous single-agent deep research.

Result: Best variant SFR-DR-20B achieved up to 28.7% performance on Humanity’s Last Exam benchmark, demonstrating significant improvement in autonomous deep research capabilities.

Conclusion: Continual RL with synthetic data effectively enhances agentic skills while preserving reasoning ability in autonomous single-agent models for deep research applications.

Abstract: Equipping large language models (LLMs) with complex, interleaved reasoning and tool-use capabilities has become a key focus in agentic AI research, especially with recent advances in reasoning-oriented (``thinking’’) models. Such capabilities are key to unlocking a number of important applications. One such application is Deep Research (DR), which requires extensive search and reasoning over many sources. Our work in this paper focuses on the development of native Autonomous Single-Agent models for DR featuring minimal web crawling and Python tool integration. Unlike multi-agent systems, where agents take up pre-defined roles and are told what to do at each step in a static workflow, an autonomous single-agent determines its next action dynamically based on context, without manual directive. While prior work has proposed training recipes for base or instruction-tuned LLMs, we focus on continual reinforcement learning (RL) of reasoning-optimized models to further enhance agentic skills while preserving reasoning ability. Towards this end, we propose a simple RL recipe with entirely synthetic data, which we apply to various open-source LLMs. Our best variant SFR-DR-20B achieves up to 28.7% on Humanity’s Last Exam benchmark. In addition, we conduct key analysis experiments to provide more insights into our methodologies.

Jiaxiang Chen, Zhuo Wang, Mingxi Zou, Zhucong Li, Zhijian Zhou, Song Wang, Zenglin Xu

Main category: cs.AI

TL;DR: A framework that shifts from implicit exploration to structured reasoning through guidelines and stepwise refinement, improving stability and performance across diverse reasoning tasks.

Details

Motivation: Existing LLM reasoning methods rely on implicit exploration, leading to unstable paths, lack of error correction, and limited learning from experience.

Method: Extract structured reasoning patterns from successes and reflective signals from failures, then use step-by-step guidelines with refinement after each step during inference.

Result: Consistently outperforms strong baselines on BBH, GSM8K, MATH-500, MBPP, and HumanEval benchmarks across diverse reasoning tasks.

Conclusion: Structured reasoning with stepwise execution and refinement improves stability and generalization, with guidelines transferring well across domains and supporting cross-model collaboration effectively.

Abstract: Large language models (LLMs) have advanced general-purpose reasoning, showing strong performance across diverse tasks. However, existing methods often rely on implicit exploration, where the model follows stochastic and unguided reasoning paths-like walking without a map. This leads to unstable reasoning paths, lack of error correction, and limited learning from past experience. To address these issues, we propose a framework that shifts from implicit exploration to structured reasoning through guideline and refinement. First, we extract structured reasoning patterns from successful trajectories and reflective signals from failures. During inference, the model follows these guidelines step-by-step, with refinement applied after each step to correct errors and stabilize the reasoning process. Experiments on BBH and four additional benchmarks (GSM8K, MATH-500, MBPP, HumanEval) show that our method consistently outperforms strong baselines across diverse reasoning tasks. Structured reasoning with stepwise execution and refinement improves stability and generalization, while guidelines transfer well across domains and flexibly support cross-model collaboration, matching or surpassing supervised fine-tuning in effectiveness and scalability.

[407] Can AI Make Energy Retrofit Decisions? An Evaluation of Large Language Models

Lei Shu, Dong Zhao

Main category: cs.AI

TL;DR: LLMs show promise for energy retrofit decision making, achieving up to 54.5% top 1 match and 92.8% within top 5 recommendations without fine-tuning, but need improvements in accuracy, consistency, and context handling.

Details

Motivation: Conventional energy retrofit approaches suffer from limited generalizability and low interpretability, hindering adoption in diverse residential contexts. Generative AI and LLMs may help by processing contextual information and producing readable recommendations.

Method: Evaluated seven LLMs (ChatGPT, DeepSeek, Gemini, Grok, Llama, Claude) on residential retrofit decisions under two objectives: maximizing CO2 reduction (technical) and minimizing payback period (sociotechnical). Assessed performance on accuracy, consistency, sensitivity, and reasoning using a dataset of 400 homes across 49 US states.

Result: LLMs generate effective recommendations in many cases, with stronger performance for technical objectives. Performance is sensitive to location and building geometry but less sensitive to technology and occupant behavior. Agreement across models is low, and higher performing models tend to diverge from others.

Conclusion: LLMs are promising assistants for energy retrofit decision making, but improvements in accuracy, consistency, and context handling are needed for reliable practice. Most models show simplified engineering-style reasoning that lacks deeper contextual awareness.

Abstract: Conventional approaches to building energy retrofit decision making suffer from limited generalizability and low interpretability, hindering adoption in diverse residential contexts. With the growth of Smart and Connected Communities, generative AI, especially large language models (LLMs), may help by processing contextual information and producing practitioner readable recommendations. We evaluate seven LLMs (ChatGPT, DeepSeek, Gemini, Grok, Llama, and Claude) on residential retrofit decisions under two objectives: maximizing CO2 reduction (technical) and minimizing payback period (sociotechnical). Performance is assessed on four dimensions: accuracy, consistency, sensitivity, and reasoning, using a dataset of 400 homes across 49 US states. LLMs generate effective recommendations in many cases, reaching up to 54.5 percent top 1 match and 92.8 percent within top 5 without fine tuning. Performance is stronger for the technical objective, while sociotechnical decisions are limited by economic trade offs and local context. Agreement across models is low, and higher performing models tend to diverge from others. LLMs are sensitive to location and building geometry but less sensitive to technology and occupant behavior. Most models show step by step, engineering style reasoning, but it is often simplified and lacks deeper contextual awareness. Overall, LLMs are promising assistants for energy retrofit decision making, but improvements in accuracy, consistency, and context handling are needed for reliable practice.

[408] Large Language Models as Virtual Survey Respondents: Evaluating Sociodemographic Response Generation

Jianpeng Zhao, Chenyu Yuan, Weiming Luo, Haoling Xie, Guangwei Zhang, Steven Jige Quan, Zixuan Yuan, Pengyang Wang, Denghui Zhang

Main category: cs.AI

TL;DR: LLMs can simulate virtual survey respondents through Partial Attribute Simulation (PAS) and Full Attribute Simulation (FAS) methods, with evaluation showing consistent performance trends and context-dependent fidelity.

Details

Motivation: Traditional survey methods are costly, time-consuming, and limited in scale, creating a need for scalable and cost-effective alternatives for sociological research and policy evaluation.

Method: Introduced PAS (predicting missing attributes from partial profiles) and FAS (generating complete synthetic datasets under zero-context and context-enhanced conditions), evaluated on LLM-S^3 benchmark with 11 real-world datasets across 4 sociological domains using GPT-3.5/4 Turbo and LLaMA 3.0/3.1-8B models.

Result: Revealed consistent trends in prediction performance, identified failure modes, and demonstrated that context and prompt design significantly impact simulation fidelity.

Conclusion: Establishes a rigorous foundation for LLM-driven survey simulations, providing scalable and cost-effective tools for sociological research and policy evaluation.

Abstract: Questionnaire-based surveys are foundational to social science research and public policymaking, yet traditional survey methods remain costly, time-consuming, and often limited in scale. This paper explores a new paradigm: simulating virtual survey respondents using Large Language Models (LLMs). We introduce two novel simulation settings, namely Partial Attribute Simulation (PAS) and Full Attribute Simulation (FAS), to systematically evaluate the ability of LLMs to generate accurate and demographically coherent responses. In PAS, the model predicts missing attributes based on partial respondent profiles, whereas FAS involves generating complete synthetic datasets under both zero-context and context-enhanced conditions. We curate a comprehensive benchmark suite, LLM-S^3 (Large Language Model-based Sociodemographic Survey Simulation), that spans 11 real-world public datasets across four sociological domains. Our evaluation of multiple mainstream LLMs (GPT-3.5/4 Turbo, LLaMA 3.0/3.1-8B) reveals consistent trends in prediction performance, highlights failure modes, and demonstrates how context and prompt design impact simulation fidelity. This work establishes a rigorous foundation for LLM-driven survey simulations, offering scalable and cost-effective tools for sociological research and policy evaluation. Our code and dataset are available at: https://github.com/dart-lab-research/LLM-S-Cube-Benchmark

[409] Evaluating Multi-Turn Bargain Skills in LLM-Based Seller Agent

Issue Yishu Wang, Kakam Chong, Xiaofeng Wang, Xu Yan, DeXin Kong, Chen Ju, Ming Chen, Shuai Xiao, Shuguang Han, jufeng chen

Main category: cs.AI

TL;DR: A multi-turn evaluation framework for measuring LLM seller agents’ bargaining ability in e-commerce, featuring a large-scale benchmark and automated intent extraction pipeline.

Details

Motivation: To enable effective seller agents in online second-hand marketplaces that can accurately track and interpret cumulative buyer intents across long negotiations.

Method: Developed a multi-turn evaluation framework with Theory of Mind grounding, large-scale e-commerce bargaining benchmark (622 categories, 9,892 products, 3,014 tasks), and automated pipeline for intent extraction from dialogue data.

Result: Created a comprehensive framework that moves beyond outcome-only metrics to include turn-level evaluation with annotated buyer intents for more accurate assessment of bargaining effectiveness.

Conclusion: The proposed framework provides a robust method for evaluating seller agents’ bargaining capabilities through intent tracking and interpretation in multi-turn e-commerce negotiations.

Abstract: In online second-hand marketplaces, multi-turn bargaining is a crucial part of seller-buyer interactions. Large Language Models (LLMs) can act as seller agents, negotiating with buyers on behalf of sellers under given business constraints. A critical ability for such agents is to track and accurately interpret cumulative buyer intents across long negotiations, which directly impacts bargaining effectiveness. We introduce a multi-turn evaluation framework for measuring the bargaining ability of seller agents in e-commerce dialogues. The framework tests whether an agent can extract and track buyer intents. Our contributions are: (1) a large-scale e-commerce bargaining benchmark spanning 622 categories, 9,892 products, and 3,014 tasks; (2) a turn-level evaluation framework grounded in Theory of Mind (ToM) with annotated buyer intents, moving beyond outcome-only metrics; and (3) an automated pipeline that extracts reliable intent from massive dialogue data.

[410] A data-driven discretized CS:GO simulation environment to facilitate strategic multi-agent planning research

Yunzhe Wang, Volkan Ustun, Chris McGroarty

Main category: cs.AI

TL;DR: DECOY is a multi-agent simulator that abstracts strategic planning in 3D environments using discretized waypoints and neural models trained on real CS:GO data, achieving accurate gameplay simulation without low-level mechanics.

Details

Motivation: Modern simulation environments need to balance high-fidelity detail with computational efficiency for complex multi-agent interactions, requiring a solution that preserves strategic planning while being computationally tractable.

Method: Uses a waypoint system to discretize continuous states and actions, paired with neural predictive and generative models trained on real CS:GO tournament data to reconstruct event outcomes based only on movement decisions.

Result: Extensive evaluations show that replays generated from human data in DECOY closely match those observed in the original game, demonstrating accurate simulation fidelity.

Conclusion: DECOY provides a valuable publicly available tool for advancing research in strategic multi-agent planning and behavior generation by effectively abstracting high-level strategy while preserving environmental fidelity.

Abstract: Modern simulation environments for complex multi-agent interactions must balance high-fidelity detail with computational efficiency. We present DECOY, a novel multi-agent simulator that abstracts strategic, long-horizon planning in 3D terrains into high-level discretized simulation while preserving low-level environmental fidelity. Using Counter-Strike: Global Offensive (CS:GO) as a testbed, our framework accurately simulates gameplay using only movement decisions as tactical positioning – without explicitly modeling low-level mechanics such as aiming and shooting. Central to our approach is a waypoint system that simplifies and discretizes continuous states and actions, paired with neural predictive and generative models trained on real CS:GO tournament data to reconstruct event outcomes. Extensive evaluations show that replays generated from human data in DECOY closely match those observed in the original game. Our publicly available simulation environment provides a valuable tool for advancing research in strategic multi-agent planning and behavior generation.

[411] Teaching AI Stepwise Diagnostic Reasoning with Report-Guided Chain-of-Thought Learning

Yihong Luo, Wenwu He, Zhuo-Xu Cui, Dong Liang

Main category: cs.AI

TL;DR: DiagCoT is a multi-stage framework that fine-tunes vision-language models to emulate radiologists’ diagnostic reasoning using free-text reports, achieving significant improvements in disease classification, pathology grounding, and report generation.

Details

Motivation: To develop interpretable AI systems that can replicate radiologists' stepwise diagnostic reasoning process using only unstructured clinical reports as supervision, enabling scalable development of diagnostically competent models.

Method: Multi-stage framework combining: 1) contrastive image-report tuning for domain alignment, 2) chain-of-thought supervision to capture inferential logic, and 3) reinforcement tuning with clinical reward signals for factual accuracy and fluency.

Result: On MIMIC-CXR benchmark: improved zero-shot disease classification AUC from 0.52 to 0.76 (+0.24), pathology grounding mIoU from 0.08 to 0.31 (+0.23), and report generation BLEU from 0.11 to 0.33 (+0.22). Outperformed state-of-the-art models including LLaVA-Med and CXR-LLAVA on long-tailed diseases and external datasets.

Conclusion: DiagCoT successfully converts unstructured clinical narratives into structured supervision, providing a scalable approach for developing interpretable and diagnostically competent AI systems for radiology applications.

Abstract: This study presents DiagCoT, a multi-stage framework that applies supervised fine-tuning to general-purpose vision-language models (VLMs) to emulate radiologists’ stepwise diagnostic reasoning using only free-text reports. DiagCoT combines contrastive image-report tuning for domain alignment, chain-of-thought supervision to capture inferential logic, and reinforcement tuning with clinical reward signals to enhance factual accuracy and fluency. On the MIMIC-CXR benchmark, DiagCoT improved zero-shot disease classification AUC from 0.52 to 0.76 (absolute gain of 0.24), pathology grounding mIoU from 0.08 to 0.31 (absolute gain of 0.23), and report generation BLEU from 0.11 to 0.33 (absolute gain of 0.22). It outperformed state-of-the-art models including LLaVA-Med and CXR-LLAVA on long-tailed diseases and external datasets. By converting unstructured clinical narratives into structured supervision, DiagCoT offers a scalable approach for developing interpretable and diagnostically competent AI systems for radiology.

[412] Tree of Agents: Improving Long-Context Capabilities of Large Language Models through Multi-Perspective Reasoning

Song Yu, Xiaofei Xu, Ke Deng, Li Li, Lin Tian

Main category: cs.AI

TL;DR: Tree of Agents (TOA) is a multi-agent framework that segments long inputs into chunks processed by independent agents, enabling collaborative reasoning through tree-structured information exchange to address the ’lost in the middle’ problem in LLMs.

Details

Motivation: LLMs struggle with long-context tasks due to the 'lost in the middle' issue where middle information is underutilized. Existing methods either risk discarding key information or cause attention dispersion when extending context windows.

Method: TOA segments input into chunks processed by independent agents. Each agent generates local cognition, then dynamically exchanges information through tree-structured paths. Uses prefix-hash caching and adaptive pruning for efficiency.

Result: TOA with LLaMA3.1-8B significantly outperforms baselines and achieves comparable performance to much larger commercial models like Gemini1.5-pro on various long-context tasks.

Conclusion: The Tree of Agents framework effectively mitigates position bias and reduces hallucinations in long-context processing while maintaining efficiency with comparable API overhead.

Abstract: Large language models (LLMs) face persistent challenges when handling long-context tasks, most notably the lost in the middle issue, where information located in the middle of a long input tends to be underutilized. Some existing methods that reduce input have the risk of discarding key information, while others that extend context windows often lead to attention dispersion. To address these limitations, we propose Tree of Agents (TOA), a multi-agent reasoning framework that segments the input into chunks processed by independent agents. Each agent generates its local cognition, then agents dynamically exchange information for collaborative reasoning along tree-structured paths. TOA enables agents to probe different reasoning orders for multi-perspective understanding, effectively mitigating position bias and reducing hallucinations. To improve processing efficiency, we incorporate prefix-hash caching and adaptive pruning strategies, achieving significant performance improvements with comparable API overhead. Experiments show that TOA, powered by compact LLaMA3.1-8B, significantly outperforms multiple baselines and demonstrates comparable performance to the latest and much larger commercial models, such as Gemini1.5-pro, on various long-context tasks. Code is available at https://github.com/Aireduce952/Tree-of-Agents.

[413] HyFedRAG: A Federated Retrieval-Augmented Generation Framework for Heterogeneous and Privacy-Sensitive Data

Cheng Qian, Hainan Zhang, Yongxin Tong, Hong-Wei Zheng, Zhiming Zheng

Main category: cs.AI

TL;DR: HyFedRAG is a federated RAG framework for hybrid healthcare data that enables privacy-preserving retrieval across SQL, knowledge graphs, and clinical notes using edge-cloud collaboration and anonymization tools.

Details

Motivation: Centralized RAG struggles with heterogeneous, privacy-sensitive healthcare data across distributed settings, making rare disease case retrieval difficult due to privacy constraints and format diversity limitations.

Method: Edge-cloud collaborative framework using Flower, with edge-side LLMs converting diverse data into standardized privacy-preserving representations, lightweight local retrievers with anonymization tools, and three-tier caching strategy for efficiency.

Result: Outperforms existing baselines on PMC-Patients dataset in retrieval quality, generation consistency, and system efficiency.

Conclusion: Provides scalable, privacy-compliant solution for RAG over structurally heterogeneous data, unlocking LLM potential in sensitive, diverse data environments.

Abstract: Centralized RAG pipelines struggle with heterogeneous and privacy-sensitive data, especially in distributed healthcare settings where patient data spans SQL, knowledge graphs, and clinical notes. Clinicians face difficulties retrieving rare disease cases due to privacy constraints and the limitations of traditional cloud-based RAG systems in handling diverse formats and edge devices. To address this, we introduce HyFedRAG, a unified and efficient Federated RAG framework tailored for Hybrid data modalities. By leveraging an edge-cloud collaborative mechanism, HyFedRAG enables RAG to operate across diverse data sources while preserving data privacy. Our key contributions are: (1) We design an edge-cloud collaborative RAG framework built on Flower, which supports querying structured SQL data, semi-structured knowledge graphs, and unstructured documents. The edge-side LLMs convert diverse data into standardized privacy-preserving representations, and the server-side LLMs integrates them for global reasoning and generation. (2) We integrate lightweight local retrievers with privacy-aware LLMs and provide three anonymization tools that enable each client to produce semantically rich, de-identified summaries for global inference across devices. (3) To optimize response latency and reduce redundant computation, we design a three-tier caching strategy consisting of local cache, intermediate representation cache, and cloud inference cache. Experimental results on PMC-Patients demonstrate that HyFedRAG outperforms existing baselines in terms of retrieval quality, generation consistency, and system efficiency. Our framework offers a scalable and privacy-compliant solution for RAG over structural-heterogeneous data, unlocking the potential of LLMs in sensitive and diverse data environments.

[414] Accelerate Scaling of LLM Alignment via Quantifying the Coverage and Depth of Instruction Set

Chengwei Wu, Li Du, Hanyu Zhao, Yiming Ju, Jiapu Wang, Tengfei Pan

Main category: cs.AI

TL;DR: The paper proposes a novel instruction data selection method that identifies instruction depth and semantic coverage as key factors for LLM alignment performance, achieving accelerated scaling compared to SOTA methods.

Details

Motivation: Improving model alignment performance and efficiency for downstream tasks by addressing the challenge of selecting informative instructions from large candidate pools, as current methods fail to scale effectively.

Method: Investigates key factors influencing instruction dataset distribution and aligned model performance, then designs an instruction selection algorithm that simultaneously maximizes instruction depth and semantic coverage.

Result: The proposed method explains over 70% of model loss on development set and demonstrates sustainable performance improvement at a faster pace compared to state-of-the-art baseline methods.

Conclusion: Instruction depth and semantic coverage are crucial factors for downstream performance, and the proposed selection method enables accelerated scaling in LLM alignment tasks.

Abstract: With the growing demand for applying large language models to downstream tasks, improving model alignment performance and efficiency has become crucial. Such a process involves selecting informative instructions from a candidate pool. However, due to the complexity of instruction set distributions, the key factors driving the performance of aligned models remain unclear. As a result, current instruction set refinement methods fail to improve performance as the instruction pool expands continuously. To address this issue, we first investigate the key factors that influence the relationship between instruction dataset distribution and aligned model performance. Based on these insights, we propose a novel instruction data selection method. We identify that the depth of instructions and the coverage of the semantic space are the crucial factors determining downstream performance, which could explain over 70% of the model loss on the development set. We then design an instruction selection algorithm to simultaneously maximize the depth and semantic coverage of the selected instructions. Experimental results demonstrate that, compared to state-of-the-art baseline methods, it can sustainably improve model performance at a faster pace and thus achieve \emph{``Accelerated Scaling’’}.

[415] MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents

Pengxiang Zhao, Guangyi Liu, Yaozhen Liang, Weiqing He, Zhengxi Lu, Yuehao Huang, Yaxuan Guo, Kexin Zhang, Hao Wang, Liang Liu, Yong Liu

Main category: cs.AI

TL;DR: MAS-Bench is a benchmark for evaluating GUI-shortcut hybrid agents in mobile domain, featuring 139 tasks across 11 apps, 88 predefined shortcuts, and 7 metrics to assess agents’ ability to autonomously generate and use shortcuts for efficiency.

Details

Motivation: To address the lack of systematic benchmarking frameworks for hybrid GUI agents that combine traditional GUI operations with efficient shortcuts (APIs, deep links, RPA scripts) to enhance efficiency across various platforms.

Method: Developed MAS-Bench with 139 complex tasks across 11 real-world mobile applications, a knowledge base of 88 predefined shortcuts, and 7 evaluation metrics to assess agents’ capability to autonomously generate reusable workflows and intelligently embed shortcuts.

Result: Hybrid agents achieved significantly higher success rates and efficiency compared to GUI-only counterparts, demonstrating the effectiveness of the benchmark for evaluating shortcut generation capabilities.

Conclusion: MAS-Bench fills a critical evaluation gap and provides a foundational platform for future advancements in creating more efficient and robust intelligent agents that can leverage both GUI operations and shortcuts.

Abstract: To enhance the efficiency of GUI agents on various platforms like smartphones and computers, a hybrid paradigm that combines flexible GUI operations with efficient shortcuts (e.g., API, deep links) is emerging as a promising direction. However, a framework for systematically benchmarking these hybrid agents is still underexplored. To take the first step in bridging this gap, we introduce MAS-Bench, a benchmark that pioneers the evaluation of GUI-shortcut hybrid agents with a specific focus on the mobile domain. Beyond merely using predefined shortcuts, MAS-Bench assesses an agent’s capability to autonomously generate shortcuts by discovering and creating reusable, low-cost workflows. It features 139 complex tasks across 11 real-world applications, a knowledge base of 88 predefined shortcuts (APIs, deep-links, RPA scripts), and 7 evaluation metrics. The tasks are designed to be solvable via GUI-only operations, but can be significantly accelerated by intelligently embedding shortcuts. Experiments show that hybrid agents achieve significantly higher success rates and efficiency than their GUI-only counterparts. This result also demonstrates the effectiveness of our method for evaluating an agent’s shortcut generation capabilities. MAS-Bench fills a critical evaluation gap, providing a foundational platform for future advancements in creating more efficient and robust intelligent agents.

[416] MORSE: Multi-Objective Reinforcement Learning via Strategy Evolution for Supply Chain Optimization

Niki Kotecha, Ehecatl Antonio del Rio Chanona

Main category: cs.AI

TL;DR: Proposes RL+MOEA hybrid approach for dynamic multi-objective supply chain optimization with CVaR risk management, outperforming state-of-the-art methods in inventory management.

Details

Motivation: Traditional optimization methods struggle with real-time adaptation to dynamic supply chain environments with conflicting objectives like cost, service level, and sustainability.

Method: Combines Reinforcement Learning and Multi-Objective Evolutionary Algorithms to search policy neural network parameter space, generating Pareto front of policies with CVaR for risk-sensitive decision-making.

Result: Demonstrates effectiveness through case studies, showing ability to respond to supply chain dynamics and outperform state-of-the-art methods in inventory management.

Conclusion: Provides flexible, adaptable real-time decision-making framework that improves efficiency and robustness in managing uncertainty while optimizing supply chain performance.

Abstract: In supply chain management, decision-making often involves balancing multiple conflicting objectives, such as cost reduction, service level improvement, and environmental sustainability. Traditional multi-objective optimization methods, such as linear programming and evolutionary algorithms, struggle to adapt in real-time to the dynamic nature of supply chains. In this paper, we propose an approach that combines Reinforcement Learning (RL) and Multi-Objective Evolutionary Algorithms (MOEAs) to address these challenges for dynamic multi-objective optimization under uncertainty. Our method leverages MOEAs to search the parameter space of policy neural networks, generating a Pareto front of policies. This provides decision-makers with a diverse population of policies that can be dynamically switched based on the current system objectives, ensuring flexibility and adaptability in real-time decision-making. We also introduce Conditional Value-at-Risk (CVaR) to incorporate risk-sensitive decision-making, enhancing resilience in uncertain environments. We demonstrate the effectiveness of our approach through case studies, showcasing its ability to respond to supply chain dynamics and outperforming state-of-the-art methods in an inventory management case study. The proposed strategy not only improves decision-making efficiency but also offers a more robust framework for managing uncertainty and optimizing performance in supply chains.

[417] Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers

Ran Xin, Zeyu Zheng, Yanchen Nie, Kun Yuan, Xia Xiao

Main category: cs.AI

TL;DR: BFS-Prover-V2 addresses scaling challenges in LLM-based theorem proving with two innovations: multi-turn off-policy RL framework for training improvement and planner-enhanced multi-agent search architecture for efficient inference.

Details

Motivation: Integration of LLMs into automated theorem proving is constrained by challenges in scaling both training-time RL and inference-time compute, leading to performance plateaus and inefficient search.

Method: 1) Multi-turn off-policy RL framework with AlphaZero-inspired expert iteration, adaptive tactic-level filtering, and periodic retraining. 2) Planner-enhanced multi-agent search with hierarchical decomposition of complex theorems into simpler subgoals using parallel prover agents and shared proof cache.

Result: State-of-the-art performance: 95.08% on MiniF2F and 41.4% on ProofNet test sets, demonstrating effective scaling solutions for both training and inference.

Conclusion: The dual approach successfully addresses scaling challenges in LLM-based theorem proving, with techniques applicable to other domains requiring long-horizon multi-turn reasoning and complex search.

Abstract: The integration of Large Language Models (LLMs) into automated theorem proving has shown immense promise, yet is fundamentally constrained by challenges in scaling up both training-time reinforcement learning (RL) and inference-time compute. This paper introduces \texttt{BFS-Prover-V2}, a system designed to address this dual scaling problem. We present two primary innovations. The first is a novel multi-turn off-policy RL framework for continually improving the performance of LLM step-prover at training time. This framework, inspired by the principles of AlphaZero, utilizes a multi-stage expert iteration pipeline featuring adaptive tactic-level data filtering and periodic retraining to surmount the performance plateaus that typically curtail long-term RL in LLM-based agents. The second innovation is a planner-enhanced multi-agent search architecture that scales reasoning capabilities at inference time. This architecture employs a general reasoning model as a high-level planner to iteratively decompose complex theorems into a sequence of simpler subgoals. This hierarchical approach substantially reduces the search space, enabling a team of parallel prover agents to collaborate efficiently by leveraging a shared proof cache. We demonstrate that this dual approach to scaling yields state-of-the-art results on established formal mathematics benchmarks. \texttt{BFS-Prover-V2} achieves 95.08% and 41.4% on the MiniF2F and ProofNet test sets respectively. While demonstrated in the domain of formal mathematics, the RL and inference techniques presented in this work are of broader interest and may be applied to other domains requiring long-horizon multi-turn reasoning and complex search.

[418] An AI system to help scientists write expert-level empirical software

Eser Aygün, Anastasiya Belyaeva, Gheorghe Comanici, Marc Coram, Hao Cui, Jake Garrison, Renee Johnston Anton Kast, Cory Y. McLean, Peter Norgaard, Zahra Shamsi, David Smalling, James Thompson, Subhashini Venugopalan, Brian P. Williams, Chujun He, Sarah Martinson, Martyna Plomecka, Lai Wei, Yuchen Zhou, Qian-Ze Zhu, Matthew Abraham, Erica Brand, Anna Bulanova, Jeffrey A. Cardille, Chris Co, Scott Ellsworth, Grace Joseph, Malcolm Kane, Ryan Krueger, Johan Kartiwa, Dan Liebling, Jan-Matthis Lueckmann, Paul Raccuglia, Xuefei, Wang, Katherine Chou, James Manyika, Yossi Matias, John C. Platt, Lizzie Dorfman, Shibl Mourad, Michael P. Brenner

Main category: cs.AI

TL;DR: AI system combining LLM and Tree Search creates expert-level scientific software that outperforms human-developed methods across multiple domains including bioinformatics, epidemiology, and geospatial analysis.

Details

Motivation: To overcome the bottleneck of slow, manual software creation for computational experiments and accelerate scientific discovery.

Method: Uses Large Language Model (LLM) and Tree Search (TS) to systematically improve quality metrics and intelligently navigate solution spaces, exploring and integrating complex research ideas from external sources.

Result: Discovered 40 novel methods for single-cell data analysis that outperformed top human methods; generated 14 models that beat CDC ensemble for COVID-19 forecasting; produced state-of-the-art software for geospatial analysis, neural activity prediction, time series forecasting, and numerical integration.

Conclusion: The system represents a significant step towards accelerating scientific progress by automating the creation of expert-level scientific software across diverse domains.

Abstract: The cycle of scientific discovery is frequently bottlenecked by the slow, manual creation of software to support computational experiments. To address this, we present an AI system that creates expert-level scientific software whose goal is to maximize a quality metric. The system uses a Large Language Model (LLM) and Tree Search (TS) to systematically improve the quality metric and intelligently navigate the large space of possible solutions. The system achieves expert-level results when it explores and integrates complex research ideas from external sources. The effectiveness of tree search is demonstrated across a wide range of benchmarks. In bioinformatics, it discovered 40 novel methods for single-cell data analysis that outperformed the top human-developed methods on a public leaderboard. In epidemiology, it generated 14 models that outperformed the CDC ensemble and all other individual models for forecasting COVID-19 hospitalizations. Our method also produced state-of-the-art software for geospatial analysis, neural activity prediction in zebrafish, time series forecasting and numerical solution of integrals. By devising and implementing novel solutions to diverse tasks, the system represents a significant step towards accelerating scientific progress.

Zhou-Peng Shou, Zhi-Qiang You, Fang Wang, Hai-Bo Liu

Main category: cs.AI

TL;DR: Zero-shot multimodal reasoning component using human-like cognitive strategies with “intent sketch” to suppress shortcut reasoning and improve contextual understanding without parameter fine-tuning.

Details

Motivation: Address issues of "shortcuts" and insufficient contextual understanding in complex cross-modal reasoning of multimodal large models by mimicking human cognitive processes.

Method: Plug-and-play three-module pipeline: Intent Perceiver, Strategy Generator, and Strategy Selector that explicitly constructs “understand-plan-select” cognitive process using “intent sketch” strategies through in-context engineering.

Result: Achieves consistent improvements across different reasoning engines and pipeline combinations with gains up to 9.51 percentage points on IntentBench, WorldSense, and Daily-Omni benchmarks.

Conclusion: The “intent sketch” reasoning component demonstrates practical value and portability in zero-shot scenarios, reducing conditional entropy and improving information utilization efficiency without parameter tuning.

Abstract: Targeting the issues of “shortcuts” and insufficient contextual understanding in complex cross-modal reasoning of multimodal large models, this paper proposes a zero-shot multimodal reasoning component guided by human-like cognitive strategies centered on an “intent sketch”. The component comprises a plug-and-play three-module pipeline-Intent Perceiver, Strategy Generator, and Strategy Selector-that explicitly constructs a “understand-plan-select” cognitive process. By generating and filtering “intent sketch” strategies to guide the final reasoning, it requires no parameter fine-tuning and achieves cross-model transfer solely through in-context engineering. Information-theoretic analysis shows that this process can reduce conditional entropy and improve information utilization efficiency, thereby suppressing unintended shortcut reasoning. Experiments on IntentBench, WorldSense, and Daily-Omni validate the method’s generality and robust gains; compared with their respective baselines, the complete “three-module” scheme yields consistent improvements across different reasoning engines and pipeline combinations, with gains up to approximately 9.51 percentage points, demonstrating the practical value and portability of the “intent sketch” reasoning component in zero-shot scenarios.

[420] Reinforcement Learning Foundations for Deep Research Systems: A Survey

Wenjun Li, Zhi Chen, Jingru Lin, Hannan Cao, Wei Han, Sheng Liang, Zhi Zhang, Kuicai Dong, Dexun Li, Chen Zhang, Yong Liu

Main category: cs.AI

TL;DR: This survey paper provides the first comprehensive analysis of reinforcement learning (RL) foundations for deep research systems, covering data synthesis, RL methods for agentic research, and training systems/frameworks.

Details

Motivation: Current deep research systems rely on supervised fine-tuning (SFT) and preference alignment methods like DPO, which suffer from imitation biases, exposure biases, and dependence on human-defined decision points. RL offers better alignment with tool-interaction research by enabling exploration, recovery behaviors, and principled credit assignment.

Method: The survey systematizes work along three axes: (i) data synthesis and curation, (ii) RL methods covering stability, sample efficiency, long context handling, reward design, and multi-objective optimization, and (iii) agentic RL training systems and frameworks. It also covers agent architecture, coordination, evaluation, and benchmarks.

Result: The paper distills recurring patterns, identifies infrastructure bottlenecks, and provides practical guidance for training robust, transparent deep research agents using reinforcement learning.

Conclusion: Reinforcement learning provides a more effective foundation for deep research systems compared to SFT and DPO methods, enabling better exploration, credit assignment, and reduced dependence on human priors and biases for complex, multi-step research tasks.

Abstract: Deep research systems, agentic AI that solve complex, multi-step tasks by coordinating reasoning, search across the open web and user files, and tool use, are moving toward hierarchical deployments with a Planner, Coordinator, and Executors. In practice, training entire stacks end-to-end remains impractical, so most work trains a single planner connected to core tools such as search, browsing, and code. While SFT imparts protocol fidelity, it suffers from imitation and exposure biases and underuses environment feedback. Preference alignment methods such as DPO are schema and proxy-dependent, off-policy, and weak for long-horizon credit assignment and multi-objective trade-offs. A further limitation of SFT and DPO is their reliance on human defined decision points and subskills through schema design and labeled comparisons. Reinforcement learning aligns with closed-loop, tool-interaction research by optimizing trajectory-level policies, enabling exploration, recovery behaviors, and principled credit assignment, and it reduces dependence on such human priors and rater biases. This survey is, to our knowledge, the first dedicated to the RL foundations of deep research systems. It systematizes work after DeepSeek-R1 along three axes: (i) data synthesis and curation; (ii) RL methods for agentic research covering stability, sample efficiency, long context handling, reward and credit design, multi-objective optimization, and multimodal integration; and (iii) agentic RL training systems and frameworks. We also cover agent architecture and coordination, as well as evaluation and benchmarks, including recent QA, VQA, long-form synthesis, and domain-grounded, tool-interaction tasks. We distill recurring patterns, surface infrastructure bottlenecks, and offer practical guidance for training robust, transparent deep research agents with RL.

[421] VehicleWorld: A Highly Integrated Multi-Device Environment for Intelligent Vehicle Interaction

Jie Yang, Jiajun Chen, Zhangyue Yin, Shuo Chen, Yuxin Wang, Yiran Guo, Yuan Li, Yining Zheng, Xuanjing Huang, Xipeng Qiu

Main category: cs.AI

TL;DR: VehicleWorld introduces a comprehensive automotive environment with 30 modules and 250 APIs, proposing State-based Function Call (SFC) that outperforms traditional function calling by maintaining explicit system state awareness for vehicle cockpit control.

Details

Motivation: Intelligent vehicle cockpits require coordination across complex subsystems, but traditional Function Calling approaches operate statelessly, requiring multiple exploratory calls that lead to inefficiency and limited error recovery.

Method: The authors introduce VehicleWorld environment with 30 modules, 250 APIs, and 680 properties, then propose State-based Function Call (SFC) which maintains explicit system state awareness and implements direct state transitions to achieve target conditions.

Result: Experimental results show SFC significantly outperforms traditional FC approaches, achieving superior execution accuracy and reduced latency. Direct state prediction was found to outperform function calling for environmental control.

Conclusion: SFC represents a novel approach that addresses the limitations of traditional function calling in complex vehicle cockpit environments, providing better performance through explicit state awareness and direct state transitions.

Abstract: Intelligent vehicle cockpits present unique challenges for API Agents, requiring coordination across tightly-coupled subsystems that exceed typical task environments’ complexity. Traditional Function Calling (FC) approaches operate statelessly, requiring multiple exploratory calls to build environmental awareness before execution, leading to inefficiency and limited error recovery. We introduce VehicleWorld, the first comprehensive environment for the automotive domain, featuring 30 modules, 250 APIs, and 680 properties with fully executable implementations that provide real-time state information during agent execution. This environment enables precise evaluation of vehicle agent behaviors across diverse, challenging scenarios. Through systematic analysis, we discovered that direct state prediction outperforms function calling for environmental control. Building on this insight, we propose State-based Function Call (SFC), a novel approach that maintains explicit system state awareness and implements direct state transitions to achieve target conditions. Experimental results demonstrate that SFC significantly outperforms traditional FC approaches, achieving superior execution accuracy and reduced latency. We have made all implementation code publicly available on Github https://github.com/OpenMOSS/VehicleWorld.

[422] Another Turn, Better Output? A Turn-Wise Analysis of Iterative LLM Prompting

Shashidhar Reddy Javaji, Bhavul Gauri, Zining Zhu

Main category: cs.AI

TL;DR: Evaluation framework for measuring iterative refinement in LLMs across ideation, code, and math tasks, showing domain-dependent gains and the importance of targeted feedback over vague prompts.

Details

Motivation: Lack of clear measurement for when iteration helps vs. hurts in multi-turn LLM workflows, despite their increasing use in various domains.

Method: 12-turn controlled conversations per task using various prompts (vague to targeted), with domain-appropriate scoring (unit tests for code, reasoning-soundness for math, originality/feasibility for ideation) and turn-level metrics tracking semantic movement, change, and size growth.

Result: Gains are domain-dependent: early turns matter for ideas/code, late turns matter for math with elaboration; vague feedback plateaus/reverses correctness while targeted prompts reliably improve quality; consistent domain patterns observed in semantic movement and output characteristics.

Conclusion: Framework makes iteration measurable and comparable across models, providing signals for when to steer, stop, or switch strategies in multi-turn LLM workflows.

Abstract: Large language models (LLMs) are now used in multi-turn workflows, but we still lack a clear way to measure when iteration helps and when it hurts. We present an evaluation framework for iterative refinement that spans ideation, code, and math. Our protocol runs controlled 12-turn conversations per task, utilizing a variety of prompts ranging from vague ``improve it’’ feedback to targeted steering, and logs per-turn outputs. We score outcomes with domain-appropriate checks (unit tests for code; answer-equivalence plus reasoning-soundness for math; originality and feasibility for ideation) and track turn-level behavior with three families of metrics: semantic movement across turns, turn-to-turn change, and output size growth. Across models and tasks, gains are domain-dependent: they arrive early in ideas and code, but in math late turns matter when guided by elaboration. After the first few turns, vague feedback often plateaus or reverses correctness, while targeted prompts reliably shift the intended quality axis (novelty vs. feasibility in ideation; speed vs. readability in code; in math, elaboration outperforms exploration and drives late-turn gains). We also observe consistent domain patterns: ideation moves more in meaning across turns, code tends to grow in size with little semantic change, and math starts fixed but can break that path with late, elaborative iteration.Together, the framework and metrics make iteration measurable and comparable across models, and signal when to steer, stop, or switch strategies.

[423] RAFFLES: Reasoning-based Attribution of Faults for LLM Systems

Chenyang Zhu, Spencer Hong, Jingyu Wu, Kushal Chawla, Charlotte Tang, Youbing Yin, Nathan Wolfe, Erin Babinsky, Daben Liu

Main category: cs.AI

TL;DR: RAFFLES is an evaluation architecture that uses iterative reasoning and refinement to identify faults in long-horizon LLM agentic systems, significantly outperforming existing baselines in fault detection accuracy.

Details

Motivation: Current evaluation methods for LLM agentic systems are limited to single metrics and end-to-end outcomes, making it difficult to identify where and why complex multi-component systems fail over long horizons.

Method: RAFFLES operates as an iterative pipeline with a central Judge that systematically investigates faults and specialized Evaluators that assess both system components and the Judge’s reasoning quality, building a history of hypotheses.

Result: RAFFLES achieved over 43% agent-step fault pair accuracy on Algorithmically-Generated dataset (vs 16.6% previous best) and over 20% on Hand-Crafted dataset (vs 8.8% previous best).

Conclusion: RAFFLES represents a significant advancement in automated fault detection for autonomous systems, moving beyond labor-intensive manual human review and enabling better understanding of complex agentic system failures.

Abstract: We have reached a critical roadblock in the development and enhancement of long-horizon, multi-component LLM agentic systems: it is incredibly tricky to identify where these systems break down and why. Evaluation capabilities that currently exist today (e.g., single pass LLM-as-a-judge) are limited in that they often focus on individual metrics or capabilities, end-to-end outcomes, and are narrowly grounded on the preferences of humans. We argue that to match the agentic capabilities, evaluation frameworks must also be able to reason, probe, iterate, and understand the complex logic passing through these systems over long horizons. In this paper, we present RAFFLES - an evaluation architecture that incorporates reasoning and iterative refinement. Specifically, RAFFLES operates as an iterative, multi-component pipeline, using a central Judge to systematically investigate faults and a set of specialized Evaluators to assess not only the system’s components but also the quality of the reasoning by the Judge itself, thereby building a history of hypotheses. We tested RAFFLES against several baselines on the Who&When dataset, a benchmark designed to diagnose the “who” (agent) and “when” (step) of a system’s failure. RAFFLES outperforms these baselines, achieving an agent-step fault pair accuracy of over 43% on the Algorithmically-Generated dataset (a substantial increase from the previously published best of 16.6%) and over 20% on the Hand-Crafted dataset (surpassing the previously published best of 8.8%). These results demonstrate a key step towards introducing automated fault detection for autonomous systems over labor-intensive manual human review.

[424] Test-Time Scaling in Reasoning Models Is Not Effective for Knowledge-Intensive Tasks Yet

James Xu Zhao, Bryan Hooi, See-Kiong Ng

Main category: cs.AI

TL;DR: Test-time scaling (increasing inference computation for longer reasoning chains) doesn’t consistently improve accuracy on knowledge-intensive tasks and often increases hallucinations, though it remains beneficial compared to no thinking.

Details

Motivation: To evaluate the effectiveness of test-time scaling approaches on knowledge-intensive tasks where factual accuracy and low hallucination rates are critical, as current methods show limitations in these domains.

Method: Comprehensive evaluation using 12 reasoning models on two knowledge-intensive benchmarks, analyzing how extended reasoning affects hallucination behavior through case studies and behavioral analysis.

Result: Increasing test-time computation doesn’t consistently improve accuracy and often leads to more hallucinations. Reduced hallucinations come from models choosing to abstain rather than improved factual recall, while longer reasoning can induce confirmation bias and overconfident hallucinations.

Conclusion: While test-time scaling has limitations for knowledge-intensive tasks (inconsistent accuracy improvements and increased hallucinations), enabling thinking remains beneficial compared to no thinking at all, highlighting the need for better approaches in this domain.

Abstract: Test-time scaling increases inference-time computation by allowing models to generate long reasoning chains, and has shown strong performance across many domains. However, in this work, we show that this approach is not yet effective for knowledge-intensive tasks, where high factual accuracy and low hallucination rates are essential. We conduct a comprehensive evaluation of test-time scaling using 12 reasoning models on two knowledge-intensive benchmarks. Our results reveal that increasing test-time computation does not consistently improve accuracy and, in many cases, it even leads to more hallucinations. We then analyze how extended reasoning affects hallucination behavior. We find that reduced hallucinations often result from the model choosing to abstain after thinking more, rather than from improved factual recall. Conversely, for some models, longer reasoning encourages attempts on previously unanswered questions, many of which result in hallucinations. Case studies show that extended reasoning can induce confirmation bias, leading to overconfident hallucinations. Despite these limitations, we observe that compared to non-thinking, enabling thinking remains beneficial. Code and data are available at https://github.com/XuZhao0/tts-knowledge

[425] Paper2Agent: Reimagining Research Papers As Interactive and Reliable AI Agents

Jiacheng Miao, Joe R. Davis, Jonathan K. Pritchard, James Zou

Main category: cs.AI

TL;DR: Paper2Agent is an automated framework that converts research papers into AI agents that can execute the paper’s methods and answer scientific queries through natural language.

Details

Motivation: Research papers require substantial effort to understand and adapt code/methods, creating barriers to dissemination and reuse. The goal is to transform static papers into active AI systems.

Method: Systematically analyzes papers and codebases using multiple agents to construct Model Context Protocol (MCP) servers, then iteratively generates and runs tests to refine the MCP. These can be connected to chat agents for natural language interaction.

Result: Successfully created agents that leverage AlphaGenome for genomic variant interpretation and ScanPy/TISSUE for single-cell/spatial transcriptomics analyses. Agents can reproduce original results and handle novel user queries.

Conclusion: Paper2Agent introduces a new paradigm for knowledge dissemination by turning static papers into dynamic AI agents, creating a foundation for collaborative AI co-scientists.

Abstract: We introduce Paper2Agent, an automated framework that converts research papers into AI agents. Paper2Agent transforms research output from passive artifacts into active systems that can accelerate downstream use, adoption, and discovery. Conventional research papers require readers to invest substantial effort to understand and adapt a paper’s code, data, and methods to their own work, creating barriers to dissemination and reuse. Paper2Agent addresses this challenge by automatically converting a paper into an AI agent that acts as a knowledgeable research assistant. It systematically analyzes the paper and the associated codebase using multiple agents to construct a Model Context Protocol (MCP) server, then iteratively generates and runs tests to refine and robustify the resulting MCP. These paper MCPs can then be flexibly connected to a chat agent (e.g. Claude Code) to carry out complex scientific queries through natural language while invoking tools and workflows from the original paper. We demonstrate Paper2Agent’s effectiveness in creating reliable and capable paper agents through in-depth case studies. Paper2Agent created an agent that leverages AlphaGenome to interpret genomic variants and agents based on ScanPy and TISSUE to carry out single-cell and spatial transcriptomics analyses. We validate that these paper agents can reproduce the original paper’s results and can correctly carry out novel user queries. By turning static papers into dynamic, interactive AI agents, Paper2Agent introduces a new paradigm for knowledge dissemination and a foundation for the collaborative ecosystem of AI co-scientists.

[426] Directly Aligning the Full Diffusion Trajectory with Fine-Grained Human Preference

Xiangwei Shen, Zhimin Li, Zhantao Yang, Shiyi Zhang, Yingfang Zhang, Donghao Li, Chunyu Wang, Qinglin Lu, Yansong Tang

Main category: cs.AI

TL;DR: Direct-Align method with SRPO improves diffusion model alignment by avoiding expensive multistep denoising and enabling online reward adjustment, achieving 3x better realism and aesthetics.

Details

Motivation: Existing diffusion model alignment methods suffer from computational expense of multistep denoising and require continuous offline reward model adaptation for desired aesthetic quality.

Method: Proposes Direct-Align method that predefines noise prior for image recovery via interpolation, and Semantic Relative Preference Optimization (SRPO) for text-conditioned online reward adjustment.

Result: Fine-tuned FLUX.1.dev model shows over 3x improvement in human-evaluated realism and aesthetic quality compared to previous methods.

Conclusion: The proposed approach effectively addresses computational limitations and offline adaptation requirements of previous diffusion model alignment methods, significantly improving output quality.

Abstract: Recent studies have demonstrated the effectiveness of directly aligning diffusion models with human preferences using differentiable reward. However, they exhibit two primary challenges: (1) they rely on multistep denoising with gradient computation for reward scoring, which is computationally expensive, thus restricting optimization to only a few diffusion steps; (2) they often need continuous offline adaptation of reward models in order to achieve desired aesthetic quality, such as photorealism or precise lighting effects. To address the limitation of multistep denoising, we propose Direct-Align, a method that predefines a noise prior to effectively recover original images from any time steps via interpolation, leveraging the equation that diffusion states are interpolations between noise and target images, which effectively avoids over-optimization in late timesteps. Furthermore, we introduce Semantic Relative Preference Optimization (SRPO), in which rewards are formulated as text-conditioned signals. This approach enables online adjustment of rewards in response to positive and negative prompt augmentation, thereby reducing the reliance on offline reward fine-tuning. By fine-tuning the FLUX.1.dev model with optimized denoising and online reward adjustment, we improve its human-evaluated realism and aesthetic quality by over 3x.

James Mooney, Josef Woldense, Zheng Robert Jia, Shirley Anugrah Hayati, My Ha Nguyen, Vipul Raheja, Dongyeop Kang

Main category: cs.AI

TL;DR: LLMs show significant internal inconsistencies across different experimental settings, failing to maintain behavioral consistency despite generating human-like survey responses, which limits their ability to substitute real human participants in research.

Details

Motivation: To evaluate whether LLM-based synthetic agents can truly substitute human participants in research by examining their internal consistency across different experimental settings, rather than just surface-level survey response matching.

Method: Developed a study to (a) reveal agents’ internal states and (b) examine their behavior in basic dialogue settings, testing behavioral hypotheses to assess consistency between conversation behavior and revealed internal states across various LLM families and sizes.

Result: Found significant internal inconsistencies in LLMs across different model families and sizes. While agents can generate responses matching human counterparts, they fail to maintain internal consistency.

Conclusion: LLMs’ inability to be internally consistent represents a critical limitation in their capability to accurately substitute for real human participants in research, despite their surface-level response generation abilities.

Abstract: The impressive capabilities of Large Language Models (LLMs) have fueled the notion that synthetic agents can serve as substitutes for real participants in human-subject research. In an effort to evaluate the merits of this claim, social science researchers have largely focused on whether LLM-generated survey data corresponds to that of a human counterpart whom the LLM is prompted to represent. In contrast, we address a more fundamental question: Do agents maintain internal consistency, retaining similar behaviors when examined under different experimental settings? To this end, we develop a study designed to (a) reveal the agent’s internal state and (b) examine agent behavior in a basic dialogue setting. This design enables us to explore a set of behavioral hypotheses to assess whether an agent’s conversation behavior is consistent with what we would expect from their revealed internal state. Our findings on these hypotheses show significant internal inconsistencies in LLMs across model families and at differing model sizes. Most importantly, we find that, although agents may generate responses matching those of their human counterparts, they fail to be internally consistent, representing a critical gap in their capabilities to accurately substitute for real participants in human-subject research. Our simulation code and data are publicly accessible.

[428] Categorical semantics of compositional reinforcement learning

Georgios Bakirtzis, Michail Savvas, Ufuk Topcu

Main category: cs.AI

TL;DR: A categorical framework for compositional reinforcement learning using MDP pushout operations to ensure robust task compositionality and unify safety/symmetry concepts.

Details

Motivation: To develop formal minimal assumptions for robust compositional knowledge representations in RL, enabling modular, interpretable, and safe task specifications through functional decompositions.

Method: Using category theory to study the category MDP, with objects as Markov decision processes. Employing pushout operations for task compositionality and introducing zig-zag diagrams that leverage compositional guarantees from the category structure.

Result: The framework provides compositional guarantees for RL tasks through pushout operations, unifies concepts like safety requirements and symmetry exploitation, and generalizes previous abstraction theories for reinforcement learning.

Conclusion: The categorical approach offers a robust theoretical foundation for compositional RL, enabling systematic task combination while ensuring properties like safety and symmetry are preserved through formal mathematical guarantees.

Abstract: Compositional knowledge representations in reinforcement learning (RL) facilitate modular, interpretable, and safe task specifications. However, generating compositional models requires the characterization of minimal assumptions for the robustness of the compositionality feature, especially in the case of functional decompositions. Using a categorical point of view, we develop a knowledge representation framework for a compositional theory of RL. Our approach relies on the theoretical study of the category MDP, whose objects are Markov decision processes (MDPs) acting as models of tasks. The categorical semantics models the compositionality of tasks through the application of pushout operations akin to combining puzzle pieces. As a practical application of these pushout operations, we introduce zig-zag diagrams that rely on the compositional guarantees engendered by the category MDP. We further prove that properties of the category MDP unify concepts, such as enforcing safety requirements and exploiting symmetries, generalizing previous abstraction theories for RL.

[429] NoisyNN: Exploring the Impact of Information Entropy Change in Learning Systems

Xiaowei Yu, Zhe Huang, Minheng Chen, Lu Zhang, Tianming Liu, Dajiang Zhu

Main category: cs.AI

TL;DR: Noise injection can improve deep learning performance by reducing task complexity through entropy change, with positive noise boosting accuracy while harmful noise impairs models.

Details

Motivation: Conventional view treats noise as harmful perturbation, but this work explores how noise can positively impact learning systems by changing entropy and reducing task complexity.

Method: Inject noise at different levels (embedding space, image) in CNNs and ViTs, categorize noise into positive (PN) and harmful (HN) types based on task complexity reduction, and use information entropy to define task complexity.

Result: Achieved unprecedented 95% top-1 accuracy on ImageNet, with extensive experiments showing performance improvements in both CNNs and ViTs through proactive positive noise injection.

Conclusion: Noise plays different roles - positive noise benefits learning while harmful noise impairs models, offering new explanations for deep models and providing a paradigm for performance improvement through entropy manipulation.

Abstract: We investigate the impact of entropy change in deep learning systems by noise injection at different levels, including the embedding space and the image. The series of models that employ our methodology are collectively known as Noisy Neural Networks (NoisyNN), with examples such as NoisyViT and NoisyCNN. Noise is conventionally viewed as a harmful perturbation in various deep learning architectures, such as convolutional neural networks (CNNs) and vision transformers (ViTs), as well as different learning tasks like image classification and transfer learning. However, this work shows noise can be an effective way to change the entropy of the learning system. We demonstrate that specific noise can boost the performance of various deep models under certain conditions. We theoretically prove the enhancement gained from positive noise by reducing the task complexity defined by information entropy and experimentally show the significant performance gain in large image datasets, such as the ImageNet. Herein, we use the information entropy to define the complexity of the task. We categorize the noise into two types, positive noise (PN) and harmful noise (HN), based on whether the noise can help reduce the task complexity. Extensive experiments of CNNs and ViTs have shown performance improvements by proactively injecting positive noise, where we achieved an unprecedented top 1 accuracy of 95$%$ on ImageNet. Both theoretical analysis and empirical evidence have confirmed that the presence of positive noise, can benefit the learning process, while the traditionally perceived harmful noise indeed impairs deep learning models. The different roles of noise offer new explanations for deep models on specific tasks and provide a new paradigm for improving model performance. Moreover, it reminds us that we can influence the performance of learning systems via information entropy change.

[430] GameGPT: Multi-agent Collaborative Framework for Game Development

Dake Chen, Haoyang Zhang, Hanbin Wang, Yunhao Huo, Yuzhao Li, Junjie Wang

Main category: cs.AI

TL;DR: GameGPT is a multi-agent framework that automates game development by addressing hallucination and redundancy issues through dual collaboration, layered approaches with custom lexicons, and decoupling methods.

Details

Motivation: LLM-based agents show promise for automating software development but face challenges with hallucination and newly identified redundancy issues, particularly in game development contexts.

Method: Proposes a multi-agent collaborative framework with dual collaboration, layered approaches using in-house lexicons, and decoupling techniques to handle planning, task identification, and implementation phases.

Result: The framework successfully mitigates both hallucination and redundancy problems in automated game development processes.

Conclusion: GameGPT demonstrates that addressing redundancy alongside hallucination through specialized collaborative methods enables more effective automation of game development using LLM-based agents.

Abstract: The large language model (LLM) based agents have demonstrated their capacity to automate and expedite software development processes. In this paper, we focus on game development and propose a multi-agent collaborative framework, dubbed GameGPT, to automate game development. While many studies have pinpointed hallucination as a primary roadblock for deploying LLMs in production, we identify another concern: redundancy. Our framework presents a series of methods to mitigate both concerns. These methods include dual collaboration and layered approaches with several in-house lexicons, to mitigate the hallucination and redundancy in the planning, task identification, and implementation phases. Furthermore, a decoupling approach is also introduced to achieve code generation with better precision.

[431] Online Prompt Pricing based on Combinatorial Multi-Armed Bandit and Hierarchical Stackelberg Game

Meiling Li, Hongrun Ren, Haixu Xiong, Zhenxing Qian, Xinpeng Zhang

Main category: cs.AI

TL;DR: Proposes an online pricing mechanism for prompt bundle trading using combinatorial multi-armed bandit and hierarchical Stackelburg game to optimize profits for consumers, platforms, and sellers simultaneously.

Details

Motivation: To address the novel prompt trading scenario and provide a flexible pricing mechanism that better aligns with real-world transaction needs compared to existing fixed pricing models.

Method: Uses combinatorial multi-armed bandit (CMAB) and three-stage hierarchical Stackelburg game to break down pricing into unknown category selection and incentive strategy optimization, selecting high-quality categories and deriving optimal strategies for all participants.

Result: Tested on a simulated text-to-image dataset, the method demonstrates effectiveness and provides a feasible price-setting standard for prompt marketplaces.

Conclusion: The proposed PBT pricing mechanism is more flexible and diverse than fixed pricing modes, successfully achieving profit satisfaction for all three market participants (consumer, platform, seller) in prompt bundle trading scenarios.

Abstract: Generation models have shown promising performance in various tasks, making trading around machine learning models possible. In this paper, we aim at a novel prompt trading scenario, prompt bundle trading (PBT) system, and propose an online pricing mechanism. Based on the combinatorial multi-armed bandit (CMAB) and three-stage hierarchical Stackelburg (HS) game, our pricing mechanism considers the profits of the consumer, platform, and seller, simultaneously achieving the profit satisfaction of these three participants. We break down the pricing issue into two steps, namely unknown category selection and incentive strategy optimization. The former step is to select a set of categories with the highest qualities, and the latter is to derive the optimal strategy for each participant based on the chosen categories. Unlike the existing fixed pricing mode, the PBT pricing mechanism we propose is more flexible and diverse, which is more in accord with the transaction needs of real-world scenarios. We test our method on a simulated text-to-image dataset. The experimental results demonstrate the effectiveness of our algorithm, which provides a feasible price-setting standard for the prompt marketplaces.

[432] Transforming Wearable Data into Personal Health Insights using Large Language Model Agents

Mike A. Merrill, Akshay Paruchuri, Naghmeh Rezaei, Geza Kovacs, Javier Perez, Yun Liu, Erik Schenck, Nova Hammerquist, Jake Sunshine, Shyam Tailor, Kumar Ayush, Hao-Wei Su, Qian He, Cory Y. McLean, Mark Malhotra, Shwetak Patel, Jiening Zhan, Tim Althoff, Daniel McDuff, Xin Liu

Main category: cs.AI

TL;DR: PHIA is a tool-based LLM agent system that uses multistep reasoning with code generation and information retrieval to analyze wearable health data, achieving 84% accuracy on numerical questions and 83% favorable ratings on open-ended questions.

Details

Motivation: Standard LLMs struggle with complex numerical reasoning required for personalized health insights from wearable trackers, necessitating more advanced tool-based approaches.

Method: PHIA leverages multistep reasoning with code generation and information retrieval to analyze behavioral health data from wearables, tested on benchmark datasets with over 4000 health insights questions.

Result: PHIA significantly outperforms code generation baselines with 84% accuracy on objective numerical questions and 83% favorable ratings on open-ended questions, being twice as likely to achieve the highest quality rating.

Conclusion: This work advances behavioral health by enabling individuals to understand their wearable data, paving the way for accessible, personalized, and data-driven wellness for the wider population.

Abstract: Deriving personalized insights from popular wearable trackers requires complex numerical reasoning that challenges standard LLMs, necessitating tool-based approaches like code generation. Large language model (LLM) agents present a promising yet largely untapped solution for this analysis at scale. We introduce the Personal Health Insights Agent (PHIA), a system leveraging multistep reasoning with code generation and information retrieval to analyze and interpret behavioral health data. To test its capabilities, we create and share two benchmark datasets with over 4000 health insights questions. A 650-hour human expert evaluation shows that PHIA significantly outperforms a strong code generation baseline, achieving 84% accuracy on objective, numerical questions and, for open-ended ones, earning 83% favorable ratings while being twice as likely to achieve the highest quality rating. This work can advance behavioral health by empowering individuals to understand their data, enabling a new era of accessible, personalized, and data-driven wellness for the wider population.

[433] ResearchArena: Benchmarking Large Language Models’ Ability to Collect and Organize Information as Research Agents

Hao Kang, Chenyan Xiong

Main category: cs.AI

TL;DR: ResearchArena benchmark evaluates LLMs’ capabilities in academic survey tasks across three stages: literature discovery, selection, and organization, showing current LLMs underperform simpler methods but with potential for improvement.

Details

Motivation: LLMs struggle with domain-specific analytical tasks like conducting academic research surveys, which are foundational to academic research but require specialized capabilities beyond general NLP tasks.

Method: Created ResearchArena benchmark with three-stage evaluation: information discovery (finding relevant papers), information selection (assessing relevance/impact), and information organization (structuring knowledge into mind-maps). Built on 12M full-text papers and 7.9K surveys from S2ORC corpus.

Result: LLM-based approaches underperformed simpler keyword-based retrieval methods. Recent reasoning models like DeepSeek-R1 showed slightly better zero-shot performance, indicating room for improvement in autonomous research capabilities.

Conclusion: Significant opportunities exist for advancing LLMs in autonomous research tasks. The benchmark is open-sourced to facilitate further research and development in this domain.

Abstract: Large language models (LLMs) excel across many natural language processing tasks but face challenges in domain-specific, analytical tasks such as conducting research surveys. This study introduces ResearchArena, a benchmark designed to evaluate LLMs’ capabilities in conducting academic surveys – a foundational step in academic research. ResearchArena models the process in three stages: (1) information discovery, identifying relevant literature; (2) information selection, evaluating papers’ relevance and impact; and (3) information organization, structuring knowledge into hierarchical frameworks such as mind-maps. Notably, mind-map construction is treated as a bonus task, reflecting its supplementary role in survey-writing. To support these evaluations, we construct an offline environment of 12M full-text academic papers and 7.9K survey papers. To ensure ethical compliance, we do not redistribute copyrighted materials; instead, we provide code to construct the environment from the Semantic Scholar Open Research Corpus (S2ORC). Preliminary evaluations reveal that LLM-based approaches underperform compared to simpler keyword-based retrieval methods, though recent reasoning models such as DeepSeek-R1 show slightly better zero-shot performance. These results underscore significant opportunities for advancing LLMs in autonomous research. We open-source the code to construct the ResearchArena benchmark at https://github.com/cxcscmu/ResearchArena.

[434] Navigating the Labyrinth: Evaluating LLMs’ Ability to Reason About Search Problems

Nasim Borazjanizadeh, Roei Herzig, Trevor Darrell, Rogerio Feris, Leonid Karlinsky

Main category: cs.AI

TL;DR: LLMs struggle with search problems requiring backtracking and multiple pathways. SearchBench benchmark shows GPT-4 solves only 1.4% with language reasoning, 11.7% with code generation, but 57%+ with A* algorithm and multi-stage inference.

Details

Motivation: LLMs perform well on math/reasoning benchmarks but fail at logic puzzles easy for humans. Need to investigate their limitations in search problems requiring backtracking and multiple solution pathways.

Method: Created SearchBench with 11 search problems, automated pipelines for instance generation and solution analysis. Tested language reasoning, code generation, and A* algorithm implementations with multi-stage inference.

Result: Language-only reasoning: GPT-4 1.4%, o1-preview 18.6%. Code generation: GPT-4 11.7%. Best performance: A* algorithm with multi-stage inference achieving 57%+ with GPT-4.

Conclusion: Current LLMs fundamentally struggle with search problems requiring backtracking. Algorithmic approaches (A*) with multi-stage inference significantly outperform pure language reasoning, highlighting limitations in auto-regressive models for complex search tasks.

Abstract: Large Language Models (LLMs) have recently achieved impressive performance in math and reasoning benchmarks. However, they often struggle with logic problems and puzzles that are relatively easy for humans. To further investigate this, we introduce a new benchmark, SearchBench, which contains 11 unique search problems, each equipped with automated pipelines to generate an arbitrary number of instances and analyze the feasibility, correctness, and optimality of LLM-generated solutions. We show that using language-only reasoning, even the most advanced LLMs fail to solve SearchBench end-to-end, e.g., OpenAI’s frontier models GPT4 and o1-preview solve only 1.4% and 18.6% of SearchBench problems, respectively. The reason is that SearchBench problems require considering multiple pathways to the solution and performing backtracking, posing a significant challenge to auto-regressive models. Instructing LLMs to generate code that solves the problem helps, but only slightly, e.g., GPT4’s performance rises to 11.7%. Interestingly, we show that the current strongest baseline on SearchBench is obtained using in-context learning with A* algorithm implementations. We further show that this baseline can be further enhanced via a Multi-Stage-Multi-Try inference method, raising GPT4’s performance above 57%.

[435] ChinaTravel: An Open-Ended Benchmark for Language Agents in Chinese Travel Planning

Jie-Jing Shao, Bo-Wen Zhang, Xiao-Wen Yang, Baizhi Chen, Si-Yu Han, Wen-Da Wei, Guohao Cai, Zhenhua Dong, Lan-Zhe Guo, Yu-Feng Li

Main category: cs.AI

TL;DR: ChinaTravel is the first open-ended benchmark for evaluating language agents in multi-day, multi-POI travel planning scenarios using authentic Chinese travel requirements from 1,154 human participants.

Details

Motivation: Existing benchmarks oversimplify real-world travel planning by focusing on synthetic queries and limited constraints, creating a gap for evaluating language agents in complex multi-objective planning with diverse human needs.

Method: Developed a compositionally generalizable domain-specific language (DSL) for scalable evaluation covering feasibility, constraint satisfaction, and preference comparison. Used authentic travel requirements collected from human participants.

Result: Neuro-symbolic agents achieved 37.0% constraint satisfaction rate on human queries, representing a 10x improvement over purely neural models.

Conclusion: ChinaTravel serves as a pivotal milestone for advancing language agents in complex, real-world planning scenarios, demonstrating the superiority of neuro-symbolic approaches over purely neural models in travel planning.

Abstract: Recent advances in LLMs, particularly in language reasoning and tool integration, have rapidly sparked the \emph{Language Agents} for real-world development. Among these, travel planning represents a prominent domain, combining complex multi-objective planning challenges with practical deployment demands. However, existing benchmarks often oversimplify real-world requirements by focusing on synthetic queries and limited constraints. We address the gap of evaluating language agents in multi-day, multi-POI travel planning scenarios with diverse and open human needs. Specifically, we introduce \emph{ChinaTravel}, the first open-ended benchmark grounded in authentic Chinese travel requirements collected from 1,154 human participants. We design a compositionally generalizable domain-specific language (DSL) for scalable evaluation, covering feasibility, constraint satisfaction, and preference comparison. Empirical studies reveal the potential of neuro-symbolic agents in travel planning, achieving a 37.0% constraint satisfaction rate on human queries, a 10\times improvement over purely neural models. These findings highlight ChinaTravel as a pivotal milestone for advancing language agents in complex, real-world planning scenarios.

[436] KABB: Knowledge-Aware Bayesian Bandits for Dynamic Expert Coordination in Multi-Agent Systems

Jusheng Zhang, Zimeng Huang, Yijia Fan, Ningyuan Liu, Mingyan Li, Zhuojie Yang, Jiawei Yao, Jian Wang, Keze Wang

Main category: cs.AI

TL;DR: KABB is a novel multi-agent coordination framework that uses semantic understanding and dynamic adaptation to achieve optimal cost-performance balance while maintaining high efficiency.

Details

Motivation: Address the challenges of prohibitive costs in scaling large language models and the limitations of static knowledge assumptions and coordination inefficiencies in multi-agent systems.

Method: Three key innovations: 1) Three-dimensional knowledge distance model for deep semantic understanding, 2) Dual-adaptation mechanism for continuous expert optimization, 3) Knowledge-aware Thompson Sampling strategy for efficient expert selection.

Result: Extensive evaluation demonstrates KABB achieves optimal cost-performance balance, maintaining high performance while keeping computational demands relatively low.

Conclusion: KABB provides an effective solution for multi-agent system coordination through semantic understanding and dynamic adaptation, offering a promising alternative to costly large language model scaling.

Abstract: As scaling large language models faces prohibitive costs, multi-agent systems emerge as a promising alternative, though challenged by static knowledge assumptions and coordination inefficiencies. We introduces Knowledge-Aware Bayesian Bandits (KABB), a novel framework that enhances multi-agent system coordination through semantic understanding and dynamic adaptation. The framework features three key innovations: a three-dimensional knowledge distance model for deep semantic understanding, a dual-adaptation mechanism for continuous expert optimization, and a knowledge-aware Thompson Sampling strategy for efficient expert selection. Extensive evaluation demonstrates KABB achieves an optimal cost-performance balance, maintaining high performance while keeping computational demands relatively low in multi-agent coordination.

[437] Antidistillation Sampling

Yash Savani, Asher Trockman, Zhili Feng, Yixuan Even Xu, Avi Schwarzschild, Alexander Robey, Marc Finzi, J. Zico Kolter

Main category: cs.AI

TL;DR: Antidistillation sampling strategically modifies model outputs to poison reasoning traces, making them ineffective for distillation while preserving model utility.

Details

Motivation: Frontier models generate rich reasoning traces that can be exploited for model distillation, creating a vulnerability that model owners want to protect against.

Method: Strategic modification of a model’s next-token probability distribution to poison reasoning traces without compromising performance.

Result: Renders reasoning traces significantly less effective for distillation while maintaining the model’s practical utility.

Conclusion: Antidistillation sampling provides an effective defense mechanism against unauthorized model distillation while preserving original model functionality.

Abstract: Frontier models that generate extended reasoning traces inadvertently produce rich token sequences that can facilitate model distillation. Recognizing this vulnerability, model owners may seek sampling strategies that limit the effectiveness of distillation without compromising model performance. Antidistillation sampling provides exactly this capability. By strategically modifying a model’s next-token probability distribution, antidistillation sampling poisons reasoning traces, rendering them significantly less effective for distillation while preserving the model’s practical utility. For further details, see https://antidistillation.com.

[438] OpenDeception: Benchmarking and Investigating AI Deceptive Behaviors via Open-ended Interaction Simulation

Yichen Wu, Xudong Pan, Geng Hong, Min Yang

Main category: cs.AI

TL;DR: OpenDeception is a novel framework for evaluating deception risks in LLM-based agents through open-ended scenarios and internal reasoning analysis, revealing high deception intention (80%) and success rates (50%) across mainstream models.

Details

Motivation: As LLM capabilities improve and agent applications become widespread, there is an urgent need for systematic evaluation and oversight of deception risks, which existing evaluation methods using simulated games or limited choices fail to address adequately.

Method: Constructed five types of common use cases with ten diverse real-world scenarios each, using agent simulation for multi-turn dialogue to avoid ethical concerns and costs of human testing. Jointly evaluates deception intention and capabilities by inspecting internal reasoning processes.

Result: Evaluation of eleven mainstream LLMs shows deception intention ratio exceeds 80% and deception success rate surpasses 50%. Stronger LLMs exhibit higher deception risks.

Conclusion: There is an urgent need to address deception risks and security concerns in LLM-based agents, calling for more alignment efforts to inhibit deceptive behaviors, especially as more capable models show increased deception risks.

Abstract: As the general capabilities of large language models (LLMs) improve and agent applications become more widespread, the underlying deception risks urgently require systematic evaluation and effective oversight. Unlike existing evaluation which uses simulated games or presents limited choices, we introduce OpenDeception, a novel deception evaluation framework with an open-ended scenario dataset. OpenDeception jointly evaluates both the deception intention and capabilities of LLM-based agents by inspecting their internal reasoning process. Specifically, we construct five types of common use cases where LLMs intensively interact with the user, each consisting of ten diverse, concrete scenarios from the real world. To avoid ethical concerns and costs of high-risk deceptive interactions with human testers, we propose to simulate the multi-turn dialogue via agent simulation. Extensive evaluation of eleven mainstream LLMs on OpenDeception highlights the urgent need to address deception risks and security concerns in LLM-based agents: the deception intention ratio across the models exceeds 80%, while the deception success rate surpasses 50%. Furthermore, we observe that LLMs with stronger capabilities do exhibit a higher risk of deception, which calls for more alignment efforts on inhibiting deceptive behaviors.

Kasra Borazjani, Payam Abdisarabshali, Fardis Nadimi, Naji Khosravan, Minghui Liwang, Xianbin Wang, Yiguang Hong, Seyyedali Hosseinalipour

Main category: cs.AI

TL;DR: Proposes M3T-FFMs - a new paradigm combining multi-modal multi-task foundation models with federated learning for embodied AI systems, addressing deployment challenges through the EMBODY framework.

Details

Motivation: Embodied AI systems need to learn from diverse sensory inputs, adapt to user preferences, and operate safely under resource/constraints, requiring models that balance generalization and personalization while preserving privacy.

Method: Unifies multi-modal multi-task foundation models (M3T-FMs) with federated learning (FL) to create M3T-FFMs, introduces EMBODY framework covering six deployment dimensions, and presents evaluation framework with prototype implementation.

Result: Developed a unified framework (EMBODY) identifying concrete challenges and research directions, created evaluation framework for deployment trade-offs, and implemented prototype showing energy and latency performance results.

Conclusion: M3T-FFMs represent a promising paradigm for embodied AI that combines generalization capabilities of foundation models with privacy-preserving distributed training of federated learning, addressing key deployment challenges through the comprehensive EMBODY framework.

Abstract: As embodied AI systems become increasingly multi-modal, personalized, and interactive, they must learn effectively from diverse sensory inputs, adapt continually to user preferences, and operate safely under resource and privacy constraints. These challenges expose a pressing need for machine learning models capable of swift, context-aware adaptation while balancing model generalization and personalization. Here, two methods emerge as suitable candidates, each offering parts of these capabilities: multi-modal multi-task foundation models (M3T-FMs) provide a pathway toward generalization across tasks and modalities, whereas federated learning (FL) offers the infrastructure for distributed, privacy-preserving model updates and user-level model personalization. However, when used in isolation, each of these approaches falls short of meeting the complex and diverse capability requirements of real-world embodied AI environments. In this vision paper, we introduce multi-modal multi-task federated foundation models (M3T-FFMs) for embodied AI, a new paradigm that unifies the strengths of M3T-FMs with the privacy-preserving distributed training nature of FL, enabling intelligent systems at the wireless edge. We collect critical deployment dimensions of M3T-FFMs in embodied AI ecosystems under a unified framework, which we name “EMBODY”: Embodiment heterogeneity, Modality richness and imbalance, Bandwidth and compute constraints, On-device continual learning, Distributed control and autonomy, and Yielding safety, privacy, and personalization. For each, we identify concrete challenges and envision actionable research directions. We also present an evaluation framework for deploying M3T-FFMs in embodied AI systems, along with the associated trade-offs. Finally, we present a prototype implementation of M3T-FFMs and evaluate their energy and latency performance.

[440] Project Riley: Multimodal Multi-Agent LLM Collaboration with Emotional Reasoning and Voting

Ana Rita Ortigoso, Gabriel Vieira, Daniel Fuentes, Luis Frazão, Nuno Costa, António Pereira

Main category: cs.AI

TL;DR: Project Riley is a multimodal AI system with five emotional agents that simulate reasoning influenced by emotions, inspired by Inside Out. It uses multi-agent dialogues to generate emotionally calibrated responses.

Details

Motivation: To create conversational AI that can simulate human-like emotional reasoning and generate responses influenced by different emotional states, moving beyond purely logical AI systems.

Method: Uses five distinct emotional agents (Joy, Sadness, Fear, Anger, Disgust) that engage in structured dialogues. Combines textual and visual LLMs with reasoning processes. Includes RAG integration and context tracking in the Armando variant for emergency contexts.

Result: Strong performance in structured scenarios with high emotional alignment and communicative clarity. User testing showed effectiveness in emotional appropriateness, clarity, and human-likeness.

Conclusion: The multi-agent emotional architecture successfully simulates emotion-influenced reasoning and generates coherent, emotionally appropriate responses, demonstrating potential for emotionally intelligent AI systems.

Abstract: This paper presents Project Riley, a novel multimodal and multi-model conversational AI architecture oriented towards the simulation of reasoning influenced by emotional states. Drawing inspiration from Pixar’s Inside Out, the system comprises five distinct emotional agents - Joy, Sadness, Fear, Anger, and Disgust - that engage in structured multi-round dialogues to generate, criticise, and iteratively refine responses. A final reasoning mechanism synthesises the contributions of these agents into a coherent output that either reflects the dominant emotion or integrates multiple perspectives. The architecture incorporates both textual and visual large language models (LLMs), alongside advanced reasoning and self-refinement processes. A functional prototype was deployed locally in an offline environment, optimised for emotional expressiveness and computational efficiency. From this initial prototype, another one emerged, called Armando, which was developed for use in emergency contexts, delivering emotionally calibrated and factually accurate information through the integration of Retrieval-Augmented Generation (RAG) and cumulative context tracking. The Project Riley prototype was evaluated through user testing, in which participants interacted with the chatbot and completed a structured questionnaire assessing three dimensions: Emotional Appropriateness, Clarity and Utility, and Naturalness and Human-likeness. The results indicate strong performance in structured scenarios, particularly with respect to emotional alignment and communicative clarity.

[441] SUDER: Self-Improving Unified Large Multimodal Models for Understanding and Generation with Dual Self-Rewards

Jixiang Hong, Yiran Zhang, Guanzhong Wang, Yi Liu, Ji-Rong Wen, Rui Yan

Main category: cs.AI

TL;DR: SUDER is a self-supervised framework that improves large multimodal models by using the inherent duality between vision-language understanding and generation tasks to provide mutual optimization signals without external supervision.

Details

Motivation: Current LMMs struggle with accurate vision-language alignment and require external supervision, while only addressing unidirectional tasks. The authors aim to create a unified framework that enhances both understanding and generation capabilities without external feedback.

Method: The proposed SUDER framework uses a dual self-reward mechanism where understanding and generation tasks provide optimization signals for each other. It samples multiple outputs for a given input, then reverses input-output pairs to compute dual likelihood as self-rewards for optimization.

Result: Extensive experiments show SUDER effectively enhances model performance without external supervision, achieving remarkable improvements in text-to-image tasks and visual understanding benchmarks.

Conclusion: SUDER demonstrates that leveraging the natural duality between understanding and generation tasks enables self-supervised improvement of LMMs, providing a unified framework that enhances both capabilities without requiring external supervision.

Abstract: Building upon large language models (LLMs), recent large multimodal models (LMMs) unify cross-model understanding and generation into a single framework. However, LMMs still struggle to achieve accurate vision-language alignment, prone to generating text responses contradicting the visual input or failing to follow the text-to-image prompts. Current solutions require external supervision (e.g., human feedback or reward models) and only address unidirectional tasks-either understanding or generation. In this work, based on the observation that understanding and generation are naturally inverse dual tasks, we propose \textbf{SUDER} (\textbf{S}elf-improving \textbf{U}nified LMMs with \textbf{D}ual s\textbf{E}lf-\textbf{R}ewards), a framework reinforcing the understanding and generation capabilities of LMMs with a self-supervised dual reward mechanism. SUDER leverages the inherent duality between understanding and generation tasks to provide self-supervised optimization signals for each other. Specifically, we sample multiple outputs for a given input in one task domain, then reverse the input-output pairs to compute the dual likelihood within the model as self-rewards for optimization. Extensive experimental results on visual understanding and generation benchmarks demonstrate that our method can effectively enhance the performance of the model without any external supervision, especially achieving remarkable improvements in text-to-image tasks.

[442] Safe and Economical UAV Trajectory Planning in Low-Altitude Airspace: A Hybrid DRL-LLM Approach with Compliance Awareness

Yanwei Gong, Junchao Fan, Ruichen Zhang, Dusit Niyato, Yingying Yao, Xiaolin Chang

Main category: cs.AI

TL;DR: A novel UAV trajectory planning framework combining deep reinforcement learning with large language model reasoning to enable safe, compliant, and economically viable path planning in complex urban environments.

Details

Motivation: The rapid growth of low-altitude economy and widespread UAV adoption create challenges in trajectory planning, with existing studies overlooking urban airspace constraints and economic efficiency. DRL shows promise but suffers from low learning efficiency.

Method: Proposed a UAV trajectory planning framework that integrates deep reinforcement learning with large language model reasoning to enhance learning efficiency and decision-making capabilities.

Result: Experimental results show significant outperformance over existing baselines across multiple metrics: data collection rate, collision avoidance, successful landing, regulatory compliance, and energy efficiency.

Conclusion: The approach effectively addresses key UAV trajectory planning challenges under low-altitude economy networking constraints, validating the effectiveness of combining DRL with LLM reasoning.

Abstract: The rapid growth of the low-altitude economy has driven the widespread adoption of unmanned aerial vehicles (UAVs). This growing deployment presents new challenges for UAV trajectory planning in complex urban environments. However, existing studies often overlook key factors, such as urban airspace constraints and economic efficiency, which are essential in low-altitude economy contexts. Deep reinforcement learning (DRL) is regarded as a promising solution to these issues, while its practical adoption remains limited by low learning efficiency. To overcome this limitation, we propose a novel UAV trajectory planning framework that combines DRL with large language model (LLM) reasoning to enable safe, compliant, and economically viable path planning. Experimental results demonstrate that our method significantly outperforms existing baselines across multiple metrics, including data collection rate, collision avoidance, successful landing, regulatory compliance, and energy efficiency. These results validate the effectiveness of our approach in addressing UAV trajectory planning key challenges under constraints of the low-altitude economy networking.

[443] Modular Recurrence in Contextual MDPs for Universal Morphology Control

Laurens Engwegen, Daan Brinks, Wendelin Böhmer

Main category: cs.AI

TL;DR: A modular recurrent architecture improves generalization to unseen robot morphologies by inferring partially observable contextual information through interactions.

Details

Motivation: To create a universal controller that can generalize to new, unseen robot morphologies by addressing the challenge of partially observable contextual information about robot properties.

Method: Implemented a modular recurrent architecture that infers contextual information through interactions and evaluated it on a large set of MuJoCo robots with various dynamics, kinematics, and topologies.

Result: Substantially improved performance on robots with unseen dynamics, kinematics, and topologies across four different environments.

Conclusion: The modular recurrent architecture successfully enables better generalization to unseen robot contexts by inferring partially observable information through interaction, advancing multi-robot control capabilities.

Abstract: A universal controller for any robot morphology would greatly improve computational and data efficiency. By utilizing contextual information about the properties of individual robots and exploiting their modular structure in the architecture of deep reinforcement learning agents, steps have been made towards multi-robot control. Generalization to new, unseen robots, however, remains a challenge. In this paper we hypothesize that the relevant contextual information is partially observable, but that it can be inferred through interactions for better generalization to contexts that are not seen during training. To this extent, we implement a modular recurrent architecture and evaluate its generalization performance on a large set of MuJoCo robots. The results show a substantial improved performance on robots with unseen dynamics, kinematics, and topologies, in four different environments.

[444] MEAL: A Benchmark for Continual Multi-Agent Reinforcement Learning

Tristan Tomilin, Luka van den Boogaard, Samuel Garcin, Bram Grooten, Meng Fang, Yali Du, Mykola Pechenizkiy

Main category: cs.AI

TL;DR: MEAL is the first benchmark for continual multi-agent reinforcement learning (CMARL) that enables GPU-accelerated training across 100 tasks on desktop hardware, revealing limitations of naive CL+MARL combinations in complex coordination scenarios.

Details

Motivation: There is a significant gap in benchmarks for continual learning in cooperative multi-agent settings, with existing CL benchmarks being CPU-bound and limiting task sequence length due to computational bottlenecks.

Method: The authors introduce MEAL, a benchmark built with JAX for GPU acceleration, allowing training across sequences of 100 tasks on standard desktop PCs. They test naive combinations of popular CL and MARL methods and conduct ablation studies to identify critical architectural and algorithmic features.

Result: Naive combinations of CL and MARL methods perform well on simple environments but fail to scale to complex settings requiring sustained coordination and adaptation. The ablation study successfully identifies key architectural and algorithmic features necessary for effective CMARL.

Conclusion: MEAL provides an essential benchmark for CMARL research, demonstrating that specialized approaches beyond simple method combinations are needed for complex multi-agent continual learning scenarios, and enabling rapid experimentation through GPU acceleration.

Abstract: Benchmarks play a crucial role in the development and analysis of reinforcement learning (RL) algorithms, with environment availability strongly impacting research. One particularly underexplored intersection is continual learning (CL) in cooperative multi-agent settings. To remedy this, we introduce MEAL (Multi-agent Environments for Adaptive Learning), the first benchmark tailored for continual multi-agent reinforcement learning (CMARL). Existing CL benchmarks run environments on the CPU, leading to computational bottlenecks and limiting the length of task sequences. MEAL leverages JAX for GPU acceleration, enabling continual learning across sequences of 100 tasks on a standard desktop PC in a few hours. We show that naively combining popular CL and MARL methods yields strong performance on simple environments, but fails to scale to more complex settings requiring sustained coordination and adaptation. Our ablation study identifies architectural and algorithmic features critical for CMARL on MEAL.

[445] FinStat2SQL: A Text2SQL Pipeline for Financial Statement Analysis

Quang Hung Nguyen, Phuong Anh Trinh, Phan Quoc Hung Mai, Tuan Phong Trinh

Main category: cs.AI

TL;DR: FinStat2SQL is a lightweight text-to-SQL pipeline for financial statements that combines large and small language models in a multi-agent setup, achieving 61.33% accuracy with fast response times on consumer hardware.

Details

Motivation: Text-to-SQL faces challenges with complex domain-specific queries, especially in finance where database designs and reporting layouts vary widely between entities and countries like Vietnam with VAS standards.

Method: Multi-agent pipeline combining large and small language models for entity extraction, SQL generation, and self-correction. Built domain-specific database and evaluated on synthetic QA dataset with fine-tuned 7B model.

Result: Achieved 61.33% accuracy with sub-4-second response times on consumer hardware, outperforming GPT-4o-mini.

Conclusion: Provides scalable, cost-efficient solution for financial analysis, making AI-powered querying accessible to Vietnamese enterprises.

Abstract: Despite the advancements of large language models, text2sql still faces many challenges, particularly with complex and domain-specific queries. In finance, database designs and financial reporting layouts vary widely between financial entities and countries, making text2sql even more challenging. We present FinStat2SQL, a lightweight text2sql pipeline enabling natural language queries over financial statements. Tailored to local standards like VAS, it combines large and small language models in a multi-agent setup for entity extraction, SQL generation, and self-correction. We build a domain-specific database and evaluate models on a synthetic QA dataset. A fine-tuned 7B model achieves 61.33% accuracy with sub-4-second response times on consumer hardware, outperforming GPT-4o-mini. FinStat2SQL offers a scalable, cost-efficient solution for financial analysis, making AI-powered querying accessible to Vietnamese enterprises.

[446] Multi-Agent Reasoning for Cardiovascular Imaging Phenotype Analysis

Weitong Zhang, Mengyun Qiao, Chengqi Zang, Steven Niederer, Paul M Matthews, Wenjia Bai, Bernhard Kainz

Main category: cs.AI

TL;DR: MESHAgents is a multi-agent AI framework using LLMs to automatically discover imaging phenotype associations and confounders, achieving performance comparable to expert-selected phenotypes in disease classification.

Details

Motivation: Traditional approaches for identifying imaging phenotype associations rely on human-driven hypothesis testing and often miss complex non-linear dependencies in multi-modal data.

Method: Multi-disciplinary AI agents using large language models dynamically elicit, surface, and decide confounders and phenotypes through iterative self-organizing reasoning, synthesizing statistical correlations with multi-expert consensus.

Result: The framework autonomously uncovered correlations beyond standard demographic factors, achieving mean AUC differences of -0.004±0.010 compared to expert-selected phenotypes, with improved recall for 6 out of 9 disease types.

Conclusion: MESHAgents provides clinically relevant imaging phenotypes with transparent reasoning, offering a scalable alternative to expert-driven methods for phenome-wide association studies.

Abstract: Identifying associations between imaging phenotypes, disease risk factors, and clinical outcomes is essential for understanding disease mechanisms. However, traditional approaches rely on human-driven hypothesis testing and selection of association factors, often overlooking complex, non-linear dependencies among imaging phenotypes and other multi-modal data. To address this, we introduce Multi-agent Exploratory Synergy for the Heart (MESHAgents): a framework that leverages large language models as agents to dynamically elicit, surface, and decide confounders and phenotypes in association studies. Specifically, we orchestrate a multi-disciplinary team of AI agents, which spontaneously generate and converge on insights through iterative, self-organizing reasoning. The framework dynamically synthesizes statistical correlations with multi-expert consensus, providing an automated pipeline for phenome-wide association studies (PheWAS). We demonstrate the system’s capabilities through a population-based study of imaging phenotypes of the heart and aorta. MESHAgents autonomously uncovered correlations between imaging phenotypes and a wide range of non-imaging factors, identifying additional confounder variables beyond standard demographic factors. Validation on diagnosis tasks reveals that MESHAgents-discovered phenotypes achieve performance comparable to expert-selected phenotypes, with mean AUC differences as small as $-0.004_{\pm0.010}$ on disease classification tasks. Notably, the recall score improves for 6 out of 9 disease types. Our framework provides clinically relevant imaging phenotypes with transparent reasoning, offering a scalable alternative to expert-driven methods.

[447] SenseCF: LLM-Prompted Counterfactuals for Intervention and Sensor Data Augmentation

Shovito Barua Soumma, Asiful Arefeen, Stephanie M. Carpenter, Melanie Hingle, Hassan Ghasemzadeh

Main category: cs.AI

TL;DR: LLM-based counterfactual explanations achieve high plausibility and validity, outperform traditional methods, and improve downstream classifier performance when used as augmented data.

Details

Motivation: To explore large language models for generating counterfactual explanations that can serve as interventions for abnormality prevention and augmented data for training robust models in clinical prediction tasks.

Method: Used GPT-4o-mini in zero-shot and three-shot settings to generate counterfactual explanations, evaluated on stress prediction and heart disease detection datasets, and compared against traditional methods (DiCE, CFNOW, NICE).

Result: Achieved up to 99% plausibility, 0.99 validity, competitive sparsity, and improved downstream classifier accuracy by 5% on average, particularly in low-data regimes.

Conclusion: Prompt-based generative techniques with LLMs show strong potential for enhancing explainability and robustness in clinical and physiological prediction tasks.

Abstract: Counterfactual explanations (CFs) offer human-centric insights into machine learning predictions by highlighting minimal changes required to alter an outcome. Therefore, CFs can be used as (i) interventions for abnormality prevention and (ii) augmented data for training robust models. In this work, we explore large language models (LLMs), specifically GPT-4o-mini, for generating CFs in a zero-shot and three-shot setting. We evaluate our approach on two datasets: the AI-Readi flagship dataset for stress prediction and a public dataset for heart disease detection. Compared to traditional methods such as DiCE, CFNOW, and NICE, our few-shot LLM-based approach achieves high plausibility (up to 99%), strong validity (up to 0.99), and competitive sparsity. Moreover, using LLM-generated CFs as augmented samples improves downstream classifier performance (an average accuracy gain of 5%), especially in low-data regimes. This demonstrates the potential of prompt-based generative techniques to enhance explainability and robustness in clinical and physiological prediction tasks. Code base: github.com/shovito66/SenseCF.

[448] Towards Urban Planing AI Agent in the Age of Agentic AI

Yanjie Fu, Dongjie Wang

Main category: cs.AI

TL;DR: The paper proposes agentic urban AI planners that combine agentic AI with participatory urbanism to overcome limitations of current generative AI approaches in urban planning.

Details

Motivation: Existing generative AI approaches for urban planning have predefined structures and ignore domain expert tools, creating a gap between AI capabilities and practical urban planning needs.

Method: The paper outlines a research direction for agentic urban AI planners that synthesize agentic AI with participatory urbanism, leveraging both AI capabilities and domain expert tools.

Result: The analysis identifies critical gaps in current generative urban planning approaches and proposes a new framework that integrates AI with urban planning practitioner tools.

Conclusion: A new synthesis of agentic AI and participatory urbanism is needed to create effective AI urban planners that respect domain expertise and practical planning tools.

Abstract: Generative AI, large language models, and agentic AI have emerged separately of urban planning. However, the convergence between AI and urban planning presents an interesting opportunity towards AI urban planners. Existing studies conceptualizes urban planning as a generative AI task, where AI synthesizes land-use configurations under geospatial, social, and human-centric constraints and reshape automated urban design. We further identify critical gaps of existing generative urban planning studies: 1) the generative structure has to be predefined with strong assumption: all of adversarial generator-discriminator, forward and inverse diffusion structures, hierarchical zone-POI generative structure are predefined by humans; 2) ignore the power of domain expert developed tools: domain urban planners have developed various tools in the urban planning process guided by urban theory, while existing pure neural networks based generation ignore the power of the tools developed by urban planner practitioners. To address these limitations, we outline a future research direction agentic urban AI planner, calling for a new synthesis of agentic AI and participatory urbanism.

Zhenyu Pan, Yutong Zhang, Jianshu Zhang, Haoran Lu, Haozheng Luo, Yuwei Han, Philip S. Yu, Manling Li, Han Liu

Main category: cs.AI

TL;DR: This paper investigates the trade-off between reasoning accuracy and social bias mitigation in Multimodal Large Language Models, finding that a 1:4 mix of debias and reasoning training with reinforcement learning achieves optimal balance.

Details

Motivation: While MLLMs achieve state-of-the-art results, advanced prompting and fine-tuning techniques that improve logical accuracy often leave models with pronounced social biases, creating a need to understand how reasoning gains interact with bias mitigation.

Method: Benchmarked three bias-mitigation strategies (SFT, KD, RL) under identical conditions, then varied the proportion of debias-focused and reasoning-centric samples to chart the reasoning-versus-bias trade-off.

Result: Revealed a consistent sweet spot: a roughly 1:4 mix trained with reinforcement learning cuts stereotype scores by 10% while retaining 88% of the model’s original reasoning accuracy.

Conclusion: Provides concrete guidance for balancing fairness and capability in MLLMs, showing that reinforcement learning with optimal sample mixing can effectively mitigate bias while preserving reasoning performance.

Abstract: Multimodal Large Language Models (MLLMs) already achieve state-of-the-art results across a wide range of tasks and modalities. To push their reasoning ability further, recent studies explore advanced prompting schemes and post-training fine-tuning. Although these techniques improve logical accuracy, they frequently leave the models’ outputs burdened with pronounced social biases. Clarifying how reasoning gains interact with bias mitigation-and whether the two objectives inherently trade off-therefore remains an open and pressing research problem. Our study begins by benchmarking three bias-mitigation strategies-supervised fine-uning (SFT), knowledge distillation (KD), and rule-based reinforcement learning (RL)-under identical conditions, establishing their baseline strengths and weaknesses. Building on these results, we vary the proportion of debias-focused and reasoning-centric samples within each paradigm to chart the reasoning-versus-bias trade-off. Our sweeps reveal a consistent sweet spot: a roughly 1:4 mix trained with reinforcement learning cuts stereotype scores by 10% while retaining 88% of the model’s original reasoning accuracy, offering concrete guidance for balancing fairness and capability in MLLMs.

[450] Evo-MARL: Co-Evolutionary Multi-Agent Reinforcement Learning for Internalized Safety

Zhenyu Pan, Yiting Zhang, Yutong Zhang, Jianshu Zhang, Haozheng Luo, Yuwei Han, Dennis Wu, Hong-Yu Chen, Philip S. Yu, Manling Li, Han Liu

Main category: cs.AI

TL;DR: Evo-MARL is a multi-agent reinforcement learning framework that trains all agents to perform their tasks while simultaneously developing defensive capabilities against adversarial attacks, eliminating the need for external safety modules.

Details

Motivation: Existing multi-agent systems using external guard modules face limitations in protection and create single-point failure risks. Adding more guard agents increases cost and complexity without solving the fundamental vulnerability.

Method: Evo-MARL integrates evolutionary search with parameter-sharing reinforcement learning to co-evolve attackers and defenders. It trains each agent to perform primary functions while resisting threats through adversarial training.

Result: Experiments show Evo-MARL reduces attack success rates by up to 22% while improving accuracy by up to 5% on reasoning tasks, demonstrating simultaneous safety and performance improvements.

Conclusion: The framework successfully internalizes safety mechanisms into all agents, providing robust defense without increasing system overhead or creating single-point failure vulnerabilities, proving that safety and utility can be jointly enhanced.

Abstract: Multi-agent systems (MAS) built on multimodal large language models exhibit strong collaboration and performance. However, their growing openness and interaction complexity pose serious risks, notably jailbreak and adversarial attacks. Existing defenses typically rely on external guard modules, such as dedicated safety agents, to handle unsafe behaviors. Unfortunately, this paradigm faces two challenges: (1) standalone agents offer limited protection, and (2) their independence leads to single-point failure-if compromised, system-wide safety collapses. Naively increasing the number of guard agents further raises cost and complexity. To address these challenges, we propose Evo-MARL, a novel multi-agent reinforcement learning (MARL) framework that enables all task agents to jointly acquire defensive capabilities. Rather than relying on external safety modules, Evo-MARL trains each agent to simultaneously perform its primary function and resist adversarial threats, ensuring robustness without increasing system overhead or single-node failure. Furthermore, Evo-MARL integrates evolutionary search with parameter-sharing reinforcement learning to co-evolve attackers and defenders. This adversarial training paradigm internalizes safety mechanisms and continually enhances MAS performance under co-evolving threats. Experiments show that Evo-MARL reduces attack success rates by up to 22% while boosting accuracy by up to 5% on reasoning tasks-demonstrating that safety and utility can be jointly improved.

Rui Lu, Jinhe Bi, Yunpu Ma, Feng Xiao, Yuntao Du, Yijun Tian

Main category: cs.AI

TL;DR: MV-Debate is a multi-view agent debate framework that uses four specialized agents to detect harmful content in multimodal social media through iterative debate and reflection gating.

Details

Motivation: Social media contains complex multimodal content where harmful intent (sarcasm, hate speech, misinformation) is often concealed through cross-modal contradictions and subtle cues, making detection challenging.

Method: Proposes MV-Debate framework with four debate agents: surface analyst, deep reasoner, modality contrast, and social contextualist. Uses iterative debate with dynamic reflection gating to refine responses based on reflection-gain criterion.

Result: Experiments on three benchmark datasets show MV-Debate significantly outperforms strong single-model and existing multi-agent debate baselines.

Conclusion: The work demonstrates the promise of multi-agent debate frameworks for advancing reliable social intent detection in safety-critical online contexts.

Abstract: Social media has evolved into a complex multimodal environment where text, images, and other signals interact to shape nuanced meanings, often concealing harmful intent. Identifying such intent, whether sarcasm, hate speech, or misinformation, remains challenging due to cross-modal contradictions, rapid cultural shifts, and subtle pragmatic cues. To address these challenges, we propose MV-Debate, a multi-view agent debate framework with dynamic reflection gating for unified multimodal harmful content detection. MV-Debate assembles four complementary debate agents, a surface analyst, a deep reasoner, a modality contrast, and a social contextualist, to analyze content from diverse interpretive perspectives. Through iterative debate and reflection, the agents refine responses under a reflection-gain criterion, ensuring both accuracy and efficiency. Experiments on three benchmark datasets demonstrate that MV-Debate significantly outperforms strong single-model and existing multi-agent debate baselines. This work highlights the promise of multi-agent debate in advancing reliable social intent detection in safety-critical online contexts.

[452] KIRETT: Knowledge-Graph-Based Smart Treatment Assistant for Intelligent Rescue Operations

Mubaris Nadeem, Johannes Zenkert, Lisa Bender, Christian Weber, Madjid Fathi

Main category: cs.AI

TL;DR: Knowledge Graph for emergency medical assistance provides AI-based treatment recommendations to first responders using real-time vital data analysis.

Details

Motivation: Increasing need for rescue operations and time-critical emergency situations require optimized healthcare delivery where first responders need intelligent assistance to provide personalized treatment.

Method: Developed a Knowledge Graph as central knowledge representation that enables AI-based pre-recognition of medical situations and provides intelligent treatment recommendations.

Result: The system provides first responders with processed knowledge and recommendations for medical treatments based on freshly recorded vital data in emergency situations.

Conclusion: The Knowledge Graph approach offers innovative knowledge management that assists first responders in making faster, more informed treatment decisions during time-dependent emergency scenarios.

Abstract: Over the years, the need for rescue operations throughout the world has increased rapidly. Demographic changes and the resulting risk of injury or health disorders form the basis for emergency calls. In such scenarios, first responders are in a rush to reach the patient in need, provide first aid, and save lives. In these situations, they must be able to provide personalized and optimized healthcare in the shortest possible time and estimate the patients condition with the help of freshly recorded vital data in an emergency situation. However, in such a timedependent situation, first responders and medical experts cannot fully grasp their knowledge and need assistance and recommendation for further medical treatments. To achieve this, on the spot calculated, evaluated, and processed knowledge must be made available to improve treatments by first responders. The Knowledge Graph presented in this article as a central knowledge representation provides first responders with an innovative knowledge management that enables intelligent treatment recommendations with an artificial intelligence-based pre-recognition of the situation.

[453] An LLM + ASP Workflow for Joint Entity-Relation Extraction

Trang Tran, Trung Hoang Le, Huiping Cao, Tran Cao Son

Main category: cs.AI

TL;DR: A novel joint entity-relation extraction workflow combining LLMs for natural language understanding and Answer Set Programming for knowledge representation, achieving state-of-the-art results with minimal training data.

Details

Motivation: Traditional machine learning approaches for joint entity-relation extraction require large annotated datasets and lack flexibility for incorporating domain knowledge, making model creation labor-intensive and time-consuming.

Method: Proposes a generic workflow using generative pretrained LLMs for natural language understanding and Answer Set Programming (ASP) for knowledge representation and reasoning. The approach works directly with unannotated text and allows easy incorporation of domain-specific knowledge without modifying core programs.

Result: The LLM + ASP workflow outperforms state-of-the-art JERE systems with only 10% of training data. Achieves 2.5 times improvement (35% over 15%) in Relation Extraction for the challenging SciERC corpus.

Conclusion: The combination of LLMs and ASP provides an effective, flexible, and data-efficient solution for joint entity-relation extraction that can be applied across domains with minimal training data requirements.

Abstract: Joint entity-relation extraction (JERE) identifies both entities and their relationships simultaneously. Traditional machine-learning based approaches to performing this task require a large corpus of annotated data and lack the ability to easily incorporate domain specific information in the construction of the model. Therefore, creating a model for JERE is often labor intensive, time consuming, and elaboration intolerant. In this paper, we propose harnessing the capabilities of generative pretrained large language models (LLMs) and the knowledge representation and reasoning capabilities of Answer Set Programming (ASP) to perform JERE. We present a generic workflow for JERE using LLMs and ASP. The workflow is generic in the sense that it can be applied for JERE in any domain. It takes advantage of LLM’s capability in natural language understanding in that it works directly with unannotated text. It exploits the elaboration tolerant feature of ASP in that no modification of its core program is required when additional domain specific knowledge, in the form of type specifications, is found and needs to be used. We demonstrate the usefulness of the proposed workflow through experiments with limited training data on three well-known benchmarks for JERE. The results of our experiments show that the LLM + ASP workflow is better than state-of-the-art JERE systems in several categories with only 10% of training data. It is able to achieve a 2.5 times (35% over 15%) improvement in the Relation Extraction task for the SciERC corpus, one of the most difficult benchmarks.

[454] Building Self-Evolving Agents via Experience-Driven Lifelong Learning: A Framework and Benchmark

Yuxuan Cai, Yipeng Hao, Jie Zhou, Hang Yan, Zhikai Lei, Rui Zhen, Zhenhua Han, Yutao Yang, Junsong Li, Qianjun Pan, Tianyu Huai, Qin Chen, Xin Li, Kai Chen, Bo Zhang, Xipeng Qiu, Liang He

Main category: cs.AI

TL;DR: Introduces Experience-driven Lifelong Learning (ELL) framework for creating self-evolving AI agents that continuously learn through real-world interaction, with four core principles and a benchmark dataset called StuLife.

Details

Motivation: As AI advances toward general intelligence, there's a need to shift from systems optimized for static tasks to open-ended agents that can learn continuously through real-world interaction.

Method: Proposes ELL framework with four principles: Experience Exploration, Long-term Memory, Skill Learning, and Knowledge Internalization. Also introduces StuLife benchmark dataset simulating a student’s college journey across three phases and ten sub-scenarios.

Result: The paper presents a comprehensive framework for lifelong learning agents and a corresponding benchmark, but specific experimental results are not detailed in the provided abstract.

Conclusion: The ELL framework provides a systematic approach to building self-evolving agents capable of continuous growth, with StuLife serving as an appropriate benchmark for evaluating lifelong learning capabilities in realistic scenarios.

Abstract: As AI advances toward general intelligence, the focus is shifting from systems optimized for static tasks to creating open-ended agents that learn continuously. In this paper, we introduce Experience-driven Lifelong Learning (ELL), a framework for building self-evolving agents capable of continuous growth through real-world interaction. The framework is built on four core principles: (1) Experience Exploration: Agents learn through continuous, self-motivated interaction with dynamic environments, navigating interdependent tasks and generating rich experiential trajectories. (2) Long-term Memory: Agents preserve and structure historical knowledge, including personal experiences, domain expertise, and commonsense reasoning, into a persistent memory system. (3) Skill Learning: Agents autonomously improve by abstracting recurring patterns from experience into reusable skills, which are actively refined and validated for application in new tasks. (4) Knowledge Internalization: Agents internalize explicit and discrete experiences into implicit and intuitive capabilities as “second nature”. We also introduce StuLife, a benchmark dataset for ELL that simulates a student’s holistic college journey, from enrollment to academic and personal development, across three core phases and ten detailed sub-scenarios. StuLife is designed around three key paradigm

[455] ReST-RL: Achieving Accurate Code Reasoning of LLMs with Optimized Self-Training and Decoding

Sining Zhoubian, Dan Zhang, Jie Tang

Main category: cs.AI

TL;DR: ReST-RL is a unified reinforcement learning paradigm that combines improved GRPO training with VM-assisted MCTS decoding to significantly boost LLM code reasoning accuracy without requiring annotated training data.

Details

Motivation: Existing RL methods like GRPO fail due to insignificant reward variance, while process reward models (PRMs) suffer from training data acquisition difficulties and verification effectiveness issues in improving LLM reasoning accuracy.

Method: Two-stage approach: 1) ReST-GRPO uses optimized ReST algorithm to filter high-value training data and increase reward variance; 2) VM-MCTS employs Monte-Carlo Tree Search to collect value targets for VM training, then uses adapted MCTS with VM to provide process signals and verification scores during decoding.

Result: Significantly outperforms other reinforcement training baselines (naive GRPO, ReST-DPO) and decoding/verification baselines (PRM-BoN, ORM-MCTS) on major coding benchmarks including APPS, BigCodeBench, and HumanEval.

Conclusion: ReST-RL effectively strengthens LLM reasoning ability through improved training efficiency and precise process verification, demonstrating a powerful unified RL paradigm for code reasoning tasks.

Abstract: With respect to improving the reasoning accuracy of LLMs, the representative reinforcement learning (RL) method GRPO faces failure due to insignificant reward variance, while verification methods based on process reward models (PRMs) suffer from difficulties with training data acquisition and verification effectiveness. To tackle these problems, this paper introduces ReST-RL, a unified LLM RL paradigm that significantly improves LLM’s code reasoning ability by combining an improved GRPO algorithm with a meticulously designed test time decoding method assisted by a value model (VM). As the first stage of policy reinforcement, ReST-GRPO adopts an optimized ReST algorithm to filter and assemble high-value training data, increasing the reward variance of GRPO sampling, thus improving the effectiveness and efficiency of training. After the basic reasoning ability of LLM policy has been improved, we further propose a test time decoding optimization method called VM-MCTS. Through Monte-Carlo Tree Search (MCTS), we collect accurate value targets with no annotation required, on which VM training is based. When decoding, the VM is deployed by an adapted MCTS algorithm to provide precise process signals as well as verification scores, assisting the LLM policy to achieve high reasoning accuracy. We conduct extensive experiments on coding problems to verify the validity of the proposed RL paradigm. Upon comparison, our approach significantly outperforms other reinforcement training baselines (e.g., naive GRPO and ReST-DPO), as well as decoding and verification baselines (e.g., PRM-BoN and ORM-MCTS) on well-known coding benchmarks of various levels (e.g., APPS, BigCodeBench, and HumanEval), indicating its power to strengthen the reasoning ability of LLM policies. Codes for our project can be found at https://github.com/THUDM/ReST-RL.

[456] Error Notebook-Guided, Training-Free Part Retrieval in 3D CAD Assemblies via Vision-Language Models

Yunqing Liu, Nan Zhang, Zhiming Tan

Main category: cs.AI

TL;DR: A novel part retrieval framework using Error Notebooks + RAG for improved CAD assembly part retrieval without additional training, achieving up to 23.4% accuracy improvement with GPT-4o.

Details

Motivation: Direct LLM/VLM use for CAD part retrieval faces token limit issues and unsatisfactory performance, while fine-tuning is computationally expensive and unavailable for proprietary models like GPT/Gemini.

Method: Error Notebooks construction (collecting historical erroneous CoTs with corrections) + RAG retrieval of specification-relevant records for refined prompt engineering.

Result: Substantial gains with proprietary models, GPT-4o achieving 23.4% absolute accuracy improvement; CoT reasoning particularly beneficial for complex cases (>10 parts).

Conclusion: The framework effectively handles 3D models with lengthy metadata without extra training, demonstrating significant performance improvements in CAD part retrieval tasks.

Abstract: Effective specification-aware part retrieval within complex CAD assemblies is essential for automated design verification and downstream engineering tasks. However, directly using LLMs/VLMs to this task presents some challenges: the input sequences may exceed model token limits, and even after processing, performance remains unsatisfactory. Moreover, fine-tuning LLMs/VLMs requires significant computational resources, and for many high-performing general-use proprietary models (e.g., GPT or Gemini), fine-tuning access is not available. In this paper, we propose a novel part retrieval framework that requires no extra training, but using Error Notebooks + RAG for refined prompt engineering to help improve the existing general model’s retrieval performance. The construction of Error Notebooks consists of two steps: (1) collecting historical erroneous CoTs and their incorrect answers, and (2) connecting these CoTs through reflective corrections until the correct solutions are obtained. As a result, the Error Notebooks serve as a repository of tasks along with their corrected CoTs and final answers. RAG is then employed to retrieve specification-relevant records from the Error Notebooks and incorporate them into the inference process. Another major contribution of our work is a human-in-the-loop CAD dataset, which is used to evaluate our method. In addition, the engineering value of our novel framework lies in its ability to effectively handle 3D models with lengthy, non-natural language metadata. Experiments with proprietary models, including GPT-4o and the Gemini series, show substantial gains, with GPT-4o (Omni) achieving up to a 23.4% absolute accuracy improvement on the human preference dataset. Moreover, ablation studies confirm that CoT reasoning provides benefits especially in challenging cases with higher part counts (>10).

[457] Oyster-I: Beyond Refusal – Constructive Safety Alignment for Responsible Language Models

Ranjie Duan, Jiexi Liu, Xiaojun Jia, Shiji Zhao, Ruoxi Cheng, Fengxiang Wang, Cheng Wei, Yong Xie, Chang Liu, Defeng Li, Yinpeng Dong, Yichi Zhang, Yuefeng Chen, Chongwen Wang, Xingjun Ma, Xingxing Wei, Yang Liu, Hang Su, Jun Zhu, Xinfeng Li, Yitong Sun, Jie Zhang, Jinzhao Hu, Sha Xu, Yitong Yang, Jialing Tao, Hui Xue

Main category: cs.AI

TL;DR: CSA is a new safety paradigm that shifts from refusal-based to guidance-first approach, protecting against malicious use while actively helping vulnerable users through game-theoretic anticipation and interpretable reasoning control.

Details

Motivation: Current LLM safety approaches focus narrowly on adversarial risks and rely on defensive refusals, which can be harmful for non-malicious users in psychological distress who need constructive guidance.

Method: Constructive Safety Alignment (CSA) implemented in Oyster-I (Oy1) combines game-theoretic anticipation of user reactions, fine-grained risk boundary discovery, and interpretable reasoning control.

Result: Oy1 achieves state-of-the-art safety among open models, strong constructive engagement close to GPT-5, and unmatched robustness on jailbreak datasets nearing GPT-o1 levels while retaining high general capabilities.

Conclusion: CSA redefines model-user relationships by making systems not just safe but meaningfully helpful, shifting from refusal-first to guidance-first safety for more responsible, user-centered AI.

Abstract: Large language models (LLMs) typically deploy safety mechanisms to prevent harmful content generation. Most current approaches focus narrowly on risks posed by malicious actors, often framing risks as adversarial events and relying on defensive refusals. However, in real-world settings, risks also come from non-malicious users seeking help while under psychological distress (e.g., self-harm intentions). In such cases, the model’s response can strongly influence the user’s next actions. Simple refusals may lead them to repeat, escalate, or move to unsafe platforms, creating worse outcomes. We introduce Constructive Safety Alignment (CSA), a human-centric paradigm that protects against malicious misuse while actively guiding vulnerable users toward safe and helpful results. Implemented in Oyster-I (Oy1), CSA combines game-theoretic anticipation of user reactions, fine-grained risk boundary discovery, and interpretable reasoning control, turning safety into a trust-building process. Oy1 achieves state-of-the-art safety among open models while retaining high general capabilities. On our Constructive Benchmark, it shows strong constructive engagement, close to GPT-5, and unmatched robustness on the Strata-Sword jailbreak dataset, nearing GPT-o1 levels. By shifting from refusal-first to guidance-first safety, CSA redefines the model-user relationship, aiming for systems that are not just safe, but meaningfully helpful. We release Oy1, code, and the benchmark to support responsible, user-centered AI.

[458] Planning with Reasoning using Vision Language World Model

Delong Chen, Theo Moutakanni, Willy Chung, Yejin Bang, Ziwei Ji, Allen Bolourchi, Pascale Fung

Main category: cs.AI

TL;DR: VLWM is a vision-language foundation model for world modeling that combines action policy learning with dynamics modeling, enabling both reactive and reflective planning through semantic cost minimization.

Details

Motivation: High-level world models that understand semantic and temporal abstraction for action reasoning are underdeveloped, limiting effective planning capabilities.

Method: Uses LLM Self-Refine with Tree of Captions to extract targets from visual observations, learns both action policy and dynamics model, and employs semantic cost minimization via a self-supervised critic model.

Result: Achieves state-of-the-art Visual Planning for Assistance performance with +27% Elo score improvement, and outperforms VLM baselines on RoboVQA and WorldPrediction benchmarks.

Conclusion: VLWM demonstrates effective world modeling through combined reactive and reflective planning, showing significant improvements in visual planning tasks and benchmark performance.

Abstract: Effective planning requires strong world models, but high-level world models that can understand and reason about actions with semantic and temporal abstraction remain largely underdeveloped. We introduce the Vision Language World Model (VLWM), a foundation model trained for language-based world modeling on natural videos. Given visual observations, the VLWM first infers the overall goal achievements then predicts a trajectory composed of interleaved actions and world state changes. Those targets are extracted by iterative LLM Self-Refine conditioned on compressed future observations represented by Tree of Captions. The VLWM learns both an action policy and a dynamics model, which respectively facilitates reactive system-1 plan decoding and reflective system-2 planning via cost minimization. The cost evaluates the semantic distance between the hypothetical future states given by VLWM roll-outs and the expected goal state, and is measured by a critic model that we trained in a self-supervised manner. The VLWM achieves state-of-the-art Visual Planning for Assistance (VPA) performance on both benchmark evaluations and our proposed PlannerArena human evaluations, where system-2 improves the Elo score by +27% upon system-1. The VLWM models also outperforms strong VLM baselines on RoboVQA and WorldPrediction benchmark.

[459] Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning

Haozhe Wang, Qixin Xu, Che Liu, Junhong Wu, Fangzhen Lin, Wenhu Chen

Main category: cs.AI

TL;DR: RL enhances LLM reasoning through emergent hierarchical planning, with HICRA algorithm focusing optimization on high-impact strategic tokens for better performance.

Details

Motivation: To understand the underlying mechanisms of how RL improves LLM reasoning abilities and address inefficiencies in current RL algorithms that apply optimization pressure indiscriminately.

Method: Analysis of RL dynamics in LLMs, identification of two-phase learning process, and development of HIerarchy-Aware Credit Assignment (HICRA) algorithm that concentrates optimization on planning tokens.

Result: HICRA significantly outperforms strong baselines like GRPO, demonstrating that focusing on strategic bottlenecks is key to advanced reasoning. Semantic entropy proved superior for measuring strategic exploration.

Conclusion: RL success in LLMs stems from emergent reasoning hierarchy similar to human cognition, and targeted optimization on high-level planning tokens through HICRA is more effective than agnostic optimization approaches.

Abstract: Reinforcement Learning (RL) has proven highly effective at enhancing the complex reasoning abilities of Large Language Models (LLMs), yet underlying mechanisms driving this success remain largely opaque. Our analysis reveals that puzzling phenomena like aha moments", length-scaling’’ and entropy dynamics are not disparate occurrences but hallmarks of an emergent reasoning hierarchy, akin to the separation of high-level strategic planning from low-level procedural execution in human cognition. We uncover a compelling two-phase dynamic: initially, a model is constrained by procedural correctness and must improve its low-level skills. The learning bottleneck then decisively shifts, with performance gains being driven by the exploration and mastery of high-level strategic planning. This insight exposes a core inefficiency in prevailing RL algorithms like GRPO, which apply optimization pressure agnostically and dilute the learning signal across all tokens. To address this, we propose HIerarchy-Aware Credit Assignment (HICRA), an algorithm that concentrates optimization efforts on high-impact planning tokens. HICRA significantly outperforms strong baselines, demonstrating that focusing on this strategic bottleneck is key to unlocking advanced reasoning. Furthermore, we validate semantic entropy as a superior compass for measuring strategic exploration over misleading metrics such as token-level entropy.

[460] Meta-Policy Reflexion: Reusable Reflective Memory and Rule Admissibility for Resource-Efficient LLM Agent

Chunlong Wu, Ye Luo, Zhibo Qu, Min Wang

Main category: cs.AI

TL;DR: Meta-Policy Reflexion (MPR) is a hybrid framework that consolidates LLM-generated reflections into reusable meta-policy memory to improve agent performance without weight updates.

Details

Motivation: Existing reflective strategies produce ephemeral, task-specific traces that aren't reused, while RL alternatives require substantial parameter updates and compute.

Method: MPR creates structured Meta-Policy Memory (MPM) from LLM reflections and applies it through soft memory-guided decoding and hard rule admissibility checks (HAC).

Result: Empirical results show consistent gains in execution accuracy and robustness compared to Reflexion baselines, with rule admissibility improving stability.

Conclusion: MPR externalizes reusable corrective knowledge without model updates, enforces domain constraints, and retains language-based reflection adaptability, with potential for multimodal and multi-agent extensions.

Abstract: Large language model (LLM) agents achieve impressive single-task performance but commonly exhibit repeated failures, inefficient exploration, and limited cross-task adaptability. Existing reflective strategies (e.g., Reflexion, ReAct) improve per-episode behavior but typically produce ephemeral, task-specific traces that are not reused across tasks. Reinforcement-learning based alternatives can produce transferable policies but require substantial parameter updates and compute. In this work we introduce Meta-Policy Reflexion (MPR): a hybrid framework that consolidates LLM-generated reflections into a structured, predicate-like Meta-Policy Memory (MPM) and applies that memory at inference time through two complementary mechanisms soft memory-guided decoding and hard rule admissibility checks(HAC). MPR (i) externalizes reusable corrective knowledge without model weight updates, (ii) enforces domain constraints to reduce unsafe or invalid actions, and (iii) retains the adaptability of language-based reflection. We formalize the MPM representation, present algorithms for update and decoding, and validate the approach in a text-based agent environment following the experimental protocol described in the provided implementation (AlfWorld-based). Empirical results reported in the supplied material indicate consistent gains in execution accuracy and robustness when compared to Reflexion baselines; rule admissibility further improves stability. We analyze mechanisms that explain these gains, discuss scalability and failure modes, and outline future directions for multimodal and multi-agent extensions.

[461] ArcMemo: Abstract Reasoning Composition with Lifelong LLM Memory

Matthew Ho, Chen Si, Zhaoxiang Feng, Fangxu Yu, Yichi Yang, Zhijian Liu, Zhiting Hu, Lianhui Qin

Main category: cs.AI

TL;DR: The paper introduces concept-level memory for LLMs that distills reusable abstractions from reasoning traces, enabling test-time continual learning without weight updates and achieving 7.5% performance gains on ARC-AGI benchmark.

Details

Motivation: Current LLMs discard valuable patterns and insights from reasoning traces once context windows reset. External memory can persist these discoveries, but existing approaches use instance-based entries that lack reusability and scalability.

Method: Proposes concept-level memory with strategies for abstracting takeaways from solution rollouts and retrieving relevant concepts for new queries. Uses natural language storage of modular abstractions that can be selectively integrated into prompts.

Result: Achieves 7.5% relative gain over strong no-memory baseline on ARC-AGI benchmark, with performance scaling with inference compute. Abstract concepts outperform other memory designs at all tested scales, and dynamic memory updates during test-time beat fixed settings.

Conclusion: Concept-level memory enables effective test-time continual learning through accumulation and abstraction of patterns, supporting self-improvement without weight updates. The approach shows particular strength on compositional generalization and abstract reasoning tasks.

Abstract: While inference-time scaling enables LLMs to carry out increasingly long and capable reasoning traces, the patterns and insights uncovered during these traces are immediately discarded once the context window is reset for a new query. External memory is a natural way to persist these discoveries, and recent work has shown clear benefits for reasoning-intensive tasks. We see an opportunity to make such memories more broadly reusable and scalable by moving beyond instance-based memory entries (e.g. exact query/response pairs, or summaries tightly coupled with the original problem context) toward concept-level memory: reusable, modular abstractions distilled from solution traces and stored in natural language. For future queries, relevant concepts are selectively retrieved and integrated into the prompt, enabling test-time continual learning without weight updates. Our design introduces new strategies for abstracting takeaways from rollouts and retrieving entries for new queries, promoting reuse and allowing memory to expand with additional experiences. We evaluate on ARC-AGI, a benchmark that stresses compositional generalization and abstract reasoning, making it a natural fit for concept memory. Our method yields a 7.5% relative gain over a strong no-memory baseline with performance continuing to scale with inference compute. We find abstract concepts to be the most consistent memory design, outscoring the baseline at all tested inference compute scales. Moreover, dynamically updating memory during test-time outperforms fixed settings, supporting the hypothesis that accumulating and abstracting patterns enables further solutions in a form of self-improvement. Code is available at https://github.com/matt-seb-ho/arc_memo.

[462] TalkToAgent: A Human-centric Explanation of Reinforcement Learning Agents with Large Language Models

Haechang Kim, Hao Chen, Can Li, Jong Min Lee

Main category: cs.AI

TL;DR: TalkToAgent is a multi-agent LLM framework that provides interactive natural language explanations for RL policies using five specialized agents to map user queries to XRL tools and generate various explanation types.

Details

Motivation: There's a gap between complex RL policies and domain experts due to limited comprehensibility of current XRL approaches and isolated tool coverage, leaving users uncertain about which tools to use.

Method: A multi-agent LLM framework with five specialized agents (Coordinator, Explainer, Coder, Evaluator, Debugger) that automatically maps user queries to relevant XRL tools and provides explanations through key state variables, expected outcomes, or counterfactual explanations.

Result: Successfully mapped user queries to XRL tasks with high accuracy, minimized counterfactual generation failures through coder-debugger interactions, and effectively interpreted agent’s actions within the problem domain.

Conclusion: TalkToAgent addresses the comprehensibility gap in XRL by providing interactive natural language explanations through a multi-agent LLM framework, successfully bridging the gap between complex RL policies and domain experts.

Abstract: Explainable Reinforcement Learning (XRL) has emerged as a promising approach in improving the transparency of Reinforcement Learning (RL) agents. However, there remains a gap between complex RL policies and domain experts, due to the limited comprehensibility of XRL results and isolated coverage of current XRL approaches that leave users uncertain about which tools to employ. To address these challenges, we introduce TalkToAgent, a multi-agent Large Language Models (LLM) framework that delivers interactive, natural language explanations for RL policies. The architecture with five specialized LLM agents (Coordinator, Explainer, Coder, Evaluator, and Debugger) enables TalkToAgent to automatically map user queries to relevant XRL tools and clarify an agent’s actions in terms of either key state variables, expected outcomes, or counterfactual explanations. Moreover, our approach extends previous counterfactual explanations by deriving alternative scenarios from qualitative behavioral descriptions, or even new rule-based policies. We validated TalkToAgent on quadruple-tank process control problem, a well-known nonlinear control benchmark. Results demonstrated that TalkToAgent successfully mapped user queries into XRL tasks with high accuracy, and coder-debugger interactions minimized failures in counterfactual generation. Furthermore, qualitative evaluation confirmed that TalkToAgent effectively interpreted agent’s actions and contextualized their meaning within the problem domain.

[463] Sticker-TTS: Learn to Utilize Historical Experience with a Sticker-driven Test-Time Scaling Framework

Jie Chen, Jinhao Jiang, Yingqian Min, Zican Dong, Shijie Wang, Wayne Xin Zhao, Ji-Rong Wen

Main category: cs.AI

TL;DR: Sticker-TTS is a test-time scaling framework that coordinates three collaborative LRMs to iteratively refine solutions using historical experience through distilled key conditions called stickers, achieving superior performance on mathematical reasoning benchmarks.

Details

Motivation: Current test-time scaling methods rely on redundant sampling and ignore historical experience utilization, limiting computational efficiency for large reasoning models.

Method: Proposes a framework with three collaborative LRMs that extract, refine, and reuse critical information (stickers) across multiple reasoning rounds, using a two-stage optimization strategy combining imitation learning and self-improvement.

Result: Extensive evaluations on AIME-24, AIME-25, and OlymMATH benchmarks show Sticker-TTS consistently outperforms strong baselines including self-consistency and reinforcement learning approaches under comparable inference budgets.

Conclusion: The framework demonstrates the effectiveness of sticker-guided historical experience utilization for improving computational efficiency and performance in complex reasoning tasks.

Abstract: Large reasoning models (LRMs) have exhibited strong performance on complex reasoning tasks, with further gains achievable through increased computational budgets at inference. However, current test-time scaling methods predominantly rely on redundant sampling, ignoring the historical experience utilization, thereby limiting computational efficiency. To overcome this limitation, we propose Sticker-TTS, a novel test-time scaling framework that coordinates three collaborative LRMs to iteratively explore and refine solutions guided by historical attempts. At the core of our framework are distilled key conditions-termed stickers-which drive the extraction, refinement, and reuse of critical information across multiple rounds of reasoning. To further enhance the efficiency and performance of our framework, we introduce a two-stage optimization strategy that combines imitation learning with self-improvement, enabling progressive refinement. Extensive evaluations on three challenging mathematical reasoning benchmarks, including AIME-24, AIME-25, and OlymMATH, demonstrate that Sticker-TTS consistently surpasses strong baselines, including self-consistency and advanced reinforcement learning approaches, under comparable inference budgets. These results highlight the effectiveness of sticker-guided historical experience utilization. Our code and data are available at https://github.com/RUCAIBox/Sticker-TTS.

[464] LatticeWorld: A Multimodal Large Language Model-Empowered Framework for Interactive Complex World Generation

Yinglin Duan, Zhengxia Zou, Tongwei Gu, Wei Jia, Zhan Zhao, Luyi Xu, Xinzhu Liu, Yenan Lin, Hao Jiang, Kang Chen, Shuang Qiu

Main category: cs.AI

TL;DR: LatticeWorld is a 3D world generation framework that uses lightweight LLMs and Unreal Engine 5 to create dynamic, interactive 3D environments from multimodal inputs, achieving 90x efficiency gains over manual methods.

Details

Motivation: To bridge the sim-to-real gap and enable convenient creation of realistic 3D simulations for applications like embodied AI, autonomous driving, and entertainment by automating the 3D world generation process.

Method: Proposes LatticeWorld framework that combines lightweight LLaMA-2-7B LLMs with Unreal Engine 5 rendering engine to generate dynamic 3D environments from textual descriptions and visual instructions as multimodal inputs.

Result: Achieves superior accuracy in scene layout generation and visual fidelity, with over 90x increase in industrial production efficiency compared to traditional manual methods while maintaining high creative quality.

Conclusion: LatticeWorld provides an effective solution for automated 3D world generation that significantly improves production efficiency and enables creation of large-scale interactive worlds with realistic physics and multi-agent interactions.

Abstract: Recent research has been increasingly focusing on developing 3D world models that simulate complex real-world scenarios. World models have found broad applications across various domains, including embodied AI, autonomous driving, entertainment, etc. A more realistic simulation with accurate physics will effectively narrow the sim-to-real gap and allow us to gather rich information about the real world conveniently. While traditional manual modeling has enabled the creation of virtual 3D scenes, modern approaches have leveraged advanced machine learning algorithms for 3D world generation, with most recent advances focusing on generative methods that can create virtual worlds based on user instructions. This work explores such a research direction by proposing LatticeWorld, a simple yet effective 3D world generation framework that streamlines the industrial production pipeline of 3D environments. LatticeWorld leverages lightweight LLMs (LLaMA-2-7B) alongside the industry-grade rendering engine (e.g., Unreal Engine 5) to generate a dynamic environment. Our proposed framework accepts textual descriptions and visual instructions as multimodal inputs and creates large-scale 3D interactive worlds with dynamic agents, featuring competitive multi-agent interaction, high-fidelity physics simulation, and real-time rendering. We conduct comprehensive experiments to evaluate LatticeWorld, showing that it achieves superior accuracy in scene layout generation and visual fidelity. Moreover, LatticeWorld achieves over a $90\times$ increase in industrial production efficiency while maintaining high creative quality compared with traditional manual production methods. Our demo video is available at https://youtu.be/8VWZXpERR18

cs.SD

[465] TSPC: A Two-Stage Phoneme-Centric Architecture for code-switching Vietnamese-English Speech Recognition

Minh N. H. Nguyen, Anh Nguyen Tran, Dung Truong Dinh, Nam Van Vo

Main category: cs.SD

TL;DR: A novel Two-Stage Phoneme-Centric (TSPC) model for Vietnamese-English code-switching ASR that uses extended Vietnamese phoneme set as intermediate representation, achieving 20.8% WER with reduced training resources.

Details

Motivation: Code-switching presents challenges for ASR systems, especially for Vietnamese-English pairs due to distinct phonological features and sound recognition ambiguity. Existing methods fail to capture subtle phonological shifts in CS scenarios.

Method: Two-Stage Phoneme-Centric architecture using extended Vietnamese phoneme set as intermediate representation for mixed-lingual modeling. Phonetic-based approach enables phoneme adaptation and language conversion.

Result: TSPC consistently outperforms existing baselines including PhoWhisper-base, achieving significantly lower word error rate of 20.8% with reduced training resources.

Conclusion: The phoneme-centric two-stage architecture effectively enhances ASR performance in complex Vietnamese-English code-switching scenarios through phoneme adaptation and language conversion capabilities.

Abstract: Code-switching (CS) presents a significant challenge for general Auto-Speech Recognition (ASR) systems. Existing methods often fail to capture the subtle phonological shifts inherent in CS scenarios. The challenge is particularly difficult for language pairs like Vietnamese and English, where both distinct phonological features and the ambiguity arising from similar sound recognition are present. In this paper, we propose a novel architecture for Vietnamese-English CS ASR, a Two-Stage Phoneme-Centric model (TSPC). The TSPC employs a phoneme-centric approach, built upon an extended Vietnamese phoneme set as an intermediate representation to facilitate mixed-lingual modeling. Experimental results demonstrate that TSPC consistently outperforms existing baselines, including PhoWhisper-base, in Vietnamese-English CS ASR, achieving a significantly lower word error rate of 20.8% with reduced training resources. Furthermore, the phonetic-based two-stage architecture enables phoneme adaptation and language conversion to enhance ASR performance in complex CS Vietnamese-English ASR scenarios.

[466] Xi+: Uncertainty Supervision for Robust Speaker Embedding

Junjie Li, Kong Aik Lee, Duc-Tuan Truong, Tianchi Liu, Man-Wai Mak

Main category: cs.SD

TL;DR: Proposes xi+ architecture with temporal attention and Stochastic Variance Loss for improved speaker recognition, achieving ~10-11% performance gains.

Details

Motivation: Current xi-vector model has suboptimal uncertainty estimation as it's implicitly trained through classification loss alone without considering temporal relationships between frames.

Method: Introduces xi+ architecture with temporal attention module for context-aware frame-level uncertainty estimation and novel Stochastic Variance Loss for explicit uncertainty supervision.

Result: Achieves consistent performance improvements of about 10% on VoxCeleb1-O set and 11% on NIST SRE 2024 evaluation set.

Conclusion: The proposed xi+ architecture with temporal attention and explicit uncertainty supervision through Stochastic Variance Loss significantly improves speaker recognition performance.

Abstract: There are various factors that can influence the performance of speaker recognition systems, such as emotion, language and other speaker-related or context-related variations. Since individual speech frames do not contribute equally to the utterance-level representation, it is essential to estimate the importance or reliability of each frame. The xi-vector model addresses this by assigning different weights to frames based on uncertainty estimation. However, its uncertainty estimation model is implicitly trained through classification loss alone and does not consider the temporal relationships between frames, which may lead to suboptimal supervision. In this paper, we propose an improved architecture, xi+. Compared to xi-vector, xi+ incorporates a temporal attention module to capture frame-level uncertainty in a context-aware manner. In addition, we introduce a novel loss function, Stochastic Variance Loss, which explicitly supervises the learning of uncertainty. Results demonstrate consistent performance improvements of about 10% on the VoxCeleb1-O set and 11% on the NIST SRE 2024 evaluation set.

[467] DreamAudio: Customized Text-to-Audio Generation with Diffusion Models

Yi Yuan, Xubo Liu, Haohe Liu, Xiyuan Kang, Zhuo Chen, Yuxuan Wang, Mark D. Plumbley, Wenwu Wang

Main category: cs.SD

TL;DR: DreamAudio is a customized text-to-audio generation framework that enables precise control over fine-grained acoustic characteristics using reference audio samples, allowing generation of personalized audio events while maintaining semantic alignment with text prompts.

Details

Motivation: Existing text-to-audio models focus on semantic alignment but lack precise control over fine-grained acoustic characteristics, making it challenging for users to generate specific sound content with desired audio events.

Method: A new framework that identifies auditory information from user-provided reference concepts, using few reference audio samples to generate new audio containing specific personalized events. Two types of datasets are developed for training and testing customized systems.

Result: DreamAudio generates audio samples highly consistent with customized audio features and well-aligned with input text prompts. It also offers comparable performance in general text-to-audio tasks. A human-involved benchmark dataset is provided.

Conclusion: The proposed DreamAudio system successfully addresses the limitation of existing models by enabling customized text-to-audio generation with precise control over specific audio events while maintaining overall semantic alignment and general generation capabilities.

Abstract: With the development of large-scale diffusion-based and language-modeling-based generative models, impressive progress has been achieved in text-to-audio generation. Despite producing high-quality outputs, existing text-to-audio models mainly aim to generate semantically aligned sound and fall short on precisely controlling fine-grained acoustic characteristics of specific sounds. As a result, users that need specific sound content may find it challenging to generate the desired audio clips. In this paper, we present DreamAudio for customized text-to-audio generation (CTTA). Specifically, we introduce a new framework that is designed to enable the model to identify auditory information from user-provided reference concepts for audio generation. Given a few reference audio samples containing personalized audio events, our system can generate new audio samples that include these specific events. In addition, two types of datasets are developed for training and testing the customized systems. The experiments show that the proposed model, DreamAudio, generates audio samples that are highly consistent with the customized audio features and aligned well with the input text prompts. Furthermore, DreamAudio offers comparable performance in general text-to-audio tasks. We also provide a human-involved dataset containing audio events from real-world CTTA cases as the benchmark for customized generation tasks.

[468] MeanFlow-Accelerated Multimodal Video-to-Audio Synthesis via One-Step Generation

Xiaoran Yang, Jianxuan Yang, Xinyue Guo, Haoyu Wang, Ningning Pan, Gongping Huang

Main category: cs.SD

TL;DR: MeanFlow-accelerated model enables one-step audio generation from silent videos, significantly improving inference speed while maintaining audio quality and synchronization.

Details

Motivation: Address the efficiency bottleneck in existing video-to-audio synthesis methods that rely on iterative sampling processes, leading to slow inference speeds.

Method: Introduces MeanFlow model that characterizes flow fields using average velocity instead of instantaneous velocity, enabling one-step generation. Uses scalar rescaling mechanism to balance conditional and unconditional predictions when applying classifier-free guidance.

Result: Significantly accelerates multimodal video-to-audio synthesis while preserving audio quality, semantic alignment, and temporal synchronization. Also works effectively for text-to-audio synthesis tasks.

Conclusion: MeanFlow incorporation improves inference speed without compromising perceptual quality on both video-to-audio and text-to-audio synthesis tasks, effectively solving the efficiency-quality trade-off.

Abstract: A key challenge in synthesizing audios from silent videos is the inherent trade-off between synthesis quality and inference efficiency in existing methods. For instance, flow matching based models rely on modeling instantaneous velocity, inherently require an iterative sampling process, leading to slow inference speeds. To address this efficiency bottleneck, we introduce a MeanFlow-accelerated model that characterizes flow fields using average velocity, enabling one-step generation and thereby significantly accelerating multimodal video-to-audio (VTA) synthesis while preserving audio quality, semantic alignment, and temporal synchronization. Furthermore, a scalar rescaling mechanism is employed to balance conditional and unconditional predictions when classifier-free guidance (CFG) is applied, effectively mitigating CFG-induced distortions in one step generation. Since the audio synthesis network is jointly trained with multimodal conditions, we further evaluate it on text-to-audio (TTA) synthesis task. Experimental results demonstrate that incorporating MeanFlow into the network significantly improves inference speed without compromising perceptual quality on both VTA and TTA synthesis tasks.

[469] FireRedChat: A Pluggable, Full-Duplex Voice Interaction System with Cascaded and Semi-Cascaded Implementations

Junjie Chen, Yao Hu, Junjie Li, Kangyue Li, Kun Liu, Wenpeng Li, Xu Li, Ziyuan Li, Feiyu Shen, Xu Tang, Manzhen Wei, Yichen Wu, Fenglong Xie, Kaituo Xu, Kun Xie

Main category: cs.SD

TL;DR: A complete full-duplex voice interaction system with turn-taking controller, interaction module, and dialogue manager that enables simultaneous speaking with controllable barge-in using streaming personalized VAD and semantic end-of-turn detection.

Details

Motivation: Existing solutions are either difficult end-to-end systems or modular pipelines with non-open components, limiting holistic optimization and control for full-duplex voice interaction.

Method: Developed a modular system with streaming personalized VAD for barge-in suppression, semantic end-of-turn detection, and implemented cascaded/semi-cascaded variants using internal models. Includes dialogue manager for tool invocation and context management.

Result: Achieved fewer false interruptions, more accurate semantic end detection, lower latency approaching industrial systems, with semi-cascaded variant capturing emotional cues and yielding more coherent responses.

Conclusion: The system enables robust, natural, real-time full-duplex interaction with improved control accuracy and efficiency, demonstrating practical implementation for lifelike assistants and customer service.

Abstract: Full-duplex voice interaction allows users and agents to speak simultaneously with controllable barge-in, enabling lifelike assistants and customer service. Existing solutions are either end-to-end, difficult to design and hard to control, or modular pipelines governed by turn-taking controllers that ease upgrades and per-module optimization; however, prior modular frameworks depend on non-open components and external providers, limiting holistic optimization. In this work, we present a complete, practical full-duplex voice interaction system comprising a turn-taking controller, an interaction module, and a dialogue manager. The controller integrates streaming personalized VAD (pVAD) to suppress false barge-ins from noise and non-primary speakers, precisely timestamp primary-speaker segments, and explicitly enable primary-speaker barge-ins; a semantic end-of-turn detector improves stop decisions. It upgrades heterogeneous half-duplex pipelines, cascaded, semi-cascaded, and speech-to-speech, to full duplex. Using internal models, we implement cascaded and semi-cascaded variants; the semi-cascaded one captures emotional and paralinguistic cues, yields more coherent responses, lowers latency and error propagation, and improves robustness. A dialogue manager extends capabilities via tool invocation and context management. We also propose three system-level metrics, barge-in, end-of-turn detection accuracy, and end-to-end latency, to assess naturalness, control accuracy, and efficiency. Experiments show fewer false interruptions, more accurate semantic ends, and lower latency approaching industrial systems, enabling robust, natural, real-time full-duplex interaction. Demos: https://fireredteam.github.io/demos/firered_chat.

[470] The First Voice Timbre Attribute Detection Challenge

Liping Chen, Jinghao He, Zhengyan Sheng, Kong Aik Lee, Zhen-Hua Ling

Main category: cs.SD

TL;DR: First voice timbre attribute detection challenge at NCMMSC 2025 focusing on explainable voice timbre comparison using VCTK-RVA dataset.

Details

Motivation: To advance research in voice timbre explainability and create a benchmark for comparing speech utterance intensity in specific timbre descriptor dimensions.

Method: Participants developed their own systems and submitted outputs to organizers for evaluation. Six teams participated with five providing methodological descriptions.

Result: Challenge successfully conducted with six participating teams submitting their detection outputs for evaluation.

Conclusion: The challenge established a framework for voice timbre attribute detection and provided a platform for comparing different methodologies in this emerging research area.

Abstract: The first voice timbre attribute detection challenge is featured in a special session at NCMMSC 2025. It focuses on the explainability of voice timbre and compares the intensity of two speech utterances in a specified timbre descriptor dimension. The evaluation was conducted on the VCTK-RVA dataset. Participants developed their systems and submitted their outputs to the organizer, who evaluated the performance and sent feedback to them. Six teams submitted their outputs, with five providing descriptions of their methodologies.

[471] AnalysisGNN: Unified Music Analysis with Graph Neural Networks

Emmanouil Karystinaios, Johannes Hentschel, Markus Neuwirth, Gerhard Widmer

Main category: cs.SD

TL;DR: AnalysisGNN is a graph neural network framework that integrates multiple heterogeneous music analysis datasets using data-shuffling, weighted multi-task loss, and logit fusion, with a specialized non-chord-tone prediction module to improve label consistency.

Details

Motivation: Current computational music analysis approaches are typically domain-specific and lack integration across different analytical tasks and datasets, creating limitations for comprehensive score analysis.

Method: Uses graph neural networks with data-shuffling strategy, custom weighted multi-task loss, logit fusion between task-specific classifiers, and a non-chord-tone prediction module to filter out passing/non-functional notes.

Result: Achieves performance comparable to traditional static-dataset approaches while demonstrating increased resilience to domain shifts and annotation inconsistencies across multiple heterogeneous corpora.

Conclusion: AnalysisGNN provides an effective framework for integrating diverse music analysis datasets and tasks, offering improved consistency and domain robustness compared to specialized single-domain approaches.

Abstract: Recent years have seen a boom in computational approaches to music analysis, yet each one is typically tailored to a specific analytical domain. In this work, we introduce AnalysisGNN, a novel graph neural network framework that leverages a data-shuffling strategy with a custom weighted multi-task loss and logit fusion between task-specific classifiers to integrate heterogeneously annotated symbolic datasets for comprehensive score analysis. We further integrate a Non-Chord-Tone prediction module, which identifies and excludes passing and non-functional notes from all tasks, thereby improving the consistency of label signals. Experimental evaluations demonstrate that AnalysisGNN achieves performance comparable to traditional static-dataset approaches, while showing increased resilience to domain shifts and annotation inconsistencies across multiple heterogeneous corpora.

[472] Continuous Audio Language Models

Rouard Simon, Orsini Manu, Roebel Axel, Zeghidour Neil, Défossez Alexandre

Main category: cs.SD

TL;DR: CALM introduces continuous audio language models that avoid lossy compression, achieving higher audio quality at lower computational cost compared to discrete token approaches.

Details

Motivation: Discrete audio tokens from lossy codecs create a trade-off between audio fidelity and computational cost - higher quality requires more tokens, increasing computational burden.

Method: Uses a large Transformer backbone to produce contextual embeddings at each timestep, then conditions an MLP to generate continuous audio frames through consistency modeling with an audio VAE.

Result: CALM achieves improved efficiency and fidelity over state-of-the-art discrete audio language models for both speech and music generation.

Conclusion: Continuous audio modeling facilitates lightweight, high-quality audio generation by avoiding the limitations of lossy compression in discrete token approaches.

Abstract: Audio Language Models (ALM) have emerged as the dominant paradigm for speech and music generation by representing audio as sequences of discrete tokens. Yet, unlike text tokens, which are invertible, audio tokens are extracted from lossy codecs with a limited bitrate. As a consequence, increasing audio quality requires generating more tokens, which imposes a trade-off between fidelity and computational cost. We address this issue by studying Continuous Audio Language Models (CALM). These models instantiate a large Transformer backbone that produces a contextual embedding at every timestep. This sequential information then conditions an MLP that generates the next continuous frame of an audio VAE through consistency modeling. By avoiding lossy compression, CALM achieves higher quality at lower computational cost than their discrete counterpart. Experiments on speech and music demonstrate improved efficiency and fidelity over state-of-the-art discrete audio language models, facilitating lightweight, high-quality audio generation. Samples are available at https://continuous-audio-language-models.github.io

[473] Benchmarking Music Autotagging with MGPHot Expert Annotations vs. Generic Tag Datasets

Pedro Ramoneda, Pablo Alonso-Jiménez, Sergio Oramas, Xavier Serra, Dmitry Bogdanov

Main category: cs.SD

TL;DR: A new benchmarking dataset for music autotagging based on MGPHot with expert annotations, audio URLs, standardized splits, and precomputed representations for 7 state-of-the-art models.

Details

Motivation: Music autotagging is valuable for applications but lacks standardized benchmarks with expert musicological annotations. The MGPHot dataset has expert annotations but lacks audio and evaluation setups.

Method: Curated YouTube URLs with retrievable audio, proposed train/val/test splits, and precomputed representations for 7 state-of-the-art models to create a standardized benchmarking framework.

Result: Evaluated models on both MGPHot and standard reference tag datasets, revealing key differences between expert and generic tag annotations.

Conclusion: Provides an advanced benchmarking framework for future music understanding research, enabling better comparison and insights through expert annotations.

Abstract: Music autotagging aims to automatically assign descriptive tags, such as genre, mood, or instrumentation, to audio recordings. Due to its challenges, diversity of semantic descriptions, and practical value in various applications, it has become a common downstream task for evaluating the performance of general-purpose music representations learned from audio data. We introduce a new benchmarking dataset based on the recently published MGPHot dataset, which includes expert musicological annotations, allowing for additional insights and comparisons with results obtained on common generic tag datasets. While MGPHot annotations have been shown to be useful for computational musicology, the original dataset neither includes audio nor provides evaluation setups for its use as a standardized autotagging benchmark. To address this, we provide a curated set of YouTube URLs with retrievable audio, and propose a train/val/test split for standardized evaluation, and precomputed representations for seven state-of-the-art models. Using these resources, we evaluated these models in MGPHot and standard reference tag datasets, highlighting key differences between expert and generic tag annotations. Altogether, our contributions provide a more advanced benchmarking framework for future research in music understanding.

[474] Synthetic data enables context-aware bioacoustic sound event detection

Benjamin Hoffman, David Robinson, Marius Miron, Vittorio Baglione, Daniela Canestrari, Damian Elias, Eva Trapote, Felix Effenberger, Maddie Cusimano, Masato Hagiwara, Olivier Pietquin

Main category: cs.SD

TL;DR: A methodology for training foundation models to enhance in-context learning for bioacoustic signal processing using synthetic data and domain randomization, achieving 64% improvement over training-free methods.

Details

Motivation: To improve few-shot bioacoustic sound event detection capabilities for ecologists and ethologists by developing training-free tools that can handle diverse acoustic environments.

Method: Uses synthetically generated training data with domain-randomization pipeline to create diverse acoustic scenes with strong temporal labels. Generated over 8.8k hours of labeled audio and trained a transformer-based query-by-example model. Also created a public benchmark of 13 diverse few-shot bioacoustics tasks.

Result: The model outperforms previously published methods and shows 64% relative improvement over other training-free methods. Performance gains attributed to increased model size, data scale, and algorithmic improvements.

Conclusion: The approach successfully enhances in-context learning for bioacoustic applications, providing ecologists with an effective training-free tool for sound event detection, with the trained model made available via API.

Abstract: We propose a methodology for training foundation models that enhances their in-context learning capabilities within the domain of bioacoustic signal processing. We use synthetically generated training data, introducing a domain-randomization-based pipeline that constructs diverse acoustic scenes with temporally strong labels. We generate over 8.8 thousand hours of strongly-labeled audio and train a query-by-example, transformer-based model to perform few-shot bioacoustic sound event detection. Our second contribution is a public benchmark of 13 diverse few-shot bioacoustics tasks. Our model outperforms previously published methods, and improves relative to other training-free methods by $64%$. We demonstrate that this is due to increase in model size and data scale, as well as algorithmic improvements. We make our trained model available via an API, to provide ecologists and ethologists with a training-free tool for bioacoustic sound event detection.

[475] Robust detection of overlapping bioacoustic sound events

Louis Mahon, Benjamin Hoffman, Logan James, Maddie Cusimano, Masato Hagiwara, Sarah C Woolley, Felix Effenberger, Sara Keen, Jen-Yu Liu, Olivier Pietquin

Main category: cs.SD

TL;DR: Voxaboxen is a novel onset-based bioacoustic sound event detection method that outperforms standard frame-based approaches, especially for overlapping vocalizations, achieving state-of-the-art results across multiple datasets.

Details

Motivation: Standard bioacoustic sound event detection methods struggle with overlapping events, which are common in real-world scenarios like ethology and ecology. Current frame-based, multi-label approaches are inadequate for handling frequent vocalization overlaps.

Method: Voxaboxen uses an onset-based approach inspired by computer vision object detection. It predicts vocalization starts and ends with duration information, then fuses the two sets of bounding boxes using graph-matching. The method leverages self-supervised audio encoders.

Result: Voxaboxen achieves state-of-the-art results on seven existing datasets and a new zebra finch dataset specifically designed for testing overlapping vocalizations. The method shows robust performance even with frequent vocalization overlaps.

Conclusion: The onset-based Voxaboxen approach significantly outperforms traditional frame-based methods for bioacoustic sound event detection, particularly in handling overlapping events, making it valuable for real-world applications in ecology and conservation.

Abstract: We propose a method for accurately detecting bioacoustic sound events that is robust to overlapping events, a common issue in domains such as ethology, ecology and conservation. While standard methods employ a frame-based, multi-label approach, we introduce an onset-based detection method which we name Voxaboxen. It takes inspiration from object detection methods in computer vision, but simultaneously takes advantage of recent advances in self-supervised audio encoders. For each time window, Voxaboxen predicts whether it contains the start of a vocalization and how long the vocalization is. It also does the same in reverse, predicting whether each window contains the end of a vocalization, and how long ago it started. The two resulting sets of bounding boxes are then fused using a graph-matching algorithm. We also release a new dataset designed to measure performance on detecting overlapping vocalizations. This consists of recordings of zebra finches annotated with temporally-strong labels and showing frequent overlaps. We test Voxaboxen on seven existing data sets and on our new data set. We compare Voxaboxen to natural baselines and existing sound event detection methods and demonstrate SotA results. Further experiments show that improvements are robust to frequent vocalization overlap.

[476] Whisper Smarter, not Harder: Adversarial Attack on Partial Suppression

Zheng Jie Wong, Bingquan Shen

Main category: cs.SD

TL;DR: The paper investigates adversarial attacks on ASR models, explores methods to increase attack imperceptibility through partial suppression, and proposes low-pass filtering as an effective defense.

Details

Motivation: ASR models are widely deployed but vulnerable to adversarial attacks that can suppress or disrupt their output, raising concerns about their robustness and security in real-world applications.

Method: The researchers investigate and verify the robustness of existing adversarial attacks, explore techniques to increase imperceptibility by relaxing optimization objectives from complete to partial suppression, and examine potential defense mechanisms.

Result: The study demonstrates that relaxing the attack objective to partial suppression significantly decreases the perceptibility of adversarial attacks while maintaining effectiveness. A low-pass filter defense is shown to be potentially effective against these attacks.

Conclusion: Adversarial attacks on ASR models can be made more imperceptible through partial suppression techniques, and low-pass filtering serves as a viable defense strategy, highlighting the need for continued research in both attack and defense mechanisms for ASR system security.

Abstract: Currently, Automatic Speech Recognition (ASR) models are deployed in an extensive range of applications. However, recent studies have demonstrated the possibility of adversarial attack on these models which could potentially suppress or disrupt model output. We investigate and verify the robustness of these attacks and explore if it is possible to increase their imperceptibility. We additionally find that by relaxing the optimisation objective from complete suppression to partial suppression, we can further decrease the imperceptibility of the attack. We also explore possible defences against these attacks and show a low-pass filter defence could potentially serve as an effective defence.

[477] Learning and composing of classical music using restricted Boltzmann machines

Mutsumi Kobayashi, Hiroshi Watanabe

Main category: cs.SD

TL;DR: Using restricted Boltzmann machines (RBM) to analyze and compose music in J.S. Bach’s style, providing interpretable insights into musical characteristics.

Details

Motivation: Existing machine learning models for music composition are too complex to understand how they capture a composer's style, so a simpler model is needed for analysis.

Method: Trained a restricted Boltzmann machine (RBM) on J.S. Bach’s music due to its simple structure that allows investigation of internal states after learning.

Result: The learned RBM was able to successfully compose music, demonstrating its capability to capture and reproduce Bach’s musical style.

Conclusion: RBMs provide an interpretable alternative to complex models for music composition and analysis, allowing better understanding of how musical styles are learned.

Abstract: Recently, software has been developed that uses machine learning to mimic the style of a particular composer, such as J. S. Bach. However, since such software often adopts machine learning models with complex structures, it is difficult to analyze how the software understands the characteristics of the composer’s music. In this study, we adopted J. S. Bach’s music for training of a restricted Boltzmann machine (RBM). Since the structure of RBMs is simple, it allows us to investigate the internal states after learning. We found that the learned RBM is able to compose music.

cs.LG

[478] Standard vs. Modular Sampling: Best Practices for Reliable LLM Unlearning

Praveen Bushipaka, Lucia Passaro, Tommaso Cucinotta

Main category: cs.LG

TL;DR: Systematic evaluation of LLM unlearning practices reveals limitations of single neighbor sets and standard sampling methods, proposing diverse neighbor sets and modular entity-level unlearning as best practices.

Details

Motivation: Existing LLM unlearning benchmarks use simplified single neighbor sets and standard sampling approaches that don't reflect real-world data complexities, requiring critical examination of these practices.

Method: Systematic evaluation of common unlearning practices, analysis of sampling efficiency, and development of Modular Entity-Level Unlearning (MELU) strategy as an alternative to cyclic sampling.

Result: Found that single neighbor sets are suboptimal, standard 1:1 sampling is inefficient and yields poor results, while diverse neighbor sets and modular approaches provide better balance between forget efficacy and model utility.

Conclusion: Proposed best practices include incorporating diverse neighbor sets, avoiding standard 1:1 sampling, and using modular entity-level unlearning with robust algorithms for effective and stable unlearning performance.

Abstract: A conventional LLM Unlearning setting consists of two subsets -“forget” and “retain”, with the objectives of removing the undesired knowledge from the forget set while preserving the remaining knowledge from the retain. In privacy-focused unlearning research, a retain set is often further divided into neighbor sets, containing either directly or indirectly connected to the forget targets; and augmented by a general-knowledge set. A common practice in existing benchmarks is to employ only a single neighbor set, with general knowledge which fails to reflect the real-world data complexities and relationships. LLM Unlearning typically involves 1:1 sampling or cyclic iteration sampling. However, the efficacy and stability of these de facto standards have not been critically examined. In this study, we systematically evaluate these common practices. Our findings reveal that relying on a single neighbor set is suboptimal and that a standard sampling approach can obscure performance trade-offs. Based on this analysis, we propose and validate an initial set of best practices: (1) Incorporation of diverse neighbor sets to balance forget efficacy and model utility, (2) Standard 1:1 sampling methods are inefficient and yield poor results, (3) Our proposed Modular Entity-Level Unlearning (MELU) strategy as an alternative to cyclic sampling. We demonstrate that this modular approach, combined with robust algorithms, provides a clear and stable path towards effective unlearning.

[479] Feed Two Birds with One Scone: Exploiting Function-Space Regularization for Both OOD Robustness and ID Fine-Tuning Performance

Xiang Yuan, Jun Shu, Deyu meng, Zongben Xu

Main category: cs.LG

TL;DR: A novel robust fine-tuning method that preserves out-of-distribution robustness by constraining function space distance and adding consistency regularization, outperforming existing methods across various CLIP backbones.

Details

Motivation: Existing robust fine-tuning methods that preserve pretrained weights, features, or logits cannot consistently improve OOD robustness across different model architectures, as they serve as poor proxies for function space optimization needed for stable OOD predictions.

Method: Proposes two regularizations: 1) constrains distance between fine-tuning and pre-trained models in function space using simulated OOD samples to preserve pre-trained model’s OOD robustness, and 2) additional consistency regularization to promote stable predictions for perturbed samples to further enhance OOD robustness.

Result: Extensive experiments show the approach consistently improves both in-distribution fine-tuning performance and out-of-distribution robustness across various CLIP backbones, outperforming existing regularization-based robust fine-tuning methods.

Conclusion: The proposed function space regularization approach effectively preserves OOD robustness during fine-tuning and enhances model stability, providing a more effective solution than weight/feature/logit preservation methods for maintaining robustness across different architectures.

Abstract: Robust fine-tuning aims to achieve competitive in-distribution (ID) performance while maintaining the out-of-distribution (OOD) robustness of a pre-trained model when transferring it to a downstream task. To remedy this, most robust fine-tuning methods aim to preserve the pretrained weights, features, or logits. However, we find that these methods cannot always improve OOD robustness for different model architectures. This is due to the OOD robustness requiring the model function to produce stable prediction for input information of downstream tasks, while existing methods might serve as a poor proxy for the optimization in the function space. Based on this finding, we propose a novel regularization that constrains the distance of fine-tuning and pre-trained model in the function space with the simulated OOD samples, aiming to preserve the OOD robustness of the pre-trained model. Besides, to further enhance the OOD robustness capability of the fine-tuning model, we introduce an additional consistency regularization to promote stable predictions of perturbed samples. Extensive experiments demonstrate our approach could consistently improve both downstream task ID fine-tuning performance and OOD robustness across a variety of CLIP backbones, outperforming existing regularization-based robust fine-tuning methods.

[480] Safeguarding Graph Neural Networks against Topology Inference Attacks

Jie Fu, Hong Yuan, Zhili Chen, Wendy Hui Wang

Main category: cs.LG

TL;DR: GNNs are vulnerable to topology privacy attacks that can reconstruct training graph structures from black-box model access. Existing edge-level privacy protections are inadequate, so we propose PGR - a bi-level optimization defense that generates synthetic graphs to protect topology while maintaining accuracy.

Details

Motivation: Graph Neural Networks raise serious privacy concerns, particularly around topology privacy (confidentiality of graph structure), which is underexplored compared to edge-level privacy. Current defenses fail to protect against graph-level inference attacks.

Method: Proposed Topology Inference Attacks (TIAs) to demonstrate vulnerability, then developed Private Graph Reconstruction (PGR) - a bi-level optimization framework that iteratively generates synthetic training graphs using meta-gradients while updating the GNN model.

Result: GNNs are highly susceptible to topology inference attacks. PGR significantly reduces topology leakage with minimal impact on model accuracy, outperforming existing edge-level differential privacy mechanisms.

Conclusion: Topology privacy is a critical threat in GNNs that requires specialized defenses. PGR provides an effective solution that protects graph structure confidentiality while maintaining model utility, addressing a gap in current privacy protection approaches.

Abstract: Graph Neural Networks (GNNs) have emerged as powerful models for learning from graph-structured data. However, their widespread adoption has raised serious privacy concerns. While prior research has primarily focused on edge-level privacy, a critical yet underexplored threat lies in topology privacy - the confidentiality of the graph’s overall structure. In this work, we present a comprehensive study on topology privacy risks in GNNs, revealing their vulnerability to graph-level inference attacks. To this end, we propose a suite of Topology Inference Attacks (TIAs) that can reconstruct the structure of a target training graph using only black-box access to a GNN model. Our findings show that GNNs are highly susceptible to these attacks, and that existing edge-level differential privacy mechanisms are insufficient as they either fail to mitigate the risk or severely compromise model accuracy. To address this challenge, we introduce Private Graph Reconstruction (PGR), a novel defense framework designed to protect topology privacy while maintaining model accuracy. PGR is formulated as a bi-level optimization problem, where a synthetic training graph is iteratively generated using meta-gradients, and the GNN model is concurrently updated based on the evolving graph. Extensive experiments demonstrate that PGR significantly reduces topology leakage with minimal impact on model accuracy. Our code is anonymously available at https://github.com/JeffffffFu/PGR.

[481] Neural Breadcrumbs: Membership Inference Attacks on LLMs Through Hidden State and Attention Pattern Analysis

Disha Makhija, Manoj Ghuhan Arivazhagan, Vinayshekhar Bannihatti Kumar, Rashmi Gangadharaiah

Main category: cs.LG

TL;DR: memTrace framework analyzes LLMs’ internal representations (hidden states and attention patterns) to detect membership inference, achieving 0.85 AUC scores, showing that internal behaviors reveal training data exposure even when outputs appear protected.

Details

Motivation: Recent studies suggested MIAs perform only marginally better than random guessing against large language models, indicating modern pre-training may be free from privacy risks. This work explores whether examining internal representations rather than just outputs can provide additional membership inference signals.

Method: memTrace framework follows ’neural breadcrumbs’ by extracting signals from transformer hidden states and attention patterns. It analyzes layer-wise representation dynamics, attention distribution characteristics, and cross-layer transition patterns to detect memorization fingerprints.

Result: The approach yields strong membership detection across several model families, achieving average AUC scores of 0.85 on popular MIA benchmarks, significantly outperforming traditional loss-based approaches.

Conclusion: Internal model behaviors can reveal aspects of training data exposure even when output-based signals appear protected, highlighting the need for further research into membership privacy and development of more robust privacy-preserving training techniques for LLMs.

Abstract: Membership inference attacks (MIAs) reveal whether specific data was used to train machine learning models, serving as important tools for privacy auditing and compliance assessment. Recent studies have reported that MIAs perform only marginally better than random guessing against large language models, suggesting that modern pre-training approaches with massive datasets may be free from privacy leakage risks. Our work offers a complementary perspective to these findings by exploring how examining LLMs’ internal representations, rather than just their outputs, may provide additional insights into potential membership inference signals. Our framework, \emph{memTrace}, follows what we call \enquote{neural breadcrumbs} extracting informative signals from transformer hidden states and attention patterns as they process candidate sequences. By analyzing layer-wise representation dynamics, attention distribution characteristics, and cross-layer transition patterns, we detect potential memorization fingerprints that traditional loss-based approaches may not capture. This approach yields strong membership detection across several model families achieving average AUC scores of 0.85 on popular MIA benchmarks. Our findings suggest that internal model behaviors can reveal aspects of training data exposure even when output-based signals appear protected, highlighting the need for further research into membership privacy and the development of more robust privacy-preserving training techniques for large language models.

[482] Calibrated Recommendations with Contextual Bandits

Diego Feijer, Himan Abdollahpouri, Sanket Gupta, Alexander Clare, Yuxiao Wen, Todd Wasson, Maria Dimakopoulou, Zahra Nazari, Kyle Kretschman, Mounia Lalmas

Main category: cs.LG

TL;DR: Spotify uses contextual bandits to dynamically balance content types (music, podcasts, audiobooks) on Home page based on user context and preferences, improving engagement especially for underrepresented content.

Details

Motivation: Historical data is heavily skewed toward music, making it challenging to deliver balanced and personalized content mix. Users' preferences vary by time, day, and device.

Method: Proposed calibration method using contextual bandits to dynamically learn each user’s optimal content type distribution based on context and preferences, rather than relying on historical averages.

Result: Both offline and online results demonstrate improved precision and user engagement with Spotify Home page, particularly with under-represented content types like podcasts.

Conclusion: Contextual bandit approach successfully adapts to users’ varying interests across different contexts, outperforming traditional calibration methods that use historical averages.

Abstract: Spotify’s Home page features a variety of content types, including music, podcasts, and audiobooks. However, historical data is heavily skewed toward music, making it challenging to deliver a balanced and personalized content mix. Moreover, users’ preference towards different content types may vary depending on the time of day, the day of week, or even the device they use. We propose a calibration method that leverages contextual bandits to dynamically learn each user’s optimal content type distribution based on their context and preferences. Unlike traditional calibration methods that rely on historical averages, our approach boosts engagement by adapting to how users interests in different content types varies across contexts. Both offline and online results demonstrate improved precision and user engagement with the Spotify Home page, in particular with under-represented content types such as podcasts.

[483] PLanTS: Periodicity-aware Latent-state Representation Learning for Multivariate Time Series

Jia Wang, Xiao Wang, Chi Zhang

Main category: cs.LG

TL;DR: PLanTS is a periodicity-aware self-supervised learning framework for multivariate time series that models irregular latent states and their transitions using multi-granularity patching, contrastive learning, and next-transition prediction.

Details

Motivation: Existing SSL methods for multivariate time series neglect intrinsic periodic structures and fail to capture dynamic evolution of latent states, while dealing with challenges like high dimensionality, limited labeled data, and non-stationary nature.

Method: Uses period-aware multi-granularity patching mechanism, generalized contrastive loss for instance-level and state-level similarities, and a next-transition prediction pretext task to capture temporal dynamics and future state evolution.

Result: PLanTS consistently improves representation quality over existing SSL methods and demonstrates superior runtime efficiency compared to DTW-based methods across various downstream tasks.

Conclusion: The proposed periodicity-aware framework effectively addresses limitations of current SSL approaches by explicitly modeling irregular latent states and transitions, leading to better representations for multivariate time series analysis.

Abstract: Multivariate time series (MTS) are ubiquitous in domains such as healthcare, climate science, and industrial monitoring, but their high dimensionality, limited labeled data, and non-stationary nature pose significant challenges for conventional machine learning methods. While recent self-supervised learning (SSL) approaches mitigate label scarcity by data augmentations or time point-based contrastive strategy, they neglect the intrinsic periodic structure of MTS and fail to capture the dynamic evolution of latent states. We propose PLanTS, a periodicity-aware self-supervised learning framework that explicitly models irregular latent states and their transitions. We first designed a period-aware multi-granularity patching mechanism and a generalized contrastive loss to preserve both instance-level and state-level similarities across multiple temporal resolutions. To further capture temporal dynamics, we design a next-transition prediction pretext task that encourages representations to encode predictive information about future state evolution. We evaluate PLanTS across a wide range of downstream tasks-including multi-class and multi-label classification, forecasting, trajectory tracking and anomaly detection. PLanTS consistently improves the representation quality over existing SSL methods and demonstrates superior runtime efficiency compared to DTW-based methods.

[484] MCIGLE: Multimodal Exemplar-Free Class-Incremental Graph Learning

Haochen You, Baojing Liu

Main category: cs.LG

TL;DR: MCIGLE is a novel framework for exemplar-free class-incremental learning on multimodal graph data that addresses catastrophic forgetting, distribution bias, and memory limitations through feature alignment and recursive least squares.

Details

Motivation: Existing methods struggle with challenges like catastrophic forgetting, distribution bias, memory limits, and weak generalization when learning from multimodal graph-structured data without storing old class examples.

Method: Extracts and aligns multimodal graph features, applies Concatenated Recursive Least Squares for knowledge retention, and uses multi-channel processing to balance accuracy and memory preservation.

Result: Experiments on public datasets validate the framework’s effectiveness and generalizability in handling incremental learning tasks.

Conclusion: MCIGLE provides an effective solution for exemplar-free class-incremental learning on multimodal graph data, demonstrating strong performance in knowledge retention and generalization.

Abstract: Exemplar-free class-incremental learning enables models to learn new classes over time without storing data from old ones. As multimodal graph-structured data becomes increasingly prevalent, existing methods struggle with challenges like catastrophic forgetting, distribution bias, memory limits, and weak generalization. We propose MCIGLE, a novel framework that addresses these issues by extracting and aligning multimodal graph features and applying Concatenated Recursive Least Squares for effective knowledge retention. Through multi-channel processing, MCIGLE balances accuracy and memory preservation. Experiments on public datasets validate its effectiveness and generalizability.

[485] STL-based Optimization of Biomolecular Neural Networks for Regression and Control

Eric Palanques-Tost, Hanna Krasowski, Murat Arcak, Ron Weiss, Calin Belta

Main category: cs.LG

TL;DR: Training Biomolecular Neural Networks (BNNs) using Signal Temporal Logic (STL) specifications instead of target data, enabling gradient-based optimization for regression and control tasks in biological systems.

Details

Motivation: BNNs have universal function approximation capabilities but lack target data for training, making traditional training methods challenging for biological applications.

Method: Leveraging Signal Temporal Logic (STL) specifications to define training objectives, using STL’s quantitative semantics for gradient-based optimization of BNN weights, with applications to regression and feedback control tasks.

Result: Numerical experiments demonstrate that STL-based learning efficiently solves regression tasks (acting as reporters of dysregulated states) and control tasks (reducing inflammation while avoiding adverse responses to infections).

Conclusion: STL-based learning provides an effective approach for training BNNs without requiring target data, enabling their application in biological systems for both monitoring and control purposes.

Abstract: Biomolecular Neural Networks (BNNs), artificial neural networks with biologically synthesizable architectures, achieve universal function approximation capabilities beyond simple biological circuits. However, training BNNs remains challenging due to the lack of target data. To address this, we propose leveraging Signal Temporal Logic (STL) specifications to define training objectives for BNNs. We build on the quantitative semantics of STL, enabling gradient-based optimization of the BNN weights, and introduce a learning algorithm that enables BNNs to perform regression and control tasks in biological systems. Specifically, we investigate two regression problems in which we train BNNs to act as reporters of dysregulated states, and a feedback control problem in which we train the BNN in closed-loop with a chronic disease model, learning to reduce inflammation while avoiding adverse responses to external infections. Our numerical experiments demonstrate that STL-based learning can solve the investigated regression and control tasks efficiently.

[486] Prior Distribution and Model Confidence

Maksim Kazanskii, Artem Kasianov

Main category: cs.LG

TL;DR: A framework that uses training data embeddings to filter low-confidence predictions by measuring distance from training distribution, improving classification accuracy without retraining.

Details

Motivation: To understand how training data distribution affects model performance and develop a method to assess prediction confidence on unseen data without requiring model retraining.

Method: Analyze embeddings of training set to measure distance from training distribution in embedding space, filter low-confidence predictions based on this distance, and use multiple embedding models for robust confidence estimation.

Result: Significant improvement in classification accuracy across multiple model architectures, with further gains achieved by combining complementary embeddings from different models for better out-of-distribution detection.

Conclusion: The proposed model-agnostic framework effectively improves prediction reliability by filtering uncertain predictions, is generalizable beyond computer vision, and has potential applications in domains like NLP where confidence estimation is critical.

Abstract: This paper investigates the impact of training data distribution on the performance of image classification models. By analyzing the embeddings of the training set, we propose a framework to understand the confidence of model predictions on unseen data without the need for retraining. Our approach filters out low-confidence predictions based on their distance from the training distribution in the embedding space, significantly improving classification accuracy. We demonstrate this on the example of several classification models, showing consistent performance gains across architectures. Furthermore, we show that using multiple embedding models to represent the training data enables a more robust estimation of confidence, as different embeddings capture complementary aspects of the data. Combining these embeddings allows for better detection and exclusion of out-of-distribution samples, resulting in further accuracy improvements. The proposed method is model-agnostic and generalizable, with potential applications beyond computer vision, including domains such as Natural Language Processing where prediction reliability is critical.

[487] MambaLite-Micro: Memory-Optimized Mamba Inference on MCUs

Hongjun Xu, Junxi Xia, Weisi Yang, Yueyuan Sui, Stephen Xia

Main category: cs.LG

TL;DR: First deployment of Mamba-based neural architecture on resource-constrained microcontrollers using a C-based runtime-free inference engine called MambaLite-Micro

Details

Motivation: Deploying Mamba models on microcontrollers is challenging due to limited memory, lack of native operator support, and absence of embedded-friendly toolchains

Method: Pipeline maps PyTorch Mamba model to on-device execution by exporting weights to lightweight format and implementing handcrafted Mamba layer in C with operator fusion and memory layout optimization

Result: Reduces 83.0% peak memory, maintains average numerical error of 1.7x10-5, achieves 100% consistency with PyTorch baselines on keyword spotting and human activity recognition tasks

Conclusion: Successfully deployed on ESP32S3 and STM32H7 microcontrollers, demonstrating consistent operation across heterogeneous embedded platforms and enabling advanced sequence models on resource-constrained devices

Abstract: Deploying Mamba models on microcontrollers (MCUs) remains challenging due to limited memory, the lack of native operator support, and the absence of embedded-friendly toolchains. We present, to our knowledge, the first deployment of a Mamba-based neural architecture on a resource-constrained MCU, a fully C-based runtime-free inference engine: MambaLite-Micro. Our pipeline maps a trained PyTorch Mamba model to on-device execution by (1) exporting model weights into a lightweight format, and (2) implementing a handcrafted Mamba layer and supporting operators in C with operator fusion and memory layout optimization. MambaLite-Micro eliminates large intermediate tensors, reducing 83.0% peak memory, while maintaining an average numerical error of only 1.7x10-5 relative to the PyTorch Mamba implementation. When evaluated on keyword spotting(KWS) and human activity recognition (HAR) tasks, MambaLite-Micro achieved 100% consistency with the PyTorch baselines, fully preserving classification accuracy. We further validated portability by deploying on both ESP32S3 and STM32H7 microcontrollers, demonstrating consistent operation across heterogeneous embedded platforms and paving the way for bringing advanced sequence models like Mamba to real-world resource-constrained applications.

[488] Self-Aligned Reward: Towards Effective and Efficient Reasoners

Peixuan Han, Adit Krishnan, Gerald Friedland, Jiaxuan You, Chris Kong

Main category: cs.LG

TL;DR: SAR (Self-Aligned Reward) is a self-guided signal that complements binary verifiable rewards to improve both reasoning accuracy and efficiency in LLMs by favoring concise, query-specific responses through perplexity difference scoring.

Details

Motivation: Existing verifiable rewards provide only binary correctness feedback, leading to inefficient verbose reasoning and high computational costs while often compromising accuracy.

Method: SAR is defined as the relative perplexity difference between an answer conditioned on the query and the standalone answer, favoring responses that are concise and query-specific.

Result: Integration with RL algorithms (PPO, GRPO) improves accuracy by 4% while reducing inference cost by 30%, achieving Pareto-optimal trade-off between correctness and efficiency.

Conclusion: SAR serves as an effective fine-grained complement to verifiable rewards, enabling more efficient LLM training while preserving critical reasoning behaviors and suppressing unnecessary elaboration.

Abstract: Reinforcement learning with verifiable rewards has significantly advanced reasoning in large language models (LLMs), but such signals remain coarse, offering only binary correctness feedback. This limitation often results in inefficiencies, including overly verbose reasoning and high computational cost, while existing solutions often compromise accuracy. To address this, we introduce self-aligned reward (SAR), a self-guided signal that complements verifiable rewards to encourage both reasoning accuracy and efficiency. SAR is defined as the relative perplexity difference between an answer conditioned on the query and the standalone answer, thereby favoring responses that are concise and query-specific. Quantitative analysis reveals that SAR reliably distinguishes answer quality: concise, correct answers score higher than redundant ones, and partially correct answers score higher than entirely incorrect ones. Evaluation on 4 models across 7 benchmarks shows that integrating SAR with prevalent RL algorithms like PPO and GRPO improves accuracy by 4%, while reducing inference cost by 30%. Further analysis demonstrates that SAR achieves a Pareto-optimal trade-off between correctness and efficiency compared to reward signals based on length or self-confidence. We also show that SAR shortens responses while preserving advanced reasoning behaviors, demonstrating its ability to suppress unnecessary elaboration without losing critical reasoning. These results highlight the promise of self-aligned reward as a fine-grained complement to verifiable rewards, paving the way for more efficient and effective LLM training.

[489] DreamPRM-1.5: Unlocking the Potential of Each Instance for Multimodal Process Reward Model Training

Qi Cao, Pengtao Xie

Main category: cs.LG

TL;DR: DreamPRM-1.5 is an instance-reweighted framework that uses bi-level optimization to handle distribution shifts and noisy data in multimodal process reward models, achieving state-of-the-art performance on MMMU benchmark.

Details

Motivation: Training multimodal process reward models faces challenges with distribution shifts and noisy data, which can degrade model performance and reliability.

Method: Uses bi-level optimization to adaptively adjust training example importance through two strategies: Instance Table (for smaller datasets) and Instance Net (for larger datasets), integrated with test-time scaling.

Result: Achieves 84.6% accuracy on the MMMU benchmark, surpassing GPT-5 performance.

Conclusion: DreamPRM-1.5 effectively addresses distribution shift and noise issues in multimodal PRM training through adaptive instance reweighting, demonstrating superior benchmark performance.

Abstract: Training multimodal process reward models (PRMs) is challenged by distribution shifts and noisy data. We introduce DreamPRM-1.5, an instance-reweighted framework that adaptively adjusts the importance of each training example via bi-level optimization. We design two complementary strategies: Instance Table, effective for smaller datasets, and Instance Net, scalable to larger ones. Integrated into test-time scaling, DreamPRM-1.5 achieves 84.6 accuracy on the MMMU benchmark, surpassing GPT-5.

[490] Reinforcement Learning with Anticipation: A Hierarchical Approach for Long-Horizon Tasks

Yang Yu

Main category: cs.LG

TL;DR: RLA introduces a hierarchical RL framework with anticipation model for long-horizon goal tasks, using value geometric consistency for stable training and theoretical guarantees.

Details

Motivation: Hierarchical RL struggles with automatic hierarchy discovery and joint training instability in long-horizon goal-conditioned tasks, lacking theoretical guarantees.

Method: Learns low-level goal-conditioned policy and high-level anticipation model as planner, trained with value geometric consistency regularization to prevent degenerate solutions.

Result: RLA approaches globally optimal policy under various conditions, providing principled and convergent hierarchical planning for long-horizon tasks.

Conclusion: RLA offers a scalable, theoretically-grounded framework for hierarchical reinforcement learning with stable training and convergence guarantees.

Abstract: Solving long-horizon goal-conditioned tasks remains a significant challenge in reinforcement learning (RL). Hierarchical reinforcement learning (HRL) addresses this by decomposing tasks into more manageable sub-tasks, but the automatic discovery of the hierarchy and the joint training of multi-level policies often suffer from instability and can lack theoretical guarantees. In this paper, we introduce Reinforcement Learning with Anticipation (RLA), a principled and potentially scalable framework designed to address these limitations. The RLA agent learns two synergistic models: a low-level, goal-conditioned policy that learns to reach specified subgoals, and a high-level anticipation model that functions as a planner, proposing intermediate subgoals on the optimal path to a final goal. The key feature of RLA is the training of the anticipation model, which is guided by a principle of value geometric consistency, regularized to prevent degenerate solutions. We present proofs that RLA approaches the globally optimal policy under various conditions, establishing a principled and convergent method for hierarchical planning and execution in long-horizon goal-conditioned tasks.

[491] ProfilingAgent: Profiling-Guided Agentic Reasoning for Adaptive Model Optimization

Sadegh Jafari, Aishwarya Sarkar, Mohiuddin Bilwal, Ali Jannesari

Main category: cs.LG

TL;DR: ProfilingAgent uses LLMs to automate model compression via profiling-guided pruning and quantization, achieving significant memory savings and speedups while maintaining accuracy.

Details

Motivation: Foundation models face compute and memory bottlenecks on resource-limited platforms, and existing compression techniques use uniform heuristics that ignore architectural and runtime heterogeneity.

Method: A profiling-guided, agentic approach using LLMs to automate compression via structured pruning and post-training dynamic quantization, with a modular multi-agent system that reasons over static metrics and dynamic signals.

Result: Pruning maintains competitive accuracy (about 1% drop on ImageNet-1K, +2% gains for ViT-B/16 on smaller datasets), quantization achieves up to 74% memory savings with <0.5% accuracy loss, and consistent inference speedups up to 1.74x faster.

Conclusion: Agentic systems are scalable solutions for profiling-guided model optimization, with LLM reasoning quality being crucial for iterative pruning success.

Abstract: Foundation models face growing compute and memory bottlenecks, hindering deployment on resource-limited platforms. While compression techniques such as pruning and quantization are widely used, most rely on uniform heuristics that ignore architectural and runtime heterogeneity. Profiling tools expose per-layer latency, memory, and compute cost, yet are rarely integrated into automated pipelines. We propose ProfilingAgent, a profiling-guided, agentic approach that uses large language models (LLMs) to automate compression via structured pruning and post-training dynamic quantization. Our modular multi-agent system reasons over static metrics (MACs, parameter counts) and dynamic signals (latency, memory) to design architecture-specific strategies. Unlike heuristic baselines, ProfilingAgent tailors layer-wise decisions to bottlenecks. Experiments on ImageNet-1K, CIFAR-10, and CIFAR-100 with ResNet-101, ViT-B/16, Swin-B, and DeiT-B/16 show pruning maintains competitive or improved accuracy (about 1% drop on ImageNet-1K, +2% gains for ViT-B/16 on smaller datasets), while quantization achieves up to 74% memory savings with <0.5% accuracy loss. Our quantization also yields consistent inference speedups of up to 1.74 times faster. Comparative studies with GPT-4o and GPT-4-Turbo highlight the importance of LLM reasoning quality for iterative pruning. These results establish agentic systems as scalable solutions for profiling-guided model optimization.

[492] Causal Debiasing Medical Multimodal Representation Learning with Missing Modalities

Xiaoguang Zhu, Lianlong Sun, Yang Liu, Pengyi Jiang, Uma Srivatsa, Nipavan Chiamvimonvat, Vladimir Filkov

Main category: cs.LG

TL;DR: A causal framework for medical multimodal learning that addresses missingness and distribution biases in clinical data through structural causal analysis and dual-branch neural networks.

Details

Motivation: Real-world medical datasets often have missing modalities due to cost, protocol, or patient constraints, and existing methods neglect the underlying bias from data acquisition processes that hinder model generalization.

Method: Structural causal analysis of data-generating process with two components: (1) missingness deconfounding module using backdoor adjustment for causal intervention, and (2) dual-branch neural network that disentangles causal features from spurious correlations.

Result: Evaluated on real-world public and in-hospital datasets, demonstrating effectiveness and providing causal insights.

Conclusion: The proposed unified framework effectively addresses missingness and distribution biases in multimodal medical data, improving generalization and offering causal understanding of clinical data patterns.

Abstract: Medical multimodal representation learning aims to integrate heterogeneous clinical data into unified patient representations to support predictive modeling, which remains an essential yet challenging task in the medical data mining community. However, real-world medical datasets often suffer from missing modalities due to cost, protocol, or patient-specific constraints. Existing methods primarily address this issue by learning from the available observations in either the raw data space or feature space, but typically neglect the underlying bias introduced by the data acquisition process itself. In this work, we identify two types of biases that hinder model generalization: missingness bias, which results from non-random patterns in modality availability, and distribution bias, which arises from latent confounders that influence both observed features and outcomes. To address these challenges, we perform a structural causal analysis of the data-generating process and propose a unified framework that is compatible with existing direct prediction-based multimodal learning methods. Our method consists of two key components: (1) a missingness deconfounding module that approximates causal intervention based on backdoor adjustment and (2) a dual-branch neural network that explicitly disentangles causal features from spurious correlations. We evaluated our method in real-world public and in-hospital datasets, demonstrating its effectiveness and causal insights.

[493] OptiProxy-NAS: Optimization Proxy based End-to-End Neural Architecture Search

Bo Lyu, Yu Cui, Tuo Shi, Ke Li

Main category: cs.LG

TL;DR: OptiProxy-NAS is a novel neural architecture search method that uses optimization proxies to transform NAS into a continuous, differentiable optimization problem, enabling efficient gradient-based search across multiple domains.

Details

Motivation: Neural architecture search is computationally expensive with discrete, vast search spaces. Current methods use surrogate models or supernetworks, but there's a need for more efficient end-to-end optimization approaches.

Method: Proposes OptiProxy-NAS framework that uses proxy representations to reformulate NAS space as continuous, differentiable, and smooth, allowing application of any differentiable optimization method for gradient-based architecture search.

Result: Comprehensive experiments on 12 NAS tasks across 4 search spaces in computer vision, NLP, and resource-constrained domains show superior search results and efficiency. Additional experiments confirm flexibility in low-fidelity scenarios.

Conclusion: OptiProxy-NAS provides an effective end-to-end optimization framework for NAS that outperforms existing methods in both performance and efficiency across multiple application domains.

Abstract: Neural architecture search (NAS) is a hard computationally expensive optimization problem with a discrete, vast, and spiky search space. One of the key research efforts dedicated to this space focuses on accelerating NAS via certain proxy evaluations of neural architectures. Different from the prevalent predictor-based methods using surrogate models and differentiable architecture search via supernetworks, we propose an optimization proxy to streamline the NAS as an end-to-end optimization framework, named OptiProxy-NAS. In particular, using a proxy representation, the NAS space is reformulated to be continuous, differentiable, and smooth. Thereby, any differentiable optimization method can be applied to the gradient-based search of the relaxed architecture parameters. Our comprehensive experiments on $12$ NAS tasks of $4$ search spaces across three different domains including computer vision, natural language processing, and resource-constrained NAS fully demonstrate the superior search results and efficiency. Further experiments on low-fidelity scenarios verify the flexibility.

[494] DQS: A Low-Budget Query Strategy for Enhancing Unsupervised Data-driven Anomaly Detection Approaches

Lucas Correia, Jan-Christoph Goos, Thomas Bäck, Anna V. Kononova

Main category: cs.LG

TL;DR: This paper proposes a novel active learning approach called dissimilarity-based query strategy (DQS) to improve threshold selection in unsupervised time series anomaly detection by selectively querying labels to maximize sample diversity.

Details

Motivation: Existing unsupervised anomaly detection methods suffer from poor threshold setting, and those claiming to be unsupervised often require labeled data for calibration, which is not available in real-world scenarios.

Method: Integrates active learning with unsupervised anomaly detection using a novel DQS strategy that evaluates similarity between anomaly scores via dynamic time warping to maximize diversity of queried samples for threshold refinement.

Result: DQS performs best in small-budget scenarios, while other strategies are more robust to mislabelling. All query strategies outperform unsupervised thresholds even with mislabelling.

Conclusion: When feasible to query an oracle, active learning-based threshold selection is recommended, with strategy choice depending on oracle expertise and labeling budget.

Abstract: Truly unsupervised approaches for time series anomaly detection are rare in the literature. Those that exist suffer from a poorly set threshold, which hampers detection performance, while others, despite claiming to be unsupervised, need to be calibrated using a labelled data subset, which is often not available in the real world. This work integrates active learning with an existing unsupervised anomaly detection method by selectively querying the labels of multivariate time series, which are then used to refine the threshold selection process. To achieve this, we introduce a novel query strategy called the dissimilarity-based query strategy (DQS). DQS aims to maximise the diversity of queried samples by evaluating the similarity between anomaly scores using dynamic time warping. We assess the detection performance of DQS in comparison to other query strategies and explore the impact of mislabelling, a topic that is underexplored in the literature. Our findings indicate that DQS performs best in small-budget scenarios, though the others appear to be more robust when faced with mislabelling. Therefore, in the real world, the choice of query strategy depends on the expertise of the oracle and the number of samples they are willing to label. Regardless, all query strategies outperform the unsupervised threshold even in the presence of mislabelling. Thus, whenever it is feasible to query an oracle, employing an active learning-based threshold is recommended.

[495] GraMFedDHAR: Graph Based Multimodal Differentially Private Federated HAR

Labani Halder, Tanmay Sen, Sarbani Palit

Main category: cs.LG

TL;DR: GraMFedDHAR: Graph-based multimodal federated learning framework for Human Activity Recognition that uses GCNs and attention fusion to handle heterogeneous sensor data while maintaining privacy through differential privacy.

Details

Motivation: Address challenges in HAR including noisy/incomplete multimodal sensor data, limited labeled examples, privacy concerns, and infrastructure constraints of centralized approaches. Federated learning helps with privacy but struggles with heterogeneous data and DP requirements.

Method: Model diverse sensor streams (pressure mat, depth camera, accelerometers) as modality-specific graphs, process through residual GCNs, fuse via attention-based weighting instead of simple concatenation, and apply differential privacy during federated aggregation.

Result: MultiModalGCN outperforms baseline MultiModalFFN by up to 2% higher accuracy in non-DP settings, and shows significant 7-13% improvements under differential privacy constraints across various privacy budgets.

Conclusion: Graph-based modeling with GNNs provides robustness in multimodal learning and demonstrates greater resilience to performance degradation caused by differential privacy noise compared to traditional approaches.

Abstract: Human Activity Recognition (HAR) using multimodal sensor data remains challenging due to noisy or incomplete measurements, scarcity of labeled examples, and privacy concerns. Traditional centralized deep learning approaches are often constrained by infrastructure availability, network latency, and data sharing restrictions. While federated learning (FL) addresses privacy by training models locally and sharing only model parameters, it still has to tackle issues arising from the use of heterogeneous multimodal data and differential privacy requirements. In this article, a Graph-based Multimodal Federated Learning framework, GraMFedDHAR, is proposed for HAR tasks. Diverse sensor streams such as a pressure mat, depth camera, and multiple accelerometers are modeled as modality-specific graphs, processed through residual Graph Convolutional Neural Networks (GCNs), and fused via attention-based weighting rather than simple concatenation. The fused embeddings enable robust activity classification, while differential privacy safeguards data during federated aggregation. Experimental results show that the proposed MultiModalGCN model outperforms the baseline MultiModalFFN, with up to 2 percent higher accuracy in non-DP settings in both centralized and federated paradigms. More importantly, significant improvements are observed under differential privacy constraints: MultiModalGCN consistently surpasses MultiModalFFN, with performance gaps ranging from 7 to 13 percent depending on the privacy budget and setting. These results highlight the robustness of graph-based modeling in multimodal learning, where GNNs prove more resilient to the performance degradation introduced by DP noise.

[496] Distributed Deep Learning using Stochastic Gradient Staleness

Viet Hoang Pham, Hyo-Sung Ahn

Main category: cs.LG

TL;DR: A distributed training method combining data parallelism and decoupled parallel backpropagation to accelerate DNN training by processing more data per iteration and reducing locking issues.

Details

Motivation: Deep neural networks require substantial training time due to increasing depth and large datasets, creating a need for more efficient training methods.

Method: Integrates data parallelism with fully decoupled parallel backpropagation algorithm using multiple computational units to process more training data per iteration while avoiding locking issues.

Result: The method is proven to converge to critical points under certain conditions and shows effectiveness in empirical evaluations on CIFAR-10 classification tasks.

Conclusion: The proposed distributed training approach significantly improves training efficiency for deep neural networks by combining parallel processing strategies.

Abstract: Despite the notable success of deep neural networks (DNNs) in solving complex tasks, the training process still remains considerable challenges. A primary obstacle is the substantial time required for training, particularly as high performing DNNs tend to become increasingly deep (characterized by a larger number of hidden layers) and require extensive training datasets. To address these challenges, this paper introduces a distributed training method that integrates two prominent strategies for accelerating deep learning: data parallelism and fully decoupled parallel backpropagation algorithm. By utilizing multiple computational units operating in parallel, the proposed approach enhances the amount of training data processed in each iteration while mitigating locking issues commonly associated with the backpropagation algorithm. These features collectively contribute to significant improvements in training efficiency. The proposed distributed training method is rigorously proven to converge to critical points under certain conditions. Its effectiveness is further demonstrated through empirical evaluations, wherein an DNN is trained to perform classification tasks on the CIFAR-10 dataset.

[497] Morphological Perceptron with Competitive Layer: Training Using Convex-Concave Procedure

Iara Cunha, Marcos Eduardo Valle

Main category: cs.LG

TL;DR: Proposes convex-concave procedure (CCP) for training morphological perceptron with competitive layer (MPCL) networks, addressing non-differentiability of morphological operators through DC programming and linear programming subproblems.

Details

Motivation: Morphological perceptrons use mathematical morphology operations but are non-differentiable, making gradient-based optimization unsuitable. Alternative training methods are needed for MPCL networks in multiclass classification.

Method: Formulates training as difference of convex (DC) functions and solves iteratively using convex-concave procedure (CCP), resulting in sequence of linear programming subproblems.

Result: Computational experiments demonstrate effectiveness of the proposed CCP training method for MPCL networks in classification tasks.

Conclusion: CCP provides an effective gradient-free optimization approach for training morphological neural networks with competitive layers, overcoming the limitations of non-differentiable morphological operators.

Abstract: A morphological perceptron is a multilayer feedforward neural network in which neurons perform elementary operations from mathematical morphology. For multiclass classification tasks, a morphological perceptron with a competitive layer (MPCL) is obtained by integrating a winner-take-all output layer into the standard morphological architecture. The non-differentiability of morphological operators renders gradient-based optimization methods unsuitable for training such networks. Consequently, alternative strategies that do not depend on gradient information are commonly adopted. This paper proposes the use of the convex-concave procedure (CCP) for training MPCL networks. The training problem is formulated as a difference of convex (DC) functions and solved iteratively using CCP, resulting in a sequence of linear programming subproblems. Computational experiments demonstrate the effectiveness of the proposed training method in addressing classification tasks with MPCL networks.

[498] Simulation Priors for Data-Efficient Deep Learning

Lenart Treven, Bhavya Sukhija, Jonas Rothfuss, Stelian Coros, Florian Dörfler, Andreas Krause

Main category: cs.LG

TL;DR: SimPEL combines first-principles models with deep learning using Bayesian methods to enable efficient learning in low-data regimes while quantifying uncertainty.

Details

Motivation: First-principles models often fail to capture real-world complexity due to simplifying assumptions, while deep learning requires large datasets. There's a need for methods that can efficiently learn complex dynamics with limited data.

Method: SimPEL uses low-fidelity simulators as priors in Bayesian deep learning, combining simulator knowledge with data-driven learning while quantifying epistemic uncertainty.

Result: Superior performance in learning complex dynamics across biological, agricultural, and robotic domains. On a high-speed RC car task, it learned dynamic parking maneuvers with drifting using substantially less data than state-of-the-art baselines.

Conclusion: SimPEL effectively bridges the sim-to-real gap in model-based reinforcement learning and shows strong potential for data-efficient learning and control in complex real-world environments.

Abstract: How do we enable AI systems to efficiently learn in the real-world? First-principles models are widely used to simulate natural systems, but often fail to capture real-world complexity due to simplifying assumptions. In contrast, deep learning approaches can estimate complex dynamics with minimal assumptions but require large, representative datasets. We propose SimPEL, a method that efficiently combines first-principles models with data-driven learning by using low-fidelity simulators as priors in Bayesian deep learning. This enables SimPEL to benefit from simulator knowledge in low-data regimes and leverage deep learning’s flexibility when more data is available, all the while carefully quantifying epistemic uncertainty. We evaluate SimPEL on diverse systems, including biological, agricultural, and robotic domains, showing superior performance in learning complex dynamics. For decision-making, we demonstrate that SimPEL bridges the sim-to-real gap in model-based reinforcement learning. On a high-speed RC car task, SimPEL learns a highly dynamic parking maneuver involving drifting with substantially less data than state-of-the-art baselines. These results highlight the potential of SimPEL for data-efficient learning and control in complex real-world environments.

[499] Offline vs. Online Learning in Model-based RL: Lessons for Data Collection Strategies

Jiaqi Chen, Ji Shi, Cansu Sancaktar, Jonas Frey, Georg Martius

Main category: cs.LG

TL;DR: Online model-based RL outperforms offline training due to OOD state issues. Limited offline dataset coverage causes imagination-reality mismatch. Adding online interactions or exploration data mitigates performance degradation.

Details

Motivation: To investigate the effects of online vs offline data collection on world models in model-based reinforcement learning, as the performance differences and underlying causes haven't been thoroughly studied despite the theoretical suitability of offline training for task-agnostic dynamics learning.

Method: Conducted experiments on 31 different environments comparing online and offline model-based RL paradigms. Analyzed performance degradation causes and tested mitigation strategies including additional online interactions (fixed/adaptive schedules) and incorporating exploration data into offline datasets.

Result: Online agents consistently outperformed offline counterparts. Key issue identified: offline agents encounter Out-Of-Distribution states at test time due to limited state space coverage in datasets, causing imagination-reality mismatch. Performance was restored by adding online interactions or exploration data.

Conclusion: Offline datasets should incorporate exploration data alongside expert data to mitigate OOD state issues. Current focus on expert-only data is insufficient. Limited online interactions can effectively restore online training performance levels while maintaining data efficiency.

Abstract: Data collection is crucial for learning robust world models in model-based reinforcement learning. The most prevalent strategies are to actively collect trajectories by interacting with the environment during online training or training on offline datasets. At first glance, the nature of learning task-agnostic environment dynamics makes world models a good candidate for effective offline training. However, the effects of online vs. offline data on world models and thus on the resulting task performance have not been thoroughly studied in the literature. In this work, we investigate both paradigms in model-based settings, conducting experiments on 31 different environments. First, we showcase that online agents outperform their offline counterparts. We identify a key challenge behind performance degradation of offline agents: encountering Out-Of-Distribution states at test time. This issue arises because, without the self-correction mechanism in online agents, offline datasets with limited state space coverage induce a mismatch between the agent’s imagination and real rollouts, compromising policy training. We demonstrate that this issue can be mitigated by allowing for additional online interactions in a fixed or adaptive schedule, restoring the performance of online training with limited interaction data. We also showcase that incorporating exploration data helps mitigate the performance degradation of offline agents. Based on our insights, we recommend adding exploration data when collecting large datasets, as current efforts predominantly focus on expert data alone.

[500] Ensemble of Precision-Recall Curve (PRC) Classification Trees with Autoencoders

Jiaju Miao, Wei Zhu

Main category: cs.LG

TL;DR: Hybrid framework combining PRC Random Forest with autoencoders for anomaly detection, addressing class imbalance and dimensionality challenges simultaneously.

Details

Motivation: Anomaly detection faces challenges of extreme class imbalance and curse of dimensionality, which impede progress in critical applications like network security and fraud prevention.

Method: Integrates previously developed PRC Random Forest (PRC-RF) with autoencoders to create a hybrid framework called Autoencoder-PRC-RF, using autoencoders to learn compact latent representations.

Result: Extensive experiments across diverse benchmark datasets demonstrate superior accuracy, scalability, and interpretability compared to prior methods.

Conclusion: The Autoencoder-PRC-RF model shows strong potential for high-stakes anomaly-detection tasks by effectively addressing both class imbalance and dimensionality challenges.

Abstract: Anomaly detection underpins critical applications from network security and intrusion detection to fraud prevention, where recognizing aberrant patterns rapidly is indispensable. Progress in this area is routinely impeded by two obstacles: extreme class imbalance and the curse of dimensionality. To combat the former, we previously introduced Precision-Recall Curve (PRC) classification trees and their ensemble extension, the PRC Random Forest (PRC-RF). Building on that foundation, we now propose a hybrid framework that integrates PRC-RF with autoencoders, unsupervised machine learning methods that learn compact latent representations, to confront both challenges simultaneously. Extensive experiments across diverse benchmark datasets demonstrate that the resulting Autoencoder-PRC-RF model achieves superior accuracy, scalability, and interpretability relative to prior methods, affirming its potential for high-stakes anomaly-detection tasks.

[501] Real-E: A Foundation Benchmark for Advancing Robust and Generalizable Electricity Forecasting

Chen Shao, Yue Wang, Zhenyi Zhu, Zhanbo Huang, Sebastian Pütz, Benjamin Schäfer, Tobais Käfer, Michael Färber

Main category: cs.LG

TL;DR: The paper introduces Real-E dataset for energy forecasting, covering 74+ power stations across 30+ European countries over 10 years, and benchmarks 20+ baselines showing existing methods struggle with complex correlation dynamics.

Details

Motivation: Existing energy forecasting benchmarks are limited in spatial/temporal scope and lack multi-energy features, raising concerns about reliability and real-world applicability.

Method: Created Real-E dataset with rich metadata, conducted extensive data analysis, benchmarked 20+ baseline models across various types, and introduced a new metric to quantify correlation structure shifts.

Result: Existing forecasting methods struggle with the Real-E dataset which exhibits more complex and non-stationary correlation dynamics than previous benchmarks.

Conclusion: The findings reveal key limitations of current methods and provide a strong empirical basis for developing more robust energy forecasting models.

Abstract: Energy forecasting is vital for grid reliability and operational efficiency. Although recent advances in time series forecasting have led to progress, existing benchmarks remain limited in spatial and temporal scope and lack multi-energy features. This raises concerns about their reliability and applicability in real-world deployment. To address this, we present the Real-E dataset, covering over 74 power stations across 30+ European countries over a 10-year span with rich metadata. Using Real- E, we conduct an extensive data analysis and benchmark over 20 baselines across various model types. We introduce a new metric to quantify shifts in correlation structures and show that existing methods struggle on our dataset, which exhibits more complex and non-stationary correlation dynamics. Our findings highlight key limitations of current methods and offer a strong empirical basis for building more robust forecasting models

[502] DCV-ROOD Evaluation Framework: Dual Cross-Validation for Robust Out-of-Distribution Detection

Arantxa Urrea-Castaño, Nicolás Segura-Kunsagi, Juan Luis Suárez-Díaz, Rosana Montes, Francisco Herrera

Main category: cs.LG

TL;DR: Proposes DCV-ROOD, a dual cross-validation framework for robust evaluation of OOD detection models that handles in-distribution and out-of-distribution data differently and achieves fast convergence to true performance.

Details

Motivation: Out-of-distribution detection is crucial for AI robustness but challenging to evaluate reliably. Cross-validation is effective for performance estimation but needs adaptation for OOD scenarios with different data characteristics.

Method: Dual CV framework that partitions ID data conventionally while grouping OOD data by classes. Considers class hierarchy for fair ID-OOD partitions. Tests framework on state-of-the-art OOD detection methods.

Result: The proposed DCV-ROOD framework achieves very fast convergence to the true performance of OOD detection methods.

Conclusion: The dual cross-validation approach provides a robust evaluation framework for OOD detection that effectively handles the distinct characteristics of ID and OOD data.

Abstract: Out-of-distribution (OOD) detection plays a key role in enhancing the robustness of artificial intelligence systems by identifying inputs that differ significantly from the training distribution, thereby preventing unreliable predictions and enabling appropriate fallback mechanisms. Developing reliable OOD detection methods is a significant challenge, and rigorous evaluation of these techniques is essential for ensuring their effectiveness, as it allows researchers to assess their performance under diverse conditions and to identify potential limitations or failure modes. Cross-validation (CV) has proven to be a highly effective tool for providing a reasonable estimate of the performance of a learning algorithm. Although OOD scenarios exhibit particular characteristics, an appropriate adaptation of CV can lead to a suitable evaluation framework for this setting. This work proposes a dual CV framework for robust evaluation of OOD detection models, aimed at improving the reliability of their assessment. The proposed evaluation framework aims to effectively integrate in-distribution (ID) and OOD data while accounting for their differing characteristics. To achieve this, ID data are partitioned using a conventional approach, whereas OOD data are divided by grouping samples based on their classes. Furthermore, we analyze the context of data with class hierarchy to propose a data splitting that considers the entire class hierarchy to obtain fair ID-OOD partitions to apply the proposed evaluation framework. This framework is called Dual Cross-Validation for Robust Out-of-Distribution Detection (DCV-ROOD). To test the validity of the evaluation framework, we selected a set of state-of-the-art OOD detection methods, both with and without outlier exposure. The results show that the method achieves very fast convergence to the true performance.

[503] Select, then Balance: A Plug-and-Play Framework for Exogenous-Aware Spatio-Temporal Forecasting

Wei Chen, Yuqian Wu, Yuanshao Zhu, Xixuan Hao, Shiyu Wang, Yuxuan Liang

Main category: cs.LG

TL;DR: A novel framework called ExoST for spatio-temporal forecasting that effectively models exogenous variables through a “select, then balance” approach using latent space gated experts and siamese network architecture.

Details

Motivation: Existing spatio-temporal forecasting solutions only use limited observed target variables, but real-world scenarios have exogenous variables that can improve accuracy. However, challenges include inconsistent effects of different exogenous variables and imbalance between historical vs future variables.

Method: Constructs latent space gated expert module to dynamically select and recompose salient exogenous signals. Uses siamese network architecture with dual-branch spatio-temporal backbones for past and future variables, integrated through context-aware weighting mechanism.

Result: Extensive experiments on real-world datasets demonstrate the framework’s effectiveness, generality, robustness, and efficiency.

Conclusion: The proposed ExoST framework successfully addresses challenges in modeling exogenous variables for spatio-temporal forecasting through its selective and balanced approach.

Abstract: Spatio-temporal forecasting aims to predict the future state of dynamic systems and plays an important role in multiple fields. However, existing solutions only focus on modeling using a limited number of observed target variables. In real-world scenarios, exogenous variables can be integrated into the model as additional input features and associated with the target signal to promote forecast accuracy. Although promising, this still encounters two challenges: the inconsistent effects of different exogenous variables to the target system, and the imbalance effects between historical variables and future variables. To address these challenges, this paper introduces \model, a novel framework for modeling \underline{exo}genous variables in \underline{s}patio-\underline{t}emporal forecasting, which follows a ``select, then balance’’ paradigm. Specifically, we first construct a latent space gated expert module, where fused exogenous information is projected into a latent space to dynamically select and recompose salient signals via specialized sub-experts. Furthermore, we design a siamese network architecture in which recomposed representations of past and future exogenous variables are fed into dual-branch spatio-temporal backbones to capture dynamic patterns. The outputs are integrated through a context-aware weighting mechanism to achieve dynamic balance during the modeling process. Extensive experiments on real-world datasets demonstrate the effectiveness, generality, robustness, and efficiency of our proposed framework.

[504] time2time: Causal Intervention in Hidden States to Simulate Rare Events in Time Series Foundation Models

Debdeep Sanyal, Aaryan Nagpal, Dhruv Kumar, Murari Mandal, Saurabh Deshpande

Main category: cs.LG

TL;DR: Transformer foundation models encode semantic concepts like market regimes, not just curve fitting. Activation transplantation can manipulate hidden states to simulate rare events like market crashes.

Details

Motivation: To determine if transformer-based models internalize semantic concepts and if their internal representations can be used to simulate rare high-stakes events like market crashes.

Method: Activation transplantation - a causal intervention that manipulates hidden states by imposing statistical moments from one event onto another during forward pass.

Result: Injecting crash semantics induces downturn predictions, while calm semantics suppresses crashes. Models encode graded event severity with latent vector norm correlating with shock magnitude. Validated across Toto and Chronos architectures.

Conclusion: Large time series transformers have steerable, semantically grounded representations with a latent concept space governing predictions, enabling direct causal intervention and semantic what-if analysis for stress-testing.

Abstract: While transformer-based foundation models excel at forecasting routine patterns, two questions remain: do they internalize semantic concepts such as market regimes, or merely fit curves? And can their internal representations be leveraged to simulate rare, high-stakes events such as market crashes? To investigate this, we introduce activation transplantation, a causal intervention that manipulates hidden states by imposing the statistical moments of one event (e.g., a historical crash) onto another (e.g., a calm period) during the forward pass. This procedure deterministically steers forecasts: injecting crash semantics induces downturn predictions, while injecting calm semantics suppresses crashes and restores stability. Beyond binary control, we find that models encode a graded notion of event severity, with the latent vector norm directly correlating with the magnitude of systemic shocks. Validated across two architecturally distinct TSFMs, Toto (decoder only) and Chronos (encoder-decoder), our results demonstrate that steerable, semantically grounded representations are a robust property of large time series transformers. Our findings provide evidence for a latent concept space that governs model predictions, shifting interpretability from post-hoc attribution to direct causal intervention, and enabling semantic “what-if” analysis for strategic stress-testing.

[505] Simple Optimizers for Convex Aligned Multi-Objective Optimization

Ben Kretzu, Karen Ullrich, Yonathan Efroni

Main category: cs.LG

TL;DR: This paper relaxes strong convexity assumptions in aligned multi-objective optimization, developing new gradient-descent algorithms with convergence guarantees under standard smoothness/Lipschitz conditions more relevant to deep learning practice.

Details

Motivation: Existing AMOO analysis relies on strong convexity assumptions that imply unique optimal solutions, which doesn't align with real-world deep learning practice where objectives may not be inherently conflicting and diverse tasks can enhance performance.

Method: Developed new analytical tools and metrics for convex AMOO, proposed scalable gradient-descent algorithms under standard smoothness or Lipschitz continuity conditions.

Result: Established convergence guarantees for the proposed algorithms and proved a novel lower bound showing suboptimality of naive equal-weight approaches compared to their methods.

Conclusion: The work provides more practical convergence analysis for aligned multi-objective optimization that better matches deep learning assumptions, with improved algorithms that outperform naive approaches.

Abstract: It is widely recognized in modern machine learning practice that access to a diverse set of tasks can enhance performance across those tasks. This observation suggests that, unlike in general multi-objective optimization, the objectives in many real-world settings may not be inherently conflicting. To address this, prior work introduced the Aligned Multi-Objective Optimization (AMOO) framework and proposed gradient-based algorithms with provable convergence guarantees. However, existing analysis relies on strong assumptions, particularly strong convexity, which implies the existence of a unique optimal solution. In this work, we relax this assumption and study gradient-descent algorithms for convex AMOO under standard smoothness or Lipschitz continuity conditions-assumptions more consistent with those used in deep learning practice. This generalization requires new analytical tools and metrics to characterize convergence in the convex AMOO setting. We develop such tools, propose scalable algorithms for convex AMOO, and establish their convergence guarantees. Additionally, we prove a novel lower bound that demonstrates the suboptimality of naive equal-weight approaches compared to our methods.

[506] Performance of Conformal Prediction in Capturing Aleatoric Uncertainty

Misgina Tsighe Hagos, Claes Lundström

Main category: cs.LG

TL;DR: Conformal prediction sets show weak correlation with human-annotated aleatoric uncertainty, questioning their effectiveness in capturing inherent dataset ambiguity.

Details

Motivation: To investigate whether conformal prediction effectively quantifies aleatoric uncertainty (inherent dataset ambiguity from overlapping classes) as claimed in literature, since there's a lack of empirical validation.

Method: Used three conformal prediction approaches to generate prediction sets for eight deep learning models on four datasets with multiple human annotations per instance. Measured correlation between prediction set sizes and number of distinct human labels, and assessed similarity between prediction sets and human annotations.

Result: Vast majority of conformal prediction outputs showed very weak to weak correlation with human annotations, with only a few showing moderate correlation. Prediction sets provide coverage but fail to effectively capture aleatoric uncertainty.

Conclusion: Conformal predictors’ capability in capturing aleatoric uncertainty is limited, necessitating critical reassessment of their prediction sets despite providing good coverage of true classes.

Abstract: Conformal prediction is a model-agnostic approach to generating prediction sets that cover the true class with a high probability. Although its prediction set size is expected to capture aleatoric uncertainty, there is a lack of evidence regarding its effectiveness. The literature presents that prediction set size can upper-bound aleatoric uncertainty or that prediction sets are larger for difficult instances and smaller for easy ones, but a validation of this attribute of conformal predictors is missing. This work investigates how effectively conformal predictors quantify aleatoric uncertainty, specifically the inherent ambiguity in datasets caused by overlapping classes. We perform this by measuring the correlation between prediction set sizes and the number of distinct labels assigned by human annotators per instance. We further assess the similarity between prediction sets and human-provided annotations. We use three conformal prediction approaches to generate prediction sets for eight deep learning models trained on four datasets. The datasets contain annotations from multiple human annotators (ranging from five to fifty participants) per instance, enabling the identification of class overlap. We show that the vast majority of the conformal prediction outputs show a very weak to weak correlation with human annotations, with only a few showing moderate correlation. These findings underscore the necessity of critically reassessing the prediction sets generated using conformal predictors. While they can provide a higher coverage of the true classes, their capability in capturing aleatoric uncertainty remains limited.

Akaash Kolluri, Shengguang Wu, Joon Sung Park, Michael S. Bernstein

Main category: cs.LG

TL;DR: Finetuning LLMs on social science experiment data (SocSci210 dataset) significantly improves simulation accuracy, achieving 26% better alignment with human responses than base models and outperforming GPT-4o by 13%.

Details

Motivation: To leverage LLMs for more accurate simulations of social science experiments by finetuning them directly on individual-level responses from past experiments.

Method: Constructed SocSci210 dataset with 2.9M responses from 400K participants across 210 experiments, then finetuned Qwen2.5-14B model to create Socrates-Qwen-14B.

Result: 26% improvement in alignment with human response distributions, 71% improvement in generalization to new conditions, and 10.6% reduction in demographic bias compared to base model.

Conclusion: Finetuning on social science datasets enables more accurate experimental simulations and hypothesis screening, with potential applications across diverse social science domains.

Abstract: Large language models (LLMs) offer a powerful opportunity to simulate the results of social science experiments. In this work, we demonstrate that finetuning LLMs directly on individual-level responses from past experiments meaningfully improves the accuracy of such simulations across diverse social science domains. We construct SocSci210 via an automatic pipeline, a dataset comprising 2.9 million responses from 400,491 participants in 210 open-source social science experiments. Through finetuning, we achieve multiple levels of generalization. In completely unseen studies, our strongest model, Socrates-Qwen-14B, produces predictions that are 26% more aligned with distributions of human responses to diverse outcome questions under varying conditions relative to its base model (Qwen2.5-14B), outperforming GPT-4o by 13%. By finetuning on a subset of conditions in a study, generalization to new unseen conditions is particularly robust, improving by 71%. Since SocSci210 contains rich demographic information, we reduce demographic parity, a measure of bias, by 10.6% through finetuning. Because social sciences routinely generate rich, topic-specific datasets, our findings indicate that finetuning on such data could enable more accurate simulations for experimental hypothesis screening. We release our data, models and finetuning code at stanfordhci.github.io/socrates.

[508] Benchmarking Robust Aggregation in Decentralized Gradient Marketplaces

Zeyu Song, Sainyam Galhotra, Shagufta Mehnaz

Main category: cs.LG

TL;DR: A benchmark framework for evaluating gradient aggregation methods in decentralized gradient marketplaces, addressing economic efficiency, fairness, and market stability in buyer-baseline-reliant environments.

Details

Motivation: Existing FL benchmarks overlook critical economic and systemic factors unique to decentralized gradient marketplaces, particularly when buyers rely on private baseline datasets for evaluation.

Method: Developed a simulation environment modeling marketplace dynamics, evaluation methodology with marketplace-centric metrics, and empirical analysis of MartFL framework with adapted FLTrust and SkyMask aggregation strategies across diverse datasets and attack scenarios.

Result: Comprehensive benchmark providing tools and empirical evidence to evaluate trade-offs between model performance, robustness, cost, fairness, and stability in gradient marketplaces.

Conclusion: The benchmark equips the community to design more robust, equitable, and economically viable decentralized gradient marketplaces by addressing previously overlooked economic and systemic factors.

Abstract: The rise of distributed and privacy-preserving machine learning has sparked interest in decentralized gradient marketplaces, where participants trade intermediate artifacts like gradients. However, existing Federated Learning (FL) benchmarks overlook critical economic and systemic factors unique to such marketplaces-cost-effectiveness, fairness to sellers, and market stability-especially when a buyer relies on a private baseline dataset for evaluation. We introduce a comprehensive benchmark framework to holistically evaluate robust gradient aggregation methods within these buyer-baseline-reliant marketplaces. Our contributions include: (1) a simulation environment modeling marketplace dynamics with a variable buyer baseline and diverse seller distributions; (2) an evaluation methodology augmenting standard FL metrics with marketplace-centric dimensions such as Economic Efficiency, Fairness, and Selection Dynamics; (3) an in-depth empirical analysis of the existing Distributed Gradient Marketplace framework, MartFL, including the integration and comparative evaluation of adapted FLTrust and SkyMask as alternative aggregation strategies within it. This benchmark spans diverse datasets, local attacks, and Sybil attacks targeting the marketplace selection process; and (4) actionable insights into the trade-offs between model performance, robustness, cost, fairness, and stability. This benchmark equips the community with essential tools and empirical evidence to evaluate and design more robust, equitable, and economically viable decentralized gradient marketplaces.

[509] Data-Driven Stochastic Modeling Using Autoregressive Sequence Models: Translating Event Tables to Queueing Dynamics

Daksh Mittal, Shunri Zheng, Jing Dong, Hongseok Namkoong

Main category: cs.LG

TL;DR: Data-driven framework using autoregressive sequence models to automate queueing network modeling from event-stream data, eliminating need for manual specification of arrival processes, service mechanisms, and routing logic.

Details

Motivation: Traditional queueing network models require substantial human effort and domain expertise to construct, making them less scalable and accessible. The goal is to leverage AI advances and available data to create more automated modeling pipelines.

Method: Uses Transformer-style architectures to parameterize conditional distributions of event types and times, treating modeling as sequence distribution learning. Trained on event-stream data to learn patterns without explicit specification of system components.

Result: Framework successfully constructs high-fidelity simulators validated on diverse queueing networks, demonstrating utility in simulation, uncertainty quantification, and counterfactual evaluation.

Conclusion: The approach represents a step toward more automated, data-driven modeling pipelines that can support broader adoption of queueing network models across various service domains.

Abstract: While queueing network models are powerful tools for analyzing service systems, they traditionally require substantial human effort and domain expertise to construct. To make this modeling approach more scalable and accessible, we propose a data-driven framework for queueing network modeling and simulation based on autoregressive sequence models trained on event-stream data. Instead of explicitly specifying arrival processes, service mechanisms, or routing logic, our approach learns the conditional distributions of event types and event times, recasting the modeling task as a problem of sequence distribution learning. We show that Transformer-style architectures can effectively parameterize these distributions, enabling automated construction of high-fidelity simulators. As a proof of concept, we validate our framework on event tables generated from diverse queueing networks, showcasing its utility in simulation, uncertainty quantification, and counterfactual evaluation. Leveraging advances in artificial intelligence and the growing availability of data, our framework takes a step toward more automated, data-driven modeling pipelines to support broader adoption of queueing network models across service domains.

[510] The Measure of Deception: An Analysis of Data Forging in Machine Unlearning

Rishabh Dixit, Yuan Hui, Rayan Saab

Main category: cs.LG

TL;DR: The paper analyzes adversarial forging in machine unlearning, showing that the set of data points that can effectively mimic target gradients is extremely small, making false unlearning claims detectable in principle.

Details

Motivation: Privacy regulations and the need to mitigate harmful data effects require machine unlearning, but adversaries can forge data to mimic unlearning without actually removing information. The paper aims to understand and quantify this forging phenomenon.

Method: The authors develop a framework to analyze ε-forging sets (data points whose gradients approximate target gradients within tolerance ε). They analyze linear regression, one-layer neural networks, and extend to batch SGD and smooth loss functions under mild regularity assumptions.

Result: For linear models and neural networks, the Lebesgue measure of forging sets scales as ε or ε^d. More generally, the measure decays as ε^{(d-r)/2} where d is data dimension and r is nullity of a variation matrix. Probability bounds show random sampling of forging points is vanishingly small.

Conclusion: Adversarial forging is fundamentally limited due to the extremely small measure of forging sets, meaning false unlearning claims can be detected in principle, providing theoretical support for verifiable machine unlearning.

Abstract: Motivated by privacy regulations and the need to mitigate the effects of harmful data, machine unlearning seeks to modify trained models so that they effectively ``forget’’ designated data. A key challenge in verifying unlearning is forging – adversarially crafting data that mimics the gradient of a target point, thereby creating the appearance of unlearning without actually removing information. To capture this phenomenon, we consider the collection of data points whose gradients approximate a target gradient within tolerance $\epsilon$ – which we call an $\epsilon$-forging set – and develop a framework for its analysis. For linear regression and one-layer neural networks, we show that the Lebesgue measure of this set is small. It scales on the order of $\epsilon$, and when $\epsilon$ is small enough, $\epsilon^d$. More generally, under mild regularity assumptions, we prove that the forging set measure decays as $\epsilon^{(d-r)/2}$, where $d$ is the data dimension and $r<d$ is the nullity of a variation matrix defined by the model gradients. Extensions to batch SGD and almost-everywhere smooth loss functions yield the same asymptotic scaling. In addition, we establish probability bounds showing that, under non-degenerate data distributions, the likelihood of randomly sampling a forging point is vanishingly small. These results provide evidence that adversarial forging is fundamentally limited and that false unlearning claims can, in principle, be detected.

[511] Learning to Construct Knowledge through Sparse Reference Selection with Reinforcement Learning

Shao-An Yin

Main category: cs.LG

TL;DR: Deep Reinforcement Learning framework for sparse reference selection that helps prioritize which scientific papers to read under time and cost constraints, evaluated on drug-gene relation discovery.

Details

Motivation: The rapid expansion of scientific literature makes knowledge acquisition difficult, especially in specialized domains with complex reasoning, restricted full-text access, and sparse target references among many candidates.

Method: A Deep Reinforcement Learning framework that emulates human knowledge construction to prioritize which papers to read under limited time and cost constraints.

Result: The approach demonstrates that both humans and machines can construct knowledge effectively from partial information (titles and abstracts only) in drug-gene relation discovery tasks.

Conclusion: The framework provides an effective solution for knowledge construction from sparse scientific literature, particularly useful when full-text access is restricted.

Abstract: The rapid expansion of scientific literature makes it increasingly difficult to acquire new knowledge, particularly in specialized domains where reasoning is complex, full-text access is restricted, and target references are sparse among a large set of candidates. We present a Deep Reinforcement Learning framework for sparse reference selection that emulates human knowledge construction, prioritizing which papers to read under limited time and cost. Evaluated on drug–gene relation discovery with access restricted to titles and abstracts, our approach demonstrates that both humans and machines can construct knowledge effectively from partial information.

[512] SPINN: An Optimal Self-Supervised Physics-Informed Neural Network Framework

Reza Pirayeshshirazinezhad

Main category: cs.LG

TL;DR: A machine learning surrogate model for predicting liquid sodium heat transfer in miniature heat sinks, combining physics-informed neural networks and transfer learning to achieve ~8% error.

Details

Motivation: High-fidelity CFD modeling of turbulent forced convection for liquid metals is computationally expensive and time-consuming, necessitating efficient alternative tools for heat sink design optimization.

Method: Used kernel-based ML techniques and shallow neural networks on 87 Nusselt number data points, then implemented self-supervised physics-informed neural networks with adaptive physics weighting and transfer learning from water-trained models.

Result: Self-supervised physics-informed neural network achieved ~8% error margin, physics-only regression maintained 5-10% error, and other ML methods mostly stayed within ±8% prediction accuracy.

Conclusion: Machine learning models provide a powerful and efficient alternative to CFD for designing and optimizing liquid-metal-cooled miniature heat sinks, with physics-informed approaches showing particularly promising accuracy.

Abstract: A surrogate model is developed to predict the convective heat transfer coefficient of liquid sodium (Na) flow within rectangular miniature heat sinks. Initially, kernel-based machine learning techniques and shallow neural network are applied to a dataset with 87 Nusselt numbers for liquid sodium in rectangular miniature heat sinks. Subsequently, a self-supervised physics-informed neural network and transfer learning approach are used to increase the estimation performance. In the self-supervised physics-informed neural network, an additional layer determines the weight the of physics in the loss function to balance data and physics based on their uncertainty for a better estimation. For transfer learning, a shallow neural network trained on water is adapted for use with Na. Validation results show that the self-supervised physics-informed neural network successfully estimate the heat transfer rates of Na with an error margin of approximately +8%. Using only physics for regression, the error remains between 5% to 10%. Other machine learning methods specify the prediction mostly within +8%. High-fidelity modeling of turbulent forced convection of liquid metals using computational fluid dynamics (CFD) is both time-consuming and computationally expensive. Therefore, machine learning based models offer a powerful alternative tool for the design and optimization of liquid-metal-cooled miniature heat sinks.

[513] X-SQL: Expert Schema Linking and Understanding of Text-to-SQL with Multi-LLMs

Dazhi Peng

Main category: cs.LG

TL;DR: X-SQL is a novel Text-to-SQL framework that emphasizes database schema importance, featuring X-Linking for schema linking and X-Admin for schema understanding, achieving state-of-the-art performance on Spider benchmarks.

Details

Motivation: The research community often overlooks the importance of database schema information for generating high-quality SQL queries, despite schema information playing a significant or even dominant role in Text-to-SQL tasks.

Method: Proposed a database schema expert with two components: X-Linking (LLM Supervised Finetuning-based method for schema linking) and X-Admin (for schema understanding by bridging abstract schema information with natural language questions). Used Multi-LLMs for different components within the system.

Result: Achieved Execution Accuracies of 84.9% on Spider-Dev dataset and 82.5% on Spider-Test dataset, establishing X-SQL as the leading Text-to-SQL framework based on open-source models.

Conclusion: X-SQL demonstrates that proper handling of database schema information through specialized components (X-Linking and X-Admin) combined with multi-LLM approach significantly boosts Text-to-SQL performance, setting new state-of-the-art results.

Abstract: With Large Language Models’ (LLMs) emergent abilities on code generation tasks, Text-to-SQL has become one of the most popular downstream applications. Despite the strong results of multiple recent LLM-based Text-to-SQL frameworks, the research community often overlooks the importance of database schema information for generating high-quality SQL queries. We find that such schema information plays a significant or even dominant role in the Text-to-SQL task. To tackle this challenge, we propose a novel database schema expert with two components. We first introduce X-Linking, an LLM Supervised Finetuning (SFT)-based method that achieves superior Schema Linking results compared to existing open-source Text-to-SQL methods. In addition, we innovatively propose an X-Admin component that focuses on Schema Understanding by bridging the gap between abstract schema information and the user’s natural language question. Aside from better learning with schema information, we experiment with Multi-LLMs for different components within the system to further boost its performance. By incorporating these techniques into our end-to-end framework, X-SQL, we have achieved Execution Accuracies of 84.9% on the Spider-Dev dataset and 82.5% on the Spider-Test dataset. This outstanding performance establishes X-SQL as the leading Text-to-SQL framework based on open-source models.

[514] Smoothed Online Optimization for Target Tracking: Robust and Learning-Augmented Algorithms

Ali Zeynali, Mahsa Sahebdel, Qingsong Liu, Mohammad Hajiesmaili, Ramesh K. Sitaraman

Main category: cs.LG

TL;DR: SOOTT framework integrates target tracking, adversarial perturbation, and switching costs for online decision-making under uncertainty, with applications in AI workload scheduling.

Details

Motivation: Address real-world scenarios like elastic/inelastic workload scheduling in AI clusters where operators must balance long-term SLAs against sudden demand spikes.

Method: Propose BEST algorithm with competitive guarantees, then introduce CoRT - a learning-augmented variant that incorporates untrusted black-box predictions from ML models.

Result: CoRT strictly improves over BEST when predictions are accurate while maintaining robustness under arbitrary prediction errors. Validated through workload scheduling case study.

Conclusion: Both algorithms effectively balance trajectory tracking, decision smoothness, and resilience to external disturbances in practical applications.

Abstract: We introduce the Smoothed Online Optimization for Target Tracking (SOOTT) problem, a new framework that integrates three key objectives in online decision-making under uncertainty: (1) tracking cost for following a dynamically moving target, (2) adversarial perturbation cost for withstanding unpredictable disturbances, and (3) switching cost for penalizing abrupt changes in decisions. This formulation captures real-world scenarios such as elastic and inelastic workload scheduling in AI clusters, where operators must balance long-term service-level agreements (e.g., LLM training) against sudden demand spikes (e.g., real-time inference). We first present BEST, a robust algorithm with provable competitive guarantees for SOOTT. To enhance practical performance, we introduce CoRT, a learning-augmented variant that incorporates untrusted black-box predictions (e.g., from ML models) into its decision process. Our theoretical analysis shows that CoRT strictly improves over BEST when predictions are accurate, while maintaining robustness under arbitrary prediction errors. We validate our approach through a case study on workload scheduling, demonstrating that both algorithms effectively balance trajectory tracking, decision smoothness, and resilience to external disturbances.

[515] Unified Interaction Foundational Model (UIFM) for Predicting Complex User and System Behavior

Vignesh Ethiraj, Subhash Talluri

Main category: cs.LG

TL;DR: UIFM introduces composite tokenization to treat multi-attribute events as single semantic units, enabling better behavioral understanding than text-serialized foundation models.

Details

Motivation: Current foundation models designed for natural language fail to grasp holistic structured interactions in domains like telecommunications, e-commerce and finance, losing critical context when serializing events into text.

Method: Unified Interaction Foundation Model (UIFM) uses composite tokenization principle where each multi-attribute event is treated as a single, semantically coherent unit to learn the underlying ‘grammar’ of user behavior.

Result: UIFM architecture is more accurate and represents a fundamental step towards creating more adaptable and intelligent predictive systems that perceive entire interactions rather than disconnected data streams.

Conclusion: Composite tokenization enables genuine behavioral understanding by preserving event context, making UIFM a significant advancement for AI systems that need to understand and predict complex, evolving sequences of events.

Abstract: A central goal of artificial intelligence is to build systems that can understand and predict complex, evolving sequences of events. However, current foundation models, designed for natural language, fail to grasp the holistic nature of structured interactions found in domains like telecommunications, e-commerce and finance. By serializing events into text, they disassemble them into semantically fragmented parts, losing critical context. In this work, we introduce the Unified Interaction Foundation Model (UIFM), a foundation model engineered for genuine behavioral understanding. At its core is the principle of composite tokenization, where each multi-attribute event is treated as a single, semantically coherent unit. This allows UIFM to learn the underlying “grammar” of user behavior, perceiving entire interactions rather than a disconnected stream of data points. We demonstrate that this architecture is not just more accurate, but represents a fundamental step towards creating more adaptable and intelligent predictive systems.

[516] PolicyEvolve: Evolving Programmatic Policies by LLMs for multi-player games via Population-Based Training

Mingrui Lv, Hangzhi Liu, Zhi Luo, Hongjie Zhang, Jie Ou

Main category: cs.LG

TL;DR: PolicyEvolve is a framework that uses LLMs to generate interpretable programmatic policies for multi-agent games, reducing computational costs and improving transparency compared to traditional MARL.

Details

Motivation: Traditional MARL requires massive computational resources and produces black-box policies that lack interpretability, limiting practical deployment. LLMs have shown success in generating interpretable policies for single-agent tasks, inspiring their application to multi-agent settings.

Method: Framework with four modules: Global Pool (stores elite policies), Local Pool (temporary policies), Policy Planner (LLM-based policy generation and refinement), and Trajectory Critic (analyzes vulnerabilities and provides improvement feedback). Iterative process refines policies until they achieve high win rates.

Result: Significantly reduces reliance on manually crafted code and minimizes environmental interactions while achieving high-performance policies. Produces interpretable rule-based code with high execution efficiency.

Conclusion: PolicyEvolve provides an effective approach for generating interpretable programmatic policies in multi-agent games, addressing computational cost and transparency limitations of traditional MARL methods.

Abstract: Multi-agent reinforcement learning (MARL) has achieved significant progress in solving complex multi-player games through self-play. However, training effective adversarial policies requires millions of experience samples and substantial computational resources. Moreover, these policies lack interpretability, hindering their practical deployment. Recently, researchers have successfully leveraged Large Language Models (LLMs) to generate programmatic policies for single-agent tasks, transforming neural network-based policies into interpretable rule-based code with high execution efficiency. Inspired by this, we propose PolicyEvolve, a general framework for generating programmatic policies in multi-player games. PolicyEvolve significantly reduces reliance on manually crafted policy code, achieving high-performance policies with minimal environmental interactions. The framework comprises four modules: Global Pool, Local Pool, Policy Planner, and Trajectory Critic. The Global Pool preserves elite policies accumulated during iterative training. The Local Pool stores temporary policies for the current iteration; only sufficiently high-performing policies from this pool are promoted to the Global Pool. The Policy Planner serves as the core policy generation module. It samples the top three policies from the Global Pool, generates an initial policy for the current iteration based on environmental information, and refines this policy using feedback from the Trajectory Critic. Refined policies are then deposited into the Local Pool. This iterative process continues until the policy achieves a sufficiently high average win rate against the Global Pool, at which point it is integrated into the Global Pool. The Trajectory Critic analyzes interaction data from the current policy, identifies vulnerabilities, and proposes directional improvements to guide the Policy Planner

[517] A novel biomass fluidized bed gasification model coupled with machine learning and CFD simulation

Chun Wang

Main category: cs.LG

TL;DR: Machine learning and CFD coupling model for biomass gasification improves prediction accuracy and computational efficiency

Details

Motivation: To enhance prediction accuracy and computational efficiency for complex thermochemical reaction processes in biomass fluidized bed gasification

Method: Constructed high-quality dataset from experimental data and high-fidelity simulations, trained agent model for reaction kinetics, embedded into CFD framework for real-time reaction rate and composition updates

Result: Proposed coupling model enables improved computational efficiency while maintaining accuracy in predicting biomass gasification processes

Conclusion: The ML-CFD coupling approach successfully addresses the trade-off between accuracy and computational cost in modeling complex thermochemical reactions

Abstract: A coupling model of biomass fluidized bed gasification based on machine learning and computational fluid dynamics is proposed to improve the prediction accuracy and computational efficiency of complex thermochemical reaction process. By constructing a high-quality data set based on experimental data and high fidelity simulation results, the agent model used to describe the characteristics of reaction kinetics was trained and embedded into the computational fluid dynamics (CFD) framework to realize the real-time update of reaction rate and composition evolution.

[518] ARIES: Relation Assessment and Model Recommendation for Deep Time Series Forecasting

Fei Wang, Yujie Li, Zezhi Shao, Chengqing Yu, Yisong Fu, Zhulin An, Yongjun Xu, Xueqi Cheng

Main category: cs.LG

TL;DR: ARIES is a framework that establishes relationships between time series properties and modeling strategies, and provides model recommendations for time series forecasting.

Details

Motivation: Existing benchmark datasets lack diverse temporal patterns, and there's no effective model recommendation approach, leading to high costs when testing different architectures across applications.

Method: Constructed synthetic dataset with multiple patterns, designed comprehensive system to compute time series properties, benchmarked over 50 forecasting models, and established relationships between properties and modeling strategies.

Result: Experimental results revealed clear correlations between time series properties and modeling strategies, enabling the creation of an interpretable model recommendation system.

Conclusion: ARIES is the first study to establish relationships between time series data properties and modeling strategies while implementing a practical model recommendation system.

Abstract: Recent advancements in deep learning models for time series forecasting have been significant. These models often leverage fundamental time series properties such as seasonality and non-stationarity, which may suggest an intrinsic link between model performance and data properties. However, existing benchmark datasets fail to offer diverse and well-defined temporal patterns, restricting the systematic evaluation of such connections. Additionally, there is no effective model recommendation approach, leading to high time and cost expenditures when testing different architectures across different downstream applications. For those reasons, we propose ARIES, a framework for assessing relation between time series properties and modeling strategies, and for recommending deep forcasting models for realistic time series. First, we construct a synthetic dataset with multiple distinct patterns, and design a comprehensive system to compute the properties of time series. Next, we conduct an extensive benchmarking of over 50 forecasting models, and establish the relationship between time series properties and modeling strategies. Our experimental results reveal a clear correlation. Based on these findings, we propose the first deep forecasting model recommender, capable of providing interpretable suggestions for real-world time series. In summary, ARIES is the first study to establish the relations between the properties of time series data and modeling strategies, while also implementing a model recommendation system. The code is available at: https://github.com/blisky-li/ARIES.

[519] A Surrogate model for High Temperature Superconducting Magnets to Predict Current Distribution with Neural Network

Mianjun Xiao, Peng Song, Yulong Liu, Cedric Korte, Ziyang Xu, Jiale Gao, Jiaqi Lu, Haoyang Nie, Qiantong Deng, Timing Qu

Main category: cs.LG

TL;DR: A surrogate model using residual neural networks predicts current density in superconducting magnets much faster than traditional FEM simulations, with good accuracy and extrapolation capability.

Details

Motivation: Traditional finite element method (FEM) becomes computationally expensive for large-scale high-temperature superconducting magnets, limiting rapid design optimization.

Method: Developed a fully connected residual neural network (FCRN) trained on FEM simulation data to predict space-time current density distribution in REBCO solenoids.

Result: FCRN with 12 residual blocks and 256 neurons per layer achieved best performance, predicting magnetization losses with <10% error even 50% beyond training range, and is orders of magnitude faster than FEM.

Conclusion: The FCRN-based surrogate model provides accurate and efficient prediction capabilities, enabling rapid analysis of large-scale HTS magnets for design optimization.

Abstract: Finite element method (FEM) is widely used in high-temperature superconducting (HTS) magnets, but its computational cost increases with magnet size and becomes time-consuming for meter-scale magnets, especially when multi-physics couplings are considered, which limits the fast design of large-scale REBCO magnet systems. In this work, a surrogate model based on a fully connected residual neural network (FCRN) is developed to predict the space-time current density distribution in REBCO solenoids. Training datasets were generated from FEM simulations with varying numbers of turns and pancakes. The results demonstrate that, for deeper networks, the FCRN architecture achieves better convergence than conventional fully connected network (FCN), with the configuration of 12 residual blocks and 256 neurons per layer providing the most favorable balance between training accuracy and generalization capability. Extrapolation studies show that the model can reliably predict magnetization losses for up to 50% beyond the training range, with maximum errors below 10%. The surrogate model achieves predictions several orders of magnitude faster than FEM and still remains advantageous when training costs are included. These results indicate that the proposed FCRN-based surrogate model provides both accuracy and efficiency, offering a promising tool for the rapid analysis of large-scale HTS magnets.

[520] Teaching Precommitted Agents: Model-Free Policy Evaluation and Control in Quasi-Hyperbolic Discounted MDPs

S. R. Eshwar

Main category: cs.LG

TL;DR: This paper develops theoretical foundations and practical algorithms for reinforcement learning with quasi-hyperbolic discounting preferences, proving optimal policies have simple one-step non-stationary form and creating model-free algorithms with convergence guarantees.

Details

Motivation: Time-inconsistent preferences (smaller-sooner over larger-later rewards) are common in human/animal decision-making but quasi-hyperbolic discounting hasn't been well integrated into reinforcement learning frameworks.

Method: Theoretical analysis of optimal policy structure and development of model-free algorithms for both policy evaluation and Q-learning with quasi-hyperbolic preferences.

Result: Proved optimal policy reduces to simple one-step non-stationary form, and created first practical model-free algorithms with provable convergence guarantees.

Conclusion: Provides foundational insights and practical tools for incorporating quasi-hyperbolic preferences in reinforcement learning, addressing key theoretical and algorithmic gaps.

Abstract: Time-inconsistent preferences, where agents favor smaller-sooner over larger-later rewards, are a key feature of human and animal decision-making. Quasi-Hyperbolic (QH) discounting provides a simple yet powerful model for this behavior, but its integration into the reinforcement learning (RL) framework has been limited. This paper addresses key theoretical and algorithmic gaps for precommitted agents with QH preferences. We make two primary contributions: (i) we formally characterize the structure of the optimal policy, proving for the first time that it reduces to a simple one-step non-stationary form; and (ii) we design the first practical, model-free algorithms for both policy evaluation and Q-learning in this setting, both with provable convergence guarantees. Our results provide foundational insights for incorporating QH preferences in RL.

[521] If generative AI is the answer, what is the question?

Ambuj Tewari

Main category: cs.LG

TL;DR: A comprehensive survey of generative AI foundations, covering five major model families, probabilistic frameworks, game-theoretic approaches, deployment considerations, and social responsibility aspects.

Details

Motivation: To establish generation as a distinct machine learning task and provide a systematic framework for understanding what generation is as a problem, rather than just focusing on model implementations.

Method: Survey of five generative model families (autoregressive, variational autoencoders, normalizing flows, GANs, diffusion models), probabilistic frameworks distinguishing density estimation from generation, game-theoretic analysis with adversary-learner setup, and post-training deployment modifications.

Result: A comprehensive framework that connects generation to prediction, compression, and decision-making, while providing systematic analysis of different generative approaches and their theoretical foundations.

Conclusion: Generation should be understood as a distinct ML task with important social responsibility considerations including privacy, AI content detection, and copyright/IP issues, adopting a task-first perspective rather than model-centric view.

Abstract: Beginning with text and images, generative AI has expanded to audio, video, computer code, and molecules. Yet, if generative AI is the answer, what is the question? We explore the foundations of generation as a distinct machine learning task with connections to prediction, compression, and decision-making. We survey five major generative model families: autoregressive models, variational autoencoders, normalizing flows, generative adversarial networks, and diffusion models. We then introduce a probabilistic framework that emphasizes the distinction between density estimation and generation. We review a game-theoretic framework with a two-player adversary-learner setup to study generation. We discuss post-training modifications that prepare generative models for deployment. We end by highlighting some important topics in socially responsible generation such as privacy, detection of AI-generated content, and copyright and IP. We adopt a task-first framing of generation, focusing on what generation is as a machine learning problem, rather than only on how models implement it.

[522] Data-Efficient Time-Dependent PDE Surrogates: Graph Neural Simulators vs Neural Operators

Dibyajyoti Nayak, Somdatta Goswami

Main category: cs.LG

TL;DR: Graph Neural Simulators (GNS) with explicit time-stepping schemes significantly improve data efficiency and reduce error accumulation for learning PDE solutions, outperforming neural operator baselines with only 3% training data.

Details

Motivation: Neural operators require large datasets and struggle with scarce training data, while many formulations don't encode causal, local-in-time structure of physical evolution. Autoregressive models preserve causality but suffer from error accumulation.

Method: Employ Graph Neural Simulators (GNS) - a message-passing graph neural network framework - with explicit numerical time-stepping schemes to model instantaneous time derivatives. Uses PCA+KMeans trajectory selection strategy for enhanced low-data performance.

Result: GNS achieves under 1% relative L2 errors with only 30 training samples (3% of data) across three PDE systems. Reduces autoregressive error by 82.48% relative to FNO AR and 99.86% relative to DON AR. Substantially reduces error accumulation over extended temporal horizons.

Conclusion: Combining graph-based local inductive biases with conventional time integrators yields accurate, physically consistent, and scalable surrogate models for time-dependent PDEs with high data efficiency.

Abstract: Neural operators (NOs) approximate mappings between infinite-dimensional function spaces but require large datasets and struggle with scarce training data. Many NO formulations don’t explicitly encode causal, local-in-time structure of physical evolution. While autoregressive models preserve causality by predicting next time-steps, they suffer from rapid error accumulation. We employ Graph Neural Simulators (GNS) - a message-passing graph neural network framework - with explicit numerical time-stepping schemes to construct accurate forward models that learn PDE solutions by modeling instantaneous time derivatives. We evaluate our framework on three canonical PDE systems: (1) 2D Burgers’ scalar equation, (2) 2D coupled Burgers’ vector equation, and (3) 2D Allen-Cahn equation. Rigorous evaluations demonstrate GNS significantly improves data efficiency, achieving higher generalization accuracy with substantially fewer training trajectories compared to neural operator baselines like DeepONet and FNO. GNS consistently achieves under 1% relative L2 errors with only 30 training samples out of 1000 (3% of available data) across all three PDE systems. It substantially reduces error accumulation over extended temporal horizons: averaged across all cases, GNS reduces autoregressive error by 82.48% relative to FNO AR and 99.86% relative to DON AR. We introduce a PCA+KMeans trajectory selection strategy enhancing low-data performance. Results indicate combining graph-based local inductive biases with conventional time integrators yields accurate, physically consistent, and scalable surrogate models for time-dependent PDEs.

[523] Tracking daily paths in home contexts with RSSI fingerprinting based on UWB through deep learning models

Aurora Polo-Rodríguez, Juan Carlos Valera, Jesús Peral, David Gil, Javier Medina-Quero

Main category: cs.LG

TL;DR: UWB-based indoor tracking using deep learning with RSSI fingerprinting achieves ~50cm accuracy for human activity recognition in homes.

Details

Motivation: UWB location tracking is affected by walls and obstacles, reducing precision in real home environments. The study aims to improve indoor tracking accuracy for daily activity recognition.

Method: Fingerprinting approach using RSSI data from UWB and Bluetooth in two flats. Compared CNN, LSTM, and hybrid CNN+LSTM models with different temporal windows (future, past, combined).

Result: Achieved mean absolute error close to 50 cm. Hybrid CNN+LSTM model showed superior performance in accurate location estimation.

Conclusion: The hybrid deep learning model with UWB RSSI fingerprinting enables accurate indoor tracking, facilitating practical human activity recognition applications in residential settings.

Abstract: The field of human activity recognition has evolved significantly, driven largely by advancements in Internet of Things (IoT) device technology, particularly in personal devices. This study investigates the use of ultra-wideband (UWB) technology for tracking inhabitant paths in home environments using deep learning models. UWB technology estimates user locations via time-of-flight and time-difference-of-arrival methods, which are significantly affected by the presence of walls and obstacles in real environments, reducing their precision. To address these challenges, we propose a fingerprinting-based approach utilizing received signal strength indicator (RSSI) data collected from inhabitants in two flats (60 m2 and 100 m2) while performing daily activities. We compare the performance of convolutional neural network (CNN), long short-term memory (LSTM), and hybrid CNN+LSTM models, as well as the use of Bluetooth technology. Additionally, we evaluate the impact of the type and duration of the temporal window (future, past, or a combination of both). Our results demonstrate a mean absolute error close to 50 cm, highlighting the superiority of the hybrid model in providing accurate location estimates, thus facilitating its application in daily human activity recognition in residential settings.

[524] An Improved Template for Approximate Computing

M. Rezaalipour, F. Costa, M. Biasion, R. Otoni, G. A. Constantinides, L. Pozzi

Main category: cs.LG

TL;DR: A methodology to reduce area of neural network arithmetic operators (adders/multipliers) via approximate computing, achieving better area savings with same accuracy loss compared to state-of-the-art approaches.

Details

Motivation: Deploying neural networks on edge devices requires balancing energy consumption and accuracy. Approximate computing can reduce energy by slightly sacrificing arithmetic operator accuracy.

Method: Improves boolean rewriting technique (XPAT) with parametrisable template for circuit rewriting. Proposes novel template based on parametrisable product sharing that acts as proxy for synthesized area.

Result: Methodology converges better to low-area solutions and finds better approximations than original XPAT and two other state-of-the-art approaches.

Conclusion: The proposed template-based approach with product sharing provides superior area savings while maintaining comparable accuracy loss for neural network arithmetic operators on edge devices.

Abstract: Deploying neural networks on edge devices entails a careful balance between the energy required for inference and the accuracy of the resulting classification. One technique for navigating this tradeoff is approximate computing: the process of reducing energy consumption by slightly reducing the accuracy of arithmetic operators. In this context, we propose a methodology to reduce the area of the small arithmetic operators used in neural networks - i.e., adders and multipliers - via a small loss in accuracy, and show that we improve area savings for the same accuracy loss w.r.t. the state of the art. To achieve our goal, we improve on a boolean rewriting technique recently proposed, called XPAT, where the use of a parametrisable template to rewrite circuits has proved to be highly beneficial. In particular, XPAT was able to produce smaller circuits than comparable approaches while utilising a naive sum of products template structure. In this work, we show that template parameters can act as proxies for chosen metrics and we propose a novel template based on parametrisable product sharing that acts as a close proxy to synthesised area. We demonstrate experimentally that our methodology converges better to low-area solutions and that it can find better approximations than both the original XPAT and two other state-of-the-art approaches.

[525] Exploring Urban Factors with Autoencoders: Relationship Between Static and Dynamic Features

Ximena Pocco, Waqar Hassan, Karelia Salinas, Vladimir Molchanov, Luis G. Nonato

Main category: cs.LG

TL;DR: Visual analytics framework compares fused vs separate latent representations of urban data, finding fused representations produce more structured patterns while separate ones have specific use cases.

Details

Motivation: Urban analytics faces challenges with granular, heterogeneous, multimodal data. While visualization tools exist for exploring fused data representations, there's limited understanding of whether fused representations provide deeper insights than examining data sources separately.

Method: Developed a visualization-assisted framework to analyze the effectiveness of fused latent data representations versus separate representations for uncovering patterns from dynamic and static urban data.

Result: Combined latent representations produce more structured patterns, while separate representations are useful in particular cases.

Conclusion: Fused latent representations are generally more effective for pattern discovery in urban analytics, though separate representations still have value in specific scenarios.

Abstract: Urban analytics utilizes extensive datasets with diverse urban information to simulate, predict trends, and uncover complex patterns within cities. While these data enables advanced analysis, it also presents challenges due to its granularity, heterogeneity, and multimodality. To address these challenges, visual analytics tools have been developed to support the exploration of latent representations of fused heterogeneous and multimodal data, discretized at a street-level of detail. However, visualization-assisted tools seldom explore the extent to which fused data can offer deeper insights than examining each data source independently within an integrated visualization framework. In this work, we developed a visualization-assisted framework to analyze whether fused latent data representations are more effective than separate representations in uncovering patterns from dynamic and static urban data. The analysis reveals that combined latent representations produce more structured patterns, while separate ones are useful in particular cases.

[526] Reasoning Language Model for Personalized Lung Cancer Screening

Chuang Niu, Ge Wang

Main category: cs.LG

TL;DR: A reasoning language model that integrates radiology findings with medical records for improved lung cancer risk assessment, outperforming Lung-RADS by considering multiple risk factors through chain-of-thought reasoning.

Details

Motivation: Lung-RADS has limitations in sensitivity and specificity as it only considers lung nodule characteristics without incorporating various risk factors, creating a need for more comprehensive and individualized risk assessment in lung cancer screening.

Method: Developed a reasoning language model (RLM) that integrates radiology findings with longitudinal medical records through systematic dataset construction, supervised fine-tuning, reinforcement learning, and chain-of-thought reasoning to decompose risk evaluation into sub-components.

Result: Significant improvements in risk prediction performance on national lung screening trial datasets, with enhanced predictive accuracy and monitorability through the reasoning process.

Conclusion: The proposed RLM approach facilitates clinical translation into lung cancer screening by providing more accurate and transparent risk assessment that considers multiple factors beyond just nodule characteristics.

Abstract: Accurate risk assessment in lung cancer screening is critical for enabling early cancer detection and minimizing unnecessary invasive procedures. The Lung CT Screening Reporting and Data System (Lung-RADS) has been widely used as the standard framework for patient management and follow-up. Nevertheless, Lung-RADS faces trade-offs between sensitivity and specificity, as it stratifies risk solely based on lung nodule characteristics without incorporating various risk factors. Here we propose a reasoning language model (RLM) to integrate radiology findings with longitudinal medical records for individualized lung cancer risk assessment. Through a systematic study including dataset construction and distillation, supervised fine-tuning, reinforcement learning, and comprehensive evaluation, our model makes significant improvements in risk prediction performance on datasets in the national lung screening trial. Notably, RLM can decompose the risk evaluation task into sub-components, analyze the contributions of diverse risk factors, and synthesize them into a final risk score computed using our data-driven system equation. Our approach improves both predictive accuracy and monitorability through the chain of thought reasoning process, thereby facilitating clinical translation into lung cancer screening.

[527] Nonnegative matrix factorization and the principle of the common cause

E. Khalafyan, A. E. Allahverdyan, A. Hovhannisyan

Main category: cs.LG

TL;DR: NMF and PCC are shown to be closely related. PCC helps determine stable NMF rank estimation that resolves nonidentifiability, while NMF enables approximate PCC implementation for clustering and denoising.

Details

Motivation: To explore the reciprocal relationship between Nonnegative Matrix Factorization (NMF) and the Principle of Common Cause (PCC) in probabilistic causality, and leverage this connection for robust data analysis.

Method: Used PCC as predictability tool for robust NMF rank estimation, implemented NMF around this stable rank, and developed clustering method where data points with same common cause are grouped together.

Result: PCC-based rank estimation is stable against weak noise, NMF features become stable against noise and optimization seeds (resolving nonidentifiability), and NMF enables effective PCC implementation for clustering and denoising.

Conclusion: The reciprocal relationship between NMF and PCC provides mutual benefits - PCC enables robust NMF implementation while NMF facilitates approximate PCC application, leading to stable feature extraction, effective clustering, and data denoising capabilities.

Abstract: Nonnegative matrix factorization (NMF) is a known unsupervised data-reduction method. The principle of the common cause (PCC) is a basic methodological approach in probabilistic causality, which seeks an independent mixture model for the joint probability of two dependent random variables. It turns out that these two concepts are closely related. This relationship is explored reciprocally for several datasets of gray-scale images, which are conveniently mapped into probability models. On one hand, PCC provides a predictability tool that leads to a robust estimation of the effective rank of NMF. Unlike other estimates (e.g., those based on the Bayesian Information Criteria), our estimate of the rank is stable against weak noise. We show that NMF implemented around this rank produces features (basis images) that are also stable against noise and against seeds of local optimization, thereby effectively resolving the NMF nonidentifiability problem. On the other hand, NMF provides an interesting possibility of implementing PCC in an approximate way, where larger and positively correlated joint probabilities tend to be explained better via the independent mixture model. We work out a clustering method, where data points with the same common cause are grouped into the same cluster. We also show how NMF can be employed for data denoising.

[528] Toward a Metrology for Artificial Intelligence: Hidden-Rule Environments and Reinforcement Learning

Christo Mathew, Wentian Wang, Lazaros Gallos, Paul Kantor, Vladimir Menkov, Hao Wang

Main category: cs.LG

TL;DR: Transformer-based A2C agent learns to infer hidden rules in Game Of Hidden Rules puzzle using Feature-Centric and Object-Centric state representations

Details

Motivation: To develop an agent that can simultaneously infer hidden governing rules and learn optimal policies in complex puzzle environments with partial observations

Method: Transformer-based Advantage Actor-Critic (A2C) algorithm with two state representation strategies: Feature-Centric (FC) and Object-Centric (OC)

Result: Models evaluated across multiple rule-based and trial-list-based experimental setups, analyzing transfer effects and representation impact on learning efficiency

Conclusion: The approach demonstrates effective rule inference and policy learning in complex hidden rule environments, with representation strategy affecting learning efficiency

Abstract: We investigate reinforcement learning in the Game Of Hidden Rules (GOHR) environment, a complex puzzle in which an agent must infer and execute hidden rules to clear a 6$\times$6 board by placing game pieces into buckets. We explore two state representation strategies, namely Feature-Centric (FC) and Object-Centric (OC), and employ a Transformer-based Advantage Actor-Critic (A2C) algorithm for training. The agent has access only to partial observations and must simultaneously infer the governing rule and learn the optimal policy through experience. We evaluate our models across multiple rule-based and trial-list-based experimental setups, analyzing transfer effects and the impact of representation on learning efficiency.

[529] Metric Embedding Initialization-Based Differentially Private and Explainable Graph Clustering

Haochen You, Baojing Liu

Main category: cs.LG

TL;DR: A differentially private graph clustering method using metric embedding initialization and HST-based initialization to achieve better performance while ensuring privacy.

Details

Motivation: Current graph clustering under differential privacy faces challenges like high noise, low efficiency, and poor interpretability, which limit field development.

Method: Construct SDP optimization, extract key set, use HST-based initialization for well-initialized clustering configuration, then apply k-median clustering with comparative explanations through cluster center differences.

Result: Extensive experiments on public datasets show the framework outperforms existing methods in various clustering metrics while strictly ensuring privacy.

Conclusion: The proposed approach successfully addresses key challenges in differentially private graph clustering by providing better performance, efficiency, and interpretability while maintaining strict privacy guarantees.

Abstract: Graph clustering under the framework of differential privacy, which aims to process graph-structured data while protecting individual privacy, has been receiving increasing attention. Despite significant achievements in current research, challenges such as high noise, low efficiency and poor interpretability continue to severely constrain the development of this field. In this paper, we construct a differentially private and interpretable graph clustering approach based on metric embedding initialization. Specifically, we construct an SDP optimization, extract the key set and provide a well-initialized clustering configuration using an HST-based initialization method. Subsequently, we apply an established k-median clustering strategy to derive the cluster results and offer comparative explanations for the query set through differences from the cluster centers. Extensive experiments on public datasets demonstrate that our proposed framework outperforms existing methods in various clustering metrics while strictly ensuring privacy.

[530] UrbanMIMOMap: A Ray-Traced MIMO CSI Dataset with Precoding-Aware Maps and Benchmarks

Honggang Jia, Xiucheng Wang, Nan Cheng, Ruijin Sun, Changle Li

Main category: cs.LG

TL;DR: UrbanMIMOMap dataset provides large-scale urban MIMO CSI data using ray tracing for 6G environment-aware communication research, addressing limitations of existing SISO-focused datasets.

Details

Motivation: 6G systems require environment-aware communication with integrated sensing, but existing datasets are limited to SISO and path loss data, insufficient for advanced MIMO systems needing detailed channel state information.

Method: Created UrbanMIMOMap dataset using high-precision ray tracing to generate comprehensive complex CSI matrices across dense spatial grids in urban environments.

Result: Produced a large-scale MIMO CSI dataset that goes beyond traditional path loss data, enabling high-fidelity radio map construction and serving as fundamental resource for data-driven methods including deep learning.

Conclusion: UrbanMIMOMap provides crucial dataset for high-precision radio map generation, MIMO spatial performance research, and machine learning applications in 6G environment awareness, with demonstrated utility through baseline ML evaluations.

Abstract: Sixth generation (6G) systems require environment-aware communication, driven by native artificial intelligence (AI) and integrated sensing and communication (ISAC). Radio maps (RMs), providing spatially continuous channel information, are key enablers. However, generating high-fidelity RM ground truth via electromagnetic (EM) simulations is computationally intensive, motivating machine learning (ML)-based RM construction. The effectiveness of these data-driven methods depends on large-scale, high-quality training data. Current public datasets often focus on single-input single-output (SISO) and limited information, such as path loss, which is insufficient for advanced multi-input multi-output (MIMO) systems requiring detailed channel state information (CSI). To address this gap, this paper presents UrbanMIMOMap, a novel large-scale urban MIMO CSI dataset generated using high-precision ray tracing. UrbanMIMOMap offers comprehensive complex CSI matrices across a dense spatial grid, going beyond traditional path loss data. This rich CSI is vital for constructing high-fidelity RMs and serves as a fundamental resource for data-driven RM generation, including deep learning. We demonstrate the dataset’s utility through baseline performance evaluations of representative ML methods for RM construction. This work provides a crucial dataset and reference for research in high-precision RM generation, MIMO spatial performance, and ML for 6G environment awareness. The code and data for this work are available at: https://github.com/UNIC-Lab/UrbanMIMOMap.

[531] IPR: Intelligent Prompt Routing with User-Controlled Quality-Cost Trade-offs

Aosong Feng, Zhichao Xu, Xian Wu, Kang Zhou, Sheng Guan, Yueyan Chen, Ninad Kulkarni, Yun Zhou, Balasubramaniam Srinivasan, Haibo Ding, Lin Lee Cheong

Main category: cs.LG

TL;DR: IPR is an intelligent prompt routing framework that dynamically selects optimal LLMs based on predicted response quality and user tolerance levels, achieving 43.9% cost reduction while maintaining quality parity.

Details

Motivation: Optimizing performance-cost trade-offs for large-scale commercial LLM systems by routing queries to the most cost-effective model while maintaining response quality.

Method: Modular architecture with lightweight quality estimators trained on 1.5M prompts, user-controlled routing with tolerance parameter τ, and extensible design using frozen encoders with model-specific adapters.

Result: 43.9% cost reduction while maintaining quality parity with the strongest Claude model, with sub-150ms latency on a major cloud platform deployment.

Conclusion: IPR provides an effective framework for quality-constrained intelligent routing that significantly reduces costs while preserving response quality, with rapid model integration capabilities.

Abstract: Routing incoming queries to the most cost-effective LLM while maintaining response quality poses a fundamental challenge in optimizing performance-cost trade-offs for large-scale commercial systems. We present IPR, a quality-constrained Intelligent Prompt Routing framework that dynamically selects optimal models based on predicted response quality and user-specified tolerance levels. IPR introduces three key innovations: (1) a modular architecture with lightweight quality estimators trained on 1.5M prompts annotated with calibrated quality scores, enabling fine-grained quality prediction across model families; (2) a user-controlled routing mechanism with tolerance parameter $\tau \in [0,1]$ that provides explicit control over quality-cost trade-offs; and (3) an extensible design using frozen encoders with model-specific adapters, reducing new model integration from days to hours. To rigorously train and evaluate IPR, we curate an industrial-level dataset IPRBench\footnote{IPRBench will be released upon legal approval.}, a comprehensive benchmark containing 1.5 million examples with response quality annotations across 11 LLM candidates. Deployed on a major cloud platform, IPR achieves 43.9% cost reduction while maintaining quality parity with the strongest model in the Claude family and processes requests with sub-150ms latency.

[532] RecMind: LLM-Enhanced Graph Neural Networks for Personalized Consumer Recommendations

Chang Xue, Youwei Lu, Chen Yang, Jinming Xing

Main category: cs.LG

TL;DR: RecMind is an LLM-enhanced graph recommender that combines language model embeddings with collaborative filtering through contrastive alignment and gated fusion, achieving state-of-the-art results on recommendation benchmarks.

Details

Motivation: Personalization systems face challenges with sparse interactions, fast content churn, and heterogeneous textual signals. Existing approaches either rely solely on collaborative filtering or treat LLMs as monolithic rankers, missing opportunities to leverage both textual and structural signals effectively.

Method: Uses frozen LLM with lightweight adapters to generate text-conditioned embeddings from titles/attributes/reviews, combined with LightGCN for collaborative embeddings. Aligns both views with symmetric contrastive objective and fuses via intra-layer gating to balance language and graph signals.

Result: Achieves best results on all eight metrics on Yelp and Amazon-Electronics datasets, with relative improvements up to +4.53% (Recall@40) and +4.01% (NDCG@40) over strong baselines.

Conclusion: The hybrid approach of treating LLM as preference prior rather than monolithic ranker, combined with cross-view alignment and adaptive gating, provides superior performance across both cold/long-tail and mainstream recommendation scenarios.

Abstract: Personalization is a core capability across consumer technologies, streaming, shopping, wearables, and voice, yet it remains challenged by sparse interactions, fast content churn, and heterogeneous textual signals. We present RecMind, an LLM-enhanced graph recommender that treats the language model as a preference prior rather than a monolithic ranker. A frozen LLM equipped with lightweight adapters produces text-conditioned user/item embeddings from titles, attributes, and reviews; a LightGCN backbone learns collaborative embeddings from the user-item graph. We align the two views with a symmetric contrastive objective and fuse them via intra-layer gating, allowing language to dominate in cold/long-tail regimes and graph structure to stabilize rankings elsewhere. On Yelp and Amazon-Electronics, RecMind attains the best results on all eight reported metrics, with relative improvements up to +4.53% (Recall@40) and +4.01% (NDCG@40) over strong baselines. Ablations confirm both the necessity of cross-view alignment and the advantage of gating over late fusion and LLM-only variants.

[533] A Spatio-Temporal Graph Neural Networks Approach for Predicting Silent Data Corruption inducing Circuit-Level Faults

Shaoqi Wei, Senling Wang, Hiroshi Kai, Yoshinobu Higami, Ruijun Ma, Tianming Ni, Xiaoqing Wen, Hiroshi Takahashi

Main category: cs.LG

TL;DR: A unified spatio-temporal graph convolutional network (ST-GCN) for fast prediction of fault impact probabilities in sequential circuits, reducing simulation time by 10x+ while maintaining high accuracy.

Details

Motivation: Silent Data Errors from defects and aging degrade safety-critical systems, and functional testing to detect these faults is expensive to simulate.

Method: Model gate-level netlists as spatio-temporal graphs to capture topology and signal timing, using dedicated spatial and temporal encoders to predict multi-cycle fault impact probabilities efficiently.

Result: On ISCAS-89 benchmarks, reduces simulation time by more than 10x while maintaining high accuracy (mean absolute error 0.024 for 5-cycle predictions). Test-point selection using predicted FIPs improves detection of hard-to-detect faults.

Conclusion: The framework enables efficient quantitative risk assessment, scales to SoC-level test optimization, and fits electronic design automation flows with configurable efficiency-accuracy trade-offs.

Abstract: Silent Data Errors (SDEs) from time-zero defects and aging degrade safety-critical systems. Functional testing detects SDE-related faults but is expensive to simulate. We present a unified spatio-temporal graph convolutional network (ST-GCN) for fast, accurate prediction of long-cycle fault impact probabilities (FIPs) in large sequential circuits, supporting quantitative risk assessment. Gate-level netlists are modeled as spatio-temporal graphs to capture topology and signal timing; dedicated spatial and temporal encoders predict multi-cycle FIPs efficiently. On ISCAS-89 benchmarks, the method reduces simulation time by more than 10x while maintaining high accuracy (mean absolute error 0.024 for 5-cycle predictions). The framework accepts features from testability metrics or fault simulation, allowing efficiency-accuracy trade-offs. A test-point selection study shows that choosing observation points by predicted FIPs improves detection of long-cycle, hard-to-detect faults. The approach scales to SoC-level test strategy optimization and fits downstream electronic design automation flows.

[534] LoaQ: Layer-wise Output Approximation Quantization

Li Lin, Xiaojun Wan

Main category: cs.LG

TL;DR: LoaQ is a layer-wise post-training quantization method that focuses on output-level consistency rather than local approximations, achieving better alignment with the original model outputs through a simple closed-form solution.

Details

Motivation: Traditional layer-wise PTQ methods use local approximations that only achieve activation-aware weight approximations, leading to insufficient accuracy and deviations from the original model outputs. There's a need for better output-level consistency in quantization.

Method: LoaQ is an output-approximation method that explicitly targets output-level consistency by leveraging structural characteristics of mainstream LLMs. It features a simple closed-form solution that can be integrated into existing quantization pipelines.

Result: Experiments on LLaMA and Qwen model families show LoaQ performs effectively in both weight-only and weight-activation joint quantization, enhancing overall quantization quality when integrated with existing strategies.

Conclusion: LoaQ advances post-training quantization by providing better output-level consistency through a mathematically elegant solution that is orthogonal to existing techniques and easily integrable into current quantization workflows.

Abstract: A natural and intuitive idea in model quantization is to approximate each component’s quantized output to match its original. Layer-wise post-training quantization (PTQ), though based on this idea, adopts a strictly local view and can achieve, at best, only activation-aware approximations of weights. As a result, it often leads to insufficient approximations and practical deviations from this guiding intuition. Recent work has achieved a more accurate approximation of linear-layer outputs within the framework of layer-wise PTQ, but such refinements remain inadequate for achieving alignment with the full model output. Based on a deeper understanding of the structural characteristics of mainstream LLMs, we propose $LoaQ$, an output-approximation method for layer-wise PTQ that explicitly targets output-level consistency. It better aligns with this intuition and can feature a simple closed-form solution, making it orthogonal to existing techniques and readily integrable into existing quantization pipelines. Experiments on the LLaMA and Qwen model families demonstrate that LoaQ performs effectively in both weight-only and weight-activation joint quantization. By integrating seamlessly with existing quantization strategies, it further enhances overall quantization quality and shows strong potential to advance the frontier of post-training quantization.

[535] WindFM: An Open-Source Foundation Model for Zero-Shot Wind Power Forecasting

Hang Fan, Yu Shi, Zongliang Fu, Shuo Chen, Wei Wei, Wei Xu, Jian Li

Main category: cs.LG

TL;DR: WindFM is a lightweight generative foundation model for probabilistic wind power forecasting that uses a discretize-and-generate framework with transformer architecture, achieving state-of-the-art zero-shot performance without fine-tuning.

Details

Motivation: Existing wind forecasting models are either site-specific (lacking generalization) or require fine-tuning of general foundation models that cannot incorporate domain-specific energy data effectively.

Method: Uses a specialized time-series tokenizer to convert continuous multivariate observations into discrete hierarchical tokens, then trains a decoder-only Transformer on 150 billion time steps from 126,000+ sites in the WIND Toolkit dataset.

Result: The compact 8.1M parameter model achieves SOTA zero-shot performance on deterministic and probabilistic tasks, outperforming specialized models and larger foundation models without fine-tuning, and shows strong adaptability to out-of-distribution data from different continents.

Conclusion: WindFM demonstrates that a lightweight, domain-specific foundation model can learn universal representations of wind generation dynamics that are robust, transferable, and outperform both specialized and general-purpose models in zero-shot settings.

Abstract: High-quality wind power forecasting is crucial for the operation of modern power grids. However, prevailing data-driven paradigms either train a site-specific model which cannot generalize to other locations or rely on fine-tuning of general-purpose time series foundation models which are difficult to incorporate domain-specific data in the energy sector. This paper introduces WindFM, a lightweight and generative Foundation Model designed specifically for probabilistic wind power forecasting. WindFM employs a discretize-and-generate framework. A specialized time-series tokenizer first converts continuous multivariate observations into discrete, hierarchical tokens. Subsequently, a decoder-only Transformer learns a universal representation of wind generation dynamics by autoregressively pre-training on these token sequences. Using the comprehensive WIND Toolkit dataset comprising approximately 150 billion time steps from more than 126,000 sites, WindFM develops a foundational understanding of the complex interplay between atmospheric conditions and power output. Extensive experiments demonstrate that our compact 8.1M parameter model achieves state-of-the-art zero-shot performance on both deterministic and probabilistic tasks, outperforming specialized models and larger foundation models without any fine-tuning. In particular, WindFM exhibits strong adaptiveness under out-of-distribution data from a different continent, demonstrating the robustness and transferability of its learned representations. Our pre-trained model is publicly available at https://github.com/shiyu-coder/WindFM.

[536] Evaluating the Efficiency of Latent Spaces via the Coupling-Matrix

Mehmet Can Yavuz, Berrin Yanikoglu

Main category: cs.LG

TL;DR: A new redundancy index rho(C) that directly quantifies inter-dimensional dependencies in latent representations, providing a compact and statistically grounded measure of representational quality that predicts model performance.

Details

Motivation: Deep networks often produce redundant latent spaces where multiple coordinates encode overlapping information, reducing effective capacity and hindering generalization. Standard metrics like accuracy or reconstruction loss only provide indirect evidence and cannot isolate redundancy as a failure mode.

Method: Introduces a redundancy index rho(C) that analyzes coupling matrices derived from latent representations and compares their off-diagonal statistics against a normal distribution via energy distance to quantify inter-dimensional dependencies.

Result: Low rho(C) reliably predicts high classification accuracy or low reconstruction error, while elevated redundancy is associated with performance collapse. The estimator reliability grows with latent dimension, and Tree-structured Parzen Estimators preferentially explore low-rho regions.

Conclusion: rho(C) exposes redundancy as a universal bottleneck across models and tasks, offering both a theoretical lens and practical tool for evaluating and improving the efficiency of learned representations, with applications in neural architecture search and redundancy-aware regularization.

Abstract: A central challenge in representation learning is constructing latent embeddings that are both expressive and efficient. In practice, deep networks often produce redundant latent spaces where multiple coordinates encode overlapping information, reducing effective capacity and hindering generalization. Standard metrics such as accuracy or reconstruction loss provide only indirect evidence of such redundancy and cannot isolate it as a failure mode. We introduce a redundancy index, denoted rho(C), that directly quantifies inter-dimensional dependencies by analyzing coupling matrices derived from latent representations and comparing their off-diagonal statistics against a normal distribution via energy distance. The result is a compact, interpretable, and statistically grounded measure of representational quality. We validate rho(C) across discriminative and generative settings on MNIST variants, Fashion-MNIST, CIFAR-10, and CIFAR-100, spanning multiple architectures and hyperparameter optimization strategies. Empirically, low rho(C) reliably predicts high classification accuracy or low reconstruction error, while elevated redundancy is associated with performance collapse. Estimator reliability grows with latent dimension, yielding natural lower bounds for reliable analysis. We further show that Tree-structured Parzen Estimators (TPE) preferentially explore low-rho regions, suggesting that rho(C) can guide neural architecture search and serve as a redundancy-aware regularization target. By exposing redundancy as a universal bottleneck across models and tasks, rho(C) offers both a theoretical lens and a practical tool for evaluating and improving the efficiency of learned representations.

[537] Text-Trained LLMs Can Zero-Shot Extrapolate PDE Dynamics

Jiajun Bao, Nicolas Boullé, Toni J. B. Liu, Raphaël Sarfati, Christopher J. Earls

Main category: cs.LG

TL;DR: LLMs can accurately extrapolate spatiotemporal dynamics from discretized PDE solutions without fine-tuning, showing predictive accuracy improvements with longer contexts but degradation at finer spatial resolutions.

Details

Motivation: To explore how text-trained foundation models can perform zero-shot time-series forecasting and extrapolate PDE solutions without traditional fine-tuning or natural language prompting.

Method: Analyzed LLMs’ performance on multi-step rollouts of PDE solutions, examining token-level output distributions and error accumulation patterns across different temporal contexts and spatial discretizations.

Result: Predictive accuracy improves with longer temporal contexts but degrades at finer spatial discretizations. Errors grow algebraically with time horizon, similar to classical finite-difference solvers. Models show consistent ICL progression from pattern imitation to confident predictions.

Conclusion: LLMs demonstrate emergent in-context learning capabilities for spatiotemporal dynamics, with predictable scaling laws governing performance based on context length and output length, revealing insights into their internal processing of numerical data.

Abstract: Large language models (LLMs) have demonstrated emergent in-context learning (ICL) capabilities across a range of tasks, including zero-shot time-series forecasting. We show that text-trained foundation models can accurately extrapolate spatiotemporal dynamics from discretized partial differential equation (PDE) solutions without fine-tuning or natural language prompting. Predictive accuracy improves with longer temporal contexts but degrades at finer spatial discretizations. In multi-step rollouts, where the model recursively predicts future spatial states over multiple time steps, errors grow algebraically with the time horizon, reminiscent of global error accumulation in classical finite-difference solvers. We interpret these trends as in-context neural scaling laws, where prediction quality varies predictably with both context length and output length. To better understand how LLMs are able to internally process PDE solutions so as to accurately roll them out, we analyze token-level output distributions and uncover a consistent ICL progression: beginning with syntactic pattern imitation, transitioning through an exploratory high-entropy phase, and culminating in confident, numerically grounded predictions.

[538] Exploring approaches to computational representation and classification of user-generated meal logs

Guanlan Hu, Adit Anand, Pooja M. Desai, Iñigo Urteaga, Lena Mamykina

Main category: cs.LG

TL;DR: Machine learning with domain enrichment outperforms self-assessment in classifying meal alignment with nutritional goals using free-text meal logs.

Details

Motivation: To improve nutrition guidance by automatically classifying patient-generated free-text meal logs against specific nutritional goals using ML and domain knowledge.

Method: Used TFIDF and BERT text embeddings with domain enrichment (ontologies, ingredient parsers, macronutrient content) to train logistic regression and MLP classifiers on 3000+ meal records evaluated by dietitians.

Result: ML classifiers with enrichment achieved higher accuracy than self-assessments, with Parsed Ingredients, Food Entities, and Macronutrients enrichment performing best across multiple nutritional goals.

Conclusion: ML can reliably classify meal-goal alignment from unstructured text, especially when incorporating nutrition domain knowledge, supporting precision healthcare nutrition guidance.

Abstract: This study examined the use of machine learning and domain specific enrichment on patient generated health data, in the form of free text meal logs, to classify meals on alignment with different nutritional goals. We used a dataset of over 3000 meal records collected by 114 individuals from a diverse, low income community in a major US city using a mobile app. Registered dietitians provided expert judgement for meal to goal alignment, used as gold standard for evaluation. Using text embeddings, including TFIDF and BERT, and domain specific enrichment information, including ontologies, ingredient parsers, and macronutrient contents as inputs, we evaluated the performance of logistic regression and multilayer perceptron classifiers using accuracy, precision, recall, and F1 score against the gold standard and self assessment. Even without enrichment, ML outperformed self assessments of individuals who logged meals, and the best performing combination of ML classifier with enrichment achieved even higher accuracies. In general, ML classifiers with enrichment of Parsed Ingredients, Food Entities, and Macronutrients information performed well across multiple nutritional goals, but there was variability in the impact of enrichment and classification algorithm on accuracy of classification for different nutritional goals. In conclusion, ML can utilize unstructured free text meal logs and reliably classify whether meals align with specific nutritional goals, exceeding self assessments, especially when incorporating nutrition domain knowledge. Our findings highlight the potential of ML analysis of patient generated health data to support patient centered nutrition guidance in precision healthcare.

[539] A Fragile Number Sense: Probing the Elemental Limits of Numerical Reasoning in LLMs

Roussel Rahman, Aashwin Ananda Mishra

Main category: cs.LG

TL;DR: LLMs show strong performance on algorithmic math problems but fail at combinatorial puzzles like Game of 24, revealing their numerical reasoning is more pattern-matching than true analytical thinking.

Details

Motivation: To evaluate the robustness of LLMs' numerical reasoning by testing them on problems of escalating complexity, from basic arithmetic to combinatorial puzzles.

Method: Tested several state-of-the-art LLM-based agents on a 100-problem challenge with four categories: basic arithmetic, advanced operations, primality checking, and Game of 24 number puzzle.

Result: High accuracy on first three categories requiring deterministic algorithmic execution, but consistent failure at Game of 24 which requires heuristic search over large combinatorial space.

Conclusion: LLMs’ numerical proficiency is confined to recalling and executing known algorithms rather than generative problem-solving, limiting their potential for tasks requiring novel numerical insights.

Abstract: Large Language Models (LLMs) have demonstrated remarkable emergent capabilities, yet the robustness of their numerical reasoning remains an open question. While standard benchmarks evaluate LLM reasoning on complex problem sets using aggregated metrics, they often obscure foundational weaknesses. In this work, we probe LLM mathematical numeracy by evaluating performance on problems of escalating complexity, from constituent operations to combinatorial puzzles. We test several state-of-the-art LLM-based agents on a 100-problem challenge comprising four categories: (1) basic arithmetic, (2) advanced operations, (3) primality checking, and (4) the Game of 24 number puzzle. Our results show that while the agents achieved high accuracy on the first three categories, which require deterministic algorithmic execution, they consistently failed at the number puzzle, underlining its demand for a heuristic search over a large combinatorial space to be a significant bottleneck. These findings reveal that the agents’ proficiency is largely confined to recalling and executing known algorithms, rather than performing generative problem-solving. This suggests their apparent numerical reasoning is more akin to sophisticated pattern-matching than flexible, analytical thought, limiting their potential for tasks that require novel or creative numerical insights.

[540] Ban&Pick: Achieving Free Performance Gains and Inference Speedup via Smarter Routing in MoE-LLMs

Yuanteng Chen, Peisong Wang, Yuantian Shao, Jian Cheng

Main category: cs.LG

TL;DR: Ban&Pick is a post-training plug-and-play strategy that improves MoE routing by reinforcing key experts and pruning redundant ones, delivering free performance gains and faster inference without retraining.

Details

Motivation: Current MoE routers converge prematurely and enforce balanced expert usage, limiting model performance and efficiency by underutilizing influential experts and introducing redundancy with fixed expert activation.

Method: Ban&Pick uses two components: Pick identifies and reinforces key influential experts, while Ban dynamically prunes redundant experts based on layer and token sensitivity.

Result: On Qwen3-30B-A3B, improved accuracy from 80.67 to 84.66 on AIME2024 and from 65.66 to 68.18 on GPQA-Diamond, while accelerating inference by 1.25x under vLLM.

Conclusion: The approach demonstrates that smarter routing strategies can unlock significant performance improvements and efficiency gains in existing MoE models without architectural changes or retraining.

Abstract: Sparse Mixture-of-Experts (MoE) has become a key architecture for scaling large language models (LLMs) efficiently. Recent fine-grained MoE designs introduce hundreds of experts per layer, with multiple experts activated per token, enabling stronger specialization. However, during pre-training, routers are optimized mainly for stability and robustness: they converge prematurely and enforce balanced usage, limiting the full potential of model performance and efficiency. In this work, we uncover two overlooked issues: (i) a few highly influential experts are underutilized due to premature and balanced routing decisions; and (ii) enforcing a fixed number of active experts per token introduces substantial redundancy. Instead of retraining models or redesigning MoE architectures, we introduce Ban&Pick, a post-training, plug-and-play strategy for smarter MoE routing. Pick discovers and reinforces key experts-a small group with outsized impact on performance-leading to notable accuracy gains across domains. Ban complements this by dynamically pruning redundant experts based on layer and token sensitivity, delivering faster inference with minimal accuracy loss. Experiments on fine-grained MoE-LLMs (DeepSeek, Qwen3) across math, code, and general reasoning benchmarks demonstrate that Ban&Pick delivers free performance gains and inference acceleration without retraining or architectural changes. For instance, on Qwen3-30B-A3B, it improves accuracy from 80.67 to 84.66 on AIME2024 and from 65.66 to 68.18 on GPQA-Diamond, while accelerating inference by 1.25x under the vLLM.

[541] Breaking SafetyCore: Exploring the Risks of On-Device AI Deployment

Victor Guyomard, Mathis Mauvisseau, Marie Paindavoine

Main category: cs.LG

TL;DR: On-device AI models pose security risks - SafetyCore Android case study shows models can be extracted and manipulated to bypass detection

Details

Motivation: As more AI models are deployed on-device for privacy and latency benefits, new security risks emerge that differ from traditional software vulnerabilities

Method: Real-world case study of SafetyCore Android system service that performs sensitive image content detection, demonstrating extraction and manipulation techniques

Result: Successfully bypassed detection by extracting and manipulating the on-device AI model, rendering the protection ineffective

Conclusion: On-device AI models have significant vulnerabilities that adversaries can exploit, requiring new security considerations for on-device AI deployments

Abstract: Due to hardware and software improvements, an increasing number of AI models are deployed on-device. This shift enhances privacy and reduces latency, but also introduces security risks distinct from traditional software. In this article, we examine these risks through the real-world case study of SafetyCore, an Android system service incorporating sensitive image content detection. We demonstrate how the on-device AI model can be extracted and manipulated to bypass detection, effectively rendering the protection ineffective. Our analysis exposes vulnerabilities of on-device AI models and provides a practical demonstration of how adversaries can exploit them.

[542] Variational Garrote for Statistical Physics-based Sparse and Robust Variable Selection

Hyungjoon Soh, Dongha Lee, Vipul Periwal, Junghyo Jo

Main category: cs.LG

TL;DR: Enhanced Variational Garrote (VG) method for sparse variable selection using statistical physics and variational inference, outperforms Ridge and LASSO in sparse regimes with automatic differentiation for scalability.

Details

Motivation: Need for effective variable selection from high-dimensional data to promote model simplicity and explainability in big data era.

Method: Revisits Variational Garrote with feature selection spin variables and variational inference, enhanced with modern automatic differentiation for efficient optimization.

Result: VG performs best in highly sparse regimes, shows more consistent variable selection than Ridge/LASSO, and reveals sharp transition point for estimating correct number of relevant variables.

Conclusion: VG offers strong potential for sparse modeling applications including compressed sensing and machine learning model pruning, with practical signal for identifying key predictors.

Abstract: Selecting key variables from high-dimensional data is increasingly important in the era of big data. Sparse regression serves as a powerful tool for this purpose by promoting model simplicity and explainability. In this work, we revisit a valuable yet underutilized method, the statistical physics-based Variational Garrote (VG), which introduces explicit feature selection spin variables and leverages variational inference to derive a tractable loss function. We enhance VG by incorporating modern automatic differentiation techniques, enabling scalable and efficient optimization. We evaluate VG on both fully controllable synthetic datasets and complex real-world datasets. Our results demonstrate that VG performs especially well in highly sparse regimes, offering more consistent and robust variable selection than Ridge and LASSO regression across varying levels of sparsity. We also uncover a sharp transition: as superfluous variables are admitted, generalization degrades abruptly and the uncertainty of the selection variables increases. This transition point provides a practical signal for estimating the correct number of relevant variables, an insight we successfully apply to identify key predictors in real-world data. We expect that VG offers strong potential for sparse modeling across a wide range of applications, including compressed sensing and model pruning in machine learning.

[543] Beyond the Pre-Service Horizon: Infusing In-Service Behavior for Improved Financial Risk Forecasting

Senhao Liu, Zhiyu Guo, Zhiyuan Ji, Yueguo Chen, Yateng Tang, Yunhai Wang, Xuehao Zheng, Xiang Ao

Main category: cs.LG

TL;DR: MGKD framework uses knowledge distillation to improve pre-service risk prediction by transferring insights from in-service behavior data through multi-granularity distillation strategies.

Details

Motivation: Traditional financial risk management separates pre-service risk assessment and in-service default detection, missing opportunities to leverage in-service behavioral data for better pre-service predictions.

Method: Multi-Granularity Knowledge Distillation (MGKD) with teacher-student architecture, using soft labels from in-service data, including coarse-grained, fine-grained, and self-distillation strategies, plus re-weighting for class imbalance.

Result: Experimental results on Tencent Mobile Payment datasets show effectiveness in both offline and online scenarios, improving pre-service risk assessment performance.

Conclusion: MGKD successfully bridges pre-service and in-service risk modeling, enabling better default prediction before service activation by transferring behavioral patterns from in-service data.

Abstract: Typical financial risk management involves distinct phases for pre-service risk assessment and in-service default detection, often modeled separately. This paper proposes a novel framework, Multi-Granularity Knowledge Distillation (abbreviated as MGKD), aimed at improving pre-service risk prediction through the integration of in-service user behavior data. MGKD follows the idea of knowledge distillation, where the teacher model, trained on historical in-service data, guides the student model, which is trained on pre-service data. By using soft labels derived from in-service data, the teacher model helps the student model improve its risk prediction prior to service activation. Meanwhile, a multi-granularity distillation strategy is introduced, including coarse-grained, fine-grained, and self-distillation, to align the representations and predictions of the teacher and student models. This approach not only reinforces the representation of default cases but also enables the transfer of key behavioral patterns associated with defaulters from the teacher to the student model, thereby improving the overall performance of pre-service risk assessment. Moreover, we adopt a re-weighting strategy to mitigate the model’s bias towards the minority class. Experimental results on large-scale real-world datasets from Tencent Mobile Payment demonstrate the effectiveness of our proposed approach in both offline and online scenarios.

[544] Graph Neural Networks for Resource Allocation in Interference-limited Multi-Channel Wireless Networks with QoS Constraints

Lili Chen, Changyang She, Jingge Zhu, Jamie Evans

Main category: cs.LG

TL;DR: Proposes JCPGNN-M, a GNN-based algorithm with Lagrangian optimization for QoS-constrained wireless resource allocation, offering theoretical convergence guarantees and outperforming traditional methods in speed and scalability.

Details

Motivation: Traditional deep learning approaches for meeting minimum data rate constraints in wireless systems lack theoretical guarantees and often fail to satisfy QoS requirements in practice, requiring a more principled solution.

Method: Extends WMMSE to multi-channel setting (eWMMSE), then develops JCPGNN-M GNN algorithm with Lagrangian primal-dual optimization framework to ensure QoS constraint satisfaction and convergence.

Result: JCPGNN-M matches eWMMSE performance while providing significant improvements in inference speed, generalization to larger networks, and robustness under imperfect channel state information.

Conclusion: The proposed framework provides a scalable and theoretically grounded solution for constrained resource allocation in future wireless networks with guaranteed QoS satisfaction.

Abstract: Meeting minimum data rate constraints is a significant challenge in wireless communication systems, particularly as network complexity grows. Traditional deep learning approaches often address these constraints by incorporating penalty terms into the loss function and tuning hyperparameters empirically. However, this heuristic treatment offers no theoretical convergence guarantees and frequently fails to satisfy QoS requirements in practical scenarios. Building upon the structure of the WMMSE algorithm, we first extend it to a multi-channel setting with QoS constraints, resulting in the enhanced WMMSE (eWMMSE) algorithm, which is provably convergent to a locally optimal solution when the problem is feasible. To further reduce computational complexity and improve scalability, we develop a GNN-based algorithm, JCPGNN-M, capable of supporting simultaneous multi-channel allocation per user. To overcome the limitations of traditional deep learning methods, we propose a principled framework that integrates GNN with a Lagrangian-based primal-dual optimization method. By training the GNN within the Lagrangian framework, we ensure satisfaction of QoS constraints and convergence to a stationary point. Extensive simulations demonstrate that JCPGNN-M matches the performance of eWMMSE while offering significant gains in inference speed, generalization to larger networks, and robustness under imperfect channel state information. This work presents a scalable and theoretically grounded solution for constrained resource allocation in future wireless networks.

[545] NeuroDeX: Unlocking Diverse Support in Decompiling Deep Neural Network Executables

Yilin Li, Guozhu Meng, Mingyang Sun, Yanzhong Wang, Kun Sun, Hailong Chang, Yuekang Li

Main category: cs.LG

TL;DR: NeuroDeX is a novel decompiler for DNN executables that uses LLMs and dynamic analysis to handle compilation optimizations and quantized models, achieving high accuracy in model recovery.

Details

Motivation: On-device deep learning models face reverse engineering threats, and existing decompilation methods struggle with compilation optimizations and quantized compiled models.

Method: Leverages semantic understanding capabilities of LLMs combined with dynamic analysis for operator type recognition, attribute recovery, and model reconstruction.

Result: Successfully decompiles 96 DNN executables across 12 models, achieving nearly identical recovery for non-quantized models and 72% top-1 accuracy for quantized executables.

Conclusion: NeuroDeX provides a more comprehensive and effective solution for DNN executable decompilation compared to previous approaches, handling diverse compilation scenarios.

Abstract: On-device deep learning models have extensive real world demands. Deep learning compilers efficiently compile models into executables for deployment on edge devices, but these executables may face the threat of reverse engineering. Previous studies have attempted to decompile DNN executables, but they face challenges in handling compilation optimizations and analyzing quantized compiled models. In this paper, we present NeuroDeX to unlock diverse support in decompiling DNN executables. NeuroDeX leverages the semantic understanding capabilities of LLMs along with dynamic analysis to accurately and efficiently perform operator type recognition, operator attribute recovery and model reconstruction. NeuroDeX can recover DNN executables into high-level models towards compilation optimizations, different architectures and quantized compiled models. We conduct experiments on 96 DNN executables across 12 common DNN models. Extensive experimental results demonstrate that NeuroDeX can decompile non-quantized executables into nearly identical high-level models. NeuroDeX can recover functionally similar high-level models for quantized executables, achieving an average top-1 accuracy of 72%. NeuroDeX offers a more comprehensive and effective solution compared to previous DNN executables decompilers.

[546] CAPMix: Robust Time Series Anomaly Detection Based on Abnormal Assumptions with Dual-Space Mixup

Xudong Mou, Rui Wang, Tiejun Wang, Renyu Yang, Shiru Chen, Jie Sun, Tianyu Wo, Xudong Liu

Main category: cs.LG

TL;DR: CAPMix is a novel anomaly augmentation framework that addresses patchy generation and anomaly shift issues in time series anomaly detection through CutAddPaste mechanism, label revision, and dual-space mixup.

Details

Motivation: Existing anomaly assumption approaches suffer from patchy generation (scattered anomaly knowledge leading to simplistic injection) and anomaly shift (synthetic anomalies either too similar to normal data or unrealistically divergent), which distort classification boundaries.

Method: Proposes CAPMix with three key components: 1) CutAddPaste mechanism for targeted anomaly injection, 2) label revision strategy for adaptive anomaly label refinement, and 3) dual-space mixup within temporal convolutional network for smoother decision boundaries.

Result: Extensive experiments on five benchmark datasets (AIOps, UCR, SWaT, WADI, ESA) show CAPMix achieves significant improvements over state-of-the-art baselines with enhanced robustness against contaminated training data.

Conclusion: CAPMix effectively addresses fundamental limitations of existing anomaly augmentation methods and demonstrates superior performance across multiple time series anomaly detection benchmarks.

Abstract: Time series anomaly detection (TSAD) is a vital yet challenging task, particularly in scenarios where labeled anomalies are scarce and temporal dependencies are complex. Recent anomaly assumption (AA) approaches alleviate the lack of anomalies by injecting synthetic samples and training discriminative models. Despite promising results, these methods often suffer from two fundamental limitations: patchy generation, where scattered anomaly knowledge leads to overly simplistic or incoherent anomaly injection, and Anomaly Shift, where synthetic anomalies either resemble normal data too closely or diverge unrealistically from real anomalies, thereby distorting classification boundaries. In this paper, we propose CAPMix, a controllable anomaly augmentation framework that addresses both issues. First, we design a CutAddPaste mechanism to inject diverse and complex anomalies in a targeted manner, avoiding patchy generation. Second, we introduce a label revision strategy to adaptively refine anomaly labels, reducing the risk of anomaly shift. Finally, we employ dual-space mixup within a temporal convolutional network to enforce smoother and more robust decision boundaries. Extensive experiments on five benchmark datasets, including AIOps, UCR, SWaT, WADI, and ESA, demonstrate that CAPMix achieves significant improvements over state-of-the-art baselines, with enhanced robustness against contaminated training data. The code is available at https://github.com/alsike22/CAPMix.

[547] CAME-AB: Cross-Modality Attention with Mixture-of-Experts for Antibody Binding Site Prediction

Hongzong Li, Jiahao Ma, Zhanpeng Shi, Fanming Jin, Ye-Fan Hu, Jian-Dong Huang

Main category: cs.LG

TL;DR: CAME-AB is a cross-modality attention framework with Mixture-of-Experts backbone that integrates five biological modalities for superior antibody binding site prediction, outperforming existing methods on multiple metrics.

Details

Motivation: Existing antibody binding site prediction methods rely on single-view features and fail to identify antibody-specific binding sites on antigens, representing a dual limitation in both representation and prediction capabilities.

Method: Integrates five biological modalities (amino acid encodings, BLOSUM profiles, language model embeddings, structure-aware features, GCN-refined graphs) with adaptive modality fusion, Transformer encoder, MoE module, supervised contrastive learning, and stochastic weight averaging.

Result: Outperforms strong baselines on benchmark datasets across multiple metrics including Precision, Recall, F1-score, AUC-ROC, and MCC. Ablation studies confirm effectiveness of each component.

Conclusion: CAME-AB provides a robust multimodal framework for antibody binding site prediction, demonstrating the value of cross-modality integration and adaptive feature fusion in computational immunology.

Abstract: Antibody binding site prediction plays a pivotal role in computational immunology and therapeutic antibody design. Existing sequence or structure methods rely on single-view features and fail to identify antibody-specific binding sites on the antigens-a dual limitation in representation and prediction. In this paper, we propose CAME-AB, a novel Cross-modality Attention framework with a Mixture-of-Experts (MoE) backbone for robust antibody binding site prediction. CAME-AB integrates five biologically grounded modalities, including raw amino acid encodings, BLOSUM substitution profiles, pretrained language model embeddings, structure-aware features, and GCN-refined biochemical graphs-into a unified multimodal representation. To enhance adaptive cross-modal reasoning, we propose an adaptive modality fusion module that learns to dynamically weight each modality based on its global relevance and input-specific contribution. A Transformer encoder combined with an MoE module further promotes feature specialization and capacity expansion. We additionally incorporate a supervised contrastive learning objective to explicitly shape the latent space geometry, encouraging intra-class compactness and inter-class separability. To improve optimization stability and generalization, we apply stochastic weight averaging during training. Extensive experiments on benchmark antibody-antigen datasets demonstrate that CAME-AB consistently outperforms strong baselines on multiple metrics, including Precision, Recall, F1-score, AUC-ROC, and MCC. Ablation studies further validate the effectiveness of each architectural component and the benefit of multimodal feature integration. The model implementation details and the codes are available on https://anonymous.4open.science/r/CAME-AB-C525

[548] DyC-STG: Dynamic Causal Spatio-Temporal Graph Network for Real-time Data Credibility Analysis in IoT

Guanjie Cheng, Boyi Li, Peihan Wu, Feiyi Chen, Xinkui Zhao, Mengying Zhu, Shuiguang Deng

Main category: cs.LG

TL;DR: Proposes DyC-STG framework for IoT data credibility analysis, combining event-driven dynamic graphs and causal reasoning to overcome limitations of traditional spatio-temporal models.

Details

Motivation: Addresses critical data credibility challenges in IoT applications by overcoming limitations of static graph topologies and spurious correlations in human-centric environments.

Method: Dynamic Causal Spatio-Temporal Graph Network with two modules: event-driven dynamic graph for real-time topology adaptation and causal reasoning module enforcing temporal precedence.

Result: Achieves state-of-the-art performance with 1.4 percentage point improvement over strongest baselines and F1-Score up to 0.930. Two new real-world datasets released.

Conclusion: DyC-STG effectively addresses IoT data credibility challenges through dynamic topology adaptation and causal reasoning, establishing new benchmark for real-time analysis.

Abstract: The wide spreading of Internet of Things (IoT) sensors generates vast spatio-temporal data streams, but ensuring data credibility is a critical yet unsolved challenge for applications like smart homes. While spatio-temporal graph (STG) models are a leading paradigm for such data, they often fall short in dynamic, human-centric environments due to two fundamental limitations: (1) their reliance on static graph topologies, which fail to capture physical, event-driven dynamics, and (2) their tendency to confuse spurious correlations with true causality, undermining robustness in human-centric environments. To address these gaps, we propose the Dynamic Causal Spatio-Temporal Graph Network (DyC-STG), a novel framework designed for real-time data credibility analysis in IoT. Our framework features two synergistic contributions: an event-driven dynamic graph module that adapts the graph topology in real-time to reflect physical state changes, and a causal reasoning module to distill causally-aware representations by strictly enforcing temporal precedence. To facilitate the research in this domain we release two new real-world datasets. Comprehensive experiments show that DyC-STG establishes a new state-of-the-art, outperforming the strongest baselines by 1.4 percentage points and achieving an F1-Score of up to 0.930.

[549] A machine-learned expression for the excess Gibbs energy

Marco Hoffmann, Thomas Specht, Quirin Göttl, Jakob Burger, Stephan Mandt, Hans Hasse, Fabian Jirasek

Main category: cs.LG

TL;DR: HANNA is a neural network model that integrates physical laws as hard constraints to predict excess Gibbs energy for multi-component mixtures from molecular structures, outperforming state-of-the-art methods.

Details

Motivation: Predicting excess Gibbs energy of multi-component mixtures from molecular structures is a long-standing challenge in chemical engineering and chemistry for modeling thermodynamic properties of liquid mixtures.

Method: Integrated physical laws as hard constraints within a flexible neural network, trained end-to-end on experimental binary mixture data from Dortmund Data Bank. Used a novel surrogate solver for liquid-liquid equilibrium data and geometric projection for multi-component extrapolation.

Result: HANNA delivers excellent predictions, clearly outperforming state-of-the-art benchmark methods in both accuracy and scope.

Conclusion: The model provides thermodynamically consistent predictions and robust extrapolation to multi-component mixtures without additional parameters, with trained model and code openly available.

Abstract: The excess Gibbs energy plays a central role in chemical engineering and chemistry, providing a basis for modeling the thermodynamic properties of liquid mixtures. Predicting the excess Gibbs energy of multi-component mixtures solely from the molecular structures of their components is a long-standing challenge. In this work, we address this challenge by integrating physical laws as hard constraints within a flexible neural network. The resulting model, HANNA, was trained end-to-end on an extensive experimental dataset for binary mixtures from the Dortmund Data Bank, guaranteeing thermodynamically consistent predictions. A novel surrogate solver developed in this work enabled the inclusion of liquid-liquid equilibrium data in the training process. Furthermore, a geometric projection method was applied to enable robust extrapolations to multi-component mixtures, without requiring additional parameters. We demonstrate that HANNA delivers excellent predictions, clearly outperforming state-of-the-art benchmark methods in accuracy and scope. The trained model and corresponding code are openly available, and an interactive interface is provided on our website, MLPROP.

[550] On optimal solutions of classical and sliced Wasserstein GANs with non-Gaussian data

Yu-Jui Huang, Hsin-Hua Shen, Yu-Chih Huang, Wan-Yi Lin, Shih-Chun Lin

Main category: cs.LG

TL;DR: This paper provides closed-form optimal parameters for Wasserstein GANs (WGANs) beyond the linear-quadratic-Gaussian setting, extending to non-linear activations and non-Gaussian data in 1D, and shows linear generators can be asymptotically optimal for sliced WGANs in high dimensions.

Details

Motivation: Current parameter selection methods for GANs require exhaustive search and lack theoretical optimality guarantees beyond simple LQG settings. The paper aims to characterize optimal WGAN parameters for more realistic scenarios with non-linear networks and non-Gaussian data.

Method: Derived closed-form optimal parameters for 1D WGANs with non-linear activation functions and non-Gaussian data. Extended to high dimensions using sliced Wasserstein framework with joint distribution constraints. Showed linear generators can be asymptotically optimal for sliced WGANs.

Result: Empirical studies demonstrate good convergence behavior with both Gaussian and Laplace distributed data. The proposed solution achieves same performance as r-PCA but requires less computational resources.

Conclusion: The paper provides theoretically grounded optimal parameter selection for WGANs beyond traditional LQG assumptions, offering practical closed-form solutions that work well with non-linear networks and non-Gaussian data distributions.

Abstract: The generative adversarial network (GAN) aims to approximate an unknown distribution via a parameterized neural network (NN). While GANs have been widely applied in reinforcement and semisupervised learning as well as computer vision tasks, selecting their parameters often needs an exhaustive search and only a few selection methods can be proved to be theoretically optimal. One of the most promising GAN variants is the Wasserstein GAN (WGAN). Prior work on optimal parameters for WGAN is limited to the linear-quadratic-Gaussian (LQG) setting, where the NN is linear and the data is Gaussian. In this paper, we focus on the characterization of optimal WGAN parameters beyond the LQG setting. We derive closed-form optimal parameters for one-dimensional WGANs when the NN has non-linear activation functions and the data is non-Gaussian. To extend this to high-dimensional WGANs, we adopt the sliced Wasserstein framework and replace the constraint on marginal distributions of the randomly projected data by a constraint on the joint distribution of the original (unprojected) data. We show that the linear generator can be asymptotically optimal for sliced WGAN with non-Gaussian data. Empirical studies show that our closed-form WGAN parameters have good convergence behavior with data under both Gaussian and Laplace distributions. Also, compared to the r principal component analysis (r-PCA) solution, our proposed solution for sliced WGAN can achieve the same performance while requiring less computational resources.

[551] Outcome-based Exploration for LLM Reasoning

Yuda Song, Julia Kempe, Remi Munos

Main category: cs.LG

TL;DR: Outcome-based RL for LLMs improves accuracy but reduces diversity. Proposed outcome-based exploration with two algorithms maintains diversity while improving reasoning performance.

Details

Motivation: Address the diversity collapse problem in outcome-based RL for LLMs, where accuracy gains come at the cost of reduced generation diversity which is critical for real-world test-time scaling.

Method: Proposed outcome-based exploration with two algorithms: historical exploration (UCB-style bonuses for rare answers) and batch exploration (penalizes within-batch repetition). Formalized through outcome-based bandits model.

Result: Experiments on competition math with Llama and Qwen models show both methods improve accuracy while mitigating diversity collapse compared to standard outcome-based RL.

Conclusion: Outcome-based exploration provides a practical RL approach that enhances reasoning without sacrificing the diversity essential for scalable deployment of LLMs.

Abstract: Reinforcement learning (RL) has emerged as a powerful method for improving the reasoning abilities of large language models (LLMs). Outcome-based RL, which rewards policies solely for the correctness of the final answer, yields substantial accuracy gains but also induces a systematic loss in generation diversity. This collapse undermines real-world performance, where diversity is critical for test-time scaling. We analyze this phenomenon by viewing RL post-training as a sampling process and show that, strikingly, RL can reduce effective diversity even on the training set relative to the base model. Our study highlights two central findings: (i) a transfer of diversity degradation, where reduced diversity on solved problems propagates to unsolved ones, and (ii) the tractability of the outcome space, since reasoning tasks admit only a limited set of distinct answers. Motivated by these insights, we propose outcome-based exploration, which assigns exploration bonuses according to final outcomes. We introduce two complementary algorithms: historical exploration, which encourages rarely observed answers via UCB-style bonuses, and batch exploration, which penalizes within-batch repetition to promote test-time diversity. Experiments on standard competition math with Llama and Qwen models demonstrate that both methods improve accuracy while mitigating diversity collapse. On the theoretical side, we formalize the benefit of outcome-based exploration through a new model of outcome-based bandits. Together, these contributions chart a practical path toward RL methods that enhance reasoning without sacrificing the diversity essential for scalable deployment.

[552] QualityFM: a Multimodal Physiological Signal Foundation Model with Self-Distillation for Signal Quality Challenges in Critically Ill Patients

Zongheng Guo, Tao Chen, Manuela Ferrario

Main category: cs.LG

TL;DR: QualityFM is a multimodal foundation model for PPG and ECG signals that uses self-distillation and windowed sparse attention to handle signal quality issues, trained on 21M+ waveforms and achieving strong performance on clinical tasks.

Details

Motivation: High incidence of poor signal quality in ICU/OR leads to false alarms and diagnostic inaccuracies, with existing methods suffering from limited generalizability and reliance on labeled data.

Method: Dual-track architecture with self-distillation strategy, windowed sparse attention in Transformer, composite loss function combining distillation and reconstruction losses on power/phase spectra.

Result: Pre-trained three models (9.6M to 319M parameters) that demonstrated efficacy through transfer learning on ventricular tachycardia false alarm detection, atrial fibrillation identification, and ABP estimation.

Conclusion: QualityFM provides a general-purpose foundation model for physiological signal quality understanding with strong cross-task transferability and practical clinical value.

Abstract: Photoplethysmogram (PPG) and electrocardiogram (ECG) are commonly recorded in intesive care unit (ICU) and operating room (OR). However, the high incidence of poor, incomplete, and inconsistent signal quality, can lead to false alarms or diagnostic inaccuracies. The methods explored so far suffer from limited generalizability, reliance on extensive labeled data, and poor cross-task transferability. To overcome these challenges, we introduce QualityFM, a novel multimodal foundation model for these physiological signals, designed to acquire a general-purpose understanding of signal quality. Our model is pre-trained on an large-scale dataset comprising over 21 million 30-second waveforms and 179,757 hours of data. Our approach involves a dual-track architecture that processes paired physiological signals of differing quality, leveraging a self-distillation strategy where an encoder for high-quality signals is used to guide the training of an encoder for low-quality signals. To efficiently handle long sequential signals and capture essential local quasi-periodic patterns, we integrate a windowed sparse attention mechanism within our Transformer-based model. Furthermore, a composite loss function, which combines direct distillation loss on encoder outputs with indirect reconstruction loss based on power and phase spectra, ensures the preservation of frequency-domain characteristics of the signals. We pre-train three models with varying parameter counts (9.6 M to 319 M) and demonstrate their efficacy and practical value through transfer learning on three distinct clinical tasks: false alarm of ventricular tachycardia detection, the identification of atrial fibrillation and the estimation of arterial blood pressure (ABP) from PPG and ECG signals.

[553] Lane Change Intention Prediction of two distinct Populations using a Transformer

Francesco De Cristofaro, Cornelia Lex, Jia Hu, Arno Eichberger

Main category: cs.LG

TL;DR: Transformer model for lane change prediction shows poor cross-dataset performance (39.43% accuracy) but achieves high accuracy (86.71%) when trained on combined datasets from different populations.

Details

Motivation: Current lane change intention prediction algorithms are typically trained and tested on single datasets, lacking validation of their generalizability across different populations and driving environments.

Method: Tested a transformer model designed for lane change intention prediction on two distinct datasets collected by LevelX in Germany and Hong Kong, evaluating both cross-dataset performance and combined training approach.

Result: The transformer’s accuracy dropped significantly to 39.43% when tested on a different population than it was trained on, but achieved 86.71% accuracy when trained simultaneously on both populations.

Conclusion: Lane change prediction models suffer from poor cross-dataset generalization, but training on diverse datasets from different populations significantly improves model performance and robustness.

Abstract: As a result of the growing importance of lane change intention prediction for a safe and efficient driving experience in complex driving scenarios, researchers have in recent years started to train novel machine learning algorithms on available datasets with promising results. A shortcoming of this recent research effort, though, is that the vast majority of the proposed algorithms are trained on a single datasets. In doing so, researchers failed to test if their algorithm would be as effective if tested on a different dataset and, by extension, on a different population with respect to the one on which they were trained. In this article we test a transformer designed for lane change intention prediction on two datasets collected by LevelX in Germany and Hong Kong. We found that the transformer’s accuracy plummeted when tested on a population different to the one it was trained on with accuracy values as low as 39.43%, but that when trained on both populations simultaneously it could achieve an accuracy as high as 86.71%. - This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.

[554] Learning Optimal Defender Strategies for CAGE-2 using a POMDP Model

Duc Huy Le, Rolf Stadler

Main category: cs.LG

TL;DR: BF-PPO method using PPO and particle filter outperforms CARDIFF in defender strategy learning and training time for CAGE-2 cybersecurity benchmark

Details

Motivation: To develop an optimal defender strategy for CAGE-2 cybersecurity benchmark by creating a formal POMDP model and addressing computational complexity from large state spaces

Method: Constructed formal POMDP model for CAGE-2, developed BF-PPO method based on PPO with particle filter to handle computational complexity of large state space

Result: BF-PPO outperformed CARDIFF (highest ranked method on CAGE-2 leaderboard) in both learned defender strategy quality and required training time

Conclusion: The POMDP-based approach with particle filter effectively addresses computational challenges and delivers superior defender strategies for cybersecurity scenarios

Abstract: CAGE-2 is an accepted benchmark for learning and evaluating defender strategies against cyberattacks. It reflects a scenario where a defender agent protects an IT infrastructure against various attacks. Many defender methods for CAGE-2 have been proposed in the literature. In this paper, we construct a formal model for CAGE-2 using the framework of Partially Observable Markov Decision Process (POMDP). Based on this model, we define an optimal defender strategy for CAGE-2 and introduce a method to efficiently learn this strategy. Our method, called BF-PPO, is based on PPO, and it uses particle filter to mitigate the computational complexity due to the large state space of the CAGE-2 model. We evaluate our method in the CAGE-2 CybORG environment and compare its performance with that of CARDIFF, the highest ranked method on the CAGE-2 leaderboard. We find that our method outperforms CARDIFF regarding the learned defender strategy and the required training time.

[555] Predicting Fetal Outcomes from Cardiotocography Signals Using a Supervised Variational Autoencoder

John Tolladay, Beth Albert, Gabriel Davis Jones

Main category: cs.LG

TL;DR: Supervised VAE model for CTG signal classification achieves competitive fetal outcome prediction (AUROC 0.752-0.779) while partially encoding clinically meaningful features, though full interpretability remains challenging due to FHR signal complexity.

Details

Motivation: To address interpretability limitations of current deep learning approaches for cardiotocography (CTG) signal classification by developing a supervised variational autoencoder that can both classify pregnancy outcomes and provide insights into clinically relevant features.

Method: Used OxMat CTG dataset with 5-minute fetal heart rate segments labeled with postnatal outcomes. Trained VAE optimized for signal reconstruction and outcome prediction, incorporating Kullback-Leibler divergence and total correlation constraints to structure latent space. Evaluated with AUROC and MSE metrics.

Result: Achieved AUROC of 0.752 at segment level and 0.779 at CTG level. Relaxing total correlation constraints improved both reconstruction and classification. Baseline-related features were well represented and aligned with model scores, while variability metrics were less strongly encoded.

Conclusion: Supervised VAEs can provide competitive fetal outcome prediction while partially encoding clinically meaningful CTG features, though the irregular, multi-timescale nature of FHR signals poses challenges for full interpretability. Provides basis for future interpretable generative models.

Abstract: Objective: To develop and interpret a supervised variational autoencoder (VAE) model for classifying cardiotocography (CTG) signals based on pregnancy outcomes, addressing interpretability limits of current deep learning approaches. Methods: The OxMat CTG dataset was used to train a VAE on five-minute fetal heart rate (FHR) segments, labeled with postnatal outcomes. The model was optimised for signal reconstruction and outcome prediction, incorporating Kullback-Leibler divergence and total correlation (TC) constraints to structure the latent space. Performance was evaluated using area under the receiver operating characteristic curve (AUROC) and mean squared error (MSE). Interpretability was assessed using coefficient of determination, latent traversals and unsupervised component analyses. Results: The model achieved an AUROC of 0.752 at the segment level and 0.779 at the CTG level, where predicted scores were aggregated. Relaxing TC constraints improved both reconstruction and classification. Latent analysis showed that baseline-related features (e.g., FHR baseline, baseline shift) were well represented and aligned with model scores, while metrics like short- and long-term variability were less strongly encoded. Traversals revealed clear signal changes for baseline features, while other properties were entangled or subtle. Unsupervised decompositions corroborated these patterns. Findings: This work demonstrates that supervised VAEs can achieve competitive fetal outcome prediction while partially encoding clinically meaningful CTG features. The irregular, multi-timescale nature of FHR signals poses challenges for disentangling physiological components, distinguishing CTG from more periodic signals such as ECG. Although full interpretability was not achieved, the model supports clinically useful outcome prediction and provides a basis for future interpretable, generative models.

[556] Contrastive Self-Supervised Network Intrusion Detection using Augmented Negative Pairs

Jack Wilkie, Hanan Hindy, Christos Tachtatzis, Robert Atkinson

Main category: cs.LG

TL;DR: CLAN introduces a novel contrastive learning approach for network intrusion detection that treats augmented samples as negative views (representing malicious traffic) while using other benign samples as positive views, achieving superior performance in both binary and multi-class classification tasks.

Details

Motivation: Traditional supervised ML models require large labeled datasets, making them impractical for real-world intrusion detection. Anomaly detection methods suffer from high false positive rates. Self-supervised learning shows promise but existing approaches have limitations in how they handle positive and negative sample generation.

Method: CLAN (Contrastive Learning using Augmented Negative pairs) treats augmented samples as negative views representing potentially malicious distributions, while other benign samples serve as positive views. This novel paradigm enhances both classification accuracy and inference efficiency after pretraining on benign traffic.

Result: Experimental evaluation on Lycos2017 dataset shows CLAN surpasses existing self-supervised and anomaly detection techniques in binary classification. When fine-tuned on limited labeled data, it achieves superior multi-class classification performance compared to existing self-supervised models.

Conclusion: CLAN provides an effective self-supervised learning framework for network intrusion detection that addresses limitations of existing approaches, offering improved accuracy, lower false positive rates, and better performance with limited labeled data.

Abstract: Network intrusion detection remains a critical challenge in cybersecurity. While supervised machine learning models achieve state-of-the-art performance, their reliance on large labelled datasets makes them impractical for many real-world applications. Anomaly detection methods, which train exclusively on benign traffic to identify malicious activity, suffer from high false positive rates, limiting their usability. Recently, self-supervised learning techniques have demonstrated improved performance with lower false positive rates by learning discriminative latent representations of benign traffic. In particular, contrastive self-supervised models achieve this by minimizing the distance between similar (positive) views of benign traffic while maximizing it between dissimilar (negative) views. Existing approaches generate positive views through data augmentation and treat other samples as negative. In contrast, this work introduces Contrastive Learning using Augmented Negative pairs (CLAN), a novel paradigm for network intrusion detection where augmented samples are treated as negative views - representing potentially malicious distributions - while other benign samples serve as positive views. This approach enhances both classification accuracy and inference efficiency after pretraining on benign traffic. Experimental evaluation on the Lycos2017 dataset demonstrates that the proposed method surpasses existing self-supervised and anomaly detection techniques in a binary classification task. Furthermore, when fine-tuned on a limited labelled dataset, the proposed approach achieves superior multi-class classification performance compared to existing self-supervised models.

[557] Tackling Device Data Distribution Real-time Shift via Prototype-based Parameter Editing

Zheqi Lv, Wenqiao Zhang, Kairui Fu, Qi Tian, Shengyu Zhang, Jiajie Su, Jingyuan Chen, Kun Kuang, Fei Wu

Main category: cs.LG

TL;DR: Persona is a novel personalized method that uses prototype-based parameter editing to enhance on-device model generalization without retraining, addressing real-time data distribution shifts on devices.

Details

Motivation: On-device real-time data distribution shifts challenge lightweight model generalization, but current research relies on data-intensive fine-tuning approaches that are impractical for deployed devices.

Method: Uses a neural adapter in the cloud to generate parameter editing matrices from real-time device data, clustering models into prototypes that are dynamically refined. Incorporates cross-layer knowledge transfer for consistent multi-layer parameter changes.

Result: Extensive experiments on vision and recommendation tasks across multiple datasets confirm Persona’s effectiveness and generality in handling data distribution shifts.

Conclusion: Persona provides an efficient backpropagation-free framework for enhancing on-device model generalization without post-deployment retraining, effectively addressing real-time data distribution challenges.

Abstract: The on-device real-time data distribution shift on devices challenges the generalization of lightweight on-device models. This critical issue is often overlooked in current research, which predominantly relies on data-intensive and computationally expensive fine-tuning approaches. To tackle this, we introduce Persona, a novel personalized method using a prototype-based, backpropagation-free parameter editing framework to enhance model generalization without post-deployment retraining. Persona employs a neural adapter in the cloud to generate a parameter editing matrix based on real-time device data. This matrix adeptly adapts on-device models to the prevailing data distributions, efficiently clustering them into prototype models. The prototypes are dynamically refined via the parameter editing matrix, facilitating efficient evolution. Furthermore, the integration of cross-layer knowledge transfer ensures consistent and context-aware multi-layer parameter changes and prototype assignment. Extensive experiments on vision task and recommendation task on multiple datasets confirm Persona’s effectiveness and generality.

Georgia Channing, Avijit Ghosh

Main category: cs.LG

TL;DR: AI for science faces social/institutional barriers like community dysfunction, misaligned incentives, data fragmentation, and infrastructure inequities - not just technical challenges. Solutions require collective social approaches.

Details

Motivation: To identify and address the primary social and institutional barriers preventing AI from delivering equitable benefits in scientific discovery, moving beyond purely technical solutions.

Method: Analysis of current AI-for-science landscape through examination of community dynamics, incentive structures, data practices, and infrastructure access patterns. Identifies four interconnected challenges through systematic assessment.

Result: Identifies four key barriers: community dysfunction, research priorities misaligned with upstream needs, data fragmentation, and infrastructure inequities. Shows these stem from cultural/organizational practices rather than purely technical limitations.

Conclusion: AI for science must be reframed as a collective social project requiring intentional community-building, cross-disciplinary education, shared benchmarks, and accessible infrastructure alongside technical innovation for sustainable progress.

Abstract: Artificial intelligence promises to accelerate scientific discovery, yet its benefits remain unevenly distributed. While technical obstacles such as scarce data, fragmented standards, and unequal access to computation are significant, we argue that the primary barriers are social and institutional. Narratives that defer progress to speculative “AI scientists,” the undervaluing of data and infrastructure contributions, misaligned incentives, and gaps between domain experts and machine learning researchers all constrain impact. We highlight four interconnected challenges: community dysfunction, research priorities misaligned with upstream needs, data fragmentation, and infrastructure inequities. We argue that their roots lie in cultural and organizational practices. Addressing them requires not only technical innovation but also intentional community-building, cross-disciplinary education, shared benchmarks, and accessible infrastructure. We call for reframing AI for science as a collective social project, where sustainable collaboration and equitable participation are treated as prerequisites for technical progress.

[559] Information-Theoretic Bounds and Task-Centric Learning Complexity for Real-World Dynamic Nonlinear Systems

Sri Satish Krishna Chaitanya Bulusu, Mikko Sillanpää

Main category: cs.LG

TL;DR: A theoretical framework for modeling dynamic nonlinear systems using structured decomposition and variance analysis, showing that static and dynamic distortions cannot be minimized simultaneously and providing complexity bounds for learning.

Details

Motivation: Dynamic nonlinear systems exhibit coupled static and dynamic distortions that pose challenges for data-driven modeling, requiring a more foundational theoretical approach.

Method: Uses structured decomposition, variance analysis, directional lower bounds on interactions, behavioral indicators, memory finiteness index, and power-based conditions to establish measurable links between system properties and thermodynamic laws.

Result: Developed a Behavioral Uncertainty Principle showing static and dynamic distortions cannot be minimized simultaneously, identified system resistance to complete deterministic decomposition, and created model-agnostic complexity metrics showing lower-variance components are easier to learn.

Conclusion: The framework explains empirical benefits of structured residual learning and provides a scalable, theoretically grounded approach for modeling complex dynamic nonlinear systems with broad applicability.

Abstract: Dynamic nonlinear systems exhibit distortions arising from coupled static and dynamic effects. Their intertwined nature poses major challenges for data-driven modeling. This paper presents a theoretical framework grounded in structured decomposition, variance analysis, and task-centric complexity bounds. The framework employs a directional lower bound on interactions between measurable system components, extending orthogonality in inner product spaces to structurally asymmetric settings. This bound supports variance inequalities for decomposed systems. Key behavioral indicators are introduced along with a memory finiteness index. A rigorous power-based condition establishes a measurable link between finite memory in realizable systems and the First Law of Thermodynamics. This offers a more foundational perspective than classical bounds based on the Second Law. Building on this foundation, we formulate a `Behavioral Uncertainty Principle,’ demonstrating that static and dynamic distortions cannot be minimized simultaneously. We identify that real-world systems seem to resist complete deterministic decomposition due to entangled static and dynamic effects. We also present two general-purpose theorems linking function variance to mean-squared Lipschitz continuity and learning complexity. This yields a model-agnostic, task-aware complexity metric, showing that lower-variance components are inherently easier to learn. These insights explain the empirical benefits of structured residual learning, including improved generalization, reduced parameter count, and lower training cost, as previously observed in power amplifier linearization experiments. The framework is broadly applicable and offers a scalable, theoretically grounded approach to modeling complex dynamic nonlinear systems.

[560] PAC-Bayesian Generalization Bounds for Graph Convolutional Networks on Inductive Node Classification

Huayi Tang, Yong Liu

Main category: cs.LG

TL;DR: Theoretical analysis of GCN generalization bounds for dynamic graphs, addressing data dependency and non-stationarity in inductive node classification.

Details

Motivation: Real-world graphs are dynamic with evolving nodes and connections, but previous theoretical studies based on transductive learning fail to adequately model temporal evolution and structural dynamics.

Method: PAC-Bayesian theoretical analysis of graph convolutional networks (GCNs) for inductive node classification, treating nodes as dependent and non-identically distributed data points. Derives generalization bounds for one-layer and two-layer GCNs.

Result: Novel generalization bounds that incorporate data dependency and non-stationarity effects, with sufficient conditions for generalization gap convergence to zero as nodes increase. Two-layer GCNs require stronger graph topology assumptions for convergence.

Conclusion: Establishes theoretical foundation for understanding and improving GNN generalization in dynamic graph environments.

Abstract: Graph neural networks (GNNs) have achieved remarkable success in processing graph-structured data across various applications. A critical aspect of real-world graphs is their dynamic nature, where new nodes are continually added and existing connections may change over time. Previous theoretical studies, largely based on the transductive learning framework, fail to adequately model such temporal evolution and structural dynamics. In this paper, we presents a PAC-Bayesian theoretical analysis of graph convolutional networks (GCNs) for inductive node classification, treating nodes as dependent and non-identically distributed data points. We derive novel generalization bounds for one-layer GCNs that explicitly incorporate the effects of data dependency and non-stationarity, and establish sufficient conditions under which the generalization gap converges to zero as the number of nodes increases. Furthermore, we extend our analysis to two-layer GCNs, and reveal that it requires stronger assumptions on graph topology to guarantee convergence. This work establishes a theoretical foundation for understanding and improving GNN generalization in dynamic graph environments.

[561] Demo: Healthcare Agent Orchestrator (HAO) for Patient Summarization in Molecular Tumor Boards

Noel Codella, Sam Preston, Hao Qiu, Leonardo Schettini, Wen-wai Yim, Mert Öz, Shrey Jain, Matthew P. Lungren, Thomas Osborne

Main category: cs.LG

TL;DR: AI system (HAO) with TBFact evaluation framework automates patient summary generation for Molecular Tumor Boards, achieving 94% high-importance information capture and enabling secure clinical deployment.

Details

Motivation: Manual patient summary creation for Molecular Tumor Boards is labor-intensive, subjective, and prone to information omissions, requiring automated solutions.

Method: Healthcare Agent Orchestrator (HAO) uses LLM-driven multi-agent workflow to generate summaries, with TBFact framework evaluating comprehensiveness and succinctness using model-as-judge approach.

Result: HAO captured 94% of high-importance information with TBFact recall of 0.84 under strict entailment criteria, enabling data-free evaluation without sharing sensitive clinical data.

Conclusion: HAO and TBFact provide reliable, scalable support for MTBs through automated summary generation and robust evaluation framework that maintains data privacy.

Abstract: Molecular Tumor Boards (MTBs) are multidisciplinary forums where oncology specialists collaboratively assess complex patient cases to determine optimal treatment strategies. A central element of this process is the patient summary, typically compiled by a medical oncologist, radiation oncologist, or surgeon, or their trained medical assistant, who distills heterogeneous medical records into a concise narrative to facilitate discussion. This manual approach is often labor-intensive, subjective, and prone to omissions of critical information. To address these limitations, we introduce the Healthcare Agent Orchestrator (HAO), a Large Language Model (LLM)-driven AI agent that coordinates a multi-agent clinical workflow to generate accurate and comprehensive patient summaries for MTBs. Evaluating predicted patient summaries against ground truth presents additional challenges due to stylistic variation, ordering, synonym usage, and phrasing differences, which complicate the measurement of both succinctness and completeness. To overcome these evaluation hurdles, we propose TBFact, a ``model-as-a-judge’’ framework designed to assess the comprehensiveness and succinctness of generated summaries. Using a benchmark dataset derived from de-identified tumor board discussions, we applied TBFact to evaluate our Patient History agent. Results show that the agent captured 94% of high-importance information (including partial entailments) and achieved a TBFact recall of 0.84 under strict entailment criteria. We further demonstrate that TBFact enables a data-free evaluation framework that institutions can deploy locally without sharing sensitive clinical data. Together, HAO and TBFact establish a robust foundation for delivering reliable and scalable support to MTBs.

[562] Small Vectors, Big Effects: A Mechanistic Study of RL-Induced Reasoning via Steering Vectors

Viacheslav Sinii, Nikita Balagansky, Yaroslav Aksenov, Vadim Kurochkin, Daniil Laptev, Gleb Gerasimov, Alexey Gorbatovski, Boris Shaposhnikov, Daniil Gavrilov

Main category: cs.LG

TL;DR: The paper analyzes how reasoning training reshapes language models through lightweight steering vectors, finding they act as token-substitution biases and modify MLP/unembedding computations while maintaining interpretability.

Details

Motivation: To understand the mechanisms by which reasoning training changes language model computations, particularly through lightweight steering vectors that can match fine-tuning performance while remaining interpretable.

Method: Used reinforcement-learning trained steering vectors inserted into residual streams, analyzed with logit-lens readouts, path patching, and circuit analyses on two models.

Result: Found that last-layer steering vectors act as token-substitution biases on first generated tokens, while penultimate-layer vectors modify MLP/unembedding computations to up-weight process words and structure symbols without changing attention patterns.

Conclusion: Establishes a principled framework for interpreting behavioral changes from reasoning training through additive interventions that provide both performance and interpretability.

Abstract: The mechanisms by which reasoning training reshapes language-model computations remain poorly understood. We study lightweight steering vectors inserted into the base model’s residual stream and trained with a reinforcement-learning objective, which can match full fine-tuning performance while retaining the interpretability of small, additive interventions. Using logit-lens readouts, path patching, and circuit analyses, we analyze two models and find: (i) the last-layer steering vector behaves like a token-substitution bias concentrated on the first generated token, consistently boosting tokens such as “To” and “Step”; and (ii) the penultimate-layer steering vector leaves attention patterns largely unchanged and instead acts through the MLP and unembedding, preferentially up-weighting process words and structure symbols. These results establish a principled framework for interpreting the behavioral changes induced by reasoning training.

[563] A Survey of Generalization of Graph Anomaly Detection: From Transfer Learning to Foundation Models

Junjun Pan, Yu Zheng, Yue Tan, Yixin Liu

Main category: cs.LG

TL;DR: A comprehensive review paper on generalization in graph anomaly detection (GAD), covering evolution, problem formalization, taxonomy, method review, and future directions.

Details

Motivation: Most GAD methods assume identical training/testing distributions and are task-specific, limiting adaptability to real-world scenarios with shifting distributions and scarce training samples in new applications.

Method: Systematic review approach: tracing evolution of generalization in GAD, formalizing problem settings, developing a taxonomy, and conducting comprehensive review of existing generalized GAD methods.

Result: Provides a structured framework for understanding generalization in GAD, including transfer learning approaches and “one-for-all” foundation models that can generalize across multiple applications.

Conclusion: Identifies current open challenges and suggests future research directions to advance generalization capabilities in graph anomaly detection for real-world applications.

Abstract: Graph anomaly detection (GAD) has attracted increasing attention in recent years for identifying malicious samples in a wide range of graph-based applications, such as social media and e-commerce. However, most GAD methods assume identical training and testing distributions and are tailored to specific tasks, resulting in limited adaptability to real-world scenarios such as shifting data distributions and scarce training samples in new applications. To address the limitations, recent work has focused on improving the generalization capability of GAD models through transfer learning that leverages knowledge from related domains to enhance detection performance, or developing “one-for-all” GAD foundation models that generalize across multiple applications. Since a systematic understanding of generalization in GAD is still lacking, in this paper, we provide a comprehensive review of generalization in GAD. We first trace the evolution of generalization in GAD and formalize the problem settings, which further leads to our systematic taxonomy. Rooted in this fine-grained taxonomy, an up-to-date and comprehensive review is conducted for the existing generalized GAD methods. Finally, we identify current open challenges and suggest future directions to inspire future research in this emerging field.

[564] BEAM: Brainwave Empathy Assessment Model for Early Childhood

Chen Xie, Gaofeng Wu, Kaidong Wang, Zihao Zhu, Xiaoshu Luo, Yan Liang, Feiyu Quan, Ruoxi Wu, Xianghui Huang, Han Zhang

Main category: cs.LG

TL;DR: BEAM is a novel deep learning framework that uses multi-view EEG signals to objectively predict empathy levels in young children, outperforming existing methods by capturing both cognitive and emotional dimensions through spatio-temporal feature extraction and contrastive learning.

Details

Motivation: Traditional empathy assessment methods rely on subjective self-reports or observer labeling which are biased and fail to objectively capture empathy formation processes. EEG offers objectivity but current approaches only extract static patterns, missing temporal dynamics.

Method: Proposed Brainwave Empathy Assessment Model (BEAM) with three components: 1) LaBraM-based encoder for spatio-temporal feature extraction, 2) feature fusion module to integrate multi-view EEG signals, 3) contrastive learning module to enhance class separation.

Result: BEAM outperforms state-of-the-art methods across multiple metrics when validated on the CBCP dataset, demonstrating superior performance in predicting empathy levels in children aged 4-6 years.

Conclusion: BEAM provides an objective tool for empathy assessment in young children and offers preliminary insights for early interventions in children’s prosocial development, addressing limitations of traditional subjective methods.

Abstract: Empathy in young children is crucial for their social and emotional development, yet predicting it remains challenging. Traditional methods often only rely on self-reports or observer-based labeling, which are susceptible to bias and fail to objectively capture the process of empathy formation. EEG offers an objective alternative; however, current approaches primarily extract static patterns, neglecting temporal dynamics. To overcome these limitations, we propose a novel deep learning framework, the Brainwave Empathy Assessment Model (BEAM), to predict empathy levels in children aged 4-6 years. BEAM leverages multi-view EEG signals to capture both cognitive and emotional dimensions of empathy. The framework comprises three key components: 1) a LaBraM-based encoder for effective spatio-temporal feature extraction, 2) a feature fusion module to integrate complementary information from multi-view signals, and 3) a contrastive learning module to enhance class separation. Validated on the CBCP dataset, BEAM outperforms state-of-the-art methods across multiple metrics, demonstrating its potential for objective empathy assessment and providing a preliminary insight into early interventions in children’s prosocial development.

[565] Knowledge-Guided Machine Learning for Stabilizing Near-Shortest Path Routing

Yung-Fu Chen, Sen Lin, Anish Arora

Main category: cs.LG

TL;DR: A simple algorithm that trains DNNs on minimal graph data to learn local routing policies that generalize across geometric random graphs, solving all-pairs near-shortest path problems efficiently.

Details

Motivation: To develop scalable and efficient routing policies for geometric random graphs that can generalize from minimal training data while leveraging network domain knowledge.

Method: Train deep neural networks using domain-informed input features (distance-to-destination and node stretch) to learn local routing policies from few samples of a single seed graph.

Result: The DNN using only distance-to-destination learns Greedy Forwarding policy, while the DNN with both features learns GreedyTensile routing that outperforms greedy forwarding and operates with ultra-low latency.

Conclusion: The approach successfully learns generalizable routing policies with minimal data, demonstrates explainability through symbolic interpretation, and achieves superior performance over traditional methods.

Abstract: We propose a simple algorithm that needs only a few data samples from a single graph for learning local routing policies that generalize across a rich class of geometric random graphs in Euclidean metric spaces. We thus solve the all-pairs near-shortest path problem by training deep neural networks (DNNs) that let each graph node efficiently and scalably route (i.e., forward) packets by considering only the node’s state and the state of the neighboring nodes. Our algorithm design exploits network domain knowledge in the selection of input features and design of the policy function for learning an approximately optimal policy. Domain knowledge also provides theoretical assurance that the choice of a ``seed graph’’ and its node data sampling suffices for generalizable learning. Remarkably, one of these DNNs we train – using distance-to-destination as the only input feature – learns a policy that exactly matches the well-known Greedy Forwarding policy, which forwards packets to the neighbor with the shortest distance to the destination. We also learn a new policy, which we call GreedyTensile routing – using both distance-to-destination and node stretch as the input features – that almost always outperforms greedy forwarding. We demonstrate the explainability and ultra-low latency run-time operation of Greedy Tensile routing by symbolically interpreting its DNN in low-complexity terms of two linear actions.

[566] Group Effect Enhanced Generative Adversarial Imitation Learning for Individual Travel Behavior Modeling under Incentives

Yuanyuan Wu, Zhenlin Qin, Leizhen Wang, Xiaolei Ma, Zhenliang Ma

Main category: cs.LG

TL;DR: Proposes gcGAIL model for efficient individual travel behavior modeling using group patterns, outperforms existing methods in accuracy and generalization for transport incentive responses.

Details

Motivation: Individual travel behavior modeling is crucial for urban mobility policy but faces data challenges - MDPs require extensive data with spatial-temporal coverage and situational diversity issues.

Method: Group-effect-enhanced generative adversarial imitation learning (gcGAIL) that leverages shared behavioral patterns among passenger groups to improve modeling efficiency.

Result: gcGAIL outperforms state-of-the-art benchmarks (AIRL, GAIL, conditional GAIL) in accuracy, generalization, and pattern efficiency. Robust to spatial variation, data sparsity, and behavioral diversity.

Conclusion: gcGAIL enables efficient prediction of individual behavior responses over time, providing basis for personalized incentives and better timing of incentive injections for sustainable behavior changes.

Abstract: Understanding and modeling individual travel behavior responses is crucial for urban mobility regulation and policy evaluation. The Markov decision process (MDP) provides a structured framework for dynamic travel behavior modeling at the individual level. However, solving an MDP in this context is highly data-intensive and faces challenges of data quantity, spatial-temporal coverage, and situational diversity. To address these, we propose a group-effect-enhanced generative adversarial imitation learning (gcGAIL) model that improves the individual behavior modeling efficiency by leveraging shared behavioral patterns among passenger groups. We validate the gcGAIL model using a public transport fare-discount case study and compare against state-of-the-art benchmarks, including adversarial inverse reinforcement learning (AIRL), baseline GAIL, and conditional GAIL. Experimental results demonstrate that gcGAIL outperforms these methods in learning individual travel behavior responses to incentives over time in terms of accuracy, generalization, and pattern demonstration efficiency. Notably, gcGAIL is robust to spatial variation, data sparsity, and behavioral diversity, maintaining strong performance even with partial expert demonstrations and underrepresented passenger groups. The gcGAIL model predicts the individual behavior response at any time, providing the basis for personalized incentives to induce sustainable behavior changes (better timing of incentive injections).

[567] TrajAware: Graph Cross-Attention and Trajectory-Aware for Generalisable VANETs under Partial Observations

Xiaolu Fu, Ziyuan Bao, Eiman Kanjo

Main category: cs.LG

TL;DR: TrajAware is an RL-based routing framework for VANETs that addresses dynamic topologies and edge device constraints through action space pruning, graph cross-attention, and trajectory-aware prediction, achieving near-optimal performance on resource-limited hardware.

Details

Motivation: VANET routing faces challenges from dynamic topologies, incomplete observations, and limited edge device resources. Existing RL approaches assume fixed graph structures and require retraining for changing conditions, making them unsuitable for constrained hardware deployment.

Method: TrajAware integrates three components: (1) action space pruning to reduce redundant neighbor options while preserving two-hop reachability, (2) graph cross-attention to map pruned neighbors to global graph context for generalization across network sizes, and (3) trajectory-aware prediction using historical routes and junction information to estimate real-time positions under partial observations.

Result: Evaluated in SUMO simulator with real-world city maps using leave-one-city-out setup, TrajAware achieves near-shortest paths and high delivery ratios while maintaining efficiency suitable for constrained edge devices, outperforming state-of-the-art baselines in both full and partial observation scenarios.

Conclusion: TrajAware provides an effective RL-based solution for VANET routing that addresses the challenges of dynamic environments and hardware constraints, demonstrating superior performance and generalization capabilities compared to existing approaches.

Abstract: Vehicular ad hoc networks (VANETs) are a crucial component of intelligent transportation systems; however, routing remains challenging due to dynamic topologies, incomplete observations, and the limited resources of edge devices. Existing reinforcement learning (RL) approaches often assume fixed graph structures and require retraining when network conditions change, making them unsuitable for deployment on constrained hardware. We present TrajAware, an RL-based framework designed for edge AI deployment in VANETs. TrajAware integrates three components: (i) action space pruning, which reduces redundant neighbour options while preserving two-hop reachability, alleviating the curse of dimensionality; (ii) graph cross-attention, which maps pruned neighbours to the global graph context, producing features that generalise across diverse network sizes; and (iii) trajectory-aware prediction, which uses historical routes and junction information to estimate real-time positions under partial observations. We evaluate TrajAware in the open-source SUMO simulator using real-world city maps with a leave-one-city-out setup. Results show that TrajAware achieves near-shortest paths and high delivery ratios while maintaining efficiency suitable for constrained edge devices, outperforming state-of-the-art baselines in both full and partial observation scenarios.

[568] Barycentric Neural Networks and Length-Weighted Persistent Entropy Loss: A Green Geometric and Topological Framework for Function Approximation

Victor Toscano-Duran, Rocio Gonzalez-Diaz, Miguel A. Gutiérrez-Naranjo

Main category: cs.LG

TL;DR: A new small shallow neural network called Barycentric Neural Network (BNN) uses barycentric coordinates and fixed base points to exactly represent continuous piecewise linear functions, combined with a novel length-weighted persistent entropy loss for superior approximation performance.

Details

Motivation: To address the computational costs of deep/overparameterized networks by creating small shallow networks that can efficiently approximate continuous functions while ensuring strict continuity and interpretability.

Method: Proposes BNN architecture using barycentric coordinates and fixed base points, introduces length-weighted persistent entropy (LWPE) as a stable topological feature, and optimizes base points directly rather than internal weights.

Result: The approach achieves superior and faster approximation performance compared to classical loss functions (MSE, RMSE, MAE, log-cosh) in resource-constrained settings.

Conclusion: BNN combined with LWPE loss provides flexible, geometrically interpretable function approximation that is particularly effective when computational resources are limited.

Abstract: While it is well-established that artificial neural networks are \emph{universal approximators} for continuous functions on compact domains, many modern approaches rely on deep or overparameterized architectures that incur high computational costs. In this paper, a new type of \emph{small shallow} neural network, called the \emph{Barycentric Neural Network} ($\BNN$), is proposed, which leverages a fixed set of \emph{base points} and their \emph{barycentric coordinates} to define both its structure and its parameters. We demonstrate that our $\BNN$ enables the exact representation of \emph{continuous piecewise linear functions} ($\CPLF$s), ensuring strict continuity across segments. Since any continuous function over a compact domain can be approximated arbitrarily well by $\CPLF$s, the $\BNN$ naturally emerges as a flexible and interpretable tool for \emph{function approximation}. Beyond the use of this representation, the main contribution of the paper is the introduction of a new variant of \emph{persistent entropy}, a topological feature that is stable and scale invariant, called the \emph{length-weighted persistent entropy} ($\LWPE$), which is weighted by the lifetime of topological features. Our framework, which combines the $\BNN$ with a loss function based on our $\LWPE$, aims to provide flexible and geometrically interpretable approximations of nonlinear continuous functions in resource-constrained settings, such as those with limited base points for $\BNN$ design and few training epochs. Instead of optimizing internal weights, our approach directly \emph{optimizes the base points that define the $\BNN$}. Experimental results show that our approach achieves \emph{superior and faster approximation performance} compared to classical loss functions such as MSE, RMSE, MAE, and log-cosh.

[569] Probabilistic Modeling of Latent Agentic Substructures in Deep Neural Networks

Su Hyeong Lee, Risi Kondor, Richard Ngo

Main category: cs.LG

TL;DR: A probabilistic theory of intelligent agency for neural models where agents are outcome distributions with log score utility, showing strict welfare improvement through weighted logarithmic pooling and formalizing agentic alignment phenomena in LLMs.

Details

Motivation: To develop a principled mathematical framework for understanding how subagents coalesce into coherent higher-level entities, with implications for alignment in agentic AI systems.

Method: Represent agents as outcome distributions with epistemic utility given by log score, define compositions through weighted logarithmic pooling, and analyze properties like cloning invariance, continuity, and openness.

Result: Proved strict unanimity is impossible under linear pooling or binary outcomes but possible with ≥3 outcomes; formalized Waluigi effect in LLMs where benevolent personas induce antagonistic counterparts; showed manifest-then-suppress strategy yields larger misalignment reduction.

Conclusion: The framework provides novel insights into agentic alignment, demonstrating how mathematical modeling of subagent composition can inform alignment strategies for AI systems.

Abstract: We develop a theory of intelligent agency grounded in probabilistic modeling for neural models. Agents are represented as outcome distributions with epistemic utility given by log score, and compositions are defined through weighted logarithmic pooling that strictly improves every member’s welfare. We prove that strict unanimity is impossible under linear pooling or in binary outcome spaces, but possible with three or more outcomes. Our framework admits recursive structure via cloning invariance, continuity, and openness, while tilt-based analysis rules out trivial duplication. Finally, we formalize an agentic alignment phenomenon in LLMs using our theory: eliciting a benevolent persona (“Luigi’”) induces an antagonistic counterpart (“Waluigi”), while a manifest-then-suppress Waluigi strategy yields strictly larger first-order misalignment reduction than pure Luigi reinforcement alone. These results clarify how developing a principled mathematical framework for how subagents can coalesce into coherent higher-level entities provides novel implications for alignment in agentic AI systems.

[570] Nested Optimal Transport Distances

Ruben Bontorno, Songyan Hou

Main category: cs.LG

TL;DR: Proposes nested optimal transport distance as a robust evaluation metric for financial time series generative models, with an efficient parallel computation algorithm.

Details

Motivation: Lack of consensus metrics for evaluating deep generative models of financial time series, particularly for decision-making applications like stress testing and scenario generation.

Method: Employ nested optimal transport distance (time-causal variant of optimal transport) and develop a statistically consistent, parallelizable algorithm for its computation.

Result: Achieves substantial speedups over existing approaches while maintaining robustness for financial decision tasks like hedging, optimal stopping, and reinforcement learning.

Conclusion: Nested optimal transport provides an effective evaluation framework for financial time series generative models with practical computational advantages.

Abstract: Simulating realistic financial time series is essential for stress testing, scenario generation, and decision-making under uncertainty. Despite advances in deep generative models, there is no consensus metric for their evaluation. We focus on generative AI for financial time series in decision-making applications and employ the nested optimal transport distance, a time-causal variant of optimal transport distance, which is robust to tasks such as hedging, optimal stopping, and reinforcement learning. Moreover, we propose a statistically consistent, naturally parallelizable algorithm for its computation, achieving substantial speedups over existing approaches.

[571] RT-HCP: Dealing with Inference Delays and Sample Efficiency to Learn Directly on Robotic Platforms

Zakariae El Asri, Ibrahim Laiche, Clément Rambour, Olivier Sigaud, Nicolas Thome

Main category: cs.LG

TL;DR: Proposes RT-HCP algorithm that addresses slow inference time in model-based RL for robot control by providing action sequences to meet high-frequency requirements while maintaining sample efficiency.

Details

Motivation: Model-based RL methods are sample efficient but suffer from slow inference times that prevent meeting robot control frequency requirements, creating execution gaps in real-time control.

Method: Defines a framework for handling inference delays where slow controllers provide action sequences to feed high-frequency robotic platforms. Proposes RT-HCP algorithm that offers optimal trade-off between performance, sample efficiency and inference time.

Result: Validated superiority of RT-HCP through experiments on a high-frequency FURUTA pendulum platform, demonstrating excellent performance while meeting control frequency requirements.

Conclusion: RT-HCP provides an effective solution for learning controllers directly on robots by addressing both sample efficiency and inference time challenges, enabling real-time high-frequency control.

Abstract: Learning a controller directly on the robot requires extreme sample efficiency. Model-based reinforcement learning (RL) methods are the most sample efficient, but they often suffer from a too long inference time to meet the robot control frequency requirements. In this paper, we address the sample efficiency and inference time challenges with two contributions. First, we define a general framework to deal with inference delays where the slow inference robot controller provides a sequence of actions to feed the control-hungry robotic platform without execution gaps. Then, we compare several RL algorithms in the light of this framework and propose RT-HCP, an algorithm that offers an excellent trade-off between performance, sample efficiency and inference time. We validate the superiority of RT-HCP with experiments where we learn a controller directly on a simple but high frequency FURUTA pendulum platform. Code: github.com/elasriz/RTHCP

[572] Long-Range Graph Wavelet Networks

Filippo Guerranti, Fabrizio Forte, Simon Geisler, Stephan Günnemann

Main category: cs.LG

TL;DR: LR-GWN is a graph neural network that combines local polynomial approximations with spectral parameterization to effectively capture both short-range and long-range interactions in graphs using wavelet decomposition.

Details

Motivation: Existing wavelet-based graph neural networks are limited by finite-order polynomial approximations that restrict their receptive fields and hinder long-range information propagation across graphs.

Method: Decompose wavelet filters into complementary local and global components: local aggregation with efficient low-order polynomials, and long-range interactions through flexible spectral domain parameterization.

Result: Achieves state-of-the-art performance among wavelet-based methods on long-range benchmarks while remaining competitive on short-range datasets.

Conclusion: The hybrid design successfully unifies short- and long-distance information flow within a principled wavelet framework, overcoming limitations of previous polynomial approximations.

Abstract: Modeling long-range interactions, the propagation of information across distant parts of a graph, is a central challenge in graph machine learning. Graph wavelets, inspired by multi-resolution signal processing, provide a principled way to capture both local and global structures. However, existing wavelet-based graph neural networks rely on finite-order polynomial approximations, which limit their receptive fields and hinder long-range propagation. We propose Long-Range Graph Wavelet Networks (LR-GWN), which decompose wavelet filters into complementary local and global components. Local aggregation is handled with efficient low-order polynomials, while long-range interactions are captured through a flexible spectral domain parameterization. This hybrid design unifies short- and long-distance information flow within a principled wavelet framework. Experiments show that LR-GWN achieves state-of-the-art performance among wavelet-based methods on long-range benchmarks, while remaining competitive on short-range datasets.

[573] Aligning Large Vision-Language Models by Deep Reinforcement Learning and Direct Preference Optimization

Thanh Thi Nguyen, Campbell Wilson, Janis Dalins

Main category: cs.LG

TL;DR: Overview of using Deep Reinforcement Learning (DRL) and Direct Preference Optimization (DPO) for fine-tuning Large Vision-Language Models to align with human values and improve task performance.

Details

Motivation: Fine-tuning LVLMs for human alignment and specific tasks remains challenging despite large-scale pretraining progress. DRL and DPO offer promising frameworks for this alignment process.

Method: Explores DRL (which uses reward signals for optimization) and DPO (which directly aligns policies with preferences without explicit reward models) as fine-tuning paradigms. Categorizes approaches, examines preference data sources, and reward signals.

Result: Provides frameworks for aligning LVLMs with human preferences, improving task performance, and enabling adaptive multimodal interaction through DRL and DPO techniques.

Conclusion: DRL and DPO significantly contribute to developing robust and human-aligned LVLMs, though challenges remain in scalability, sample efficiency, continual learning, generalization, and safety.

Abstract: Large Vision-Language Models (LVLMs) or multimodal large language models represent a significant advancement in artificial intelligence, enabling systems to understand and generate content across both visual and textual modalities. While large-scale pretraining has driven substantial progress, fine-tuning these models for aligning with human values or engaging in specific tasks or behaviors remains a critical challenge. Deep Reinforcement Learning (DRL) and Direct Preference Optimization (DPO) offer promising frameworks for this aligning process. While DRL enables models to optimize actions using reward signals instead of relying solely on supervised preference data, DPO directly aligns the policy with preferences, eliminating the need for an explicit reward model. This overview explores paradigms for fine-tuning LVLMs, highlighting how DRL and DPO techniques can be used to align models with human preferences and values, improve task performance, and enable adaptive multimodal interaction. We categorize key approaches, examine sources of preference data, reward signals, and discuss open challenges such as scalability, sample efficiency, continual learning, generalization, and safety. The goal is to provide a clear understanding of how DRL and DPO contribute to the evolution of robust and human-aligned LVLMs.

[574] Asynchronous Message Passing for Addressing Oversquashing in Graph Neural Networks

Kushal Bose, Swagatam Das

Main category: cs.LG

TL;DR: Proposed asynchronous message passing framework for GNNs that updates node features in batches based on centrality to address oversquashing in long-range tasks, achieving 5% and 4% improvements on benchmark datasets.

Details

Motivation: Graph Neural Networks suffer from oversquashing when handling long-range interactions due to bottlenecks in message propagation. Existing graph rewiring methods compromise inductive bias and cause information loss, while increasing channel capacity adds parameter complexity.

Method: Asynchronous framework that creates node batches per layer based on centrality values, updating only features of nodes in these batches. Processes information sequentially across layers to avoid simultaneous compression into fixed-capacity channels.

Result: Achieved 5% improvement on REDDIT-BINARY and 4% improvement on Peptides-struct datasets for graph classification. Theoretically demonstrated higher feature sensitivity bounds compared to synchronous approaches.

Conclusion: Asynchronous message passing with centrality-based batching effectively alleviates oversquashing in GNNs for long-range tasks while maintaining efficiency and avoiding the drawbacks of graph rewiring methods.

Abstract: Graph Neural Networks (GNNs) suffer from Oversquashing, which occurs when tasks require long-range interactions. The problem arises from the presence of bottlenecks that limit the propagation of messages among distant nodes. Recently, graph rewiring methods modify edge connectivity and are expected to perform well on long-range tasks. Yet, graph rewiring compromises the inductive bias, incurring significant information loss in solving the downstream task. Furthermore, increasing channel capacity may overcome information bottlenecks but enhance the parameter complexity of the model. To alleviate these shortcomings, we propose an efficient model-agnostic framework that asynchronously updates node features, unlike traditional synchronous message passing GNNs. Our framework creates node batches in every layer based on the node centrality values. The features of the nodes belonging to these batches will only get updated. Asynchronous message updates process information sequentially across layers, avoiding simultaneous compression into fixed-capacity channels. We also theoretically establish that our proposed framework maintains higher feature sensitivity bounds compared to standard synchronous approaches. Our framework is applied to six standard graph datasets and two long-range datasets to perform graph classification and achieves impressive performances with a $5%$ and $4%$ improvements on REDDIT-BINARY and Peptides-struct, respectively.

[575] Physics-informed Value Learner for Offline Goal-Conditioned Reinforcement Learning

Vittorio Giammarino, Ruiqi Ni, Ahmed H. Qureshi

Main category: cs.LG

TL;DR: Physics-informed regularization for offline goal-conditioned RL using Eikonal PDE to improve value function learning and enable better stitching in long-horizon tasks.

Details

Motivation: Offline GCRL faces challenges with limited dataset coverage and long-horizon generalization. Current methods lack geometric inductive biases that could improve value function learning.

Method: Propose Physics-informed (Pi) regularizer derived from Eikonal PDE to induce geometric structure in value functions. Combine with Hierarchical Implicit Q-Learning (HIQL) to create Pi-HIQL.

Result: Significant improvements in performance and generalization, especially in stitching regimes and large-scale navigation tasks.

Conclusion: Physics-informed regularization grounded in continuous-time optimal control effectively improves offline GCRL by aligning value functions with cost-to-go structures.

Abstract: Offline Goal-Conditioned Reinforcement Learning (GCRL) holds great promise for domains such as autonomous navigation and locomotion, where collecting interactive data is costly and unsafe. However, it remains challenging in practice due to the need to learn from datasets with limited coverage of the state-action space and to generalize across long-horizon tasks. To improve on these challenges, we propose a Physics-informed (Pi) regularized loss for value learning, derived from the Eikonal Partial Differential Equation (PDE) and which induces a geometric inductive bias in the learned value function. Unlike generic gradient penalties that are primarily used to stabilize training, our formulation is grounded in continuous-time optimal control and encourages value functions to align with cost-to-go structures. The proposed regularizer is broadly compatible with temporal-difference-based value learning and can be integrated into existing Offline GCRL algorithms. When combined with Hierarchical Implicit Q-Learning (HIQL), the resulting method, Physics-informed HIQL (Pi-HIQL), yields significant improvements in both performance and generalization, with pronounced gains in stitching regimes and large-scale navigation tasks.

[576] \texttt{R$^\textbf{2}$AI}: Towards Resistant and Resilient AI in an Evolving World

Youbang Sun, Xiang Wang, Jie Fu, Chaochao Lu, Bowen Zhou

Main category: cs.LG

TL;DR: Proposes safe-by-coevolution framework (R²AI) combining resistance to known threats and resilience to unforeseen risks through adversarial learning and continuous safety-capability coevolution.

Details

Motivation: Address the gap between rapidly growing AI capabilities and lagging safety progress, moving beyond brittle post-hoc alignment and limited intrinsic safety approaches.

Method: R²AI framework integrates fast/slow safe models, safety wind tunnel for adversarial simulation/verification, and continual feedback loops to coevolve safety with capabilities.

Result: Provides a scalable proactive framework for maintaining safety in dynamic environments, addressing both near-term vulnerabilities and long-term existential risks.

Conclusion: Safe-by-coevolution offers a practical path forward for AI safety as systems advance toward AGI/ASI, inspired by biological immunity principles.

Abstract: In this position paper, we address the persistent gap between rapidly growing AI capabilities and lagging safety progress. Existing paradigms divide into Make AI Safe'', which applies post-hoc alignment and guardrails but remains brittle and reactive, and Make Safe AI’’, which emphasizes intrinsic safety but struggles to address unforeseen risks in open-ended environments. We therefore propose \textit{safe-by-coevolution} as a new formulation of the ``Make Safe AI’’ paradigm, inspired by biological immunity, in which safety becomes a dynamic, adversarial, and ongoing learning process. To operationalize this vision, we introduce \texttt{R$^2$AI} – \textit{Resistant and Resilient AI} – as a practical framework that unites resistance against known threats with resilience to unforeseen risks. \texttt{R$^2$AI} integrates \textit{fast and slow safe models}, adversarial simulation and verification through a \textit{safety wind tunnel}, and continual feedback loops that guide safety and capability to coevolve. We argue that this framework offers a scalable and proactive path to maintain continual safety in dynamic environments, addressing both near-term vulnerabilities and long-term existential risks as AI advances toward AGI and ASI.

[577] floq: Training Critics via Flow-Matching for Scaling Compute in Value-Based RL

Bhavya Agrawalla, Michal Nauman, Khush Agarwal, Aviral Kumar

Main category: cs.LG

TL;DR: Floq introduces iterative computation to TD-learning by parameterizing Q-functions using flow-matching velocity fields, enabling better capacity scaling and performance improvements in RL.

Details

Motivation: Modern ML uses dense supervision for intermediate computations (like teacher forcing in language models), which enables learning complex functions. This motivates applying iterative computation to temporal difference methods in RL, which typically use monolithic value function representations.

Method: Parameterizes Q-function using a velocity field trained with flow-matching techniques. The velocity field is trained using TD-learning objective that bootstraps from values produced by a target velocity field through multiple steps of numerical integration.

Result: Achieves nearly 1.8x performance improvement across challenging offline RL benchmarks and online fine-tuning tasks. Demonstrates superior capacity scaling compared to standard TD-learning architectures.

Conclusion: Iterative computation through flow-matching shows significant potential for value learning in reinforcement learning, enabling finer control and better scaling of Q-function capacity.

Abstract: A hallmark of modern large-scale machine learning techniques is the use of training objectives that provide dense supervision to intermediate computations, such as teacher forcing the next token in language models or denoising step-by-step in diffusion models. This enables models to learn complex functions in a generalizable manner. Motivated by this observation, we investigate the benefits of iterative computation for temporal difference (TD) methods in reinforcement learning (RL). Typically they represent value functions in a monolithic fashion, without iterative compute. We introduce floq (flow-matching Q-functions), an approach that parameterizes the Q-function using a velocity field and trains it using techniques from flow-matching, typically used in generative modeling. This velocity field underneath the flow is trained using a TD-learning objective, which bootstraps from values produced by a target velocity field, computed by running multiple steps of numerical integration. Crucially, floq allows for more fine-grained control and scaling of the Q-function capacity than monolithic architectures, by appropriately setting the number of integration steps. Across a suite of challenging offline RL benchmarks and online fine-tuning tasks, floq improves performance by nearly 1.8x. floq scales capacity far better than standard TD-learning architectures, highlighting the potential of iterative computation for value learning.

[578] Concolic Testing on Individual Fairness of Neural Network Models

Ming-I Huang, Chih-Duo Hong, Fang Yu

Main category: cs.LG

TL;DR: PyFair is a formal framework for evaluating individual fairness of DNNs using concolic testing to generate fairness-specific path constraints, featuring a dual network architecture for comprehensive assessments with completeness guarantees.

Details

Motivation: To advance algorithmic fairness in critical domains by providing a rigorous, systematic method for fairness testing and verification of pre-trained deep neural networks.

Method: Adapts the concolic testing tool PyCT to generate fairness-specific path constraints, uses a dual network architecture for comprehensive fairness assessments, and provides completeness guarantees for certain network types.

Result: Evaluated on 25 benchmark models including bias-mitigated ones, PyFair effectively detects discriminatory instances and verifies fairness, though scalability challenges emerge with complex models.

Conclusion: PyFair represents a significant advancement in formal fairness verification for DNNs, offering systematic detection of discrimination while highlighting scalability limitations that need addressing for broader application.

Abstract: This paper introduces PyFair, a formal framework for evaluating and verifying individual fairness of Deep Neural Networks (DNNs). By adapting the concolic testing tool PyCT, we generate fairness-specific path constraints to systematically explore DNN behaviors. Our key innovation is a dual network architecture that enables comprehensive fairness assessments and provides completeness guarantees for certain network types. We evaluate PyFair on 25 benchmark models, including those enhanced by existing bias mitigation techniques. Results demonstrate PyFair’s efficacy in detecting discriminatory instances and verifying fairness, while also revealing scalability challenges for complex models. This work advances algorithmic fairness in critical domains by offering a rigorous, systematic method for fairness testing and verification of pre-trained DNNs.

[579] AxelSMOTE: An Agent-Based Oversampling Algorithm for Imbalanced Classification

Sukumar Kishanthan, Asela Hevapathige

Main category: cs.LG

TL;DR: AxelSMOTE is a novel agent-based oversampling method that addresses class imbalance by modeling data instances as autonomous agents using Axelrod’s cultural dissemination model, outperforming existing techniques while maintaining efficiency.

Details

Motivation: Traditional oversampling techniques for class imbalance have limitations: they treat features independently, lack similarity controls, limit diversity, and fail to manage synthetic variety effectively, requiring a more sophisticated approach.

Method: AxelSMOTE implements four innovations: 1) trait-based feature grouping to preserve correlations, 2) similarity-based probabilistic exchange mechanism for meaningful interactions, 3) Beta distribution blending for realistic interpolation, and 4) controlled diversity injection to avoid overfitting.

Result: Experiments on eight imbalanced datasets show that AxelSMOTE outperforms state-of-the-art sampling methods while maintaining computational efficiency.

Conclusion: AxelSMOTE successfully overcomes the limitations of traditional oversampling techniques by leveraging agent-based modeling and cultural dissemination principles, providing a more effective solution for class imbalance problems in machine learning.

Abstract: Class imbalance in machine learning poses a significant challenge, as skewed datasets often hinder performance on minority classes. Traditional oversampling techniques, which are commonly used to alleviate class imbalance, have several drawbacks: they treat features independently, lack similarity-based controls, limit sample diversity, and fail to manage synthetic variety effectively. To overcome these issues, we introduce AxelSMOTE, an innovative agent-based approach that views data instances as autonomous agents engaging in complex interactions. Based on Axelrod’s cultural dissemination model, AxelSMOTE implements four key innovations: (1) trait-based feature grouping to preserve correlations; (2) a similarity-based probabilistic exchange mechanism for meaningful interactions; (3) Beta distribution blending for realistic interpolation; and (4) controlled diversity injection to avoid overfitting. Experiments on eight imbalanced datasets demonstrate that AxelSMOTE outperforms state-of-the-art sampling methods while maintaining computational efficiency.

[580] Not All Samples Are Equal: Quantifying Instance-level Difficulty in Targeted Data Poisoning

William Xu, Yiwei Lu, Yihan Wang, Matthew Y. R. Yang, Zuoqiu Liu, Gautam Kamath, Yaoliang Yu

Main category: cs.LG

TL;DR: This paper investigates why certain test samples are more vulnerable to targeted data poisoning attacks and proposes three predictive criteria to assess attack difficulty.

Details

Motivation: Targeted data poisoning attacks are increasingly threatening due to their ease of deployment and high success rates, posing unique threats to individual test instances rather than overall model performance.

Method: The study introduces three predictive criteria for targeted poisoning difficulty: ergodic prediction accuracy (analyzed through clean training dynamics), poison distance, and poison budget.

Result: Experimental results show these metrics effectively predict varying difficulty levels of real-world targeted poisoning attacks across diverse scenarios.

Conclusion: The proposed criteria provide practitioners with valuable insights for vulnerability assessment and better understanding of data poisoning attacks.

Abstract: Targeted data poisoning attacks pose an increasingly serious threat due to their ease of deployment and high success rates. These attacks aim to manipulate the prediction for a single test sample in classification models. Unlike indiscriminate attacks that aim to decrease overall test performance, targeted attacks present a unique threat to individual test instances. This threat model raises a fundamental question: what factors make certain test samples more susceptible to successful poisoning than others? We investigate how attack difficulty varies across different test instances and identify key characteristics that influence vulnerability. This paper introduces three predictive criteria for targeted data poisoning difficulty: ergodic prediction accuracy (analyzed through clean training dynamics), poison distance, and poison budget. Our experimental results demonstrate that these metrics effectively predict the varying difficulty of real-world targeted poisoning attacks across diverse scenarios, offering practitioners valuable insights for vulnerability assessment and understanding data poisoning attacks.

[581] Tackling the Noisy Elephant in the Room: Label Noise-robust Out-of-Distribution Detection via Loss Correction and Low-rank Decomposition

Tarhib Al Azad, Shahana Ibrahim

Main category: cs.LG

TL;DR: Proposes a robust OOD detection framework that integrates loss correction techniques with low-rank and sparse decomposition to handle noisy training labels, outperforming state-of-the-art methods.

Details

Motivation: Label noise significantly degrades out-of-distribution (OOD) detection performance, but existing solutions combining noise-robust methods with OOD detection are insufficient for this critical challenge.

Method: Integrates loss correction techniques from noisy label learning with low-rank and sparse decomposition methods from signal processing to create a robust OOD detection framework.

Result: Extensive experiments on synthetic and real-world datasets show the method significantly outperforms state-of-the-art OOD detection techniques, especially under severe noisy label settings.

Conclusion: The proposed framework effectively addresses the underexplored problem of OOD detection under noisy training labels, providing a principled solution that combines techniques from different domains for improved robustness.

Abstract: Robust out-of-distribution (OOD) detection is an indispensable component of modern artificial intelligence (AI) systems, especially in safety-critical applications where models must identify inputs from unfamiliar classes not seen during training. While OOD detection has been extensively studied in the machine learning literature–with both post hoc and training-based approaches–its effectiveness under noisy training labels remains underexplored. Recent studies suggest that label noise can significantly degrade OOD performance, yet principled solutions to this issue are lacking. In this work, we demonstrate that directly combining existing label noise-robust methods with OOD detection strategies is insufficient to address this critical challenge. To overcome this, we propose a robust OOD detection framework that integrates loss correction techniques from the noisy label learning literature with low-rank and sparse decomposition methods from signal processing. Extensive experiments on both synthetic and real-world datasets demonstrate that our method significantly outperforms the state-of-the-art OOD detection techniques, particularly under severe noisy label settings.

[582] Staying in the Sweet Spot: Responsive Reasoning Evolution via Capability-Adaptive Hint Scaffolding

Ziheng Li, Zexu Sun, Jinman Zhao, Erxue Min, Yongcheng Zeng, Hui Wu, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Xu Chen, Zhi-Hong Deng

Main category: cs.LG

TL;DR: SEELE is a novel RLVR framework that dynamically adjusts problem difficulty by appending adaptive hints to optimize learning efficiency and improve math reasoning performance.

Details

Motivation: Existing RLVR methods suffer from exploration inefficiency due to mismatches between problem difficulty and model capability - models fail on overly difficult problems and learn little from simple ones.

Method: SEELE appends hints of adaptive length to problems, using multi-round rollout sampling and item response theory to determine optimal hint length that keeps difficulty in the high-efficiency region.

Result: Outperforms GRPO by +11.8 points, SFT by +10.5 points, and previous best supervision-aided approach by +3.6 points across six math reasoning benchmarks.

Conclusion: Dynamic difficulty adjustment through adaptive hinting significantly improves exploration efficiency and reasoning performance in reinforcement learning with verifiable rewards.

Abstract: Reinforcement learning with verifiable rewards (RLVR) has achieved remarkable success in enhancing the reasoning capabilities of large language models (LLMs). However, existing RLVR methods often suffer from exploration inefficiency due to mismatches between the training data’s difficulty and the model’s capability. LLMs fail to discover viable reasoning paths when problems are overly difficult, while learning little new capability when problems are too simple. In this work, we formalize the impact of problem difficulty by quantifying the relationship between loss descent speed and rollout accuracy. Building on this analysis, we propose SEELE, a novel supervision-aided RLVR framework that dynamically adjusts problem difficulty to stay within the high-efficiency region. SEELE augments each training sample by appending a hint (part of a full solution) after the original problem. Unlike previous hint-based approaches, SEELE deliberately and adaptively adjusts the hint length for each problem to achieve an optimal difficulty. To determine the optimal hint length, SEELE employs a multi-round rollout sampling strategy. In each round, it fits an item response theory model to the accuracy-hint pairs collected in preceding rounds to predict the required hint length for the next round. This instance-level, real-time difficulty adjustment aligns problem difficulty with the evolving model capability, thereby improving exploration efficiency. Experimental results show that SEELE outperforms Group Relative Policy Optimization (GRPO) and Supervised Fine-tuning (SFT) by +11.8 and +10.5 points, respectively, and surpasses the best previous supervision-aided approach by +3.6 points on average across six math reasoning benchmarks.

[583] Neutron Reflectometry by Gradient Descent

Max D. ~Champneys, Andrew J. ~Parnell, Philipp Gutfreund, Maximilian W. A. Skoda, . Patrick A. Fairclough, Timothy J. ~Rogers, Stephanie L. ~Burg

Main category: cs.LG

TL;DR: A novel gradient-based optimization approach for neutron reflectometry data analysis using automatic differentiation to compute exact gradients of the forward model, enabling efficient parameter estimation without losing physical intuition.

Details

Motivation: Traditional neutron reflectometry analysis requires solving inverse problems that are inefficient for large datasets or complex multilayer structures, and machine learning surrogates lose physical intuition.

Method: Uses automatic differentiation to compute exact gradients of the error function with respect to parameters, enabling gradient descent optimization directly on the forward reflection model.

Result: Demonstrates state-of-the-art performance on a thick oxide quartz film and robust co-fitting performance for complex organic LED multilayer devices.

Conclusion: Provides an efficient gradient-based approach that maintains physical intuition while enabling modern optimization techniques, with an open-source differentiable reflectometry library for broader application.

Abstract: Neutron reflectometry (NR) is a powerful technique to probe surfaces and interfaces. NR is inherently an indirect measurement technique, access to the physical quantities of interest (layer thickness, scattering length density, roughness), necessitate the solution of an inverse modelling problem, that is inefficient for large amounts of data or complex multiplayer structures (e.g. lithium batteries / electrodes). Recently, surrogate machine learning models have been proposed as an alternative to existing optimisation routines. Although such approaches have been successful, physical intuition is lost when replacing governing equations with fast neural networks. Instead, we propose a novel and efficient approach; to optimise reflectivity data analysis by performing gradient descent on the forward reflection model itself. Herein, automatic differentiation techniques are used to evaluate exact gradients of the error function with respect to the parameters of interest. Access to these quantities enables users of neutron reflectometry to harness a host of powerful modern optimisation and inference techniques that remain thus far unexploited in the context of neutron reflectometry. This paper presents two benchmark case studies; demonstrating state-of-the-art performance on a thick oxide quartz film, and robust co-fitting performance in the high complexity regime of organic LED multilayer devices. Additionally, we provide an open-source library of differentiable reflectometry kernels in the python programming language so that gradient based approaches can readily be applied to other NR datasets.

[584] Learning words in groups: fusion algebras, tensor ranks and grokking

Maor Shutman, Oren Louidor, Ran Tessler

Main category: cs.LG

TL;DR: Two-layer neural networks can learn arbitrary word operations in finite groups and exhibit grokking behavior by finding low-rank implementations of 3-tensors through group representation theory.

Details

Motivation: To understand how neural networks can learn complex group operations and exhibit grokking phenomena, and to explain the underlying mechanisms through tensor decomposition and representation theory.

Method: Reframe the problem as learning a 3-tensor, show it’s typically low-rank, decompose it using self-conjugate group representations and fusion structure, and analyze a surrogate model to understand how networks find low-rank implementations.

Result: Networks can learn arbitrary word operations in finite groups by finding low-rank tensor implementations, effectively implementing efficient matrix multiplication similar to Strassen’s algorithm, and exhibit grokking behavior.

Conclusion: The work provides a theoretical framework explaining how neural networks learn group operations through low-rank tensor approximations and reveals the mechanism behind grokking phenomena in gradient descent optimization.

Abstract: In this work, we demonstrate that a simple two-layer neural network with standard activation functions can learn an arbitrary word operation in any finite group, provided sufficient width is available and exhibits grokking while doing so. To explain the mechanism by which this is achieved, we reframe the problem as that of learning a particular $3$-tensor, which we show is typically of low rank. A key insight is that low-rank implementations of this tensor can be obtained by decomposing it along triplets of basic self-conjugate representations of the group and leveraging the fusion structure to rule out many components. Focusing on a phenomenologically similar but more tractable surrogate model, we show that the network is able to find such low-rank implementations (or approximations thereof), thereby using limited width to approximate the word-tensor in a generalizable way. In the case of the simple multiplication word, we further elucidate the form of these low-rank implementations, showing that the network effectively implements efficient matrix multiplication in the sense of Strassen. Our work also sheds light on the mechanism by which a network reaches such a solution under gradient descent.

[585] From Noise to Narrative: Tracing the Origins of Hallucinations in Transformers

Praneet Suresh, Jack Stanley, Sonia Joseph, Luca Scimeca, Danilo Bzdok

Main category: cs.LG

TL;DR: Transformer models hallucinate when input uncertainty increases, activating input-insensitive semantic features that lead to coherent but incorrect outputs, with hallucinations predictable from internal concept patterns.

Details

Motivation: To understand failure modes of generative AI systems, particularly transformer hallucination, which impedes trust and adoption in high-stakes applications.

Method: Used sparse autoencoders to capture concept representations in pre-trained transformers under controlled input uncertainty scenarios, analyzing semantic concept activation patterns.

Result: Found that semantic concepts grow with input unstructuredness, transformers activate coherent but input-insensitive features under uncertainty, and hallucinations can be predicted from internal concept patterns.

Conclusion: Insights into transformer internal mechanics have implications for AI alignment, safety, adversarial attack surfaces, and automatic hallucination risk quantification.

Abstract: As generative AI systems become competent and democratized in science, business, and government, deeper insight into their failure modes now poses an acute need. The occasional volatility in their behavior, such as the propensity of transformer models to hallucinate, impedes trust and adoption of emerging AI solutions in high-stakes areas. In the present work, we establish how and when hallucinations arise in pre-trained transformer models through concept representations captured by sparse autoencoders, under scenarios with experimentally controlled uncertainty in the input space. Our systematic experiments reveal that the number of semantic concepts used by the transformer model grows as the input information becomes increasingly unstructured. In the face of growing uncertainty in the input space, the transformer model becomes prone to activate coherent yet input-insensitive semantic features, leading to hallucinated output. At its extreme, for pure-noise inputs, we identify a wide variety of robustly triggered and meaningful concepts in the intermediate activations of pre-trained transformer models, whose functional integrity we confirm through targeted steering. We also show that hallucinations in the output of a transformer model can be reliably predicted from the concept patterns embedded in transformer layer activations. This collection of insights on transformer internal processing mechanics has immediate consequences for aligning AI models with human values, AI safety, opening the attack surface for potential adversarial attacks, and providing a basis for automatic quantification of a model’s hallucination risk.

[586] MedualTime: A Dual-Adapter Language Model for Medical Time Series-Text Multimodal Learning

Jiexia Ye, Weiqi Zhang, Ziyue Li, Jia Li, Meng Zhao, Fugee Tsung

Main category: cs.LG

TL;DR: MedualTime proposes a novel textual-temporal multimodal learning paradigm with dual adapters to address bias in medical time series-text learning, enabling either modality to serve as primary while being enhanced by the other.

Details

Motivation: Existing contrastive learning and prompt-based LM approaches in medical multimodal learning tend to be biased, treating text modality as secondary and overlooking critical task-relevant information in clinical reports.

Method: Designed MedualTime with dual adapters for simultaneous temporal-primary and textual-primary modeling, using lightweight adaptation tokens injected into top LM layers for high-level modality fusion and efficient fine-tuning.

Result: Achieves superior performance with 8% accuracy and 12% F1 improvements in supervised settings, and demonstrates strong transferability in few-shot label transfer experiments.

Conclusion: The proposed textual-temporal paradigm effectively captures modality-specific information and fosters cross-modal interaction, providing a more balanced and effective approach for medical multimodal learning.

Abstract: The recent rapid advancements in language models (LMs) have garnered attention in medical time series-text multimodal learning. However, existing contrastive learning-based and prompt-based LM approaches tend to be biased, often assigning a primary role to time series modality while treating text modality as secondary. We classify these approaches under a temporal-primary paradigm, which may overlook the unique and critical task-relevant information embedded in text modality like clinical reports, thus failing to fully leverage mutual benefits and complementarity of different modalities. To fill this gap, we propose a novel textual-temporal multimodal learning paradigm that enables either modality to serve as the primary while being enhanced by the other, thereby effectively capturing modality-specific information and fostering cross-modal interaction. In specific, we design MedualTime, a language model composed of dual adapters to implement temporal-primary and textual-primary modeling simultaneously. Within each adapter, lightweight adaptation tokens are injected into the top layers of LM to encourage high-level modality fusion. The shared LM pipeline by dual adapters not only achieves adapter alignment but also enables efficient fine-tuning, reducing computational resources. Empirically, MedualTime demonstrates superior performance on medical data, achieving notable improvements of 8% accuracy and 12% F1 in supervised settings. Furthermore, MedualTime’s transferability is validated by few-shot label transfer experiments from coarse-grained to fine-grained medical data. https://github.com/start2020/MedualTime

[587] Catapult Dynamics and Phase Transitions in Quadratic Nets

David Meltzer, Min Chen, Junyu Liu

Main category: cs.LG

TL;DR: This paper proves the existence of catapult phase transitions in neural networks with super-critical learning rates, showing weight norm decreases when loss becomes large, and demonstrates increasing sparsity in ReLU activations at higher learning rates.

Details

Motivation: To theoretically establish that the catapult phase observed in wide neural networks exists across a broader class of models, including quadratic models and two-layer homogeneous neural nets, and to understand the behavior beyond the theoretically derived learning rate range.

Method: The authors prove the existence of catapult phase by showing weight norm decreases when loss becomes large for certain learning rate ranges, and empirically study super-critical learning rates beyond this range by analyzing activation sparsity in ReLU networks.

Result: The paper demonstrates that catapult phase exists in various models, provides theoretical bounds for learning rates where this phenomenon occurs, and shows empirically that ReLU networks trained with super-critical learning rates develop increasingly sparse activation maps as learning rates increase.

Conclusion: The catapult phase is a general phenomenon across multiple model classes, characterized by specific weight norm dynamics during high-loss periods, and leads to sparser network representations at higher learning rates, providing insights into optimization dynamics beyond standard gradient descent regimes.

Abstract: Neural networks trained with gradient descent can undergo non-trivial phase transitions as a function of the learning rate. In \cite{lewkowycz2020large} it was discovered that wide neural nets can exhibit a catapult phase for super-critical learning rates, where the training loss grows exponentially quickly at early times before rapidly decreasing to a small value. During this phase the top eigenvalue of the neural tangent kernel (NTK) also undergoes significant evolution. In this work, we will prove that the catapult phase exists in a large class of models, including quadratic models and two-layer, homogenous neural nets. To do this, we show that for a certain range of learning rates the weight norm decreases whenever the loss becomes large. We also empirically study learning rates beyond this theoretically derived range and show that the activation map of ReLU nets trained with super-critical learning rates becomes increasingly sparse as we increase the learning rate.

[588] The Curious Price of Distributional Robustness in Reinforcement Learning with a Generative Model

Laixi Shi, Gen Li, Yuting Wei, Yuxin Chen, Matthieu Geist, Yuejie Chi

Main category: cs.LG

TL;DR: This paper analyzes the sample complexity of distributionally robust Markov decision processes (RMDPs) for reinforcement learning, showing that robustness requirements can either reduce or increase sample complexity compared to standard MDPs depending on the uncertainty set metric used.

Details

Motivation: To address the sim-to-real gap in reinforcement learning by investigating model robustness through distributionally robust MDPs, and to understand the statistical consequences of robustness requirements compared to standard RL.

Method: The study uses a model-based method called distributionally robust value iteration, assuming access to a generative model that draws samples based on the nominal MDP. The analysis focuses on uncertainty sets specified via total variation distance and chi-squared divergence.

Result: The research provides a near-optimal characterization of RMDP sample complexity. Surprisingly, RMDPs are not inherently easier or harder to learn than standard MDPs - the statistical impact depends on uncertainty set characteristics: TV distance reduces sample complexity, while chi-squared divergence significantly increases it.

Conclusion: Distributional robustness in RL has varying statistical consequences depending on the uncertainty set metric. The choice of uncertainty set (TV distance vs. chi-squared divergence) critically affects whether robustness requirements make learning easier or harder compared to standard MDPs.

Abstract: This paper investigates model robustness in reinforcement learning (RL) to reduce the sim-to-real gap in practice. We adopt the framework of distributionally robust Markov decision processes (RMDPs), aimed at learning a policy that optimizes the worst-case performance when the deployed environment falls within a prescribed uncertainty set around the nominal MDP. Despite recent efforts, the sample complexity of RMDPs remained mostly unsettled regardless of the uncertainty set in use. It was unclear if distributional robustness bears any statistical consequences when benchmarked against standard RL. Assuming access to a generative model that draws samples based on the nominal MDP, we provide a near-optimal characterization of the sample complexity of RMDPs when the uncertainty set is specified via either the total variation (TV) distance or chi-squared divergence. The algorithm studied here is a model-based method called distributionally robust value iteration, which is shown to be near-optimal for the full range of uncertainty levels. Somewhat surprisingly, our results uncover that RMDPs are not necessarily easier or harder to learn than standard MDPs. The statistical consequence incurred by the robustness requirement depends heavily on the size and shape of the uncertainty set: in the case w.r.t.~the TV distance, the minimax sample complexity of RMDPs is always smaller than that of standard MDPs; in the case w.r.t.~the chi-squared divergence, the sample complexity of RMDPs far exceeds the standard MDP counterpart.

[589] A Review of Machine Learning Techniques in Imbalanced Data and Future Trends

Elaheh Jafarigol, Theodore Trafalis, Neshat Mohammadi

Main category: cs.LG

TL;DR: A comprehensive review of 258 papers on imbalanced learning methods for rare event detection, providing technical and application perspectives.

Details

Motivation: Detecting rare events has been challenging for over two decades, with real-life problems driving the need for improved data processing and algorithmic approaches for effective and computationally efficient imbalanced learning methods.

Method: Collected and reviewed 258 peer-reviewed papers from archival journals and conference papers to provide an in-depth review of various approaches in imbalanced learning from both technical and application perspectives.

Result: Created a structured review of methods addressing imbalanced data problems across various domains.

Conclusion: This work provides a general guideline for researchers in academia and industry working with large-scale imbalanced data in machine learning applications.

Abstract: For over two decades, detecting rare events has been a challenging task among researchers in the data mining and machine learning domain. Real-life problems inspire researchers to navigate and further improve data processing and algorithmic approaches to achieve effective and computationally efficient methods for imbalanced learning. In this paper, we have collected and reviewed 258 peer-reviewed papers from archival journals and conference papers in an attempt to provide an in-depth review of various approaches in imbalanced learning from technical and application perspectives. This work aims to provide a structured review of methods used to address the problem of imbalanced data in various domains and create a general guideline for researchers in academia or industry who want to dive into the broad field of machine learning using large-scale imbalanced data.

[590] Probabilistic Shapley Value Modeling and Inference

Mert Ketenci, Iñigo Urteaga, Victor Alfonso Rodriguez, Noémie Elhadad, Adler Perotte

Main category: cs.LG

TL;DR: PSI is a probabilistic framework that models feature attribution distributions with Shapley values as means, using variational inference and masking-based neural networks to handle variable-length feature subsets efficiently.

Details

Motivation: To address the challenge of quantifying uncertainty in feature attributions for flexible predictive models, particularly the computational difficulty of marginalizing over all possible feature subsets in Shapley value calculations.

Method: Developed probabilistic Shapley inference (PSI) with latent random variables whose mean equals Shapley values, using variational inference to jointly train predictive models and attribution distributions, and introduced masking-based neural network architecture to handle variable-length input feature subsets.

Result: PSI achieves competitive predictive performance compared to strong baselines while learning meaningful feature attribution distributions centered at Shapley values that reveal attribution uncertainty across different data modalities.

Conclusion: PSI provides an efficient, scalable framework for probabilistic inference of feature attributions and their uncertainty, successfully addressing the computational challenges of Shapley value calculation through innovative masking-based architecture and variational training.

Abstract: We propose probabilistic Shapley inference (PSI), a novel probabilistic framework to model and infer sufficient statistics of feature attributions in flexible predictive models, via latent random variables whose mean recovers Shapley values. PSI enables efficient, scalable inference over input-to-output attributions, and their uncertainty, via a variational objective that jointly trains a predictive (regression or classification) model and its attribution distributions. To address the challenge of marginalizing over variable-length input feature subsets in Shapley value calculation, we introduce a masking-based neural network architecture, with a modular training and inference procedure. We evaluate PSI on synthetic and real-world datasets, showing that it achieves competitive predictive performance compared to strong baselines, while learning feature attribution distributions – centered at Shapley values – that reveal meaningful attribution uncertainty across data modalities.

[591] Diffusion on language model encodings for protein sequence generation

Viacheslav Meshchaninov, Pavel Strashnov, Andrey Shevtsov, Fedor Nikolaev, Nikita Ivanisenko, Olga Kardymon, Dmitry Vetrov

Main category: cs.LG

TL;DR: DiMA is a latent diffusion framework for protein sequence design that uses continuous diffusion on protein language model representations, achieving high performance across various protein encoders and outperforming existing methods in quality, diversity, and novelty.

Details

Motivation: Protein sequence design has advanced with discrete diffusion and autoregressive approaches, but continuous diffusion remains underexplored despite its potential benefits.

Method: Developed DiMA, a latent diffusion framework operating on protein language model representations, with systematic exploration of architectural choices and diffusion components that generalizes across multiple protein encoders (8M to 3B parameters).

Result: DiMA achieves consistently high performance across sequence-only, dual-decodable, and multimodal representations using the same architecture. It produces novel, high-quality, diverse protein sequences and outperforms autoregressive, discrete diffusion, and flow matching baselines across multiple metrics.

Conclusion: DiMA provides a universal continuous diffusion framework for protein sequence generation with versatile functionality for conditional generation tasks, offering both architectural insights and practical applicability across various protein design scenarios.

Abstract: Protein sequence design has seen significant advances through discrete diffusion and autoregressive approaches, yet the potential of continuous diffusion remains underexplored. Here, we present DiMA, a latent diffusion framework that operates on protein language model representations. Through systematic exploration of architectural choices and diffusion components, we develop a robust methodology that generalizes across multiple protein encoders ranging from 8M to 3B parameters. We demonstrate that our framework achieves consistently high performance across sequence-only (ESM-2, ESMc), dual-decodable (CHEAP), and multimodal (SaProt) representations using the same architecture and training approach. We extensively evaluate existing methods alongside DiMA using multiple metrics across two protein modalities, covering quality, diversity, novelty, and distribution matching of generated proteins. DiMA consistently produces novel, high-quality and diverse protein sequences and achieves strong results compared to baselines such as autoregressive, discrete diffusion and flow matching language models. The model demonstrates versatile functionality, supporting conditional generation tasks including protein family-generation, motif scaffolding and infilling, and fold-specific sequence design. This work provides a universal continuous diffusion framework for protein sequence generation, offering both architectural insights and practical applicability across various protein design scenarios. Code is released at \href{https://github.com/MeshchaninovViacheslav/DiMA}{GitHub}.

[592] DACAD: Domain Adaptation Contrastive Learning for Anomaly Detection in Multivariate Time Series

Zahra Zamanzadeh Darban, Yiyuan Yang, Geoffrey I. Webb, Charu C. Aggarwal, Qingsong Wen, Shirui Pan, Mahsa Salehi

Main category: cs.LG

TL;DR: DACAD is a novel domain adaptation contrastive learning model for multivariate time series anomaly detection that addresses the limitation of inconsistent anomalous classes across domains through anomaly injection and dual contrastive learning.

Details

Motivation: Traditional unsupervised domain adaptation methods assume consistent anomalous classes across domains, which limits their effectiveness in real-world time series anomaly detection where labeled data is scarce and anomalous classes may vary between domains.

Method: Combines UDA with contrastive learning using anomaly injection mechanism, supervised contrastive loss for source domain, self-supervised contrastive triplet loss for target domain, and Center-based Entropy Classifier for learning normal boundaries.

Result: Extensive evaluations on multiple real-world and synthetic datasets demonstrate superior performance in knowledge transfer across domains and addressing limited labeled data challenges.

Conclusion: DACAD effectively overcomes the limitation of inconsistent anomalous classes in domain adaptation for time series anomaly detection, providing improved adaptability and robustness through its novel contrastive learning approach.

Abstract: In time series anomaly detection (TSAD), the scarcity of labeled data poses a challenge to the development of accurate models. Unsupervised domain adaptation (UDA) offers a solution by leveraging labeled data from a related domain to detect anomalies in an unlabeled target domain. However, existing UDA methods assume consistent anomalous classes across domains. To address this limitation, we propose a novel Domain Adaptation Contrastive learning model for Anomaly Detection in multivariate time series (DACAD), combining UDA with contrastive learning. DACAD utilizes an anomaly injection mechanism that enhances generalization across unseen anomalous classes, improving adaptability and robustness. Additionally, our model employs supervised contrastive loss for the source domain and self-supervised contrastive triplet loss for the target domain, ensuring comprehensive feature representation learning and domain-invariant feature extraction. Finally, an effective Center-based Entropy Classifier (CEC) accurately learns normal boundaries in the source domain. Extensive evaluations on multiple real-world datasets and a synthetic dataset highlight DACAD’s superior performance in transferring knowledge across domains and mitigating the challenge of limited labeled data in TSAD.

[593] A Minimum Description Length Approach to Regularization in Neural Networks

Matan Abudy, Orr Well, Emmanuel Chemla, Roni Katzir, Nur Lan

Main category: cs.LG

TL;DR: MDL regularization outperforms standard methods (L1, L2, none) by selecting perfect solutions over approximations in neural networks trained on formal languages.

Details

Motivation: Standard regularization methods push expressive neural networks away from perfect solutions when trained on formal languages, despite their capacity to learn symbolic, perfect solutions.

Method: Apply Minimum Description Length (MDL) principle as a theoretically grounded regularization method to balance model complexity with data fit, comparing against standard L1, L2, and no regularization.

Result: MDL regularization successfully selects perfect solutions over approximations, independent of optimization algorithm, unlike standard regularization methods which actively push networks away from perfect initializations.

Conclusion: MDL provides the appropriate inductive bias to counteract overfitting and promote generalization, making it superior to existing regularization techniques for learning perfect symbolic solutions.

Abstract: State-of-the-art neural networks can be trained to become remarkable solutions to many problems. But while these architectures can express symbolic, perfect solutions, trained models often arrive at approximations instead. We show that the choice of regularization method plays a crucial role: when trained on formal languages with standard regularization ($L_1$, $L_2$, or none), expressive architectures not only fail to converge to correct solutions but are actively pushed away from perfect initializations. In contrast, applying the Minimum Description Length (MDL) principle to balance model complexity with data fit provides a theoretically grounded regularization method. Using MDL, perfect solutions are selected over approximations, independently of the optimization algorithm. We propose that unlike existing regularization techniques, MDL introduces the appropriate inductive bias to effectively counteract overfitting and promote generalization.

[594] The Over-Certainty Phenomenon in Modern Test-Time Adaptation Algorithms

Fin Amin, Jung-Eun Kim

Main category: cs.LG

TL;DR: A novel method that addresses over-certainty in neural networks during domain shifts by introducing a certainty regularizer that dynamically adjusts pseudo-label confidence based on backbone entropy and logit norm, achieving improved calibration while maintaining accuracy.

Details

Motivation: Neural networks fail to account for their familiarity with novel data during domain shifts, leading to over-certain predictions that can cause misplaced trust in critical applications. Existing test-time adaptation methods reduce entropy but result in poorly calibrated models.

Method: Proposes a certainty regularizer that dynamically adjusts pseudo-label confidence by considering both backbone entropy and logit norm to mitigate the over-certainty phenomenon during test-time adaptation.

Result: Achieves state-of-the-art performance in Expected Calibration Error and Negative Log Likelihood metrics while maintaining comparable accuracy to existing methods.

Conclusion: The proposed approach successfully addresses the over-certainty problem in domain shift scenarios, providing better calibrated predictions without sacrificing accuracy, which is crucial for building trustworthy AI systems.

Abstract: When neural networks are confronted with unfamiliar data that deviate from their training set, this signifies a domain shift. While these networks output predictions on their inputs, they typically fail to account for their level of familiarity with these novel observations. Prevailing works navigate test-time adaptation with the goal of curtailing model entropy, yet they unintentionally produce models that struggle with sub-optimal calibration-a dilemma we term the over-certainty phenomenon. This over-certainty in predictions can be particularly dangerous in the setting of domain shifts, as it may lead to misplaced trust. In this paper, we propose a solution that not only maintains accuracy but also addresses calibration by mitigating the over-certainty phenomenon. To do this, we introduce a certainty regularizer that dynamically adjusts pseudo-label confidence by accounting for both backbone entropy and logit norm. Our method achieves state-of-the-art performance in terms of Expected Calibration Error and Negative Log Likelihood, all while maintaining parity in accuracy.

[595] Towards a General Time Series Forecasting Model with Unified Representation and Adaptive Transfer

Yihang Wang, Yuying Qiu, Peng Chen, Kai Zhao, Yang Shu, Zhongwen Rao, Lujia Pan, Bin Yang, Chenjuan Guo

Main category: cs.LG

TL;DR: A novel time series foundation model that uses Decomposed Frequency Learning and Time Series Register to achieve unified representations and domain-specific feature capture for superior few-shot and zero-shot forecasting performance.

Details

Motivation: Address the need for general forecasting models that can handle heterogeneous multi-domain time series data and enable adaptive transfer across diverse downstream scenarios, moving beyond simply scaling dataset and model sizes.

Method: Proposes Decomposed Frequency Learning with frequency-based masking and reconstruction for unified representations, and Time Series Register to capture domain-specific features during pre-training for adaptive transfer.

Result: Achieves state-of-the-art forecasting performance on seven real-world benchmarks with remarkable few-shot and zero-shot capabilities.

Conclusion: The approach successfully addresses key challenges in multi-domain time series forecasting by providing unified representations while preserving domain-specific features, enabling effective transfer learning across diverse scenarios.

Abstract: With the growing availability of multi-domain time series data, there is an increasing demand for general forecasting models pre-trained on multi-source datasets to support diverse downstream prediction scenarios. Existing time series foundation models primarily focus on scaling up pre-training datasets and model sizes to enhance generalization performance. In this paper, we take a different approach by addressing two critical aspects of general forecasting models: (1) how to derive unified representations from heterogeneous multi-domain time series data, and (2) how to effectively capture domain-specific features to enable adaptive transfer across various downstream scenarios. To address the first aspect, we propose Decomposed Frequency Learning as the pre-training task, which leverages frequency-based masking and reconstruction to decompose coupled semantic information in time series, resulting in unified representations across domains. For the second aspect, we introduce the Time Series Register, which captures domain-specific representations during pre-training and enhances adaptive transferability to downstream tasks. Our model achieves the state-of-the-art forecasting performance on seven real-world benchmarks, demonstrating remarkable few-shot and zero-shot capabilities.

[596] ALPS: Improved Optimization for Highly Sparse One-Shot Pruning for Large Language Models

Xiang Meng, Kayhan Behdin, Haoyue Wang, Rahul Mazumder

Main category: cs.LG

TL;DR: ALPS is an optimization-based pruning framework for LLMs that uses operator splitting and conjugate gradient methods to achieve better compression and performance than heuristic approaches, especially at high sparsity levels.

Details

Motivation: Large Language Models require massive computational resources and storage. Current one-shot pruning methods rely on heuristics rather than optimization, leading to suboptimal compression and performance degradation.

Method: ALPS uses operator splitting technique and a preconditioned conjugate gradient-based post-processing step. It incorporates novel acceleration techniques with theoretical convergence guarantees, leveraging vectorization and GPU parallelism for efficiency.

Result: ALPS significantly outperforms state-of-the-art methods. On OPT-30B with 70% sparsity, it achieves 13% reduction in test perplexity on WikiText and 19% improvement in zero-shot benchmark performance compared to existing methods.

Conclusion: ALPS provides an effective optimization-based framework for pruning LLMs that delivers superior compression and performance, particularly for highly sparse models, addressing the computational burden of large language models.

Abstract: The impressive performance of Large Language Models (LLMs) across various natural language processing tasks comes at the cost of vast computational resources and storage requirements. One-shot pruning techniques offer a way to alleviate these burdens by removing redundant weights without the need for retraining. Yet, the massive scale of LLMs often forces current pruning approaches to rely on heuristics instead of optimization-based techniques, potentially resulting in suboptimal compression. In this paper, we introduce ALPS, an optimization-based framework that tackles the pruning problem using the operator splitting technique and a preconditioned conjugate gradient-based post-processing step. Our approach incorporates novel techniques to accelerate and theoretically guarantee convergence while leveraging vectorization and GPU parallelism for efficiency. ALPS substantially outperforms state-of-the-art methods in terms of the pruning objective and perplexity reduction, particularly for highly sparse models. On the OPT-30B model with 70% sparsity, ALPS achieves a 13% reduction in test perplexity on the WikiText dataset and a 19% improvement in zero-shot benchmark performance compared to existing methods.

[597] Neural CRNs: A Natural Implementation of Learning in Chemical Reaction Networks

Rajiv Teja Nagipogu, John H. Reif

Main category: cs.LG

TL;DR: A dynamical systems approach for molecular neural computing using time evolution of concentrations, enabling compact circuits with minimal phases and native gradient approximations.

Details

Motivation: To develop autonomous learning molecular circuits for bioengineering and synthetic biology applications, moving beyond discrete-layered neural architectures using steady-state computations.

Method: Model neural computations as time evolution of molecular concentrations using unimolecular and bimolecular reactions, with end-to-end supervised learning pipeline in two sequential phases and native incorporation of first-order gradient approximations.

Result: Demonstrated compact circuit implementations for both linear and nonlinear modeling, validated through training and inference simulations across various regression and classification tasks, showing linear scaling with input dimensionality.

Conclusion: Presents a viable pathway for embedding learning behaviors in synthetic biochemical systems with more efficient and scalable molecular computing frameworks.

Abstract: Molecular circuits capable of autonomous learning could unlock novel applications in fields such as bioengineering and synthetic biology. To this end, existing chemical implementations of neural computing have mainly relied on emulating discrete-layered neural architectures using steady-state computations of mass action kinetics. In contrast, we propose an alternative dynamical systems-based approach in which neural computations are modeled as the time evolution of molecular concentrations. The analog nature of our framework naturally aligns with chemical kinetics-based computation, leading to more compact circuits. We present the advantages of our framework through three key demonstrations. First, we assemble an end-to-end supervised learning pipeline using only two sequential phases, the minimum required number for supervised learning. Then, we show (through appropriate simplifications) that both linear and nonlinear modeling circuits can be implemented solely using unimolecular and bimolecular reactions, avoiding the complexities of higher-order chemistries. Finally, we demonstrate that first-order gradient approximations can be natively incorporated into the framework, enabling nonlinear models to scale linearly rather than combinatorially with input dimensionality. All the circuit constructions are validated through training and inference simulations across various regression and classification tasks. Our work presents a viable pathway toward embedding learning behaviors in synthetic biochemical systems.

[598] MENSA: A Multi-Event Network for Survival Analysis with Trajectory-based Likelihood Estimation

Christian Marius Lillelund, Ali Hossein Gharari Foomani, Weijie Sun, Shi-ang Qi, Russell Greiner

Main category: cs.LG

TL;DR: MENSA is a deep learning model that jointly models multiple non-exclusive and semi-competing events in survival analysis, outperforming single-event approaches by capturing dependencies and temporal ordering between events.

Details

Motivation: Existing time-to-event methods focus on single-event or competing-risk scenarios, leaving multi-event situations underexplored. Real-world patients often experience multiple non-exclusive events, and current workarounds using separate single-event models fail to exploit dependencies and shared structure across events.

Method: Proposed MENSA (Multi-Event Network for Survival Analysis), a deep learning model that jointly models flexible time-to-event distributions for multiple events (competing or co-occurring) with a novel trajectory-based likelihood that captures temporal ordering between events.

Result: Across five benchmark datasets, MENSA consistently improved prediction performance over many state-of-the-art baselines.

Conclusion: MENSA effectively addresses the limitations of existing approaches by providing a unified framework for multi-event survival analysis that captures event dependencies and temporal relationships, demonstrating superior performance compared to traditional methods.

Abstract: Most existing time-to-event methods focus on either single-event or competing-risk settings, leaving multi-event scenarios relatively underexplored. In many real-world applications, the same patient may experience multiple events that are non-exclusive, and sometimes semi-competing. A common workaround is to train separate single-event models, but this approach fails to exploit dependencies and shared structure across events. To address these limitations, we propose MENSA (Multi-Event Network for Survival Analysis), a deep learning model that jointly models flexible time-to-event distributions for multiple events, whether competing or co-occurring. In addition, we introduce a novel trajectory-based likelihood that captures the temporal ordering between events. Across five benchmark datasets, MENSA consistently improves prediction performance over many state-of-the-art baselines. The source code is available at https://github.com/thecml/mensa.

[599] Flash STU: Fast Spectral Transform Units

Y. Isabel Liu, Windsor Nguyen, Yagiz Devre, Evan Dogariu, Anirudha Majumdar, Elad Hazan

Main category: cs.LG

TL;DR: Flash STU architecture combines spectral state space models with sliding window attention for efficient billion-parameter language modeling with near-linear time complexity, outperforming Transformers and other state-space models.

Details

Motivation: Address the challenge of balancing computational efficiency with model expressiveness in state-space model architectures for sequence modeling.

Method: Propose a hybrid Flash STU architecture that interleaves spectral state space model layers with sliding window attention, enabling scalability to billions of parameters.

Result: Flash STU consistently outperforms Transformer and other leading state-space models (S4, Mamba-2) across diverse sequence prediction tasks including linear dynamical systems, robotics control, and language modeling.

Conclusion: The hybrid approach of combining spectral state space models with sliding window attention provides an effective solution for scalable and efficient sequence modeling while maintaining strong performance.

Abstract: Recent advances in state-space model architectures have shown great promise for efficient sequence modeling, but challenges remain in balancing computational efficiency with model expressiveness. We propose the Flash STU architecture, a hybrid model that interleaves spectral state space model layers with sliding window attention, enabling scalability to billions of parameters for language modeling while maintaining a near-linear time complexity. We evaluate the Flash STU and its variants on diverse sequence prediction tasks, including linear dynamical systems, robotics control, and language modeling. We find that, given a fixed parameter budget, the Flash STU architecture consistently outperforms the Transformer and other leading state-space models such as S4 and Mamba-2.

[600] Rethinking GNN Expressive Power from a Distributed Computational Model Perspective

Guanyu Cui, Yuhe Guo, Zhewei Wei, Hsin-Hao Su

Main category: cs.LG

TL;DR: This paper analyzes GNN expressiveness through computational models rather than WL test alignments, showing WL tests are not locally computable by GNNs and that unrestricted preprocessing can be problematic, while also providing positive results on virtual nodes/edges.

Details

Motivation: Current GNN expressiveness analyses focus too much on distinguishing graph structures via WL test alignments rather than computing specific function classes, which is more aligned with machine learning theory.

Method: Uses a modified CONGEST computational model with clearly specified preprocessing and postprocessing to analyze GNN expressiveness, examining lower bounds on capacity requirements and effects of preprocessing.

Result: Shows that WL test simulation requires nearly linear growth in GNN capacity with graph size, indicating WL is not locally computable and misaligned with message-passing GNNs. Also demonstrates problems with unrestricted preprocessing.

Conclusion: Computational models provide better framework for GNN expressiveness analysis than WL test alignments. Highlights open problems and provides both negative results about WL test limitations and positive results about virtual nodes/edges.

Abstract: The success of graph neural networks (GNNs) has motivated theoretical studies on their expressive power, often through alignments with the Weisfeiler-Lehman (WL) tests. However, such analyses typically focus on the ability of GNNs to distinguish between graph structures, rather than to compute or approximate specific function classes. The latter is more commonly studied in machine learning theory, including results such as the Turing completeness of recurrent networks and the universal approximation property of feedforward networks. We argue that using well-defined computational models, such as a modified CONGEST model with clearly specified preprocessing and postprocessing, offers a more sound framework for analyzing GNN expressiveness. Within this framework, we show that allowing unrestricted preprocessing or incorporating externally computed features, while claiming that these precomputations enhance the expressiveness, can sometimes lead to problems. We also show that the lower bound on a GNN’s capacity (depth multiplied by width) to simulate one iteration of the WL test actually grows nearly linearly with graph size, indicating that the WL test is not locally computable and is misaligned with message-passing GNNs. Despite these negative results, we also present positive results that characterize the effects of virtual nodes and edges from a computational model perspective. Finally, we highlight several open problems regarding GNN expressiveness for further exploration.

[601] An Architecture Built for Federated Learning: Addressing Data Heterogeneity through Adaptive Normalization-Free Feature Recalibration

Vasilis Siomos, Jonathan Passerat-Palmbach, Giacomo Tarroni

Main category: cs.LG

TL;DR: ANFR combines weight standardization and channel attention to address statistical heterogeneity in federated learning, improving performance and generalization while maintaining privacy.

Details

Motivation: Statistical heterogeneity among client datasets degrades federated learning system performance, requiring robust solutions that preserve data ownership.

Method: Adaptive Normalization-free Feature Recalibration (ANFR) integrates weight standardization to avoid mismatched client statistics and channel attention to produce learnable scaling factors for feature maps.

Result: ANFR consistently outperforms established baselines across various aggregation methods, datasets, and heterogeneity conditions, and achieves strong privacy-utility balance with differential privacy.

Conclusion: ANFR offers a novel and versatile approach to combat statistical heterogeneity in FL, working with any aggregation method and supporting both global and personalized FL with minimal overhead.

Abstract: Federated learning is a decentralized collaborative training paradigm preserving stakeholders’ data ownership while improving performance and generalization. However, statistical heterogeneity among client datasets degrades system performance. To address this issue, we propose Adaptive Normalization-free Feature Recalibration (ANFR), a model architecture-level approach that combines weight standardization and channel attention to combat heterogeneous data in FL. ANFR leverages weight standardization to avoid mismatched client statistics and inconsistent averaging, ensuring robustness under heterogeneity, and channel attention to produce learnable scaling factors for feature maps, suppressing inconsistencies across clients due to heterogeneity. We demonstrate that combining these techniques boosts model performance beyond their individual contributions, by improving class selectivity and channel attention weight distribution. ANFR works with any aggregation method, supports both global and personalized FL, and adds minimal overhead. Furthermore, when training with differential privacy, ANFR achieves an appealing balance between privacy and utility, enabling strong privacy guarantees without sacrificing performance. By integrating weight standardization and channel attention in the backbone model, ANFR offers a novel and versatile approach to the challenge of statistical heterogeneity. Extensive experiments show ANFR consistently outperforms established baselines across various aggregation methods, datasets, and heterogeneity conditions. Code is provided at https://github.com/siomvas/ANFR.

[602] Sampling from Energy-based Policies using Diffusion

Vineet Jain, Tara Akhound-Sadegh, Siamak Ravanbakhsh

Main category: cs.LG

TL;DR: Diffusion Q-Sampling (DQS) uses diffusion models to sample from energy-based policies in continuous action spaces, enabling more expressive multimodal policy representations than traditional Gaussian approaches.

Details

Motivation: Existing methods in maximum entropy RL use simple parametric distributions like Gaussians for policy representation, which limits their ability to capture complex multimodal behaviors in continuous action spaces where direct sampling from Boltzmann distributions is computationally intractable.

Method: Proposes a diffusion-based approach for sampling from energy-based policies where the negative Q-function defines the energy function. Introduces Diffusion Q-Sampling (DQS) as an actor-critic method that leverages diffusion models to enable more expressive policy representations.

Result: The approach enhances sample efficiency in continuous control tasks and successfully captures multimodal behaviors, addressing key limitations of existing methods while maintaining stable learning in diverse environments.

Conclusion: Diffusion-based sampling provides an effective solution for implementing energy-based policies in continuous action spaces, offering improved expressiveness and performance over traditional parametric policy representations in reinforcement learning.

Abstract: Energy-based policies offer a flexible framework for modeling complex, multimodal behaviors in reinforcement learning (RL). In maximum entropy RL, the optimal policy is a Boltzmann distribution derived from the soft Q-function, but direct sampling from this distribution in continuous action spaces is computationally intractable. As a result, existing methods typically use simpler parametric distributions, like Gaussians, for policy representation – limiting their ability to capture the full complexity of multimodal action distributions. In this paper, we introduce a diffusion-based approach for sampling from energy-based policies, where the negative Q-function defines the energy function. Based on this approach, we propose an actor-critic method called Diffusion Q-Sampling (DQS) that enables more expressive policy representations, allowing stable learning in diverse environments. We show that our approach enhances sample efficiency in continuous control tasks and captures multimodal behaviors, addressing key limitations of existing methods. Code is available at https://github.com/vineetjain96/Diffusion_Q_Sampling.git

[603] FACEGroup: Feasible and Actionable Counterfactual Explanations for Group Fairness

Christos Fragkathoulas, Vasiliki Papanikou, Evaggelia Pitoura, Evimaria Terzi

Main category: cs.LG

TL;DR: FACEGroup is a graph-based framework for generating group counterfactual explanations to audit group fairness, incorporating feasibility constraints and identifying subgroups with similar counterfactuals while capturing trade-offs in counterfactual generation.

Details

Motivation: To address the need for trustworthy machine learning by providing a systematic approach to audit group fairness through counterfactual explanations that account for real-world feasibility constraints.

Method: Developed FACEGroup framework that models feasibility constraints, identifies subgroups with similar counterfactuals, and captures key trade-offs in counterfactual generation using graph-based approach.

Result: Experiments on benchmark datasets demonstrate that FACEGroup effectively generates feasible group counterfactuals while accounting for trade-offs, and the proposed metrics successfully capture and quantify fairness disparities.

Conclusion: FACEGroup provides the first comprehensive graph-based framework for group counterfactual explanations that addresses feasibility constraints and trade-offs, offering novel metrics for auditing group fairness in machine learning systems.

Abstract: Counterfactual explanations assess unfairness by revealing how inputs must change to achieve a desired outcome. This paper introduces the first graph-based framework for generating group counterfactual explanations to audit group fairness, a key aspect of trustworthy machine learning. Our framework, FACEGroup (Feasible and Actionable Counterfactual Explanations for Group Fairness), models real-world feasibility constraints, identifies subgroups with similar counterfactuals, and captures key trade-offs in counterfactual generation, distinguishing it from existing methods. To evaluate fairness, we introduce novel metrics for both group and subgroup level analysis that explicitly account for these trade-offs. Experiments on benchmark datasets show that FACEGroup effectively generates feasible group counterfactuals while accounting for trade-offs, and that our metrics capture and quantify fairness disparities.

[604] Learning Load Balancing with GNN in MPTCP-Enabled Heterogeneous Networks

Han Ji, Xiping Wu, Zhihong Zeng, Chen Chen

Main category: cs.LG

TL;DR: GNN-based load balancing model for MPTCP-enabled hybrid LiFi/WiFi networks that achieves near-optimal throughput with significantly faster inference compared to traditional methods.

Details

Motivation: Current TCP restricts UEs to single AP connections, and MPTCP complicates HetNet topologies making existing load balancing models ineffective for hybrid LiFi/WiFi networks.

Method: Proposed graph neural network (GNN) model that treats network topology as a graph with channel state and data rate as node features, and load balancing solutions as edge labels.

Result: Achieves near-optimal throughput within 11.5% gap of traditional optimization methods while reducing inference time by 4 orders of magnitude. Outperforms DNN models by up to 21.7% throughput improvement.

Conclusion: GNN-based approach effectively handles complex partial mesh topologies in MPTCP-enabled HetNets and generalizes better across varying numbers of APs and UEs with a single trained model.

Abstract: Hybrid light fidelity (LiFi) and wireless fidelity (WiFi) networks are a promising paradigm of heterogeneous network (HetNet), attributed to the complementary physical properties of optical spectra and radio frequency. However, the current development of such HetNets is mostly bottlenecked by the existing transmission control protocol (TCP), which restricts the user equipment (UE) to connecting one access point (AP) at a time. While the ongoing investigation on multipath TCP (MPTCP) can bring significant benefits, it complicates the network topology of HetNets, making the existing load balancing (LB) learning models less effective. Driven by this, we propose a graph neural network (GNN)-based model to tackle the LB problem for MPTCP-enabled HetNets, which results in a partial mesh topology. Such a topology can be modeled as a graph, with the channel state information and data rate requirement embedded as node features, while the LB solutions are deemed as edge labels. Compared to the conventional deep neural network (DNN), the proposed GNN-based model exhibits two key strengths: i) it can better interpret a complex network topology; and ii) it can handle various numbers of APs and UEs with a single trained model. Simulation results show that against the traditional optimisation method, the proposed learning model can achieve near-optimal throughput within a gap of 11.5%, while reducing the inference time by 4 orders of magnitude. In contrast to the DNN model, the new method can improve the network throughput by up to 21.7%, at a similar inference time level.

Armin Saghafian, Amirmohammad Izadi, Negin Hashemi Dijujin, Mahdieh Soleymani Baghshah

Main category: cs.LG

TL;DR: CAREL framework uses cross-modal auxiliary losses and instruction tracking to improve grounding of language instructions in RL environments, achieving better sample efficiency and generalization.

Details

Motivation: Enhance model's ability to generalize across tasks and environments in language-guided goal-reaching RL by improving instruction grounding in environmental context.

Method: Proposes CAREL framework with auxiliary loss functions inspired by video-text retrieval literature and novel instruction tracking method to automatically monitor progress.

Result: Superior sample efficiency and systematic generalization in multi-modal reinforcement learning problems.

Conclusion: CAREL effectively addresses instruction grounding challenges in goal-reaching RL through cross-modal auxiliary learning and progress tracking.

Abstract: Grounding the instruction in the environment is a key step in solving language-guided goal-reaching reinforcement learning problems. In automated reinforcement learning, a key concern is to enhance the model’s ability to generalize across various tasks and environments. In goal-reaching scenarios, the agent must comprehend the different parts of the instructions within the environmental context in order to complete the overall task successfully. In this work, we propose CAREL (Cross-modal Auxiliary REinforcement Learning) as a new framework to solve this problem using auxiliary loss functions inspired by video-text retrieval literature and a novel method called instruction tracking, which automatically keeps track of progress in an environment. The results of our experiments suggest superior sample efficiency and systematic generalization for this framework in multi-modal reinforcement learning problems. Our code base is available here.

[606] Off-Policy Maximum Entropy RL with Future State and Action Visitation Measures

Adrien Bolland, Gaspard Lambrechts, Damien Ernst

Main category: cs.LG

TL;DR: Novel maximum entropy RL approach using relative entropy of discounted state-action distributions as intrinsic rewards, providing theoretical guarantees and good exploration performance.

Details

Motivation: To improve exploration in reinforcement learning by developing a theoretically grounded intrinsic reward function based on relative entropy of future state-action distributions.

Method: Proposes using relative entropy of discounted state-action visitation distributions as intrinsic rewards, with off-policy learning of the fixed point distribution and policy optimization.

Result: The approach provides a lower bound on standard entropy objectives and state-action value functions, and achieves good state-action space coverage with high-performance control.

Conclusion: The relative entropy-based intrinsic reward framework offers strong theoretical foundations and effective exploration capabilities for reinforcement learning agents.

Abstract: Maximum entropy reinforcement learning integrates exploration into policy learning by providing additional intrinsic rewards proportional to the entropy of some distribution. In this paper, we propose a novel approach in which the intrinsic reward function is the relative entropy of the discounted distribution of states and actions (or features derived from these states and actions) visited during future time steps. This approach is motivated by three results. First, this new objective is a lower bound on the negated entropy of the marginal visitation distribution of states and actions, commonly used as an alternative exploration objective. Second, a policy maximizing the expected discounted sum of intrinsic rewards also maximizes a lower bound on the state-action value function of the decision process. Third, the distribution used in the intrinsic reward definition is the fixed point of a contraction operator. Existing algorithms can therefore be adapted to learn this fixed point off-policy and compute the intrinsic rewards. We finally introduce an algorithm maximizing our new objective and show that resulting policies have good state-action space coverage and achieve high-performance control.

[607] Neural Port-Hamiltonian Differential Algebraic Equations for Compositional Learning of Electrical Networks

Cyrus Neary, Nathan Tsao, Ufuk Topcu

Main category: cs.LG

TL;DR: Neural port-Hamiltonian differential algebraic equations (N-PHDAEs) for modeling constrained dynamical systems like electrical networks, with automatic index reduction for training and improved accuracy over baseline methods.

Details

Motivation: Deep learning struggles with algebraic constraints in coupled dynamical systems, particularly electrical networks where compositional couplings introduce constraints on state variables that challenge existing data-driven approaches.

Method: Introduce N-PHDAEs that use neural networks to parameterize unknown terms in both differential and algebraic components of port-Hamiltonian DAEs. Use automatic differentiation for index reduction to transform neural DAEs into equivalent neural ODEs for training.

Result: Achieves order of magnitude improvement in prediction accuracy and constraint satisfaction compared to baseline N-ODE over long prediction horizons. Validated compositional capabilities by training separate N-PHDAE models for grid components and coupling them to predict larger-scale network behavior.

Conclusion: N-PHDAEs provide an effective framework for modeling constrained dynamical systems with compositional couplings, demonstrating superior performance in electrical network simulations and enabling scalable component-based modeling.

Abstract: We develop compositional learning algorithms for coupled dynamical systems, with a particular focus on electrical networks. While deep learning has proven effective at modeling complex relationships from data, compositional couplings between system components typically introduce algebraic constraints on state variables, posing challenges to many existing data-driven approaches to modeling dynamical systems. Towards developing deep learning models for constrained dynamical systems, we introduce neural port-Hamiltonian differential algebraic equations (N-PHDAEs), which use neural networks to parameterize unknown terms in both the differential and algebraic components of a port-Hamiltonian DAE. To train these models, we propose an algorithm that uses automatic differentiation to perform index reduction, automatically transforming the neural DAE into an equivalent system of neural ordinary differential equations (N-ODEs), for which established model inference and backpropagation methods exist. Experiments simulating the dynamics of nonlinear circuits exemplify the benefits of our approach: the proposed N-PHDAE model achieves an order of magnitude improvement in prediction accuracy and constraint satisfaction when compared to a baseline N-ODE over long prediction time horizons. We also validate the compositional capabilities of our approach through experiments on a simulated DC microgrid: we train individual N-PHDAE models for separate grid components, before coupling them to accurately predict the behavior of larger-scale networks.

[608] Achieving $\widetilde{\mathcal{O}}(\sqrt{T})$ Regret in Average-Reward POMDPs with Known Observation Models

Alessio Russo, Alberto Maria Metelli, Marcello Restelli

Main category: cs.LG

TL;DR: Novel algorithm for POMDPs with unknown transition model achieves O(√T log T) regret using deterministic belief-based policies and improved estimation techniques

Details

Motivation: Existing methods for average-reward POMDPs have limitations: frequentist approaches use suboptimal stochastic policies, while Bayesian methods require strong assumptions about estimator consistency

Method: Proposed Action-wise OAS-UCRL algorithm with novel estimator that can use samples from different policies, combined with optimistic approach leveraging deterministic belief-based policies

Result: Achieves regret guarantee of O(√T log T) against optimal policy, improving over state-of-the-art techniques. Numerical simulations validate theoretical results

Conclusion: The approach removes previous limitations by providing convenient estimation guarantees and enabling data-efficient learning with optimal policy class

Abstract: We tackle average-reward infinite-horizon POMDPs with an unknown transition model but a known observation model, a setting that has been previously addressed in two limiting ways: (i) frequentist methods relying on suboptimal stochastic policies having a minimum probability of choosing each action, and (ii) Bayesian approaches employing the optimal policy class but requiring strong assumptions about the consistency of employed estimators. Our work removes these limitations by proving convenient estimation guarantees for the transition model and introducing an optimistic algorithm that leverages the optimal class of deterministic belief-based policies. We introduce modifications to existing estimation techniques providing theoretical guarantees separately for each estimated action transition matrix. Unlike existing estimation methods that are unable to use samples from different policies, we present a novel and simple estimator that overcomes this barrier. This new data-efficient technique, combined with the proposed \emph{Action-wise OAS-UCRL} algorithm and a tighter theoretical analysis, leads to the first approach enjoying a regret guarantee of order $\mathcal{O}(\sqrt{T ,\log T})$ when compared against the optimal policy, thus improving over state of the art techniques. Finally, theoretical results are validated through numerical simulations showing the efficacy of our method against baseline methods.

[609] A Theoretical Justification for Asymmetric Actor-Critic Algorithms

Gaspard Lambrechts, Damien Ernst, Aditya Mahajan

Main category: cs.LG

TL;DR: Theoretical justification for asymmetric actor-critic algorithms showing they eliminate error terms from state aliasing in partially observable environments.

Details

Motivation: Asymmetric learning algorithms in reinforcement learning lack precise theoretical justification for their benefits despite being theoretically sound and successful in practice.

Method: Adapt finite-time convergence analysis to asymmetric actor-critic algorithms with linear function approximators for partially observable environments.

Result: The finite-time bound demonstrates that the asymmetric critic eliminates error terms arising from aliasing in the agent state.

Conclusion: Provides theoretical justification showing asymmetric critic methods offer concrete benefits by removing state aliasing errors in partially observable reinforcement learning.

Abstract: In reinforcement learning for partially observable environments, many successful algorithms have been developed within the asymmetric learning paradigm. This paradigm leverages additional state information available at training time for faster learning. Although the proposed learning objectives are usually theoretically sound, these methods still lack a precise theoretical justification for their potential benefits. We propose such a justification for asymmetric actor-critic algorithms with linear function approximators by adapting a finite-time convergence analysis to this setting. The resulting finite-time bound reveals that the asymmetric critic eliminates error terms arising from aliasing in the agent state.

[610] Predicting Steady-State Behavior in Complex Networks with Graph Neural Networks

Priodyuti Pradhan, Amit Reza

Main category: cs.LG

TL;DR: Graph neural network framework using convolution and attention mechanisms accurately identifies steady-state behavior of linear dynamical systems on networks and distinguishes different information propagation states.

Details

Motivation: To understand and classify different types of information propagation states (diffused, weakly localized, strongly localized) in complex systems using machine learning approaches.

Method: Developed a graph convolution and attention-based neural network framework to learn the behavior of linear dynamical systems on networks and identify their steady-state behavior.

Result: The trained model distinguishes different information propagation states with high accuracy and performs well with real-world data evaluation.

Conclusion: The proposed graph neural network framework effectively identifies system states and provides analytical derivations for both forward and backward propagation to enhance model explainability.

Abstract: In complex systems, information propagation can be defined as diffused or delocalized, weakly localized, and strongly localized. This study investigates the application of graph neural network models to learn the behavior of a linear dynamical system on networks. A graph convolution and attention-based neural network framework has been developed to identify the steady-state behavior of the linear dynamical system. We reveal that our trained model distinguishes the different states with high accuracy. Furthermore, we have evaluated model performance with real-world data. In addition, to understand the explainability of our model, we provide an analytical derivation for the forward and backward propagation of our framework.

[611] Kolmogorov-Arnold Fourier Networks

Jusheng Zhang, Yijia Fan, Kaitong Cai, Keze Wang

Main category: cs.LG

TL;DR: KAF network combines trainable Fourier features and hybrid activation to address parameter explosion and high-frequency capture issues in KAN networks, achieving better efficiency and performance across multiple domains.

Details

Motivation: Kolmogorov-Arnold networks (KAN) suffer from parameter explosion and poor high-frequency feature capture in high-dimensional tasks, limiting their practical utility despite strong theoretical expressiveness.

Method: Proposes KAF network with: 1) merged dual-matrix structure to reduce parameters, 2) learnable Random Fourier Features initialization to eliminate spectral distortion, 3) adaptive hybrid GELU-Fourier activation for progressive frequency enhancement.

Result: Comprehensive experiments show KAF’s superiority in vision, NLP, audio processing, and differential equation-solving tasks, demonstrating improved performance and efficiency.

Conclusion: KAF effectively combines theoretical interpretability with practical utility and computational efficiency, solving key limitations of KAN networks while maintaining strong performance across diverse applications.

Abstract: Although Kolmogorov-Arnold based interpretable networks (KAN) have strong theoretical expressiveness, they face significant parameter explosion and high-frequency feature capture challenges in high-dimensional tasks. To address this issue, we propose the Kolmogorov-Arnold-Fourier Network (KAF), which effectively integrates trainable Random Fourier Features (RFF) and a novel hybrid GELU-Fourier activation mechanism to balance parameter efficiency and spectral representation capabilities. Our key technical contributions include: (1) merging KAN’s dual-matrix structure through matrix association properties to substantially reduce parameters; (2) introducing learnable RFF initialization strategies to eliminate spectral distortion in high-dimensional approximation tasks; (3) implementing an adaptive hybrid activation function that progressively enhances frequency representation during the training process. Comprehensive experiments demonstrate the superiority of our KAF across various domains including vision, NLP, audio processing, and differential equation-solving tasks, effectively combining theoretical interpretability with practical utility and computational efficiency.

[612] Flow-based generative models as iterative algorithms in probability space

Yao Xie, Xiuyuan Cheng

Main category: cs.LG

TL;DR: Flow-based generative models tutorial providing mathematical framework for neural network representations of probability densities using ODEs, with theoretical foundations and practical applications.

Details

Motivation: To provide researchers and practitioners with a rigorous yet accessible understanding of flow-based generative models, bridging empirical advancements with theoretical insights for effective application in signal processing and machine learning.

Method: Presents flow-based generative models as neural network-based representations of continuous probability densities using invertible mappings governed by Ordinary Differential Equations (ODEs), exploring theoretical principles including Wasserstein metric, gradient flows, and density evolution.

Result: A comprehensive tutorial framework that enables exact likelihood estimation, efficient sampling, and deterministic transformations between distributions for capturing complex probability distributions in high-dimensional data synthesis.

Conclusion: Flow-based generative models offer a powerful framework for data-driven modeling with strong theoretical foundations, making them valuable tools for various applications including image generation, language modeling, biomedical signal processing, and anomaly detection.

Abstract: Generative AI (GenAI) has revolutionized data-driven modeling by enabling the synthesis of high-dimensional data across various applications, including image generation, language modeling, biomedical signal processing, and anomaly detection. Flow-based generative models provide a powerful framework for capturing complex probability distributions, offering exact likelihood estimation, efficient sampling, and deterministic transformations between distributions. These models leverage invertible mappings governed by Ordinary Differential Equations (ODEs), enabling precise density estimation and likelihood evaluation. This tutorial presents an intuitive mathematical framework for flow-based generative models, formulating them as neural network-based representations of continuous probability densities. We explore key theoretical principles, including the Wasserstein metric, gradient flows, and density evolution governed by ODEs, to establish convergence guarantees and bridge empirical advancements with theoretical insights. By providing a rigorous yet accessible treatment, we aim to equip researchers and practitioners with the necessary tools to effectively apply flow-based generative models in signal processing and machine learning.

[613] Emergence of the Primacy Effect in Structured State-Space Models

Takashi Morita

Main category: cs.LG

TL;DR: SSMs show unexpected primacy effect (better memory for initial inputs) instead of the theoretically expected recency effect, challenging current understanding of their memory mechanisms.

Details

Motivation: To investigate the memory retention patterns of structured state-space models (SSMs) which were theoretically designed to have monotonic decay favoring recent inputs, but may exhibit unexpected behavior in practice.

Method: Trained and evaluated SSMs on a synthetic, statistically balanced memorization task to systematically analyze memory retention patterns.

Result: Contrary to theoretical expectations, SSMs predominantly preserved initially presented data (primacy effect) rather than more recent inputs (recency effect).

Conclusion: The observed primacy effect presents a non-trivial challenge to current theoretical understanding of SSMs and opens new research directions for understanding their memory mechanisms.

Abstract: Structured state-space models (SSMs) have been developed to offer more persistent memory retention than traditional recurrent neural networks, while maintaining real-time inference capabilities and addressing the time-complexity limitations of Transformers. Despite this intended persistence, the memory mechanism of canonical SSMs is theoretically designed to decay monotonically over time, meaning that more recent inputs are expected to be retained more accurately than earlier ones. Contrary to this theoretical expectation, however, the present study reveals a counterintuitive finding: when trained and evaluated on a synthetic, statistically balanced memorization task, SSMs predominantly preserve the initially presented data in memory. This pattern of memory bias, known as the primacy effect in psychology, presents a non-trivial challenge to the current theoretical understanding of SSMs and opens new avenues for future research.

[614] A comparative analysis of rank aggregation methods for the partial label ranking problem

Jiayi Wang, Juan C. Alfaro, Viktor Bengs

Main category: cs.LG

TL;DR: This paper explores alternative rank aggregation methods for partial label ranking, extending scoring-based and probabilistic approaches to handle ties. Scoring-based variants outperform state-of-the-art, while probabilistic variants underperform.

Details

Motivation: Existing partial label ranking approaches rely on approximation algorithms for rank aggregation, but there's a need to explore alternative aggregation methods that can better handle ties in predicted orders.

Method: The paper investigates several alternative aggregation methods including scoring-based and non-parametric probabilistic-based rank aggregation approaches, extending them to increase the likelihood of producing ties for partial label ranking.

Result: Experimental evaluations show that scoring-based variants consistently outperform the current state-of-the-art method in handling incomplete information, while non-parametric probabilistic-based variants fail to achieve competitive performance.

Conclusion: Scoring-based rank aggregation methods are more effective for partial label ranking problems compared to probabilistic approaches, demonstrating superior performance in handling ties and incomplete information.

Abstract: The label ranking problem is a supervised learning scenario in which the learner predicts a total order of the class labels for a given input instance. Recently, research has increasingly focused on the partial label ranking problem, a generalization of the label ranking problem that allows ties in the predicted orders. So far, most existing learning approaches for the partial label ranking problem rely on approximation algorithms for rank aggregation in the final prediction step. This paper explores several alternative aggregation methods for this critical step, including scoring-based and non-parametric probabilistic-based rank aggregation approaches. To enhance their suitability for the more general partial label ranking problem, the investigated methods are extended to increase the likelihood of producing ties. Experimental evaluations on standard benchmarks demonstrate that scoring-based variants consistently outperform the current state-of-the-art method in handling incomplete information. In contrast, non-parametric probabilistic-based variants fail to achieve competitive performance.

[615] VIPER: Visual Perception and Explainable Reasoning for Sequential Decision-Making

Mohamed Salim Aissi, Clemence Grislain, Mohamed Chetouani, Olivier Sigaud, Laure Soulier, Nicolas Thome

Main category: cs.LG

TL;DR: VIPER is a multimodal framework that combines VLM perception and LLM reasoning for visual instruction-based planning, achieving state-of-the-art performance on ALFWorld benchmark while enhancing explainability.

Details

Motivation: While LLMs excel at text reasoning and VLMs are effective for visual perception, applying these models for visual instruction-based planning remains an open problem that needs to be addressed.

Method: Uses a modular pipeline with frozen VLM to generate textual descriptions of images, processed by LLM policy to predict actions. Fine-tuned with behavioral cloning and reinforcement learning.

Result: Significantly outperforms state-of-the-art visual instruction-based planners on ALFWorld benchmark and narrows the gap with text-based oracles.

Conclusion: VIPER demonstrates effective integration of perception and reasoning through text as intermediate representation, enhancing both performance and explainability for visual planning tasks.

Abstract: While Large Language Models (LLMs) excel at reasoning on text and Vision-Language Models (VLMs) are highly effective for visual perception, applying those models for visual instruction-based planning remains a widely open problem. In this paper, we introduce VIPER, a novel framework for multimodal instruction-based planning that integrates VLM-based perception with LLM-based reasoning. Our approach uses a modular pipeline where a frozen VLM generates textual descriptions of image observations, which are then processed by an LLM policy to predict actions based on the task goal. We fine-tune the reasoning module using behavioral cloning and reinforcement learning, improving our agent’s decision-making capabilities. Experiments on the ALFWorld benchmark show that VIPER significantly outperforms state-of-the-art visual instruction-based planners while narrowing the gap with purely text-based oracles. By leveraging text as an intermediate representation, VIPER also enhances explainability, paving the way for a fine-grained analysis of perception and reasoning components.

[616] Hallucination Detection on a Budget: Efficient Bayesian Estimation of Semantic Entropy

Kamil Ciosek, Nicolò Felicioni, Sina Ghiassian

Main category: cs.LG

TL;DR: Bayesian adaptive semantic entropy estimation for LLM hallucination detection that uses fewer samples and adapts to context difficulty

Details

Motivation: Improve semantic entropy estimation for detecting LLM hallucinations by addressing sample efficiency and adaptive resource allocation

Method: Bayesian approach to semantic entropy estimation with adaptive sampling that allocates more samples to harder contexts

Result: Achieves same hallucination detection quality (AUROC) with only 53% of samples compared to baseline, works even with single sample

Conclusion: Proposed Bayesian adaptive method significantly improves sample efficiency for semantic entropy-based hallucination detection

Abstract: Detecting whether an LLM hallucinates is an important research challenge. One promising way of doing so is to estimate the semantic entropy (Farquhar et al., 2024) of the distribution of generated sequences. We propose a new algorithm for doing that, with two main advantages. First, due to us taking the Bayesian approach, we achieve a much better quality of semantic entropy estimates for a given budget of samples from the LLM. Second, we are able to tune the number of samples adaptively so that `harder’ contexts receive more samples. We demonstrate empirically that our approach systematically beats the baselines, requiring only 53% of samples used by Farquhar et al. (2024) to achieve the same quality of hallucination detection as measured by AUROC. Moreover, quite counterintuitively, our estimator is useful even with just one sample from the LLM.

[617] M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models

Junxiong Wang, Wen-Ding Li, Daniele Paliotta, Daniel Ritter, Alexander M. Rush, Tri Dao

Main category: cs.LG

TL;DR: M1 is a hybrid linear RNN reasoning model based on Mamba architecture that enables memory-efficient inference for mathematical reasoning, outperforming previous linear RNN models and matching state-of-the-art transformer models while achieving 3x speedup.

Details

Motivation: Transformer-based models are limited in extending context length due to quadratic computational complexity and linear memory requirements, which hinders effective long chain-of-thought reasoning for complex mathematical problems.

Method: Developed a hybrid linear RNN reasoning model (M1) using Mamba architecture with distillation from existing reasoning models and enhanced through RL training, enabling memory-efficient inference.

Result: Outperforms previous linear RNN models and matches Deepseek R1 distilled reasoning models on AIME and MATH benchmarks, with 3x speedup compared to same-size transformers and higher accuracy under fixed generation time budget using self-consistency voting.

Conclusion: M1 provides an effective approach to scaling test-time generation using self-consistency or long chain-of-thought reasoning, offering superior computational efficiency while maintaining competitive performance.

Abstract: Effective reasoning is crucial to solving complex mathematical problems. Recent large language models (LLMs) have boosted performance by scaling test-time computation through long chain-of-thought reasoning. However, transformer-based models are inherently limited in extending context length due to their quadratic computational complexity and linear memory requirements. In this paper, we introduce a novel hybrid linear RNN reasoning model, M1, built on the Mamba architecture, which allows memory-efficient inference. Our approach leverages a distillation process from existing reasoning models and is further enhanced through RL training. Experimental results on the AIME and MATH benchmarks show that M1 not only outperforms previous linear RNN models but also matches the performance of state-of-the-art Deepseek R1 distilled reasoning models at a similar scale. We also compare our generation speed with a highly performant general purpose inference engine, vLLM, and observe more than a 3x speedup compared to a same size transformer. With throughput speedup, we are able to achieve higher accuracy compared to DeepSeek R1 distilled transformer reasoning models under a fixed generation time budget using self-consistency voting. Overall, we introduce a hybrid Mamba reasoning model and provide a more effective approach to scaling test-time generation using self-consistency or long chain of thought reasoning.

[618] An All-Atom Generative Model for Designing Protein Complexes

Ruizhe Chen, Dongyu Xue, Xiangxin Zhou, Zaixiang Zheng, Xiangxiang Zeng, Quanquan Gu

Main category: cs.LG

TL;DR: APM is a novel generative model for multi-chain protein complexes that handles atom-level modeling, inter-chain interactions, and supports tasks like folding, inverse-folding, and zero-shot sampling with state-of-the-art performance.

Details

Motivation: While single-chain protein modeling has advanced significantly (e.g., ESM, AlphaFold2), multi-chain protein modeling remains underdeveloped despite being crucial for understanding biological functions and protein interactions.

Method: APM integrates atom-level information and leverages multi-chain protein data to precisely model inter-chain interactions. It supports supervised fine-tuning (SFT) and zero-shot sampling capabilities.

Result: APM achieves state-of-the-art results in modeling multi-chain proteins, enabling precise inter-chain interaction modeling, designing protein complexes with binding capabilities from scratch, and performing folding/inverse-folding tasks.

Conclusion: APM represents a significant advancement in multi-chain protein modeling, demonstrating versatility in downstream applications and providing a powerful tool for understanding protein complexes and their biological functions.

Abstract: Proteins typically exist in complexes, interacting with other proteins or biomolecules to perform their specific biological roles. Research on single-chain protein modeling has been extensively and deeply explored, with advancements seen in models like the series of ESM and AlphaFold2. Despite these developments, the study and modeling of multi-chain proteins remain largely uncharted, though they are vital for understanding biological functions. Recognizing the importance of these interactions, we introduce APM (All-Atom Protein Generative Model), a model specifically designed for modeling multi-chain proteins. By integrating atom-level information and leveraging data on multi-chain proteins, APM is capable of precisely modeling inter-chain interactions and designing protein complexes with binding capabilities from scratch. It also performs folding and inverse-folding tasks for multi-chain proteins. Moreover, APM demonstrates versatility in downstream applications: it achieves enhanced performance through supervised fine-tuning (SFT) while also supporting zero-shot sampling in certain tasks, achieving state-of-the-art results. We released our code at https://github.com/bytedance/apm.

[619] Addressing Concept Mislabeling in Concept Bottleneck Models Through Preference Optimization

Emiliano Penaloza, Tianyue H. Zhang, Laurent Charlin, Mateo Espinosa Zarlenga

Main category: cs.LG

TL;DR: CPO objective improves Concept Bottleneck Models’ robustness to concept mislabeling by using Direct Preference Optimization instead of Binary Cross Entropy, reducing performance degradation from noisy concept labels.

Details

Motivation: Concept Bottleneck Models assume accurate concept labels, but real-world datasets often contain mislabeled concepts that degrade model performance by up to 25%.

Method: Introduced Concept Preference Optimization (CPO) objective based on Direct Preference Optimization, which directly optimizes for concept’s posterior distribution and is less sensitive to concept noise compared to Binary Cross Entropy.

Result: CPO consistently outperforms BCE on three real-world datasets, both with and without added label noise, demonstrating improved robustness to concept mislabeling.

Conclusion: CPO provides an effective solution to mitigate the negative impact of concept mislabeling in CBMs, enhancing their practical applicability in real-world scenarios where perfect concept labels are unavailable.

Abstract: Concept Bottleneck Models (CBMs) propose to enhance the trustworthiness of AI systems by constraining their decisions on a set of human-understandable concepts. However, CBMs typically assume that datasets contain accurate concept labels-an assumption often violated in practice, which we show can significantly degrade performance (by 25% in some cases). To address this, we introduce the Concept Preference Optimization (CPO) objective, a new loss function based on Direct Preference Optimization, which effectively mitigates the negative impact of concept mislabeling on CBM performance. We provide an analysis of key properties of the CPO objective, showing it directly optimizes for the concept’s posterior distribution, and contrast it against Binary Cross Entropy (BCE), demonstrating that CPO is inherently less sensitive to concept noise. We empirically confirm our analysis by finding that CPO consistently outperforms BCE on three real-world datasets, both with and without added label noise. We make our code available on Github.

[620] In-context Ranking Preference Optimization

Junda Wu, Rohan Surana, Zhouhang Xie, Yiran Shen, Yu Xia, Tong Yu, Ryan A. Rossi, Prithviraj Ammanabrolu, Julian McAuley

Main category: cs.LG

TL;DR: IRPO extends DPO to optimize LLMs for ranking tasks using in-context ranking lists, handling sparse feedback and position-aware optimization through differentiable positional aggregation.

Details

Motivation: Address limited pairwise feedback in real-world scenarios where users only identify relevant items rather than providing detailed comparisons, and support complex IR tasks requiring top-quality output ranking.

Method: Extends DPO objective by incorporating item relevance and position information, uses differentiable positional aggregation of pairwise preferences to optimize discrete ranking metrics via gradient-based methods.

Result: IRPO outperforms standard DPO approaches in ranking performance, demonstrating better alignment with in-context ranking preferences.

Conclusion: IRPO effectively handles sparse feedback and position-aware optimization for LLM ranking tasks, providing theoretical guarantees and practical improvements over existing methods.

Abstract: Recent developments in Direct Preference Optimization (DPO) allow large language models (LLMs) to function as implicit ranking models by maximizing the margin between preferred and non-preferred responses. In practice, user feedback on such lists typically involves identifying a few relevant items in context rather than providing detailed pairwise comparisons for every possible item pair. Moreover, many complex information retrieval tasks, such as conversational agents and summarization systems, critically depend on ranking the highest-quality outputs at the top, emphasizing the need to support natural and flexible forms of user feedback. To address the challenge of limited and sparse pairwise feedback in the in-context setting, we propose an In-context Ranking Preference Optimization (IRPO) framework that directly optimizes LLMs based on ranking lists constructed during inference. To further capture flexible forms of feedback, IRPO extends the DPO objective by incorporating both the relevance of items and their positions in the list. Modeling these aspects jointly is non-trivial, as ranking metrics are inherently discrete and non-differentiable, making direct optimization difficult. To overcome this, IRPO introduces a differentiable objective based on positional aggregation of pairwise item preferences, enabling effective gradient-based optimization of discrete ranking metrics. We further provide theoretical insights showing that IRPO (i) automatically emphasizes items with greater disagreement between the model and the reference ranking, and (ii) links its gradient to an importance sampling estimator, yielding an unbiased estimator with reduced variance. Empirical results show IRPO outperforms standard DPO approaches in ranking performance, highlighting its effectiveness in aligning LLMs with direct in-context ranking preferences.

[621] Efficient Unstructured Pruning of Mamba State-Space Models for Resource-Constrained Environments

Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma

Main category: cs.LG

TL;DR: A novel unstructured pruning framework for Mamba SSMs that achieves 70% parameter reduction while maintaining 95% performance through gradient-aware magnitude pruning, iterative scheduling, and global optimization.

Details

Motivation: State-space models like Mamba offer linear-time complexity but have large parameter counts that challenge deployment in resource-constrained environments, requiring efficient compression techniques.

Method: Three key innovations: (1) gradient-aware magnitude pruning combining weight magnitude and gradient information, (2) iterative pruning schedule for gradual sparsity increase, and (3) global pruning strategy for optimal parameter allocation across the model.

Result: Achieves up to 70% parameter reduction while retaining over 95% of original performance across WikiText-103, Long Range Arena, and ETT time-series benchmarks, with significant efficiency gains and minimal degradation.

Conclusion: The framework enables practical deployment of Mamba models in resource-constrained settings while broadening their applicability, with analysis revealing critical insights into the architecture’s redundancy and robustness.

Abstract: State-space models (SSMs), particularly the Mamba architecture, have emerged as powerful alternatives to Transformers for sequence modeling, offering linear-time complexity and competitive performance across diverse tasks. However, their large parameter counts pose significant challenges for deployment in resource-constrained environments. We propose a novel unstructured pruning framework tailored for Mamba models that achieves up to 70% parameter reduction while retaining over 95% of the original performance. Our approach integrates three key innovations: (1) a gradient-aware magnitude pruning technique that combines weight magnitude and gradient information to identify less critical parameters, (2) an iterative pruning schedule that gradually increases sparsity to maintain model stability, and (3) a global pruning strategy that optimizes parameter allocation across the entire model. Through extensive experiments on WikiText-103, Long Range Arena, and ETT time-series benchmarks, we demonstrate significant efficiency gains with minimal performance degradation. Our analysis of pruning effects on Mamba’s components reveals critical insights into the architecture’s redundancy and robustness, enabling practical deployment in resource-constrained settings while broadening Mamba’s applicability.

[622] Identification and Optimal Nonlinear Control of Turbojet Engine Using Koopman Eigenfunction Model

David Grasev

Main category: cs.LG

TL;DR: Data-driven Koopman operator approach for gas turbine engine modeling and control, outperforming traditional methods in tracking and disturbance rejection.

Details

Motivation: Physics-based modeling of gas turbine engines is challenging due to complex nonlinear dynamics and unavailable performance characteristics, requiring many simplifying assumptions in conventional methods.

Method: Uses sparse identification of nonlinear dynamics for rotor estimation, maps autonomous dynamics to optimized Koopman eigenfunction space via metaheuristic algorithms and gradient-based identification, then designs nonlinear feedback controller and Kalman estimator in eigenfunction space.

Result: Koopman-based controller demonstrates superior performance in reference tracking and disturbance rejection under sea-level and varying flight conditions compared to traditional and gain-scheduled PI controllers.

Conclusion: The global nature of Koopman-based control enables targeting individual modes during optimization, leading to improved performance tuning and better overall control performance for complex gas turbine systems.

Abstract: Gas turbine engines are complex and highly nonlinear dynamical systems. Deriving their physics-based models can be challenging because it requires performance characteristics that are not always available, often leading to many simplifying assumptions. This paper discusses the limitations of conventional experimental methods used to derive component-level and locally linear parameter-varying models, and addresses these issues by employing identification techniques based on data collected from standard engine operation under closed-loop control. The rotor dynamics are estimated using the sparse identification of nonlinear dynamics. Subsequently, the autonomous part of the dynamics is mapped into an optimally constructed Koopman eigenfunction space. This process involves eigenvalue optimization using metaheuristic algorithms and temporal projection, followed by gradient-based eigenfunction identification. The resulting Koopman model is validated against an in-house reference component-level model. A globally optimal nonlinear feedback controller and a Kalman estimator are then designed within the eigenfunction space and compared to traditional and gain-scheduled proportional-integral controllers, as well as a proposed internal model control approach. The eigenmode structure enables targeting individual modes during optimization, leading to improved performance tuning. Results demonstrate that the Koopman-based controller surpasses other benchmark controllers in both reference tracking and disturbance rejection under sea-level and varying flight conditions, due to its global nature.

[623] MetaSTH-Sleep: Towards Effective Few-Shot Sleep Stage Classification for Health Management with Spatial-Temporal Hypergraph Enhanced Meta-Learning

Jingyu Li, Tiehua Zhang, Jinze Wang, Yi Zhang, Yuhuan Li, Yifan Zhao, Zhishu Shen, Libing Wu, Jiannan Liu

Main category: cs.LG

TL;DR: MetaSTH-Sleep is a few-shot learning framework that uses spatial-temporal hypergraph and meta-learning for sleep stage classification, addressing data scarcity and generalization challenges in EEG signal analysis.

Details

Motivation: Traditional manual sleep stage annotation is time-consuming, and existing deep learning methods struggle with limited labeled data, inter-individual variability, and failure to capture high-order relationships in bio-signals.

Method: Proposes MetaSTH-Sleep framework combining spatial-temporal hypergraph modeling with meta-learning to enable rapid adaptation to new subjects using few labeled samples while capturing signal heterogeneity and temporal dependencies.

Result: Experimental results show substantial performance improvements across diverse subjects, demonstrating effective generalization and adaptation capabilities.

Conclusion: The framework successfully addresses key challenges in sleep stage classification and provides valuable support for clinical sleep annotation with limited data requirements.

Abstract: Accurate classification of sleep stages based on bio-signals is fundamental not only for automatic sleep stage annotation, but also for clinical health management and continuous sleep monitoring. Traditionally, this task relies on experienced clinicians to manually annotate data, a process that is both time-consuming and labor-intensive. In recent years, deep learning methods have shown promise in automating this task. However, three major challenges remain: (1) deep learning models typically require large-scale labeled datasets, making them less effective in real-world settings where annotated data is limited; (2) significant inter-individual variability in bio-signals often results in inconsistent model performance when applied to new subjects, limiting generalization; and (3) existing approaches often overlook the high-order relationships among bio-signals, failing to simultaneously capture signal heterogeneity and spatial-temporal dependencies. To address these issues, we propose MetaSTH-Sleep, a few-shot sleep stage classification framework based on spatial-temporal hypergraph enhanced meta-learning. Our approach enables rapid adaptation to new subjects using only a few labeled samples, while the hypergraph structure effectively models complex spatial interconnections and temporal dynamics simultaneously in EEG signals. Experimental results demonstrate that MetaSTH-Sleep achieves substantial performance improvements across diverse subjects, offering valuable insights to support clinicians in sleep stage annotation.

[624] Steering LLM Reasoning Through Bias-Only Adaptation

Viacheslav Sinii, Alexey Gorbatovski, Artem Cherepanov, Boris Shaposhnikov, Nikita Balagansky, Daniil Gavrilov

Main category: cs.LG

TL;DR: Training minimal steering vectors (0.0016% additional parameters) matches full RL fine-tuning performance on math reasoning tasks, reducing computational costs while providing interpretability.

Details

Motivation: To determine the minimal parameter budget required for effective chain-of-thought reasoning and reduce the computational overhead of full model fine-tuning.

Method: Train a single d-dimensional steering vector per layer using reinforcement learning while freezing all base model weights, applied to 8B parameter models on mathematical reasoning benchmarks.

Result: Matches accuracy of fully RL-tuned models across various base models and math reasoning benchmarks while adding only ~0.0016% additional parameters.

Conclusion: Minimal trainable parameters are sufficient for high-level reasoning, reducing optimizer memory and inter-GPU communication costs while providing clearer insight into model computations through interpretability analysis.

Abstract: We show that training a single $d$-dimensional steering vector per layer with reinforcement learning, while freezing all base weights, matches the accuracy of fully RL-tuned reasoning models on mathematical-reasoning tasks. On an 8 billion-parameter model this adds only $\approx 0.0016%$ additional parameters and reproduces performance across a range of base models and mathematical-reasoning benchmarks. These results tighten the upper bound on the parameter budget required for high-level chain-of-thought reasoning, indicating that millions of adapter weights are unnecessary. The minimal trainable footprint reduces optimizer memory and inter-GPU communication, lowering the overall cost of fine-tuning. Moreover, a logit-lens analysis shows that the learned vectors amplify coherent token directions, providing clearer insight into the model’s internal computations.

[625] Grower-in-the-Loop Interactive Reinforcement Learning for Greenhouse Climate Control

Maxiu Xiao, Jianglin Lan, Jingxin Yu, Weihong Ma, Qiuju Xie, Congcong Sun

Main category: cs.LG

TL;DR: Interactive RL with imperfect grower inputs can improve greenhouse climate control performance, with policy shaping and control sharing showing profit improvements while reward shaping is sensitive to imperfect inputs.

Details

Motivation: Climate control is crucial for greenhouse production but RL faces challenges like limited training efficiency and reliance on initial conditions. Interactive RL combining human input with RL offers a solution, but hasn't been applied to greenhouse control and may face imperfect input challenges.

Method: Developed three interactive RL algorithms (reward shaping, policy shaping, control sharing) tailored for greenhouse climate control, analyzed input characteristics and trade-offs, proposed neural network-based approach for robustness, and conducted comprehensive evaluation in simulated environment.

Result: Interactive RL with imperfect grower inputs improved RL agent performance. Policy shaping and control sharing achieved 8.4% and 6.8% profit improvements respectively, while reward shaping decreased profit by 9.4% due to sensitivity to imperfect inputs.

Conclusion: Interactive RL incorporating imperfect inputs has potential to improve greenhouse climate control, but algorithm selection is crucial - action-influencing methods (policy/control sharing) outperform reward manipulation methods when dealing with imperfect inputs.

Abstract: Climate control is crucial for greenhouse production as it directly affects crop growth and resource use. Reinforcement learning (RL) has received increasing attention in this field, but still faces challenges, including limited training efficiency and high reliance on initial learning conditions. Interactive RL, which combines human (grower) input with the RL agent’s learning, offers a potential solution to overcome these challenges. However, interactive RL has not yet been applied to greenhouse climate control and may face challenges related to imperfect inputs. Therefore, this paper aims to explore the possibility and performance of applying interactive RL with imperfect inputs into greenhouse climate control, by: (1) developing three representative interactive RL algorithms tailored for greenhouse climate control (reward shaping, policy shaping and control sharing); (2) analyzing how input characteristics are often contradicting, and how the trade-offs between them make grower’s inputs difficult to perfect; (3) proposing a neural network-based approach to enhance the robustness of interactive RL agents under limited input availability; (4) conducting a comprehensive evaluation of the three interactive RL algorithms with imperfect inputs in a simulated greenhouse environment. The demonstration shows that interactive RL incorporating imperfect grower inputs has the potential to improve the performance of the RL agent. RL algorithms that influence action selection, such as policy shaping and control sharing, perform better when dealing with imperfect inputs, achieving 8.4% and 6.8% improvement in profit, respectively. In contrast, reward shaping, an algorithm that manipulates the reward function, is sensitive to imperfect inputs and leads to a 9.4% decrease in profit. This highlights the importance of selecting an appropriate mechanism when incorporating imperfect inputs.

[626] Who Gets Credit or Blame? Attributing Accountability in Modern AI Systems

Shichang Zhang, Hongzhe Du, Jiaqi W. Ma, Himabindu Lakkaraju

Main category: cs.LG

TL;DR: A framework for attributing model behavior to specific development stages (pretraining, fine-tuning, alignment) using counterfactual analysis without retraining.

Details

Motivation: Modern AI systems are developed through multiple stages, raising accountability questions about which stage is responsible for model successes or failures.

Method: Proposes estimators that quantify stage effects by answering counterfactual questions about how behavior would change if updates from particular stages were removed, accounting for data and optimization dynamics.

Result: Successfully quantifies stage accountability, identifies and removes spurious correlations in image classification and text toxicity detection tasks developed across multiple stages.

Conclusion: Provides a practical tool for model analysis and represents a significant step toward more accountable AI development.

Abstract: Modern AI systems are typically developed through multiple stages-pretraining, fine-tuning rounds, and subsequent adaptation or alignment, where each stage builds on the previous ones and updates the model in distinct ways. This raises a critical question of accountability: when a deployed model succeeds or fails, which stage is responsible, and to what extent? We pose the accountability attribution problem for tracing model behavior back to specific stages of the model development process. To address this challenge, we propose a general framework that answers counterfactual questions about stage effects: how would the model’s behavior have changed if the updates from a particular stage had not occurred? Within this framework, we introduce estimators that efficiently quantify stage effects without retraining the model, accounting for both the data and key aspects of model optimization dynamics, including learning rate schedules, momentum, and weight decay. We demonstrate that our approach successfully quantifies the accountability of each stage to the model’s behavior. Based on the attribution results, our method can identify and remove spurious correlations learned during image classification and text toxicity detection tasks that were developed across multiple stages. Our approach provides a practical tool for model analysis and represents a significant step toward more accountable AI development.

[627] ThinkEval: Practical Evaluation of Knowledge Leakage in LLM Editing using Thought-based Knowledge Graphs

Manit Baser, Dinil Mon Divakaran, Mohan Gurusamy

Main category: cs.LG

TL;DR: ThinkEval framework evaluates model-editing techniques for LLMs, revealing they struggle to prevent indirect knowledge leakage while preserving related knowledge integrity.

Details

Motivation: Current model-editing techniques focus on isolated facts but fail to prevent indirect knowledge leakage through persistent causal links, which is critical for practical applications like healthcare where outdated or incorrect knowledge needs updating.

Method: Developed ThinkEval framework with specialized knowledge graphs to analyze causal structure of facts before/after editing, using KnowGIC benchmark dataset with multi-step reasoning paths to measure complex knowledge transformation effects.

Result: Evaluation of five editing techniques (AlphaEdit, RECT, ROME, MEMIT, PRUNE) across multiple LLMs shows they struggle to balance indirect fact suppression with preservation of related knowledge, compromising contextual integrity.

Conclusion: Existing model-editing techniques have limitations in maintaining knowledge integrity while preventing leakage, highlighting the need for more robust editing approaches that can handle complex causal relationships in knowledge structures.

Abstract: Robust model-editing techniques are essential for deploying large language models (LLMs) in practical applications, to enable cost-effective ways to deal with challenges such as privacy breaches, bias mitigation and misinformation spread. For example, an LLM-based healthcare assistance may need to update out-dated or incorrect knowledge to prevent harmful recommendations. However, many editing techniques focus on isolated facts, which critically fail to prevent indirect knowledge leakage – the unintended reconstruction of edited-out information through persistent causal links and contextual relationships. To assist users in selecting the right editing technique, we develop and present ThinkEval, a framework to systematically quantify indirect knowledge leakage and ripple effects in model-editing. ThinkEval builds and employs specialized knowledge graphs to analyze the causal structure of facts before and after editing. To support this approach, we present KnowGIC, a benchmark dataset comprising multi-step reasoning paths that precisely measure these complex knowledge transformation effects. We evaluate five editing techniques: AlphaEdit, RECT, ROME, MEMIT, and PRUNE across multiple LLMs. Our results show that these techniques struggle to balance indirect fact suppression with the preservation of related knowledge, compromising the contextual integrity of a model’s knowledge. Our dataset is available at: https://anonymous.4open.science/r/KnowGIC.

[628] Efficient $Q$-Learning and Actor-Critic Methods for Robust Average Reward Reinforcement Learning

Yang Xu, Swetha Ganesh, Vaneet Aggarwal

Main category: cs.LG

TL;DR: Non-asymptotic convergence analysis of Q-learning and actor-critic algorithms for robust average-reward MDPs under various uncertainty sets, achieving ε-optimal policies with ᵥ(ε⁻²) sample complexity.

Details

Motivation: To address the challenge of learning robust policies in Markov Decision Processes under model uncertainty and contamination, providing theoretical guarantees and efficient algorithms for robust reinforcement learning.

Method: Proves the optimal robust Q operator is a strict contraction under a carefully designed semi-norm, enabling stochastic approximation updates. Develops efficient robust Q-function estimation and actor-critic algorithms with sample complexity analysis.

Result: Achieves ε-optimal robust policies with ᵥ(ε⁻²) sample complexity for both Q-learning and actor-critic approaches. Provides numerical simulations demonstrating algorithm performance.

Conclusion: The proposed framework provides the first non-asymptotic convergence guarantees for robust average-reward MDPs under various uncertainty sets, with efficient sample complexity and practical algorithms supported by numerical validation.

Abstract: We present a non-asymptotic convergence analysis of $Q$-learning and actor-critic algorithms for robust average-reward Markov Decision Processes (MDPs) under contamination, total-variation (TV) distance, and Wasserstein uncertainty sets. A key ingredient of our analysis is showing that the optimal robust $Q$ operator is a strict contraction with respect to a carefully designed semi-norm (with constant functions quotiented out). This property enables a stochastic approximation update that learns the optimal robust $Q$-function using $\tilde{\mathcal{O}}(\epsilon^{-2})$ samples. We also provide an efficient routine for robust $Q$-function estimation, which in turn facilitates robust critic estimation. Building on this, we introduce an actor-critic algorithm that learns an $\epsilon$-optimal robust policy within $\tilde{\mathcal{O}}(\epsilon^{-2})$ samples. We provide numerical simulations to evaluate the performance of our algorithms.

[629] Test-Time Scaling of Diffusion Models via Noise Trajectory Search

Vignav Ramesh, Morteza Mardani

Main category: cs.LG

TL;DR: A novel method for optimizing noise trajectories in diffusion models using an epsilon-greedy search algorithm that treats denoising as contextual bandits, achieving state-of-the-art performance with up to 164% improvement over baselines.

Details

Motivation: Current test-time scaling in diffusion models through increased denoising steps yields diminishing returns. Optimizing the noise trajectory offers better sample quality but faces challenges with high-dimensional search space and costly evaluations.

Method: Cast diffusion as Markov Decision Process (MDP), then relax to contextual bandits. Use epsilon-greedy search that globally explores at extreme timesteps and locally exploits during intermediate de-mixing steps.

Result: Achieves state-of-the-art scores for class-conditioned and text-to-image generation, exceeding baselines by up to 164% and matching/exceeding MCTS performance.

Conclusion: First practical method for test-time noise trajectory optimization of arbitrary non-differentiable rewards, balancing performance and efficiency effectively.

Abstract: The iterative and stochastic nature of diffusion models enables test-time scaling, whereby spending additional compute during denoising generates higher-fidelity samples. Increasing the number of denoising steps is the primary scaling axis, but this yields quickly diminishing returns. Instead optimizing the noise trajectory–the sequence of injected noise vectors–is promising, as the specific noise realizations critically affect sample quality; but this is challenging due to a high-dimensional search space, complex noise-outcome interactions, and costly trajectory evaluations. We address this by first casting diffusion as a Markov Decision Process (MDP) with a terminal reward, showing tree-search methods such as Monte Carlo tree search (MCTS) to be meaningful but impractical. To balance performance and efficiency, we then resort to a relaxation of MDP, where we view denoising as a sequence of independent contextual bandits. This allows us to introduce an $\epsilon$-greedy search algorithm that globally explores at extreme timesteps and locally exploits during the intermediate steps where de-mixing occurs. Experiments on EDM and Stable Diffusion reveal state-of-the-art scores for class-conditioned/text-to-image generation, exceeding baselines by up to $164%$ and matching/exceeding MCTS performance. To our knowledge, this is the first practical method for test-time noise trajectory optimization of arbitrary (non-differentiable) rewards.

[630] Scaling Laws of Motion Forecasting and Planning – Technical Report

Mustafa Baniodeh, Kratarth Goel, Scott Ettinger, Carlos Fuertes, Ari Seff, Tim Shen, Cole Gulino, Chenjie Yang, Ghassen Jerfel, Dokook Choe, Rui Wang, Benjamin Charrow, Vinutha Kallem, Sergio Casas, Rami Al-Rfou, Benjamin Sapp, Dragomir Anguelov

Main category: cs.LG

TL;DR: Transformer models for autonomous driving show power-law scaling similar to language models, with closed-loop metrics improving with scale. Optimal scaling requires model size to grow 1.5x faster than dataset size, and smaller models can be competitive through sampling techniques.

Details

Motivation: To understand how scaling laws apply to autonomous driving tasks (motion forecasting and planning) and determine optimal scaling strategies for model parameters, training data, and inference compute.

Method: Empirical study of encoder-decoder transformer models trained on 500K hours of driving data, analyzing scaling relationships between compute budget, model size, dataset size, and performance metrics.

Result: Performance improves as power-law function of compute; closed-loop metrics scale with model size; optimal scaling requires model parameters to increase 1.5x faster than dataset size; smaller models can compete with larger ones through sampling techniques.

Conclusion: Optimizing scaling properties is crucial for improving autonomous driving models, and training on general logged driving data can help address data scarcity for large model training.

Abstract: We study the empirical scaling laws of a family of encoder-decoder autoregressive transformer models on the task of joint motion forecasting and planning in the autonomous driving domain. Using a 500 thousand hours driving dataset, we demonstrate that, similar to language modeling, model performance improves as a power-law function of the total compute budget, and we observe a strong correlation between model training loss and model evaluation metrics. Most interestingly, closed-loop metrics also improve with scaling, which has important implications for the suitability of open-loop metrics for model development and hill climbing. We also study the optimal scaling of the number of transformer parameters and the training data size for a training compute-optimal model. We find that as the training compute budget grows, optimal scaling requires increasing the model size 1.5x as fast as the dataset size. We also study inference-time compute scaling, where we observe that sampling and clustering the output of smaller models makes them competitive with larger models, up to a crossover point beyond which a larger models becomes more inference-compute efficient. Overall, our experimental results demonstrate that optimizing the training and inference-time scaling properties of motion forecasting and planning models is a key lever for improving their performance to address a wide variety of driving scenarios. Finally, we briefly study the utility of training on general logged driving data of other agents to improve the performance of the ego-agent, an important research area to address the scarcity of robotics data for large capacity models training.

[631] Pruning Spurious Subgraphs for Graph Out-of-Distribution Generalization

Tianjun Yao, Haoxuan Li, Yongqiang Chen, Tongliang Liu, Le Song, Eric Xing, Zhiqiang Shen

Main category: cs.LG

TL;DR: PrunE is a pruning-based graph OOD method that removes spurious edges to improve out-of-distribution generalization in Graph Neural Networks by retaining invariant subgraphs more comprehensively.

Details

Motivation: GNNs suffer performance degradation under distribution shifts. Existing methods struggle to directly identify invariant subgraphs when spurious edges have strong correlations with targets.

Method: Uses two regularization terms: 1) graph size constraint to exclude uninformative spurious edges, and 2) epsilon-probability alignment to suppress spurious edges occurrence.

Result: Achieves superior OOD performance and significantly outperforms previous state-of-the-art methods through theoretical analysis and extensive experiments.

Conclusion: PrunE effectively improves graph OOD generalization by pruning spurious edges rather than directly identifying invariant subgraphs, demonstrating better performance than existing approaches.

Abstract: Graph Neural Networks (GNNs) often encounter significant performance degradation under distribution shifts between training and test data, hindering their applicability in real-world scenarios. Recent studies have proposed various methods to address the out-of-distribution generalization challenge, with many methods in the graph domain focusing on directly identifying an invariant subgraph that is predictive of the target label. However, we argue that identifying the edges from the invariant subgraph directly is challenging and error-prone, especially when some spurious edges exhibit strong correlations with the targets. In this paper, we propose PrunE, the first pruning-based graph OOD method that eliminates spurious edges to improve OOD generalizability. By pruning spurious edges, PrunE retains the invariant subgraph more comprehensively, which is critical for OOD generalization. Specifically, PrunE employs two regularization terms to prune spurious edges:

graph size constraint to exclude uninformative spurious edges, and 2) $\epsilon$-probability alignment to further suppress the occurrence of spurious edges. Through theoretical analysis and extensive experiments, we show that PrunE achieves superior OOD performance and outperforms previous state-of-the-art methods significantly. Codes are available at: \href{https://github.com/tianyao-aka/PrunE-GraphOOD}{https://github.com/tianyao-aka/PrunE-GraphOOD}.

[632] A Gravity-informed Spatiotemporal Transformer for Human Activity Intensity Prediction

Yi Wang, Zhenghong Wang, Fan Zhang, Chaogui Kang, Sijie Ruan, Di Zhu, Chengling Tang, Zhongfu Ma, Weiyu Zhang, Yu Zheng, Philip S. Yu, Yu Liu

Main category: cs.LG

TL;DR: Gravityformer integrates physics principles with deep learning to improve human activity intensity prediction by modeling spatial interactions using gravitational laws, addressing over-smoothing in transformer attention.

Details

Motivation: Existing methods for human activity intensity prediction overlook physical constraints of spatial interaction, leading to uninterpretable spatial correlations and over-smoothing problems.

Method: Proposes Gravityformer framework that: 1) estimates spatially explicit mass parameters from spatiotemporal embeddings, 2) models spatial interaction using adaptive gravity model to learn physical constraints, and 3) uses learned interactions to guide transformer attention and mitigate over-smoothing. Includes parallel spatiotemporal graph convolution transformer.

Result: Demonstrates superior performance on six real-world large-scale activity datasets compared to state-of-the-art benchmarks. Learned gravity attention matrix is interpretable based on geographical laws and improves generalization in zero-shot cross-region inference.

Conclusion: Provides a novel approach for integrating physical laws with deep learning for spatiotemporal prediction, offering both performance improvements and interpretability.

Abstract: Human activity intensity prediction is crucial to many location-based services. Despite tremendous progress in modeling dynamics of human activity, most existing methods overlook physical constraints of spatial interaction, leading to uninterpretable spatial correlations and over-smoothing phenomenon. To address these limitations, this work proposes a physics-informed deep learning framework, namely Gravity-informed Spatiotemporal Transformer (Gravityformer) by integrating the universal law of gravitation to refine transformer attention. Specifically, it (1) estimates two spatially explicit mass parameters based on spatiotemporal embedding feature, (2) models the spatial interaction in end-to-end neural network using proposed adaptive gravity model to learn the physical constraint, and (3) utilizes the learned spatial interaction to guide and mitigate the over-smoothing phenomenon in transformer attention. Moreover, a parallel spatiotemporal graph convolution transformer is proposed for achieving a balance between coupled spatial and temporal learning. Systematic experiments on six real-world large-scale activity datasets demonstrate the quantitative and qualitative superiority of our model over state-of-the-art benchmarks. Additionally, the learned gravity attention matrix can be not only disentangled and interpreted based on geographical laws, but also improved the generalization in zero-shot cross-region inference. This work provides a novel insight into integrating physical laws with deep learning for spatiotemporal prediction.

[633] Precise Bayesian Neural Networks

Carlos Stein Brito

Main category: cs.LG

TL;DR: Bayesian neural networks with von Mises-Fisher posterior on weight directions instead of standard Gaussian, using normalization layers to model uncertainty only in directions, resulting in dimension-aware KL and improved calibration.

Details

Motivation: Standard Gaussian posteriors misalign with network geometry, KL terms are brittle in high dimensions, and traditional BNN implementations add complexity without reliably improving uncertainty.

Method: Model uncertainty only in weight directions using von Mises-Fisher posterior on unit sphere, leveraging normalization layers to neutralize weight magnitude influence. Derives closed-form approximations linking concentration parameter to activation variance and effective noise.

Result: Produces lightweight variational unit that fits modern normalized architectures, improves calibration without sacrificing accuracy, and enables stable optimization in high dimensions.

Conclusion: By aligning variational posterior with network’s intrinsic geometry, BNNs can be simultaneously principled, practical, and precise.

Abstract: Despite its long history, Bayesian neural networks (BNNs) and variational training remain underused in practice: standard Gaussian posteriors misalign with network geometry, KL terms can be brittle in high dimensions, and implementations often add complexity without reliably improving uncertainty. We revisit the problem through the lens of normalization. Because normalization layers neutralize the influence of weight magnitude, we model uncertainty \emph{only in weight directions} using a von Mises-Fisher posterior on the unit sphere. High-dimensional geometry then yields a single, interpretable scalar per layer–the effective post-normalization noise $\sigma_{\mathrm{eff}}$–that (i) corresponds to simple additive Gaussian noise in the forward pass and (ii) admits a compact, dimension-aware KL in closed form. We derive accurate, closed-form approximations linking concentration $\kappa$ to activation variance and to $\sigma_{\mathrm{eff}}$ across regimes, producing a lightweight, implementation-ready variational unit that fits modern normalized architectures and improves calibration without sacrificing accuracy. This dimension awareness is critical for stable optimization in high dimensions. In short, by aligning the variational posterior with the network’s intrinsic geometry, BNNs can be simultaneously principled, practical, and precise.

[634] Quantum Machine Learning in Transportation: A Case Study of Pedestrian Stress Modelling

Bara Rababah, Bilal Farooq

Main category: cs.LG

TL;DR: Quantum machine learning applied to pedestrian stress classification using skin conductance response data, with QNN outperforming QSVM and classical models.

Details

Motivation: To leverage quantum computing for complex machine learning tasks in intelligent transportation systems, specifically to model pedestrian stress through skin conductance responses in virtual reality road crossing scenarios.

Method: Developed Quantum Support Vector Machine (QSVM) with eight-qubit ZZ feature map and Quantum Neural Network (QNN) using Tree Tensor Network ansatz with eight-qubit ZZ feature map on Pennylane platform. Used SCR measurements with amplitude and time features categorized into amplitude-based classes.

Result: QSVM achieved good training accuracy but suffered from overfitting with only 45% test accuracy. QNN model performed better with 55% test accuracy, outperforming both QSVM and classical versions.

Conclusion: Quantum Neural Network with Tree Tensor Network ansatz shows promise for quantum machine learning applications in transportation systems, demonstrating better generalization than Quantum Support Vector Machine for pedestrian stress classification tasks.

Abstract: Quantum computing has opened new opportunities to tackle complex machine learning tasks, for instance, high-dimensional data representations commonly required in intelligent transportation systems. We explore quantum machine learning to model complex skin conductance response (SCR) events that reflect pedestrian stress in a virtual reality road crossing experiment. For this purpose, Quantum Support Vector Machine (QSVM) with an eight-qubit ZZ feature map and a Quantum Neural Network (QNN) using a Tree Tensor Network ansatz and an eight-qubit ZZ feature map, were developed on Pennylane. The dataset consists of SCR measurements along with features such as the response amplitude and elapsed time, which have been categorized into amplitude-based classes. The QSVM achieved good training accuracy, but had an overfitting problem, showing a low test accuracy of 45% and therefore impacting the reliability of the classification model. The QNN model reached a higher test accuracy of 55%, making it a better classification model than the QSVM and the classic versions.

[635] GenAI-Powered Inference

Kosuke Imai, Kentaro Nakamura

Main category: cs.LG

TL;DR: GPI is a statistical framework using GenAI models for causal and predictive inference with unstructured data like text and images, without requiring model fine-tuning.

Details

Motivation: To enable efficient causal and predictive inference from unstructured data using generative AI models while quantifying estimation uncertainty.

Method: Leverages open-source GenAI models to generate unstructured data and extract low-dimensional representations that capture underlying structure, then applies machine learning for estimation.

Result: Successfully applied to three real-world applications: Chinese social media censorship analysis, facial appearance effects on elections, and political rhetoric persuasiveness assessment.

Conclusion: GPI provides a versatile, computationally efficient framework for statistical inference with unstructured data using generative AI, with available open-source software.

Abstract: We introduce GenAI-Powered Inference (GPI), a statistical framework for both causal and predictive inference using unstructured data, including text and images. GPI leverages open-source Generative Artificial Intelligence (GenAI) models – such as large language models and diffusion models – not only to generate unstructured data at scale but also to extract low-dimensional representations that are guaranteed to capture their underlying structure. Applying machine learning to these representations, GPI enables estimation of causal and predictive effects while quantifying associated estimation uncertainty. Unlike existing approaches to representation learning, GPI does not require fine-tuning of generative models, making it computationally efficient and broadly accessible. We illustrate the versatility of the GPI framework through three applications: (1) analyzing Chinese social media censorship, (2) estimating predictive effects of candidates’ facial appearance on electoral outcomes, and (3) assessing the persuasiveness of political rhetoric. An open-source software package is available for implementing GPI.

[636] PLAME: Lightweight MSA Design Advances Protein Folding From Evolutionary Embeddings

Hanqun Cao, Xinyi Zhou, Zijun Gao, Chenyu Wang, Xin Gao, Zhi Zhang, Chunbin Gu, Ge Liu, Pheng-Ann Heng

Main category: cs.LG

TL;DR: PLAME is a lightweight MSA generation framework that uses protein language model embeddings to create better MSAs for low-homology and orphan proteins, improving AlphaFold structure prediction accuracy.

Details

Motivation: Traditional multiple sequence alignments (MSAs) perform poorly on low-homology and orphan proteins that lack strong evolutionary neighbors, limiting protein structure prediction accuracy for these challenging targets.

Method: PLAME leverages evolutionary embeddings from pretrained protein language models coupled with a conservation-diversity loss to generate MSAs. It includes MSA selection strategy and sequence-quality metric for filtering high-quality candidates.

Result: On AlphaFold2 low-homology/orphan benchmarks, PLAME achieves state-of-the-art improvements in structure accuracy (lDDT/TM-score) and shows consistent gains with AlphaFold3. It also enables ESMFold to approach AlphaFold2-level accuracy while maintaining fast inference speed.

Conclusion: PLAME provides a practical solution for high-quality protein folding of proteins lacking strong evolutionary neighbors, serving as a lightweight adapter that enhances existing folding tools without sacrificing computational efficiency.

Abstract: Protein structure prediction often hinges on multiple sequence alignments (MSAs), which underperform on low-homology and orphan proteins. We introduce PLAME, a lightweight MSA design framework that leverages evolutionary embeddings from pretrained protein language models to generate MSAs that better support downstream folding. PLAME couples these embeddings with a conservation-diversity loss that balances agreement on conserved positions with coverage of plausible sequence variation. Beyond generation, we develop (i) an MSA selection strategy to filter high-quality candidates and (ii) a sequence-quality metric that is complementary to depth-based measures and predictive of folding gains. On AlphaFold2 low-homology/orphan benchmarks, PLAME delivers state-of-the-art improvements in structure accuracy (e.g., lDDT/TM-score), with consistent gains when paired with AlphaFold3. Ablations isolate the benefits of the selection strategy, and case studies elucidate how MSA characteristics shape AlphaFold confidence and error modes. Finally, we show PLAME functions as a lightweight adapter, enabling ESMFold to approach AlphaFold2-level accuracy while retaining ESMFold-like inference speed. PLAME thus provides a practical path to high-quality folding for proteins lacking strong evolutionary neighbors.

[637] Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved)

Chongli Qin, Jost Tobias Springenberg

Main category: cs.LG

TL;DR: iw-SFT improves standard supervised fine-tuning by using importance weighting to optimize a tighter bound on the RL objective, achieving competitive performance with advanced RL methods.

Details

Motivation: To bridge the gap between behavior cloning/SFT and reinforcement learning by showing SFT maximizes a lower bound on RL objectives, and to develop a more effective variant that behaves closer to RL training.

Method: Proposes importance weighted supervised fine-tuning (iw-SFT) - a small modification to standard SFT that uses importance weighting to optimize a tighter bound on the RL objective, working with both curated and quality-scored data.

Result: iw-SFT achieves competitive performance with advanced RL algorithms, achieving 66.7% on the AIME 2024 dataset and performing well in both language modeling and continuous control tasks.

Conclusion: A simple importance weighting modification to SFT can bridge the gap between supervised fine-tuning and reinforcement learning, providing an effective and easy-to-implement alternative to complex RL methods.

Abstract: Behavior Cloning (BC) on curated (or filtered) data is the predominant paradigm for supervised fine-tuning (SFT) of large language models; as well as for imitation learning of control policies. Here, we draw on a connection between this successful strategy and the theory and practice of finding optimal policies via Reinforcement Learning (RL). Building on existing literature, we clarify that SFT can be understood as maximizing a lower bound on the RL objective in a sparse reward setting. Giving support to its often observed good performance. From this viewpoint, we realize that a small modification to SFT leads to an importance weighted variant that behaves closer to training with RL as it: i) optimizes a tighter bound to the RL objective and, ii) can improve performance compared to SFT on curated data. We refer to this variant as importance weighted supervised fine-tuning (iw-SFT). We show that it is easy to implement and can be further generalized to training with quality scored data. The resulting SFT variants are competitive with more advanced RL algorithms for large language models and for training policies in continuous control tasks. For example achieving 66.7% on the AIME 2024 dataset.

[638] Your Attention Matters: to Improve Model Robustness to Noise and Spurious Correlations

Camilo Tamayo-Rousseau, Yunjia Zhao, Yiqun Zhang, Randall Balestriero

Main category: cs.LG

TL;DR: Doubly Stochastic attention is the most robust self-attention mechanism in Vision Transformers, outperforming other variants by 0.1%-5.1% under various data corruption scenarios.

Details

Motivation: Self-attention mechanisms are core to Transformers' success, but their robustness to noise and spurious correlations has not been well studied despite many variants existing.

Method: Evaluated Softmax, Sigmoid, Linear, Doubly Stochastic, and Cosine attention in Vision Transformers across CIFAR-10, CIFAR-100, and Imagenette datasets under different data corruption scenarios.

Result: Doubly Stochastic attention consistently outperformed all other mechanisms, showing the best robustness when training data or both training and testing data were corrupted.

Conclusion: Doubly Stochastic attention should be preferred for contexts with imperfect data, providing valuable guidance for self-attention selection in practical applications.

Abstract: Self-attention mechanisms are foundational to Transformer architectures, supporting their impressive success in a wide range of tasks. While there are many self-attention variants, their robustness to noise and spurious correlations has not been well studied. This study evaluates Softmax, Sigmoid, Linear, Doubly Stochastic, and Cosine attention within Vision Transformers under different data corruption scenarios. Through testing across the CIFAR-10, CIFAR-100, and Imagenette datasets, we show that Doubly Stochastic attention is the most robust. It consistently outperformed the next best mechanism by $0.1%-5.1%$ when training data, or both training and testing data, were corrupted. Our findings inform self-attention selection in contexts with imperfect data. The code used is available at https://github.com/ctamayor/NeurIPS-Robustness-ViT.

[639] Nested Graph Pseudo-Label Refinement for Noisy Label Domain Adaptation Learning

Yingxu Wang, Mengzhu Wang, Zhichao Huang, Suyu Liu, Nan Yin

Main category: cs.LG

TL;DR: NeGPR is a novel graph domain adaptation framework that handles noisy source labels through dual-branch pretraining, nested pseudo-label refinement, and noise-aware regularization, achieving significant performance improvements over state-of-the-art methods.

Details

Motivation: Most existing Graph Domain Adaptation methods assume clean source labels, but real-world scenarios often have pervasive annotation noise that severely impairs feature alignment and adaptation performance under domain shifts.

Method: NeGPR pretrains dual branches (semantic and topology) with neighborhood consistency to reduce noisy supervision influence. It uses nested refinement where one branch selects high-confidence target samples to guide the other branch’s adaptation, and incorporates noise-aware regularization to mitigate pseudo-label noise effects even with source overfitting.

Result: Extensive experiments show NeGPR consistently outperforms state-of-the-art methods under severe label noise, achieving accuracy gains of up to 12.7% on benchmark datasets.

Conclusion: NeGPR effectively addresses the challenge of noisy labels in graph domain adaptation through its dual-branch architecture, nested refinement mechanism, and theoretically-grounded noise-aware regularization, demonstrating robust performance across various noisy label scenarios.

Abstract: Graph Domain Adaptation (GDA) facilitates knowledge transfer from labeled source graphs to unlabeled target graphs by learning domain-invariant representations, which is essential in applications such as molecular property prediction and social network analysis. However, most existing GDA methods rely on the assumption of clean source labels, which rarely holds in real-world scenarios where annotation noise is pervasive. This label noise severely impairs feature alignment and degrades adaptation performance under domain shifts. To address this challenge, we propose Nested Graph Pseudo-Label Refinement (NeGPR), a novel framework tailored for graph-level domain adaptation with noisy labels. NeGPR first pretrains dual branches, i.e., semantic and topology branches, by enforcing neighborhood consistency in the feature space, thereby reducing the influence of noisy supervision. To bridge domain gaps, NeGPR employs a nested refinement mechanism in which one branch selects high-confidence target samples to guide the adaptation of the other, enabling progressive cross-domain learning. Furthermore, since pseudo-labels may still contain noise and the pre-trained branches are already overfitted to the noisy labels in the source domain, NeGPR incorporates a noise-aware regularization strategy. This regularization is theoretically proven to mitigate the adverse effects of pseudo-label noise, even under the presence of source overfitting, thus enhancing the robustness of the adaptation process. Extensive experiments on benchmark datasets demonstrate that NeGPR consistently outperforms state-of-the-art methods under severe label noise, achieving gains of up to 12.7% in accuracy.

[640] Uncertainty-Driven Reliability: Selective Prediction and Trustworthy Deployment in Modern Machine Learning

Stephan Rabanser

Main category: cs.LG

TL;DR: This thesis develops trajectory-based uncertainty estimation for selective prediction, analyzes privacy-uncertainty trade-offs, decomposes selective classification errors, and defends against adversarial manipulation of uncertainty signals.

Details

Motivation: To enhance safety and trustworthiness of ML systems in high-stakes domains by improving uncertainty estimation for selective prediction, where models abstain when confidence is low.

Method: Proposes ensembling predictions from intermediate training checkpoints for lightweight post-hoc abstention; develops finite-sample decomposition of selective classification gap; designs defenses combining calibration audits with verifiable inference.

Result: Achieves state-of-the-art selective prediction performance without architecture changes; maintains robustness under differential privacy; identifies five interpretable error sources; provides defenses against adversarial manipulation of uncertainty.

Conclusion: The contributions advance reliable ML by improving, evaluating, and safeguarding uncertainty estimation, enabling models that know when to abstain from predictions.

Abstract: Machine learning (ML) systems are increasingly deployed in high-stakes domains where reliability is paramount. This thesis investigates how uncertainty estimation can enhance the safety and trustworthiness of ML, focusing on selective prediction – where models abstain when confidence is low. We first show that a model’s training trajectory contains rich uncertainty signals that can be exploited without altering its architecture or loss. By ensembling predictions from intermediate checkpoints, we propose a lightweight, post-hoc abstention method that works across tasks, avoids the cost of deep ensembles, and achieves state-of-the-art selective prediction performance. Crucially, this approach is fully compatible with differential privacy (DP), allowing us to study how privacy noise affects uncertainty quality. We find that while many methods degrade under DP, our trajectory-based approach remains robust, and we introduce a framework for isolating the privacy-uncertainty trade-off. Next, we then develop a finite-sample decomposition of the selective classification gap – the deviation from the oracle accuracy-coverage curve – identifying five interpretable error sources and clarifying which interventions can close the gap. This explains why calibration alone cannot fix ranking errors, motivating methods that improve uncertainty ordering. Finally, we show that uncertainty signals can be adversarially manipulated to hide errors or deny service while maintaining high accuracy, and we design defenses combining calibration audits with verifiable inference. Together, these contributions advance reliable ML by improving, evaluating, and safeguarding uncertainty estimation, enabling models that not only make accurate predictions – but also know when to say “I do not know”.

[641] BadPromptFL: A Novel Backdoor Threat to Prompt-based Federated Learning in Multimodal Models

Maozhen Zhang, Mengnan Zhao, Wei Wang, Bo Wang

Main category: cs.LG

TL;DR: BadPromptFL is the first backdoor attack targeting prompt-based federated learning in multimodal models, achieving >90% attack success by injecting poisoned prompts through compromised clients without modifying model parameters.

Details

Motivation: Prompt-based tuning has become popular for efficient adaptation of large vision-language models in federated learning, but the security implications of prompt aggregation in multimodal settings remain unexplored, creating a critical attack surface.

Method: Compromised clients jointly optimize local backdoor triggers and prompt embeddings, injecting poisoned prompts into the global aggregation process which are then propagated to benign clients, leveraging CLIP-style architectures’ contextual learning behavior.

Result: Achieves high attack success rates (>90%) with minimal visibility and limited client participation across multiple datasets and aggregation protocols, demonstrating effectiveness, stealth, and generalizability.

Conclusion: The attack raises critical concerns about the robustness of prompt-based federated learning in real-world deployments, highlighting significant security vulnerabilities in current approaches.

Abstract: Prompt-based tuning has emerged as a lightweight alternative to full fine-tuning in large vision-language models, enabling efficient adaptation via learned contextual prompts. This paradigm has recently been extended to federated learning settings (e.g., PromptFL), where clients collaboratively train prompts under data privacy constraints. However, the security implications of prompt-based aggregation in federated multimodal learning remain largely unexplored, leaving a critical attack surface unaddressed. In this paper, we introduce \textbf{BadPromptFL}, the first backdoor attack targeting prompt-based federated learning in multimodal contrastive models. In BadPromptFL, compromised clients jointly optimize local backdoor triggers and prompt embeddings, injecting poisoned prompts into the global aggregation process. These prompts are then propagated to benign clients, enabling universal backdoor activation at inference without modifying model parameters. Leveraging the contextual learning behavior of CLIP-style architectures, BadPromptFL achieves high attack success rates (e.g., (>90%)) with minimal visibility and limited client participation. Extensive experiments across multiple datasets and aggregation protocols validate the effectiveness, stealth, and generalizability of our attack, raising critical concerns about the robustness of prompt-based federated learning in real-world deployments.

[642] Confounding is a Pervasive Problem in Real World Recommender Systems

Alexander Merkov, David Rohde, Alexandre Gilotte, Benjamin Heymann

Main category: cs.LG

TL;DR: Common practices in recommender systems like feature engineering, A/B testing, and modularization can introduce unobserved confounding, leading to biased causal effect estimates and performance issues.

Details

Motivation: Unobserved confounding is a well-known problem in observational studies that biases causal effect estimates. While recommender systems using fully observed data appear immune, standard practices often ignore observed features, creating similar confounding issues that undermine system performance.

Method: The paper provides several illustrations of how confounding occurs in recommender systems through common practices, supported by simulation studies. It offers practical suggestions for practitioners to reduce or avoid confounding effects.

Result: The research demonstrates that standard recommender system practices can introduce confounding, which hampers system performance. Simulation studies confirm these phenomena and provide evidence of the negative impacts.

Conclusion: Practitioners need to be aware of how common practices in recommender systems can introduce confounding biases. The paper provides practical guidance on how to mitigate these effects in real-world systems to improve performance and reliability.

Abstract: Unobserved confounding arises when an unmeasured feature influences both the treatment and the outcome, leading to biased causal effect estimates. This issue undermines observational studies in fields like economics, medicine, ecology or epidemiology. Recommender systems leveraging fully observed data seem not to be vulnerable to this problem. However many standard practices in recommender systems result in observed features being ignored, resulting in effectively the same problem. This paper will show that numerous common practices such as feature engineering, A/B testing and modularization can in fact introduce confounding into recommendation systems and hamper their performance. Several illustrations of the phenomena are provided, supported by simulation studies with practical suggestions about how practitioners may reduce or avoid the affects of confounding in real systems.

[643] Bridging Generalization and Personalization in Human Activity Recognition via On-Device Few-Shot Learning

Pixi Kang, Julian Moosmann, Mengxi Liu, Bo Zhou, Michele Magno, Paul Lukowicz, Sizhen Bian

Main category: cs.LG

TL;DR: A novel on-device few-shot learning framework for Human Activity Recognition that bridges generalization across users and personalization for individuals, achieving significant accuracy improvements with minimal computation on resource-constrained devices.

Details

Motivation: Conventional HAR models fail to generalize across user-specific variations, leading to degraded performance. There's a need for models that can both generalize across diverse users and efficiently personalize for individuals on resource-constrained devices.

Method: Proposes a two-stage approach: first trains a generalizable representation across users, then rapidly adapts to new users with few labeled samples by updating lightweight classifier layers directly on resource-constrained devices (RISC-V GAP9 microcontroller).

Result: Achieved accuracy improvements of 3.73%, 17.38%, and 3.70% on three benchmark datasets (RecGym, QVAR-Gesture, Ultrasound-Gesture) through post-deployment adaptation with minimal computation and memory cost.

Conclusion: Few-shot on-device learning enables scalable, user-aware, and energy-efficient wearable human activity recognition by seamlessly uniting generalization and personalization, making it practical for real-world deployment.

Abstract: Human Activity Recognition (HAR) with different sensing modalities requires both strong generalization across diverse users and efficient personalization for individuals. However, conventional HAR models often fail to generalize when faced with user-specific variations, leading to degraded performance. To address this challenge, we propose a novel on-device few-shot learning framework that bridges generalization and personalization in HAR. Our method first trains a generalizable representation across users and then rapidly adapts to new users with only a few labeled samples, updating lightweight classifier layers directly on resource-constrained devices. This approach achieves robust on-device learning with minimal computation and memory cost, making it practical for real-world deployment. We implement our framework on the energy-efficient RISC-V GAP9 microcontroller and evaluate it on three benchmark datasets (RecGym, QVAR-Gesture, Ultrasound-Gesture). Across these scenarios, post-deployment adaptation improves accuracy by 3.73%, 17.38%, and 3.70%, respectively. These results demonstrate that few-shot on-device learning enables scalable, user-aware, and energy-efficient wearable human activity recognition by seamlessly uniting generalization and personalization. The related framework is open sourced for further research\footnote{https://github.com/kangpx/onlineTiny2023}.

[644] Amortized In-Context Mixed Effect Transformer Models: A Zero-Shot Approach for Pharmacokinetics

César Ali Ojeda Marin, Wilhelm Huisinga, Purity Kavwele, Niklas Hartung

Main category: cs.LG

TL;DR: AICMET is a transformer-based model that combines mechanistic compartmental priors with amortized Bayesian inference for accurate dose-response forecasting, enabling zero-shot adaptation to new compounds and collapsing traditional modeling cycles from weeks to hours.

Details

Motivation: Accurate dose-response forecasting under sparse sampling is crucial for precision pharmacotherapy, but traditional methods require extensive development time and expert modeling.

Method: Transformer-based latent-variable framework that unifies mechanistic compartmental priors with amortized in-context Bayesian inference, pre-trained on synthetic pharmacokinetic trajectories with Ornstein-Uhlenbeck priors.

Result: State-of-the-art predictive accuracy across public datasets, outperforming nonlinear mixed-effects baselines and neural ODE variants, with faithful quantification of inter-patient variability.

Conclusion: Transformer-based population-aware neural architectures offer a viable alternative to traditional pharmacokinetic modeling pipelines, enabling truly population-aware personalized dosing regimens.

Abstract: Accurate dose-response forecasting under sparse sampling is central to precision pharmacotherapy. We present the Amortized In-Context Mixed-Effect Transformer (AICMET) model, a transformer-based latent-variable framework that unifies mechanistic compartmental priors with amortized in-context Bayesian inference. AICMET is pre-trained on hundreds of thousands of synthetic pharmacokinetic trajectories with Ornstein-Uhlenbeck priors over the parameters of compartment models, endowing the model with strong inductive biases and enabling zero-shot adaptation to new compounds. At inference time, the decoder conditions on the collective context of previously profiled trial participants, generating calibrated posterior predictions for newly enrolled patients after a few early drug concentration measurements. This capability collapses traditional model-development cycles from weeks to hours while preserving some degree of expert modelling. Experiments across public datasets show that AICMET attains state-of-the-art predictive accuracy and faithfully quantifies inter-patient variability – outperforming both nonlinear mixed-effects baselines and recent neural ODE variants. Our results highlight the feasibility of transformer-based, population-aware neural architectures as offering a new alternative for bespoke pharmacokinetic modeling pipelines, charting a path toward truly population-aware personalized dosing regimens.

[645] Convergence and Generalization of Anti-Regularization for Parametric Models

Dongseok Kim, Wonjun Jeong, Gisung Oh

Main category: cs.LG

TL;DR: Anti-regularization is a technique that adds a negative regularization term to boost model expressivity in small-sample settings, with safeguards to prevent instability and overfitting.

Details

Motivation: To address underfitting in small-sample learning regimes by deliberately increasing model complexity when data is limited, while ensuring the intervention fades as sample size grows.

Method: Introduces a reward term with reversed sign in loss function, uses power-law decay schedule, implements spectral safety conditions and trust-region constraints with projection operators and gradient clipping safeguards.

Result: Mitigates underfitting in regression and classification while preserving generalization and improving calibration. Maintains stable performance through proper decay scheduling.

Conclusion: Anti-regularization provides a simple, reproducible procedure that integrates with standard empirical risk minimization, enabling robust learning under data and resource constraints by intervening only when necessary.

Abstract: Anti-regularization introduces a reward term with a reversed sign into the loss function, deliberately amplifying model expressivity in small-sample regimes while ensuring that the intervention gradually vanishes as the sample size grows through a power-law decay schedule. We formalize spectral safety conditions and trust-region constraints, and we design a lightweight safeguard that combines a projection operator with gradient clipping to guarantee stable intervention. Theoretical analysis extends to linear smoothers and the Neural Tangent Kernel regime, providing practical guidance on the choice of decay exponents through the balance between empirical risk and variance. Empirical results show that Anti-regularization mitigates underfitting in both regression and classification while preserving generalization and improving calibration. Ablation studies confirm that the decay schedule and safeguards are essential to avoiding overfitting and instability. As an alternative, we also propose a degrees-of-freedom targeting schedule that maintains constant per-sample complexity. Anti-regularization constitutes a simple and reproducible procedure that integrates seamlessly into standard empirical risk minimization pipelines, enabling robust learning under limited data and resource constraints by intervening only when necessary and vanishing otherwise.

[646] Towards Synthesizing Normative Data for Cognitive Assessments Using Generative Multimodal Large Language Models

Victoria Yan, Honor Chotkowski, Fengran Wang, Xinhui Li, Carl Yang, Jiaying Lu, Runze Yan, Xiao Hu, Alex Fedorov

Main category: cs.LG

TL;DR: Using GPT-4o models with advanced prompting strategies can generate realistic synthetic normative data for cognitive assessments, overcoming traditional data collection limitations.

Details

Motivation: Traditional normative data collection for cognitive tests is costly, time-consuming, and infrequently updated, creating barriers for developing new image-based assessments.

Method: Used GPT-4o and GPT-4o-mini with naive and advanced prompting strategies to generate synthetic responses for image-based cognitive tests like the “Cookie Theft” task. Evaluated responses using embedding analysis, BLEU, ROUGE, BERTScore, and LLM-as-a-judge evaluation.

Result: Advanced prompting produced synthetic responses that better distinguished diagnostic groups and captured demographic diversity. BERTScore was most reliable for contextual similarity, while BLEU was ineffective for creative outputs. LLM-as-a-judge showed promising validation results.

Conclusion: Generative multimodal LLMs with refined prompting can feasibly generate robust synthetic normative data, enabling development of novel image-based cognitive assessments without traditional limitations.

Abstract: Cognitive assessments require normative data as essential benchmarks for evaluating individual performance. Hence, developing new cognitive tests based on novel image stimuli is challenging due to the lack of readily available normative data. Traditional data collection methods are costly, time-consuming, and infrequently updated, limiting their practical utility. Recent advancements in generative multimodal large language models (MLLMs) offer a new approach to generate synthetic normative data from existing cognitive test images. We investigated the feasibility of using MLLMs, specifically GPT-4o and GPT-4o-mini, to synthesize normative textual responses for established image-based cognitive assessments, such as the “Cookie Theft” picture description task. Two distinct prompting strategies-naive prompts with basic instructions and advanced prompts enriched with contextual guidance-were evaluated. Responses were analyzed using embeddings to assess their capacity to distinguish diagnostic groups and demographic variations. Performance metrics included BLEU, ROUGE, BERTScore, and an LLM-as-a-judge evaluation. Advanced prompting strategies produced synthetic responses that more effectively distinguished between diagnostic groups and captured demographic diversity compared to naive prompts. Superior models generated responses exhibiting higher realism and diversity. BERTScore emerged as the most reliable metric for contextual similarity assessment, while BLEU was less effective for evaluating creative outputs. The LLM-as-a-judge approach provided promising preliminary validation results. Our study demonstrates that generative multimodal LLMs, guided by refined prompting methods, can feasibly generate robust synthetic normative data for existing cognitive tests, thereby laying the groundwork for developing novel image-based cognitive assessments without the traditional limitations.

[647] Group Expectation Policy Optimization for Heterogeneous Reinforcement Learning

Han Zhang, Ruibin Zheng, Zexuan Yi, Zhuo Zhang, Hanyang Peng, Hui Wang, Zike Yuan, Cai Ke, Shiwei Chen, Jiacheng Yang, Yangning Li, Xiang Li, Jiangyue Yan, Yaoqi Liu, Liwen Jing, Jiayin Qi, Ruifeng Xu, Binxing Fang, Yue Yu

Main category: cs.LG

TL;DR: HeteroRL is an asynchronous RL architecture that decouples sampling from learning to enable robust decentralized training in heterogeneous networks, addressing latency-induced KL divergence issues with Group Expectation Policy Optimization (GEPO) for exponential variance reduction.

Details

Motivation: As single-center computing faces power constraints, decentralized training becomes essential. RL post-training for LLMs faces challenges in heterogeneous distributed environments due to tightly-coupled sampling-learning alternation under network delays.

Method: Propose HeteroRL architecture that decouples rollout sampling from parameter learning. Introduce Group Expectation Policy Optimization (GEPO) with refined sampling mechanism to reduce importance weight variance caused by latency-induced KL divergence.

Result: GEPO achieves exponential variance reduction theoretically. Experiments show superior stability over methods like GRPO, with less than 3% performance degradation under 1800-second delays.

Conclusion: HeteroRL demonstrates strong potential for decentralized RL in heterogeneous networks, providing robust deployment across geographically distributed nodes despite significant network delays.

Abstract: As single-center computing approaches power constraints, decentralized training is becoming essential. Reinforcement Learning (RL) post-training enhances Large Language Models (LLMs) but faces challenges in heterogeneous distributed environments due to its tightly-coupled sampling-learning alternation. We propose HeteroRL, an asynchronous RL architecture that decouples rollout sampling from parameter learning, enabling robust deployment across geographically distributed nodes under network delays. We identify that latency-induced KL divergence causes importance sampling failure due to high variance. To address this, we propose Group Expectation Policy Optimization (GEPO), which reduces importance weight variance through a refined sampling mechanism. Theoretically, GEPO achieves exponential variance reduction. Experiments show it maintains superior stability over methods like GRPO, with less than 3% performance degradation under 1800-second delays, demonstrating strong potential for decentralized RL in heterogeneous networks.

[648] Complementary Learning System Empowers Online Continual Learning of Vehicle Motion Forecasting in Smart Cities

Zirui Li, Yunlong Lin, Guodong Du, Xiaocong Zhao, Cheng Gong, Chen Lv, Chao Lu, Jianwei Gong

Main category: cs.LG

TL;DR: Dual-LS is a brain-inspired continual learning system that pairs two memory replay mechanisms to prevent catastrophic forgetting in DNN-based vehicle motion forecasting, achieving up to 74.31% improvement while reducing computational costs by 94.02%.

Details

Motivation: Current DNN models for vehicle motion forecasting suffer from catastrophic forgetting when updated, requiring costly data collection and failing to balance long- and short-term experience like human learning.

Method: Dual-LS uses a task-free, online continual learning paradigm inspired by the human brain’s complementary learning system, featuring two synergistic memory rehearsal replay mechanisms that dynamically coordinate long-term and short-term knowledge representations.

Result: Tests on naturalistic data from three countries (772,000 vehicles, 11,187 km testing mileage) show Dual-LS mitigates catastrophic forgetting by up to 74.31% and reduces computational resource demand by up to 94.02%, while maintaining predictive stability.

Conclusion: Dual-LS enables DNN-based vehicle motion forecasting to achieve computation-efficient, human-like continual learning adaptability suitable for smart city applications without increasing data requirements.

Abstract: Artificial intelligence underpins most smart city services, yet deep neural network (DNN) that forecasts vehicle motion still struggle with catastrophic forgetting, the loss of earlier knowledge when models are updated. Conventional fixes enlarge the training set or replay past data, but these strategies incur high data collection costs, sample inefficiently and fail to balance long- and short-term experience, leaving them short of human-like continual learning. Here we introduce Dual-LS, a task-free, online continual learning paradigm for DNN-based motion forecasting that is inspired by the complementary learning system of the human brain. Dual-LS pairs two synergistic memory rehearsal replay mechanisms to accelerate experience retrieval while dynamically coordinating long-term and short-term knowledge representations. Tests on naturalistic data spanning three countries, over 772,000 vehicles and cumulative testing mileage of 11,187 km show that Dual-LS mitigates catastrophic forgetting by up to 74.31% and reduces computational resource demand by up to 94.02%, markedly boosting predictive stability in vehicle motion forecasting without inflating data requirements. Meanwhile, it endows DNN-based vehicle motion forecasting with computation efficient and human-like continual learning adaptability fit for smart cities.

[649] Failure Prediction Is a Better Performance Proxy for Early-Exit Networks Than Calibration

Piotr Kubaty, Filip Szatkowski, Metod Jazbec, Bartosz Wójcik

Main category: cs.LG

TL;DR: Early-exit models with confidence-based exit strategies don’t necessarily benefit from calibration - miscalibrated networks can outperform calibrated ones. Failure prediction is proposed as a better performance proxy.

Details

Motivation: To challenge the assumption that calibration improves early-exit model performance, and to find a more reliable evaluation framework than calibration metrics.

Method: Presented empirical evidence comparing calibrated vs miscalibrated networks, and proposed using failure prediction as an alternative performance measure that captures sample ranking changes and efficiency gains.

Result: Found that miscalibrated networks can outperform calibrated ones, and that failure prediction correlates strongly with efficiency gains unlike calibration metrics.

Conclusion: Failure prediction is a more informative and reliable proxy for early-exit model performance than calibration metrics, offering better framework for design and evaluation.

Abstract: Early-exit models accelerate inference by attaching internal classifiers to intermediate layers of the network, allowing computation to halt once a prediction meets a predefined exit criterion. Most early-exit methods rely on confidence-based exit strategies, which has motivated prior work to calibrate intermediate classifiers in pursuit of improved performance-efficiency trade-offs. In this paper, we argue that calibration metrics can be misleading indicators of multi-exit model performance. Specifically, we present empirical evidence showing that miscalibrated networks can outperform calibrated ones. As an alternative, we propose using failure prediction as a more informative proxy for early-exit model performance. Unlike calibration, failure prediction captures changes in sample rankings and correlates strongly with efficiency gains, offering a more reliable framework for designing and evaluating early-exit models.

[650] Missing Data Imputation using Neural Cellular Automata

Tin Luu, Binh Nguyen, Man Ngo

Main category: cs.LG

TL;DR: Proposes a novel neural cellular automata (NCA)-based method for tabular data imputation that outperforms state-of-the-art approaches.

Details

Motivation: Missing data is a persistent problem in tabular data analysis, and while recent generative models have been explored for imputation, neural cellular automata have been overlooked despite their powerful computational capabilities.

Method: Developed an NCA-based imputation model with appropriate adaptations to handle missing data, leveraging the computational power of neural cellular automata for the imputation task.

Result: The proposed NCA-based model demonstrates superior performance compared to state-of-the-art methods, achieving lower imputation error and better post-imputation performance in experiments.

Conclusion: Neural cellular automata represent a promising and previously unexplored approach for tabular data imputation, offering competitive performance against existing generative model-based methods.

Abstract: When working with tabular data, missingness is always one of the most painful problems. Throughout many years, researchers have continuously explored better and better ways to impute missing data. Recently, with the rapid development evolution in machine learning and deep learning, there is a new trend of leveraging generative models to solve the imputation task. While the imputing version of famous models such as Variational Autoencoders or Generative Adversarial Networks were investigated, prior work has overlooked Neural Cellular Automata (NCA), a powerful computational model. In this paper, we propose a novel imputation method that is inspired by NCA. We show that, with some appropriate adaptations, an NCA-based model is able to address the missing data imputation problem. We also provide several experiments to evidence that our model outperforms state-of-the-art methods in terms of imputation error and post-imputation performance.

[651] Any-Order Flexible Length Masked Diffusion

Jaeyeon Kim, Lee Cheuk-Kit, Carles Domingo-Enrich, Yilun Du, Sham Kakade, Timothy Ngotiaoco, Sitan Chen, Michael Albergo

Main category: cs.LG

TL;DR: FlexMDMs extend masked diffusion models to support flexible-length sequence generation while maintaining any-order inference capabilities, achieving superior performance on various tasks including math and code completion.

Details

Motivation: Masked diffusion models (MDMs) are limited to fixed-length generations and do not support token insertions, which restricts their applicability to tasks requiring variable-length outputs.

Method: FlexMDMs extend the stochastic interpolant framework to generate sequences by inserting mask tokens and unmasking them, enabling modeling of flexible-length sequences while retaining any-order inference properties.

Result: FlexMDMs match MDMs in perplexity while better modeling length statistics, achieve ~60% higher success rate on maze planning, and enable efficient fine-tuning of existing models (3 days on 16 H100s) with significant performance improvements on math (58%→67%) and code infilling (52%→65%).

Conclusion: FlexMDMs provide a flexible and efficient diffusion paradigm for variable-length sequence generation that can be easily adapted from existing MDMs, demonstrating strong empirical performance across multiple domains.

Abstract: Masked diffusion models (MDMs) have recently emerged as a promising alternative to autoregressive models over discrete domains. MDMs generate sequences in an any-order, parallel fashion, enabling fast inference and strong performance on non-causal tasks. However, a crucial limitation is that they do not support token insertions and are thus limited to fixed-length generations. To this end, we introduce Flexible Masked Diffusion Models (FlexMDMs), a discrete diffusion paradigm that simultaneously can model sequences of flexible length while provably retaining MDMs’ flexibility of any-order inference. Grounded in an extension of the stochastic interpolant framework, FlexMDMs generate sequences by inserting mask tokens and unmasking them. Empirically, we show that FlexMDMs match MDMs in perplexity while modeling length statistics with much higher fidelity. On a synthetic maze planning task, they achieve $\approx 60 %$ higher success rate than MDM baselines. Finally, we show pretrained MDMs can easily be retrofitted into FlexMDMs: on 16 H100s, it takes only three days to fine-tune LLaDA-8B into a FlexMDM, achieving superior performance on math (GSM8K, $58% \to 67%$) and code infilling performance ($52% \to 65%$).

[652] Optimizing In-Context Learning for Efficient Full Conformal Prediction

Weicao Deng, Sangwoo Park, Min Li, Osvaldo Simeone

Main category: cs.LG

TL;DR: E-ICL+FCP is a new conformal prediction framework that combines in-context learning with permutation-invariant transformers and CP-aware loss to achieve better data efficiency and coverage than traditional split and full CP methods.

Details

Motivation: Current conformal prediction methods face trade-offs between data efficiency (split CP wastes data) and computational complexity (full CP requires prohibitive retraining). Existing meta-learning/ICL approaches don't optimize specifically for CP, leading to large prediction sets.

Method: Enhanced ICL-based Full CP (E-ICL+FCP) uses a permutation-invariant Transformer-based in-context learning model trained with a CP-aware loss function to simulate multiple retrained models without actual retraining.

Result: Experiments on synthetic and real tasks show E-ICL+FCP achieves superior efficiency-coverage trade-offs compared to existing split CP and full CP baselines, preserving coverage while reducing computational overhead.

Conclusion: The proposed E-ICL+FCP framework effectively addresses the limitations of traditional CP methods by combining ICL with CP-specific optimization, providing reliable uncertainty quantification with improved data efficiency and reduced computational burden.

Abstract: Reliable uncertainty quantification is critical for trustworthy AI. Conformal Prediction (CP) provides prediction sets with distribution-free coverage guarantees, but its two main variants face complementary limitations. Split CP (SCP) suffers from data inefficiency due to dataset partitioning, while full CP (FCP) improves data efficiency at the cost of prohibitive retraining complexity. Recent approaches based on meta-learning or in-context learning (ICL) partially mitigate these drawbacks. However, they rely on training procedures not specifically tailored to CP, which may yield large prediction sets. We introduce an efficient FCP framework, termed enhanced ICL-based FCP (E-ICL+FCP), which employs a permutation-invariant Transformer-based ICL model trained with a CP-aware loss. By simulating the multiple retrained models required by FCP without actual retraining, E-ICL+FCP preserves coverage while markedly reducing both inefficiency and computational overhead. Experiments on synthetic and real tasks demonstrate that E-ICL+FCP attains superior efficiency-coverage trade-offs compared to existing SCP and FCP baselines.

[653] Online Identification of IT Systems through Active Causal Learning

Kim Hammar, Rolf Stadler

Main category: cs.LG

TL;DR: First principled method for online, data-driven identification of IT system causal models using active causal learning with Gaussian process regression and rollout-based interventions.

Details

Motivation: Traditional expert-designed causal models are challenging for complex modern IT systems. Need automated methods to predict effects, optimize operations, diagnose failures, and detect intrusions for network/system management automation.

Method: Active causal learning method using Gaussian process regression to estimate causal functions iteratively. Uses rollout-based intervention policy to collect system measurements with minimal operational interference.

Result: Method is proven optimal in Bayesian sense and produces effective interventions. Experimental validation shows accurate causal model identification with low system operation interference.

Conclusion: The approach successfully enables automated, data-driven causal model identification for IT systems, addressing the limitations of manual expert design in complex dynamic environments.

Abstract: Identifying a causal model of an IT system is fundamental to many branches of systems engineering and operation. Such a model can be used to predict the effects of control actions, optimize operations, diagnose failures, detect intrusions, etc., which is central to achieving the longstanding goal of automating network and system management tasks. Traditionally, causal models have been designed and maintained by domain experts. This, however, proves increasingly challenging with the growing complexity and dynamism of modern IT systems. In this paper, we present the first principled method for online, data-driven identification of an IT system in the form of a causal model. The method, which we call active causal learning, estimates causal functions that capture the dependencies among system variables in an iterative fashion using Gaussian process regression based on system measurements, which are collected through a rollout-based intervention policy. We prove that this method is optimal in the Bayesian sense and that it produces effective interventions. Experimental validation on a testbed shows that our method enables accurate identification of a causal system model while inducing low interference with system operations.

[654] Insights from Gradient Dynamics: Gradient Autoscaled Normalization

Vincent-Daniel Yun

Main category: cs.LG

TL;DR: Empirical analysis of gradient variance/standard deviation evolution in deep networks, leading to hyperparameter-free gradient normalization that aligns with natural gradient scaling and improves optimization stability.

Details

Motivation: Gradient dynamics are crucial for neural network stability and generalization, but there's a gap between theoretical expectations and empirical behaviors that needs to be addressed.

Method: Proposed hyperparameter-free gradient normalization method that aligns gradient scaling with their natural evolution during training to prevent unintended amplification and stabilize optimization.

Result: Experiments on CIFAR-100 with ResNet-20, ResNet-56, and VGG-16-BN show maintained or improved test accuracy under strong generalization while preserving convergence guarantees.

Conclusion: Direct tracking of gradient dynamics is important for bridging theory-practice gaps, and the proposed method provides insights for future optimization research while delivering practical performance benefits.

Abstract: Gradient dynamics play a central role in determining the stability and generalization of deep neural networks. In this work, we provide an empirical analysis of how variance and standard deviation of gradients evolve during training, showing consistent changes across layers and at the global scale in convolutional networks. Motivated by these observations, we propose a hyperparameter-free gradient normalization method that aligns gradient scaling with their natural evolution. This approach prevents unintended amplification, stabilizes optimization, and preserves convergence guarantees. Experiments on the challenging CIFAR-100 benchmark with ResNet-20, ResNet-56, and VGG-16-BN demonstrate that our method maintains or improves test accuracy even under strong generalization. Beyond practical performance, our study highlights the importance of directly tracking gradient dynamics, aiming to bridge the gap between theoretical expectations and empirical behaviors, and to provide insights for future optimization research.

[655] AdaGrad Meets Muon: Adaptive Stepsizes for Orthogonal Updates

Minxin Zhang, Yuxuan Liu, Hayden Schaeffer

Main category: cs.LG

TL;DR: AdaGO combines orthogonalized momentum updates from Muon optimizer with AdaGrad’s adaptive stepsize scaling, achieving better performance while maintaining computational efficiency.

Details

Motivation: Muon optimizer shows strong empirical success but lacks clear learning rate determination, while AdaGrad provides adaptive stepsize scaling but doesn't use orthogonal updates. The goal is to combine both benefits.

Method: AdaGO integrates norm-based AdaGrad stepsize with orthogonalized update direction, preserving orthogonality while adapting to optimization landscape through accumulated gradient norm scaling.

Result: Theoretical convergence rates established for nonconvex functions. Empirical results on CIFAR-10 classification and function regression show AdaGO outperforms both Muon and Adam.

Conclusion: AdaGO successfully combines orthogonal momentum updates with adaptive stepsize scaling, providing both theoretical guarantees and empirical improvements over existing optimizers with minimal computational overhead.

Abstract: The recently proposed Muon optimizer updates weight matrices via orthogonalized momentum and has demonstrated strong empirical success in large language model training. However, it remains unclear how to determine the learning rates for such orthogonalized updates. AdaGrad, by contrast, is a widely used adaptive method that scales stochastic gradients by accumulated past gradients. We propose a new algorithm, AdaGO, which combines a norm-based AdaGrad-type stepsize with an orthogonalized update direction, bringing together the benefits of both approaches. Unlike other adaptive variants of Muon, AdaGO preserves the orthogonality of the update direction, which can be interpreted as a spectral descent direction, while adapting the stepsizes to the optimization landscape by scaling the direction with accumulated past gradient norms. The implementation of AdaGO requires only minimal modification to Muon, with a single additional scalar variable, the accumulated squared gradient norms, to be computed, making it computationally and memory efficient. Optimal theoretical convergence rates are established for nonconvex functions in both stochastic and deterministic settings under standard smoothness and unbiased bounded-variance noise assumptions. Empirical results on CIFAR-10 classification and function regression demonstrate that AdaGO outperforms Muon and Adam.

[656] CPEP: Contrastive Pose-EMG Pre-training Enhances Gesture Generalization on EMG Signals

Wenhui Cui, Christopher Sandino, Hadi Pouransari, Ran Liu, Juri Minxha, Ellen Zippi, Aman Verma, Anna Sedlackova, Erdrin Azemi, Behrooz Mahasseni

Main category: cs.LG

TL;DR: A framework that aligns EMG and pose representations to enable zero-shot gesture classification using weak biosignals

Details

Motivation: Leverage low-power biosignals like EMG for continuous gesture prediction on wearables by aligning them with high-quality structured data representations

Method: Contrastive Pose-EMG Pre-training (CPEP) framework that learns an EMG encoder to produce pose-informative representations aligned with pose data

Result: Outperforms emg2pose benchmark models by up to 21% on in-distribution gesture classification and 72% on unseen gesture classification

Conclusion: Aligning weak-modality EMG data with structured pose representations improves representation quality and enables effective zero-shot gesture classification

Abstract: Hand gesture classification using high-quality structured data such as videos, images, and hand skeletons is a well-explored problem in computer vision. Leveraging low-power, cost-effective biosignals, e.g. surface electromyography (sEMG), allows for continuous gesture prediction on wearables. In this paper, we demonstrate that learning representations from weak-modality data that are aligned with those from structured, high-quality data can improve representation quality and enables zero-shot classification. Specifically, we propose a Contrastive Pose-EMG Pre-training (CPEP) framework to align EMG and pose representations, where we learn an EMG encoder that produces high-quality and pose-informative representations. We assess the gesture classification performance of our model through linear probing and zero-shot setups. Our model outperforms emg2pose benchmark models by up to 21% on in-distribution gesture classification and 72% on unseen (out-of-distribution) gesture classification.

[657] ModalSurv: A Multimodal Deep Survival Framework for Prostrate and Bladder Cancer

Noorul Wahab, Ethar Alzaid, Jiaqi Lv, Adam Shephard, Shan E Ahmed Raza

Main category: cs.LG

TL;DR: ModaliSurv is a multimodal deep survival model that integrates clinical, MRI, RNA-seq, and pathology data using DeepHit with cross-attention, achieving strong performance in prostate and bladder cancer recurrence prediction.

Details

Motivation: Accurate time-to-event prediction is crucial in oncology for treatment planning and patient management, but requires integrating heterogeneous patient data from multiple modalities.

Method: DeepHit survival model with projection layer and inter-modality cross-attention to integrate clinical, MRI, RNA-seq, and whole-slide pathology features for multimodal learning.

Result: Achieved C-index of 0.843 (cross-validation) and 0.818 (development set) for prostate cancer recurrence; 0.662 (cross-validation) and 0.457 (development set) for bladder cancer recurrence.

Conclusion: Multimodal integration with deep survival learning provides promising personalized risk stratification for prostate and bladder cancer, with broad applicability to survival prediction tasks involving heterogeneous biomedical data.

Abstract: Accurate prediction of time-to-event outcomes is a central challenge in oncology, with significant implications for treatment planning and patient management. In this work, we present ModaliSurv, a multimodal deep survival model utilising DeepHit with a projection layer and inter-modality cross-attention, which integrates heterogeneous patient data, including clinical, MRI, RNA-seq and whole-slide pathology features. The model is designed to capture complementary prognostic signals across modalities and estimate individualised time-to-biochemical recurrence in prostate cancer and time-to-cancer recurrence in bladder cancer. Our approach was evaluated in the context of the CHIMERA Grand Challenge, across two of the three provided tasks. For Task 1 (prostate cancer bio-chemical recurrence prediction), the proposed framework achieved a concordance index (C-index) of 0.843 on 5-folds cross-validation and 0.818 on CHIMERA development set, demonstrating robust discriminatory ability. For Task 3 (bladder cancer recurrence prediction), the model obtained a C-index of 0.662 on 5-folds cross-validation and 0.457 on development set, highlighting its adaptability and potential for clinical translation. These results suggest that leveraging multimodal integration with deep survival learning provides a promising pathway toward personalised risk stratification in prostate and bladder cancer. Beyond the challenge setting, our framework is broadly applicable to survival prediction tasks involving heterogeneous biomedical data.

cs.MA

[658] Strategic Concealment of Environment Representations in Competitive Games

Yue Guan, Dipankar Maity, Panagiotis Tsiotras

Main category: cs.MA

TL;DR: Strategic concealment of map abstractions in competitive games where an Attacker hides their environmental understanding to mislead a Defender’s barrier placement, improving Attacker performance.

Details

Motivation: To study how players in competitive games can strategically hide their map abstractions to gain advantage, particularly in defense scenarios where one player tries to infer and counter the other's environmental understanding.

Method: Model the interaction as a Bayesian game where Defender selects barrier configurations based on beliefs about Attacker’s abstraction, while Attacker chooses trajectories to obfuscate their abstraction. Solve using bilinear programming integrating Bayesian inference, strategic planning, and belief manipulation.

Result: Purposeful abstraction concealment emerges naturally to improve Attacker performance. Simulations show Attacker can shape Defender’s belief to induce suboptimal barrier placement, gaining strategic advantage.

Conclusion: Strategic concealment of map abstractions provides a viable approach for players to manipulate opponents’ beliefs and gain competitive edge in adversarial scenarios involving environmental understanding and barrier placement.

Abstract: This paper investigates the strategic concealment of map abstractions used by the players in competitive games. We consider a defense scenario in which one player (the Defender) seeks to infer and exploit the abstraction used by the other player (the Attacker). The interaction between the two players is modeled as a Bayesian game: the Defender selects a barrier configuration, i.e., a placement of obstacles that can obstruct the Attacker’s movement, based on its belief about the Attacker’s abstraction, while the Attacker chooses a trajectory that may intentionally obfuscate its own abstraction of the environment to mislead the Defender. We show that purposeful abstraction concealment naturally emerges from this formulation as a means of improving the Attacker’s performance. To solve the game, we propose a bilinear programming approach that integrates Bayesian inference, strategic planning, and belief manipulation. Simulations demonstrate that, by shaping the Defender’s belief, the Attacker can induce suboptimal Defender barrier placement, thereby gaining a strategic advantage.

Jason Starace, Terence Soule

Main category: cs.MA

TL;DR: Multi-modal classification combining behavioral telemetry and semantic context achieves 21% accuracy for 36 player profiles, significantly outperforming behavioral-only approaches (10% accuracy) but showing limitations without conversational data.

Details

Motivation: Traditional player modeling uses simplified 5-10 category taxonomies that fail to capture player diversity, and behavioral clustering cannot distinguish players with different motivations who exhibit similar actions.

Method: Systematic evaluation using 19,413 gameplay sessions from an AI-controlled text-based RPG, comparing behavioral-only baselines with multi-modal approaches that integrate action sequences and semantic descriptions using LSTM models processing action-text pairs.

Result: Multi-modal LSTM achieved 21% accuracy (vs 10% for behavioral-only), with non-neutral profiles reaching 42% accuracy (15x above random) while neutral profiles dropped to 25%. Identical actions like “help the merchant” cannot reveal player intent without reasoning data.

Conclusion: Behavioral data alone plateaus at 10% for 36 categories, while multi-modal integration enables 25%. Personality-based adaptation requires conversational interaction as predefined choices cannot capture intent, establishing benchmarks for complex player modeling.

Abstract: Modern adaptive games require nuanced player understanding, yet most models use simplified 5-10 category taxonomies that fail to capture diversity. Behavioral clustering cannot distinguish players with different motivations who act similarly. We present a systematic evaluation of multi-modal classification at scale, combining behavioral telemetry with semantic context to support 36 player profiles. Using 19,413 gameplay sessions from an AI-controlled text-based RPG, we compared behavioral-only baselines with multi-modal approaches that integrate action sequences and semantic descriptions. Traditional clustering achieved only 10% accuracy for 36-category classification, limited by semantic conflation where opposite actions produced identical features. Our multi-modal LSTM processing action-text pairs improved accuracy to 21%, showing both potential and limits of non-conversational data. Analysis by behavioral complexity revealed that non-neutral profiles reached 42% accuracy (15x above random), while neutral profiles dropped to 25% (9x above random). Identical actions such as “help the merchant” cannot reveal whether a player is neutral or strategically waiting. Without access to reasoning, even multi-modal models struggle, though above-baseline results confirm a meaningful signal. Since prediction beyond 20 categories remains unexplored, our findings establish benchmarks for complex player modeling. Behavioral data alone plateaus near 10% for 36 categories, while multi-modal integration enables 25%. For designers, this shows that personality-based adaptation requires conversational interaction, as predefined choices cannot capture intent. Our evaluation at 36-category scale offers guidance for building adaptive games that better understand their players.

[660] Orchestrator: Active Inference for Multi-Agent Systems in Long-Horizon Tasks

Lukas Beckenbauer, Johannes-Lucas Loewe, Ge Zheng, Alexandra Brintrup

Main category: cs.MA

TL;DR: Orchestrator is a novel multi-agent system framework that uses attention-inspired coordination and reflective benchmarking to improve performance in complex, non-linear tasks with partial observability.

Details

Motivation: Complex non-linear tasks challenge LLM-enhanced multi-agent systems due to partial observability and suboptimal coordination, requiring better coordination mechanisms.

Method: Proposes Orchestrator framework with attention-inspired self-emergent coordination, reflective benchmarking, and monitoring mechanism to track agent-environment dynamics using active inference benchmarks.

Result: Evaluated on maze puzzles of increasing complexity, demonstrating effectiveness in enhancing coordination and performance in dynamic, non-linear environments with long-horizon objectives.

Conclusion: Orchestrator successfully mitigates partial observability effects and enables agents to approximate global task solutions more efficiently through improved coordination mechanisms.

Abstract: Complex, non-linear tasks challenge LLM-enhanced multi-agent systems (MAS) due to partial observability and suboptimal coordination. We propose Orchestrator, a novel MAS framework that leverages attention-inspired self-emergent coordination and reflective benchmarking to optimize global task performance. Orchestrator introduces a monitoring mechanism to track agent-environment dynamics, using active inference benchmarks to optimize system behavior. By tracking agent-to-agent and agent-to-environment interaction, Orchestrator mitigates the effects of partial observability and enables agents to approximate global task solutions more efficiently. We evaluate the framework on a series of maze puzzles of increasing complexity, demonstrating its effectiveness in enhancing coordination and performance in dynamic, non-linear environments with long-horizon objectives.

[661] MAPF-HD: Multi-Agent Path Finding in High-Density Environments

Hiroya Makino, Seigo Ito

Main category: cs.MA

TL;DR: Proposes PHANS method for multi-agent path finding in high-density environments, solving MAPF-HD problems in seconds instead of minutes/hours like ILP approaches.

Details

Motivation: Traditional MAPF methods using integer linear programming become computationally expensive in high-density environments, making them impractical for real-world applications like warehouses and automated parking.

Method: PHANS (Phased Null-Agent Swapping) - a heuristic approach that incrementally swaps positions between agents and empty vertices to optimize paths efficiently.

Result: Achieves solution times of seconds to tens of seconds even in large environments with over 700 cells, compared to tens to hundreds of seconds for ILP methods in smaller environments.

Conclusion: PHANS enables practical application of MAPF in high-density scenarios, potentially improving efficiency in warehouse logistics, traffic management, and crowd control applications.

Abstract: Multi-agent path finding (MAPF) involves planning efficient paths for multiple agents to move simultaneously while avoiding collisions. In typical warehouse environments, agents are often sparsely distributed along aisles. However, increasing the agent density can improve space efficiency. When the agent density is high, we must optimize the paths not only for goal-assigned agents but also for those obstructing them. This study proposes a novel MAPF framework for high-density environments (MAPF-HD). Several studies have explored MAPF in similar settings using integer linear programming (ILP). However, ILP-based methods require substantial computation time to optimize all agent paths simultaneously. Even in small grid-based environments with fewer than $100$ cells, these computations can incur tens to hundreds of seconds. These high computational costs render these methods impractical for large-scale applications such as automated warehouses and valet parking. To address these limitations, we introduce the phased null-agent swapping (PHANS) method. PHANS employs a heuristic approach to incrementally swap positions between agents and empty vertices. This method solves the MAPF-HD problem within seconds to tens of seconds, even in large environments containing more than $700$ cells. The proposed method can potentially improve efficiency in various real-world applications such as warehouse logistics, traffic management, or crowd control. Code is available at https://github.com/ToyotaCRDL/MAPF-in-High-Density-Envs.

[662] HECATE: An ECS-based Framework for Teaching and Developing Multi-Agent Systems

Arthur Casals, Anarosa A. F. Brandão

Main category: cs.MA

TL;DR: HECATE is an ECS-based framework that bridges distributed systems engineering with multiagent systems development, reducing the need for specialized agent knowledge by leveraging familiar DS patterns.

Details

Motivation: To simplify multiagent systems (MAS) development by integrating agent concepts into distributed systems (DS) domain, reducing the need for specialized agent knowledge and leveraging familiar DS patterns and standards.

Method: Built using Entity-Component-System (ECS) architectural pattern with data-oriented design to implement multiagent systems, engineering MAS from a distributed systems perspective.

Result: The framework’s architecture, core components, and implementation approach are presented, demonstrating support for different agent models.

Conclusion: HECATE successfully bridges the gap between distributed systems engineering and MAS development, simplifying MAS engineering by minimizing agent-specific knowledge requirements through familiar DS patterns.

Abstract: This paper introduces HECATE, a novel framework based on the Entity-Component-System (ECS) architectural pattern that bridges the gap between distributed systems engineering and MAS development. HECATE is built using the Entity-Component-System architectural pattern, leveraging data-oriented design to implement multiagent systems. This approach involves engineering multiagent systems (MAS) from a distributed systems (DS) perspective, integrating agent concepts directly into the DS domain. This approach simplifies MAS development by (i) reducing the need for specialized agent knowledge and (ii) leveraging familiar DS patterns and standards to minimize the agent-specific knowledge required for engineering MAS. We present the framework’s architecture, core components, and implementation approach, demonstrating how it supports different agent models.

[663] Nanobot Algorithms for Treatment of Diffuse Cancer

Noble Harasha, Nancy Lynch

Main category: cs.MA

TL;DR: Three nanobot algorithms for targeted drug delivery to diffuse cancer sites: KM (natural signals), KMA (amplified signals), and KMAR (adaptive repulsion). KMAR shows best overall performance.

Details

Motivation: Improve targeted drug delivery for diffuse cancers by optimizing nanobot coordination and drug allocation according to site demands using chemical signaling.

Method: Mathematical modeling of nanobot behavior with three algorithms: KM (natural chemical gradients), KMA (amplified signals), and KMAR (adaptive repulsion from treated sites). Simulations across various cancer patterns.

Result: KM works well unless signals are weak; KMA improves speed but reduces success except for concentrated patterns; KMAR demonstrates robust performance across all cancer patterns.

Conclusion: KMAR algorithm with adaptive repulsion mechanism provides the most effective and robust solution for nanobot-based drug delivery to diffuse cancer sites.

Abstract: Motile nanosized particles, or “nanobots”, promise more effective and less toxic targeted drug delivery because of their unique scale and precision. We consider the case in which the cancer is “diffuse”, dispersed such that there are multiple distinct cancer sites. We investigate the problem of a swarm of nanobots locating these sites and treating them by dropping drug payloads at the sites. To improve the success of the treatment, the drug payloads must be allocated between sites according to their “demands”; this requires extra nanobot coordination. We present a mathematical model of the behavior of the nanobot agents and of their colloidal environment. This includes a movement model for agents based upon experimental findings from actual nanoparticles in which bots noisily ascend and descend chemical gradients. We present three algorithms: The first algorithm, called KM, is the most representative of reality, with agents simply following naturally existing chemical signals that surround each cancer site. The second algorithm, KMA, includes an additional chemical payload which amplifies the existing natural signals. The third algorithm, KMAR, includes another additional chemical payload which counteracts the other signals, instead inducing negative chemotaxis in agents such that they are repelled from sites that are already sufficiently treated. We present simulation results for all algorithms across different types of cancer arrangements. For KM, we show that the treatment is generally successful unless the natural chemical signals are weak, in which case the treatment progresses too slowly. For KMA, we demonstrate a significant improvement in treatment speed but a drop in eventual success, except for concentrated cancer patterns. For KMAR, our results show great performance across all types of cancer patterns, demonstrating robustness and adaptability.

[664] Game Theory and Multi-Agent Reinforcement Learning for Zonal Ancillary Markets

Francesco Morri, Hélène Le Cadre, Pierre Gruet, Luce Brotcorne

Main category: cs.MA

TL;DR: This paper analyzes zonal ancillary market coupling using game theory, formulating it as a bilevel problem and comparing exact optimization methods with multi-agent deep reinforcement learning on real German and Austrian data.

Details

Motivation: To characterize and optimize zonal ancillary market coupling through game-theoretic approaches, addressing the need for efficient market mechanisms in energy systems with multiple interconnected zones.

Method: Formulated the ancillary market as a multi-leader single follower bilevel problem, cast as a generalized Nash game. Used two exact approaches (integrated optimization and Gauss-Seidel best-response) and compared against multi-agent deep reinforcement learning on real data from Germany and Austria.

Result: Multi-agent deep reinforcement learning achieved smallest convergence rate but required pretraining, while best-response was slowest. Reinforcement learning resulted in smaller market costs but higher variability in profit allocation. Stronger zone coupling reduced costs for larger zones.

Conclusion: Multi-agent deep reinforcement learning shows promise for market optimization but requires careful consideration of profit distribution variability. Zone coupling benefits larger zones economically, providing insights for market design in interconnected energy systems.

Abstract: We characterize zonal ancillary market coupling relying on noncooperative game theory. To that purpose, we formulate the ancillary market as a multi-leader single follower bilevel problem, that we subsequently cast as a generalized Nash game with side constraints and nonconvex feasibility sets. We determine conditions for equilibrium existence and show that the game has a generalized potential game structure. To compute market equilibrium, we rely on two exact approaches: an integrated optimization approach and Gauss-Seidel best-response, that we compare against multi-agent deep reinforcement learning. On real data from Germany and Austria, simulations indicate that multi-agent deep reinforcement learning achieves the smallest convergence rate but requires pretraining, while best-response is the slowest. On the economics side, multi-agent deep reinforcement learning results in smaller market costs compared to the exact methods, but at the cost of higher variability in the profit allocation among stakeholders. Further, stronger coupling between zones tends to reduce costs for larger zones.

Ryosuke Takata, Atsushi Masumori, Takashi Ikegami

Main category: cs.MA

TL;DR: LLM agents in an El Farol Bar problem showed emergent social dynamics, balancing game-theoretic rationality with human-like social motivations, creating new collective decision-making models.

Details

Motivation: To investigate how LLM agents navigate social dilemmas and whether they exhibit human-like social behaviors in classic game theory problems.

Method: Used LLM agents in a spatially extended El Farol Bar problem with prompt-specified constraints (60% threshold) to observe their autonomous decision-making processes.

Result: LLM agents developed spontaneous motivation to attend the bar, formed collective decision-making patterns, and behaved more like humans than perfectly rational agents, though didn’t completely solve the problem.

Conclusion: LLM agents naturally balance formal rationality with social preferences, enabling new models of group decision-making that traditional game theory couldn’t handle, revealing complex interplay between external and internal incentives.

Abstract: We investigate the emergent social dynamics of Large Language Model (LLM) agents in a spatially extended El Farol Bar problem, observing how they autonomously navigate this classic social dilemma. As a result, the LLM agents generated a spontaneous motivation to go to the bar and changed their decision making by becoming a collective. We also observed that the LLM agents did not solve the problem completely, but rather behaved more like humans. These findings reveal a complex interplay between external incentives (prompt-specified constraints such as the 60% threshold) and internal incentives (culturally-encoded social preferences derived from pre-training), demonstrating that LLM agents naturally balance formal game-theoretic rationality with social motivations that characterize human behavior. These findings suggest that a new model of group decision making, which could not be handled in the previous game-theoretic problem setting, can be realized by LLM agents.

cs.MM

[666] Effectively obtaining acoustic, visual and textual data from videos

Jorge E. León, Miguel Carrasco

Main category: cs.MM

TL;DR: Proposes a method to create multimodal datasets (audio-image-text) from videos, addressing the shortage of such datasets for machine learning applications.

Details

Motivation: The increasing demand for high-quality multimodal datasets combining acoustic, visual and textual data, which are currently limited in availability.

Method: Extracting related audio-image-text observations from videos by selecting suitable videos, extracting relevant data pairs, and generating descriptive texts using image-to-text models to ensure robust semantic connections between modalities.

Result: Created publicly available multimodal datasets that support research in multimodal data analysis and machine learning.

Conclusion: The proposed method successfully addresses the gap in multimodal dataset availability and provides a solution for generating high-quality audio-image-text datasets from video sources.

Abstract: The increasing use of machine learning models has amplified the demand for high-quality, large-scale multimodal datasets. However, the availability of such datasets, especially those combining acoustic, visual and textual data, remains limited. This paper addresses this gap by proposing a method to extract related audio-image-text observations from videos. We detail the process of selecting suitable videos, extracting relevant data pairs, and generating descriptive texts using image-to-text models. Our approach ensures a robust semantic connection between modalities, enhancing the utility of the created datasets for various applications. We also discuss the challenges encountered and propose solutions to improve data quality. The resulting datasets, publicly available, aim to support and advance research in multimodal data analysis and machine learning.

eess.AS

[667] Graph Connectionist Temporal Classification for Phoneme Recognition

Henry Grafé, Hugo Van hamme

Main category: eess.AS

TL;DR: Using Graph Temporal Classification (GTC) instead of CTC loss for Automatic Phoneme Recognition allows models to handle multiple pronunciation variants from G2P systems, improving phoneme error rates.

Details

Motivation: Standard CTC loss cannot handle the ambiguity of multiple possible pronunciations per word that G2P systems generate, limiting training effectiveness with pseudo phoneme-level annotations.

Method: Adapted Graph Temporal Classification (GTC) to APR setting, enabling training from graphs of alternative phoneme sequences that represent multiple valid pronunciations per word.

Result: Experiments on English and Dutch datasets showed consistent improvement in phoneme error rates compared to CTC baseline when incorporating multiple pronunciations into training.

Conclusion: Integrating pronunciation variation into the loss function via GTC is a promising strategy for training APR systems from noisy G2P-based supervision.

Abstract: Automatic Phoneme Recognition (APR) systems are often trained using pseudo phoneme-level annotations generated from text through Grapheme-to-Phoneme (G2P) systems. These G2P systems frequently output multiple possible pronunciations per word, but the standard Connectionist Temporal Classification (CTC) loss cannot account for such ambiguity during training. In this work, we adapt Graph Temporal Classification (GTC) to the APR setting. GTC enables training from a graph of alternative phoneme sequences, allowing the model to consider multiple pronunciations per word as valid supervision. Our experiments on English and Dutch data sets show that incorporating multiple pronunciations per word into the training loss consistently improves phoneme error rates compared to a baseline trained with CTC. These results suggest that integrating pronunciation variation into the loss function is a promising strategy for training APR systems from noisy G2P-based supervision.

[668] On the Contribution of Lexical Features to Speech Emotion Recognition

David Combei

Main category: eess.AS

TL;DR: Lexical content from speech achieves competitive emotion recognition performance compared to acoustic models, with 51.5% WF1 on MELD dataset vs 49.3% for acoustic-only approach.

Details

Motivation: To investigate whether lexical content extracted from speech can match or outperform traditional acoustic-only approaches for speech emotion recognition.

Method: Analyzed different self-supervised speech and text representations, conducted layer-wise study of transformer-based encoders, and evaluated audio denoising effects.

Result: Lexical-based approach achieved 51.5% weighted F1-score on MELD dataset, outperforming acoustic-only pipeline (49.3%) despite having fewer parameters.

Conclusion: Lexical content plays a significant role in speech emotion recognition and can achieve competitive or superior performance compared to acoustic models.

Abstract: Although paralinguistic cues are often considered the primary drivers of speech emotion recognition (SER), we investigate the role of lexical content extracted from speech and show that it can achieve competitive and in some cases higher performance compared to acoustic models. On the MELD dataset, our lexical-based approach obtains a weighted F1-score (WF1) of 51.5%, compared to 49.3% for an acoustic-only pipeline with a larger parameter count. Furthermore, we analyze different self-supervised (SSL) speech and text representations, conduct a layer-wise study of transformer-based encoders, and evaluate the effect of audio denoising.

[669] Time-domain sound field estimation using kernel ridge regression

Jesper Brunnström, Martin Bo Møller, Jan Østergaard, Shoichi Koyama, Toon van Waterschoot, Marc Moonen

Main category: eess.AS

TL;DR: Generalizes kernel ridge regression from single-frequency to discrete-time sound field estimation, enabling time-domain analysis with physical realizability guarantees and improved performance using temporal and spatial priors.

Details

Motivation: Existing kernel ridge regression methods for sound field estimation are limited to single-frequency analysis, restricting the types of data and prior knowledge that can be utilized.

Method: Proposes a generalized kernel ridge regression approach for discrete-time sound fields that provides closed-form time-domain estimates, ensures physical realizability, and incorporates time-domain data weighting to exploit prior knowledge of room impulse response behavior.

Result: The method demonstrates improved estimation performance using time-domain data weighting and can be combined with directional weighting to exploit both spatial and temporal properties of room impulse responses, validated with simulated and real data.

Conclusion: The proposed framework enables solving a broader class of sound field estimation problems by considering time-domain responses rather than separate frequency-domain analysis, expanding the applicability of kernel ridge regression in acoustic applications.

Abstract: Sound field estimation methods based on kernel ridge regression have proven effective, allowing for strict enforcement of physical properties, in addition to the inclusion of prior knowledge such as directionality of the sound field. These methods have been formulated for single-frequency sound fields, restricting the types of data and prior knowledge that can be used. In this paper, the kernel ridge regression approach is generalized to consider discrete-time sound fields. The proposed method provides time-domain sound field estimates that can be computed in closed form, are guaranteed to be physically realizable, and for which time-domain properties of the sound fields can be exploited to improve estimation performance. Exploiting prior information on the time-domain behaviour of room impulse responses, the estimation performance of the proposed method is shown to be improved using a time-domain data weighting, demonstrating the usefulness of the proposed approach. It is further shown using both simulated and real data that the time-domain data weighting can be combined with a directional weighting, exploiting prior knowledge of both spatial and temporal properties of the room impulse responses. The theoretical framework of the proposed method enables solving a broader class of sound field estimation problems using kernel ridge regression where it would be required to consider the time-domain response rather than the frequency-domain response of each frequency separately.

[670] From perception to production: how acoustic invariance facilitates articulatory learning in a self-supervised vocal imitation model

Marvin Lavechin, Thomas Hueber

Main category: eess.AS

TL;DR: Self-supervised learning model maps acoustic speech to articulatory movements using pre-trained wav2vec 2.0 representations, outperforming traditional MFCC features and providing insights into infant speech acquisition.

Details

Motivation: Address how human infants solve the complex acoustic-to-articulatory mapping problem in speech acquisition without explicit instruction, inspired by developmental theories.

Method: Computational model with feature extractor (using wav2vec 2.0 intermediate layers), inverse model for articulatory parameter mapping, and speech synthesizer. Tested in single- and multi-speaker settings.

Result: wav2vec 2.0 intermediate layers provide optimal representations that enable learning human-like articulatory trajectories, discriminate articulation places, and produce intelligible speech, significantly outperforming MFCC features.

Conclusion: Self-supervised representations balancing phonetic discriminability with speaker invariance are critical for articulatory learning, providing computational evidence for developmental theories that perceptual learning guides articulatory development in infants.

Abstract: Human infants face a formidable challenge in speech acquisition: mapping extremely variable acoustic inputs into appropriate articulatory movements without explicit instruction. We present a computational model that addresses the acoustic-to-articulatory mapping problem through self-supervised learning. Our model comprises a feature extractor that transforms speech into latent representations, an inverse model that maps these representations to articulatory parameters, and a synthesizer that generates speech outputs. Experiments conducted in both single- and multi-speaker settings reveal that intermediate layers of a pre-trained wav2vec 2.0 model provide optimal representations for articulatory learning, significantly outperforming MFCC features. These representations enable our model to learn articulatory trajectories that correlate with human patterns, discriminate between places of articulation, and produce intelligible speech. Critical to successful articulatory learning are representations that balance phonetic discriminability with speaker invariance – precisely the characteristics of self-supervised representation learning models. Our findings provide computational evidence consistent with developmental theories proposing that perceptual learning of phonetic categories guides articulatory development, offering insights into how infants might acquire speech production capabilities despite the complex mapping problem they face.

[671] Beamforming-LLM: What, Where and When Did I Miss?

Vishal Choudhari

Main category: eess.AS

TL;DR: Beamforming-LLM is a system that uses spatial audio and AI to help users recall missed conversations through natural language queries, providing summaries and audio playback.

Details

Motivation: To enable users to semantically recall conversations they missed in multi-speaker environments, addressing the challenge of following complex discussions and providing assistive technology for auditory memory.

Method: Combines spatial audio capture with microphone arrays, beamforming for directional audio separation, Whisper transcription, vector embedding with sentence encoders, and RAG with GPT-4o-mini for retrieval and summarization.

Result: A user-friendly interface that provides contrastive summaries, spatial context, and timestamped audio playback for missed conversation segments.

Conclusion: This work establishes the foundation for intelligent auditory memory systems with applications in assistive technology, meeting summarization, and context-aware spatial computing.

Abstract: We present Beamforming-LLM, a system that enables users to semantically recall conversations they may have missed in multi-speaker environments. The system combines spatial audio capture using a microphone array with retrieval-augmented generation (RAG) to support natural language queries such as, “What did I miss when I was following the conversation on dogs?” Directional audio streams are separated using beamforming, transcribed with Whisper, and embedded into a vector database using sentence encoders. Upon receiving a user query, semantically relevant segments are retrieved, temporally aligned with non-attended segments, and summarized using a lightweight large language model (GPT-4o-mini). The result is a user-friendly interface that provides contrastive summaries, spatial context, and timestamped audio playback. This work lays the foundation for intelligent auditory memory systems and has broad applications in assistive technology, meeting summarization, and context-aware personal spatial computing.

[672] Speaker Privacy and Security in the Big Data Era: Protection and Defense against Deepfake

Liping Chen, Kong Aik Lee, Zhen-Hua Ling, Xin Wang, Rohan Kumar Das, Tomoki Toda, Haizhou Li

Main category: eess.AS

TL;DR: Overview of voice anonymization, deepfake detection, and watermarking techniques for protecting against deepfake speech misuse, covering methodologies, advancements, and challenges.

Details

Motivation: Address security threats from deepfake speech misuse in the era of big data, which has caused significant societal costs worldwide.

Method: Provides a concise review of three defense techniques: voice anonymization (protects voice attributes from extraction), deepfake detection, and watermarking.

Result: Comprehensive overview of current methodologies, recent advancements, and existing challenges in deepfake speech defense systems.

Conclusion: These three techniques represent the main approaches to combat deepfake speech threats, with a more detailed comprehensive version to be published soon.

Abstract: In the era of big data, remarkable advancements have been achieved in personalized speech generation techniques that utilize speaker attributes, including voice and speaking style, to generate deepfake speech. This has also amplified global security risks from deepfake speech misuse, resulting in considerable societal costs worldwide. To address the security threats posed by deepfake speech, techniques have been developed focusing on both the protection of voice attributes and the defense against deepfake speech. Among them, the voice anonymization technique has been developed to protect voice attributes from extraction for deepfake generation, while deepfake detection and watermarking have been utilized to defend against the misuse of deepfake speech. This paper provides a short and concise overview of the three techniques, describing the methodologies, advancements, and challenges. A comprehensive version, offering additional discussions, will be published in the near future.

[673] Integrating Spatial and Semantic Embeddings for Stereo Sound Event Localization in Videos

Davide Berghi, Philip J. B. Jackson

Main category: eess.AS

TL;DR: Enhanced stereo sound event localization and detection with distance estimation using pre-trained language-aligned models (CLAP for audio, OWL-ViT for vision) integrated into a Cross-Modal Conformer architecture, achieving 2nd place in DCASE 2025 Challenge.

Details

Motivation: Traditional SELD approaches rely on multichannel input and struggle with semantic reasoning across spatial, temporal, and semantic dimensions. Limited by data constraints for large-scale pre-training.

Method: Integrated pre-trained contrastive language-aligned models (CLAP audio, OWL-ViT visual) into modified Conformer module (Cross-Modal Conformer). Used large synthetic datasets with left-right channel swapping augmentation, model ensembling, and visual post-processing.

Result: Achieved second rank in DCASE 2025 Challenge Task 3 (Track B), demonstrating effectiveness of the approach through ablation studies on DCASE2025 Task3 development set.

Conclusion: Combining extensive pre-training with language-aligned models and multimodal fusion effectively addresses 3D SELD challenges. Future work will explore modality-specific contributions and architectural refinements.

Abstract: In this study, we address the multimodal task of stereo sound event localization and detection with source distance estimation (3D SELD) in regular video content. 3D SELD is a complex task that combines temporal event classification with spatial localization, requiring reasoning across spatial, temporal, and semantic dimensions. The last is arguably the most challenging to model. Traditional SELD approaches typically rely on multichannel input, limiting their capacity to benefit from large-scale pre-training due to data constraints. To overcome this, we enhance a standard SELD architecture with semantic information by integrating pre-trained, contrastive language-aligned models: CLAP for audio and OWL-ViT for visual inputs. These embeddings are incorporated into a modified Conformer module tailored for multimodal fusion, which we refer to as the Cross-Modal Conformer. We perform an ablation study on the development set of the DCASE2025 Task3 Stereo SELD Dataset to assess the individual contributions of the language-aligned models and benchmark against the DCASE Task 3 baseline systems. Additionally, we detail the curation process of large synthetic audio and audio-visual datasets used for model pre-training. These datasets were further expanded through left-right channel swapping augmentation. Our approach, combining extensive pre-training, model ensembling, and visual post-processing, achieved second rank in the DCASE 2025 Challenge Task 3 (Track B), underscoring the effectiveness of our method. Future work will explore the modality-specific contributions and architectural refinements.

[674] LS-EEND: Long-Form Streaming End-to-End Neural Diarization with Online Attractor Extraction

Di Liang, Xiaofei Li

Main category: eess.AS

TL;DR: Proposes LS-EEND, a frame-wise online neural diarization method with causal embedding encoder and online attractor decoder that handles up to 8 speakers in hour-long recordings with state-of-the-art performance.

Details

Motivation: To develop a streaming end-to-end neural diarization system that can handle flexible numbers of speakers and very long audio recordings in real-time with linear computational complexity.

Method: Uses causal embedding encoder and self-attention-based online attractor decoder with retention mechanism for linear temporal complexity. Employs multi-step progressive training from easy to hard tasks. Frame-in-frame-out processing for streaming diarization.

Result: Achieves new SOTA online diarization error rates: CALLHOME (12.11%), DIHARD II (27.58%), DIHARD III (19.61%), AMI (20.76%). Several times lower real-time-factor than comparison models.

Conclusion: LS-EEND enables efficient streaming diarization for high speaker counts and long recordings with superior performance and computational efficiency.

Abstract: This work proposes a frame-wise online/streaming end-to-end neural diarization (EEND) method, which detects speaker activities in a frame-in-frame-out fashion. The proposed model mainly consists of a causal embedding encoder and an online attractor decoder. Speakers are modeled in the self-attention-based decoder along both the time and speaker dimensions, and frame-wise speaker attractors are automatically generated and updated for new speakers and existing speakers, respectively. Retention mechanism is employed and especially adapted for long-form diarization with a linear temporal complexity. A multi-step progressive training strategy is proposed for gradually learning from easy tasks to hard tasks in terms of the number of speakers and audio length. Finally, the proposed model (referred to as long-form streaming EEND, LS-EEND) is able to perform streaming diarization for a high (up to 8) and flexible number speakers and very long (say one hour) audio recordings. Experiments on various simulated and real-world datasets show that: 1) when not using oracle speech activity information, the proposed model achieves new state-of-the-art online diarization error rate on all datasets, including CALLHOME (12.11%), DIHARD II (27.58%), DIHARD III (19.61%), and AMI (20.76%); 2) Due to the frame-in-frame-out processing fashion and the linear temporal complexity, the proposed model achieves several times lower real-time-factor than comparison online diarization models.

[675] SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline

Helin Wang, Jiarui Hai, Dongchao Yang, Chen Chen, Kai Li, Junyi Peng, Thomas Thebaud, Laureano Moro Velazquez, Jesus Villalba, Najim Dehak

Main category: eess.AS

TL;DR: SoloSpeech is a novel cascaded generative pipeline for target speech extraction that achieves state-of-the-art intelligibility and quality while addressing artifacts and generalization issues in existing methods.

Details

Motivation: Current discriminative TSE models produce unwanted artifacts and reduce naturalness, while generative models lag in perceptual quality and intelligibility. The authors aim to overcome these limitations with a better approach.

Method: A cascaded generative pipeline with compression, extraction, reconstruction, and correction processes. Features a speaker-embedding-free target extractor that uses conditional information from cue audio’s latent space aligned with mixture audio’s latent space.

Result: Achieves new state-of-the-art intelligibility and quality on Libri2Mix dataset, with exceptional generalization on out-of-domain data and real-world scenarios.

Conclusion: SoloSpeech successfully addresses the limitations of both discriminative and generative TSE models, providing superior performance and generalization capabilities.

Abstract: Target Speech Extraction (TSE) aims to isolate a target speaker’s voice from a mixture of multiple speakers by leveraging speaker-specific cues, typically provided as auxiliary audio (a.k.a. cue audio). Although recent advancements in TSE have primarily employed discriminative models that offer high perceptual quality, these models often introduce unwanted artifacts, reduce naturalness, and are sensitive to discrepancies between training and testing environments. On the other hand, generative models for TSE lag in perceptual quality and intelligibility. To address these challenges, we present SoloSpeech, a novel cascaded generative pipeline that integrates compression, extraction, reconstruction, and correction processes. SoloSpeech features a speaker-embedding-free target extractor that utilizes conditional information from the cue audio’s latent space, aligning it with the mixture audio’s latent space to prevent mismatches. Evaluated on the widely-used Libri2Mix dataset, SoloSpeech achieves the new state-of-the-art intelligibility and quality in target speech extraction while demonstrating exceptional generalization on out-of-domain data and real-world scenarios.

eess.IV

[676] A Synthetic-to-Real Dehazing Method based on Domain Unification

Zhiqiang Yuan, Jinchao Zhang, Jie Zhou

Main category: eess.IV

TL;DR: A domain unification method for synthetic-to-real image dehazing that addresses performance degradation caused by distribution shift between synthetic and real domains due to imperfect clean data collection.

Details

Motivation: Deep learning-based dehazing methods suffer from performance degradation when applied to real-world hazy images due to distribution shift. This shift stems from imperfect clean data collection in real scenes, making atmospheric physics models inconsistent between synthetic and real domains.

Method: Proposes a synthetic-to-real dehazing method based on domain unification, which aims to unify the relationship between real and synthetic domains to make the dehazing model more aligned with actual situations.

Result: Extensive experiments show the proposed method significantly outperforms state-of-the-art methods on real-world images, both qualitatively and quantitatively.

Conclusion: The domain unification approach effectively bridges the gap between synthetic and real domains for image dehazing, resulting in superior performance on real-world hazy images compared to existing methods.

Abstract: Due to distribution shift, the performance of deep learning-based method for image dehazing is adversely affected when applied to real-world hazy images. In this paper, we find that such deviation in dehazing task between real and synthetic domains may come from the imperfect collection of clean data. Owing to the complexity of the scene and the effect of depth, the collected clean data cannot strictly meet the ideal conditions, which makes the atmospheric physics model in the real domain inconsistent with that in the synthetic domain. For this reason, we come up with a synthetic-to-real dehazing method based on domain unification, which attempts to unify the relationship between the real and synthetic domain, thus to let the dehazing model more in line with the actual situation. Extensive experiments qualitatively and quantitatively demonstrate that the proposed dehazing method significantly outperforms state-of-the-art methods on real-world images.

[677] Stabilizing RED using the Koopman Operator

Shraddha Chavan, Kunal N. Chaudhury

Main category: eess.IV

TL;DR: Proposes Koopman operator-based stabilization for RED framework to prevent instability from black-box denoisers while maintaining reconstruction quality.

Details

Motivation: The RED framework using pretrained denoisers can lead to instability despite high-fidelity reconstructions, requiring a stabilization mechanism.

Method: Uses Koopman operator to capture local dynamics of RED in low-dimensional feature space, with spectral radius detection for instability and adaptive step-size formulation.

Result: Effective stabilization demonstrated with several pretrained denoisers, providing model-agnostic solution with modest overhead and no retraining required.

Conclusion: Koopman operator-based approach successfully stabilizes RED framework while maintaining its reconstruction performance benefits.

Abstract: The widely used RED (Regularization-by-Denoising) framework uses pretrained denoisers as implicit regularizers for model-based reconstruction. Although RED generally yields high-fidelity reconstructions, the use of black-box denoisers can sometimes lead to instability. In this letter, we propose a data-driven mechanism to stabilize RED using the Koopman operator, a classical tool for analyzing dynamical systems. Specifically, we use the operator to capture the local dynamics of RED in a low-dimensional feature space, and its spectral radius is used to detect instability and formulate an adaptive step-size rule that is model-agnostic, has modest overhead, and requires no retraining. We test this with several pretrained denoisers to demonstrate the effectiveness of the proposed Koopman stabilization.

[678] CardiacFlow: 3D+t Four-Chamber Cardiac Shape Completion and Generation via Flow Matching

Qiang Ma, Qingjie Meng, Mengyun Qiao, Paul M. Matthews, Declan P. O’Regan, Wenjia Bai

Main category: eess.IV

TL;DR: Flow matching techniques for 3D+t cardiac shape generation, augmentation, and completion from limited multi-view CMR data, achieving 16% error reduction and superior performance.

Details

Motivation: Learning 3D+t shape completion and generation from multi-view cardiac MRI requires large amounts of high-resolution 3D whole-heart segmentations, which are limited in availability.

Method: Proposed CardiacFlow - a flow matching framework with latent rectified flow for data augmentation, label completion network for 3D+t reconstruction from sparse views, and one-step generative flow conditioned on time encoding.

Result: Flow-based augmentation reduces geometric errors by 16% in 3D shape completion. CardiacFlow achieves superior generation quality and periodic consistency on UK Biobank dataset compared to baselines.

Conclusion: Flow matching techniques effectively address data scarcity in cardiac shape analysis, enabling robust 3D+t generation, augmentation, and completion from limited multi-view CMR data.

Abstract: Learning 3D+t shape completion and generation from multi-view cardiac magnetic resonance (CMR) images requires a large amount of high-resolution 3D whole-heart segmentations (WHS) to capture shape priors. In this work, we leverage flow matching techniques to learn deep generative flows for augmentation, completion, and generation of 3D+t shapes of four cardiac chambers represented implicitly by segmentations. Firstly, we introduce a latent rectified flow to generate 3D cardiac shapes for data augmentation, learnt from a limited number of 3D WHS data. Then, a label completion network is trained on both real and synthetic data to reconstruct 3D+t shapes from sparse multi-view CMR segmentations. Lastly, we propose CardiacFlow, a novel one-step generative flow model for efficient 3D+t four-chamber cardiac shape generation, conditioned on the periodic Gaussian kernel encoding of time frames. The experiments on the WHS datasets demonstrate that flow-based data augmentation reduces geometric errors by 16% in 3D shape completion. The evaluation on the UK Biobank dataset validates that CardiacFlow achieves superior generation quality and periodic consistency compared to existing baselines.

[679] Brain Tumor Detection Through Diverse CNN Architectures in IoT Healthcare Industries: Fast R-CNN, U-Net, Transfer Learning-Based CNN, and Fully Connected CNN

Mohsen Asghari Ilani, Yaser M. Banad

Main category: eess.IV

TL;DR: AI-powered deep learning models (R-CNN, UNet, CNN, and transfer learning) achieve high accuracy (up to 99%) in brain tumor classification from MRI images within IoT-healthcare systems, enabling earlier diagnosis and improved patient outcomes.

Details

Motivation: Brain health is critical and accurate diagnosis is essential for effective treatment. MRI provides big data for AI-driven classification, and IoT-healthcare systems can leverage real-time data for timely interventions.

Method: Used Region-based CNN (R-CNN), UNet architectures, CNN, and transfer learning models (Inception-V3, EfficientNetB4, VGG19) for classifying glioma, meningioma, and pituitary tumors from MRI images. Performance assessed using F-score, recall, precision, and accuracy.

Result: Fast R-CNN achieved best results: 99% accuracy, 98.5% F-score, 99.5% AUC, 99.4% recall, 98.5% precision. EfficientNetB2 showed strongest cross-dataset validation: 92.23% accuracy, 92.11% precision/recall, 95.96% specificity.

Conclusion: Combining R-CNN, UNet, and transfer learning enables earlier diagnosis and more effective treatment in IoT-healthcare systems. AI models demonstrate robustness across diverse datasets, enhancing brain tumor classification and patient care.

Abstract: Artificial intelligence (AI)-powered deep learning has advanced brain tumor diagnosis in Internet of Things (IoT)-healthcare systems, achieving high accuracy with large datasets. Brain health is critical to human life, and accurate diagnosis is essential for effective treatment. Magnetic Resonance Imaging (MRI) provides key data for brain tumor detection, serving as a major source of big data for AI-driven image classification. In this study, we classified glioma, meningioma, and pituitary tumors from MRI images using Region-based Convolutional Neural Network (R-CNN) and UNet architectures. We also applied Convolutional Neural Networks (CNN) and CNN-based transfer learning models such as Inception-V3, EfficientNetB4, and VGG19. Model performance was assessed using F-score, recall, precision, and accuracy. The Fast R-CNN achieved the best results with 99% accuracy, 98.5% F-score, 99.5% Area Under the Curve (AUC), 99.4% recall, and 98.5% precision. Combining R-CNN, UNet, and transfer learning enables earlier diagnosis and more effective treatment in IoT-healthcare systems, improving patient outcomes. IoT devices such as wearable monitors and smart imaging systems continuously collect real-time data, which AI algorithms analyze to provide immediate insights for timely interventions and personalized care. For external cohort cross-dataset validation, EfficientNetB2 achieved the strongest performance among fine-tuned EfficientNet models, with 92.11% precision, 92.11% recall/sensitivity, 95.96% specificity, 92.02% F1-score, and 92.23% accuracy. These findings underscore the robustness and reliability of AI models in handling diverse datasets, reinforcing their potential to enhance brain tumor classification and patient care in IoT healthcare environments.

[680] Application Space and the Rate-Distortion-Complexity Analysis of Neural Video CODECs

Ricardo L. de Queiroz, Diogo C. Garcia, Yi-Hsin Chen, Ruhan Conceição, Wen-Hsiao Peng, Luciano V. Agostini

Main category: eess.IV

TL;DR: This paper proposes a 3D rate-distortion-complexity (RDC) analysis framework for video codec selection, extending the 2D BD metric to include computational complexity and using Lagrangian cost optimization to compare codecs across different application requirements.

Details

Motivation: Traditional video compression evaluation focuses on rate-distortion tradeoffs but ignores computational complexity. The authors aim to develop a comprehensive framework that considers all three dimensions (rate, distortion, complexity) to better match codecs to specific application requirements.

Method: The authors generalize the 2D Bjontegaard delta metric to 3D RDC space, formulate Lagrangian cost D+λR+γC to evaluate codec performance, and define application-specific (λ,γ) parameters. They apply this framework to compare state-of-the-art neural video codecs across different application scenarios.

Result: The analysis revealed that only four neural video codecs emerged as optimal for different applications depending on their specific (λ,γ) requirements. The results were both informative and surprising, showing clear performance differentiation among codecs in the 3D RDC space.

Conclusion: The proposed RDC framework provides a more comprehensive approach to video codec selection by incorporating computational complexity alongside traditional rate-distortion metrics. This enables better matching of codecs to specific application requirements through appropriate (λ,γ) parameter selection.

Abstract: We study the decision-making process for choosing video compression systems through a rate-distortion-complexity (RDC) analysis. We discuss the 2D Bjontegaard delta (BD) metric and formulate generalizations in an attempt to extend its notions to the 3D RDC volume. We follow that discussion with another one on the computation of metrics in the RDC volume, and on how to define and measure the cost of a coder-decoder (codec) pair, where the codec is characterized by a cloud of points in the RDC space. We use a Lagrangian cost $D+\lambda R + \gamma C$, such that choosing the best video codec among a number of candidates for an application demands selecting appropriate $(\lambda, \gamma)$ values. Thus, we argue that an application may be associated with a $(\lambda, \gamma)$ point in the application space. An example streaming application was given as a case study to set a particular point in the $(\lambda, \gamma)$ plane. The result is that we can compare Lagrangian costs in an RDC volume for different codecs for a given application. Furthermore, we can span the plane and compare codecs for the entire application space filled with different $(\lambda, \gamma)$ choices. We then compared several state-of-the-art neural video codecs using the proposed metrics. Results are informative and surprising. We found that, within our RDC computation constraints, only four neural video codecs came out as the best suited for any application, depending on where its desirable $(\lambda, \gamma)$ lies.

[681] Imagining Alternatives: Towards High-Resolution 3D Counterfactual Medical Image Generation via Language Guidance

Mohamed Mohamed, Brennan Nichyporuk, Douglas L. Arnold, Tal Arbel

Main category: eess.IV

TL;DR: First framework for generating high-resolution 3D counterfactual medical images using language prompts, adapting 3D diffusion models for neurological imaging with enhanced text alignment.

Details

Motivation: Address the gap in 3D medical image generation where pretrained foundation models are unavailable, enabling clinical applications like personalized counterfactual explanations, disease progression simulation, and enhanced medical training.

Method: Adapt state-of-the-art 3D diffusion models with enhancements from Simple Diffusion, incorporate augmented conditioning to improve text alignment and image quality for native 3D generation.

Result: Successfully simulates varying counterfactual lesion loads in Multiple Sclerosis and cognitive states in Alzheimer’s disease on two neurological MRI datasets, generating high-quality images while preserving subject fidelity.

Conclusion: Lays groundwork for prompt-driven disease progression analysis in 3D medical imaging, representing the first demonstration of language-guided native-3D diffusion model specifically for neurological imaging data.

Abstract: Vision-language models have demonstrated impressive capabilities in generating 2D images under various conditions; however the impressive performance of these models in 2D is largely enabled by extensive, readily available pretrained foundation models. Critically, comparable pretrained foundation models do not exist for 3D, significantly limiting progress in this domain. As a result, the potential of vision-language models to produce high-resolution 3D counterfactual medical images conditioned solely on natural language descriptions remains completely unexplored. Addressing this gap would enable powerful clinical and research applications, such as personalized counterfactual explanations, simulation of disease progression scenarios, and enhanced medical training by visualizing hypothetical medical conditions in realistic detail. Our work takes a meaningful step toward addressing this challenge by introducing a framework capable of generating high-resolution 3D counterfactual medical images of synthesized patients guided by free-form language prompts. We adapt state-of-the-art 3D diffusion models with enhancements from Simple Diffusion and incorporate augmented conditioning to improve text alignment and image quality. To our knowledge, this represents the first demonstration of a language-guided native-3D diffusion model applied specifically to neurological imaging data, where faithful three-dimensional modeling is essential to represent the brain’s three-dimensional structure. Through results on two distinct neurological MRI datasets, our framework successfully simulates varying counterfactual lesion loads in Multiple Sclerosis (MS), and cognitive states in Alzheimer’s disease, generating high-quality images while preserving subject fidelity in synthetically generated medical images. Our results lay the groundwork for prompt-driven disease progression analysis within 3D medical imaging.

[682] FASL-Seg: Anatomy and Tool Segmentation of Surgical Scenes

Muraam Abdel-Ghani, Mahmoud Ali, Mohamed Ali, Fatmaelzahraa Ahmed, Mohamed Arsalan, Abdulaziz Al-Ali, Shidin Balakrishnan

Main category: eess.IV

TL;DR: FASL-Seg model improves surgical scene segmentation by capturing both high-level contextual and low-level edge features through dual processing streams, achieving state-of-the-art performance on benchmark datasets.

Details

Motivation: Existing surgical segmentation models focus mainly on tools and overlook anatomical objects, while struggling to balance high-level contextual and low-level edge feature capture.

Method: Proposed Feature-Adaptive Spatial Localization model (FASL-Seg) with two distinct processing streams: Low-Level Feature Projection (LLFP) and High-Level Feature Projection (HLFP) for varying feature resolutions.

Result: Achieved mIoU of 72.71% on EndoVis18 parts/anatomy segmentation (5% improvement over SOTA), 85.61% on EndoVis18 tool segmentation, and 72.78% on EndoVis17 tool segmentation.

Conclusion: Dual processing streams for varying feature resolutions effectively enable precise segmentation of both anatomy and surgical instruments, demonstrating consistent performance across different classes.

Abstract: The growing popularity of robotic minimally invasive surgeries has made deep learning-based surgical training a key area of research. A thorough understanding of the surgical scene components is crucial, which semantic segmentation models can help achieve. However, most existing work focuses on surgical tools and overlooks anatomical objects. Additionally, current state-of-the-art (SOTA) models struggle to balance capturing high-level contextual features and low-level edge features. We propose a Feature-Adaptive Spatial Localization model (FASL-Seg), designed to capture features at multiple levels of detail through two distinct processing streams, namely a Low-Level Feature Projection (LLFP) and a High-Level Feature Projection (HLFP) stream, for varying feature resolutions - enabling precise segmentation of anatomy and surgical instruments. We evaluated FASL-Seg on surgical segmentation benchmark datasets EndoVis18 and EndoVis17 on three use cases. The FASL-Seg model achieves a mean Intersection over Union (mIoU) of 72.71% on parts and anatomy segmentation in EndoVis18, improving on SOTA by 5%. It further achieves a mIoU of 85.61% and 72.78% in EndoVis18 and EndoVis17 tool type segmentation, respectively, outperforming SOTA overall performance, with comparable per-class SOTA results in both datasets and consistent performance in various classes for anatomy and instruments, demonstrating the effectiveness of distinct processing streams for varying feature resolutions.

[683] Robustness and accuracy of mean opinion scores with hard and soft outlier detection

Dietmar Saupe, Tim Bleile

Main category: eess.IV

TL;DR: Proposes empirical worst-case analysis using evolutionary optimization to test outlier detection methods in subjective quality assessment, and introduces two new low-complexity methods with excellent performance.

Details

Motivation: Lack of reliable and comprehensive approach for comparative performance analysis of outlier detection methods in subjective image/video quality assessment.

Method: Evolutionary optimization of adversarial black-box attacks on outlier detection algorithms to maximize distortion of scale values relative to ground truth.

Result: Applied analysis to several hard and soft outlier detection methods, showing their differing performance under stress testing. Proposed two new methods with low complexity and excellent worst-case performance.

Conclusion: The empirical worst-case analysis provides a general solution for evaluating outlier detection methods, and the proposed new methods offer effective alternatives with good performance characteristics.

Abstract: In subjective assessment of image and video quality, observers rate or compare selected stimuli. Before calculating the mean opinion scores (MOS) for these stimuli from the ratings, it is recommended to identify and deal with outliers that may have given unreliable ratings. Several methods are available for this purpose, some of which have been standardized. These methods are typically based on statistics and sometimes tested by introducing synthetic ratings from artificial outliers, such as random clickers. However, a reliable and comprehensive approach is lacking for comparative performance analysis of outlier detection methods. To fill this gap, this work proposes and applies an empirical worst-case analysis as a general solution. Our method involves evolutionary optimization of an adversarial black-box attack on outlier detection algorithms, where the adversary maximizes the distortion of scale values with respect to ground truth. We apply our analysis to several hard and soft outlier detection methods for absolute category ratings and show their differing performance in this stress test. In addition, we propose two new outlier detection methods with low complexity and excellent worst-case performance. Software for adversarial attacks and data analysis is available.

[684] Leveraging Information Divergence for Robust Semi-Supervised Fetal Ultrasound Image Segmentation

Fangyijie Wang, Guénolé Silvestre, Kathleen M. Curran

Main category: eess.IV

TL;DR: Semi-supervised learning framework using information divergence loss for robust fetal ultrasound segmentation with limited annotations, achieving state-of-the-art performance with only 5% labeled data.

Details

Motivation: Address the challenge of automated fetal ultrasound segmentation due to scarcity of high-quality annotations in maternal-fetal ultrasound monitoring.

Method: Lightweight CNN (1.47M parameters) and Transformer network trained jointly with labeled data via standard supervision and unlabeled data via cross-supervision, using information divergence loss (KL divergence + Mutual Information Gap) and mixup augmentation.

Result: Outperforms 7 state-of-the-art methods, improves Dice score by 2.39%, reduces 95% Hausdorff distance by 14.90, and decreases Average Surface Distance by 4.18 with only 5% labeled data.

Conclusion: Information divergence effectively enables annotation-efficient and robust medical image segmentation, demonstrating strong performance in fetal ultrasound applications.

Abstract: Maternal-fetal Ultrasound is the primary modality for monitoring fetal development, yet automated segmentation remains challenging due to the scarcity of high-quality annotations. To address this limitation, we propose a semi-supervised learning framework that leverages information divergence for robust fetal ultrasound segmentation. Our method employs a lightweight convolutional network (1.47M parameters) and a Transformer-based network, trained jointly with labelled data through standard supervision and with unlabelled data via cross-supervision. To encourage consistent and confident predictions, we introduce an information divergence loss that combines per-pixel Kullback-Leibler divergence and Mutual Information Gap, effectively reducing prediction disagreement between the two models. In addition, we apply mixup on unlabelled samples to further enhance robustness. Experiments on two fetal ultrasound datasets demonstrate that our approach consistently outperforms seven state-of-the-art semi-supervised methods. When only 5% of training data is labelled, our framework improves the Dice score by 2.39%, reduces the 95% Hausdorff distance by 14.90, and decreases the Average Surface Distance by 4.18. These results highlight the effectiveness of leveraging information divergence for annotation-efficient and robust medical image segmentation. Our code is publicly available on GitHub.

[685] Impact of Labeling Inaccuracy and Image Noise on Tooth Segmentation in Panoramic Radiographs using Federated, Centralized and Local Learning

Johan Andreas Balle Rubak, Khuram Naveed, Sanyam Jain, Lukas Esterle, Alexandros Iosifidis, Ruben Pauwels

Main category: eess.IV

TL;DR: Federated learning outperforms centralized and local learning for tooth segmentation in dental radiographs across various data corruption scenarios while preserving privacy.

Details

Motivation: To address privacy constraints, heterogeneous data quality, and inconsistent labeling in dental diagnostic AI by comparing federated learning with centralized and local learning approaches.

Method: Used Attention U-Net trained on 2066 radiographs from six institutions across four scenarios: baseline, label manipulation, image-quality manipulation, and faulty client exclusion. Implemented FL via Flower AI framework with per-client loss monitoring and evaluated using Dice, IoU, HD, HD95 and ASSD metrics.

Result: FL consistently achieved the best performance across all scenarios: baseline (Dice: 0.94889), label manipulation (Dice: 0.94884), image noise (Dice: 0.94853), and faulty-client exclusion (Dice: 0.94790). Loss-curve monitoring reliably detected corrupted sites.

Conclusion: FL matches or exceeds centralized learning and outperforms local learning while preserving privacy, making it a practical approach for scalable clinical AI deployment with effective anomaly detection capabilities.

Abstract: Objectives: Federated learning (FL) may mitigate privacy constraints, heterogeneous data quality, and inconsistent labeling in dental diagnostic AI. We compared FL with centralized (CL) and local learning (LL) for tooth segmentation in panoramic radiographs across multiple data corruption scenarios. Methods: An Attention U-Net was trained on 2066 radiographs from six institutions across four settings: baseline (unaltered data); label manipulation (dilated/missing annotations); image-quality manipulation (additive Gaussian noise); and exclusion of a faulty client with corrupted data. FL was implemented via the Flower AI framework. Per-client training- and validation-loss trajectories were monitored for anomaly detection and a set of metrics (Dice, IoU, HD, HD95 and ASSD) was evaluated on a hold-out test set. From these metrics significance results were reported through Wilcoxon signed-rank test. CL and LL served as comparators. Results: Baseline: FL achieved a median Dice of 0.94889 (ASSD: 1.33229), slightly better than CL at 0.94706 (ASSD: 1.37074) and LL at 0.93557-0.94026 (ASSD: 1.51910-1.69777). Label manipulation: FL maintained the best median Dice score at 0.94884 (ASSD: 1.46487) versus CL’s 0.94183 (ASSD: 1.75738) and LL’s 0.93003-0.94026 (ASSD: 1.51910-2.11462). Image noise: FL led with Dice at 0.94853 (ASSD: 1.31088); CL scored 0.94787 (ASSD: 1.36131); LL ranged from 0.93179-0.94026 (ASSD: 1.51910-1.77350). Faulty-client exclusion: FL reached Dice at 0.94790 (ASSD: 1.33113) better than CL’s 0.94550 (ASSD: 1.39318). Loss-curve monitoring reliably flagged the corrupted site. Conclusions: FL matches or exceeds CL and outperforms LL across corruption scenarios while preserving privacy. Per-client loss trajectories provide an effective anomaly-detection mechanism and support FL as a practical, privacy-preserving approach for scalable clinical AI deployment.

[686] Contrastive Anatomy-Contrast Disentanglement: A Domain-General MRI Harmonization Method

Daniel Scholz, Ayhan Can Erdur, Robbie Holland, Viktoria Ehm, Jan C. Peeken, Benedikt Wiestler, Daniel Rueckert

Main category: eess.IV

TL;DR: A novel diffusion autoencoder method with contrastive loss and domain-agnostic augmentation for MRI scanner harmonization, achieving significant performance improvements without requiring fine-tuning.

Details

Motivation: Address inconsistencies in MRI image contrast caused by variations in scanners and acquisition parameters, which hinder data comparability and reproducibility across clinical studies.

Method: Conditioned diffusion autoencoder with contrastive loss and domain-agnostic contrast augmentation to harmonize MR images while preserving subject-specific anatomy from a single reference image.

Result: Outperforms baseline techniques with +7% PSNR improvement on traveling subjects dataset and +18% improvement on age regression in unseen domains.

Conclusion: Provides robust, effective harmonization without fine-tuning, enhancing comparability, reproducibility, and generalizability in multi-site and longitudinal clinical studies for improved healthcare outcomes.

Abstract: Magnetic resonance imaging (MRI) is an invaluable tool for clinical and research applications. Yet, variations in scanners and acquisition parameters cause inconsistencies in image contrast, hindering data comparability and reproducibility across datasets and clinical studies. Existing scanner harmonization methods, designed to address this challenge, face limitations, such as requiring traveling subjects or struggling to generalize to unseen domains. We propose a novel approach using a conditioned diffusion autoencoder with a contrastive loss and domain-agnostic contrast augmentation to harmonize MR images across scanners while preserving subject-specific anatomy. Our method enables brain MRI synthesis from a single reference image. It outperforms baseline techniques, achieving a +7% PSNR improvement on a traveling subjects dataset and +18% improvement on age regression in unseen. Our model provides robust, effective harmonization of brain MRIs to target scanners without requiring fine-tuning. This advancement promises to enhance comparability, reproducibility, and generalizability in multi-site and longitudinal clinical studies, ultimately contributing to improved healthcare outcomes.

Daniel Scholz, Ayhan Can Erdur, Viktoria Ehm, Anke Meyer-Baese, Jan C. Peeken, Daniel Rueckert, Benedikt Wiestler

Main category: eess.IV

TL;DR: MM-DINOv2 adapts DINOv2 for multi-modal medical imaging with patch embeddings, full-modality masking for missing data, and semi-supervised learning, achieving 11.1% improvement in glioma classification.

Details

Motivation: Vision foundation models like DINOv2 are designed for uni-modal analysis but struggle with multi-modal medical imaging tasks, missing modalities, and unlabeled data utilization that are common in clinical settings.

Method: Multi-modal patch embeddings, full-modality masking to handle missing data, and semi-supervised learning to leverage unlabeled datasets for robust cross-modality relationships.

Result: Achieved Matthews Correlation Coefficient of 0.6 on external test set for glioma subtype classification, surpassing state-of-the-art supervised approaches by +11.1%.

Conclusion: Provides a scalable and robust solution for multi-modal medical imaging that addresses real-world clinical challenges like missing data and limited annotations while leveraging natural image pre-trained foundation models.

Abstract: Vision foundation models like DINOv2 demonstrate remarkable potential in medical imaging despite their origin in natural image domains. However, their design inherently works best for uni-modal image analysis, limiting their effectiveness for multi-modal imaging tasks that are common in many medical fields, such as neurology and oncology. While supervised models perform well in this setting, they fail to leverage unlabeled datasets and struggle with missing modalities, a frequent challenge in clinical settings. To bridge these gaps, we introduce MM-DINOv2, a novel and efficient framework that adapts the pre-trained vision foundation model DINOv2 for multi-modal medical imaging. Our approach incorporates multi-modal patch embeddings, enabling vision foundation models to effectively process multi-modal imaging data. To address missing modalities, we employ full-modality masking, which encourages the model to learn robust cross-modality relationships. Furthermore, we leverage semi-supervised learning to harness large unlabeled datasets, enhancing both the accuracy and reliability of medical predictions. Applied to glioma subtype classification from multi-sequence brain MRI, our method achieves a Matthews Correlation Coefficient (MCC) of 0.6 on an external test set, surpassing state-of-the-art supervised approaches by +11.1%. Our work establishes a scalable and robust solution for multi-modal medical imaging tasks, leveraging powerful vision foundation models pre-trained on natural images while addressing real-world clinical challenges such as missing data and limited annotations.

[688] Fairness-Aware Data Augmentation for Cardiac MRI using Text-Conditioned Diffusion Models

Grzegorz Skorupko, Richard Osuala, Zuzanna Szafranowska, Kaisar Kushibar, Vien Ngoc Dang, Nay Aung, Steffen E Petersen, Karim Lekadir, Polyxeni Gkontra

Main category: eess.IV

TL;DR: Proposes a ControlNet-based diffusion model to generate synthetic cardiac MRI data conditioned on patient metadata and cardiac geometry, addressing dataset imbalances to improve fairness in medical AI classification tasks.

Details

Motivation: Deep learning for cardiac MRI diagnosis is constrained by imbalanced and biased training datasets, particularly with underrepresented groups like female patients or specific BMI categories with heart conditions.

Method: Uses ControlNet based on denoising diffusion probabilistic model, conditioned on text from patient metadata (sex, age, BMI, health) and cardiac geometry from segmentation masks. Evaluates with UK Biobank data using quantitative metrics and downstream classification tasks.

Result: Effectively mitigates dataset imbalances, addressing scarcity of diagnosed female patients and normal-BMI individuals with heart failure. Demonstrates feasibility using a single consumer-level GPU.

Conclusion: Represents a major step toward using synthetic data for developing fair and generalizable medical classification models, with practical feasibility in resource-constrained environments.

Abstract: While deep learning holds great promise for disease diagnosis and prognosis in cardiac magnetic resonance imaging, its progress is often constrained by highly imbalanced and biased training datasets. To address this issue, we propose a method to alleviate imbalances inherent in datasets through the generation of synthetic data based on sensitive attributes such as sex, age, body mass index (BMI), and health condition. We adopt ControlNet based on a denoising diffusion probabilistic model to condition on text assembled from patient metadata and cardiac geometry derived from segmentation masks. We assess our method using a large-cohort study from the UK Biobank by evaluating the realism of the generated images using established quantitative metrics. Furthermore, we conduct a downstream classification task aimed at debiasing a classifier by rectifying imbalances within underrepresented groups through synthetically generated samples. Our experiments demonstrate the effectiveness of the proposed approach in mitigating dataset imbalances, such as the scarcity of diagnosed female patients or individuals with normal BMI level suffering from heart failure. This work represents a major step towards the adoption of synthetic data for the development of fair and generalizable models for medical classification tasks. Notably, we conduct all our experiments using a single, consumer-level GPU to highlight the feasibility of our approach within resource-constrained environments. Our code is available at https://github.com/faildeny/debiasing-cardiac-mri.

[689] VIBESegmentator: Full Body MRI Segmentation for the NAKO and UK Biobank

Robert Graf, Paul-Sören Platzek, Evamaria Olga Riedel, Constanze Ramschütz, Sophie Starck, Hendrik Kristian Möller, Matan Atad, Henry Völzke, Robin Bülow, Carsten Oliver Schmidt, Julia Rüdebusch, Matthias Jung, Marco Reisert, Jakob Weiss, Maximilian Löffler, Fabian Bamberg, Bene Wiestler, Johannes C. Paetzold, Daniel Rueckert, Jan Stefan Kirschke

Main category: eess.IV

TL;DR: Public deep learning model for comprehensive torso segmentation in MRI/CT with 71-72 structures, achieving high Dice scores (0.90 internal, 0.81 external) and full voxel coverage.

Details

Motivation: To provide publicly available deep learning-based torso segmentation with comprehensive anatomical coverage extending to compartment boundaries, addressing limitations of existing models.

Method: Used iterative improvement of preliminary segmentations from TotalSegmentator and other models, trained nnUNet on 2897 series from 626 subjects with 3-fold cross-validation, segmented 71-72 structures including organs, muscles, vessels, bones, and body composition.

Result: Achieved average Dice score of 0.90±0.06 on internal test set and tied best model on external Amos dataset with 0.81±0.14 Dice, while offering larger field of view and more structures than existing models.

Conclusion: Successfully developed a publicly available full-torso segmentation model that provides comprehensive voxel-wise classification for both MRI and CT images, covering nearly all subject voxels.

Abstract: Objectives: To present a publicly available deep learning-based torso segmentation model that provides comprehensive voxel-wise coverage, including delineations that extend to the boundaries of anatomical compartments. Materials and Methods: We extracted preliminary segmentations from TotalSegmentator, spine, and body composition models for Magnetic Resonance Tomography (MR) images, then improved them iteratively and retrained an nnUNet model. Using a random retrospective subset of German National Cohort (NAKO), UK Biobank, internal MR and Computed Tomography (CT) data (Training: 2897 series from 626 subjects, 290 female; mean age 53+-16; 3-fold-cross validation (20% hold-out). Internal testing 36 series from 12 subjects, 6 male; mean age 60+-11), we segmented 71 structures in torso MR and 72 in CT images: 20 organs, 10 muscles, 19 vessels, 16 bones, ribs in CT, intervertebral discs, spinal cord, spinal canal and body composition (subcutaneous fat, unclassified muscles and visceral fat). For external validation, we used existing automatic organ segmentations, independent ground truth segmentations on gradient echo images, and the Amos data. We used non-parametric bootstrapping for confidence intervals and Wilcoxon rank-sum test for computing statistical significance. Results: We achieved an average Dice score of 0.90+-0.06 on our internal gradient echo test set, which included 71 semantic segmentation labels. Our model ties with the best model on Amos with a Dice of 0,81+-0.14, while having a larger field of view and a considerably higher number structures included. Conclusion: Our work presents a publicly available full-torso segmentation model for MRI and CT images that classifies almost all subject voxels to date.

[690] Efficient and Accurate Pneumonia Detection Using a Novel Multi-Scale Transformer Approach

Alireza Saber, Pouria Parhami, Alimohammad Siahkarzadeh, Mansoor Fateh, Amirreza Fateh

Main category: eess.IV

TL;DR: Novel multi-scale transformer approach for pneumonia detection combining lung segmentation and classification, achieving high accuracy (93.75-96.04%) with computational efficiency.

Details

Motivation: Pneumonia is a leading cause of mortality worldwide, and chest X-ray interpretation is challenging due to variations in imaging conditions and subtle visual indicators. Automated tools can enhance diagnostic reliability and support clinical decision-making.

Method: Proposes a unified framework with lightweight transformer-enhanced TransUNet for lung segmentation (95.68% Dice score) and pre-trained ResNet models (ResNet-50/101) for multi-scale feature extraction processed through modified transformer modules for classification.

Result: Achieved 93.75% accuracy on Kermany dataset and 96.04% accuracy on Cohen dataset, outperforming existing methods while maintaining computational efficiency with fewer parameters than traditional transformers.

Conclusion: Demonstrates the potential of multi-scale transformer architectures to improve pneumonia diagnosis, offering a scalable and accurate solution suitable for resource-constrained clinical environments.

Abstract: Pneumonia, a prevalent respiratory infection, remains a leading cause of morbidity and mortality worldwide, particularly among vulnerable populations. Chest X-rays serve as a primary tool for pneumonia detection; however, variations in imaging conditions and subtle visual indicators complicate consistent interpretation. Automated tools can enhance traditional methods by improving diagnostic reliability and supporting clinical decision-making. In this study, we propose a novel multi-scale transformer approach for pneumonia detection that integrates lung segmentation and classification into a unified framework. Our method introduces a lightweight transformer-enhanced TransUNet for precise lung segmentation, achieving a Dice score of 95.68% on the “Chest X-ray Masks and Labels” dataset with fewer parameters than traditional transformers. For classification, we employ pre-trained ResNet models (ResNet-50 and ResNet-101) to extract multi-scale feature maps, which are then processed through a modified transformer module to enhance pneumonia detection. This integration of multi-scale feature extraction and lightweight transformer modules ensures robust performance, making our method suitable for resource-constrained clinical environments. Our approach achieves 93.75% accuracy on the “Kermany” dataset and 96.04% accuracy on the “Cohen” dataset, outperforming existing methods while maintaining computational efficiency. This work demonstrates the potential of multi-scale transformer architectures to improve pneumonia diagnosis, offering a scalable and accurate solution to global healthcare challenges. https://github.com/amirrezafateh/Multi-Scale-Transformer-Pneumonia

[691] NeuroBOLT: Resting-state EEG-to-fMRI Synthesis with Multi-dimensional Feature Mapping

Yamin Li, Ange Lou, Ziyuan Xu, Shengchao Zhang, Shiyu Wang, Dario J. Englot, Soheil Kolouri, Daniel Moyer, Roza G. Bayrak, Catie Chang

Main category: eess.IV

TL;DR: NeuroBOLT is a transformer-based framework that translates raw EEG data to fMRI signals across the brain, achieving state-of-the-art accuracy and generalization across conditions and brain regions.

Details

Motivation: fMRI has high costs and immobility limitations, while EEG is more accessible but lacks spatial resolution. Current EEG-fMRI translation methods are limited to specific brain areas and single conditions, creating a need for a more generalizable solution.

Method: NeuroBOLT uses multi-dimensional representation learning from temporal, spatial, and spectral domains to translate raw EEG data to corresponding fMRI activity signals across the entire brain.

Result: The framework effectively reconstructs unseen resting-state fMRI signals from primary sensory areas, high-level cognitive areas, and deep subcortical brain regions with state-of-the-art accuracy.

Conclusion: NeuroBOLT significantly advances EEG-fMRI integration by demonstrating potential for generalization across varying conditions and sites, overcoming previous limitations in the field.

Abstract: Functional magnetic resonance imaging (fMRI) is an indispensable tool in modern neuroscience, providing a non-invasive window into whole-brain dynamics at millimeter-scale spatial resolution. However, fMRI is constrained by issues such as high operation costs and immobility. With the rapid advancements in cross-modality synthesis and brain decoding, the use of deep neural networks has emerged as a promising solution for inferring whole-brain, high-resolution fMRI features directly from electroencephalography (EEG), a more widely accessible and portable neuroimaging modality. Nonetheless, the complex projection from neural activity to fMRI hemodynamic responses and the spatial ambiguity of EEG pose substantial challenges both in modeling and interpretability. Relatively few studies to date have developed approaches for EEG-fMRI translation, and although they have made significant strides, the inference of fMRI signals in a given study has been limited to a small set of brain areas and to a single condition (i.e., either resting-state or a specific task). The capability to predict fMRI signals in other brain areas, as well as to generalize across conditions, remain critical gaps in the field. To tackle these challenges, we introduce a novel and generalizable framework: NeuroBOLT, i.e., Neuro-to-BOLD Transformer, which leverages multi-dimensional representation learning from temporal, spatial, and spectral domains to translate raw EEG data to the corresponding fMRI activity signals across the brain. Our experiments demonstrate that NeuroBOLT effectively reconstructs unseen resting-state fMRI signals from primary sensory, high-level cognitive areas, and deep subcortical brain regions, achieving state-of-the-art accuracy with the potential to generalize across varying conditions and sites, which significantly advances the integration of these two modalities.

[692] Contactless pulse rate assessment: Results and insights for application in driving simulator

Đorđe D. Nešković, Kristina Stojmenova Pečečnik, Jaka Sodnik, Nadica Miljković

Main category: eess.IV

TL;DR: rPPG framework combining signal processing with Eulerian Video Magnification improves pulse rate estimation in driving simulators, reducing MAE from 6.48 to 5.04 bpm, with identified age-related differences.

Details

Motivation: Address motion artifacts in remote photoplethysmography for non-contact driver monitoring in dynamic driving environments.

Method: Combines signal processing techniques before and after applying Eulerian Video Magnification (EVM) for pulse rate estimation, compared against Empatica E4 reference data and literature benchmarks.

Result: EVM slightly improves PR estimation (MAE reduced from 6.48 to 5.04 bpm), requires ~20s additional processing for 30s sequences, and shows statistically significant differences between younger and older drivers.

Conclusion: Demonstrates feasibility of rPPG-based pulse rate monitoring in driving simulations, encouraging further research despite modest performance improvements.

Abstract: Remote photoplethysmography (rPPG) offers a promising solution for non-contact driver monitoring by detecting subtle blood flow-induced facial color changes from video. However, motion artifacts in dynamic driving environments remain key challenges. This study presents an rPPG framework that combines signal processing techniques before and after applying Eulerian Video Magnification (EVM) for pulse rate (PR) estimation in driving simulators. While not novel, the approach offers insights into the efficiency of the EVM method and its time complexity. We compare results of the proposed rPPG approach against reference Empatica E4 data and also compare it with existing achievements from the literature. Additionally, the possible bias of the Empatica E4 is further assessed using an independent dataset with both the Empatica E4 and the Faros 360 measurements. EVM slightly improves PR estimation, reducing the mean absolute error (MAE) from 6.48 bpm to 5.04 bpm (the lowest MAE (~2 bpm) was achieved under strict conditions) with an additional time required for EVM of about 20 s for 30 s sequence. Furthermore, statistically significant differences are identified between younger and older drivers in both reference and rPPG data. Our findings demonstrate the feasibility of using rPPG-based PR monitoring, encouraging further research in driving simulations.

[693] Potential Contrast: Properties, Equivalences, and Generalization to Multiple Classes

Wallace Peaslee, Anna Breger, Carola-Bibiane Schönlieb

Main category: eess.IV

TL;DR: Normalized multi-class potential contrast for image quality assessment in cultural heritage applications

Details

Motivation: To improve image quality evaluation in cultural heritage by removing format dependence and enabling multi-class analysis for complex artifacts like manuscripts with bleedthrough

Method: Developed normalized version of potential contrast, proved mathematical equalities for generalization to multiple classes and continuous settings, implemented algorithms for practical application

Result: Created a normalized metric that eliminates image format dependence and enables analysis of more than two pixel classes, demonstrated utility on medieval music manuscripts with bleedthrough

Conclusion: The normalized multi-class potential contrast provides a robust tool for cultural heritage image analysis, with open-source implementations available for practical use

Abstract: Potential contrast is typically used as an image quality measure and quantifies the maximal possible contrast between samples from two classes of pixels in an image after an arbitrary grayscale transformation. It has been applied in cultural heritage to evaluate multispectral images using a small number of labeled pixels. In this work, we introduce a normalized version of potential contrast that removes dependence on image format and also prove equalities that enable generalization to more than two classes and to continuous settings. Finally, we exemplify the utility of multi-class normalized potential contrast through an application to a medieval music manuscript with visible bleedthrough from the back of the page. We share our implementations, based on both original algorithms and our new equalities, including generalization to multiple classes, at https://github.com/wallacepeaslee/Multiple-Class-Normalized-Potential-Contrast.

[694] Content Generation Models in Computational Pathology: A Comprehensive Survey on Methods, Applications, and Challenges

Yuan Zhang, Xinfeng Zhang, Xiaoming Qi, Xinyu Wu, Feng Chen, Guanyu Yang, Huazhu Fu

Main category: eess.IV

TL;DR: Comprehensive review of content generation modeling in computational pathology covering image/text/molecular generation, evolution from GANs to diffusion/VLMs, datasets, evaluations, limitations, and future directions.

Details

Motivation: To synthesize recent progress in computational pathology content generation, which enables data-efficient learning, synthetic data augmentation, and task-oriented generation for diverse diagnostic tasks.

Method: Analysis of over 150 representative studies organized into four domains: image generation, text generation, molecular profile-morphology generation, and specialized applications, tracing architectural evolution from GANs to diffusion models and generative vision-language models.

Result: The review provides comprehensive synthesis of datasets, evaluation protocols, and highlights limitations including challenges in high-fidelity whole slide image generation, clinical interpretability, and ethical/legal concerns with synthetic data.

Conclusion: Identifies open challenges and future research directions, emphasizing the need for developing integrated and clinically deployable generation systems as a foundational reference for researchers in computational pathology.

Abstract: Content generation modeling has emerged as a promising direction in computational pathology, offering capabilities such as data-efficient learning, synthetic data augmentation, and task-oriented generation across diverse diagnostic tasks. This review provides a comprehensive synthesis of recent progress in the field, organized into four key domains: image generation, text generation, molecular profile-morphology generation, and other specialized generation applications. By analyzing over 150 representative studies, we trace the evolution of content generation architectures – from early generative adversarial networks to recent advances in diffusion models and generative vision-language models. We further examine the datasets and evaluation protocols commonly used in this domain and highlight ongoing limitations, including challenges in generating high-fidelity whole slide images, clinical interpretability, and concerns related to the ethical and legal implications of synthetic data. The review concludes with a discussion of open challenges and prospective research directions, with an emphasis on developing integrated and clinically deployable generation systems. This work aims to provide a foundational reference for researchers and practitioners developing content generation models in computational pathology.

[695] Simultaneous Segmentation of Ventricles and Normal/Abnormal White Matter Hyperintensities in Clinical MRI using Deep Learning

Mahdi Bashiri Bawil, Mousa Shamsi, Abolhassan Shakeri Bavil

Main category: eess.IV

TL;DR: Novel 2D pix2pix deep learning framework for simultaneous segmentation of ventricles and white matter hyperintensities in MS patients, with ability to distinguish normal vs pathological lesions, achieving superior accuracy and 36x faster processing than existing methods.

Details

Motivation: Current MS MRI segmentation methods have limitations: they segment structures independently despite their pathophysiological relationship, struggle to differentiate normal vs pathological hyperintensities, and are poorly optimized for anisotropic clinical MRI data.

Method: 2D pix2pix-based deep learning framework for simultaneous segmentation of ventricles and white matter hyperintensities, developed and validated on FLAIR MRI scans from 300 MS patients.

Result: Superior performance for both ventricle segmentation (Dice: 0.801+/-0.025, HD95: 18.46+/-7.1mm) and WMH segmentation (Dice: 0.624+/-0.061, precision: 0.755+/-0.161). Successfully differentiated normal vs abnormal hyperintensities (Dice: 0.647). 36x faster processing (4 seconds per case) with minimal resource requirements.

Conclusion: The method addresses critical limitations in current neuroimaging analysis with improved accuracy, clinically relevant differentiation capability, and computational efficiency, enabling potential integration into routine clinical workflows for enhanced MS diagnosis and monitoring.

Abstract: Multiple sclerosis (MS) diagnosis and monitoring rely heavily on accurate assessment of brain MRI biomarkers, particularly white matter hyperintensities (WMHs) and ventricular changes. Current segmentation approaches suffer from several limitations: they typically segment these structures independently despite their pathophysiological relationship, struggle to differentiate between normal and pathological hyperintensities, and are poorly optimized for anisotropic clinical MRI data. We propose a novel 2D pix2pix-based deep learning framework for simultaneous segmentation of ventricles and WMHs with the unique capability to distinguish between normal periventricular hyperintensities and pathological MS lesions. Our method was developed and validated on FLAIR MRI scans from 300 MS patients. Compared to established methods (SynthSeg, Atlas Matching, BIANCA, LST-LPA, LST-LGA, and WMH-SynthSeg), our approach achieved superior performance for both ventricle segmentation (Dice: 0.801+/-0.025, HD95: 18.46+/-7.1mm) and WMH segmentation (Dice: 0.624+/-0.061, precision: 0.755+/-0.161). Furthermore, our method successfully differentiated between normal and abnormal hyperintensities with a Dice coefficient of 0.647. Notably, our approach demonstrated exceptional computational efficiency, completing end-to-end processing in approximately 4 seconds per case, up to 36 times faster than baseline methods, while maintaining minimal resource requirements. This combination of improved accuracy, clinically relevant differentiation capability, and computational efficiency addresses critical limitations in current neuroimaging analysis, potentially enabling integration into routine clinical workflows and enhancing MS diagnosis and monitoring.

[696] PRO: Projection Domain Synthesis for CT Imaging

Kang Chen, Bin Huang, Xuebin Yang, Junyan Zhang, Yongbo Wang, Qiegen Liu

Main category: eess.IV

TL;DR: PRO is the first projection domain synthesis foundation model for CT imaging that generates synthetic projection data using anatomical text prompts, enabling realistic simulation of physical acquisition processes and improving downstream CT imaging tasks.

Details

Motivation: Current image domain methods for synthetic CT data generation cannot simulate the physical acquisition process or utilize complete statistical information from projection data, limiting their utility and fidelity.

Method: PRO operates in the projection domain rather than image domain, learning rich structural representations from projection data and leveraging anatomical text prompts for controllable synthesis. It simulates physical processes including material attenuation, beam hardening, scattering, and projection geometry.

Result: Experimental results show that incorporating PRO’s synthesized data significantly improves performance across multiple downstream tasks including low-dose and sparse-view reconstruction.

Conclusion: PRO demonstrates versatility and scalability as a foundation model for CT data generation, highlighting projection domain synthesis as a powerful tool for data augmentation and robust CT imaging applications.

Abstract: Synthetic CT projection data is crucial for advancing imaging research, yet its generation remains challenging. Current image domain methods are limited as they cannot simulate the physical acquisition process or utilize the complete statistical information present in projection data, restricting their utility and fidelity. In this work, we present PRO, a projection domain synthesis foundation model for CT imaging. To the best of our knowledge, this is the first study that performs CT synthesis in the projection domain. Unlike previous approaches that operate in the image domain, PRO learns rich structural representations from projection data and leverages anatomical text prompts for controllable synthesis. Projection data generation models can utilize complete measurement signals and simulate the physical processes of scanning, including material attenuation characteristics, beam hardening, scattering, and projection geometry, and support research on downstream imaging tasks. Moreover, PRO functions as a foundation model, capable of generalizing across diverse downstream tasks by adjusting its generative behavior via prompt inputs. Experimental results demonstrated that incorporating our synthesized data significantly improves performance across multiple downstream tasks, including low-dose and sparse-view reconstruction. These findings underscore the versatility and scalability of PRO in data generation for various CT applications. These results highlight the potential of projection domain synthesis as a powerful tool for data augmentation and robust CT imaging. Our source code is publicly available at: https://github.com/yqx7150/PRO.

[697] Identifying actionable driver mutations in lung cancer using an efficient Asymmetric Transformer Decoder

Biagio Brattoli, Jack Shi, Jongchan Park, Taebum Lee, Donggeun Yoo, Sergio Pereira

Main category: eess.IV

TL;DR: This paper presents a machine learning approach using Multiple Instance Learning and an Asymmetric Transformer Decoder to detect six key NSCLC driver mutations from pathology images, outperforming existing methods by 3-4% and addressing limitations of current MIL approaches.

Details

Motivation: Genetic testing for NSCLC driver mutations is limited by availability and turnaround times. Current ML methods focus on only 1-2 common mutations, reducing clinical utility and patient benefit.

Method: Used Multiple Instance Learning techniques with an Asymmetric Transformer Decoder model that maintains low query dimensionality. Introduced method to directly utilize tissue type information to address biological relevance limitations in MIL.

Result: Outperformed top MIL models by average of 3%, and over 4% for rare mutations (ERBB2 and BRAF). The approach efficiently extracts information from patch embeddings while minimizing overfitting risks.

Conclusion: The method moves ML-based tests closer to being practical alternatives to standard genetic testing for NSCLC, particularly benefiting detection of rare mutations and improving overall mutation detection performance.

Abstract: Identifying actionable driver mutations in non-small cell lung cancer (NSCLC) can impact treatment decisions and significantly improve patient outcomes. Despite guideline recommendations, broader adoption of genetic testing remains challenging due to limited availability and lengthy turnaround times. Machine Learning (ML) methods for Computational Pathology (CPath) offer a potential solution; however, research often focuses on only one or two common mutations, limiting the clinical value of these tools and the pool of patients who can benefit from them. This study evaluates various Multiple Instance Learning (MIL) techniques to detect six key actionable NSCLC driver mutations: ALK, BRAF, EGFR, ERBB2, KRAS, and MET ex14. Additionally, we introduce an Asymmetric Transformer Decoder model that employs queries and key-values of varying dimensions to maintain a low query dimensionality. This approach efficiently extracts information from patch embeddings and minimizes overfitting risks, proving highly adaptable to the MIL setting. Moreover, we present a method to directly utilize tissue type in the model, addressing a typical MIL limitation where either all regions or only some specific regions are analyzed, neglecting biological relevance. Our method outperforms top MIL models by an average of 3%, and over 4% when predicting rare mutations such as ERBB2 and BRAF, moving ML-based tests closer to being practical alternatives to standard genetic testing.

[698] Hessian-Based Lightweight Neural Network HessNet for State-of-the-Art Brain Vessel Segmentation on a Minimal Training Dataset

Alexandra Bernadotte, Elfimov Nikita, Mikhail Shutov, Ivan Menshikov

Main category: eess.IV

TL;DR: HessNet is a lightweight semi-supervised neural network with only 6000 parameters that uses Hessian matrices for 3D brain vessel segmentation in MRA images, achieving state-of-the-art accuracy while running on CPU and enabling creation of a large annotated dataset.

Details

Motivation: Current manual segmentation and classical methods like Frangi filter lack sufficient accuracy for brain vessel segmentation in MRA, and there's a notable lack of publicly available annotated MRA datasets for neural network training.

Method: Proposed HessNet - a Hessian-based lightweight neural network with only 6000 parameters for 3D segmentation of tubular structures, using semi-supervised learning approach that can run on CPU.

Result: Achieved state-of-the-art vessel segmentation accuracy on minimal training dataset, enabled creation of large semi-manually annotated brain vessel dataset (200 images from IXI dataset) with expert supervision, significantly reducing resource requirements.

Conclusion: HessNet provides an efficient, resource-light solution for accurate brain vessel segmentation that facilitates dataset creation and allows experts to focus on complex cases, addressing the critical gap in annotated MRA datasets.

Abstract: Accurate segmentation of blood vessels in brain magnetic resonance angiography (MRA) is essential for successful surgical procedures, such as aneurysm repair or bypass surgery. Currently, annotation is primarily performed through manual segmentation or classical methods, such as the Frangi filter, which often lack sufficient accuracy. Neural networks have emerged as powerful tools for medical image segmentation, but their development depends on well-annotated training datasets. However, there is a notable lack of publicly available MRA datasets with detailed brain vessel annotations. To address this gap, we propose a novel semi-supervised learning lightweight neural network with Hessian matrices on board for 3D segmentation of complex structures such as tubular structures, which we named HessNet. The solution is a Hessian-based neural network with only 6000 parameters. HessNet can run on the CPU and significantly reduces the resource requirements for training neural networks. The accuracy of vessel segmentation on a minimal training dataset reaches state-of-the-art results. It helps us create a large, semi-manually annotated brain vessel dataset of brain MRA images based on the IXI dataset (annotated 200 images). Annotation was performed by three experts under the supervision of three neurovascular surgeons after applying HessNet. It provides high accuracy of vessel segmentation and allows experts to focus only on the most complex important cases. The dataset is available at https://git.scinalytics.com/terilat/VesselDatasetPartly.

[699] Learn2Reg 2024: New Benchmark Datasets Driving Progress on New Challenges

Lasse Hansen, Wiebke Heyer, Christoph Großbröhmer, Frederic Madesta, Thilo Sentker, Wang Jiazheng, Yuxi Zhang, Hang Zhang, Min Liu, Junyi Wang, Xi Zhu, Yuhua Li, Liwen Wang, Daniil Morozov, Nazim Haouchine, Joel Honkamaa, Pekka Marttinen, Yichao Zhou, Zuopeng Tan, Zhuoyuan Wang, Yi Wang, Hongchao Zhou, Shunbo Hu, Yi Zhang, Qian Tao, Lukas Förner, Thomas Wendler, Bailiang Jian, Christian Wachinger, Jin Kim, Dan Ruan, Marek Wodzinski, Henning Müller, Tony C. W. Mok, Xi Jia, Jinming Duan, Mikael Brudfors, Seyed-Ahmad Ahmadi, Yunzheng Zhu, William Hsu, Tina Kapur, William M. Wells, Alexandra Golby, Aaron Carass, Harrison Bai, Yihao Liu, Perrine Paul-Gilloteaux, Joakim Lindblad, Nataša Sladoje, Andreas Walter, Junyu Chen, Reuben Dorent, Alessa Hering, Mattias P. Heinrich

Main category: eess.IV

TL;DR: Learn2Reg 2024 introduces three new medical image registration tasks addressing modality diversity and task complexity gaps, with new datasets and method developments.

Details

Motivation: Previous Learn2Reg challenges lacked comprehensive coverage of modality diversity and task complexity in medical image registration benchmarking.

Method: Introduces three new tasks: large-scale multi-modal registration, unsupervised inter-subject brain registration, and first microscopy-focused benchmark. New methods include invertibility constraints, pyramid features, keypoints alignment, and instance optimization.

Result: The 2024 edition expands the benchmark scope with more diverse datasets and inspires development of advanced registration techniques.

Conclusion: Learn2Reg 2024 addresses previous limitations by introducing more comprehensive tasks and datasets, driving progress in medical image registration methods.

Abstract: Medical image registration is critical for clinical applications, and fair benchmarking of different methods is essential for monitoring ongoing progress. To date, the Learn2Reg 2020-2023 challenges have released several complementary datasets and established metrics for evaluations. However, these editions did not capture all aspects of the registration problem, particularly in terms of modality diversity and task complexity. To address these limitations, the 2024 edition introduces three new tasks, including large-scale multi-modal registration and unsupervised inter-subject brain registration, as well as the first microscopy-focused benchmark within Learn2Reg. The new datasets also inspired new method developments, including invertibility constraints, pyramid features, keypoints alignment and instance optimisation.

[700] AURAD: Anatomy-Pathology Unified Radiology Synthesis with Progressive Representations

Shuhan Ding, Jingjing Fu, Yu Gu, Naiteek Sangani, Mu Wei, Paul Vozila, Nan Liu, Jiang Bian, Hoifung Poon

Main category: eess.IV

TL;DR: AURAD is a controllable chest X-ray synthesis framework that generates both high-fidelity images and pseudo semantic masks from clinical prompts, ensuring anatomical-pathological consistency and clinical relevance.

Details

Motivation: Medical image synthesis faces challenges in fine-grained controllability due to limited annotations and domain shifts, especially in chest radiographs where disease patterns are diverse and intertwined with anatomy.

Method: Progressive pipeline: generates pseudo masks from clinical prompts conditioned on anatomical structures, then uses masks to guide image synthesis. Leverages pretrained medical models for clinical plausibility filtering.

Result: 78% of synthesized images classified as authentic by radiologists, over 40% of segmentation overlays rated clinically useful. Demonstrates effectiveness across tasks and datasets.

Conclusion: AURAD bridges generative modeling with clinical applications by producing both realistic images and usable segmentation masks, enabling downstream tasks like detection and segmentation.

Abstract: Medical image synthesis has become an essential strategy for augmenting datasets and improving model generalization in data-scarce clinical settings. However, fine-grained and controllable synthesis remains difficult due to limited high-quality annotations and domain shifts across datasets. Existing methods, often designed for natural images or well-defined tumors, struggle to generalize to chest radiographs, where disease patterns are morphologically diverse and tightly intertwined with anatomical structures. To address these challenges, we propose AURAD, a controllable radiology synthesis framework that jointly generates high-fidelity chest X-rays and pseudo semantic masks. Unlike prior approaches that rely on randomly sampled masks-limiting diversity, controllability, and clinical relevance-our method learns to generate masks that capture multi-pathology coexistence and anatomical-pathological consistency. It follows a progressive pipeline: pseudo masks are first generated from clinical prompts conditioned on anatomical structures, and then used to guide image synthesis. We also leverage pretrained expert medical models to filter outputs and ensure clinical plausibility. Beyond visual realism, the synthesized masks also serve as labels for downstream tasks such as detection and segmentation, bridging the gap between generative modeling and real-world clinical applications. Extensive experiments and blinded radiologist evaluations demonstrate the effectiveness and generalizability of our method across tasks and datasets. In particular, 78% of our synthesized images are classified as authentic by board-certified radiologists, and over 40% of predicted segmentation overlays are rated as clinically useful. All code, pre-trained models, and the synthesized dataset will be released upon publication.

Today’s Research Highlights

Table of Contents

cs.CL

[1] An Empirical Analysis of Discrete Unit Representations in Speech Language Modeling Pre-training

[2] Beyond ROUGE: N-Gram Subspace Features for LLM Hallucination Detection

[3] A Lightweight Framework for Trigger-Guided LoRA-Based Self-Adaptation in LLMs

[4] Talk Isn’t Always Cheap: Understanding Failure Modes in Multi-Agent Debate

[5] No Translation Needed: Forecasting Quality from Fertility and Metadata

[6] Direct-Scoring NLG Evaluators Can Use Pairwise Comparisons Too

[7] From Staff Messages to Actionable Insights: A Multi-Stage LLM Classification Framework for Healthcare Analytics

[8] The Token Tax: Systematic Bias in Multilingual Tokenization

[9] Biomedical Literature Q&A System Using Retrieval-Augmented Generation (RAG)

[10] Using Contrastive Learning to Improve Two-Way Reasoning in Large Language Models: The Obfuscation Task as a Case Study

[11] Ad hoc conventions generalize to new referents

[12] Mitigating Spurious Correlations Between Question and Answer via Chain-of-Thought Correctness Perception Distillation

[13] Enhancing the Robustness of Contextual ASR to Varying Biasing Information Volumes Through Purified Semantic Correlation Joint Modeling

[14] Icon$^{2}$: Aligning Large Language Models Using Self-Synthetic Preference Data via Inherent Regulation

[15] Beyond Keywords: Driving Generative Search Engine Optimization with Content-Centric Agents

[16] New Insights into Optimal Alignment of Acoustic and Linguistic Representations for Knowledge Transfer in ASR

[17] From Joy to Fear: A Benchmark of Emotion Estimation in Pop Song Lyrics

[18] Few-Shot Query Intent Detection via Relation-Aware Prompt Learning

[19] Empathy Omni: Enabling Empathetic Speech Response Generation through Large Language Models

[20] LM-Searcher: Cross-domain Neural Architecture Search with LLMs via Unified Numerical Encoding

[21] Cross-Question Method Reuse in Large Language Models: From Word-Level Prediction to Rational Logical-Layer Reasoning

[22] Llama-GENBA-10B: A Trilingual Large Language Model for German, English and Bavarian

[23] Revealing the Numeracy Gap: An Empirical Investigation of Text Embedding Models

[24] A Survey of the State-of-the-Art in Conversational Question Answering Systems

[25] Exploring Subjective Tasks in Farsi: A Survey Analysis and Evaluation of Language Models

[26] QCSE: A Pretrained Quantum Context-Sensitive Word Embedding for Natural Language Processing

[27] Enhancing Factual Accuracy and Citation Generation in LLMs via Multi-Stage Self-Verification

[28] LatinX: Aligning a Multilingual TTS Model with Direct Preference Optimization

[29] ZhiFangDanTai: Fine-tuning Graph-based Retrieval-Augmented Generation Model for Traditional Chinese Medicine Formula

[30] MedFactEval and MedAgentBrief: A Framework and Workflow for Generating and Evaluating Factual Clinical Summaries

[31] Let’s Roleplay: Examining LLM Alignment in Collaborative Dialogues

[32] Accelerating Large Language Model Inference via Early-Exiting Algorithms

[33] KatotohananQA: Evaluating Truthfulness of Large Language Models in Filipino

[34] Multimodal Fine-grained Context Interaction Graph Modeling for Conversational Speech Synthesis

[35] Multimodal Reasoning for Science: Technical Report and 1st Place Solution to the ICML 2025 SeePhys Challenge

[36] Orthogonal Low-rank Adaptation in Lie Groups for Continual Learning of Large Language Models

[37] Benchmarking Gender and Political Bias in Large Language Models

[38] Understanding the Influence of Synthetic Data for Text Embedders

[39] Augmented Fine-Tuned LLMs for Enhanced Recruitment Automation

[40] MSLEF: Multi-Segment LLM Ensemble Finetuning in Recruitment

[41] No Encore: Unlearning as Opt-Out in Music Generation

[42] Mask-GCG: Are All Tokens in Adversarial Suffixes Necessary for Jailbreak Attacks?

[43] PL-CA: A Parametric Legal Case Augmentation Framework

[44] Do LLMs exhibit the same commonsense capabilities across languages?

[45] WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents

[46] Crown, Frame, Reverse: Layer-Wise Scaling Variants for LLM Pre-Training

[47] LAMDAS: LLM as an Implicit Classifier for Domain-specific Data Selection

[48] SLiNT: Structure-aware Language Model with Injection and Contrastive Training for Knowledge Graph Completion

[49] HAVE: Head-Adaptive Gating and ValuE Calibration for Hallucination Mitigation in Large Language Models

[50] Guided Decoding and Its Critical Role in Retrieval-Augmented Generation

[51] Modelling Intertextuality with N-gram Embeddings

[52] Domain-Aware RAG: MoL-Enhanced RL for Efficient Training and Scalable Retrieval

[53] IntrEx: A Dataset for Modeling Engagement in Educational Conversations

[54] ParCzech4Speech: A New Speech Corpus Derived from Czech Parliamentary Data

[55] Will Annotators Disagree? Identifying Subjectivity in Value-Laden Arguments

[56] Anchoring Refusal Direction: Mitigating Safety Risks in Tuning via Projection Constraint

[57] MachineLearningLM: Continued Pretraining Language Models on Millions of Synthetic Tabular Prediction Tasks Scales In-Context ML

[58] MoGU V2: Toward a Higher Pareto Frontier Between Model Usability and Security

[59] Saturation-Driven Dataset Generation for LLM Mathematical Reasoning in the TPTP Ecosystem

[60] A Comparative Benchmark of Large Language Models for Labelling Wind Turbine Maintenance Logs

[61] COMPACT: Common-token Optimized Model Pruning Across Channels and Tokens

[62] EPT Benchmark: Evaluation of Persian Trustworthiness in Large Language Models

[63] The Majority is not always right: RL training for solution aggregation

[64] UNH at CheckThat! 2025: Fine-tuning Vs Prompting in Claim Extraction

[65] mmBERT: A Modern Multilingual Encoder with Annealed Language Learning

[66] Proof-Carrying Numbers (PCN): A Protocol for Trustworthy Numeric Answers from LLMs via Claim Verification

[67] Beyond Two-Stage Training: Cooperative SFT and RL for LLM Reasoning

[68] Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models

[69] On the Same Wavelength? Evaluating Pragmatic Reasoning in Language Models across Broad Concepts

[70] Multiple Noises in Diffusion Model for Semi-Supervised Multi-Domain Translation

[71] Support or Refute: Analyzing the Stance of Evidence to Detect Out-of-Context Mis- and Disinformation

[72] Grammaticality illusion or ambiguous interpretation? Event-related potentials reveal the nature of the missing-NP effect in Mandarin centre-embedded structures

[73] AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

[74] Repetition Improves Language Model Embeddings

[75] Linearly Controlled Language Generation with Performative Guarantees

[76] Synth-SBDH: A Synthetic Dataset of Social and Behavioral Determinants of Health for Clinical Text

[77] A Principled Framework for Evaluating on Typologically Diverse Languages