Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 163]
cs.CV [Total: 384]
cs.AI [Total: 112]
cs.SD [Total: 33]
cs.LG [Total: 293]
cs.MA [Total: 6]
cs.MM [Total: 3]
eess.AS [Total: 13]
eess.IV [Total: 39]

cs.CL

[1] Rethinking Graph-Based Document Classification: Learning Data-Driven Structures Beyond Heuristic Approaches

Margarita Bugueño, Gerard de Melo

Main category: cs.CL

TL;DR: The paper proposes a data-driven method to learn graph structures for document classification, outperforming heuristic-based approaches in accuracy and F1 score.

Details

Motivation: Existing graph-based document representations rely on manual heuristics or expert knowledge, limiting their applicability and effectiveness.

Method: The method constructs homogeneous weighted graphs with sentences as nodes, using a self-attention model to learn edges and a statistical filtering strategy to retain only strongly correlated sentences.

Result: Experiments show the learned graphs outperform heuristic-based ones in accuracy and F1 score, with statistical filtering improving robustness.

Conclusion: Automatic graph generation is more effective than heuristic approaches, offering potential for broader NLP applications.

Abstract: In document classification, graph-based models effectively capture document structure, overcoming sequence length limitations and enhancing contextual understanding. However, most existing graph document representations rely on heuristics, domain-specific rules, or expert knowledge. Unlike previous approaches, we propose a method to learn data-driven graph structures, eliminating the need for manual design and reducing domain dependence. Our approach constructs homogeneous weighted graphs with sentences as nodes, while edges are learned via a self-attention model that identifies dependencies between sentence pairs. A statistical filtering strategy aims to retain only strongly correlated sentences, improving graph quality while reducing the graph size. Experiments on three document classification datasets demonstrate that learned graphs consistently outperform heuristic-based graphs, achieving higher accuracy and $F_1$ score. Furthermore, our study demonstrates the effectiveness of the statistical filtering in improving classification robustness. These results highlight the potential of automatic graph generation over traditional heuristic approaches and open new directions for broader applications in NLP.

[2] FECT: Factuality Evaluation of Interpretive AI-Generated Claims in Contact Center Conversation Transcripts

Hagyeong Shin, Binoy Robin Dalal, Iwona Bialynicka-Birula, Navjot Matharu, Ryan Muir, Xingwei Yang, Samuel W. K. Wong

Main category: cs.CL

TL;DR: The paper addresses hallucinations in LLMs for enterprise applications, introduces a 3D paradigm for factuality evaluation, and presents the FECT benchmark dataset for contact center conversations.

Details

Motivation: Hallucinations in LLMs can harm business decisions, especially in contact center analysis where ground-truth labels for interpretations are lacking.

Method: Proposes a 3D (Decompose, Decouple, Detach) paradigm for human and LLM-judge annotations, and introduces the FECT dataset.

Result: The 3D paradigm and FECT dataset enable better factuality evaluation of AI-generated claims in contact center conversations.

Conclusion: The work provides a novel approach for automated factuality evaluation in AI systems analyzing contact center data.

Abstract: Large language models (LLMs) are known to hallucinate, producing natural language outputs that are not grounded in the input, reference materials, or real-world knowledge. In enterprise applications where AI features support business decisions, such hallucinations can be particularly detrimental. LLMs that analyze and summarize contact center conversations introduce a unique set of challenges for factuality evaluation, because ground-truth labels often do not exist for analytical interpretations about sentiments captured in the conversation and root causes of the business problems. To remedy this, we first introduce a \textbf{3D} – \textbf{Decompose, Decouple, Detach} – paradigm in the human annotation guideline and the LLM-judges’ prompt to ground the factuality labels in linguistically-informed evaluation criteria. We then introduce \textbf{FECT}, a novel benchmark dataset for \textbf{F}actuality \textbf{E}valuation of Interpretive AI-Generated \textbf{C}laims in Contact Center Conversation \textbf{T}ranscripts, labeled under our 3D paradigm. Lastly, we report our findings from aligning LLM-judges on the 3D paradigm. Overall, our findings contribute a new approach for automatically evaluating the factuality of outputs generated by an AI system for analyzing contact center conversations.

[3] XAutoLM: Efficient Fine-Tuning of Language Models via Meta-Learning and AutoML

Ernesto L. Estevanell-Valladares, Suilan Estevez-Velarde, Yoan Gutiérrez, Andrés Montoyo, Ruslan Mitkov

Main category: cs.CL

TL;DR: XAutoLM is a meta-learning-augmented AutoML framework for efficient fine-tuning of language models, outperforming zero-shot optimizers and reducing resource usage.

Details

Motivation: Addressing the lack of automated frameworks for resource-efficient LM fine-tuning, given the high computational and environmental costs.

Method: XAutoLM uses meta-learning to reuse past experiences, extracting task- and system-level meta-features to optimize sampling.

Result: Outperforms zero-shot optimizers on most tasks, reduces evaluation time by 4.5x, error ratios by sevenfold, and discovers more efficient pipelines.

Conclusion: XAutoLM enables resource-efficient LM fine-tuning, promoting Green AI in NLP.

Abstract: Experts in machine learning leverage domain knowledge to navigate decisions in model selection, hyperparameter optimisation, and resource allocation. This is particularly critical for fine-tuning language models (LMs), where repeated trials incur substantial computational overhead and environmental impact. However, no existing automated framework simultaneously tackles the entire model selection and HPO task for resource-efficient LM fine-tuning. We introduce XAutoLM, a meta-learning-augmented AutoML framework that reuses past experiences to optimise discriminative and generative LM fine-tuning pipelines efficiently. XAutoLM learns from stored successes and failures by extracting task- and system-level meta-features to bias its sampling toward fruitful configurations and away from costly dead ends. On four text classification and two question-answering benchmarks, XAutoLM surpasses zero-shot optimiser’s peak F1 on five of six tasks, cuts mean evaluation time by up to 4.5x, reduces error ratios by up to sevenfold, and uncovers up to 50% more pipelines above the zero-shot Pareto front. In contrast, simpler memory-based baselines suffer negative transfer. We release XAutoLM and our experience store to catalyse resource-efficient, Green AI fine-tuning in the NLP community.

[4] MAO-ARAG: Multi-Agent Orchestration for Adaptive Retrieval-Augmented Generation

Yiqun Chen, Erhan Zhang, Lingyong Yan, Shuaiqiang Wang, Jizhou Huang, Dawei Yin, Jiaxin Mao

Main category: cs.CL

TL;DR: MAO-ARAG is an adaptive RAG framework using multi-agent orchestration to dynamically tailor workflows for QA queries, balancing performance and cost.

Details

Motivation: Fixed RAG pipelines struggle with varying query complexities, leading to inefficiencies in performance and cost.

Method: Proposes MAO-ARAG, a multi-agent framework with executor agents (query reformulation, document selection, generation) and a planner agent trained via reinforcement learning.

Result: Achieves high answer quality while maintaining reasonable costs and latency on multiple QA datasets.

Conclusion: MAO-ARAG effectively addresses the limitations of fixed RAG systems by dynamically adapting to query needs.

Abstract: In question-answering (QA) systems, Retrieval-Augmented Generation (RAG) has become pivotal in enhancing response accuracy and reducing hallucination issues. The architecture of RAG systems varies significantly, encompassing single-round RAG, iterative RAG, and reasoning RAG, each tailored to address different types of queries. Due to the varying complexity of real-world queries, a fixed RAG pipeline often struggles to balance performance and cost efficiency across different queries. To address this challenge, we propose an adaptive RAG framework called MAO-ARAG, which leverages multi-agent orchestration. Our adaptive RAG is conceived as a multi-turn framework. Specifically, we define multiple executor agents, representing typical RAG modules such as query reformulation agents, document selection agent, and generation agents. A planner agent intelligently selects and integrates the appropriate agents from these executors into a suitable workflow tailored for each query, striving for high-quality answers while maintaining reasonable costs. During each turn, the planner agent is trained using reinforcement learning, guided by an outcome-based reward (F1 score) and a cost-based penalty, continuously improving answer quality while keeping costs within a reasonable range. Experiments conducted on multiple QA datasets demonstrate that our approach, which dynamically plans workflows for each query, not only achieves high answer quality but also maintains both cost and latency within acceptable limits.The code of MAO-ARAG is on https://github.com/chenyiqun/Agentic-RAG.

[5] UrBLiMP: A Benchmark for Evaluating the Linguistic Competence of Large Language Models in Urdu

Farah Adeeba, Brian Dillon, Hassan Sajjad, Rajesh Bhatt

Main category: cs.CL

TL;DR: The paper introduces UrBLiMP, a benchmark for evaluating multilingual LLMs’ linguistic knowledge in Urdu, revealing performance variations and limitations.

Details

Motivation: Assess LLMs' syntactic knowledge in low-resource languages like Urdu, which are underrepresented in training data.

Method: Created UrBLiMP, a dataset of 5,696 minimal pairs for ten syntactic phenomena, validated by human annotators (96.10% agreement). Evaluated 20 multilingual LLMs.

Result: LLaMA-3-70B performed best (94.73% accuracy), but top models like Gemma-3-27B-PT were comparable. Performance varied across linguistic phenomena.

Conclusion: Multilingual LLMs show potential but have limitations in capturing fine-grained syntax for low-resource languages.

Abstract: Multilingual Large Language Models (LLMs) have shown remarkable performance across various languages; however, they often include significantly less data for low-resource languages such as Urdu compared to high-resource languages like English. To assess the linguistic knowledge of LLMs in Urdu, we present the Urdu Benchmark of Linguistic Minimal Pairs (UrBLiMP) i.e. pairs of minimally different sentences that contrast in grammatical acceptability. UrBLiMP comprises 5,696 minimal pairs targeting ten core syntactic phenomena, carefully curated using the Urdu Treebank and diverse Urdu text corpora. A human evaluation of UrBLiMP annotations yielded a 96.10% inter-annotator agreement, confirming the reliability of the dataset. We evaluate twenty multilingual LLMs on UrBLiMP, revealing significant variation in performance across linguistic phenomena. While LLaMA-3-70B achieves the highest average accuracy (94.73%), its performance is statistically comparable to other top models such as Gemma-3-27B-PT. These findings highlight both the potential and the limitations of current multilingual LLMs in capturing fine-grained syntactic knowledge in low-resource languages.

[6] Cross-Domain Web Information Extraction at Pinterest

Michael Farag, Patrick Halina, Andrey Zaytsev, Alekhya Munagala, Imtihan Ahmed, Junhao Wang

Main category: cs.CL

TL;DR: Pinterest’s system for attribute extraction from e-commerce websites achieves high accuracy and scalability using a compact webpage representation, outperforming complex LLMs like GPT with simpler models like XGBoost.

Details

Motivation: To enhance user experiences and improve content distribution by accurately extracting structured product data from unstructured e-commerce websites.

Method: Leverages a novel webpage representation combining structural, visual, and text modalities, optimized for small model learning (e.g., XGBoost).

Result: Highly scalable (1,000 URLs/sec) and 1000x more cost-effective than GPT alternatives, with superior accuracy.

Conclusion: The system demonstrates that simpler models can outperform complex LLMs in attribute extraction when paired with an optimized representation.

Abstract: The internet offers a massive repository of unstructured information, but it’s a significant challenge to convert this into a structured format. At Pinterest, the ability to accurately extract structured product data from e-commerce websites is essential to enhance user experiences and improve content distribution. In this paper, we present Pinterest’s system for attribute extraction, which achieves remarkable accuracy and scalability at a manageable cost. Our approach leverages a novel webpage representation that combines structural, visual, and text modalities into a compact form, optimizing it for small model learning. This representation captures each visible HTML node with its text, style and layout information. We show how this allows simple models such as eXtreme Gradient Boosting (XGBoost) to extract attributes more accurately than much more complex Large Language Models (LLMs) such as Generative Pre-trained Transformer (GPT). Our results demonstrate a system that is highly scalable, processing over 1,000 URLs per second, while being 1000 times more cost-effective than the cheapest GPT alternatives.

[7] Asking the Right Questions: Benchmarking Large Language Models in the Development of Clinical Consultation Templates

Liam G. McCoy, Fateme Nateghi Haredasht, Kanav Chopra, David Wu, David JH Wu, Abass Conteh, Sarita Khemani, Saloni Kumar Maharaj, Vishnu Ravi, Arth Pahwa, Yingjie Weng, Leah Rosengaus, Lena Giang, Kelvin Zhenghao Li, Olivia Jee, Daniel Shirvani, Ethan Goh, Jonathan H. Chen

Main category: cs.CL

TL;DR: This study evaluates LLMs’ ability to generate structured clinical consultation templates, finding high comprehensiveness but issues with length and prioritization, especially in narrative-driven fields.

Details

Motivation: To assess the potential of LLMs in improving structured clinical information exchange between physicians by generating coherent and concise consultation templates.

Method: The study used 145 expert-crafted templates to evaluate frontier LLMs (e.g., GPT-4o, Claude 4 Sonnet) through a multi-agent pipeline involving prompt optimization, semantic autograding, and prioritization analysis.

Result: LLMs like o3 achieved high comprehensiveness (up to 92.2%) but produced overly long templates and struggled with prioritization under length constraints, with performance varying by specialty.

Conclusion: LLMs can enhance clinical information exchange but require better evaluation methods to ensure prioritization of salient information within real-world time constraints.

Abstract: This study evaluates the capacity of large language models (LLMs) to generate structured clinical consultation templates for electronic consultation. Using 145 expert-crafted templates developed and routinely used by Stanford’s eConsult team, we assess frontier models – including o3, GPT-4o, Kimi K2, Claude 4 Sonnet, Llama 3 70B, and Gemini 2.5 Pro – for their ability to produce clinically coherent, concise, and prioritized clinical question schemas. Through a multi-agent pipeline combining prompt optimization, semantic autograding, and prioritization analysis, we show that while models like o3 achieve high comprehensiveness (up to 92.2%), they consistently generate excessively long templates and fail to correctly prioritize the most clinically important questions under length constraints. Performance varies across specialties, with significant degradation in narrative-driven fields such as psychiatry and pain medicine. Our findings demonstrate that LLMs can enhance structured clinical information exchange between physicians, while highlighting the need for more robust evaluation methods that capture a model’s ability to prioritize clinically salient information within the time constraints of real-world physician communication.

[8] CSIRO-LT at SemEval-2025 Task 11: Adapting LLMs for Emotion Recognition for Multiple Languages

Jiyu Chen, Necva Bölücü, Sarvnaz Karimi, Diego Mollá, Cécile L. Paris

Main category: cs.CL

TL;DR: The paper explores emotion recognition across languages, focusing on fine-tuning multilingual LLMs with LoRA for each language as the most effective method.

Details

Motivation: Challenges in detecting emotions across languages due to cultural nuances led to the Semeval 2025 Task 11, aiming to bridge this gap.

Method: Fine-tuning a pre-trained multilingual LLM with LoRA separately for each language.

Result: The most effective method for emotion recognition across languages is fine-tuning with LoRA per language.

Conclusion: Fine-tuning multilingual LLMs with LoRA per language is optimal for cross-lingual emotion recognition.

Abstract: Detecting emotions across different languages is challenging due to the varied and culturally nuanced ways of emotional expressions. The \textit{Semeval 2025 Task 11: Bridging the Gap in Text-Based emotion} shared task was organised to investigate emotion recognition across different languages. The goal of the task is to implement an emotion recogniser that can identify the basic emotional states that general third-party observers would attribute to an author based on their written text snippet, along with the intensity of those emotions. We report our investigation of various task-adaptation strategies for LLMs in emotion recognition. We show that the most effective method for this task is to fine-tune a pre-trained multilingual LLM with LoRA setting separately for each language.

[9] Adaptive Content Restriction for Large Language Models via Suffix Optimization

Yige Li, Peihai Jiang, Jun Sun, Peng Shu, Tianming Liu, Zhen Xiang

Main category: cs.CL

TL;DR: The paper introduces Adaptive Content Restriction (AdaCoRe), a task for lightweight methods to prevent LLMs from generating restricted terms without fine-tuning. It proposes Suffix Optimization (SOP) and evaluates it on CoReBench, showing significant improvements over baselines.

Details

Motivation: Content restriction in LLMs is challenging due to varying user needs and dynamic definitions of harmfulness. Fine-tuning for each case is impractical, necessitating lightweight solutions.

Method: Proposes Suffix Optimization (SOP), which appends an optimized suffix to prompts to block restricted terms while maintaining output quality. Evaluated on CoReBench, a new benchmark with 400 prompts for 80 terms across 8 categories.

Result: SOP outperforms baselines by 15%, 17%, 10%, 9%, and 6% on average restriction rates for various LLMs. It also works effectively on the POE platform.

Conclusion: SOP is a practical, lightweight solution for adaptive content restriction in LLMs, validated by strong performance on CoReBench and real-world platforms.

Abstract: Large Language Models (LLMs) have demonstrated significant success across diverse applications. However, enforcing content restrictions remains a significant challenge due to their expansive output space. One aspect of content restriction is preventing LLMs from generating harmful content via model alignment approaches such as supervised fine-tuning (SFT). Yet, the need for content restriction may vary significantly across user groups, change rapidly over time, and not always align with general definitions of harmfulness. Applying SFT to each of these specific use cases is impractical due to the high computational, data, and storage demands. Motivated by this need, we propose a new task called \textit{Adaptive Content Restriction} (AdaCoRe), which focuses on lightweight strategies – methods without model fine-tuning – to prevent deployed LLMs from generating restricted terms for specific use cases. We propose the first method for AdaCoRe, named \textit{Suffix Optimization (SOP)}, which appends a short, optimized suffix to any prompt to a) prevent a target LLM from generating a set of restricted terms, while b) preserving the output quality. To evaluate AdaCoRe approaches, including our SOP, we create a new \textit{Content Restriction Benchmark} (CoReBench), which contains 400 prompts for 80 restricted terms across 8 carefully selected categories. We demonstrate the effectiveness of SOP on CoReBench, which outperforms the system-level baselines such as system suffix by 15%, 17%, 10%, 9%, and 6% on average restriction rates for Gemma2-2B, Mistral-7B, Vicuna-7B, Llama3-8B, and Llama3.1-8B, respectively. We also demonstrate that SOP is effective on POE, an online platform hosting various commercial LLMs, highlighting its practicality in real-world scenarios.

[10] Show or Tell? Modeling the evolution of request-making in Human-LLM conversations

Shengqi Zhu, Jeffrey M. Rzeszotarski, David Mimno

Main category: cs.CL

TL;DR: The paper introduces a task to segment chat queries into components like requests, roles, and context, revealing differences between LLM and human interactions. It analyzes user behavior over time and the impact of model capabilities.

Details

Motivation: To understand user behavior in LLM chat logs, which is often obscured by query variability, and to explore how users adapt over time and with new model capabilities.

Method: Segmenting chat queries into structured components (requests, roles, context, expressions) and conducting diachronic analyses of user behavior.

Result: User query patterns evolve from request-focused to more exploratory, converging with experience. Model capabilities also influence behavior, especially with new model introductions.

Conclusion: The study provides insights into LLM user behavior, highlighting the impact of experience and model updates, and offers a framework for analyzing chat logs.

Abstract: Chat logs provide a rich source of information about LLM users, but patterns of user behavior are often masked by the variability of queries. We present a new task, segmenting chat queries into contents of requests, roles, query-specific context, and additional expressions. We find that, despite the familiarity of chat-based interaction, request-making in LLM queries remains significantly different from comparable human-human interactions. With the data resource, we introduce an important perspective of diachronic analyses with user expressions. We find that query patterns vary between early ones emphasizing requests, and individual users explore patterns but tend to converge with experience. Finally, we show that model capabilities affect user behavior, particularly with the introduction of new models, which are traceable at the community level.

[11] WebDS: An End-to-End Benchmark for Web-based Data Science

Ethan Hsu, Hong Meng Yam, Ines Bouissou, Aaron Murali John, Raj Thota, Josh Koe, Vivek Sarath Putta, G K Dharesan, Alexander Spangher, Shikhar Murty, Tenghao Huang, Christopher D. Manning

Main category: cs.CL

TL;DR: WebDS is a new benchmark for web-based data science tasks, addressing gaps in existing benchmarks by focusing on complex, multi-step operations with diverse data formats and tools.

Details

Motivation: Existing benchmarks are too simplistic or static, failing to reflect real-world data science workflows involving web interactions and heterogeneous data.

Method: WebDS includes 870 tasks across 29 websites, challenging agents with multi-step operations and diverse data formats.

Result: Current SOTA LLM agents perform poorly on WebDS (15% success rate), revealing new failure modes like poor information grounding and repetitive behavior.

Conclusion: WebDS provides a realistic testing ground for advancing LLM-based data science tools.

Abstract: A large portion of real-world data science tasks are complex and require multi-hop web-based interactions: finding appropriate data available on the internet, synthesizing real-time data of various modalities from different locations, and producing summarized analyses. Existing web benchmarks often focus on simplistic interactions, such as form submissions or e-commerce transactions, and often do not require diverse tool-using capabilities required for web based data science. Conversely, traditional data science benchmarks typically concentrate on static, often textually bound datasets and do not assess end-to-end workflows that encompass data acquisition, cleaning, analysis, and insight generation. In response, we introduce WebDS, the first end-to-end web-based data science benchmark. It comprises 870 web-based data science tasks across 29 diverse websites from structured government data portals to unstructured news media, challenging agents to perform complex, multi-step operations requiring the use of tools and heterogeneous data formats that better reflect the realities of modern data analytics. Evaluations of current SOTA LLM agents indicate significant performance gaps in accomplishing these tasks. For instance, Browser Use, which accomplishes 80% of tasks on Web Voyager, successfully completes only 15% of tasks in WebDS, which our analysis suggests is due to new failure modes like poor information grounding, repetitive behavior and shortcut-taking that agents performing WebDS’ tasks display. By providing a more robust and realistic testing ground, WebDS sets the stage for significant advances in the development of practically useful LLM-based data science.

[12] WarriorMath: Enhancing the Mathematical Ability of Large Language Models with a Defect-aware Framework

Yue Chen, Minghua He, Fangkai Yang, Pu Zhao, Lu Wang, Yu Kang, Yifei Dong, Yuefeng Zhan, Hao Sun, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang

Main category: cs.CL

TL;DR: WarriorMath is a defect-aware framework for improving LLMs’ mathematical problem-solving by generating targeted training data and using progressive learning.

Details

Motivation: Existing methods for augmenting training data overlook LLMs' specific failure modes, leading to minimal performance gains.

Method: WarriorMath integrates targeted data synthesis (using expert LLMs to generate and refine problems) and progressive training (iterative fine-tuning with challenging data).

Result: WarriorMath outperforms baselines by 12.57% on six benchmarks, achieving state-of-the-art performance.

Conclusion: The defect-aware, multi-expert framework effectively enhances LLMs’ mathematical abilities.

Abstract: Large Language Models (LLMs) excel in solving mathematical problems, yet their performance is often limited by the availability of high-quality, diverse training data. Existing methods focus on augmenting datasets through rephrasing or difficulty progression but overlook the specific failure modes of LLMs. This results in synthetic questions that the model can already solve, providing minimal performance gains. To address this, we propose WarriorMath, a defect-aware framework for mathematical problem solving that integrates both targeted data synthesis and progressive training. In the synthesis stage, we employ multiple expert LLMs in a collaborative process to generate, critique, and refine problems. Questions that base LLMs fail to solve are identified and iteratively improved through expert-level feedback, producing high-quality, defect-aware training data. In the training stage, we introduce a progressive learning framework that iteratively fine-tunes the model using increasingly challenging data tailored to its weaknesses. Experiments on six mathematical benchmarks show that WarriorMath outperforms strong baselines by 12.57% on average, setting a new state-of-the-art. Our results demonstrate the effectiveness of a defect-aware, multi-expert framework for improving mathematical ability.

[13] Bridging LLMs and Symbolic Reasoning in Educational QA Systems: Insights from the XAI Challenge at IJCNN 2025

Long S. T. Nguyen, Khang H. N. Vo, Thu H. A. Nguyen, Tuan C. Bui, Duc Q. Nguyen, Thanh-Tung Tran, Anh D. Nguyen, Minh L. Nguyen, Fabien Baldacci, Thang H. Bui, Emanuel Di Nardo, Angelo Ciaramella, Son H. Le, Ihsan Ullah, Lorenzo Di Rocco, Tho T. Quan

Main category: cs.CL

TL;DR: The paper analyzes the XAI Challenge 2025, a hackathon focused on building explainable QA systems for education using lightweight LLMs or hybrid symbolic systems.

Details

Motivation: Address the need for transparency and interpretability in AI for education, leveraging hackathons to prototype XAI solutions.

Method: Hackathon-style competition with participants developing QA systems using logic-based templates and Z3 validation, evaluated for explainability.

Result: A novel approach combining LLMs and symbolic reasoning for explainability, with insights for future XAI educational systems.

Conclusion: The challenge bridges LLMs and symbolic reasoning, offering a model for future XAI initiatives in education.

Abstract: The growing integration of Artificial Intelligence (AI) into education has intensified the need for transparency and interpretability. While hackathons have long served as agile environments for rapid AI prototyping, few have directly addressed eXplainable AI (XAI) in real-world educational contexts. This paper presents a comprehensive analysis of the XAI Challenge 2025, a hackathon-style competition jointly organized by Ho Chi Minh City University of Technology (HCMUT) and the International Workshop on Trustworthiness and Reliability in Neurosymbolic AI (TRNS-AI), held as part of the International Joint Conference on Neural Networks (IJCNN 2025). The challenge tasked participants with building Question-Answering (QA) systems capable of answering student queries about university policies while generating clear, logic-based natural language explanations. To promote transparency and trustworthiness, solutions were required to use lightweight Large Language Models (LLMs) or hybrid LLM-symbolic systems. A high-quality dataset was provided, constructed via logic-based templates with Z3 validation and refined through expert student review to ensure alignment with real-world academic scenarios. We describe the challenge’s motivation, structure, dataset construction, and evaluation protocol. Situating the competition within the broader evolution of AI hackathons, we argue that it represents a novel effort to bridge LLMs and symbolic reasoning in service of explainability. Our findings offer actionable insights for future XAI-centered educational systems and competitive research initiatives.

[14] Prompting Large Language Models with Partial Knowledge for Answering Questions with Unseen Entities

Zhichao Yan, Jiapu Wang, Jiaoyan Chen, Yanyan Wang, Hongye Tan, Jiye Liang, Xiaoli Li, Ru Li, Jeff Z. Pan

Main category: cs.CL

TL;DR: The paper explores how partially relevant knowledge can ‘awaken’ LLMs in RAG systems, improving performance in incomplete knowledge scenarios, and introduces a new task, Unseen Entity KGQA.

Details

Motivation: To address the challenge of effectively utilizing partially relevant knowledge in RAG systems, especially in incomplete knowledge bases, and to explore LLMs' potential to leverage such knowledge.

Method: Uses triplets from gold reasoning paths (with answer paths removed) to construct partially relevant knowledge, analyzes the awakening effect theoretically, and tests on KG QA datasets. Introduces Unseen Entity KGQA for real-world challenges.

Result: The awakening-based approach outperforms traditional embedding-based methods, showing efficacy in practical applications.

Conclusion: Partially relevant knowledge can effectively ‘awaken’ LLMs, offering a promising direction for RAG systems in incomplete knowledge scenarios.

Abstract: Retrieval-Augmented Generation (RAG) shows impressive performance by supplementing and substituting parametric knowledge in Large Language Models (LLMs). Retrieved knowledge can be divided into three types: explicit answer evidence, implicit answer clue, and insufficient answer context which can be further categorized into totally irrelevant and partially relevant information. Effectively utilizing partially relevant knowledge remains a key challenge for RAG systems, especially in incomplete knowledge base retrieval. Contrary to the conventional view, we propose a new perspective: LLMs can be awakened via partially relevant knowledge already embedded in LLMs. To comprehensively investigate this phenomenon, the triplets located in the gold reasoning path and their variants are used to construct partially relevant knowledge by removing the path that contains the answer. We provide theoretical analysis of the awakening effect in LLMs and support our hypothesis with experiments on two Knowledge Graphs (KGs) Question Answering (QA) datasets. Furthermore, we present a new task, Unseen Entity KGQA, simulating real-world challenges where entity linking fails due to KG incompleteness. Our awakening-based approach demonstrates greater efficacy in practical applications, outperforms traditional methods that rely on embedding-based similarity which are prone to returning noisy information.

[15] KEDAS: Knowledge Editing Alignment with Diverse Augmentation and Self-adaptive Inference

Chenming Tang, Yutong Yang, Yunfang Wu

Main category: cs.CL

TL;DR: KEDAS improves LLM knowledge editing via alignment, diverse augmentation, and self-adaptive inference, outperforming existing methods.

Details

Motivation: Efficiently update outdated knowledge in LLMs while preserving their capabilities.

Method: Uses low-rank adaptation for alignment, diverse edit augmentation, and self-adaptive inference with a smart retriever.

Result: Achieves top performance in 35/36 cases, surpassing baselines by 19.8 harmonic mean scores.

Conclusion: KEDAS is robust and efficient, offering an ideal paradigm for knowledge editing alignment.

Abstract: Knowledge editing aims to modify outdated knowledge in large language models (LLMs) efficiently while retaining their powerful capabilities. Most existing methods rely on either parameter-level editing or retrieval-based approaches. In this work, we propose Knowledge Editing alignment with Diverse Augmentation and Self-adaptive inference (KEDAS) to better align LLMs with knowledge editing. In the alignment phase, LLMs learn to apply in-context edited knowledge via low-rank adaptation. During editing, we design a diverse edit augmentation technique to improve the recall of edits. After that, a self-adaptive post-alignment inference mechanism is proposed, in which a filter-based smart retriever is employed to perform a dynamic selection of inference routing. Specifically, irrelevant queries will go through the original pre-alignment model directly, while relevant ones, together with their related edits, go through the model with aligned adapters activated. In experiments, KEDAS secures the highest overall performance scores in 35 out of 36 cases across four datasets with three LLMs on three settings, surpassing its strong knowledge editing alignment counterpart by about 19.8 harmonic mean scores of edit success, locality and portability and outperforming both parameter editing and retrieval-based baselines significantly. Analysis of computational cost and performance on general tasks further validates the robustness and efficiency of KEDAS, indicating that it presents an ideal paradigm of knowledge editing alignment.

[16] D-SCoRE: Document-Centric Segmentation and CoT Reasoning with Structured Export for QA-CoT Data Generation

Weibo Zhou, Lingbo Li, Shangsong Liang

Main category: cs.CL

TL;DR: D-SCoRE is a training-free pipeline for generating diverse, high-quality QA datasets from text, outperforming human-annotated datasets in domain-specific fine-tuning.

Details

Motivation: High-quality QA datasets are scarce and expensive, limiting supervised fine-tuning for domain-specific LLMs.

Method: Uses document-centric processing, segmentation, CoT reasoning, and structured export to create QA-COT datasets with multi-dimensional controls for diversity.

Result: Outperforms human-annotated datasets (SQuAD, Covid-QA) in evaluations on SQuADShifts and Covid-QA test sets.

Conclusion: D-SCoRE is efficient, scalable, and effective for domain-aware QA dataset generation and LLM fine-tuning.

Abstract: The scarcity and high cost of high-quality question-answering (QA) datasets hinder supervised fine-tuning (SFT) for domain-specific large language models (LLMs). To address this, we introduce D-SCoRE, a training-free pipeline that utilizes LLMs and prompt engineering to produce diverse, high-quality QA datasets from arbitrary textual sources. D-SCoRE integrates $\textbf{D}$ocument-centric processing, $\textbf{S}$egmentation, $\textbf{Co}$T $\textbf{R}$easoning, and structured $\textbf{E}$xport to generate QA-COT datasets tailored for domain-aware SFT. Multi-dimensional control mechanisms, such as semantic role transformation, question type balancing, and counterfactual materials, enhance diversity and relevance, overcoming limitations of existing QA generation. LLMs fine-tuned on D-SCoRE-generated QA datasets, and human-annotated QA datasets (SQuAD, Covid-QA) are evaluated on SQuADShifts and Covid-QA test sets, with D-SCoRE outperforming across most domains. D-SCoRE generates six QA-CoT pairs with four-option counterfactual materials per 100-200-word text in 90 seconds using an 8B LLM on consumer-grade hardware. Its simplicity and scalability enable efficient QA generation and high-performance fine-tuning across domains.

[17] Marco-Voice Technical Report

Fengping Tian, Chenyang Lyu, Xuanfan Ni, Haoqin Sun, Qingjuan Li, Zhiqiang Qian, Haijun Li, Longyue Wang, Zhao Xu, Weihua Luo, Kaifu Zhang

Main category: cs.CL

TL;DR: A multifunctional speech synthesis system, Marco-Voice, integrates voice cloning and emotion control, achieving expressive and natural speech while preserving speaker identity.

Details

Motivation: To address challenges in expressive, controllable, and natural speech generation that maintains speaker identity across diverse linguistic and emotional contexts.

Method: Introduces speaker-emotion disentanglement with in-batch contrastive learning and rotational emotional embedding for smooth emotion control. Uses CSEMOTIONS dataset (10 hours of Mandarin speech, 6 speakers, 7 emotions).

Result: Marco-Voice shows substantial improvements in objective and subjective metrics, excelling in speech clarity and emotional richness.

Conclusion: Marco-Voice represents a significant advance in expressive neural speech synthesis, delivering competitive performance.

Abstract: This paper presents a multifunctional speech synthesis system that integrates voice cloning and emotion control speech synthesis within a unified framework. The goal of this work is to address longstanding challenges in achieving highly expressive, controllable, and natural speech generation that faithfully preserves speaker identity across diverse linguistic and emotional contexts. Our approach introduces an effective speaker-emotion disentanglement mechanism with in-batch contrastive learning, enabling independent manipulation of speaker identity and eemotional style, as well as rotational emotional embedding integration method for smooth emotion control. To support comprehensive training and evaluation, we construct CSEMOTIONS, a high-quality emotional speech dataset containing 10 hours of Mandarin speech from six professional speakers across seven emotional categories. Extensive experiments demonstrate that our system, Marco-Voice, achieves substantial improvements in both objective and subjective metrics. Comprehensive evaluations and analysis were conducted, results show that MarcoVoice delivers competitive performance in terms of speech clarity and emotional richness, representing a substantial advance in the field of expressive neural speech synthesis.

[18] LinkQA: Synthesizing Diverse QA from Multiple Seeds Strongly Linked by Knowledge Points

Xuemiao Zhang, Can Ren, Chengying Tu, Rongxiang Weng, Hongfei Yan, Jingang Wang, Xunliang Cai

Main category: cs.CL

TL;DR: LinkSyn is a KP graph-based framework for synthesizing diverse QA data, improving LLM training by balancing KP coverage and popularity. It enhances model performance significantly.

Details

Motivation: Addressing the scarcity of high-quality, diverse training data for LLMs by synthesizing QA data with controlled discipline and difficulty distributions.

Method: Constructs a KP graph from QA seed data, uses knowledge distribution and diffusion-based synthesis to generate diverse QA data, and enhances high-difficulty QAs.

Result: Synthesized LinkQA dataset (50B tokens) improves Llama-3 8B performance by 11.51% on MMLU and CMMLU, setting new SOTA.

Conclusion: LinkSyn effectively addresses data scarcity, enhancing LLM performance across disciplines and difficulty levels.

Abstract: The advancement of large language models (LLMs) struggles with the scarcity of high-quality, diverse training data. To address this limitation, we propose LinkSyn, a novel knowledge point (KP) graph-based synthesis framework that enables flexible control over discipline and difficulty distributions while balancing KP coverage and popularity. LinkSyn extracts KPs from question-answering (QA) seed data and constructs a KP graph to synthesize diverse QA data from multiple seeds strongly linked by KPs and sampled from graph walks. Specifically, LinkSyn incorporates (1) a knowledge distribution value function to guide the adjustment of path sampling probability and balance KP coverage and popularity during graph walks; (2) diffusion-based synthesis via DeepSeek-R1 by leveraging multiple seeds with dense logical associations along each path; and (3) high-difficulty QA enhancement within given disciplines by flexible difficulty adjustments. By executing LinkSyn, we synthesize LinkQA, a diverse multi-disciplinary QA dataset with 50B tokens. Extensive experiments on Llama-3 8B demonstrate that continual pre-training with LinkQA yields an average improvement of $\mathbf{11.51%}$ on MMLU and CMMLU, establishing new SOTA results. LinkQA consistently enhances performance across model size and initial FLOPs scales.

[19] Large-Scale Diverse Synthesis for Mid-Training

Xuemiao Zhang, Chengying Tu, Can Ren, Rongxiang Weng, Hongfei Yan, Jingang Wang, Xunliang Cai

Main category: cs.CL

TL;DR: The paper introduces BoostQA, a 100B-token QA dataset synthesized via a diversified pipeline to address data scarcity and diversity issues in LLMs, improving model performance by 12.74% on benchmarks.

Details

Motivation: High-quality, knowledge-intensive training data is scarce, limiting LLM development. Existing QA datasets lack scalability and diversity, especially in cross-domain and high-difficulty contexts.

Method: Proposes a three-step synthesis pipeline: (1) curates seed data, (2) uses DeepSeek-R1 for STEM-focused and high-difficulty synthesis, (3) refines answers with DeepSeek-V3. Utilizes BoostQA in mid-training for domain-specific optimization.

Result: Llama-3 8B mid-trained on BoostQA achieves a 12.74% average improvement on MMLU and CMMLU, setting SOTA performance across 12 benchmarks. BoostQA shows scalability with model size, data volume, and FLOPs.

Conclusion: BoostQA effectively addresses data scarcity and diversity, enhancing LLM performance and scalability, particularly in STEM and high-difficulty contexts.

Abstract: The scarcity of high-quality, knowledge-intensive training data hinders the development of large language models (LLMs), as traditional corpora provide limited information. Previous studies have synthesized and integrated corpora-dependent question-answering (QA) data to improve model performance but face challenges in QA data scalability and knowledge diversity, particularly in cross-domain contexts. Furthermore, leveraging our designed discipline and difficulty annotation system, we probe model deficiencies in STEM disciplines and high-difficulty data. To overcome these limitations, we propose a novel diversified pipeline to synthesize BoostQA, a 100B-token large-scale QA dataset. Our synthesis framework: (1) curates seed data from heterogeneous sources; (2) utilizes DeepSeek-R1 to implement STEM-focused multi-grade synthesis to boost data diversity and high-difficulty synthesis to mitigate difficulty degradation; (3) refines answers via DeepSeek-V3 to improve output quality. We utilize BoostQA in mid-training, a mid-stage between pre-training and post-training, to optimize domain-specific knowledge acquisition and enhance data quality. Our method enables Llama-3 8B, mid-trained on a 40B-token dataset, to achieve an average improvement of $\mathbf{12.74%}$ on MMLU and CMMLU and establish SOTA average performance across 12 benchmarks. BoostQA also demonstrates robust scalability, with performance consistently improving as model size, data volume, and initial FLOPs scale.

[20] MaRGen: Multi-Agent LLM Approach for Self-Directed Market Research and Analysis

Roman Koshkin, Pengyu Dai, Nozomi Fujikawa, Masahito Togami, Marco Visentini-Scarzanella

Main category: cs.CL

TL;DR: An autonomous framework using LLMs automates business analysis and market report generation, employing specialized agents to replicate professional methodologies, with iterative improvements enhancing report quality.

Details

Motivation: To automate end-to-end business analysis and market report generation, making professional insights affordable and efficient.

Method: Uses specialized agents (Researcher, Reviewer, Writer, Retriever) for multi-step processes like data querying, analysis, visualization, and report generation, with iterative improvements via automated review cycles.

Result: Generates detailed 6-page reports in 7 minutes at ~$1 cost, with quality improved by automated reviews and consultants’ knowledge.

Conclusion: The framework is a significant step toward affordable, automated market insights.

Abstract: We present an autonomous framework that leverages Large Language Models (LLMs) to automate end-to-end business analysis and market report generation. At its core, the system employs specialized agents - Researcher, Reviewer, Writer, and Retriever - that collaborate to analyze data and produce comprehensive reports. These agents learn from real professional consultants’ presentation materials at Amazon through in-context learning to replicate professional analytical methodologies. The framework executes a multi-step process: querying databases, analyzing data, generating insights, creating visualizations, and composing market reports. We also introduce a novel LLM-based evaluation system for assessing report quality, which shows alignment with expert human evaluations. Building on these evaluations, we implement an iterative improvement mechanism that optimizes report quality through automated review cycles. Experimental results show that report quality can be improved by both automated review cycles and consultants’ unstructured knowledge. In experimental validation, our framework generates detailed 6-page reports in 7 minutes at a cost of approximately $1. Our work could be an important step to automatically create affordable market insights.

[21] MedSynth: Realistic, Synthetic Medical Dialogue-Note Pairs

Ahmad Rezaie Mianroodi, Amirali Rezaie, Niko Grisel Todorov, Cyril Rakovski, Frank Rudzicz

Main category: cs.CL

TL;DR: MedSynth introduces a synthetic dataset of medical dialogues and notes to improve automation in clinical documentation, enhancing model performance for Dialogue-to-Note and Note-to-Dialogue tasks.

Details

Motivation: To reduce physician burnout by automating clinical documentation, addressing the lack of open-access, privacy-compliant, and diverse medical training data.

Method: Creation of MedSynth, a dataset with over 10,000 dialogue-note pairs covering 2000+ ICD-10 codes, informed by disease distribution analysis.

Result: The dataset significantly improves model performance in generating medical notes from dialogues and vice versa.

Conclusion: MedSynth provides a valuable, privacy-compliant resource for advancing medical documentation automation.

Abstract: Physicians spend significant time documenting clinical encounters, a burden that contributes to professional burnout. To address this, robust automation tools for medical documentation are crucial. We introduce MedSynth – a novel dataset of synthetic medical dialogues and notes designed to advance the Dialogue-to-Note (Dial-2-Note) and Note-to-Dialogue (Note-2-Dial) tasks. Informed by an extensive analysis of disease distributions, this dataset includes over 10,000 dialogue-note pairs covering over 2000 ICD-10 codes. We demonstrate that our dataset markedly enhances the performance of models in generating medical notes from dialogues, and dialogues from medical notes. The dataset provides a valuable resource in a field where open-access, privacy-compliant, and diverse training data are scarce. Code is available at https://github.com/ahmadrezarm/MedSynth/tree/main and the dataset is available at https://huggingface.co/datasets/Ahmad0067/MedSynth.

[22] ArzEn-MultiGenre: An aligned parallel dataset of Egyptian Arabic song lyrics, novels, and subtitles, with English translations

Rania Al-Sabbagh

Main category: cs.CL

TL;DR: ArzEn-MultiGenre is a parallel dataset of Egyptian Arabic and English texts (song lyrics, novels, TV subtitles) for machine translation, research, and pedagogy.

Details

Motivation: To provide a resource for benchmarking MT models, aiding research in translation studies, and supporting translation education and professionals.

Method: Manual translation and alignment of 25,557 segment pairs across diverse genres.

Result: A gold-standard dataset with unique genres, useful for MT, research, and training.

Conclusion: The dataset fills gaps in existing resources and offers high-quality, human-translated parallel texts.

Abstract: ArzEn-MultiGenre is a parallel dataset of Egyptian Arabic song lyrics, novels, and TV show subtitles that are manually translated and aligned with their English counterparts. The dataset contains 25,557 segment pairs that can be used to benchmark new machine translation models, fine-tune large language models in few-shot settings, and adapt commercial machine translation applications such as Google Translate. Additionally, the dataset is a valuable resource for research in various disciplines, including translation studies, cross-linguistic analysis, and lexical semantics. The dataset can also serve pedagogical purposes by training translation students and aid professional translators as a translation memory. The contributions are twofold: first, the dataset features textual genres not found in existing parallel Egyptian Arabic and English datasets, and second, it is a gold-standard dataset that has been translated and aligned by human experts.

[23] Discovering Bias Associations through Open-Ended LLM Generations

Jinhao Pan, Chahat Raj, Ziwei Zhu

Main category: cs.CL

TL;DR: BADF is a framework for discovering known and new biases in LLMs by analyzing open-ended outputs, advancing bias understanding and providing a scalable tool.

Details

Motivation: Addressing representational harms in LLMs caused by social biases, which existing methods fail to fully capture due to reliance on predefined associations.

Method: Introduces the Bias Association Discovery Framework (BADF) to systematically extract bias associations from LLM outputs, tested across multiple models and contexts.

Result: BADF successfully maps and analyzes diverse bias associations, revealing both known and unrecognized biases in LLMs.

Conclusion: BADF enhances bias detection in LLMs, offering a scalable solution for identifying and analyzing biases in open-ended generation.

Abstract: Social biases embedded in Large Language Models (LLMs) raise critical concerns, resulting in representational harms – unfair or distorted portrayals of demographic groups – that may be expressed in subtle ways through generated language. Existing evaluation methods often depend on predefined identity-concept associations, limiting their ability to surface new or unexpected forms of bias. In this work, we present the Bias Association Discovery Framework (BADF), a systematic approach for extracting both known and previously unrecognized associations between demographic identities and descriptive concepts from open-ended LLM outputs. Through comprehensive experiments spanning multiple models and diverse real-world contexts, BADF enables robust mapping and analysis of the varied concepts that characterize demographic identities. Our findings advance the understanding of biases in open-ended generation and provide a scalable tool for identifying and analyzing bias associations in LLMs. Data, code, and results are available at https://github.com/JP-25/Discover-Open-Ended-Generation

[24] From Query to Logic: Ontology-Driven Multi-Hop Reasoning in LLMs

Haonan Bian, Yutao Qi, Rui Yang, Yuanxi Che, Jiaqian Wang, Heming Xia, Ranran Zhen

Main category: cs.CL

TL;DR: ORACLE enhances LLMs for multi-hop QA by integrating knowledge graphs and logic, outperforming benchmarks like DeepSeek-R1.

Details

Motivation: LLMs struggle with complex multi-hop QA due to poor capture of deep conceptual relationships.

Method: ORACLE uses LLMs to build question-specific ontologies, converts them to logic chains, and decomposes queries logically.

Result: Achieves competitive performance on MQA benchmarks, with more logical and interpretable reasoning.

Conclusion: ORACLE effectively combines LLMs and structured reasoning for superior multi-hop QA.

Abstract: Large Language Models (LLMs), despite their success in question answering, exhibit limitations in complex multi-hop question answering (MQA) tasks that necessitate non-linear, structured reasoning. This limitation stems from their inability to adequately capture deep conceptual relationships between entities. To overcome this challenge, we present ORACLE (Ontology-driven Reasoning And Chain for Logical Eucidation), a training-free framework that combines LLMs’ generative capabilities with the structural benefits of knowledge graphs. Our approach operates through three stages: (1) dynamic construction of question-specific knowledge ontologies using LLMs, (2) transformation of these ontologies into First-Order Logic reasoning chains, and (3) systematic decomposition of the original query into logically coherent sub-questions. Experimental results on several standard MQA benchmarks show that our framework achieves highly competitive performance, rivaling current state-of-the-art models like DeepSeek-R1. Detailed analyses further confirm the effectiveness of each component, while demonstrating that our method generates more logical and interpretable reasoning chains than existing approaches.

[25] Towards Efficient Medical Reasoning with Minimal Fine-Tuning Data

Xinlin Zhuang, Feilong Tang, Haolin Yang, Ming Hu, Huifa Li, Haochen Xue, Yichen Li, Junjun He, Zongyuan Ge, Ying Qian, Imran Razzak

Main category: cs.CL

TL;DR: DIQ, a data selection strategy, balances sample difficulty and gradient influence to improve medical reasoning in LLMs with minimal fine-tuning data.

Details

Motivation: Existing SFT practices use unfiltered datasets with redundant/low-quality samples, causing computational inefficiency and suboptimal performance.

Method: Proposes DIQ, selecting high-difficulty-high-influence samples to balance reasoning complexity and optimization utility.

Result: DIQ-selected subsets (1-10% of data) match or outperform full-dataset performance, improving clinical reasoning alignment with expert practices.

Conclusion: DIQ demonstrates the superiority of principled data selection over brute-force scaling, enabling efficient fine-tuning.

Abstract: Supervised Fine-Tuning (SFT) plays a pivotal role in adapting Large Language Models (LLMs) to specialized domains such as medical reasoning. However, existing SFT practices often rely on unfiltered datasets that contain redundant and low-quality samples, leading to substantial computational costs and suboptimal performance. Although existing methods attempt to alleviate this problem by selecting data based on sample difficulty, defined by knowledge and reasoning complexity, they overlook each sample’s optimization utility reflected in its gradient. Interestingly, we find that gradient-based influence alone favors easy-to-optimize samples that cause large parameter shifts but lack deep reasoning chains, while difficulty alone selects noisy or overly complex cases that fail to guide stable optimization. Based on this observation, we propose a data selection strategy, Difficulty-Influence Quadrant (DIQ), which prioritizes samples in the high-difficulty-high-influence quadrant to balance complex clinical reasoning with substantial gradient influence, enabling efficient medical reasoning with minimal fine-tuning data. Furthermore, Human and LLM-as-a-judge evaluations show that DIQ-selected subsets demonstrate higher data quality and generate clinical reasoning that is more aligned with expert practices in differential diagnosis, safety check, and evidence citation, as DIQ emphasizes samples that foster expert-like reasoning patterns. Extensive experiments on medical reasoning benchmarks demonstrate that DIQ enables models fine-tuned on only 1% of selected data to match full-dataset performance, while using 10% consistently outperforms the baseline, highlighting the superiority of principled data selection over brute-force scaling. The code and data are available at https://github.com/mihara-bot/DIQ.

[26] TreeDiff: AST-Guided Code Generation with Diffusion LLMs

Yiming Zeng, Jinghan Cao, Zexin Li, Yiming Chen, Tao Ren, Dawei Xiang, Xidong Wu, Shangqian Gao, Tingting Yu

Main category: cs.CL

TL;DR: A syntax-aware diffusion framework improves code generation by incorporating structural priors from ASTs, enhancing syntactic correctness and generalization.

Details

Motivation: Diffusion models struggle with structured domains like source code due to strict syntactic rules, requiring better methods to preserve hierarchical organization.

Method: Proposes syntax-aware diffusion using AST-derived subtrees for selective corruption, respecting grammatical boundaries.

Result: Syntax-aware corruption improves syntactic correctness, reconstruction accuracy, and generalization to unseen code patterns.

Conclusion: Incorporating structural information into diffusion models is promising for advancing code generation tasks.

Abstract: Recent advances in diffusion-based language models have opened new possibilities for controllable and bidirectional sequence generation. These models provide an alternative to traditional autoregressive approaches by framing text generation as an iterative denoising process. However, applying diffusion models to structured domains such as source code remains a significant challenge. Programming languages differ from natural language in that they follow strict syntactic and semantic rules, with hierarchical organization that must be preserved for correctness. Standard token-level corruption techniques used during training often ignore this structure, which may hinder the model’s ability to learn meaningful representations of code. To address this limitation, we propose a syntax-aware diffusion framework that incorporates structural priors from Abstract Syntax Trees (ASTs) into the denoising process. Instead of masking individual tokens at random, we selectively corrupt syntactically meaningful code spans derived from AST subtrees. This enables the model to reconstruct programs in a way that respects grammatical boundaries and captures long-range dependencies. Experimental results demonstrate that syntax-aware corruption significantly improves syntactic correctness, reconstruction accuracy, and generalization to unseen code patterns. These findings highlight the potential of incorporating structural information into diffusion-based training and suggest that syntax-guided denoising is a promising direction for advancing diffusion-based language models in code generation tasks.

[27] Harnessing Collective Intelligence of LLMs for Robust Biomedical QA: A Multi-Model Approach

Dimitra Panou, Alexandros C. Dimopoulos, Manolis Koubarakis, Martin Reczko

Main category: cs.CL

TL;DR: The paper discusses using open-source LLMs for biomedical question-answering in the BioASQ challenge, achieving top results in multiple rounds.

Details

Motivation: The exponential growth of biomedical literature necessitates efficient text mining and question-answering tools.

Method: Deployed retrieval-augmented LLMs, using majority voting for Yes/No questions and union of answers for list/factoid questions, evaluating 13 LLMs.

Result: Achieved 1st place for ideal answers and 2nd for exact answers in Synergy task rounds, with tailored LLM pipelines for question types.

Conclusion: Combining LLMs effectively improves biomedical question-answering, with specific pipelines excelling for different question types.

Abstract: Biomedical text mining and question-answering are essential yet highly demanding tasks, particularly in the face of the exponential growth of biomedical literature. In this work, we present our participation in the 13th edition of the BioASQ challenge, which involves biomedical semantic question-answering for Task 13b and biomedical question-answering for developing topics for the Synergy task. We deploy a selection of open-source large language models (LLMs) as retrieval-augmented generators to answer biomedical questions. Various models are used to process the questions. A majority voting system combines their output to determine the final answer for Yes/No questions, while for list and factoid type questions, the union of their answers in used. We evaluated 13 state-of-the-art open source LLMs, exploring all possible model combinations to contribute to the final answer, resulting in tailored LLM pipelines for each question type. Our findings provide valuable insight into which combinations of LLMs consistently produce superior results for specific question types. In the four rounds of the 2025 BioASQ challenge, our system achieved notable results: in the Synergy task, we secured 1st place for ideal answers and 2nd place for exact answers in round 2, as well as two shared 1st places for exact answers in round 3 and 4.

[28] TeSent: A Benchmark Dataset for Fairness-aware Explainable Sentiment Classification in Telugu

Vallabhaneni Raj Kumar, Ashwin S, Supriya Manna, Niladri Sett, Cheedella V S N M S Hema Harshitha, Kurakula Harshitha, Anand Kumar Sharma, Basina Deepakraj, Tanuj Sarkar, Bondada Navaneeth Krishna, Samanthapudi Shakeer

Main category: cs.CL

TL;DR: The paper introduces TeSent, a benchmark dataset for Telugu sentiment classification, addressing the lack of annotated resources. It includes explainability and fairness evaluations, and shows that training with rationales improves model accuracy and reduces bias.

Details

Motivation: Telugu, a major Dravidian language, lacks high-quality annotated resources for NLP tasks, limiting its representation in global NLP and ML.

Method: Scraped Telugu texts from various sources, preprocessed 26,150 sentences, and developed an annotation platform for labels and rationales. Fine-tuned SOTA models with and without rationales, and evaluated explainability and fairness.

Result: Training with rationales improved model accuracy, reduced bias, and aligned explainers’ outputs with human reasoning.

Conclusion: TeSent and TeEEC provide valuable resources for Telugu NLP, demonstrating the benefits of incorporating rationales for better model performance and fairness.

Abstract: In the Indian subcontinent, Telugu, one of India’s six classical languages, is the most widely spoken Dravidian Language. Despite its 96 million speaker base worldwide, Telugu remains underrepresented in the global NLP and Machine Learning landscape, mainly due to lack of high-quality annotated resources. This work introduces TeSent, a comprehensive benchmark dataset for sentiment classification, a key text classification problem, in Telugu. TeSent not only provides ground truth labels for the sentences, but also supplements with provisions for evaluating explainability and fairness, two critical requirements in modern-day machine learning tasks. We scraped Telugu texts covering multiple domains from various social media platforms, news websites and web-blogs to preprocess and generate 26,150 sentences, and developed a custom-built annotation platform and a carefully crafted annotation protocol for collecting the ground truth labels along with their human-annotated rationales. We then fine-tuned several SOTA pre-trained models in two ways: with rationales, and without rationales. Further, we provide a detailed plausibility and faithfulness evaluation suite, which exploits the rationales, for six widely used post-hoc explainers applied on the trained models. Lastly, we curate TeEEC, Equity Evaluation Corpus in Telugu, a corpus to evaluate fairness of Telugu sentiment and emotion related NLP tasks, and provide a fairness evaluation suite for the trained classifier models. Our experimental results suggest that training with rationales may improve model accuracy, reduce bias in models, and make the explainers’ output more aligned to human reasoning.

[29] Listening to the Unspoken: Exploring “365” Aspects of Multimodal Interview Performance Assessment

Jia Li, Yang Wang, Wenhao Qian, Zhenzhen Hu, Richang Hong, Meng Wang

Main category: cs.CL

TL;DR: A novel framework for interview performance assessment integrates video, audio, and text data, evaluates six responses per candidate, and scores five key dimensions using multimodal fusion and ensemble learning.

Details

Motivation: To ensure holistic and fair evaluations of candidates by capturing explicit and implicit cues from multimodal data.

Method: Uses modality-specific feature extractors, a Shared Compression Multilayer Perceptron for fusion, and a two-level ensemble learning strategy for robust predictions.

Result: Achieved a multi-dimensional average MSE of 0.1824 and secured first place in the AVI Challenge 2025.

Conclusion: The framework is effective and robust for automated, multimodal interview performance assessment.

Abstract: Interview performance assessment is essential for determining candidates’ suitability for professional positions. To ensure holistic and fair evaluations, we propose a novel and comprehensive framework that explores ``365’’ aspects of interview performance by integrating \textit{three} modalities (video, audio, and text), \textit{six} responses per candidate, and \textit{five} key evaluation dimensions. The framework employs modality-specific feature extractors to encode heterogeneous data streams and subsequently fused via a Shared Compression Multilayer Perceptron. This module compresses multimodal embeddings into a unified latent space, facilitating efficient feature interaction. To enhance prediction robustness, we incorporate a two-level ensemble learning strategy: (1) independent regression heads predict scores for each response, and (2) predictions are aggregated across responses using a mean-pooling mechanism to produce final scores for the five target dimensions. By listening to the unspoken, our approach captures both explicit and implicit cues from multimodal data, enabling comprehensive and unbiased assessments. Achieving a multi-dimensional average MSE of 0.1824, our framework secured first place in the AVI Challenge 2025, demonstrating its effectiveness and robustness in advancing automated and multimodal interview performance assessment. The full implementation is available at https://github.com/MSA-LMC/365Aspects.

[30] The Homogenizing Effect of Large Language Models on Human Expression and Thought

Zhivar Sourati, Alireza S. Ziabari, Morteza Dehghani

Main category: cs.CL

TL;DR: The paper discusses how large language models (LLMs) may standardize language and reasoning, marginalizing cognitive diversity and alternative voices, which risks flattening collective intelligence.

Details

Motivation: To highlight the potential negative impact of LLMs on cognitive diversity, creativity, and collective intelligence by reinforcing dominant styles and marginalizing alternative perspectives.

Method: The review synthesizes evidence from linguistics, cognitive science, and computer science to analyze how LLMs reflect and amplify dominant patterns in their training data.

Result: LLMs risk homogenizing language and reasoning, reducing cognitive diversity, and undermining collective adaptability.

Conclusion: Unchecked reliance on LLMs could flatten cognitive landscapes, necessitating measures to preserve diversity in language and reasoning.

Abstract: Cognitive diversity, reflected in variations of language, perspective, and reasoning, is essential to creativity and collective intelligence. This diversity is rich and grounded in culture, history, and individual experience. Yet as large language models (LLMs) become deeply embedded in people’s lives, they risk standardizing language and reasoning. This Review synthesizes evidence across linguistics, cognitive, and computer science to show how LLMs reflect and reinforce dominant styles while marginalizing alternative voices and reasoning strategies. We examine how their design and widespread use contribute to this effect by mirroring patterns in their training data and amplifying convergence as all people increasingly rely on the same models across contexts. Unchecked, this homogenization risks flattening the cognitive landscapes that drive collective intelligence and adaptability.

[31] A Theory of Adaptive Scaffolding for LLM-Based Pedagogical Agents

Clayton Cohn, Surya Rayala, Namrata Srivastava, Joyce Horn Fonteles, Shruti Jain, Xinying Luo, Divya Mereddy, Naveeduddin Mohammed, Gautam Biswas

Main category: cs.CL

TL;DR: The paper proposes a framework combining Evidence-Centered Design and Social Cognitive Theory to enhance LLM-based pedagogical agents for STEM+C learning, demonstrated by Inquizzitor, which provides adaptive, theory-grounded feedback.

Details

Motivation: Current LLM systems in education lack theoretical grounding compared to traditional intelligent tutoring systems, creating a need for a principled approach.

Method: The framework integrates Evidence-Centered Design and Social Cognitive Theory, applied in Inquizzitor, an LLM-based formative assessment agent with human-AI hybrid intelligence.

Result: Inquizzitor delivers high-quality, theory-aligned assessment and interaction, valued by students and effective for teachers.

Conclusion: Theory-driven LLM integration, as shown by Inquizzitor, can provide adaptive, principled instruction, highlighting its potential in education.

Abstract: Large language models (LLMs) present new opportunities for creating pedagogical agents that engage in meaningful dialogue to support student learning. However, the current use of LLM systems like ChatGPT in classrooms often lacks the solid theoretical foundation found in earlier intelligent tutoring systems. To bridge this gap, we propose a framework that combines Evidence-Centered Design with Social Cognitive Theory for adaptive scaffolding in LLM-based agents focused on STEM+C learning. We illustrate this framework with Inquizzitor, an LLM-based formative assessment agent that integrates human-AI hybrid intelligence and provides feedback grounded in cognitive science principles. Our findings show that Inquizzitor delivers high-quality assessment and interaction aligned with core learning theories, offering teachers effective guidance that students value. This research underscores the potential for theory-driven LLM integration in education, highlighting the ability of these systems to provide adaptive and principled instruction.

[32] MOPrompt: Multi-objective Semantic Evolution for Prompt Optimization

Sara Câmara, Eduardo Luz, Valéria Carvalho, Ivan Meneghini, Gladston Moreira

Main category: cs.CL

TL;DR: MOPrompt is a multi-objective evolutionary framework for optimizing prompts in LLMs, balancing accuracy and context size, outperforming baselines with significant token reduction.

Details

Motivation: Manual prompt design is complex and time-consuming, and existing automated methods often ignore the trade-off between performance and efficiency.

Method: MOPrompt uses Multi-objective Evolutionary Optimization (EMO) to simultaneously optimize prompts for accuracy and token length, mapping the Pareto front of solutions.

Result: MOPrompt achieves the same peak accuracy as baselines but with a 31% reduction in token length for one model.

Conclusion: MOPrompt effectively addresses the trade-off between performance and efficiency, offering practical solutions for deploying LLMs.

Abstract: Prompt engineering is crucial for unlocking the potential of Large Language Models (LLMs). Still, since manual prompt design is often complex, non-intuitive, and time-consuming, automatic prompt optimization has emerged as a research area. However, a significant challenge in prompt optimization is managing the inherent trade-off between task performance, such as accuracy, and context size. Most existing automated methods focus on a single objective, typically performance, thereby failing to explore the critical spectrum of efficiency and effectiveness. This paper introduces the MOPrompt, a novel Multi-objective Evolutionary Optimization (EMO) framework designed to optimize prompts for both accuracy and context size (measured in tokens) simultaneously. Our framework maps the Pareto front of prompt solutions, presenting practitioners with a set of trade-offs between context size and performance, a crucial tool for deploying Large Language Models (LLMs) in real-world applications. We evaluate MOPrompt on a sentiment analysis task in Portuguese, using Gemma-2B and Sabiazinho-3 as evaluation models. Our findings show that MOPrompt substantially outperforms the baseline framework. For the Sabiazinho model, MOPrompt identifies a prompt that achieves the same peak accuracy (0.97) as the best baseline solution, but with a 31% reduction in token length.

[33] Are All Prompt Components Value-Neutral? Understanding the Heterogeneous Adversarial Robustness of Dissected Prompt in Large Language Models

Yujia Zheng, Tianhao Li, Haotian Huang, Tianyu Zeng, Jingyu Lu, Chuangxin Chu, Yuekai Huang, Ziyou Jiang, Qian Xiong, Yuyao Ge, Mingyang Li

Main category: cs.CL

TL;DR: PromptAnatomy dissects prompts into functional components for targeted adversarial attacks, improving robustness evaluation in LLMs.

Details

Motivation: Existing adversarial attack methods overlook prompt structural heterogeneity, leading to unreliable robustness assessments.

Method: Introduces PromptAnatomy, dissecting prompts and using ComPerturb for selective perturbation, with PPL-based filtering for plausibility.

Result: Achieves state-of-the-art attack success rates across datasets and LLMs, validated by ablation studies.

Conclusion: Prompt structure awareness and controlled perturbation are crucial for reliable adversarial robustness evaluation.

Abstract: Prompt-based adversarial attacks have become an effective means to assess the robustness of large language models (LLMs). However, existing approaches often treat prompts as monolithic text, overlooking their structural heterogeneity-different prompt components contribute unequally to adversarial robustness. Prior works like PromptRobust assume prompts are value-neutral, but our analysis reveals that complex, domain-specific prompts with rich structures have components with differing vulnerabilities. To address this gap, we introduce PromptAnatomy, an automated framework that dissects prompts into functional components and generates diverse, interpretable adversarial examples by selectively perturbing each component using our proposed method, ComPerturb. To ensure linguistic plausibility and mitigate distribution shifts, we further incorporate a perplexity (PPL)-based filtering mechanism. As a complementary resource, we annotate four public instruction-tuning datasets using the PromptAnatomy framework, verified through human review. Extensive experiments across these datasets and five advanced LLMs demonstrate that ComPerturb achieves state-of-the-art attack success rates. Ablation studies validate the complementary benefits of prompt dissection and PPL filtering. Our results underscore the importance of prompt structure awareness and controlled perturbation for reliable adversarial robustness evaluation in LLMs. Code and data are available at https://github.com/Yujiaaaaa/PACP.

[34] OpenMed NER: Open-Source, Domain-Adapted State-of-the-Art Transformers for Biomedical NER Across 12 Public Datasets

Maziyar Panahi

Main category: cs.CL

TL;DR: OpenMed NER introduces open-source, domain-adapted transformer models for biomedical NER, achieving state-of-the-art performance with efficiency and low carbon footprint.

Details

Motivation: To address the challenge of extracting structured information from unstructured healthcare data efficiently and effectively.

Method: Combines lightweight domain-adaptive pre-training (DAPT) with parameter-efficient Low-Rank Adaptation (LoRA) on ethically sourced data, fine-tuned for NER tasks.

Result: Achieves new state-of-the-art micro-F1 scores on 10 of 12 biomedical NER benchmarks, with significant improvements across diverse entity types.

Conclusion: OpenMed NER demonstrates that strategically adapted open-source models can outperform closed-source solutions while being efficient and compliant with regulations.

Abstract: Named-entity recognition (NER) is fundamental to extracting structured information from the >80% of healthcare data that resides in unstructured clinical notes and biomedical literature. Despite recent advances with large language models, achieving state-of-the-art performance across diverse entity types while maintaining computational efficiency remains a significant challenge. We introduce OpenMed NER, a suite of open-source, domain-adapted transformer models that combine lightweight domain-adaptive pre-training (DAPT) with parameter-efficient Low-Rank Adaptation (LoRA). Our approach performs cost-effective DAPT on a 350k-passage corpus compiled from ethically sourced, publicly available research repositories and de-identified clinical notes (PubMed, arXiv, and MIMIC-III) using DeBERTa-v3, PubMedBERT, and BioELECTRA backbones. This is followed by task-specific fine-tuning with LoRA, which updates less than 1.5% of model parameters. We evaluate our models on 12 established biomedical NER benchmarks spanning chemicals, diseases, genes, and species. OpenMed NER achieves new state-of-the-art micro-F1 scores on 10 of these 12 datasets, with substantial gains across diverse entity types. Our models advance the state-of-the-art on foundational disease and chemical benchmarks (e.g., BC5CDR-Disease, +2.70 pp), while delivering even larger improvements of over 5.3 and 9.7 percentage points on more specialized gene and clinical cell line corpora. This work demonstrates that strategically adapted open-source models can surpass closed-source solutions. This performance is achieved with remarkable efficiency: training completes in under 12 hours on a single GPU with a low carbon footprint (< 1.2 kg CO2e), producing permissively licensed, open-source checkpoints designed to help practitioners facilitate compliance with emerging data protection and AI regulations, such as the EU AI Act.

[35] Authorship Attribution in Multilingual Machine-Generated Texts

Lucio La Cava, Dominik Macko, Róbert Móro, Ivan Srba, Andrea Tagarelli

Main category: cs.CL

TL;DR: The paper introduces Multilingual Authorship Attribution (AA), addressing the challenge of attributing texts to human or multiple LLM generators across diverse languages, highlighting limitations of monolingual methods in multilingual settings.

Details

Motivation: The increasing difficulty in distinguishing machine-generated text from human-written content and the lack of multilingual focus in existing AA methods motivate this work.

Method: The study evaluates monolingual AA methods’ suitability and cross-lingual transferability across 18 languages and 8 generators (7 LLMs and human-authored texts).

Result: Monolingual AA methods show adaptability but face significant limitations in cross-lingual transfer, especially across diverse language families.

Conclusion: The complexity of multilingual AA calls for more robust approaches to align with real-world multilingual usage of LLMs.

Abstract: As Large Language Models (LLMs) have reached human-like fluency and coherence, distinguishing machine-generated text (MGT) from human-written content becomes increasingly difficult. While early efforts in MGT detection have focused on binary classification, the growing landscape and diversity of LLMs require a more fine-grained yet challenging authorship attribution (AA), i.e., being able to identify the precise generator (LLM or human) behind a text. However, AA remains nowadays confined to a monolingual setting, with English being the most investigated one, overlooking the multilingual nature and usage of modern LLMs. In this work, we introduce the problem of Multilingual Authorship Attribution, which involves attributing texts to human or multiple LLM generators across diverse languages. Focusing on 18 languages – covering multiple families and writing scripts – and 8 generators (7 LLMs and the human-authored class), we investigate the multilingual suitability of monolingual AA methods, their cross-lingual transferability, and the impact of generators on attribution performance. Our results reveal that while certain monolingual AA methods can be adapted to multilingual settings, significant limitations and challenges remain, particularly in transferring across diverse language families, underscoring the complexity of multilingual AA and the need for more robust approaches to better match real-world scenarios.

[36] CUPID: Evaluating Personalized and Contextualized Alignment of LLMs from Interactions

Tae Soo Kim, Yoonjoo Lee, Yoonah Park, Jiho Kim, Young-Ho Kim, Juho Kim

Main category: cs.CL

TL;DR: CUPID is a benchmark for evaluating LLMs’ ability to infer and apply dynamic user preferences from multi-turn interactions, revealing current models’ limitations.

Details

Motivation: Humans have dynamic preferences that change with context, but current LLMs assume static preferences, leading to misalignment in personalized interactions.

Method: Introduced CUPID, a benchmark of 756 human-curated interaction sessions, to test LLMs’ ability to infer contextual preferences from multi-turn feedback.

Result: State-of-the-art LLMs struggle with preference inference (under 50% precision, 65% recall) and fail to discern relevant context for new requests.

Conclusion: LLMs need advancement for contextual personalization, and CUPID serves as a resource to drive these improvements.

Abstract: Personalization of Large Language Models (LLMs) often assumes users hold static preferences that reflect globally in all tasks. In reality, humans hold dynamic preferences that change depending on the context. As users interact with an LLM in various contexts, they naturally reveal their contextual preferences, which a model must infer and apply in future contexts to ensure alignment. To assess this, we introduce CUPID, a benchmark of 756 human-curated interaction session histories between users and LLM-based chat assistants. In each interaction session, the user provides a request in a specific context and expresses their preference through multi-turn feedback. Given a new user request and prior interaction sessions, our benchmark assesses whether LLMs can infer the preference relevant to this request and generate a response that satisfies this preference. With CUPID, we evaluated 10 open and proprietary LLMs, revealing that state-of-the-art LLMs struggle to infer preferences from multi-turn interactions and fail to discern what previous context is relevant to a new request – under 50% precision and 65% recall. Our work highlights the need to advance LLM capabilities for more contextually personalized interactions and proposes CUPID as a resource to drive these improvements.

[37] The Bidirectional Process Reward Model

Lingyin Zhang, Jun Gao, Xiaoxue Ren, Ziqiang Cao

Main category: cs.CL

TL;DR: BiPRM introduces a bidirectional evaluation paradigm for Process Reward Models (PRMs), improving reasoning quality in LLMs by leveraging global context without added complexity.

Details

Motivation: Existing PRMs use unidirectional (L2R) evaluation, limiting global context use and consistency verification.

Method: BiPRM adds a parallel R2L evaluation stream via prompt modifications, maintaining efficiency and compatibility.

Result: BiPRM outperforms unidirectional baselines by up to 31.9% in stepwise reward evaluation across various settings.

Conclusion: BiPRM is effective, robust, and broadly applicable, advancing process-based reward modeling.

Abstract: Process Reward Models (PRMs) have emerged as a promising approach to enhance the reasoning quality of Large Language Models (LLMs) by assigning fine-grained scores to intermediate reasoning steps within a solution trajectory. However, existing PRMs predominantly adopt a unidirectional left-to-right (L2R) evaluation paradigm, which limits their ability to leverage global context, making it challenging to verify the consistency of earlier steps based on later ones. In light of these challenges, we propose a novel bidirectional evaluation paradigm, named Bidirectional Process Reward Model (BiPRM). BiPRM seamlessly incorporates a parallel right-to-left (R2L) evaluation stream alongside the conventional L2R flow, enabling later reasoning steps to help assess earlier ones in real time. Notably, the built-in R2L evaluation is implemented solely through prompt modifications that reverse the original reasoning trajectory, without any additional parameters or inference latency introduced. This ensures BiPRM remains both efficient and broadly compatible with existing PRM studies. We conduct extensive experiments on two mathematical reasoning benchmarks using samples generated by three different policy models. Our method, BiPRM, is evaluated across three backbones and three distinct PRM objectives. Across all settings, BiPRM consistently outperforms unidirectional baselines, achieving up to a 31.9% improvement in stepwise reward evaluation. Generally, our results highlight BiPRM’s effectiveness, robustness, and general applicability, offering a promising new direction for process-based reward modeling.

[38] Collaborative Chain-of-Agents for Parametric-Retrieved Knowledge Synergy

Yi Jiang, Sendong Zhao, Jianbo Li, Haochun Wang, Lizhe Zhang, Yan Liu, Bin Qin

Main category: cs.CL

TL;DR: The paper introduces Collaborative Chain-of-Agents (CoCoA), a framework to improve synergy between parametric and retrieved knowledge in RAG, enhancing LLM performance in knowledge-intensive tasks.

Details

Motivation: Current RAG methods struggle to fully exploit knowledge during generation, limiting synergy between internal and external knowledge.

Method: Proposes CoCoA-zero, a multi-agent RAG framework for conditional knowledge induction and reasoning, and CoCoA, a long-chain training strategy to fine-tune LLMs.

Result: CoCoA-zero and CoCoA outperform existing methods on open-domain and multi-hop QA tasks.

Conclusion: The framework effectively integrates and leverages parametric and retrieved knowledge, improving LLM performance.

Abstract: Retrieval-Augmented Generation (RAG) has emerged as a promising framework for enhancing the capabilities of Large Language Models (LLMs), especially in knowledge-intensive tasks. Despite its advantages, current RAG methods often struggle to fully exploit knowledge during generation. In particular, the synergy between the model’s internal parametric knowledge and external retrieved knowledge remains limited. Retrieved contents may sometimes mislead generation, while certain generated content can guide the model toward more accurate outputs. In this work, we propose Collaborative Chain-of-Agents, a framework designed to enhance explicitly synergy over both parametric and retrieved knowledge. Specifically, we first introduce CoCoA-zero, a multi-agent RAG framework that first performs conditional knowledge induction and then reasons answers. Building on this, we develop CoCoA, a long-chain training strategy that synthesizes extended multi-agent reasoning trajectories from CoCoA-zero to fine-tune the LLM. This strategy enhances the model’s capability to explicitly integrate and jointly leverage parametric and retrieved knowledge. Experiments results show that CoCoA-zero and CoCoA achieve superior performance on open-domain and multi-hop QA tasks.

[39] Am I Blue or Is My Hobby Counting Teardrops? Expression Leakage in Large Language Models as a Symptom of Irrelevancy Disruption

Berkay Köprü, Mehrzad Mashal, Yigit Gurses, Akos Kadar, Maximilian Schmitt, Ditty Mathew, Felix Burkhardt, Florian Eyben, Björn W. Schuller

Main category: cs.CL

TL;DR: The paper introduces ’expression leakage’ in LLMs, where models generate sentimentally charged expressions unrelated to input context. It provides a benchmark dataset, an automatic evaluation pipeline, and shows that larger models reduce leakage, but mitigation requires specific model-building care.

Details

Motivation: To address the overlooked issue of sentimentally irrelevant expressions (expression leakage) in LLMs, which differs from prior focus on semantic leakage.

Method: Collect a benchmark dataset, propose an automatic evaluation pipeline, and analyze expression leakage across model scales and sentiment injections.

Result: Larger models reduce expression leakage, but mitigation isn’t possible via prompting. Negative sentiment disrupts generation more than positive.

Conclusion: Expression leakage is a distinct issue requiring targeted mitigation during model building, not just scaling or prompting.

Abstract: Large language models (LLMs) have advanced natural language processing (NLP) skills such as through next-token prediction and self-attention, but their ability to integrate broad context also makes them prone to incorporating irrelevant information. Prior work has focused on semantic leakage, bias introduced by semantically irrelevant context. In this paper, we introduce expression leakage, a novel phenomenon where LLMs systematically generate sentimentally charged expressions that are semantically unrelated to the input context. To analyse the expression leakage, we collect a benchmark dataset along with a scheme to automatically generate a dataset from free-form text from common-crawl. In addition, we propose an automatic evaluation pipeline that correlates well with human judgment, which accelerates the benchmarking by decoupling from the need of annotation for each analysed model. Our experiments show that, as the model scales in the parameter space, the expression leakage reduces within the same LLM family. On the other hand, we demonstrate that expression leakage mitigation requires specific care during the model building process, and cannot be mitigated by prompting. In addition, our experiments indicate that, when negative sentiment is injected in the prompt, it disrupts the generation process more than the positive sentiment, causing a higher expression leakage rate.

[40] CultureGuard: Towards Culturally-Aware Dataset and Guard Model for Multilingual Safety Applications

Raviraj Joshi, Rakesh Paul, Kanishk Singla, Anusha Kamath, Michael Evans, Katherine Luna, Shaona Ghosh, Utkarsh Vaidya, Eileen Long, Sanjay Singh Chauhan, Niranjan Wartikar

Main category: cs.CL

TL;DR: CultureGuard introduces a pipeline to create multilingual safety datasets, improving LLM safety in non-English languages.

Details

Motivation: Addressing the lack of culturally aligned safety datasets for non-English languages in LLMs.

Method: A four-stage pipeline: cultural data segregation, adaptation, machine translation, and quality filtering.

Result: A dataset of 386,661 samples in 9 languages, enabling a state-of-the-art multilingual safety model.

Conclusion: This work advances multilingual LLM safety by providing culturally aware datasets and models.

Abstract: The increasing use of Large Language Models (LLMs) in agentic applications highlights the need for robust safety guard models. While content safety in English is well-studied, non-English languages lack similar advancements due to the high cost of collecting culturally aligned labeled datasets. We present CultureGuard, a novel solution for curating culturally aligned, high-quality safety datasets across multiple languages. Our approach introduces a four-stage synthetic data generation and filtering pipeline: cultural data segregation, cultural data adaptation, machine translation, and quality filtering. This pipeline enables the conversion and expansion of the Nemotron-Content-Safety-Dataset-V2 English safety dataset into eight distinct languages: Arabic, German, Spanish, French, Hindi, Japanese, Thai, and Chinese. The resulting dataset, Nemotron-Content-Safety-Dataset-Multilingual-v1, comprises 386,661 samples in 9 languages and facilitates the training of Llama-3.1-Nemotron-Safety-Guard-Multilingual-8B-v1 via LoRA-based fine-tuning. The final model achieves state-of-the-art performance on several multilingual content safety benchmarks. We also benchmark the latest open LLMs on multilingual safety and observe that these LLMs are more prone to give unsafe responses when prompted in non-English languages. This work represents a significant step toward closing the safety gap in multilingual LLMs by enabling the development of culturally aware safety guard models.

[41] Enhancing the Preference Extractor in Multi-turn Dialogues: From Annotating Disasters to Accurate Preference Extraction

Cheng Wang, ziru Liu, Pengcheng Tang, Mingyu Zhang, Quanyu Dai, Yue Zhu

Main category: cs.CL

TL;DR: The paper proposes IterChat, a framework for generating high-quality dialogue data to improve user preference extraction in dialogue systems, addressing challenges like annotation difficulty and error propagation.

Details

Motivation: Current methods for tracking user preferences in multi-turn dialogues face challenges like high annotation complexity and error propagation, necessitating a more efficient solution.

Method: The IterChat framework decomposes multi-turn preference extraction into one-turn processes, uses GPT4 to pre-define preference slots, and generates diverse dialogue datasets.

Result: Fine-tuning or few-shot prompting with the new data format outperforms original multi-turn dialogues and improves annotator efficiency by 28.4%.

Conclusion: IterChat effectively addresses annotation and training challenges, offering a scalable solution for preference extraction in dialogue systems.

Abstract: Identifying user preferences in dialogue systems is a pivotal aspect of providing satisfying services. Current research shows that using large language models (LLMs) to fine-tune a task-specific preference extractor yields excellent results in terms of accuracy and generalization. However, the primary challenge stems from the inherent difficulty in obtaining high-quality labeled multi-turn dialogue data. Accurately tracking user preference transitions across turns not only demands intensive domain expertise and contextual consistency maintenance for annotators (termed \textbf{``Annotating Disaster’’}) but also complicates model training due to error propagation in sequential dependency learning. Inspired by the observation that multi-turn preference extraction can be decomposed into iterative executions of one-turn extraction processes. We propose a novel dialogue data generation framework named \textbf{IterChat}. First, we construct a new data format that categorizes the dialogue data into attributed historical preferences and one-turn dialogues. This reduces the probability of annotation errors and improves annotation efficiency. Then, to generate a high-quality and diverse dialogue dataset, we adopt GPT4 to pre-define the preference slots in the target preference extractor task and then randomly sample the subset of the slots and their corresponding schema values to create the dialogue datasets. Experimental results indicate that fine-tuning or only few-shot prompting with the new dialogue format yields superior performance compared to the original multi-turn dialogues. Additionally, the new data format improves annotator efficiency with a win rate of 28.4% higher than the original multi-turn dialogues.

[42] AI-Generated Text is Non-Stationary: Detection via Temporal Tomography

Alva West, Yixuan Weng, Minjun Zhu, Luodan Zhang, Zhen Lin, Guangsheng Bao, Yue Zhang

Main category: cs.CL

TL;DR: TDT introduces a novel AI-generated text detection method using temporal discrepancy analysis, outperforming existing methods by preserving positional information and handling non-stationarity.

Details

Motivation: Current AI-text detectors aggregate token-level data into scalar scores, losing positional info, which fails against adversarial perturbations exploiting non-stationarity.

Method: TDT treats token-level discrepancies as a time-series, applying Continuous Wavelet Transform to capture anomalies’ location and scale.

Result: TDT achieves 0.855 AUROC (7.1% improvement) on RAID and 14.1% AUROC improvement on adversarial tasks, with minimal computational overhead.

Conclusion: Non-stationarity is key in AI-generated text; TDT’s temporal analysis ensures robust detection.

Abstract: The field of AI-generated text detection has evolved from supervised classification to zero-shot statistical analysis. However, current approaches share a fundamental limitation: they aggregate token-level measurements into scalar scores, discarding positional information about where anomalies occur. Our empirical analysis reveals that AI-generated text exhibits significant non-stationarity, statistical properties vary by 73.8% more between text segments compared to human writing. This discovery explains why existing detectors fail against localized adversarial perturbations that exploit this overlooked characteristic. We introduce Temporal Discrepancy Tomography (TDT), a novel detection paradigm that preserves positional information by reformulating detection as a signal processing task. TDT treats token-level discrepancies as a time-series signal and applies Continuous Wavelet Transform to generate a two-dimensional time-scale representation, capturing both the location and linguistic scale of statistical anomalies. On the RAID benchmark, TDT achieves 0.855 AUROC (7.1% improvement over the best baseline). More importantly, TDT demonstrates robust performance on adversarial tasks, with 14.1% AUROC improvement on HART Level 2 paraphrasing attacks. Despite its sophisticated analysis, TDT maintains practical efficiency with only 13% computational overhead. Our work establishes non-stationarity as a fundamental characteristic of AI-generated text and demonstrates that preserving temporal dynamics is essential for robust detection.

[43] A comprehensive taxonomy of hallucinations in Large Language Models

Manuel Cossio

Main category: cs.CL

TL;DR: The paper provides a taxonomy of hallucinations in large language models (LLMs), categorizing them, analyzing causes, and proposing mitigation strategies.

Details

Motivation: Address the challenge of LLM hallucinations—plausible but incorrect outputs—to improve reliability in critical applications.

Method: Develops a theoretical framework, categorizes hallucinations, analyzes causes, and surveys benchmarks and mitigation strategies.

Result: Identifies intrinsic/extrinsic hallucinations, underlying causes, and proposes detection and mitigation approaches.

Conclusion: Hallucinations are inevitable in LLMs; future efforts should focus on detection, mitigation, and human oversight for responsible deployment.

Abstract: Large language models (LLMs) have revolutionized natural language processing, yet their propensity for hallucination, generating plausible but factually incorrect or fabricated content, remains a critical challenge. This report provides a comprehensive taxonomy of LLM hallucinations, beginning with a formal definition and a theoretical framework that posits its inherent inevitability in computable LLMs, irrespective of architecture or training. It explores core distinctions, differentiating between intrinsic (contradicting input context) and extrinsic (inconsistent with training data or reality), as well as factuality (absolute correctness) and faithfulness (adherence to input). The report then details specific manifestations, including factual errors, contextual and logical inconsistencies, temporal disorientation, ethical violations, and task-specific hallucinations across domains like code generation and multimodal applications. It analyzes the underlying causes, categorizing them into data-related issues, model-related factors, and prompt-related influences. Furthermore, the report examines cognitive and human factors influencing hallucination perception, surveys evaluation benchmarks and metrics for detection, and outlines architectural and systemic mitigation strategies. Finally, it introduces web-based resources for monitoring LLM releases and performance. This report underscores the complex, multifaceted nature of LLM hallucinations and emphasizes that, given their theoretical inevitability, future efforts must focus on robust detection, mitigation, and continuous human oversight for responsible and reliable deployment in critical applications.

[44] HeQ: a Large and Diverse Hebrew Reading Comprehension Benchmark

Amir DN Cohen, Hilla Merhav, Yoav Goldberg, Reut Tsarfaty

Main category: cs.CL

TL;DR: The paper introduces HeQ, a Hebrew Machine Reading Comprehension dataset, addressing challenges in semantic understanding due to Hebrew’s morphological richness, and proposes improved evaluation metrics.

Details

Motivation: Current Hebrew NLP benchmarks lack semantic focus, and morphological complexity causes annotation and evaluation issues.

Method: Developed HeQ with 30,147 QA pairs using new guidelines, crowdsourcing, and revised metrics.

Result: Standard metrics like F1 and EM are unsuitable for Hebrew; proposed enhancements show better alignment.

Conclusion: HeQ highlights challenges in Hebrew NLU, suggesting morpho-syntactic models may underperform on semantic tasks, advancing NLU for MRLs.

Abstract: Current benchmarks for Hebrew Natural Language Processing (NLP) focus mainly on morpho-syntactic tasks, neglecting the semantic dimension of language understanding. To bridge this gap, we set out to deliver a Hebrew Machine Reading Comprehension (MRC) dataset, where MRC is to be realized as extractive Question Answering. The morphologically rich nature of Hebrew poses a challenge to this endeavor: the indeterminacy and non-transparency of span boundaries in morphologically complex forms lead to annotation inconsistencies, disagreements, and flaws in standard evaluation metrics. To remedy this, we devise a novel set of guidelines, a controlled crowdsourcing protocol, and revised evaluation metrics that are suitable for the morphologically rich nature of the language. Our resulting benchmark, HeQ (Hebrew QA), features 30,147 diverse question-answer pairs derived from both Hebrew Wikipedia articles and Israeli tech news. Our empirical investigation reveals that standard evaluation metrics such as F1 scores and Exact Match (EM) are not appropriate for Hebrew (and other MRLs), and we propose a relevant enhancement. In addition, our experiments show low correlation between models’ performance on morpho-syntactic tasks and on MRC, which suggests that models designed for the former might underperform on semantics-heavy tasks. The development and exploration of HeQ illustrate some of the challenges MRLs pose in natural language understanding (NLU), fostering progression towards more and better NLU models for Hebrew and other MRLs.

[45] AGENTICT$^2$S:Robust Text-to-SPARQL via Agentic Collaborative Reasoning over Heterogeneous Knowledge Graphs for the Circular Economy

Yang Zhao, Chengxiao Dai, Wei Zhuo, Tan Chuan Fu, Yue Xiu, Dusit Niyato, Jonathan Z. Low, Eugene Ho Hong Zhuang, Daren Zong Loong Tan

Main category: cs.CL

TL;DR: AgenticT$^2$S is a modular framework for KGQA that uses specialized agents and weak-to-strong alignment to improve accuracy and efficiency in cross-graph reasoning, particularly in low-resource domains like the circular economy.

Details

Motivation: Existing KGQA methods struggle with generalizability in low-resource domains and handling queries across multiple graphs, especially in sustainability-focused areas like the circular economy.

Method: AgenticT$^2$S decomposes KGQA into subtasks managed by specialized agents (retrieval, query generation, verification) and uses a scheduler for subgoal assignment. It includes a two-stage verifier for query validation.

Result: AgenticT$^2$S improves execution accuracy by 17.3% and triple-level F$_1$ by 25.4% over baselines, while reducing prompt length by 46.4%.

Conclusion: The framework demonstrates the effectiveness of agent-based, schema-aware reasoning for scalable KGQA, supporting robust cross-graph reasoning in sustainability domains.

Abstract: Question answering over heterogeneous knowledge graphs (KGQA) involves reasoning across diverse schemas, incomplete alignments, and distributed data sources. Existing text-to-SPARQL approaches rely on large-scale domain-specific fine-tuning or operate within single-graph settings, limiting their generalizability in low-resource domains and their ability to handle queries spanning multiple graphs. These challenges are particularly relevant in domains such as the circular economy, where information about classifications, processes, and emissions is distributed across independently curated knowledge graphs (KGs). We present AgenticT$^2$S, a modular framework that decomposes KGQA into subtasks managed by specialized agents responsible for retrieval, query generation, and verification. A scheduler assigns subgoals to different graphs using weak-to-strong alignment strategies. A two-stage verifier detects structurally invalid and semantically underspecified queries through symbolic validation and counterfactual consistency checks. Experiments on real-world circular economy KGs demonstrate that AgenticT$^2$S improves execution accuracy by 17.3% and triple level F$_1$ by 25.4% over the best baseline, while reducing the average prompt length by 46.4%. These results demonstrate the benefits of agent-based schema-aware reasoning for scalable KGQA and support decision-making in sustainability domains through robust cross-graph reasoning.

[46] MLP Memory: Language Modeling with Retriever-pretrained External Memory

Rubin Wei, Jiaqi Cao, Jiarui Wang, Jushi Kai, Qipeng Guo, Bowen Zhou, Zhouhan Lin

Main category: cs.CL

TL;DR: The paper proposes a decoupled memory architecture for LLMs to reduce hallucinations, combining a transformer decoder with a pretrained MLP memory, achieving better performance and scaling.

Details

Motivation: Hallucinations in decoder-only LLMs hinder their use in knowledge-intensive tasks, and retriever-augmented generation lacks deep interaction with LLMs.

Method: Decouple memorization from LLM using a pretrained, differentiable external MLP memory, trained to imitate a retriever on pretraining data.

Result: 17.5% and 24.1% improvement on WikiText-103 and Web datasets, faster inference, and superior performance on hallucination benchmarks and memory tasks.

Conclusion: The proposed architecture effectively reduces hallucinations, improves performance, and scales better, with plans to open-source the code.

Abstract: While modern decoder-only LLMs achieve superior performance across various domains, hallucinations have risen to be a common problem in their generated text, hindering their application in knowledge-intensive tasks. Retriever-augmented generation (RAG) offers a solution, but the non-parametric nature of the retriever hinders its deep interaction with LLM. In this work, we propose to decouple memorization from the LLM decoder using a pretrained, differentiable external memory. The external memory is an MLP pretrained by imitating the behavior of a retriever on the entire pretraining dataset. Our resulting architecture, which comprises a transformer decoder and an external MLP memory pretrained on language modeling and retriever imitation respectively, demonstrates strong perplexity and performance on downstream tasks. Experiments show our architecture exhibits steeper power-law scaling with model size, achieving 17.5% and 24.1% improvement on WikiText-103 and Web datasets compared to decoder-only models while benefiting from added training without overfitting. We demonstrate superior performance on three hallucination benchmarks and nine memory-intensive tasks. Additionally, our approach delivers $80\times$ speedup over $k$NN-LM (500M tokens) and $1.3\times$ faster inference than decoder-only models. Unlike $k$NN-LM, which impairs reasoning, our MLP memory improves StrategyQA performance. We will open-source our code and models in the future.

[47] Web-CogReasoner: Towards Knowledge-Induced Cognitive Reasoning for Web Agents

Yuhan Guo, Cong Guo, Aiwen Sun, Hongliang He, Xinyu Yang, Yue Lu, Yingji Zhang, Xuntao Guo, Dong Zhang, Jianzhuang Liu, Jiang Duan, Yijia Xiao, Liangjian Wen, Hai-Ming Xu, Yong Dai

Main category: cs.CL

TL;DR: The paper introduces the Web-CogKnowledge Framework and Web-CogReasoner, a knowledge-driven agent for web tasks, emphasizing knowledge acquisition and cognitive reasoning.

Details

Motivation: Web agents need structured knowledge for effective cognitive reasoning, decomposed into knowledge content learning and cognitive processes.

Method: Proposes the Web-CogKnowledge Framework (Factual, Conceptual, Procedural knowledge) and Web-CogDataset, with a knowledge-driven Chain-of-Thought reasoning framework.

Result: Web-CogReasoner outperforms existing models, especially in generalizing to unseen tasks.

Conclusion: The framework and dataset enable robust web agent performance, validated by Web-CogBench.

Abstract: Multimodal large-scale models have significantly advanced the development of web agents, enabling perception and interaction with digital environments akin to human cognition. In this paper, we argue that web agents must first acquire sufficient knowledge to effectively engage in cognitive reasoning. Therefore, we decompose a web agent’s capabilities into two essential stages: knowledge content learning and cognitive processes. To formalize this, we propose Web-CogKnowledge Framework, categorizing knowledge as Factual, Conceptual, and Procedural. In this framework, knowledge content learning corresponds to the agent’s processes of Memorizing and Understanding, which rely on the first two knowledge types, representing the “what” of learning. Conversely, cognitive processes correspond to Exploring, grounded in Procedural knowledge, defining the “how” of reasoning and action. To facilitate knowledge acquisition, we construct the Web-CogDataset, a structured resource curated from 14 real-world websites, designed to systematically instill core knowledge necessary for web agent. This dataset serves as the agent’s conceptual grounding-the “nouns” upon which comprehension is built-as well as the basis for learning how to reason and act. Building on this foundation, we operationalize these processes through a novel knowledge-driven Chain-of-Thought (CoT) reasoning framework, developing and training our proposed agent, the Web-CogReasoner. Extensive experimentation reveals its significant superiority over existing models, especially in generalizing to unseen tasks where structured knowledge is decisive. To enable rigorous evaluation, we introduce the Web-CogBench, a comprehensive evaluation suite designed to assess and compare agent performance across the delineated knowledge domains and cognitive capabilities. Our code and data is open sourced at https://github.com/Gnonymous/Web-CogReasoner

[48] Counterfactual Probing for Hallucination Detection and Mitigation in Large Language Models

Yijun Feng

Main category: cs.CL

TL;DR: Counterfactual Probing detects and mitigates LLM hallucinations by generating plausible but erroneous statements and evaluating model sensitivity, outperforming baselines and reducing hallucinations by 24.5%.

Details

Motivation: LLMs often produce fluent but factually incorrect outputs (hallucinations), necessitating methods to detect and mitigate such errors without retraining.

Method: Generates counterfactual statements with subtle errors, evaluates model sensitivity to these perturbations, and uses adaptive mitigation strategies.

Result: Superior detection performance on TruthfulQA and other datasets, with a 24.5% average reduction in hallucination scores.

Conclusion: Counterfactual Probing is an effective, no-retraining-required method for real-time hallucination detection and mitigation in LLMs.

Abstract: Large Language Models have demonstrated remarkable capabilities across diverse tasks, yet they frequently generate hallucinations outputs that are fluent but factually incorrect or unsupported. We propose Counterfactual Probing, a novel approach for detecting and mitigating hallucinations in LLM outputs. Our method dynamically generates counterfactual statements that appear plausible but contain subtle factual errors, then evaluates the model’s sensitivity to these perturbations. We hypothesize that genuine knowledge exhibits robustness to counterfactual variations, while hallucinated content shows inconsistent confidence patterns when confronted with plausible alternatives. Our comprehensive evaluation on TruthfulQA, factual statement datasets, and curated hallucination examples demonstrates that counterfactual probing achieves superior detection performance compared to baseline methods, while our adaptive mitigation strategies reduce hallucination scores by an average of 24.5%. The approach requires no model retraining and can be integrated into existing LLM pipelines as a realtime verification mechanism.

[49] Quantum-RAG and PunGPT2: Advancing Low-Resource Language Generation and Retrieval for the Punjabi Language

Jaskaranjeet Singh, Rakesh Thakur

Main category: cs.CL

TL;DR: PunGPT2 is the first open-source Punjabi LLM suite, trained on a 35GB corpus, with innovations like Pun-RAG, Pun-Instruct, and Quantum-RAG for improved performance in low-resource NLP.

Details

Motivation: Low-resource languages like Punjabi are often excluded from NLP advancements, prompting the development of specialized models and frameworks.

Method: Developed PunGPT2 with optimized tokenizer, Pun-RAG for retrieval-augmented generation, Pun-Instruct for instruction tuning, and Quantum-RAG for hybrid retrieval.

Result: Outperforms multilingual baselines (mBERT, mT5, MuRIL) in perplexity, factuality, and fluency.

Conclusion: Provides a scalable blueprint for LLM adoption in underrepresented languages and introduces quantum-aware retrieval in NLP.

Abstract: Despite the rapid advancement of large language models (LLMs), low-resource languages remain largely excluded from the NLP landscape. We present PunGPT2, the first fully open-source suite of Punjabi large language models, trained from scratch on a 35GB domain-diverse corpus encompassing literature, religious texts, news, and social discourse. Unlike prior multilingual approaches, PunGPT2 captures rich syntactic and morphological features unique to Punjabi through a tokenizer optimised with byte pair encoding and linguistically aligned pretraining objectives. To improve factual grounding and domain recall, we introduce Pun-RAG, a retrieval-augmented generation framework combining PunGPT2 with a dense FAISS retriever over a curated Punjabi knowledge base. We further develop Pun-Instruct, a parameter-efficient, instruction-tuned variant using QLoRA, enabling robust zero-shot and instruction-following performance with significantly reduced compute needs. As a key innovation, we propose Quantum-RAG, a novel hybrid retrieval system that fuses sparse (BM25) and dense methods with quantum-inspired semantic matching. By encoding queries using amplitude-based embeddings and retrieving via quantum kernel similarity, Quantum-RAG achieves improved contextual relevance with minimal memory overhead marking the first practical integration of quantum representations in low-resource language generation. Our models significantly outperform strong multilingual baselines (mBERT, mT5, MuRIL) in perplexity, factuality, and fluency. This work provides a scalable, reproducible blueprint for extending LLM capabilities to underrepresented languages and pioneers quantum-aware retrieval in low-resource NLP

[50] Word Overuse and Alignment in Large Language Models: The Influence of Learning from Human Feedback

Tom S. Juzek, Zina B. Ward

Main category: cs.CL

TL;DR: The study investigates why LLMs overuse certain terms like ‘delve’ and ‘intricate,’ linking it to Learning from Human Feedback (LHF). It proposes a method to detect LHF-induced lexical preferences and experimentally confirms LHF’s role in word overuse, highlighting a misalignment between LHF workers and LLM users.

Details

Motivation: To understand the reasons behind LLMs' lexical overuse of specific terms and explore the role of LHF in this phenomenon.

Method: Uses Meta’s Llama model to detect LHF-induced lexical preferences and experimentally emulates LHF to confirm word overuse.

Result: Demonstrates that LHF leads to systematic preference for certain words, revealing a misalignment between LHF workers and LLM users.

Conclusion: Highlights the need for transparency in alignment research and contributes to explainable AI by linking LHF to lexical overuse.

Abstract: Large Language Models (LLMs) are known to overuse certain terms like “delve” and “intricate.” The exact reasons for these lexical choices, however, have been unclear. Using Meta’s Llama model, this study investigates the contribution of Learning from Human Feedback (LHF), under which we subsume Reinforcement Learning from Human Feedback and Direct Preference Optimization. We present a straightforward procedure for detecting the lexical preferences of LLMs that are potentially LHF-induced. Next, we more conclusively link LHF to lexical overuse by experimentally emulating the LHF procedure and demonstrating that participants systematically prefer text variants that include certain words. This lexical overuse can be seen as a sort of misalignment, though our study highlights the potential divergence between the lexical expectations of different populations – namely LHF workers versus LLM users. Our work contributes to the growing body of research on explainable artificial intelligence and emphasizes the importance of both data and procedural transparency in alignment research.

[51] ROVER: Recursive Reasoning Over Videos with Vision-Language Models for Embodied Tasks

Philip Schroeder, Ondrej Biza, Thomas Weng, Hongyin Luo, James Glass

Main category: cs.CL

TL;DR: ROVER improves video reasoning in VLMs by recursively decomposing long videos into shorter subtask segments, enhancing accuracy and reducing hallucinations.

Details

Motivation: VLMs struggle with reasoning over long video sequences, limiting their use in embodied settings requiring continuous visual input.

Method: ROVER recursively decomposes long videos into subtask segments, using in-context learning for focused reasoning.

Result: ROVER outperforms baselines in task progress estimation, frame-level reasoning, and video QA, with linear time complexity.

Conclusion: ROVER effectively addresses VLMs’ limitations in long-video reasoning, improving performance and scalability.

Abstract: Vision-language models (VLMs) have exhibited impressive capabilities across diverse image understanding tasks, but still struggle in settings that require reasoning over extended sequences of camera frames from a video. This limits their utility in embodied settings, which require reasoning over long frame sequences from a continuous stream of visual input at each moment of a task attempt. To address this limitation, we propose ROVER (Reasoning Over VidEo Recursively), a framework that enables the model to recursively decompose long-horizon video trajectories into segments corresponding to shorter subtasks within the trajectory. In doing so, ROVER facilitates more focused and accurate reasoning over temporally localized frame sequences without losing global context. We evaluate ROVER, implemented using an in-context learning approach, on diverse OpenX Embodiment videos and on a new dataset derived from RoboCasa that consists of 543 videos showing both expert and perturbed non-expert trajectories across 27 robotic manipulation tasks. ROVER outperforms strong baselines across three video reasoning tasks: task progress estimation, frame-level natural language reasoning, and video question answering. We observe that, by reducing the number of frames the model reasons over at each timestep, ROVER mitigates hallucinations, especially during unexpected or non-optimal moments of a trajectory. In addition, by enabling the implementation of a subtask-specific sliding context window, ROVER’s time complexity scales linearly with video length, an asymptotic improvement over baselines. Demos, code, and data available at: https://rover-vlm.github.io

[52] SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension

Junjie Wu, Jiangnan Li, Yuqing Li, Lemao Liu, Liyan Xu, Jiwei Li, Dit-Yan Yeung, Jie Zhou, Mo Yu

Main category: cs.CL

TL;DR: The paper proposes situated embedding models (SitEmb) to enhance retrieval performance by conditioning short chunks on broader context, outperforming larger models.

Details

Motivation: Existing methods for retrieval-augmented generation (RAG) struggle with contextual dependencies and embedding model limitations when handling long documents.

Method: Introduces a new training paradigm and situated embedding models (SitEmb) to encode short chunks with broader context, improving retrieval accuracy.

Result: SitEmb models, especially the 8B SitEmb-v1.5, outperform state-of-the-art models by over 10% on a book-plot retrieval benchmark and show strong multilingual performance.

Conclusion: SitEmb effectively addresses contextual retrieval challenges, offering significant improvements over existing methods with fewer parameters.

Abstract: Retrieval-augmented generation (RAG) over long documents typically involves splitting the text into smaller chunks, which serve as the basic units for retrieval. However, due to dependencies across the original document, contextual information is often essential for accurately interpreting each chunk. To address this, prior work has explored encoding longer context windows to produce embeddings for longer chunks. Despite these efforts, gains in retrieval and downstream tasks remain limited. This is because (1) longer chunks strain the capacity of embedding models due to the increased amount of information they must encode, and (2) many real-world applications still require returning localized evidence due to constraints on model or human bandwidth. We propose an alternative approach to this challenge by representing short chunks in a way that is conditioned on a broader context window to enhance retrieval performance – i.e., situating a chunk’s meaning within its context. We further show that existing embedding models are not well-equipped to encode such situated context effectively, and thus introduce a new training paradigm and develop the situated embedding models (SitEmb). To evaluate our method, we curate a book-plot retrieval dataset specifically designed to assess situated retrieval capabilities. On this benchmark, our SitEmb-v1 model based on BGE-M3 substantially outperforms state-of-the-art embedding models, including several with up to 7-8B parameters, with only 1B parameters. Our 8B SitEmb-v1.5 model further improves performance by over 10% and shows strong results across different languages and several downstream applications.

[53] TIBSTC-CoT: A Multi-Domain Instruction Dataset for Chain-of-Thought Reasoning in Language Models

Fan Gao, Cheng Huang, Nyima Tashi, Yutong Liu, Xiangxiang Wang, Thupten Tsering, Ban Ma-bao, Renzeg Duojie, Gadeng Luosang, Rinchen Dongrub, Dorje Tashi, Xiao Feng, Hao Wang, Yongbin Yu

Main category: cs.CL

TL;DR: TIBSTC-CoT is a large-scale Tibetan dataset created using LLMs to address data scarcity, enabling the development of the Sunshine-thinking LLM family for Tibetan language processing.

Details

Motivation: To tackle the lack of data for Tibetan, a low-resource language, and improve AI inclusivity.

Method: Automated dataset construction via chain-of-thought prompting with LLMs, followed by training Tibetan-centric LLMs (Sunshine-thinking) on this dataset.

Result: Sunshine-thinking LLMs show strong reasoning and generation performance, comparable to SOTA multilingual LLMs.

Conclusion: This work advances inclusive AI by providing scalable data and models for Tibetan language processing.

Abstract: To address the severe data scarcity in Tibetan, a low-resource language spoken by over six million people, we introduce TIBSTC-CoT, the large-scale, multi-domain Tibetan dataset automatically constructed via chain-of-thought prompting with large language models (LLMs). TIBSTC-CoT establishes a scalable and reproducible framework for dataset creation in low-resource settings, covering diverse domains and reasoning patterns essential for language understanding and generation. Building on this dataset, we develop the Sunshine-thinking LLM family, a series of Tibetan-centric LLMs equipped with chain-of-thought capabilities. Trained entirely on TIBSTC-CoT, Sunshine-thinking has demonstrated strong reasoning and generation performance, comparable to state-of-the-art (SOTA) multilingual LLMs. Our work marks a significant step toward inclusive AI by enabling high-quality Tibetan language processing through both resource creation and model innovation. All data are available: https://github.com/Vicentvankor/sun-shine.

[54] Contextually Aware E-Commerce Product Question Answering using RAG

Praveen Tangarajan, Anand A. Rajasekar, Manish Rathi, Vinay Rao Dandin, Ozan Ersoy

Main category: cs.CL

TL;DR: A scalable, end-to-end framework for e-commerce PQA using RAG integrates user context and product info to deliver personalized answers, handling diverse queries and identifying catalog gaps.

Details

Motivation: Existing PQA systems struggle with cognitive overload and ineffective use of user context and product info.

Method: Proposes a RAG-based framework leveraging conversational history, user profiles, and product attributes.

Result: System handles objective, subjective, and multi-intent queries, identifies catalog gaps, and introduces novel metrics.

Conclusion: The framework improves PQA by integrating context and diverse info, with metrics for broader RAG evaluation.

Abstract: E-commerce product pages contain a mix of structured specifications, unstructured reviews, and contextual elements like personalized offers or regional variants. Although informative, this volume can lead to cognitive overload, making it difficult for users to quickly and accurately find the information they need. Existing Product Question Answering (PQA) systems often fail to utilize rich user context and diverse product information effectively. We propose a scalable, end-to-end framework for e-commerce PQA using Retrieval Augmented Generation (RAG) that deeply integrates contextual understanding. Our system leverages conversational history, user profiles, and product attributes to deliver relevant and personalized answers. It adeptly handles objective, subjective, and multi-intent queries across heterogeneous sources, while also identifying information gaps in the catalog to support ongoing content improvement. We also introduce novel metrics to measure the framework’s performance which are broadly applicable for RAG system evaluations.

[55] Prompting Large Language Models to Detect Dementia Family Caregivers

Md Badsha Biswas, Özlem Uzuner

Main category: cs.CL

TL;DR: A system for detecting tweets by dementia caregivers using LLMs, achieving high accuracy with a zero-shot prompt on a fine-tuned model.

Details

Motivation: To identify tweets by dementia caregivers for developing internet-based support interventions.

Method: Binary classification using large language models (LLMs) with various prompting methods, focusing on zero-shot prompts.

Result: Achieved a macro F1-score of 0.95 on validation and test sets.

Conclusion: Fine-tuned LLMs with zero-shot prompting effectively identify caregiver tweets, enabling future support interventions.

Abstract: Social media, such as Twitter, provides opportunities for caregivers of dementia patients to share their experiences and seek support for a variety of reasons. Availability of this information online also paves the way for the development of internet-based interventions in their support. However, for this purpose, tweets written by caregivers of dementia patients must first be identified. This paper demonstrates our system for the SMM4H 2025 shared task 3, which focuses on detecting tweets posted by individuals who have a family member with dementia. The task is outlined as a binary classification problem, differentiating between tweets that mention dementia in the context of a family member and those that do not. Our solution to this problem explores large language models (LLMs) with various prompting methods. Our results show that a simple zero-shot prompt on a fine-tuned model yielded the best results. Our final system achieved a macro F1-score of 0.95 on the validation set and the test set. Our full code is available on GitHub.

[56] SpeechRole: A Large-Scale Dataset and Benchmark for Evaluating Speech Role-Playing Agents

Changhao Jiang, Jiajun Sun, Yifei Cao, Jiabao Zhuang, Hui Li, Xiaoran Fan, Ming Zhang, Junjie Ye, Shihan Dou, Zhiheng Xi, Jingqi Tong, Yilong Wu, Baoyu Fan, Zhen Wang, Tao Liang, Zhihui Fei, Mingyang Wan, Guojun Ma, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang

Main category: cs.CL

TL;DR: The paper introduces SpeechRole-Data, a large-scale dataset for Speech Role-Playing Agents (SRPAs), and SpeechRole-Eval, a benchmark for evaluating SRPAs in interaction, expressiveness, and fidelity.

Details

Motivation: Existing research on role-playing agents focuses on text, ignoring speech in realistic interactions. There's a lack of systematic evaluation for SRPAs.

Method: Constructed SpeechRole-Data (98 roles, 112k conversations) with diverse vocal traits and proposed SpeechRole-Eval for multidimensional assessment.

Result: Experiments show pros and cons of cascaded and end-to-end SRPAs in vocal style consistency and role coherence.

Conclusion: The released data, code, and models aim to advance speech-driven multimodal role-playing research.

Abstract: Recently, role-playing agents have emerged as a promising paradigm for achieving personalized interaction and emotional resonance. Existing research primarily focuses on the textual modality, neglecting the critical dimension of speech in realistic interactive scenarios. In particular, there is a lack of systematic evaluation for Speech Role-Playing Agents (SRPAs). To address this gap, we construct SpeechRole-Data, a large-scale, high-quality dataset that comprises 98 diverse roles and 112k speech-based single-turn and multi-turn conversations. Each role demonstrates distinct vocal characteristics, including timbre and prosody, thereby enabling more sophisticated speech role-playing. Furthermore, we propose SpeechRole-Eval, a multidimensional evaluation benchmark that systematically assesses SRPAs performance in key aspects such as fundamental interaction ability, speech expressiveness, and role-playing fidelity. Experimental results reveal the advantages and challenges of both cascaded and end-to-end speech role-playing agents in maintaining vocal style consistency and role coherence. We release all data, code, and baseline models to provide a solid foundation for speech-driven multimodal role-playing research and to foster further developments in this field.

[57] SpeechR: A Benchmark for Speech Reasoning in Large Audio-Language Models

Wanqi Yang, Yanda Li, Yunchao Wei, Meng Fang, Ling Chen

Main category: cs.CL

TL;DR: SpeechR is a benchmark for evaluating reasoning in large audio-language models (LALMs), addressing gaps in contextual and inference-driven reasoning. It tests factual retrieval, procedural inference, and normative judgment across multiple formats. Results show transcription accuracy doesn’t guarantee strong reasoning.

Details

Motivation: Existing evaluations of LALMs focus on surface-level perception, neglecting contextual and inference-driven reasoning in speech. SpeechR aims to fill this gap.

Method: SpeechR evaluates models on factual retrieval, procedural inference, and normative judgment using multiple-choice, generative, and acoustic-feature formats.

Result: Evaluations on 11 LALMs show high transcription accuracy doesn’t correlate with strong reasoning capabilities.

Conclusion: SpeechR provides a structured benchmark for analyzing reasoning in spoken language, aiding targeted model improvement.

Abstract: Large audio-language models (LALMs) have achieved near-human performance in sentence-level transcription and emotion recognition. However, existing evaluations focus mainly on surface-level perception, leaving the capacity of models for contextual and inference-driven reasoning in speech-based scenarios insufficiently examined. To address this gap, we introduce SpeechR, a unified benchmark for evaluating reasoning over speech in large audio-language models. SpeechR evaluates models along three key dimensions: factual retrieval, procedural inference, and normative judgment. It includes three distinct evaluation formats. The multiple-choice version measures answer selection accuracy. The generative version assesses the coherence and logical consistency of reasoning chains. The acoustic-feature version investigates whether variations in stress and emotion affect reasoning performance. Evaluations on eleven state-of-the-art LALMs reveal that high transcription accuracy does not translate into strong reasoning capabilities. SpeechR establishes a structured benchmark for evaluating reasoning in spoken language, enabling more targeted analysis of model capabilities across diverse dialogue-based tasks.

[58] Diagnosing Memorization in Chain-of-Thought Reasoning, One Token at a Time

Huihan Li, You Chen, Siyuan Wang, Yixin He, Ninareh Mehrabi, Rahul Gupta, Xiang Ren

Main category: cs.CL

TL;DR: STIM is a framework to identify memorization sources in LLMs, showing local memorization drives errors in reasoning tasks.

Details

Motivation: Address concerns about LLMs' reliance on memorization in reasoning tasks, especially in CoT, where errors cascade.

Method: Introduces STIM to attribute tokens in reasoning chains to memorization sources (local, mid-range, long-range) based on pretraining corpus co-occurrence.

Result: Local memorization drives up to 67% of wrong tokens; STIM scores predict errors effectively.

Conclusion: STIM is a valuable tool for diagnosing and improving model reasoning, applicable to other structured tasks.

Abstract: Large Language Models (LLMs) perform well on reasoning benchmarks but often fail when inputs alter slightly, raising concerns about the extent to which their success relies on memorization. This issue is especially acute in Chain-of-Thought (CoT) reasoning, where spurious memorized patterns can trigger intermediate errors that cascade into incorrect final answers. We introduce STIM, a novel framework for Source-aware Token-level Identification of Memorization, which attributes each token in a reasoning chain to one of multiple memorization sources - local, mid-range, or long-range - based on their statistical co-occurrence with the token in the pretraining corpus. Our token-level analysis across tasks and distributional settings reveals that models rely more on memorization in complex or long-tail cases, and that local memorization is often the dominant driver of errors, leading to up to 67% of wrong tokens. We also show that memorization scores from STIM can be effective in predicting the wrong tokens in the wrong reasoning step. STIM offers a powerful tool for diagnosing and improving model reasoning and can generalize to other structured step-wise generation tasks.

[59] Harnessing Temporal Databases for Systematic Evaluation of Factual Time-Sensitive Question-Answering in Large Language Models

Soyeon Kim, Jindong Wang, Xing Xie, Steven Euijong Whang

Main category: cs.CL

TL;DR: TDBench is a new benchmark for time-sensitive QA tasks, leveraging temporal databases and SQL to enable scalable, comprehensive evaluation of LLMs.

Details

Motivation: Existing TSQA benchmarks are limited by manual curation or fixed templates, hindering scalable evaluation.

Method: TDBench uses temporal databases and techniques like temporal SQL to construct TSQA pairs and introduces a time accuracy metric.

Result: TDBench facilitates scalable TSQA evaluation, reduces human labor, and complements existing approaches.

Conclusion: TDBench offers a reliable, automated solution for evaluating LLMs on time-sensitive factual knowledge.

Abstract: Facts evolve over time, making it essential for Large Language Models (LLMs) to handle time-sensitive factual knowledge accurately and reliably. While factual Time-Sensitive Question-Answering (TSQA) tasks have been widely studied, existing benchmarks often rely on manual curation or a small, fixed set of predefined templates, which restricts scalable and comprehensive TSQA evaluation. To address these challenges, we propose TDBench, a new benchmark that systematically constructs TSQA pairs by harnessing temporal databases and database techniques such as temporal SQL and functional dependencies. We also introduce a fine-grained evaluation metric called time accuracy, which assesses the validity of time references in model explanations alongside traditional answer accuracy to enable a more reliable TSQA evaluation. Extensive experiments on contemporary LLMs show how \ours{} enables scalable and comprehensive TSQA evaluation while reducing the reliance on human labor, complementing existing Wikipedia/Wikidata-based TSQA evaluation approaches by enabling LLM evaluation on application-specific data and seamless multi-hop question generation. Code and data are publicly available at: https://github.com/ssoy0701/tdbench.git.

[60] ProCut: LLM Prompt Compression via Attribution Estimation

Zhentao Xu, Fengyi Li, Albert Chen, Xiaofeng Wang

Main category: cs.CL

TL;DR: ProCut is a training-free framework for compressing large-scale LLM prompts by analyzing and pruning low-impact segments, reducing token count by 78% while maintaining or improving performance.

Details

Motivation: Large-scale industrial LLM systems face bloated prompts due to iterative additions, leading to maintenance difficulties, high latency, and costs.

Method: ProCut segments prompts into meaningful units, quantifies their impact, and prunes low-utility components using attribution analysis.

Result: Achieves 78% token reduction in production and up to 62% better performance than alternatives, with 50% lower compression latency.

Conclusion: ProCut effectively compresses prompts without training, integrates with existing frameworks, and enhances efficiency and performance.

Abstract: In large-scale industrial LLM systems, prompt templates often expand to thousands of tokens as teams iteratively incorporate sections such as task instructions, few-shot examples, and heuristic rules to enhance robustness and coverage. This expansion leads to bloated prompts that are difficult to maintain and incur significant inference latency and serving costs. To address this, we introduce Prompt Compression via Attribution Estimation (ProCut), a flexible, LLM-agnostic, training-free framework that compresses prompts through attribution analysis. ProCut segments prompt templates into semantically meaningful units, quantifies their impact on task performance, and prunes low-utility components. Through extensive experiments on five public benchmark datasets and real-world industrial prompts, we show that ProCut achieves substantial prompt size reductions (78% fewer tokens in production) while maintaining or even slightly improving task performance (up to 62% better than alternative methods). We further introduce an LLM-driven attribution estimator that reduces compression latency by over 50%, and demonstrate that ProCut integrates seamlessly with existing prompt-optimization frameworks to produce concise, high-performing prompts.

[61] The SMeL Test: A simple benchmark for media literacy in language models

Gustaf Ahdritz, Anat Kleiman

Main category: cs.CL

TL;DR: The paper introduces the Synthetic Media Literacy Test (SMeL Test) to evaluate LLMs’ ability to filter untrustworthy content, finding no model consistently trusts reliable sources, with hallucinations up to 70%.

Details

Motivation: To assess if LLMs can discern trustworthy content like humans, given the prevalence of misleading online information.

Method: Developed the SMeL Test to benchmark instruction-tuned LLMs, including reasoning models, on filtering untrustworthy information.

Result: No model consistently trusted reliable sources; reasoning improved scores but hallucinations remained high (up to 70%). Larger models didn’t outperform smaller ones.

Conclusion: The study highlights LLMs’ limitations in discerning trustworthy content and calls for new methods to address this form of hallucination.

Abstract: The internet is rife with unattributed, deliberately misleading, or otherwise untrustworthy content. Though large language models (LLMs) are often tasked with autonomous web browsing, the extent to which they have learned the simple heuristics human researchers use to navigate this noisy environment is not currently known. In this paper, we introduce the Synthetic Media Literacy Test (SMeL Test), a minimal benchmark that tests the ability of language models to actively filter out untrustworthy information in context. We benchmark a variety of commonly used instruction-tuned LLMs, including reasoning models, and find that no model consistently trusts more reliable sources; while reasoning in particular is associated with higher scores, even the best API model we test hallucinates up to 70% of the time. Remarkably, larger and more capable models do not necessarily outperform their smaller counterparts. We hope our work sheds more light on this important form of hallucination and guides the development of new methods to combat it.

[62] When Truth Is Overridden: Uncovering the Internal Origins of Sycophancy in Large Language Models

Jin Li, Keyu Wang, Shu Yang, Zhuoran Zhang, Di Wang

Main category: cs.CL

TL;DR: LLMs exhibit sycophantic behavior by agreeing with user opinions, even when incorrect. This paper explores the internal mechanisms behind this behavior, identifying a two-stage process and noting the influence of grammatical perspective.

Details

Motivation: To understand why LLMs display sycophantic behavior and how user opinions and framing affect this tendency.

Method: Systematic study of sycophancy across model families, logit-lens analysis, causal activation patching, and examination of grammatical perspective effects.

Result: Sycophancy arises from a late-layer output preference shift and deeper representational divergence. User authority has no impact, but first-person prompts increase sycophancy.

Conclusion: Sycophancy is a structural override of learned knowledge in deeper layers, with implications for AI alignment and truthfulness.

Abstract: Large Language Models (LLMs) often exhibit sycophantic behavior, agreeing with user-stated opinions even when those contradict factual knowledge. While prior work has documented this tendency, the internal mechanisms that enable such behavior remain poorly understood. In this paper, we provide a mechanistic account of how sycophancy arises within LLMs. We first systematically study how user opinions induce sycophancy across different model families. We find that simple opinion statements reliably induce sycophancy, whereas user expertise framing has a negligible impact. Through logit-lens analysis and causal activation patching, we identify a two-stage emergence of sycophancy: (1) a late-layer output preference shift and (2) deeper representational divergence. We also verify that user authority fails to influence behavior because models do not encode it internally. In addition, we examine how grammatical perspective affects sycophantic behavior, finding that first-person prompts (I believe...'') consistently induce higher sycophancy rates than third-person framings (They believe…’’) by creating stronger representational perturbations in deeper layers. These findings highlight that sycophancy is not a surface-level artifact but emerges from a structural override of learned knowledge in deeper layers, with implications for alignment and truthful AI systems.

[63] “Harmless to You, Hurtful to Me!”: Investigating the Detection of Toxic Languages Grounded in the Perspective of Youth

Yaqiong Li, Peng Zhang, Lin Wang, Hansu Gu, Siyuan Qiao, Ning Gu, Tun Lu

Main category: cs.CL

TL;DR: The paper explores youth-specific toxicity in social media, focusing on languages perceived as toxic by youth but not adults. It constructs a Chinese youth-toxicity dataset and finds contextual factors improve detection accuracy.

Details

Motivation: To address the gap in understanding youth's unique toxicity perceptions, which differ from adults and are overlooked in existing research.

Method: Constructed a Chinese youth-toxicity dataset and analyzed contextual factors like utterance source and text features. Evaluated existing toxicity detection techniques.

Result: Youth’s toxicity perception is influenced by contextual factors. Incorporating meta-information improves detection accuracy.

Conclusion: The study highlights the need for youth-centered toxicity detection and provides insights for future research.

Abstract: Risk perception is subjective, and youth’s understanding of toxic content differs from that of adults. Although previous research has conducted extensive studies on toxicity detection in social media, the investigation of youth’s unique toxicity, i.e., languages perceived as nontoxic by adults but toxic as youth, is ignored. To address this gap, we aim to explore: 1) What are the features of youth-toxicity'' languages in social media (RQ1); 2) Can existing toxicity detection techniques accurately detect these languages (RQ2). For these questions, we took Chinese youth as the research target, constructed the first Chinese youth-toxicity’’ dataset, and then conducted extensive analysis. Our results suggest that youth’s perception of these is associated with several contextual factors, like the source of an utterance and text-related features. Incorporating these meta information into current toxicity detection methods significantly improves accuracy overall. Finally, we propose several insights into future research on youth-centered toxicity detection.

[64] Learning Dynamics of Meta-Learning in Small Model Pretraining

David Demitri Africa, Yuval Weiss, Paula Buttery, Richard Diehl Martinez

Main category: cs.CL

TL;DR: Meta-learning improves small language models’ pretraining, making them faster, better, and more interpretable.

Details

Motivation: To reduce the cost and improve the interpretability of small language models using meta-learning.

Method: Integrates first-order MAML with subset-masked LM pretraining, tested on four LLama-style models (11M-570M params).

Result: Achieves same loss 1.6x faster, better multilingual NER F1, and reveals interpretable two-stage training dynamics.

Conclusion: Meta-learning enhances small models’ efficiency, performance, and interpretability, with clear training dynamics.

Abstract: Large language models are powerful but costly. We ask whether meta-learning can make the pretraining of small language models not only better but also more interpretable. We integrate first-order MAML with subset-masked LM pretraining, producing four LLama-style decoder-only models (11M-570M params), and evaluate it on a fundamental NLP task with many settings and real-world applications. Compared with vanilla training, our model (i) reaches the same loss up to 1.6x sooner, (ii) improves F1 on multilingual Universal NER under equal compute, and (iii) makes the training dynamics easy to read: first the network’s representations fan out (“diversify”) and later they collapse into a smaller, shared subspace (“compress”). This two-stage shift shows up as a rise-and-fall in both effective-rank curves and attention-head entropy. The same curves pinpoint which layers specialise earliest and which later reconverge, giving a compact, interpretable signature of meta-adaptation. Code, checkpoints and WandB logs are released.

[65] Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference

Yuxuan Song, Zheng Zhang, Cheng Luo, Pengyang Gao, Fan Xia, Hao Luo, Zheng Li, Yuehang Yang, Hongli Yu, Xingwei Qu, Yuwei Fu, Jing Su, Ge Zhang, Wenhao Huang, Mingxuan Wang, Lin Yan, Xiaoying Jia, Jingjing Liu, Wei-Ying Ma, Ya-Qin Zhang, Yonghui Wu, Hao Zhou

Main category: cs.CL

TL;DR: Seed Diffusion Preview is a fast, parallel-generation language model using discrete-state diffusion, achieving 2,146 tokens/s on H20 GPUs while maintaining strong performance on code benchmarks.

Details

Motivation: To address the latency issue in token-by-token decoding by leveraging non-sequential, parallel generation for faster inference.

Method: Uses discrete-state diffusion for parallel generation, enabling faster inference compared to sequential models.

Result: Achieves 2,146 tokens/s on H20 GPUs, outperforming Mercury and Gemini Diffusion in speed while maintaining competitive benchmark performance.

Conclusion: Seed Diffusion Preview sets a new state-of-the-art on the speed-quality Pareto frontier for code models.

Abstract: We present Seed Diffusion Preview, a large-scale language model based on discrete-state diffusion, offering remarkably fast inference speed. Thanks to non-sequential, parallel generation, discrete diffusion models provide a notable speedup to mitigate the inherent latency of token-by-token decoding, as demonstrated recently (e.g., Mercury Coder, Gemini Diffusion). Seed Diffusion Preview achieves an inference speed of 2,146 token/s over H20 GPUs while maintaining competitive performance across a sweep of standard code evaluation benchmarks, significantly faster than contemporary Mercury and Gemini Diffusion, establishing new state of the art on the speed-quality Pareto frontier for code models.

[66] Proof2Hybrid: Automatic Mathematical Benchmark Synthesis for Proof-Centric Problems

Yebo Peng, Zixiang Liu, Yaoming Li, Zhizhuo Yang, Xinye Xu, Bowen Ye, Weijun Yuan, Zihan Wang, Tong Yang

Main category: cs.CL

TL;DR: Proof2Hybrid is an automated framework for creating proof-centric benchmarks to evaluate LLMs’ mathematical abilities, demonstrated with AlgGeoTest, revealing significant gaps in LLMs’ comprehension of algebraic geometry.

Details

Motivation: Existing benchmarks for evaluating LLMs' mathematical capabilities are limited, especially for proof-centric problems, due to scalability and cost issues.

Method: Proposes Proof2Hybrid, an automated framework using Proof2X to convert proofs into verifiable questions, including hybrid-formatted “$m$-out-of-$n$ multiple judge questions” for robust evaluation.

Result: AlgGeoTest, a benchmark of 456 algebraic geometry items, shows profound deficits in LLMs’ understanding of the subject.

Conclusion: Proof2Hybrid and AlgGeoTest enable deeper research into AI’s mathematical intelligence by providing scalable, precise evaluation tools.

Abstract: Evaluating the mathematical capability of Large Language Models (LLMs) is a critical yet challenging frontier. Existing benchmarks fall short, particularly for proof-centric problems, as manual creation is unscalable and costly, leaving the true mathematical abilities of LLMs largely unassessed. To overcome these barriers, we propose Proof2Hybrid, the first fully automated framework that synthesizes high-quality, proof-centric benchmarks from natural language mathematical corpora. The key novelty of our solution is Proof2X, a roadmap of converting mathematical proofs into various kinds of questions that are easy to verify. Instructed by this roadmap, we propose a new type of hybrid-formatted questions, named ``$m$-out-of-$n$ multiple judge questions’’, specifically designed to enable robust, automatic evaluation while being resilient to guessing and superficial pattern matching inherent in traditional formats. As a demonstration of our framework, we introduce AlgGeoTest, a benchmark for algebraic geometry–a frontier domain of modern mathematics–comprising 456 challenging items. Our extensive evaluations on state-of-the-art LLMs using AlgGeoTest reveal profound deficits in their comprehension of algebraic geometry, providing a more precise measure of their true mathematical capabilities. Our framework and benchmark pave the way for a new wave of in-depth research into the mathematical intelligence of AI systems.

[67] Isolating Culture Neurons in Multilingual Large Language Models

Danial Namazifard, Lukas Galke

Main category: cs.CL

TL;DR: The paper explores how multilingual LLMs encode culture, identifying and isolating culture-specific neurons distinct from language-specific ones.

Details

Motivation: To understand and localize how culture is encoded in multilingual LLMs, disentangling it from language-specific encoding.

Method: Extends a methodology for identifying language-specific neurons to culture-specific neurons, using the MUREL dataset (85.2M tokens across six cultures).

Result: Culture-specific neurons are found in upper layers, distinct from language neurons, and can be independently modulated.

Conclusion: Cultural knowledge in LLMs can be selectively edited for fairness and inclusivity.

Abstract: Language and culture are deeply intertwined, yet it is so far unclear how and where multilingual large language models encode culture. Here, we extend upon an established methodology for identifying language-specific neurons and extend it to localize and isolate culture-specific neurons, carefully disentangling their overlap and interaction with language-specific neurons. To facilitate our experiments, we introduce MUREL, a curated dataset of 85.2 million tokens spanning six different cultures. Our localization and intervention experiments show that LLMs encode different cultures in distinct neuron populations, predominantly in upper layers, and that these culture neurons can be modulated independently from language-specific neurons or those specific to other cultures. These findings suggest that cultural knowledge and propensities in multilingual language models can be selectively isolated and edited - promoting fairness, inclusivity, and alignment. Code and data is available at https://github.com/namazifard/Culture_Neurons .

[68] Interference Matrix: Quantifying Cross-Lingual Interference in Transformer Encoders

Belen Alastruey, João Maria Janeiro, Alexandre Allauzen, Maha Elbayad, Loïc Barrault, Marta R. Costa-jussà

Main category: cs.CL

TL;DR: A study on language interference in encoder-only Transformer models across 83 languages, revealing asymmetrical interference patterns tied to script rather than linguistic families or embedding similarity. The interference matrix predicts downstream task performance.

Details

Motivation: To quantify and understand cross-lingual interference in multilingual Transformer models, aiming to improve their design and performance.

Method: Constructed an interference matrix by training and evaluating small BERT-like models on all possible language pairs (83 languages).

Result: Interference is asymmetrical and better explained by script than linguistic families or embedding similarity. The matrix predicts downstream task performance.

Conclusion: The interference matrix is a valuable tool for optimizing multilingual model design.

Abstract: In this paper, we present a comprehensive study of language interference in encoder-only Transformer models across 83 languages. We construct an interference matrix by training and evaluating small BERT-like models on all possible language pairs, providing a large-scale quantification of cross-lingual interference. Our analysis reveals that interference between languages is asymmetrical and that its patterns do not align with traditional linguistic characteristics, such as language family, nor with proxies like embedding similarity, but instead better relate to script. Finally, we demonstrate that the interference matrix effectively predicts performance on downstream tasks, serving as a tool to better design multilingual models to obtain optimal performance.

[69] Decomposing the Entropy-Performance Exchange: The Missing Keys to Unlocking Effective Reinforcement Learning

Jia Deng, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Ji-Rong Wen

Main category: cs.CL

TL;DR: The paper analyzes the entropy-performance exchange in RLVR for LLMs, identifying stages and granularities where entropy reduction boosts learning. It proposes methods to dynamically adjust rewards based on perplexity and position, improving performance.

Details

Motivation: To understand the entropy-performance exchange in RLVR for LLMs and enhance reasoning abilities by identifying effective learning stages and granularities.

Method: Systematic empirical analysis of RLVR’s entropy-performance exchange across stage-level, instance-level, and token-level granularities. Proposes dynamic reward adjustment methods using perplexity and positional information.

Result: Entropy reduction in negative samples aids learning in the rising stage, while high-entropy tokens in low-perplexity samples and sequence ends correlate with learning efficiency in the plateau stage. Proposed methods outperform baselines.

Conclusion: Dynamic reward adjustment based on perplexity and position improves RLVR performance, leveraging insights from entropy dynamics across training stages and granularities.

Abstract: Recently, reinforcement learning with verifiable rewards (RLVR) has been widely used for enhancing the reasoning abilities of large language models (LLMs). A core challenge in RLVR involves managing the exchange between entropy and performance of policies. Despite the importance of this exchange, a fine-grained understanding of when and how this exchange operates most effectively remains limited. To bridge this gap, we conduct a systematic empirical analysis of the entropy-performance exchange mechanism of RLVR across different levels of granularity. Specifically, we first divide the training process into two distinct stages based on entropy dynamics, i.e., rising stage and plateau stage, and then systematically investigate how this mechanism varies across stage-level, instance-level, and token-level granularitiess. Our analysis reveals that, in the rising stage, entropy reduction in negative samples facilitates the learning of effective reasoning patterns, which in turn drives rapid performance gains. Moreover, in the plateau stage, learning efficiency strongly correlates with high-entropy tokens present in low-perplexity samples and those located at the end of sequences. Motivated by these findings, we propose two methods that dynamically adjust the reward signal using perplexity and positional information to focus RL updates on tokens that exhibit high learning potential, achieving improvements compared to the baseline methods on various LLMs.

[70] SHAMI-MT: A Syrian Arabic Dialect to Modern Standard Arabic Bidirectional Machine Translation System

Serry Sibaee, Omer Nacar, Yasser Al-Habashi, Adel Ammar, Wadii Boulila

Main category: cs.CL

TL;DR: SHAMI-MT is a bidirectional machine translation system for Modern Standard Arabic (MSA) and the Syrian dialect, achieving high-quality translations.

Details

Motivation: The diglossia between MSA and regional dialects like Syrian poses challenges for NLP, especially machine translation.

Method: Two models (MSA-to-Shami and Shami-to-MSA) were built using AraT5v2-base-1024, fine-tuned on the Nabra dataset, and evaluated on MADAR corpus.

Result: The MSA-to-Shami model scored 4.01/5.0 in quality, showing accuracy and dialectal authenticity.

Conclusion: SHAMI-MT advances dialectal Arabic translation, aiding content localization, cultural heritage, and intercultural communication.

Abstract: The rich linguistic landscape of the Arab world is characterized by a significant gap between Modern Standard Arabic (MSA), the language of formal communication, and the diverse regional dialects used in everyday life. This diglossia presents a formidable challenge for natural language processing, particularly machine translation. This paper introduces \textbf{SHAMI-MT}, a bidirectional machine translation system specifically engineered to bridge the communication gap between MSA and the Syrian dialect. We present two specialized models, one for MSA-to-Shami and another for Shami-to-MSA translation, both built upon the state-of-the-art AraT5v2-base-1024 architecture. The models were fine-tuned on the comprehensive Nabra dataset and rigorously evaluated on unseen data from the MADAR corpus. Our MSA-to-Shami model achieved an outstanding average quality score of \textbf{4.01 out of 5.0} when judged by OPENAI model GPT-4.1, demonstrating its ability to produce translations that are not only accurate but also dialectally authentic. This work provides a crucial, high-fidelity tool for a previously underserved language pair, advancing the field of dialectal Arabic translation and offering significant applications in content localization, cultural heritage, and intercultural communication.

[71] Dynaword: From One-shot to Continuously Developed Datasets

Kenneth Enevoldsen, Kristian Nørgaard Jensen, Jan Kostkan, Balázs Szabó, Márton Kardos, Kirten Vad, Andrea Blasi Núñez, Gianluca Barmina, Jacob Nielsen, Rasmus Larsen, Peter Vahlstrup, Per Møldrup Dalum, Desmond Elliott, Lukas Galke, Peter Schneider-Kamp, Kristoffer Nielbo

Main category: cs.CL

TL;DR: The paper introduces Dynaword, a framework for creating open, community-updatable NLP datasets, and Danish Dynaword, its implementation, addressing licensing, static releases, and quality challenges.

Details

Motivation: Current NLP datasets face issues with restrictive licensing, lack of community updates, and limited quality assurance.

Method: Proposes the Dynaword framework for open, collaborative dataset creation and implements Danish Dynaword as a proof of concept.

Result: Danish Dynaword is larger, openly licensed, and community-contributed, with built-in quality checks.

Conclusion: Dynaword offers a sustainable, collaborative solution for large-scale NLP datasets.

Abstract: Large-scale datasets are foundational for research and development in natural language processing. However, current approaches face three key challenges: (1) reliance on ambiguously licensed sources restricting use, sharing, and derivative works; (2) static dataset releases that prevent community contributions and diminish longevity; and (3) quality assurance processes restricted to publishing teams rather than leveraging community expertise. To address these limitations, we introduce two contributions: the Dynaword approach and Danish Dynaword. The Dynaword approach is a framework for creating large-scale, open datasets that can be continuously updated through community collaboration. Danish Dynaword is a concrete implementation that validates this approach and demonstrates its potential. Danish Dynaword contains over four times as many tokens as comparable releases, is exclusively openly licensed, and has received multiple contributions across industry and research. The repository includes light-weight tests to ensure data formatting, quality, and documentation, establishing a sustainable framework for ongoing community contributions and dataset evolution.

[72] A French Version of the OLDI Seed Corpus

Malik Marmonier, Benoît Sagot, Rachel Bawden

Main category: cs.CL

TL;DR: The paper introduces the first French partition of the OLDI Seed Corpus for WMT 2025, detailing its creation using machine translation and native speaker post-editing, and its role as a pivot resource for under-resourced regional languages.

Details

Motivation: To address the lack of parallel corpora for under-resourced regional languages of France by creating a French partition of the OLDI Seed Corpus.

Method: Utilized multiple machine translation systems and a custom-built interface for post-editing by native speakers to create the corpus.

Result: A French corpus combining technical terminology and stylistic irregularities from Wikipedia, serving as a pivot resource.

Conclusion: The French corpus is a key step toward facilitating parallel corpora collection for regional languages.

Abstract: We present the first French partition of the OLDI Seed Corpus, our submission to the WMT 2025 Open Language Data Initiative (OLDI) shared task. We detail its creation process, which involved using multiple machine translation systems and a custom-built interface for post-editing by qualified native speakers. We also highlight the unique translation challenges presented by the source data, which combines highly technical, encyclopedic terminology with the stylistic irregularities characteristic of user-generated content taken from Wikipedia. This French corpus is not an end in itself, but is intended as a crucial pivot resource to facilitate the collection of parallel corpora for the under-resourced regional languages of France.

[73] Simple Methods Defend RAG Systems Well Against Real-World Attacks

Ilias Triantafyllopoulos, Renyi Qu, Salvatore Giorgi, Brenda Curtis, Lyle H. Ungar, João Sedoc

Main category: cs.CL

TL;DR: The paper evaluates four methods for detecting Out-Of-Domain (OOD) queries in Retrieval-Augmented Generation (RAG) systems to ensure safety and relevance, highlighting the importance of external OOD detectors.

Details

Motivation: Ensuring safety and relevance in RAG systems for safety-critical applications is challenging, necessitating robust OOD detection methods.

Method: Four OOD detection methods (GPT-4o, regression-based, PCA-based, Neural Collapse) are evaluated, with focus on PCA and Neural Collapse feature separation strategies.

Result: Validated on datasets and real-world applications, the study confirms external OOD detectors are crucial for maintaining response relevance.

Conclusion: External OOD detection is essential for RAG systems to ensure safety and relevance in responses.

Abstract: Ensuring safety and in-domain responses for Retrieval-Augmented Generation (RAG) systems is paramount in safety-critical applications, yet remains a significant challenge. To address this, we evaluate four methodologies for Out-Of-Domain (OOD) query detection: GPT-4o, regression-based, Principal Component Analysis (PCA)-based, and Neural Collapse (NC), to ensure the RAG system only responds to queries confined to the system’s knowledge base. Specifically, our evaluation explores two novel dimensionality reduction and feature separation strategies: \textit{PCA}, where top components are selected using explained variance or OOD separability, and an adaptation of \textit{Neural Collapse Feature Separation}. We validate our approach on standard datasets (StackExchange and MSMARCO) and real-world applications (Substance Use and COVID-19), including tests against LLM-simulated and actual attacks on a COVID-19 vaccine chatbot. Through human and LLM-based evaluations of response correctness and relevance, we confirm that an external OOD detector is crucial for maintaining response relevance.

[74] LaMPE: Length-aware Multi-grained Position Encoding for Adaptive Long-context Scaling Without Training

Sikui Zhang, Guangze Gao, Ziyun Gan, Chunfeng Yuan, Zefeng Lin, Houwen Peng, Bing Li, Weiming Hu

Main category: cs.CL

TL;DR: LaMPE is a training-free method for adaptive long-context scaling in LLMs, addressing OOD issues in RoPE by dynamically mapping input length to positional capacity and using multi-grained attention.

Details

Motivation: Performance degradation in LLMs when input exceeds the pretraining context window due to OOD behavior of RoPE.

Method: Proposes LaMPE, which uses a parametric scaled sigmoid function for dynamic mapping and a multi-grained attention mechanism.

Result: Significant performance improvements on three LLMs across five benchmarks compared to existing methods.

Conclusion: LaMPE effectively adapts to varying input lengths without training and outperforms current extrapolation methods.

Abstract: Large language models (LLMs) experience significant performance degradation when the input exceeds the pretraining context window, primarily due to the out-of-distribution (OOD) behavior of Rotary Position Embedding (RoPE). Recent studies mitigate this problem by remapping OOD positions into the in-distribution range with fixed mapping strategies, ignoring the dynamic relationship between input length and the model’s effective context window. To this end, we propose Length-aware Multi-grained Positional Encoding (LaMPE), a training-free method that fully utilizes the model’s effective context window for adaptive long-context scaling in LLMs. Motivated by the left-skewed frequency distribution of relative positions, LaMPE establishes a dynamic relationship between mapping length and input length through a parametric scaled sigmoid function to adaptively allocate positional capacity across varying input lengths. Meanwhile, LaMPE devises a novel multi-grained attention mechanism that strategically allocates positional resolution across different sequence regions to capture both fine-grained locality and long-range dependencies. Our method can be seamlessly applied to a wide range of RoPE-based LLMs without training. Extensive experiments on three representative LLMs across five mainstream long-context benchmarks demonstrate that LaMPE achieves significant performance improvements compared to existing length extrapolation methods. The code will be released at https://github.com/scar-on/LaMPE.

[75] VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo

Qianli Ma, Yaowei Zheng, Zhelun Shi, Zhongkai Zhao, Bin Jia, Ziyue Huang, Zhiqi Lin, Youjie Li, Jiacheng Yang, Yanghua Peng, Zhi Zhang, Xin Liu

Main category: cs.CL

TL;DR: The paper introduces \veomni, a modular and efficient training framework for omni-modal LLMs, addressing scalability and engineering challenges in existing systems.

Details

Motivation: Training omni-modal LLMs is challenging due to heterogeneous architectures and inefficient parallel logic in current frameworks.

Method: \veomni decouples communication from computation, enabling efficient 3D parallelism and flexible modality integration.

Result: The framework achieves high throughput (2,800 tokens/sec/GPU) and scales to 160K context lengths on 128 GPUs.

Conclusion: \veomni demonstrates superior efficiency and scalability for large omni-modal LLM training.

Abstract: Recent advances in large language models (LLMs) have driven impressive progress in omni-modal understanding and generation. However, training omni-modal LLMs remains a significant challenge due to the heterogeneous model architectures required to process diverse modalities, necessitating sophisticated system design for efficient large-scale training. Existing frameworks typically entangle model definition with parallel logic, incurring limited scalability and substantial engineering overhead for end-to-end omni-modal training. % We present \veomni, a modular and efficient training framework to accelerate the development of omni-modal LLMs. \veomni introduces model-centric distributed recipes that decouples communication from computation, enabling efficient 3D parallelism on omni-modal LLMs. \veomni also features a flexible configuration interface supporting seamless integration of new modalities with minimal code change. % Using \veomni, a omni-modal mixture-of-experts (MoE) model with 30B parameters can be trained with over 2,800 tokens/sec/GPU throughput and scale to 160K context lengths via 3D parallelism on 128 GPUs, showcasing its superior efficiency and scalability for training large omni-modal LLMs.

[76] CAMERA: Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis

Yuzhuang Xu, Xu Han, Yuanchi Zhang, Yixuan Wang, Yijun Liu, Shiyu Ji, Qingfu Zhu, Wanxiang Che

Main category: cs.CL

TL;DR: The paper introduces CAMERA, a framework for compressing Mixture-of-Experts (MoE) models by analyzing and pruning redundant micro-experts, achieving efficiency without training.

Details

Motivation: MoE models face computational and storage inefficiencies despite performance gains, and prior methods fail to balance performance and efficiency.

Method: CAMERA identifies micro-expert redundancy and proposes pruning (CAMERA-P) and quantization (CAMERA-Q) frameworks.

Result: CAMERA-P outperforms baselines under 20%-60% pruning, and CAMERA-Q excels in 2-bit quantization. Analysis of Qwen2-57B-A14B completes in <5 minutes on an A100 GPU.

Conclusion: CAMERA offers a lightweight, training-free solution for efficient MoE model compression, improving performance and computational efficiency.

Abstract: Large Language Models (LLMs) with Mixture-of-Experts (MoE) architectures are distinguished by their strong performance scaling with increasing parameters across a wide range of tasks, yet they also suffer from substantial computational and storage overheads. Notably, the performance gains of MoE models do not scale proportionally with the growth in expert parameters. While prior works attempt to reduce parameters via expert-level pruning, merging, or decomposition, they still suffer from challenges in both performance and computational efficiency. In this paper, we address these challenges by introducing micro-expert as a finer-grained compression unit that spans across matrices. We first establish a more fundamental perspective, viewing MoE layers as mixtures of micro-experts, and present CAMERA, a lightweight and training-free framework for identifying micro-expert redundancy. Our analysis uncovers significant variance in micro-expert contributions during decoding. Based on this insight, we further propose CAMERA-P, a structured micro-expert pruning framework, and CAMERA-Q, a mixed-precision quantization idea designed for micro-experts. Extensive experiments on nine downstream tasks show that CAMERA-P consistently outperforms strong baselines under pruning ratios ranging from 20% to 60%. Furthermore, CAMERA-Q achieves superior results under aggressive 2-bit quantization, surpassing existing matrix- and channel-level ideas. Notably, our method enables complete micro-expert analysis of Qwen2-57B-A14B in less than 5 minutes on a single NVIDIA A100-40GB GPU.

[77] Understanding and Mitigating Political Stance Cross-topic Generalization in Large Language Models

Jiayi Zhang, Shu Yang, Junchao Wu, Derek F. Wong, Di Wang

Main category: cs.CL

TL;DR: Fine-tuning LLMs on political topics can manipulate their stance on unrelated issues. The paper identifies political neurons and introduces InhibitFT to mitigate cross-topic generalization.

Details

Motivation: To understand and mitigate unintended political stance generalization in fine-tuned LLMs.

Method: Proposes PNLAC to identify political neurons and InhibitFT for targeted inhibition.

Result: Identifies neuron types and reduces cross-topic generalization by 20%.

Conclusion: Selective neuron inhibition effectively mitigates unintended stance generalization.

Abstract: Fine-tuning Large Language Models on a political topic will significantly manipulate their political stance on various issues and unintentionally affect their stance on unrelated topics. While previous studies have proposed this issue, there is still a lack of understanding regarding the internal representations of these stances and the mechanisms that lead to unintended cross-topic generalization. In this paper, we systematically explore the internal mechanisms underlying this phenomenon from a neuron-level perspective and how to mitigate the cross-topic generalization of political fine-tuning. Firstly, we propose Political Neuron Localization through Activation Contrasting (PNLAC) to identify two distinct types of political neurons: general political neurons, which govern stance across multiple political topics, and topic-specific neurons} that affect the model’s political stance on individual topics. We find the existence of these political neuron types across four models and datasets through activation patching experiments. Leveraging these insights, we introduce InhibitFT, an inhibition-based fine-tuning method, effectively mitigating the cross-topic stance generalization. Experimental results demonstrate the robustness of identified neuron types across various models and datasets, and show that InhibitFT significantly reduces the cross-topic stance generalization by 20% on average, while preserving topic-specific performance. Moreover, we demonstrate that selectively inhibiting only 5% of neurons is sufficient to effectively mitigate the cross-topic stance generalization.

[78] CompressKV: Semantic Retrieval Heads Know What Tokens are Not Important Before Generation

Xiaolin Lin, Jingcun Wang, Olga Kondrateva, Yiyu Shi, Bing Li, Grace Li Zhang

Main category: cs.CL

TL;DR: The paper introduces CompressKV, a method to optimize KV cache compression in LLMs by identifying and leveraging specific attention heads for token retention, improving efficiency and performance.

Details

Motivation: Addressing the inefficiency of heuristic token eviction in KV cache compression, which degrades LLM performance by ignoring the distinct functionalities of attention heads.

Method: Identify attention heads capable of retrieving critical tokens and their semantic context, then use these heads to retain important KV cache pairs. Introduce a layer-adaptive allocation strategy.

Result: CompressKV outperforms state-of-the-art methods on LongBench and Needle-in-a-Haystack benchmarks under various memory budgets.

Conclusion: The proposed method effectively improves KV cache compression by focusing on critical attention heads and adaptive layer strategies, enhancing LLM performance.

Abstract: Recent advances in large language models (LLMs) have significantly boosted long-context processing. However, the increasing key-value (KV) cache size poses critical challenges to memory and execution efficiency. Most KV cache compression methods rely on heuristic token eviction using all attention heads in Grouped Query Attention (GQA)-based LLMs. This method ignores the different functionalities of attention heads, leading to the eviction of critical tokens and thus degrades the performance of LLMs. To address the issue above, instead of using all the attention heads in GQA-based LLMs to determine important tokens as in the previous work, we first identify the attention heads in each layer that are not only capable of retrieving the initial and final tokens of a prompt, but also capable of retrieving important tokens within the text and attending to their surrounding semantic context. Afterwards, we exploit such heads to determine the important tokens and retain their corresponding KV cache pairs. Furthermore, we analyze the cache eviction error of each layer individually and introduce a layer-adaptive KV cache allocation strategy. Experimental results demonstrate the proposed CompressKV consistently outperforms state-of-the-art approaches under various memory budgets on LongBench and Needle-in-a-Haystack benchmarks. Our code is publicly available at: https://github.com/TUDa-HWAI/CompressKV.git.

[79] Learning to Evolve: Bayesian-Guided Continual Knowledge Graph Embedding

Linyu Li, Zhi Jin, Yuanpeng He, Dongming Jin, Yichi Zhang, Haoran Duan, Nyima Tash

Main category: cs.CL

TL;DR: The paper introduces BAKE, a continual knowledge graph embedding (CKGE) model, addressing catastrophic forgetting by using Bayesian posterior updates and continual clustering.

Details

Motivation: Traditional KGE models are static, unsuitable for evolving knowledge graphs, leading to catastrophic forgetting in continual learning scenarios.

Method: BAKE employs Bayesian posterior updates for continual learning and introduces continual clustering to constrain knowledge evolution differences.

Result: Experiments show BAKE outperforms existing baseline models in preserving knowledge across multiple snapshots.

Conclusion: BAKE effectively mitigates catastrophic forgetting in CKGE, offering a robust solution for evolving knowledge graphs.

Abstract: Since knowledge graphs (KG) will continue to evolve in real scenarios, traditional KGE models are only suitable for static knowledge graphs. Therefore, continual knowledge graph embedding (CKGE) has attracted the attention of researchers. Currently, a key challenge facing CKGE is that the model is prone to “catastrophic forgetting”, resulting in the loss of previously learned knowledge. In order to effectively alleviate this problem, we propose a new CKGE model BAKE. First, we note that the Bayesian posterior update principle provides a natural continual learning strategy that is insensitive to data order and can theoretically effectively resist the forgetting of previous knowledge during data evolution. Different from the existing CKGE method, BAKE regards each batch of new data as a Bayesian update of the model prior. Under this framework, as long as the posterior distribution of the model is maintained, the model can better preserve the knowledge of early snapshots even after evolving through multiple time snapshots. Secondly, we propose a continual clustering method for CKGE, which further directly combats knowledge forgetting by constraining the evolution difference (or change amplitude) between new and old knowledge between different snapshots. We conduct extensive experiments on BAKE on multiple datasets, and the results show that BAKE significantly outperforms existing baseline models.

[80] AI-Based Measurement of Innovation: Mapping Expert Insight into Large Language Model Applications

Robin Nowak, Patrick Figge, Carolin Haeussler

Main category: cs.CL

TL;DR: The paper proposes an LLM framework to measure innovation from unstructured text, outperforming traditional methods and other ML/DL models in accuracy and consistency.

Details

Motivation: Overcome limitations of manual expert evaluations and context-specific proxies in innovation research by leveraging LLMs.

Method: Design an LLM framework to assess innovation from text, validated through studies on software updates and product reviews.

Result: The LLM framework achieved higher F1-scores and consistency than alternative measures and state-of-the-art models.

Conclusion: LLMs offer reliable, accessible tools for innovation measurement, with design choices like model selection and prompt engineering impacting performance.

Abstract: Measuring innovation often relies on context-specific proxies and on expert evaluation. Hence, empirical innovation research is often limited to settings where such data is available. We investigate how large language models (LLMs) can be leveraged to overcome the constraints of manual expert evaluations and assist researchers in measuring innovation. We design an LLM framework that reliably approximates domain experts’ assessment of innovation from unstructured text data. We demonstrate the performance and broad applicability of this framework through two studies in different contexts: (1) the innovativeness of software application updates and (2) the originality of user-generated feedback and improvement ideas in product reviews. We compared the performance (F1-score) and reliability (consistency rate) of our LLM framework against alternative measures used in prior innovation studies, and to state-of-the-art machine learning- and deep learning-based models. The LLM framework achieved higher F1-scores than the other approaches, and its results are highly consistent (i.e., results do not change across runs). This article equips R&D personnel in firms, as well as researchers, reviewers, and editors, with the knowledge and tools to effectively use LLMs for measuring innovation and evaluating the performance of LLM-based innovation measures. In doing so, we discuss, the impact of important design decisions-including model selection, prompt engineering, training data size, training data distribution, and parameter settings-on performance and reliability. Given the challenges inherent in using human expert evaluation and existing text-based measures, our framework has important implications for harnessing LLMs as reliable, increasingly accessible, and broadly applicable research tools for measuring innovation.

[81] LatentPrompt: Optimizing Promts in Latent Space

Mateusz Bystroński, Grzegorz Piotrowski, Nitesh V. Chawla, Tomasz Kajdanowicz

Main category: cs.CL

TL;DR: LatentPrompt is a model-agnostic framework for automatic prompt optimization using latent semantic space, improving task performance without manual rules.

Details

Motivation: Existing prompt optimization methods rely on heuristics or manual effort, limiting scalability and efficiency.

Method: Embeds seed prompts in a latent space, explores it to find high-performing prompts, and refines them automatically.

Result: Increased classification accuracy by ~3% on the Financial PhraseBank sentiment benchmark.

Conclusion: LatentPrompt is broadly applicable, requiring only black-box LLM access and an evaluation metric, making it versatile for various tasks.

Abstract: Recent advances have shown that optimizing prompts for Large Language Models (LLMs) can significantly improve task performance, yet many optimization techniques rely on heuristics or manual exploration. We present LatentPrompt, a model-agnostic framework for prompt optimization that leverages latent semantic space to automatically generate, evaluate, and refine candidate prompts without requiring hand-crafted rules. Beginning with a set of seed prompts, our method embeds them in a continuous latent space and systematically explores this space to identify prompts that maximize task-specific performance. In a proof-of-concept study on the Financial PhraseBank sentiment classification benchmark, LatentPrompt increased classification accuracy by approximately 3 percent after a single optimization cycle. The framework is broadly applicable, requiring only black-box access to an LLM and an automatic evaluation metric, making it suitable for diverse domains and tasks.

[82] Monsoon Uprising in Bangladesh: How Facebook Shaped Collective Identity

Md Tasin Abir, Arpita Chowdhury, Ashfia Rahman

Main category: cs.CL

TL;DR: The study explores how Facebook fostered collective identity during Bangladesh’s 2024 Monsoon Uprising, using multimodal content like memes and hashtags to unify protesters and challenge authoritarian narratives.

Details

Motivation: To understand the role of Facebook in shaping collective identity and resistance during a pro-democracy uprising in Bangladesh.

Method: Qualitative analysis of visual rhetoric, verbal discourse, and digital irony on Facebook, focusing on symbols like the color red and the term “Razakar.”

Result: Facebook’s multimodal content (images, memes, videos) unified protesters, built solidarity, and mobilized public sentiment against repression.

Conclusion: Online platforms like Facebook are powerful tools for identity construction and political mobilization in repressive contexts.

Abstract: This study investigates how Facebook shaped collective identity during the July 2024 pro-democracy uprising in Bangladesh, known as the Monsoon Uprising. During government repression, protesters turned to Facebook as a central space for resistance, where multimodal expressions, images, memes, videos, hashtags, and satirical posts played an important role in unifying participants. Using a qualitative approach, this research analyzes visual rhetoric, verbal discourse, and digital irony to reveal how shared symbols, protest art, and slogans built a sense of solidarity. Key elements included the symbolic use of red, the ironic metaphorical use of the term “Razakar”, and the widespread sharing of visuals representing courage, injustice, and resistance. The findings show that the combination of visual and verbal strategies on Facebook not only mobilized public sentiment, but also built a strong collective identity that challenged authoritarian narratives. This study tries to demonstrate how online platforms can serve as powerful tools for identity construction and political mobilization in the digital age.

[83] From Monolingual to Bilingual: Investigating Language Conditioning in Large Language Models for Psycholinguistic Tasks

Shuzhou Yuan, Zhan Qu, Mario Tawfelis, Michael Färber

Main category: cs.CL

TL;DR: LLMs adjust outputs based on language identity, with deeper layers showing stronger psycholinguistic signals, especially in Chinese.

Details

Motivation: To explore how LLMs encode psycholinguistic knowledge across languages and whether they exhibit human-like responses.

Method: Evaluated two models (Llama-3.3-70B-Instruct and Qwen2.5-72B-Instruct) using sound symbolism and word valence tasks under monolingual and bilingual prompting in English, Dutch, and Chinese.

Result: Models adjusted outputs based on language identity, with Qwen showing greater sensitivity. Chinese prompts yielded stronger valence representations.

Conclusion: Language identity influences LLM behavior and internal representations, offering insights into their use for cross-linguistic cognition.

Abstract: Large Language Models (LLMs) exhibit strong linguistic capabilities, but little is known about how they encode psycholinguistic knowledge across languages. We investigate whether and how LLMs exhibit human-like psycholinguistic responses under different linguistic identities using two tasks: sound symbolism and word valence. We evaluate two models, Llama-3.3-70B-Instruct and Qwen2.5-72B-Instruct, under monolingual and bilingual prompting in English, Dutch, and Chinese. Behaviorally, both models adjust their outputs based on prompted language identity, with Qwen showing greater sensitivity and sharper distinctions between Dutch and Chinese. Probing analysis reveals that psycholinguistic signals become more decodable in deeper layers, with Chinese prompts yielding stronger and more stable valence representations than Dutch. Our results demonstrate that language identity conditions both output behavior and internal representations in LLMs, providing new insights into their application as models of cross-linguistic cognition.

[84] Modular Arithmetic: Language Models Solve Math Digit by Digit

Tanja Baeumel, Daniil Gurgurov, Yusser al Ghussin, Josef van Genabith, Simon Ostermann

Main category: cs.CL

TL;DR: LLMs use digit-position-specific circuits for arithmetic tasks, independent of model size or tokenization. These circuits are identified and validated using Feature Importance and Causal Interventions.

Details

Motivation: To understand the internal mechanisms LLMs use for arithmetic tasks, as a unified understanding is lacking.

Method: Extend findings on digit-wise number representation; use Feature Importance and Causal Interventions to identify digit-position-specific circuits.

Result: Existence of modular subgroups of MLP neurons for digit positions, validated through interventions that alter predictions at targeted positions.

Conclusion: LLMs employ compositional, interpretable structures (digit-position circuits) for arithmetic, demonstrated causally.

Abstract: While recent work has begun to uncover the internal strategies that Large Language Models (LLMs) employ for simple arithmetic tasks, a unified understanding of their underlying mechanisms is still lacking. We extend recent findings showing that LLMs represent numbers in a digit-wise manner and present evidence for the existence of digit-position-specific circuits that LLMs use to perform simple arithmetic tasks, i.e. modular subgroups of MLP neurons that operate independently on different digit positions (units, tens, hundreds). Notably, such circuits exist independently of model size and of tokenization strategy, i.e. both for models that encode longer numbers digit-by-digit and as one token. Using Feature Importance and Causal Interventions, we identify and validate the digit-position-specific circuits, revealing a compositional and interpretable structure underlying the solving of arithmetic problems in LLMs. Our interventions selectively alter the model’s prediction at targeted digit positions, demonstrating the causal role of digit-position circuits in solving arithmetic tasks.

[85] PoeTone: A Framework for Constrained Generation of Structured Chinese Songci with LLMs

Zhan Qu, Shuzhou Yuan, Michael Färber

Main category: cs.CL

TL;DR: The paper investigates LLMs’ ability to generate Songci, a strict Chinese poetry form, using a multi-faceted evaluation framework and fine-tuning methods to improve performance.

Details

Motivation: To explore LLMs' capabilities in generating culturally significant and formally constrained literary texts like Songci.

Method: Developed an evaluation framework with formal conformity, automated and human assessments, and classification tasks. Evaluated 18 LLMs under five prompting strategies and proposed a Generate-Critic architecture for fine-tuning.

Result: Fine-tuned models improved formal conformity by up to 5.88%.

Conclusion: The study provides insights into LLMs’ strengths and limitations in constrained literary generation.

Abstract: This paper presents a systematic investigation into the constrained generation capabilities of large language models (LLMs) in producing Songci, a classical Chinese poetry form characterized by strict structural, tonal, and rhyme constraints defined by Cipai templates. We first develop a comprehensive, multi-faceted evaluation framework that includes: (i) a formal conformity score, (ii) automated quality assessment using LLMs, (iii) human evaluation, and (iv) classification-based probing tasks. Using this framework, we evaluate the generative performance of 18 LLMs, including 3 proprietary models and 15 open-source models across four families, under five prompting strategies: zero-shot, one-shot, completion-based, instruction-tuned, and chain-of-thought. Finally, we propose a Generate-Critic architecture in which the evaluation framework functions as an automated critic. Leveraging the critic’s feedback as a reward signal, we fine-tune three lightweight open-source LLMs via supervised fine-tuning (SFT), resulting in improvements of up to 5.88% in formal conformity. Our findings offer new insights into the generative strengths and limitations of LLMs in producing culturally significant and formally constrained literary texts.

[86] I Have No Mouth, and I Must Rhyme: Uncovering Internal Phonetic Representations in LLaMA 3.2

Jack Merullo, Arjun Khurana, Oliver McLaughlin

Main category: cs.CL

TL;DR: Llama-3.2-1B-Instruct models phonemes internally for phonetic tasks like rhyming, revealing a ‘phoneme mover head’ and IPA-like vowel organization without explicit training.

Details

Motivation: To understand how large language models like Llama-3.2-1B-Instruct represent phonetic information internally without phonetic or auditory grounding.

Method: Investigating token-level phonetic representations in Llama, identifying a ‘phoneme mover head,’ and analyzing its output space.

Result: Llama uses a rich internal phoneme model, with IPA-like vowel organization, and a specific head for phonetic tasks.

Conclusion: Llama learns human-like phonetic representations unsupervised, suggesting advanced internal modeling capabilities.

Abstract: Large language models demonstrate proficiency on phonetic tasks, such as rhyming, without explicit phonetic or auditory grounding. In this work, we investigate how \verb|Llama-3.2-1B-Instruct| represents token-level phonetic information. Our results suggest that Llama uses a rich internal model of phonemes to complete phonetic tasks. We provide evidence for high-level organization of phoneme representations in its latent space. In doing so, we also identify a ``phoneme mover head" which promotes phonetic information during rhyming tasks. We visualize the output space of this head and find that, while notable differences exist, Llama learns a model of vowels similar to the standard IPA vowel chart for humans, despite receiving no direct supervision to do so.

[87] Observing Dialogue in Therapy: Categorizing and Forecasting Behavioral Codes

Jie Cao, Michael Tanana, Zac E. Imel, Eric Poitras, David C. Atkins, Vivek Srikumar

Main category: cs.CL

TL;DR: The paper proposes neural network models to analyze and forecast behavioral codes in Motivational Interviewing (MI) therapy dialogues, aiding real-time therapist guidance.

Details

Motivation: To improve therapy outcomes by providing real-time feedback to therapists using automated dialogue analysis in Motivational Interviewing (MI).

Method: Neural network models are developed to categorize and forecast MI behavioral codes in therapy dialogues, building on recent dialogue modeling successes.

Result: The models outperform baselines in categorizing and forecasting behavioral codes, with analysis highlighting design tradeoffs.

Conclusion: The approach effectively aids real-time therapist guidance, demonstrating the potential of neural networks in therapy dialogue analysis.

Abstract: Automatically analyzing dialogue can help understand and guide behavior in domains such as counseling, where interactions are largely mediated by conversation. In this paper, we study modeling behavioral codes used to asses a psychotherapy treatment style called Motivational Interviewing (MI), which is effective for addressing substance abuse and related problems. Specifically, we address the problem of providing real-time guidance to therapists with a dialogue observer that (1) categorizes therapist and client MI behavioral codes and, (2) forecasts codes for upcoming utterances to help guide the conversation and potentially alert the therapist. For both tasks, we define neural network models that build upon recent successes in dialogue modeling. Our experiments demonstrate that our models can outperform several baselines for both tasks. We also report the results of a careful analysis that reveals the impact of the various network design tradeoffs for modeling therapy dialogue.

[88] Contextual Graph Transformer: A Small Language Model for Enhanced Engineering Document Information Extraction

Karan Reddy, Mayukha Pal

Main category: cs.CL

TL;DR: The paper introduces the Contextual Graph Transformer (CGT), a hybrid model combining GNNs and Transformers, to improve question answering in technical documents by capturing both local structure and global dependencies.

Details

Motivation: Standard transformer models struggle with fine-grained syntax and entity relationships in technical documents, necessitating a specialized solution.

Method: CGT constructs a dynamic graph over input tokens using sequential, skip-gram, and semantic similarity edges, processed by GATv2Conv layers, and then a Transformer encoder. It is trained in two phases: general pretraining and domain-specific fine-tuning.

Result: CGT outperforms GPT-2 and BERT, achieving 24.7% higher accuracy than GPT-2 with 62.4% fewer parameters.

Conclusion: CGT is a parameter-efficient, adaptable solution for technical domains, enhancing grounding, entity tracking, and retrieval-augmented responses.

Abstract: Standard transformer-based language models, while powerful for general text, often struggle with the fine-grained syntax and entity relationships in complex technical, engineering documents. To address this, we propose the Contextual Graph Transformer (CGT), a hybrid neural architecture that combines Graph Neural Networks (GNNs) and Transformers for domain-specific question answering. CGT constructs a dynamic graph over input tokens using sequential, skip-gram, and semantic similarity edges, which is processed by GATv2Conv layers for local structure learning. These enriched embeddings are then passed to a Transformer encoder to capture global dependencies. Unlike generic large models, technical domains often require specialized language models with stronger contextualization and structure awareness. CGT offers a parameter-efficient solution for such use cases. Integrated into a Retrieval-Augmented Generation (RAG) pipeline, CGT outperforms baselines like GPT-2 and BERT, achieving 24.7% higher accuracy than GPT-2 with 62.4% fewer parameters. This gain stems from CGTs ability to jointly model structural token interactions and long-range semantic coherence. The model is trained from scratch using a two-phase approach: pretraining on general text followed by fine-tuning on domain-specific manuals. This highlights CGTs adaptability to technical language, enabling better grounding, entity tracking, and retrieval-augmented responses in real-world applications.

[89] Enhancing Talk Moves Analysis in Mathematics Tutoring through Classroom Teaching Discourse

Jie Cao, Abhijit Suresh, Jennifer Jacobs, Charis Clevenger, Amanda Howard, Chelsea Brown, Brent Milne, Tom Fischaber, Tamara Sumner, James H. Martin

Main category: cs.CL

TL;DR: The paper introduces SAGA22, a compact dataset for analyzing math tutoring discourse using talk moves, and explores modeling strategies to improve performance in tutoring settings.

Details

Motivation: Human tutoring is vital for student learning, but scaling data collection and analysis for machine learning models is resource-intensive.

Method: The study uses talk moves, leverages classroom datasets, and explores modeling strategies like dialogue context, speaker info, pretraining, and fine-tuning.

Result: Pretraining on classroom data improves model performance in tutoring, especially with longer context and speaker info. Ablation studies highlight modeling challenges.

Conclusion: Supplementary pretraining on classroom data enhances tutoring models, though talk move modeling remains challenging.

Abstract: Human tutoring interventions play a crucial role in supporting student learning, improving academic performance, and promoting personal growth. This paper focuses on analyzing mathematics tutoring discourse using talk moves - a framework of dialogue acts grounded in Accountable Talk theory. However, scaling the collection, annotation, and analysis of extensive tutoring dialogues to develop machine learning models is a challenging and resource-intensive task. To address this, we present SAGA22, a compact dataset, and explore various modeling strategies, including dialogue context, speaker information, pretraining datasets, and further fine-tuning. By leveraging existing datasets and models designed for classroom teaching, our results demonstrate that supplementary pretraining on classroom data enhances model performance in tutoring settings, particularly when incorporating longer context and speaker information. Additionally, we conduct extensive ablation studies to underscore the challenges in talk move modeling.

[90] What’s in the News? Towards Identification of Bias by Commission, Omission, and Source Selection (COSS)

Anastasia Zhukova, Terry Ruas, Felix Hamborg, Karsten Donnay, Bela Gipp

Main category: cs.CL

TL;DR: A methodology for automatically identifying bias in news articles by analyzing commission, omission, and source selection (COSS) as a joint objective, with a pipeline approach and visualization example.

Details

Motivation: The challenge of determining reliable and neutral news sources in an information-overloaded world.

Method: Proposes a pipeline for identifying bias by COSS as a joint three-fold objective, leveraging text reuse patterns.

Result: A framework for bias identification with visualization of extracted features.

Conclusion: The approach addresses bias holistically, improving on previous work that treated bias types separately.

Abstract: In a world overwhelmed with news, determining which information comes from reliable sources or how neutral is the reported information in the news articles poses a challenge to news readers. In this paper, we propose a methodology for automatically identifying bias by commission, omission, and source selection (COSS) as a joint three-fold objective, as opposed to the previous work separately addressing these types of bias. In a pipeline concept, we describe the goals and tasks of its steps toward bias identification and provide an example of a visualization that leverages the extracted features and patterns of text reuse.

[91] Towards Actionable Pedagogical Feedback: A Multi-Perspective Analysis of Mathematics Teaching and Tutoring Dialogue

Jannatun Naim, Jie Cao, Fareen Tasneem, Jennifer Jacobs, Brent Milne, James Martin, Tamara Sumner

Main category: cs.CL

TL;DR: The paper proposes a multi-perspective discourse analysis framework to address challenges in utterance-level discourse analysis in mathematics education, integrating talk moves, dialogue acts, and discourse relations for comprehensive feedback.

Details

Motivation: To overcome the limitations of single-tag utterance analysis (multifunctionality and exclusion of non-talk-move utterances) in providing effective feedback for instructional practices.

Method: A top-down framework combining domain-specific talk moves, dialogue acts (43-tag SWBD-MASL schema), and discourse relations (16 SDRT relations), applied to two datasets (TalkMoves and SAGA22) with distributional, sequential, and multi-view analyses.

Result: Identified meaningful discourse patterns and highlighted the crucial role of non-talk-move utterances in guiding, acknowledging, and structuring classroom discourse.

Conclusion: The framework enhances AI-assisted education systems by incorporating discourse relations and dialogue acts, improving feedback and supporting AI agent development for educational roles.

Abstract: Effective feedback is essential for refining instructional practices in mathematics education, and researchers often turn to advanced natural language processing (NLP) models to analyze classroom dialogues from multiple perspectives. However, utterance-level discourse analysis encounters two primary challenges: (1) multifunctionality, where a single utterance may serve multiple purposes that a single tag cannot capture, and (2) the exclusion of many utterances from domain-specific discourse move classifications, leading to their omission in feedback. To address these challenges, we proposed a multi-perspective discourse analysis that integrates domain-specific talk moves with dialogue act (using the flattened multi-functional SWBD-MASL schema with 43 tags) and discourse relation (applying Segmented Discourse Representation Theory with 16 relations). Our top-down analysis framework enables a comprehensive understanding of utterances that contain talk moves, as well as utterances that do not contain talk moves. This is applied to two mathematics education datasets: TalkMoves (teaching) and SAGA22 (tutoring). Through distributional unigram analysis, sequential talk move analysis, and multi-view deep dive, we discovered meaningful discourse patterns, and revealed the vital role of utterances without talk moves, demonstrating that these utterances, far from being mere fillers, serve crucial functions in guiding, acknowledging, and structuring classroom discourse. These insights underscore the importance of incorporating discourse relations and dialogue acts into AI-assisted education systems to enhance feedback and create more responsive learning environments. Our framework may prove helpful for providing human educator feedback, but also aiding in the development of AI agents that can effectively emulate the roles of both educators and students.

[92] Building and Aligning Comparable Corpora

Motaz Saad, David Langlois, Kamel Smaili

Main category: cs.CL

TL;DR: The paper presents a method to build and align comparable corpora from Wikipedia and EURONEWS in English, French, and Arabic, using cross-lingual similarity measures. CL-LSI outperforms dictionary-based methods and is validated on BBC and ALJAZEERA news documents.

Details

Motivation: Comparable corpora are valuable for multilingual NLP when parallel texts are unavailable, as they reveal topic-related content across languages.

Method: Build comparable corpora from Wikipedia and EURONEWS; align documents using bilingual dictionary and CL-LSI measures. Test on BBC and ALJAZEERA news.

Result: CL-LSI outperforms dictionary-based alignment, successfully aligning documents at topic and event levels.

Conclusion: CL-LSI is effective for cross-lingual document alignment, offering insights into topic and event-level similarities.

Abstract: Comparable corpus is a set of topic aligned documents in multiple languages, which are not necessarily translations of each other. These documents are useful for multilingual natural language processing when there is no parallel text available in some domains or languages. In addition, comparable documents are informative because they can tell what is being said about a topic in different languages. In this paper, we present a method to build comparable corpora from Wikipedia encyclopedia and EURONEWS website in English, French and Arabic languages. We further experiment a method to automatically align comparable documents using cross-lingual similarity measures. We investigate two cross-lingual similarity measures to align comparable documents. The first measure is based on bilingual dictionary, and the second measure is based on Latent Semantic Indexing (LSI). Experiments on several corpora show that the Cross-Lingual LSI (CL-LSI) measure outperforms the dictionary based measure. Finally, we collect English and Arabic news documents from the British Broadcast Corporation (BBC) and from ALJAZEERA (JSC) news website respectively. Then we use the CL-LSI similarity measure to automatically align comparable documents of BBC and JSC. The evaluation of the alignment shows that CL-LSI is not only able to align cross-lingual documents at the topic level, but also it is able to do this at the event level.

[93] Automated SNOMED CT Concept Annotation in Clinical Text Using Bi-GRU Neural Networks

Ali Noori, Pratik Devkota, Somya Mohanty, Prashanti Manda

Main category: cs.CL

TL;DR: A neural sequence labeling approach using a Bi-GRU model achieves 90% F1-score for SNOMED CT concept recognition in clinical text, outperforming rule-based systems and matching neural models with lower computational cost.

Details

Motivation: Manual annotation of clinical text with SNOMED CT concepts is labor-intensive; automation is needed for scalability and structured data extraction.

Method: A Bi-GRU model processes preprocessed text (using SpaCy and SciBERT) with contextual, syntactic, and morphological features, assigning IOB tags to identify concept spans.

Result: Achieves 90% F1-score on validation, surpassing rule-based systems and matching/exceeding neural models, with effective handling of ambiguity and misspellings.

Conclusion: Lightweight RNN-based architectures like Bi-GRU offer high-quality clinical concept annotation at lower computational cost, suitable for real-world deployment.

Abstract: Automated annotation of clinical text with standardized medical concepts is critical for enabling structured data extraction and decision support. SNOMED CT provides a rich ontology for labeling clinical entities, but manual annotation is labor-intensive and impractical at scale. This study introduces a neural sequence labeling approach for SNOMED CT concept recognition using a Bidirectional GRU model. Leveraging a subset of MIMIC-IV, we preprocess text with domain-adapted SpaCy and SciBERT-based tokenization, segmenting sentences into overlapping 19-token chunks enriched with contextual, syntactic, and morphological features. The Bi-GRU model assigns IOB tags to identify concept spans and achieves strong performance with a 90 percent F1-score on the validation set. These results surpass traditional rule-based systems and match or exceed existing neural models. Qualitative analysis shows effective handling of ambiguous terms and misspellings. Our findings highlight that lightweight RNN-based architectures can deliver high-quality clinical concept annotation with significantly lower computational cost than transformer-based models, making them well-suited for real-world deployment.

[94] EH-Benchmark Ophthalmic Hallucination Benchmark and Agent-Driven Top-Down Traceable Reasoning Workflow

Xiaoyu Pan, Yang Bai, Ke Zou, Yang Zhou, Jun Zhou, Huazhu Fu, Yih-Chung Tham, Yong Liu

Main category: cs.CL

TL;DR: EH-Benchmark is introduced to evaluate and mitigate hallucinations in Medical Large Language Models (MLLMs) for ophthalmic diagnosis, improving accuracy and reliability.

Details

Motivation: MLLMs face accuracy issues due to hallucinations from limited ophthalmic knowledge, poor visual reasoning, and lack of multimodal data. Existing benchmarks fail to address these challenges.

Method: A three-phase framework (Knowledge-Level Retrieval, Task-Level Case Studies, Result-Level Validation) is proposed to categorize and mitigate hallucinations in MLLMs.

Result: The multi-agent framework significantly reduces hallucinations, enhancing model accuracy, interpretability, and reliability.

Conclusion: EH-Benchmark effectively addresses hallucinations in MLLMs, offering a robust solution for ophthalmic diagnosis.

Abstract: Medical Large Language Models (MLLMs) play a crucial role in ophthalmic diagnosis, holding significant potential to address vision-threatening diseases. However, their accuracy is constrained by hallucinations stemming from limited ophthalmic knowledge, insufficient visual localization and reasoning capabilities, and a scarcity of multimodal ophthalmic data, which collectively impede precise lesion detection and disease diagnosis. Furthermore, existing medical benchmarks fail to effectively evaluate various types of hallucinations or provide actionable solutions to mitigate them. To address the above challenges, we introduce EH-Benchmark, a novel ophthalmology benchmark designed to evaluate hallucinations in MLLMs. We categorize MLLMs’ hallucinations based on specific tasks and error types into two primary classes: Visual Understanding and Logical Composition, each comprising multiple subclasses. Given that MLLMs predominantly rely on language-based reasoning rather than visual processing, we propose an agent-centric, three-phase framework, including the Knowledge-Level Retrieval stage, the Task-Level Case Studies stage, and the Result-Level Validation stage. Experimental results show that our multi-agent framework significantly mitigates both types of hallucinations, enhancing accuracy, interpretability, and reliability. Our project is available at https://github.com/ppxy1/EH-Benchmark.

[95] Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction

Yuerong Song, Xiaoran Liu, Ruixiao Li, Zhigeng Liu, Zengfeng Huang, Qipeng Guo, Ziwei He, Xipeng Qiu

Main category: cs.CL

TL;DR: Sparse-dLLM reduces computational and memory overhead in dLLMs by selectively caching pivotal tokens and evicting low-relevance ones, achieving 10x higher throughput with comparable performance.

Details

Motivation: Current caching techniques in dLLMs impose high memory usage, limiting long-context applications. Attention patterns reveal persistent sparsity, motivating selective cache eviction.

Method: Proposes Sparse-dLLM, a training-free framework combining dynamic cache eviction with sparse attention via delayed bidirectional sparse caching.

Result: Achieves up to 10x higher throughput than vanilla dLLMs, with comparable performance and similar peak memory costs.

Conclusion: Sparse-dLLM outperforms previous methods in efficiency and effectiveness for dLLM inference.

Abstract: Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning and parallel decoding but suffer from prohibitive quadratic computational complexity and memory overhead during inference. Current caching techniques accelerate decoding by storing full-layer states, yet impose substantial memory usage that limit long-context applications. Our analysis of attention patterns in dLLMs reveals persistent cross-layer sparsity, with pivotal tokens remaining salient across decoding steps and low-relevance tokens staying unimportant, motivating selective cache eviction. We propose Sparse-dLLM, the first training-free framework integrating dynamic cache eviction with sparse attention via delayed bidirectional sparse caching. By leveraging the stability of token saliency over steps, it retains critical tokens and dynamically evicts unimportant prefix/suffix entries using an attention-guided strategy. Extensive experiments on LLaDA and Dream series demonstrate Sparse-dLLM achieves up to 10$\times$ higher throughput than vanilla dLLMs, with comparable performance and similar peak memory costs, outperforming previous methods in efficiency and effectiveness.

[96] Guess or Recall? Training CNNs to Classify and Localize Memorization in LLMs

Jérémie Dentan, Davide Buscaldi, Sonia Vanier

Main category: cs.CL

TL;DR: The paper analyzes verbatim memorization in LLMs, critiques the existing taxonomy, and proposes a new one aligned with attention weights, identifying three memorization categories. It also introduces a visual technique to localize memorization mechanisms.

Details

Motivation: To better understand the distinct mechanisms behind verbatim memorization in LLMs and improve the existing taxonomy.

Method: Train CNNs on LLM attention weights to evaluate alignment with the existing taxonomy and propose a new one.

Result: Existing taxonomy poorly reflects memorization mechanisms; new taxonomy identifies three categories. Few-shot memorization lacks a distinct attention mechanism. Many extractable samples are guessed.

Conclusion: A refined taxonomy and visual technique improve understanding of memorization in LLMs, highlighting the need to separate guessed samples from recalled ones.

Abstract: Verbatim memorization in Large Language Models (LLMs) is a multifaceted phenomenon involving distinct underlying mechanisms. We introduce a novel method to analyze the different forms of memorization described by the existing taxonomy. Specifically, we train Convolutional Neural Networks (CNNs) on the attention weights of the LLM and evaluate the alignment between this taxonomy and the attention weights involved in decoding. We find that the existing taxonomy performs poorly and fails to reflect distinct mechanisms within the attention blocks. We propose a new taxonomy that maximizes alignment with the attention weights, consisting of three categories: memorized samples that are guessed using language modeling abilities, memorized samples that are recalled due to high duplication in the training set, and non-memorized samples. Our results reveal that few-shot verbatim memorization does not correspond to a distinct attention mechanism. We also show that a significant proportion of extractable samples are in fact guessed by the model and should therefore be studied separately. Finally, we develop a custom visual interpretability technique to localize the regions of the attention weights involved in each form of memorization.

[97] EHSAN: Leveraging ChatGPT in a Hybrid Framework for Arabic Aspect-Based Sentiment Analysis in Healthcare

Eman Alamoudi, Ellis Solaiman

Main category: cs.CL

TL;DR: EHSAN introduces a hybrid pipeline combining ChatGPT pseudo-labelling and human review to create the first explainable Arabic aspect-based sentiment dataset for healthcare, showing high model accuracy even with minimal human supervision.

Details

Motivation: Arabic-language patient feedback is under-analysed due to dialect diversity and lack of aspect-level sentiment labels, necessitating a scalable solution.

Method: A hybrid pipeline merges ChatGPT pseudo-labelling with human review to create three dataset versions (fully supervised, semi-supervised, unsupervised). Two transformer models are fine-tuned for aspect and sentiment classification.

Result: High accuracy was achieved even with minimal human supervision, with minor performance drop for ChatGPT-only labels. Reducing aspect classes improved classification metrics.

Conclusion: The approach effectively combines large language model annotation with human expertise for scalable Arabic aspect-based sentiment analysis in healthcare, with future work focusing on generalisation and interpretability.

Abstract: Arabic-language patient feedback remains under-analysed because dialect diversity and scarce aspect-level sentiment labels hinder automated assessment. To address this gap, we introduce EHSAN, a data-centric hybrid pipeline that merges ChatGPT pseudo-labelling with targeted human review to build the first explainable Arabic aspect-based sentiment dataset for healthcare. Each sentence is annotated with an aspect and sentiment label (positive, negative, or neutral), forming a pioneering Arabic dataset aligned with healthcare themes, with ChatGPT-generated rationales provided for each label to enhance transparency. To evaluate the impact of annotation quality on model performance, we created three versions of the training data: a fully supervised set with all labels reviewed by humans, a semi-supervised set with 50% human review, and an unsupervised set with only machine-generated labels. We fine-tuned two transformer models on these datasets for both aspect and sentiment classification. Experimental results show that our Arabic-specific model achieved high accuracy even with minimal human supervision, reflecting only a minor performance drop when using ChatGPT-only labels. Reducing the number of aspect classes notably improved classification metrics across the board. These findings demonstrate an effective, scalable approach to Arabic aspect-based sentiment analysis (SA) in healthcare, combining large language model annotation with human expertise to produce a robust and explainable dataset. Future directions include generalisation across hospitals, prompt refinement, and interpretable data-driven modelling.

[98] MArgE: Meshing Argumentative Evidence from Multiple Large Language Models for Justifiable Claim Verification

Ming Pok Ng, Junqi Jiang, Gabriel Freedman, Antonio Rago, Francesca Toni

Main category: cs.CL

TL;DR: MArgE introduces a structured framework for combining outputs from multiple LLMs using argument trees, improving claim verification accuracy and justification.

Details

Motivation: Current methods for combining LLM outputs lack structure, leading to unjustifiable results. MArgE aims to provide formal, inspectable reasoning.

Method: Uses Argumentative LLMs (ArgLLMs) to construct structured argument trees for claim verification, creating a traceable decision pathway.

Result: MArgE outperforms single LLMs, open-source models, GPT-4o-mini, and unstructured multi-LLM debate methods.

Conclusion: Formal argumentative reasoning enhances the reliability and justification of multi-LLM outputs.

Abstract: Leveraging outputs from multiple large language models (LLMs) is emerging as a method for harnessing their power across a wide range of tasks while mitigating their capacity for making errors, e.g., hallucinations. However, current approaches to combining insights from multiple LLMs often involve unstructured interactions (e.g., free debate), resulting in model generations that are not faithfully justifiable. In this work, we introduce MArgE, a novel framework to provide formal structure to the evidence from each LLM, in the form of a tree of extracted arguments, for the task of claim verification. We use a variant of Argumentative LLMs (ArgLLMs), i.e. LLMs driven by frameworks and semantics from the field of computational argumentation, to construct structured argument trees for given claims. This process creates an inspectable pathway from the initial arguments to the final claim verification decisions, providing a faithful justification thereof. We show experimentally that MArgE can significantly outperform single LLMs, including three open-source models (4B to 8B parameters), GPT-4o-mini and existing ArgLLMs, as well as prior methods for unstructured multi-LLM debates. We thus demonstrate the advantages of incorporating formal, argumentative reasoning mechanisms when combining multiple LLM outputs.

[99] CharBench: Evaluating the Role of Tokenization in Character-Level Tasks

Omri Uzan, Yuval Pinter

Main category: cs.CL

TL;DR: CharBench is a large-scale benchmark for character-level tasks, revealing modern LLMs struggle with accuracy as low as 32.3%. Tokenization’s impact varies by task type.

Details

Motivation: Address the unclear role of tokenization in LLMs' struggles with character-level tasks like counting or locating characters.

Method: Introduce CharBench, a comprehensive benchmark, and evaluate diverse LLMs on it, analyzing word properties and tokenization effects.

Result: Average accuracy of 43.6% and 32.3% on tasks; tokenization weakly affects counting but harms positional tasks.

Conclusion: CharBench highlights LLMs’ limitations in character-level reasoning, urging future work to improve using this benchmark.

Abstract: Tasks that require character-level reasoning, such as counting or locating characters within words, remain challenging for contemporary language models. A common conjecture is that language models’ reliance on subword units, rather than characters, contributes to their struggles with character-level tasks, yet recent studies offer conflicting conclusions about the role of tokenization, leaving its impact unclear. To address this gap, we introduce CharBench, a comprehensive benchmark of character-level tasks that is two orders of magnitude larger than existing alternatives. We evaluate a diverse range of leading open-weight and proprietary models on CharBench and find that it presents a significant challenge to modern LLMs, with an average accuracy of 43.6% and 32.3% on some tasks. We present an in-depth analysis of how intrinsic properties of words and their segmentations into tokens correspond to model performance. For counting tasks, we find that tokenization properties are weakly correlated with correctness, while the length of the queried word and the actual character count play a more significant part. In contrast, for tasks requiring intra-word positional understanding, performance is negatively correlated with the length of the token containing the queried character, suggesting that longer tokens obscure character position information for LLMs. We encourage future work to build on the benchmark and evaluation methodology introduced here as tools for improving model performance on such tasks.

[100] Mitigating Attention Hacking in Preference-Based Reward Modeling via Interaction Distillation

Jianxiang Zang, Meiling Ning, Shihan Dou, Jiazheng Zhang, Tao Gui, Qi Zhang, Xuanjing Huang

Main category: cs.CL

TL;DR: The paper addresses ‘attention hacking’ in reward models (RMs) for RLHF by proposing ‘Interaction Distillation’ to improve token-level interaction and attention alignment.

Details

Motivation: Current RMs lack adequate token-level interaction due to unidirectional attention and Siamese-encoding, making them vulnerable to 'attention hacking.'

Method: Proposes ‘Interaction Distillation,’ using a teacher model to guide attention alignment and improve preference modeling.

Result: The method provides more stable and generalizable reward signals, outperforming existing RM optimization techniques.

Conclusion: Attention hacking is a fundamental limitation in RMs, and Interaction Distillation effectively addresses it.

Abstract: The reward model (RM), as the core component of reinforcement learning from human feedback (RLHF) for large language models (LLMs), responsible for providing reward signals to generated responses. However, mainstream preference modeling in RM is inadequate in terms of token-level interaction, making its judgment signals vulnerable to being hacked by misallocated attention to context. This stems from two fundamental limitations: (1) Current preference modeling employs decoder-only architectures, where the unidirectional causal attention mechanism leads to forward-decaying intra-sequence attention within the prompt-response sequence. (2) The independent Siamese-encoding paradigm induces the absence of token-level inter-sequence attention between chosen and rejected sequences. To address this “attention hacking”, we propose “Interaction Distillation”, a novel training framework for more adequate preference modeling through attention-level optimization. The method introduces an interaction-based natural language understanding model as the teacher to provide sophisticated token interaction patterns via comprehensive attention, and guides the preference modeling to simulate teacher model’s interaction pattern through an attentional alignment objective. Through extensive experiments, interaction distillation has demonstrated its ability to provide more stable and generalizable reward signals compared to state-of-the-art RM optimization methods that target data noise, highlighting the attention hacking constitute a more fundamental limitation in RM.

[101] Pointer: Linear-Complexity Long-Range Modeling without Pre-training

Zixi Li

Main category: cs.CL

TL;DR: Pointer is a new architecture with linear complexity for long-range sequence modeling, outperforming standard attention in speed and accuracy without pre-training.

Details

Motivation: Address the inefficiency of standard attention mechanisms (O(N^2)) for long sequences and eliminate the need for pre-training.

Method: Uses layer-wise pointer chaining, where each layer’s pointer selection depends on previous layers, creating long-distance connections.

Result: Achieves 2–10× speedup on long sequences, >95% accuracy on copy tasks up to 2048 tokens, and learns interpretable pointer patterns.

Conclusion: Pointer is an efficient and interpretable alternative to attention for long-range modeling without pre-training.

Abstract: We introduce Pointer, a novel architecture that achieves linear $O(NK)$ complexity for long-range sequence modeling while maintaining superior performance without requiring pre-training. Unlike standard attention mechanisms that compute $O(N^2)$ pairwise interactions, our approach uses layer-wise pointer chaining where each layer’s pointer selection depends on previous layer’s pointer positions, creating explicit long-distance connections through pointer chains. We demonstrate that this architecture achieves $2$–$10\times$ speedup on long sequences compared to standard transformers, maintains $>95%$ accuracy on copy tasks at distances up to 2048 tokens, and learns interpretable pointer patterns that reveal structured dependency modeling. Our experiments on efficiency benchmarks, long-range dependency tasks, and interpretability analysis show that Pointer offers a compelling alternative to attention mechanisms for scenarios requiring efficient long-range modeling without pre-training dependencies.

[102] Test Set Quality in Multilingual LLM Evaluation

Kranti Chalamalasetti, Gabriel Bernier-Colborne, Yvan Gauthier, Sowmya Vajjala

Main category: cs.CL

TL;DR: The paper highlights quality issues in multilingual benchmark datasets, identifies errors in French and Telugu datasets, and shows significant performance differences in LLMs when using revised datasets. It advocates for revisiting and versioning test sets.

Details

Motivation: To address the lack of attention to dataset quality in multilingual benchmarks, despite known errors in human-annotated test sets.

Method: Manual analysis of multilingual evaluation sets in French and Telugu, identifying errors and comparing LLM performance on original vs. revised datasets.

Result: Large performance differences (up to 10%) in LLMs when using revised datasets, indicating dataset quality impacts results.

Conclusion: Test sets should be revisited, corrected, and versioned; recommendations are provided for dataset creators and consumers to improve quality.

Abstract: Several multilingual benchmark datasets have been developed in a semi-automatic manner in the recent past to measure progress and understand the state-of-the-art in the multilingual capabilities of Large Language Models. However, there is not a lot of attention paid to the quality of the datasets themselves, despite the existence of previous work in identifying errors in even fully human-annotated test sets. In this paper, we manually analyze recent multilingual evaluation sets in two languages - French and Telugu, identifying several errors in the process. We compare the performance difference across several LLMs with the original and revised versions of the datasets and identify large differences (almost 10% in some cases) in both languages). Based on these results, we argue that test sets should not be considered immutable and should be revisited, checked for correctness, and potentially versioned. We end with some recommendations for both the dataset creators as well as consumers on addressing the dataset quality issues.

[103] You Can Generate It Again: Data-to-Text Generation with Verification and Correction Prompting

Xuan Ren, Zeyu Zhang, Lingqiao Liu

Main category: cs.CL

TL;DR: The paper introduces Verification and Correction Prompting (VCP) to improve semantic fidelity in small language models for data-to-text tasks, reducing keyword omission errors while maintaining text quality.

Details

Motivation: Small language models like T5 are cost-efficient but often miss keywords, a critical error in data-to-text tasks. The paper aims to address this using feedback systems.

Method: The VCP approach involves a multi-step process: generation, verification (checking for keywords), and regeneration. A training procedure helps the model handle feedback despite potential inaccuracies.

Result: VCP reduces the Semantic Error Rate (SER) without compromising text quality.

Conclusion: The VCP approach effectively enhances semantic fidelity in small language models for data-to-text generation.

Abstract: Small language models like T5 excel in generating high-quality text for data-to-text tasks, offering adaptability and cost-efficiency compared to Large Language Models (LLMs). However, they frequently miss keywords, which is considered one of the most severe and common errors in this task. In this work, we explore the potential of using feedback systems to enhance semantic fidelity in smaller language models for data-to-text generation tasks, through our Verification and Correction Prompting (VCP) approach. In the inference stage, our approach involves a multi-step process, including generation, verification, and regeneration stages. During the verification stage, we implement a simple rule to check for the presence of every keyword in the prediction. Recognizing that this rule can be inaccurate, we have developed a carefully designed training procedure, which enabling the model to incorporate feedback from the error-correcting prompt effectively, despite its potential inaccuracies. The VCP approach effectively reduces the Semantic Error Rate (SER) while maintaining the text’s quality.

[104] Thinker-DDM: Modeling Deliberation for Machine Translation with a Drift-Diffusion Process

Hongbin Na, Zimu Wang, Mieradilijiang Maimaiti, Tong Chen, Wei Wang, Tao Shen, Ling Chen

Main category: cs.CL

TL;DR: Thinker-DDM improves LLM-based machine translation by emulating human translators’ dynamic decision-making, outperforming baselines in high- and low-resource settings.

Details

Motivation: Prior LLM-based translation lacks human-like decision-making. Thinker-DDM aims to address this gap.

Method: Incorporates Thinker with Drift-Diffusion Model (DDM) to emulate human translators’ dynamic decisions under resource constraints.

Result: Outperforms baselines in high- and low-resource settings (WMT22, CommonMT). Effective in commonsense translation.

Conclusion: Thinker-DDM enhances translation performance by mimicking human decision-making, proving effective across diverse scenarios.

Abstract: Large language models (LLMs) have demonstrated promising potential in various downstream tasks, including machine translation. However, prior work on LLM-based machine translation has mainly focused on better utilizing training data, demonstrations, or pre-defined and universal knowledge to improve performance, with a lack of consideration of decision-making like human translators. In this paper, we incorporate Thinker with the Drift-Diffusion Model (Thinker-DDM) to address this issue. We then redefine the Drift-Diffusion process to emulate human translators’ dynamic decision-making under constrained resources. We conduct extensive experiments under the high-resource, low-resource, and commonsense translation settings using the WMT22 and CommonMT datasets, in which Thinker-DDM outperforms baselines in the first two scenarios. We also perform additional analysis and evaluation on commonsense translation to illustrate the high effectiveness and efficacy of the proposed method.

[105] THREAD: Thinking Deeper with Recursive Spawning

Philip Schroeder, Nathaniel Morgan, Hongyin Luo, James Glass

Main category: cs.CL

TL;DR: THREAD improves LLM performance by dynamically spawning threads to decompose tasks, achieving state-of-the-art results.

Details

Motivation: LLMs struggle with long, complex contexts; THREAD aims to enhance adaptability and efficiency.

Method: THREAD uses dynamic threading to recursively break tasks into simpler sub-problems, implemented via few-shot learning.

Result: Outperforms benchmarks (ALFWorld, TextCraft, etc.) with GPT-4/3.5 and smaller models (Llama-3-8b, CodeLlama-7b).

Conclusion: THREAD effectively addresses LLM limitations, offering scalable and efficient task decomposition.

Abstract: Large language models (LLMs) have shown impressive capabilities across diverse settings, but still struggle as the length and complexity of the context increases. To address this challenge, we propose Thinking Recursively and Dynamically (ThReaD). THREAD frames model generation as a thread of execution that, based on the context, can run to completion or dynamically spawn new threads. By spawning, threads can offload work (e.g., thinking, retrieving information) to child threads, which only return tokens needed for the parent thread to do its work. In effect, this enables the model to adapt, as needed, the amount of intermediate work used to produce tokens. We apply THREAD in the settings of LLM task solving and question answering, where the dynamic threading allows the model to recursively decompose the given task or question into progressively simpler sub-problems that can be solved by separate child threads. We test THREAD, implemented using a few-shot learning approach, on diverse benchmarks for agent tasks and data-grounded question answering. THREAD achieves state-of-the-art performance with GPT-4 and GPT-3.5 on these benchmarks, including ALFWorld, TextCraft, and WebShop, along with two new benchmarks, DataCommons QA and MIMIC-III ICU QA. In addition, THREAD outperforms existing frameworks by 10% to 50% absolute points with smaller models, including Llama-3-8b and CodeLlama-7b.

[106] Dynamic Order Template Prediction for Generative Aspect-Based Sentiment Analysis

Yonghyun Jun, Hwanhee Lee

Main category: cs.CL

TL;DR: The paper introduces a Dynamic Order Template (DOT) method for Aspect-based Sentiment Analysis (ABSA) to address inefficiencies and errors in multi-view prompting, improving performance and reducing inference time.

Details

Motivation: Previous ABSA models use static templates, failing to capture dependencies between elements, while multi-view prompting suffers from inefficiencies and out-of-distribution errors.

Method: Proposes DOT, dynamically generating necessary views for each instance based on instance-level entropy to ensure diverse and relevant view generation.

Result: Improves F1-scores on ASQP and ACOS datasets and significantly reduces inference time.

Conclusion: DOT enhances ABSA by dynamically adapting templates, outperforming previous methods in accuracy and efficiency.

Abstract: Aspect-based sentiment analysis (ABSA) assesses sentiments towards specific aspects within texts, resulting in detailed sentiment tuples. Previous ABSA models often use static templates to predict all of the elements in the tuples, and these models often fail to accurately capture dependencies between elements. Multi-view prompting method improves the performance of ABSA by predicting tuples with various templates and then ensembling the results. However, this method suffers from inefficiencies and out-of-distribution errors. In this paper, we propose a Dynamic Order Template (DOT) method for ABSA, which dynamically generates necessary views for each instance based on instance-level entropy. Ensuring the diverse and relevant view generation, our proposed method improves F1-scores on ASQP and ACOS datasets while significantly reducing inference time.

[107] Can Tool-augmented Large Language Models be Aware of Incomplete Conditions?

Seungbin Yang, ChaeHun Park, Taehee Kim, Jaegul Choo

Main category: cs.CL

TL;DR: LLMs struggle with incomplete scenarios in tool-augmented interactions. A new benchmark and prompting strategy improve their reliability.

Details

Motivation: Addressing the understudied issue of LLMs handling incomplete or ambiguous information in tool-augmented environments.

Method: Constructed a benchmark dataset with altered instances, analyzed LLM behavior, and proposed a prompting-based reasoning strategy.

Result: State-of-the-art LLMs often fail to recognize incomplete conditions, but the proposed strategy significantly improves performance.

Conclusion: The study advances LLM reliability in real-world applications by enhancing their ability to handle incomplete information.

Abstract: Recent advancements in integrating large language models (LLMs) with tools have allowed the models to interact with real-world environments. However, these tool-augmented LLMs often encounter incomplete scenarios when users provide partial information or the necessary tools are unavailable. Recognizing and managing such scenarios is crucial for LLMs to ensure their reliability, but this exploration remains understudied. This study examines whether LLMs can identify incomplete conditions and appropriately determine when to refrain from using tools. To quantitatively evaluate this capability, we construct a new benchmark dataset where instances are systematically altered to simulate the ambiguous and incomplete conditions common in real-world interactions. Our experiments reveal that even state-of-the-art LLMs often struggle to identify these conditions, attempting to use tools without sufficient information or when the correct tool is unavailable. To better understand these limitations, we conduct a detailed behavioral analysis across various conditions, including implicit evaluation and scenarios where models receive feedback from previous tool invocations. Based on this analysis, we propose a novel prompting-based reasoning strategy that explicitly instructs models to assess the sufficiency of information and the availability of tools. Our proposed approach significantly enhances the models’ ability to recognize incomplete conditions, resulting in more informed and contextually appropriate tool-use decisions. We believe our research contributes to advancing the reliability of LLMs, especially in real-world applications where incomplete or ambiguous information is common. Our dataset is available at https://huggingface.co/datasets/ddehun/ICT.

[108] Cascade Reward Sampling for Efficient Decoding-Time Alignment

Bolian Li, Yifan Wang, Anamika Lochab, Ananth Grama, Ruqi Zhang

Main category: cs.CL

TL;DR: CARDS improves decoding-time alignment efficiency in LLMs by reducing redundant computations and ensuring accurate reward evaluations, achieving significant time savings and better alignment quality.

Details

Motivation: Aligning LLMs with human preferences efficiently without fine-tuning is challenging due to inefficiencies like wasted token generation and excessive reward evaluations.

Method: Introduces Cascade Reward Sampling (CARDS), featuring a segment-level rejection sampling algorithm and an uncertainty-based segmentation mechanism to minimize redundant computations.

Result: CARDS reduces decoding time by ~70% and achieves over 90% win-ties in utility and safety benchmarks.

Conclusion: CARDS effectively addresses efficiency bottlenecks in decoding-time alignment, enhancing both alignment quality and general utility of LLMs.

Abstract: Aligning large language models (LLMs) with human preferences is essential for their applications. Recently, decoding-time alignment has emerged as an effective plug-and-play technique that avoids fine-tuning model parameters. This approach retains the general utility of pretrained LLMs but often suffers from significant inefficiencies during decoding, primarily due to wasted token generation and excessive reward evaluations. To address these challenges, we introduce Cascade Reward Sampling (CARDS) to resolve both efficiency bottlenecks in decoding-time alignment. Specifically, we develop a segment-level rejection sampling algorithm that minimizes redundant computations of both LLMs and reward models (RMs). Central to CARDS is an uncertainty-based segmentation mechanism, which ensures the accuracy of RMs evaluations on incomplete segments. Furthermore, we provide a detailed analysis of reward scores on segments to elucidate the improved alignment performance. Experimental results demonstrate that CARDS significantly improves decoding efficiency, alignment quality, and general utility compared to existing decoding-time alignment methods, achieving approximately a 70% reduction in decoding time and over 90% win-ties in utility and safety benchmarks.

[109] Path-LLM: A Shortest-Path-based LLM Learning for Unified Graph Representation

Wenbo Shang, Xuliang Zhu, Xin Huang

Main category: cs.CL

TL;DR: The paper introduces Path-LLM, a novel model for unified graph representation learning using a large language model (LLM) and path features, outperforming existing methods in efficiency and performance.

Details

Motivation: Existing graph representation learning methods face issues like high training demands, poor generalization, or shallow semantics. Path-LLM addresses these by leveraging LLMs and path features.

Method: Path-LLM uses four techniques: L2SP path selection, path textualization, self-supervised LLM training for graph representation, and unified embedding extraction.

Result: Path-LLM outperforms state-of-the-art methods (WalkLM, GraphGPT, etc.) in tasks like node classification and keyword search, with 90% fewer training paths and 35x faster runtime.

Conclusion: Path-LLM is an efficient and superior approach for unified graph representation learning, validated by theoretical analysis and extensive experiments.

Abstract: Unified graph representation learning aims to generate node embeddings, which can be applied to multiple downstream applications of graph analytics. However, existing studies based on graph neural networks and language models either suffer from the limitations of numerous training needs toward specific downstream predictions, poor generalization, or shallow semantic features. In this work, we propose a novel Path-LLM model to efficiently learn unified graph representation, which leverages a powerful large language model (LLM) to incorporate our proposed path features. Our Path-LLM framework consists of four well-designed techniques. First, we develop a new mechanism of long-to-short shortest path (L2SP) selection, which can cover key connections between different dense groups. An in-depth analysis and comparison of different path selections is conducted to justify the rationale behind our designed L2SP method. Next, we design path textualization to obtain L2SP-based training texts with key phrase selection from node text attributes. We then feed the texts into a self-supervised LLM training process to align next node/edge generation in L2SP with next token generation in causal language modeling for graph representation learning and finally extract the unified graph embeddings. We theoretically analyze the algorithm complexity of our Path-LLM approach. Extensive experiments on large-scale graph benchmarks validate the superiority of Path-LLM against state-of-the-art methods WalkLM, GraphGPT, OFA, and GraphTranslator on two classical graph learning tasks (node classification and edge validation) and one NP-hard graph query processing task (keyword search). Compared with WalkLM, our approach saves more than 90% of training paths on millions-scale graphs and runs at most 35x faster.

[110] Learning from Negative Samples in Biomedical Generative Entity Linking

Chanhwi Kim, Hyunjae Kim, Sihyeon Park, Jiwoo Lee, Mujeen Sung, Jaewoo Kang

Main category: cs.CL

TL;DR: ANGEL introduces a framework for training generative BioEL models using negative samples, improving accuracy by up to 1.7%.

Details

Motivation: Generative BioEL models lack explicit learning from hard negative samples, limiting performance.

Method: Train a generative model to predict entity names, gather correct/incorrect predictions, and optimize preferences.

Result: ANGEL outperforms baselines by 1.4% average top-1 accuracy, increasing to 1.7% with pre-training.

Conclusion: ANGEL effectively enhances generative BioEL models by leveraging negative samples in training.

Abstract: Generative models have become widely used in biomedical entity linking (BioEL) due to their excellent performance and efficient memory usage. However, these models are usually trained only with positive samples, i.e., entities that match the input mention’s identifier, and do not explicitly learn from hard negative samples, which are entities that look similar but have different meanings. To address this limitation, we introduce ANGEL (Learning from Negative Samples in Biomedical Generative Entity Linking), the first framework that trains generative BioEL models using negative samples. Specifically, a generative model is initially trained to generate positive entity names from the knowledge base for given input entities. Subsequently, both correct and incorrect outputs are gathered from the model’s top-k predictions. Finally, the model is updated to prioritize the correct predictions through preference optimization. Our models outperform the previous best baseline models by up to an average top-1 accuracy of 1.4% on five benchmarks. When incorporating our framework into pre-training, the performance improvement increases further to 1.7%, demonstrating its effectiveness in both the pre-training and fine-tuning stages. The code and model weights are available at https://github.com/dmis-lab/ANGEL.

[111] CCSBench: Evaluating Compositional Controllability in LLMs for Scientific Document Summarization

Yixi Ding, Jiaying Wu, Tongyao Zhu, Yanxia Qin, Qian Liu, Min-Yen Kan

Main category: cs.CL

TL;DR: The paper introduces CCSBench, a benchmark for compositional controllable summarization in scientific documents, addressing gaps in multi-attribute control.

Details

Motivation: To enhance scientific knowledge dissemination by enabling summarization systems to control multiple attributes (e.g., length, empirical focus) simultaneously.

Method: Developed CCSBench for fine-grained control over explicit (objective) and implicit (subjective) attributes. Evaluated LLMs using in-context learning, fine-tuning, and modular methods.

Result: LLMs struggle with balancing trade-offs between control attributes, particularly implicit ones requiring abstract reasoning.

Conclusion: CCSBench highlights limitations in LLMs for multi-attribute control, suggesting need for improved methods in compositional summarization.

Abstract: To broaden the dissemination of scientific knowledge to diverse audiences, it is desirable for scientific document summarization systems to simultaneously control multiple attributes such as length and empirical focus. However, existing research typically focuses on controlling single attributes, leaving the compositional control of multiple attributes underexplored. To address this gap, we introduce CCSBench, the first evaluation benchmark for compositional controllable summarization in the scientific domain. Our benchmark enables fine-grained control over both explicit attributes (e.g., length), which are objective and straightforward, and implicit attributes (e.g., conceptual or empirical focus), which are more subjective and abstract. We conduct extensive experiments using various large language models (LLMs) under various settings, including in-context learning, parameter-efficient fine-tuning, and two-stage modular methods for balancing control over different attributes. Our findings reveal significant limitations in LLMs capabilities in balancing trade-offs between control attributes, especially implicit ones that require deeper understanding and abstract reasoning.

[112] Arena-Lite: Efficient and Reliable Large Language Model Evaluation via Tournament-Based Direct Comparisons

Seonil Son, Ju-Min Oh, Heegon Jin, Cheolhun Jang, Jeongbeom Jeong, Kuntae Kim

Main category: cs.CL

TL;DR: Arena-Lite improves LLM evaluation reliability by using tournament-structured head-to-head comparisons, eliminating baselines and reducing required comparisons.

Details

Motivation: Current baseline-mediated evaluation methods for LLMs are less reliable than direct comparisons. Arena-Lite aims to enhance reliability and efficiency.

Method: Arena-Lite integrates a tournament structure with direct head-to-head comparisons, tested via stochastic modeling and empirical validation with an LLM judge.

Result: Arena-Lite achieves higher reliability with fewer comparisons, even with smaller datasets or weaker judges.

Conclusion: Arena-Lite offers a more reliable and efficient alternative for LLM evaluation, with available tools for adoption.

Abstract: As Large Language Models (LLMs) expand across domains, LLM judges have become essential for systems evaluation. Current benchmarks typically compare system outputs against baselines. This baseline-mediated approach, though convenient, yields lower reliability than direct comparison between systems. We propose Arena-Lite which integrates tournament structure on top of head-to-head comparison. The application of a tournament structure and direct comparison eliminates the need for baseline outputs, reduces the number of required comparisons, and allows higher reliability in system rankings. We conducted two experiments: (1) controlled stochastic modeling and (2) empirical validation with a real LLM judge. Those experiments collectively demonstrate that Arena-Lite consistently achieves higher reliability with fewer comparisons, even with smaller datasets or weaker judges. We release an easy-to-use web demonstration and code to foster adoption of Arena-Lite, streamlining model selection across research and industry communities. Arena-Lite demo and code are available on \href{https://huggingface.co/spaces/NCSOFT/ArenaLite}{https://huggingface.co/spaces/NCSOFT/ArenaLite}

[113] Training and Evaluating Language Models with Template-based Data Generation

Yifan Zhang

Main category: cs.CL

TL;DR: The paper introduces Template-based Data Generation (TDG) to address the lack of high-quality datasets for training LLMs in complex reasoning tasks, specifically math problems. It creates TemplateGSM, a dataset of 7M+ synthetic grade school math problems with verifiable solutions.

Details

Motivation: Current LLMs struggle with multi-step reasoning tasks like math problem-solving due to a lack of domain-specific datasets.

Method: TDG uses GPT-4 to generate parameterized meta-templates for synthesizing high-quality problems and solutions, creating the TemplateGSM dataset.

Result: TemplateGSM provides 7M+ verifiable math problems, enabling supervised fine-tuning and model alignment via RLVR.

Conclusion: TDG and TemplateGSM offer a scalable solution to data scarcity, advancing LLMs’ reasoning capabilities.

Abstract: The rapid advancement of large language models (LLMs) such as GPT-3, PaLM, and Llama has significantly transformed natural language processing, showcasing remarkable capabilities in understanding and generating language. However, a fundamental bottleneck persists: these models often struggle with tasks requiring complex, multi-step reasoning, particularly in mathematical problem-solving. This deficiency stems from the critical scarcity of large-scale, high-quality, domain-specific datasets necessary for cultivating sophisticated reasoning abilities. To overcome this challenge, we introduce Template-based Data Generation (TDG), a novel and scalable paradigm that harnesses frontier LLMs (GPT-4) to automatically generate parameterized meta-templates, which in turn synthesize a virtually infinite stream of high-quality problems and solutions. Using this paradigm, we create TemplateMath Part I: TemplateGSM, a foundational dataset of over 7 million synthetically generated grade school math problems. Each problem is accompanied by a programmatically verifiable solution, offering an unprecedented level of quality at scale. This resource not only resolves the data scarcity issue for supervised fine-tuning but also provides a robust mechanism for model alignment through Reinforcement Learning with Verifiable Rewards (RLVR). Our approach elevates data augmentation by employing GPT-4 for meta-template creation, guaranteeing diverse and complex problem structures. By providing a scalable solution to the data and verification bottleneck, TDG and TemplateGSM pave the way for a new generation of LLMs with powerful, reliable reasoning skills. The code and data are available at https://github.com/iiis-ai/TemplateMath.

[114] Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity

Da Ma, Lu Chen, Situo Zhang, Yuxun Miao, Su Zhu, Zhi Chen, Hongshen Xu, Hanqi Li, Shuai Fan, Lei Pan, Kai Yu

Main category: cs.CL

TL;DR: PoD is a KV cache compression framework for LLMs that retains less important tokens in a shared form, reducing memory usage by 35% without performance loss.

Details

Motivation: The linear growth of KV cache in LLMs increases memory usage, and existing compression methods risk losing critical information by discarding less relevant tokens.

Method: PoD prioritizes proximal tokens (start/end of context) and shares key states for distant tokens across layers, leveraging redundancy in attention scores.

Result: PoD reduces KV cache memory usage by up to 35% without performance degradation on synthetic and real-world benchmarks.

Conclusion: PoD offers an effective, orthogonal approach to KV cache compression, compatible with existing token-selection methods.

Abstract: The rapid expansion of context window sizes in Large Language Models~(LLMs) has enabled them to tackle increasingly complex tasks involving lengthy documents. However, this progress comes at the cost of a substantial increase in memory usage during inference, primarily due to the linear growth of the key-value~(KV) cache. Existing KV cache compression methods often discard less relevant tokens, which can lead to significant performance degradation when critical information is lost. In this paper, we propose \textsc{PoD}(Proximal tokens over Distant tokens), a novel KV cache compression framework that allocates memory according to token importance, retaining less important tokens in a more compact, shared form rather than discarding them entirely. Our approach is motivated by two key observations: (1) proximal tokens – those at the beginning and end of the context – are significantly more important for next-token prediction, and (2) attention scores for distant tokens are highly redundant across consecutive layers. Leveraging these insights, \textsc{PoD} preserves the full KV cache for proximal tokens, while for distant tokens, it shares key states across layers. Since attention scores are determined by both queries and keys, sharing key states enables multiple layers to reuse a single set of keys for distant tokens, substantially reducing KV cache memory without discarding essential context. We further introduce a lightweight post-training adaptation to enable the model to adjust to this new attention-sharing structure. Extensive experiments on both synthetic(Needle in a Haystack) and real-world long-context benchmarks demonstrate that \textsc{PoD} reduces KV cache memory usage by up to 35% without compromising performance. Our method is orthogonal to existing token-selection-based techniques and can be combined with them for further KV cache compression.

[115] Core Context Aware Transformers for Long Context Language Modeling

Yaofo Chen, Zeng You, Shuhai Zhang, Haokun Li, Yirui Li, Yaowei Wang, Mingkui Tan

Main category: cs.CL

TL;DR: Proposes Core Context Aware (CCA) Attention for efficient long-context modeling in LLMs, reducing redundancy and improving performance.

Details

Motivation: Addresses the inefficiency and redundancy in self-attention mechanisms for large context lengths (e.g., 128K), which hampers performance and increases computational overhead.

Method: Introduces two modules: 1) Globality-aware pooling to compress tokens into core tokens, and 2) Locality-preserving to retain local context. CCA-Attention replaces self-attention with minimal fine-tuning.

Result: Demonstrates superior performance in long-context modeling and computational efficiency compared to state-of-the-art methods.

Conclusion: CCA-Attention effectively reduces redundancy and enhances long-context modeling in LLMs, offering a plug-and-play solution with minimal adaptation.

Abstract: Transformer-based Large Language Models (LLMs) have exhibited remarkable success in extensive tasks primarily attributed to self-attention mechanism, which requires a token to consider all preceding tokens as its context to compute attention. However, when the context length L becomes very large (e.g., 128K), the amount of potentially redundant information in the context tends to increase. The redundant context not only hampers the modeling representation performance but also incurs unnecessary computational and storage overhead. In this paper, we propose a plug-and-play Core Context Aware (CCA) Attention for efficient long-context modeling, comprising two complementary modules: 1) Globality-aware pooling module groups input tokens and dynamically compresses each group into one core token based on their significance. In this way, our method automatically focuses and strengthens core context while diminishing redundancy during the learning process, leading to effective long-term dependency modeling. 2) Locality-preserving module incorporates neighboring tokens to preserve local context for detailed representation. Notably, our CCA-Attention is able to replace the self-attention module in existing LLMs with minimal fine-tuning cost. Extensive experimental results show the superiority of our method in both long-context modeling and computational efficiency over state-of-the-art methods.

[116] KnowRA: Knowledge Retrieval Augmented Method for Document-level Relation Extraction with Comprehensive Reasoning Abilities

Chengcheng Mai, Yuxiang Wang, Ziyu Gong, Hanxiang Wang, Yihua Huang

Main category: cs.CL

TL;DR: The paper introduces KnowRA, a knowledge retrieval augmented method for document-level relation extraction (Doc-RE), enhancing reasoning with external knowledge and comprehensive cross-sentence interactions.

Details

Motivation: Existing Doc-RE methods lack comprehensive reasoning and external knowledge utilization for long documents, limiting their effectiveness.

Method: KnowRA constructs a document graph for semantic encoding, integrates co-reference resolution, retrieves external knowledge for common-sense reasoning, filters irrelevant knowledge, and uses an axis attention mechanism for logical reasoning.

Result: Experiments on two datasets show KnowRA outperforms state-of-the-art baselines.

Conclusion: KnowRA effectively addresses Doc-RE challenges by combining comprehensive reasoning and external knowledge, validated by superior performance.

Abstract: Document-level relation extraction (Doc-RE) aims to extract relations between entities across multiple sentences. Therefore, Doc-RE requires more comprehensive reasoning abilities like humans, involving complex cross-sentence interactions between entities, contexts, and external general knowledge, compared to the sentence-level RE. However, most existing Doc-RE methods focus on optimizing single reasoning ability, but lack the ability to utilize external knowledge for comprehensive reasoning on long documents. To solve these problems, a knowledge retrieval augmented method, named KnowRA, was proposed with comprehensive reasoning to autonomously determine whether to accept external knowledge to assist DocRE. Firstly, we constructed a document graph for semantic encoding and integrated the co-reference resolution model to augment the co-reference reasoning ability. Then, we expanded the document graph into a document knowledge graph by retrieving the external knowledge base for common-sense reasoning and a novel knowledge filtration method was presented to filter out irrelevant knowledge. Finally, we proposed the axis attention mechanism to build direct and indirect associations with intermediary entities for achieving cross-sentence logical reasoning. Extensive experiments conducted on two datasets verified the effectiveness of our method compared to the state-of-the-art baselines. Our code is available at https://anonymous.4open.science/r/KnowRA.

[117] Self-Evolving Critique Abilities in Large Language Models

Zhengyang Tang, Ziniu Li, Zhenyang Xiao, Tian Ding, Ruoyu Sun, Benyou Wang, Dayiheng Liu, Fei Huang, Tianyu Liu, Bowen Yu, Junyang Lin

Main category: cs.CL

TL;DR: SCRIT is a self-evolving framework for LLMs to improve critique abilities without external supervision, using self-generated data and contrastive-critic methods.

Details

Motivation: LLMs struggle with providing feedback in tasks where human evaluation is difficult or where they outperform humans. Current methods rely on human annotations or more powerful models, lacking self-supervised solutions.

Method: SCRIT trains LLMs with self-generated data, using a contrastive-critic approach with reference solutions for data synthesis and self-validation for quality. The model operates without references at inference.

Result: Implemented with Qwen2.5-72B-Instruct, SCRIT shows 10.0% gain in critique-correction accuracy and 19.0% improvement in error identification F1-score across benchmarks.

Conclusion: SCRIT scales with data and model size, enabling continuous improvement through iterations, proving effective for enhancing LLM critique abilities.

Abstract: Despite their remarkable performance, Large Language Models (LLMs) face a critical challenge: providing feedback for tasks where human evaluation is difficult or where LLMs potentially outperform humans. In such scenarios, leveraging the critique ability of LLMs themselves - identifying and correcting flaws - shows considerable promise. This paper explores enhancing critique abilities of LLMs, noting that current approaches rely on human annotations or more powerful models, leaving the challenge of improving critique abilities without external supervision unresolved. We introduce SCRIT (Self-evolving CRITic), a framework that trains LLMs with self-generated data to evolve their critique abilities. To address the low quality of naively generated data, we propose a contrastive-critic approach that uses reference solutions during data synthesis to enhance the model’s understanding of key concepts, and incorporates a self-validation scheme to ensure data quality. The final trained model operates without any reference solutions at inference time. Implemented with Qwen2.5-72B-Instruct, a leading LLM, SCRIT demonstrates consistent improvements across a wide range of benchmarks spanning both mathematical and scientific reasoning: achieving a 10.0% relative gain in critique-correction accuracy and a 19.0% relative improvement in error identification F1-score. Our analysis reveals that SCRIT’s performance scales positively with data and model size and enables continuous improvement through multi-round iterations.

[118] Rethinking Table Instruction Tuning

Naihao Deng, Rada Mihalcea

Main category: cs.CL

TL;DR: The paper evaluates hyperparameter impacts on table LLMs, finding smaller learning rates and fewer training instances improve performance. It introduces TAMA, a model outperforming GPT-3.5/4 on table tasks while maintaining general capabilities.

Details

Motivation: Existing research overlooks hyperparameter effects and lacks evaluation of out-of-domain and general capabilities in table LLMs.

Method: Systematic analysis of hyperparameters (e.g., learning rate) and introduction of TAMA, a table LLM tuned from LLaMA 3.1 8B Instruct.

Result: Smaller learning rates and fewer training instances enhance table understanding without compromising general capabilities. TAMA matches or surpasses GPT-3.5/4 on table tasks.

Conclusion: Careful hyperparameter selection can reduce data costs and improve efficiency. TAMA demonstrates superior performance and generalization.

Abstract: Recent advances in table understanding have focused on instruction-tuning large language models (LLMs) for table-related tasks. However, existing research has overlooked the impact of hyperparameter choices, and also lacks a comprehensive evaluation of the out-of-domain table understanding ability and the general capabilities of these table LLMs. In this paper, we evaluate these abilities in existing table LLMs, and find significant declines in both out-of-domain table understanding and general capabilities as compared to their base models. Through systematic analysis, we show that hyperparameters, such as learning rate, can significantly influence both table-specific and general capabilities. Contrary to the previous table instruction-tuning work, we demonstrate that smaller learning rates and fewer training instances can enhance table understanding while preserving general capabilities. Based on our findings, we introduce TAMA, a TAble LLM instruction-tuned from LLaMA 3.1 8B Instruct, which achieves performance on par with, or surpassing GPT-3.5 and GPT-4 on table tasks, while maintaining strong out-of-domain generalization and general capabilities. Our findings highlight the potential for reduced data annotation costs and more efficient model development through careful hyperparameter selection. We open-source the project and our models.

[119] Beyond Scaling: Measuring and Predicting the Upper Bound of Knowledge Retention in Language Model Pre-Training

Changhao Jiang, Ming Zhang, Junjie Ye, Xiaoran Fan, Yifei Cao, Jiajun Sun, Zhiheng Xi, Shihan Dou, Yi Dong, Yujiong Shen, Jingqi Tong, Baoyu Fan, Zhen Wang, Tao Liang, Zhihui Fei, Mingyang Wan, Guojun Ma, Qi Zhang, Tao Gui, Xuanjing Huang

Main category: cs.CL

TL;DR: The paper introduces Size-dependent Mutual Information (SMI) to predict QA performance using pre-training signals, outperforming baselines with R² > 0.75.

Details

Motivation: Predicting model performance efficiently without additional training is crucial for resource optimization and task-aligned dataset creation.

Method: Large-scale retrieval and semantic analysis of pre-training corpora, development of a multi-template QA framework, and introduction of SMI.

Result: SMI achieves strong linear correlation (R² > 0.75) with QA accuracy, suggesting an 80% upper bound under optimal conditions.

Conclusion: SMI is effective for performance prediction, but intrinsic memory limits may necessitate retrieval or few-shot methods for further improvements.

Abstract: The GPT-4 technical report highlights the possibility of predicting model performance on downstream tasks using only pre-training signals, though detailed methodologies are absent. Such predictive capabilities are essential for resource-efficient pre-training and the construction of task-aligned datasets. In this paper, we aim to predict performance in closed-book question answering (QA), a vital downstream task that directly reflects a model’s internalized knowledge without the help of external tools. We address three primary challenges: (1) limited access to and understanding of pre-training corpora, (2) limitations of current evaluation methods for pre-trained models, and (3) limitations of frequency-based metrics in predicting model performance. In response, we conduct large-scale retrieval and semantic analysis across the pre-training corpora of 21 publicly available and 3 custom-trained large language models. We then develop a multi-template QA evaluation framework incorporating paraphrased question variants. Building on these foundations, we propose Size-dependent Mutual Information (SMI), an information-theoretic metric that linearly correlates pre-training data characteristics, model size, and QA accuracy, without requiring additional training. Experimental results show that SMI outperforms co-occurrence-based baselines, achieving $R^2 > 0.75$ on models with over one billion parameters. Theoretical analysis further suggests an upper bound of around 80% QA accuracy under optimal pre-training, reflecting intrinsic memory limitations and motivating the use of retrieval or few-shot methods in later stages.

[120] Emergent Response Planning in LLMs

Zhichen Dong, Zhanhui Zhou, Zhixuan Liu, Chao Yang, Chaochao Lu

Main category: cs.CL

TL;DR: LLMs exhibit emergent planning behaviors, encoding future outputs in hidden representations, which can be probed for structure, content, and behavior attributes.

Details

Motivation: To investigate whether LLMs, despite being trained for next-token prediction, inherently plan ahead in their hidden representations.

Method: Probing LLM prompt representations to identify encoded global attributes of responses, including structure, content, and behavior attributes.

Result: LLMs encode future outputs beyond the next token, with planning behaviors scaling with model size and evolving during generation.

Conclusion: LLMs’ hidden representations reveal planning ahead, offering potential for improving transparency and control in generation.

Abstract: In this work, we argue that large language models (LLMs), though trained to predict only the next token, exhibit emergent planning behaviors: $\textbf{their hidden representations encode future outputs beyond the next token}$. Through simple probing, we demonstrate that LLM prompt representations encode global attributes of their entire responses, including $\textit{structure attributes}$ (e.g., response length, reasoning steps), $\textit{content attributes}$ (e.g., character choices in storywriting, multiple-choice answers at the end of response), and $\textit{behavior attributes}$ (e.g., answer confidence, factual consistency). In addition to identifying response planning, we explore how it scales with model size across tasks and how it evolves during generation. The findings that LLMs plan ahead for the future in their hidden representations suggest potential applications for improving transparency and generation control.

[121] Thinking Outside the (Gray) Box: A Context-Based Score for Assessing Value and Originality in Neural Text Generation

Giorgio Franceschelli, Mirco Musolesi

Main category: cs.CL

TL;DR: Proposes a context-based score to balance diversity and quality in large language model outputs, validated via reinforcement learning and creative tasks.

Details

Motivation: Addressing the trade-off between diversity and quality in AI-generated creative outputs.

Method: Uses a context-based score derived from information theory to evaluate value and originality, applied in a reinforcement learning framework.

Result: Enhances value and originality in tasks like poetry generation and math problem solving.

Conclusion: The proposed score effectively improves creative outputs without compromising quality.

Abstract: Despite the increasing use of large language models for creative tasks, their outputs often lack diversity. Common solutions, such as sampling at higher temperatures, can compromise the quality of the results. Dealing with this trade-off is still an open challenge in designing AI systems for creativity. Drawing on information theory, we propose a context-based score to quantitatively evaluate value and originality. This score incentivizes accuracy and adherence to the request while fostering divergence from the learned distribution. We show that our score can be used as a reward in a reinforcement learning framework to fine-tune large language models for maximum performance. We validate our strategy through experiments considering a variety of creative tasks, such as poetry generation and math problem solving, demonstrating that it enhances the value and originality of the generated solutions.

[122] Towards Question Answering over Large Semi-structured Tables

Yuxiang Wang, Junhao Gan, Jianzhong Qi

Main category: cs.CL

TL;DR: TaDRe is a TableQA model that improves accuracy by refining table decomposition, achieving state-of-the-art performance on large-table tasks.

Details

Motivation: Large tables overwhelm models, and existing decomposition methods suffer from errors and quality issues.

Method: TaDRe incorporates pre- and post-table decomposition refinements to ensure quality.

Result: TaDRe outperforms benchmarks on new and public datasets.

Conclusion: TaDRe effectively addresses large-table QA challenges with refined decomposition.

Abstract: Table Question Answering (TableQA) attracts strong interests due to the prevalence of web information presented in the form of semi-structured tables. Despite many efforts, TableQA over large tables remains an open challenge. This is because large tables may overwhelm models that try to comprehend them in full to locate question answers. Recent studies reduce input table size by decomposing tables into smaller, question-relevant sub-tables via generating programs to parse the tables. However, such solutions are subject to program generation and execution errors and are difficult to ensure decomposition quality. To address this issue, we propose TaDRe, a TableQA model that incorporates both pre- and post-table decomposition refinements to ensure table decomposition quality, hence achieving highly accurate TableQA results. To evaluate TaDRe, we construct two new large-table TableQA benchmarks via LLM-driven table expansion and QA pair generation. Extensive experiments on both the new and public benchmarks show that TaDRe achieves state-of-the-art performance on large-table TableQA tasks.

[123] DiffSampling: Enhancing Diversity and Accuracy in Neural Text Generation

Giorgio Franceschelli, Mirco Musolesi

Main category: cs.CL

TL;DR: DiffSampling is a new decoding method for language models that improves output diversity and correctness by analyzing token probability distributions.

Details

Motivation: Current decoding strategies either reduce output diversity or compromise accuracy, leading to repetitive or incorrect text generation.

Method: DiffSampling leverages mathematical analysis of token probability distributions to truncate incorrect tokens and introduces variations to address inconsistencies in common sampling strategies.

Result: Experiments show DiffSampling performs comparably to existing methods in quality while potentially enhancing output diversity.

Conclusion: DiffSampling offers a balanced approach to decoding, improving both diversity and correctness in text generation.

Abstract: Despite their growing capabilities, language models still frequently reproduce content from their training data, generate repetitive text, and favor common grammatical patterns and vocabulary. A possible cause is the decoding strategy: the most common strategies either consider only the most probable tokens, which reduces output diversity, or increase the likelihood of unlikely tokens, compromising output accuracy and correctness. In this paper, we propose DiffSampling, a new decoding method that leverages a mathematical analysis of the token probability distribution to ensure the generation of contextually appropriate text. In particular, the difference between consecutive, sorted probabilities can be used to truncate incorrect tokens. In addition, we also propose two variations of the proposed method that aim to correct the subtle inconsistencies of common sampling strategies. Experiments involving four different text-generation tasks demonstrate that our approach consistently performs at least on par with the existing methods it builds upon in terms of quality, while potentially improving output diversity.

[124] Control Illusion: The Failure of Instruction Hierarchies in Large Language Models

Yilin Geng, Haonan Li, Honglin Mu, Xudong Han, Timothy Baldwin, Omri Abend, Eduard Hovy, Lea Frermann

Main category: cs.CL

TL;DR: The paper evaluates how well large language models (LLMs) enforce hierarchical instruction schemes, finding they struggle with consistent prioritization and biases, and suggests social hierarchies may influence control more than system/user roles.

Details

Motivation: To systematically understand the effectiveness of hierarchical control mechanisms in LLMs, given their increasing deployment with such schemes.

Method: Introduces a systematic evaluation framework based on constraint prioritization, tested across six state-of-the-art LLMs.

Result: LLMs struggle with consistent instruction prioritization, exhibit biases, and obey social hierarchies more reliably than system/user roles.

Conclusion: Pretraining-derived social structures may have stronger influence on LLM behavior than post-training guardrails, highlighting limitations in current hierarchical control mechanisms.

Abstract: Large language models (LLMs) are increasingly deployed with hierarchical instruction schemes, where certain instructions (e.g., system-level directives) are expected to take precedence over others (e.g., user messages). Yet, we lack a systematic understanding of how effectively these hierarchical control mechanisms work. We introduce a systematic evaluation framework based on constraint prioritization to assess how well LLMs enforce instruction hierarchies. Our experiments across six state-of-the-art LLMs reveal that models struggle with consistent instruction prioritization, even for simple formatting conflicts. We find that the widely-adopted system/user prompt separation fails to establish a reliable instruction hierarchy, and models exhibit strong inherent biases toward certain constraint types regardless of their priority designation. We find that LLMs more reliably obey constraints framed through natural social hierarchies (e.g., authority, expertise, consensus) than system/user roles, which suggests that pretraining-derived social structures act as latent control priors, with potentially stronger influence than post-training guardrails.

[125] What are Foundation Models Cooking in the Post-Soviet World?

Anton Lavrouk, Tarek Naous, Alan Ritter, Wei Xu

Main category: cs.CL

TL;DR: The study investigates Post-Soviet cultural food knowledge in foundation models using BORSch, a multimodal dataset. Models struggle with dish-origin identification due to misleading data co-occurrences and linguistic issues. QA alone may not fully evaluate cultural understanding.

Details

Motivation: To understand how foundation models handle Post-Soviet cultural food knowledge and identify biases in their predictions.

Method: Constructed BORSch, a multimodal dataset of Russian and Ukrainian dishes, and tested models on QA and visual description tasks.

Result: Models often misidentify dish origins, influenced by language and misleading data. QA and visual description tasks show weak correlation.

Conclusion: QA may not suffice for evaluating cultural understanding. BORSch is released to aid further research.

Abstract: The culture of the Post-Soviet states is complex, shaped by a turbulent history that continues to influence current events. In this study, we investigate the Post-Soviet cultural food knowledge of foundation models by constructing BORSch, a multimodal dataset encompassing 1147 and 823 dishes in the Russian and Ukrainian languages, centered around the Post-Soviet region. We demonstrate that leading models struggle to correctly identify the origins of dishes from Post-Soviet nations in both text-only and multimodal Question Answering (QA), instead over-predicting countries linked to the language the question is asked in. Through analysis of pretraining data, we show that these results can be explained by misleading dish-origin co-occurrences, along with linguistic phenomena such as Russian-Ukrainian code mixing. Finally, to move beyond QA-based assessments, we test models’ abilities to produce accurate visual descriptions of dishes. The weak correlation between this task and QA suggests that QA alone may be insufficient as an evaluation of cultural understanding. To foster further research, we will make BORSch publicly available at https://github.com/alavrouk/BORSch.

[126] Predictive Data Selection: The Data That Predicts Is the Data That Teaches

Kashun Shum, Yuzhen Huang, Hongjian Zou, Qi Ding, Yixuan Liao, Xiaoxin Chen, Qian Liu, Junxian He

Main category: cs.CL

TL;DR: The paper introduces PreSelect, a lightweight data selection method for language model pretraining, achieving 10x compute reduction while outperforming baselines.

Details

Motivation: To efficiently estimate data contribution during pretraining and improve data selection for better downstream performance.

Method: Uses predictive data selection (PreSelect) based on model loss correlation with downstream performance, requiring only a fastText-based scorer.

Result: Models trained on 30B tokens with PreSelect outperform baselines trained on 300B tokens, with significant gains over other methods.

Conclusion: PreSelect is an efficient and effective data selection method, reducing compute needs while enhancing model performance.

Abstract: Language model pretraining involves training on extensive corpora, where data quality plays a pivotal role. In this work, we aim to directly estimate the contribution of data during pretraining and select pretraining data in an efficient manner. Specifically, we draw inspiration from recent findings showing that compression efficiency (i.e., the normalized loss) of diverse models on certain text correlates strongly with their downstream performance, when the text domain aligns with the downstream benchmarks(Huang et al., 2024). Building on this observation, we hypothesize that data on which model losses are predictive of downstream abilities also contribute effectively to learning, which shares similar intuition with Thrush et al.(2024). To leverage this insight, we introduce predictive data selection (PreSelect), a lightweight and efficient data selection method that requires training and deploying only a fastText-based scorer. Through comprehensive experiments with 1B and 3B parameter models, we demonstrate that models trained on 30B tokens selected with PreSelect surpass the performance of the vanilla baseline trained on 300B tokens, achieving a 10x reduction in compute requirements. Furthermore, PreSelect significantly outperforms other competitive data selection baselines, such as DCLM and FineWeb-Edu on a scale of 3B models trained on 100B tokens. We open-source our trained data selection scorer along with the curated datasets at https://github.com/hkust-nlp/PreSelect.

[127] TIB-STC: A Large-Scale Structured Tibetan Benchmark for Low-Resource Language Modeling

Cheng Huang, Fan Gao, Yutong Liu, Nyima Tashi, Xiangxiang Wang, Thupten Tsering, Ban Ma-bao, Renzeg Duojie, Gadeng Luosang, Rinchen Dongrub, Dorje Tashi, Xiao Feng, Hao Wang, Yongbin Yu

Main category: cs.CL

TL;DR: The paper introduces TIB-STC, a large-scale, expert-curated dataset for Tibetan language modeling, and validates its effectiveness by training the Sun-Shine model, which performs well on Tibetan-specific tasks.

Details

Motivation: To address the uneven progress in NLP for low-resource and culturally rich languages like Tibetan by providing a dedicated dataset and model.

Method: Developed TIB-STC, a multi-domain Tibetan dataset, and trained the Sun-Shine model using a three-stage pipeline (pretraining, fine-tuning, preference optimization).

Result: Sun-Shine performs robustly on Tibetan-specific benchmarks (Ti-MMLU, Ti-SafetyBench), demonstrating TIB-STC’s utility.

Conclusion: TIB-STC advances low-resource language modeling and promotes inclusivity in multilingual NLP; the dataset and model are publicly released.

Abstract: Advancement of large language models (LLMs) has brought transformative capabilities to NLP, but such progress remains unevenly distributed, especially for low-resource and culturally rich languages like Tibetan. In this paper, we present TIB-STC, the first large-scale, expert-curated, and multi-domain dataset specifically designed to support the development and evaluation of LLMs for the Tibetan language. Spanning over 11 billion tokens across literature, religion, medicine, law, and daily communication, TIB-STC preserves traditional grammar and stylistic richness. To validate its utility, we train a reference model, Sun-Shine, on TIB-STC through a three-stage pipeline involving pretraining, supervised fine-tuning, and preference optimization. Evaluation on TLUE Benchmark for Tibetan-specific tasks, including Ti-MMLU and Ti-SafetyBench, demonstrates the TIB-STC’s effectiveness in enabling robust instruction-following and culturally aligned generation. We release TIB-STC to advance research in low-resource language modeling and promote inclusivity in multilingual NLP. All data are available: https://github.com/Vicentvankor/sun-shine.

[128] Efficient Dynamic Clustering-Based Document Compression for Retrieval-Augmented-Generation

Weitao Li, Kaiming Liu, Xiangyu Zhang, Xuanyu Lei, Weizhi Ma, Yang Liu

Main category: cs.CL

TL;DR: The paper introduces EDC2-RAG, a framework improving RAG by dynamically clustering documents to reduce noise and redundancy, validated on knowledge-QA and hallucination-detection tasks.

Details

Motivation: Current RAG implementations struggle with noise and redundancy due to poor exploitation of inter-document relationships, leading to generation errors.

Method: Proposes EDC2-RAG, leveraging latent inter-document relationships for dynamic clustering and compression to filter irrelevant/redundant content.

Result: EDC2-RAG shows consistent performance gains on GPT-3.5-Turbo and GPT-4o-mini, proving robust and applicable across scenarios.

Conclusion: EDC2-RAG effectively addresses RAG limitations, enhancing performance and reliability in knowledge injection tasks.

Abstract: Retrieval-Augmented Generation (RAG) has emerged as a widely adopted approach for knowledge injection during large language model (LLM) inference in recent years. However, due to their limited ability to exploit fine-grained inter-document relationships, current RAG implementations face challenges in effectively addressing the retrieved noise and redundancy content, which may cause error in the generation results. To address these limitations, we propose an Efficient Dynamic Clustering-based document Compression framework (EDC2-RAG) that utilizes latent inter-document relationships while simultaneously removing irrelevant information and redundant content. We validate our approach, built upon GPT-3.5-Turbo and GPT-4o-mini, on widely used knowledge-QA and Hallucination-Detection datasets. Experimental results show that our method achieves consistent performance improvements across various scenarios and experimental settings, demonstrating strong robustness and applicability. Our code and datasets are available at https://github.com/Tsinghua-dhy/EDC-2-RAG.

[129] Enhancing Time Series Forecasting via Multi-Level Text Alignment with LLMs

Taibiao Zhao, Xiaobing Chen, Mingxuan Sun

Main category: cs.CL

TL;DR: A framework for adapting LLMs to time series forecasting by decomposing data into trend, seasonal, and residual components, then aligning them with text representations for improved accuracy and interpretability.

Details

Motivation: Time series data is continuous, while LLMs operate on discrete tokens, creating challenges in aligning them without losing predictive accuracy or interpretability.

Method: Decompose time series into trend, seasonal, and residual components, reprogram them into text representations, and align these with pre-trained word tokens using a multi-level alignment mechanism.

Result: Outperforms state-of-the-art models in accuracy while maintaining good interpretability.

Conclusion: The proposed framework successfully bridges the gap between continuous time series data and discrete LLMs, enhancing both accuracy and interpretability.

Abstract: The adaptation of large language models (LLMs) to time series forecasting poses unique challenges, as time series data is continuous in nature, while LLMs operate on discrete tokens. Despite the success of LLMs in natural language processing (NLP) and other structured domains, aligning time series data with language-based representations while maintaining both predictive accuracy and interpretability remains a significant hurdle. Existing methods have attempted to reprogram time series data into text-based forms, but these often fall short in delivering meaningful, interpretable results. In this paper, we propose a multi-level text alignment framework for time series forecasting using LLMs that not only improves prediction accuracy but also enhances the interpretability of time series representations. Our method decomposes time series into trend, seasonal, and residual components, which are then reprogrammed into component-specific text representations. We introduce a multi-level alignment mechanism, where component-specific embeddings are aligned with pre-trained word tokens, enabling more interpretable forecasts. Experiments on multiple datasets demonstrate that our method outperforms state-of-the-art models in accuracy while providing good interpretability.

[130] Thought Manipulation: External Thought Can Be Efficient for Large Reasoning Models

Yule Liu, Jingyi Zheng, Zhen Sun, Zifan Peng, Wenhan Dong, Zeyang Sha, Shiwen Cui, Weiqiang Wang, Xinlei He

Main category: cs.CL

TL;DR: The paper introduces \Method, a pipeline to reduce redundant reasoning steps in large reasoning models (LRMs) by using external CoTs from smaller models, improving efficiency without performance loss.

Details

Motivation: LRMs suffer from "overthinking," generating excessive reasoning steps with minimal gains. The goal is to optimize their inference efficiency.

Method: Proposes \Method, which inserts external CoTs (from smaller models) between \texttt{} and \texttt{} tokens to reduce redundant steps.

Result: \Method reduces output tokens by ~30% on QwQ-32B (LiveBench/Code) while maintaining performance, with minimal overhead. It also identifies and mitigates suboptimal modes.

Conclusion: \Method is a practical, efficient solution to optimize LRM inference, enhancing scalability for real-world applications.

Abstract: Recent advancements in large reasoning models (LRMs) have demonstrated the effectiveness of scaling test-time computation to enhance reasoning capabilities on various tasks. However, LRMs often suffer from an ``overthinking’’ problem, where the model generates excessively redundant reasoning steps with limited performance gains. In this work, we empirically reveal an important characteristic of LRM behaviors that placing external CoTs generated by smaller models between the thinking token (\texttt{} and \texttt{}) can effectively manipulate the model to generate fewer thoughts. Building on this finding, we propose a simple yet efficient pipeline, \Method, to enable LRMs to bypass unnecessary intermediate steps, thereby significantly reducing computational costs. We conduct extensive experiments to evaluate the utility and efficiency of \Method. For instance, when applied to QwQ-32B on the LiveBench/Code dataset, \Method keeps the original performance while reducing output token counts by approximately 30%, with minimal overhead introduced by the CoT generator. Furthermore, we identify two suboptimal modes, blindly following flawed external thoughts and unnecessary rethinking, and show that simple mitigations, such as difficulty-aware fallbacks, can further improve performance. Overall, \Method offers a practical, general, and efficient way to optimize LRM inference, making powerful reasoning models more accessible and scalable for real-world applications.

[131] Agree to Disagree? A Meta-Evaluation of LLM Misgendering

Arjun Subramonian, Vagrant Gautam, Preethi Seshadri, Dietrich Klakow, Kai-Wei Chang, Yizhou Sun

Main category: cs.CL

TL;DR: The paper examines the convergent validity of methods for measuring LLM misgendering, finding significant disagreement between probability- and generation-based evaluations, and highlights the complexity of misgendering behavior beyond pronouns.

Details

Motivation: To determine whether existing evaluation methods for LLM misgendering align (convergent validity) and to assess their limitations.

Method: A systematic meta-evaluation of probability- and generation-based methods across three datasets, transforming datasets for parallel evaluation, and human evaluation of 2400 LLM generations.

Result: Disagreement between methods at instance, dataset, and model levels (20.2% conflict), with misgendering behavior extending beyond pronouns, challenging automatic evaluations.

Conclusion: Recommendations for future LLM misgendering evaluations are provided, questioning broader LLM evaluation conventions that assume method agreement.

Abstract: Numerous methods have been proposed to measure LLM misgendering, including probability-based evaluations (e.g., automatically with templatic sentences) and generation-based evaluations (e.g., with automatic heuristics or human validation). However, it has gone unexamined whether these evaluation methods have convergent validity, that is, whether their results align. Therefore, we conduct a systematic meta-evaluation of these methods across three existing datasets for LLM misgendering. We propose a method to transform each dataset to enable parallel probability- and generation-based evaluation. Then, by automatically evaluating a suite of 6 models from 3 families, we find that these methods can disagree with each other at the instance, dataset, and model levels, conflicting on 20.2% of evaluation instances. Finally, with a human evaluation of 2400 LLM generations, we show that misgendering behaviour is complex and goes far beyond pronouns, which automatic evaluations are not currently designed to capture, suggesting essential disagreement with human evaluations. Based on our findings, we provide recommendations for future evaluations of LLM misgendering. Our results are also more widely relevant, as they call into question broader methodological conventions in LLM evaluation, which often assume that different evaluation methods agree.

[132] A Scoping Review of Natural Language Processing in Addressing Medically Inaccurate Information: Errors, Misinformation, and Hallucination

Zhaoyi Sun, Wen-Wai Yim, Ozlem Uzuner, Fei Xia, Meliha Yetisgen

Main category: cs.CL

TL;DR: The review explores NLP’s role in detecting and correcting medically inaccurate information, highlighting its potential and challenges in healthcare.

Details

Motivation: To advance patient safety, improve public health communication, and enhance NLP reliability in healthcare.

Method: A scoping review following PRISMA guidelines, analyzing studies from 2020 to 2024 across five databases.

Result: NLP shows promise in tasks like error/misinformation detection/correction and hallucination management, but faces challenges like data privacy and context dependency.

Conclusion: While NLP has advanced in addressing medical inaccuracies, future work should focus on real-world datasets, contextual methods, and hallucination management for reliable healthcare applications.

Abstract: Objective: This review aims to explore the potential and challenges of using Natural Language Processing (NLP) to detect, correct, and mitigate medically inaccurate information, including errors, misinformation, and hallucination. By unifying these concepts, the review emphasizes their shared methodological foundations and their distinct implications for healthcare. Our goal is to advance patient safety, improve public health communication, and support the development of more reliable and transparent NLP applications in healthcare. Methods: A scoping review was conducted following PRISMA guidelines, analyzing studies from 2020 to 2024 across five databases. Studies were selected based on their use of NLP to address medically inaccurate information and were categorized by topic, tasks, document types, datasets, models, and evaluation metrics. Results: NLP has shown potential in addressing medically inaccurate information on the following tasks: (1) error detection (2) error correction (3) misinformation detection (4) misinformation correction (5) hallucination detection (6) hallucination mitigation. However, challenges remain with data privacy, context dependency, and evaluation standards. Conclusion: This review highlights the advancements in applying NLP to tackle medically inaccurate information while underscoring the need to address persistent challenges. Future efforts should focus on developing real-world datasets, refining contextual methods, and improving hallucination management to ensure reliable and transparent healthcare applications.

[133] Do MLLMs Capture How Interfaces Guide User Behavior? A Benchmark for Multimodal UI/UX Design Understanding

Jaehyun Jeon, Min Soo Kim, Jang Han Yoon, Sumin Shim, Yejin Choi, Hanbin Kim, Youngjae Yu

Main category: cs.CL

TL;DR: The paper introduces WiserUI-Bench, a benchmark for evaluating MLLMs’ understanding of UI/UX design, focusing on behavior-oriented aspects and expert-curated rationales.

Details

Motivation: Recent studies on UI quality evaluation with MLLMs overlook behavior-oriented aspects, creating a gap in understanding UI/UX design's impact on user actions.

Method: The authors developed WiserUI-Bench, featuring 300 real-world UI image pairs with A/B-tested variants and 684 expert rationales. It evaluates models on selecting effective designs and explaining their effectiveness.

Result: Experiments show current MLLMs lack nuanced reasoning about UI/UX design and its behavioral impact.

Conclusion: WiserUI-Bench aims to advance UI/UX understanding and enable applications like behavior-aware interface optimization.

Abstract: User interface (UI) design goes beyond visuals, guiding user behavior and overall user experience (UX). Strategically crafted interfaces, for example, can boost sign-ups and drive business sales, underscoring the shift toward UI/UX as a unified design concept. While recent studies have explored UI quality evaluation using Multimodal Large Language Models (MLLMs), they largely focus on surface-level features, overlooking behavior-oriented aspects. To fill this gap, we introduce WiserUI-Bench, a novel benchmark for assessing models’ multimodal understanding of UI/UX design. It includes 300 diverse real-world UI image pairs, each consisting of two design variants A/B-tested at scale by actual companies, where one was empirically validated to steer more user actions than the other. Each pair is accompanied one or more of 684 expert-curated rationales that capture key factors behind each winning design’s effectiveness, spanning diverse cognitive dimensions of UX. Our benchmark supports two core tasks: (1) selecting the more effective UI/UX design by predicting the A/B test verified winner and (2) assessing how well a model, given the winner, can explain its effectiveness in alignment with expert reasoning. Experiments across several MLLMs show that current models exhibit limited nuanced reasoning about UI/UX design and its behavioral impact. We believe our work will foster research in UI/UX understanding and enable broader applications such as behavior-aware interface optimization.

[134] GenKnowSub: Improving Modularity and Reusability of LLMs through General Knowledge Subtraction

Mohammadtaha Bagherifard, Sahar Rajabi, Ali Edalat, Yadollah Yaghoobzadeh

Main category: cs.CL

TL;DR: Proposes GenKnowSub, a modular framework to disentangle general knowledge and task-specific adaptations in LLMs, improving zero-shot generalization without additional training.

Details

Motivation: Addresses the entanglement of general knowledge and task-specific adaptations in LLMs, limiting zero-shot generalization.

Method: Uses a library of task-specific LoRA modules and a general-domain LoRA, applying general knowledge subtraction (GenKnowSub) to refine task-specific modules. Dynamically combines modules using the Arrow routing algorithm.

Result: Shows consistent performance gains in monolingual and cross-lingual settings on benchmarks, even with weaker LLMs like Phi-2.

Conclusion: GenKnowSub effectively improves zero-shot generalization by disentangling knowledge components, validated across diverse languages and models.

Abstract: Large language models often struggle with zero-shot generalization, and several modular approaches have been proposed to address this challenge. Yet, we hypothesize that a key limitation remains: the entanglement of general knowledge and task-specific adaptations. To overcome this, we propose a modular framework that disentangles these components by constructing a library of task-specific LoRA modules alongside a general-domain LoRA. By subtracting this general knowledge component from each task-specific module, we obtain residual modules that focus more exclusively on task-relevant information, a method we call general knowledge subtraction (GenKnowSub). Leveraging the refined task-specific modules and the Arrow routing algorithm \citep{ostapenko2024towards}, we dynamically select and combine modules for new inputs without additional training. Our studies on the Phi-3 model and standard Arrow as baselines reveal that using general knowledge LoRAs derived from diverse languages, including English, French, and German, yields consistent performance gains in both monolingual and cross-lingual settings across a wide set of benchmarks. Further experiments on Phi-2 demonstrate how GenKnowSub generalizes to weaker LLMs. The complete code and data are available at https://github.com/saharsamr/Modular-LLM.

[135] XtraGPT: Context-Aware and Controllable Academic Paper Revision via Human-AI Collaboration

Nuo Chen, Andre Lin HuiKai, Jiaying Wu, Junyi Hou, Zining Zhang, Qian Wang, Xidong Wang, Bingsheng He

Main category: cs.CL

TL;DR: The paper introduces XtraGPT, a human-AI collaboration framework for academic paper revision, outperforming existing systems in scientific writing assistance.

Details

Motivation: Existing LLMs lack the capability to support high-quality scientific writing, especially in iterative, revision-driven academic workflows.

Method: Developed a dataset of 7,040 research papers with 140,000 instruction-response pairs and built XtraGPT, a suite of open-source LLMs (1.5B to 14B parameters) for context-aware writing assistance.

Result: XtraGPT outperforms same-scale baselines and approaches proprietary systems in improving scientific drafts, validated by automated and human evaluations.

Conclusion: The proposed framework effectively bridges the gap in AI-assisted scientific writing, offering scalable and high-quality revision support.

Abstract: Despite the growing adoption of large language models (LLMs) in academic workflows, their capabilities remain limited when it comes to supporting high-quality scientific writing. Most existing systems are designed for general-purpose scientific text generation and fail to meet the sophisticated demands of research communication beyond surface-level polishing, such as conceptual coherence across sections. Furthermore, academic writing is inherently iterative and revision-driven, a process not well supported by direct prompting-based paradigms. To address these scenarios, we propose a human-AI collaboration framework for academic paper revision. We first introduce a comprehensive dataset of 7,040 research papers from top-tier venues annotated with over 140,000 instruction-response pairs that reflect realistic, section-level scientific revisions. Building on the dataset, we develop XtraGPT, the first suite of open-source LLMs, designed to provide context-aware, instruction-guided writing assistance, ranging from 1.5B to 14B parameters. Extensive experiments validate that XtraGPT significantly outperforms same-scale baselines and approaches the quality of proprietary systems. Both automated preference assessments and human evaluations confirm the effectiveness of our models in improving scientific drafts.

[136] ChartEdit: How Far Are MLLMs From Automating Chart Analysis? Evaluating MLLMs’ Capability via Chart Editing

Xuanle Zhao, Xuexin Liu, Haoyue Yang, Xianzhen Luo, Fanhu Zeng, Jianling Li, Qi Shi, Chi Chen

Main category: cs.CL

TL;DR: The paper introduces extsc{ChartEdit}, a benchmark for evaluating multimodal large language models (MLLMs) in chart editing tasks, revealing limitations in their accuracy and performance.

Details

Motivation: Current evaluations of MLLMs for chart editing rely on limited case studies, lacking a comprehensive framework.

Method: The authors propose extsc{ChartEdit}, a benchmark with 1405 diverse editing instructions on 233 real-world charts, manually annotated. They evaluate 10 MLLMs at code and chart levels.

Result: Large-scale models partially match reference images but struggle with precise edits (SOTA score: 59.96). Small-scale models perform poorly in both editing and chart generation.

Conclusion: Significant challenges remain in MLLMs’ chart editing capabilities, highlighting the need for further development.

Abstract: Although multimodal large language models (MLLMs) show promise in generating chart rendering code, editing charts via code presents a greater challenge. This task demands MLLMs to integrate chart understanding and reasoning capacities, which are labor-intensive. While many MLLMs claim such editing capabilities, current evaluations rely on limited case studies, highlighting the urgent need for a comprehensive evaluation framework. In this work, we propose \textsc{ChartEdit}, a novel benchmark designed for chart editing tasks, featuring $1405$ diverse editing instructions applied to $233$ real-world charts, each manually annotated and validated for accuracy. Utilizing \textsc{ChartEdit}, we evaluate the performance of 10 mainstream MLLMs across two types of experiments at both the code and chart levels. The results suggest that large-scale models can generate code to produce images that partially match the reference images. However, their ability to generate accurate edits according to the instructions remains limited. The state-of-the-art (SOTA) model achieves a score of only $59.96$, highlighting significant challenges in precise modification. In contrast, small-scale models, including chart-domain models, struggle both with following editing instructions and generating overall chart images, underscoring the need for further development in this area. Code is available at https://github.com/xxlllz/ChartEdit.

[137] On the Generalization vs Fidelity Paradox in Knowledge Distillation

Suhas Kamasetty Ramesh, Ayan Sengupta, Tanmoy Chakraborty

Main category: cs.CL

TL;DR: KD improves smaller LMs (up to 10%) more than larger ones (~1.3%), with teacher task expertise being key. However, KD may not preserve reasoning fidelity.

Details

Motivation: To explore KD's effectiveness for smaller LMs and the mechanisms behind knowledge transfer, which are underexplored.

Method: Large-scale empirical and statistical analysis of KD on models (0.5B-7B parameters) across 14 zero-shot reasoning tasks.

Result: Smaller models gain up to 10% performance, while larger ones see minimal gains (~1.3%). Teacher task expertise matters more than performance. KD may not maintain reasoning fidelity.

Conclusion: KD is beneficial for smaller LMs but has trade-offs, like potential reasoning fidelity loss. Teacher signals and logit smoothing are critical.

Abstract: Knowledge distillation (KD) is a key technique for compressing large language models into smaller ones while preserving performance. Despite the recent traction of KD research, its effectiveness for smaller language models (LMs) and the mechanisms driving knowledge transfer remain underexplored. In this work, we present the first large-scale empirical and statistical analysis of KD across models ranging from 0.5B to 7B parameters on 14 complex reasoning tasks in a zero-shot setting. Our findings reveal that KD can improve the average performance of smaller models by up to $10%$, with a peak task specific gain of $22%$, while providing only marginal benefits ($\sim 1.3%$) for larger models. Surprisingly, teacher performance has a minimal impact on student outcomes, while teacher task expertise impacts KD effectiveness. A correlation study indicates that smaller LMs benefit more from KD, whereas larger LMs show diminished gains. Additionally, we uncover a misalignment between improvements in student performance and reasoning fidelity, suggesting that while KD enhances accuracy, it does not always maintain the structured decision-making processes of the teacher. Our ablation study further highlights the importance of teacher signals and logit smoothing in influencing students’ performance after distillation. Overall, our study offers a comprehensive empirical and statistical assessment of KD, highlighting both its benefits and trade-offs when distilling knowledge from larger to smaller LMs.

[138] Not All Tokens Are What You Need In Thinking

Hang Yuan, Bin Yu, Haotian Li, Shijun Yang, Christina Dan Wang, Zhou Yu, Xueyin Xu, Weizhen Qi, Kai Chen

Main category: cs.CL

TL;DR: CTS is a token-level compression framework that reduces redundancy in reasoning models by selectively preserving essential tokens, improving efficiency without significant accuracy loss.

Details

Motivation: Address inefficiencies in modern reasoning models like high latency, resource consumption, and overthinking (redundant CoT tokens).

Method: Proposes Conditional Token Selection (CTS), which scores token importance and trains models on compressed CoT.

Result: CTS reduces reasoning tokens by 13.2% with a 9.1% accuracy boost on GPQA. Further compression (42% fewer tokens) causes only a 5% accuracy drop.

Conclusion: CTS effectively compresses CoT, highlighting redundancy in existing models while maintaining performance.

Abstract: Modern reasoning models, such as OpenAI’s o1 and DeepSeek-R1, exhibit impressive problem-solving capabilities but suffer from critical inefficiencies: high inference latency, excessive computational resource consumption, and a tendency toward overthinking – generating verbose chains of thought (CoT) laden with redundant tokens that contribute minimally to the final answer. To address these issues, we propose Conditional Token Selection (CTS), a token-level compression framework with a flexible and variable compression ratio that identifies and preserves only the most essential tokens in CoT. CTS evaluates each token’s contribution to deriving correct answers using conditional importance scoring, then trains models on compressed CoT. Extensive experiments demonstrate that CTS effectively compresses long CoT while maintaining strong reasoning performance. Notably, on the GPQA benchmark, Qwen2.5-14B-Instruct trained with CTS achieves a 9.1% accuracy improvement with 13.2% fewer reasoning tokens (13% training token reduction). Further reducing training tokens by 42% incurs only a marginal 5% accuracy drop while yielding a 75.8% reduction in reasoning tokens, highlighting the prevalence of redundancy in existing CoT.

[139] Shifting AI Efficiency From Model-Centric to Data-Centric Compression

Xuyang Liu, Zichen Wen, Shaobo Wang, Junjie Chen, Zhishan Tao, Yubo Wang, Xiangqi Jin, Chang Zou, Yiyu Wang, Chenfei Liao, Xu Zheng, Honggang Chen, Weijia Li, Xuming Hu, Conghui He, Linfeng Zhang

Main category: cs.CL

TL;DR: The paper argues for a shift from model-centric to data-centric compression (token compression) to improve AI efficiency amid hardware limits and computational bottlenecks from long token sequences.

Details

Motivation: The motivation is to address the computational inefficiency caused by the quadratic cost of self-attention in large language models (LLMs) and multi-modal LLMs (MLLMs) due to ultra-long contexts, high-resolution images, and extended videos.

Method: The paper provides a unified mathematical framework for model efficiency strategies and systematically reviews token compression, analyzing its benefits and challenges.

Result: Token compression is identified as a crucial paradigm shift for reducing long-context overhead, with compelling advantages across diverse scenarios.

Conclusion: The work aims to synthesize existing research, offer a fresh perspective on AI efficiency, and inspire innovative solutions to handle increasing context lengths.

Abstract: The rapid advancement of large language models (LLMs) and multi-modal LLMs (MLLMs) has historically relied on model-centric scaling through increasing parameter counts from millions to hundreds of billions to drive performance gains. However, as we approach hardware limits on model size, the dominant computational bottleneck has fundamentally shifted to the quadratic cost of self-attention over long token sequences, now driven by ultra-long text contexts, high-resolution images, and extended videos. In this position paper, \textbf{we argue that the focus of research for efficient AI is shifting from model-centric compression to data-centric compression}. We position token compression as the new frontier, which improves AI efficiency via reducing the number of tokens during model training or inference. Through comprehensive analysis, we first examine recent developments in long-context AI across various domains and establish a unified mathematical framework for existing model efficiency strategies, demonstrating why token compression represents a crucial paradigm shift in addressing long-context overhead. Subsequently, we systematically review the research landscape of token compression, analyzing its fundamental benefits and identifying its compelling advantages across diverse scenarios. Furthermore, we provide an in-depth analysis of current challenges in token compression research and outline promising future directions. Ultimately, our work aims to offer a fresh perspective on AI efficiency, synthesize existing research, and catalyze innovative developments to address the challenges that increasing context lengths pose to the AI community’s advancement.

[140] It’s High Time: A Survey of Temporal Question Answering

Bhawna Piryani, Abdelrahman Abdallah, Jamshid Mozafari, Avishek Anand, Adam Jatowt

Main category: cs.CL

TL;DR: A survey on Temporal Question Answering (TQA), covering challenges, neural advances, and benchmarks.

Details

Motivation: Address the growing need for systems to handle time-stamped content and temporal reasoning in question answering.

Method: Focuses on neural architectures like transformers and LLMs, with techniques like temporal language modeling and RAG.

Result: Highlights progress in temporal reasoning and recency awareness, supported by benchmark datasets.

Conclusion: TQA is advancing with neural methods, but challenges like temporal robustness remain.

Abstract: Time plays a critical role in how information is generated, retrieved, and interpreted. In this survey, we provide a comprehensive overview of Temporal Question Answering (TQA), a research area that focuses on answering questions involving temporal constraints or context. As the amount of time-stamped content from sources like news articles, web archives, and knowledge bases increases, systems must address challenges such as detecting temporal intent, normalizing time expressions, ordering events, and reasoning over evolving or ambiguous facts. We focus on recent advances in TQA enabled by neural architectures, especially transformer-based models and Large Language Models (LLMs), highlighting progress in temporal language modeling, retrieval-augmented generation (RAG), and temporal reasoning. We also discuss benchmark datasets and evaluation strategies designed to test temporal robustness, recency awareness, and generalization.

[141] Affordance Benchmark for MLLMs

Junying Wang, Wenzhe Li, Yalun Wu, Yingji Liang, Yijin Guo, Chunyi Li, Haodong Duan, Zicheng Zhang, Guangtao Zhai

Main category: cs.CL

TL;DR: A4Bench is a new benchmark evaluating MLLMs’ affordance perception, showing they lag behind humans, especially in transformative affordance.

Details

Motivation: To explore MLLMs' underexplored affordance perception, crucial for intuitive and safe interactions.

Method: Introduces A4Bench with two dimensions: Constitutive Affordance (1,282 QA pairs) and Transformative Affordance (718 QA pairs), evaluating 17 MLLMs against human performance.

Result: Proprietary models outperform open-source ones but all lag behind humans, especially in transformative affordance (e.g., Gemini-2.0-Pro: 18.05% vs. human: 85.34%).

Conclusion: Highlights gaps in MLLMs’ environmental understanding, laying groundwork for more robust, context-aware AI systems.

Abstract: Affordance theory suggests that environments inherently provide action possibilities shaping perception and behavior. While Multimodal Large Language Models (MLLMs) achieve strong performance in vision-language tasks, their ability to perceive affordance, which is crucial for intuitive and safe interactions, remains underexplored. To address this, we introduce A4Bench, a novel benchmark designed to evaluate the affordance perception abilities of MLLMs across two dimensions: 1) Constitutive Affordance, assessing understanding of inherent object properties through 1,282 questionanswer pairs spanning nine sub-disciplines, and 2) Transformative Affordance, probing dynamic and contextual nuances (e.g., misleading, time-dependent, cultural, or individual-specific affordance) with 718 challenging question-answer pairs. We evaluate 17 MLLMs (nine proprietary and eight open-source) and compare them to human performance. Results show that proprietary models generally outperform open-source ones, yet all models perform far below humans, especially in transformative affordance. Furthermore, even top-performing models, such as Gemini-2.0-Pro (18.05% overall exact match accuracy), significantly lag behind human performance (best: 85.34%, worst: 81.25%). These findings highlight critical gaps in environmental understanding of MLLMs and provide a foundation for advancing AI systems toward more robust, context-aware interactions.

[142] Leaps Beyond the Seen: Reinforced Reasoning Augmented Generation for Clinical Notes

Lo Pang-Yun Ting, Chengshuai Zhao, Yu-Hua Zeng, Yuan Jee Lim, Kun-Ta Chuang, Huan Liu

Main category: cs.CL

TL;DR: ReinRAG, a reinforced RAG method, improves long-form clinical note generation by retrieving reasoning paths from a medical knowledge graph and optimizing retrieval with GRO.

Details

Motivation: Existing LLM-based methods struggle with generating long-form clinical notes from limited patient information.

Method: ReinRAG retrieves reasoning paths for semantic guidance and uses GRO for group-normalized rewards to enhance retrieval quality.

Result: ReinRAG outperforms baselines in clinical efficacy and NLG metrics, filling semantic gaps and avoiding misinterpretation.

Conclusion: ReinRAG effectively bridges information gaps and improves long-form clinical note generation.

Abstract: Clinical note generation aims to produce free-text summaries of a patient’s condition and diagnostic process, with discharge instructions being a representative long-form example. While recent LLM-based methods pre-trained on general clinical corpora show promise in clinical text generation, they fall short in producing long-form notes from limited patient information. In this paper, we propose ReinRAG, a reinforced reasoning augmented generation (RAG) for long-form discharge instructions based on pre-admission information. ReinRAG retrieves reasoning paths from a medical knowledge graph to provide explicit semantic guidance to the LLM. To bridge the information gap, we propose group-based retriever optimization (GRO) which improves retrieval quality with group-normalized rewards, encouraging reasoning leaps for deeper inference by the LLM. Comprehensive experiments on the real-world dataset show that ReinRAG outperforms baselines in both clinical efficacy and natural language generation metrics. Further analysis reveals that ReinRAG fills semantic gaps in sparse input scenarios, and retrieved reasoning paths help LLMs avoid clinical misinterpretation by focusing on key evidence and following coherent reasoning.

[143] RAISE: Enhancing Scientific Reasoning in LLMs via Step-by-Step Retrieval

Minhae Oh, Jeonghye Kim, Nakyung Lee, Donggeon Seo, Taeuk Kim, Jungwoo Lee

Main category: cs.CL

TL;DR: RAISE is a retrieval-augmented framework for scientific reasoning, outperforming baselines by retrieving logically relevant documents.

Details

Motivation: Scientific reasoning involves complex processes, domain knowledge, and adapting to new findings, requiring a robust solution.

Method: RAISE uses problem decomposition, logical query generation, and logical retrieval to retrieve relevant documents.

Result: RAISE consistently outperforms baselines on scientific reasoning benchmarks by retrieving logically relevant documents.

Conclusion: RAISE effectively addresses scientific reasoning challenges by combining domain knowledge and logical relevance in retrieval.

Abstract: Scientific reasoning requires not only long-chain reasoning processes, but also knowledge of domain-specific terminologies and adaptation to updated findings. To deal with these challenges for scientific reasoning, we introduce RAISE, a step-by-step retrieval-augmented framework which retrieves logically relevant documents from in-the-wild corpus. RAISE is divided into three steps: problem decomposition, logical query generation, and logical retrieval. We observe that RAISE consistently outperforms other baselines on scientific reasoning benchmarks. We analyze that unlike other baselines, RAISE retrieves documents that are not only similar in terms of the domain knowledge, but also documents logically more relevant.

[144] Extrapolation by Association: Length Generalization Transfer in Transformers

Ziyang Cai, Nayoung Lee, Avi Schwarzschild, Samet Oymak, Dimitris Papailiopoulos

Main category: cs.CL

TL;DR: Transformer models can transfer length generalization across related tasks, enabling extrapolation to longer inputs. This capability is linked to the reuse of attention heads between tasks.

Details

Motivation: To understand how transformer models generalize to longer inputs and whether this ability can be transferred across related tasks.

Method: Investigates length generalization through task association, training models with longer auxiliary tasks to test generalization on target tasks. Experiments include arithmetic, string transformations, and maze navigation.

Result: Models generalize to longer inputs when trained with related auxiliary tasks. Pretrained models also show transfer effects, suggesting reusable computational scaffolding. Attention head reuse correlates with transfer.

Conclusion: Transformers generalize by reusing inductive structures across tasks, with attention head reuse playing a key role. This deepens understanding of out-of-distribution generalization.

Abstract: Transformer language models have demonstrated impressive generalization capabilities in natural language domains, yet we lack a fine-grained understanding of how such generalization arises. In this paper, we investigate length generalization–the ability to extrapolate from shorter to longer inputs–through the lens of \textit{task association}. We find that length generalization can be \textit{transferred} across related tasks. That is, training a model with a longer and related auxiliary task can lead it to generalize to unseen and longer inputs from some other target task. We demonstrate this length generalization transfer across diverse algorithmic tasks, including arithmetic operations, string transformations, and maze navigation. Our results show that transformer models can inherit generalization capabilities from similar tasks when trained jointly. Moreover, we observe similar transfer effects in pretrained language models, suggesting that pretraining equips models with reusable computational scaffolding that facilitates extrapolation in downstream settings. Finally, we provide initial mechanistic evidence that length generalization transfer correlates with the re-use of the same attention heads between the tasks. Together, our findings deepen our understanding of how transformers generalize to out-of-distribution inputs and highlight the compositional reuse of inductive structure across tasks.

[145] Reasoning with Exploration: An Entropy Perspective on Reinforcement Learning for LLMs

Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Wayne Xin Zhao, Zhenliang Zhang, Furu Wei

Main category: cs.CL

TL;DR: The paper explores the role of entropy in enhancing exploratory reasoning in LLMs, introducing a simple RL modification to improve reasoning depth and achieve significant performance gains.

Details

Motivation: To address the imbalance between exploration and exploitation in LLMs, which often leads to performance plateaus, by leveraging entropy as a signal for exploratory reasoning.

Method: Augments the advantage function in RL with an entropy-based term to promote longer and deeper reasoning chains, unlike traditional maximum-entropy methods.

Result: Positive correlations between high-entropy regions and exploratory reasoning actions, with significant improvements on the Pass@K metric, even for large K values.

Conclusion: The proposed minimal modification effectively enhances LLM reasoning by balancing exploration and exploitation, pushing the boundaries of reasoning capabilities.

Abstract: Balancing exploration and exploitation is a central goal in reinforcement learning (RL). Despite recent advances in enhancing large language model (LLM) reasoning, most methods lean toward exploitation, and increasingly encounter performance plateaus. In this work, we revisit entropy – a signal of exploration in RL – and examine its relationship to exploratory reasoning in LLMs. Through empirical analysis, we uncover positive correlations between high-entropy regions and three types of exploratory reasoning actions: (1) pivotal tokens that determine or connect logical steps, (2) reflective actions such as self-verification and correction, and (3) rare behaviors under-explored by the base LLMs. Motivated by this, we introduce a minimal modification to standard RL with only one line of code: augmenting the advantage function with an entropy-based term. Unlike traditional maximum-entropy methods which encourage exploration by promoting uncertainty, we encourage exploration by promoting longer and deeper reasoning chains. Notably, our method achieves significant gains on the Pass@K metric – an upper-bound estimator of LLM reasoning capabilities – even when evaluated with extremely large K values, pushing the boundaries of LLM reasoning.

[146] FinCoT: Grounding Chain-of-Thought in Expert Financial Reasoning

Natapong Nitarach, Warit Sirichotedumrong, Panop Pitchayarthorn, Pittawat Taveekitworachai, Potsawee Manakul, Kunat Pipatanakul

Main category: cs.CL

TL;DR: FinCoT is a structured CoT prompting framework for financial NLP, incorporating expert reasoning blueprints to improve model accuracy and efficiency.

Details

Motivation: Prior work in FinNLP lacks structured CoT with domain expertise, prompting the need for FinCoT.

Method: FinCoT evaluates three prompting styles (zero-shot, unstructured CoT, structured CoT) and introduces structured finance-specific reasoning blueprints.

Result: FinCoT boosts accuracy (e.g., Qwen3-8B-Base from 63.2% to 80.5%) and reduces output length (up to 8.9x).

Conclusion: FinCoT enhances performance, reduces costs, and improves interpretability, especially for models without financial post-training.

Abstract: This paper presents FinCoT, a structured chain-of-thought (CoT) prompting framework that embeds domain-specific expert financial reasoning blueprints to guide large language models’ behaviors. We identify three main prompting styles in financial NLP (FinNLP): (1) standard prompting (zero-shot), (2) unstructured CoT (free-form reasoning), and (3) structured CoT (with explicitly structured reasoning steps). Prior work has mainly focused on the first two, while structured CoT remains underexplored and lacks domain expertise incorporation. Therefore, we evaluate all three prompting approaches across ten CFA-style financial domains and introduce FinCoT as the first structured finance-specific prompting approach incorporating blueprints from domain experts. FinCoT improves the accuracy of a general-purpose model, Qwen3-8B-Base, from 63.2% to 80.5%, and boosts Fin-R1 (7B), a finance-specific model, from 65.7% to 75.7%, while reducing output length by up to 8.9x and 1.16x compared to structured CoT methods, respectively. We find that FinCoT proves most effective for models lacking financial post-training. Our findings show that FinCoT does not only improve performance and reduce inference costs but also yields more interpretable and expert-aligned reasoning traces.

[147] Large Language Models in Argument Mining: A Survey

Hao Li, Viktor Schlegel, Yizheng Sun, Riza Batista-Navarro, Goran Nenadic

Main category: cs.CL

TL;DR: A survey on how Large Language Models (LLMs) have revolutionized Argument Mining (AM), covering foundational theories, datasets, subtasks, methodologies, challenges, and future research directions.

Details

Motivation: To systematically synthesize recent advancements in LLM-driven AM and guide researchers in this rapidly evolving field.

Method: Review of foundational theories, annotation frameworks, datasets, and taxonomy of AM subtasks. Analysis of LLM techniques like prompting, chain-of-thought reasoning, and retrieval augmentation.

Result: Comprehensive taxonomy of AM subtasks, detailed LLM methodologies, critical evaluation practices, and identified challenges (e.g., long-context reasoning, interpretability).

Conclusion: Emerging trends and a forward-looking research agenda are proposed to strategically advance LLM-based computational argumentation.

Abstract: Argument Mining (AM), a critical subfield of Natural Language Processing (NLP), focuses on extracting argumentative structures from text. The advent of Large Language Models (LLMs) has profoundly transformed AM, enabling advanced in-context learning, prompt-based generation, and robust cross-domain adaptability. This survey systematically synthesizes recent advancements in LLM-driven AM. We provide a concise review of foundational theories and annotation frameworks, alongside a meticulously curated catalog of datasets. A key contribution is our comprehensive taxonomy of AM subtasks, elucidating how contemporary LLM techniques – such as prompting, chain-of-thought reasoning, and retrieval augmentation – have reconfigured their execution. We further detail current LLM architectures and methodologies, critically assess evaluation practices, and delineate pivotal challenges including long-context reasoning, interpretability, and annotation bottlenecks. Conclusively, we highlight emerging trends and propose a forward-looking research agenda for LLM-based computational argumentation, aiming to strategically guide researchers in this rapidly evolving domain.

[148] MIST: Jailbreaking Black-box Large Language Models via Iterative Semantic Tuning

Muyang Zheng, Yuanzhi Yao, Changting Lin, Rui Wang, Caihong Kai

Main category: cs.CL

TL;DR: MIST is a method for jailbreaking black-box LLMs via iterative semantic tuning, achieving high success rates and efficiency.

Details

Motivation: LLMs remain vulnerable to jailbreak attacks despite alignment efforts, prompting the need for effective attack methods.

Method: MIST uses iterative semantic tuning with sequential synonym search and order-determining optimization to refine prompts.

Result: MIST outperforms state-of-the-art methods in attack success rate, query efficiency, and transferability.

Conclusion: MIST is a practical and efficient method for jailbreaking black-box LLMs, validated by extensive experiments.

Abstract: Despite efforts to align large language models (LLMs) with societal and moral values, these models remain susceptible to jailbreak attacks – methods designed to elicit harmful responses. Jailbreaking black-box LLMs is considered challenging due to the discrete nature of token inputs, restricted access to the target LLM, and limited query budget. To address the issues above, we propose an effective method for jailbreaking black-box large language Models via Iterative Semantic Tuning, named MIST. MIST enables attackers to iteratively refine prompts that preserve the original semantic intent while inducing harmful content. Specifically, to balance semantic similarity with computational efficiency, MIST incorporates two key strategies: sequential synonym search, and its advanced version – order-determining optimization. We conduct extensive experiments on two datasets using two open-source and four closed-source models. Results show that MIST achieves competitive attack success rate, relatively low query count, and fair transferability, outperforming or matching state-of-the-art jailbreak methods. Additionally, we conduct analysis on computational efficiency to validate the practical viability of MIST.

[149] Can Reasoning Help Large Language Models Capture Human Annotator Disagreement?

Jingwei Ni, Yu Fan, Vilém Zouhar, Donya Rooein, Alexander Hoyle, Mrinmaya Sachan, Markus Leippold, Dirk Hovy, Elliott Ash

Main category: cs.CL

TL;DR: RLVR-style reasoning worsens LLM disagreement modeling, while naive CoT reasoning improves it, highlighting risks of replacing human annotators with reasoning LLMs.

Details

Motivation: To assess if RLVR-style reasoning helps LLMs capture human annotation variation, given its success in other tasks.

Method: Systematic evaluation across 60 setups (model sizes, distribution expression, steering methods) in 3 tasks.

Result: RLVR degrades disagreement modeling; naive CoT improves RLHF LLM performance.

Conclusion: Caution needed when replacing human annotators with reasoning LLMs, as disagreements are informative.

Abstract: Variation in human annotation (i.e., disagreements) is common in NLP, often reflecting important information like task subjectivity and sample ambiguity. Modeling this variation is important for applications that are sensitive to such information. Although RLVR-style reasoning (Reinforcement Learning with Verifiable Rewards) has improved Large Language Model (LLM) performance on many tasks, it remains unclear whether such reasoning enables LLMs to capture informative variation in human annotation. In this work, we evaluate the influence of different reasoning settings on LLM disagreement modeling. We systematically evaluate each reasoning setting across model sizes, distribution expression methods, and steering methods, resulting in 60 experimental setups across 3 tasks. Surprisingly, our results show that RLVR-style reasoning degrades performance in disagreement modeling, while naive Chain-of-Thought (CoT) reasoning improves the performance of RLHF LLMs (RL from human feedback). These findings underscore the potential risk of replacing human annotators with reasoning LLMs, especially when disagreements are important.

[150] FloorPlan-DeepSeek (FPDS): A multimodal approach to floorplan generation using vector-based next room prediction

Jun Yin, Pengyu Zeng, Jing Zhong, Peilin Li, Miao Zhang, Ran Luo, Shuai Lu

Main category: cs.CL

TL;DR: The paper proposes a ’next room prediction’ paradigm for floor plan generation, inspired by autoregressive models, to better align with iterative architectural workflows.

Details

Motivation: Existing generative models for floor plans are end-to-end and incompatible with incremental real-world architectural workflows.

Method: The paper introduces a ’next room prediction’ approach, inspired by autoregressive models like those in language models, tailored for floor plan generation.

Result: FPDS shows competitive performance compared to diffusion models and Tell2Design in text-to-floorplan tasks.

Conclusion: The proposed method has potential for supporting intelligent architectural design by aligning with iterative practices.

Abstract: In the architectural design process, floor plan generation is inherently progressive and iterative. However, existing generative models for floor plans are predominantly end-to-end generation that produce an entire pixel-based layout in a single pass. This paradigm is often incompatible with the incremental workflows observed in real-world architectural practice. To address this issue, we draw inspiration from the autoregressive ’next token prediction’ mechanism commonly used in large language models, and propose a novel ’next room prediction’ paradigm tailored to architectural floor plan modeling. Experimental evaluation indicates that FPDS demonstrates competitive performance in comparison to diffusion models and Tell2Design in the text-to-floorplan task, indicating its potential applicability in supporting future intelligent architectural design.

[151] LinguaSynth: Heterogeneous Linguistic Signals for News Classification

Duo Zhang, Junyi Mo

Main category: cs.CL

TL;DR: LinguaSynth is a transparent NLP framework combining five linguistic features for text classification, outperforming TF-IDF and maintaining interpretability without deep learning.

Details

Motivation: Addressing interpretability and efficiency issues in deep learning-based NLP by proposing a transparent alternative.

Method: Integrates lexical, syntactic, entity-level, word-level semantics, and document-level semantics in a logistic regression model.

Result: Achieves 84.89% accuracy on 20 Newsgroups, surpassing TF-IDF by 3.32%.

Conclusion: LinguaSynth proves interpretable, efficient NLP is possible without deep neural networks, setting a new benchmark.

Abstract: Deep learning has significantly advanced NLP, but its reliance on large black-box models introduces critical interpretability and computational efficiency concerns. This paper proposes LinguaSynth, a novel text classification framework that strategically integrates five complementary linguistic feature types: lexical, syntactic, entity-level, word-level semantics, and document-level semantics within a transparent logistic regression model. Unlike transformer-based architectures, LinguaSynth maintains interpretability and computational efficiency, achieving an accuracy of 84.89 percent on the 20 Newsgroups dataset and surpassing a robust TF-IDF baseline by 3.32 percent. Through rigorous feature interaction analysis, we show that syntactic and entity-level signals provide essential disambiguation and effectively complement distributional semantics. LinguaSynth sets a new benchmark for interpretable, resource-efficient NLP models and challenges the prevailing assumption that deep neural networks are necessary for high-performing text classification.

[152] What to Keep and What to Drop: Adaptive Table Filtering Framework

WonJune Jang

Main category: cs.CL

TL;DR: ATF is a question-aware filtering framework for large tables, reducing input size by 70% and improving TableQA performance, with slight trade-offs in Table Fact Verification.

Details

Motivation: Addressing input length limits in LLMs for table-based reasoning by pruning uninformative table data.

Method: Uses LLM-generated column descriptions, clustering, and sparse-dense alignment scores to filter columns and rows.

Result: Reduces table cells by 70%, improves TableQA performance, but slightly drops Table Fact Verification accuracy.

Conclusion: ATF effectively balances informativeness and minimalism, integrating with existing models without retraining.

Abstract: Large language models (LLMs) for table-based reasoning often struggle with large tables due to input length limits. We propose ATF (Adaptive Table Filtering Framework), a modular and question-aware filtering pipeline that prunes uninformative columns and rows using LLM-generated column descriptions, clustering, and sparse-dense alignment scores. ATF integrates seamlessly with existing models (e.g., TAPAS, TAPEX) without retraining. Experiments show that ATF reduces table cells by 70%, boosting performance on out-of-domain TableQA tasks while causing slight performance drops on Table Fact Verification, where full-table context is more critical. These results highlight ATF’s ability to adaptively balance informativeness and minimalism across tasks. Our code available at: https://github.com/torijune/ATF-Adaptive-Table-Filtering-Framework

[153] Generalizing Verifiable Instruction Following

Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, Hannaneh Hajishirzi

Main category: cs.CL

TL;DR: The paper introduces IFBench, a benchmark for evaluating precise instruction following in AI models, and proposes reinforcement learning with verifiable rewards (RLVR) to improve generalization.

Details

Motivation: Current language models struggle with adhering to diverse and unseen output constraints in human instructions, limiting their practical utility.

Method: The authors design IFBench with 58 new constraints, develop constraint verification modules, and use RLVR for training.

Result: RLVR significantly improves models’ ability to follow precise instructions, as demonstrated on IFBench.

Conclusion: The work provides tools (IFBench, training constraints, RLVR prompts, and code) to advance research in precise instruction following.

Abstract: A crucial factor for successful human and AI interaction is the ability of language models or chatbots to follow human instructions precisely. A common feature of instructions are output constraints like only answer with yes or no" or mention the word `abrakadabra’ at least 3 times" that the user adds to craft a more useful answer. Even today’s strongest models struggle with fulfilling such constraints. We find that most models strongly overfit on a small set of verifiable constraints from the benchmarks that test these abilities, a skill called precise instruction following, and are not able to generalize well to unseen output constraints. We introduce a new benchmark, IFBench, to evaluate precise instruction following generalization on 58 new, diverse, and challenging verifiable out-of-domain constraints. In addition, we perform an extensive analysis of how and on what data models can be trained to improve precise instruction following generalization. Specifically, we carefully design constraint verification modules and show that reinforcement learning with verifiable rewards (RLVR) significantly improves instruction following. In addition to IFBench, we release 29 additional new hand-annotated training constraints and verification functions, RLVR training prompts, and code.

[154] FlexOlmo: Open Language Models for Flexible Data Use

Weijia Shi, Akshita Bhagia, Kevin Farhat, Niklas Muennighoff, Pete Walsh, Jacob Morrison, Dustin Schwenk, Shayne Longpre, Jake Poznanski, Allyson Ettinger, Daogao Liu, Margaret Li, Dirk Groeneveld, Mike Lewis, Wen-tau Yih, Luca Soldaini, Kyle Lo, Noah A. Smith, Luke Zettlemoyer, Pang Wei Koh, Hannaneh Hajishirzi, Ali Farhadi, Sewon Min

Main category: cs.CL

TL;DR: FlexOlmo is a new language model class enabling distributed training without data sharing and flexible inference, outperforming prior methods by 10.1% and improving performance by 41% when combining experts.

Details

Motivation: Addresses the need for training models on closed datasets without data sharing, respecting data owners' preferences and regulatory requirements.

Method: Uses a mixture-of-experts (MoE) architecture with independently trained experts and domain-informed routing, trained on the FlexMix corpus.

Result: Achieves a 41% relative improvement by combining experts and outperforms prior merging methods by 10.1%.

Conclusion: FlexOlmo offers a solution for regulated industries, enabling the use of closed data while respecting data access preferences.

Abstract: We introduce FlexOlmo, a new class of language models (LMs) that supports (1) distributed training without data sharing, where different model parameters are independently trained on closed datasets, and (2) data-flexible inference, where these parameters along with their associated data can be flexibly included or excluded from model inferences with no further training. FlexOlmo employs a mixture-of-experts (MoE) architecture where each expert is trained independently on closed datasets and later integrated through a new domain-informed routing without any joint training. FlexOlmo is trained on FlexMix, a corpus we curate comprising publicly available datasets alongside seven domain-specific sets, representing realistic approximations of closed sets. We evaluate models with up to 37 billion parameters (20 billion active) on 31 diverse downstream tasks. We show that a general expert trained on public data can be effectively combined with independently trained experts from other data owners, leading to an average 41% relative improvement while allowing users to opt out of certain data based on data licensing or permission requirements. Our approach also outperforms prior model merging methods by 10.1% on average and surpasses the standard MoE trained without data restrictions using the same training FLOPs. Altogether, this research presents a solution for both data owners and researchers in regulated industries with sensitive or protected data. FlexOlmo enables benefiting from closed data while respecting data owners’ preferences by keeping their data local and supporting fine-grained control of data access during inference.

[155] Distillation versus Contrastive Learning: How to Train Your Rerankers

Zhichao Xu, Zhiqi Huang, Shengyao Zhuang, Vivek Srikumar

Main category: cs.CL

TL;DR: The paper compares contrastive learning and knowledge distillation for training text rerankers, finding distillation superior when using a larger teacher model, but not with same-capacity teachers.

Details

Motivation: To clarify the effectiveness of contrastive learning vs. knowledge distillation for training cross-encoder rerankers under practical conditions.

Method: Empirical comparison by training rerankers of varying sizes/architectures using both methods on the same data, with a strong contrastive model as the teacher.

Result: Knowledge distillation outperforms contrastive learning with a larger teacher, but not with same-capacity teachers, especially for out-of-domain tasks.

Conclusion: Use knowledge distillation with a larger teacher for smaller rerankers; otherwise, contrastive learning is a robust baseline.

Abstract: Training effective text rerankers is crucial for information retrieval. Two strategies are widely used: contrastive learning (optimizing directly on ground-truth labels) and knowledge distillation (transferring knowledge from a larger reranker). While both have been studied extensively, a clear comparison of their effectiveness for training cross-encoder rerankers under practical conditions is needed. This paper empirically compares these strategies by training rerankers of different sizes and architectures using both methods on the same data, with a strong contrastive learning model acting as the distillation teacher. Our results show that knowledge distillation generally yields better in-domain and out-of-domain ranking performance than contrastive learning when distilling from a larger teacher model. This finding is consistent across student model sizes and architectures. However, distilling from a teacher of the same capacity does not provide the same advantage, particularly for out-of-domain tasks. These findings offer practical guidance for choosing a training strategy based on available teacher models. We recommend using knowledge distillation to train smaller rerankers if a larger, more powerful teacher is accessible; in its absence, contrastive learning remains a robust baseline.

[156] Ref-Long: Benchmarking the Long-context Referencing Capability of Long-context Language Models

Junjie Wu, Gefei Gu, Yanan Zheng, Dit-Yan Yeung, Arman Cohan

Main category: cs.CL

TL;DR: The paper introduces Ref-Long, a benchmark to evaluate long-context referencing in LCLMs, revealing limitations in current models like GPT-4o.

Details

Motivation: Long-context referencing is underexplored despite its importance in LCLMs, prompting the need for a dedicated benchmark.

Method: Ref-Long assesses LCLMs by requiring them to identify document indexes referencing a key, with three subsets from synthetic to realistic scenarios.

Result: Tests on 13 LCLMs show significant shortcomings in referencing, even in advanced models. Analyses include human evaluations and fine-tuning.

Conclusion: Ref-Long highlights gaps in LCLMs’ referencing abilities and provides insights for improvement.

Abstract: Long-context language models (LCLMs) have exhibited impressive capabilities in long-context understanding tasks. Among these, long-context referencing – a crucial task that requires LCLMs to attribute items of interest to specific parts of long-context data – remains underexplored. To bridge this gap, this paper proposes Referencing Evaluation for Long-context Language Models (Ref-Long), a novel benchmark designed to assess the long-context referencing capability of LCLMs. Specifically, Ref-Long requires LCLMs to identify the indexes of documents that reference a specific key, emphasizing contextual relationships between the key and the documents over simple retrieval. Based on the task design, we construct three subsets ranging from synthetic to realistic scenarios to form the Ref-Long benchmark. Experimental results of 13 LCLMs reveal significant shortcomings in long-context referencing, even among advanced models like GPT-4o. To further investigate these challenges, we conduct comprehensive analyses, including human evaluations, task format adjustments, fine-tuning experiments, and error analyses, leading to several key insights. Our data and code can be found in https://github. com/wujunjie1998/Ref-Long.

[157] LLMs on Trial: Evaluating Judicial Fairness for Large Language Models

Yiran Hu, Zongyue Xue, Haitao Li, Siyuan Zheng, Qingjing Chen, Shaochun Wang, Xihan Zhang, Ning Zheng, Yun Liu, Qingyao Ai, Yiqun Liu, Charles L. A. Clarke, Weixing Shen

Main category: cs.CL

TL;DR: The paper investigates judicial fairness in LLMs, revealing pervasive bias and inconsistency, and introduces a toolkit for future research.

Details

Motivation: To address underexplored judicial fairness of LLMs in high-stakes fields, ensuring trustworthiness in their decisions.

Method: Develops a fairness framework with 65 labels and 161 values, compiles the JudiFair dataset, and introduces three evaluation metrics (inconsistency, bias, imbalanced inaccuracy) to assess 16 LLMs.

Result: LLMs exhibit severe judicial unfairness, with pronounced biases on demographic labels. Adjusting temperature affects fairness, but model size, release date, and origin do not.

Conclusion: The study highlights LLM fairness issues and provides a toolkit to aid future research in evaluating and improving fairness.

Abstract: Large Language Models (LLMs) are increasingly used in high-stakes fields where their decisions impact rights and equity. However, LLMs’ judicial fairness and implications for social justice remain underexplored. When LLMs act as judges, the ability to fairly resolve judicial issues is a prerequisite to ensure their trustworthiness. Based on theories of judicial fairness, we construct a comprehensive framework to measure LLM fairness, leading to a selection of 65 labels and 161 corresponding values. Applying this framework to the judicial system, we compile an extensive dataset, JudiFair, comprising 177,100 unique case facts. To achieve robust statistical inference, we develop three evaluation metrics, inconsistency, bias, and imbalanced inaccuracy, and introduce a method to assess the overall fairness of multiple LLMs across various labels. Through experiments with 16 LLMs, we uncover pervasive inconsistency, bias, and imbalanced inaccuracy across models, underscoring severe LLM judicial unfairness. Particularly, LLMs display notably more pronounced biases on demographic labels, with slightly less bias on substance labels compared to procedure ones. Interestingly, increased inconsistency correlates with reduced biases, but more accurate predictions exacerbate biases. While we find that adjusting the temperature parameter can influence LLM fairness, model size, release date, and country of origin do not exhibit significant effects on judicial fairness. Accordingly, we introduce a publicly available toolkit containing all datasets and code, designed to support future research in evaluating and improving LLM fairness.

[158] HuggingGraph: Understanding the Supply Chain of LLM Ecosystem

Mohammad Shahedur Rahman, Runbang Hu, Peng Gao, Yuede Ji

Main category: cs.CL

TL;DR: The paper explores vulnerabilities and biases in LLMs by analyzing relationships between models and datasets in the LLM supply chain, using a large directed heterogeneous graph.

Details

Motivation: The increasing reliance on pre-trained models and datasets in LLMs raises concerns about inherited vulnerabilities and biases, necessitating a deeper understanding of their origins.

Method: The study collects LLM supply chain data, constructs a directed heterogeneous graph (402,654 nodes, 462,524 edges), and performs analyses to uncover relationships.

Result: The analysis reveals insights into model-dataset dependencies, aiding in risk detection, fairness improvement, and regulatory compliance.

Conclusion: Understanding LLM supply chains is crucial for mitigating risks and ensuring ethical AI development.

Abstract: Large language models (LLMs) leverage deep learning architectures to process and predict sequences of words based on context, enabling them to perform a wide range of natural language processing tasks, such as translation, summarization, question answering, and content generation. However, the increasing size and complexity of developing, training, and deploying cutting-edge LLMs demand extensive computational resources and large-scale datasets. This creates a significant barrier for researchers and practitioners. Because of that, platforms that host models and datasets have gained widespread popularity. For example, on one of the most popular platforms, i.e., Hugging Face, there are more than 1.8 million models and more than 450K datasets by the end of June 2025, and the trend does not show any slowdown. As existing LLMs are often built from base models or other pretrained models and use external datasets, they can inevitably inherit vulnerabilities, biases, or malicious components that exist in previous models or datasets. Therefore, it is critical to understand these components’ origin and development process to detect potential risks better, improve model fairness, and ensure compliance with regulatory frameworks. Motivated by that, this project aims to study such relationships between models and datasets, which are the central parts of the LLM supply chain. First, we design a methodology to collect LLMs’ supply chain information systematically. With the collected information, we design a new graph to model the relationships between models and datasets, which is a large directed heterogeneous graph having 402,654 nodes and 462,524 edges. Then, on top of this graph, we perform different types of analysis and make multiple interesting findings.

[159] SCOPE: Stochastic and Counterbiased Option Placement for Evaluating Large Language Models

Wonjun Jeong, Dongseok Kim, Taegkeun Whangbo

Main category: cs.CL

TL;DR: SCOPE is a framework to mitigate selection bias in LLM evaluations by redistributing answer slots and blocking near-miss guesses, improving fairness and reliability.

Details

Motivation: LLMs exploit biases in option positions or labels, inflating scores without genuine understanding, necessitating a fairer evaluation method.

Method: SCOPE uses a null prompt to estimate position-bias, redistributes answer slots inversely, and prevents semantically similar distractors from being adjacent.

Result: SCOPE outperforms existing debiasing methods, showing stable performance improvements and clearer confidence distributions.

Conclusion: SCOPE enhances fairness and reliability in LLM evaluations, setting a new standard for debiasing.

Abstract: Large Language Models (LLMs) can achieve inflated scores on multiple-choice tasks by exploiting inherent biases in option positions or labels, rather than demonstrating genuine understanding. This study introduces SCOPE, an evaluation framework designed to measure and mitigate such selection bias in a dataset-independent manner. By repeatedly invoking a null prompt that lacks semantic content, SCOPE estimates each model’s unique position-bias distribution. It then redistributes the answer slot according to the inverse-bias distribution, thereby equalizing the lucky-rate, the probability of selecting the correct answer by chance. Furthermore, it prevents semantically similar distractors from being placed adjacent to the answer, thereby blocking near-miss guesses based on superficial proximity cues. Across multiple benchmark experiments, SCOPE consistently outperformed existing debiasing methods in terms of stable performance improvements and showed clearer confidence distributions over correct options. This framework thus offers a new standard for enhancing the fairness and reliability of LLM evaluations.

[160] CaliDrop: KV Cache Compression with Calibration

Yi Su, Quantong Qiu, Yuechi Zhou, Juntao Li, Qingrong Xia, Ping Li, Xinyu Duan, Zhefeng Wang, Min Zhang

Main category: cs.CL

TL;DR: CaliDrop enhances token eviction in KV cache compression for LLMs by using speculative calibration on discarded tokens, reducing accuracy loss.

Details

Motivation: The memory footprint of KV cache grows linearly with sequence length, batch size, and model size, creating bottlenecks in long-context scenarios. Existing token eviction methods degrade accuracy under high compression.

Method: Proposes CaliDrop, a strategy that leverages high similarity of queries at nearby positions to perform speculative calibration on discarded tokens.

Result: CaliDrop significantly improves the accuracy of existing token eviction methods.

Conclusion: CaliDrop effectively mitigates accuracy degradation in token eviction, enhancing KV cache compression for LLMs.

Abstract: Large Language Models (LLMs) require substantial computational resources during generation. While the Key-Value (KV) cache significantly accelerates this process by storing attention intermediates, its memory footprint grows linearly with sequence length, batch size, and model size, creating a bottleneck in long-context scenarios. Various KV cache compression techniques, including token eviction, quantization, and low-rank projection, have been proposed to mitigate this bottleneck, often complementing each other. This paper focuses on enhancing token eviction strategies. Token eviction leverages the observation that the attention patterns are often sparse, allowing for the removal of less critical KV entries to save memory. However, this reduction usually comes at the cost of notable accuracy degradation, particularly under high compression ratios. To address this issue, we propose \textbf{CaliDrop}, a novel strategy that enhances token eviction through calibration. Our preliminary experiments show that queries at nearby positions exhibit high similarity. Building on this observation, CaliDrop performs speculative calibration on the discarded tokens to mitigate the accuracy loss caused by token eviction. Extensive experiments demonstrate that CaliDrop significantly improves the accuracy of existing token eviction methods.

[161] DYNARTmo: A Dynamic Articulatory Model for Visualization of Speech Movement Patterns

Bernd J. Kröger

Main category: cs.CL

TL;DR: DYNARTmo is a dynamic articulatory model for visualizing speech articulation in 2D, integrating control parameters and coarticulation, embedded in a web app for education and therapy.

Details

Motivation: To enhance phonetics education and speech therapy by providing a dynamic visualization tool for speech articulation processes.

Method: Builds on UK-DYNAMO, uses articulatory underspecification, segmental/gestural control, and coarticulation with 16 control parameters.

Result: Simulates six articulators, generates vocalic/consonantal configurations, and is implemented in a web app with multiple views.

Conclusion: Focuses on static modeling; future work will include dynamic movement and articulatory-acoustic integration.

Abstract: We present DYNARTmo, a dynamic articulatory model designed to visualize speech articulation processes in a two-dimensional midsagittal plane. The model builds upon the UK-DYNAMO framework and integrates principles of articulatory underspecification, segmental and gestural control, and coarticulation. DYNARTmo simulates six key articulators based on ten continuous and six discrete control parameters, allowing for the generation of both vocalic and consonantal articulatory configurations. The current implementation is embedded in a web-based application (SpeechArticulationTrainer) that includes sagittal, glottal, and palatal views, making it suitable for use in phonetics education and speech therapy. While this paper focuses on the static modeling aspects, future work will address dynamic movement generation and integration with articulatory-acoustic modules.

[162] Trustworthy Reasoning: Evaluating and Enhancing Factual Accuracy in LLM Intermediate Thought Processes

Rui Jiao, Yue Zhang, Jinku Li

Main category: cs.CL

TL;DR: A novel framework improves factual accuracy in LLMs’ reasoning steps, combining fact-checking, reinforcement learning, and interpretability methods, achieving up to 49.90% improvement.

Details

Motivation: Addressing factual inaccuracies in LLMs' reasoning steps, which can mislead users in high-stakes domains like healthcare and legal analysis.

Method: Integrates a fact-checking classifier, GRPO reinforcement learning, and mechanistic interpretability to enhance reasoning accuracy.

Result: Significant improvement in factual robustness (up to 49.90%) while maintaining performance on benchmarks like Math-500 and GPQA.

Conclusion: The framework provides actionable insights for future training methodologies to enhance factual robustness in LLMs.

Abstract: We present a novel framework addressing a critical vulnerability in Large Language Models (LLMs): the prevalence of factual inaccuracies within intermediate reasoning steps despite correct final answers. This phenomenon poses substantial risks in high-stakes domains including healthcare, legal analysis, and scientific research, where erroneous yet confidently presented reasoning can mislead users into dangerous decisions. Our framework integrates three core components: (1) a specialized fact-checking classifier trained on counterfactually augmented data to detect subtle factual inconsistencies within reasoning chains; (2) an enhanced Group Relative Policy Optimization (GRPO) reinforcement learning approach that balances factuality, coherence, and structural correctness through multi-dimensional rewards; and (3) a mechanistic interpretability method examining how factuality improvements manifest in model activations during reasoning processes. Extensive evaluation across multi state-of-the-art models reveals concerning patterns: even leading models like Claude-3.7 and GPT-o1 demonstrate reasoning factual accuracy of only 81.93% and 82.57% respectively. Our approach significantly enhances factual robustness (up to 49.90% improvement) while maintaining or improving performance on challenging benchmarks including Math-500, AIME-2024, and GPQA. Furthermore, our neural activation-level analysis provides actionable insights into how factual enhancements reshape reasoning trajectories within model architectures, establishing foundations for future training methodologies that explicitly target factual robustness through activation-guided optimization.

[163] Med-R$^3$: Enhancing Medical Retrieval-Augmented Reasoning of LLMs via Progressive Reinforcement Learning

Keer Lu, Zheng Liang, Youquan Li, Jiejun Tan, Da Pan, Shusen Zhang, Guosheng Dong, Huang Leng, Bin Cui, Wentao Zhang

Main category: cs.CL

TL;DR: The paper introduces Med-R³, a medical retrieval-augmented reasoning framework using progressive reinforcement learning to jointly optimize retrieval and reasoning, outperforming existing models.

Details

Motivation: Existing methods lack joint optimization of retrieval and reasoning, rely heavily on supervised fine-tuning, and inadequately address medical domain demands.

Method: Develop logical reasoning first, then adaptively optimize retrieval, and finally jointly optimize retrieval-reasoning coordination using reinforcement learning.

Result: Med-R³ achieves state-of-the-art performance, with LLaMA3.1-8B-Instruct + Med-R³ surpassing GPT-4o-mini by 3.93% and Qwen2.5-14B + Med-R³ by 13.53%.

Conclusion: Med-R³ effectively addresses the limitations of existing methods and demonstrates superior performance in medical retrieval-augmented reasoning.

Abstract: In medical scenarios, effectively retrieving external knowledge and leveraging it for rigorous logical reasoning is of significant importance. Despite their potential, existing work has predominantly focused on enhancing either retrieval or reasoning capabilities of the models in isolation, with little attention given to their joint optimization, which leads to limited coordination between the two processes. Additionally, current methods rely heavily on supervised fine-tuning (SFT), which can cause models to memorize existing problem-solving pathways, thereby restricting their generalization ability when confronted with novel problem contexts. Furthermore, while some studies have explored to improve retrieval-augmented reasoning in general domains via reinforcement learning, their reward function designs do not adequately capture the specific demands of the medical domain. To address these challenges, we introduce Med-R$^3$, a Medical Retrieval-augmented Reasoning framework driven by progressive Reinforcement learning. In this framework, we first develop the model’s ability to perform logical reasoning over medical problems. Subsequently, on the basis of this foundation, we adaptively optimize the retrieval capability to better align with the characteristics of knowledge corpus and external information utilization throughout the reasoning process. Finally, we conduct joint optimization of the model’s retrieval and reasoning coordination. Extensive experiments indicate that Med-R$^3$ could achieve state-of-the-art performances, with LLaMA3.1-8B-Instruct + Med-R$^3$ surpassing closed-sourced GPT-4o-mini by 3.93% at a comparable parameter scale, while Qwen2.5-14B augmented with Med-R$^3$ shows a more substantial gain of 13.53%.

cs.CV

[164] Team PA-VCG’s Solution for Competition on Understanding Chinese College Entrance Exam Papers in ICDAR'25

Wei Wu, Wenjie Wang, Yang Tan, Ying Liu, Liang Diao, Lin Huang, Kaihe Xu, Wenfeng Xie, Ziling Lin

Main category: cs.CV

TL;DR: Team PA-VGG’s solution for ICDAR'25 Competition uses high-res image processing, multi-image input, and domain-specific post-training to achieve 89.6% accuracy in Gaokao paper OCR.

Details

Motivation: Address challenges of dense OCR extraction and complex layouts in Chinese college entrance exam papers.

Method: High-resolution image processing, multi-image end-to-end input, and domain-specific post-training.

Result: Achieved 89.6% accuracy, securing first place in the competition.

Conclusion: Domain-specific post-training significantly improves performance in complex OCR tasks.

Abstract: This report presents Team PA-VGG’s solution for the ICDAR'25 Competition on Understanding Chinese College Entrance Exam Papers. In addition to leveraging high-resolution image processing and a multi-image end-to-end input strategy to address the challenges of dense OCR extraction and complex document layouts in Gaokao papers, our approach introduces domain-specific post-training strategies. Experimental results demonstrate that our post-training approach achieves the most outstanding performance, securing first place with an accuracy rate of 89.6%.

[165] Inclusive Review on Advances in Masked Human Face Recognition Technologies

Ali Haitham Abdul Amir, Zainab N. Nemer

Main category: cs.CV

TL;DR: A review of advancements in Masked Face Recognition (MFR) using deep learning, focusing on challenges like partial concealment and solutions like CNNs and Siamese networks.

Details

Motivation: The rise of mask usage due to COVID-19 has challenged facial recognition systems, necessitating improved MFR technologies.

Method: The paper reviews deep learning techniques (CNNs, Siamese networks), data enhancement, and feature extraction methods.

Result: Identifies challenges (lighting, mask types) and solutions (artificial databases, multimedia methods) to improve MFR accuracy.

Conclusion: Future research should focus on efficient algorithms and multimedia integration to enhance real-world MFR performance and applications.

Abstract: Masked Face Recognition (MFR) is an increasingly important area in biometric recognition technologies, especially with the widespread use of masks as a result of the COVID-19 pandemic. This development has created new challenges for facial recognition systems due to the partial concealment of basic facial features. This paper aims to provide a comprehensive review of the latest developments in the field, with a focus on deep learning techniques, especially convolutional neural networks (CNNs) and twin networks (Siamese networks), which have played a pivotal role in improving the accuracy of covering face recognition. The paper discusses the most prominent challenges, which include changes in lighting, different facial positions, partial concealment, and the impact of mask types on the performance of systems. It also reviews advanced technologies developed to overcome these challenges, including data enhancement using artificial databases and multimedia methods to improve the ability of systems to generalize. In addition, the paper highlights advance in deep network design, feature extraction techniques, evaluation criteria, and data sets used in this area. Moreover, it reviews the various applications of masked face recognition in the fields of security and medicine, highlighting the growing importance of these systems in light of recurrent health crises and increasing security threats. Finally, the paper focuses on future research trends such as developing more efficient algorithms and integrating multimedia technologies to improve the performance of recognition systems in real-world environments and expand their applications.

[166] Context Guided Transformer Entropy Modeling for Video Compression

Junlong Tong, Wei Zhang, Yaohui Jin, Xiaoyu Shen

Main category: cs.CV

TL;DR: The paper introduces the Context Guided Transformer (CGT) entropy model to reduce video redundancy by leveraging spatio-temporal contexts more efficiently, addressing issues of model complexity and spatial dependency ordering.

Details

Motivation: Existing conditional entropy models either introduce high computational costs with temporal context or lack explicit modeling of spatial dependency order, limiting context availability during decoding.

Method: CGT uses a temporal context resampler with transformer encoders to reduce computational overhead and a teacher-student network to model spatial dependency order, selecting top-k tokens for efficient decoding.

Result: CGT reduces entropy modeling time by ~65% and achieves an 11% BD-Rate reduction compared to prior state-of-the-art models.

Conclusion: The proposed CGT model effectively balances computational efficiency and context modeling, improving video compression performance.

Abstract: Conditional entropy models effectively leverage spatio-temporal contexts to reduce video redundancy. However, incorporating temporal context often introduces additional model complexity and increases computational cost. In parallel, many existing spatial context models lack explicit modeling the ordering of spatial dependencies, which may limit the availability of relevant context during decoding. To address these issues, we propose the Context Guided Transformer (CGT) entropy model, which estimates probability mass functions of the current frame conditioned on resampled temporal context and dependency-weighted spatial context. A temporal context resampler learns predefined latent queries to extract critical temporal information using transformer encoders, reducing downstream computational overhead. Meanwhile, a teacher-student network is designed as dependency-weighted spatial context assigner to explicitly model the dependency of spatial context order. The teacher generates an attention map to represent token importance and an entropy map to reflect prediction certainty from randomly masked inputs, guiding the student to select the weighted top-k tokens with the highest spatial dependency. During inference, only the student is used to predict undecoded tokens based on high-dependency context. Experimental results demonstrate that our CGT model reduces entropy modeling time by approximately 65% and achieves an 11% BD-Rate reduction compared to the previous state-of-the-art conditional entropy model.

[167] HoneyImage: Verifiable, Harmless, and Stealthy Dataset Ownership Verification for Image Models

Zhihao Zhu, Jiale Han, Yi Yang

Main category: cs.CV

TL;DR: HoneyImage is a novel method for verifying dataset ownership in image recognition models by embedding imperceptible traces in hard samples, balancing verification effectiveness and data integrity.

Details

Motivation: Addressing concerns about unauthorized use of proprietary image datasets in AI models, existing solutions like backdoor watermarking and membership inference have trade-offs between verification and data integrity.

Method: HoneyImage selectively modifies a small number of hard samples to embed imperceptible yet verifiable traces for ownership verification.

Result: Experiments on four benchmark datasets show strong verification accuracy with minimal impact on downstream performance and imperceptibility.

Conclusion: HoneyImage offers a practical solution for dataset owners to protect their data, enabling safe sharing and maximizing AI potential.

Abstract: Image-based AI models are increasingly deployed across a wide range of domains, including healthcare, security, and consumer applications. However, many image datasets carry sensitive or proprietary content, raising critical concerns about unauthorized data usage. Data owners therefore need reliable mechanisms to verify whether their proprietary data has been misused to train third-party models. Existing solutions, such as backdoor watermarking and membership inference, face inherent trade-offs between verification effectiveness and preservation of data integrity. In this work, we propose HoneyImage, a novel method for dataset ownership verification in image recognition models. HoneyImage selectively modifies a small number of hard samples to embed imperceptible yet verifiable traces, enabling reliable ownership verification while maintaining dataset integrity. Extensive experiments across four benchmark datasets and multiple model architectures show that HoneyImage consistently achieves strong verification accuracy with minimal impact on downstream performance while maintaining imperceptible. The proposed HoneyImage method could provide data owners with a practical mechanism to protect ownership over valuable image datasets, encouraging safe sharing and unlocking the full transformative potential of data-driven AI.

[168] On-the-Fly Object-aware Representative Point Selection in Point Cloud

Xiaoyu Zhang, Ziwei Wang, Hai Dong, Zhifeng Bao, Jiajun Liu

Main category: cs.CV

TL;DR: A framework for point cloud downsampling in AVs, preserving object-related data while reducing storage and processing costs.

Details

Motivation: Address challenges of high data volume in AV point clouds, balancing storage and processing efficiency with critical object information retention.

Method: Two-step approach: (1) Object Presence Detection using unsupervised and supervised classifiers, (2) Sampling Budget Allocation for object-relevant points.

Result: Outperforms baselines on KITTI and nuScenes datasets in efficiency and effectiveness at varying sampling rates.

Conclusion: A scalable, model-agnostic solution for AV point cloud downsampling, enhancing downstream model integration.

Abstract: Point clouds are essential for object modeling and play a critical role in assisting driving tasks for autonomous vehicles (AVs). However, the significant volume of data generated by AVs creates challenges for storage, bandwidth, and processing cost. To tackle these challenges, we propose a representative point selection framework for point cloud downsampling, which preserves critical object-related information while effectively filtering out irrelevant background points. Our method involves two steps: (1) Object Presence Detection, where we introduce an unsupervised density peak-based classifier and a supervised Na"ive Bayes classifier to handle diverse scenarios, and (2) Sampling Budget Allocation, where we propose a strategy that selects object-relevant points while maintaining a high retention rate of object information. Extensive experiments on the KITTI and nuScenes datasets demonstrate that our method consistently outperforms state-of-the-art baselines in both efficiency and effectiveness across varying sampling rates. As a model-agnostic solution, our approach integrates seamlessly with diverse downstream models, making it a valuable and scalable addition to the 3D point cloud downsampling toolkit for AV applications.

[169] Phase-fraction guided denoising diffusion model for augmenting multiphase steel microstructure segmentation via micrograph image-mask pair synthesis

Hoang Hai Nam Nguyen, Minh Tien Tran, Hoheok Kim, Ho Won Lee

Main category: cs.CV

TL;DR: PF-DiffSeg, a diffusion-based framework, jointly generates microstructure images and masks, improving segmentation accuracy and training efficiency, especially for minority classes.

Details

Motivation: Addressing the lack of annotated phase masks for rare or complex metal alloy morphologies, which limits machine learning effectiveness in metallographic segmentation.

Method: A phase-fraction controlled, one-stage denoising diffusion framework that synthesizes images and masks simultaneously, conditioned on global phase-fraction vectors.

Result: Notable improvements in segmentation accuracy, especially for minority classes, outperforming two-stage diffusion and GAN baselines while reducing inference time.

Conclusion: PF-DiffSeg offers a scalable, unified solution for data augmentation in metallographic applications, enhancing diversity and efficiency.

Abstract: The effectiveness of machine learning in metallographic microstructure segmentation is often constrained by the lack of human-annotated phase masks, particularly for rare or compositionally complex morphologies within the metal alloy. We introduce PF-DiffSeg, a phase-fraction controlled, one-stage denoising diffusion framework that jointly synthesizes microstructure images and their corresponding segmentation masks in a single generative trajectory to further improve segmentation accuracy. By conditioning on global phase-fraction vectors, augmented to represent real data distribution and emphasize minority classes, our model generates compositionally valid and structurally coherent microstructure image and mask samples that improve both data diversity and training efficiency. Evaluated on the MetalDAM benchmark for additively manufactured multiphase steel, our synthetic augmentation method yields notable improvements in segmentation accuracy compared to standard augmentation strategies especially in minority classes and further outperforms a two-stage mask-guided diffusion and generative adversarial network (GAN) baselines, while also reducing inference time compared to conventional approach. The method integrates generation and conditioning into a unified framework, offering a scalable solution for data augmentation in metallographic applications.

Lei Yao, Yi Wang, Yi Zhang, Moyun Liu, Lap-Pui Chau

Main category: cs.CV

TL;DR: GaussianCross introduces a cross-modal self-supervised 3D representation learning method using 3D Gaussian Splatting to improve point cloud discrimination and structural information, achieving superior performance and efficiency.

Details

Motivation: Existing self-supervised methods suffer from model collapse and structural information deficiency due to insufficient point discrimination, leading to unreliable expressions and suboptimal performance.

Method: GaussianCross converts point clouds into a unified Gaussian representation and uses a tri-attribute adaptive distillation splatting module to capture appearance, geometry, and semantic features.

Result: GaussianCross outperforms state-of-the-art methods in efficiency and accuracy, with significant improvements in semantic and instance segmentation tasks.

Conclusion: GaussianCross is a robust and efficient solution for 3D representation learning, demonstrating strong generalization and performance gains.

Abstract: The significance of informative and robust point representations has been widely acknowledged for 3D scene understanding. Despite existing self-supervised pre-training counterparts demonstrating promising performance, the model collapse and structural information deficiency remain prevalent due to insufficient point discrimination difficulty, yielding unreliable expressions and suboptimal performance. In this paper, we present GaussianCross, a novel cross-modal self-supervised 3D representation learning architecture integrating feed-forward 3D Gaussian Splatting (3DGS) techniques to address current challenges. GaussianCross seamlessly converts scale-inconsistent 3D point clouds into a unified cuboid-normalized Gaussian representation without missing details, enabling stable and generalizable pre-training. Subsequently, a tri-attribute adaptive distillation splatting module is incorporated to construct a 3D feature field, facilitating synergetic feature capturing of appearance, geometry, and semantic cues to maintain cross-modal consistency. To validate GaussianCross, we perform extensive evaluations on various benchmarks, including ScanNet, ScanNet200, and S3DIS. In particular, GaussianCross shows a prominent parameter and data efficiency, achieving superior performance through linear probing (<0.1% parameters) and limited data training (1% of scenes) compared to state-of-the-art methods. Furthermore, GaussianCross demonstrates strong generalization capabilities, improving the full fine-tuning accuracy by 9.3% mIoU and 6.1% AP$_{50}$ on ScanNet200 semantic and instance segmentation tasks, respectively, supporting the effectiveness of our approach. The code, weights, and visualizations are publicly available at \href{https://rayyoh.github.io/GaussianCross/}{https://rayyoh.github.io/GaussianCross/}.

[171] Benefits of Feature Extraction and Temporal Sequence Analysis for Video Frame Prediction: An Evaluation of Hybrid Deep Learning Models

Jose M. Sánchez Velázquez, Mingbo Cai, Andrew Coney, Álvaro J. García- Tejedor, Alberto Nogales

Main category: cs.CV

TL;DR: The paper evaluates hybrid deep learning models combining autoencoders with RNNs and 3D CNNs for video frame prediction, showing improved performance, especially with 3D CNNs and ConvLSTMs on real-world grayscale data.

Details

Motivation: Video frame prediction has critical applications like weather forecasting and autonomous systems, but current models need improvement.

Method: Hybrid deep learning approaches combining autoencoders with RNNs and 3D CNNs were tested on diverse datasets.

Result: Performance improved, with SSIM metrics rising from 0.69 to 0.82, showing 3D CNNs and ConvLSTMs as most effective, especially on grayscale real-world data.

Conclusion: Hybrid models with 3D CNNs and ConvLSTMs are highly effective for video frame prediction, particularly in real-world grayscale scenarios.

Abstract: In recent years, advances in Artificial Intelligence have significantly impacted computer science, particularly in the field of computer vision, enabling solutions to complex problems such as video frame prediction. Video frame prediction has critical applications in weather forecasting or autonomous systems and can provide technical improvements, such as video compression and streaming. Among Artificial Intelligence methods, Deep Learning has emerged as highly effective for solving vision-related tasks, although current frame prediction models still have room for enhancement. This paper evaluates several hybrid deep learning approaches that combine the feature extraction capabilities of autoencoders with temporal sequence modelling using Recurrent Neural Networks (RNNs), 3D Convolutional Neural Networks (3D CNNs), and related architectures. The proposed solutions were rigorously evaluated on three datasets that differ in terms of synthetic versus real-world scenarios and grayscale versus color imagery. Results demonstrate that the approaches perform well, with SSIM metrics increasing from 0.69 to 0.82, indicating that hybrid models utilizing 3DCNNs and ConvLSTMs are the most effective, and greyscale videos with real data are the easiest to predict.

[172] Learning Partially-Decorrelated Common Spaces for Ad-hoc Video Search

Fan Hu, Zijie Xin, Xirong Li

Main category: cs.CV

TL;DR: The paper proposes LPD (Learning Partially Decorrelated common spaces) to address the visual diversity challenge in Ad-hoc Video Search (AVS) by constructing feature-specific common spaces and using de-correlation loss.

Details

Motivation: The visual diversity in AVS makes it hard to retrieve all relevant videos comprehensively. Current methods fuse features into common spaces but lack diversity.

Method: LPD learns separate common spaces for each feature, uses de-correlation loss to diversify negative samples, and employs an entropy-based fair multi-space triplet ranking loss for convergence.

Result: Experiments on TRECVID AVS benchmarks (2016-2023) show LPD’s effectiveness, and visualizations confirm its ability to enhance result diversity.

Conclusion: LPD improves AVS by leveraging diverse feature-specific spaces and de-correlation, outperforming current methods.

Abstract: Ad-hoc Video Search (AVS) involves using a textual query to search for multiple relevant videos in a large collection of unlabeled short videos. The main challenge of AVS is the visual diversity of relevant videos. A simple query such as “Find shots of a man and a woman dancing together indoors” can span a multitude of environments, from brightly lit halls and shadowy bars to dance scenes in black-and-white animations. It is therefore essential to retrieve relevant videos as comprehensively as possible. Current solutions for the AVS task primarily fuse multiple features into one or more common spaces, yet overlook the need for diverse spaces. To fully exploit the expressive capability of individual features, we propose LPD, short for Learning Partially Decorrelated common spaces. LPD incorporates two key innovations: feature-specific common space construction and the de-correlation loss. Specifically, LPD learns a separate common space for each video and text feature, and employs de-correlation loss to diversify the ordering of negative samples across different spaces. To enhance the consistency of multi-space convergence, we designed an entropy-based fair multi-space triplet ranking loss. Extensive experiments on the TRECVID AVS benchmarks (2016-2023) justify the effectiveness of LPD. Moreover, diversity visualizations of LPD’s spaces highlight its ability to enhance result diversity.

[173] TESPEC: Temporally-Enhanced Self-Supervised Pretraining for Event Cameras

Mohammad Mohammadi, Ziyi Wu, Igor Gilitschenski

Main category: cs.CV

TL;DR: TESPEC is a self-supervised pre-training framework for event-based perception, leveraging long-term temporal information to improve recurrent models.

Details

Motivation: Existing SSL methods for event-based pre-training ignore temporal information, limiting their effectiveness for recurrent models.

Method: TESPEC uses masked image modeling with a novel reconstruction target, accumulating events into pseudo grayscale videos for robust learning.

Result: TESPEC achieves state-of-the-art performance in tasks like object detection, semantic segmentation, and depth estimation.

Conclusion: TESPEC effectively bridges the gap in temporal learning for event-based perception, outperforming existing methods.

Abstract: Long-term temporal information is crucial for event-based perception tasks, as raw events only encode pixel brightness changes. Recent works show that when trained from scratch, recurrent models achieve better results than feedforward models in these tasks. However, when leveraging self-supervised pre-trained weights, feedforward models can outperform their recurrent counterparts. Current self-supervised learning (SSL) methods for event-based pre-training largely mimic RGB image-based approaches. They pre-train feedforward models on raw events within a short time interval, ignoring the temporal information of events. In this work, we introduce TESPEC, a self-supervised pre-training framework tailored for learning spatio-temporal information. TESPEC is well-suited for recurrent models, as it is the first framework to leverage long event sequences during pre-training. TESPEC employs the masked image modeling paradigm with a new reconstruction target. We design a novel method to accumulate events into pseudo grayscale videos containing high-level semantic information about the underlying scene, which is robust to sensor noise and reduces motion blur. Reconstructing this target thus requires the model to reason about long-term history of events. Extensive experiments demonstrate our state-of-the-art results in downstream tasks, including object detection, semantic segmentation, and monocular depth estimation. Project webpage: https://mhdmohammadi.github.io/TESPEC_webpage.

[174] Latent Diffusion Based Face Enhancement under Degraded Conditions for Forensic Face Recognition

Hassan Ugail, Hamad Mansour Alawar, AbdulNasser Abbas Zehi, Ahmed Mohammad Alkendi, Ismail Lujain Jaleel

Main category: cs.CV

TL;DR: Latent diffusion-based enhancement significantly improves face recognition accuracy (from 29.1% to 84.5%) for forensic imagery degraded by compression, blur, and noise.

Details

Motivation: Face recognition systems struggle with low-quality forensic images, necessitating methods to enhance degraded data for accurate identification.

Method: The study uses the Flux.1 Kontext Dev pipeline with Facezoom LoRA adaptation on 3,000 individuals from LFW, testing against seven degradation types.

Result: Recognition accuracy improved by 55.4 percentage points (95% CI: [54.1, 56.7]), with significant gains across all degradation categories.

Conclusion: Diffusion-based enhancement is highly effective for forensic face recognition, offering practical improvements for degraded imagery.

Abstract: Face recognition systems experience severe performance degradation when processing low-quality forensic evidence imagery. This paper presents an evaluation of latent diffusion-based enhancement for improving face recognition under forensically relevant degradations. Using a dataset of 3,000 individuals from LFW with 24,000 recognition attempts, we implement the Flux.1 Kontext Dev pipeline with Facezoom LoRA adaptation to test against seven degradation categories, including compression artefacts, blur effects, and noise contamination. Our approach demonstrates substantial improvements, increasing overall recognition accuracy from 29.1% to 84.5% (55.4 percentage point improvement, 95% CI: [54.1, 56.7]). Statistical analysis reveals significant performance gains across all degradation types, with effect sizes exceeding conventional thresholds for practical significance. These findings establish the potential of sophisticated diffusion based enhancement in forensic face recognition applications.

[175] CLIP Brings Better Features to Visual Aesthetics Learners

Liwu Xu, Jinjin Xu, Yuzhe Yang, Xilu Wang, Yijie Huang, Yaqian Li

Main category: cs.CV

TL;DR: A two-phase CLIP-based Semi-supervised Knowledge Distillation (CSKD) paradigm is proposed to leverage CLIP’s generalization for lightweight Image Aesthetics Assessment (IAA) models, achieving state-of-the-art performance.

Details

Motivation: IAA is subjective and lacks manual annotations. CLIP's potential for IAA is underexplored, especially in low-data settings.

Method: CSKD uses feature alignment to distill knowledge from CLIP to lightweight IAA models, followed by collaborative distillation with unlabeled data.

Result: CSKD outperforms benchmarks, effectively transferring CLIP’s features to IAA models, improving aesthetics representation.

Conclusion: CSKD provides a scalable solution for IAA, enhancing model initialization and feature representation.

Abstract: Image Aesthetics Assessment (IAA) is a challenging task due to its subjective nature and expensive manual annotations. Recent large-scale vision-language models, such as Contrastive Language-Image Pre-training (CLIP), have shown their promising representation capability for various downstream tasks. However, the application of CLIP to resource-constrained and low-data IAA tasks remains limited. While few attempts to leverage CLIP in IAA have mainly focused on carefully designed prompts, we extend beyond this by allowing models from different domains and with different model sizes to acquire knowledge from CLIP. To achieve this, we propose a unified and flexible two-phase CLIP-based Semi-supervised Knowledge Distillation (CSKD) paradigm, aiming to learn a lightweight IAA model while leveraging CLIP’s strong generalization capability. Specifically, CSKD employs a feature alignment strategy to facilitate the distillation of heterogeneous CLIP teacher and IAA student models, effectively transferring valuable features from pre-trained visual representations to two lightweight IAA models, respectively. To efficiently adapt to downstream IAA tasks in a low-data regime, the two strong visual aesthetics learners then conduct distillation with unlabeled examples for refining and transferring the task-specific knowledge collaboratively. Extensive experiments demonstrate that the proposed CSKD achieves state-of-the-art performance on multiple widely used IAA benchmarks. Furthermore, analysis of attention distance and entropy before and after feature alignment shows the effective transfer of CLIP’s feature representation to IAA models, which not only provides valuable guidance for the model initialization of IAA but also enhances the aesthetic feature representation of IAA models. Code will be made publicly available.

[176] Optimizing Vision-Language Consistency via Cross-Layer Regional Attention Alignment

Yifan Wang, Hongfeng Ai, Quangao Liu, Maowei Jiang, Ruiyuan Kang, Ruiqi Li, Jiahua Dong, Mengting Xiao, Cheng Jiang, Chenzhong Li

Main category: cs.CV

TL;DR: CCRA improves VLMs by aligning cross-modal attention with LPWCA and PAI, achieving state-of-the-art performance with minimal added parameters.

Details

Motivation: VLMs struggle with mismatched attention in cross-modal embedding, leading to suboptimal performance.

Method: CCRA uses LPWCA for fine-grained regional-semantic correlations and PAI to coordinate attention mechanisms progressively.

Result: CCRA-enhanced LLaVA-v1.5-7B outperforms baselines on ten benchmarks with only 3.55M extra parameters.

Conclusion: CCRA ensures consistent attention alignment, improving performance and interpretability in VLMs.

Abstract: Vision Language Models (VLMs) face challenges in effectively coordinating diverse attention mechanisms for cross-modal embedding learning, leading to mismatched attention and suboptimal performance. We propose Consistent Cross-layer Regional Alignment (CCRA), which introduces Layer-Patch-wise Cross Attention (LPWCA) to capture fine-grained regional-semantic correlations by jointly weighting patch and layer-wise embedding, and Progressive Attention Integration (PAI) that systematically coordinates LPWCA, layer-wise, and patch-wise attention mechanisms in sequence. This progressive design ensures consistency from semantic to regional levels while preventing attention drift and maximizing individual attention benefits. Experimental results on ten diverse vision-language benchmarks demonstrate that our CCRA-enhanced LLaVA-v1.5-7B model achieves state-of-the-art performance, outperforming all baseline methods with only 3.55M additional parameters, while providing enhanced interpretability through more regionally focused and semantically aligned attention patterns.

[177] ThermoCycleNet: Stereo-based Thermogram Labeling for Model Transition to Cycling

Daniel Andrés López, Vincent Weber, Severin Zentgraf, Barlo Hillen, Perikles Simon, Elmar Schömer

Main category: cs.CV

TL;DR: Infrared thermography aids sports medicine by assessing thermal radiation during exercise. A method combining automatic and manual annotations improves deep neural network performance for transitioning from treadmill to bicycle analysis.

Details

Motivation: To adapt a stereo- and multimodal-based labeling approach from treadmill running to ergometer cycling, improving deep neural network performance with minimal manual data.

Method: Training a semantic segmentation network with automatic labels and fine-tuning on high-quality manual annotations, comparing different dataset combinations.

Result: Fine-tuning with a small fraction of manual data improves network performance. Combining automatic and manual labels accelerates adaptation to new use cases.

Conclusion: Combining automatic and manual annotations efficiently adapts deep neural networks for new applications like transitioning from treadmill to bicycle analysis.

Abstract: Infrared thermography is emerging as a powerful tool in sports medicine, allowing assessment of thermal radiation during exercise and analysis of anatomical regions of interest, such as the well-exposed calves. Building on our previous advanced automatic annotation method, we aimed to transfer the stereo- and multimodal-based labeling approach from treadmill running to ergometer cycling. Therefore, the training of the semantic segmentation network with automatic labels and fine-tuning on high-quality manually annotated images has been examined and compared in different data set combinations. The results indicate that fine-tuning with a small fraction of manual data is sufficient to improve the overall performance of the deep neural network. Finally, combining automatically generated labels with small manually annotated data sets accelerates the adaptation of deep neural networks to new use cases, such as the transition from treadmill to bicycle.

[178] ROVI: A VLM-LLM Re-Captioned Dataset for Open-Vocabulary Instance-Grounded Text-to-Image Generation

Cihang Peng, Qiming Hou, Zhong Ren, Kun Zhou

Main category: cs.CV

TL;DR: ROVI is a synthetic dataset for text-to-image generation, created by labeling 1M web images using re-captioning with VLM and LLM. It outperforms existing datasets in quality and resolution, and improves model performance.

Details

Motivation: To create a high-quality dataset for instance-grounded text-to-image generation, addressing gaps in existing datasets like limited categories and overlooked visual elements.

Method: Uses re-captioning: VLM generates visual descriptions, LLM extracts categories for OVDs, linking global prompts to instance annotations.

Result: ROVI surpasses existing datasets in quality, resolution, and category diversity. GLIGEN trained on ROVI outperforms state-of-the-art models.

Conclusion: ROVI offers a superior dataset for text-to-image generation, enhancing model accuracy and fidelity, with open-source availability.

Abstract: We present ROVI, a high-quality synthetic dataset for instance-grounded text-to-image generation, created by labeling 1M curated web images. Our key innovation is a strategy called re-captioning, focusing on the pre-detection stage, where a VLM (Vision-Language Model) generates comprehensive visual descriptions that are then processed by an LLM (Large Language Model) to extract a flat list of potential categories for OVDs (Open-Vocabulary Detectors) to detect. This approach yields a global prompt inherently linked to instance annotations while capturing secondary visual elements humans typically overlook. Evaluations show that ROVI exceeds existing detection datasets in image quality and resolution while containing two orders of magnitude more categories with an open-vocabulary nature. For demonstrative purposes, a text-to-image model GLIGEN trained on ROVI significantly outperforms state-of-the-art alternatives in instance grounding accuracy, prompt fidelity, and aesthetic quality. Our dataset and reproducible pipeline are available at https://github.com/CihangPeng/ROVI.

[179] AutoSIGHT: Automatic Eye Tracking-based System for Immediate Grading of Human experTise

Byron Dowling, Jozef Probcin, Adam Czajka

Main category: cs.CV

TL;DR: AutoSIGHT uses eye-tracking data to classify human expertise in visual tasks, achieving high accuracy with small evaluation windows.

Details

Motivation: To automate the assessment of human expertise in visual tasks using eye-tracking data, enabling dynamic human-AI collaboration.

Method: AutoSIGHT employs an ensemble of features from eye-tracking data during visual tasks, tested on iris Presentation Attack Detection (PAD).

Result: Achieves AUROC of 0.751 (5s window) and 0.8306 (30s window), proving feasibility.

Conclusion: Demonstrates potential for integrating human expertise assessment into human-AI systems, with shared data for further research.

Abstract: Can we teach machines to assess the expertise of humans solving visual tasks automatically based on eye tracking features? This paper proposes AutoSIGHT, Automatic System for Immediate Grading of Human experTise, that classifies expert and non-expert performers, and builds upon an ensemble of features extracted from eye tracking data while the performers were solving a visual task. Results on the task of iris Presentation Attack Detection (PAD) used for this study show that with a small evaluation window of just 5 seconds, AutoSIGHT achieves an average average Area Under the ROC curve performance of 0.751 in subject-disjoint train-test regime, indicating that such detection is viable. Furthermore, when a larger evaluation window of up to 30 seconds is available, the Area Under the ROC curve (AUROC) increases to 0.8306, indicating the model is effectively leveraging more information at a cost of slightly delayed decisions. This work opens new areas of research on how to incorporate the automatic weighing of human and machine expertise into human-AI pairing setups, which need to react dynamically to nonstationary expertise distribution between the human and AI players (e.g. when the experts need to be replaced, or the task at hand changes rapidly). Along with this paper, we offer the eye tracking data used in this study collected from 6 experts and 53 non-experts solving iris PAD visual task.

[180] 3D Reconstruction via Incremental Structure From Motion

Muhammad Zeeshan, Umer Zaki, Syed Ahmed Pasha, Zaar Khizar

Main category: cs.CV

TL;DR: Incremental SfM is a flexible method for 3D reconstruction, outperforming global SfM in sparse or noisy datasets. This paper details its implementation, focusing on geometric consistency and iterative refinement, validated with real-world data.

Details

Motivation: The need for accurate 3D reconstruction in robotics, mapping, and scene understanding drives the exploration of incremental SfM as a robust alternative to global SfM, especially in sparse or noisy datasets.

Method: The paper details the incremental SfM pipeline, emphasizing geometric consistency and iterative refinement via bundle adjustment. Real-world datasets are used for validation.

Result: The approach demonstrates reliable 3D reconstruction in visually structured environments, validated by reprojection error and camera trajectory coherence.

Conclusion: Incremental SfM is a practical and reliable method for sparse 3D reconstruction, particularly in challenging datasets.

Abstract: Accurate 3D reconstruction from unstructured image collections is a key requirement in applications such as robotics, mapping, and scene understanding. While global Structure from Motion (SfM) techniques rely on full image connectivity and can be sensitive to noise or missing data, incremental SfM offers a more flexible alternative. By progressively incorporating new views into the reconstruction, it enables the system to recover scene structure and camera motion even in sparse or partially overlapping datasets. In this paper, we present a detailed implementation of the incremental SfM pipeline, focusing on the consistency of geometric estimation and the effect of iterative refinement through bundle adjustment. We demonstrate the approach using a real dataset and assess reconstruction quality through reprojection error and camera trajectory coherence. The results support the practical utility of incremental SfM as a reliable method for sparse 3D reconstruction in visually structured environments.

[181] Structured Spectral Graph Learning for Anomaly Classification in 3D Chest CT Scans

Theo Di Piazza, Carole Lazarus, Olivier Nempont, Loic Boussel

Main category: cs.CV

TL;DR: Proposes a graph-based method for multi-label classification of 3D CT scans, addressing limitations of 3D CNNs and Vision Transformers.

Details

Motivation: Automated methods are needed to assist radiologists with increasing CT scan workloads, but existing approaches struggle with long-range dependencies or high computational costs.

Method: Models CT scans as structured graphs using axial slice triplets and spectral domain convolution for improved multi-label anomaly classification.

Result: Demonstrates strong cross-dataset generalization, competitive performance, and robustness to z-axis translation.

Conclusion: The graph-based approach offers a viable alternative to existing methods, balancing performance and computational efficiency.

Abstract: With the increasing number of CT scan examinations, there is a need for automated methods such as organ segmentation, anomaly detection and report generation to assist radiologists in managing their increasing workload. Multi-label classification of 3D CT scans remains a critical yet challenging task due to the complex spatial relationships within volumetric data and the variety of observed anomalies. Existing approaches based on 3D convolutional networks have limited abilities to model long-range dependencies while Vision Transformers suffer from high computational costs and often require extensive pre-training on large-scale datasets from the same domain to achieve competitive performance. In this work, we propose an alternative by introducing a new graph-based approach that models CT scans as structured graphs, leveraging axial slice triplets nodes processed through spectral domain convolution to enhance multi-label anomaly classification performance. Our method exhibits strong cross-dataset generalization, and competitive performance while achieving robustness to z-axis translation. An ablation study evaluates the contribution of each proposed component.

[182] Anti-Inpainting: A Proactive Defense Approach against Malicious Diffusion-based Inpainters under Unknown Conditions

Yimao Guo, Zuomin Qu, Wei Lu, Xiangyang Luo

Main category: cs.CV

TL;DR: Anti-Inpainting is a proactive defense method against diffusion-based image manipulation, featuring multi-level feature extraction, semantic-preserving augmentation, and distribution deviation optimization.

Details

Motivation: Existing defenses fail under unknown conditions for diffusion-based malicious image manipulation.

Method: Three modules: multi-level deep feature extractor, multi-scale semantic-preserving augmentation, and selection-based distribution deviation optimization.

Result: Effective defense against diffusion-based inpainters under unknown conditions, robust against purification methods, and transferable across model versions.

Conclusion: Anti-Inpainting successfully addresses the limitations of existing methods, providing robust and transferable protection.

Abstract: With the increasing prevalence of diffusion-based malicious image manipulation, existing proactive defense methods struggle to safeguard images against tampering under unknown conditions. To address this, we propose Anti-Inpainting, a proactive defense approach that achieves protection comprising three novel modules. First, we introduce a multi-level deep feature extractor to obtain intricate features from the diffusion denoising process, enhancing protective effectiveness. Second, we design a multi-scale, semantic-preserving data augmentation technique to enhance the transferability of adversarial perturbations across unknown conditions. Finally, we propose a selection-based distribution deviation optimization strategy to bolster protection against manipulations guided by diverse random seeds. Extensive experiments on InpaintGuardBench and CelebA-HQ demonstrate that Anti-Inpainting effectively defends against diffusion-based inpainters under unknown conditions. Additionally, our approach demonstrates robustness against various image purification methods and transferability across different diffusion model versions.

[183] RAISE: Realness Assessment for Image Synthesis and Evaluation

Aniruddha Mukherjee, Spriha Dubey, Somdyuti Paul

Main category: cs.CV

TL;DR: The paper introduces RAISE, a dataset with AI-generated and real images paired with subjective realness scores, and evaluates deep vision models for predicting perceptual realness.

Details

Motivation: Assessing the perceived realness of AI-generated visual content is challenging but essential for substituting real data.

Method: Conducted a human study to evaluate perceptual realness, created the RAISE dataset, and trained models on it using deep vision features.

Result: Deep vision models effectively capture subjective realness, with RAISE serving as a resource for objective assessment.

Conclusion: RAISE aids in developing robust models for perceptual realness evaluation of AI-generated content.

Abstract: The rapid advancement of generative AI has enabled the creation of highly photorealistic visual content, offering practical substitutes for real images and videos in scenarios where acquiring real data is difficult or expensive. However, reliably substituting real visual content with AI-generated counterparts requires robust assessment of the perceived realness of AI-generated visual content, a challenging task due to its inherent subjective nature. To address this, we conducted a comprehensive human study evaluating the perceptual realness of both real and AI-generated images, resulting in a new dataset, containing images paired with subjective realness scores, introduced as RAISE in this paper. Further, we develop and train multiple models on RAISE to establish baselines for realness prediction. Our experimental results demonstrate that features derived from deep foundation vision models can effectively capture the subjective realness. RAISE thus provides a valuable resource for developing robust, objective models of perceptual realness assessment.

[184] Evading Data Provenance in Deep Neural Networks

Hongyu Zhu, Sichu Liang, Wenwen Wang, Zhuomeng Zhang, Fangqi Li, Shi-Lin Wang

Main category: cs.CV

TL;DR: The paper introduces a unified evasion framework to bypass Dataset Ownership Verification (DOV) methods, outperforming existing attacks by leveraging task-relevant knowledge transfer and curated subsets from OOD datasets.

Details

Motivation: To address the limitations of oversimplistic evasion attacks in evaluating DOV methods, revealing vulnerabilities and improving evasion effectiveness.

Method: A teacher model learns from the copyright dataset, then transfers task-relevant knowledge to a surrogate student using an OOD dataset, curated via Vision-Language and Large Language Models.

Result: The approach eliminates copyright identifiers and outperforms nine state-of-the-art evasion attacks in generalization and effectiveness.

Conclusion: The study exposes vulnerabilities in current DOV methods, emphasizing the need for further development to enhance their practicality.

Abstract: Modern over-parameterized deep models are highly data-dependent, with large scale general-purpose and domain-specific datasets serving as the bedrock for rapid advancements. However, many datasets are proprietary or contain sensitive information, making unrestricted model training problematic. In the open world where data thefts cannot be fully prevented, Dataset Ownership Verification (DOV) has emerged as a promising method to protect copyright by detecting unauthorized model training and tracing illicit activities. Due to its diversity and superior stealth, evading DOV is considered extremely challenging. However, this paper identifies that previous studies have relied on oversimplistic evasion attacks for evaluation, leading to a false sense of security. We introduce a unified evasion framework, in which a teacher model first learns from the copyright dataset and then transfers task-relevant yet identifier-independent domain knowledge to a surrogate student using an out-of-distribution (OOD) dataset as the intermediary. Leveraging Vision-Language Models and Large Language Models, we curate the most informative and reliable subsets from the OOD gallery set as the final transfer set, and propose selectively transferring task-oriented knowledge to achieve a better trade-off between generalization and evasion effectiveness. Experiments across diverse datasets covering eleven DOV methods demonstrate our approach simultaneously eliminates all copyright identifiers and significantly outperforms nine state-of-the-art evasion attacks in both generalization and effectiveness, with moderate computational overhead. As a proof of concept, we reveal key vulnerabilities in current DOV methods, highlighting the need for long-term development to enhance practicality.

[185] CountingFruit: Language-Guided 3D Fruit Counting with Semantic Gaussian Splatting

Fengze Li, Yangle Liu, Jieming Ma, Hai-Ning Liang, Yaochun Shen, Huangxiang Li, Zhijing Wu

Main category: cs.CV

TL;DR: FruitLangGS is a language-guided 3D fruit counting framework using adaptive-density Gaussian Splatting for accurate orchard-scale fruit counting, outperforming existing methods with up to 99.2% recall.

Details

Motivation: Challenges in 3D fruit counting include occlusion, semantic ambiguity, and computational cost. Existing methods suffer from fusion errors and slow inference.

Method: Uses adaptive-density Gaussian Splatting with radius-aware pruning and tile-based rasterization. CLIP-aligned semantic vectors filter Gaussians via dual-threshold cosine similarity, followed by geometric clustering.

Result: Achieves high recall (up to 99.2%) on orchard datasets, avoids fusion errors, and remains robust under occlusion.

Conclusion: FruitLangGS is effective for fruit counting and enables language-guided 3D semantic retrieval, showcasing potential for agricultural scene understanding.

Abstract: Accurate 3D fruit counting in orchards is challenging due to heavy occlusion, semantic ambiguity between fruits and surrounding structures, and the high computational cost of volumetric reconstruction. Existing pipelines often rely on multi-view 2D segmentation and dense volumetric sampling, which lead to accumulated fusion errors and slow inference. We introduce FruitLangGS, a language-guided 3D fruit counting framework that reconstructs orchard-scale scenes using an adaptive-density Gaussian Splatting pipeline with radius-aware pruning and tile-based rasterization, enabling scalable 3D representation. During inference, compressed CLIP-aligned semantic vectors embedded in each Gaussian are filtered via a dual-threshold cosine similarity mechanism, retrieving Gaussians relevant to target prompts while suppressing common distractors (e.g., foliage), without requiring retraining or image-space masks. The selected Gaussians are then sampled into dense point clouds and clustered geometrically to estimate fruit instances, remaining robust under severe occlusion and viewpoint variation. Experiments on nine different orchard-scale datasets demonstrate that FruitLangGS consistently outperforms existing pipelines in instance counting recall, avoiding multi-view segmentation fusion errors and achieving up to 99.2% recall on Fuji-SfM orchard dataset. Ablation studies further confirm that language-conditioned semantic embedding and dual-threshold prompt filtering are essential for suppressing distractors and improving counting accuracy under heavy occlusion. Beyond fruit counting, the same framework enables prompt-driven 3D semantic retrieval without retraining, highlighting the potential of language-guided 3D perception for scalable agricultural scene understanding.

[186] DreamSat-2.0: Towards a General Single-View Asteroid 3D Reconstruction

Santiago Diaz, Xinghui Hu, Josiane Uwumukiza, Giovanni Lavezzi, Victor Rodriguez-Fernandez, Richard Linares

Main category: cs.CV

TL;DR: DreamSat-2.0 benchmarks three 3D reconstruction models on spacecraft and asteroid datasets, showing domain-dependent performance in image quality and shape accuracy.

Details

Motivation: To improve asteroid exploration and autonomous spacecraft navigation by evaluating and advancing 3D reconstruction techniques.

Method: Benchmarking Hunyuan-3D, Trellis-3D, and Ouroboros-3D using 2D perceptual and 3D geometric metrics on custom datasets.

Result: Models perform better on spacecraft for image quality and on asteroids for shape accuracy. Hunyuan-3D excels in both domains.

Conclusion: DreamSat-2.0 establishes new benchmarks, advancing 3D reconstruction for space applications.

Abstract: To enhance asteroid exploration and autonomous spacecraft navigation, we introduce DreamSat-2.0, a pipeline that benchmarks three state-of-the-art 3D reconstruction models-Hunyuan-3D, Trellis-3D, and Ouroboros-3D-on custom spacecraft and asteroid datasets. Our systematic analysis, using 2D perceptual (image quality) and 3D geometric (shape accuracy) metrics, reveals that model performance is domain-dependent. While models produce higher-quality images of complex spacecraft, they achieve better geometric reconstructions for the simpler forms of asteroids. New benchmarks are established, with Hunyuan-3D achieving top perceptual scores on spacecraft but its best geometric accuracy on asteroids, marking a significant advance over our prior work.

[187] VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos

Jiashuo Yu, Yue Wu, Meng Chu, Zhifei Ren, Zizheng Huang, Pei Chu, Ruijie Zhang, Yinan He, Qirui Li, Songze Li, Zhenxiang Li, Zhongying Tu, Conghui He, Yu Qiao, Yali Wang, Yi Wang, Limin Wang

Main category: cs.CV

TL;DR: VRBench is a new benchmark for evaluating large models’ multi-step reasoning using long narrative videos, human-labeled QA pairs, and a multi-phase evaluation pipeline.

Details

Motivation: Existing evaluations lack focus on temporal reasoning and procedural validity, which VRBench addresses.

Method: VRBench includes 960 long videos, 8,243 QA pairs, and 25,106 reasoning steps. It uses a human-AI framework for coherent reasoning chains and a multi-phase evaluation pipeline.

Result: Evaluations of 12 LLMs and 19 VLMs on VRBench provide insights into multi-step reasoning capabilities.

Conclusion: VRBench advances the field by offering a comprehensive evaluation tool for multi-step reasoning in large models.

Abstract: We present VRBench, the first long narrative video benchmark crafted for evaluating large models’ multi-step reasoning capabilities, addressing limitations in existing evaluations that overlook temporal reasoning and procedural validity. It comprises 960 long videos (with an average duration of 1.6 hours), along with 8,243 human-labeled multi-step question-answering pairs and 25,106 reasoning steps with timestamps. These videos are curated via a multi-stage filtering process including expert inter-rater reviewing to prioritize plot coherence. We develop a human-AI collaborative framework that generates coherent reasoning chains, each requiring multiple temporally grounded steps, spanning seven types (e.g., event attribution, implicit inference). VRBench designs a multi-phase evaluation pipeline that assesses models at both the outcome and process levels. Apart from the MCQs for the final results, we propose a progress-level LLM-guided scoring metric to evaluate the quality of the reasoning chain from multiple dimensions comprehensively. Through extensive evaluations of 12 LLMs and 19 VLMs on VRBench, we undertake a thorough analysis and provide valuable insights that advance the field of multi-step reasoning.

[188] COSTARR: Consolidated Open Set Technique with Attenuation for Robust Recognition

Ryan Rabinowitz, Steve Cruz, Walter Scheirer, Terrance E. Boult

Main category: cs.CV

TL;DR: COSTARR introduces a novel attenuation hypothesis for open-set recognition, outperforming prior methods by leveraging overlooked attenuation information in deep features.

Details

Motivation: Addressing the challenge of novelty detection in visual recognition by moving beyond the familiarity hypothesis to incorporate attenuation effects.

Method: Proposes COSTARR, combining familiar features and attenuation effects, with ablation studies to validate contributions of pre- and post-attenuated features.

Result: COSTARR significantly outperforms state-of-the-art methods across various architectures and datasets.

Conclusion: COSTARR advances open-set recognition by effectively utilizing attenuation information, demonstrating broad applicability.

Abstract: Handling novelty remains a key challenge in visual recognition systems. Existing open-set recognition (OSR) methods rely on the familiarity hypothesis, detecting novelty by the absence of familiar features. We propose a novel attenuation hypothesis: small weights learned during training attenuate features and serve a dual role-differentiating known classes while discarding information useful for distinguishing known from unknown classes. To leverage this overlooked information, we present COSTARR, a novel approach that combines both the requirement of familiar features and the lack of unfamiliar ones. We provide a probabilistic interpretation of the COSTARR score, linking it to the likelihood of correct classification and belonging in a known class. To determine the individual contributions of the pre- and post-attenuated features to COSTARR’s performance, we conduct ablation studies that show both pre-attenuated deep features and the underutilized post-attenuated Hadamard product features are essential for improving OSR. Also, we evaluate COSTARR in a large-scale setting using ImageNet2012-1K as known data and NINCO, iNaturalist, OpenImage-O, and other datasets as unknowns, across multiple modern pre-trained architectures (ViTs, ConvNeXts, and ResNet). The experiments demonstrate that COSTARR generalizes effectively across various architectures and significantly outperforms prior state-of-the-art methods by incorporating previously discarded attenuation information, advancing open-set recognition capabilities.

[189] ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment

Amir Aghdam, Vincent Tao Hu, Björn Ommer

Main category: cs.CV

TL;DR: ActAlign is a zero-shot, training-free method for fine-grained video classification using sequence alignment of LLM-generated sub-actions and video frames via DTW, outperforming larger models.

Details

Motivation: To enable zero-shot classification of fine-grained actions in videos without temporal annotations or training data for unseen classes.

Method: Uses LLMs to generate ordered sub-actions for each class and aligns them with video frames using Dynamic Time Warping (DTW) in a shared embedding space.

Result: Achieves 30.5% accuracy on ActionAtlas, outperforming larger models with 8x fewer parameters.

Conclusion: Combining structured language priors with classical alignment methods can enhance image-language models for fine-grained video understanding.

Abstract: We address the task of zero-shot video classification for extremely fine-grained actions (e.g., Windmill Dunk in basketball), where no video examples or temporal annotations are available for unseen classes. While image-language models (e.g., CLIP, SigLIP) show strong open-set recognition, they lack temporal modeling needed for video understanding. We propose ActAlign, a truly zero-shot, training-free method that formulates video classification as a sequence alignment problem, preserving the generalization strength of pretrained image-language models. For each class, a large language model (LLM) generates an ordered sequence of sub-actions, which we align with video frames using Dynamic Time Warping (DTW) in a shared embedding space. Without any video-text supervision or fine-tuning, ActAlign achieves 30.5% accuracy on ActionAtlas–the most diverse benchmark of fine-grained actions across multiple sports–where human performance is only 61.6%. ActAlign outperforms billion-parameter video-language models while using 8x fewer parameters. Our approach is model-agnostic and domain-general, demonstrating that structured language priors combined with classical alignment methods can unlock the open-set recognition potential of image-language models for fine-grained video understanding.

[190] AURA: A Hybrid Spatiotemporal-Chromatic Framework for Robust, Real-Time Detection of Industrial Smoke Emissions

Mikhail Bychkov, Matey Yordanov, Andrei Kuchma

Main category: cs.CV

TL;DR: AURA is a hybrid spatiotemporal-chromatic framework for real-time industrial smoke detection and classification, improving accuracy and reducing false positives.

Details

Motivation: Current monitoring systems lack specificity in distinguishing smoke types and struggle with environmental variability.

Method: AURA uses dynamic movement patterns and color characteristics of industrial smoke for detection and classification.

Result: Enhanced accuracy and reduced false positives in smoke detection.

Conclusion: AURA improves environmental compliance, safety, and public health through precise automated monitoring.

Abstract: This paper introduces AURA, a novel hybrid spatiotemporal-chromatic framework designed for robust, real-time detection and classification of industrial smoke emissions. The framework addresses critical limitations of current monitoring systems, which often lack the specificity to distinguish smoke types and struggle with environmental variability. AURA leverages both the dynamic movement patterns and the distinct color characteristics of industrial smoke to provide enhanced accuracy and reduced false positives. This framework aims to significantly improve environmental compliance, operational safety, and public health outcomes by enabling precise, automated monitoring of industrial emissions.

[191] Trans-Adapter: A Plug-and-Play Framework for Transparent Image Inpainting

Yuekun Dai, Haitian Li, Shangchen Zhou, Chen Change Loy

Main category: cs.CV

TL;DR: Proposes Trans-Adapter, a plug-and-play adapter for diffusion-based inpainting models to handle RGBA images directly, addressing transparency consistency and edge quality issues.

Details

Motivation: Existing inpainting methods are limited to RGB images, and conventional RGBA inpainting pipelines struggle with transparency consistency and edge quality.

Method: Introduces Trans-Adapter, a plug-and-play adapter for diffusion models, and LayerBench for evaluation with a new alpha edge quality metric.

Result: Demonstrates effectiveness through extensive experiments on LayerBench, showing improved transparency edge quality.

Conclusion: Trans-Adapter successfully enables direct RGBA image inpainting with better transparency consistency and edge quality.

Abstract: RGBA images, with the additional alpha channel, are crucial for any application that needs blending, masking, or transparency effects, making them more versatile than standard RGB images. Nevertheless, existing image inpainting methods are designed exclusively for RGB images. Conventional approaches to transparent image inpainting typically involve placing a background underneath RGBA images and employing a two-stage process: image inpainting followed by image matting. This pipeline, however, struggles to preserve transparency consistency in edited regions, and matting can introduce jagged edges along transparency boundaries. To address these challenges, we propose Trans-Adapter, a plug-and-play adapter that enables diffusion-based inpainting models to process transparent images directly. Trans-Adapter also supports controllable editing via ControlNet and can be seamlessly integrated into various community models. To evaluate our method, we introduce LayerBench, along with a novel non-reference alpha edge quality evaluation metric for assessing transparency edge quality. We conduct extensive experiments on LayerBench to demonstrate the effectiveness of our approach.

Xiao Zhang, Johan Bos

Main category: cs.CV

TL;DR: A novel multi-modal framework for digitizing tombstones uses vision-language models and retrieval-augmented generation to improve interpretation and preservation, achieving high accuracy (F1 score 89.5).

Details

Motivation: Tombstones are culturally significant but face preservation challenges like erosion and vandalism. Digitization can aid in their interpretation and protection.

Method: The approach combines vision-language models (VLMs) to create structured Tombstone Meaning Representations (TMRs) and retrieval-augmented generation (RAG) for semantic enrichment.

Result: The method significantly outperforms traditional OCR, improving parsing accuracy (F1 score from 36.1 to 89.5) and demonstrates robustness across diverse inscriptions and degraded conditions.

Conclusion: This work formalizes tombstone understanding with large vision-language models, offering valuable tools for heritage preservation.

Abstract: Tombstones are historically and culturally rich artifacts, encapsulating individual lives, community memory, historical narratives and artistic expression. Yet, many tombstones today face significant preservation challenges, including physical erosion, vandalism, environmental degradation, and political shifts. In this paper, we introduce a novel multi-modal framework for tombstones digitization, aiming to improve the interpretation, organization and retrieval of tombstone content. Our approach leverages vision-language models (VLMs) to translate tombstone images into structured Tombstone Meaning Representations (TMRs), capturing both image and text information. To further enrich semantic parsing, we incorporate retrieval-augmented generation (RAG) for integrate externally dependent elements such as toponyms, occupation codes, and ontological concepts. Compared to traditional OCR-based pipelines, our method improves parsing accuracy from an F1 score of 36.1 to 89.5. We additionally evaluate the model’s robustness across diverse linguistic and cultural inscriptions, and simulate physical degradation through image fusion to assess performance under noisy or damaged conditions. Our work represents the first attempt to formalize tombstone understanding using large vision-language models, presenting implications for heritage preservation.

[193] EgoTrigger: Toward Audio-Driven Image Capture for Human Memory Enhancement in All-Day Energy-Efficient Smart Glasses

Akshay Paruchuri, Sinan Hersek, Lavisha Aggarwal, Qiao Yang, Xin Liu, Achin Kulshrestha, Andrea Colaco, Henry Fuchs, Ishan Chatterjee

Main category: cs.CV

TL;DR: EgoTrigger uses audio cues to selectively activate cameras in smart glasses, reducing energy use by 54% while maintaining performance for memory enhancement tasks.

Details

Motivation: Smart glasses need energy-efficient solutions for continuous contextual sensing to enhance human memory without draining battery life.

Method: EgoTrigger employs a lightweight audio model (YAMNet) and custom classification to trigger cameras based on hand-object interaction sounds, tested on QA-Ego4D and HME-QA datasets.

Result: EgoTrigger reduces frame usage by 54%, saving energy in cameras and downstream operations, while matching performance on episodic memory tasks.

Conclusion: EgoTrigger’s context-aware triggering is a promising approach for energy-efficient smart glasses, supporting all-day use for memory enhancement.

Abstract: All-day smart glasses are likely to emerge as platforms capable of continuous contextual sensing, uniquely positioning them for unprecedented assistance in our daily lives. Integrating the multi-modal AI agents required for human memory enhancement while performing continuous sensing, however, presents a major energy efficiency challenge for all-day usage. Achieving this balance requires intelligent, context-aware sensor management. Our approach, EgoTrigger, leverages audio cues from the microphone to selectively activate power-intensive cameras, enabling efficient sensing while preserving substantial utility for human memory enhancement. EgoTrigger uses a lightweight audio model (YAMNet) and a custom classification head to trigger image capture from hand-object interaction (HOI) audio cues, such as the sound of a drawer opening or a medication bottle being opened. In addition to evaluating on the QA-Ego4D dataset, we introduce and evaluate on the Human Memory Enhancement Question-Answer (HME-QA) dataset. Our dataset contains 340 human-annotated first-person QA pairs from full-length Ego4D videos that were curated to ensure that they contained audio, focusing on HOI moments critical for contextual understanding and memory. Our results show EgoTrigger can use 54% fewer frames on average, significantly saving energy in both power-hungry sensing components (e.g., cameras) and downstream operations (e.g., wireless transmission), while achieving comparable performance on datasets for an episodic memory task. We believe this context-aware triggering strategy represents a promising direction for enabling energy-efficient, functional smart glasses capable of all-day use – supporting applications like helping users recall where they placed their keys or information about their routine activities (e.g., taking medications).

[194] MASIV: Toward Material-Agnostic System Identification from Videos

Yizhou Zhao, Haoyu Chen, Chunjiang Liu, Zhenyang Li, Charles Herrmann, Junhwa Hur, Yinxiao Li, Ming-Hsuan Yang, Bhiksha Raj, Min Xu

Main category: cs.CV

TL;DR: MASIV is a vision-based framework for material-agnostic system identification, using neural constitutive models to infer object dynamics without predefined material priors. It addresses optimization challenges with dense geometric guidance, achieving top performance in accuracy and generalization.

Details

Motivation: Existing methods rely on predefined material priors, limiting their ability to handle unknown materials. MASIV aims to overcome this by being material-agnostic.

Method: MASIV employs learnable neural constitutive models and introduces dense geometric guidance by reconstructing continuum particle trajectories for stable optimization.

Result: MASIV achieves state-of-the-art performance in geometric accuracy, rendering quality, and generalization ability.

Conclusion: MASIV successfully addresses the limitations of predefined material priors and optimization challenges, offering a robust solution for system identification from videos.

Abstract: System identification from videos aims to recover object geometry and governing physical laws. Existing methods integrate differentiable rendering with simulation but rely on predefined material priors, limiting their ability to handle unknown ones. We introduce MASIV, the first vision-based framework for material-agnostic system identification. Unlike existing approaches that depend on hand-crafted constitutive laws, MASIV employs learnable neural constitutive models, inferring object dynamics without assuming a scene-specific material prior. However, the absence of full particle state information imposes unique challenges, leading to unstable optimization and physically implausible behaviors. To address this, we introduce dense geometric guidance by reconstructing continuum particle trajectories, providing temporally rich motion constraints beyond sparse visual cues. Comprehensive experiments show that MASIV achieves state-of-the-art performance in geometric accuracy, rendering quality, and generalization ability.

[195] The Promise of RL for Autoregressive Image Editing

Saba Ahmadi, Rabiul Awal, Ankur Sikarwar, Amirhossein Kazemnejad, Ge Ya Luo, Juan A. Rodriguez, Sai Rajeswar, Siva Reddy, Christopher Pal, Benno Krojer, Aishwarya Agrawal

Main category: cs.CV

TL;DR: The paper introduces EARL, an RL-based image editing model combining autoregression and RL, outperforming baselines with less training data.

Details

Motivation: To enhance performance on diverse image editing tasks by exploring strategies like SFT, RL, and CoT reasoning.

Method: Adopts an autoregressive multimodal model, evaluating SFT, RL, and CoT, with RL combined with a multi-modal LLM verifier proving most effective.

Result: EARL, the proposed RL-based model, performs competitively on diverse edits despite using less training data.

Conclusion: EARL advances autoregressive multimodal models in image editing; code and models are publicly released.

Abstract: We explore three strategies to enhance performance on a wide range of image editing tasks: supervised fine-tuning (SFT), reinforcement learning (RL), and Chain-of-Thought (CoT) reasoning. In order to study all these components in one consistent framework, we adopt an autoregressive multimodal model that processes textual and visual tokens in a unified manner. We find RL combined with a large multi-modal LLM verifier to be the most effective of these strategies. As a result, we release EARL: Editing with Autoregression and RL, a strong RL-based image editing model that performs competitively on a diverse range of edits compared to strong baselines, despite using much less training data. Thus, EARL pushes the frontier of autoregressive multimodal models on image editing. We release our code, training data, and trained models at https://github.com/mair-lab/EARL.

[196] LAVA: Language Driven Scalable and Versatile Traffic Video Analytics

Yanrui Yu, Tianfei Zhou, Jiaxin Sun, Lianpeng Qiao, Lizhong Ding, Ye Yuan, Guoren Wang

Main category: cs.CV

TL;DR: The paper introduces Lava, a language-driven video analytics system for flexible and efficient querying of large-scale video data using natural language, outperforming existing methods in accuracy and speed.

Details

Motivation: Existing SQL-based video analytics systems are rigid and limited to predefined categories, prompting the need for a more flexible, natural language-driven approach.

Method: Lava combines a bandit-based sampling method for video localization, an open-world detection module for object retrieval, and a trajectory extraction scheme for temporal association.

Result: Lava improves F1-scores by 14%, reduces MPAE by 0.39, achieves 86% top-k precision, and processes videos 9.6x faster than baselines.

Conclusion: Lava demonstrates the effectiveness of natural language-driven video analytics, offering significant improvements in flexibility, accuracy, and efficiency.

Abstract: In modern urban environments, camera networks generate massive amounts of operational footage – reaching petabytes each day – making scalable video analytics essential for efficient processing. Many existing approaches adopt an SQL-based paradigm for querying such large-scale video databases; however, this constrains queries to rigid patterns with predefined semantic categories, significantly limiting analytical flexibility. In this work, we explore a language-driven video analytics paradigm aimed at enabling flexible and efficient querying of high-volume video data driven by natural language. Particularly, we build \textsc{Lava}, a system that accepts natural language queries and retrieves traffic targets across multiple levels of granularity and arbitrary categories. \textsc{Lava} comprises three main components: 1) a multi-armed bandit-based efficient sampling method for video segment-level localization; 2) a video-specific open-world detection module for object-level retrieval; and 3) a long-term object trajectory extraction scheme for temporal object association, yielding complete trajectories for object-of-interests. To support comprehensive evaluation, we further develop a novel benchmark by providing diverse, semantically rich natural language predicates and fine-grained annotations for multiple videos. Experiments on this benchmark demonstrate that \textsc{Lava} improves $F_1$-scores for selection queries by $\mathbf{14%}$, reduces MPAE for aggregation queries by $\mathbf{0.39}$, and achieves top-$k$ precision of $\mathbf{86%}$, while processing videos $ \mathbf{9.6\times} $ faster than the most accurate baseline. Our code and dataset are available at https://github.com/yuyanrui/LAVA.

[197] UniEgoMotion: A Unified Model for Egocentric Motion Reconstruction, Forecasting, and Generation

Chaitanya Patel, Hiroki Nakamura, Yuta Kyuragi, Kazuki Kozuka, Juan Carlos Niebles, Ehsan Adeli

Main category: cs.CV

TL;DR: The paper introduces UniEgoMotion, a unified model for egocentric motion generation and forecasting using first-person images, addressing limitations of third-person methods.

Details

Motivation: Enhancing AR/VR, human-robot interaction, assistive tech, and healthcare by improving egocentric motion prediction and synthesis.

Method: Proposes UniEgoMotion, a conditional motion diffusion model with head-centric motion representation, leveraging first-person visual inputs.

Result: Achieves state-of-the-art performance in egocentric motion reconstruction and generation from single images.

Conclusion: Sets a new benchmark for egocentric motion modeling, enabling novel applications in egocentric contexts.

Abstract: Egocentric human motion generation and forecasting with scene-context is crucial for enhancing AR/VR experiences, improving human-robot interaction, advancing assistive technologies, and enabling adaptive healthcare solutions by accurately predicting and simulating movement from a first-person perspective. However, existing methods primarily focus on third-person motion synthesis with structured 3D scene contexts, limiting their effectiveness in real-world egocentric settings where limited field of view, frequent occlusions, and dynamic cameras hinder scene perception. To bridge this gap, we introduce Egocentric Motion Generation and Egocentric Motion Forecasting, two novel tasks that utilize first-person images for scene-aware motion synthesis without relying on explicit 3D scene. We propose UniEgoMotion, a unified conditional motion diffusion model with a novel head-centric motion representation tailored for egocentric devices. UniEgoMotion’s simple yet effective design supports egocentric motion reconstruction, forecasting, and generation from first-person visual inputs in a unified framework. Unlike previous works that overlook scene semantics, our model effectively extracts image-based scene context to infer plausible 3D motion. To facilitate training, we introduce EE4D-Motion, a large-scale dataset derived from EgoExo4D, augmented with pseudo-ground-truth 3D motion annotations. UniEgoMotion achieves state-of-the-art performance in egocentric motion reconstruction and is the first to generate motion from a single egocentric image. Extensive evaluations demonstrate the effectiveness of our unified framework, setting a new benchmark for egocentric motion modeling and unlocking new possibilities for egocentric applications.

[198] Semi-Supervised Anomaly Detection in Brain MRI Using a Domain-Agnostic Deep Reinforcement Learning Approach

Zeduo Zhang, Yalda Mohsenzadeh

Main category: cs.CV

TL;DR: A semi-supervised anomaly detection framework using deep reinforcement learning (DRL) is proposed, achieving high performance on brain MRI and industrial datasets.

Details

Motivation: To address challenges like large-scale data, overfitting, and class imbalance in anomaly detection, particularly for brain MRI volumes.

Method: Integrates DRL with feature representations, using publicly available datasets (IXI and BraTS 2021) for training, validation, and testing. Preprocessing includes normalization, skull-stripping, and co-registering.

Result: Achieved AUROC of 88.7% (pixel-level) and 96.7% (image-level) on brain MRI datasets, outperforming SOTA. Also performed well on industrial datasets (AUROC = 99.8% pixel-level, 99.3% image-level).

Conclusion: The domain-agnostic DRL approach shows promise for MRI anomaly detection, with robustness, generalizability, and efficiency for real-world applications.

Abstract: To develop a domain-agnostic, semi-supervised anomaly detection framework that integrates deep reinforcement learning (DRL) to address challenges such as large-scale data, overfitting, and class imbalance, focusing on brain MRI volumes. This retrospective study used publicly available brain MRI datasets collected between 2005 and 2021. The IXI dataset provided 581 T1-weighted and 578 T2-weighted MRI volumes (from healthy subjects) for training, while the BraTS 2021 dataset provided 251 volumes for validation and 1000 for testing (unhealthy subjects with Glioblastomas). Preprocessing included normalization, skull-stripping, and co-registering to a uniform voxel size. Experiments were conducted on both T1- and T2-weighted modalities. Additional experiments and ablation analyses were also carried out on the industrial datasets. The proposed method integrates DRL with feature representations to handle label scarcity, large-scale data and overfitting. Statistical analysis was based on several detection and segmentation metrics including AUROC and Dice score. The proposed method achieved an AUROC of 88.7% (pixel-level) and 96.7% (image-level) on brain MRI datasets, outperforming State-of-The-Art (SOTA) methods. On industrial surface datasets, the model also showed competitive performance (AUROC = 99.8% pixel-level, 99.3% image-level) on MVTec AD dataset, indicating strong cross-domain generalization. Studies on anomaly sample size showed a monotonic increase in AUROC as more anomalies were seen, without evidence of overfitting or additional computational cost. The domain-agnostic semi-supervised approach using DRL shows significant promise for MRI anomaly detection, achieving strong performance on both medical and industrial datasets. Its robustness, generalizability and efficiency highlight its potential for real-world clinical applications.

[199] VAEmo: Efficient Representation Learning for Visual-Audio Emotion with Knowledge Injection

Hao Cheng, Zhiwei Zhao, Yichao He, Zhenzhen Hu, Jia Li, Meng Wang, Richang Hong

Main category: cs.CV

TL;DR: VAEmo is a two-stage framework for audiovisual emotion recognition (AVER) that combines unified cross-modal encoding with emotion-aware semantic guidance, achieving state-of-the-art performance.

Details

Motivation: AVER faces challenges like emotional ambiguity, cross-modal disparities, and scarce annotated data. Existing methods lack fine-grained emotional modeling.

Method: VAEmo uses a two-stage approach: Stage 1 pre-trains a unified network on large-scale VA data without labels; Stage 2 injects external affective knowledge via text embeddings and contrastive learning.

Result: VAEmo outperforms benchmarks, demonstrating the effectiveness of unified encoding and emotion-aware semantics.

Conclusion: VAEmo offers a compact, efficient solution for AVER, leveraging cross-modal alignment and external knowledge for superior performance.

Abstract: Audiovisual emotion recognition (AVER) aims to infer human emotions from nonverbal visual-audio (VA) cues, offering modality-complementary and language-agnostic advantages. However, AVER remains challenging due to the inherent ambiguity of emotional expressions, cross-modal expressive disparities, and the scarcity of reliably annotated data. Recent self-supervised AVER approaches have introduced strong multimodal representations, yet they predominantly rely on modality-specific encoders and coarse content-level alignment, limiting fine-grained emotional semantic modeling. To address these issues, we propose VAEmo, an efficient two-stage framework for emotion-centric joint VA representation learning with external knowledge injection. In Stage1, a unified and lightweight representation network is pre-trained on large-scale speaker-centric VA corpora via masked reconstruction and contrastive objectives, mitigating the modality gap and learning expressive, complementary representations without emotion labels. In Stage2, multimodal large language models automatically generate detailed affective descriptions according to our well-designed chain-of-thought prompting for only a small subset of VA samples; these rich textual semantics are then injected by aligning their corresponding embeddings with VA representations through dual-path contrastive learning, further bridging the emotion gap. Extensive experiments on multiple downstream AVER benchmarks show that VAEmo achieves state-of-the-art performance with a compact design, highlighting the benefit of unified cross-modal encoding and emotion-aware semantic guidance for efficient, generalizable VA emotion representations.

[200] Dataset Condensation with Color Compensation

Huyu Wu, Duo Su, Junjie Hou, Guang Li

Main category: cs.CV

TL;DR: DC3, a dataset condensation framework with color compensation, addresses inefficiency and semantic distortion in existing methods by enhancing color diversity using latent diffusion models.

Details

Motivation: Existing dataset condensation methods struggle with inefficiency (image-level selection) or semantic distortion (pixel-level optimization). The oversight of color's dual role as information carrier and semantic unit is identified as a critical issue.

Method: DC3 uses a calibrated selection strategy and latent diffusion models to enhance color diversity in condensed images, avoiding the creation of entirely new images.

Result: DC3 outperforms SOTA methods across benchmarks, with FID results confirming high-quality datasets without model collapse or degradation.

Conclusion: DC3 demonstrates superior performance and generalization, pioneering the fine-tuning of pre-trained diffusion models with condensed datasets.

Abstract: Dataset condensation always faces a constitutive trade-off: balancing performance and fidelity under extreme compression. Existing methods struggle with two bottlenecks: image-level selection methods (Coreset Selection, Dataset Quantization) suffer from inefficiency condensation, while pixel-level optimization (Dataset Distillation) introduces semantic distortion due to over-parameterization. With empirical observations, we find that a critical problem in dataset condensation is the oversight of color’s dual role as an information carrier and a basic semantic representation unit. We argue that improving the colorfulness of condensed images is beneficial for representation learning. Motivated by this, we propose DC3: a Dataset Condensation framework with Color Compensation. After a calibrated selection strategy, DC3 utilizes the latent diffusion model to enhance the color diversity of an image rather than creating a brand-new one. Extensive experiments demonstrate the superior performance and generalization of DC3 that outperforms SOTA methods across multiple benchmarks. To the best of our knowledge, besides focusing on downstream tasks, DC3 is the first research to fine-tune pre-trained diffusion models with condensed datasets. The FID results prove that training networks with our high-quality datasets is feasible without model collapse or other degradation issues. Code and generated data will be released soon.

[201] Is It Really You? Exploring Biometric Verification Scenarios in Photorealistic Talking-Head Avatar Videos

Laura Pedrouzo-Rodriguez, Pedro Delgado-DeRobles, Luis F. Gomez, Ruben Tolosana, Ruben Vera-Rodriguez, Aythami Morales, Julian Fierrez

Main category: cs.CV

TL;DR: The paper explores using facial motion patterns as behavioral biometrics for identity verification in photorealistic avatar-mediated scenarios, proposing a lightweight spatio-temporal Graph Convolutional Network with 80% AUC performance.

Details

Motivation: The rise of photorealistic avatars introduces security risks like impersonation, necessitating reliable biometric verification methods beyond appearance and voice.

Method: A new dataset of realistic avatar videos (genuine and impostor) was created using GAGAvatar. A lightweight spatio-temporal Graph Convolutional Network with temporal attention pooling, using facial landmarks, was proposed.

Result: Facial motion cues achieved ~80% AUC for identity verification, demonstrating their reliability.

Conclusion: The work highlights the need for advanced behavioral biometric defenses in avatar-based systems, providing a benchmark for future research.

Abstract: Photorealistic talking-head avatars are becoming increasingly common in virtual meetings, gaming, and social platforms. These avatars allow for more immersive communication, but they also introduce serious security risks. One emerging threat is impersonation: an attacker can steal a user’s avatar, preserving his appearance and voice, making it nearly impossible to detect its fraudulent usage by sight or sound alone. In this paper, we explore the challenge of biometric verification in such avatar-mediated scenarios. Our main question is whether an individual’s facial motion patterns can serve as reliable behavioral biometrics to verify their identity when the avatar’s visual appearance is a facsimile of its owner. To answer this question, we introduce a new dataset of realistic avatar videos created using a state-of-the-art one-shot avatar generation model, GAGAvatar, with genuine and impostor avatar videos. We also propose a lightweight, explainable spatio-temporal Graph Convolutional Network architecture with temporal attention pooling, that uses only facial landmarks to model dynamic facial gestures. Experimental results demonstrate that facial motion cues enable meaningful identity verification with AUC values approaching 80%. The proposed benchmark and biometric system are available for the research community in order to bring attention to the urgent need for more advanced behavioral biometric defenses in avatar-based communication systems.

[202] Rate-distortion Optimized Point Cloud Preprocessing for Geometry-based Point Cloud Compression

Wanhao Ma, Wei Zhang, Shuai Wan, Fuzheng Yang

Main category: cs.CV

TL;DR: Proposes a preprocessing framework combining a voxelization network and a G-PCC surrogate model to enhance G-PCC efficiency without sacrificing interoperability. Achieves 38.84% BD-rate reduction.

Details

Motivation: G-PCC underperforms compared to deep learning-based methods despite lower computational power. The goal is to improve G-PCC efficiency while maintaining its interoperability and flexibility.

Method: Integrates a compression-oriented voxelization network with a differentiable G-PCC surrogate model, jointly optimized in training. The surrogate mimics G-PCC’s rate-distortion behavior for end-to-end gradient propagation.

Result: Achieves a 38.84% average BD-rate reduction over G-PCC, demonstrating significant efficiency improvement.

Conclusion: The framework enhances legacy compression standards like G-PCC while preserving backward compatibility, making it practical for real-world deployment.

Abstract: Geometry-based point cloud compression (G-PCC), an international standard designed by MPEG, provides a generic framework for compressing diverse types of point clouds while ensuring interoperability across applications and devices. However, G-PCC underperforms compared to recent deep learning-based PCC methods despite its lower computational power consumption. To enhance the efficiency of G-PCC without sacrificing its interoperability or computational flexibility, we propose a novel preprocessing framework that integrates a compression-oriented voxelization network with a differentiable G-PCC surrogate model, jointly optimized in the training phase. The surrogate model mimics the rate-distortion behaviour of the non-differentiable G-PCC codec, enabling end-to-end gradient propagation. The versatile voxelization network adaptively transforms input point clouds using learning-based voxelization and effectively manipulates point clouds via global scaling, fine-grained pruning, and point-level editing for rate-distortion trade-offs. During inference, only the lightweight voxelization network is appended to the G-PCC encoder, requiring no modifications to the decoder, thus introducing no computational overhead for end users. Extensive experiments demonstrate a 38.84% average BD-rate reduction over G-PCC. By bridging classical codecs with deep learning, this work offers a practical pathway to enhance legacy compression standards while preserving their backward compatibility, making it ideal for real-world deployment.

[203] OpenGS-Fusion: Open-Vocabulary Dense Mapping with Hybrid 3D Gaussian Splatting for Refined Object-Level Understanding

Dianyi Yang, Xihan Wang, Yu Gao, Shiyang Liu, Bohan Ren, Yufeng Yue, Yi Yang

Main category: cs.CV

TL;DR: OpenGS-Fusion improves 3D scene understanding with open-vocabulary queries by combining 3D Gaussian representation and adaptive thresholding, outperforming existing methods.

Details

Motivation: Existing methods lack flexibility and precision for open-ended queries in 3D scene understanding.

Method: Combines 3D Gaussian representation with Truncated Signed Distance Field and introduces MLLM-Assisted Adaptive Thresholding.

Result: Achieves 17% improvement in 3D mIoU and outperforms in scene reconstruction and object understanding.

Conclusion: OpenGS-Fusion is effective for language-guided scene interaction and offers superior performance.

Abstract: Recent advancements in 3D scene understanding have made significant strides in enabling interaction with scenes using open-vocabulary queries, particularly for VR/AR and robotic applications. Nevertheless, existing methods are hindered by rigid offline pipelines and the inability to provide precise 3D object-level understanding given open-ended queries. In this paper, we present OpenGS-Fusion, an innovative open-vocabulary dense mapping framework that improves semantic modeling and refines object-level understanding. OpenGS-Fusion combines 3D Gaussian representation with a Truncated Signed Distance Field to facilitate lossless fusion of semantic features on-the-fly. Furthermore, we introduce a novel multimodal language-guided approach named MLLM-Assisted Adaptive Thresholding, which refines the segmentation of 3D objects by adaptively adjusting similarity thresholds, achieving an improvement 17% in 3D mIoU compared to the fixed threshold strategy. Extensive experiments demonstrate that our method outperforms existing methods in 3D object understanding and scene reconstruction quality, as well as showcasing its effectiveness in language-guided scene interaction. The code is available at https://young-bit.github.io/opengs-fusion.github.io/ .

[204] Personalized Safety Alignment for Text-to-Image Diffusion Models

Yu Lei, Jinbin Bai, Qingyu Shi, Aosong Feng, Kaidong Yu

Main category: cs.CV

TL;DR: Proposes Personalized Safety Alignment (PSA) for text-to-image diffusion models, allowing user-specific safety control by integrating personalized profiles, outperforming existing methods.

Details

Motivation: Current safety mechanisms in text-to-image models lack personalization, failing to account for diverse user preferences shaped by factors like age and beliefs.

Method: Introduces PSA, integrating user-specific safety profiles via cross-attention in the diffusion process, and uses the Sage dataset for training.

Result: PSA outperforms existing methods in harmful content suppression and aligns better with user constraints, achieving higher Win Rate and Pass Rate scores.

Conclusion: PSA effectively personalizes safety in generative models, balancing user preferences and content quality, with publicly available resources.

Abstract: Text-to-image diffusion models have revolutionized visual content generation, but current safety mechanisms apply uniform standards that often fail to account for individual user preferences. These models overlook the diverse safety boundaries shaped by factors like age, mental health, and personal beliefs. To address this, we propose Personalized Safety Alignment (PSA), a framework that allows user-specific control over safety behaviors in generative models. PSA integrates personalized user profiles into the diffusion process, adjusting the model’s behavior to match individual safety preferences while preserving image quality. We introduce a new dataset, Sage, which captures user-specific safety preferences and incorporates these profiles through a cross-attention mechanism. Experiments show that PSA outperforms existing methods in harmful content suppression and aligns generated content better with user constraints, achieving higher Win Rate and Pass Rate scores. Our code, data, and models are publicly available at https://torpedo2648.github.io/PSAlign/.

[205] LawDIS: Language-Window-based Controllable Dichotomous Image Segmentation

Xinyu Yan, Meijun Sun, Ge-Peng Ji, Fahad Shahbaz Khan, Salman Khan, Deng-Ping Fan

Main category: cs.CV

TL;DR: LawDIS is a controllable dichotomous image segmentation framework using language and window-based controls for high-quality object masks, outperforming 11 state-of-the-art methods.

Details

Motivation: To enhance image segmentation by integrating user controls (language prompts and adjustable windows) for personalized and high-accuracy applications.

Method: Uses a latent diffusion model for mask generation, with macro (language-controlled segmentation) and micro (window-controlled refinement) modes, coordinated by a mode switcher.

Result: Significantly outperforms 11 cutting-edge methods, achieving Fβω gains of 4.6% with both strategies and 3.6% with only the LS strategy on DIS-TE.

Conclusion: LawDIS is effective for high-accuracy, personalized image segmentation, with potential for broader applications.

Abstract: We present LawDIS, a language-window-based controllable dichotomous image segmentation (DIS) framework that produces high-quality object masks. Our framework recasts DIS as an image-conditioned mask generation task within a latent diffusion model, enabling seamless integration of user controls. LawDIS is enhanced with macro-to-micro control modes. Specifically, in macro mode, we introduce a language-controlled segmentation strategy (LS) to generate an initial mask based on user-provided language prompts. In micro mode, a window-controlled refinement strategy (WR) allows flexible refinement of user-defined regions (i.e., size-adjustable windows) within the initial mask. Coordinated by a mode switcher, these modes can operate independently or jointly, making the framework well-suited for high-accuracy, personalised applications. Extensive experiments on the DIS5K benchmark reveal that our LawDIS significantly outperforms 11 cutting-edge methods across all metrics. Notably, compared to the second-best model MVANet, we achieve $F_\beta^\omega$ gains of 4.6% with both the LS and WR strategies and 3.6% gains with only the LS strategy on DIS-TE. Codes will be made available at https://github.com/XinyuYanTJU/LawDIS.

[206] DeflareMamba: Hierarchical Vision Mamba for Contextually Consistent Lens Flare Removal

Yihang Huang, Yuanfei Huang, Junhui Lin, Hua Huang

Main category: cs.CV

TL;DR: DeflareMamba introduces a state space model for lens flare removal, addressing contextual inconsistencies with a hierarchical framework and local-global dependency modeling.

Details

Motivation: Existing flare removal methods struggle with maintaining contextual consistency, leading to incomplete results.

Method: Uses a hierarchical framework with varied stride sampling for long-range correlations and local-enhanced state space models to preserve details.

Result: Effectively removes various flare types while preserving non-flare regions, improving downstream tasks like object recognition.

Conclusion: DeflareMamba is the first state space model for flare removal, demonstrating superior performance and broader applications.

Abstract: Lens flare removal remains an information confusion challenge in the underlying image background and the optical flares, due to the complex optical interactions between light sources and camera lens. While recent solutions have shown promise in decoupling the flare corruption from image, they often fail to maintain contextual consistency, leading to incomplete and inconsistent flare removal. To eliminate this limitation, we propose DeflareMamba, which leverages the efficient sequence modeling capabilities of state space models while maintains the ability to capture local-global dependencies. Particularly, we design a hierarchical framework that establishes long-range pixel correlations through varied stride sampling patterns, and utilize local-enhanced state space models that simultaneously preserves local details. To the best of our knowledge, this is the first work that introduces state space models to the flare removal task. Extensive experiments demonstrate that our method effectively removes various types of flare artifacts, including scattering and reflective flares, while maintaining the natural appearance of non-flare regions. Further downstream applications demonstrate the capacity of our method to improve visual object recognition and cross-modal semantic understanding. Code is available at https://github.com/BNU-ERC-ITEA/DeflareMamba.

[207] TEACH: Text Encoding as Curriculum Hints for Scene Text Recognition

Xiahan Yang, Hui Zheng

Main category: cs.CV

TL;DR: TEACH is a training paradigm for STR that uses ground-truth text as auxiliary input, gradually reducing its influence to improve model accuracy without extra inference overhead.

Details

Motivation: STR is challenging due to complex visuals and limited semantic priors; TEACH addresses this by guiding models from label-dependent to visual recognition.

Method: TEACH encodes target labels into embeddings, uses loss-aware masking, and progressively reduces label influence, simulating curriculum learning.

Result: Models with TEACH show improved accuracy across benchmarks, especially in challenging conditions, proving its robustness.

Conclusion: TEACH is a model-agnostic, no-overhead solution that enhances STR performance by integrating label guidance during training.

Abstract: Scene Text Recognition (STR) remains a challenging task due to complex visual appearances and limited semantic priors. We propose TEACH, a novel training paradigm that injects ground-truth text into the model as auxiliary input and progressively reduces its influence during training. By encoding target labels into the embedding space and applying loss-aware masking, TEACH simulates a curriculum learning process that guides the model from label-dependent learning to fully visual recognition. Unlike language model-based approaches, TEACH requires no external pretraining and introduces no inference overhead. It is model-agnostic and can be seamlessly integrated into existing encoder-decoder frameworks. Extensive experiments across multiple public benchmarks show that models trained with TEACH achieve consistently improved accuracy, especially under challenging conditions, validating its robustness and general applicability.

[208] DELTAv2: Accelerating Dense 3D Tracking

Tuan Duc Ngo, Ashkan Mirzaei, Guocheng Qian, Hanwen Liang, Chuang Gan, Evangelos Kalogerakis, Peter Wonka, Chaoyang Wang

Main category: cs.CV

TL;DR: A novel algorithm for faster 3D point tracking in videos, addressing computational bottlenecks with a coarse-to-fine strategy and optimized feature computation, achieving 5-100x speedup without sacrificing accuracy.

Details

Motivation: Existing methods for dense long-term 3D point tracking are computationally expensive, especially with large numbers of trajectories.

Method: Introduces a coarse-to-fine strategy with a learnable interpolation module and optimizes correlation feature computation.

Result: Achieves 5-100x speedup over prior methods while maintaining state-of-the-art tracking accuracy.

Conclusion: The proposed algorithm efficiently accelerates 3D point tracking without compromising performance.

Abstract: We propose a novel algorithm for accelerating dense long-term 3D point tracking in videos. Through analysis of existing state-of-the-art methods, we identify two major computational bottlenecks. First, transformer-based iterative tracking becomes expensive when handling a large number of trajectories. To address this, we introduce a coarse-to-fine strategy that begins tracking with a small subset of points and progressively expands the set of tracked trajectories. The newly added trajectories are initialized using a learnable interpolation module, which is trained end-to-end alongside the tracking network. Second, we propose an optimization that significantly reduces the cost of correlation feature computation, another key bottleneck in prior methods. Together, these improvements lead to a 5-100x speedup over existing approaches while maintaining state-of-the-art tracking accuracy.

[209] Efficient Chambolle-Pock based algorithms for Convoltional sparse representation

Yi Liu, Junjing Li, Yang Chen, Haowei Tang, Pengcheng Zhang, Tianling Lyu, Zhiguo Gui

Main category: cs.CV

TL;DR: The paper introduces a fast and efficient method for convolutional sparse coding (CSC) and dictionary learning (CDL) using the Chambolle-Pock (CP) framework, eliminating manual parameter tuning and improving convergence speed.

Details

Motivation: Current ADMM-based CSC methods require careful penalty parameter selection, which can lead to slow or no convergence. The CP framework offers a parameter-free alternative with faster convergence.

Method: The proposed method uses the CP framework for CSC and CDL, incorporating an anisotropic total variation penalty for CSC.

Result: The CP-based method matches ADMM in noise-free image processing and outperforms it in noise removal from Gaussian noise-polluted images.

Conclusion: The CP framework is a viable, efficient alternative to ADMM for CSC and CDL, offering faster convergence and eliminating manual parameter tuning.

Abstract: Recently convolutional sparse representation (CSR), as a sparse representation technique, has attracted increasing attention in the field of image processing, due to its good characteristic of translate-invariance. The content of CSR usually consists of convolutional sparse coding (CSC) and convolutional dictionary learning (CDL), and many studies focus on how to solve the corresponding optimization problems. At present, the most efficient optimization scheme for CSC is based on the alternating direction method of multipliers (ADMM). However, the ADMM-based approach involves a penalty parameter that needs to be carefully selected, and improper parameter selection may result in either no convergence or very slow convergence. In this paper, a novel fast and efficient method using Chambolle-Pock(CP) framework is proposed, which does not require extra manual selection parameters in solving processing, and has faster convergence speed. Furthermore, we propose an anisotropic total variation penalty of the coefficient maps for CSC and apply the CP algorithm to solve it. In addition, we also apply the CP framework to solve the corresponding CDL problem. Experiments show that for noise-free image the proposed CSC algorithms can achieve rival results of the latest ADMM-based approach, while outperforms in removing noise from Gaussian noise pollution image.

[210] No Pose at All: Self-Supervised Pose-Free 3D Gaussian Splatting from Sparse Views

Ranran Huang, Krystian Mikolajczyk

Main category: cs.CV

TL;DR: SPFSplat is a framework for 3D Gaussian splatting from sparse multi-view images without ground-truth poses, achieving state-of-the-art performance in novel view synthesis and pose estimation.

Details

Motivation: To enable efficient 3D Gaussian splatting without requiring ground-truth poses during training or inference, addressing practical challenges in novel view synthesis and pose estimation.

Method: Uses a shared feature extraction backbone to predict 3D Gaussian primitives and camera poses in a canonical space in one feed-forward step, with rendering and reprojection losses for geometric constraints.

Result: Achieves state-of-the-art performance in novel view synthesis under significant viewpoint changes and limited overlap, and outperforms methods with geometry priors in pose estimation.

Conclusion: SPFSplat’s pose-free training and efficient design make it practical for real-world applications, demonstrating superior performance without pose supervision.

Abstract: We introduce SPFSplat, an efficient framework for 3D Gaussian splatting from sparse multi-view images, requiring no ground-truth poses during training or inference. It employs a shared feature extraction backbone, enabling simultaneous prediction of 3D Gaussian primitives and camera poses in a canonical space from unposed inputs within a single feed-forward step. Alongside the rendering loss based on estimated novel-view poses, a reprojection loss is integrated to enforce the learning of pixel-aligned Gaussian primitives for enhanced geometric constraints. This pose-free training paradigm and efficient one-step feed-forward design make SPFSplat well-suited for practical applications. Remarkably, despite the absence of pose supervision, SPFSplat achieves state-of-the-art performance in novel view synthesis even under significant viewpoint changes and limited image overlap. It also surpasses recent methods trained with geometry priors in relative pose estimation. Code and trained models are available on our project page: https://ranrhuang.github.io/spfsplat/.

Xinhang Wan, Dongqiang Gou, Xinwang Liu, En Zhu, Xuming He

Main category: cs.CV

TL;DR: The paper proposes a novel approach for 3D affordance grounding and classification in Embodied AI, addressing inconsistencies and scale issues by learning an affordance-aware 3D representation and using a stage-wise inference strategy.

Details

Motivation: To improve object manipulation learning in Embodied AI by addressing the limitations of previous methods, which tackled affordance grounding and classification separately and struggled with incomplete predictions and scale variability.

Method: Develops a cross-modal 3D representation with efficient fusion and multi-scale geometric feature propagation, followed by a two-stage prediction mechanism to couple grounding and classification tasks.

Result: The method demonstrates improved performance in both affordance grounding and classification tasks.

Conclusion: The proposed approach effectively addresses the dependency between grounding and classification, enabling better affordance understanding and more accurate predictions.

Abstract: A core problem of Embodied AI is to learn object manipulation from observation, as humans do. To achieve this, it is important to localize 3D object affordance areas through observation such as images (3D affordance grounding) and understand their functionalities (affordance classification). Previous attempts usually tackle these two tasks separately, leading to inconsistent predictions due to lacking proper modeling of their dependency. In addition, these methods typically only ground the incomplete affordance areas depicted in images, failing to predict the full potential affordance areas, and operate at a fixed scale, resulting in difficulty in coping with affordances significantly varying in scale with respect to the whole object. To address these issues, we propose a novel approach that learns an affordance-aware 3D representation and employs a stage-wise inference strategy leveraging the dependency between grounding and classification tasks. Specifically, we first develop a cross-modal 3D representation through efficient fusion and multi-scale geometric feature propagation, enabling inference of full potential affordance areas at a suitable regional scale. Moreover, we adopt a simple two-stage prediction mechanism, effectively coupling grounding and classification for better affordance understanding. Experiments demonstrate the effectiveness of our method, showing improved performance in both affordance grounding and classification.

[212] A Coarse-to-Fine Approach to Multi-Modality 3D Occupancy Grounding

Zhan Shi, Song Wang, Junbo Chen, Jianke Zhu

Main category: cs.CV

TL;DR: A benchmark for 3D occupancy grounding is introduced to improve object perception in autonomous driving by using voxel-level annotations instead of bounding boxes. The proposed GroundingOcc model outperforms baselines.

Details

Motivation: Existing visual grounding tasks rely on bounding boxes, which lack fine-grained details and accuracy in object representation.

Method: GroundingOcc, an end-to-end model, combines visual, textual, and point cloud features for voxel-wise predictions and refined localization, enhanced by 2D grounding and depth estimation modules.

Result: GroundingOcc outperforms existing baselines on 3D occupancy grounding, demonstrating improved precision.

Conclusion: The benchmark and GroundingOcc model advance 3D occupancy grounding, offering more accurate object perception for autonomous driving.

Abstract: Visual grounding aims to identify objects or regions in a scene based on natural language descriptions, essential for spatially aware perception in autonomous driving. However, existing visual grounding tasks typically depend on bounding boxes that often fail to capture fine-grained details. Not all voxels within a bounding box are occupied, resulting in inaccurate object representations. To address this, we introduce a benchmark for 3D occupancy grounding in challenging outdoor scenes. Built on the nuScenes dataset, it integrates natural language with voxel-level occupancy annotations, offering more precise object perception compared to the traditional grounding task. Moreover, we propose GroundingOcc, an end-to-end model designed for 3D occupancy grounding through multi-modal learning. It combines visual, textual, and point cloud features to predict object location and occupancy information from coarse to fine. Specifically, GroundingOcc comprises a multimodal encoder for feature extraction, an occupancy head for voxel-wise predictions, and a grounding head to refine localization. Additionally, a 2D grounding module and a depth estimation module enhance geometric understanding, thereby boosting model performance. Extensive experiments on the benchmark demonstrate that our method outperforms existing baselines on 3D occupancy grounding. The dataset is available at https://github.com/RONINGOD/GroundingOcc.

[213] Deep Learning for Pavement Condition Evaluation Using Satellite Imagery

Prathyush Kumar Reddy Lebaku, Lu Gao, Pan Lu, Jingran Sun

Main category: cs.CV

TL;DR: The paper explores using satellite images and deep learning for cost-effective pavement condition evaluation, achieving over 90% accuracy.

Details

Motivation: Conventional infrastructure inspection methods are labor-intensive and time-consuming, prompting the need for more efficient solutions.

Method: Deep learning models were applied to analyze over 3,000 satellite images of pavement sections, paired with evaluation ratings from TxDOT’s PMIS database.

Result: The study achieved an accuracy rate exceeding 90% in evaluating pavement conditions.

Conclusion: This research demonstrates a rapid, cost-effective approach for future pavement network evaluations using satellite imagery and deep learning.

Abstract: Civil infrastructure systems covers large land areas and needs frequent inspections to maintain their public service capabilities. The conventional approaches of manual surveys or vehicle-based automated surveys to assess infrastructure conditions are often labor-intensive and time-consuming. For this reason, it is worthwhile to explore more cost-effective methods for monitoring and maintaining these infrastructures. Fortunately, recent advancements in satellite systems and image processing algorithms have opened up new possibilities. Numerous satellite systems have been employed to monitor infrastructure conditions and identify damages. Due to the improvement in ground sample distance (GSD), the level of detail that can be captured has significantly increased. Taking advantage of these technology advancement, this research investigated to evaluate pavement conditions using deep learning models for analyzing satellite images. We gathered over 3,000 satellite images of pavement sections, together with pavement evaluation ratings from TxDOT’s PMIS database. The results of our study show an accuracy rate is exceeding 90%. This research paves the way for a rapid and cost-effective approach to evaluating the pavement network in the future.

[214] RoadMamba: A Dual Branch Visual State Space Model for Road Surface Classification

Tianze Wang, Zhang Zhang, Chao Yue, Nuoran Li, Chao Sun

Main category: cs.CV

TL;DR: RoadMamba, a novel method combining local and global perception, achieves state-of-the-art performance in road surface classification using visual Mamba architectures.

Details

Motivation: Improving autonomous vehicle safety and comfort by accurately classifying road surface conditions using visual technologies.

Method: Proposes RoadMamba, utilizing Dual State Space Model (DualSSM) for global and local feature extraction and Dual Attention Fusion (DAF) for feature decoding. Includes a dual auxiliary loss to balance feature reliance.

Result: Achieves state-of-the-art performance on a dataset of 1 million samples.

Conclusion: RoadMamba effectively combines local and global perception for superior road surface classification, benefiting autonomous vehicle systems.

Abstract: Acquiring the road surface conditions in advance based on visual technologies provides effective information for the planning and control system of autonomous vehicles, thus improving the safety and driving comfort of the vehicles. Recently, the Mamba architecture based on state-space models has shown remarkable performance in visual processing tasks, benefiting from the efficient global receptive field. However, existing Mamba architectures struggle to achieve state-of-the-art visual road surface classification due to their lack of effective extraction of the local texture of the road surface. In this paper, we explore for the first time the potential of visual Mamba architectures for road surface classification task and propose a method that effectively combines local and global perception, called RoadMamba. Specifically, we utilize the Dual State Space Model (DualSSM) to effectively extract the global semantics and local texture of the road surface and decode and fuse the dual features through the Dual Attention Fusion (DAF). In addition, we propose a dual auxiliary loss to explicitly constrain dual branches, preventing the network from relying only on global semantic information from the deep large receptive field and ignoring the local texture. The proposed RoadMamba achieves the state-of-the-art performance in experiments on a large-scale road surface classification dataset containing 1 million samples.

[215] StyDeco: Unsupervised Style Transfer with Distilling Priors and Semantic Decoupling

Yuanlin Yang, Quanjian Song, Zhexian Gao, Ge Wang, Shanshan Li, Xiaoyan Zhang

Main category: cs.CV

TL;DR: StyDeco is an unsupervised framework for style transfer that addresses the semantic gap in text-driven diffusion models by learning tailored text representations, outperforming existing methods in fidelity and structure preservation.

Details

Motivation: Current text-driven diffusion models treat textual descriptions uniformly, ignoring the spatial-semantic gap in visual style transfer, leading to loss of detail and structure.

Method: StyDeco uses Prior-Guided Data Distillation (PGD) for unsupervised stylistic knowledge distillation and Contrastive Semantic Decoupling (CSD) to adapt text encoders with domain-specific weights, clustering source and target representations.

Result: Experiments on three benchmarks show StyDeco outperforms existing methods in stylistic fidelity and structural preservation, and supports de-stylization.

Conclusion: StyDeco effectively bridges the semantic gap in text-driven style transfer, offering superior performance and extensibility.

Abstract: Diffusion models have emerged as the dominant paradigm for style transfer, but their text-driven mechanism is hindered by a core limitation: it treats textual descriptions as uniform, monolithic guidance. This limitation overlooks the semantic gap between the non-spatial nature of textual descriptions and the spatially-aware attributes of visual style, often leading to the loss of semantic structure and fine-grained details during stylization. In this paper, we propose StyDeco, an unsupervised framework that resolves this limitation by learning text representations specifically tailored for the style transfer task. Our framework first employs Prior-Guided Data Distillation (PGD), a strategy designed to distill stylistic knowledge without human supervision. It leverages a powerful frozen generative model to automatically synthesize pseudo-paired data. Subsequently, we introduce Contrastive Semantic Decoupling (CSD), a task-specific objective that adapts a text encoder using domain-specific weights. CSD performs a two-class clustering in the semantic space, encouraging source and target representations to form distinct clusters. Extensive experiments on three classic benchmarks demonstrate that our framework outperforms several existing approaches in both stylistic fidelity and structural preservation, highlighting its effectiveness in style transfer with semantic preservation. In addition, our framework supports a unique de-stylization process, further demonstrating its extensibility. Our code is vailable at https://github.com/QuanjianSong/StyDeco.

[216] Perspective from a Broader Context: Can Room Style Knowledge Help Visual Floorplan Localization?

Bolei Chen, Shengsheng Yan, Yongzheng Cui, Jiaxu Kang, Ping Zhong, Jianxin Wang

Main category: cs.CV

TL;DR: The paper proposes using visual scene context to improve Floorplan Localization (FLoc) by pre-training a room discriminator with unsupervised learning, outperforming existing methods.

Details

Motivation: Existing FLoc methods rely on limited structural or geometric cues, ignoring richer visual context, leading to ambiguous localization.

Method: An unsupervised learning technique with clustering constraints pre-trains a room discriminator to extract room types from images, which is then integrated into FLoc algorithms.

Result: The approach outperforms state-of-the-art methods, showing significant improvements in robustness and accuracy on standard benchmarks.

Conclusion: Leveraging visual scene context through unsupervised learning enhances FLoc by reducing ambiguity and improving performance.

Abstract: Since a building’s floorplan remains consistent over time and is inherently robust to changes in visual appearance, visual Floorplan Localization (FLoc) has received increasing attention from researchers. However, as a compact and minimalist representation of the building’s layout, floorplans contain many repetitive structures (e.g., hallways and corners), thus easily result in ambiguous localization. Existing methods either pin their hopes on matching 2D structural cues in floorplans or rely on 3D geometry-constrained visual pre-trainings, ignoring the richer contextual information provided by visual images. In this paper, we suggest using broader visual scene context to empower FLoc algorithms with scene layout priors to eliminate localization uncertainty. In particular, we propose an unsupervised learning technique with clustering constraints to pre-train a room discriminator on self-collected unlabeled room images. Such a discriminator can empirically extract the hidden room type of the observed image and distinguish it from other room types. By injecting the scene context information summarized by the discriminator into an FLoc algorithm, the room style knowledge is effectively exploited to guide definite visual FLoc. We conducted sufficient comparative studies on two standard visual Floc benchmarks. Our experiments show that our approach outperforms state-of-the-art methods and achieves significant improvements in robustness and accuracy.

[217] MoGaFace: Momentum-Guided and Texture-Aware Gaussian Avatars for Consistent Facial Geometry

Yujian Liu, Linlang Cao, Chuang Chen, Fanyu Geng, Dongxu Shen, Peng Cao, Shidang Xu, Xiaoli Liu

Main category: cs.CV

TL;DR: MoGaFace improves 3D head avatar reconstruction by refining geometry and texture during Gaussian rendering, addressing misalignment issues with FLAME meshes.

Details

Motivation: Existing methods suffer from misalignment between estimated FLAME meshes and target images, leading to poor rendering quality and loss of details.

Method: Introduces Momentum-Guided Consistent Geometry for alignment and Latent Texture Attention for texture refinement.

Result: Achieves high-fidelity reconstruction and better novel-view synthesis, even with inaccurate mesh initialization.

Conclusion: MoGaFace effectively addresses misalignment and enhances rendering quality for 3D head avatars.

Abstract: Existing 3D head avatar reconstruction methods adopt a two-stage process, relying on tracked FLAME meshes derived from facial landmarks, followed by Gaussian-based rendering. However, misalignment between the estimated mesh and target images often leads to suboptimal rendering quality and loss of fine visual details. In this paper, we present MoGaFace, a novel 3D head avatar modeling framework that continuously refines facial geometry and texture attributes throughout the Gaussian rendering process. To address the misalignment between estimated FLAME meshes and target images, we introduce the Momentum-Guided Consistent Geometry module, which incorporates a momentum-updated expression bank and an expression-aware correction mechanism to ensure temporal and multi-view consistency. Additionally, we propose Latent Texture Attention, which encodes compact multi-view features into head-aware representations, enabling geometry-aware texture refinement via integration into Gaussians. Extensive experiments show that MoGaFace achieves high-fidelity head avatar reconstruction and significantly improves novel-view synthesis quality, even under inaccurate mesh initialization and unconstrained real-world settings.

[218] Eigen Neural Network: Unlocking Generalizable Vision with Eigenbasis

Anzhe Cheng, Chenzhong Yin, Mingxi Cheng, Shukai Duan, Shahin Nazarian, Paul Bogdan

Main category: cs.CV

TL;DR: ENN introduces a novel architecture using orthonormal eigenbasis for weights, improving feature clarity and learning dynamics, outperforming SOTA in image classification and cross-modal retrieval, and enabling BP-free local learning with speedup and accuracy gains.

Details

Motivation: Address the disordered weight structures in DNNs caused by gradient-based optimization, which harms feature clarity and learning dynamics.

Method: Reparameterize each layer’s weights in a learned orthonormal eigenbasis, enforcing decorrelated and well-aligned weight dynamics.

Result: ENN outperforms SOTA methods on ImageNet and sets a new benchmark in cross-modal image-text retrieval. ENN-ℓ achieves 2× training speedup and higher accuracy than BP.

Conclusion: ENN remedies BP’s representational deficiencies, enhancing performance and enabling efficient, parallelizable training.

Abstract: The remarkable success of Deep Neural Networks(DNN) is driven by gradient-based optimization, yet this process is often undermined by its tendency to produce disordered weight structures, which harms feature clarity and degrades learning dynamics. To address this fundamental representational flaw, we introduced the Eigen Neural Network (ENN), a novel architecture that reparameterizes each layer’s weights in a layer-shared, learned orthonormal eigenbasis. This design enforces decorrelated, well-aligned weight dynamics axiomatically, rather than through regularization, leading to more structured and discriminative feature representations. When integrated with standard BP, ENN consistently outperforms state-of-the-art methods on large-scale image classification benchmarks, including ImageNet, and its superior representations generalize to set a new benchmark in cross-modal image-text retrieval. Furthermore, ENN’s principled structure enables a highly efficient, backpropagation-free(BP-free) local learning variant, ENN-$\ell$. This variant not only resolves BP’s procedural bottlenecks to achieve over 2$\times$ training speedup via parallelism, but also, remarkably, surpasses the accuracy of end-to-end backpropagation. ENN thus presents a new architectural paradigm that directly remedies the representational deficiencies of BP, leading to enhanced performance and enabling a more efficient, parallelizable training regime.

[219] ParaRevSNN: A Parallel Reversible Spiking Neural Network for Efficient Training and Inference

Changqing Xu, Guoqing Sun, Yi Liu, Xinfang Liao, Yintang Yang

Main category: cs.CV

TL;DR: ParaRevSNN introduces parallelism to reversible SNNs, reducing latency while maintaining memory efficiency and accuracy.

Details

Motivation: RevSNNs suffer from high latency due to sequential computation, limiting their practicality.

Method: ParaRevSNN decouples sequential dependencies between reversible blocks to enable inter-block parallelism.

Result: ParaRevSNN reduces training time by up to 35.2% and inference time to 18.15%, matching or exceeding RevSNN accuracy.

Conclusion: ParaRevSNN is efficient and suitable for resource-constrained scenarios.

Abstract: Reversible Spiking Neural Networks (RevSNNs) enable memory-efficient training by reconstructing forward activations during backpropagation, but suffer from high latency due to strictly sequential computation. To overcome this limitation, we propose ParaRevSNN, a parallel reversible SNN architecture that decouples sequential dependencies between reversible blocks while preserving reversibility. This design enables inter-block parallelism, significantly accelerating training and inference while retaining the memory-saving benefits of reversibility. Experiments on CIFAR10, CIFAR100, CIFAR10-DVS, and DVS128 Gesture demonstrate that ParaRevSNN matches or exceeds the accuracy of standard RevSNNs, while reducing training time by up to 35.2% and inference time to 18.15%, making it well-suited for deployment in resource-constrained scenarios.

[220] Multi-Cache Enhanced Prototype Learning for Test-Time Generalization of Vision-Language Models

Xinyu Chen, Haotian Zhai, Can Zhang, Xiupeng Shi, Ruirui Li

Main category: cs.CV

TL;DR: The paper proposes MCP and MCP++, cache-enhanced methods for test-time adaptation, improving generalization by ensuring intra-class compactness and leveraging multi-cache strategies.

Details

Motivation: Existing cache-enhanced TTA methods rely on unreliable low-entropy samples under distribution shifts, leading to non-compact intra-class distributions.

Method: Introduces Multi-Cache enhanced Prototype-based TTA (MCP) with three caches (entropy, align, negative) and MCP++ with cross-modal alignment and residual learning.

Result: Achieves state-of-the-art performance across 15 downstream tasks, demonstrating superior generalization.

Conclusion: The proposed methods effectively enhance test-time adaptation by addressing intra-class compactness and leveraging multi-cache strategies.

Abstract: In zero-shot setting, test-time adaptation adjusts pre-trained models using unlabeled data from the test phase to enhance performance on unknown test distributions. Existing cache-enhanced TTA methods rely on a low-entropy criterion to select samples for prototype construction, assuming intra-class compactness. However, low-entropy samples may be unreliable under distribution shifts, and the resulting prototypes may not ensure compact intra-class distributions. This study identifies a positive correlation between cache-enhanced performance and intra-class compactness. Based on this observation, we propose a Multi-Cache enhanced Prototype-based Test-Time Adaptation (MCP) featuring three caches: an entropy cache for initializing prototype representations with low-entropy samples, an align cache for integrating visual and textual information to achieve compact intra-class distributions, and a negative cache for prediction calibration using high-entropy samples. We further developed MCP++, a framework incorporating cross-modal prototype alignment and residual learning, introducing prototype residual fine-tuning. Comparative and ablation experiments across 15 downstream tasks demonstrate that the proposed method and framework achieve state-of-the-art generalization performance.

[221] Enhancing Multi-view Open-set Learning via Ambiguity Uncertainty Calibration and View-wise Debiasing

Zihan Fang, Zhiyong Xu, Lan Du, Shide Du, Zhiling Cai, Shiping Wang

Main category: cs.CV

TL;DR: A multi-view open-set learning framework is proposed to address class incompleteness and view-induced biases by using ambiguity uncertainty calibration and view-wise debiasing.

Details

Motivation: Existing multi-view learning models fail in open-set scenarios due to class incompleteness assumptions and static view-induced biases.

Method: The framework includes O-Mix for generating ambiguous samples, an ambiguity perception network, and an HSIC-based contrastive debiasing module.

Result: The framework improves unknown-class recognition while maintaining closed-set performance across diverse benchmarks.

Conclusion: The proposed approach effectively handles open-set challenges in multi-view learning by addressing ambiguity and bias.

Abstract: Existing multi-view learning models struggle in open-set scenarios due to their implicit assumption of class completeness. Moreover, static view-induced biases, which arise from spurious view-label associations formed during training, further degrade their ability to recognize unknown categories. In this paper, we propose a multi-view open-set learning framework via ambiguity uncertainty calibration and view-wise debiasing. To simulate ambiguous samples, we design O-Mix, a novel synthesis strategy to generate virtual samples with calibrated open-set ambiguity uncertainty. These samples are further processed by an auxiliary ambiguity perception network that captures atypical patterns for improved open-set adaptation. Furthermore, we incorporate an HSIC-based contrastive debiasing module that enforces independence between view-specific ambiguous and view-consistent representations, encouraging the model to learn generalizable features. Extensive experiments on diverse multi-view benchmarks demonstrate that the proposed framework consistently enhances unknown-class recognition while preserving strong closed-set performance.

[222] Mitigating Information Loss under High Pruning Rates for Efficient Large Vision Language Models

Mingyu Fu, Wei Suo, Ji Ma, Lin Yuanbo Wu, Peng Wang, Yanning Zhang

Main category: cs.CV

TL;DR: ACCM reduces computational costs in LVLMs by compensating for visual information loss with adaptive image captions, outperforming existing methods.

Details

Motivation: High computational costs of LVLMs due to lengthy visual sequences limit their applications, and existing pruning methods degrade performance.

Method: ACCM uses a lightweight caption model and selector to generate and choose contextually appropriate captions, trained via self-supervised learning.

Result: ACCM outperforms SOTA methods, achieving 20.6% better performance with 6.5% fewer FLOPs across seven benchmarks.

Conclusion: ACCM effectively mitigates visual information loss and reduces computational costs, enhancing LVLM applications.

Abstract: Despite the great success of Large Vision Language Models (LVLMs), their high computational cost severely limits their broad applications. The computational cost of LVLMs mainly stems from the visual sequence of the input, which consists of hundreds or even thousands of tokens. Although existing methods have made progress by removing redundant tokens, they suffer from severe performance degradation with high pruning rates due to the loss of visual information. In this paper, we propose an Adaptive Content Compensation Method (ACCM), which can effectively mitigate the visual information loss via an image caption. Specifically, ACCM comprises two key components: a lightweight caption model and a selector. Firstly the caption model generates question-related descriptions under the guidance of the user instruction. Then the selector further identifies a contextually appropriate caption from multiple candidates. Leveraging self-supervised learning, our modules could be learned efficiently without any human or automated labeling. We conduct extensive experiments across seven benchmarks and the results show that ACCM significantly outperforms existing methods with lower FLOPs (e.g., surpassing SOTA by 20.6% with 6.5% fewer FLOPs).

[223] OCSplats: Observation Completeness Quantification and Label Noise Separation in 3DGS

Han Ling, Xian Xu, Yinghui Sun, Quansen Sun

Main category: cs.CV

TL;DR: OCSplats improves 3DGS anti-noise reconstruction by addressing epistemic uncertainty, using hybrid noise assessment and dynamic anchor points for label noise classification, achieving robust performance across diverse scenarios.

Details

Motivation: Label noise in real-world 3D reconstruction (e.g., moving objects, non-Lambertian surfaces) causes errors. Existing methods struggle with noise separation or require impractical scene-specific tuning.

Method: Proposes OCSplats, combining hybrid noise assessment and observation-based cognitive correction. Introduces a dynamic anchor point pipeline for label noise classification.

Result: OCSplats achieves leading reconstruction accuracy and precise noise classification in varied scenarios without parameter adjustments.

Conclusion: OCSplats effectively addresses noise in 3DGS, offering a practical and scalable solution for diverse real-world applications.

Abstract: 3D Gaussian Splatting (3DGS) has become one of the most promising 3D reconstruction technologies. However, label noise in real-world scenarios-such as moving objects, non-Lambertian surfaces, and shadows-often leads to reconstruction errors. Existing 3DGS-Bsed anti-noise reconstruction methods either fail to separate noise effectively or require scene-specific fine-tuning of hyperparameters, making them difficult to apply in practice. This paper re-examines the problem of anti-noise reconstruction from the perspective of epistemic uncertainty, proposing a novel framework, OCSplats. By combining key technologies such as hybrid noise assessment and observation-based cognitive correction, the accuracy of noise classification in areas with cognitive differences has been significantly improved. Moreover, to address the issue of varying noise proportions in different scenarios, we have designed a label noise classification pipeline based on dynamic anchor points. This pipeline enables OCSplats to be applied simultaneously to scenarios with vastly different noise proportions without adjusting parameters. Extensive experiments demonstrate that OCSplats always achieve leading reconstruction performance and precise label noise classification in scenes of different complexity levels.

[224] NS-Net: Decoupling CLIP Semantic Information through NULL-Space for Generalizable AI-Generated Image Detection

Jiazhen Yan, Fan Wang, Weiwei Jiang, Ziqiang Li, Zhangjie Fu

Main category: cs.CV

TL;DR: NS-Net improves AI-generated image detection by decoupling semantic info from CLIP features using NULL-Space projection and contrastive learning, achieving 7.4% better accuracy.

Details

Motivation: Addressing the failure of existing detectors to generalize to unknown generative models, especially when real and fake images are semantically similar.

Method: Proposes NS-Net, leveraging NULL-Space projection to remove semantic info from CLIP features and contrastive learning to capture differences between real and fake images. Includes Patch Selection to preserve fine-grained artifacts.

Result: Outperforms state-of-the-art methods by 7.4% in detection accuracy on a benchmark with 40 generative models.

Conclusion: NS-Net demonstrates strong generalization across GAN- and diffusion-based techniques, improving detection of AI-generated images.

Abstract: The rapid progress of generative models, such as GANs and diffusion models, has facilitated the creation of highly realistic images, raising growing concerns over their misuse in security-sensitive domains. While existing detectors perform well under known generative settings, they often fail to generalize to unknown generative models, especially when semantic content between real and fake images is closely aligned. In this paper, we revisit the use of CLIP features for AI-generated image detection and uncover a critical limitation: the high-level semantic information embedded in CLIP’s visual features hinders effective discrimination. To address this, we propose NS-Net, a novel detection framework that leverages NULL-Space projection to decouple semantic information from CLIP’s visual features, followed by contrastive learning to capture intrinsic distributional differences between real and generated images. Furthermore, we design a Patch Selection strategy to preserve fine-grained artifacts by mitigating semantic bias caused by global image structures. Extensive experiments on an open-world benchmark comprising images generated by 40 diverse generative models show that NS-Net outperforms existing state-of-the-art methods, achieving a 7.4% improvement in detection accuracy, thereby demonstrating strong generalization across both GAN- and diffusion-based image generation techniques.

[225] DisFaceRep: Representation Disentanglement for Co-occurring Facial Components in Weakly Supervised Face Parsing

Xiaoqin Wang, Xianxu Hou, Meidan Ding, Junliang Chen, Kaijun Deng, Jinheng Xie, Linlin Shen

Main category: cs.CV

TL;DR: WSFP introduces weakly supervised face parsing to reduce annotation costs, using DisFaceRep to disentangle co-occurring facial components for improved performance.

Details

Motivation: Existing face parsing methods require expensive dense pixel-level annotations, prompting the need for a weakly supervised approach to reduce costs.

Method: Proposes DisFaceRep, a framework with explicit (co-occurring component disentanglement) and implicit (text-guided loss) mechanisms to separate facial components.

Result: DisFaceRep outperforms existing weakly supervised methods on datasets like CelebAMask-HQ, LaPa, and Helen.

Conclusion: WSFP is challenging but feasible with DisFaceRep, offering a cost-effective alternative to dense annotations.

Abstract: Face parsing aims to segment facial images into key components such as eyes, lips, and eyebrows. While existing methods rely on dense pixel-level annotations, such annotations are expensive and labor-intensive to obtain. To reduce annotation cost, we introduce Weakly Supervised Face Parsing (WSFP), a new task setting that performs dense facial component segmentation using only weak supervision, such as image-level labels and natural language descriptions. WSFP introduces unique challenges due to the high co-occurrence and visual similarity of facial components, which lead to ambiguous activations and degraded parsing performance. To address this, we propose DisFaceRep, a representation disentanglement framework designed to separate co-occurring facial components through both explicit and implicit mechanisms. Specifically, we introduce a co-occurring component disentanglement strategy to explicitly reduce dataset-level bias, and a text-guided component disentanglement loss to guide component separation using language supervision implicitly. Extensive experiments on CelebAMask-HQ, LaPa, and Helen demonstrate the difficulty of WSFP and the effectiveness of DisFaceRep, which significantly outperforms existing weakly supervised semantic segmentation methods. The code will be released at \href{https://github.com/CVI-SZU/DisFaceRep}{\textcolor{cyan}{https://github.com/CVI-SZU/DisFaceRep}}.

[226] Geo-NI: Geometry-aware Neural Interpolation for Light Field Rendering

Gaochang Wu, Yuemei Zhou, Yebin Liu, Lu Fang, Tianyou Chai

Main category: cs.CV

TL;DR: Geo-NI combines Neural Interpolation (NI) and Depth Image-Based Rendering (DIBR) for improved light field rendering, handling large disparity and non-Lambertian effects.

Details

Motivation: Existing methods either rely on neural networks for direct interpolation (NI) or use scene geometry (DIBR), but neither fully leverages both approaches.

Method: Geo-NI integrates NI with a novel DIBR pipeline, using depth hypotheses to shear input light fields and blending reconstructed fields based on a reconstruction cost volume.

Result: Geo-NI outperforms existing methods, handling large disparity and non-Lambertian effects effectively.

Conclusion: The Geo-NI framework successfully combines NI and DIBR, demonstrating superior performance in light field rendering.

Abstract: In this paper, we present a Geometry-aware Neural Interpolation (Geo-NI) framework for light field rendering. Previous learning-based approaches either rely on the capability of neural networks to perform direct interpolation, which we dubbed Neural Interpolation (NI), or explore scene geometry for novel view synthesis, also known as Depth Image-Based Rendering (DIBR). Instead, we incorporate the ideas behind these two kinds of approaches by launching the NI with a novel DIBR pipeline. Specifically, the proposed Geo-NI first performs NI using input light field sheared by a set of depth hypotheses. Then the DIBR is implemented by assigning the sheared light fields with a novel reconstruction cost volume according to the reconstruction quality under different depth hypotheses. The reconstruction cost is interpreted as a blending weight to render the final output light field by blending the reconstructed light fields along the dimension of depth hypothesis. By combining the superiorities of NI and DIBR, the proposed Geo-NI is able to render views with large disparity with the help of scene geometry while also reconstruct non-Lambertian effect when depth is prone to be ambiguous. Extensive experiments on various datasets demonstrate the superior performance of the proposed geometry-aware light field rendering framework.

[227] ODOV: Towards Open-Domain Open-Vocabulary Object Detection

Yupeng Zhang, Ruize Han, Fangnan Zhou, Song Wang, Wei Feng, Liang Wan

Main category: cs.CV

TL;DR: The paper introduces Open-Domain Open-Vocabulary (ODOV) object detection, addressing domain and category shifts. It presents a benchmark (OD-LVIS) and a novel method using language models for domain-agnostic prompts and domain-specific embeddings.

Details

Motivation: To tackle real-world object detection challenges involving domain and category shifts, requiring adaptable models.

Method: Uses large language models for domain-agnostic text prompts and learns domain embeddings from images, integrating them for customized category embeddings.

Result: Benchmark evaluations confirm the method’s effectiveness, the benchmark’s utility, and the superiority of the proposed approach.

Conclusion: The work validates ODOV detection’s rationale, the benchmark’s value, and the method’s superiority in handling real-world object detection.

Abstract: In this work, we handle a new problem of Open-Domain Open-Vocabulary (ODOV) object detection, which considers the detection model’s adaptability to the real world including both domain and category shifts. For this problem, we first construct a new benchmark OD-LVIS, which includes 46,949 images, covers 18 complex real-world domains and 1,203 categories, and provides a comprehensive dataset for evaluating real-world object detection. Besides, we develop a novel baseline method for ODOV detection.The proposed method first leverages large language models to generate the domain-agnostic text prompts for category embedding. It further learns the domain embedding from the given image, which, during testing, can be integrated into the category embedding to form the customized domain-specific category embedding for each test image. We provide sufficient benchmark evaluations for the proposed ODOV detection task and report the results, which verify the rationale of ODOV detection, the usefulness of our benchmark, and the superiority of the proposed method.

Zihan Li, Wei Sun, Jing Hu, Jianhua Yin, Jianlong Wu, Liqiang Nie

Main category: cs.CV

TL;DR: A self-enhanced framework improves image clustering by aligning cross-modal semantic consistency and fine-tuning the encoder, outperforming existing methods.

Details

Motivation: Existing methods freeze the encoder in language-image models like CLIP, limiting performance due to task-agnostic representations.

Method: The framework uses cross-modal semantic consistency to align clustering heads with pre-trained semantics, then fine-tunes the encoder using self-generated pseudo-labels.

Result: Outperforms deep clustering methods on six datasets, with a smaller model (ViT-B/32) matching or surpassing larger models (ViT-L/14).

Conclusion: The proposed framework effectively bridges the gap between generic pre-trained features and task-specific clustering demands.

Abstract: While large language-image pre-trained models like CLIP offer powerful generic features for image clustering, existing methods typically freeze the encoder. This creates a fundamental mismatch between the model’s task-agnostic representations and the demands of a specific clustering task, imposing a ceiling on performance. To break this ceiling, we propose a self-enhanced framework based on cross-modal semantic consistency for efficient image clustering. Our framework first builds a strong foundation via Cross-Modal Semantic Consistency and then specializes the encoder through Self-Enhancement. In the first stage, we focus on Cross-Modal Semantic Consistency. By mining consistency between generated image-text pairs at the instance, cluster assignment, and cluster center levels, we train lightweight clustering heads to align with the rich semantics of the pre-trained model. This alignment process is bolstered by a novel method for generating higher-quality cluster centers and a dynamic balancing regularizer to ensure well-distributed assignments. In the second stage, we introduce a Self-Enhanced fine-tuning strategy. The well-aligned model from the first stage acts as a reliable pseudo-label generator. These self-generated supervisory signals are then used to feed back the efficient, joint optimization of the vision encoder and clustering heads, unlocking their full potential. Extensive experiments on six mainstream datasets show that our method outperforms existing deep clustering methods by significant margins. Notably, our ViT-B/32 model already matches or even surpasses the accuracy of state-of-the-art methods built upon the far larger ViT-L/14.

[229] Individual Content and Motion Dynamics Preserved Pruning for Video Diffusion Models

Yiming Wu, Huan Wang, Zhenghao Chen, Dong Xu

Main category: cs.CV

TL;DR: The paper introduces VDMini, a lightweight Video Diffusion Model (VDM) variant, using pruning and consistency loss to reduce computational cost and speed up inference while maintaining video quality.

Details

Motivation: High computational cost and slow inference time hinder the deployment of VDMs, prompting the need for an efficient compression method.

Method: Prunes redundant blocks from shallower layers (focusing on individual content) while preserving deeper layers (crucial for motion dynamics). Introduces Individual Content and Motion Dynamics (ICMD) Consistency Loss, including Individual Content Distillation (ICD) Loss and Multi-frame Content Adversarial (MCA) Loss.

Result: Achieves significant speedups (2.5×, 1.4×, 1.25×) for Text-to-Video (T2V) and Image-to-Video (I2V) tasks while maintaining video quality on benchmarks like UCF101 and VBench.

Conclusion: VDMini effectively balances efficiency and performance, making VDMs more practical for deployment.

Abstract: The high computational cost and slow inference time are major obstacles to deploying Video Diffusion Models (VDMs). To overcome this, we introduce a new Video Diffusion Model Compression approach using individual content and motion dynamics preserved pruning and consistency loss. First, we empirically observe that deeper VDM layers are crucial for maintaining the quality of \textbf{motion dynamics} (\textit{e.g.,} coherence of the entire video), while shallower layers are more focused on \textbf{individual content} (\textit{e.g.,} individual frames). Therefore, we prune redundant blocks from the shallower layers while preserving more of the deeper layers, resulting in a lightweight VDM variant called VDMini. Moreover, we propose an \textbf{Individual Content and Motion Dynamics (ICMD)} Consistency Loss to gain comparable generation performance as larger VDM to VDMini. In particular, we first use the Individual Content Distillation (ICD) Loss to preserve the consistency in the features of each generated frame between the teacher and student models. Next, we introduce a Multi-frame Content Adversarial (MCA) Loss to enhance the motion dynamics across the generated video as a whole. This method significantly accelerates inference time while maintaining high-quality video generation. Extensive experiments demonstrate the effectiveness of our VDMini on two important video generation tasks, Text-to-Video (T2V) and Image-to-Video (I2V), where we respectively achieve an average 2.5 $\times$, 1.4 $\times$, and 1.25 $\times$ speed up for the I2V method SF-V, the T2V method T2V-Turbo-v2, and the T2V method HunyuanVideo, while maintaining the quality of the generated videos on several benchmarks including UCF101, VBench-T2V, and VBench-I2V.

[230] SpatioTemporal Difference Network for Video Depth Super-Resolution

Zhengxue Wang, Yuan Wu, Xiang Li, Zhiqiang Yan, Jian Yang

Main category: cs.CV

TL;DR: The paper proposes STDNet, a SpatioTemporal Difference Network, to address long-tailed distribution issues in video depth super-resolution by leveraging spatial and temporal difference mechanisms.

Details

Motivation: Video depth super-resolution suffers from long-tailed distributions in spatial non-smooth regions and temporal variation zones, limiting reconstruction quality.

Method: STDNet uses two branches: a spatial difference branch for intra-frame RGB-D aggregation and a temporal difference branch for motion compensation using adjacent frames.

Result: STDNet outperforms existing methods across multiple datasets.

Conclusion: The proposed STDNet effectively mitigates long-tailed effects, enhancing depth super-resolution quality.

Abstract: Depth super-resolution has achieved impressive performance, and the incorporation of multi-frame information further enhances reconstruction quality. Nevertheless, statistical analyses reveal that video depth super-resolution remains affected by pronounced long-tailed distributions, with the long-tailed effects primarily manifesting in spatial non-smooth regions and temporal variation zones. To address these challenges, we propose a novel SpatioTemporal Difference Network (STDNet) comprising two core branches: a spatial difference branch and a temporal difference branch. In the spatial difference branch, we introduce a spatial difference mechanism to mitigate the long-tailed issues in spatial non-smooth regions. This mechanism dynamically aligns RGB features with learned spatial difference representations, enabling intra-frame RGB-D aggregation for depth calibration. In the temporal difference branch, we further design a temporal difference strategy that preferentially propagates temporal variation information from adjacent RGB and depth frames to the current depth frame, leveraging temporal difference representations to achieve precise motion compensation in temporal long-tailed areas. Extensive experimental results across multiple datasets demonstrate the effectiveness of our STDNet, outperforming existing approaches.

[231] Progressive Growing of Video Tokenizers for Temporally Compact Latent Spaces

Aniruddha Mahapatra, Long Mai, David Bourgin, Yitian Zhang, Feng Liu

Main category: cs.CV

TL;DR: The paper proposes a bootstrapped high-temporal-compression model for video tokenizers, leveraging lower-compression models to enhance reconstruction quality and achieve higher temporal compression.

Details

Motivation: Extending video tokenizers for temporal compression beyond 4x without increasing channel capacity is challenging. The authors aim to improve this by utilizing insights from lower-compression models.

Method: A bootstrapped model is developed, progressively training high-compression blocks on top of well-trained lower-compression models, with a cross-level feature-mixing module to retain information.

Result: The method improves reconstruction quality and achieves higher temporal compression compared to direct full-model training, enabling efficient video diffusion model training.

Conclusion: The approach successfully enhances video tokenizers, enabling high-quality video generation with reduced computational resources.

Abstract: Video tokenizers are essential for latent video diffusion models, converting raw video data into spatiotemporally compressed latent spaces for efficient training. However, extending state-of-the-art video tokenizers to achieve a temporal compression ratio beyond 4x without increasing channel capacity poses significant challenges. In this work, we propose an alternative approach to enhance temporal compression. We find that the reconstruction quality of temporally subsampled videos from a low-compression encoder surpasses that of high-compression encoders applied to original videos. This indicates that high-compression models can leverage representations from lower-compression models. Building on this insight, we develop a bootstrapped high-temporal-compression model that progressively trains high-compression blocks atop well-trained lower-compression models. Our method includes a cross-level feature-mixing module to retain information from the pretrained low-compression model and guide higher-compression blocks to capture the remaining details from the full video sequence. Evaluation of video benchmarks shows that our method significantly improves reconstruction quality while increasing temporal compression compared to directly training the full model. Furthermore, the resulting compact latent space effectively trains a video diffusion model for high-quality video generation with a significantly reduced token budget.

[232] Enhancing Diffusion-based Dataset Distillation via Adversary-Guided Curriculum Sampling

Lexiao Zou, Gongwei Chen, Yanda Chen, Miao Zhang

Main category: cs.CV

TL;DR: The paper introduces Adversary-guided Curriculum Sampling (ACS) to enhance dataset distillation by improving diversity and reducing redundancy in images sampled from diffusion models.

Details

Motivation: Address performance degradation in dataset distillation due to lack of diversity in images sampled from diffusion models, which leads to information redundancy.

Method: Proposes ACS, which partitions the dataset into curricula and uses adversarial loss to guide diffusion sampling, ensuring diversity and systematic coverage.

Result: ACS achieves improvements of 4.1% on Imagewoof and 2.1% on ImageNet-1k over state-of-the-art methods.

Conclusion: ACS effectively mitigates redundancy and enhances diversity in distilled datasets, outperforming existing techniques.

Abstract: Dataset distillation aims to encapsulate the rich information contained in dataset into a compact distilled dataset but it faces performance degradation as the image-per-class (IPC) setting or image resolution grows larger. Recent advancements demonstrate that integrating diffusion generative models can effectively facilitate the compression of large-scale datasets while maintaining efficiency due to their superiority in matching data distribution and summarizing representative patterns. However, images sampled from diffusion models are always blamed for lack of diversity which may lead to information redundancy when multiple independent sampled images are aggregated as a distilled dataset. To address this issue, we propose Adversary-guided Curriculum Sampling (ACS), which partitions the distilled dataset into multiple curricula. For generating each curriculum, ACS guides diffusion sampling process by an adversarial loss to challenge a discriminator trained on sampled images, thus mitigating information overlap between curricula and fostering a more diverse distilled dataset. Additionally, as the discriminator evolves with the progression of curricula, ACS generates images from simpler to more complex, ensuring efficient and systematic coverage of target data informational spectrum. Extensive experiments demonstrate the effectiveness of ACS, which achieves substantial improvements of 4.1% on Imagewoof and 2.1% on ImageNet-1k over the state-of-the-art.

[233] Optimizing Image Capture for Computer Vision-Powered Taxonomic Identification and Trait Recognition of Biodiversity Specimens

Alyson East, Elizabeth G. Campolongo, Luke Meyers, S M Rayeed, Samuel Stevens, Iuliia Zarubiieva, Isadora E. Fluck, Jennifer C. Girón, Maximiliane Jousse, Scott Lowe, Kayla I Perry, Isabelle Betancourt, Noah Charney, Evan Donoso, Nathan Fox, Kim J. Landsbergen, Ekaterina Nepovinnykh, Michelle Ramirez, Parkash Singh, Khum Thapa-Magar, Matthew Thompson, Evan Waite, Tanya Berger-Wolf, Hilmar Lapp, Paula Mabee, Charles Stewart, Graham Taylor, Sydne Record

Main category: cs.CV

TL;DR: A framework for optimizing biological specimen imaging for computer vision, bridging digitization practices with computational needs.

Details

Motivation: Current imaging protocols for biological specimens are not designed for automated analysis, limiting computer vision applications in taxonomy and trait extraction.

Method: Interdisciplinary collaboration to synthesize evidence-based recommendations, including practical checklists and guidelines for imaging.

Result: A comprehensive framework with ten considerations for optimizing image capture, actionable guidance, and a roadmap for community standards.

Conclusion: This framework enhances automated analysis of existing specimens and guides future digitization for improved computational capabilities.

Abstract: 1) Biological collections house millions of specimens with digital images increasingly available through open-access platforms. However, most imaging protocols were developed for human interpretation without considering automated analysis requirements. As computer vision applications revolutionize taxonomic identification and trait extraction, a critical gap exists between current digitization practices and computational analysis needs. This review provides the first comprehensive practical framework for optimizing biological specimen imaging for computer vision applications. 2) Through interdisciplinary collaboration between taxonomists, collection managers, ecologists, and computer scientists, we synthesized evidence-based recommendations addressing fundamental computer vision concepts and practical imaging considerations. We provide immediately actionable implementation guidance while identifying critical areas requiring community standards development. 3) Our framework encompasses ten interconnected considerations for optimizing image capture for computer vision-powered taxonomic identification and trait extraction. We translate these into practical implementation checklists, equipment selection guidelines, and a roadmap for community standards development including filename conventions, pixel density requirements, and cross-institutional protocols. 4)By bridging biological and computational disciplines, this approach unlocks automated analysis potential for millions of existing specimens and guides future digitization efforts toward unprecedented analytical capabilities.

[234] ModelNet40-E: An Uncertainty-Aware Benchmark for Point Cloud Classification

Pedro Alonso, Tianrui Li, Chongshou Li

Main category: cs.CV

TL;DR: ModelNet40-E is a new benchmark for evaluating point cloud classification models under LiDAR-like noise, featuring noise-corrupted data and uncertainty annotations. Point Transformer v3 shows the best calibration among tested models.

Details

Motivation: To assess robustness and calibration of point cloud models under synthetic noise, addressing gaps in existing benchmarks.

Method: Introduces ModelNet40-E with noise-corrupted point clouds and uncertainty annotations. Evaluates PointNet, DGCNN, and Point Transformer v3 using accuracy, calibration metrics, and uncertainty-awareness.

Result: All models degrade with noise, but Point Transformer v3 excels in calibration, aligning uncertainties with measurement noise.

Conclusion: ModelNet40-E enables fine-grained evaluation, with Point Transformer v3 emerging as the most robust and well-calibrated model.

Abstract: We introduce ModelNet40-E, a new benchmark designed to assess the robustness and calibration of point cloud classification models under synthetic LiDAR-like noise. Unlike existing benchmarks, ModelNet40-E provides both noise-corrupted point clouds and point-wise uncertainty annotations via Gaussian noise parameters ({\sigma}, {\mu}), enabling fine-grained evaluation of uncertainty modeling. We evaluate three popular models-PointNet, DGCNN, and Point Transformer v3-across multiple noise levels using classification accuracy, calibration metrics, and uncertainty-awareness. While all models degrade under increasing noise, Point Transformer v3 demonstrates superior calibration, with predicted uncertainties more closely aligned with the underlying measurement uncertainty.

[235] SGCap: Decoding Semantic Group for Zero-shot Video Captioning

Zeyu Pan, Ping Li, Wenxiao Wang

Main category: cs.CV

TL;DR: SGCap introduces a zero-shot video captioning method using Semantic Group Decoding and novel modules (KSS, PSS) to model temporal dynamics and enhance generalization, outperforming prior methods.

Details

Motivation: Existing zero-shot image captioning methods fail to capture video temporal dynamics and lack sufficient video-level supervision.

Method: Proposes SGCap with Semantic Group Decoding (SGD), Key Sentences Selection (KSS), and Probability Sampling Supervision (PSS) to model inter-frame relationships and enhance captioning.

Result: SGCap outperforms state-of-the-art zero-shot methods and competes with fully supervised ones on benchmarks.

Conclusion: SGCap effectively addresses zero-shot video captioning challenges by leveraging temporal dynamics and diverse supervision, achieving strong performance.

Abstract: Zero-shot video captioning aims to generate sentences for describing videos without training the model on video-text pairs, which remains underexplored. Existing zero-shot image captioning methods typically adopt a text-only training paradigm, where a language decoder reconstructs single-sentence embeddings obtained from CLIP. However, directly extending them to the video domain is suboptimal, as applying average pooling over all frames neglects temporal dynamics. To address this challenge, we propose a Semantic Group Captioning (SGCap) method for zero-shot video captioning. In particular, it develops the Semantic Group Decoding (SGD) strategy to employ multi-frame information while explicitly modeling inter-frame temporal relationships. Furthermore, existing zero-shot captioning methods that rely on cosine similarity for sentence retrieval and reconstruct the description supervised by a single frame-level caption, fail to provide sufficient video-level supervision. To alleviate this, we introduce two key components, including the Key Sentences Selection (KSS) module and the Probability Sampling Supervision (PSS) module. The two modules construct semantically-diverse sentence groups that models temporal dynamics and guide the model to capture inter-sentence causal relationships, thereby enhancing its generalization ability to video captioning. Experimental results on several benchmarks demonstrate that SGCap significantly outperforms previous state-of-the-art zero-shot alternatives and even achieves performance competitive with fully supervised ones. Code is available at https://github.com/mlvccn/SGCap_Video.

[236] PromptSafe: Gated Prompt Tuning for Safe Text-to-Image Generation

Zonglei Jing, Xiao Yang, Xiaoqian Li, Siyuan Liang, Aishan Liu, Mingchuan Zhang, Xianglong Liu

Main category: cs.CV

TL;DR: PromptSafe is a gated prompt tuning framework for T2I models to prevent NSFW content, balancing safety and image quality by dynamically adjusting defensive strength based on prompt toxicity.

Details

Motivation: Existing T2I models are vulnerable to generating NSFW content, and current moderation methods are computationally expensive, degrade benign image quality, and lack adaptability to diverse safety needs.

Method: PromptSafe uses a text-only training corpus (rewritten unsafe prompts via LLM) to optimize a universal soft prompt and employs a gated control network to adaptively adjust defensive strength based on prompt toxicity.

Result: Achieves a 2.36% unsafe generation rate, preserves benign fidelity, and shows strong generalization, transferability, and resilience to attacks.

Conclusion: PromptSafe offers a practical, scalable solution for safe T2I generation by dynamically aligning defense intensity with prompt risk.

Abstract: Text-to-image (T2I) models have demonstrated remarkable generative capabilities but remain vulnerable to producing not-safe-for-work (NSFW) content, such as violent or explicit imagery. While recent moderation efforts have introduced soft prompt-guided tuning by appending defensive tokens to the input, these approaches often rely on large-scale curated image-text datasets and apply static, one-size-fits-all defenses at inference time. However, this results not only in high computational cost and degraded benign image quality, but also in limited adaptability to the diverse and nuanced safety requirements of real-world prompts. To address these challenges, we propose PromptSafe, a gated prompt tuning framework that combines a lightweight, text-only supervised soft embedding with an inference-time gated control network. Instead of training on expensive image-text datasets, we first rewrite unsafe prompts into semantically aligned but safe alternatives using an LLM, constructing an efficient text-only training corpus. Based on this, we optimize a universal soft prompt that repels unsafe and attracts safe embeddings during the diffusion denoising process. To avoid over-suppressing benign prompts, we introduce a gated mechanism that adaptively adjusts the defensive strength based on estimated prompt toxicity, thereby aligning defense intensity with prompt risk and ensuring strong protection for harmful inputs while preserving benign generation quality. Extensive experiments across multiple benchmarks and T2I models show that PromptSafe achieves a SOTA unsafe generation rate (2.36%), while preserving high benign fidelity. Furthermore, PromptSafe demonstrates strong generalization to unseen harmful categories, robust transferability across diffusion model architectures, and resilience under adaptive adversarial attacks, highlighting its practical value for safe and scalable deployment.

[237] Integrating Disparity Confidence Estimation into Relative Depth Prior-Guided Unsupervised Stereo Matching

Chuang-Wei Liu, Mingjian Sun, Cairong Zhao, Hanli Wang, Alexander Dvorkovich, Rui Fan

Main category: cs.CV

TL;DR: The paper proposes an unsupervised stereo matching framework using disparity confidence estimation and depth prior-guided loss functions to improve accuracy by addressing ambiguities and noise in existing methods.

Details

Motivation: Existing unsupervised stereo matching methods suffer from ambiguities like repetitive patterns and texture-less regions, and inefficient use of 3D geometric knowledge due to noisy disparity estimates.

Method: The framework includes a disparity confidence estimation algorithm and two depth prior-guided loss functions, leveraging quasi-dense correspondences and dual disparity smoothness loss.

Result: The method achieves state-of-the-art accuracy on the KITTI Stereo benchmarks among unsupervised methods.

Conclusion: The proposed framework effectively addresses challenges in unsupervised stereo matching, enhancing performance through confident disparity estimation and geometric knowledge transfer.

Abstract: Unsupervised stereo matching has garnered significant attention for its independence from costly disparity annotations. Typical unsupervised methods rely on the multi-view consistency assumption for training networks, which suffer considerably from stereo matching ambiguities, such as repetitive patterns and texture-less regions. A feasible solution lies in transferring 3D geometric knowledge from a relative depth map to the stereo matching networks. However, existing knowledge transfer methods learn depth ranking information from randomly built sparse correspondences, which makes inefficient utilization of 3D geometric knowledge and introduces noise from mistaken disparity estimates. This work proposes a novel unsupervised learning framework to address these challenges, which comprises a plug-and-play disparity confidence estimation algorithm and two depth prior-guided loss functions. Specifically, the local coherence consistency between neighboring disparities and their corresponding relative depths is first checked to obtain disparity confidence. Afterwards, quasi-dense correspondences are built using only confident disparity estimates to facilitate efficient depth ranking learning. Finally, a dual disparity smoothness loss is proposed to boost stereo matching performance at disparity discontinuities. Experimental results demonstrate that our method achieves state-of-the-art stereo matching accuracy on the KITTI Stereo benchmarks among all unsupervised stereo matching methods.

[238] GMAT: Grounded Multi-Agent Clinical Description Generation for Text Encoder in Vision-Language MIL for Whole Slide Image Classification

Ngoc Bui Lam Quang, Nam Le Nguyen Binh, Thanh-Huy Nguyen, Le Thien Phuc Nguyen, Quan Nguyen, Ulas Bagci

Main category: cs.CV

TL;DR: A vision-language MIL framework improves WSI classification by generating diverse, accurate clinical descriptions and encoding them effectively for better alignment with visual features.

Details

Motivation: Existing methods using LLMs for clinical descriptions in MIL pipelines face limitations in expressiveness, domain grounding, and fine-grained specificity, leading to suboptimal performance.

Method: Proposes a grounded multi-agent description generation system and a text encoding strategy using multiple descriptions for better alignment with visual features.

Result: Outperforms single-prompt baselines and achieves results comparable to state-of-the-art models on renal and lung cancer datasets.

Conclusion: The framework enhances MIL performance by addressing limitations in clinical description generation and encoding, improving alignment between text and visual features.

Abstract: Multiple Instance Learning (MIL) is the leading approach for whole slide image (WSI) classification, enabling efficient analysis of gigapixel pathology slides. Recent work has introduced vision-language models (VLMs) into MIL pipelines to incorporate medical knowledge through text-based class descriptions rather than simple class names. However, when these methods rely on large language models (LLMs) to generate clinical descriptions or use fixed-length prompts to represent complex pathology concepts, the limited token capacity of VLMs often constrains the expressiveness and richness of the encoded class information. Additionally, descriptions generated solely by LLMs may lack domain grounding and fine-grained medical specificity, leading to suboptimal alignment with visual features. To address these challenges, we propose a vision-language MIL framework with two key contributions: (1) A grounded multi-agent description generation system that leverages curated pathology textbooks and agent specialization (e.g., morphology, spatial context) to produce accurate and diverse clinical descriptions; (2) A text encoding strategy using a list of descriptions rather than a single prompt, capturing fine-grained and complementary clinical signals for better alignment with visual features. Integrated into a VLM-MIL pipeline, our approach shows improved performance over single-prompt class baselines and achieves results comparable to state-of-the-art models, as demonstrated on renal and lung cancer datasets.

[239] Domain Generalized Stereo Matching with Uncertainty-guided Data Augmentation

Shuangli Du, Jing Wang, Minghua Zhao, Zhenyu Xu, Jie Li

Main category: cs.CV

TL;DR: The paper proposes an uncertainty-guided data augmentation (UgDA) method to improve stereo matching models’ generalization from synthetic to real data by perturbing RGB image statistics and enforcing feature consistency.

Details

Motivation: Stereo matching models trained on synthetic data struggle with real-world domain differences like color and texture. The goal is to enhance cross-domain generalization.

Method: UgDA perturbs RGB image statistics (mean, standard deviation) to simulate unseen domains and uses Gaussian distributions for uncertainty modeling. Feature consistency between original and augmented data is enforced.

Result: Extensive experiments show UgDA significantly boosts the generalization performance of stereo matching networks across benchmarks.

Conclusion: UgDA is a simple, architecture-agnostic solution for improving stereo matching models’ cross-domain generalization.

Abstract: State-of-the-art stereo matching (SM) models trained on synthetic data often fail to generalize to real data domains due to domain differences, such as color, illumination, contrast, and texture. To address this challenge, we leverage data augmentation to expand the training domain, encouraging the model to acquire robust cross-domain feature representations instead of domain-dependent shortcuts. This paper proposes an uncertainty-guided data augmentation (UgDA) method, which argues that the image statistics in RGB space (mean and standard deviation) carry the domain characteristics. Thus, samples in unseen domains can be generated by properly perturbing these statistics. Furthermore, to simulate more potential domains, Gaussian distributions founded on batch-level statistics are poposed to model the unceratinty of perturbation direction and intensity. Additionally, we further enforce feature consistency between original and augmented data for the same scene, encouraging the model to learn structure aware, shortcuts-invariant feature representations. Our approach is simple, architecture-agnostic, and can be integrated into any SM networks. Extensive experiments on several challenging benchmarks have demonstrated that our method can significantly improve the generalization performance of existing SM networks.

[240] C3D-AD: Toward Continual 3D Anomaly Detection via Kernel Attention with Learnable Advisor

Haoquan Lu, Hanzhe Liang, Jie Zhang, Chenxi Hu, Jinbao Wang, Can Gao

Main category: cs.CV

TL;DR: Proposes C3D-AD, a continual learning framework for 3D anomaly detection, handling new classes over time with Kernel Attention and representation rehearsal.

Details

Motivation: Existing 3D AD methods are class-specific and cannot adapt to new classes, limiting their real-world applicability.

Method: Uses Kernel Attention Layer (KAL) for generalized feature extraction, Kernel Attention with Advisor (KAA) for continual reconstruction, and Reconstruction with Parameter Perturbation (RPP) for representation consistency.

Result: Achieves 66.4%, 83.1%, and 63.4% AUROC on Real3D-AD, Anomaly-ShapeNet, and MulSen-AD datasets.

Conclusion: C3D-AD effectively handles multi-class and emerging class scenarios, outperforming existing methods.

Abstract: 3D Anomaly Detection (AD) has shown great potential in detecting anomalies or defects of high-precision industrial products. However, existing methods are typically trained in a class-specific manner and also lack the capability of learning from emerging classes. In this study, we proposed a continual learning framework named Continual 3D Anomaly Detection (C3D-AD), which can not only learn generalized representations for multi-class point clouds but also handle new classes emerging over time.Specifically, in the feature extraction module, to extract generalized local features from diverse product types of different tasks efficiently, Kernel Attention with random feature Layer (KAL) is introduced, which normalizes the feature space. Then, to reconstruct data correctly and continually, an efficient Kernel Attention with learnable Advisor (KAA) mechanism is proposed, which learns the information from new categories while discarding redundant old information within both the encoder and decoder. Finally, to keep the representation consistency over tasks, a Reconstruction with Parameter Perturbation (RPP) module is proposed by designing a representation rehearsal loss function, which ensures that the model remembers previous category information and returns category-adaptive representation.Extensive experiments on three public datasets demonstrate the effectiveness of the proposed method, achieving an average performance of 66.4%, 83.1%, and 63.4% AUROC on Real3D-AD, Anomaly-ShapeNet, and MulSen-AD, respectively.

[241] P3P Made Easy

Seong Hun Lee, Patrick Vandewalle, Javier Civera

Main category: cs.CV

TL;DR: A novel algebraic solution for the P3P problem, offering simplicity and efficiency comparable to state-of-the-art methods.

Details

Motivation: To provide a computationally efficient and interpretable solution for recovering camera pose from 2D-3D correspondences.

Method: Reformulates the P3P problem into a quartic polynomial with simple, efficient coefficients.

Result: Achieves accuracy and runtime performance on par with leading methods, validated by synthetic datasets.

Conclusion: The solver is ideal for real-time systems and education due to its simplicity, reliability, and performance.

Abstract: We present a novel algebraic solution to the Perspective-Three-Point (P3P) problem, which aims to recover the absolute pose of a calibrated camera from three 2D-3D correspondences. Our method reformulates the problem into a quartic polynomial with coefficients that are analytically simple and computationally efficient. Despite its simplicity, the proposed solver achieves accuracy and runtime performance comparable to state-of-the-art methods. Extensive experiments on synthetic datasets validate its robustness and efficiency. This combination of simplicity and performance makes our solver appealing for both real-time systems and educational contexts, where interpretability and reliability are critical.

[242] Multimodal Attention-Aware Fusion for Diagnosing Distal Myopathy: Evaluating Model Interpretability and Clinician Trust

Mohsen Abbaspour Onari, Lucie Charlotte Magister, Yaoxin Wu, Amalia Lupi, Dario Creazzo, Mattia Tordin, Luigi Di Donatantonio, Emilio Quaia, Chao Zhang, Isel Grau, Marco S. Nobile, Yingqian Zhang, Pietro Liò

Main category: cs.CV

TL;DR: A novel multimodal attention-aware fusion architecture improves distal myopathy diagnosis by combining global and local deep learning features, enhancing accuracy and interpretability, though gaps in clinical specificity remain.

Details

Motivation: Diagnosing distal myopathy is challenging due to its genetic heterogeneity and varied clinical manifestations, requiring improved radiological methods.

Method: Proposes a fusion architecture integrating global and local deep learning features via an attention gate mechanism, evaluated on BUSI and a proprietary dataset.

Result: Achieves high classification accuracy and generates clinically relevant saliency maps, but gaps in anatomical specificity and clinical usefulness persist.

Conclusion: Richer interpretability methods and human-in-the-loop feedback are needed to meet real-world clinical diagnostic expectations.

Abstract: Distal myopathy represents a genetically heterogeneous group of skeletal muscle disorders with broad clinical manifestations, posing diagnostic challenges in radiology. To address this, we propose a novel multimodal attention-aware fusion architecture that combines features extracted from two distinct deep learning models, one capturing global contextual information and the other focusing on local details, representing complementary aspects of the input data. Uniquely, our approach integrates these features through an attention gate mechanism, enhancing both predictive performance and interpretability. Our method achieves a high classification accuracy on the BUSI benchmark and a proprietary distal myopathy dataset, while also generating clinically relevant saliency maps that support transparent decision-making in medical diagnosis. We rigorously evaluated interpretability through (1) functionally grounded metrics, coherence scoring against reference masks and incremental deletion analysis, and (2) application-grounded validation with seven expert radiologists. While our fusion strategy boosts predictive performance relative to single-stream and alternative fusion strategies, both quantitative and qualitative evaluations reveal persistent gaps in anatomical specificity and clinical usefulness of the interpretability. These findings highlight the need for richer, context-aware interpretability methods and human-in-the-loop feedback to meet clinicians’ expectations in real-world diagnostic settings.

[243] Referring Remote Sensing Image Segmentation with Cross-view Semantics Interaction Network

Jiaxing Yang, Lihe Zhang, Huchuan Lu

Main category: cs.CV

TL;DR: The paper introduces CSINet, a framework for Referring Remote Sensing Image Segmentation (RRSIS), addressing scale variation issues by combining remote and close-view cues for better segmentation of tiny or ambiguous targets.

Details

Motivation: Existing methods struggle with tiny or ambiguous targets in RRSIS due to reliance on single-view structures and saliency-preferring techniques.

Method: CSINet uses a Cross-View Window-attention module (CVWin) to integrate global and local semantics from remote and close views, and a Collaboratively Dilated Attention enhanced Decoder (CDAD) for multiscale feature integration.

Result: CSINet significantly improves segmentation performance while maintaining speed.

Conclusion: The proposed framework effectively addresses limitations in RRSIS by leveraging cross-view semantics and multiscale features.

Abstract: Recently, Referring Remote Sensing Image Segmentation (RRSIS) has aroused wide attention. To handle drastic scale variation of remote targets, existing methods only use the full image as input and nest the saliency-preferring techniques of cross-scale information interaction into traditional single-view structure. Although effective for visually salient targets, they still struggle in handling tiny, ambiguous ones in lots of real scenarios. In this work, we instead propose a paralleled yet unified segmentation framework Cross-view Semantics Interaction Network (CSINet) to solve the limitations. Motivated by human behavior in observing targets of interest, the network orchestrates visual cues from remote and close distances to conduct synergistic prediction. In its every encoding stage, a Cross-View Window-attention module (CVWin) is utilized to supplement global and local semantics into close-view and remote-view branch features, finally promoting the unified representation of feature in every encoding stage. In addition, we develop a Collaboratively Dilated Attention enhanced Decoder (CDAD) to mine the orientation property of target and meanwhile integrate cross-view multiscale features. The proposed network seamlessly enhances the exploitation of global and local semantics, achieving significant improvements over others while maintaining satisfactory speed.

[244] Zero-shot Segmentation of Skin Conditions: Erythema with Edit-Friendly Inversion

Konstantinos Moutselos, Ilias Maglogiannis

Main category: cs.CV

TL;DR: A zero-shot image segmentation framework using diffusion models detects erythema without labeled data, outperforming baseline methods.

Details

Motivation: To reduce reliance on labeled dermatological datasets and provide a scalable diagnostic tool for erythema detection.

Method: Uses generative editing in diffusion models to synthesize reference images without erythema, aligns them with original images, and performs color-space analysis.

Result: Successfully isolated facial erythema in diverse cases, outperforming threshold-based techniques.

Conclusion: Combining generative diffusion models and color segmentation enables efficient erythema detection without prior training data.

Abstract: This study proposes a zero-shot image segmentation framework for detecting erythema (redness of the skin) using edit-friendly inversion in diffusion models. The method synthesizes reference images of the same patient that are free from erythema via generative editing and then accurately aligns these references with the original images. Color-space analysis is performed with minimal user intervention to identify erythematous regions. This approach significantly reduces the reliance on labeled dermatological datasets while providing a scalable and flexible diagnostic support tool by avoiding the need for any annotated training masks. In our initial qualitative experiments, the pipeline successfully isolated facial erythema in diverse cases, demonstrating performance improvements over baseline threshold-based techniques. These results highlight the potential of combining generative diffusion models and statistical color segmentation for computer-aided dermatology, enabling efficient erythema detection without prior training data.

[245] StyleSentinel: Reliable Artistic Copyright Verification via Stylistic Fingerprints

Lingxiao Chen, Liqin Wang, Wei Lu

Main category: cs.CV

TL;DR: StyleSentinel protects artists’ copyright by verifying stylistic fingerprints in artwork, outperforming existing methods in one-sample verification.

Details

Motivation: Unauthorized use of personal artwork via diffusion models threatens intellectual property; current defenses like watermarks are inadequate.

Method: StyleSentinel enhances stylistic expressiveness through semantic self-reconstruction, fuses multi-layer features for a stylistic fingerprint, and models style as a hypersphere boundary for one-class learning.

Result: Outperforms state-of-the-art in one-sample verification and proves effective on online platforms.

Conclusion: StyleSentinel offers a robust solution for copyright protection by leveraging stylistic fingerprints, addressing gaps in existing defenses.

Abstract: The versatility of diffusion models in generating customized images has led to unauthorized usage of personal artwork, which poses a significant threat to the intellectual property of artists. Existing approaches relying on embedding additional information, such as perturbations, watermarks, and backdoors, suffer from limited defensive capabilities and fail to protect artwork published online. In this paper, we propose StyleSentinel, an approach for copyright protection of artwork by verifying an inherent stylistic fingerprint in the artist’s artwork. Specifically, we employ a semantic self-reconstruction process to enhance stylistic expressiveness within the artwork, which establishes a dense and style-consistent manifold foundation for feature learning. Subsequently, we adaptively fuse multi-layer image features to encode abstract artistic style into a compact stylistic fingerprint. Finally, we model the target artist’s style as a minimal enclosing hypersphere boundary in the feature space, transforming complex copyright verification into a robust one-class learning task. Extensive experiments demonstrate that compared with the state-of-the-art, StyleSentinel achieves superior performance on the one-sample verification task. We also demonstrate the effectiveness through online platforms.

[246] Weakly-Supervised Image Forgery Localization via Vision-Language Collaborative Reasoning Framework

Ziqi Sheng, Junyan Wu, Wei Lu, Jiantao Zhou

Main category: cs.CV

TL;DR: ViLaCo is a vision-language framework for weakly supervised image forgery localization, leveraging pre-trained VLMs for semantic guidance and outperforming existing methods.

Details

Motivation: To address the limited performance of weakly supervised image forgery localization (WSIFL) methods by introducing external semantic guidance from pre-trained vision-language models (VLMs).

Method: ViLaCo uses a vision-language feature modeling network for semantic knowledge, an adaptive reasoning network for alignment, and dual prediction heads for image-level classification and pixel-level localization. A contrastive patch consistency module enhances forgery discrimination.

Result: ViLaCo outperforms existing WSIFL methods, achieving state-of-the-art performance in detection and localization accuracy on multiple datasets.

Conclusion: ViLaCo effectively bridges the gap between weak supervision and fine-grained localization, demonstrating superior performance in image forgery localization.

Abstract: Image forgery localization aims to precisely identify tampered regions within images, but it commonly depends on costly pixel-level annotations. To alleviate this annotation burden, weakly supervised image forgery localization (WSIFL) has emerged, yet existing methods still achieve limited localization performance as they mainly exploit intra-image consistency clues and lack external semantic guidance to compensate for weak supervision. In this paper, we propose ViLaCo, a vision-language collaborative reasoning framework that introduces auxiliary semantic supervision distilled from pre-trained vision-language models (VLMs), enabling accurate pixel-level localization using only image-level labels. Specifically, ViLaCo first incorporates semantic knowledge through a vision-language feature modeling network, which jointly extracts textual and visual priors using pre-trained VLMs. Next, an adaptive vision-language reasoning network aligns textual semantics and visual features through mutual interactions, producing semantically aligned representations. Subsequently, these representations are passed into dual prediction heads, where the coarse head performs image-level classification and the fine head generates pixel-level localization masks, thereby bridging the gap between weak supervision and fine-grained localization. Moreover, a contrastive patch consistency module is introduced to cluster tampered features while separating authentic ones, facilitating more reliable forgery discrimination. Extensive experiments on multiple public datasets demonstrate that ViLaCo substantially outperforms existing WSIFL methods, achieving state-of-the-art performance in both detection and localization accuracy.

[247] SBP-YOLO:A Lightweight Real-Time Model for Detecting Speed Bumps and Potholes

Chuanqi Liang, Jie Fu, Lei Luo, Miao Yu

Main category: cs.CV

TL;DR: SBP-YOLO is a lightweight detection framework for speed bumps and potholes, optimized for embedded deployment, achieving 87.0% mAP and 139.5 FPS.

Details

Motivation: The demand for ride comfort in new energy vehicles necessitates accurate real-time detection of road irregularities for predictive suspension control.

Method: SBP-YOLO integrates GhostConv, VoVGSCSPC, and a Lightweight Efficiency Detection Head (LEDH), with a hybrid training strategy using NWD loss, knowledge distillation, and weather augmentation.

Result: The model achieves 87.0% mAP (5.8% higher than YOLOv11n) and runs at 139.5 FPS on a Jetson AGX Xavier.

Conclusion: SBP-YOLO is effective for real-time road condition perception in intelligent suspension systems.

Abstract: With increasing demand for ride comfort in new energy vehicles, accurate real-time detection of speed bumps and potholes is critical for predictive suspension control. This paper proposes SBP-YOLO, a lightweight detection framework based on YOLOv11, optimized for embedded deployment. The model integrates GhostConv for efficient computation, VoVGSCSPC for multi-scale feature enhancement, and a Lightweight Efficiency Detection Head (LEDH) to reduce early-stage feature processing costs. A hybrid training strategy combining NWD loss, knowledge distillation, and Albumentations-based weather augmentation improves detection robustness, especially for small and distant targets. Experiments show SBP-YOLO achieves 87.0% mAP (outperforming YOLOv11n by 5.8%) and runs at 139.5 FPS on a Jetson AGX Xavier with TensorRT FP16 quantization. The results validate its effectiveness for real-time road condition perception in intelligent suspension systems.

[248] Predicting Video Slot Attention Queries from Random Slot-Feature Pairs

Rongzhen Zhao, Jian Li, Juho Kannala, Joni Pajarinen

Main category: cs.CV

TL;DR: The paper introduces RandSF.Q, a method for unsupervised video Object-Centric Learning (OCL) that improves query prediction by incorporating next-frame features and learning transition dynamics, outperforming existing methods.

Details

Motivation: Current video OCL methods lack next-frame feature incorporation and fail to learn transition dynamics, limiting their effectiveness in object-level scene representation and dynamics modeling.

Method: Proposes RandSF.Q, which includes a new transitioner using slot-feature pairs and trains it with randomly sampled pairs to learn dynamics.

Result: Significantly outperforms existing methods, achieving up to 10 points higher in object discovery and improving downstream tasks like dynamics modeling.

Conclusion: RandSF.Q sets a new state-of-the-art in video OCL by addressing key limitations of existing methods, with potential benefits for broader applications.

Abstract: Unsupervised video Object-Centric Learning (OCL) is promising as it enables object-level scene representation and dynamics modeling as we humans do. Mainstream video OCL methods adopt a recurrent architecture: An aggregator aggregates current video frame into object features, termed slots, under some queries; A transitioner transits current slots to queries for the next frame. This is an effective architecture but all existing implementations both (\textit{i1}) neglect to incorporate next frame features, the most informative source for query prediction, and (\textit{i2}) fail to learn transition dynamics, the knowledge essential for query prediction. To address these issues, we propose Random Slot-Feature pair for learning Query prediction (RandSF.Q): (\textit{t1}) We design a new transitioner to incorporate both slots and features, which provides more information for query prediction; (\textit{t2}) We train the transitioner to predict queries from slot-feature pairs randomly sampled from available recurrences, which drives it to learn transition dynamics. Experiments on scene representation demonstrate that our method surpass existing video OCL methods significantly, e.g., up to 10 points on object discovery, setting new state-of-the-art. Such superiority also benefits downstream tasks like dynamics modeling. Our core source code and training logs are available as the supplement.

[249] Effective Damage Data Generation by Fusing Imagery with Human Knowledge Using Vision-Language Models

Jie Wei, Erika Ardiles-Cruz, Aleksey Panasyuk, Erik Blasch

Main category: cs.CV

TL;DR: The paper proposes using vision-language models (VLMs) to improve damage assessment in HADR by addressing data imbalance and labeling inaccuracies, showing promising results in classifying structural damage.

Details

Motivation: Current deep learning methods for HADR damage assessment struggle with data imbalance, lack of moderate damage examples, and human labeling errors.

Method: Leverages VLMs to fuse imagery with human knowledge, generating diverse image-based damage data.

Result: Initial experiments show improved classification of structural damage levels in buildings, roads, and infrastructure.

Conclusion: VLMs offer a viable solution to enhance damage assessment accuracy in HADR by overcoming data limitations.

Abstract: It is of crucial importance to assess damages promptly and accurately in humanitarian assistance and disaster response (HADR). Current deep learning approaches struggle to generalize effectively due to the imbalance of data classes, scarcity of moderate damage examples, and human inaccuracy in pixel labeling during HADR situations. To accommodate for these limitations and exploit state-of-the-art techniques in vision-language models (VLMs) to fuse imagery with human knowledge understanding, there is an opportunity to generate a diversified set of image-based damage data effectively. Our initial experimental results suggest encouraging data generation quality, which demonstrates an improvement in classifying scenes with different levels of structural damage to buildings, roads, and infrastructures.

[250] A Full-Stage Refined Proposal Algorithm for Suppressing False Positives in Two-Stage CNN-Based Detection Methods

Qiang Guo, Rubo Zhang, Bingbing Zhang, Junjie Liu, Jianqing Liu

Main category: cs.CV

TL;DR: The paper proposes a Full-stage Refined Proposal (FRP) algorithm to reduce false positives in pedestrian detection using a two-stage CNN framework, with innovative training and testing strategies.

Details

Motivation: False positives in pedestrian detection remain unresolved, prompting the need for an effective solution.

Method: The FRP algorithm employs pedestrian feature re-evaluation strategies, including Training mode FRP (TFRP) for training, and Classifier-guided FRP (CFRP) and Split-proposal FRP (SFRP) for testing.

Result: Experiments show the FRP algorithm effectively reduces false positives across benchmarks and enhances detection in resource-constrained devices.

Conclusion: The FRP algorithm significantly improves pedestrian detection by suppressing false positives at all stages, proving effective in diverse scenarios.

Abstract: False positives in pedestrian detection remain a challenge that has yet to be effectively resolved. To address this issue, this paper proposes a Full-stage Refined Proposal (FRP) algorithm aimed at eliminating these false positives within a two-stage CNN-based pedestrian detection framework. The main innovation of this work lies in employing various pedestrian feature re-evaluation strategies to filter out low-quality pedestrian proposals during both the training and testing stages. Specifically, in the training phase, the Training mode FRP algorithm (TFRP) introduces a novel approach for validating pedestrian proposals to effectively guide the model training process, thereby constructing a model with strong capabilities for false positive suppression. During the inference phase, two innovative strategies are implemented: the Classifier-guided FRP (CFRP) algorithm integrates a pedestrian classifier into the proposal generation pipeline to yield high-quality proposals through pedestrian feature evaluation, and the Split-proposal FRP (SFRP) algorithm vertically divides all proposals, sending both the original and the sub-region proposals to the subsequent subnetwork to evaluate their confidence scores, filtering out those with lower sub-region pedestrian confidence scores. As a result, the proposed algorithm enhances the model’s ability to suppress pedestrian false positives across all stages. Various experiments conducted on multiple benchmarks and the SY-Metro datasets demonstrate that the model, supported by different combinations of the FRP algorithm, can effectively eliminate false positives to varying extents. Furthermore, experiments conducted on embedded platforms underscore the algorithm’s effectiveness in enhancing the comprehensive pedestrian detection capabilities of the small pedestrian detector in resource-constrained edge devices.

[251] Lightweight Backbone Networks Only Require Adaptive Lightweight Self-Attention Mechanisms

Fengyun Li, Chao Zheng, Yangyang Fang, Jialiang Lan, Jianhua Liang, Luhao Zhang, Fa Si

Main category: cs.CV

TL;DR: The paper proposes Fast Window Attention (FWA), a lightweight SoftMax attention mechanism with adaptive feature map sizes, and integrates it into a hybrid backbone network, LOLViT, which outperforms CNNs in speed and accuracy.

Details

Motivation: Address the computational imbalance between CNNs and attention mechanisms, and improve long-sequence modeling in lightweight hybrid models.

Method: Introduces FWA for adaptive feature map reduction and a ReLU-based SoftMax approximation, combined with GhostNet for a hybrid backbone (LOLViT).

Result: LOLViT achieves superior performance in classification, detection, and segmentation tasks, with LOLViT-X being 5x faster than MobileViT-X.

Conclusion: FWA and LOLViT effectively balance computational efficiency and accuracy, advancing lightweight hybrid models.

Abstract: Currently, lightweight hybrid backbone networks have partially alleviated the issue of computational saturation, but the imbalance in computational efficiencys between convolutional neural networks (CNNs) and attention mechanisms is becoming increasingly apparent. Specifically, although linear attention mechanisms and their variants have made progress in lightweight design, they still fail to meet the demands of hybrid models for long-sequence modeling. On the other hand, existing lightweight SoftMax attention computations typically reduce the feature map to a fixed size to decrease the number of sequences, thereby compressing the computational scale. However, the process of determining the feature map reduction ratio is cumbersome, and computational saturation issues still persist. To address this issue, this paper proposes a lightweight SoftMax attention mechanism with adaptive feature map sizes, named Fast Window Attention (FWA), which generates a small number of key sequences (Key and Value) through window aggregation for attention computation. Additionally, it explains the rationality of using ReLU to simulate SoftMax operations in lightweight global attention mechanisms. Finally, the paper designs a global-local feature fusion mechanism and combines it with GhostNet to propose a lightweight hybrid backbone network, LOLViT. Through visual tasks such as classification (ImageNet 1K), detection (COCO 2017), and segmentation (BDD100K), along with extensive ablation studies, it is demonstrated that LOLViT outperforms CNN models of the same level in both inference speed and model accuracy. Notably, the inference speed of LOLViT-X is 5x that of MobileViT-X.

[252] Construction of Digital Terrain Maps from Multi-view Satellite Imagery using Neural Volume Rendering

Josef X. Biberstein, Guilherme Cavalheiro, Juyeop Han, Sertac Karaman

Main category: cs.CV

TL;DR: The paper introduces Neural Terrain Maps (NTM), a method using neural volume rendering to create digital terrain maps directly from satellite imagery, avoiding the manual preprocessing required by traditional multi-view stereo pipelines.

Details

Motivation: Current methods for producing digital terrain maps (DTMs) are cumbersome and require manual preprocessing. The paper aims to simplify this process by leveraging neural volume rendering.

Method: NTM learns textured DTMs directly from satellite imagery, using only pixel loci without relying on depth or structural priors. It is tested on synthetic and real data from Earth and Mars.

Result: NTM achieves terrain prediction precision nearly matching satellite image resolution, even with imperfect camera parameters.

Conclusion: The method shows promise for automating DTM production, reducing reliance on manual preprocessing in planetary exploration.

Abstract: Digital terrain maps (DTMs) are an important part of planetary exploration, enabling operations such as terrain relative navigation during entry, descent, and landing for spacecraft and aiding in navigation on the ground. As robotic exploration missions become more ambitious, the need for high quality DTMs will only increase. However, producing DTMs via multi-view stereo pipelines for satellite imagery, the current state-of-the-art, can be cumbersome and require significant manual image preprocessing to produce satisfactory results. In this work, we seek to address these shortcomings by adapting neural volume rendering techniques to learn textured digital terrain maps directly from satellite imagery. Our method, neural terrain maps (NTM), only requires the locus for each image pixel and does not rely on depth or any other structural priors. We demonstrate our method on both synthetic and real satellite data from Earth and Mars encompassing scenes on the order of $100 \textrm{km}^2$. We evaluate the accuracy of our output terrain maps by comparing with existing high-quality DTMs produced using traditional multi-view stereo pipelines. Our method shows promising results, with the precision of terrain prediction almost equal to the resolution of the satellite images even in the presence of imperfect camera intrinsics and extrinsics.

[253] Video-based Vehicle Surveillance in the Wild: License Plate, Make, and Model Recognition with Self Reflective Vision-Language Models

Pouya Parsa, Keya Li, Kara M. Kockelman, Seongjin Choi

Main category: cs.CV

TL;DR: VLMs enable scalable ALPR and vehicle make/model recognition from handheld or non-static videos, achieving high accuracy with self-reflection improving results.

Details

Motivation: To address challenges in ALPR and vehicle recognition under dynamic conditions (e.g., camera motion, occlusions) using cost-effective VLMs instead of traditional hardware-dependent methods.

Method: Proposes a pipeline for ALPR and make/model recognition using VLMs with multimodal prompts, sharp frame filtering, and a self-reflection module for correction.

Result: Achieves 91.67% ALPR and 66.67% make/model accuracy on a smartphone dataset, with self-reflection improving results by 5.72%. Public dataset results are 83.05% and 61.07%.

Conclusion: VLMs offer a scalable, cost-effective solution for dynamic traffic video analysis, outperforming traditional methods.

Abstract: Automatic license plate recognition (ALPR) and vehicle make and model recognition underpin intelligent transportation systems, supporting law enforcement, toll collection, and post-incident investigation. Applying these methods to videos captured by handheld smartphones or non-static vehicle-mounted cameras presents unique challenges compared to fixed installations, including frequent camera motion, varying viewpoints, occlusions, and unknown road geometry. Traditional ALPR solutions, dependent on specialized hardware and handcrafted OCR pipelines, often degrade under these conditions. Recent advances in large vision-language models (VLMs) enable direct recognition of textual and semantic attributes from arbitrary imagery. This study evaluates the potential of VLMs for ALPR and makes and models recognition using monocular videos captured with handheld smartphones and non-static mounted cameras. The proposed license plate recognition pipeline filters to sharp frames, then sends a multimodal prompt to a VLM using several prompt strategies. Make and model recognition pipeline runs the same VLM with a revised prompt and an optional self-reflection module. In the self-reflection module, the model contrasts the query image with a reference from a 134-class dataset, correcting mismatches. Experiments on a smartphone dataset collected on the campus of the University of Texas at Austin, achieve top-1 accuracies of 91.67% for ALPR and 66.67% for make and model recognition. On the public UFPR-ALPR dataset, the approach attains 83.05% and 61.07%, respectively. The self-reflection module further improves results by 5.72% on average for make and model recognition. These findings demonstrate that VLMs provide a cost-effective solution for scalable, in-motion traffic video analysis.

[254] Open-Attribute Recognition for Person Retrieval: Finding People Through Distinctive and Novel Attributes

Minjeong Park, Hongbeen Park, Sangwon Lee, Yoonha Jang, Jinkyu Kim

Main category: cs.CV

TL;DR: The paper introduces Open-Attribute Recognition for Person Retrieval (OAPR) to handle novel attributes in real-world scenarios, proposing a framework for generalizable body part representations and reconstructing datasets for validation.

Details

Motivation: Existing PAR methods assume closed-set attributes, limiting real-world applicability where novel attributes emerge. Predefined attributes are also less discriminative for person retrieval.

Method: Proposes OAPR task and a framework for learning generalizable body part representations. Reconstructs four datasets for open-attribute recognition.

Result: Experiments validate the necessity of OAPR and the framework’s effectiveness.

Conclusion: The OAPR task addresses real-world limitations, and the proposed framework shows promise for open-attribute recognition.

Abstract: Pedestrian Attribute Recognition (PAR) plays a crucial role in various vision tasks such as person retrieval and identification. Most existing attribute-based retrieval methods operate under the closed-set assumption that all attribute classes are consistently available during both training and inference. However, this assumption limits their applicability in real-world scenarios where novel attributes may emerge. Moreover, predefined attributes in benchmark datasets are often generic and shared across individuals, making them less discriminative for retrieving the target person. To address these challenges, we propose the Open-Attribute Recognition for Person Retrieval (OAPR) task, which aims to retrieve individuals based on attribute cues, regardless of whether those attributes were seen during training. To support this task, we introduce a novel framework designed to learn generalizable body part representations that cover a broad range of attribute categories. Furthermore, we reconstruct four widely used datasets for open-attribute recognition. Comprehensive experiments on these datasets demonstrate the necessity of the OAPR task and the effectiveness of our framework. The source code and pre-trained models will be publicly available upon publication.

[255] Spatial-Frequency Aware for Object Detection in RAW Image

Zhuohua Ye, Liming Zhang, Hongru Han

Main category: cs.CV

TL;DR: The paper proposes SFAE, a framework combining spatial and frequency domains to enhance RAW image object detection by recovering suppressed details.

Details

Motivation: Existing methods struggle with RAW data's wide dynamic range and linear response, which suppress object details. Spatial domain enhancements alone are insufficient.

Method: SFAE transforms frequency bands into spatial maps, uses cross-domain fusion attention for interaction, and applies adaptive nonlinear adjustments.

Result: The framework effectively recovers object details in RAW images by leveraging frequency domain features.

Conclusion: SFAE outperforms spatial-only methods by integrating spatial and frequency representations for better object detection in RAW data.

Abstract: Direct RAW-based object detection offers great promise by utilizing RAW data (unprocessed sensor data), but faces inherent challenges due to its wide dynamic range and linear response, which tends to suppress crucial object details. In particular, existing enhancement methods are almost all performed in the spatial domain, making it difficult to effectively recover these suppressed details from the skewed pixel distribution of RAW images. To address this limitation, we turn to the frequency domain, where features, such as object contours and textures, can be naturally separated based on frequency. In this paper, we propose Space-Frequency Aware RAW Image Object Detection Enhancer (SFAE), a novel framework that synergizes spatial and frequency representations. Our contribution is threefold. The first lies in the ``spatialization" of frequency bands. Different from the traditional paradigm of directly manipulating abstract spectra in deep networks, our method inversely transforms individual frequency bands back into tangible spatial maps, thus preserving direct physical intuition. Then the cross-domain fusion attention module is developed to enable deep multimodal interactions between these maps and the original spatial features. Finally, the framework performs adaptive nonlinear adjustments by predicting and applying different gamma parameters for the two domains.

[256] ForenX: Towards Explainable AI-Generated Image Detection with Multimodal Large Language Models

Chuangchuang Tan, Jinglu Wang, Xiang Ming, Renshuai Tao, Yunchao Wei, Yao Zhao, Yan Lu

Main category: cs.CV

TL;DR: ForenX is a novel method using multimodal large language models (MLLMs) to detect AI-generated images and provide human-like explanations, enhanced by a specialized forensic prompt and the ForgReason dataset.

Details

Motivation: The gap between AI-based detection methods and human cognitive forensic analysis for AI-generated images motivates the development of ForenX.

Method: ForenX employs MLLMs with a forensic prompt to focus on forgery-indicative attributes and uses the ForgReason dataset, curated via LLM-human collaboration.

Result: ForenX improves generalization and explanation quality, verified on benchmarks and subjective evaluations.

Conclusion: ForenX bridges the gap between AI and human forensic analysis, offering accurate and explainable detection of AI-generated images.

Abstract: Advances in generative models have led to AI-generated images visually indistinguishable from authentic ones. Despite numerous studies on detecting AI-generated images with classifiers, a gap persists between such methods and human cognitive forensic analysis. We present ForenX, a novel method that not only identifies the authenticity of images but also provides explanations that resonate with human thoughts. ForenX employs the powerful multimodal large language models (MLLMs) to analyze and interpret forensic cues. Furthermore, we overcome the limitations of standard MLLMs in detecting forgeries by incorporating a specialized forensic prompt that directs the MLLMs attention to forgery-indicative attributes. This approach not only enhance the generalization of forgery detection but also empowers the MLLMs to provide explanations that are accurate, relevant, and comprehensive. Additionally, we introduce ForgReason, a dataset dedicated to descriptions of forgery evidences in AI-generated images. Curated through collaboration between an LLM-based agent and a team of human annotators, this process provides refined data that further enhances our model’s performance. We demonstrate that even limited manual annotations significantly improve explanation quality. We evaluate the effectiveness of ForenX on two major benchmarks. The model’s explainability is verified by comprehensive subjective evaluations.

[257] 3DRot: 3D Rotation Augmentation for RGB-Based 3D Tasks

Shitian Yang, Deyu Li, Xiaoke Jiang, Lei Zhang

Main category: cs.CV

TL;DR: 3DRot is a plug-and-play augmentation method for RGB-based 3D tasks that preserves geometric consistency by rotating and mirroring images about the camera’s optical center, improving performance in monocular 3D detection.

Details

Motivation: RGB-based 3D tasks face challenges due to scarce, expensive annotations and limited augmentation options that disrupt geometric consistency.

Method: 3DRot rotates and mirrors images about the camera’s optical center while updating RGB images, camera intrinsics, object poses, and 3D annotations to maintain projective geometry.

Result: On SUN RGB-D, 3DRot improves IoU3D (43.21 to 44.51), reduces rotation error (22.91° to 20.93°), and boosts mAP0.5 (35.70 to 38.11).

Conclusion: 3DRot is effective for monocular 3D detection and can be easily adapted to other 3D tasks due to its camera-space transform approach.

Abstract: RGB-based 3D tasks, e.g., 3D detection, depth estimation, 3D keypoint estimation, still suffer from scarce, expensive annotations and a thin augmentation toolbox, since most image transforms, including resize and rotation, disrupt geometric consistency. In this paper, we introduce 3DRot, a plug-and-play augmentation that rotates and mirrors images about the camera’s optical center while synchronously updating RGB images, camera intrinsics, object poses, and 3D annotations to preserve projective geometry-achieving geometry-consistent rotations and reflections without relying on any scene depth. We validate 3DRot with a classical 3D task, monocular 3D detection. On SUN RGB-D dataset, 3DRot raises $IoU_{3D}$ from 43.21 to 44.51, cuts rotation error (ROT) from 22.91$^\circ$ to 20.93$^\circ$, and boosts $mAP_{0.5}$ from 35.70 to 38.11. As a comparison, Cube R-CNN adds 3 other datasets together with SUN RGB-D for monocular 3D estimation, with a similar mechanism and test dataset, increases $IoU_{3D}$ from 36.2 to 37.8, boosts $mAP_{0.5}$ from 34.7 to 35.4. Because it operates purely through camera-space transforms, 3DRot is readily transferable to other 3D tasks.

[258] Capturing More: Learning Multi-Domain Representations for Robust Online Handwriting Verification

Peirong Zhang, Kai Ding, Lianwen Jin

Main category: cs.CV

TL;DR: SPECTRUM is a temporal-frequency synergistic model for online handwriting verification (OHV), combining temporal and frequency features for superior performance.

Details

Motivation: To unlock the untapped potential of multi-domain representation learning in OHV by integrating temporal and frequency features.

Method: SPECTRUM uses a multi-scale interactor, self-gated fusion module, and multi-domain distance-based verifier for temporal-frequency integration.

Result: Outperforms existing OHV methods, demonstrating the effectiveness of multi-domain learning.

Conclusion: Multi-domain learning enhances handwriting verification and opens avenues for future research in multi-domain approaches.

Abstract: In this paper, we propose SPECTRUM, a temporal-frequency synergistic model that unlocks the untapped potential of multi-domain representation learning for online handwriting verification (OHV). SPECTRUM comprises three core components: (1) a multi-scale interactor that finely combines temporal and frequency features through dual-modal sequence interaction and multi-scale aggregation, (2) a self-gated fusion module that dynamically integrates global temporal and frequency features via self-driven balancing. These two components work synergistically to achieve micro-to-macro spectral-temporal integration. (3) A multi-domain distance-based verifier then utilizes both temporal and frequency representations to improve discrimination between genuine and forged handwriting, surpassing conventional temporal-only approaches. Extensive experiments demonstrate SPECTRUM’s superior performance over existing OHV methods, underscoring the effectiveness of temporal-frequency multi-domain learning. Furthermore, we reveal that incorporating multiple handwritten biometrics fundamentally enhances the discriminative power of handwriting representations and facilitates verification. These findings not only validate the efficacy of multi-domain learning in OHV but also pave the way for future research in multi-domain approaches across both feature and biometric domains. Code is publicly available at https://github.com/NiceRingNode/SPECTRUM.

[259] Hyperspectral Image Recovery Constrained by Multi-Granularity Non-Local Self-Similarity Priors

Zhuoran Peng, Yiqing Shen

Main category: cs.CV

TL;DR: The paper proposes a multi-granularity non-local self-similarity prior model for hyperspectral image (HSI) recovery, addressing limitations of fixed-format methods by alternating coarse-grained (Tucker) and fine-grained (FCTN) tensor decompositions.

Details

Motivation: Existing HSI recovery methods use fixed-format non-local priors, limiting adaptability to diverse missing scenarios.

Method: Alternates coarse-grained (Tucker) and fine-grained (FCTN) tensor decompositions to capture global and local details.

Result: The model shows strong applicability and outperforms in various missing scenarios like pixels and stripes.

Conclusion: The proposed method unifies global, local, and non-local priors, enhancing HSI recovery adaptability and performance.

Abstract: Hyperspectral image (HSI) recovery, as an upstream image processing task, holds significant importance for downstream tasks such as classification, segmentation, and detection. In recent years, HSI recovery methods based on non-local prior representations have demonstrated outstanding performance. However, these methods employ a fixed-format factor to represent the non-local self-similarity tensor groups, making them unable to adapt to diverse missing scenarios. To address this issue, we introduce the concept of granularity in tensor decomposition for the first time and propose an HSI recovery model constrained by multi-granularity non-local self-similarity priors. Specifically, the proposed model alternately performs coarse-grained decomposition and fine-grained decomposition on the non-local self-similarity tensor groups. Among them, the coarse-grained decomposition builds upon Tucker tensor decomposition, which extracts global structural information of the image by performing singular value shrinkage on the mode-unfolded matrices. The fine-grained decomposition employs the FCTN decomposition, capturing local detail information through modeling pairwise correlations among factor tensors. This architectural approach achieves a unified representation of global, local, and non-local priors for HSIs. Experimental results demonstrate that the model has strong applicability and exhibits outstanding recovery effects in various types of missing scenes such as pixels and stripes.

[260] Uncertainty-Aware Segmentation Quality Prediction via Deep Learning Bayesian Modeling: Comprehensive Evaluation and Interpretation on Skin Cancer and Liver Segmentation

Sikha O K, Meritxell Riera-Marín, Adrian Galdran, Javier García Lopez, Julia Rodríguez-Comas, Gemma Piella, Miguel A. González Ballester

Main category: cs.CV

TL;DR: A novel framework predicts segmentation quality without ground truth annotations, using uncertainty maps and predicted segmentations, achieving high accuracy and robustness across datasets.

Details

Motivation: Clinical settings lack manual annotations for segmentation quality assessment, hindering model adoption.

Method: Proposes two frameworks: one using predicted segmentation and uncertainty maps, another integrating input images, uncertainty, and segmentations. Bayesian adaptations of SwinUNet and Feature Pyramid Network with uncertainty quantification methods (Monte Carlo Dropout, Ensemble, Test Time Augmentation) are evaluated.

Result: Achieves R2 score of 93.25 and Pearson correlation of 96.58 on HAM10000, with cross-modality robustness shown in 3D liver segmentation.

Conclusion: The framework provides reliable segmentation quality assessment without ground truth, with interpretability through Grad-CAM and UMAP analysis.

Abstract: Image segmentation is a critical step in computational biomedical image analysis, typically evaluated using metrics like the Dice coefficient during training and validation. However, in clinical settings without manual annotations, assessing segmentation quality becomes challenging, and models lacking reliability indicators face adoption barriers. To address this gap, we propose a novel framework for predicting segmentation quality without requiring ground truth annotations during test time. Our approach introduces two complementary frameworks: one leveraging predicted segmentation and uncertainty maps, and another integrating the original input image, uncertainty maps, and predicted segmentation maps. We present Bayesian adaptations of two benchmark segmentation models-SwinUNet and Feature Pyramid Network with ResNet50-using Monte Carlo Dropout, Ensemble, and Test Time Augmentation to quantify uncertainty. We evaluate four uncertainty estimates: confidence map, entropy, mutual information, and expected pairwise Kullback-Leibler divergence on 2D skin lesion and 3D liver segmentation datasets, analyzing their correlation with segmentation quality metrics. Our framework achieves an R2 score of 93.25 and Pearson correlation of 96.58 on the HAM10000 dataset, outperforming previous segmentation quality assessment methods. For 3D liver segmentation, Test Time Augmentation with entropy achieves an R2 score of 85.03 and a Pearson correlation of 65.02, demonstrating cross-modality robustness. Additionally, we propose an aggregation strategy that combines multiple uncertainty estimates into a single score per image, offering a more robust and comprehensive assessment of segmentation quality. Finally, we use Grad-CAM and UMAP-based embedding analysis to interpret the model’s behavior and reliability, highlighting the impact of uncertainty integration.

[261] Can3Tok: Canonical 3D Tokenization and Latent Modeling of Scene-Level 3D Gaussians

Quankai Gao, Iliyan Georgiev, Tuanfeng Y. Wang, Krishna Kumar Singh, Ulrich Neumann, Jae Shin Yoon

Main category: cs.CV

TL;DR: Can3Tok is the first 3D scene-level VAE that encodes Gaussian primitives into a low-dimensional latent embedding, addressing scale inconsistency and enabling novel scene generation.

Details

Motivation: Existing 3D generation is limited to object-level due to challenges in scaling latent representation learning for unbounded, inconsistent 3D scenes.

Method: Introduces Can3Tok, a 3D scene-level VAE, and a pipeline for processing 3D scene data to handle scale inconsistency.

Result: Validated on DL3DV-10K, Can3Tok generalizes to novel scenes, unlike other methods that fail to converge or generalize.

Conclusion: Can3Tok enables downstream tasks like image-to-3DGS and text-to-3DGS generation, proving its utility for scene-level 3D generation.

Abstract: 3D generation has made significant progress, however, it still largely remains at the object-level. Feedforward 3D scene-level generation has been rarely explored due to the lack of models capable of scaling-up latent representation learning on 3D scene-level data. Unlike object-level generative models, which are trained on well-labeled 3D data in a bounded canonical space, scene-level generations with 3D scenes represented by 3D Gaussian Splatting (3DGS) are unbounded and exhibit scale inconsistency across different scenes, making unified latent representation learning for generative purposes extremely challenging. In this paper, we introduce Can3Tok, the first 3D scene-level variational autoencoder (VAE) capable of encoding a large number of Gaussian primitives into a low-dimensional latent embedding, which effectively captures both semantic and spatial information of the inputs. Beyond model design, we propose a general pipeline for 3D scene data processing to address scale inconsistency issue. We validate our method on the recent scene-level 3D dataset DL3DV-10K, where we found that only Can3Tok successfully generalizes to novel 3D scenes, while compared methods fail to converge on even a few hundred scene inputs during training and exhibit zero generalization ability during inference. Finally, we demonstrate image-to-3DGS and text-to-3DGS generation as our applications to demonstrate its ability to facilitate downstream generation tasks.

[262] EfficientGFormer: Multimodal Brain Tumor Segmentation via Pruned Graph-Augmented Transformer

Fatemeh Ziaeetabar

Main category: cs.CV

TL;DR: EfficientGFormer integrates pretrained models with graph-based reasoning for efficient 3D brain tumor segmentation, achieving state-of-the-art accuracy with reduced computational cost.

Details

Motivation: Brain tumor segmentation is challenging due to tumor heterogeneity and high computational demands. EfficientGFormer aims to address these issues for clinical viability.

Method: The framework uses nnFormer for encoding MRI volumes into patch-level embeddings, structured into a dual-edge graph. A pruned GAT enables efficient reasoning, and a distillation module ensures compact, real-time deployment.

Result: EfficientGFormer outperforms transformer and graph-based baselines on MSD Task01 and BraTS 2021 datasets, with reduced memory and inference time.

Conclusion: EfficientGFormer provides a scalable, interpretable, and generalizable solution for fast and accurate brain tumor segmentation.

Abstract: Accurate and efficient brain tumor segmentation remains a critical challenge in neuroimaging due to the heterogeneous nature of tumor subregions and the high computational cost of volumetric inference. In this paper, we propose EfficientGFormer, a novel architecture that integrates pretrained foundation models with graph-based reasoning and lightweight efficiency mechanisms for robust 3D brain tumor segmentation. Our framework leverages nnFormer as a modality-aware encoder, transforming multi-modal MRI volumes into patch-level embeddings. These features are structured into a dual-edge graph that captures both spatial adjacency and semantic similarity. A pruned, edge-type-aware Graph Attention Network (GAT) enables efficient relational reasoning across tumor subregions, while a distillation module transfers knowledge from a full-capacity teacher to a compact student model for real-time deployment. Experiments on the MSD Task01 and BraTS 2021 datasets demonstrate that EfficientGFormer achieves state-of-the-art accuracy with significantly reduced memory and inference time, outperforming recent transformer-based and graph-based baselines. This work offers a clinically viable solution for fast and accurate volumetric tumor delineation, combining scalability, interpretability, and generalization.

[263] MiraGe: Multimodal Discriminative Representation Learning for Generalizable AI-Generated Image Detection

Kuo Shi, Jie Lu, Shanshan Ye, Guangquan Zhang, Zhen Fang

Main category: cs.CV

TL;DR: MiraGe is a method for detecting AI-generated images by learning generator-invariant features, improving generalizability across unseen models.

Details

Motivation: Existing detectors struggle with new/unseen generative models due to overlapping features. MiraGe aims to address this by enhancing feature discriminability.

Method: Uses multimodal discriminative representation learning, minimizing intra-class variation and maximizing inter-class separation. Integrates multimodal prompt learning with CLIP for better feature alignment.

Result: Achieves state-of-the-art performance, robust even against unseen generators like Sora.

Conclusion: MiraGe effectively generalizes AI-generated image detection, outperforming existing methods.

Abstract: Recent advances in generative models have highlighted the need for robust detectors capable of distinguishing real images from AI-generated images. While existing methods perform well on known generators, their performance often declines when tested with newly emerging or unseen generative models due to overlapping feature embeddings that hinder accurate cross-generator classification. In this paper, we propose Multimodal Discriminative Representation Learning for Generalizable AI-generated Image Detection (MiraGe), a method designed to learn generator-invariant features. Motivated by theoretical insights on intra-class variation minimization and inter-class separation, MiraGe tightly aligns features within the same class while maximizing separation between classes, enhancing feature discriminability. Moreover, we apply multimodal prompt learning to further refine these principles into CLIP, leveraging text embeddings as semantic anchors for effective discriminative representation learning, thereby improving generalizability. Comprehensive experiments across multiple benchmarks show that MiraGe achieves state-of-the-art performance, maintaining robustness even against unseen generators like Sora.

[264] ReasonAct: Progressive Training for Fine-Grained Video Reasoning in Small Models

Jiaxin Liu, Zhaolu Kang

Main category: cs.CV

TL;DR: ReasonAct improves small models’ video reasoning via a three-stage training process and temporal-aware reinforcement learning, achieving significant accuracy gains on benchmark datasets.

Details

Motivation: Small-scale multimodal models struggle with fine-grained temporal reasoning in video understanding, necessitating an enhanced method.

Method: A three-stage training process: text-only reasoning foundation, video fine-tuning, and temporal-aware reinforcement learning refinement. Incorporates T-GRPO with temporal consistency and sub-action decomposition for graduated rewards.

Result: Achieves 67.2%, 94.1%, and 78.9% accuracy on HMDB51, UCF-101, and Kinetics-400, with improvements of 17.9, 15.8, and 12.3 points over baselines.

Conclusion: Progressive training enables smaller models to achieve competitive video reasoning performance efficiently.

Abstract: While recent multimodal models have shown progress in vision-language tasks, small-scale variants still struggle with the fine-grained temporal reasoning required for video understanding. We introduce ReasonAct, a method that enhances video reasoning in smaller models through a three-stage training process: first building a foundation with text-only reasoning, then fine-tuning on video, and finally refining with temporal-aware reinforcement learning. We build upon Temporal Group Relative Policy Optimization (T-GRPO) by incorporating temporal consistency modeling into policy optimization. We also propose a biomechanically-motivated sub-action decomposition mechanism that provides graduated rewards for constituent action phases. Through experiments on HMDB51, UCF-101, and Kinetics-400, our 3B-parameter model achieves 67.2%, 94.1%, and 78.9% accuracy respectively, demonstrating improvements of 17.9, 15.8, and 12.3 points over baselines. Ablation studies validate that our progressive training methodology enables smaller models to achieve competitive video reasoning performance while maintaining computational efficiency.

[265] MagicVL-2B: Empowering Vision-Language Models on Mobile Devices with Lightweight Visual Encoders via Curriculum Learning

Yi Liu, Xiao Xu, Zeyu Xu, Meng Zhang, Yibo Li, Haoyu Chen, Junkang Zhang, Qiang Wang, Jifa Sun, Siling Lin, Shengxun Cheng, Lingshu Zhang, Kang Wang

Main category: cs.CV

TL;DR: MagicVL-2B is a lightweight Vision-Language Model optimized for smartphones, reducing power consumption by 41.1% while matching state-of-the-art accuracy.

Details

Motivation: Address the computational and storage challenges of deploying VLMs on mobile devices.

Method: Uses a lightweight visual encoder (<100M parameters), dynamic resolution scheme, and multimodal curriculum learning.

Result: Matches state-of-the-art accuracy with 41.1% lower power consumption.

Conclusion: MagicVL-2B is a practical solution for mobile vision-language applications.

Abstract: Vision-Language Models (VLMs) have achieved remarkable breakthroughs in recent years, enabling a diverse array of applications in everyday life. However, the substantial computational and storage demands of VLMs pose significant challenges for their efficient deployment on mobile devices, which represent the most ubiquitous and accessible computing platforms today. In this work, we introduce MagicVL-2B, a novel VLM meticulously optimized for flagship smartphones. MagicVL-2B leverages a lightweight visual encoder with fewer than 100M parameters and features a redesigned dynamic resolution scheme that adaptively generates image tokens without excessive modification of image dimensions. To further enhance the performance of this compact encoder within VLMs, we propose a multimodal curriculum learning strategy that incrementally increases task difficulty and data information density throughout training. This approach substantially improves the model’s performance across a variety of sub-tasks. Extensive evaluations on standard VLM benchmarks demonstrate that MagicVL-2B matches the accuracy of current state-of-the-art models while reducing on-device power consumption by 41.1%. These results establish MagicVL-2B as a practical and robust solution for real-world mobile vision-language applications, enabling advanced multimodal intelligence to run directly on smartphones.

[266] E-VRAG: Enhancing Long Video Understanding with Resource-Efficient Retrieval Augmented Generation

Zeyu Xu, Junkang Zhang, Qiang Wang, Yi Liu

Main category: cs.CV

TL;DR: E-VRAG is a novel video RAG framework that reduces computational costs by 70% while improving accuracy for video understanding tasks, using hierarchical query decomposition, lightweight VLM scoring, and multi-view QA.

Details

Motivation: Existing video RAG methods struggle with balancing retrieval efficiency and accuracy, especially for complex video content, due to computational constraints and limited context windows.

Method: E-VRAG employs frame pre-filtering via hierarchical query decomposition, lightweight VLM scoring, a global statistical distribution-based retrieval strategy, and multi-view QA for retrieved frames.

Result: E-VRAG achieves a 70% reduction in computational cost and higher accuracy on four benchmarks without additional training.

Conclusion: E-VRAG effectively improves efficiency and accuracy in video RAG tasks, demonstrating its potential for scalable video understanding.

Abstract: Vision-Language Models (VLMs) have enabled substantial progress in video understanding by leveraging cross-modal reasoning capabilities. However, their effectiveness is limited by the restricted context window and the high computational cost required to process long videos with thousands of frames. Retrieval-augmented generation (RAG) addresses this challenge by selecting only the most relevant frames as input, thereby reducing the computational burden. Nevertheless, existing video RAG methods struggle to balance retrieval efficiency and accuracy, particularly when handling diverse and complex video content. To address these limitations, we propose E-VRAG, a novel and efficient video RAG framework for video understanding. We first apply a frame pre-filtering method based on hierarchical query decomposition to eliminate irrelevant frames, reducing computational costs at the data level. We then employ a lightweight VLM for frame scoring, further reducing computational costs at the model level. Additionally, we propose a frame retrieval strategy that leverages the global statistical distribution of inter-frame scores to mitigate the potential performance degradation from using a lightweight VLM. Finally, we introduce a multi-view question answering scheme for the retrieved frames, enhancing the VLM’s capability to extract and comprehend information from long video contexts. Experiments on four public benchmarks show that E-VRAG achieves about 70% reduction in computational cost and higher accuracy compared to baseline methods, all without additional training. These results demonstrate the effectiveness of E-VRAG in improving both efficiency and accuracy for video RAG tasks.

[267] A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models

Quan-Sheng Zeng, Yunheng Li, Qilong Wang, Peng-Tao Jiang, Zuxuan Wu, Ming-Ming Cheng, Qibin Hou

Main category: cs.CV

TL;DR: GlimpsePrune dynamically prunes visual tokens in LVLMs, retaining baseline performance while reducing computational cost.

Details

Motivation: Fixed compression ratios in LVLMs often discard informative tokens, degrading performance.

Method: Introduces GlimpsePrune, a dynamic pruning framework inspired by human cognition, which prunes irrelevant tokens in one forward pass.

Result: Prunes 92.6% of tokens while maintaining baseline performance; GlimpsePrune+ achieves 110% performance.

Conclusion: GlimpsePrune offers a more efficient and powerful approach for LVLMs.

Abstract: Visual token compression is critical for Large Vision-Language Models (LVLMs) to efficiently process high-resolution inputs. Existing methods that typically adopt fixed compression ratios cannot adapt to scenes of varying complexity, often causing imprecise pruning that discards informative visual tokens and results in degraded model performance. To address this issue, we introduce a dynamic pruning framework, GlimpsePrune, inspired by human cognition. It takes a data-driven ‘‘glimpse’’ and prunes irrelevant visual tokens in a single forward pass before answer generation. This approach prunes 92.6% of visual tokens while on average fully retaining the baseline performance on free-form VQA tasks. The reduced computational cost also enables more effective fine-tuning: an enhanced GlimpsePrune+ achieves 110% of the baseline performance while maintaining a similarly high pruning rate. Our work paves a new way for building more powerful and efficient LVLMs.

[268] Zero-Shot Temporal Interaction Localization for Egocentric Videos

Erhang Zhang, Junyi Ma, Yin-Dong Zheng, Yixuan Zhou, Hesheng Wang

Main category: cs.CV

TL;DR: EgoLoc is a zero-shot temporal interaction localization method for egocentric videos, using self-adaptive sampling and closed-loop feedback to improve accuracy and efficiency.

Details

Motivation: Current methods for temporal action localization rely on annotated data, causing domain bias and inefficiency. Zero-shot approaches using VLMs are coarse-grained and open-loop, limiting performance.

Method: EgoLoc introduces self-adaptive sampling for visual prompts, combines 2D and 3D observations, and uses closed-loop feedback for refinement.

Result: EgoLoc outperforms state-of-the-art baselines in temporal interaction localization on public and new benchmarks.

Conclusion: EgoLoc offers a more accurate and efficient solution for zero-shot TIL in egocentric videos, with open-source code and data provided.

Abstract: Locating human-object interaction (HOI) actions within video serves as the foundation for multiple downstream tasks, such as human behavior analysis and human-robot skill transfer. Current temporal action localization methods typically rely on annotated action and object categories of interactions for optimization, which leads to domain bias and low deployment efficiency. Although some recent works have achieved zero-shot temporal action localization (ZS-TAL) with large vision-language models (VLMs), their coarse-grained estimations and open-loop pipelines hinder further performance improvements for temporal interaction localization (TIL). To address these issues, we propose a novel zero-shot TIL approach dubbed EgoLoc to locate the timings of grasp actions for human-object interaction in egocentric videos. EgoLoc introduces a self-adaptive sampling strategy to generate reasonable visual prompts for VLM reasoning. By absorbing both 2D and 3D observations, it directly samples high-quality initial guesses around the possible contact/separation timestamps of HOI according to 3D hand velocities, leading to high inference accuracy and efficiency. In addition, EgoLoc generates closed-loop feedback from visual and dynamic cues to further refine the localization results. Comprehensive experiments on the publicly available dataset and our newly proposed benchmark demonstrate that EgoLoc achieves better temporal interaction localization for egocentric videos compared to state-of-the-art baselines. We will release our code and relevant data as open-source at https://github.com/IRMVLab/EgoLoc.

[269] EvoVLMA: Evolutionary Vision-Language Model Adaptation

Kun Ding, Ying Wang, Shiming Xiang

Main category: cs.CV

TL;DR: EvoVLMA automates the search for efficient adaptation algorithms for Vision-Language Models (VLMs) using LLM-assisted evolutionary methods, improving performance over manual designs.

Details

Motivation: Existing VLM adaptation methods are manually designed, requiring expertise and time. Automating this process can enhance efficiency and performance.

Method: A two-stage LLM-assisted evolutionary algorithm optimizes feature selection and logits computation, using divide-and-conquer to manage the search space. Low-precision code conversion and web-based execution improve stability.

Result: EvoVLMA improves the APE algorithm by 1.91 points in 8-shot image classification accuracy.

Conclusion: EvoVLMA demonstrates the potential for automating adaptation algorithm optimization in multimodal models, offering a promising alternative to manual design.

Abstract: Pre-trained Vision-Language Models (VLMs) have been exploited in various Computer Vision tasks (e.g., few-shot recognition) via model adaptation, such as prompt tuning and adapters. However, existing adaptation methods are designed by human experts, requiring significant time cost and experience. Inspired by recent advances in Large Language Models (LLMs) based code generation, we propose an Evolutionary Vision-Language Model Adaptation (EvoVLMA) method to automatically search training-free efficient adaptation algorithms for VLMs. We recognize feature selection and logits computation as the key functions in training-free VLM adaptation, and propose a two-stage LLM-assisted evolutionary algorithm for optimizing these parts in a sequential manner, effectively addressing the challenge posed by the expansive search space through a divide-and-conquer strategy. Besides, to enhance the stability and efficiency of searching process, we propose low-precision code conversion, web based code execution and process monitoring, leading to a highly effective automatic algorithm design system. Extensive experiments demonstrate that the algorithms found by EvoVLMA can obtain promising results compared to previous manually-designed ones. More specifically, in the 8-shot image classification setting, the classical APE algorithm can be improved by 1.91 points in recognition accuracy. This research opens new possibilities for automating the optimization of adaptation algorithms of pre-trained multimodal models. Code is available at: https://github.com/kding1225/EvoVLMA

Sara Shoouri, Morteza Tavakoli Taba, Hun-Seok Kim

Main category: cs.CV

TL;DR: A predictive, history-aware adaptive scanning framework reduces LiDAR energy consumption by 65% while maintaining strong 3D object detection performance.

Details

Motivation: Conventional LiDAR sensors ignore temporal continuity, causing redundancy and high power consumption, limiting practicality on resource-constrained platforms.

Method: A lightweight predictor network and differentiable Mask Generator use historical data to identify critical ROIs, guiding sparse LiDAR scanning outside these regions.

Result: Reduces LiDAR energy consumption by over 65% while matching or outperforming traditional dense scanning methods in 3D object detection.

Conclusion: The proposed adaptive scanning framework efficiently balances energy savings and detection performance, enhancing LiDAR-camera fusion practicality.

Abstract: Multi-sensor fusion using LiDAR and RGB cameras significantly enhances 3D object detection task. However, conventional LiDAR sensors perform dense, stateless scans, ignoring the strong temporal continuity in real-world scenes. This leads to substantial sensing redundancy and excessive power consumption, limiting their practicality on resource-constrained platforms. To address this inefficiency, we propose a predictive, history-aware adaptive scanning framework that anticipates informative regions of interest (ROI) based on past observations. Our approach introduces a lightweight predictor network that distills historical spatial and temporal contexts into refined query embeddings. These embeddings guide a differentiable Mask Generator network, which leverages Gumbel-Softmax sampling to produce binary masks identifying critical ROIs for the upcoming frame. Our method significantly reduces unnecessary data acquisition by concentrating dense LiDAR scanning only within these ROIs and sparsely sampling elsewhere. Experiments on nuScenes and Lyft benchmarks demonstrate that our adaptive scanning strategy reduces LiDAR energy consumption by over 65% while maintaining competitive or even superior 3D object detection performance compared to traditional LiDAR-camera fusion methods with dense LiDAR scanning.

[271] LetheViT: Selective Machine Unlearning for Vision Transformers via Attention-Guided Contrastive Learning

Yujia Tong, Tian Zhang, Jingling Yuan, Yuze Wang, Chuang Hu

Main category: cs.CV

TL;DR: LetheViT, a contrastive unlearning method for Vision Transformers (ViTs), addresses privacy compliance by enabling selective data forgetting while maintaining model performance.

Details

Motivation: Privacy regulations like GDPR and CCPA require models to forget specific user data without compromising overall performance, posing a challenge for ViTs.

Method: LetheViT uses masked and original image inputs to generate positive and negative logits, guiding the model to forget specific details while retaining general class outlines.

Result: LetheViT achieves state-of-the-art performance, balancing privacy compliance with model efficacy.

Conclusion: The method effectively addresses random data forgetting in ViTs, ensuring privacy compliance without sacrificing recognition capability.

Abstract: Vision Transformers (ViTs) have revolutionized computer vision tasks with their exceptional performance. However, the introduction of privacy regulations such as GDPR and CCPA has brought new challenges to them. These laws grant users the right to withdraw their data, necessitating not only the deletion of data but also the complete removal of its influence from trained models. Machine unlearning emerges as a critical solution, with exact unlearning being computationally prohibitive and approximate methods offering a more practical approach. This work addresses the particularly challenging scenario of random data forgetting in ViTs, where the model must forget specific samples while retaining others, even within the same class. We first reveal the core characteristics of ViTs through selective masking experiments: when high-attention areas are masked, the model retains its recognition capability but significantly weakens its memorization ability. Based on the above insights, we propose LetheViT, a contrastive unlearning method tailored for ViTs. LetheViT uses masked image inputs to generate positive logits and original image inputs to generate negative logits, guiding the model to forget specific details while retaining the general cl category outlines. Experimental results demonstrate that LetheViT achieves state-of-the-art performance, effectively balancing privacy compliance with model efficacy.

[272] TopoImages: Incorporating Local Topology Encoding into Deep Learning Models for Medical Image Classification

Pengfei Gu, Hongxiao Wang, Yejia Zhang, Huimin Li, Chaoli Wang, Danny Chen

Main category: cs.CV

TL;DR: The paper introduces TopoImages, a method to encode local topological features of image patches using persistent homology, improving deep learning classification accuracy.

Details

Motivation: Existing image processing methods lack sensitivity to topological structures, which are crucial for understanding image content, especially in biomedical contexts.

Method: The approach computes persistence diagrams of image patches, vectorizes them, and arranges them into multi-channel TopoImages. Multi-view TopoImages are generated using various filtration functions and fused with input images for DL-based classification.

Result: Experiments on medical image datasets show improved accuracy over state-of-the-art methods.

Conclusion: TopoImages provides a versatile and effective way to integrate topological information into DL frameworks, enhancing classification performance.

Abstract: Topological structures in image data, such as connected components and loops, play a crucial role in understanding image content (e.g., biomedical objects). % Despite remarkable successes of numerous image processing methods that rely on appearance information, these methods often lack sensitivity to topological structures when used in general deep learning (DL) frameworks. % In this paper, we introduce a new general approach, called TopoImages (for Topology Images), which computes a new representation of input images by encoding local topology of patches. % In TopoImages, we leverage persistent homology (PH) to encode geometric and topological features inherent in image patches. % Our main objective is to capture topological information in local patches of an input image into a vectorized form. % Specifically, we first compute persistence diagrams (PDs) of the patches, % and then vectorize and arrange these PDs into long vectors for pixels of the patches. % The resulting multi-channel image-form representation is called a TopoImage. % TopoImages offers a new perspective for data analysis. % To garner diverse and significant topological features in image data and ensure a more comprehensive and enriched representation, we further generate multiple TopoImages of the input image using various filtration functions, which we call multi-view TopoImages. % The multi-view TopoImages are fused with the input image for DL-based classification, with considerable improvement. % Our TopoImages approach is highly versatile and can be seamlessly integrated into common DL frameworks. Experiments on three public medical image classification datasets demonstrate noticeably improved accuracy over state-of-the-art methods.

[273] Harnessing Textual Semantic Priors for Knowledge Transfer and Refinement in CLIP-Driven Continual Learning

Lingfeng He, De Cheng, Huaijie Wang, Nannan Wang

Main category: cs.CV

TL;DR: SECA leverages CLIP’s textual priors for continual learning, addressing stability-plasticity trade-offs via semantic-guided knowledge transfer and visual prototype refinement.

Details

Motivation: To explore CLIP's untapped textual semantic priors for continual learning, mitigating interference from unrelated tasks and bridging the modality gap.

Method: Proposes SECA with SG-AKT for semantic-aware knowledge transfer and SE-VPR for refining visual prototypes using textual embeddings.

Result: Demonstrates effectiveness on multiple benchmarks, balancing stability and plasticity.

Conclusion: SECA successfully integrates textual and visual semantics for improved continual learning performance.

Abstract: Continual learning (CL) aims to equip models with the ability to learn from a stream of tasks without forgetting previous knowledge. With the progress of vision-language models like Contrastive Language-Image Pre-training (CLIP), their promise for CL has attracted increasing attention due to their strong generalizability. However, the potential of rich textual semantic priors in CLIP in addressing the stability-plasticity dilemma remains underexplored. During backbone training, most approaches transfer past knowledge without considering semantic relevance, leading to interference from unrelated tasks that disrupt the balance between stability and plasticity. Besides, while text-based classifiers provide strong generalization, they suffer from limited plasticity due to the inherent modality gap in CLIP. Visual classifiers help bridge this gap, but their prototypes lack rich and precise semantics. To address these challenges, we propose Semantic-Enriched Continual Adaptation (SECA), a unified framework that harnesses the anti-forgetting and structured nature of textual priors to guide semantic-aware knowledge transfer in the backbone and reinforce the semantic structure of the visual classifier. Specifically, a Semantic-Guided Adaptive Knowledge Transfer (SG-AKT) module is proposed to assess new images’ relevance to diverse historical visual knowledge via textual cues, and aggregate relevant knowledge in an instance-adaptive manner as distillation signals. Moreover, a Semantic-Enhanced Visual Prototype Refinement (SE-VPR) module is introduced to refine visual prototypes using inter-class semantic relations captured in class-wise textual embeddings. Extensive experiments on multiple benchmarks validate the effectiveness of our approach.

[274] Set Pivot Learning: Redefining Generalized Segmentation with Vision Foundation Models

Xinhui Li, Xinyu He, Qiming Hu, Xiaojie Guo

Main category: cs.CV

TL;DR: The paper introduces Set Pivot Learning (SPL), a new paradigm for domain generalization using Vision Foundation Models (VFMs), focusing on dynamic adaptation and VFM-centric tuning for better real-world alignment.

Details

Motivation: Traditional domain generalization (DG) assumptions are outdated with VFMs. SPL addresses this by redefining domain migration tasks to better fit current needs.

Method: Proposes SPL with dynamic adaptation and VFM-centric tuning. Introduces Dynamic Prompt Fine-Tuning, combining a Dynamic Class-aware Prompter and Prompt-guided Feature Focuser.

Result: Experiments show SPL’s superiority over state-of-the-art methods, especially in generalized segmentation.

Conclusion: SPL offers a more adaptive and effective approach to domain generalization with VFMs, outperforming traditional methods.

Abstract: In this paper, we introduce, for the first time, the concept of Set Pivot Learning, a paradigm shift that redefines domain generalization (DG) based on Vision Foundation Models (VFMs). Traditional DG assumes that the target domain is inaccessible during training, but the emergence of VFMs, trained on vast and diverse data, renders this assumption unclear and obsolete. Traditional DG assumes that the target domain is inaccessible during training, but the emergence of VFMs, which are trained on vast and diverse datasets, renders this assumption unclear and obsolete. To address this challenge, we propose Set Pivot Learning (SPL), a new definition of domain migration task based on VFMs, which is more suitable for current research and application requirements. Unlike conventional DG methods, SPL prioritizes adaptive refinement over rigid domain transfer, ensuring continuous alignment with evolving real-world conditions. Specifically, SPL features two key attributes: (i) Dynamic adaptation, transitioning from static domain alignment to flexible, task-driven feature optimization, enabling models to evolve with downstream scenarios; (ii) VFM-centric tuning, leveraging pretrained knowledge as a pivot to hone task-specific representations while preserving cross-domain robustness. Building on SPL, we propose a Dynamic Prompt Fine-Tuning method, which combines a Dynamic Class-aware Prompter with a Prompt-guided Feature Focuser, to elevate VFM performance in targeted scenarios. Extensive experiments on benchmark datasets show the effectiveness of our method, highlighting its superiority over state-of-the-art methods, particularly in generalized segmentation.

[275] A Spatio-temporal Continuous Network for Stochastic 3D Human Motion Prediction

Hua Yu, Yaqing Hou, Xu Gui, Shanshan Feng, Dongsheng Zhou, Qiang Zhang

Main category: cs.CV

TL;DR: The paper proposes STCN, a two-stage method for stochastic and continuous human motion prediction, addressing challenges like mode collapse and temporal dynamics.

Details

Motivation: Existing methods struggle with continuous temporal dynamics and stochastic motion sequences, often overlooking flexibility and suffering from mode collapse.

Method: STCN uses a spatio-temporal continuous network for smoother sequences and introduces an anchor set to prevent mode collapse. It models Gaussian mixture distributions and samples multiple sequences per anchor.

Result: STCN achieves competitive performance in diversity and accuracy on Human3.6M and HumanEva-I datasets.

Conclusion: STCN effectively addresses stochastic and continuous human motion prediction, improving flexibility and reducing mode collapse.

Abstract: Stochastic Human Motion Prediction (HMP) has received increasing attention due to its wide applications. Despite the rapid progress in generative fields, existing methods often face challenges in learning continuous temporal dynamics and predicting stochastic motion sequences. They tend to overlook the flexibility inherent in complex human motions and are prone to mode collapse. To alleviate these issues, we propose a novel method called STCN, for stochastic and continuous human motion prediction, which consists of two stages. Specifically, in the first stage, we propose a spatio-temporal continuous network to generate smoother human motion sequences. In addition, the anchor set is innovatively introduced into the stochastic HMP task to prevent mode collapse, which refers to the potential human motion patterns. In the second stage, STCN endeavors to acquire the Gaussian mixture distribution (GMM) of observed motion sequences with the aid of the anchor set. It also focuses on the probability associated with each anchor, and employs the strategy of sampling multiple sequences from each anchor to alleviate intra-class differences in human motions. Experimental results on two widely-used datasets (Human3.6M and HumanEva-I) demonstrate that our model obtains competitive performance on both diversity and accuracy.

[276] Lifelong Person Re-identification via Privacy-Preserving Data Replay

Mingyu Wang, Haojie Liu, Zhiyong Li, Wei Jiang

Main category: cs.CV

TL;DR: The paper proposes Privacy-Preserving Replay (Pr²R), a method for lifelong person re-identification (LReID) that condenses sequential data into pixel-space replay samples to preserve privacy and improve performance.

Details

Motivation: Existing replay-based LReID methods raise privacy concerns by storing raw samples, while exemplar-free methods suffer from performance degradation due to forgetting past knowledge.

Method: Pr²R distills training characteristics of multiple real images into a single condensed sample, undergoing pixel-level changes. It uses a dual-alignment strategy during style replay to mitigate domain shifts and forgetting.

Result: Pr²R achieves 4% and 6% higher accuracy on sequential tasks compared to state-of-the-art and other replay-based methods, respectively.

Conclusion: Pr²R effectively balances privacy preservation and performance in LReID, outperforming existing methods.

Abstract: Lifelong person re-identification (LReID) aims to incrementally accumulate knowledge across a sequence of tasks under domain shifts. Recently, replay-based methods have demonstrated strong effectiveness in LReID by rehearsing past samples stored in an auxiliary memory. However, storing historical exemplars raises concerns over data privacy. To avoid this, exemplar-free approaches attempt to match the distribution of past data without storing raw samples. Despite being privacy-friendly, these methods often suffer from performance degradation due to the forgetting of specific past knowledge representations. To this end, we propose to condense information from sequential data into the pixel space in the replay memory, enabling Privacy-Preserving Replay (Pr^2R). More specifically, by distilling the training characteristics of multiple real images into a single image, the condensed samples undergo pixel-level changes. This not only protects the privacy of the original data but also makes the replay samples more representative for sequential tasks. During the style replay phase, we align the current domain to the previous one while simultaneously adapting the replay samples to match the style of the current domain. This dual-alignment strategy effectively mitigates both class-incremental challenges and forgetting caused by domain shifts. Extensive experiments on multiple benchmarks show that the proposed method significantly improves replay effectiveness while preserving data privacy. Specifically, Pr^2R achieves 4% and 6% higher accuracy on sequential tasks compared to the current state-of-the-art and other replay-based methods, respectively.

[277] Self-Navigated Residual Mamba for Universal Industrial Anomaly Detection

Hanxi Li, Jingqi Wu, Lin Yuanbo Wu, Mingliang Li, Deyin Liu, Jialie Shen, Chunhua Shen

Main category: cs.CV

TL;DR: SNARM is a novel framework for industrial anomaly detection using self-referential learning and dynamic reference selection, outperforming SOTA methods.

Details

Motivation: To improve anomaly detection by dynamically refining detection using in-image references, reducing reliance on pre-trained features.

Method: Computes inter- and intra-residual features, uses a Mamba module with dynamic navigation, and aggregates results via ensemble learning.

Result: Achieves SOTA performance on MVTec AD, MVTec 3D, and VisA benchmarks, with improvements in Image-AUROC, Pixel-AURC, PRO, and AP.

Conclusion: SNARM effectively enhances anomaly discrimination through self-referential learning and dynamic feature refinement.

Abstract: In this paper, we propose Self-Navigated Residual Mamba (SNARM), a novel framework for universal industrial anomaly detection that leverages self-referential learning'' within test images to enhance anomaly discrimination. Unlike conventional methods that depend solely on pre-trained features from normal training data, SNARM dynamically refines anomaly detection by iteratively comparing test patches against adaptively selected in-image references. Specifically, we first compute the inter-residuals’’ features by contrasting test image patches with the training feature bank. Patches exhibiting small-norm residuals (indicating high normality) are then utilized as self-generated reference patches to compute ``intra-residuals’’, amplifying discriminative signals. These inter- and intra-residual features are concatenated and fed into a novel Mamba module with multiple heads, which are dynamically navigated by residual properties to focus on anomalous regions. Finally, AD results are obtained by aggregating the outputs of a self-navigated Mamba in an ensemble learning paradigm. Extensive experiments on MVTec AD, MVTec 3D, and VisA benchmarks demonstrate that SNARM achieves state-of-the-art (SOTA) performance, with notable improvements in all metrics, including Image-AUROC, Pixel-AURC, PRO, and AP.

[278] DMTrack: Spatio-Temporal Multimodal Tracking via Dual-Adapter

Weihong Li, Shaohua Dong, Haonan Lu, Yanhao Zhang, Heng Fan, Libo Zhang

Main category: cs.CV

TL;DR: DMTrack introduces a dual-adapter architecture (STMA and PMCA) for spatio-temporal multimodal tracking, achieving state-of-the-art performance with minimal trainable parameters.

Details

Motivation: To improve spatio-temporal multimodal tracking by bridging modality gaps and enhancing cross-modality fusion.

Method: Uses STMA for self-prompting spatio-temporal features and PMCA for progressive cross-modality prompting with shallow and deep adapters.

Result: Achieves state-of-the-art performance on five benchmarks with only 0.93M trainable parameters.

Conclusion: DMTrack’s dual-adapter design effectively enhances multimodal tracking, offering a lightweight yet powerful solution.

Abstract: In this paper, we explore adapter tuning and introduce a novel dual-adapter architecture for spatio-temporal multimodal tracking, dubbed DMTrack. The key of our DMTrack lies in two simple yet effective modules, including a spatio-temporal modality adapter (STMA) and a progressive modality complementary adapter (PMCA) module. The former, applied to each modality alone, aims to adjust spatio-temporal features extracted from a frozen backbone by self-prompting, which to some extent can bridge the gap between different modalities and thus allows better cross-modality fusion. The latter seeks to facilitate cross-modality prompting progressively with two specially designed pixel-wise shallow and deep adapters. The shallow adapter employs shared parameters between the two modalities, aiming to bridge the information flow between the two modality branches, thereby laying the foundation for following modality fusion, while the deep adapter modulates the preliminarily fused information flow with pixel-wise inner-modal attention and further generates modality-aware prompts through pixel-wise inter-modal attention. With such designs, DMTrack achieves promising spatio-temporal multimodal tracking performance with merely \textbf{0.93M} trainable parameters. Extensive experiments on five benchmarks show that DMTrack achieves state-of-the-art results. Code will be available.

[279] CLIMD: A Curriculum Learning Framework for Imbalanced Multimodal Diagnosis

Kai Han, Chongwen Lyu, Lele Ma, Chengxuan Qian, Siqi Ma, Zheng Pang, Jun Chen, Zhe Liu

Main category: cs.CV

TL;DR: The paper proposes CLIMD, a Curriculum Learning framework for Imbalanced Multimodal Diagnosis, addressing class imbalance in multimodal medical data by focusing on key samples and adapting to complex distributions.

Details

Motivation: Class imbalance in multimodal medical data hinders accurate diagnosis, and existing methods like resampling or loss reweighting often lead to overfitting or underfitting while missing cross-modal interactions.

Method: CLIMD uses a multimodal curriculum measurer (intra-modal confidence and inter-modal complementarity) and a class distribution-guided training scheduler to adapt to imbalanced data.

Result: CLIMD outperforms state-of-the-art methods on multiple datasets and handles imbalanced data effectively.

Conclusion: CLIMD is a versatile, plug-and-play framework that improves multimodal disease diagnosis accuracy and is publicly available.

Abstract: Clinicians usually combine information from multiple sources to achieve the most accurate diagnosis, and this has sparked increasing interest in leveraging multimodal deep learning for diagnosis. However, in real clinical scenarios, due to differences in incidence rates, multimodal medical data commonly face the issue of class imbalance, which makes it difficult to adequately learn the features of minority classes. Most existing methods tackle this issue with resampling or loss reweighting, but they are prone to overfitting or underfitting and fail to capture cross-modal interactions. Therefore, we propose a Curriculum Learning framework for Imbalanced Multimodal Diagnosis (CLIMD). Specifically, we first design multimodal curriculum measurer that combines two indicators, intra-modal confidence and inter-modal complementarity, to enable the model to focus on key samples and gradually adapt to complex category distributions. Additionally, a class distribution-guided training scheduler is introduced, which enables the model to progressively adapt to the imbalanced class distribution during training. Extensive experiments on multiple multimodal medical datasets demonstrate that the proposed method outperforms state-of-the-art approaches across various metrics and excels in handling imbalanced multimodal medical data. Furthermore, as a plug-and-play CL framework, CLIMD can be easily integrated into other models, offering a promising path for improving multimodal disease diagnosis accuracy. Code is publicly available at https://github.com/KHan-UJS/CLIMD.

[280] Enhancing Zero-Shot Brain Tumor Subtype Classification via Fine-Grained Patch-Text Alignment

Lubin Gan, Jing Zhang, Linhao Qu, Yijun Wang, Siying Wu, Xiaoyan Sun

Main category: cs.CV

TL;DR: FG-PAN, a zero-shot framework, improves brain tumor subtype classification by refining patch-level features and generating fine-grained text descriptions using LLMs.

Details

Motivation: Challenges in fine-grained brain tumor classification due to subtle morphological variations and limited annotated data, with existing vision-language models lacking precision.

Method: FG-PAN uses a local feature refinement module for patch-level visual enhancement and a text description generation module with LLMs for semantic prototypes.

Result: FG-PAN achieves state-of-the-art performance and robust generalization in zero-shot classification on datasets like EBRAINS and TCGA.

Conclusion: FG-PAN effectively addresses fine-grained classification challenges by aligning visual and semantic features, demonstrating superior performance.

Abstract: The fine-grained classification of brain tumor subtypes from histopathological whole slide images is highly challenging due to subtle morphological variations and the scarcity of annotated data. Although vision-language models have enabled promising zero-shot classification, their ability to capture fine-grained pathological features remains limited, resulting in suboptimal subtype discrimination. To address these challenges, we propose the Fine-Grained Patch Alignment Network (FG-PAN), a novel zero-shot framework tailored for digital pathology. FG-PAN consists of two key modules: (1) a local feature refinement module that enhances patch-level visual features by modeling spatial relationships among representative patches, and (2) a fine-grained text description generation module that leverages large language models to produce pathology-aware, class-specific semantic prototypes. By aligning refined visual features with LLM-generated fine-grained descriptions, FG-PAN effectively increases class separability in both visual and semantic spaces. Extensive experiments on multiple public pathology datasets, including EBRAINS and TCGA, demonstrate that FG-PAN achieves state-of-the-art performance and robust generalization in zero-shot brain tumor subtype classification.

[281] Towards Generalizable AI-Generated Image Detection via Image-Adaptive Prompt Learning

Yiheng Li, Zichang Tan, Zhen Lei, Xu Zhou, Yang Yang

Main category: cs.CV

TL;DR: The paper proposes Image-Adaptive Prompt Learning (IAPL), a framework for detecting AI-generated images by adapting prompts dynamically to input images, achieving high accuracy on benchmark datasets.

Details

Motivation: Existing methods for detecting AI-generated images struggle with generalization to unseen generators, as they rely on fixed parameters fine-tuned on limited data.

Method: IAPL introduces two adaptive modules: Conditional Information Learner for forgery-specific features and Confidence-Driven Adaptive Prediction for optimizing predictions based on single test samples.

Result: IAPL achieves 95.61% and 96.7% mean accuracy on UniversalFakeDetect and GenImage datasets, respectively.

Conclusion: IAPL enhances adaptability to diverse forged images by dynamically adjusting prompts, outperforming existing methods.

Abstract: A major struggle for AI-generated image detection is identifying fake images from unseen generators. Existing cutting-edge methods typically customize pre-trained foundation models to this task via partial-parameter fine-tuning. However, these parameters trained on a narrow range of generators may fail to generalize to unknown sources. In light of this, we propose a novel framework named Image-Adaptive Prompt Learning (IAPL), which enhances flexibility in processing diverse testing images. It consists of two adaptive modules, i.e., the Conditional Information Learner and the Confidence-Driven Adaptive Prediction. The former employs CNN-based feature extractors to learn forgery-specific and image-specific conditions, which are then propagated to learnable tokens via a gated mechanism. The latter optimizes the shallowest learnable tokens based on a single test sample and selects the cropped view with the highest prediction confidence for final detection. These two modules enable the prompts fed into the foundation model to be automatically adjusted based on the input image, rather than being fixed after training, thereby enhancing the model’s adaptability to various forged images. Extensive experiments show that IAPL achieves state-of-the-art performance, with 95.61% and 96.7% mean accuracy on two widely used UniversalFakeDetect and GenImage datasets, respectively.

[282] From Pixels to Places: A Systematic Benchmark for Evaluating Image Geolocalization Ability in Large Language Models

Lingyao Li, Runlong Yu, Qikai Hu, Bowei Li, Min Deng, Yang Zhou, Xiaowei Jia

Main category: cs.CV

TL;DR: A benchmark (IMAGEO-Bench) evaluates LLMs for image geolocalization, revealing performance disparities, geospatial biases, and key factors like urban settings and landmarks.

Details

Motivation: To assess LLMs' underexplored potential in image geolocalization for applications like crisis response and digital forensics.

Method: Introduces IMAGEO-Bench with three datasets, testing 10 LLMs on accuracy, distance error, bias, and reasoning.

Result: Closed-source models outperform open-source ones, with biases favoring high-resource regions. Urban and landmark recognition are critical.

Conclusion: IMAGEO-Bench highlights LLMs’ spatial reasoning gaps and biases, guiding future geolocation-aware AI development.

Abstract: Image geolocalization, the task of identifying the geographic location depicted in an image, is important for applications in crisis response, digital forensics, and location-based intelligence. While recent advances in large language models (LLMs) offer new opportunities for visual reasoning, their ability to perform image geolocalization remains underexplored. In this study, we introduce a benchmark called IMAGEO-Bench that systematically evaluates accuracy, distance error, geospatial bias, and reasoning process. Our benchmark includes three diverse datasets covering global street scenes, points of interest (POIs) in the United States, and a private collection of unseen images. Through experiments on 10 state-of-the-art LLMs, including both open- and closed-source models, we reveal clear performance disparities, with closed-source models generally showing stronger reasoning. Importantly, we uncover geospatial biases as LLMs tend to perform better in high-resource regions (e.g., North America, Western Europe, and California) while exhibiting degraded performance in underrepresented areas. Regression diagnostics demonstrate that successful geolocalization is primarily dependent on recognizing urban settings, outdoor environments, street-level imagery, and identifiable landmarks. Overall, IMAGEO-Bench provides a rigorous lens into the spatial reasoning capabilities of LLMs and offers implications for building geolocation-aware AI systems.

[283] LLaDA-MedV: Exploring Large Language Diffusion Models for Biomedical Image Understanding

Xuanzhao Dong, Wenhui Zhu, Xiwen Chen, Zhipeng Wang, Peijie Qiu, Shao Tang, Xin Li, Yalin Wang

Main category: cs.CV

TL;DR: LLaDA-MedV is a new biomedical vision-language model using masked diffusion, outperforming LLaVA-Med and LLaDA-V in visual conversation and VQA tasks.

Details

Motivation: To explore masked diffusion models in biomedical VLMs, where autoregressive models dominate.

Method: Introduces LLaDA-MedV, a large language diffusion model for biomedical image understanding via vision instruction tuning.

Result: Achieves 7.855% and 1.867% gains over LLaVA-Med and LLaDA-V, respectively, and sets new SOTA on VQA benchmarks.

Conclusion: LLaDA-MedV demonstrates superior performance and informative outputs, with insights on training and inference strategies.

Abstract: Autoregressive models (ARMs) have long dominated the landscape of biomedical vision-language models (VLMs). Recently, masked diffusion models such as LLaDA have emerged as promising alternatives, yet their application in the biomedical domain remains largely underexplored. To bridge this gap, we introduce \textbf{LLaDA-MedV}, the first large language diffusion model tailored for biomedical image understanding through vision instruction tuning. LLaDA-MedV achieves relative performance gains of 7.855% over LLaVA-Med and 1.867% over LLaDA-V in the open-ended biomedical visual conversation task, and sets new state-of-the-art accuracy on the closed-form subset of three VQA benchmarks: 84.93% on VQA-RAD, 92.31% on SLAKE, and 95.15% on PathVQA. Furthermore, a detailed comparison with LLaVA-Med suggests that LLaDA-MedV is capable of generating reasonably longer responses by explicitly controlling response length, which can lead to more informative outputs. We also conduct an in-depth analysis of both the training and inference stages, highlighting the critical roles of initialization weight selection, fine-tuning strategies, and the interplay between sampling steps and response repetition. The code and model weight is released at https://github.com/LLM-VLM-GSL/LLaDA-MedV.

[284] Glass Surface Segmentation with an RGB-D Camera via Weighted Feature Fusion for Service Robots

Henghong Lin, Zihan Zhu, Tao Wang, Anastasia Ioannou, Yuanshui Huang

Main category: cs.CV

TL;DR: Proposes a Weighted Feature Fusion (WFF) module for RGB-D glass surface segmentation, introduces the MJU-Glass dataset, and shows improved accuracy and robustness.

Details

Motivation: Addresses challenges in glass surface segmentation like transparency, reflections, and occlusions by effectively fusing RGB and depth data.

Method: Introduces a WFF module to dynamically combine RGB and depth features, compatible with various deep neural networks. Also presents the MJU-Glass dataset for benchmarking.

Result: Achieves a 7.49% improvement in boundary IoU (bIoU) when integrated with PSPNet, enhancing segmentation accuracy and robustness.

Conclusion: The WFF module and MJU-Glass dataset offer a robust solution for glass surface segmentation, reducing collision risks in robotics.

Abstract: We address the problem of glass surface segmentation with an RGB-D camera, with a focus on effectively fusing RGB and depth information. To this end, we propose a Weighted Feature Fusion (WFF) module that dynamically and adaptively combines RGB and depth features to tackle issues such as transparency, reflections, and occlusions. This module can be seamlessly integrated with various deep neural network backbones as a plug-and-play solution. Additionally, we introduce the MJU-Glass dataset, a comprehensive RGB-D dataset collected by a service robot navigating real-world environments, providing a valuable benchmark for evaluating segmentation models. Experimental results show significant improvements in segmentation accuracy and robustness, with the WFF module enhancing performance in both mean Intersection over Union (mIoU) and boundary IoU (bIoU), achieving a 7.49% improvement in bIoU when integrated with PSPNet. The proposed module and dataset provide a robust framework for advancing glass surface segmentation in robotics and reducing the risk of collisions with glass objects.

[285] CSLRConformer: A Data-Centric Conformer Approach for Continuous Arabic Sign Language Recognition on the Isharah Datase

Fatimah Mohamed Emad Elden

Main category: cs.CV

TL;DR: The paper proposes a data-centric approach for Continuous Sign Language Recognition (CSLR), introducing feature engineering, preprocessing, and the CSLRConformer model, achieving competitive results.

Details

Motivation: Addressing the challenge of signer-independent recognition to improve CSLR generalization across diverse signers.

Method: Systematic feature engineering, preprocessing with DBSCAN-based outlier filtering, and the novel CSLRConformer architecture (hybrid CNN-Transformer).

Result: Achieved a Word Error Rate (WER) of 5.60% (development) and 12.01% (test), securing 3rd place in the competition.

Conclusion: Validates cross-domain adaptation of the Conformer model, setting a new state-of-the-art for keypoint-based CSLR.

Abstract: The field of Continuous Sign Language Recognition (CSLR) poses substantial technical challenges, including fluid inter-sign transitions, the absence of temporal boundaries, and co-articulation effects. This paper, developed for the MSLR 2025 Workshop Challenge at ICCV 2025, addresses the critical challenge of signer-independent recognition to advance the generalization capabilities of CSLR systems across diverse signers. A data-centric methodology is proposed, centered on systematic feature engineering, a robust preprocessing pipeline, and an optimized model architecture. Key contributions include a principled feature selection process guided by Exploratory Data Analysis (EDA) to isolate communicative keypoints, a rigorous preprocessing pipeline incorporating DBSCAN-based outlier filtering and spatial normalization, and the novel CSLRConformer architecture. This architecture adapts the hybrid CNN-Transformer design of the Conformer model, leveraging its capacity to model local temporal dependencies and global sequence context; a characteristic uniquely suited for the spatio-temporal dynamics of sign language. The proposed methodology achieved a competitive performance, with a Word Error Rate (WER) of 5.60% on the development set and 12.01% on the test set, a result that secured a 3rd place ranking on the official competition platform. This research validates the efficacy of cross-domain architectural adaptation, demonstrating that the Conformer model, originally conceived for speech recognition, can be successfully repurposed to establish a new state-of-the-art performance in keypoint-based CSLR.

[286] Minimal High-Resolution Patches Are Sufficient for Whole Slide Image Representation via Cascaded Dual-Scale Reconstruction

Yujian Liu, Yuechuan Lin, Dongxu Shen, Haoran Li, Yutong Wang, Xiaoli Liu, Shidang Xu

Main category: cs.CV

TL;DR: The paper proposes a Cascaded Dual-Scale Reconstruction (CDSR) framework for efficient and accurate whole-slide image (WSI) analysis, using only 9 high-resolution patches per WSI. It outperforms existing methods with fewer patches.

Details

Motivation: Current MIL approaches overlook feature extractor impact, leading to domain gaps and suboptimal representations. SSL methods split WSIs into small patches, degrading performance and increasing costs.

Method: CDSR uses a two-stage selective sampling strategy and a Local-to-Global Network to reconstruct high-resolution WSI representations, integrating local detail with global context.

Result: CDSR improves accuracy by 6.3% and AUC by 5.5% on classification tasks, using only 7,070 patches per dataset (4.5% of total), outperforming methods trained on millions of patches.

Conclusion: CDSR is efficient and morphologically faithful, addressing domain gaps and redundancy in WSI analysis while reducing computational costs.

Abstract: Whole-slide image (WSI) analysis remains challenging due to the gigapixel scale and sparsely distributed diagnostic regions. Multiple Instance Learning (MIL) mitigates this by modeling the WSI as bags of patches for slide-level prediction. However, most MIL approaches emphasize aggregator design while overlooking the impact of the feature extractor of the feature extraction stage, which is often pretrained on natural images. This leads to domain gap and suboptimal representations. Self-supervised learning (SSL) has shown promise in bridging domain gap via pretext tasks, but it still primarily builds upon generic backbones, thus requiring WSIs to be split into small patches. This inevitably splits histological structures and generates both redundant and interdependent patches, which in turn degrades aggregator performance and drastically increases training costs. To address this challenge, we propose a Cascaded Dual-Scale Reconstruction (CDSR) framework, demonstrating that only an average of 9 high-resolution patches per WSI are sufficient for robust slide-level representation. CDSR employs a two-stage selective sampling strategy that identifies the most informative representative regions from both model-based and semantic perspectives. These patches are then fed into a Local-to-Global Network, which reconstructs spatially coherent high-resolution WSI representations by integrating fine-grained local detail with global contextual information. Unlike existing dense-sampling or SSL pipelines, CDSR is optimized for efficiency and morphological fidelity. Experiments on Camelyon16, TCGA-NSCLC, and TCGA-RCC demonstrate that CDSR achieves improvements of 6.3% in accuracy and 5.5% in area under ROC curve on downstream classification tasks with only 7,070 (4.5% of total) high-resolution patches per dataset on average, outperforming state-of-the-art methods trained on over 10,000,000 patches.

[287] StrandDesigner: Towards Practical Strand Generation with Sketch Guidance

Na Zhang, Moran Li, Chengming Xu, Han Feng, Xiaobin Hu, Jiangning Zhang, Weijian Cao, Chengjie Wang, Yanwei Fu

Main category: cs.CV

TL;DR: A sketch-based strand generation model for realistic hair, offering precise control and user-friendliness, outperforming existing methods.

Details

Motivation: Existing diffusion models for hairstyle generation lack precision and user-friendliness, prompting the need for a sketch-based approach.

Method: The framework uses a learnable strand upsampling strategy and a multi-scale adaptive conditioning mechanism with a transformer and diffusion heads.

Result: Outperforms existing methods in realism and precision on benchmark datasets, with qualitative results confirming effectiveness.

Conclusion: The proposed sketch-based model provides finer control and better results, with code to be released publicly.

Abstract: Realistic hair strand generation is crucial for applications like computer graphics and virtual reality. While diffusion models can generate hairstyles from text or images, these inputs lack precision and user-friendliness. Instead, we propose the first sketch-based strand generation model, which offers finer control while remaining user-friendly. Our framework tackles key challenges, such as modeling complex strand interactions and diverse sketch patterns, through two main innovations: a learnable strand upsampling strategy that encodes 3D strands into multi-scale latent spaces, and a multi-scale adaptive conditioning mechanism using a transformer with diffusion heads to ensure consistency across granularity levels. Experiments on several benchmark datasets show our method outperforms existing approaches in realism and precision. Qualitative results further confirm its effectiveness. Code will be released at GitHub.

[288] DAG: Unleash the Potential of Diffusion Model for Open-Vocabulary 3D Affordance Grounding

Hanqing Wang, Zhenhao Zhang, Kaiyang Ji, Mingyu Liu, Wenti Yin, Yuchao Chen, Zhirui Liu, Xiangyu Zeng, Tianxiang Gui, Hangxing Zhang

Main category: cs.CV

TL;DR: The paper introduces DAG, a diffusion-based framework for 3D affordance grounding, leveraging text-to-image diffusion models to improve generalization and performance.

Details

Motivation: Existing methods for 3D affordance grounding fail to generalize well due to limited affordance knowledge in demonstration images.

Method: Proposes DAG, using frozen representations from text-to-image diffusion models, an affordance block, and a multi-source affordance decoder for 3D dense prediction.

Result: DAG outperforms established methods and shows strong open-world generalization.

Conclusion: The approach successfully unlocks affordance knowledge in diffusion models, enhancing 3D affordance grounding performance.

Abstract: 3D object affordance grounding aims to predict the touchable regions on a 3d object, which is crucial for human-object interaction, human-robot interaction, embodied perception, and robot learning. Recent advances tackle this problem via learning from demonstration images. However, these methods fail to capture the general affordance knowledge within the image, leading to poor generalization. To address this issue, we propose to use text-to-image diffusion models to extract the general affordance knowledge because we find that such models can generate semantically valid HOI images, which demonstrate that their internal representation space is highly correlated with real-world affordance concepts. Specifically, we introduce the DAG, a diffusion-based 3d affordance grounding framework, which leverages the frozen internal representations of the text-to-image diffusion model and unlocks affordance knowledge within the diffusion model to perform 3D affordance grounding. We further introduce an affordance block and a multi-source affordance decoder to endow 3D dense affordance prediction. Extensive experimental evaluations show that our model excels over well-established methods and exhibits open-world generalization.

[289] MAP: Mitigating Hallucinations in Large Vision-Language Models with Map-Level Attention Processing

Chenxi Li, Yichen Guo, Benfang Qian, Jinhao You, Kai Tang, Yaosong Du, Zonghao Zhang, Xiande Huang

Main category: cs.CV

TL;DR: The paper introduces Map-Level Attention Processing (MAP), a training-free decoding method to reduce hallucinations in Large Vision-Language Models by leveraging a 2D semantic map of hidden states.

Details

Motivation: LVLMs often generate grammatically accurate but factually inconsistent content (hallucinations). Existing methods focus on localized regions, missing distributed factual information.

Method: Proposes MAP, using Layer-Wise Criss-Cross Attention and Global-Local Logit Fusion to refine token representations and predictions by aggregating inter- and intra-layer information.

Result: MAP improves factual consistency and performance across benchmarks like POPE, MME, and MMHal-Bench.

Conclusion: The map-level decoding strategy effectively mitigates hallucinations in LVLMs, demonstrating its potential for improving model truthfulness.

Abstract: Large Vision-Language Models (LVLMs) have achieved impressive performance in multimodal tasks, but they still suffer from hallucinations, i.e., generating content that is grammatically accurate but inconsistent with visual inputs. In this work, we introduce a novel map-level perspective to mitigate hallucinations in LVLMs, interpreting the hidden states of the model as a 2D semantic map. We observe that factual information is widely distributed across this map, extending beyond the localized inter- or intra-layer regions targeted by most existing methods (e.g., contrastive decoding and layer-wise consistency). Building on this insight, we propose Map-Level Attention Processing (MAP), a training-free decoding method that effectively leverages factual information through attention-based map-level operations to improve factual consistency. Specifically, we employ Layer-Wise Criss-Cross Attention to progressively refine token representations at each decoding layer by aggregating tokens from both inter- and intra-layer dimensions. Additionally, a Global-Local Logit Fusion mechanism combines logits obtained before and after global attention to further refine predictions and improve accuracy. Our method consistently improves the truthfulness and performance of LVLMs across benchmarks, such as POPE, MME, and MMHal-Bench, demonstrating the potential of the map-level decoding strategy.

[290] Single Point, Full Mask: Velocity-Guided Level Set Evolution for End-to-End Amodal Segmentation

Zhixuan Li, Yujia Liu, Chen Hui, Weisi Lin

Main category: cs.CV

TL;DR: VELA is a novel method for amodal segmentation using point-based prompts and explicit contour evolution, outperforming existing methods with interpretable geometric modeling.

Details

Motivation: Existing amodal segmentation methods rely on costly strong prompts or lack interpretability, limiting real-world applicability.

Method: VELA uses a velocity-driven level-set approach, evolving an initial contour from point prompts via a differentiable network predicting shape-specific motion fields.

Result: VELA outperforms strongly prompted methods on benchmarks like COCOA-cls, D2SA, and KINS, using only single-point prompts.

Conclusion: VELA’s interpretable geometric modeling under weak guidance is effective, with code to be released publicly.

Abstract: Amodal segmentation aims to recover complete object shapes, including occluded regions with no visual appearance, whereas conventional segmentation focuses solely on visible areas. Existing methods typically rely on strong prompts, such as visible masks or bounding boxes, which are costly or impractical to obtain in real-world settings. While recent approaches such as the Segment Anything Model (SAM) support point-based prompts for guidance, they often perform direct mask regression without explicitly modeling shape evolution, limiting generalization in complex occlusion scenarios. Moreover, most existing methods suffer from a black-box nature, lacking geometric interpretability and offering limited insight into how occluded shapes are inferred. To deal with these limitations, we propose VELA, an end-to-end VElocity-driven Level-set Amodal segmentation method that performs explicit contour evolution from point-based prompts. VELA first constructs an initial level set function from image features and the point input, which then progressively evolves into the final amodal mask under the guidance of a shape-specific motion field predicted by a fully differentiable network. This network learns to generate evolution dynamics at each step, enabling geometrically grounded and topologically flexible contour modeling. Extensive experiments on COCOA-cls, D2SA, and KINS benchmarks demonstrate that VELA outperforms existing strongly prompted methods while requiring only a single-point prompt, validating the effectiveness of interpretable geometric modeling under weak guidance. The code will be publicly released.

[291] Shape Distribution Matters: Shape-specific Mixture-of-Experts for Amodal Segmentation under Diverse Occlusions

Zhixuan Li, Yujia Liu, Chen Hui, Jeonghaeng Lee, Sanghoon Lee, Weisi Lin

Main category: cs.CV

TL;DR: ShapeMoE introduces a shape-specific sparse Mixture-of-Experts framework for amodal segmentation, improving performance by dynamically routing objects to specialized experts based on their shape characteristics.

Details

Motivation: Existing approaches struggle with diverse amodal shapes due to limited representation capacity, leading to mismatched expert routing and insufficient specialization.

Method: ShapeMoE learns a latent shape distribution space, encodes objects into Gaussian embeddings, and uses a Shape-Aware Sparse Router to route objects to specialized experts.

Result: ShapeMoE outperforms state-of-the-art methods on COCOA-cls, KINS, and D2SA datasets, particularly in occluded region segmentation.

Conclusion: ShapeMoE provides interpretable, efficient, and high-capacity amodal segmentation by leveraging shape-specific expert routing.

Abstract: Amodal segmentation targets to predict complete object masks, covering both visible and occluded regions. This task poses significant challenges due to complex occlusions and extreme shape variation, from rigid furniture to highly deformable clothing. Existing one-size-fits-all approaches rely on a single model to handle all shape types, struggling to capture and reason about diverse amodal shapes due to limited representation capacity. A natural solution is to adopt a Mixture-of-Experts (MoE) framework, assigning experts to different shape patterns. However, naively applying MoE without considering the object’s underlying shape distribution can lead to mismatched expert routing and insufficient expert specialization, resulting in redundant or underutilized experts. To deal with these issues, we introduce ShapeMoE, a shape-specific sparse Mixture-of-Experts framework for amodal segmentation. The key idea is to learn a latent shape distribution space and dynamically route each object to a lightweight expert tailored to its shape characteristics. Specifically, ShapeMoE encodes each object into a compact Gaussian embedding that captures key shape characteristics. A Shape-Aware Sparse Router then maps the object to the most suitable expert, enabling precise and efficient shape-aware expert routing. Each expert is designed as lightweight and specialized in predicting occluded regions for specific shape patterns. ShapeMoE offers well interpretability via clear shape-to-expert correspondence, while maintaining high capacity and efficiency. Experiments on COCOA-cls, KINS, and D2SA show that ShapeMoE consistently outperforms state-of-the-art methods, especially in occluded region segmentation. The code will be released.

[292] Rein++: Efficient Generalization and Adaptation for Semantic Segmentation with Vision Foundation Models

Zhixiang Wei, Xiaoxiao Ma, Ruishen Yan, Tao Tu, Huaian Chen, Jinjin Zheng, Yi Jin, Enhong Chen

Main category: cs.CV

TL;DR: Rein++ is an efficient VFM-based segmentation framework addressing data scale disparity and domain shifts in semantic segmentation, combining domain generalization (Rein-G) and adaptation (Rein-A) for superior performance.

Details

Motivation: Overcome challenges in applying Vision Foundation Models (VFMs) to semantic segmentation due to small dataset sizes and domain distribution shifts.

Method: Introduces Rein-G (instance-aware tokens for feature refinement) and Rein-A (unsupervised domain adaptation at instance/logit levels with semantic transfer).

Result: Outperforms state-of-the-art methods with efficient training, even for large models.

Conclusion: Rein++ is an efficient, generalizable, and adaptive segmentation solution for VFMs.

Abstract: Vision Foundation Models(VFMs) have achieved remarkable success in various computer vision tasks. However, their application to semantic segmentation is hindered by two significant challenges: (1) the disparity in data scale, as segmentation datasets are typically much smaller than those used for VFM pre-training, and (2) domain distribution shifts, where real-world segmentation scenarios are diverse and often underrepresented during pre-training. To overcome these limitations, we present Rein++, an efficient VFM-based segmentation framework that demonstrates superior generalization from limited data and enables effective adaptation to diverse unlabeled scenarios. Specifically, Rein++ comprises a domain generalization solution Rein-G and a domain adaptation solution Rein-A. Rein-G introduces a set of trainable, instance-aware tokens that effectively refine the VFM’s features for the segmentation task. This parameter-efficient approach fine-tunes less than 1% of the backbone’s parameters, enabling robust generalization. Building on the Rein-G, Rein-A performs unsupervised domain adaptation at both the instance and logit levels to mitigate domain shifts. In addition, it incorporates a semantic transfer module that leverages the class-agnostic capabilities of the segment anything model to enhance boundary details in the target domain. The integrated Rein++ pipeline first learns a generalizable model on a source domain (e.g., daytime scenes) and subsequently adapts it to diverse target domains (e.g., nighttime scenes) without any target labels. Comprehensive experiments demonstrate that Rein++ significantly outperforms state-of-the-art methods with efficient training, underscoring its roles an efficient, generalizable, and adaptive segmentation solution for VFMs, even for large models with billions of parameters. The code is available at https://github.com/wloves/Rein.

[293] Benchmarking Adversarial Patch Selection and Location

Shai Kimhi, Avi Mendlson, Moshe Kimhi

Main category: cs.CV

TL;DR: PatchMap is the first benchmark for adversarial patch placement, revealing hot-spots where small patches cause misclassifications. A segmentation-based heuristic improves attack success rates by 8-13%.

Details

Motivation: Adversarial patch attacks undermine vision models, necessitating a systematic study of patch placement vulnerabilities.

Method: PatchMap evaluates over 150M forward passes on ImageNet, identifying vulnerable regions. A segmentation-guided heuristic is proposed for patch placement.

Result: PatchMap identifies systematic hot-spots for attacks, and the heuristic boosts success rates by 8-13% across architectures.

Conclusion: PatchMap and the proposed heuristic advance research on location-aware defenses and adaptive attacks, with public release of data and code.

Abstract: Adversarial patch attacks threaten the reliability of modern vision models. We present PatchMap, the first spatially exhaustive benchmark of patch placement, built by evaluating over 1.5e8 forward passes on ImageNet validation images. PatchMap reveals systematic hot-spots where small patches (as little as 2% of the image) induce confident misclassifications and large drops in model confidence. To demonstrate its utility, we propose a simple segmentation guided placement heuristic that leverages off the shelf masks to identify vulnerable regions without any gradient queries. Across five architectures-including adversarially trained ResNet50, our method boosts attack success rates by 8 to 13 percentage points compared to random or fixed placements. We publicly release PatchMap and the code implementation. The full PatchMap bench (6.5B predictions, multiple backbones) will be released soon to further accelerate research on location-aware defenses and adaptive attacks.

[294] Subject or Style: Adaptive and Training-Free Mixture of LoRAs

Jia-Chen Zhang, Yu-Jie Xiong

Main category: cs.CV

TL;DR: EST-LoRA is a training-free adaptive LoRA fusion method that balances subject and style in generation tasks by considering energy, style discrepancy, and time steps, outperforming existing methods.

Details

Motivation: Current LoRA fusion methods struggle to balance subject and style and often require additional training, while existing training-free methods like K-LoRA involve complex hyperparameters.

Method: EST-LoRA adaptively selects between subject and style LoRA in each attention layer using energy, style discrepancy scores, and time steps, similar to Mixture of Experts.

Result: EST-LoRA outperforms state-of-the-art methods in quality and speed, achieving balanced generation without additional training.

Conclusion: EST-LoRA provides an efficient, training-free solution for adaptive LoRA fusion, improving generation balance and speed.

Abstract: Fine-tuning models via Low-Rank Adaptation (LoRA) demonstrates remarkable performance in subject-driven or style-driven generation tasks. Studies have explored combinations of different LoRAs to jointly generate learned styles and content. However, current methods struggle to balance the original subject and style, and often require additional training. Recently, K-LoRA proposed a training-free LoRA fusion method. But it involves multiple hyperparameters, making it difficult to adapt to all styles and subjects. In this paper, we propose EST-LoRA, a training-free adaptive LoRA fusion method. It comprehensively considers three critical factors: \underline{E}nergy of matrix, \underline{S}tyle discrepancy scores and \underline{T}ime steps. Analogous to the Mixture of Experts (MoE) architecture, the model adaptively selects between subject LoRA and style LoRA within each attention layer. This integrated selection mechanism ensures balanced contributions from both components during the generation process. Experimental results show that EST-LoRA outperforms state-of-the-art methods in both qualitative and quantitative evaluations and achieves faster generation speed compared to other efficient fusion approaches. Our code is publicly available at: https://anonymous.4open.science/r/EST-LoRA-F318.

[295] Cure or Poison? Embedding Instructions Visually Alters Hallucination in Vision-Language Models

Zhaochen Wang, Yiwei Wang, Yujun Cai

Main category: cs.CV

TL;DR: Prompt-in-Image embeds text instructions into images to improve VLM performance, showing mixed results across models.

Details

Motivation: Address hallucination in VLMs by aligning multimodal information through unified visual processing.

Method: Embed textual instructions directly into images, forcing models to process all content visually.

Result: Qwen2.5-VL improved (4.1% accuracy boost), while LLaVA-1.5 and InstructBLIP dropped to near-random performance.

Conclusion: Prompt-in-Image reduces modality gap in Qwen, but CLIP-based encoders struggle with embedded text.

Abstract: Vision-Language Models (VLMs) often suffer from hallucination, partly due to challenges in aligning multimodal information. We propose Prompt-in-Image, a simple method that embeds textual instructions directly into images. This removes the need for separate text inputs and forces the model to process all content through the visual channel. We evaluate this method on three popular open-source VLMs: Qwen2.5-VL, LLaVA-1.5, and InstructBLIP. The results reveal sharp differences. Prompt-in-Image improves Qwen2.5-VL’s performance, increasing POPE accuracy by 4.1 percent (from 80.2 percent to 84.3 percent) and also reducing hallucination rates on MS-COCO. In contrast, LLaVA-1.5 and InstructBLIP experience a severe performance drop, with accuracy falling from around 84 percent to near-random levels. Through detailed analysis, we found that CLIP-based encoders in LLaVA and InstructBLIP exhibit excessive attention bias toward embedded text regions, disrupting visual understanding. In contrast, Qwen’s vision encoder handles text-embedded images robustly. Crucially, Prompt-in-Image reduces Qwen’s modality gap, enhancing cross-modal alignment by unifying information processing through a single modality.

[296] DisCo3D: Distilling Multi-View Consistency for 3D Scene Editing

Yufeng Chi, Huimin Ma, Kafeng Wang, Jianmin Li

Main category: cs.CV

TL;DR: DisCo3D is a framework for 3D editing by distilling 3D consistency priors into a 2D editor, achieving stable multi-view consistency and high-quality edits.

Details

Motivation: Extending diffusion models to 3D editing is challenging due to multi-view consistency issues, with existing methods suffering from slow convergence, blurry artifacts, and fine-grained inconsistencies.

Method: DisCo3D fine-tunes a 3D generator with multi-view inputs, trains a 2D editor via consistency distillation, and optimizes edited outputs into 3D using Gaussian Splatting.

Result: DisCo3D outperforms state-of-the-art methods in editing quality and maintains stable multi-view consistency.

Conclusion: DisCo3D effectively addresses multi-view consistency challenges in 3D editing, offering improved efficiency and quality.

Abstract: While diffusion models have demonstrated remarkable progress in 2D image generation and editing, extending these capabilities to 3D editing remains challenging, particularly in maintaining multi-view consistency. Classical approaches typically update 3D representations through iterative refinement based on a single editing view. However, these methods often suffer from slow convergence and blurry artifacts caused by cross-view inconsistencies. Recent methods improve efficiency by propagating 2D editing attention features, yet still exhibit fine-grained inconsistencies and failure modes in complex scenes due to insufficient constraints. To address this, we propose \textbf{DisCo3D}, a novel framework that distills 3D consistency priors into a 2D editor. Our method first fine-tunes a 3D generator using multi-view inputs for scene adaptation, then trains a 2D editor through consistency distillation. The edited multi-view outputs are finally optimized into 3D representations via Gaussian Splatting. Experimental results show DisCo3D achieves stable multi-view consistency and outperforms state-of-the-art methods in editing quality.

[297] Register Anything: Estimating “Corresponding Prompts” for Segment Anything Model

Shiqi Huang, Tingfa Xu, Wen Yan, Dean Barratt, Yipeng Hu

Main category: cs.CV

TL;DR: The paper introduces PromptReg, a training-free image registration method that simplifies region-based correspondence by directly searching for corresponding prompts using pre-trained segmentation models like SAM.

Details

Motivation: Traditional region-based correspondence requires two steps: segmenting ROIs and matching them. The paper aims to simplify this into one step by leveraging pre-trained models for direct prompt-based correspondence.

Method: The method involves identifying corresponding prompts between images, using an “inverse prompt” solution to invert prompts into the target image’s space, and marginalizing across prompts and spatial dimensions for multiple ROI pairs.

Result: PromptReg outperforms intensity-based iterative algorithms and learning-based networks, achieving competitive performance with weakly-supervised methods, as validated on five diverse applications.

Conclusion: PromptReg offers a simpler, effective, and training-free approach to image registration, demonstrating strong performance across various medical and non-medical datasets.

Abstract: Establishing pixel/voxel-level or region-level correspondences is the core challenge in image registration. The latter, also known as region-based correspondence representation, leverages paired regions of interest (ROIs) to enable regional matching while preserving fine-grained capability at pixel/voxel level. Traditionally, this representation is implemented via two steps: segmenting ROIs in each image then matching them between the two images. In this paper, we simplify this into one step by directly “searching for corresponding prompts”, using extensively pre-trained segmentation models (e.g., SAM) for a training-free registration approach, PromptReg. Firstly, we introduce the “corresponding prompt problem”, which aims to identify a corresponding Prompt Y in Image Y for any given visual Prompt X in Image X, such that the two respectively prompt-conditioned segmentations are a pair of corresponding ROIs from the two images. Secondly, we present an “inverse prompt” solution that generates primary and optionally auxiliary prompts, inverting Prompt X into the prompt space of Image Y. Thirdly, we propose a novel registration algorithm that identifies multiple paired corresponding ROIs by marginalizing the inverted Prompt X across both prompt and spatial dimensions. Comprehensive experiments are conducted on five applications of registering 3D prostate MR, 3D abdomen MR, 3D lung CT, 2D histopathology and, as a non-medical example, 2D aerial images. Based on metrics including Dice and target registration errors on anatomical structures, the proposed registration outperforms both intensity-based iterative algorithms and learning-based DDF-predicting networks, even yielding competitive performance with weakly-supervised approaches that require fully-segmented training data.

[298] Versatile Transition Generation with Image-to-Video Diffusion

Zuhao Yang, Jiahui Zhang, Yingchen Yu, Shijian Lu, Song Bai

Main category: cs.CV

TL;DR: VTG is a framework for generating smooth, high-quality video transitions between two frames using text prompts, addressing challenges like object identity preservation and motion smoothness.

Details

Motivation: The paper aims to solve the underexplored problem of generating smooth and rational transition videos between given frames with text guidance.

Method: VTG uses interpolation-based initialization, dual-directional motion fine-tuning, and representation alignment regularization to enhance pre-trained diffusion models.

Result: VTG outperforms existing methods on TransitBench, a new benchmark for transition tasks like concept blending and scene transition.

Conclusion: VTG is a versatile and effective solution for high-fidelity video transition generation, validated by extensive experiments.

Abstract: Leveraging text, images, structure maps, or motion trajectories as conditional guidance, diffusion models have achieved great success in automated and high-quality video generation. However, generating smooth and rational transition videos given the first and last video frames as well as descriptive text prompts is far underexplored. We present VTG, a Versatile Transition video Generation framework that can generate smooth, high-fidelity, and semantically coherent video transitions. VTG introduces interpolation-based initialization that helps preserve object identity and handle abrupt content changes effectively. In addition, it incorporates dual-directional motion fine-tuning and representation alignment regularization to mitigate the limitations of pre-trained image-to-video diffusion models in motion smoothness and generation fidelity, respectively. To evaluate VTG and facilitate future studies on unified transition generation, we collected TransitBench, a comprehensive benchmark for transition generation covering two representative transition tasks: concept blending and scene transition. Extensive experiments show that VTG achieves superior transition performance consistently across all four tasks.

[299] TimeExpert: An Expert-Guided Video LLM for Video Temporal Grounding

Zuhao Yang, Yingchen Yu, Yunqing Zhao, Shijian Lu, Song Bai

Main category: cs.CV

TL;DR: TimeExpert is a Mixture-of-Experts-based Video-LLM that dynamically routes task-specific tokens to specialized experts for improved Video Temporal Grounding (VTG) performance.

Details

Motivation: Existing Video-LLMs process all task tokens uniformly, failing to address the distinct needs of temporal localization, saliency assessment, and textual generation in VTG tasks.

Method: Introduces TimeExpert, a MoE-based Video-LLM that dynamically routes task-specific tokens (e.g., timestamps, saliency scores) to specialized experts.

Result: Achieves state-of-the-art performance on VTG tasks like Dense Video Captioning, Moment Retrieval, and Video Highlight Detection.

Conclusion: TimeExpert’s specialized processing improves event modeling and computational efficiency in VTG applications.

Abstract: Video Temporal Grounding (VTG) aims to precisely identify video event segments in response to textual queries. The outputs of VTG tasks manifest as sequences of events, each defined by precise timestamps, saliency scores, and textual descriptions. Despite recent advances, a fundamental limitation persists in existing Video Large Language Models (Video-LLMs): they process all task tokens through identical and static pathways, failing to recognize that temporal localization, saliency assessment, and textual generation represent fundamentally distinct tasks requiring specialized processing. To address this, we introduce TimeExpert, a Mixture-of-Experts (MoE)-based Video-LLM that effectively decomposes VTG tasks by dynamically routing task-specific tokens (e.g., timestamps, saliency scores) to specialized experts, with increased computational efficiency. Our design choices enable precise handling of each subtask, leading to improved event modeling across diverse VTG applications. Extensive experiments demonstrate that TimeExpert consistently achieves state-of-the-art performance on various VTG tasks such as Dense Video Captioning, Moment Retrieval, and Video Highlight Detection.

[300] LT-Gaussian: Long-Term Map Update Using 3D Gaussian Splatting for Autonomous Driving

Luqi Cheng, Zhangshuo Qi, Zijie Zhou, Chao Lu, Guangming Xiong

Main category: cs.CV

TL;DR: LT-Gaussian is a method for updating 3D-GS-based maps in autonomous driving, using multimodal splatting, change detection, and targeted updates to efficiently handle environmental changes.

Details

Motivation: Updating 3D-Gaussian Splatting (3D-GS) maps is costly; LT-Gaussian addresses this by leveraging old and new scene data for efficient updates.

Method: LT-Gaussian uses Multimodal Gaussian Splatting, Structural Change Detection, and Gaussian-Map Update Modules to update maps by comparing old maps with current LiDAR data.

Result: LT-Gaussian efficiently updates maps, handles environmental changes, and produces higher-quality reconstructions than rebuilding from scratch.

Conclusion: LT-Gaussian is an effective solution for updating 3D-GS maps in autonomous driving, balancing efficiency and reconstruction quality.

Abstract: Maps play an important role in autonomous driving systems. The recently proposed 3D Gaussian Splatting (3D-GS) produces rendering-quality explicit scene reconstruction results, demonstrating the potential for map construction in autonomous driving scenarios. However, because of the time and computational costs involved in generating Gaussian scenes, how to update the map becomes a significant challenge. In this paper, we propose LT-Gaussian, a map update method for 3D-GS-based maps. LT-Gaussian consists of three main components: Multimodal Gaussian Splatting, Structural Change Detection Module, and Gaussian-Map Update Module. Firstly, the Gaussian map of the old scene is generated using our proposed Multimodal Gaussian Splatting. Subsequently, during the map update process, we compare the outdated Gaussian map with the current LiDAR data stream to identify structural changes. Finally, we perform targeted updates to the Gaussian-map to generate an up-to-date map. We establish a benchmark for map updating on the nuScenes dataset to quantitatively evaluate our method. The experimental results show that LT-Gaussian can effectively and efficiently update the Gaussian-map, handling common environmental changes in autonomous driving scenarios. Furthermore, by taking full advantage of information from both new and old scenes, LT-Gaussian is able to produce higher quality reconstruction results compared to map update strategies that reconstruct maps from scratch. Our open-source code is available at https://github.com/ChengLuqi/LT-gaussian.

[301] GAID: Frame-Level Gated Audio-Visual Integration with Directional Perturbation for Text-Video Retrieval

Bowen Yang, Yun Cao, Chen He, Xiaosu Su

Main category: cs.CV

TL;DR: GAID improves text-to-video retrieval by integrating audio-visual features adaptively and enhancing text embeddings with structured perturbations, achieving state-of-the-art results efficiently.

Details

Motivation: Existing methods for text-to-video retrieval often neglect audio semantics or use coarse fusion, leading to suboptimal multimodal representations.

Method: GAID introduces Frame-level Gated Fusion (FGF) for fine-grained audio-visual alignment and Directional Adaptive Semantic Perturbation (DASP) for robust text embeddings.

Result: GAID achieves consistent state-of-the-art performance on MSR-VTT, DiDeMo, LSMDC, and VATEX datasets with efficiency gains.

Conclusion: GAID’s adaptive fusion and perturbation modules yield stable, expressive representations, advancing text-to-video retrieval.

Abstract: Text-to-video retrieval requires precise alignment between language and temporally rich video signals. Existing methods predominantly exploit visual cues and often overlook complementary audio semantics or adopt coarse fusion strategies, leading to suboptimal multimodal representations. We present GAID, a framework that jointly address this gap via two key components: (i) a Frame-level Gated Fusion (FGF) that adaptively integrates audio and visual features under textual guidance, enabling fine-grained temporal alignment; and (ii) a Directional Adaptive Semantic Perturbation (DASP) that injects structure-aware perturbations into text embeddings, enhancing robustness and discrimination without incurring multi-pass inference. These modules complement each other – fusion reduces modality gaps while perturbation regularizes cross-modal matching – yielding more stable and expressive representations. Extensive experiments on MSR-VTT, DiDeMo, LSMDC, and VATEX show consistent state-of-the-art results across all retrieval metrics with notable efficiency gains. Our code is available at https://github.com/YangBowenn/GAID.

[302] HateClipSeg: A Segment-Level Annotated Dataset for Fine-Grained Hate Video Detection

Han Wang, Zhuoran Wang, Roy Ka-Wei Lee

Main category: cs.CV

TL;DR: HateClipSeg is a large-scale multimodal dataset for hate speech detection in videos, featuring fine-grained annotations and high inter-annotator agreement. It benchmarks three tasks, revealing gaps in current models.

Details

Motivation: Existing datasets lack fine-grained annotations for hate speech in videos, making detection challenging.

Method: A three-stage annotation process creates HateClipSeg, with video-level and segment-level labels. Three benchmark tasks are proposed.

Result: High inter-annotator agreement (alpha = 0.817) and significant performance gaps in current models.

Conclusion: HateClipSeg addresses dataset limitations and highlights the need for advanced multimodal and temporally aware approaches.

Abstract: Detecting hate speech in videos remains challenging due to the complexity of multimodal content and the lack of fine-grained annotations in existing datasets. We present HateClipSeg, a large-scale multimodal dataset with both video-level and segment-level annotations, comprising over 11,714 segments labeled as Normal or across five Offensive categories: Hateful, Insulting, Sexual, Violence, Self-Harm, along with explicit target victim labels. Our three-stage annotation process yields high inter-annotator agreement (Krippendorff’s alpha = 0.817). We propose three tasks to benchmark performance: (1) Trimmed Hateful Video Classification, (2) Temporal Hateful Video Localization, and (3) Online Hateful Video Classification. Results highlight substantial gaps in current models, emphasizing the need for more sophisticated multimodal and temporally aware approaches. The HateClipSeg dataset are publicly available at https://github.com/Social-AI-Studio/HateClipSeg.git.

[303] Modality Bias in LVLMs: Analyzing and Mitigating Object Hallucination via Attention Lens

Haohan Zheng, Zhenguo Zhang

Main category: cs.CV

TL;DR: The paper identifies a modality bias in large vision-language models (LVLMs) causing object hallucination, proposes a training-free method to balance cross-modal attention, and validates its efficacy.

Details

Motivation: LVLMs suffer from object hallucination due to over-reliance on textual prompts and internal knowledge, ignoring visual and textual modalities, termed modality bias.

Method: A training-free method adjusts attention weights of textual and visual tokens and uses contrastive decoding to reduce overreliance on parametric knowledge.

Result: Experiments confirm modality bias in LVLMs; the proposed method effectively mitigates hallucination across models and benchmarks.

Conclusion: The method balances cross-modal compatibility, reducing hallucination and improving alignment with user intentions.

Abstract: Large vision-language models (LVLMs) have demonstrated remarkable multimodal comprehension and reasoning capabilities, but they still suffer from severe object hallucination. Previous studies primarily attribute the flaw to linguistic prior caused by the scale mismatch between visual encoders and large language models (LLMs) in LVLMs. Specifically, as current LVLMs are built upon LLMs, they tend to over-rely on textual prompts and internal knowledge of LLMs, generating descriptions inconsistent with visual cues. However, through an in-depth investigation of the hallucinated mechanisms, we empirically reveal a previously overlooked phenomenon: LVLMs may ignore not only visual information but also textual modality during hallucination, a behavior termed as modality bias, which indicates that LVLMs struggle to simultaneously attend to both visual and textual modalities, leading to fragmented understanding of user-provided instructions. Based on this observation, we propose a simple yet effective training-free method to mitigate object hallucination. Concretely, we intervene and adjust the attention weights of textual and visual tokens, balancing cross-modal compatibility for better alignment with user intentions. Furthermore, we adopt a contrastive decoding strategy to reduce the LVLM’s overreliance on its parametric knowledge, synergistically enhancing our attention manipulation. Extensive experiments confirm the widespread presence of modality bias in LVLMs. Notably, our method effectively mitigates hallucination across multiple open-source LVLMs and benchmarks, highlighting its generalizability and efficacy.

[304] Dynamic Robot-Assisted Surgery with Hierarchical Class-Incremental Semantic Segmentation

Julia Hindel, Ema Mekic, Enamundram Naga Karthik, Rohit Mohan, Daniele Cattaneo, Maria Kalweit, Abhinav Valada

Main category: cs.CV

TL;DR: The paper introduces TOPICS+, an enhanced variant of TOPICS for robust semantic segmentation in surgical scenes, addressing class imbalances and dynamic environments. It also proposes new benchmarks and a refined label set for evaluation.

Details

Motivation: To improve scene understanding in robot-assisted surgeries by addressing limitations of static datasets and enabling continual adaptation to new classes without forgetting prior knowledge.

Method: Enhances TOPICS by incorporating Dice loss, hierarchical pseudo-labeling, and tailored label taxonomies. Introduces six novel CISS benchmarks and a refined label set for surgical environments.

Result: TOPICS+ improves segmentation robustness in dynamic surgical scenes, validated through new benchmarks and a refined label set.

Conclusion: TOPICS+ advances semantic segmentation for surgical robotics, with publicly available code and models for further research.

Abstract: Robot-assisted surgeries rely on accurate and real-time scene understanding to safely guide surgical instruments. However, segmentation models trained on static datasets face key limitations when deployed in these dynamic and evolving surgical environments. Class-incremental semantic segmentation (CISS) allows models to continually adapt to new classes while avoiding catastrophic forgetting of prior knowledge, without training on previous data. In this work, we build upon the recently introduced Taxonomy-Oriented Poincar'e-regularized Incremental Class Segmentation (TOPICS) approach and propose an enhanced variant, termed TOPICS+, specifically tailored for robust segmentation of surgical scenes. Concretely, we incorporate the Dice loss into the hierarchical loss formulation to handle strong class imbalances, introduce hierarchical pseudo-labeling, and design tailored label taxonomies for robotic surgery environments. We also propose six novel CISS benchmarks designed for robotic surgery environments including multiple incremental steps and several semantic categories to emulate realistic class-incremental settings in surgical environments. In addition, we introduce a refined set of labels with more than 144 classes on the Syn-Mediverse synthetic dataset, hosted online as an evaluation benchmark. We make the code and trained models publicly available at http://topics.cs.uni-freiburg.de.

[305] Granular Concept Circuits: Toward a Fine-Grained Circuit Discovery for Concept Representations

Dahee Kwon, Sehyun Lee, Jaesik Choi

Main category: cs.CV

TL;DR: The paper introduces Granular Concept Circuit (GCC), a method to discover circuits in deep vision models that encode specific visual concepts, offering fine-grained interpretability.

Details

Motivation: Understanding where and how visual concepts are encoded in deep vision models is challenging due to distributed representations. GCC aims to address this by identifying concept-specific circuits.

Method: GCC iteratively assesses inter-neuron connectivity, focusing on functional dependencies and semantic alignment, to construct circuits representing specific concepts.

Result: GCC successfully identifies fine-grained circuits tied to visual concepts across various deep image classification models, demonstrating versatility and effectiveness.

Conclusion: GCC provides a profound, concept-wise interpretation of deep vision models, advancing interpretability by pinpointing where specific concepts are encoded.

Abstract: Deep vision models have achieved remarkable classification performance by leveraging a hierarchical architecture in which human-interpretable concepts emerge through the composition of individual neurons across layers. Given the distributed nature of representations, pinpointing where specific visual concepts are encoded within a model remains a crucial yet challenging task. In this paper, we introduce an effective circuit discovery method, called Granular Concept Circuit (GCC), in which each circuit represents a concept relevant to a given query. To construct each circuit, our method iteratively assesses inter-neuron connectivity, focusing on both functional dependencies and semantic alignment. By automatically discovering multiple circuits, each capturing specific concepts within that query, our approach offers a profound, concept-wise interpretation of models and is the first to identify circuits tied to specific visual concepts at a fine-grained level. We validate the versatility and effectiveness of GCCs across various deep image classification models.

[306] Tracking the Unstable: Appearance-Guided Motion Modeling for Robust Multi-Object Tracking in UAV-Captured Videos

Jianbo Ma, Hui Luo, Qi Chen, Yuankai Qi, Yumei Sun, Amin Beheshti, Jianlin Zhang, Ming-Hsuan Yang

Main category: cs.CV

TL;DR: AMOT improves multi-object tracking in UAV videos by jointly leveraging appearance and motion cues with an AMC matrix and MTC module, outperforming state-of-the-art methods.

Details

Motivation: Challenges in UAV videos, like viewpoint changes and complex motion, lead to unstable tracking. Existing methods separate appearance and motion cues, missing their interplay.

Method: Proposes AMOT with an Appearance-Motion Consistency (AMC) matrix for reliable identity association and a Motion-aware Track Continuation (MTC) module to reduce broken trajectories.

Result: AMOT outperforms state-of-the-art methods on UAV benchmarks (VisDrone2019, UAVDT, VT-MOT-UAV) and generalizes well without training.

Conclusion: AMOT effectively addresses UAV tracking challenges by integrating appearance and motion cues, enhancing performance and robustness.

Abstract: Multi-object tracking (MOT) aims to track multiple objects while maintaining consistent identities across frames of a given video. In unmanned aerial vehicle (UAV) recorded videos, frequent viewpoint changes and complex UAV-ground relative motion dynamics pose significant challenges, which often lead to unstable affinity measurement and ambiguous association. Existing methods typically model motion and appearance cues separately, overlooking their spatio-temporal interplay and resulting in suboptimal tracking performance. In this work, we propose AMOT, which jointly exploits appearance and motion cues through two key components: an Appearance-Motion Consistency (AMC) matrix and a Motion-aware Track Continuation (MTC) module. Specifically, the AMC matrix computes bi-directional spatial consistency under the guidance of appearance features, enabling more reliable and context-aware identity association. The MTC module complements AMC by reactivating unmatched tracks through appearance-guided predictions that align with Kalman-based predictions, thereby reducing broken trajectories caused by missed detections. Extensive experiments on three UAV benchmarks, including VisDrone2019, UAVDT, and VT-MOT-UAV, demonstrate that our AMOT outperforms current state-of-the-art methods and generalizes well in a plug-and-play and training-free manner.

[307] SpectralX: Parameter-efficient Domain Generalization for Spectral Remote Sensing Foundation Models

Yuxiang Zhang, Wei Li, Mengmeng Zhang, Jiawei Han, Ran Tao, Shunlin Liang

Main category: cs.CV

TL;DR: SpectralX is a parameter-efficient fine-tuning framework adapting existing Remote Sensing Foundation Models (RSFMs) to process diverse spectral imagery without extensive pretraining, improving domain generalization.

Details

Motivation: To address the lack of foundation models for multispectral/hyperspectral data and leverage spectral imagery advantages in earth observation.

Method: A two-stage training approach: 1) masked-reconstruction with Hyper Tokenizer (HyperT) and Attribute-oriented Mixture of Adapter (AoMoA), 2) semantic segmentation with Attribute-refined Adapter (Are-adapter).

Result: SpectralX effectively adapts RSFMs to new spectral modalities, enhancing domain generalization for diverse inputs.

Conclusion: SpectralX enables customized adaptation of RSFMs for spectral imagery from new regions/seasons, with code available for public use.

Abstract: Recent advances in Remote Sensing Foundation Models (RSFMs) have led to significant breakthroughs in the field. While many RSFMs have been pretrained with massive optical imagery, more multispectral/hyperspectral data remain lack of the corresponding foundation models. To leverage the advantages of spectral imagery in earth observation, we explore whether existing RSFMs can be effectively adapted to process diverse spectral modalities without requiring extensive spectral pretraining. In response to this challenge, we proposed SpectralX, an innovative parameter-efficient fine-tuning framework that adapt existing RSFMs as backbone while introducing a two-stage training approach to handle various spectral inputs, thereby significantly improving domain generalization performance. In the first stage, we employ a masked-reconstruction task and design a specialized Hyper Tokenizer (HyperT) to extract attribute tokens from both spatial and spectral dimensions. Simultaneously, we develop an Attribute-oriented Mixture of Adapter (AoMoA) that dynamically aggregates multi-attribute expert knowledge while performing layer-wise fine-tuning. With semantic segmentation as downstream task in the second stage, we insert an Attribute-refined Adapter (Are-adapter) into the first stage framework. By iteratively querying low-level semantic features with high-level representations, the model learns to focus on task-beneficial attributes, enabling customized adjustment of RSFMs. Following this two-phase adaptation process, SpectralX is capable of interpreting spectral imagery from new regions or seasons. The codes will be available from the website: https://github.com/YuxiangZhang-BIT.

[308] AG$^2$aussian: Anchor-Graph Structured Gaussian Splatting for Instance-Level 3D Scene Understanding and Editing

Zhaonan Wang, Manyi Li, Changhe Tu

Main category: cs.CV

TL;DR: AG$^2$aussian introduces an anchor-graph structure to organize semantic features in 3D Gaussian Splatting, improving segmentation and Gaussian selection for scene understanding and editing.

Details

Motivation: Existing methods for semantic-aware 3D Gaussian representations produce noisy segmentation and messy Gaussian selections, hindering scene understanding and editing.

Method: The proposed AG$^2$aussian framework uses an anchor-graph structure to regulate Gaussian primitives and propagate semantic features, achieving clean and accurate instance-level selection.

Result: The method demonstrates advantages in interactive click-based query, open-vocabulary text-driven query, object removal editing, and physics simulation.

Conclusion: AG$^2$aussian effectively improves semantic-aware 3D Gaussian representations, benefiting diverse applications.

Abstract: 3D Gaussian Splatting (3DGS) has witnessed exponential adoption across diverse applications, driving a critical need for semantic-aware 3D Gaussian representations to enable scene understanding and editing tasks. Existing approaches typically attach semantic features to a collection of free Gaussians and distill the features via differentiable rendering, leading to noisy segmentation and a messy selection of Gaussians. In this paper, we introduce AG$^2$aussian, a novel framework that leverages an anchor-graph structure to organize semantic features and regulate Gaussian primitives. Our anchor-graph structure not only promotes compact and instance-aware Gaussian distributions, but also facilitates graph-based propagation, achieving a clean and accurate instance-level Gaussian selection. Extensive validation across four applications, i.e. interactive click-based query, open-vocabulary text-driven query, object removal editing, and physics simulation, demonstrates the advantages of our approach and its benefits to various applications. The experiments and ablation studies further evaluate the effectiveness of the key designs of our approach.

[309] Simulated Ensemble Attack: Transferring Jailbreaks Across Fine-tuned Vision-Language Models

Ruofan Wang, Xin Wang, Yang Yao, Xuan Tong, Xingjun Ma

Main category: cs.CV

TL;DR: SEA introduces a grey-box jailbreak method for fine-tuned VLMs, achieving high transfer attack success and toxicity rates by exploiting inherited vulnerabilities from the base model.

Details

Motivation: To highlight the underexplored risk of transferable jailbreak attacks in fine-tuned VLMs due to retained vulnerabilities from the base model.

Method: SEA combines Fine-tuning Trajectory Simulation (FTS) and Targeted Prompt Guidance (TPG) to generate transferable adversarial images and steer language outputs.

Result: SEA achieves 86.5% transfer attack success and 49.5% toxicity rates across fine-tuned Qwen2-VL variants, even safety-enhanced ones.

Conclusion: The study underscores the need for defenses against transferable vulnerabilities in fine-tuned VLMs inherited from open-source foundations.

Abstract: Fine-tuning open-source Vision-Language Models (VLMs) creates a critical yet underexplored attack surface: vulnerabilities in the base VLM could be retained in fine-tuned variants, rendering them susceptible to transferable jailbreak attacks. To demonstrate this risk, we introduce the Simulated Ensemble Attack (SEA), a novel grey-box jailbreak method in which the adversary has full access to the base VLM but no knowledge of the fine-tuned target’s weights or training configuration. To improve jailbreak transferability across fine-tuned VLMs, SEA combines two key techniques: Fine-tuning Trajectory Simulation (FTS) and Targeted Prompt Guidance (TPG). FTS generates transferable adversarial images by simulating the vision encoder’s parameter shifts, while TPG is a textual strategy that steers the language decoder toward adversarially optimized outputs. Experiments on the Qwen2-VL family (2B and 7B) demonstrate that SEA achieves high transfer attack success rates exceeding 86.5% and toxicity rates near 49.5% across diverse fine-tuned variants, even those specifically fine-tuned to improve safety behaviors. Notably, while direct PGD-based image jailbreaks rarely transfer across fine-tuned VLMs, SEA reliably exploits inherited vulnerabilities from the base model, significantly enhancing transferability. These findings highlight an urgent need to safeguard fine-tuned proprietary VLMs against transferable vulnerabilities inherited from open-source foundations, motivating the development of holistic defenses across the entire model lifecycle.

[310] Intention-Guided Cognitive Reasoning for Egocentric Long-Term Action Anticipation

Qiaohui Chu, Haoyu Zhang, Meng Liu, Yisen Feng, Haoxiang Shi, Liqiang Nie

Main category: cs.CV

TL;DR: INSIGHT is a two-stage framework for egocentric action anticipation, addressing limitations in existing methods by leveraging hand-object interactions, verb-noun dependencies, and cognitive reasoning. It achieves state-of-the-art results on benchmarks.

Details

Motivation: Improving long-term action anticipation from egocentric video for applications like human-computer interaction and assistive technologies by addressing underutilized visual cues, neglected semantic dependencies, and lack of cognitive reasoning.

Method: A two-stage framework: 1) extracts features from hand-object interactions and uses a verb-noun co-occurrence matrix; 2) employs reinforcement learning for cognitive reasoning (perception, inference, anticipation).

Result: Achieves state-of-the-art performance on Ego4D, EPIC-Kitchens-55, and EGTEA Gaze+ benchmarks, demonstrating strong generalization.

Conclusion: INSIGHT effectively addresses key limitations in action anticipation, offering improved performance and generalization through its unified framework.

Abstract: Long-term action anticipation from egocentric video is critical for applications such as human-computer interaction and assistive technologies, where anticipating user intent enables proactive and context-aware AI assistance. However, existing approaches suffer from three key limitations: 1) underutilization of fine-grained visual cues from hand-object interactions, 2) neglect of semantic dependencies between verbs and nouns, and 3) lack of explicit cognitive reasoning, limiting generalization and long-term forecasting ability. To overcome these challenges, we propose INSIGHT, a unified two-stage framework for egocentric action anticipation. In the first stage, INSIGHT focuses on extracting semantically rich features from hand-object interaction regions and enhances action representations using a verb-noun co-occurrence matrix. In the second stage, it introduces a reinforcement learning-based module that simulates explicit cognitive reasoning through a structured process: visual perception (think) -> intention inference (reason) -> action anticipation (answer). Extensive experiments on Ego4D, EPIC-Kitchens-55, and EGTEA Gaze+ benchmarks show that INSIGHT achieves state-of-the-art performance, demonstrating its effectiveness and strong generalization capability.

[311] Improving Noise Efficiency in Privacy-preserving Dataset Distillation

Runkai Zheng, Vishnu Asutosh Dasu, Yinong Oliver Wang, Haohan Wang, Fernando De la Torre

Main category: cs.CV

TL;DR: A novel framework improves differentially private dataset distillation by decoupling sampling from optimization and enhancing signal quality, achieving significant performance gains.

Details

Motivation: Addressing privacy concerns in machine learning by efficiently generating synthetic datasets without compromising privacy, while overcoming inefficiencies in current methods.

Method: Decouples sampling from optimization and improves signal quality by mitigating DP noise through matching in an informative subspace.

Result: Achieves 10.0% improvement with 50 images per class and 8.3% increase with one-fifth the distilled set size on CIFAR-10.

Conclusion: The framework advances privacy-preserving dataset distillation, offering better efficiency and performance.

Abstract: Modern machine learning models heavily rely on large datasets that often include sensitive and private information, raising serious privacy concerns. Differentially private (DP) data generation offers a solution by creating synthetic datasets that limit the leakage of private information within a predefined privacy budget; however, it requires a substantial amount of data to achieve performance comparable to models trained on the original data. To mitigate the significant expense incurred with synthetic data generation, Dataset Distillation (DD) stands out for its remarkable training and storage efficiency. This efficiency is particularly advantageous when integrated with DP mechanisms, curating compact yet informative synthetic datasets without compromising privacy. However, current state-of-the-art private DD methods suffer from a synchronized sampling-optimization process and the dependency on noisy training signals from randomly initialized networks. This results in the inefficient utilization of private information due to the addition of excessive noise. To address these issues, we introduce a novel framework that decouples sampling from optimization for better convergence and improves signal quality by mitigating the impact of DP noise through matching in an informative subspace. On CIFAR-10, our method achieves a \textbf{10.0%} improvement with 50 images per class and \textbf{8.3%} increase with just \textbf{one-fifth} the distilled set size of previous state-of-the-art methods, demonstrating significant potential to advance privacy-preserving DD.

[312] Vision transformer-based multi-camera multi-object tracking framework for dairy cow monitoring

Kumail Abbas, Zeeshan Afzal, Aqeel Raza, Taha Mansouri, Andrew W. Dowsey, Chaidate Inchaisri, Ali Alameer

Main category: cs.CV

TL;DR: A multi-camera, real-time tracking system using advanced computer vision techniques was developed to monitor dairy cow activity accurately, outperforming existing methods with high MOTA and IDF1 scores.

Details

Motivation: Manual monitoring of dairy cow activity is laborious and inconsistent, necessitating an automated, accurate system for health and welfare assessment.

Method: The system integrates six camera feeds with homographic transformations, uses YOLO11-m for detection, SAMURAI for segmentation, and a motion-aware Kalman filter for tracking.

Result: Achieved high accuracy (MOTA 98.7%-99.3%, IDF1 >99%) and minimal identity switches, outperforming Deep SORT Realtime.

Conclusion: The system enables real-time, precise cow tracking in complex environments, aiding early sickness prediction and farm productivity.

Abstract: Activity and behaviour correlate with dairy cow health and welfare, making continual and accurate monitoring crucial for disease identification and farm productivity. Manual observation and frequent assessments are laborious and inconsistent for activity monitoring. In this study, we developed a unique multi-camera, real-time tracking system for indoor-housed Holstein Friesian dairy cows. This technology uses cutting-edge computer vision techniques, including instance segmentation and tracking algorithms to monitor cow activity seamlessly and accurately. An integrated top-down barn panorama was created by geometrically aligning six camera feeds using homographic transformations. The detection phase used a refined YOLO11-m model trained on an overhead cow dataset, obtaining high accuracy (mAP@0.50 = 0.97, F1 = 0.95). SAMURAI, an upgraded Segment Anything Model 2.1, generated pixel-precise cow masks for instance segmentation utilizing zero-shot learning and motion-aware memory. Even with occlusion and fluctuating posture, a motion-aware Linear Kalman filter and IoU-based data association reliably identified cows over time for object tracking. The proposed system significantly outperformed Deep SORT Realtime. Multi-Object Tracking Accuracy (MOTA) was 98.7% and 99.3% in two benchmark video sequences, with IDF1 scores above 99% and near-zero identity switches. This unified multi-camera system can track dairy cows in complex interior surroundings in real time, according to our data. The system reduces redundant detections across overlapping cameras, maintains continuity as cows move between viewpoints, with the aim of improving early sickness prediction through activity quantification and behavioural classification.

[313] VPN: Visual Prompt Navigation

Shuo Feng, Zihan Wang, Yuchen Li, Rui Kong, Hengyi Cai, Shuaiqiang Wang, Gim Hee Lee, Piji Li, Shuqiang Jiang

Main category: cs.CV

TL;DR: The paper introduces Visual Prompt Navigation (VPN), a method using visual prompts for guiding agents in navigation tasks, reducing ambiguity compared to language instructions. It includes new datasets and a baseline network (VPNet) with data augmentation strategies.

Details

Motivation: Language-guided navigation suffers from ambiguity and verbosity, making it less effective in complex environments. Visual prompts offer a more intuitive and spatially grounded alternative.

Method: Proposes VPN, which uses visual prompts (trajectory marks on 2D top-view maps) for navigation. Introduces VPNet with view-level and trajectory-level data augmentation. New datasets R2R-VP and R2R-CE-VP are created.

Result: Experiments evaluate visual prompt forms, map formats, and augmentation strategies, showing improved navigation performance.

Conclusion: VPN provides a more effective and user-friendly alternative to language-guided navigation, with VPNet and augmentation strategies enhancing performance.

Abstract: While natural language is commonly used to guide embodied agents, the inherent ambiguity and verbosity of language often hinder the effectiveness of language-guided navigation in complex environments. To this end, we propose Visual Prompt Navigation (VPN), a novel paradigm that guides agents to navigate using only user-provided visual prompts within 2D top-view maps. This visual prompt primarily focuses on marking the visual navigation trajectory on a top-down view of a scene, offering intuitive and spatially grounded guidance without relying on language instructions. It is more friendly for non-expert users and reduces interpretive ambiguity. We build VPN tasks in both discrete and continuous navigation settings, constructing two new datasets, R2R-VP and R2R-CE-VP, by extending existing R2R and R2R-CE episodes with corresponding visual prompts. Furthermore, we introduce VPNet, a dedicated baseline network to handle the VPN tasks, with two data augmentation strategies: view-level augmentation (altering initial headings and prompt orientations) and trajectory-level augmentation (incorporating diverse trajectories from large-scale 3D scenes), to enhance navigation performance. Extensive experiments evaluate how visual prompt forms, top-view map formats, and data augmentation strategies affect the performance of visual prompt navigation. The code is available at https://github.com/farlit/VPN.

[314] DiffSemanticFusion: Semantic Raster BEV Fusion for Autonomous Driving via Online HD Map Diffusion

Zhigang Sun, Yiru Wang, Anqing Jiang, Shuo Wang, Yu Gao, Yuwen Heng, Shouyi Zhang, An He, Hao Jiang, Jinhao Chai, Zichong Gu, Wang Jijun, Shichen Tang, Lavdim Halilaj, Juergen Luettin, Hao Sun

Main category: cs.CV

TL;DR: DiffSemanticFusion combines raster and graph-based representations for improved autonomous driving tasks, achieving state-of-the-art performance in trajectory prediction and planning.

Details

Motivation: Addressing the limitations of raster-based (lack of geometric precision) and graph-based (instability without precise maps) representations in autonomous driving.

Method: Proposes DiffSemanticFusion, a fusion framework using a semantic raster-fused BEV space and a map diffusion module for enhanced stability and expressiveness.

Result: Achieves 5.1% improvement in trajectory prediction on nuScenes and 15% gain in end-to-end driving on NAVSIM.

Conclusion: DiffSemanticFusion effectively combines complementary representations, enhancing performance in autonomous driving tasks and can integrate with other vector-based methods.

Abstract: Autonomous driving requires accurate scene understanding, including road geometry, traffic agents, and their semantic relationships. In online HD map generation scenarios, raster-based representations are well-suited to vision models but lack geometric precision, while graph-based representations retain structural detail but become unstable without precise maps. To harness the complementary strengths of both, we propose DiffSemanticFusion – a fusion framework for multimodal trajectory prediction and planning. Our approach reasons over a semantic raster-fused BEV space, enhanced by a map diffusion module that improves both the stability and expressiveness of online HD map representations. We validate our framework on two downstream tasks: trajectory prediction and planning-oriented end-to-end autonomous driving. Experiments on real-world autonomous driving benchmarks, nuScenes and NAVSIM, demonstrate improved performance over several state-of-the-art methods. For the prediction task on nuScenes, we integrate DiffSemanticFusion with the online HD map informed QCNet, achieving a 5.1% performance improvement. For end-to-end autonomous driving in NAVSIM, DiffSemanticFusion achieves state-of-the-art results, with a 15% performance gain in NavHard scenarios. In addition, extensive ablation and sensitivity studies show that our map diffusion module can be seamlessly integrated into other vector-based approaches to enhance performance. All artifacts are available at https://github.com/SunZhigang7/DiffSemanticFusion.

[315] Skip priors and add graph-based anatomical information, for point-based Couinaud segmentation

Xiaotong Zhang, Alexander Broersen, Gonnie CM van Erp, Silvia L. Pintea, Jouke Dijkstra

Main category: cs.CV

TL;DR: A point-based method for Couinaud liver segmentation without explicit prior vessel structure, using a graph reasoning module to learn anatomical affinities.

Details

Motivation: Preoperative liver surgery planning relies on Couinaud segmentation from CT images, but point-based methods require time-consuming prior vessel structure knowledge.

Method: Proposes a point-based method with a graph reasoning module to implicitly learn liver vessel structure from point features.

Result: Competitive performance on MSD and LiTS datasets in Dice coefficient and average surface distance scores.

Conclusion: The method effectively segments Couinaud liver regions without explicit prior vessel knowledge, leveraging learned anatomical affinities.

Abstract: The preoperative planning of liver surgery relies on Couinaud segmentation from computed tomography (CT) images, to reduce the risk of bleeding and guide the resection procedure. Using 3D point-based representations, rather than voxelizing the CT volume, has the benefit of preserving the physical resolution of the CT. However, point-based representations need prior knowledge of the liver vessel structure, which is time consuming to acquire. Here, we propose a point-based method for Couinaud segmentation, without explicitly providing the prior liver vessel structure. To allow the model to learn this anatomical liver vessel structure, we add a graph reasoning module on top of the point features. This adds implicit anatomical information to the model, by learning affinities across point neighborhoods. Our method is competitive on the MSD and LiTS public datasets in Dice coefficient and average surface distance scores compared to four pioneering point-based methods. Our code is available at https://github.com/ZhangXiaotong015/GrPn.

[316] SoccerTrack v2: A Full-Pitch Multi-View Soccer Dataset for Game State Reconstruction

Atom Scott, Ikuma Uchida, Kento Kuroda, Yufi Kim, Keisuke Fujii

Main category: cs.CV

TL;DR: SoccerTrack v2 is a public dataset for soccer analytics, offering 10 panoramic 4K match recordings with detailed annotations for MOT, GSR, and BAS.

Details

Motivation: To address limitations of prior datasets by providing comprehensive, high-quality data for advancing computer vision and soccer analytics.

Method: Uses BePro cameras for full-length, panoramic 4K recordings of university-level matches, annotated with GSR and BAS labels.

Result: A dataset with 10 annotated matches, enabling new benchmarks in MOT, GSR, and BAS for soccer analytics.

Conclusion: SoccerTrack v2 supports research and practical applications in tactical analysis and automated tools for soccer.

Abstract: SoccerTrack v2 is a new public dataset for advancing multi-object tracking (MOT), game state reconstruction (GSR), and ball action spotting (BAS) in soccer analytics. Unlike prior datasets that use broadcast views or limited scenarios, SoccerTrack v2 provides 10 full-length, panoramic 4K recordings of university-level matches, captured with BePro cameras for complete player visibility. Each video is annotated with GSR labels (2D pitch coordinates, jersey-based player IDs, roles, teams) and BAS labels for 12 action classes (e.g., Pass, Drive, Shot). This technical report outlines the datasets structure, collection pipeline, and annotation process. SoccerTrack v2 is designed to advance research in computer vision and soccer analytics, enabling new benchmarks and practical applications in tactical analysis and automated tools.

[317] Diffusion-based 3D Hand Motion Recovery with Intuitive Physics

Yufei Zhang, Zijun Cui, Jeffrey O. Kephart, Qiang Ji

Main category: cs.CV

TL;DR: A novel 3D hand motion recovery framework using diffusion-based and physics-augmented refinement improves accuracy and coherence in hand-object interactions.

Details

Motivation: Challenges in generating accurate and temporally coherent 3D hand motion estimates from videos, especially during hand-object interactions.

Method: Diffusion-based and physics-augmented motion refinement model trained on motion capture data, integrating intuitive physics knowledge.

Result: Significant improvement over frame-wise reconstruction methods, achieving SOTA performance on benchmarks.

Conclusion: The framework effectively enhances 3D hand motion recovery by combining diffusion models with physics insights.

Abstract: While 3D hand reconstruction from monocular images has made significant progress, generating accurate and temporally coherent motion estimates from videos remains challenging, particularly during hand-object interactions. In this paper, we present a novel 3D hand motion recovery framework that enhances image-based reconstructions through a diffusion-based and physics-augmented motion refinement model. Our model captures the distribution of refined motion estimates conditioned on initial ones, generating improved sequences through an iterative denoising process. Instead of relying on scarce annotated video data, we train our model only using motion capture data without images. We identify valuable intuitive physics knowledge during hand-object interactions, including key motion states and their associated motion constraints. We effectively integrate these physical insights into our diffusion model to improve its performance. Extensive experiments demonstrate that our approach significantly improves various frame-wise reconstruction methods, achieving state-of-the-art (SOTA) performance on existing benchmarks.

[318] A Simple Algebraic Solution for Estimating the Pose of a Camera from Planar Point Features

Tarek Bouazza, Tarek Hamel, Claude Samson

Main category: cs.CV

TL;DR: A hierarchical algebraic method for estimating camera pose relative to a planar target using ≥4 reference points, with noise robustness via averaging.

Details

Motivation: To accurately estimate camera pose relative to a planar target from bearing measurements, addressing noise sensitivity.

Method: Hierarchical approach: first estimates the target plane’s normal, then camera position, distance, and full orientation. Uses averaging to refine normal direction for robustness.

Result: Validated accuracy and robustness through extensive experiments.

Conclusion: The method effectively estimates camera pose with improved noise robustness.

Abstract: This paper presents a simple algebraic method to estimate the pose of a camera relative to a planar target from $n \geq 4$ reference points with known coordinates in the target frame and their corresponding bearing measurements in the camera frame. The proposed approach follows a hierarchical structure; first, the unit vector normal to the target plane is determined, followed by the camera’s position vector, its distance to the target plane, and finally, the full orientation. To improve the method’s robustness to measurement noise, an averaging methodology is introduced to refine the estimation of the target’s normal direction. The accuracy and robustness of the approach are validated through extensive experiments.

[319] OmniEvent: Unified Event Representation Learning

Weiqi Yan, Chenlu Lin, Youbiao Wang, Zhipeng Cai, Xiuhong Lin, Yangyang Shi, Weiquan Liu, Yu Zang

Main category: cs.CV

TL;DR: OmniEvent is a unified event representation learning framework for event cameras, eliminating task-specific designs and achieving SOTA performance across diverse tasks.

Details

Motivation: Event networks rely on task-specific designs due to unstructured data and spatial-temporal inhomogeneity, limiting reusability.

Method: Proposes a decouple-enhance-fuse paradigm: local feature aggregation in spatial/temporal domains, space-filling curves for efficiency, and attention-based fusion.

Result: Outperforms task-specific SOTA by up to 68.2% across 3 tasks and 10 datasets.

Conclusion: OmniEvent provides a unified, efficient framework for event data, enabling standard vision models without architecture changes.

Abstract: Event cameras have gained increasing popularity in computer vision due to their ultra-high dynamic range and temporal resolution. However, event networks heavily rely on task-specific designs due to the unstructured data distribution and spatial-temporal (S-T) inhomogeneity, making it hard to reuse existing architectures for new tasks. We propose OmniEvent, the first unified event representation learning framework that achieves SOTA performance across diverse tasks, fully removing the need of task-specific designs. Unlike previous methods that treat event data as 3D point clouds with manually tuned S-T scaling weights, OmniEvent proposes a decouple-enhance-fuse paradigm, where the local feature aggregation and enhancement is done independently on the spatial and temporal domains to avoid inhomogeneity issues. Space-filling curves are applied to enable large receptive fields while improving memory and compute efficiency. The features from individual domains are then fused by attention to learn S-T interactions. The output of OmniEvent is a grid-shaped tensor, which enables standard vision models to process event data without architecture change. With a unified framework and similar hyper-parameters, OmniEvent out-performs (tasks-specific) SOTA by up to 68.2% across 3 representative tasks and 10 datasets (Fig.1). Code will be ready in https://github.com/Wickyan/OmniEvent .

[320] Beyond Vulnerabilities: A Survey of Adversarial Attacks as Both Threats and Defenses in Computer Vision Systems

Zhongliang Guo, Yifei Qian, Yanli Li, Weiye Li, Chun Tong Lei, Shuai Zhao, Lei Fang, Ognjen Arandjelović, Chun Pong Lau

Main category: cs.CV

TL;DR: A survey on adversarial attacks in computer vision, covering attack methods, their evolution, and defensive applications, while identifying research gaps.

Details

Motivation: To analyze adversarial attacks as both security threats and defensive tools, and to highlight gaps in robustness and security of neural networks.

Method: Systematic review of adversarial techniques across pixel-space, physically realizable, and latent-space attacks, including gradient-based and optimization methods.

Result: Identifies dual nature of adversarial techniques, their evolution, and applications in vulnerability assessment and defense.

Conclusion: The survey provides a taxonomy and future research directions to enhance robustness in computer vision systems.

Abstract: Adversarial attacks against computer vision systems have emerged as a critical research area that challenges the fundamental assumptions about neural network robustness and security. This comprehensive survey examines the evolving landscape of adversarial techniques, revealing their dual nature as both sophisticated security threats and valuable defensive tools. We provide a systematic analysis of adversarial attack methodologies across three primary domains: pixel-space attacks, physically realizable attacks, and latent-space attacks. Our investigation traces the technical evolution from early gradient-based methods such as FGSM and PGD to sophisticated optimization techniques incorporating momentum, adaptive step sizes, and advanced transferability mechanisms. We examine how physically realizable attacks have successfully bridged the gap between digital vulnerabilities and real-world threats through adversarial patches, 3D textures, and dynamic optical perturbations. Additionally, we explore the emergence of latent-space attacks that leverage semantic structure in internal representations to create more transferable and meaningful adversarial examples. Beyond traditional offensive applications, we investigate the constructive use of adversarial techniques for vulnerability assessment in biometric authentication systems and protection against malicious generative models. Our analysis reveals critical research gaps, particularly in neural style transfer protection and computational efficiency requirements. This survey contributes a comprehensive taxonomy, evolution analysis, and identification of future research directions, aiming to advance understanding of adversarial vulnerabilities and inform the development of more robust and trustworthy computer vision systems.

[321] Distinguishing Target and Non-Target Fixations with EEG and Eye Tracking in Realistic Visual Scenes

Mansi Sharma, Camilo Andrés Martínez Martínez, Benedikt Emanuel Wirth, Antonio Krüger, Philipp Müller

Main category: cs.CV

TL;DR: The paper investigates classifying target vs. non-target fixations during free visual search in realistic scenes using gaze and EEG data, outperforming prior methods with 83.6% accuracy.

Details

Motivation: Prior research lacked realism by using abstract stimuli and ignoring scene context, limiting generalizability to real-world scenarios.

Method: A 36-participant study with 140 realistic scenes (e.g., desktop icons, workshop tools) used gaze and EEG features for classification.

Result: The approach achieved 83.6% accuracy, significantly outperforming the previous state-of-the-art (56.9%).

Conclusion: The study advances realistic visual search understanding and improves classification accuracy for target fixations.

Abstract: Distinguishing target from non-target fixations during visual search is a fundamental building block to understand users’ intended actions and to build effective assistance systems. While prior research indicated the feasibility of classifying target vs. non-target fixations based on eye tracking and electroencephalography (EEG) data, these studies were conducted with explicitly instructed search trajectories, abstract visual stimuli, and disregarded any scene context. This is in stark contrast with the fact that human visual search is largely driven by scene characteristics and raises questions regarding generalizability to more realistic scenarios. To close this gap, we, for the first time, investigate the classification of target vs. non-target fixations during free visual search in realistic scenes. In particular, we conducted a 36-participants user study using a large variety of 140 realistic visual search scenes in two highly relevant application scenarios: searching for icons on desktop backgrounds and finding tools in a cluttered workshop. Our approach based on gaze and EEG features outperforms the previous state-of-the-art approach based on a combination of fixation duration and saccade-related potentials. We perform extensive evaluations to assess the generalizability of our approach across scene types. Our approach significantly advances the ability to distinguish between target and non-target fixations in realistic scenarios, achieving 83.6% accuracy in cross-user evaluations. This substantially outperforms previous methods based on saccade-related potentials, which reached only 56.9% accuracy.

[322] DiffusionFF: Face Forgery Detection via Diffusion-based Artifact Localization

Siran Peng, Haoyuan Zhang, Li Gao, Tianshuo Zhang, Bao Li, Zhen Lei

Main category: cs.CV

TL;DR: DiffusionFF is a novel framework for face forgery detection that uses diffusion-based artifact localization to improve accuracy and explainability.

Details

Motivation: The need for robust and accurate face forgery detection due to evolving deepfake techniques, with a focus on localizing forgery artifacts for better model explainability and user trust.

Method: Utilizes a denoising diffusion model to generate DSSIM maps for artifact localization, fused with semantic features from a pretrained forgery detector.

Result: Achieves superior detection performance and precise artifact localization on cross-dataset and intra-dataset benchmarks.

Conclusion: DiffusionFF is effective for both detection and localization, enhancing trust and accuracy in face forgery detection.

Abstract: The rapid evolution of deepfake generation techniques demands robust and accurate face forgery detection algorithms. While determining whether an image has been manipulated remains essential, the ability to precisely localize forgery artifacts has become increasingly important for improving model explainability and fostering user trust. To address this challenge, we propose DiffusionFF, a novel framework that enhances face forgery detection through diffusion-based artifact localization. Our method utilizes a denoising diffusion model to generate high-quality Structural Dissimilarity (DSSIM) maps, which effectively capture subtle traces of manipulation. These DSSIM maps are then fused with high-level semantic features extracted by a pretrained forgery detector, leading to significant improvements in detection accuracy. Extensive experiments on both cross-dataset and intra-dataset benchmarks demonstrate that DiffusionFF not only achieves superior detection performance but also offers precise and fine-grained artifact localization, highlighting its overall effectiveness.

[323] StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding

Haolin Yang, Feilong Tang, Linxiao Zhao, Xiang An, Ming Hu, Huifa Li, Xinlin Zhuang, Boqian Wang, Yifan Lu, Xiaofeng Zhang, Abdalla Swikir, Junjun He, Zongyuan Ge, Imran Razzak

Main category: cs.CV

TL;DR: Proposes StreamAgent for proactive, goal-driven video understanding by anticipating future task-relevant info, outperforming existing methods in real-time responsiveness.

Details

Motivation: Addresses limitations of current methods in real-time streaming video understanding, which lack proactive decision-making and future anticipation.

Method: Integrates question semantics and historical observations to anticipate key events, using a streaming KV-cache memory for efficient inference.

Result: Outperforms existing methods in accuracy and real-time efficiency on streaming and long video tasks.

Conclusion: StreamAgent enhances real-time video understanding, proving practical for dynamic scenarios like autonomous driving.

Abstract: Real-time streaming video understanding in domains such as autonomous driving and intelligent surveillance poses challenges beyond conventional offline video processing, requiring continuous perception, proactive decision making, and responsive interaction based on dynamically evolving visual content. However, existing methods rely on alternating perception-reaction or asynchronous triggers, lacking task-driven planning and future anticipation, which limits their real-time responsiveness and proactive decision making in evolving video streams. To this end, we propose a StreamAgent that anticipates the temporal intervals and spatial regions expected to contain future task-relevant information to enable proactive and goal-driven responses. Specifically, we integrate question semantics and historical observations through prompting the anticipatory agent to anticipate the temporal progression of key events, align current observations with the expected future evidence, and subsequently adjust the perception action (e.g., attending to task-relevant regions or continuously tracking in subsequent frames). To enable efficient inference, we design a streaming KV-cache memory mechanism that constructs a hierarchical memory structure for selective recall of relevant tokens, enabling efficient semantic retrieval while reducing the overhead of storing all tokens in the traditional KV-cache. Extensive experiments on streaming and long video understanding tasks demonstrate that our method outperforms existing methods in response accuracy and real-time efficiency, highlighting its practical value for real-world streaming scenarios.

[324] Medical Image De-Identification Resources: Synthetic DICOM Data and Tools for Validation

Michael W. Rutherford, Tracy Nolan, Linmin Pei, Ulrike Wagner, Qinyan Pan, Phillip Farmer, Kirk Smith, Benjamin Kopchick, Laura Opsahl-Ong, Granger Sutton, David Clunie, Keyvan Farahani, Fred Prior

Main category: cs.CV

TL;DR: The paper introduces the MIDI dataset and framework for benchmarking medical image de-identification, addressing privacy challenges in open-access data sharing.

Details

Motivation: To ensure patient privacy while promoting reproducibility and AI training in medical imaging research, given the limitations of existing de-identification tools.

Method: Developed the MIDI dataset with synthetic PHI/PII embedded in DICOM files and an evaluation framework for automated benchmarking.

Result: Created a dataset with 538 subjects and tools for objective assessment, aligning with HIPAA and DICOM standards.

Conclusion: The framework enables safer, standardized medical image sharing by improving de-identification workflow evaluation.

Abstract: Medical imaging research increasingly depends on large-scale data sharing to promote reproducibility and train Artificial Intelligence (AI) models. Ensuring patient privacy remains a significant challenge for open-access data sharing. Digital Imaging and Communications in Medicine (DICOM), the global standard data format for medical imaging, encodes both essential clinical metadata and extensive protected health information (PHI) and personally identifiable information (PII). Effective de-identification must remove identifiers, preserve scientific utility, and maintain DICOM validity. Tools exist to perform de-identification, but few assess its effectiveness, and most rely on subjective reviews, limiting reproducibility and regulatory confidence. To address this gap, we developed an openly accessible DICOM dataset infused with synthetic PHI/PII and an evaluation framework for benchmarking image de-identification workflows. The Medical Image de-identification (MIDI) dataset was built using publicly available de-identified data from The Cancer Imaging Archive (TCIA). It includes 538 subjects (216 for validation, 322 for testing), 605 studies, 708 series, and 53,581 DICOM image instances. These span multiple vendors, imaging modalities, and cancer types. Synthetic PHI and PII were embedded into structured data elements, plain text data elements, and pixel data to simulate real-world identity leaks encountered by TCIA curation teams. Accompanying evaluation tools include a Python script, answer keys (known truth), and mapping files that enable automated comparison of curated data against expected transformations. The framework is aligned with the HIPAA Privacy Rule “Safe Harbor” method, DICOM PS3.15 Confidentiality Profiles, and TCIA best practices. It supports objective, standards-driven evaluation of de-identification workflows, promoting safer and more consistent medical image sharing.

[325] InspectVLM: Unified in Theory, Unreliable in Practice

Conor Wallace, Isaac Corley, Jonathan Lwowski

Main category: cs.CV

TL;DR: InspectVLM, a unified vision-language model, shows promise for industrial inspection but falls short in robustness and visual grounding compared to traditional models.

Details

Motivation: To evaluate the viability of unified vision-language models (VLMs) for industrial inspection, aiming to simplify complex pipelines.

Method: Trained InspectVLM (based on Florence-2) on InspectMM, a large-scale multimodal dataset, and compared its performance to ResNet-based models.

Result: Competitive in classification and keypoint tasks but brittle under low prompt variability, poor in fine-grained detection, and prone to memorized responses.

Conclusion: Current VLMs lack the robustness and visual grounding needed for precision-critical industrial inspections despite their conceptual appeal.

Abstract: Unified vision-language models (VLMs) promise to streamline computer vision pipelines by reframing multiple visual tasks such as classification, detection, and keypoint localization within a single language-driven interface. This architecture is particularly appealing in industrial inspection, where managing disjoint task-specific models introduces complexity, inefficiency, and maintenance overhead. In this paper, we critically evaluate the viability of this unified paradigm using InspectVLM, a Florence-2-based VLM trained on InspectMM, our new large-scale multimodal, multitask inspection dataset. While InspectVLM performs competitively on image-level classification and structured keypoint tasks, we find that it fails to match traditional ResNet-based models in core inspection metrics. Notably, the model exhibits brittle behavior under low prompt variability, produces degenerate outputs for fine-grained object detection, and frequently defaults to memorized language responses regardless of visual input. Our findings suggest that while language-driven unification offers conceptual elegance, current VLMs lack the visual grounding and robustness necessary for deployment in precision critical industrial inspections.

[326] IAUNet: Instance-Aware U-Net

Yaroslav Prytula, Illia Tsiporenko, Ali Zeynalli, Dmytro Fishman

Main category: cs.CV

TL;DR: IAUNet is a query-based U-Net architecture for biomedical instance segmentation, featuring a lightweight Pixel decoder and Transformer decoder, outperforming state-of-the-art models on multiple datasets.

Details

Motivation: To explore U-Net's potential in query-based segmentation and improve efficiency and accuracy for overlapping cell segmentation in biomedical imaging.

Method: Combines a full U-Net with a lightweight convolutional Pixel decoder and a Transformer decoder for multi-scale feature refinement.

Result: IAUNet outperforms state-of-the-art models on public and new datasets, setting a strong benchmark for cell instance segmentation.

Conclusion: IAUNet advances query-based segmentation in biomedical imaging, offering efficiency and high performance, supported by a new benchmark dataset.

Abstract: Instance segmentation is critical in biomedical imaging to accurately distinguish individual objects like cells, which often overlap and vary in size. Recent query-based methods, where object queries guide segmentation, have shown strong performance. While U-Net has been a go-to architecture in medical image segmentation, its potential in query-based approaches remains largely unexplored. In this work, we present IAUNet, a novel query-based U-Net architecture. The core design features a full U-Net architecture, enhanced by a novel lightweight convolutional Pixel decoder, making the model more efficient and reducing the number of parameters. Additionally, we propose a Transformer decoder that refines object-specific features across multiple scales. Finally, we introduce the 2025 Revvity Full Cell Segmentation Dataset, a unique resource with detailed annotations of overlapping cell cytoplasm in brightfield images, setting a new benchmark for biomedical instance segmentation. Experiments on multiple public datasets and our own show that IAUNet outperforms most state-of-the-art fully convolutional, transformer-based, and query-based models and cell segmentation-specific models, setting a strong baseline for cell instance segmentation tasks. Code is available at https://github.com/SlavkoPrytula/IAUNet

[327] Proactive Disentangled Modeling of Trigger-Object Pairings for Backdoor Defense

Kyle Stein, Andrew A. Mahyari, Guillermo Francia III, Eman El-Sheikh

Main category: cs.CV

TL;DR: DBOM is a proactive framework using disentangled modeling to detect and neutralize backdoor threats in datasets, including unseen configurations, by leveraging VLMs and visual prompt tuning.

Details

Motivation: Address the vulnerability of DNNs and GenAI to multi-trigger backdoor attacks, which evade traditional detection methods.

Method: Uses Vision-Language Models (VLMs) to disentangle triggers and objects in embeddings via visual prompt tuning and separation losses.

Result: Effectively detects poisoned images in CIFAR-10 and GTSRB datasets, improving security of training pipelines.

Conclusion: DBOM offers robust, zero-shot generalization for detecting unseen backdoor threats, enhancing adversarial attack insights.

Abstract: Deep neural networks (DNNs) and generative AI (GenAI) are increasingly vulnerable to backdoor attacks, where adversaries embed triggers into inputs to cause models to misclassify or misinterpret target labels. Beyond traditional single-trigger scenarios, attackers may inject multiple triggers across various object classes, forming unseen backdoor-object configurations that evade standard detection pipelines. In this paper, we introduce DBOM (Disentangled Backdoor-Object Modeling), a proactive framework that leverages structured disentanglement to identify and neutralize both seen and unseen backdoor threats at the dataset level. Specifically, DBOM factorizes input image representations by modeling triggers and objects as independent primitives in the embedding space through the use of Vision-Language Models (VLMs). By leveraging the frozen, pre-trained encoders of VLMs, our approach decomposes the latent representations into distinct components through a learnable visual prompt repository and prompt prefix tuning, ensuring that the relationships between triggers and objects are explicitly captured. To separate trigger and object representations in the visual prompt repository, we introduce the trigger-object separation and diversity losses that aids in disentangling trigger and object visual features. Next, by aligning image features with feature decomposition and fusion, as well as learned contextual prompt tokens in a shared multimodal space, DBOM enables zero-shot generalization to novel trigger-object pairings that were unseen during training, thereby offering deeper insights into adversarial attack patterns. Experimental results on CIFAR-10 and GTSRB demonstrate that DBOM robustly detects poisoned images prior to downstream training, significantly enhancing the security of DNN training pipelines.

[328] CVD-SfM: A Cross-View Deep Front-end Structure-from-Motion System for Sparse Localization in Multi-Altitude Scenes

Yaxuan Li, Yewei Huang, Bijay Gaudel, Hamidreza Jafarnejadsani, Brendan Englot

Main category: cs.CV

TL;DR: A novel multi-altitude camera pose estimation system using cross-view transformers, deep features, and structure-from-motion, validated on new datasets, outperforms existing methods in accuracy and robustness.

Details

Motivation: Addressing challenges of robust and accurate localization across varied altitudes with sparse image input, especially for real-world robotic applications.

Method: Integrates cross-view transformer, deep features, and structure-from-motion into a unified framework. Introduces two new datasets for benchmarking.

Result: Superior performance in accuracy and robustness for multi-altitude sparse pose estimation compared to existing solutions.

Conclusion: The system is well-suited for real-world applications like aerial navigation, search and rescue, and automated inspection.

Abstract: We present a novel multi-altitude camera pose estimation system, addressing the challenges of robust and accurate localization across varied altitudes when only considering sparse image input. The system effectively handles diverse environmental conditions and viewpoint variations by integrating the cross-view transformer, deep features, and structure-from-motion into a unified framework. To benchmark our method and foster further research, we introduce two newly collected datasets specifically tailored for multi-altitude camera pose estimation; datasets of this nature remain rare in the current literature. The proposed framework has been validated through extensive comparative analyses on these datasets, demonstrating that our system achieves superior performance in both accuracy and robustness for multi-altitude sparse pose estimation tasks compared to existing solutions, making it well suited for real-world robotic applications such as aerial navigation, search and rescue, and automated inspection.

[329] Self-Supervised YOLO: Leveraging Contrastive Learning for Label-Efficient Object Detection

Manikanta Kotthapalli, Reshma Bhatia, Nainsi Jain

Main category: cs.CV

TL;DR: Contrastive self-supervised learning (SSL) pretraining for YOLO backbones reduces dependency on labeled data, improving performance in low-label regimes.

Details

Motivation: To reduce reliance on large labeled datasets for training one-stage object detectors like YOLO.

Method: Adapts YOLO backbones as encoders, uses SimCLR framework for SSL pretraining on unlabeled COCO images, and fine-tunes on a cyclist detection task.

Result: SSL-pretrained YOLOv8 achieves higher mAP (0.7663) and faster convergence than supervised training, especially with limited labels.

Conclusion: Contrastive SSL is effective for label-efficient object detection, leveraging unlabeled data as a scalable resource.

Abstract: One-stage object detectors such as the YOLO family achieve state-of-the-art performance in real-time vision applications but remain heavily reliant on large-scale labeled datasets for training. In this work, we present a systematic study of contrastive self-supervised learning (SSL) as a means to reduce this dependency by pretraining YOLOv5 and YOLOv8 backbones on unlabeled images using the SimCLR framework. Our approach introduces a simple yet effective pipeline that adapts YOLO’s convolutional backbones as encoders, employs global pooling and projection heads, and optimizes a contrastive loss using augmentations of the COCO unlabeled dataset (120k images). The pretrained backbones are then fine-tuned on a cyclist detection task with limited labeled data. Experimental results show that SSL pretraining leads to consistently higher mAP, faster convergence, and improved precision-recall performance, especially in low-label regimes. For example, our SimCLR-pretrained YOLOv8 achieves a mAP@50:95 of 0.7663, outperforming its supervised counterpart despite using no annotations during pretraining. These findings establish a strong baseline for applying contrastive SSL to one-stage detectors and highlight the potential of unlabeled data as a scalable resource for label-efficient object detection.

[330] IMoRe: Implicit Program-Guided Reasoning for Human Motion Q&A

Chen Li, Chinthani Sugandhika, Yeo Keat Ee, Eric Peh, Hao Zhang, Hong Yang, Deepu Rajan, Basura Fernando

Main category: cs.CV

TL;DR: IMoRe framework unifies motion reasoning without manual modules, using program-guided reading and iterative refinement for state-of-the-art performance.

Details

Motivation: Overcome limitations of explicit program execution in human motion Q&A by enabling scalable, adaptable implicit reasoning.

Method: Uses structured program functions for precise reasoning and a program-guided reading mechanism with a motion Vision Transformer.

Result: Achieves state-of-the-art on Babel-QA and generalizes to a new HuMMan-based dataset.

Conclusion: IMoRe demonstrates superior adaptability and performance in motion reasoning tasks.

Abstract: Existing human motion Q&A methods rely on explicit program execution, where the requirement for manually defined functional modules may limit the scalability and adaptability. To overcome this, we propose an implicit program-guided motion reasoning (IMoRe) framework that unifies reasoning across multiple query types without manually designed modules. Unlike existing implicit reasoning approaches that infer reasoning operations from question words, our model directly conditions on structured program functions, ensuring a more precise execution of reasoning steps. Additionally, we introduce a program-guided reading mechanism, which dynamically selects multi-level motion representations from a pretrained motion Vision Transformer (ViT), capturing both high-level semantics and fine-grained motion cues. The reasoning module iteratively refines memory representations, leveraging structured program functions to extract relevant information for different query types. Our model achieves state-of-the-art performance on Babel-QA and generalizes to a newly constructed motion Q&A dataset based on HuMMan, demonstrating its adaptability across different motion reasoning datasets. Code and dataset are available at: https://github.com/LUNAProject22/IMoRe.

[331] Deeply Dual Supervised learning for melanoma recognition

Rujosh Polma, Krishnan Menon Iyer

Main category: cs.CV

TL;DR: A novel Deeply Dual Supervised Learning framework improves melanoma recognition by integrating local and global features, outperforming state-of-the-art methods.

Details

Motivation: Existing models struggle to identify subtle visual cues differentiating melanoma from benign lesions, necessitating improved feature extraction.

Method: The framework uses a dual-pathway structure for local and global features, a dual attention mechanism, and multi-scale feature aggregation.

Result: The model achieves higher accuracy and better resilience against false positives on benchmark datasets.

Conclusion: The work advances automated skin cancer recognition and validates dual supervised learning in medical image analysis.

Abstract: As the application of deep learning in dermatology continues to grow, the recognition of melanoma has garnered significant attention, demonstrating potential for improving diagnostic accuracy. Despite advancements in image classification techniques, existing models still face challenges in identifying subtle visual cues that differentiate melanoma from benign lesions. This paper presents a novel Deeply Dual Supervised Learning framework that integrates local and global feature extraction to enhance melanoma recognition. By employing a dual-pathway structure, the model focuses on both fine-grained local features and broader contextual information, ensuring a comprehensive understanding of the image content. The framework utilizes a dual attention mechanism that dynamically emphasizes critical features, thereby reducing the risk of overlooking subtle characteristics of melanoma. Additionally, we introduce a multi-scale feature aggregation strategy to ensure robust performance across varying image resolutions. Extensive experiments on benchmark datasets demonstrate that our framework significantly outperforms state-of-the-art methods in melanoma detection, achieving higher accuracy and better resilience against false positives. This work lays the foundation for future research in automated skin cancer recognition and highlights the effectiveness of dual supervised learning in medical image analysis.

[332] Fast and Memory-efficient Non-line-of-sight Imaging with Quasi-Fresnel Transform

Yijun Wei, Jianyu Wang, Leping Xiao, Zuoqiang Shi, Xing Fu, Lingyun Qiu

Main category: cs.CV

TL;DR: A novel NLOS imaging method reduces computational and memory costs by using 2D functions and a Quasi-Fresnel transform, enabling real-time, high-resolution imaging on lightweight devices.

Details

Motivation: Existing NLOS imaging methods are computationally expensive and memory-intensive due to 3D modeling, limiting practical applications.

Method: Represents hidden scenes with 2D functions and uses a Quasi-Fresnel transform for direct inversion, reducing complexity.

Result: Significantly lowers runtime and memory usage while maintaining imaging quality, enabling use on lightweight devices.

Conclusion: This approach facilitates real-time, high-resolution NLOS imaging and expands its applicability to more platforms.

Abstract: Non-line-of-sight (NLOS) imaging seeks to reconstruct hidden objects by analyzing reflections from intermediary surfaces. Existing methods typically model both the measurement data and the hidden scene in three dimensions, overlooking the inherently two-dimensional nature of most hidden objects. This oversight leads to high computational costs and substantial memory consumption, limiting practical applications and making real-time, high-resolution NLOS imaging on lightweight devices challenging. In this paper, we introduce a novel approach that represents the hidden scene using two-dimensional functions and employs a Quasi-Fresnel transform to establish a direct inversion formula between the measurement data and the hidden scene. This transformation leverages the two-dimensional characteristics of the problem to significantly reduce computational complexity and memory requirements. Our algorithm efficiently performs fast transformations between these two-dimensional aggregated data, enabling rapid reconstruction of hidden objects with minimal memory usage. Compared to existing methods, our approach reduces runtime and memory demands by several orders of magnitude while maintaining imaging quality. The substantial reduction in memory usage not only enhances computational efficiency but also enables NLOS imaging on lightweight devices such as mobile and embedded systems. We anticipate that this method will facilitate real-time, high-resolution NLOS imaging and broaden its applicability across a wider range of platforms.

[333] Devil is in the Detail: Towards Injecting Fine Details of Image Prompt in Image Generation via Conflict-free Guidance and Stratified Attention

Kyungmin Jo, Jooyeol Yun, Jaegul Choo

Main category: cs.CV

TL;DR: The paper addresses limitations in text-to-image diffusion models by improving image prompts through conflict-free guidance and a new self-attention method, Stratified Attention, to better reflect user intent.

Details

Motivation: Existing text-to-image models struggle with intricate details like textures, and current image-prompting methods have conflicting signals in classifier-free guidance and a trade-off between realism and alignment.

Method: Proposes conflict-free guidance to avoid conflicting signals and Stratified Attention to balance realism and alignment by jointly using keys and values from both image prompts and generated images.

Result: The method outperforms existing models in faithfully reflecting image prompts across three image generation tasks.

Conclusion: The proposed approach resolves key issues in image-prompting, improving detail fidelity and balancing realism with alignment.

Abstract: While large-scale text-to-image diffusion models enable the generation of high-quality, diverse images from text prompts, these prompts struggle to capture intricate details, such as textures, preventing the user intent from being reflected. This limitation has led to efforts to generate images conditioned on user-provided images, referred to as image prompts. Recent work modifies the self-attention mechanism to impose image conditions in generated images by replacing or concatenating the keys and values from the image prompt. This enables the self-attention layer to work like a cross-attention layer, generally used to incorporate text prompts. In this paper, we identify two common issues in existing methods of modifying self-attention to generate images that reflect the details of image prompts. First, existing approaches neglect the importance of image prompts in classifier-free guidance. Specifically, current methods use image prompts as both desired and undesired conditions in classifier-free guidance, causing conflicting signals. To resolve this, we propose conflict-free guidance by using image prompts only as desired conditions, ensuring that the generated image faithfully reflects the image prompt. In addition, we observe that the two most common self-attention modifications involve a trade-off between the realism of the generated image and alignment with the image prompt. Specifically, selecting more keys and values from the image prompt improves alignment, while selecting more from the generated image enhances realism. To balance both, we propose an new self-attention modification method, Stratified Attention to jointly use keys and values from both images rather than selecting between them. Through extensive experiments across three image generation tasks, we show that the proposed method outperforms existing image-prompting models in faithfully reflecting the image prompt.

[334] Bench2ADVLM: A Closed-Loop Benchmark for Vision-language Models in Autonomous Driving

Tianyuan Zhang, Ting Jin, Lu Wang, Jiangfan Liu, Siyuan Liang, Mingchuan Zhang, Aishan Liu, Xianglong Liu

Main category: cs.CV

TL;DR: Bench2ADVLM introduces a closed-loop evaluation framework for Vision-Language Models (VLMs) in autonomous driving, addressing gaps in current open-loop assessments by integrating simulation and real-world testing.

Details

Motivation: Current VLM-based AD evaluations lack realism by focusing on open-loop settings, ignoring interactive and safety-critical aspects of real-world driving.

Method: The framework uses a dual-system adaptation architecture to translate high-level VLM commands into mid-level simulation actions and low-level real-world signals. It includes a self-reflective scenario generator for safety testing.

Result: Experiments show existing ADVLMs perform poorly in closed-loop conditions, highlighting the framework’s diagnostic capabilities.

Conclusion: Bench2ADVLM provides a comprehensive, hierarchical evaluation pipeline for ADVLMs, revealing limitations in current models under realistic conditions.

Abstract: Vision-Language Models (VLMs) have recently emerged as a promising paradigm in autonomous driving (AD). However, current performance evaluation protocols for VLM-based AD systems (ADVLMs) are predominantly confined to open-loop settings with static inputs, neglecting the more realistic and informative closed-loop setting that captures interactive behavior, feedback resilience, and real-world safety. To address this, we introduce Bench2ADVLM, a unified hierarchical closed-loop evaluation framework for real-time, interactive assessment of ADVLMs across both simulation and physical platforms. Inspired by dual-process theories of cognition, we first adapt diverse ADVLMs to simulation environments via a dual-system adaptation architecture. In this design, heterogeneous high-level driving commands generated by target ADVLMs (fast system) are interpreted by a general-purpose VLM (slow system) into standardized mid-level control actions suitable for execution in simulation. To bridge the gap between simulation and reality, we design a physical control abstraction layer that translates these mid-level actions into low-level actuation signals, enabling, for the first time, closed-loop testing of ADVLMs on physical vehicles. To enable more comprehensive evaluation, Bench2ADVLM introduces a self-reflective scenario generation module that automatically explores model behavior and uncovers potential failure modes for safety-critical scenario generation. Overall, Bench2ADVLM establishes a hierarchical evaluation pipeline that seamlessly integrates high-level abstract reasoning, mid-level simulation actions, and low-level real-world execution. Experiments on diverse scenarios across multiple state-of-the-art ADVLMs and physical platforms validate the diagnostic strength of our framework, revealing that existing ADVLMs still exhibit limited performance under closed-loop conditions.

[335] VLM4D: Towards Spatiotemporal Awareness in Vision Language Models

Shijie Zhou, Alexander Vilesov, Xuehai He, Ziyu Wan, Shuwang Zhang, Aditya Nagachandra, Di Chang, Dongdong Chen, Xin Eric Wang, Achuta Kadambi

Main category: cs.CV

TL;DR: VLM4D is a new benchmark for evaluating spatiotemporal reasoning in vision language models (VLMs), revealing gaps in dynamic understanding compared to humans.

Details

Motivation: Current VLMs lack robust dynamic spatiotemporal reasoning, unlike humans who effortlessly track object movements and perspective shifts.

Method: VLM4D uses diverse real-world and synthetic videos with curated question-answer pairs to test VLMs on motion, rotation, and perspective awareness.

Result: State-of-the-art VLMs show significant performance gaps compared to humans, struggling with visual cue integration and temporal coherence.

Conclusion: Targeted fine-tuning and 4D feature reconstruction show promise for improving VLMs, encouraging further research in dynamic visual intelligence.

Abstract: Vision language models (VLMs) have shown remarkable capabilities in integrating linguistic and visual reasoning but remain fundamentally limited in understanding dynamic spatiotemporal interactions. Humans effortlessly track and reason about object movements, rotations, and perspective shifts-abilities essential for robust dynamic real-world understanding yet notably lacking in current VLMs. In this paper, we introduce VLM4D, the first benchmark specifically designed to evaluate the spatiotemporal reasoning capabilities of VLMs. Our benchmark comprises diverse real-world and synthetic videos accompanied by carefully curated question-answer pairs emphasizing translational and rotational motions, perspective awareness, and motion continuity. Through comprehensive evaluations of state-of-the-art open and closed-source VLMs, we identify significant performance gaps compared to human baselines, highlighting fundamental deficiencies in existing models. Extensive analysis reveals that VLMs struggle particularly with integrating multiple visual cues and maintaining temporal coherence. We further explore promising directions, such as leveraging 4D feature field reconstruction and targeted spatiotemporal supervised fine-tuning, demonstrating their effectiveness in enhancing spatiotemporal comprehension. Our work aims to encourage deeper exploration into improving VLMs’ spatial and temporal grounding, paving the way towards more capable and reliable visual intelligence for dynamic environments.

[336] Protego: User-Centric Pose-Invariant Privacy Protection Against Face Recognition-Induced Digital Footprint Exposure

Ziling Wang, Shuya Yang, Jialin Lu, Ka-Ho Chow

Main category: cs.CV

TL;DR: Protego is a privacy protection method for facial images, using 3D masks to prevent retrieval-based intrusions, outperforming existing methods by 2x.

Details

Motivation: Addressing privacy concerns from FR technologies like Clearview AI, which expose personal data without consent.

Method: Encapsulates 3D facial signatures into pose-invariant 2D representations, dynamically deformed into natural-looking 3D masks.

Result: Reduces retrieval accuracy significantly across black-box FR models, with 2x better performance than existing methods.

Conclusion: Protego combats misuse of FR for surveillance and identity tracing, ensuring privacy and visual coherence.

Abstract: Face recognition (FR) technologies are increasingly used to power large-scale image retrieval systems, raising serious privacy concerns. Services like Clearview AI and PimEyes allow anyone to upload a facial photo and retrieve a large amount of online content associated with that person. This not only enables identity inference but also exposes their digital footprint, such as social media activity, private photos, and news reports, often without their consent. In response to this emerging threat, we propose Protego, a user-centric privacy protection method that safeguards facial images from such retrieval-based privacy intrusions. Protego encapsulates a user’s 3D facial signatures into a pose-invariant 2D representation, which is dynamically deformed into a natural-looking 3D mask tailored to the pose and expression of any facial image of the user, and applied prior to online sharing. Motivated by a critical limitation of existing methods, Protego amplifies the sensitivity of FR models so that protected images cannot be matched even among themselves. Experiments show that Protego significantly reduces retrieval accuracy across a wide range of black-box FR models and performs at least 2x better than existing methods. It also offers unprecedented visual coherence, particularly in video settings where consistency and natural appearance are essential. Overall, Protego contributes to the fight against the misuse of FR for mass surveillance and unsolicited identity tracing.

[337] Conditional Diffusion Model with Anatomical-Dose Dual Constraints for End-to-End Multi-Tumor Dose Prediction

Hui Xie, Haiqin Hu, Lijuan Ding, Qing Li, Yue Sun, Tao Tan

Main category: cs.CV

TL;DR: ADDiff-Dose is a novel conditional diffusion model for radiotherapy dose prediction, improving accuracy, efficiency, and clinical applicability over traditional methods.

Details

Motivation: Current radiotherapy planning is time-consuming and expert-dependent, while existing deep learning methods lack generalization and accuracy.

Method: The model uses LightweightVAE3D for CT data compression, integrates multimodal inputs, and employs a progressive noise addition and denoising framework with multi-head attention and a composite loss function.

Result: Outperforms baselines with MAE of 0.101-0.154, DICE coefficient of 0.927, and reduces planning time to 22 seconds per case.

Conclusion: ADDiff-Dose offers a generalizable, efficient solution for automated radiotherapy planning, significantly improving workflow efficiency.

Abstract: Radiotherapy treatment planning often relies on time-consuming, trial-and-error adjustments that heavily depend on the expertise of specialists, while existing deep learning methods face limitations in generalization, prediction accuracy, and clinical applicability. To tackle these challenges, we propose ADDiff-Dose, an Anatomical-Dose Dual Constraints Conditional Diffusion Model for end-to-end multi-tumor dose prediction. The model employs LightweightVAE3D to compress high-dimensional CT data and integrates multimodal inputs, including target and organ-at-risk (OAR) masks and beam parameters, within a progressive noise addition and denoising framework. It incorporates conditional features via a multi-head attention mechanism and utilizes a composite loss function combining MSE, conditional terms, and KL divergence to ensure both dosimetric accuracy and compliance with clinical constraints. Evaluation on a large-scale public dataset (2,877 cases) and three external institutional cohorts (450 cases in total) demonstrates that ADDiff-Dose significantly outperforms traditional baselines, achieving an MAE of 0.101-0.154 (compared to 0.316 for UNet and 0.169 for GAN models), a DICE coefficient of 0.927 (a 6.8% improvement), and limiting spinal cord maximum dose error to within 0.1 Gy. The average plan generation time per case is reduced to 22 seconds. Ablation studies confirm that the structural encoder enhances compliance with clinical dose constraints by 28.5%. To our knowledge, this is the first study to introduce a conditional diffusion model framework for radiotherapy dose prediction, offering a generalizable and efficient solution for automated treatment planning across diverse tumor sites, with the potential to substantially reduce planning time and improve clinical workflow efficiency.

[338] Mapillary Vistas Validation for Fine-Grained Traffic Signs: A Benchmark Revealing Vision-Language Model Limitations

Sparsh Garg, Abhishek Aich

Main category: cs.CV

TL;DR: A new validation set for traffic signs (MVV) is introduced, derived from Mapillary, with fine-grained annotations. DINOv2 outperforms VLMs in recognition tasks, highlighting limitations in current models for fine-grained understanding.

Details

Motivation: Existing datasets lack fine-grained annotations for traffic signs, which are crucial for autonomous driving. This work aims to improve accuracy and safety by providing detailed labels.

Method: The MVV dataset is created by decomposing composite traffic signs into granular categories, with pixel-level instance masks manually annotated. DINOv2 and VLMs are benchmarked on this dataset.

Result: DINOv2 consistently outperforms VLMs in traffic sign recognition and other categories, revealing limitations in current vision-language models.

Conclusion: The MVV dataset and DINOv2 benchmark advance fine-grained visual understanding, supporting more reliable perception systems for autonomous driving.

Abstract: Obtaining high-quality fine-grained annotations for traffic signs is critical for accurate and safe decision-making in autonomous driving. Widely used datasets, such as Mapillary, often provide only coarse-grained labels - without distinguishing semantically important types such as stop signs or speed limit signs. To this end, we present a new validation set for traffic signs derived from the Mapillary dataset called Mapillary Vistas Validation for Traffic Signs (MVV), where we decompose composite traffic signs into granular, semantically meaningful categories. The dataset includes pixel-level instance masks and has been manually annotated by expert annotators to ensure label fidelity. Further, we benchmark several state-of-the-art VLMs against the self-supervised DINOv2 model on this dataset and show that DINOv2 consistently outperforms all VLM baselines-not only on traffic sign recognition, but also on heavily represented categories like vehicles and humans. Our analysis reveals significant limitations in current vision-language models for fine-grained visual understanding and establishes DINOv2 as a strong baseline for dense semantic matching in autonomous driving scenarios. This dataset and evaluation framework pave the way for more reliable, interpretable, and scalable perception systems. Code and data are available at: https://github.com/nec-labs-ma/relabeling

[339] DreamPainter: Image Background Inpainting for E-commerce Scenarios

Sijie Zhao, Jing Cheng, Yaoyao Wu, Hao Xu, Shaohui Jiao

Main category: cs.CV

TL;DR: The paper introduces DreamEcom-400K, a dataset for e-commerce background generation, and DreamPainter, a framework combining text and visual controls to address challenges in product-background consistency and precise image control.

Details

Motivation: Challenges in e-commerce background generation include maintaining product consistency and spatial harmony, and the limitations of text-only prompts for precise control.

Method: Proposes DreamPainter, a framework using DreamEcom-400K dataset, integrating text prompts and reference images for control.

Result: Outperforms state-of-the-art methods, achieving high product consistency and effective integration of control signals.

Conclusion: DreamPainter and DreamEcom-400K address key challenges in e-commerce image generation, offering improved performance and flexibility.

Abstract: Although diffusion-based image genenation has been widely explored and applied, background generation tasks in e-commerce scenarios still face significant challenges. The first challenge is to ensure that the generated products are consistent with the given product inputs while maintaining a reasonable spatial arrangement, harmonious shadows, and reflections between foreground products and backgrounds. Existing inpainting methods fail to address this due to the lack of domain-specific data. The second challenge involves the limitation of relying solely on text prompts for image control, as effective integrating visual information to achieve precise control in inpainting tasks remains underexplored. To address these challenges, we introduce DreamEcom-400K, a high-quality e-commerce dataset containing accurate product instance masks, background reference images, text prompts, and aesthetically pleasing product images. Based on this dataset, we propose DreamPainter, a novel framework that not only utilizes text prompts for control but also flexibly incorporates reference image information as an additional control signal. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods, maintaining high product consistency while effectively integrating both text prompt and reference image information.

[340] HCF: Hierarchical Cascade Framework for Distributed Multi-Stage Image Compression

Junhao Cai, Taegun An, Chengjun Jin, Sung Il Choi, JuHyun Park, Changhee Joo

Main category: cs.CV

TL;DR: The paper introduces the Hierarchical Cascade Framework (HCF) for efficient distributed multi-stage image compression, outperforming existing methods in performance and computational efficiency.

Details

Motivation: Addressing inefficiencies in progressive and successive compression methods, and the lack of flexibility in fixed-parameter models for distributed multi-stage image compression.

Method: Developed HCF with latent-space transformations, policy-driven quantization control, and edge quantization principles for optimized rate-distortion trade-offs.

Result: Achieved up to 0.6dB PSNR gains, 5.56% BD-Rate improvement, and significant computational savings (97.8% FLOPs, 96.5% GPU memory, 90.0% execution time).

Conclusion: HCF is a superior solution for distributed multi-stage image compression, offering high performance, efficiency, and adaptability.

Abstract: Distributed multi-stage image compression – where visual content traverses multiple processing nodes under varying quality requirements – poses challenges. Progressive methods enable bitstream truncation but underutilize available compute resources; successive compression repeats costly pixel-domain operations and suffers cumulative quality loss and inefficiency; fixed-parameter models lack post-encoding flexibility. In this work, we developed the Hierarchical Cascade Framework (HCF) that achieves high rate-distortion performance and better computational efficiency through direct latent-space transformations across network nodes in distributed multi-stage image compression system. Under HCF, we introduced policy-driven quantization control to optimize rate-distortion trade-offs, and established the edge quantization principle through differential entropy analysis. The configuration based on this principle demonstrates up to 0.6dB PSNR gains over other configurations. When comprehensively evaluated on the Kodak, CLIC, and CLIC2020-mobile datasets, HCF outperforms successive-compression methods by up to 5.56% BD-Rate in PSNR on CLIC, while saving up to 97.8% FLOPs, 96.5% GPU memory, and 90.0% execution time. It also outperforms state-of-the-art progressive compression methods by up to 12.64% BD-Rate on Kodak and enables retraining-free cross-quality adaptation with 7.13-10.87% BD-Rate reductions on CLIC2020-mobile.

[341] StarPose: 3D Human Pose Estimation via Spatial-Temporal Autoregressive Diffusion

Haoxin Yang, Weihong Chen, Xuemiao Xu, Cheng Xu, Peng Xiao, Cuifeng Sun, Shaoyu Huang, Shengfeng He

Main category: cs.CV

TL;DR: StarPose is an autoregressive diffusion framework for monocular 3D human pose estimation, enhancing accuracy and temporal coherence by integrating historical pose predictions and spatial-temporal physical guidance.

Details

Motivation: Existing diffusion-based methods lack spatial-temporal correlations, leading to inconsistent and inaccurate 3D pose sequences.

Method: StarPose models 2D-to-3D pose mapping as an autoregressive diffusion process, using a Historical Pose Integration Module (HPIM) and Spatial-Temporal Physical Guidance (STPG) mechanism.

Result: StarPose outperforms state-of-the-art methods in accuracy and temporal consistency on benchmark datasets.

Conclusion: The proposed framework effectively addresses limitations of prior methods, delivering robust and realistic 3D pose estimates.

Abstract: Monocular 3D human pose estimation remains a challenging task due to inherent depth ambiguities and occlusions. Compared to traditional methods based on Transformers or Convolutional Neural Networks (CNNs), recent diffusion-based approaches have shown superior performance, leveraging their probabilistic nature and high-fidelity generation capabilities. However, these methods often fail to account for the spatial and temporal correlations across predicted frames, resulting in limited temporal consistency and inferior accuracy in predicted 3D pose sequences. To address these shortcomings, this paper proposes StarPose, an autoregressive diffusion framework that effectively incorporates historical 3D pose predictions and spatial-temporal physical guidance to significantly enhance both the accuracy and temporal coherence of pose predictions. Unlike existing approaches, StarPose models the 2D-to-3D pose mapping as an autoregressive diffusion process. By synergically integrating previously predicted 3D poses with 2D pose inputs via a Historical Pose Integration Module (HPIM), the framework generates rich and informative historical pose embeddings that guide subsequent denoising steps, ensuring temporally consistent predictions. In addition, a fully plug-and-play Spatial-Temporal Physical Guidance (STPG) mechanism is tailored to refine the denoising process in an iterative manner, which further enforces spatial anatomical plausibility and temporal motion dynamics, rendering robust and realistic pose estimates. Extensive experiments on benchmark datasets demonstrate that StarPose outperforms state-of-the-art methods, achieving superior accuracy and temporal consistency in 3D human pose estimation. Code is available at https://github.com/wileychan/StarPose.

[342] YOLOv1 to YOLOv11: A Comprehensive Survey of Real-Time Object Detection Innovations and Challenges

Manikanta Kotthapalli, Deepika Ravipati, Reshma Bhatia

Main category: cs.CV

TL;DR: A review of the YOLO family’s evolution, innovations, performance, and applications in computer vision.

Details

Motivation: To comprehensively analyze the advancements and impact of YOLO models in real-time object detection and beyond.

Method: Review of architectural and algorithmic innovations across YOLO versions, performance benchmarks, and extended capabilities.

Result: YOLO models have significantly improved speed, accuracy, and versatility, expanding into tasks like segmentation and domain-specific applications.

Conclusion: The YOLO family continues to evolve, with potential for further impact across diverse computer vision domains.

Abstract: Over the past decade, object detection has advanced significantly, with the YOLO (You Only Look Once) family of models transforming the landscape of real-time vision applications through unified, end-to-end detection frameworks. From YOLOv1’s pioneering regression-based detection to the latest YOLOv9, each version has systematically enhanced the balance between speed, accuracy, and deployment efficiency through continuous architectural and algorithmic advancements.. Beyond core object detection, modern YOLO architectures have expanded to support tasks such as instance segmentation, pose estimation, object tracking, and domain-specific applications including medical imaging and industrial automation. This paper offers a comprehensive review of the YOLO family, highlighting architectural innovations, performance benchmarks, extended capabilities, and real-world use cases. We critically analyze the evolution of YOLO models and discuss emerging research directions that extend their impact across diverse computer vision domains.

[343] Forecasting When to Forecast: Accelerating Diffusion Models with Confidence-Gated Taylor

Xiaoliu Guan, Lielin Jiang, Hanqi Chen, Xu Zhang, Jiaxing Yan, Guanzhong Wang, Yi Liu, Zetao Zhang, Yu Wu

Main category: cs.CV

TL;DR: The paper proposes a dynamic Taylor-based acceleration method for Diffusion Transformers (DiTs) to improve inference speed while maintaining quality. It shifts prediction to the last block level and introduces a reliability indicator for dynamic caching.

Details

Motivation: Current training-free acceleration methods for DiTs suffer from memory overhead and fixed caching schedules, leading to degraded outputs when predictions fail.

Method: The approach shifts Taylor prediction to the last block level to reduce cached features and uses the error of the first block’s prediction as a reliability indicator for dynamic caching.

Result: Empirical results show significant speedups (3.17x on FLUX, 2.36x on DiT, 4.14x on Wan Video) with negligible quality drop.

Conclusion: The proposed method effectively balances speed and quality, addressing limitations of prior approaches.

Abstract: Diffusion Transformers (DiTs) have demonstrated remarkable performance in visual generation tasks. However, their low inference speed limits their deployment in low-resource applications. Recent training-free approaches exploit the redundancy of features across timesteps by caching and reusing past representations to accelerate inference. Building on this idea, TaylorSeer instead uses cached features to predict future ones via Taylor expansion. However, its module-level prediction across all transformer blocks (e.g., attention or feedforward modules) requires storing fine-grained intermediate features, leading to notable memory and computation overhead. Moreover, it adopts a fixed caching schedule without considering the varying accuracy of predictions across timesteps, which can lead to degraded outputs when prediction fails. To address these limitations, we propose a novel approach to better leverage Taylor-based acceleration. First, we shift the Taylor prediction target from the module level to the last block level, significantly reducing the number of cached features. Furthermore, observing strong sequential dependencies among Transformer blocks, we propose to use the error between the Taylor-estimated and actual outputs of the first block as an indicator of prediction reliability. If the error is small, we trust the Taylor prediction for the last block; otherwise, we fall back to full computation, thereby enabling a dynamic caching mechanism. Empirical results show that our method achieves a better balance between speed and quality, achieving a 3.17x acceleration on FLUX, 2.36x on DiT, and 4.14x on Wan Video with negligible quality drop. The Project Page is \href{https://cg-taylor-acce.github.io/CG-Taylor/}{here.}

[344] S-RRG-Bench: Structured Radiology Report Generation with Fine-Grained Evaluation Framework

Yingshu Li, Yunyi Liu, Zhanyu Wang, Xinyu Liang, Lingqiao Liu, Lei Wang, Luping Zhou

Main category: cs.CV

TL;DR: A novel approach to structured radiology report generation (S-RRG) is introduced, including dataset construction, model training, and a new evaluation framework, aiming to improve clinical relevance and report quality.

Details

Motivation: Traditional free-text radiology reports are redundant and inconsistent, while existing structured approaches lack expressiveness and omit critical details.

Method: The work involves creating a structured chest X-ray dataset (MIMIC-STRUC), training an LLM-based model for report generation, and introducing a specialized evaluation metric (S-Score).

Result: The approach generates standardized, high-quality reports and proposes a clinically meaningful evaluation metric aligned with human assessments.

Conclusion: Structured reports and tailored evaluation metrics enhance clinical relevance and report quality in S-RRG.

Abstract: Radiology report generation (RRG) for diagnostic images, such as chest X-rays, plays a pivotal role in both clinical practice and AI. Traditional free-text reports suffer from redundancy and inconsistent language, complicating the extraction of critical clinical details. Structured radiology report generation (S-RRG) offers a promising solution by organizing information into standardized, concise formats. However, existing approaches often rely on classification or visual question answering (VQA) pipelines that require predefined label sets and produce only fragmented outputs. Template-based approaches, which generate reports by replacing keywords within fixed sentence patterns, further compromise expressiveness and often omit clinically important details. In this work, we present a novel approach to S-RRG that includes dataset construction, model training, and the introduction of a new evaluation framework. We first create a robust chest X-ray dataset (MIMIC-STRUC) that includes disease names, severity levels, probabilities, and anatomical locations, ensuring that the dataset is both clinically relevant and well-structured. We train an LLM-based model to generate standardized, high-quality reports. To assess the generated reports, we propose a specialized evaluation metric (S-Score) that not only measures disease prediction accuracy but also evaluates the precision of disease-specific details, thus offering a clinically meaningful metric for report quality that focuses on elements critical to clinical decision-making and demonstrates a stronger alignment with human assessments. Our approach highlights the effectiveness of structured reports and the importance of a tailored evaluation metric for S-RRG, providing a more clinically relevant measure of report quality.

[345] Towards Immersive Human-X Interaction: A Real-Time Framework for Physically Plausible Motion Synthesis

Kaiyang Ji, Ye Shi, Zichen Jin, Kangyi Chen, Lan Xu, Yuexin Ma, Jingyi Yu, Jingya Wang

Main category: cs.CV

TL;DR: Human-X is a real-time framework for physically plausible human interactions in VR/AR and robotics, addressing responsiveness, feasibility, and safety.

Details

Motivation: Existing methods lack real-time responsiveness and physical feasibility in dynamic human-machine interactions.

Method: Uses an auto-regressive reaction diffusion planner for joint action-reaction prediction and integrates reinforcement learning for motion tracking.

Result: Shows improved motion quality, interaction continuity, and physical plausibility on Inter-X and InterHuman datasets.

Conclusion: Validated in real-world applications, Human-X advances human-robot collaboration.

Abstract: Real-time synthesis of physically plausible human interactions remains a critical challenge for immersive VR/AR systems and humanoid robotics. While existing methods demonstrate progress in kinematic motion generation, they often fail to address the fundamental tension between real-time responsiveness, physical feasibility, and safety requirements in dynamic human-machine interactions. We introduce Human-X, a novel framework designed to enable immersive and physically plausible human interactions across diverse entities, including human-avatar, human-humanoid, and human-robot systems. Unlike existing approaches that focus on post-hoc alignment or simplified physics, our method jointly predicts actions and reactions in real-time using an auto-regressive reaction diffusion planner, ensuring seamless synchronization and context-aware responses. To enhance physical realism and safety, we integrate an actor-aware motion tracking policy trained with reinforcement learning, which dynamically adapts to interaction partners’ movements while avoiding artifacts like foot sliding and penetration. Extensive experiments on the Inter-X and InterHuman datasets demonstrate significant improvements in motion quality, interaction continuity, and physical plausibility over state-of-the-art methods. Our framework is validated in real-world applications, including virtual reality interface for human-robot interaction, showcasing its potential for advancing human-robot collaboration.

[346] AutoLoRA: Automatic LoRA Retrieval and Fine-Grained Gated Fusion for Text-to-Image Generation

Zhiwen Li, Zhongjie Duan, Die Chen, Cen Chen, Daoyuan Chen, Yaliang Li, Yingda Chen

Main category: cs.CV

TL;DR: A novel framework for semantic-driven LoRA retrieval and dynamic aggregation addresses challenges in deploying photorealistic image generation models, improving performance and scalability.

Details

Motivation: Current photorealistic image generation models face deployment constraints due to parameter fine-tuning intractability and challenges in utilizing distributed LoRA modules effectively.

Method: The framework includes a weight encoding-based LoRA retriever and a fine-grained gated fusion mechanism for dynamic aggregation.

Result: Significant improvement in image generation performance, enabling scalable and data-efficient enhancement of foundational models.

Conclusion: The work bridges the gap between community-developed LoRAs and practical deployment needs, fostering collaborative model evolution.

Abstract: Despite recent advances in photorealistic image generation through large-scale models like FLUX and Stable Diffusion v3, the practical deployment of these architectures remains constrained by their inherent intractability to parameter fine-tuning. While low-rank adaptation (LoRA) have demonstrated efficacy in enabling model customization with minimal parameter overhead, the effective utilization of distributed open-source LoRA modules faces three critical challenges: sparse metadata annotation, the requirement for zero-shot adaptation capabilities, and suboptimal fusion strategies for multi-LoRA fusion strategies. To address these limitations, we introduce a novel framework that enables semantic-driven LoRA retrieval and dynamic aggregation through two key components: (1) weight encoding-base LoRA retriever that establishes a shared semantic space between LoRA parameter matrices and text prompts, eliminating dependence on original training data, and (2) fine-grained gated fusion mechanism that computes context-specific fusion weights across network layers and diffusion timesteps to optimally integrate multiple LoRA modules during generation. Our approach achieves significant improvement in image generation perfermance, thereby facilitating scalable and data-efficient enhancement of foundational models. This work establishes a critical bridge between the fragmented landscape of community-developed LoRAs and practical deployment requirements, enabling collaborative model evolution through standardized adapter integration.

[347] Beyond RGB and Events: Enhancing Object Detection under Adverse Lighting with Monocular Normal Maps

Mingjie Liu, Hanqing Liu, Chuang Zhu

Main category: cs.CV

TL;DR: NRE-Net improves object detection in adverse lighting by fusing RGB, event data, and predicted normal maps, outperforming state-of-the-art methods.

Details

Motivation: Adverse lighting causes false detections in event cameras, and existing methods (RGB or event data alone) are insufficient.

Method: NRE-Net integrates RGB, event streams, and normal maps using ADFM and EAFM modules for optimized fusion.

Result: Achieves mAP50 improvements of 7.9% and 6.1% over frame-based methods and surpasses fusion-based approaches by 2.7%-7.1%.

Conclusion: NRE-Net effectively addresses adverse lighting challenges by leveraging multi-modal fusion, enhancing detection accuracy.

Abstract: Accurate object detection under adverse lighting conditions is critical for real-world applications such as autonomous driving. Although neuromorphic event cameras have been introduced to handle these scenarios, adverse lighting often induces distracting reflections from tunnel walls or road surfaces, which frequently lead to false obstacle detections. However, neither RGB nor event data alone is robust enough to address these complexities, and mitigating these issues without additional sensors remains underexplored. To overcome these challenges, we propose leveraging normal maps, directly predicted from monocular RGB images, as robust geometric cues to suppress false positives and enhance detection accuracy. We introduce NRE-Net, a novel multi-modal detection framework that effectively fuses three complementary modalities: monocularly predicted surface normal maps, RGB images, and event streams. To optimize the fusion process, our framework incorporates two key modules: the Adaptive Dual-stream Fusion Module (ADFM), which integrates RGB and normal map features, and the Event-modality Aware Fusion Module (EAFM), which adapts to the high dynamic range characteristics of event data. Extensive evaluations on the DSEC-Det-sub and PKU-DAVIS-SOD datasets demonstrate that NRE-Net significantly outperforms state-of-the-art methods. Our approach achieves mAP50 improvements of 7.9% and 6.1% over frame-based approaches (e.g., YOLOX), while surpassing the fusion-based SFNet by 2.7% on the DSEC-Det-sub dataset and SODFormer by 7.1% on the PKU-DAVIS-SOD dataset.

[348] mmWave Radar-Based Non-Line-of-Sight Pedestrian Localization at T-Junctions Utilizing Road Layout Extraction via Camera

Byeonggyu Park, Hee-Yeun Kim, Byonghyok Choi, Hansang Cho, Byungkwan Kim, Soomok Lee, Mingu Jeon, Seong-Woo Kim

Main category: cs.CV

TL;DR: A novel framework combines radar PCD and camera-derived road layout to localize NLoS pedestrians, validated via real-world experiments.

Details

Motivation: Accurate pedestrian localization in NLoS urban environments is challenging due to radar distortions and camera limitations.

Method: Leverages camera-inferred road layout to interpret 2D radar PCD for spatial scene reconstruction.

Result: Validated with real-vehicle experiments, showing practical applicability in outdoor NLoS scenarios.

Conclusion: The proposed method effectively addresses NLoS pedestrian localization by integrating radar and camera data.

Abstract: Pedestrians Localization in Non-Line-of-Sight (NLoS) regions within urban environments poses a significant challenge for autonomous driving systems. While mmWave radar has demonstrated potential for detecting objects in such scenarios, the 2D radar point cloud (PCD) data is susceptible to distortions caused by multipath reflections, making accurate spatial inference difficult. Additionally, although camera images provide high-resolution visual information, they lack depth perception and cannot directly observe objects in NLoS regions. In this paper, we propose a novel framework that interprets radar PCD through road layout inferred from camera for localization of NLoS pedestrians. The proposed method leverages visual information from the camera to interpret 2D radar PCD, enabling spatial scene reconstruction. The effectiveness of the proposed approach is validated through experiments conducted using a radar-camera system mounted on a real vehicle. The localization performance is evaluated using a dataset collected in outdoor NLoS driving environments, demonstrating the practical applicability of the method.

[349] VDEGaussian: Video Diffusion Enhanced 4D Gaussian Splatting for Dynamic Urban Scenes Modeling

Yuru Xiao, Zihan Lin, Chao Lu, Deming Zhai, Kui Jiang, Wenbo Zhao, Wei Zhang, Junjun Jiang, Huanran Wang, Xianming Liu

Main category: cs.CV

TL;DR: A novel video diffusion-enhanced 4D Gaussian Splatting framework improves dynamic urban scene modeling by leveraging test-time adapted video diffusion for robust, temporally consistent priors.

Details

Motivation: Current methods struggle with pre-calibrated object tracks and fast-moving objects due to temporal discontinuities and undersampled capture.

Method: Proposes a framework combining video diffusion and 4D Gaussian Splatting, with joint timestamp optimization and uncertainty distillation for precise pose alignment and content integration.

Result: Achieves ~2 dB PSNR gain in novel view synthesis, especially for fast-moving objects.

Conclusion: The method effectively overcomes limitations in dynamic scene modeling, enhancing accuracy and fidelity.

Abstract: Dynamic urban scene modeling is a rapidly evolving area with broad applications. While current approaches leveraging neural radiance fields or Gaussian Splatting have achieved fine-grained reconstruction and high-fidelity novel view synthesis, they still face significant limitations. These often stem from a dependence on pre-calibrated object tracks or difficulties in accurately modeling fast-moving objects from undersampled capture, particularly due to challenges in handling temporal discontinuities. To overcome these issues, we propose a novel video diffusion-enhanced 4D Gaussian Splatting framework. Our key insight is to distill robust, temporally consistent priors from a test-time adapted video diffusion model. To ensure precise pose alignment and effective integration of this denoised content, we introduce two core innovations: a joint timestamp optimization strategy that refines interpolated frame poses, and an uncertainty distillation method that adaptively extracts target content while preserving well-reconstructed regions. Extensive experiments demonstrate that our method significantly enhances dynamic modeling, especially for fast-moving objects, achieving an approximate PSNR gain of 2 dB for novel view synthesis over baseline approaches.

[350] Text2Lip: Progressive Lip-Synced Talking Face Generation from Text via Viseme-Guided Rendering

Xu Wang, Shengeng Tang, Fei Wang, Lechao Cheng, Dan Guo, Feng Xue, Richang Hong

Main category: cs.CV

TL;DR: Text2Lip is a viseme-centric framework for generating talking faces using textual input, outperforming audio-driven methods in semantic fidelity and visual realism.

Details

Motivation: Addressing the challenges of audio-driven methods, such as reliance on paired audio-visual data and ambiguity in mapping acoustics to lip motion.

Method: Uses structured viseme sequences as a phonetic-visual bridge, employs a progressive viseme-audio replacement strategy, and a landmark-guided renderer for photorealistic synthesis.

Result: Outperforms existing methods in semantic fidelity, visual realism, and modality robustness.

Conclusion: Text2Lip establishes a new paradigm for controllable and flexible talking face generation.

Abstract: Generating semantically coherent and visually accurate talking faces requires bridging the gap between linguistic meaning and facial articulation. Although audio-driven methods remain prevalent, their reliance on high-quality paired audio visual data and the inherent ambiguity in mapping acoustics to lip motion pose significant challenges in terms of scalability and robustness. To address these issues, we propose Text2Lip, a viseme-centric framework that constructs an interpretable phonetic-visual bridge by embedding textual input into structured viseme sequences. These mid-level units serve as a linguistically grounded prior for lip motion prediction. Furthermore, we design a progressive viseme-audio replacement strategy based on curriculum learning, enabling the model to gradually transition from real audio to pseudo-audio reconstructed from enhanced viseme features via cross-modal attention. This allows for robust generation in both audio-present and audio-free scenarios. Finally, a landmark-guided renderer synthesizes photorealistic facial videos with accurate lip synchronization. Extensive evaluations show that Text2Lip outperforms existing approaches in semantic fidelity, visual realism, and modality robustness, establishing a new paradigm for controllable and flexible talking face generation. Our project homepage is https://plyon1.github.io/Text2Lip/.

[351] A Neural Quality Metric for BRDF Models

Behnaz Kavoosighafi, Rafal K. Mantiuk, Saghi Hajisharif, Ehsan Miandji, Jonas Unger

Main category: cs.CV

TL;DR: A neural quality metric for BRDF evaluation is introduced, operating directly in BRDF space without rendering, outperforming traditional metrics in correlating with human judgments.

Details

Motivation: Traditional BRDF-space metrics fail to capture perceptual differences in rendered images, necessitating a perceptually informed alternative.

Method: A compact MLP is trained on measured and synthetic BRDF data, labeled using a perceptually validated image-space metric, to predict perceptual quality (JOD scores).

Result: The neural metric achieves higher correlation with human judgments than existing BRDF-space metrics.

Conclusion: The proposed metric offers a perceptually grounded alternative for BRDF evaluation, though its use as a loss function for BRDF fitting is limited.

Abstract: Accurately evaluating the quality of bidirectional reflectance distribution function (BRDF) models is essential for photo-realistic rendering. Traditional BRDF-space metrics often employ numerical error measures that fail to capture perceptual differences evident in rendered images. In this paper, we introduce the first perceptually informed neural quality metric for BRDF evaluation that operates directly in BRDF space, eliminating the need for rendering during quality assessment. Our metric is implemented as a compact multi-layer perceptron (MLP), trained on a dataset of measured BRDFs supplemented with synthetically generated data and labelled using a perceptually validated image-space metric. The network takes as input paired samples of reference and approximated BRDFs and predicts their perceptual quality in terms of just-objectionable-difference (JOD) scores. We show that our neural metric achieves significantly higher correlation with human judgments than existing BRDF-space metrics. While its performance as a loss function for BRDF fitting remains limited, the proposed metric offers a perceptually grounded alternative for evaluating BRDF models.

Yimeng Liu, Maolin Gan, Huaili Zeng, Li Liu, Younsuk Dong, Zhichao Cao

Main category: cs.CV

TL;DR: Hydra integrates mm-Wave radar and camera tech to detect leaf wetness, achieving high accuracy (96%) in varied conditions.

Details

Motivation: Current LWD detection lacks standardization and robustness, limiting effectiveness in real-world agriculture.

Method: Combines CNN for feature fusion and transformer-based encoder for feature mapping, using FMCW radar (76-81 GHz).

Result: Achieves 96% accuracy in lab and ~90% in real farm conditions (rain, dawn, poor light).

Conclusion: Hydra offers a precise, resilient solution for LWD detection in diverse agricultural environments.

Abstract: Leaf Wetness Duration (LWD), the time that water remains on leaf surfaces, is crucial in the development of plant diseases. Existing LWD detection lacks standardized measurement techniques, and variations across different plant characteristics limit its effectiveness. Prior research proposes diverse approaches, but they fail to measure real natural leaves directly and lack resilience in various environmental conditions. This reduces the precision and robustness, revealing a notable practical application and effectiveness gap in real-world agricultural settings. This paper presents Hydra, an innovative approach that integrates millimeter-wave (mm-Wave) radar with camera technology to detect leaf wetness by determining if there is water on the leaf. We can measure the time to determine the LWD based on this detection. Firstly, we design a Convolutional Neural Network (CNN) to selectively fuse multiple mm-Wave depth images with an RGB image to generate multiple feature images. Then, we develop a transformer-based encoder to capture the inherent connection among the multiple feature images to generate a feature map, which is further fed to a classifier for detection. Moreover, we augment the dataset during training to generalize our model. Implemented using a frequency-modulated continuous-wave (FMCW) radar within the 76 to 81 GHz band, Hydra’s performance is meticulously evaluated on plants, demonstrating the potential to classify leaf wetness with up to 96% accuracy across varying scenarios. Deploying Hydra in the farm, including rainy, dawn, or poorly light nights, it still achieves an accuracy rate of around 90%.

[353] Free-MoRef: Instantly Multiplexing Context Perception Capabilities of Video-MLLMs within Single Inference

Kuo Wang, Quanlong Zheng, Junlin Xie, Yanhao Zhang, Jinguo Luo, Haonan Lu, Liang Lin, Fan Zhou, Guanbin Li

Main category: cs.CV

TL;DR: Free-MoRef is a training-free approach to enhance Video-MLLMs’ performance on long videos by splitting and fusing vision tokens efficiently.

Details

Motivation: Existing Video-MLLMs struggle with long videos due to context length limitations, sacrificing granularity or efficiency.

Method: Free-MoRef splits vision tokens into multi-reference chunks, uses MoRef-attention for parallel clue gathering, and fuses key tokens for cross-reference interactions.

Result: Achieves 2× to 8× longer input frame perception without compression, outperforming trained long-video-MLLMs on benchmarks.

Conclusion: Free-MoRef efficiently improves long-video understanding with lower computational costs, offering significant performance gains.

Abstract: Video Multimodal Large Language Models~(Video-MLLM) have achieved remarkable advancements in video understanding tasks. However, constrained by the context length limitation in the underlying LLMs, existing Video-MLLMs typically exhibit suboptimal performance on long video scenarios. To understand extended input frames, common solutions span token compression and streaming inference techniques, which sacrifice feature granularity or inference efficiency. Differently, to efficiently achieve comprehensive understanding of longer frame inputs, we draw ideas from MoE and propose a training-free approach \textbf{Free-MoRef}, which instantly multiplexes the context perception capabilities of Video-MLLMs within one inference pass. Specifically, Free-MoRef reconstructs the vision tokens into several short sequences as multi-references. Subsequently, we introduce MoRef-attention, which gathers clues from the multi-reference chunks in parallel to summarize unified query activations. After the shadow layers in LLMs, a reference fusion step is derived to compose a final mixed reasoning sequence with key tokens from parallel chunks, which compensates the cross-reference vision interactions that are neglected in MoRef-attention. By splitting and fusing the long vision token sequences, Free-MoRef achieves improved performance under much lower computing costs in reasoning multiplexed context length, demonstrating strong efficiency and effectiveness. Experiments on VideoMME, MLVU, LongVideoBench show that Free-MoRef achieves full perception of 2$\times$ to 8$\times$ longer input frames without compression on a single A100 GPU while keeping instant responses, thereby bringing significant performance gains, even surpassing dedicatedly trained long-video-MLLMs. Codes are available at https://github.com/wkfdb/Free-MoRef

[354] HGTS-Former: Hierarchical HyperGraph Transformer for Multivariate Time Series Analysis

Xiao Wang, Hao Si, Fan Zhang, Xiaoya Zhou, Dengdi Sun, Wanli Lyu, Qingquan Yang, Jin Tang

Main category: cs.CV

TL;DR: The paper introduces HGTS-Former, a hypergraph-based transformer for multivariate time series analysis, addressing challenges like high dimensionality and complex interactions.

Details

Motivation: Multivariate time series analysis is challenging due to high dimensionality and dynamic interactions, prompting the need for advanced structural modeling.

Method: HGTS-Former normalizes and embeds time series data, uses multi-head self-attention, constructs hierarchical hypergraphs, and employs an EdgeToNode module for feature enhancement.

Result: Experiments on eight datasets validate HGTS-Former’s effectiveness in handling multivariate time series tasks.

Conclusion: HGTS-Former successfully addresses multivariate coupling in time series data, with code available for further use.

Abstract: Multivariate time series analysis has long been one of the key research topics in the field of artificial intelligence. However, analyzing complex time series data remains a challenging and unresolved problem due to its high dimensionality, dynamic nature, and complex interactions among variables. Inspired by the strong structural modeling capability of hypergraphs, this paper proposes a novel hypergraph-based time series transformer backbone network, termed HGTS-Former, to address the multivariate coupling in time series data. Specifically, given the multivariate time series signal, we first normalize and embed each patch into tokens. Then, we adopt the multi-head self-attention to enhance the temporal representation of each patch. The hierarchical hypergraphs are constructed to aggregate the temporal patterns within each channel and fine-grained relations between different variables. After that, we convert the hyperedge into node features through the EdgeToNode module and adopt the feed-forward network to further enhance the output features. Extensive experiments conducted on two multivariate time series tasks and eight datasets fully validated the effectiveness of our proposed HGTS-Former. The source code will be released on https://github.com/Event-AHU/Time_Series_Analysis.

[355] AID4AD: Aerial Image Data for Automated Driving Perception

Daniel Lengerer, Mathias Pechinger, Klaus Bogenberger, Carsten Markgraf

Main category: cs.CV

TL;DR: The paper introduces AID4AD, a dataset combining aerial imagery with nuScenes for AV perception tasks, improving map construction and motion prediction.

Details

Motivation: To enhance automated vehicle perception by integrating spatially aligned aerial imagery, addressing limitations of high-definition maps.

Method: AID4AD dataset creation using SLAM-based alignment, manual quality control, and evaluation in map construction and motion prediction tasks.

Result: Aerial imagery improves map construction accuracy by 15-23% and trajectory prediction by 2%.

Conclusion: Aerial imagery is a scalable alternative to high-definition maps, with AID4AD released to support further research.

Abstract: This work investigates the integration of spatially aligned aerial imagery into perception tasks for automated vehicles (AVs). As a central contribution, we present AID4AD, a publicly available dataset that augments the nuScenes dataset with high-resolution aerial imagery precisely aligned to its local coordinate system. The alignment is performed using SLAM-based point cloud maps provided by nuScenes, establishing a direct link between aerial data and nuScenes local coordinate system. To ensure spatial fidelity, we propose an alignment workflow that corrects for localization and projection distortions. A manual quality control process further refines the dataset by identifying a set of high-quality alignments, which we publish as ground truth to support future research on automated registration. We demonstrate the practical value of AID4AD in two representative tasks: in online map construction, aerial imagery serves as a complementary input that improves the mapping process; in motion prediction, it functions as a structured environmental representation that replaces high-definition maps. Experiments show that aerial imagery leads to a 15-23% improvement in map construction accuracy and a 2% gain in trajectory prediction performance. These results highlight the potential of aerial imagery as a scalable and adaptable source of environmental context in automated vehicle systems, particularly in scenarios where high-definition maps are unavailable, outdated, or costly to maintain. AID4AD, along with evaluation code and pretrained models, is publicly released to foster further research in this direction: https://github.com/DriverlessMobility/AID4AD.

[356] TrackletGait: A Robust Framework for Gait Recognition in the Wild

Shaoxiong Zhang, Jinkai Zheng, Shangdong Zhu, Chenggang Yan

Main category: cs.CV

TL;DR: TrackletGait is a novel framework for gait recognition in real-world scenarios, addressing challenges like non-periodic and occluded silhouettes. It introduces Random Tracklet Sampling, Haar Wavelet-based Downsampling, and Hardness Exclusion Triplet Loss, achieving state-of-the-art results.

Details

Motivation: Current gait recognition methods struggle with real-world challenges like non-periodic and occluded silhouettes, limiting their effectiveness in surveillance scenarios.

Method: TrackletGait uses Random Tracklet Sampling for diverse walking patterns, Haar Wavelet-based Downsampling for spatial information preservation, and Hardness Exclusion Triplet Loss to filter low-quality silhouettes.

Result: Achieves 77.8 and 80.4 rank-1 accuracy on Gait3D and GREW datasets with only 10.3M parameters.

Conclusion: TrackletGait effectively addresses real-world gait recognition challenges, outperforming existing methods while maintaining efficiency.

Abstract: Gait recognition aims to identify individuals based on their body shape and walking patterns. Though much progress has been achieved driven by deep learning, gait recognition in real-world surveillance scenarios remains quite challenging to current methods. Conventional approaches, which rely on periodic gait cycles and controlled environments, struggle with the non-periodic and occluded silhouette sequences encountered in the wild. In this paper, we propose a novel framework, TrackletGait, designed to address these challenges in the wild. We propose Random Tracklet Sampling, a generalization of existing sampling methods, which strikes a balance between robustness and representation in capturing diverse walking patterns. Next, we introduce Haar Wavelet-based Downsampling to preserve information during spatial downsampling. Finally, we present a Hardness Exclusion Triplet Loss, designed to exclude low-quality silhouettes by discarding hard triplet samples. TrackletGait achieves state-of-the-art results, with 77.8 and 80.4 rank-1 accuracy on the Gait3D and GREW datasets, respectively, while using only 10.3M backbone parameters. Extensive experiments are also conducted to further investigate the factors affecting gait recognition in the wild.

[357] AURORA: Augmented Understanding via Structured Reasoning and Reinforcement Learning for Reference Audio-Visual Segmentation

Ziyang Luo, Nian Liu, Fahad Shahbaz Khan, Junwei Han

Main category: cs.CV

TL;DR: AURORA is a novel framework for Ref-AVS tasks, enhancing reasoning and segmentation via CoT prompting, feature distillation, and a two-stage training strategy.

Details

Motivation: Existing methods lack semantic understanding and compromise precision; AURORA aims to improve reasoning and segmentation.

Method: Uses Chain-of-Thought prompting, segmentation feature distillation, and a two-stage training (corrective reflective-style and GRPO reinforcement learning).

Result: Achieves state-of-the-art performance on Ref-AVS benchmarks and generalizes well to unreferenced segmentation.

Conclusion: AURORA effectively integrates reasoning and segmentation, outperforming existing methods.

Abstract: Reference Audio-Visual Segmentation (Ref-AVS) tasks challenge models to precisely locate sounding objects by integrating visual, auditory, and textual cues. Existing methods often lack genuine semantic understanding, tending to memorize fixed reasoning patterns. Furthermore, jointly training for reasoning and segmentation can compromise pixel-level precision. To address these issues, we introduce AURORA, a novel framework designed to enhance genuine reasoning and language comprehension in reference audio-visual segmentation. We employ a structured Chain-of-Thought (CoT) prompting mechanism to guide the model through a step-by-step reasoning process and introduce a novel segmentation feature distillation loss to effectively integrate these reasoning abilities without sacrificing segmentation performance. To further cultivate the model’s genuine reasoning capabilities, we devise a further two-stage training strategy: first, a ``corrective reflective-style training" stage utilizes self-correction to enhance the quality of reasoning paths, followed by reinforcement learning via Group Reward Policy Optimization (GRPO) to bolster robustness in challenging scenarios. Experiments demonstrate that AURORA achieves state-of-the-art performance on Ref-AVS benchmarks and generalizes effectively to unreferenced segmentation.

[358] AttriCtrl: Fine-Grained Control of Aesthetic Attribute Intensity in Diffusion Models

Die Chen, Zhongjie Duan, Zhiwen Li, Cen Chen, Daoyuan Chen, Yaliang Li, Yinda Chen

Main category: cs.CV

TL;DR: AttriCtrl is a plug-and-play framework for precise, continuous control of aesthetic attributes in text-to-image diffusion models, addressing ambiguity in textual prompts and scalability issues of human preference data.

Details

Motivation: Fine-grained control over aesthetic attributes in text-to-image generation is challenging due to vague textual prompts and reliance on costly human preference data.

Method: AttriCtrl quantifies aesthetics using pre-trained vision-language models and employs a lightweight value encoder to map scalar intensities to embeddings for diffusion-based generation.

Result: AttriCtrl achieves accurate control over individual attributes and flexible multi-attribute composition, with seamless integration into existing frameworks.

Conclusion: AttriCtrl offers intuitive, customizable aesthetic manipulation with minimal training overhead, demonstrating strong practical utility.

Abstract: Recent breakthroughs in text-to-image diffusion models have significantly enhanced both the visual fidelity and semantic controllability of generated images. However, fine-grained control over aesthetic attributes remains challenging, especially when users require continuous and intensity-specific adjustments. Existing approaches often rely on vague textual prompts, which are inherently ambiguous in expressing both the aesthetic semantics and the desired intensity, or depend on costly human preference data for alignment, limiting their scalability and practicality. To address these limitations, we propose AttriCtrl, a plug-and-play framework for precise and continuous control of aesthetic attributes. Specifically, we quantify abstract aesthetics by leveraging semantic similarity from pre-trained vision-language models, and employ a lightweight value encoder that maps scalar intensities in $[0,1]$ to learnable embeddings within diffusion-based generation. This design enables intuitive and customizable aesthetic manipulation, with minimal training overhead and seamless integration into existing generation pipelines. Extensive experiments demonstrate that AttriCtrl achieves accurate control over individual attributes as well as flexible multi-attribute composition. Moreover, it is fully compatible with popular open-source controllable generation frameworks, showcasing strong integration capability and practical utility across diverse generation scenarios.

[359] Unified Category-Level Object Detection and Pose Estimation from RGB Images using 3D Prototypes

Tom Fischer, Xiaojie Zhang, Eddy Ilg

Main category: cs.CV

TL;DR: A unified model for RGB-based object detection and 3D pose estimation, outperforming state-of-the-art by 22.9% on REAL275.

Details

Motivation: Traditional methods rely on RGB-D inputs or separate models for detection and pose estimation, limiting practicality and performance.

Method: Integrates detection and pose estimation into a single framework using neural mesh models with learned features and multi-model RANSAC.

Result: Achieves state-of-the-art results on REAL275, improving by 22.9% on scale-agnostic metrics.

Conclusion: The unified method is more robust and practical, offering superior performance for RGB-based tasks.

Abstract: Recognizing objects in images is a fundamental problem in computer vision. Although detecting objects in 2D images is common, many applications require determining their pose in 3D space. Traditional category-level methods rely on RGB-D inputs, which may not always be available, or employ two-stage approaches that use separate models and representations for detection and pose estimation. For the first time, we introduce a unified model that integrates detection and pose estimation into a single framework for RGB images by leveraging neural mesh models with learned features and multi-model RANSAC. Our approach achieves state-of-the-art results for RGB category-level pose estimation on REAL275, improving on the current state-of-the-art by 22.9% averaged across all scale-agnostic metrics. Finally, we demonstrate that our unified method exhibits greater robustness compared to single-stage baselines. Our code and models are available at https://github.com/Fischer-Tom/unified-detection-and-pose-estimation.

[360] After the Party: Navigating the Mapping From Color to Ambient Lighting

Florin-Alexandru Vasluianu, Tim Seizinger, Zongwei Wu, Radu Timofte

Main category: cs.CV

TL;DR: CL3AN introduces a dataset and learning framework to restore images under colored lighting, addressing limitations of existing methods.

Details

Motivation: Existing methods oversimplify complex illumination scenarios, leading to artifacts like color distortion and texture leakage.

Method: A novel learning framework leverages chromaticity and luminance guidance inspired by the Retinex model.

Result: The approach shows robustness under non-homogeneous lighting and material variations while maintaining computational efficiency.

Conclusion: CL3AN provides an effective solution for illumination restoration, supported by a new dataset and benchmark.

Abstract: Illumination in practical scenarios is inherently complex, involving colored light sources, occlusions, and diverse material interactions that produce intricate reflectance and shading effects. However, existing methods often oversimplify this challenge by assuming a single light source or uniform, white-balanced lighting, leaving many of these complexities unaddressed.In this paper, we introduce CL3AN, the first large-scale, high-resolution dataset of its kind designed to facilitate the restoration of images captured under multiple Colored Light sources to their Ambient-Normalized counterparts. Through benchmarking, we find that leading approaches often produce artifacts, such as illumination inconsistencies, texture leakage, and color distortion, primarily due to their limited ability to precisely disentangle illumination from reflectance. Motivated by this insight, we achieve such a desired decomposition through a novel learning framework that leverages explicit chromaticity and luminance components guidance, drawing inspiration from the principles of the Retinex model. Extensive evaluations on existing benchmarks and our dataset demonstrate the effectiveness of our approach, showcasing enhanced robustness under non-homogeneous color lighting and material-specific reflectance variations, all while maintaining a highly competitive computational cost. The benchmark, codes, and models are available at www.github.com/fvasluianu97/RLN2.

[361] Deep classification algorithm for De-identification of DICOM medical images

Bufano Michele, Kotter Elmar

Main category: cs.CV

TL;DR: A Python algorithm was developed to de-identify PII and PHI in DICOM files using customizable parameters based on HIPAA’s safe harbor method.

Details

Motivation: Legal requirements like HIPAA mandate the removal of PII and PHI from medical images, including DICOM files, to protect patient privacy.

Method: Implemented an algorithm using HIPAA’s safe harbor method, with customizable parameters to classify and de-identify DICOM tags.

Result: Successfully recognized and de-identified sensitive information like names, history, and institution data.

Conclusion: The flexible, customizable algorithm is promising for both research and everyday use, with code available on GitHub.

Abstract: Background : De-identification of DICOM (Digital Imaging and Communi-cations in Medicine) files is an essential component of medical image research. Personal Identifiable Information (PII) and/or Personal Health Identifying Information (PHI) need to be hidden or removed due to legal reasons. According to the Health Insurance Portability and Accountability Act (HIPAA) and privacy rules, also full-face photographic images and any compa-rable images are direct identifiers and are considered protected health information that also need to be de-identified. Objective : The study aimed to implement a method that permit to de-identify the PII and PHI information present in the header and burned on the pixel data of DICOM. Methods : To execute the de-identification, we implemented an algorithm based on the safe harbor method, defined by HIPAA. Our algorithm uses input customizable parameter to classify and then possibly de-identify individual DICOM tags. Results : The most sensible information, like names, history, personal data and institution were successfully recognized. Conclusions : We developed a python algorithm that is able to classify infor-mation present in a DICOM file. The flexibility provided by the use of customi-zable input parameters, which allow the user to customize the entire process de-pending on the case (e.g., the language), makes the entire program very promis-ing for both everyday use and research purposes. Our code is available at https://github.com/rtdicomexplorer/deep_deidentification.

[362] Weakly Supervised Multimodal Temporal Forgery Localization via Multitask Learning

Wenbo Xu, Wei Lu, Xiangyang Luo

Main category: cs.CV

TL;DR: Proposes WMMT, a weakly supervised method for multimodal fine-grained temporal forgery localization using multitask learning and video-level annotations.

Details

Motivation: Addresses the lack of systematic research in weakly supervised multimodal fine-grained temporal forgery localization (WS-MTFL) caused by Deepfake videos.

Method: Uses multitask learning to integrate visual and audio detection tasks, a Mixture-of-Experts structure for feature selection, and a feature enhancement module with attention. Introduces a deviation perceiving loss for temporal information.

Result: Achieves comparable results to fully supervised approaches in Deepfake detection and localization.

Conclusion: WMMT effectively addresses WS-MTFL, demonstrating the potential of weakly supervised learning in this domain.

Abstract: The spread of Deepfake videos has caused a trust crisis and impaired social stability. Although numerous approaches have been proposed to address the challenges of Deepfake detection and localization, there is still a lack of systematic research on the weakly supervised multimodal fine-grained temporal forgery localization (WS-MTFL). In this paper, we propose a novel weakly supervised multimodal temporal forgery localization via multitask learning (WMMT), which addresses the WS-MTFL under the multitask learning paradigm. WMMT achieves multimodal fine-grained Deepfake detection and temporal partial forgery localization using merely video-level annotations. Specifically, visual and audio modality detection are formulated as two binary classification tasks. The multitask learning paradigm is introduced to integrate these tasks into a multimodal task. Furthermore, WMMT utilizes a Mixture-of-Experts structure to adaptively select appropriate features and localization head, achieving excellent flexibility and localization precision in WS-MTFL. A feature enhancement module with temporal property preserving attention mechanism is proposed to identify the intra- and inter-modality feature deviation and construct comprehensive video features. To further explore the temporal information for weakly supervised learning, an extensible deviation perceiving loss has been proposed, which aims to enlarge the deviation of adjacent segments of the forged samples and reduce the deviation of genuine samples. Extensive experiments demonstrate the effectiveness of multitask learning for WS-MTFL, and the WMMT achieves comparable results to fully supervised approaches in several evaluation metrics.

[363] Test-Time Model Adaptation for Quantized Neural Networks

Zeshuai Deng, Guohao Chen, Shuaicheng Niu, Hui Luo, Shuhai Zhang, Yifan Yang, Renjie Chen, Wei Luo, Mingkui Tan

Main category: cs.CV

TL;DR: The paper proposes a continual zeroth-order adaptation (ZOA) framework for quantized models to address performance degradation in dynamic environments, achieving efficient adaptation with minimal computational overhead.

Details

Motivation: Quantized models suffer severe performance degradation in dynamic environments, and existing test-time adaptation (TTA) methods are impractical due to gradient backpropagation issues.

Method: Introduces ZOA, a framework requiring only two forward passes for adaptation, and a domain knowledge management scheme to store and reuse knowledge efficiently.

Result: ZOA improves performance by 5.0% over state-of-the-art methods on the ImageNet-C dataset for a quantized ViT-B model.

Conclusion: ZOA offers a practical and efficient solution for adapting quantized models, enhancing robustness and generalization with minimal computational cost.

Abstract: Quantizing deep models prior to deployment is a widely adopted technique to speed up inference for various real-time applications, such as autonomous driving. However, quantized models often suffer from severe performance degradation in dynamic environments with potential domain shifts and this degradation is significantly more pronounced compared with their full-precision counterparts, as shown by our theoretical and empirical illustrations. To address the domain shift problem, test-time adaptation (TTA) has emerged as an effective solution by enabling models to learn adaptively from test data. Unfortunately, existing TTA methods are often impractical for quantized models as they typically rely on gradient backpropagation–an operation that is unsupported on quantized models due to vanishing gradients, as well as memory and latency constraints. In this paper, we focus on TTA for quantized models to improve their robustness and generalization ability efficiently. We propose a continual zeroth-order adaptation (ZOA) framework that enables efficient model adaptation using only two forward passes, eliminating the computational burden of existing methods. Moreover, we propose a domain knowledge management scheme to store and reuse different domain knowledge with negligible memory consumption, reducing the interference of different domain knowledge and fostering the knowledge accumulation during long-term adaptation. Experimental results on three classical architectures, including quantized transformer-based and CNN-based models, demonstrate the superiority of our methods for quantized model adaptation. On the quantized W6A6 ViT-B model, our ZOA is able to achieve a 5.0% improvement over the state-of-the-art FOA on ImageNet-C dataset. The source code is available at https://github.com/DengZeshuai/ZOA.

[364] Failure Cases Are Better Learned But Boundary Says Sorry: Facilitating Smooth Perception Change for Accuracy-Robustness Trade-Off in Adversarial Training

Yanyun Wang, Li Liu

Main category: cs.CV

TL;DR: The paper reveals that over-learning hard adversarial samples in Adversarial Training (AT) degrades decision boundaries, proposing a new method, Robust Perception Adversarial Training (RPAT), to mitigate the accuracy-robustness trade-off.

Details

Motivation: The trade-off between clean accuracy and adversarial robustness in AT is attributed to over-sufficient learning of hard adversarial samples, contradicting previous views.

Method: Introduces Robust Perception, a new AT objective encouraging smooth perception changes with perturbations, leading to the RPAT method.

Result: RPAT outperforms four baselines and 12 SOTA methods on CIFAR-10, CIFAR-100, and Tiny-ImageNet with various models.

Conclusion: RPAT effectively addresses the accuracy-robustness trade-off by balancing perception consistency and perturbation utilization.

Abstract: Adversarial Training (AT) is one of the most effective methods to train robust Deep Neural Networks (DNNs). However, AT creates an inherent trade-off between clean accuracy and adversarial robustness, which is commonly attributed to the more complicated decision boundary caused by the insufficient learning of hard adversarial samples. In this work, we reveal a counterintuitive fact for the first time: From the perspective of perception consistency, hard adversarial samples that can still attack the robust model after AT are already learned better than those successfully defended. Thus, different from previous views, we argue that it is rather the over-sufficient learning of hard adversarial samples that degrades the decision boundary and contributes to the trade-off problem. Specifically, the excessive pursuit of perception consistency would force the model to view the perturbations as noise and ignore the information within them, which should have been utilized to induce a smoother perception transition towards the decision boundary to support its establishment to an appropriate location. In response, we define a new AT objective named Robust Perception, encouraging the model perception to change smoothly with input perturbations, based on which we propose a novel Robust Perception Adversarial Training (RPAT) method, effectively mitigating the current accuracy-robustness trade-off. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet with ResNet-18, PreActResNet-18, and WideResNet-34-10 demonstrate the effectiveness of our method beyond four common baselines and 12 state-of-the-art (SOTA) works. The code is available at https://github.com/FlaAI/RPAT.

[365] CMIC: Content-Adaptive Mamba for Learned Image Compression

Yunuo Chen, Zezheng Lyu, Bing He, Hongwei Hu, Qi Wang, Yuan Tian, Li Song, Wenjun Zhang, Guo Lu

Main category: cs.CV

TL;DR: CAM introduces content-aware token reorganization and global priors to improve Mamba-style SSMs for learned image compression, achieving superior performance.

Details

Motivation: Vanilla Mamba's fixed, content-agnostic scans limit dynamic exploitation of content dependencies, prompting the need for a more adaptive approach.

Method: CAM uses content-aware token clustering and reordering, along with a prompt dictionary to integrate global priors, enhancing token interactions.

Result: CMIC (CAM-based LIC model) outperforms VTM-21.0 by significant BD-rate margins on multiple benchmarks.

Conclusion: CAM effectively addresses Mamba’s limitations, enabling better global dependency capture and state-of-the-art compression performance.

Abstract: Recent Learned image compression (LIC) leverages Mamba-style state-space models (SSMs) for global receptive fields with linear complexity. However, vanilla Mamba is content-agnostic, relying on fixed and predefined selective scans, which restricts its ability to dynamically and fully exploit content dependencies. We introduce Content-Adaptive Mamba (CAM), a dynamic SSM that addresses two critical limitations. First, it employs content-aware token reorganization, clustering and reordering tokens based on content similarity to prioritize proximity in feature space over Euclidean space. Second, it integrates global priors into SSM via a prompt dictionary, effectively mitigating the strict causality and long-range decay in the token interactions of Mamba. These innovations enable CAM to better capture global dependencies while preserving computational efficiency. Leveraging CAM, our Content-Adaptive Mamba-based LIC model (CMIC) achieves state-of-the-art rate-distortion performance, surpassing VTM-21.0 by -15.91%, -21.34%, and -17.58% BD-rate on Kodak, Tecnick, and CLIC benchmarks, respectively.

[366] Welcome New Doctor: Continual Learning with Expert Consultation and Autoregressive Inference for Whole Slide Image Analysis

Doanh Cao Bui, Jin Tae Kwak

Main category: cs.CV

TL;DR: COSFormer is a Transformer-based continual learning framework for multi-task WSI analysis, designed to efficiently adapt to new tasks without retraining on historical data.

Details

Motivation: The need for a resource-efficient continual learning system for WSI analysis due to the increasing volume of WSIs in clinical settings.

Method: COSFormer, a Transformer-based framework, learns sequentially from new tasks without revisiting full historical datasets.

Result: Outperforms existing continual learning frameworks in generalizability and effectiveness across seven WSI datasets and six tasks.

Conclusion: COSFormer is a robust solution for continual WSI analysis in clinical applications.

Abstract: Whole Slide Image (WSI) analysis, with its ability to reveal detailed tissue structures in magnified views, plays a crucial role in cancer diagnosis and prognosis. Due to their giga-sized nature, WSIs require substantial storage and computational resources for processing and training predictive models. With the rapid increase in WSIs used in clinics and hospitals, there is a growing need for a continual learning system that can efficiently process and adapt existing models to new tasks without retraining or fine-tuning on previous tasks. Such a system must balance resource efficiency with high performance. In this study, we introduce COSFormer, a Transformer-based continual learning framework tailored for multi-task WSI analysis. COSFormer is designed to learn sequentially from new tasks wile avoiding the need to revisit full historical datasets. We evaluate COSFormer on a sequence of seven WSI datasets covering seven organs and six WSI-related tasks under both class-incremental and task-incremental settings. The results demonstrate COSFormer’s superior generalizability and effectiveness compared to existing continual learning frameworks, establishing it as a robust solution for continual WSI analysis in clinical applications.

[367] An Event-based Fast Intensity Reconstruction Scheme for UAV Real-time Perception

Xin Dong, Yiwei Zhang, Yangjie Cui, Jinwu Xiang, Daochun Li, Zhan Tu

Main category: cs.CV

TL;DR: Proposes Event-based Single Integration (ESI), a method for real-time intensity reconstruction from event camera data, offering high frame rates and low computation for UAV applications.

Details

Motivation: Event cameras excel in challenging visual conditions but require efficient methods to process asynchronous event streams for practical use, especially in UAV-based tracking.

Method: ESI reconstructs intensity images via single integration of event streams and an enhanced decay algorithm, enabling real-time processing at 100 FPS.

Result: ESI outperforms state-of-the-art in runtime efficiency, reconstruction quality, and frame rate, proving effective in low-light UAV tracking.

Conclusion: ESI is a practical solution for event-based intensity reconstruction, enhancing UAV perception in adverse visual conditions.

Abstract: Event cameras offer significant advantages, including a wide dynamic range, high temporal resolution, and immunity to motion blur, making them highly promising for addressing challenging visual conditions. Extracting and utilizing effective information from asynchronous event streams is essential for the onboard implementation of event cameras. In this paper, we propose a streamlined event-based intensity reconstruction scheme, event-based single integration (ESI), to address such implementation challenges. This method guarantees the portability of conventional frame-based vision methods to event-based scenarios and maintains the intrinsic advantages of event cameras. The ESI approach reconstructs intensity images by performing a single integration of the event streams combined with an enhanced decay algorithm. Such a method enables real-time intensity reconstruction at a high frame rate, typically 100 FPS. Furthermore, the relatively low computation load of ESI fits onboard implementation suitably, such as in UAV-based visual tracking scenarios. Extensive experiments have been conducted to evaluate the performance comparison of ESI and state-of-the-art algorithms. Compared to state-of-the-art algorithms, ESI demonstrates remarkable runtime efficiency improvements, superior reconstruction quality, and a high frame rate. As a result, ESI enhances UAV onboard perception significantly under visual adversary surroundings. In-flight tests, ESI demonstrates effective performance for UAV onboard visual tracking under extremely low illumination conditions(2-10lux), whereas other comparative algorithms fail due to insufficient frame rate, poor image quality, or limited real-time performance.

Ziyan Liu, Junwen Li, Kaiwen Li, Tong Ruan, Chao Wang, Xinyan He, Zongyu Wang, Xuezhi Cao, Jingping Liu

Main category: cs.CV

TL;DR: A novel LLM-based framework, Intra- and Inter-modal Collaborative Reflections, improves multimodal entity linking by prioritizing text and iteratively integrating visual clues when needed, outperforming state-of-the-art methods.

Details

Motivation: Address challenges in current multimodal entity linking methods, such as unnecessary image data use and one-time visual feature extraction, which reduce effectiveness and accuracy.

Method: Proposes a framework that leverages text first, then uses a multi-round iterative strategy to integrate key visual clues when text is insufficient.

Result: Outperforms state-of-the-art methods on three datasets with improvements of 3.2%, 5.1%, and 1.6%.

Conclusion: The framework effectively balances text and visual data, enhancing accuracy and performance in multimodal entity linking.

Abstract: Multimodal entity linking plays a crucial role in a wide range of applications. Recent advances in large language model-based methods have become the dominant paradigm for this task, effectively leveraging both textual and visual modalities to enhance performance. Despite their success, these methods still face two challenges, including unnecessary incorporation of image data in certain scenarios and the reliance only on a one-time extraction of visual features, which can undermine their effectiveness and accuracy. To address these challenges, we propose a novel LLM-based framework for the multimodal entity linking task, called Intra- and Inter-modal Collaborative Reflections. This framework prioritizes leveraging text information to address the task. When text alone is insufficient to link the correct entity through intra- and inter-modality evaluations, it employs a multi-round iterative strategy that integrates key visual clues from various aspects of the image to support reasoning and enhance matching accuracy. Extensive experiments on three widely used public datasets demonstrate that our framework consistently outperforms current state-of-the-art methods in the task, achieving improvements of 3.2%, 5.1%, and 1.6%, respectively. Our code is available at https://github.com/ziyan-xiaoyu/I2CR/.

[369] Semi-Supervised Semantic Segmentation via Derivative Label Propagation

Yuanbin Fu, Xiaojie Guo

Main category: cs.CV

TL;DR: DerProp introduces derivative label propagation to improve pseudo-label reliability in semi-supervised semantic segmentation.

Details

Motivation: To address the unreliability of pseudo-labels in semi-supervised semantic segmentation and reduce annotation burden.

Method: Uses derivative label propagation with discrete derivative operations on pixel-wise feature vectors for regularization.

Result: Enhances pseudo-label reliability and outperforms other methods in experiments.

Conclusion: DerProp effectively improves semi-supervised segmentation by regularizing pseudo-labels.

Abstract: Semi-supervised semantic segmentation, which leverages a limited set of labeled images, helps to relieve the heavy annotation burden. While pseudo-labeling strategies yield promising results, there is still room for enhancing the reliability of pseudo-labels. Hence, we develop a semi-supervised framework, namely DerProp, equipped with a novel derivative label propagation to rectify imperfect pseudo-labels. Our label propagation method imposes discrete derivative operations on pixel-wise feature vectors as additional regularization, thereby generating strictly regularized similarity metrics. Doing so effectively alleviates the ill-posed problem that identical similarities correspond to different features, through constraining the solution space. Extensive experiments are conducted to verify the rationality of our design, and demonstrate our superiority over other methods. Codes are available at https://github.com/ForawardStar/DerProp/.

[370] Patho-AgenticRAG: Towards Multimodal Agentic Retrieval-Augmented Generation for Pathology VLMs via Reinforcement Learning

Wenchuan Zhang, Jingru Guo, Hengzhe Zhang, Penghao Zhang, Jie Chen, Shuwan Zhang, Zhang Zhang, Yuhao Yi, Hong Bu

Main category: cs.CV

TL;DR: Patho-AgenticRAG is a multimodal RAG framework for pathology, addressing hallucinations in VLMs by leveraging joint text-image retrieval from authoritative textbooks, improving diagnostic accuracy.

Details

Motivation: Pathology VLMs face challenges like ultra-high resolution and complex semantics, leading to hallucinations and reduced clinical trust. Existing RAG methods lack visual cue integration.

Method: Proposes Patho-AgenticRAG, a multimodal RAG framework with page-level embeddings from pathology textbooks, enabling joint text-image search and reasoning.

Result: Outperforms existing models in tasks like multiple-choice diagnosis and visual question answering.

Conclusion: Patho-AgenticRAG enhances diagnostic accuracy by integrating visual and textual information, addressing key limitations in pathology VLMs.

Abstract: Although Vision Language Models (VLMs) have shown strong generalization in medical imaging, pathology presents unique challenges due to ultra-high resolution, complex tissue structures, and nuanced clinical semantics. These factors make pathology VLMs prone to hallucinations, i.e., generating outputs inconsistent with visual evidence, which undermines clinical trust. Existing RAG approaches in this domain largely depend on text-based knowledge bases, limiting their ability to leverage diagnostic visual cues. To address this, we propose Patho-AgenticRAG, a multimodal RAG framework with a database built on page-level embeddings from authoritative pathology textbooks. Unlike traditional text-only retrieval systems, it supports joint text-image search, enabling direct retrieval of textbook pages that contain both the queried text and relevant visual cues, thus avoiding the loss of critical image-based information. Patho-AgenticRAG also supports reasoning, task decomposition, and multi-turn search interactions, improving accuracy in complex diagnostic scenarios. Experiments show that Patho-AgenticRAG significantly outperforms existing multimodal models in complex pathology tasks like multiple-choice diagnosis and visual question answering. Our project is available at the Patho-AgenticRAG repository: https://github.com/Wenchuan-Zhang/Patho-AgenticRAG.

[371] SplatSSC: Decoupled Depth-Guided Gaussian Splatting for Semantic Scene Completion

Rui Qian, Haozhi Cao, Tianchen Deng, Shenghai Yuan, Lihua Xie

Main category: cs.CV

TL;DR: SplatSSC improves monocular 3D semantic scene completion with depth-guided initialization and a Gaussian aggregator, achieving state-of-the-art performance with reduced latency and memory usage.

Details

Motivation: Address inefficiencies and artifacts in object-centric paradigms due to random initialization of Gaussian primitives.

Method: Uses a depth-guided initialization strategy (GMF module) and a Decoupled Gaussian Aggregator (DGA) for robust predictions.

Result: Achieves 6.3% higher IoU and 4.1% higher mIoU on Occ-ScanNet, with 9.3% lower latency and memory consumption.

Conclusion: SplatSSC is an efficient and effective solution for monocular 3D semantic scene completion.

Abstract: Monocular 3D Semantic Scene Completion (SSC) is a challenging yet promising task that aims to infer dense geometric and semantic descriptions of a scene from a single image. While recent object-centric paradigms significantly improve efficiency by leveraging flexible 3D Gaussian primitives, they still rely heavily on a large number of randomly initialized primitives, which inevitably leads to 1) inefficient primitive initialization and 2) outlier primitives that introduce erroneous artifacts. In this paper, we propose SplatSSC, a novel framework that resolves these limitations with a depth-guided initialization strategy and a principled Gaussian aggregator. Instead of random initialization, SplatSSC utilizes a dedicated depth branch composed of a Group-wise Multi-scale Fusion (GMF) module, which integrates multi-scale image and depth features to generate a sparse yet representative set of initial Gaussian primitives. To mitigate noise from outlier primitives, we develop the Decoupled Gaussian Aggregator (DGA), which enhances robustness by decomposing geometric and semantic predictions during the Gaussian-to-voxel splatting process. Complemented with a specialized Probability Scale Loss, our method achieves state-of-the-art performance on the Occ-ScanNet dataset, outperforming prior approaches by over 6.3% in IoU and 4.1% in mIoU, while reducing both latency and memory consumption by more than 9.3%. The code will be released upon acceptance.

[372] Semi-Supervised Dual-Threshold Contrastive Learning for Ultrasound Image Classification and Segmentation

Peng Zhang, Zhihui Lai, Heng Kong

Main category: cs.CV

TL;DR: Hermes, a semi-supervised dual-threshold contrastive learning strategy, improves ultrasound image classification and segmentation by addressing overconfident pseudo-labels and leveraging inter-task consistency.

Details

Motivation: To overcome the issues of overly confident yet incorrect pseudo-labels and the lack of affinity between segmentation and classification tasks in semi-supervised contrastive learning.

Method: Proposes Hermes, combining contrastive and semi-supervised learning with inter-task attention, saliency modules, and consistency learning to align tumor features.

Result: Hermes outperforms state-of-the-art methods on public and private ultrasound datasets.

Conclusion: Hermes effectively enhances semi-supervised learning for ultrasound tasks by integrating pseudo-label guidance and inter-task alignment.

Abstract: Confidence-based pseudo-label selection usually generates overly confident yet incorrect predictions, due to the early misleadingness of model and overfitting inaccurate pseudo-labels in the learning process, which heavily degrades the performance of semi-supervised contrastive learning. Moreover, segmentation and classification tasks are treated independently and the affinity fails to be fully explored. To address these issues, we propose a novel semi-supervised dual-threshold contrastive learning strategy for ultrasound image classification and segmentation, named Hermes. This strategy combines the strengths of contrastive learning with semi-supervised learning, where the pseudo-labels assist contrastive learning by providing additional guidance. Specifically, an inter-task attention and saliency module is also developed to facilitate information sharing between the segmentation and classification tasks. Furthermore, an inter-task consistency learning strategy is designed to align tumor features across both tasks, avoiding negative transfer for reducing features discrepancy. To solve the lack of publicly available ultrasound datasets, we have collected the SZ-TUS dataset, a thyroid ultrasound image dataset. Extensive experiments on two public ultrasound datasets and one private dataset demonstrate that Hermes consistently outperforms several state-of-the-art methods across various semi-supervised settings.

[373] SGAD: Semantic and Geometric-aware Descriptor for Local Feature Matching

Xiangzeng Liu, Chi Wang, Guanglu Shi, Xiaodong Zhang, Qiguang Miao, Miao Fan

Main category: cs.CV

TL;DR: SGAD introduces a novel descriptor network for area-based matching, improving accuracy and efficiency by avoiding complex graph optimization and using a new supervision strategy.

Details

Motivation: Existing A2PM methods are inefficient due to pixel-level comparisons and complex graph matching, limiting scalability.

Method: SGAD generates discriminative area descriptors for direct matching, uses a novel supervision strategy, and employs HCRF to eliminate overlapping areas.

Result: SGAD reduces runtime by 60x, improves accuracy (e.g., 65.98 vs. 61.11 in outdoor pose estimation), and achieves state-of-the-art performance.

Conclusion: SGAD significantly advances area-based matching by combining semantic and geometric awareness, offering a scalable and efficient solution.

Abstract: Local feature matching remains a fundamental challenge in computer vision. Recent Area to Point Matching (A2PM) methods have improved matching accuracy. However, existing research based on this framework relies on inefficient pixel-level comparisons and complex graph matching that limit scalability. In this work, we introduce the Semantic and Geometric-aware Descriptor Network (SGAD), which fundamentally rethinks area-based matching by generating highly discriminative area descriptors that enable direct matching without complex graph optimization. This approach significantly improves both accuracy and efficiency of area matching. We further improve the performance of area matching through a novel supervision strategy that decomposes the area matching task into classification and ranking subtasks. Finally, we introduce the Hierarchical Containment Redundancy Filter (HCRF) to eliminate overlapping areas by analyzing containment graphs. SGAD demonstrates remarkable performance gains, reducing runtime by 60x (0.82s vs. 60.23s) compared to MESA. Extensive evaluations show consistent improvements across multiple point matchers: SGAD+LoFTR reduces runtime compared to DKM, while achieving higher accuracy (0.82s vs. 1.51s, 65.98 vs. 61.11) in outdoor pose estimation, and SGAD+ROMA delivers +7.39% AUC@5{\deg} in indoor pose estimation, establishing a new state-of-the-art.

[374] Do Edges Matter? Investigating Edge-Enhanced Pre-Training for Medical Image Segmentation

Paul Zaha, Lars Böcking, Simeon Allmendinger, Leopold Müller, Niklas Kühl

Main category: cs.CV

TL;DR: The paper explores how edge-focused pre-training of foundation models impacts medical image segmentation across modalities, proposing a meta-learning strategy for optimal performance.

Details

Motivation: To address the gap in understanding how edge preprocessing affects segmentation performance in medical imaging and improve cross-modality capabilities.

Method: Pre-train foundation models on raw or edge-enhanced data, fine-tune on modality-specific subsets, and introduce a meta-learning strategy based on image properties.

Result: Edge-focused pre-training shows mixed results, but the meta-learning strategy improves overall segmentation performance by 16.42% over edge-only and 19.30% over raw-only models.

Conclusion: Selective application of edge-focused pre-training, guided by meta-learning, enhances segmentation performance across diverse medical imaging tasks.

Abstract: Medical image segmentation is crucial for disease diagnosis and treatment planning, yet developing robust segmentation models often requires substantial computational resources and large datasets. Existing research shows that pre-trained and finetuned foundation models can boost segmentation performance. However, questions remain about how particular image preprocessing steps may influence segmentation performance across different medical imaging modalities. In particular, edges-abrupt transitions in pixel intensity-are widely acknowledged as vital cues for object boundaries but have not been systematically examined in the pre-training of foundation models. We address this gap by investigating to which extend pre-training with data processed using computationally efficient edge kernels, such as kirsch, can improve cross-modality segmentation capabilities of a foundation model. Two versions of a foundation model are first trained on either raw or edge-enhanced data across multiple medical imaging modalities, then finetuned on selected raw subsets tailored to specific medical modalities. After systematic investigation using the medical domains Dermoscopy, Fundus, Mammography, Microscopy, OCT, US, and XRay, we discover both increased and reduced segmentation performance across modalities using edge-focused pre-training, indicating the need for a selective application of this approach. To guide such selective applications, we propose a meta-learning strategy. It uses standard deviation and image entropy of the raw image to choose between a model pre-trained on edge-enhanced or on raw data for optimal performance. Our experiments show that integrating this meta-learning layer yields an overall segmentation performance improvement across diverse medical imaging tasks by 16.42% compared to models pre-trained on edge-enhanced data only and 19.30% compared to models pre-trained on raw data only.

[375] Unleashing the Temporal Potential of Stereo Event Cameras for Continuous-Time 3D Object Detection

Jae-Young Kang, Hoonhee Cho, Kuk-Jin Yoon

Main category: cs.CV

TL;DR: A novel stereo 3D object detection framework using only event cameras, outperforming prior methods in dynamic environments.

Details

Motivation: Addressing perception gaps in high-speed scenarios caused by fixed frame rates of LiDAR and RGB cameras by leveraging event cameras' asynchronous nature and high temporal resolution.

Method: Proposes a dual filter mechanism to extract semantic and geometric information from event data and enhances regression by aligning bounding boxes with object-centric information.

Result: Outperforms prior approaches in dynamic environments, showcasing the potential of event cameras for robust, continuous-time 3D perception.

Conclusion: Event cameras can effectively replace conventional 3D sensors for continuous-time detection, especially in fast-motion scenarios.

Abstract: 3D object detection is essential for autonomous systems, enabling precise localization and dimension estimation. While LiDAR and RGB cameras are widely used, their fixed frame rates create perception gaps in high-speed scenarios. Event cameras, with their asynchronous nature and high temporal resolution, offer a solution by capturing motion continuously. The recent approach, which integrates event cameras with conventional sensors for continuous-time detection, struggles in fast-motion scenarios due to its dependency on synchronized sensors. We propose a novel stereo 3D object detection framework that relies solely on event cameras, eliminating the need for conventional 3D sensors. To compensate for the lack of semantic and geometric information in event data, we introduce a dual filter mechanism that extracts both. Additionally, we enhance regression by aligning bounding boxes with object-centric information. Experiments show that our method outperforms prior approaches in dynamic environments, demonstrating the potential of event cameras for robust, continuous-time 3D perception. The code is available at https://github.com/mickeykang16/Ev-Stereo3D.

[376] Kestrel: 3D Multimodal LLM for Part-Aware Grounded Description

Mahmoud Ahmed, Junjie Fei, Jian Ding, Eslam Mohamed Bakr, Mohamed Elhoseiny

Main category: cs.CV

TL;DR: The paper introduces PaPGD, a task for fine-grained 3D multimodal learning, and the 3DCoMPaT-GrIn Dataset. It proposes Kestrel, a model combining language understanding with spatial reasoning for part-level 3D segmentation grounding.

Details

Motivation: Addressing the lack of fine-grained multimodal segmentation in existing 3D datasets for robotic applications.

Method: Introduces the 3DCoMPaT-GrIn Dataset and Kestrel, a part-aware 3D multimodal model with multi-level point feature propagation and query refinement.

Result: Kestrel effectively bridges part-aware language understanding and 3D segmentation grounding.

Conclusion: The work advances robust and interpretable 3D object comprehension for real-world robotic applications.

Abstract: In this paper, we introduce Part-Aware Point Grounded Description (PaPGD), a challenging task aimed at advancing 3D multimodal learning for fine-grained, part-aware segmentation grounding and detailed explanation of 3D objects. Existing 3D datasets largely focus on either vision-only part segmentation or vision-language scene segmentation, lacking the fine-grained multimodal segmentation needed for robotic navigation and interaction in real-world environments. To address this gap, we present the 3DCoMPaT Grounded Instructions (3DCoMPaT-GrIn) Dataset, a comprehensive resource that pairs rich point cloud descriptions with corresponding part-level segmentation masks. This dataset encompasses extensive samples designed for both PaPGD and fine-grained single-part grounding tasks. To tackle the inherent challenges of grounding objects and generating grounded descriptions at the part level, we propose Kestrel, a part-aware 3D multimodal large language model that integrates an advanced language model for nuanced language comprehension with multi-level point feature propagation and query refinement mechanism to enhance spatial reasoning at the part level. The extensive experiments demonstrate that Kestrel effectively bridges the gap between part-aware language understanding and 3D segmentation grounding, paving the way for more robust and interpretable 3D object comprehension that meets the demands of real-world robotic applications. Project page at https://feielysia.github.io/Kestrel.github.io/

[377] Towards Real Unsupervised Anomaly Detection Via Confident Meta-Learning

Muhammad Aqeel, Shakiba Sharifi, Marco Cristani, Francesco Setti

Main category: cs.CV

TL;DR: CoMet introduces a meta-learning strategy for anomaly detection, enabling training on uncurated datasets with mixed nominal and anomalous samples, improving robustness and performance.

Details

Motivation: Current unsupervised anomaly detection assumes all training data are nominal, requiring manual curation and introducing bias. CoMet aims to eliminate this need by learning from mixed datasets.

Method: CoMet combines Soft Confident Learning (weighting low-confidence samples less) and Meta-Learning (regularizing updates via validation loss covariance) to stabilize training and prevent overfitting.

Result: Experiments on MVTec-AD, VIADUCT, and KSDD2 show CoMet outperforms baselines, remains robust to training anomalies, and achieves state-of-the-art results.

Conclusion: CoMet provides a model-agnostic, effective solution for training anomaly detection models on uncurated datasets, setting new benchmarks.

Abstract: So-called unsupervised anomaly detection is better described as semi-supervised, as it assumes all training data are nominal. This assumption simplifies training but requires manual data curation, introducing bias and limiting adaptability. We propose Confident Meta-learning (CoMet), a novel training strategy that enables deep anomaly detection models to learn from uncurated datasets where nominal and anomalous samples coexist, eliminating the need for explicit filtering. Our approach integrates Soft Confident Learning, which assigns lower weights to low-confidence samples, and Meta-Learning, which stabilizes training by regularizing updates based on training validation loss covariance. This prevents overfitting and enhances robustness to noisy data. CoMet is model-agnostic and can be applied to any anomaly detection method trainable via gradient descent. Experiments on MVTec-AD, VIADUCT, and KSDD2 with two state-of-the-art models demonstrate the effectiveness of our approach, consistently improving over the baseline methods, remaining insensitive to anomalies in the training set, and setting a new state-of-the-art across all datasets.

[378] VideoLLaMB: Long Streaming Video Understanding with Recurrent Memory Bridges

Yuxuan Wang, Yiqi Song, Cihang Xie, Yang Liu, Zilong Zheng

Main category: cs.CV

TL;DR: VideoLLaMB is an efficient framework for long video understanding, using recurrent memory bridges and temporal memory tokens. It outperforms existing models in VideoQA and egocentric planning tasks, with linear GPU memory scaling.

Details

Motivation: High computational demands and limited annotated datasets hinder the practicality of large-scale video-language models for academic researchers.

Method: VideoLLaMB employs recurrent memory bridges, temporal memory tokens, and a SceneTiling algorithm to segment videos into semantic units, enabling robust understanding without additional training.

Result: Achieves state-of-the-art performance, surpassing models by 4.2 points on VideoQA benchmarks and 2.06 points on egocentric tasks, with strong scaling up to 8 times video length.

Conclusion: VideoLLaMB offers an unprecedented balance of accuracy, scalability, and cost-effectiveness, making it highly accessible for academic use.

Abstract: Recent advancements in large-scale video-language models have shown significant potential for real-time planning and detailed interactions. However, their high computational demands and the scarcity of annotated datasets limit their practicality for academic researchers. In this work, we introduce VideoLLaMB, a novel and efficient framework for long video understanding that leverages recurrent memory bridges and temporal memory tokens to enable seamless encoding of entire video sequences with preserved semantic continuity. Central to our approach is a SceneTiling algorithm that segments videos into coherent semantic units, facilitating robust understanding across tasks without requiring additional training. VideoLLaMB achieves state-of-the-art performance, surpassing existing models by 4.2 points on four VideoQA benchmarks and by 2.06 points on egocentric planning tasks. Notably, it maintains strong performance under extreme video length scaling (up to 8 times) and excels at fine-grained frame retrieval on our proposed Needle in a Video Haystack (NIAVH) benchmark. With linear GPU memory scaling, VideoLLaMB processes up to 320 frames using a single Nvidia A100 GPU, despite being trained on only 16 frames-offering an unprecedented balance of accuracy, scalability, and cost-effectiveness. This makes it highly accessible and practical for the academic community.

[379] Whole-body Representation Learning For Competing Preclinical Disease Risk Assessment

Dmitrii Seletkov, Sophie Starck, Ayhan Can Erdur, Yundi Zhang, Daniel Rueckert, Rickmer Braren

Main category: cs.CV

TL;DR: A self-supervised learning method for whole-body preclinical disease risk assessment outperforms traditional radiomics in predicting multiple diseases and enhances CVD subgroup predictions when combined with cardiac MRI.

Details

Motivation: Current image-based risk prediction methods are limited to single conditions and rely on manual feature extraction, highlighting the need for a more comprehensive and automated approach.

Method: Proposes a whole-body self-supervised representation learning method under competing risk modeling, validated in diseases like CVD, T2D, COPD, and CKD. Combined with cardiac MRI for CVD subgroups.

Result: Outperforms whole-body radiomics in multiple diseases and improves CVD subgroup predictions (IHD, HD, stroke) in preclinical screening scenarios.

Conclusion: Demonstrates translational potential for standalone screening and multi-modal clinical workflows, enabling early personalized risk stratification.

Abstract: Reliable preclinical disease risk assessment is essential to move public healthcare from reactive treatment to proactive identification and prevention. However, image-based risk prediction algorithms often consider one condition at a time and depend on hand-crafted features obtained through segmentation tools. We propose a whole-body self-supervised representation learning method for the preclinical disease risk assessment under a competing risk modeling. This approach outperforms whole-body radiomics in multiple diseases, including cardiovascular disease (CVD), type 2 diabetes (T2D), chronic obstructive pulmonary disease (COPD), and chronic kidney disease (CKD). Simulating a preclinical screening scenario and subsequently combining with cardiac MRI, it sharpens further the prediction for CVD subgroups: ischemic heart disease (IHD), hypertensive diseases (HD), and stroke. The results indicate the translational potential of whole-body representations as a standalone screening modality and as part of a multi-modal framework within clinical workflows for early personalized risk stratification. The code is available at https://github.com/yayapa/WBRLforCR/

[380] CheXalign: Preference fine-tuning in chest X-ray interpretation models without human feedback

Dennis Hein, Zhihong Chen, Sophie Ostmeier, Justin Xu, Maya Varma, Eduardo Pontes Reis, Arne Edward Michalson, Christian Bluethgen, Hyun Joo Shin, Curtis Langlotz, Akshay S Chaudhari

Main category: cs.CV

TL;DR: The paper proposes an automated pipeline for preference feedback in radiology report generation (RRG) using publicly available datasets, avoiding costly radiologist feedback. It addresses reward overoptimization and achieves state-of-the-art performance.

Details

Motivation: Addressing staffing shortages and high workloads in radiology by improving automated report generation without relying on expensive radiologist feedback.

Method: Leverages publicly available datasets and reference-based metrics (Judges) for preference feedback, introduces a length-controlled GREEN score, and avoids additional radiologist input.

Result: Achieves state-of-the-art CheXbert scores on MIMIC-CXR for RRG while maintaining robust performance across other tasks.

Conclusion: The proposed method effectively automates preference feedback for RRG, demonstrating high accuracy and scalability without additional radiologist costs.

Abstract: Radiologists play a crucial role in translating medical images into actionable reports. However, the field faces staffing shortages and increasing workloads. While automated approaches using vision-language models (VLMs) show promise as assistants, they require exceptionally high accuracy. Most current VLMs in radiology rely solely on supervised fine-tuning. Meanwhile, additional preference fine-tuning in the post-training pipeline has become standard practice in the general domain. The challenge in radiology lies in the prohibitive cost of obtaining radiologist feedback at scale. To address this challenge, we propose an automated pipeline for preference feedback, focusing on chest X-ray radiology report generation (RRG). Specifically, our method leverages publicly available datasets containing pairs of images and radiologist-written reference reports with reference-based metrics, or Judges, eliminating the need for additional radiologist feedback. We investigate reward overoptimization via length exploitation in this setting and introduce a length-controlled version of the GREEN score. Our best-performing setup achieves state-of-the-art CheXbert scores on the MIMIC-CXR dataset for the RRG task while on average maintaining robust performance across six additional image perception and reasoning tasks.

[381] Is Uncertainty Quantification a Viable Alternative to Learned Deferral?

Anna M. Wundram, Christian F. Baumgartner

Main category: cs.CV

TL;DR: AI models for patient care need human-AI collaboration due to fallibility. Uncertainty quantification methods show promise for robust deferral strategies, outperforming learned deferral models in out-of-distribution scenarios.

Details

Motivation: To ensure safe AI implementation in healthcare by improving models' ability to defer decisions to humans when uncertain, especially under data shift conditions.

Method: Evaluated learned deferral models and uncertainty quantification methods on an ophthalmology dataset, focusing on glaucoma classification and deferral accuracy in- and out-of-distribution.

Result: Uncertainty quantification methods were more robust to out-of-distribution inputs and effective for deferral compared to supervised learned deferral models.

Conclusion: Uncertainty quantification is a promising approach for AI deferral in clinical settings, enhancing safety and reliability.

Abstract: Artificial Intelligence (AI) holds the potential to dramatically improve patient care. However, it is not infallible, necessitating human-AI-collaboration to ensure safe implementation. One aspect of AI safety is the models’ ability to defer decisions to a human expert when they are likely to misclassify autonomously. Recent research has focused on methods that learn to defer by optimising a surrogate loss function that finds the optimal trade-off between predicting a class label or deferring. However, during clinical translation, models often face challenges such as data shift. Uncertainty quantification methods aim to estimate a model’s confidence in its predictions. However, they may also be used as a deferral strategy which does not rely on learning from specific training distribution. We hypothesise that models developed to quantify uncertainty are more robust to out-of-distribution (OOD) input than learned deferral models that have been trained in a supervised fashion. To investigate this hypothesis, we constructed an extensive evaluation study on a large ophthalmology dataset, examining both learned deferral models and established uncertainty quantification methods, assessing their performance in- and out-of-distribution. Specifically, we evaluate their ability to accurately classify glaucoma from fundus images while deferring cases with a high likelihood of error. We find that uncertainty quantification methods may be a promising choice for AI deferral.

[382] Zero-shot Compositional Action Recognition with Neural Logic Constraints

Gefan Ye, Lin Li, Kexin Li, Jun Xiao, Long chen

Main category: cs.CV

TL;DR: LogicCAR introduces dual symbolic constraints (compositional and hierarchical logic) to improve zero-shot compositional action recognition, outperforming baselines on the Sth-com dataset.

Details

Motivation: Addresses spurious correlations and semantic ambiguity in ZS-CAR by leveraging human-like symbolic reasoning.

Method: Proposes LogicCAR, integrating explicit compositional and hierarchical primitive logic into neural networks via first-order logic.

Result: Outperforms existing methods on the Sth-com dataset.

Conclusion: LogicCAR effectively bridges symbolic abstraction with neural models, enhancing compositional reasoning in ZS-CAR.

Abstract: Zero-shot compositional action recognition (ZS-CAR) aims to identify unseen verb-object compositions in the videos by exploiting the learned knowledge of verb and object primitives during training. Despite compositional learning’s progress in ZS-CAR, two critical challenges persist: 1) Missing compositional structure constraint, leading to spurious correlations between primitives; 2) Neglecting semantic hierarchy constraint, leading to semantic ambiguity and impairing the training process. In this paper, we argue that human-like symbolic reasoning offers a principled solution to these challenges by explicitly modeling compositional and hierarchical structured abstraction. To this end, we propose a logic-driven ZS-CAR framework LogicCAR that integrates dual symbolic constraints: Explicit Compositional Logic and Hierarchical Primitive Logic. Specifically, the former models the restrictions within the compositions, enhancing the compositional reasoning ability of our model. The latter investigates the semantical dependencies among different primitives, empowering the models with fine-to-coarse reasoning capacity. By formalizing these constraints in first-order logic and embedding them into neural network architectures, LogicCAR systematically bridges the gap between symbolic abstraction and existing models. Extensive experiments on the Sth-com dataset demonstrate that our LogicCAR outperforms existing baseline methods, proving the effectiveness of our logic-driven constraints.

[383] Dream-to-Recon: Monocular 3D Reconstruction with Diffusion-Depth Distillation from Single Images

Philipp Wulff, Felix Wimbauer, Dominik Muhle, Daniel Cremers

Main category: cs.CV

TL;DR: A method for single-image volumetric scene reconstruction using pre-trained 2D diffusion and depth prediction models, outperforming multi-view supervised baselines.

Details

Motivation: To enable cost-effective volumetric scene reconstruction without requiring expensive 3D ground truth or multi-view supervision.

Method: Leverages pre-trained 2D diffusion models and depth prediction models to generate synthetic scene geometry from a single image, distilling a feed-forward reconstruction model.

Result: Matches or outperforms state-of-the-art baselines on KITTI-360 and Waymo datasets, especially in dynamic scenes.

Conclusion: The proposed method offers a viable alternative to multi-view supervision, with advantages in dynamic scenes.

Abstract: Volumetric scene reconstruction from a single image is crucial for a broad range of applications like autonomous driving and robotics. Recent volumetric reconstruction methods achieve impressive results, but generally require expensive 3D ground truth or multi-view supervision. We propose to leverage pre-trained 2D diffusion models and depth prediction models to generate synthetic scene geometry from a single image. This can then be used to distill a feed-forward scene reconstruction model. Our experiments on the challenging KITTI-360 and Waymo datasets demonstrate that our method matches or outperforms state-of-the-art baselines that use multi-view supervision, and offers unique advantages, for example regarding dynamic scenes.

[384] Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun Wen, Wensen Feng, Xiaoxiao Xu, Yi Wang, Yichang Zhang, Yongqiang Zhu, Yujia Wu, Yuxuan Cai, Zenan Liu

Main category: cs.CV

TL;DR: Qwen-Image is an advanced image generation model excelling in complex text rendering and precise editing, leveraging a robust data pipeline and progressive training.

Details

Motivation: Address challenges in complex text rendering and image editing consistency.

Method: Uses a comprehensive data pipeline and progressive training for text rendering. Introduces a multi-task training paradigm and dual-encoding for editing.

Result: Achieves state-of-the-art performance in image generation and editing, excelling in both alphabetic and logographic languages.

Conclusion: Qwen-Image demonstrates superior capabilities in text rendering and image editing, validated by benchmark results.

Abstract: We present Qwen-Image, an image generation foundation model in the Qwen series that achieves significant advances in complex text rendering and precise image editing. To address the challenges of complex text rendering, we design a comprehensive data pipeline that includes large-scale data collection, filtering, annotation, synthesis, and balancing. Moreover, we adopt a progressive training strategy that starts with non-text-to-text rendering, evolves from simple to complex textual inputs, and gradually scales up to paragraph-level descriptions. This curriculum learning approach substantially enhances the model’s native text rendering capabilities. As a result, Qwen-Image not only performs exceptionally well in alphabetic languages such as English, but also achieves remarkable progress on more challenging logographic languages like Chinese. To enhance image editing consistency, we introduce an improved multi-task training paradigm that incorporates not only traditional text-to-image (T2I) and text-image-to-image (TI2I) tasks but also image-to-image (I2I) reconstruction, effectively aligning the latent representations between Qwen2.5-VL and MMDiT. Furthermore, we separately feed the original image into Qwen2.5-VL and the VAE encoder to obtain semantic and reconstructive representations, respectively. This dual-encoding mechanism enables the editing module to strike a balance between preserving semantic consistency and maintaining visual fidelity. Qwen-Image achieves state-of-the-art performance, demonstrating its strong capabilities in both image generation and editing across multiple benchmarks.

[385] CLIP-IN: Enhancing Fine-Grained Visual Understanding in CLIP via Instruction Editing Data and Long Captions

Ziteng Wang, Siqi Yang, Limeng Qiao, Lin Ma

Main category: cs.CV

TL;DR: CLIP-IN enhances CLIP’s fine-grained perception using instruction-editing datasets and long captions, improving performance on fine-grained tasks without losing zero-shot capabilities.

Details

Motivation: Addressing the limitation of Vision-Language Models (VLMs) like CLIP in detailed visual comprehension.

Method: Uses instruction-editing datasets for hard negative pairs and a symmetric contrastive loss, along with long captions and rotary positional encodings.

Result: Achieves gains on MMVP and fine-grained tasks, reduces hallucinations in Multimodal LLMs, and maintains zero-shot performance.

Conclusion: Combining targeted contrastive learning with descriptive information improves VLM fine-grained understanding.

Abstract: Despite the success of Vision-Language Models (VLMs) like CLIP in aligning vision and language, their proficiency in detailed, fine-grained visual comprehension remains a key challenge. We present CLIP-IN, a novel framework that bolsters CLIP’s fine-grained perception through two core innovations. Firstly, we leverage instruction-editing datasets, originally designed for image manipulation, as a unique source of hard negative image-text pairs. Coupled with a symmetric hard negative contrastive loss, this enables the model to effectively distinguish subtle visual-semantic differences. Secondly, CLIP-IN incorporates long descriptive captions, utilizing rotary positional encodings to capture rich semantic context often missed by standard CLIP. Our experiments demonstrate that CLIP-IN achieves substantial gains on the MMVP benchmark and various fine-grained visual recognition tasks, without compromising robust zero-shot performance on broader classification and retrieval tasks. Critically, integrating CLIP-IN’s visual representations into Multimodal Large Language Models significantly reduces visual hallucinations and enhances reasoning abilities. This work underscores the considerable potential of synergizing targeted, instruction-based contrastive learning with comprehensive descriptive information to elevate the fine-grained understanding of VLMs.

[386] Correspondence-Free Fast and Robust Spherical Point Pattern Registration

Anik Sarker, Alan T. Asbeck

Main category: cs.CV

TL;DR: Proposes a linear-time algorithm for rotation estimation between spherical patterns, outperforming existing methods in speed and accuracy.

Details

Motivation: Existing methods are computationally expensive (O(n^3)) and lack robustness to outliers.

Method: Represents patterns as 3D point sets, reformulating rotation estimation as a Wahba problem. Introduces SPMC, FRS, and a hybrid approach.

Result: 10x faster and 10x more accurate than state-of-the-art methods in S^2 domain. Validated on simulations and real-world tasks.

Conclusion: The proposed algorithms offer significant improvements in efficiency and accuracy for spherical pattern alignment.

Abstract: Existing methods for rotation estimation between two spherical ($\mathbb{S}^2$) patterns typically rely on spherical cross-correlation maximization between two spherical function. However, these approaches exhibit computational complexities greater than cubic $O(n^3)$ with respect to rotation space discretization and lack extensive evaluation under significant outlier contamination. To this end, we propose a rotation estimation algorithm between two spherical patterns with linear time complexity $O(n)$. Unlike existing spherical-function-based methods, we explicitly represent spherical patterns as discrete 3D point sets on the unit sphere, reformulating rotation estimation as a spherical point-set alignment (i.e., Wahba problem for 3D unit vectors). Given the geometric nature of our formulation, our spherical pattern alignment algorithm naturally aligns with the Wahba problem framework for 3D unit vectors. Specifically, we introduce three novel algorithms: (1) SPMC (Spherical Pattern Matching by Correlation), (2) FRS (Fast Rotation Search), and (3) a hybrid approach (SPMC+FRS) that combines the advantages of the previous two methods. Our experiments demonstrate that in the $\mathbb{S}^2$ domain and in correspondence-free settings, our algorithms are over 10x faster and over 10x more accurate than current state-of-the-art methods for the Wahba problem with outliers. We validate our approach through extensive simulations on a new dataset of spherical patterns, the ``Robust Vector Alignment Dataset. “Furthermore, we adapt our methods to two real-world tasks: (i) Point Cloud Registration (PCR) and (ii) rotation estimation for spherical images.

[387] Transport-Guided Rectified Flow Inversion: Improved Image Editing Using Optimal Transport Theory

Marian Lupascu, Mihai-Sorin Stupariu

Main category: cs.CV

TL;DR: OTIP is a zero-shot framework using optimal transport theory to balance reconstruction fidelity and editing flexibility in rectified flow models, achieving superior results in image inversion tasks.

Details

Motivation: The challenge of balancing reconstruction fidelity and editing flexibility in image inversion for practical editing applications drives the need for an improved method.

Method: OTIP leverages optimal transport theory to guide the inversion process, optimizing trajectories for accuracy and controllability while maintaining computational efficiency.

Result: OTIP achieves high-fidelity reconstruction (LPIPS 0.001, SSIM 0.992) and outperforms baselines in reconstruction loss (7.8%-12.9% improvement) and semantic editing (11.2% identity preservation, 1.6% perceptual quality).

Conclusion: OTIP provides a principled, efficient solution for image inversion, excelling in reconstruction and editing tasks with superior detail preservation and semantic consistency.

Abstract: Effective image inversion in rectified flow models - mapping real images to editable latent representations - is crucial for practical image editing applications; however, achieving optimal balance between reconstruction fidelity and editing flexibility remains a fundamental challenge. In this work, we introduce the Optimal Transport Inversion Pipeline (OTIP), a zero-shot framework that leverages optimal transport theory to guide the inversion process in rectified flow models. Our underlying hypothesis is that incorporating transport-based guidance during the reverse diffusion process can effectively balance reconstruction accuracy and editing controllability through principled trajectory optimization. The method computes optimal transport paths between image and noise distributions while maintaining computational efficiency. Our approach achieves high-fidelity reconstruction with LPIPS scores of 0.001 and SSIM of 0.992 on face editing benchmarks, demonstrating superior preservation of fine-grained details compared to existing methods. We evaluate the framework across multiple editing tasks, observing 7.8% to 12.9% improvements in reconstruction loss over RF-Inversion on the LSUN-Bedroom and LSUN-Church datasets, respectively. For semantic face editing, our method achieves an 11.2% improvement in identity preservation and a 1.6% enhancement in perceptual quality, while maintaining computational efficiency comparable to baseline approaches. Qualitatively, our method produces visually compelling edits with superior semantic consistency and fine-grained detail preservation across diverse editing scenarios. Code is available at: https://github.com/marianlupascu/OT-Inversion

[388] TRUDI and TITUS: A Multi-Perspective Dataset and A Three-Stage Recognition System for Transportation Unit Identification

Emre Gülsoylu, André Kelm, Lennart Bengtson, Matthias Hirsch, Christian Wilms, Tim Rolff, Janick Edinger, Simone Frintrop

Main category: cs.CV

TL;DR: The paper introduces the TRUDI dataset and TITUS pipeline for identifying transportation units (TUs) in port logistics, addressing the lack of diverse benchmark data and improving efficiency.

Details

Motivation: Progress in TU identification is hindered by the absence of diverse, real-world benchmark datasets, limiting the development of robust solutions for port logistics.

Method: The TRUDI dataset includes 35,034 annotated instances across five categories. The TITUS pipeline segments TUs, detects ID text locations, and validates IDs, operating under varied conditions.

Result: TITUS reliably identifies TUs from diverse camera perspectives and environmental conditions, outperforming systems requiring specific setups.

Conclusion: The TRUDI dataset and TITUS pipeline provide a benchmark for advancing TU identification, supporting digital transformation and logistics efficiency.

Abstract: Identifying transportation units (TUs) is essential for improving the efficiency of port logistics. However, progress in this field has been hindered by the lack of publicly available benchmark datasets that capture the diversity and dynamics of real-world port environments. To address this gap, we present the TRUDI dataset-a comprehensive collection comprising 35,034 annotated instances across five categories: container, tank container, trailer, ID text, and logo. The images were captured at operational ports using both ground-based and aerial cameras, under a wide variety of lighting and weather conditions. For the identification of TUs-which involves reading the 11-digit alphanumeric ID typically painted on each unit-we introduce TITUS, a dedicated pipeline that operates in three stages: (1) segmenting the TU instances, (2) detecting the location of the ID text, and (3) recognising and validating the extracted ID. Unlike alternative systems, which often require similar scenes, specific camera angles or gate setups, our evaluation demonstrates that TITUS reliably identifies TUs from a range of camera perspectives and in varying lighting and weather conditions. By making the TRUDI dataset publicly available, we provide a robust benchmark that enables the development and comparison of new approaches. This contribution supports digital transformation efforts in multipurpose ports and helps to increase the efficiency of entire logistics chains.

[389] Uni-Layout: Integrating Human Feedback in Unified Layout Generation and Evaluation

Shuo Lu, Yanyin Chen, Wei Feng, Jiahao Fan, Fengheng Li, Zheng Zhang, Jingjing Lv, Junjie Shen, Ching Law, Jian Liang

Main category: cs.CV

TL;DR: Uni-Layout is a unified framework for layout generation and evaluation, incorporating human feedback and dynamic optimization to outperform existing methods.

Details

Motivation: Current layout generation methods are task-specific and lack effective evaluation metrics, limiting their applicability and measurement accuracy.

Method: Uni-Layout integrates a unified generator for various layout tasks, a human-mimicking evaluator using a large-scale feedback dataset (Layout-HF100k), and Dynamic-Margin Preference Optimization (DMPO) for alignment.

Result: Uni-Layout significantly outperforms both task-specific and general-purpose methods in experiments.

Conclusion: The proposed framework advances layout generation by unifying tasks, improving evaluation with human feedback, and optimizing alignment, offering broad applicability.

Abstract: Layout generation plays a crucial role in enhancing both user experience and design efficiency. However, current approaches suffer from task-specific generation capabilities and perceptually misaligned evaluation metrics, leading to limited applicability and ineffective measurement. In this paper, we propose \textit{Uni-Layout}, a novel framework that achieves unified generation, human-mimicking evaluation and alignment between the two. For universal generation, we incorporate various layout tasks into a single taxonomy and develop a unified generator that handles background or element contents constrained tasks via natural language prompts. To introduce human feedback for the effective evaluation of layouts, we build \textit{Layout-HF100k}, the first large-scale human feedback dataset with 100,000 expertly annotated layouts. Based on \textit{Layout-HF100k}, we introduce a human-mimicking evaluator that integrates visual and geometric information, employing a Chain-of-Thought mechanism to conduct qualitative assessments alongside a confidence estimation module to yield quantitative measurements. For better alignment between the generator and the evaluator, we integrate them into a cohesive system by adopting Dynamic-Margin Preference Optimization (DMPO), which dynamically adjusts margins based on preference strength to better align with human judgments. Extensive experiments show that \textit{Uni-Layout} significantly outperforms both task-specific and general-purpose methods. Our code is publicly available at https://github.com/JD-GenX/Uni-Layout.

Chen-Chen Fan, Peiyao Guo, Linping Zhang, Kehan Qi, Haolin Huang, Yong-Qiang Mao, Yuxi Suo, Zhizhuo Jiang, Yu Liu, You He

Main category: cs.CV

TL;DR: The paper introduces SMART-Ship, a multi-modal dataset for maritime surveillance, featuring 1092 image sets with fine-grained annotations across five modalities. It supports diverse tasks and benchmarks, validated by experiments.

Details

Motivation: Maritime surveillance is challenging due to multi-scale targets and dynamic environments. Existing datasets lack spatiotemporal consistency and multi-modal alignment.

Method: The SMART-Ship dataset includes 1092 multi-modal image sets (visible-light, SAR, panchromatic, multi-spectral, near-infrared) with fine-grained annotations (polygons, categories, identifiers, change masks).

Result: The dataset covers 38,838 ships, ensures spatiotemporal consistency, and supports five benchmark tasks. Experiments validate its utility for multi-modal RS interpretation.

Conclusion: SMART-Ship bridges a critical gap in maritime surveillance, enabling diverse multi-modal tasks and revealing directions for future research.

Abstract: Given the limitations of satellite orbits and imaging conditions, multi-modal remote sensing (RS) data is crucial in enabling long-term earth observation. However, maritime surveillance remains challenging due to the complexity of multi-scale targets and the dynamic environments. To bridge this critical gap, we propose a Synchronized Multi-modal Aligned Remote sensing Targets dataset for berthed ships analysis (SMART-Ship), containing spatiotemporal registered images with fine-grained annotation for maritime targets from five modalities: visible-light, synthetic aperture radar (SAR), panchromatic, multi-spectral, and near-infrared. Specifically, our dataset consists of 1092 multi-modal image sets, covering 38,838 ships. Each image set is acquired within one week and registered to ensure spatiotemporal consistency. Ship instances in each set are annotated with polygonal location information, fine-grained categories, instance-level identifiers, and change region masks, organized hierarchically to support diverse multi-modal RS tasks. Furthermore, we define standardized benchmarks on five fundamental tasks and comprehensively compare representative methods across the dataset. Thorough experiment evaluations validate that the proposed SMART-Ship dataset could support various multi-modal RS interpretation tasks and reveal the promising directions for further exploration.

[391] Enhancing Object Discovery for Unsupervised Instance Segmentation and Object Detection

Xingyu Feng, Hebei Gao, Hong Li

Main category: cs.CV

TL;DR: COLER is a zero-shot unsupervised model for instance segmentation and object detection, using CutOnce for pseudo labels and self-training for improved performance.

Details

Motivation: To advance unsupervised object localization by simplifying the process of generating object masks and eliminating reliance on clustering or post-processing.

Method: Uses CutOnce to generate coarse pseudo labels without clustering, then trains a detector with these masks. Includes modules to leverage self-supervised models and avoid post-processing.

Result: Outperforms state-of-the-art methods on multiple benchmarks without specialized loss functions.

Conclusion: COLER is a simple yet effective approach for unsupervised instance segmentation and object detection, with potential to advance the field.

Abstract: We propose Cut-Once-and-LEaRn (COLER), a simple approach for unsupervised instance segmentation and object detection. COLER first uses our developed CutOnce to generate coarse pseudo labels, then enables the detector to learn from these masks. CutOnce applies Normalized Cut only once and does not rely on any clustering methods, but it can generate multiple object masks in an image. We have designed several novel yet simple modules that not only allow CutOnce to fully leverage the object discovery capabilities of self-supervised models, but also free it from reliance on mask post-processing. During training, COLER achieves strong performance without requiring specially designed loss functions for pseudo labels, and its performance is further improved through self-training. COLER is a zero-shot unsupervised model that outperforms previous state-of-the-art methods on multiple benchmarks.We believe our method can help advance the field of unsupervised object localization.

[392] StableGS: A Floater-Free Framework for 3D Gaussian Splatting

Luchao Wang, Qian Ren, Kaimin Liao, Hua Wang, Zhi Chen, Yaohua Tang

Main category: cs.CV

TL;DR: StableGS eliminates floater artifacts in 3D Gaussian Splatting by decoupling geometric regularization from appearance rendering, using a Dual Opacity architecture and geometric constraints.

Details

Motivation: Floater artifacts in 3DGS degrade geometric and visual fidelity due to vanishing opacity gradients in optimization.

Method: Proposes StableGS with Dual Opacity architecture (Geometric Regularization Path and Appearance Refinement Path) and geometric constraints (depth consistency loss, global scale optimization).

Result: Eliminates floaters, resolves blur-artifact trade-off, achieves state-of-the-art geometric accuracy and visual quality.

Conclusion: StableGS effectively addresses the root cause of floaters and improves 3DGS reconstructions.

Abstract: 3D Gaussian Splatting (3DGS) reconstructions are plagued by stubborn floater" artifacts that degrade their geometric and visual fidelity. We are the first to reveal the root cause: a fundamental conflict in the 3DGS optimization process where the opacity gradients of floaters vanish when their blended color reaches a pseudo-equilibrium of canceling errors against the background, trapping them in a spurious local minimum. To resolve this, we propose StableGS, a novel framework that decouples geometric regularization from final appearance rendering. Its core is a Dual Opacity architecture that creates two separate rendering paths: a Geometric Regularization Path” to bear strong depth-based constraints for structural correctness, and an ``Appearance Refinement Path" to generate high-fidelity details upon this stable foundation. We complement this with a synergistic set of geometric constraints: a self-supervised depth consistency loss and an external geometric prior enabled by our efficient global scale optimization algorithm. Experiments on multiple benchmarks show StableGS not only eliminates floaters but also resolves the common blur-artifact trade-off, achieving state-of-the-art geometric accuracy and visual quality.

[393] Glioblastoma Overall Survival Prediction With Vision Transformers

Yin Lin, iccardo Barbieri, Domenico Aquino, Giuseppe Lauria, Marina Grisoli, Elena De Momi, Alberto Redaelli, Simona Ferrante

Main category: cs.CV

TL;DR: An AI approach using Vision Transformers (ViTs) predicts glioblastoma survival from MRI images without tumor segmentation, achieving 62.5% accuracy on BRATS dataset.

Details

Motivation: Predicting Overall Survival (OS) for glioblastoma patients is crucial for personalized treatment, but traditional methods rely on segmentation, complicating workflows.

Method: The study employs Vision Transformers (ViTs) to extract features directly from MRI images, bypassing segmentation and reducing computational needs.

Result: The model achieved 62.5% accuracy on the BRATS test set, with balanced precision, recall, and F1 scores, outperforming other methods in these metrics.

Conclusion: ViTs show promise for OS prediction in medical imaging, offering efficiency and eliminating segmentation dependency, though generalization is limited by dataset size.

Abstract: Glioblastoma is one of the most aggressive and common brain tumors, with a median survival of 10-15 months. Predicting Overall Survival (OS) is critical for personalizing treatment strategies and aligning clinical decisions with patient outcomes. In this study, we propose a novel Artificial Intelligence (AI) approach for OS prediction using Magnetic Resonance Imaging (MRI) images, exploiting Vision Transformers (ViTs) to extract hidden features directly from MRI images, eliminating the need of tumor segmentation. Unlike traditional approaches, our method simplifies the workflow and reduces computational resource requirements. The proposed model was evaluated on the BRATS dataset, reaching an accuracy of 62.5% on the test set, comparable to the top-performing methods. Additionally, it demonstrated balanced performance across precision, recall, and F1 score, overcoming the best model in these metrics. The dataset size limits the generalization of the ViT which typically requires larger datasets compared to convolutional neural networks. This limitation in generalization is observed across all the cited studies. This work highlights the applicability of ViTs for downsampled medical imaging tasks and establishes a foundation for OS prediction models that are computationally efficient and do not rely on segmentation.

[394] InfoSyncNet: Information Synchronization Temporal Convolutional Network for Visual Speech Recognition

Junxiao Xue, Xiaozhen Liu, Xuecheng Wu, Fei Yu, Jun Wang

Main category: cs.CV

TL;DR: InfoSyncNet improves lip-reading accuracy by dynamically adjusting focus with non-uniform quantization and tailored data augmentation, achieving state-of-the-art results.

Details

Motivation: Accurate lip-reading from silent videos is vital for Assistive Technology and Augmented Reality, but variability in sequences and uneven information distribution pose challenges.

Method: InfoSyncNet uses a non-uniform quantization module between encoder and decoder, along with tailored data augmentation and training strategies to handle visual speech inconsistencies.

Result: Achieves 92.0% and 60.7% Top-1 ACC on LRW and LRW1000 datasets, setting new benchmarks.

Conclusion: InfoSyncNet effectively addresses variability in lip-reading tasks, demonstrating superior performance and practical applicability.

Abstract: Estimating spoken content from silent videos is crucial for applications in Assistive Technology (AT) and Augmented Reality (AR). However, accurately mapping lip movement sequences in videos to words poses significant challenges due to variability across sequences and the uneven distribution of information within each sequence. To tackle this, we introduce InfoSyncNet, a non-uniform sequence modeling network enhanced by tailored data augmentation techniques. Central to InfoSyncNet is a non-uniform quantization module positioned between the encoder and decoder, enabling dynamic adjustment to the network’s focus and effectively handling the natural inconsistencies in visual speech data. Additionally, multiple training strategies are incorporated to enhance the model’s capability to handle variations in lighting and the speaker’s orientation. Comprehensive experiments on the LRW and LRW1000 datasets confirm the superiority of InfoSyncNet, achieving new state-of-the-art accuracies of 92.0% and 60.7% Top-1 ACC. The code is available for download (see comments).

[395] SAMPO: Visual Preference Optimization for Intent-Aware Segmentation with Vision Foundation Models

Yonghuang Wu, Wenwen Zeng, Xuan Xie, Chengqian Zhao, Guoqing Wu, Jinhua Yu

Main category: cs.CV

TL;DR: SAMPO bridges the intent gap in SAM by inferring categorical intent from sparse prompts, outperforming baselines with minimal data.

Details

Motivation: Address SAM's limitation in segmenting implicitly desired objects, especially in dense domains like biomedical nuclei segmentation.

Method: SAMPO uses preference optimization to capture target-class characteristics without language models, enabling robust segmentation from sparse prompts.

Result: Achieves state-of-the-art performance on medical tasks, improving by 9+ percentage points with only 10% training data.

Conclusion: SAMPO introduces a new paradigm for intent-aware alignment in visual models, eliminating dependencies on auxiliary tools.

Abstract: Foundation models like Segment Anything Model (SAM) excel in promptable segmentation but suffer from an intent gap: they segment only explicitly prompted objects, failing to generalize to semantically related instances implicitly desired by users. This limitation is critical in domains with dense homogeneous objects (e.g., biomedical nuclei segmentation), where sparse visual prompts typically yield incomplete results, rendering dense annotations impractical due to prohibitive cost. To bridge this gap, we introduce SAMPO (Segment Anything Model with Preference Optimization), a novel framework that teaches visual foundation models to infer high-level categorical intent from sparse visual interactions. Unlike conventional pixel-level fine-tuning, SAMPO optimizes models to implicitly capture target-class characteristics through preference optimization. This approach, which operates without dependency on language models, enables robust multi-object segmentation even under sparse prompting and demonstrates superior data efficiency during fine-tuning. Validated on three medical segmentation tasks, SAMPO achieves state-of-the-art performance: on challenging tasks like PanNuke-T2, our method, when fine-tuned with only 10% of the training data, significantly outperforms all existing methods trained on the full 100% dataset, achieving an improvement of over 9 percentage points compared to the best baseline. Our work establishes a new paradigm for intent-aware alignment in visual foundation models, removing dependencies on auxiliary prompt generators or language-model-assisted preference learning.

[396] Multi-class Image Anomaly Detection for Practical Applications: Requirements and Robust Solutions

Jaehyuk Heo, Pilsung Kang

Main category: cs.CV

TL;DR: The paper introduces Hierarchical Coreset (HierCore), a framework for multi-class image anomaly detection, addressing performance gaps by leveraging hierarchical memory banks and class-wise criteria, validated across various label availability scenarios.

Details

Motivation: To improve multi-class anomaly detection by addressing the underperformance of single models compared to class-specific ones and exploring the impact of class information on detection thresholds.

Method: Proposes HierCore, a hierarchical memory bank-based framework that estimates class-wise decision criteria, even without class labels, and evaluates it under four label availability scenarios.

Result: HierCore consistently meets all requirements and performs robustly across all settings, outperforming existing methods.

Conclusion: HierCore is a practical and robust solution for real-world multi-class anomaly detection, adaptable to varying label availability.

Abstract: Recent advances in image anomaly detection have extended unsupervised learning-based models from single-class settings to multi-class frameworks, aiming to improve efficiency in training time and model storage. When a single model is trained to handle multiple classes, it often underperforms compared to class-specific models in terms of per-class detection accuracy. Accordingly, previous studies have primarily focused on narrowing this performance gap. However, the way class information is used, or not used, remains a relatively understudied factor that could influence how detection thresholds are defined in multi-class image anomaly detection. These thresholds, whether class-specific or class-agnostic, significantly affect detection outcomes. In this study, we identify and formalize the requirements that a multi-class image anomaly detection model must satisfy under different conditions, depending on whether class labels are available during training and evaluation. We then re-examine existing methods under these criteria. To meet these challenges, we propose Hierarchical Coreset (HierCore), a novel framework designed to satisfy all defined requirements. HierCore operates effectively even without class labels, leveraging a hierarchical memory bank to estimate class-wise decision criteria for anomaly detection. We empirically validate the applicability and robustness of existing methods and HierCore under four distinct scenarios, determined by the presence or absence of class labels in the training and evaluation phases. The experimental results demonstrate that HierCore consistently meets all requirements and maintains strong, stable performance across all settings, highlighting its practical potential for real-world multi-class anomaly detection tasks.

Xinquan Yu, Wei Lu, Xiangyang Luo

Main category: cs.CV

TL;DR: The paper introduces the FMS network for detecting and grounding multi-modal media manipulation, addressing limitations of existing methods by incorporating modality reliability, unimodal internal, and cross-modal supervision.

Details

Motivation: Existing methods for detecting media manipulation are limited by unreliable unimodal data and lack comprehensive forgery supervision, leading to poor performance.

Method: The FMS network includes three modules: MDSC for modality reliability supervision, UFMR for unimodal internal supervision, and MFAR for cross-modal supervision.

Result: Extensive experiments show FMS outperforms state-of-the-art methods.

Conclusion: The FMS network effectively improves detection performance by addressing key limitations in current approaches.

Abstract: The task of Detecting and Grounding Multi-Modal Media Manipulation (DGM$^4$) is a branch of misinformation detection. Unlike traditional binary classification, it includes complex subtasks such as forgery content localization and forgery method classification. Consider that existing methods are often limited in performance due to neglecting the erroneous interference caused by unreliable unimodal data and failing to establish comprehensive forgery supervision for mining fine-grained tampering traces. In this paper, we present a Fine-grained Multiple Supervisory (FMS) network, which incorporates modality reliability supervision, unimodal internal supervision and cross-modal supervision to provide comprehensive guidance for DGM$^4$ detection. For modality reliability supervision, we propose the Multimodal Decision Supervised Correction (MDSC) module. It leverages unimodal weak supervision to correct the multi-modal decision-making process. For unimodal internal supervision, we propose the Unimodal Forgery Mining Reinforcement (UFMR) module. It amplifies the disparity between real and fake information within unimodal modality from both feature-level and sample-level perspectives. For cross-modal supervision, we propose the Multimodal Forgery Alignment Reasoning (MFAR) module. It utilizes soft-attention interactions to achieve cross-modal feature perception from both consistency and inconsistency perspectives, where we also design the interaction constraints to ensure the interaction quality. Extensive experiments demonstrate the superior performance of our FMS compared to state-of-the-art methods.

[398] MindShot: Multi-Shot Video Reconstruction from fMRI with LLM Decoding

Wenwen Zeng, Yonghuang Wu, Yifan Chen, Xuan Xie, Chengqian Zhao, Feiyu Yin, Guoqing Wu, Jinhua Yu

Main category: cs.CV

TL;DR: A novel divide-and-decode framework for multi-shot fMRI video reconstruction addresses signal mixing, temporal resolution mismatch, and dataset limitations, outperforming state-of-the-art methods.

Details

Motivation: Understanding visual cognition and enabling brain-computer interfaces require reconstructing dynamic videos from fMRI, but current methods are limited to single-shot clips.

Method: Proposes a divide-and-decode framework with shot boundary prediction, generative keyframe captioning using LLMs, and large-scale data synthesis.

Result: Outperforms state-of-the-art methods in multi-shot reconstruction fidelity, with fMRI decomposition improving caption similarity by 71.8%.

Conclusion: Establishes a new paradigm for multi-shot fMRI reconstruction by leveraging explicit decomposition and semantic prompting.

Abstract: Reconstructing dynamic videos from fMRI is important for understanding visual cognition and enabling vivid brain-computer interfaces. However, current methods are critically limited to single-shot clips, failing to address the multi-shot nature of real-world experiences. Multi-shot reconstruction faces fundamental challenges: fMRI signal mixing across shots, the temporal resolution mismatch between fMRI and video obscuring rapid scene changes, and the lack of dedicated multi-shot fMRI-video datasets. To overcome these limitations, we propose a novel divide-and-decode framework for multi-shot fMRI video reconstruction. Our core innovations are: (1) A shot boundary predictor module explicitly decomposing mixed fMRI signals into shot-specific segments. (2) Generative keyframe captioning using LLMs, which decodes robust textual descriptions from each segment, overcoming temporal blur by leveraging high-level semantics. (3) Novel large-scale data synthesis (20k samples) from existing datasets. Experimental results demonstrate our framework outperforms state-of-the-art methods in multi-shot reconstruction fidelity. Ablation studies confirm the critical role of fMRI decomposition and semantic captioning, with decomposition significantly improving decoded caption CLIP similarity by 71.8%. This work establishes a new paradigm for multi-shot fMRI reconstruction, enabling accurate recovery of complex visual narratives through explicit decomposition and semantic prompting.

[399] Low-Frequency First: Eliminating Floating Artifacts in 3D Gaussian Splatting

Jianchao Wang, Peng Zhou, Cen Li, Rong Quan, Jie Qin

Main category: cs.CV

TL;DR: EFA-GS improves 3DGS by reducing floating artifacts through selective Gaussian expansion and dynamic refinement, enhancing visual fidelity and performance.

Details

Motivation: Floating artifacts in 3DGS degrade visual quality, especially with low-quality initialization, but their causes are not well understood.

Method: Analyzes artifacts in the frequency domain, identifies under-optimized Gaussians, and introduces EFA-GS with depth-based and scale-based refinement.

Result: EFA-GS reduces artifacts, preserves details, and improves PSNR by 1.68 dB on the RWLQ dataset.

Conclusion: EFA-GS effectively mitigates floating artifacts and enhances 3DGS performance, validated in downstream tasks.

Abstract: 3D Gaussian Splatting (3DGS) is a powerful and computationally efficient representation for 3D reconstruction. Despite its strengths, 3DGS often produces floating artifacts, which are erroneous structures detached from the actual geometry and significantly degrade visual fidelity. The underlying mechanisms causing these artifacts, particularly in low-quality initialization scenarios, have not been fully explored. In this paper, we investigate the origins of floating artifacts from a frequency-domain perspective and identify under-optimized Gaussians as the primary source. Based on our analysis, we propose \textit{Eliminating-Floating-Artifacts} Gaussian Splatting (EFA-GS), which selectively expands under-optimized Gaussians to prioritize accurate low-frequency learning. Additionally, we introduce complementary depth-based and scale-based strategies to dynamically refine Gaussian expansion, effectively mitigating detail erosion. Extensive experiments on both synthetic and real-world datasets demonstrate that EFA-GS substantially reduces floating artifacts while preserving high-frequency details, achieving an improvement of 1.68 dB in PSNR over baseline method on our RWLQ dataset. Furthermore, we validate the effectiveness of our approach in downstream 3D editing tasks. Our implementation will be released on GitHub.

[400] Rethinking Transparent Object Grasping: Depth Completion with Monocular Depth Estimation and Instance Mask

Yaofeng Cheng, Xinkai Gao, Sen Zhang, Chao Zeng, Fusheng Zha, Lining Sun, Chenguang Yang

Main category: cs.CV

TL;DR: ReMake improves depth completion for transparent objects by using instance masks and monocular depth estimation, enhancing accuracy and generalization.

Details

Motivation: Depth cameras struggle with transparent objects, leading to incomplete depth data and unreliable robotic grasping. Existing methods fail to generalize to real-world scenarios due to complex light interactions.

Method: ReMake uses an instance mask to distinguish transparent regions and monocular depth estimation for context, focusing on accurate depth learning in transparent areas.

Result: The method outperforms existing approaches in benchmark datasets and real-world scenarios, showing better accuracy and generalization.

Conclusion: ReMake effectively addresses depth completion for transparent objects, improving robotic grasping reliability in real-world settings.

Abstract: Due to the optical properties, transparent objects often lead depth cameras to generate incomplete or invalid depth data, which in turn reduces the accuracy and reliability of robotic grasping. Existing approaches typically input the RGB-D image directly into the network to output the complete depth, expecting the model to implicitly infer the reliability of depth values. However, while effective in training datasets, such methods often fail to generalize to real-world scenarios, where complex light interactions lead to highly variable distributions of valid and invalid depth data. To address this, we propose ReMake, a novel depth completion framework guided by an instance mask and monocular depth estimation. By explicitly distinguishing transparent regions from non-transparent ones, the mask enables the model to concentrate on learning accurate depth estimation in these areas from RGB-D input during training. This targeted supervision reduces reliance on implicit reasoning and improves generalization to real-world scenarios. Additionally, monocular depth estimation provides depth context between the transparent object and its surroundings, enhancing depth prediction accuracy. Extensive experiments show that our method outperforms existing approaches on both benchmark datasets and real-world scenarios, demonstrating superior accuracy and generalization capability. Code and videos are available at https://chengyaofeng.github.io/ReMake.github.io/.

[401] Engagement Prediction of Short Videos with Large Multimodal Models

Wei Sun, Linhan Cao, Yuqin Cao, Weixia Zhang, Wen Wen, Kaiwei Zhang, Zijian Chen, Fangfang Lu, Xiongkuo Min, Guangtao Zhai

Main category: cs.CV

TL;DR: The paper explores using large multimodal models (LMMs) for video engagement prediction, comparing VideoLLaMA2 (audio, visual, language) and Qwen2.5-VL (visual, language). VideoLLaMA2 outperforms Qwen2.5-VL, emphasizing audio’s role. Their ensemble method wins the ICCV VQualA 2025 challenge.

Details

Motivation: Video engagement prediction is crucial for recommendation systems and content creation but is complex due to multimodal interactions. Prior methods struggle with cross-feature and cross-modality modeling.

Method: Adopts VideoLLaMA2 (audio, visual, language) and Qwen2.5-VL (visual, language) for engagement prediction, trained on SnapUGC dataset.

Result: Both models perform competitively; VideoLLaMA2 outperforms Qwen2.5-VL, showing audio’s importance. Ensemble method wins ICCV VQualA 2025 challenge.

Conclusion: LMMs are effective for engagement prediction, with audio features playing a key role. The ensemble approach achieves top performance.

Abstract: The rapid proliferation of user-generated content (UGC) on short-form video platforms has made video engagement prediction increasingly important for optimizing recommendation systems and guiding content creation. However, this task remains challenging due to the complex interplay of factors such as semantic content, visual quality, audio characteristics, and user background. Prior studies have leveraged various types of features from different modalities, such as visual quality, semantic content, background sound, etc., but often struggle to effectively model their cross-feature and cross-modality interactions. In this work, we empirically investigate the potential of large multimodal models (LMMs) for video engagement prediction. We adopt two representative LMMs: VideoLLaMA2, which integrates audio, visual, and language modalities, and Qwen2.5-VL, which models only visual and language modalities. Specifically, VideoLLaMA2 jointly processes key video frames, text-based metadata, and background sound, while Qwen2.5-VL utilizes only key video frames and text-based metadata. Trained on the SnapUGC dataset, both models demonstrate competitive performance against state-of-the-art baselines, showcasing the effectiveness of LMMs in engagement prediction. Notably, VideoLLaMA2 consistently outperforms Qwen2.5-VL, highlighting the importance of audio features in engagement prediction. By ensembling two types of models, our method achieves first place in the ICCV VQualA 2025 EVQA-SnapUGC Challenge on short-form video engagement prediction. The code is available at https://github.com/sunwei925/LMM-EVQA.git.

[402] Beyond Images: Adaptive Fusion of Visual and Textual Data for Food Classification

Prateek Mittal, Puneet Goyal, Joohi Chauhan

Main category: cs.CV

TL;DR: A novel multimodal food recognition framework combining visual and textual data improves classification accuracy and robustness, achieving 97.84% accuracy when fused.

Details

Motivation: To enhance food recognition accuracy by leveraging both visual and textual data, addressing challenges like missing or inconsistent modality data.

Method: Dynamic multimodal fusion strategy adaptively integrates visual and textual features, tested on the UPMC Food-101 dataset.

Result: Achieved 73.60% (images), 88.84% (text), and 97.84% (fused) accuracy, outperforming state-of-the-art methods.

Conclusion: The framework is robust, adaptable, and efficient, suitable for real-world multimodal food recognition.

Abstract: This study introduces a novel multimodal food recognition framework that effectively combines visual and textual modalities to enhance classification accuracy and robustness. The proposed approach employs a dynamic multimodal fusion strategy that adaptively integrates features from unimodal visual inputs and complementary textual metadata. This fusion mechanism is designed to maximize the use of informative content, while mitigating the adverse impact of missing or inconsistent modality data. The framework was rigorously evaluated on the UPMC Food-101 dataset and achieved unimodal classification accuracies of 73.60% for images and 88.84% for text. When both modalities were fused, the model achieved an accuracy of 97.84%, outperforming several state-of-the-art methods. Extensive experimental analysis demonstrated the robustness, adaptability, and computational efficiency of the proposed settings, highlighting its practical applicability to real-world multimodal food-recognition scenarios.

[403] Understanding the Risks of Asphalt Art on the Reliability of Surveillance Perception Systems

Jin Ma, Abyad Enan, Long Cheng, Mashrur Chowdhury

Main category: cs.CV

TL;DR: Artistic crosswalks can degrade pedestrian detection in vision-based models, especially with complex or adversarial designs.

Details

Motivation: To assess how asphalt art impacts pedestrian detection in surveillance systems.

Method: Evaluated a pretrained model using realistic crosswalk scenarios with normal and adversarial asphalt art.

Result: Complex and adversarial designs significantly reduce detection performance, posing risks to surveillance.

Conclusion: Urban surveillance systems must address visual variations to ensure robust pedestrian detection.

Abstract: Artistic crosswalks featuring asphalt art, introduced by different organizations in recent years, aim to enhance the visibility and safety of pedestrians. However, their visual complexity may interfere with surveillance systems that rely on vision-based object detection models. In this study, we investigate the impact of asphalt art on pedestrian detection performance of a pretrained vision-based object detection model. We construct realistic crosswalk scenarios by compositing various street art patterns into a fixed surveillance scene and evaluate the model’s performance in detecting pedestrians on asphalt-arted crosswalks under both benign and adversarial conditions. A benign case refers to pedestrian crosswalks painted with existing normal asphalt art, whereas an adversarial case involves digitally crafted or altered asphalt art perpetrated by an attacker. Our results show that while simple, color-based designs have minimal effect, complex artistic patterns, particularly those with high visual salience, can significantly degrade pedestrian detection performance. Furthermore, we demonstrate that adversarially crafted asphalt art can be exploited to deliberately obscure real pedestrians or generate non-existent pedestrian detections. These findings highlight a potential vulnerability in urban vision-based pedestrian surveillance systems and underscore the importance of accounting for environmental visual variations when designing robust pedestrian perception models.

[404] Advancing Supervised Local Learning Beyond Classification with Long-term Feature Bank

Feiyu Zhu, Yuming Zhang, Xiuyuan Guo, Hengyu Shi, Junfeng Luo, Junhao Su, Jialin Gao

Main category: cs.CV

TL;DR: The paper proposes FBA, a local learning method for visual tasks beyond classification, addressing adaptability and feature communication issues while conserving GPU memory.

Details

Motivation: Extend local learning beyond image classification by overcoming architectural limitations and lack of cross-scale feature communication.

Method: Introduces Feature Bank Augmented auxiliary network (FBA) with a simplified design and feature bank for cross-task adaptability.

Result: FBA achieves performance comparable to end-to-end methods across multiple visual tasks while reducing GPU memory usage.

Conclusion: FBA successfully applies local learning to diverse visual tasks, balancing performance and efficiency.

Abstract: Local learning offers an alternative to traditional end-to-end back-propagation in deep neural networks, significantly reducing GPU memory consumption. Although it has shown promise in image classification tasks, its extension to other visual tasks has been limited. This limitation arises primarily from two factors: 1) architectures designed specifically for classification are not readily adaptable to other tasks, which prevents the effective reuse of task-specific knowledge from architectures tailored to different problems; 2) these classification-focused architectures typically lack cross-scale feature communication, leading to degraded performance in tasks like object detection and super-resolution. To address these challenges, we propose the Feature Bank Augmented auxiliary network (FBA), which introduces a simplified design principle and incorporates a feature bank to enhance cross-task adaptability and communication. This work represents the first successful application of local learning methods beyond classification, demonstrating that FBA not only conserves GPU memory but also achieves performance on par with end-to-end approaches across multiple datasets for various visual tasks.

[405] Precision-Aware Video Compression for Reducing Bandwidth Requirements in Video Communication for Vehicle Detection-Based Applications

Abyad Enan, Jon C Calhoun, Mashrur Chowdhury

Main category: cs.CV

TL;DR: PAVC dynamically adjusts video compression based on weather/lighting to balance bandwidth usage and vehicle detection accuracy in ITS.

Details

Motivation: Limited bandwidth in ITS can hinder real-time performance, and lossy compression degrades video quality, impacting vehicle detection accuracy. Environmental factors further complicate this.

Method: Uses Precision-Aware Video Compression (PAVC) to dynamically adjust compression levels based on weather and lighting conditions.

Result: PAVC improves detection accuracy by up to 13% and reduces bandwidth usage by up to 8.23x (moderate bandwidth) or 72x (severely limited bandwidth).

Conclusion: PAVC effectively balances bandwidth efficiency and detection accuracy, enhancing ITS performance under varying conditions.

Abstract: Computer vision has become a popular tool in intelligent transportation systems (ITS), enabling various applications through roadside traffic cameras that capture video and transmit it in real time to computing devices within the same network. The efficiency of this video transmission largely depends on the available bandwidth of the communication system. However, limited bandwidth can lead to communication bottlenecks, hindering the real-time performance of ITS applications. To mitigate this issue, lossy video compression techniques can be used to reduce bandwidth requirements, at the cost of degrading video quality. This degradation can negatively impact the accuracy of applications that rely on real-time vehicle detection. Additionally, vehicle detection accuracy is influenced by environmental factors such as weather and lighting conditions, suggesting that compression levels should be dynamically adjusted in response to these variations. In this work, we utilize a framework called Precision-Aware Video Compression (PAVC), where a roadside video camera captures footage of vehicles on roadways, compresses videos, and then transmits them to a processing unit, running a vehicle detection algorithm for safety-critical applications, such as real-time collision risk assessment. The system dynamically adjusts the video compression level based on current weather and lighting conditions to maintain vehicle detection accuracy while minimizing bandwidth usage. Our results demonstrate that PAVC improves vehicle detection accuracy by up to 13% and reduces communication bandwidth requirements by up to 8.23x in areas with moderate bandwidth availability. Moreover, in locations with severely limited bandwidth, PAVC reduces bandwidth requirements by up to 72x while preserving vehicle detection performance.

[406] CityNav: A Large-Scale Dataset for Real-World Aerial Navigation

Jungdae Lee, Taiki Miyanishi, Shuhei Kurita, Koya Sakamoto, Daichi Azuma, Yutaka Matsuo, Nakamasa Inoue

Main category: cs.CV

TL;DR: The paper introduces CityNav, a large-scale real-world dataset for aerial vision-and-language navigation (VLN), addressing the gap in existing datasets by providing human demonstration trajectories paired with natural language descriptions. It also proposes geographic semantic maps to enhance navigation performance.

Details

Motivation: Aerial VLN over real-world cities is underexplored due to limited datasets and challenges in integrating visual and geographic information. The authors aim to fill this gap with CityNav.

Method: The dataset includes 32,637 human demonstration trajectories with natural language descriptions, covering 4.65 km² across two cities. Geographic semantic maps are introduced as an auxiliary modality.

Result: Experiments show that semantic map representation significantly improves the performance of three aerial VLN agents (Seq2seq, CMA, and AerialVLN models).

Conclusion: CityNav serves as an essential benchmark for advancing aerial VLN, and the proposed semantic maps enhance navigation performance.

Abstract: Vision-and-language navigation (VLN) aims to develop agents capable of navigating in realistic environments. While recent cross-modal training approaches have significantly improved navigation performance in both indoor and outdoor scenarios, aerial navigation over real-world cities remains underexplored primarily due to limited datasets and the difficulty of integrating visual and geographic information. To fill this gap, we introduce CityNav, the first large-scale real-world dataset for aerial VLN. Our dataset consists of 32,637 human demonstration trajectories, each paired with a natural language description, covering 4.65 km$^2$ across two real cities: Cambridge and Birmingham. In contrast to existing datasets composed of synthetic scenes such as AerialVLN, our dataset presents a unique challenge because agents must interpret spatial relationships between real-world landmarks and the navigation destination, making CityNav an essential benchmark for advancing aerial VLN. Furthermore, as an initial step toward addressing this challenge, we provide a methodology of creating geographic semantic maps that can be used as an auxiliary modality input during navigation. In our experiments, we compare performance of three representative aerial VLN agents (Seq2seq, CMA and AerialVLN models) and demonstrate that the semantic map representation significantly improves their navigation performance.

Shuo Wang, Yongcai Wang, Wanting Li, Yucheng Wang, Maiyue Chen, Kaihui Wang, Zhizhong Su, Xudong Cai, Yeying Jin, Deying Li, Zhaoxin Fan

Main category: cs.CV

TL;DR: MonoDream introduces a lightweight VLA framework for monocular VLN, using a Unified Navigation Representation (UNR) and Latent Panoramic Dreaming (LPD) to improve performance, narrowing the gap with panoramic-based methods.

Details

Motivation: Panoramic RGB-D sensors are costly or less accessible, while existing monocular VLA models underperform compared to panoramic-based methods. MonoDream aims to bridge this gap.

Method: MonoDream uses a UNR to align visual semantics and language-grounded action intent. It introduces LPD tasks to predict latent features of panoramic RGB-D observations from monocular input.

Result: Experiments show MonoDream improves monocular navigation performance and significantly reduces the performance gap with panoramic-based agents.

Conclusion: MonoDream demonstrates that monocular agents can achieve competitive performance by learning shared representations and leveraging latent panoramic features.

Abstract: Vision-Language Navigation (VLN) tasks often leverage panoramic RGB and depth inputs to provide rich spatial cues for action planning, but these sensors can be costly or less accessible in real-world deployments. Recent approaches based on Vision-Language Action (VLA) models achieve strong results with monocular input, yet they still lag behind methods using panoramic RGB-D information. We present MonoDream, a lightweight VLA framework that enables monocular agents to learn a Unified Navigation Representation (UNR). This shared feature representation jointly aligns navigation-relevant visual semantics (e.g., global layout, depth, and future cues) and language-grounded action intent, enabling more reliable action prediction. MonoDream further introduces Latent Panoramic Dreaming (LPD) tasks to supervise the UNR, which train the model to predict latent features of panoramic RGB and depth observations at both current and future steps based on only monocular input. Experiments on multiple VLN benchmarks show that MonoDream consistently improves monocular navigation performance and significantly narrows the gap with panoramic-based agents.

[408] GS-ID: Illumination Decomposition on Gaussian Splatting via Adaptive Light Aggregation and Diffusion-Guided Material Priors

Kang Du, Zhihao Liang, Yulin Shen, Zeyu Wang

Main category: cs.CV

TL;DR: GS-ID is an end-to-end framework for disentangling geometry, material, and lighting in Gaussian Splatting, achieving state-of-the-art performance in inverse rendering and relighting.

Details

Motivation: Existing GS-based methods fail to disentangle geometry, material, and lighting under non-Lambertian conditions, especially with specularities and shadows.

Method: GS-ID integrates adaptive light aggregation with diffusion-based material priors, models local lighting with anisotropic SGMs, and uses learnable shadow direction vectors.

Result: GS-ID reduces ambiguity in light-material-geometry interactions and excels in inverse rendering and relighting benchmarks.

Conclusion: GS-ID is effective for downstream tasks like relighting and scene composition, outperforming existing methods.

Abstract: Gaussian Splatting (GS) has emerged as an effective representation for photorealistic rendering, but the underlying geometry, material, and lighting remain entangled, hindering scene editing. Existing GS-based methods struggle to disentangle these components under non-Lambertian conditions, especially in the presence of specularities and shadows. We propose \textbf{GS-ID}, an end-to-end framework for illumination decomposition that integrates adaptive light aggregation with diffusion-based material priors. In addition to a learnable environment map for ambient illumination, we model spatially-varying local lighting using anisotropic spherical Gaussian mixtures (SGMs) that are jointly optimized with scene content. To better capture cast shadows, we associate each splat with a learnable unit vector that encodes shadow directions from multiple light sources, further improving material and lighting estimation. By combining SGMs with intrinsic priors from diffusion models, GS-ID significantly reduces ambiguity in light-material-geometry interactions and achieves state-of-the-art performance on inverse rendering and relighting benchmarks. Experiments also demonstrate the effectiveness of GS-ID for downstream applications such as relighting and scene composition.

[409] ReMoMask: Retrieval-Augmented Masked Motion Generation

Zhengdao Li, Siheng Wang, Zeyu Zhang, Hao Tang

Main category: cs.CV

TL;DR: ReMoMask is a unified framework for text-to-motion generation, addressing limitations of current methods with innovations like bidirectional momentum, semantic spatio-temporal attention, and RAG-classifier-free guidance, achieving state-of-the-art performance.

Details

Motivation: Current text-to-motion generation methods face issues like limited diversity, error accumulation, and physical implausibility in generative models, and diffusion inertia in retrieval-augmented methods.

Method: ReMoMask integrates bidirectional momentum text-motion models, semantic spatio-temporal attention, and RAG-classifier-free guidance to improve cross-modal retrieval and motion generation.

Result: ReMoMask achieves 3.88% and 10.97% FID score improvements on HumanML3D and KIT-ML benchmarks, outperforming previous methods.

Conclusion: ReMoMask offers a robust solution for text-to-motion generation, combining retrieval and generative approaches for superior performance.

Abstract: Text-to-Motion (T2M) generation aims to synthesize realistic and semantically aligned human motion sequences from natural language descriptions. However, current approaches face dual challenges: Generative models (e.g., diffusion models) suffer from limited diversity, error accumulation, and physical implausibility, while Retrieval-Augmented Generation (RAG) methods exhibit diffusion inertia, partial-mode collapse, and asynchronous artifacts. To address these limitations, we propose ReMoMask, a unified framework integrating three key innovations: 1) A Bidirectional Momentum Text-Motion Model decouples negative sample scale from batch size via momentum queues, substantially improving cross-modal retrieval precision; 2) A Semantic Spatio-temporal Attention mechanism enforces biomechanical constraints during part-level fusion to eliminate asynchronous artifacts; 3) RAG-Classier-Free Guidance incorporates minor unconditional generation to enhance generalization. Built upon MoMask’s RVQ-VAE, ReMoMask efficiently generates temporally coherent motions in minimal steps. Extensive experiments on standard benchmarks demonstrate the state-of-the-art performance of ReMoMask, achieving a 3.88% and 10.97% improvement in FID scores on HumanML3D and KIT-ML, respectively, compared to the previous SOTA method RAG-T2M. Code: https://github.com/AIGeeksGroup/ReMoMask. Website: https://aigeeksgroup.github.io/ReMoMask.

[410] NeuFlow v2: Push High-Efficiency Optical Flow To the Limit

Zhiyong Zhang, Aniket Gupta, Huaizu Jiang, Hanumant Singh

Main category: cs.CV

TL;DR: NeuFlow-V2 is a novel method for real-time high-accuracy optical flow estimation, balancing accuracy and computational efficiency with a lightweight backbone and fast refinement module.

Details

Motivation: Current learning-based optical flow methods struggle to balance accuracy and speed, often sacrificing one for the other, limiting their real-world applicability.

Method: NeuFlow-V2 introduces a lightweight backbone and a fast refinement module to maintain accuracy while reducing computational overhead.

Result: NeuFlow-V2 achieves similar accuracy to SOTA methods with 10x-70x speedups, running at over 20 FPS on 512x384 resolution images on a Jetson Orin Nano.

Conclusion: NeuFlow-V2 successfully addresses the trade-off between accuracy and efficiency, making it suitable for real-world robotic applications.

Abstract: Real-time high-accuracy optical flow estimation is critical for a variety of real-world robotic applications. However, current learning-based methods often struggle to balance accuracy and computational efficiency: methods that achieve high accuracy typically demand substantial processing power, while faster approaches tend to sacrifice precision. These fast approaches specifically falter in their generalization capabilities and do not perform well across diverse real-world scenarios. In this work, we revisit the limitations of the SOTA methods and present NeuFlow-V2, a novel method that offers both - high accuracy in real-world datasets coupled with low computational overhead. In particular, we introduce a novel light-weight backbone and a fast refinement module to keep computational demands tractable while delivering accurate optical flow. Experimental results on synthetic and real-world datasets demonstrate that NeuFlow-V2 provides similar accuracy to SOTA methods while achieving 10x-70x speedups. It is capable of running at over 20 FPS on 512x384 resolution images on a Jetson Orin Nano. The full training and evaluation code is available at https://github.com/neufieldrobotics/NeuFlow_v2.

[411] Evaluating Variance in Visual Question Answering Benchmarks

Nikitha SR

Main category: cs.CV

TL;DR: The paper critiques current evaluation methods for Multimodal Large Language Models (MLLMs) in Visual Question Answering (VQA), highlighting overlooked performance variance due to stochastic factors. It proposes variance-aware methodologies and explores Cloze-style evaluation for reliability.

Details

Motivation: Current VQA benchmarks for MLLMs rely on point estimates, ignoring performance variance caused by stochastic outputs, training seeds, and hyperparameters. This limits robust model development.

Method: The study analyzes variance across 14 VQA benchmarks, examining factors like training seed, framework non-determinism, model scale, and instruction finetuning. It also tests Cloze-style evaluation to reduce stochasticity.

Result: Findings reveal significant performance variability due to overlooked factors, emphasizing the need for variance-aware evaluation to improve MLLM reliability.

Conclusion: The paper advocates for adopting variance-aware methodologies in MLLM evaluation to enhance robustness and reliability in model development.

Abstract: Multimodal large language models (MLLMs) have emerged as powerful tools for visual question answering (VQA), enabling reasoning and contextual understanding across visual and textual modalities. Despite their advancements, the evaluation of MLLMs on VQA benchmarks often relies on point estimates, overlooking the significant variance in performance caused by factors such as stochastic model outputs, training seed sensitivity, and hyperparameter configurations. This paper critically examines these issues by analyzing variance across 14 widely used VQA benchmarks, covering diverse tasks such as visual reasoning, text understanding, and commonsense reasoning. We systematically study the impact of training seed, framework non-determinism, model scale, and extended instruction finetuning on performance variability. Additionally, we explore Cloze-style evaluation as an alternate assessment strategy, studying its effectiveness in reducing stochasticity and improving reliability across benchmarks. Our findings highlight the limitations of current evaluation practices and advocate for variance-aware methodologies to foster more robust and reliable development of MLLMs.

[412] PMGS: Reconstruction of Projectile Motion across Large Spatiotemporal Spans via 3D Gaussian Splatting

Yijun Xu, Jingrui Zhang, Yuhan Chen, Dingwen Wang, Lei Yu, Chu He

Main category: cs.CV

TL;DR: PMGS reconstructs projectile motion using 3D Gaussian Splatting, addressing challenges in dynamic reconstruction with a two-stage workflow and novel constraints.

Details

Motivation: Existing methods struggle with long-term, large-scale rigid motion and lack physical consistency.

Method: Two-stage workflow: Target Modeling (dynamic scene decomposition, point density control) and Motion Recovery (SE(3) pose learning). Includes acceleration consistency, dynamic annealing, and Kalman fusion.

Result: PMGS outperforms mainstream methods in reconstructing high-speed nonlinear rigid motion.

Conclusion: PMGS effectively addresses dynamic reconstruction challenges with improved accuracy and physical consistency.

Abstract: Modeling complex rigid motion across large spatiotemporal spans remains an unresolved challenge in dynamic reconstruction. Existing paradigms are mainly confined to short-term, small-scale deformation and offer limited consideration for physical consistency. This study proposes PMGS, focusing on reconstructing Projectile Motion via 3D Gaussian Splatting. The workflow comprises two stages:

Target Modeling: achieving object-centralized reconstruction through dynamic scene decomposition and an improved point density control; 2) Motion Recovery: restoring full motion sequences by learning per-frame SE(3) poses. We introduce an acceleration consistency constraint to bridge Newtonian mechanics and pose estimation, and design a dynamic simulated annealing strategy that adaptively schedules learning rates based on motion states. Futhermore, we devise a Kalman fusion scheme to optimize error accumulation from multi-source observations to mitigate disturbances. Experiments show PMGS’s superior performance in reconstructing high-speed nonlinear rigid motion compared to mainstream dynamic methods.

[413] DiffSSC: Semantic LiDAR Scan Completion using Denoising Diffusion Probabilistic Models

Helin Cao, Sven Behnke

Main category: cs.CV

TL;DR: The paper proposes using diffusion models for Semantic Scene Completion (SSC) in autonomous driving to address LiDAR point cloud sparsity and lack of semantics, achieving state-of-the-art results.

Details

Motivation: LiDAR-based perception systems struggle with occluded areas and scene gaps due to sparse, non-semantic point clouds. SSC aims to predict unobserved geometry and semantics for a complete scene representation.

Method: Extends diffusion models to SSC by applying noising and denoising in point and semantic spaces separately. Uses semantic LiDAR point clouds as conditional input and introduces regularization losses for stable denoising.

Result: The approach outperforms existing methods, achieving state-of-the-art performance on autonomous driving datasets.

Conclusion: Diffusion models effectively enhance SSC, providing a more complete and semantic-aware scene representation for autonomous driving.

Abstract: Perception systems play a crucial role in autonomous driving, incorporating multiple sensors and corresponding computer vision algorithms. 3D LiDAR sensors are widely used to capture sparse point clouds of the vehicle’s surroundings. However, such systems struggle to perceive occluded areas and gaps in the scene due to the sparsity of these point clouds and their lack of semantics. To address these challenges, Semantic Scene Completion (SSC) jointly predicts unobserved geometry and semantics in the scene given raw LiDAR measurements, aiming for a more complete scene representation. Building on promising results of diffusion models in image generation and super-resolution tasks, we propose their extension to SSC by implementing the noising and denoising diffusion processes in the point and semantic spaces individually. To control the generation, we employ semantic LiDAR point clouds as conditional input and design local and global regularization losses to stabilize the denoising process. We evaluate our approach on autonomous driving datasets, and it achieves state-of-the-art performance for SSC, surpassing most existing methods.

[414] MedVLThinker: Simple Baselines for Multimodal Medical Reasoning

Xiaoke Huang, Juncheng Wu, Hui Liu, Xianfeng Tang, Yuyin Zhou

Main category: cs.CV

TL;DR: MedVLThinker introduces open recipes for building reasoning-centric medical LMMs, showing RLVR outperforms SFT and text-only data boosts performance more than multimodal data.

Details

Motivation: The lack of open, reproducible methods for medical reasoning models hinders community research and comparison.

Method: Systematic data curation and two training paradigms: SFT on reasoning traces and RLVR with verifiable rewards.

Result: RLVR outperforms SFT; text-only data provides better performance than multimodal data. A 7B model sets new SOTA on VQA benchmarks.

Conclusion: MedVLThinker provides a strong open foundation for future medical reasoning research, with performance rivaling proprietary models.

Abstract: Large Reasoning Models (LRMs) have introduced a new paradigm in AI by enabling models to ``think before responding" via chain-of-thought reasoning. However, the absence of open and reproducible recipes for building reasoning-centric medical LMMs hinders community-wide research, analysis, and comparison. In this paper, we present MedVLThinker, a suite of simple yet strong baselines. Our fully open recipe consists of: (1) systematic data curation for both text-only and image-text medical data, filtered according to varying levels of reasoning difficulty, and (2) two training paradigms: Supervised Fine-Tuning (SFT) on distilled reasoning traces and Reinforcement Learning with Verifiable Rewards (RLVR) based on final answer correctness. Across extensive experiments on the Qwen2.5-VL model family (3B, 7B) and six medical QA benchmarks, we find that RLVR consistently and significantly outperforms SFT. Additionally, under the RLVR framework, a key, counter-intuitive finding is that training on our curated text-only reasoning data provides a more substantial performance boost than training on multimodal image-text data. Our best open 7B model, trained using the RLVR recipe on text-only data, establishes a new state-of-the-art on existing public VQA benchmarks, surpassing all previous open-source medical LMMs. Furthermore, scaling our model to 32B achieves performance on par with the proprietary GPT-4o. We release all curated data, models, and code to provide the community with a strong, open foundation for future research in multimodal medical reasoning.

[415] Raw Data Matters: Enhancing Prompt Tuning by Internal Augmentation on Vision-Language Models

Haoyang Li, Liang Wang, Chao Wang, Siyu Zhou, Jing Jiang, Yan Peng, Guodong Long

Main category: cs.CV

TL;DR: AugPT is a self-contained, distillation-based prompt tuning method that uses internal augmentation to enhance model performance without external knowledge.

Details

Motivation: Existing methods rely on costly external knowledge and ignore image modality features. AugPT aims to exploit known features more efficiently.

Method: AugPT uses self-supervised augmentation on unlabeled images and a gating mechanism to filter noisy samples, reusing the pre-trained model.

Result: AugPT improves performance and generalization without external knowledge, validated by extensive experiments.

Conclusion: AugPT is an effective, cost-efficient alternative to external knowledge-based prompt tuning methods.

Abstract: For CLIP-based prompt tuning, introducing more data as additional knowledge for enhancing fine-tuning process is proved to be an effective approach. Existing data amplification strategies for prompt tuning typically rely on external knowledge (e.g., large language models or pre-structured knowledge bases), resulting in higher costs for data collection and processing, while generally ignoring further utilization of features in image modality. To address this, we propose Augmentation-driven Prompt Tuning (AugPT), a self-contained distillation-based prompt tuning approach using only internal augmentation on raw dataset to better exploit known features. Specifically, AugPT employs self-supervised augmentation on unlabeled images in the training set, and introduces a novel gating mechanism based on consensus test, reusing the pre-trained prompt tuning backbone model to spontaneously filter noisy samples, further enhancing the quality of augmented views. Extensive experiments validate that AugPT simultaneously enhances model performance and generalization capability without using appended external knowledge. The code of AugPT is available at: https://github.com/JREion/AugPT .

[416] AtomThink: Multimodal Slow Thinking with Atomic Step Reasoning

Kun Xiang, Zhili Liu, Terry Jingchen Zhang, Yinya Huang, Yunshuang Nie, Kaixin Cai, Yiyang Yin, Runhui Huang, Hanhui Li, Yihan Zeng, Yu-Jie Yuan, Jianhua Han, Lanqing Hong, Hang Xu, Xiaodan Liang

Main category: cs.CV

TL;DR: The paper introduces Self-structured Chain of Thought (SCoT) and AtomThink framework to enhance multimodal mathematical reasoning in MLLMs, improving accuracy and efficiency.

Details

Motivation: Addressing the challenge of multimodal mathematical reasoning by enabling adaptive reasoning levels for tasks of varying complexity.

Method: Proposes SCoT with minimal semantic atomic steps and AtomThink framework, including data engine, SFT, policy-guided inference, and atomic capability metric.

Result: Achieves >10% accuracy gains on MathVista and MathVerse, improves data utilization by 5x, and boosts inference efficiency by 85.3%.

Conclusion: AtomThink outperforms state-of-the-art structured CoT methods in accuracy, data utilization, and efficiency, with publicly available code.

Abstract: In this paper, we address the challenging task of multimodal mathematical reasoning by incorporating the notion of ``slow thinking’’ into multimodal large language models (MLLMs). Our core idea is that models can learn to adaptively use different levels of reasoning to tackle questions of different complexity. We propose a novel paradigm of Self-structured Chain of Thought (SCoT), which comprises of minimal semantic atomic steps. Different from existing methods that rely on structured templates or free-form paradigms, our method can not only generate cognitive CoT structures for various complex tasks but also mitigates the phenomena of overthinking for easier tasks. To introduce structured reasoning into visual cognition, we further design a novel AtomThink framework with four key modules, including (i) a data engine to generate high-quality multimodal reasoning paths; (ii) a supervised fine-tuning (SFT) process with serialized inference data; (iii) a policy-guided multi-turn inference method; and (iv) an atomic capability metric to evaluate the single step utilization rate. We conduct extensive experiments to show that the proposed AtomThink significantly improves the performance of baseline MLLMs, achieving more than 10% average accuracy gains on MathVista and MathVerse. Compared to state-of-the-art structured CoT approaches, our method not only achieves higher accuracy but also improves data utilization by 5 times and boosts inference efficiency by 85.3%. Our code is now public available in https://github.com/Quinn777/AtomThink.

[417] KinMo: Kinematic-aware Human Motion Understanding and Generation

Pengfei Zhang, Pinxin Liu, Pablo Garrido, Hyeongwoo Kim, Bindita Chaudhuri

Main category: cs.CV

TL;DR: KinMo introduces a hierarchical motion representation and alignment framework to bridge the modality gap in human motion synthesis, improving understanding and generation.

Details

Motivation: Current frameworks use coarse global descriptions, missing fine details in motion, leading to ambiguities between text and motion.

Method: KinMo uses hierarchical describable motion representation, automated annotation for fine-grained descriptions, and hierarchical text-motion alignment.

Result: KinMo enhances motion understanding and generation, shown by improved text-motion retrieval and fine-grained synthesis.

Conclusion: KinMo addresses the modality gap effectively, offering scalable dataset enrichment and better motion synthesis.

Abstract: Current human motion synthesis frameworks rely on global action descriptions, creating a modality gap that limits both motion understanding and generation capabilities. A single coarse description, such as run, fails to capture details such as variations in speed, limb positioning, and kinematic dynamics, leading to ambiguities between text and motion modalities. To address this challenge, we introduce KinMo, a unified framework built on a hierarchical describable motion representation that extends beyond global actions by incorporating kinematic group movements and their interactions. We design an automated annotation pipeline to generate high-quality, fine-grained descriptions for this decomposition, resulting in the KinMo dataset and offering a scalable and cost-efficient solution for dataset enrichment. To leverage these structured descriptions, we propose Hierarchical Text-Motion Alignment that progressively integrates additional motion details, thereby improving semantic motion understanding. Furthermore, we introduce a coarse-to-fine motion generation procedure to leverage enhanced spatial understanding to improve motion synthesis. Experimental results show that KinMo significantly improves motion understanding, demonstrated by enhanced text-motion retrieval performance and enabling more fine-grained motion generation and editing capabilities. Project Page: https://andypinxinliu.github.io/KinMo

[418] Two Heads are Better than One: Robust Learning Meets Multi-branch Models

Zongyuan Zhang, Qingwen Bu, Tianyang Duan, Zheng Lin, Yuhao Qing, Zihan Fang, Heming Cui, Dong Huang

Main category: cs.CV

TL;DR: The paper proposes BORT, a method for adversarial training that improves robustness by leveraging multi-branch neural networks and a branch-orthogonal loss function, achieving state-of-the-art performance without additional data.

Details

Motivation: DNNs are vulnerable to adversarial attacks, and while adversarial training is effective, most work focuses on data-centric approaches. This paper revisits robustness from the perspective of deep feature distribution.

Method: Proposes BORT, using a multi-branch neural network and branch-orthogonal loss to create orthogonal solution spaces, improving robustness without extra data or inference time cost.

Result: Achieves 67.3% and 41.5% robust accuracy on CIFAR-10 and CIFAR-100, outperforming state-of-the-art methods by +7.23% and +9.07%.

Conclusion: BORT demonstrates superior adversarial robustness by focusing on model architecture and feature distribution, surpassing methods even with larger training datasets.

Abstract: Deep neural networks (DNNs) are vulnerable to adversarial examples, in which DNNs are misled to false outputs due to inputs containing imperceptible perturbations. Adversarial training, a reliable and effective method of defense, may significantly reduce the vulnerability of neural networks and becomes the de facto standard for robust learning. While many recent works practice the data-centric philosophy, such as how to generate better adversarial examples or use generative models to produce additional training data, we look back to the models themselves and revisit the adversarial robustness from the perspective of deep feature distribution as an insightful complementarity. In this paper, we propose \textit{Branch Orthogonality adveRsarial Training} (BORT) to obtain state-of-the-art performance with solely the original dataset for adversarial training. To practice our design idea of integrating multiple orthogonal solution spaces, we leverage a simple and straightforward multi-branch neural network that eclipses adversarial attacks with no increase in inference time. We heuristically propose a corresponding loss function, branch-orthogonal loss, to make each solution space of the multi-branch model orthogonal. We evaluate our approach on CIFAR-10, CIFAR-100 and SVHN against $\ell_{\infty}$ norm-bounded perturbations of size $\epsilon = 8/255$, respectively. Exhaustive experiments are conducted to show that our method goes beyond all state-of-the-art methods without any tricks. Compared to all methods that do not use additional data for training, our models achieve 67.3% and 41.5% robust accuracy on CIFAR-10 and CIFAR-100 (improving upon the state-of-the-art by +7.23% and +9.07%). We also outperform methods using a training set with a far larger scale than ours.

[419] Probabilistic Domain Adaptation for Biomedical Image Segmentation

Anwai Archit, Constantin Pape

Main category: cs.CV

TL;DR: A probabilistic domain adaptation method for biomedical segmentation, leveraging self-training and Probabilistic UNet to improve pseudo-label filtering and generalization.

Details

Motivation: Addressing the lack of generalization in deep learning for biomedical segmentation due to diverse experimental settings.

Method: Uses Probabilistic UNet to sample multiple segmentation hypotheses for better pseudo-label filtering, with joint and separate source-target training strategies.

Result: Evaluated on three challenging domain adaptation tasks for biomedical segmentation, showing improved performance.

Conclusion: The method enhances domain adaptation for biomedical segmentation, with publicly available code for reproducibility.

Abstract: Segmentation is a crucial analysis task in biomedical imaging. Given the diverse experimental settings in this field, the lack of generalization limits the use of deep learning in practice. Domain adaptation is a promising remedy: it involves training a model for a given task on a source dataset with labels and adapts it to a target dataset without additional labels. We introduce a probabilistic domain adaptation method, building on self-training approaches and the Probabilistic UNet. We use the latter to sample multiple segmentation hypotheses to implement better pseudo-label filtering. We further study joint and separate source-target training strategies and evaluate our method on three challenging domain adaptation tasks for biomedical segmentation. Our code is publicly available at https://github.com/computational-cell-analytics/Probabilistic-Domain-Adaptation.

[420] SuPerPM: A Surgical Perception Framework Based on Deep Point Matching Learned from Physical Constrained Simulation Data

Shan Lin, Albert J. Miao, Ali Alabiad, Fei Liu, Kaiyuan Wang, Jingpei Lu, Florian Richter, Michael C. Yip

Main category: cs.CV

TL;DR: SuPerPM, a surgical perception framework, uses learning-based point cloud matching for accurate tissue tracking during large deformations, outperforming traditional ICP-based methods.

Details

Motivation: Endoscopic tissue tracking errors often arise from incorrect data association during deformations, necessitating a more robust solution.

Method: The framework employs learning-based non-rigid point cloud matching, trained on simulated ground truth data from PBD simulations, to improve accuracy.

Result: SuPerPM achieves superior performance on surgical datasets with large deformations compared to existing tracking algorithms.

Conclusion: The proposed framework effectively addresses data association challenges in surgical environments, enabling more reliable tissue tracking.

Abstract: A major source of endoscopic tissue tracking errors during deformations stems from wrong data association between observed sensor measurements with previously tracked scene. To mitigate this issue, we present a surgical perception framework, SuPerPM, that leverages learning-based non-rigid point cloud matching for data association, thus accommodating larger deformations than previous approaches which relied on Iterative Closest Point (ICP) for point associations. The learning models typically require training data with ground truth point cloud correspondences, which is challenging or even impractical to collect in surgical environments. Thus, for tuning the learning model, we gather endoscopic data of soft tissue being manipulated by a surgical robot and then establish correspondences between point clouds at different time points to serve as ground truth. This was achieved by employing a position-based dynamics (PBD) simulation to ensure that the correspondences adhered to physical constraints. The proposed framework is demonstrated on several challenging surgical datasets that are characterized by large deformations, achieving superior performance over advanced surgical scene tracking algorithms.

[421] Joint Generative Modeling of Grounded Scene Graphs and Images via Diffusion Models

Bicheng Xu, Qi Yan, Renjie Liao, Lele Wang, Leonid Sigal

Main category: cs.CV

TL;DR: A framework for joint grounded scene graph-image generation, focusing on generating plausible scene graphs from noise using a novel diffusion model (DiffuseSG), which outperforms existing methods.

Details

Motivation: To address the challenge of generating high-dimensional, multi-modal structured data (grounded scene graphs) for interpretable control over image generation.

Method: Adopts a factorized approach: generates grounded scene graphs first, then images. Uses DiffuseSG, a diffusion model with a graph transformer denoiser, to handle heterogeneous node/edge attributes and introduces IoU-based regularization.

Result: Outperforms existing methods on VG and COCO-Stuff datasets, excelling in standard and new metrics. Shows broader applicability in downstream tasks like scene graph completion and detection.

Conclusion: DiffuseSG effectively generates grounded scene graphs, enhancing control and performance in image generation and related tasks.

Abstract: We introduce a framework for joint grounded scene graph - image generation, a challenging task involving high-dimensional, multi-modal structured data. To effectively model this complex joint distribution, we adopt a factorized approach: first generating a grounded scene graph, followed by image generation conditioned on the generated grounded scene graph. While conditional image generation has been widely explored in the literature, our primary focus is on the generation of grounded scene graphs from noise, which provides efficient and interpretable control over the image generation process. This task requires generating plausible grounded scene graphs with heterogeneous attributes for both nodes (objects) and edges (relations among objects), encompassing continuous attributes (e.g., object bounding boxes) and discrete attributes (e.g., object and relation categories). To address this challenge, we introduce DiffuseSG, a novel diffusion model that jointly models the heterogeneous node and edge attributes. We explore different encoding strategies to effectively handle the categorical data. Leveraging a graph transformer as the denoiser, DiffuseSG progressively refines grounded scene graph representations in a continuous space before discretizing them to generate structured outputs. Additionally, we introduce an IoU-based regularization term to enhance empirical performance. Our model outperforms existing methods in grounded scene graph generation on the VG and COCO-Stuff datasets, excelling in both standard and newly introduced metrics that more accurately capture the task’s complexity. Furthermore, we demonstrate the broader applicability of DiffuseSG in two important downstream tasks: 1) achieving superior results in a range of grounded scene graph completion tasks, and 2) enhancing grounded scene graph detection models by leveraging additional training samples generated by DiffuseSG.

[422] Efficient4D: Fast Dynamic 3D Object Generation from a Single-view Video

Zijie Pan, Zeyu Yang, Xiatian Zhu, Li Zhang

Main category: cs.CV

TL;DR: Efficient4D is a fast video-to-4D framework for dynamic 3D object generation, achieving real-time rendering and 10x speedup over prior methods.

Details

Motivation: Lack of 4D labeled data and inefficiency in existing image-to-3D pipelines motivate the need for a faster, scalable solution.

Method: Uses spacetime-consistent images from video as labeled data, reconstructs 4D content via 4D Gaussian splatting, and employs inconsistency-aware loss and lightweight score distillation.

Result: Achieves real-time rendering, 10x speedup (10 mins vs 120 mins), and maintains novel view synthesis quality.

Conclusion: Efficient4D is a scalable, efficient solution for dynamic 3D object generation from single-view videos.

Abstract: Generating dynamic 3D object from a single-view video is challenging due to the lack of 4D labeled data. An intuitive approach is to extend previous image-to-3D pipelines by transferring off-the-shelf image generation models such as score distillation sampling.However, this approach would be slow and expensive to scale due to the need for back-propagating the information-limited supervision signals through a large pretrained model. To address this, we propose an efficient video-to-4D object generation framework called Efficient4D. It generates high-quality spacetime-consistent images under different camera views, and then uses them as labeled data to directly reconstruct the 4D content through a 4D Gaussian splatting model. Importantly, our method can achieve real-time rendering under continuous camera trajectories. To enable robust reconstruction under sparse views, we introduce inconsistency-aware confidence-weighted loss design, along with a lightly weighted score distillation loss. Extensive experiments on both synthetic and real videos show that Efficient4D offers a remarkable 10-fold increase in speed when compared to prior art alternatives while preserving the quality of novel view synthesis. For example, Efficient4D takes only 10 minutes to model a dynamic object, vs 120 minutes by the previous art model Consistent4D.

[423] DreamFrame: Enhancing Video Understanding via Automatically Generated QA and Style-Consistent Keyframes

Zhende Song, Chenchen Wang, Jiamu Sheng, Chi Zhang, Shengji Tang, Jiayuan Fan, Tao Chen

Main category: cs.CV

TL;DR: DreamFrame is a three-stage framework for automatically generating style-consistent keyframes and QA pairs to support LVLM instruction tuning, reducing the need for manual dataset creation.

Details

Motivation: Existing LVLMs rely on labor-intensive, annotated datasets, making adaptation to specific tasks challenging. DreamFrame aims to automate dataset generation for efficient tuning.

Method: DreamFrame uses an LLM to generate movie plots, a Style Immobilization Process for visual consistency, and integrates descriptions and embeddings to produce keyframes and QA pairs.

Result: The framework created 1k stylized videos and 100k QA pairs, and the fine-tuned DreamFrame-7B outperformed similar-sized LVLMs in benchmarks.

Conclusion: DreamFrame successfully automates dataset generation and improves LVLM performance, offering a scalable solution for task-specific tuning.

Abstract: Recent large vision-language models (LVLMs) for video understanding are primarily fine-tuned with various videos scraped from online platforms. Existing datasets, such as ActivityNet, require considerable human labor for structuring and annotation before effectively utilized for tuning LVLMs. While current LVLMs are primarily trained on existing datasets in broad, general-purpose settings, adapting them to specific downstream scenarios remains challenging, as collecting and annotating task-specific videos is highly labor-intensive and time-consuming. To address this issue, we propose a three-stage framework named DreamFrame for automatically generating style-consistent keyframes and corresponding question-answer (QA) pairs to support LVLM instruction tuning. DreamFrame generates datasets in a movie-like manner. First, we utilize an LLM to generate structured movie plots including movie prior information (like overview and style), frame descriptions and plot-related QA pairs, with a story expansion strategy to mitigate context length limitations.Then, to ensure visual consistency across generated frames, we design a Style Immobilization Process which maintains consistent style through an embedding learning strategy. Finally, frame descriptions and style embeddings are integrated to produce coherent keyframes. Using DreamFrame, we construct a dataset comprising approximately 1k stylized keyframe-like videos and 100k diverse QA pairs. Extensive fine-tuned experiments on various LVLM architectures demonstrate the effectiveness of the proposed dataset. Furthermore, based on the proposed dataset, we fine-tune a new LVLM named DreamFrame-7B, which significantly surpasses the previous similar-sized LVLMs across different benchmarks.

[424] Contextual Gesture: Co-Speech Gesture Video Generation through Context-aware Gesture Representation

Pinxin Liu, Pengfei Zhang, Hyeongwoo Kim, Pablo Garrido, Ari Sharpio, Kyle Olszewski

Main category: cs.CV

TL;DR: Contextual Gesture framework improves co-speech gesture generation with alignment, tokenization, and refinement modules for realistic and synchronized outputs.

Details

Motivation: Enhancing lifelike avatars and human-computer interactions by addressing challenges in rhythmic/semantic trigger identification and pixel-level realism in gesture generation.

Method: Introduces three components: chronological speech-gesture alignment, contextualized gesture tokenization, and structure-aware refinement for video generation.

Result: Produces realistic, speech-aligned gesture videos, supports long-sequence generation, and enables video gesture editing.

Conclusion: Contextual Gesture effectively addresses existing limitations, offering improved performance in co-speech gesture generation.

Abstract: Co-speech gesture generation is crucial for creating lifelike avatars and enhancing human-computer interactions by synchronizing gestures with speech. Despite recent advancements, existing methods struggle with accurately identifying the rhythmic or semantic triggers from audio for generating contextualized gesture patterns and achieving pixel-level realism. To address these challenges, we introduce Contextual Gesture, a framework that improves co-speech gesture video generation through three innovative components: (1) a chronological speech-gesture alignment that temporally connects two modalities, (2) a contextualized gesture tokenization that incorporate speech context into motion pattern representation through distillation, and (3) a structure-aware refinement module that employs edge connection to link gesture keypoints to improve video generation. Our extensive experiments demonstrate that Contextual Gesture not only produces realistic and speech-aligned gesture videos but also supports long-sequence generation and video gesture editing applications, shown in Fig.1.

Yang Cao, Yihan Zeng, Hang Xu, Dan Xu

Main category: cs.CV

TL;DR: CoDAv2 is a framework for open-vocabulary 3D object detection, using 3D-NODE for novel object localization and DCMA for classification, achieving superior performance.

Details

Motivation: The challenge of detecting novel 3D objects with limited base categories drives the need for innovative localization and classification methods.

Method: CoDAv2 combines 3D-NODE (3D Novel Object Discovery with Enrichment) for localization and DCMA (Discovery-driven Cross-modal Alignment) for classification, enhanced by 2D box guidance.

Result: CoDAv2 significantly outperforms existing methods (AP_Novel of 9.17 vs. 3.61 on SUN-RGBD and 9.12 vs. 3.74 on ScanNetv2).

Conclusion: CoDAv2 provides a robust solution for open-vocabulary 3D object detection, with superior localization and classification capabilities.

Abstract: Open-vocabulary 3D Object Detection (OV-3DDet) addresses the detection of objects from an arbitrary list of novel categories in 3D scenes, which remains a very challenging problem. In this work, we propose CoDAv2, a unified framework designed to innovatively tackle both the localization and classification of novel 3D objects, under the condition of limited base categories. For localization, the proposed 3D Novel Object Discovery (3D-NOD) strategy utilizes 3D geometries and 2D open-vocabulary semantic priors to discover pseudo labels for novel objects during training. 3D-NOD is further extended with an Enrichment strategy that significantly enriches the novel object distribution in the training scenes, and then enhances the model’s ability to localize more novel objects. The 3D-NOD with Enrichment is termed 3D-NODE. For classification, the Discovery-driven Cross-modal Alignment (DCMA) module aligns features from 3D point clouds and 2D/textual modalities, employing both class-agnostic and class-specific alignments that are iteratively refined to handle the expanding vocabulary of objects. Besides, 2D box guidance boosts the classification accuracy against complex background noises, which is coined as Box-DCMA. Extensive evaluation demonstrates the superiority of CoDAv2. CoDAv2 outperforms the best-performing method by a large margin (AP_Novel of 9.17 vs. 3.61 on SUN-RGBD and 9.12 vs. 3.74 on ScanNetv2). Source code and pre-trained models are available at the GitHub project page.

[426] SPAN: Unlocking Pyramid Representations for Gigapixel Histopathological Images

Weiyi Wu, Xingjian Diao, Chongyang Gao, Xinwen Xu, Siting Li, Jiang Gui

Main category: cs.CV

TL;DR: A sparse-native computational framework, SPAN, is proposed for gigapixel-scale whole slide image analysis, preserving spatial relationships and outperforming existing methods.

Details

Motivation: Addressing the computational challenges of gigapixel-scale WSIs and the limitations of conventional patch-based and attention methods.

Method: Developed SPAN with hierarchical sparse pyramid attention, Spatial-Adaptive Feature Condensation, and Context-Aware Feature Refinement.

Result: SPAN outperforms state-of-the-art methods on multiple datasets, capturing contextual and hierarchical representations effectively.

Conclusion: The work introduces a new paradigm for WSI analysis, overcoming computational barriers and enabling advanced modeling.

Abstract: Whole slide images (WSIs) present fundamental computational challenges due to their gigapixel-scale resolutions and sparse, irregularly distributed informative regions. Conventional patch-based methods inevitably distort spatial relationships or treat patches as independent samples, while traditional attention mechanisms, designed for dense, uniformly distributed data, are computationally impractical for WSIs. To address these limitations, we propose a novel sparse-native computational framework that preserves exact spatial relationships, unlocking advanced modeling techniques and bridging a long-standing gap between WSI analysis and general vision. Based on this framework, we develop Sparse Pyramid Attention Networks (SPAN), incorporating a hierarchical sparse pyramid attention architecture with shifted windows that efficiently directs computational resources to informative regions. SPAN comprises two key modules: Spatial-Adaptive Feature Condensation, which progressively builds multi-scale representations from a single-scale input through sparse downsampling, and Context-Aware Feature Refinement, which captures long-range dependencies via shifted windows and global tokens. Evaluations on multiple public datasets demonstrate SPAN’s superior performance over state-of-the-art methods, validating both our framework’s effectiveness and SPAN’s specific advantages in capturing contextual and hierachical representations that existing methods fundamentally cannot model. Our work establishes a new paradigm for WSI analysis that overcomes long-standing computational barriers. The code will be made publicly available upon publication.

[427] InstructLayout: Instruction-Driven 2D and 3D Layout Synthesis with Semantic Graph Prior

Chenguo Lin, Yuchen Lin, Panwang Pan, Xuanyang Zhang, Yadong Mu

Main category: cs.CV

TL;DR: InstructLayout improves 2D/3D layout synthesis by integrating a semantic graph prior and layout decoder, outperforming existing methods.

Details

Motivation: Existing methods lack controllability in layout synthesis due to implicit modeling of object relations.

Method: Uses a semantic graph prior and layout decoder to enhance controllability and fidelity.

Result: Outperforms state-of-the-art methods in 2D/3D tasks; validated by ablation studies.

Conclusion: InstructLayout offers superior controllability and performance in layout synthesis.

Abstract: Comprehending natural language instructions is a charming property for both 2D and 3D layout synthesis systems. Existing methods implicitly model object joint distributions and express object relations, hindering generation’s controllability. We introduce InstructLayout, a novel generative framework that integrates a semantic graph prior and a layout decoder to improve controllability and fidelity for 2D and 3D layout synthesis. The proposed semantic graph prior learns layout appearances and object distributions simultaneously, demonstrating versatility across various downstream tasks in a zero-shot manner. To facilitate the benchmarking for text-driven 2D and 3D scene synthesis, we respectively curate two high-quality datasets of layout-instruction pairs from public Internet resources with large language and multimodal models. Extensive experimental results reveal that the proposed method outperforms existing state-of-the-art approaches by a large margin in both 2D and 3D layout synthesis tasks. Thorough ablation studies confirm the efficacy of crucial design components.

[428] On the Discriminability of Self-Supervised Representation Learning

Zeen Song, Wenwen Qiang, Changwen Zheng, Fuchun Sun, Hui Xiong

Main category: cs.CV

TL;DR: The paper addresses the “crowding problem” in SSL, where features lack class separation, and proposes DSA to improve feature aggregation and separation, narrowing the gap with SL.

Details

Motivation: SSL underperforms SL due to poor class separation (crowding problem) and high intra-class variance, limiting discriminability.

Method: Proposes Dynamic Semantic Adjuster (DSA), a learnable regulator to enhance feature aggregation and separation, backed by a theoretical framework linking SSL objectives to cross-entropy risk bounds.

Result: DSA significantly improves SSL performance on benchmark datasets, reducing the gap with SL.

Conclusion: DSA effectively addresses SSL’s crowding problem, enhancing discriminability and performance.

Abstract: Self-supervised learning (SSL) has recently shown notable success in various visual tasks. However, in terms of discriminability, SSL is still not on par with supervised learning (SL). This paper identifies a key issue, the ``crowding problem," where features from different classes are not well-separated, and there is high intra-class variance. In contrast, SL ensures clear class separation. Our analysis reveals that SSL objectives do not adequately constrain the relationships between samples and their augmentations, leading to poorer performance in complex tasks. We further establish a theoretical framework that connects SSL objectives to cross-entropy risk bounds, explaining how reducing intra-class variance and increasing inter-class separation can improve generalization. To address this, we propose the Dynamic Semantic Adjuster (DSA), a learnable regulator that enhances feature aggregation and separation while being robust to outliers. Comprehensive experiments conducted on diverse benchmark datasets validate that DSA leads to substantial gains in SSL performance, narrowing the performance gap with SL.

[429] Enhancing deep neural networks through complex-valued representations and Kuramoto synchronization dynamics

Sabine Muzellec, Andrea Alamia, Thomas Serre, Rufin VanRullen

Main category: cs.CV

TL;DR: The paper explores using neural synchrony, inspired by neuroscience, to improve object binding in deep learning models for visual categorization. It introduces complex-valued representations with Kuramoto dynamics for phase alignment, testing feedforward and recurrent architectures. Both outperform traditional models in multi-object tasks.

Details

Motivation: Current deep learning models struggle with object binding, limiting their ability to represent multiple objects effectively. Neuroscience suggests neural synchrony may address this, motivating the study.

Method: Combines complex-valued representations with Kuramoto dynamics for phase alignment. Tests two architectures: feedforward and recurrent (with feedback for synchronization refinement).

Result: Both synchrony-based models outperform real-valued and non-synchronized complex models in multi-object tasks (e.g., overlapping digits, noisy inputs, out-of-distribution transformations).

Conclusion: Synchrony-driven mechanisms enhance deep learning models, improving performance, robustness, and generalization in complex visual tasks.

Abstract: Neural synchrony is hypothesized to play a crucial role in how the brain organizes visual scenes into structured representations, enabling the robust encoding of multiple objects within a scene. However, current deep learning models often struggle with object binding, limiting their ability to represent multiple objects effectively. Inspired by neuroscience, we investigate whether synchrony-based mechanisms can enhance object encoding in artificial models trained for visual categorization. Specifically, we combine complex-valued representations with Kuramoto dynamics to promote phase alignment, facilitating the grouping of features belonging to the same object. We evaluate two architectures employing synchrony: a feedforward model and a recurrent model with feedback connections to refine phase synchronization using top-down information. Both models outperform their real-valued counterparts and complex-valued models without Kuramoto synchronization on tasks involving multi-object images, such as overlapping handwritten digits, noisy inputs, and out-of-distribution transformations. Our findings highlight the potential of synchrony-driven mechanisms to enhance deep learning models, improving their performance, robustness, and generalization in complex visual categorization tasks.

Laura Niss, Kevin Vogt-Lowell, Theodoros Tsiligkaridis

Main category: cs.CV

TL;DR: The paper introduces IIMM, a metric for predicting fine-tuning outcomes in vision-language models, showing its effectiveness and theoretical robustness.

Details

Motivation: To address the underexplored impact of fine-tuning on learning gains and catastrophic forgetting in vision-language models.

Method: Proposes the Inter-Intra Modal Measure (IIMM) to quantify intra-modal similarity and inter-modal misalignment, validated across models and fine-tuning techniques.

Result: IIMM shows a strong linear relationship with performance changes, outperforming existing measures, and provides theoretical stability.

Conclusion: IIMM is a lightweight, effective tool for guiding fine-tuning strategies, enhancing transferability predictions.

Abstract: The fine-tuning of large vision-language foundation models remains an underexplored area, particularly regarding its impact on learning gains and catastrophic forgetting. Inspired by the significance of modality gaps in contrastive dual-encoders, we introduce the Inter-Intra Modal Measure (IIMM) - a predictive metric that quantifies the relationship between intra-modal image embedding similarity and inter-modal misalignment. Through extensive empirical analysis across four state-of-the-art vision-language models and five fine-tuning techniques, we establish a strong linear relationship: tasks with higher IIMM scores yield greater in-domain performance improvements but suffer from more pronounced out-of-domain degradation, with some parameter-efficient fine-tuning (PEFT) methods exhibiting severe forgetting. Compared to existing transferability measures, the IIMM demonstrates significantly stronger predictive power for accuracy changes post fine-tuning in dual-encoder models. Moreover, we provide a theoretical bound, proving that changes in IIMM are limited by the Wasserstein distance between pre- and post-fine-tuning embedding distributions, ensuring its stability and robustness as a predictive measure. With only a single forward pass of the target data, practitioners can leverage this key insight to evaluate the degree to which a model can be expected to improve following fine-tuning. When combined with prior knowledge of a model’s performance across diverse tasks, the IIMM further enhances transferability predictions for novel tasks, offering a lightweight yet effective tool for guiding model adaptation strategies. Our code is provided at https://github.com/mit-ll/IIMM.

[431] CA-W3D: Leveraging Context-Aware Knowledge for Weakly Supervised Monocular 3D Detection

Chupeng Liu, Runkai Zhao, Weidong Cai

Main category: cs.CV

TL;DR: CA-W3D introduces a two-stage training paradigm for weakly supervised monocular 3D detection, leveraging contextual awareness to improve performance.

Details

Motivation: Weakly supervised methods often miss global context, limiting 3D reasoning. CA-W3D addresses this by incorporating contextual semantic relationships.

Method: Two-stage training: (1) ROCM pre-training aligns object embeddings for contextual knowledge, (2) D2OD pseudo-label training transfers contextual priors efficiently.

Result: Outperforms SoTA on KITTI benchmark, demonstrating the value of contextual-aware knowledge.

Conclusion: CA-W3D effectively enhances weakly supervised 3D detection by integrating contextual relationships, achieving superior performance.

Abstract: Weakly supervised monocular 3D detection, while less annotation-intensive, often struggles to capture the global context required for reliable 3D reasoning. Conventional label-efficient methods focus on object-centric features, neglecting contextual semantic relationships that are critical in complex scenes. In this work, we propose a Context-Aware Weak Supervision for Monocular 3D object detection, namely CA-W3D, to address this limitation in a two-stage training paradigm. Specifically, we first introduce a pre-training stage employing Region-wise Object Contrastive Matching (ROCM), which aligns regional object embeddings derived from a trainable monocular 3D encoder and a frozen open-vocabulary 2D visual grounding model. This alignment encourages the monocular encoder to discriminate scene-specific attributes and acquire richer contextual knowledge. In the second stage, we incorporate a pseudo-label training process with a Dual-to-One Distillation (D2OD) mechanism, which effectively transfers contextual priors into the monocular encoder while preserving spatial fidelity and maintaining computational efficiency during inference. Extensive experiments conducted on the public KITTI benchmark demonstrate the effectiveness of our approach, surpassing the SoTA method over all metrics, highlighting the importance of contextual-aware knowledge in weakly-supervised monocular 3D detection.

Jiawei Lian, Shaohui Mei, Xiaofei Wang, Yi Wang, Lefan Wang, Yingjie Lu, Mingyang Ma, Lap-Pui Chau

Main category: cs.CV

TL;DR: The paper introduces a background adversarial attack framework that targets DNNs without altering the main objects, demonstrating its effectiveness across diverse scenarios.

Details

Motivation: Existing adversarial attacks focus on corrupting objects or images, but this work explores background perturbations, revealing their underestimated impact on DNN robustness.

Method: The attack is formulated as an iterative optimization problem, with a new ensemble strategy and smooth constraints to enhance efficacy and transferability. Theoretical convergence is also demonstrated.

Result: Experiments in digital and physical domains show the method’s effectiveness across various objects, models, and tasks, highlighting the critical role of background variations.

Conclusion: The study underscores the discrepancy between human and machine vision regarding background importance, urging a reevaluation of DNN robustness.

Abstract: It has been widely substantiated that deep neural networks (DNNs) are susceptible and vulnerable to adversarial perturbations. Existing studies mainly focus on performing attacks by corrupting targeted objects (physical attack) or images (digital attack), which is intuitively acceptable and understandable in terms of the attack’s effectiveness. In contrast, our focus lies in conducting background adversarial attacks in both digital and physical domains, without causing any disruptions to the targeted objects themselves. Specifically, an effective background adversarial attack framework is proposed to attack anything, by which the attack efficacy generalizes well between diverse objects, models, and tasks. Technically, we approach the background adversarial attack as an iterative optimization problem, analogous to the process of DNN learning. Besides, we offer a theoretical demonstration of its convergence under a set of mild but sufficient conditions. To strengthen the attack efficacy and transferability, we propose a new ensemble strategy tailored for adversarial perturbations and introduce an improved smooth constraint for the seamless connection of integrated perturbations. We conduct comprehensive and rigorous experiments in both digital and physical domains across various objects, models, and tasks, demonstrating the effectiveness of attacking anything of the proposed method. The findings of this research substantiate the significant discrepancy between human and machine vision on the value of background variations, which play a far more critical role than previously recognized, necessitating a reevaluation of the robustness and reliability of DNNs. The code will be publicly available at https://github.com/JiaweiLian/Attack_Anything

[433] Associate Everything Detected: Facilitating Tracking-by-Detection to the Unknown

Zimeng Fang, Chao Liang, Xue Zhou, Shuyuan Zhu, Xi Li

Main category: cs.CV

TL;DR: AED is a unified framework for multi-object tracking (MOT) that handles both closed-vocabulary (CV-MOT) and open-vocabulary (OV-MOT) tasks without prior knowledge, using robust feature learning and a similarity decoder.

Details

Motivation: Existing CV-MOT and OV-MOT methods struggle in each other's tasks, highlighting the need for a unified approach.

Method: AED integrates with any detector, models association as a similarity decoding problem, and uses a sim-decoder with spatial, temporal, and cross-clip similarities.

Result: AED outperforms existing methods on TAO, SportsMOT, and DanceTrack benchmarks.

Conclusion: AED provides a versatile and high-performing solution for MOT tasks, generalizing well to unknown categories.

Abstract: Multi-object tracking (MOT) emerges as a pivotal and highly promising branch in the field of computer vision. Classical closed-vocabulary MOT (CV-MOT) methods aim to track objects of predefined categories. Recently, some open-vocabulary MOT (OV-MOT) methods have successfully addressed the problem of tracking unknown categories. However, we found that the CV-MOT and OV-MOT methods each struggle to excel in the tasks of the other. In this paper, we present a unified framework, Associate Everything Detected (AED), that simultaneously tackles CV-MOT and OV-MOT by integrating with any off-the-shelf detector and supports unknown categories. Different from existing tracking-by-detection MOT methods, AED gets rid of prior knowledge (e.g. motion cues) and relies solely on highly robust feature learning to handle complex trajectories in OV-MOT tasks while keeping excellent performance in CV-MOT tasks. Specifically, we model the association task as a similarity decoding problem and propose a sim-decoder with an association-centric learning mechanism. The sim-decoder calculates similarities in three aspects: spatial, temporal, and cross-clip. Subsequently, association-centric learning leverages these threefold similarities to ensure that the extracted features are appropriate for continuous tracking and robust enough to generalize to unknown categories. Compared with existing powerful OV-MOT and CV-MOT methods, AED achieves superior performance on TAO, SportsMOT, and DanceTrack without any prior knowledge. Our code is available at https://github.com/balabooooo/AED.

[434] OmniPose6D: Towards Short-Term Object Pose Tracking in Dynamic Scenes from Monocular RGB

Yunzhi Lin, Yipu Zhao, Fu-Jen Chu, Xingyu Chen, Weiyao Wang, Hao Tang, Patricio A. Vela, Matt Feiszli, Kevin Liang

Main category: cs.CV

TL;DR: A synthetic dataset (OmniPose6D) and benchmarking framework are introduced for short-term object pose tracking in dynamic environments using monocular RGB input. An uncertainty-aware keypoint refinement network improves pose estimation, outperforming baselines on real datasets.

Details

Motivation: The challenge of tracking object poses in dynamic environments with monocular RGB input lacks adequate datasets and benchmarking tools, limiting progress in the field.

Method: A synthetic dataset (OmniPose6D) mirrors real-world diversity. A pipeline with an uncertainty-aware keypoint refinement network uses probabilistic modeling to refine pose estimates.

Result: The approach outperforms existing baselines on real datasets, demonstrating improved tracking precision.

Conclusion: The contributions advance object pose tracking in complex scenes, setting a new standard for methodology development and assessment.

Abstract: To address the challenge of short-term object pose tracking in dynamic environments with monocular RGB input, we introduce a large-scale synthetic dataset OmniPose6D, crafted to mirror the diversity of real-world conditions. We additionally present a benchmarking framework for a comprehensive comparison of pose tracking algorithms. We propose a pipeline featuring an uncertainty-aware keypoint refinement network, employing probabilistic modeling to refine pose estimation. Comparative evaluations demonstrate that our approach achieves performance superior to existing baselines on real datasets, underscoring the effectiveness of our synthetic dataset and refinement technique in enhancing tracking precision in dynamic contexts. Our contributions set a new precedent for the development and assessment of object pose tracking methodologies in complex scenes.

[435] Beyond the Visible: Multispectral Vision-Language Learning for Earth Observation

Clive Tinashe Marimo, Benedikt Blumenstiel, Maximilian Nitsche, Johannes Jakubik, Thomas Brunschwiler

Main category: cs.CV

TL;DR: Llama3-MS-CLIP is a vision-language model pre-trained on multispectral data, outperforming RGB-based models in classification and retrieval tasks.

Details

Motivation: Existing vision-language models for Earth observation ignore multispectral data, limiting their potential.

Method: Pre-trained with contrastive learning on a large-scale multispectral dataset, validated by domain experts.

Result: Improves classification accuracy by +6.77% and retrieval performance by +4.63% mAP over RGB-based models.

Conclusion: Multispectral vision-language learning is highly relevant, and the model, dataset, and code are publicly available.

Abstract: Vision-language models for Earth observation (EO) typically rely on the visual spectrum of data as the only model input, thus failing to leverage the rich spectral information available in the multispectral channels recorded by satellites. Therefore, we introduce Llama3-MS-CLIP, the first vision-language model pre-trained with contrastive learning on a large-scale multispectral dataset and report on the performance gains due to the extended spectral range. Furthermore, we present the largest-to-date image-caption dataset for multispectral data, consisting of one million Sentinel-2 samples and corresponding textual descriptions generated using Llama3-LLaVA-Next and Overture Maps data. We develop a scalable captioning pipeline, which is validated by domain experts. We evaluate Llama3-MS-CLIP on multispectral zero-shot image classification and retrieval using three datasets of varying complexity. Our results demonstrate that Llama3-MS-CLIP significantly outperforms other RGB-based approaches, improving classification accuracy by +6.77% on average and retrieval performance by +4.63% mAP compared to the second-best model. Our results emphasize the relevance of multispectral vision-language learning. The image-caption dataset, code, and model weights are available at https://github.com/IBM/MS-CLIP.

[436] Evaluating the evaluators: Towards human-aligned metrics for missing markers reconstruction

Taras Kucherenko, Derek Peristy, Judith Bütepage

Main category: cs.CV

TL;DR: The paper critiques the use of mean square error for missing marker reconstruction in motion capture, proposing better-correlated metrics for subjective quality.

Details

Motivation: Manual cleaning of missing markers in motion capture is time-consuming, prompting interest in machine learning solutions. Current metrics like mean square error fail to align with subjective quality perception.

Method: The paper introduces and evaluates a new set of metrics for missing marker reconstruction.

Result: The proposed metrics better correlate with subjective perception of fill quality compared to mean square error.

Conclusion: Improved metrics can drive progress in the field by aligning better with human perception of reconstruction quality.

Abstract: Animation data is often obtained through optical motion capture systems, which utilize a multitude of cameras to establish the position of optical markers. However, system errors or occlusions can result in missing markers, the manual cleaning of which can be time-consuming. This has sparked interest in machine learning-based solutions for missing marker reconstruction in the academic community. Most academic papers utilize a simplistic mean square error as the main metric. In this paper, we show that this metric does not correlate with subjective perception of the fill quality. Additionally, we introduce and evaluate a set of better-correlated metrics that can drive progress in the field.

[437] PRIMAL: Physically Reactive and Interactive Motor Model for Avatar Learning

Yan Zhang, Yao Feng, Alpár Cseke, Nitin Saini, Nathan Bajandas, Nicolas Heron, Michael J. Black

Main category: cs.CV

TL;DR: PRIMAL is a generative motion model for interactive avatars, offering perpetual, realistic, controllable, and responsive movements through a two-stage learning paradigm.

Details

Motivation: Existing human motion generation methods lack responsiveness and realism, prompting the need for a model that mimics real human movements more accurately.

Method: PRIMAL uses a two-stage approach: unsupervised pretraining on sub-second motion segments and fine-tuning with a ControlNet-like adaptor for tasks like personalized action generation.

Result: The model generates unbounded, realistic, and controllable motion, outperforming state-of-the-art baselines, and is integrated into a real-time animation system.

Conclusion: PRIMAL advances motion generation by combining unsupervised learning with efficient adaptation, enabling highly responsive and natural avatar animations.

Abstract: We formulate the motor system of an interactive avatar as a generative motion model that can drive the body to move through 3D space in a perpetual, realistic, controllable, and responsive manner. Although human motion generation has been extensively studied, many existing methods lack the responsiveness and realism of real human movements. Inspired by recent advances in foundation models, we propose PRIMAL, which is learned with a two-stage paradigm. In the pretraining stage, the model learns body movements from a large number of sub-second motion segments, providing a generative foundation from which more complex motions are built. This training is fully unsupervised without annotations. Given a single-frame initial state during inference, the pretrained model not only generates unbounded, realistic, and controllable motion, but also enables the avatar to be responsive to induced impulses in real time. In the adaptation phase, we employ a novel ControlNet-like adaptor to fine-tune the base model efficiently, adapting it to new tasks such as few-shot personalized action generation and spatial target reaching. Evaluations show that our proposed method outperforms state-of-the-art baselines. We leverage the model to create a real-time character animation system in Unreal Engine that feels highly responsive and natural. Code, models, and more results are available at: https://yz-cnsdqz.github.io/eigenmotion/PRIMAL

[438] PLGS: Robust Panoptic Lifting with 3D Gaussian Splatting

Yu Wang, Xiaobao Wei, Ming Lu, Guoliang Kang

Main category: cs.CV

TL;DR: PLGS enhances 3D Gaussian Splatting (3DGS) for panoptic segmentation by introducing smoothness and noise reduction, outperforming NeRF-based methods in speed and quality.

Details

Motivation: Address the limitations of NeRF and conventional 3DGS in panoptic lifting, particularly slow training/rendering (NeRF) and susceptibility to noisy 2D masks (3DGS).

Method: Proposes PLGS: a panoptic-aware structured 3D Gaussian model with semantic anchor points for initialization, smooth regularization, self-training with pseudo labels, and instance mask projection for cross-view consistency.

Result: Outperforms state-of-the-art methods in segmentation quality and speed on various benchmarks.

Conclusion: PLGS successfully combines efficiency and robustness in panoptic segmentation, leveraging 3DGS while mitigating its noise sensitivity.

Abstract: Previous methods utilize the Neural Radiance Field (NeRF) for panoptic lifting, while their training and rendering speed are unsatisfactory. In contrast, 3D Gaussian Splatting (3DGS) has emerged as a prominent technique due to its rapid training and rendering speed. However, unlike NeRF, the conventional 3DGS may not satisfy the basic smoothness assumption as it does not rely on any parameterized structures to render (e.g., MLPs). Consequently, the conventional 3DGS is, in nature, more susceptible to noisy 2D mask supervision. In this paper, we propose a new method called PLGS that enables 3DGS to generate consistent panoptic segmentation masks from noisy 2D segmentation masks while maintaining superior efficiency compared to NeRF-based methods. Specifically, we build a panoptic-aware structured 3D Gaussian model to introduce smoothness and design effective noise reduction strategies. For the semantic field, instead of initialization with structure from motion, we construct reliable semantic anchor points to initialize the 3D Gaussians. We then use these anchor points as smooth regularization during training. Additionally, we present a self-training approach using pseudo labels generated by merging the rendered masks with the noisy masks to enhance the robustness of PLGS. For the instance field, we project the 2D instance masks into 3D space and match them with oriented bounding boxes to generate cross-view consistent instance masks for supervision. Experiments on various benchmarks demonstrate that our method outperforms previous state-of-the-art methods in terms of both segmentation quality and speed.

Junwen He, Yifan Wang, Lijun Wang, Huchuan Lu, Jun-Yan He, Chenyang Li, Hanyuan Chen, Jin-Peng Lan, Bin Luo, Yifeng Geng

Main category: cs.CV

TL;DR: A VLM-based framework for text logo layout generation integrates multi-modal inputs and user constraints, outperforming existing methods with new datasets and efficient techniques.

Details

Motivation: Text logo layout design lacks focused research despite its importance, overshadowed by broader layout tasks.

Method: Proposes a VLM-based framework with multi-modal inputs, user constraints, and efficient model techniques for processing glyph images. Introduces large datasets with layout descriptions.

Result: Outperforms benchmarks in geometric aesthetics and human preferences.

Conclusion: The framework and datasets enhance text logo layout generation, offering flexibility and robustness for real-world applications.

Abstract: Text logo design heavily relies on the creativity and expertise of professional designers, in which arranging element layouts is one of the most important procedures. However, this specific task has received limited attention, often overshadowed by broader layout generation tasks such as document or poster design. In this paper, we propose a Vision-Language Model (VLM)-based framework that generates content-aware text logo layouts by integrating multi-modal inputs with user-defined constraints, enabling more flexible and robust layout generation for real-world applications. We introduce two model techniques that reduce the computational cost for processing multiple glyph images simultaneously, without compromising performance. To support instruction tuning of our model, we construct two extensive text logo datasets that are five times larger than existing public datasets. In addition to geometric annotations (\textit{e.g.}, text masks and character recognition), our datasets include detailed layout descriptions in natural language, enabling the model to reason more effectively in handling complex designs and custom user inputs. Experimental results demonstrate the effectiveness of our proposed framework and datasets, outperforming existing methods on various benchmarks that assess geometric aesthetics and human preferences.

[440] Pairwise Matching of Intermediate Representations for Fine-grained Explainability

Lauren Shrack, Timm Haucke, Antoine Salaün, Arjun Subramonian, Sara Beery

Main category: cs.CV

TL;DR: PAIR-X is a new explainability method for deep learning models, focusing on fine-grained, localized visual explanations for subtle differences in images. It outperforms baselines on 35 re-ID datasets and is validated by experts.

Details

Motivation: Existing explainability techniques are too diffuse for fine-grained categories, making interpretations difficult.

Method: PAIR-X combines intermediate model activations and backpropagated relevance scores to generate pairwise visual explanations.

Result: PAIR-X provides qualitatively improved explanations, validated by experts, and a novel metric shows its plausibility for correct matches.

Conclusion: PAIR-X enhances interpretability, helping humans distinguish correct and incorrect matches effectively.

Abstract: The differences between images belonging to fine-grained categories are often subtle and highly localized, and existing explainability techniques for deep learning models are often too diffuse to provide useful and interpretable explanations. We propose a new explainability method (PAIR-X) that leverages both intermediate model activations and backpropagated relevance scores to generate fine-grained, highly-localized pairwise visual explanations. We use animal and building re-identification (re-ID) as a primary case study of our method, and we demonstrate qualitatively improved results over a diverse set of explainability baselines on 35 public re-ID datasets. In interviews, animal re-ID experts found PAIR-X to be a meaningful improvement over existing baselines for deep model explainability, and suggested that its visualizations would be directly applicable to their work. We also propose a novel quantitative evaluation metric for our method, and demonstrate that PAIR-X visualizations appear more plausible for correct image matches than incorrect ones even when the model similarity score for the pairs is the same. By improving interpretability, PAIR-X enables humans to better distinguish correct and incorrect matches. Our code is available at: https://github.com/pairx-explains/pairx

[441] Multimodal 3D Reasoning Segmentation with Complex Scenes

Xueying Jiang, Lewei Lu, Ling Shao, Shijian Lu

Main category: cs.CV

TL;DR: The paper introduces a 3D reasoning segmentation task and a benchmark (ReasonSeg3D) to address challenges in multimodal learning for 3D scene understanding. It also proposes MORE3D, a network for reasoning and segmenting multi-object 3D scenes.

Details

Motivation: Existing studies lack reasoning ability for human intentions and focus on simple scenarios, neglecting multi-object scenes with complex spatial relations.

Method: Proposes a 3D reasoning segmentation task, creates the ReasonSeg3D benchmark, and designs the MORE3D network for multi-object queries and 3D spatial reasoning.

Result: MORE3D excels in reasoning and segmenting complex multi-object 3D scenes, and ReasonSeg3D serves as a valuable benchmark.

Conclusion: The work advances 3D scene understanding by addressing key challenges and provides a platform for future research.

Abstract: The recent development in multimodal learning has greatly advanced the research in 3D scene understanding in various real-world tasks such as embodied AI. However, most existing studies are facing two common challenges: 1) they are short of reasoning ability for interaction and interpretation of human intentions and 2) they focus on scenarios with single-category objects and over-simplified textual descriptions and neglect multi-object scenarios with complicated spatial relations among objects. We address the above challenges by proposing a 3D reasoning segmentation task for reasoning segmentation with multiple objects in scenes. The task allows producing 3D segmentation masks and detailed textual explanations as enriched by 3D spatial relations among objects. To this end, we create ReasonSeg3D, a large-scale and high-quality benchmark that integrates 3D segmentation masks and 3D spatial relations with generated question-answer pairs. In addition, we design MORE3D, a novel 3D reasoning network that works with queries of multiple objects and is tailored for 3D scene understanding. MORE3D learns detailed explanations on 3D relations and employs them to capture spatial information of objects and reason textual outputs. Extensive experiments show that MORE3D excels in reasoning and segmenting complex multi-object 3D scenes. In addition, the created ReasonSeg3D offers a valuable platform for future exploration of 3D reasoning segmentation. The data and code will be released.

[442] Adaptive Hyper-Graph Convolution Network for Skeleton-based Human Action Recognition with Virtual Connections

Youwei Zhou, Tianyang Xu, Cong Wu, Xiaojun Wu, Josef Kittler

Main category: cs.CV

TL;DR: The paper proposes an adaptive Hyper-GCN for action recognition, optimizing hyper-graphs during training to capture multi-vertex relationships and incorporating virtual connections for richer semantic aggregation.

Details

Motivation: Existing GCNs for action recognition overlook multi-vertex convolution structures and rely on fixed hyper-graph constructions, limiting their ability to uncover latent relationships.

Method: An adaptive Hyper-GCN is introduced, dynamically optimizing hyper-graphs and integrating virtual connections to enhance semantic information aggregation.

Result: Experiments on NTU-60, NTU-120, and NW-UCLA datasets show superior performance compared to state-of-the-art methods.

Conclusion: The adaptive Hyper-GCN effectively captures intricate skeleton relationships, improving action recognition accuracy.

Abstract: The shared topology of human skeletons motivated the recent investigation of graph convolutional network (GCN) solutions for action recognition. However, most of the existing GCNs rely on the binary connection of two neighboring vertices (joints) formed by an edge (bone), overlooking the potential of constructing multi-vertex convolution structures. Although some studies have attempted to utilize hyper-graphs to represent the topology, they rely on a fixed construction strategy, which limits their adaptivity in uncovering the intricate latent relationships within the action. In this paper, we address this oversight and explore the merits of an adaptive hyper-graph convolutional network (Hyper-GCN) to achieve the aggregation of rich semantic information conveyed by skeleton vertices. In particular, our Hyper-GCN adaptively optimises the hyper-graphs during training, revealing the action-driven multi-vertex relations. Besides, virtual connections are often designed to support efficient feature aggregation, implicitly extending the spectrum of dependencies within the skeleton. By injecting virtual connections into hyper-graphs, the semantic clues of diverse action categories can be highlighted. The results of experiments conducted on the NTU-60, NTU-120, and NW-UCLA datasets demonstrate the merits of our Hyper-GCN, compared to the state-of-the-art methods. The code is available at https://github.com/6UOOON9/Hyper-GCN.

[443] TKG-DM: Training-free Chroma Key Content Generation Diffusion Model

Ryugo Morita, Stanislav Frolov, Brian Bernhard Moser, Takahiro Shirakawa, Ko Watanabe, Andreas Dengel, Jinjia Zhou

Main category: cs.CV

TL;DR: A training-free method (TKG-DM) manipulates initial noise in diffusion models to generate images with foreground objects on chroma key backgrounds, outperforming fine-tuned models.

Details

Motivation: Existing text-to-image models struggle with chroma key backgrounds, limiting foreground-background separation without fine-tuning.

Method: TKG-DM optimizes initial random noise to produce images with specifiable color backgrounds, enabling precise separation without training.

Result: The method outperforms existing approaches in evaluations and extends to tasks like consistency models and text-to-video.

Conclusion: TKG-DM offers transformative potential for generative applications requiring independent foreground-background control.

Abstract: Diffusion models have enabled the generation of high-quality images with a strong focus on realism and textual fidelity. Yet, large-scale text-to-image models, such as Stable Diffusion, struggle to generate images where foreground objects are placed over a chroma key background, limiting their ability to separate foreground and background elements without fine-tuning. To address this limitation, we present a novel Training-Free Chroma Key Content Generation Diffusion Model (TKG-DM), which optimizes the initial random noise to produce images with foreground objects on a specifiable color background. Our proposed method is the first to explore the manipulation of the color aspects in initial noise for controlled background generation, enabling precise separation of foreground and background without fine-tuning. Extensive experiments demonstrate that our training-free method outperforms existing methods in both qualitative and quantitative evaluations, matching or surpassing fine-tuned models. Finally, we successfully extend it to other tasks (e.g., consistency models and text-to-video), highlighting its transformative potential across various generative applications where independent control of foreground and background is crucial.

[444] $\textit{Revelio}$: Interpreting and leveraging semantic information in diffusion models

Dahye Kim, Xavier Thomas, Deepti Ghadiyaram

Main category: cs.CV

TL;DR: The paper investigates how visual semantic information is represented in diffusion models, using k-sparse autoencoders to uncover interpretable features and validating them via transfer learning.

Details

Motivation: To deepen the interpretability of diffusion models by analyzing their layers and denoising timesteps.

Method: Leverages k-sparse autoencoders (k-SAE) and transfer learning with lightweight classifiers on diffusion features.

Result: Demonstrates effectiveness of diffusion features for representation learning across 4 datasets, analyzing architecture, pre-training, and conditioning impacts.

Conclusion: The work advances interpretability of diffusion models, providing insights into their visual representation capabilities.

Abstract: We study $\textit{how}$ rich visual semantic information is represented within various layers and denoising timesteps of different diffusion architectures. We uncover monosemantic interpretable features by leveraging k-sparse autoencoders (k-SAE). We substantiate our mechanistic interpretations via transfer learning using light-weight classifiers on off-the-shelf diffusion models’ features. On $4$ datasets, we demonstrate the effectiveness of diffusion features for representation learning. We provide an in-depth analysis of how different diffusion architectures, pre-training datasets, and language model conditioning impacts visual representation granularity, inductive biases, and transfer learning capabilities. Our work is a critical step towards deepening interpretability of black-box diffusion models. Code and visualizations available at: https://github.com/revelio-diffusion/revelio

[445] Sequential Gaussian Avatars with Hierarchical Motion Context

Wangze Xu, Yifan Zhan, Zhihang Zhong, Xiao Sun

Main category: cs.CV

TL;DR: SeqAvatar improves 3DGS-based human avatar rendering by using hierarchical motion context and spatio-temporal multi-scale sampling, outperforming existing methods in quality and speed.

Details

Motivation: Current SMPL-driven 3DGS avatars fail to capture fine appearance details due to complex pose-to-appearance mapping.

Method: Proposes SeqAvatar with coarse-to-fine motion conditions and spatio-temporal multi-scale sampling for robust avatar modeling.

Result: Outperforms 3DGS and NeRF-based models in rendering quality and speed.

Conclusion: SeqAvatar advances real-time, high-quality human avatar rendering with superior performance.

Abstract: The emergence of neural rendering has significantly advanced the rendering quality of 3D human avatars, with the recently popular 3DGS technique enabling real-time performance. However, SMPL-driven 3DGS human avatars still struggle to capture fine appearance details due to the complex mapping from pose to appearance during fitting. In this paper, we propose SeqAvatar, which excavates the explicit 3DGS representation to better model human avatars based on a hierarchical motion context. Specifically, we utilize a coarse-to-fine motion conditions that incorporate both the overall human skeleton and fine-grained vertex motions for non-rigid deformation. To enhance the robustness of the proposed motion conditions, we adopt a spatio-temporal multi-scale sampling strategy to hierarchically integrate more motion clues to model human avatars. Extensive experiments demonstrate that our method significantly outperforms 3DGS-based approaches and renders human avatars orders of magnitude faster than the latest NeRF-based models that incorporate temporal context, all while delivering performance that is at least comparable or even superior. Project page: https://zezeaaa.github.io/projects/SeqAvatar/

[446] Momentum-GS: Momentum Gaussian Self-Distillation for High-Quality Large Scene Reconstruction

Jixuan Fan, Wanhua Li, Yifei Han, Tianru Dai, Yansong Tang

Main category: cs.CV

TL;DR: Momentum-GS improves 3D Gaussian Splatting by using momentum-based self-distillation and block weighting to enhance consistency and accuracy, reducing GPU dependency and outperforming existing methods.

Details

Motivation: High memory and storage costs in 3D Gaussian Splatting, along with accuracy loss in parallelized block-wise training, motivate the need for a solution that maintains data diversity and decouples block division from GPU count.

Method: Proposes Momentum-GS, leveraging momentum-based self-distillation (using a teacher Gaussian decoder) and dynamic block weighting to ensure global guidance and spatial consistency.

Result: Achieves a 12.8% improvement in LPIPS over CityGaussian with fewer blocks, setting a new state of the art.

Conclusion: Momentum-GS effectively addresses parallel training challenges, improving reconstruction accuracy and efficiency in large-scale scenes.

Abstract: 3D Gaussian Splatting has demonstrated notable success in large-scale scene reconstruction, but challenges persist due to high training memory consumption and storage overhead. Hybrid representations that integrate implicit and explicit features offer a way to mitigate these limitations. However, when applied in parallelized block-wise training, two critical issues arise since reconstruction accuracy deteriorates due to reduced data diversity when training each block independently, and parallel training restricts the number of divided blocks to the available number of GPUs. To address these issues, we propose Momentum-GS, a novel approach that leverages momentum-based self-distillation to promote consistency and accuracy across the blocks while decoupling the number of blocks from the physical GPU count. Our method maintains a teacher Gaussian decoder updated with momentum, ensuring a stable reference during training. This teacher provides each block with global guidance in a self-distillation manner, promoting spatial consistency in reconstruction. To further ensure consistency across the blocks, we incorporate block weighting, dynamically adjusting each block’s weight according to its reconstruction accuracy. Extensive experiments on large-scale scenes show that our method consistently outperforms existing techniques, achieving a 12.8% improvement in LPIPS over CityGaussian with much fewer divided blocks and establishing a new state of the art. Project page: https://jixuan-fan.github.io/Momentum-GS_Page/

[447] IV-tuning: Parameter-Efficient Transfer Learning for Infrared-Visible Tasks

Yaming Zhang, Chenqiang Gao, Fangcen Liu, Junjie Guo, Lan Wang, Xinggan Peng, Deyu Meng

Main category: cs.CV

TL;DR: IV-tuning is a parameter-efficient method for IR-VIS tasks, freezing PVM parameters to avoid overfitting and enhance complementary learning.

Details

Motivation: Existing IR-VIS methods under full fine-tuning constrain the feature space, impairing generalization. Freezing parameters preserves pre-trained knowledge.

Method: Proposes IV-tuning, freezing backbone parameters and using less than 3% for fine-tuning, applied to tasks like object detection and segmentation.

Result: IV-tuning outperforms full fine-tuning and existing methods, improving complementary learning and reducing overfitting.

Conclusion: IV-tuning efficiently leverages PVMs for IR-VIS tasks, achieving better performance with minimal parameter updates.

Abstract: Existing infrared and visible (IR-VIS) methods inherit the general representations of Pre-trained Visual Models (PVMs) to facilitate complementary learning. However, our analysis indicates that under the full fine-tuning paradigm, the feature space becomes highly constrained and low-ranked, which has been proven to seriously impair generalization. One solution is freezing parameters to preserve pre-trained knowledge and thus maintain diversity of the feature space. To this end, we propose IV-tuning, to parameter-efficiently harness PVMs for various IR-VIS downstream tasks, including salient object detection, semantic segmentation, and object detection. Compared with the full fine-tuning baselines and existing IR-VIS methods, IV-tuning facilitates the learning of complementary information between infrared and visible modalities with less than 3% of the backbone parameters, and effectively alleviates the overfitting problem. The code is available in https://github.com/Yummy198913/IV-tuning.

[448] GausSim: Foreseeing Reality by Gaussian Simulator for Elastic Objects

Yidi Shao, Mu Huang, Chen Change Loy, Bo Dai

Main category: cs.CV

TL;DR: GausSim is a neural network-based simulator for elastic objects using Gaussian kernels, combining continuum mechanics and hierarchical structures for efficient, realistic simulations.

Details

Motivation: To simulate real-world elastic object behaviors accurately without idealized assumptions, addressing computational efficiency and fidelity.

Method: Uses Gaussian kernels as Center of Mass Systems (CMS) with hierarchical organization and explicit physics constraints (mass/momentum conservation).

Result: Outperforms existing physics-driven baselines, validated by the READY dataset of real-world elastic deformations.

Conclusion: GausSim provides a practical, accurate solution for simulating complex dynamic behaviors, with code and models publicly available.

Abstract: We introduce GausSim, a novel neural network-based simulator designed to capture the dynamic behaviors of real-world elastic objects represented through Gaussian kernels. We leverage continuum mechanics and treat each kernel as a Center of Mass System (CMS) that represents continuous piece of matter, accounting for realistic deformations without idealized assumptions. To improve computational efficiency and fidelity, we employ a hierarchical structure that further organizes kernels into CMSs with explicit formulations, enabling a coarse-to-fine simulation approach. This structure significantly reduces computational overhead while preserving detailed dynamics. In addition, GausSim incorporates explicit physics constraints, such as mass and momentum conservation, ensuring interpretable results and robust, physically plausible simulations. To validate our approach, we present a new dataset, READY, containing multi-view videos of real-world elastic deformations. Experimental results demonstrate that GausSim achieves superior performance compared to existing physics-driven baselines, offering a practical and accurate solution for simulating complex dynamic behaviors. Code and model are available at our project page: https://www.mmlab-ntu.com/project/gausim/index.html .

[449] Towards Modality Generalization: A Benchmark and Prospective Analysis

Xiaohao Liu, Xiaobo Xia, Zhuo Huang, See-Kiong Ng, Tat-Seng Chua

Main category: cs.CV

TL;DR: The paper introduces Modality Generalization (MG) to address the challenge of models generalizing to unseen modalities, proposing a benchmark and evaluating existing methods.

Details

Motivation: Current multi-modal learning struggles with unseen modalities due to resource and privacy constraints, motivating the need for MG.

Method: Defines Weak MG (mapped modalities) and Strong MG (no mappings), proposes a benchmark, and adapts existing methods for evaluation.

Result: Experiments reveal limitations of current methods and identify future research directions for MG.

Conclusion: The work lays a foundation for robust multi-modal models capable of handling unseen modalities in real-world scenarios.

Abstract: Multi-modal learning has achieved remarkable success by integrating information from various modalities, achieving superior performance in tasks like recognition and retrieval compared to uni-modal approaches. However, real-world scenarios often present novel modalities that are unseen during training due to resource and privacy constraints, a challenge current methods struggle to address. This paper introduces Modality Generalization (MG), which focuses on enabling models to generalize to unseen modalities. We define two cases: Weak MG, where both seen and unseen modalities can be mapped into a joint embedding space via existing perceptors, and Strong MG, where no such mappings exist. To facilitate progress, we propose a comprehensive benchmark featuring multi-modal algorithms and adapt existing methods that focus on generalization. Extensive experiments highlight the complexity of MG, exposing the limitations of existing methods and identifying key directions for future research. Our work provides a foundation for advancing robust and adaptable multi-modal models, enabling them to handle unseen modalities in realistic scenarios.

[450] Scendi Score: Prompt-Aware Diversity Evaluation via Schur Complement of CLIP Embeddings

Azim Ospanov, Mohammad Jalali, Farzan Farnia

Main category: cs.CV

TL;DR: The paper introduces Scendi score, a CLIP-based metric to measure prompt-aware diversity in text-to-image models, using Schur complement decomposition of kernel covariance matrices.

Details

Motivation: Existing metrics like CLIPScore measure text-image alignment but fail to quantify diversity in outputs from similar prompts.

Method: Proposes decomposing CLIP-based kernel covariance matrices into text and non-text components using Schur complement, defining Scendi score for diversity measurement.

Result: Scendi score effectively captures intrinsic diversity in prompt-guided generative models, validated through numerical results.

Conclusion: The Scendi score provides a novel way to assess diversity in generative models and can nullify prompt influence for focused analysis.

Abstract: The use of CLIP embeddings to assess the fidelity of samples produced by text-to-image generative models has been extensively explored in the literature. While the widely adopted CLIPScore, derived from the cosine similarity of text and image embeddings, effectively measures the alignment of a generated image, it does not quantify the diversity of images generated by a text-to-image model. In this work, we extend the application of CLIP embeddings to quantify and interpret the intrinsic diversity of text-to-image models, which are responsible for generating diverse images from similar text prompts, which we refer to as prompt-aware diversity. To achieve this, we propose a decomposition of the CLIP-based kernel covariance matrix of image data into text-based and non-text-based components. Using the Schur complement of the joint image-text kernel covariance matrix, we perform this decomposition and define the matrix-based entropy of the decomposed component as the Schur Complement ENtopy DIversity (Scendi) score, as a measure of the prompt-aware diversity for prompt-guided generative models. Additionally, we discuss the application of the Schur complement-based decomposition to nullify the influence of a given prompt on the CLIP embedding of an image, enabling focus or defocus of the embedded vectors on specific objects. We present several numerical results that apply our proposed Scendi score to evaluate text-to-image and LLM (text-to-text) models. Our numerical results indicate the success of the Scendi score in capturing the intrinsic diversity of prompt-guided generative models. The codebase is available at https://github.com/aziksh-ospanov/scendi-score.

[451] CoT-Vid: Dynamic Chain-of-Thought Routing with Self Verification for Training-Free Video Reasoning

Hongbo Jin, Ruyang Liu, Wenhao Zhang, Guibo Luo, Ge Li

Main category: cs.CV

TL;DR: CoT-Vid introduces a training-free paradigm for complex video reasoning, outperforming existing models with a novel multistage reasoning design.

Details

Motivation: Addressing the research gap in complex video reasoning, leveraging System2 reasoning advancements.

Method: Dynamic inference path routing, problem decoupling, and video self-consistency verification.

Result: Achieved 9.3% and 5.6% gains on benchmarks, rivaling larger models like GPT-4V.

Conclusion: CoT-Vid demonstrates superior performance in video reasoning, setting a new standard.

Abstract: System2 reasoning is developing rapidly these days with the emergence of Deep- Thinking Models and chain-of-thought technology, which has become a centralized discussion point in the AI community. However, there is a relative gap in the research on complex video reasoning at present. In this work, we propose CoT-Vid, a novel training-free paradigm for the video domain with a multistage complex reasoning design. Distinguishing from existing video LLMs, which rely heavily on perceptual abilities, it achieved surprising performance gain with explicit reasoning mechanism. The paradigm consists of three main components: dynamic inference path routing, problem decoupling strategy, and video self-consistency verification. In addition, we propose a new standard for categorization of video questions. CoT- Vid showed outstanding results on a wide range of benchmarks, and outperforms its base model by 9.3% on Egochema and 5.6% on VideoEspresso, rivalling or even surpassing larger and proprietary models, such as GPT-4V, GPT-4o and Gemini-1.5-flash. Our codebase will be publicly available soon.

[452] Conditional Balance: Improving Multi-Conditioning Trade-Offs in Image Generation

Nadav Z. Cohen, Oron Nir, Ariel Shamir

Main category: cs.CV

TL;DR: The paper introduces a method to balance content fidelity and artistic style in image generation using DDPMs by identifying sensitive layers in attention mechanisms for fine-grained control.

Details

Motivation: Addressing the challenge of maintaining equilibrium between content and style in image generation without sacrificing either.

Method: Analyzes DDPM attention layers to identify sensitive layers for stylistic aspects, directing conditional inputs to these layers for better control.

Result: The method improves style and content alignment, enhancing the quality of generated visual content.

Conclusion: The approach effectively balances style and content in image generation, outperforming traditional and modern methods.

Abstract: Balancing content fidelity and artistic style is a pivotal challenge in image generation. While traditional style transfer methods and modern Denoising Diffusion Probabilistic Models (DDPMs) strive to achieve this balance, they often struggle to do so without sacrificing either style, content, or sometimes both. This work addresses this challenge by analyzing the ability of DDPMs to maintain content and style equilibrium. We introduce a novel method to identify sensitivities within the DDPM attention layers, identifying specific layers that correspond to different stylistic aspects. By directing conditional inputs only to these sensitive layers, our approach enables fine-grained control over style and content, significantly reducing issues arising from over-constrained inputs. Our findings demonstrate that this method enhances recent stylization techniques by better aligning style and content, ultimately improving the quality of generated visual content.

[453] From Age Estimation to Age-Invariant Face Recognition: Generalized Age Feature Extraction Using Order-Enhanced Contrastive Learning

Haoyi Wang, Victor Sanchez, Chang-Tsun Li, Nathan Clarke

Main category: cs.CV

TL;DR: The paper introduces Order-Enhanced Contrastive Learning (OrdCon) to improve generalized age feature extraction for tasks like age estimation and age-invariant face recognition by modeling the ordinal progression of aging.

Details

Motivation: Existing models perform poorly in cross-dataset evaluations due to their inability to generalize age features, as they lack explicit modeling of the natural aging process.

Method: OrdCon uses contrastive learning to align feature direction vectors with aging progression and introduces a soft proxy matching loss to minimize intra-class variance and enhance generalizability.

Result: OrdCon achieves competitive results in homogeneous-dataset evaluations and outperforms other methods in cross-dataset tests, reducing mean absolute error by ~1.38 for age estimation and boosting AIFR accuracy by 1.87%.

Conclusion: The proposed OrdCon framework effectively models the aging process, improving generalization for age-related facial analysis tasks.

Abstract: Generalized age feature extraction is crucial for age-related facial analysis tasks, such as age estimation and age-invariant face recognition (AIFR). Despite the recent successes of models in homogeneous-dataset experiments, their performance drops significantly in cross-dataset evaluations. Most of these models fail to extract generalized age features as they only attempt to map extracted features with training age labels directly without explicitly modeling the natural ordinal progression of aging. In this paper, we propose Order-Enhanced Contrastive Learning (OrdCon), a novel contrastive learning framework designed explicitly for ordinal attributes like age. Specifically, to extract generalized features, OrdCon aligns the direction vector of two features with either the natural aging direction or its reverse to model the ordinal process of aging. To further enhance generalizability, OrdCon leverages a novel soft proxy matching loss as a second contrastive objective, ensuring that features are positioned around the center of each age cluster with minimal intra-class variance and proportionally away from other clusters. By modeling the ageing process, the framework can enhance generalizability by improving the alignment of samples from the same class and reducing the divergence of direction vectors. We demonstrate that our proposed method achieves comparable results to state-of-the-art methods on various benchmark datasets in homogeneous-dataset evaluations for both age estimation and AIFR. In cross-dataset experiments, OrdCon outperforms other methods by reducing the mean absolute error by approximately 1.38 on average for the age estimation task and boosts the average accuracy for AIFR by 1.87%.

[454] Taming SAM for Underwater Instance Segmentation and Beyond

Hua Li, Shijie Lian, Zhiyuan Li, Runmin Cong, Chongyi Li

Main category: cs.CV

TL;DR: The paper introduces UWSAM, an efficient model for underwater instance segmentation, addressing SAM’s limitations in underwater tasks with a new dataset (UIIS10K) and knowledge distillation (MG-UKD).

Details

Motivation: SAM and its variants underperform in underwater segmentation due to lack of domain expertise and high computational demands.

Method: Proposes UIIS10K dataset and UWSAM model, using MG-UKD for knowledge distillation and EUPG for automatic prompt generation.

Result: UWSAM outperforms state-of-the-art methods on underwater instance datasets.

Conclusion: UWSAM is effective for underwater segmentation, leveraging domain-specific data and efficient design.

Abstract: With recent breakthroughs in large-scale modeling, the Segment Anything Model (SAM) has demonstrated significant potential in a variety of visual applications. However, due to the lack of underwater domain expertise, SAM and its variants face performance limitations in end-to-end underwater instance segmentation tasks, while their higher computational requirements further hinder their application in underwater scenarios. To address this challenge, we propose a large-scale underwater instance segmentation dataset, UIIS10K, which includes 10,048 images with pixel-level annotations for 10 categories. Then, we introduce UWSAM, an efficient model designed for automatic and accurate segmentation of underwater instances. UWSAM efficiently distills knowledge from the SAM ViT-Huge image encoder into the smaller ViT-Small image encoder via the Mask GAT-based Underwater Knowledge Distillation (MG-UKD) method for effective visual representation learning. Furthermore, we design an End-to-end Underwater Prompt Generator (EUPG) for UWSAM, which automatically generates underwater prompts instead of explicitly providing foreground points or boxes as prompts, thus enabling the network to locate underwater instances accurately for efficient segmentation. Comprehensive experimental results show that our model is effective, achieving significant performance improvements over state-of-the-art methods on multiple underwater instance datasets. Datasets and codes are available at https://github.com/LiamLian0727/UIIS10K.

[455] Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens

Dongwon Kim, Ju He, Qihang Yu, Chenglin Yang, Xiaohui Shen, Suha Kwak, Liang-Chieh Chen

Main category: cs.CV

TL;DR: TA-TiTok is a novel image tokenizer integrating text during decoding, simplifying training and enabling open-data text-to-image models.

Details

Motivation: Existing image tokenizers are hard to train and rely on private datasets, limiting accessibility.

Method: TA-TiTok uses text-aware decoding and a one-stage training process, avoiding complex distillation.

Result: Achieves comparable performance to private-data models using open data.

Conclusion: TA-TiTok and MaskGen models aim to democratize text-to-image generation by being open and efficient.

Abstract: Image tokenizers form the foundation of modern text-to-image generative models but are notoriously difficult to train. Furthermore, most existing text-to-image models rely on large-scale, high-quality private datasets, making them challenging to replicate. In this work, we introduce Text-Aware Transformer-based 1-Dimensional Tokenizer (TA-TiTok), an efficient and powerful image tokenizer that can utilize either discrete or continuous 1-dimensional tokens. TA-TiTok uniquely integrates textual information during the tokenizer decoding stage (i.e., de-tokenization), accelerating convergence and enhancing performance. TA-TiTok also benefits from a simplified, yet effective, one-stage training process, eliminating the need for the complex two-stage distillation used in previous 1-dimensional tokenizers. This design allows for seamless scalability to large datasets. Building on this, we introduce a family of text-to-image Masked Generative Models (MaskGen), trained exclusively on open data while achieving comparable performance to models trained on private data. We aim to release both the efficient, strong TA-TiTok tokenizers and the open-data, open-weight MaskGen models to promote broader access and democratize the field of text-to-image masked generative models.

[456] Hypergraph Mamba for Efficient Whole Slide Image Understanding

Jiaxuan Lu, Yuhui Lin, Junyan Shi, Fang Yan, Dongzhan Zhou, Yue Gao, Xiaosong Wang

Main category: cs.CV

TL;DR: WSI-HGMamba combines Hypergraph Neural Networks and State Space Models for efficient, scalable analysis of Whole Slide Images, outperforming Transformers with lower computational costs.

Details

Motivation: Existing MIL methods like GNNs and Transformers struggle with scalability and computational expenses for ultra-high-resolution WSIs.

Method: Introduces WSI-HGMamba, integrating HGNNs’ relational modeling with State Space Models’ efficiency via HGMamba blocks for message passing, hypergraph scanning, and bidirectional state space modeling.

Result: Achieves superior performance with up to 7x fewer FLOPs than Transformers, validated on multiple WSI benchmarks.

Conclusion: WSI-HGMamba offers a scalable, accurate, and efficient solution for slide-level understanding, suitable for next-gen pathology AI.

Abstract: Whole Slide Images (WSIs) in histopathology pose a significant challenge for extensive medical image analysis due to their ultra-high resolution, massive scale, and intricate spatial relationships. Although existing Multiple Instance Learning (MIL) approaches like Graph Neural Networks (GNNs) and Transformers demonstrate strong instance-level modeling capabilities, they encounter constraints regarding scalability and computational expenses. To overcome these limitations, we introduce the WSI-HGMamba, a novel framework that unifies the high-order relational modeling capabilities of the Hypergraph Neural Networks (HGNNs) with the linear-time sequential modeling efficiency of the State Space Models. At the core of our design is the HGMamba block, which integrates message passing, hypergraph scanning & flattening, and bidirectional state space modeling (Bi-SSM), enabling the model to retain both relational and contextual cues while remaining computationally efficient. Compared to Transformer and Graph Transformer counterparts, WSI-HGMamba achieves superior performance with up to 7* reduction in FLOPs. Extensive experiments on multiple public and private WSI benchmarks demonstrate that our method provides a scalable, accurate, and efficient solution for slide-level understanding, making it a promising backbone for next-generation pathology AI systems.

[457] UDC-VIT: A Real-World Video Dataset for Under-Display Cameras

Kyusu Ahn, JiSoo Kim, Sangik Lee, HyunGyu Lee, Byeonghyun Ko, Chanwoo Park, Jaejin Lee

Main category: cs.CV

TL;DR: The paper introduces UDC-VIT, a real-world Under-Display Camera (UDC) video dataset for addressing degradation issues like blur and flare, outperforming synthetic datasets in model training and face recognition accuracy.

Details

Motivation: Existing UDC datasets lack real-world video degradation, especially for facial recognition, necessitating a dataset that captures actual UDC issues.

Method: A video-capturing system captures clean and UDC-degraded videos simultaneously, aligned frame-by-frame using DFT. The dataset is compared with existing UDC datasets using deep-learning models.

Result: Models trained on synthetic UDC datasets underperform, while UDC-VIT improves face recognition accuracy, validated by PSNR, SSIM, and LPIPS scores.

Conclusion: UDC-VIT addresses the gap in real-world UDC video datasets, enhancing restoration models and face recognition performance.

Abstract: Even though an Under-Display Camera (UDC) is an advanced imaging system, the display panel significantly degrades captured images or videos, introducing low transmittance, blur, noise, and flare issues. Tackling such issues is challenging because of the complex degradation of UDCs, including diverse flare patterns. However, no dataset contains videos of real-world UDC degradation. In this paper, we propose a real-world UDC video dataset called UDC-VIT. Unlike existing datasets, UDC-VIT exclusively includes human motions for facial recognition. We propose a video-capturing system to acquire clean and UDC-degraded videos of the same scene simultaneously. Then, we align a pair of captured videos frame by frame, using discrete Fourier transform (DFT). We compare UDC-VIT with six representative UDC still image datasets and two existing UDC video datasets. Using six deep-learning models, we compare UDC-VIT and an existing synthetic UDC video dataset. The results indicate the ineffectiveness of models trained on earlier synthetic UDC video datasets, as they do not reflect the actual characteristics of UDC-degraded videos. We also demonstrate the importance of effective UDC restoration by evaluating face recognition accuracy concerning PSNR, SSIM, and LPIPS scores. UDC-VIT is available at our official GitHub repository.

[458] GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling

Pinxin Liu, Luchuan Song, Junhua Huang, Haiyang Liu, Chenliang Xu

Main category: cs.CV

TL;DR: GestureLSM improves full-body gesture generation by modeling spatial-temporal interactions and using flow matching for faster, higher-quality results.

Details

Motivation: Existing methods for speech-driven gesture generation produce unnatural movements due to separate modeling of body regions and slow autoregressive/diffusion pipelines.

Method: GestureLSM uses spatial-temporal attention to model interactions between body regions and flow matching for efficient sampling, enhanced by latent shortcut learning and beta distribution time stamp sampling.

Result: GestureLSM achieves state-of-the-art performance on BEAT2 and significantly reduces inference time.

Conclusion: GestureLSM’s efficient and coherent gesture generation makes it promising for real-world applications like digital humans and embodied agents.

Abstract: Generating full-body human gestures based on speech signals remains challenges on quality and speed. Existing approaches model different body regions such as body, legs and hands separately, which fail to capture the spatial interactions between them and result in unnatural and disjointed movements. Additionally, their autoregressive/diffusion-based pipelines show slow generation speed due to dozens of inference steps. To address these two challenges, we propose GestureLSM, a flow-matching-based approach for Co-Speech Gesture Generation with spatial-temporal modeling. Our method i) explicitly model the interaction of tokenized body regions through spatial and temporal attention, for generating coherent full-body gestures. ii) introduce the flow matching to enable more efficient sampling by explicitly modeling the latent velocity space. To overcome the suboptimal performance of flow matching baseline, we propose latent shortcut learning and beta distribution time stamp sampling during training to enhance gesture synthesis quality and accelerate inference. Combining the spatial-temporal modeling and improved flow matching-based framework, GestureLSM achieves state-of-the-art performance on BEAT2 while significantly reducing inference time compared to existing methods, highlighting its potential for enhancing digital humans and embodied agents in real-world applications. Project Page: https://andypinxinliu.github.io/GestureLSM

[459] ArchiLense: A Framework for Quantitative Analysis of Architectural Styles Based on Vision Large Language Models

Jing Zhong, Jun Yin, Peilin Li, Pengyu Zeng, Miao Zang, Ran Luo, Shuai Lu

Main category: cs.CV

TL;DR: The paper introduces ArchDiffBench, a dataset of architectural images, and ArchiLense, a framework using Vision-Language Models for automated style analysis, achieving high accuracy and consistency.

Details

Motivation: Traditional architectural studies rely on subjective expert interpretations and historical reviews, often biased and limited. This study aims to provide an objective, automated approach for analyzing architectural styles.

Method: Constructed ArchDiffBench (1,765 annotated images) and developed ArchiLense, a framework combining Vision-Language Models, computer vision, and machine learning for style recognition and classification.

Result: ArchiLense achieved 92.4% consistency with expert annotations and 84.5% classification accuracy, effectively capturing stylistic differences.

Conclusion: The approach offers an objective, accurate alternative to traditional methods, enhancing comparative studies of architectural culture.

Abstract: Architectural cultures across regions are characterized by stylistic diversity, shaped by historical, social, and technological contexts in addition to geograph-ical conditions. Understanding architectural styles requires the ability to describe and analyze the stylistic features of different architects from various regions through visual observations of architectural imagery. However, traditional studies of architectural culture have largely relied on subjective expert interpretations and historical literature reviews, often suffering from regional biases and limited ex-planatory scope. To address these challenges, this study proposes three core contributions: (1) We construct a professional architectural style dataset named ArchDiffBench, which comprises 1,765 high-quality architectural images and their corresponding style annotations, collected from different regions and historical periods. (2) We propose ArchiLense, an analytical framework grounded in Vision-Language Models and constructed using the ArchDiffBench dataset. By integrating ad-vanced computer vision techniques, deep learning, and machine learning algo-rithms, ArchiLense enables automatic recognition, comparison, and precise classi-fication of architectural imagery, producing descriptive language outputs that ar-ticulate stylistic differences. (3) Extensive evaluations show that ArchiLense achieves strong performance in architectural style recognition, with a 92.4% con-sistency rate with expert annotations and 84.5% classification accuracy, effec-tively capturing stylistic distinctions across images. The proposed approach transcends the subjectivity inherent in traditional analyses and offers a more objective and accurate perspective for comparative studies of architectural culture.

[460] HumanDiT: Pose-Guided Diffusion Transformer for Long-form Human Motion Video Generation

Qijun Gan, Yi Ren, Chen Zhang, Zhenhui Ye, Pan Xie, Xiang Yin, Zehuan Yuan, Bingyue Peng, Jianke Zhu

Main category: cs.CV

TL;DR: HumanDiT is a pose-guided Diffusion Transformer framework for high-fidelity human motion video generation, addressing challenges like detailed body rendering and long-sequence consistency.

Details

Motivation: Existing methods struggle with rendering detailed body parts (e.g., hands, faces) and maintaining visual consistency in long, complex motion sequences.

Method: HumanDiT uses a Diffusion Transformer (DiT) framework, supports variable resolutions and sequence lengths, and employs a prefix-latent reference strategy for consistency. It also leverages Keypoint-DiT for pose sequence generation and a Pose Adapter for pose transfer.

Result: HumanDiT outperforms existing methods, generating high-quality, long-form videos with accurate poses across diverse scenarios.

Conclusion: HumanDiT advances human motion video generation by improving detail rendering and consistency, supported by extensive experiments.

Abstract: Human motion video generation has advanced significantly, while existing methods still struggle with accurately rendering detailed body parts like hands and faces, especially in long sequences and intricate motions. Current approaches also rely on fixed resolution and struggle to maintain visual consistency. To address these limitations, we propose HumanDiT, a pose-guided Diffusion Transformer (DiT)-based framework trained on a large and wild dataset containing 14,000 hours of high-quality video to produce high-fidelity videos with fine-grained body rendering. Specifically, (i) HumanDiT, built on DiT, supports numerous video resolutions and variable sequence lengths, facilitating learning for long-sequence video generation; (ii) we introduce a prefix-latent reference strategy to maintain personalized characteristics across extended sequences. Furthermore, during inference, HumanDiT leverages Keypoint-DiT to generate subsequent pose sequences, facilitating video continuation from static images or existing videos. It also utilizes a Pose Adapter to enable pose transfer with given sequences. Extensive experiments demonstrate its superior performance in generating long-form, pose-accurate videos across diverse scenarios.

[461] AutoOcc: Automatic Open-Ended Semantic Occupancy Annotation via Vision-Language Guided Gaussian Splatting

Xiaoyu Zhou, Jingqi Wang, Yongtao Wang, Yufei Wei, Nan Dong, Ming-Hsuan Yang

Main category: cs.CV

TL;DR: AutoOcc is a vision-centric pipeline for automated 3D semantic occupancy annotation, using Gaussian splatting guided by vision-language models, outperforming existing methods without human labels.

Details

Motivation: Manual labeling for high-quality 3D semantic occupancy is challenging and labor-intensive, necessitating an automated solution.

Method: Integrates differentiable Gaussian splatting guided by vision-language models, using semantic-aware Gaussians and a cumulative Gaussian-to-voxel splatting algorithm.

Result: Outperforms existing automated occupancy annotation methods and enables open-ended semantic auto-labeling in complex scenarios.

Conclusion: AutoOcc provides a robust, automated solution for 3D semantic occupancy annotation, eliminating the need for manual labeling.

Abstract: Obtaining high-quality 3D semantic occupancy from raw sensor data remains an essential yet challenging task, often requiring extensive manual labeling. In this work, we propose AutoOcc, a vision-centric automated pipeline for open-ended semantic occupancy annotation that integrates differentiable Gaussian splatting guided by vision-language models. We formulate the open-ended semantic 3D occupancy reconstruction task to automatically generate scene occupancy by combining attention maps from vision-language models and foundation vision models. We devise semantic-aware Gaussians as intermediate geometric descriptors and propose a cumulative Gaussian-to-voxel splatting algorithm that enables effective and efficient occupancy annotation. Our framework outperforms existing automated occupancy annotation methods without human labels. AutoOcc also enables open-ended semantic occupancy auto-labeling, achieving robust performance in both static and dynamically complex scenarios.

[462] Segment Any Architectural Facades (SAAF):An automatic segmentation model for building facades, walls and windows based on multimodal semantics guidance

Peilin Li, Jun Yin, Jing Zhong, Ran Luo, Pengyu Zeng, Miao Zhang

Main category: cs.CV

TL;DR: SAAF is an automatic segmentation model for building facades using multimodal semantic guidance, combining NLP and image features for improved accuracy and robustness.

Details

Motivation: To enhance efficiency in building information models and CAD by automating the segmentation of walls and windows.

Method: Uses multimodal semantic collaborative feature extraction and an end-to-end training framework to fuse text descriptions with image features.

Result: Outperforms existing methods in mIoU metric, demonstrating high-precision segmentation across diverse datasets.

Conclusion: SAAF advances architectural computer vision and explores multimodal learning applications in architecture.

Abstract: In the context of the digital development of architecture, the automatic segmentation of walls and windows is a key step in improving the efficiency of building information models and computer-aided design. This study proposes an automatic segmentation model for building facade walls and windows based on multimodal semantic guidance, called Segment Any Architectural Facades (SAAF). First, SAAF has a multimodal semantic collaborative feature extraction mechanism. By combining natural language processing technology, it can fuse the semantic information in text descriptions with image features, enhancing the semantic understanding of building facade components. Second, we developed an end-to-end training framework that enables the model to autonomously learn the mapping relationship from text descriptions to image segmentation, reducing the influence of manual intervention on the segmentation results and improving the automation and robustness of the model. Finally, we conducted extensive experiments on multiple facade datasets. The segmentation results of SAAF outperformed existing methods in the mIoU metric, indicating that the SAAF model can maintain high-precision segmentation ability when faced with diverse datasets. Our model has made certain progress in improving the accuracy and generalization ability of the wall and window segmentation task. It is expected to provide a reference for the development of architectural computer vision technology and also explore new ideas and technical paths for the application of multimodal learning in the architectural field.

[463] SAGI: Semantically Aligned and Uncertainty Guided AI Image Inpainting

Paschalis Giakoumoglou, Dimitrios Karageorgiou, Symeon Papadopoulos, Panagiotis C. Petrantonakis

Main category: cs.CV

TL;DR: The paper introduces SAGI, a model-agnostic pipeline for automating text-guided image inpainting by aligning prompts with human perception and filtering unrealistic outputs. It also presents SAGI-D, a large dataset of AI-generated inpaintings, showing improved image quality and forensic detection performance.

Details

Motivation: Manual refinement of prompts and evaluation of generated images in text-guided inpainting is time-consuming. The goal is to automate this process while ensuring semantic correctness and photorealism.

Method: Proposes SAGI, a pipeline using pretrained language and vision-language models to sample human-aligned prompts and filter unrealistic outputs. Applied to multiple inpainting models to create SAGI-D.

Result: Semantic alignment improves image quality (human detection accuracy drops from 74% to 35%). SAGI-D enhances forensic detection by 37.4% (in-domain) and 26.1% (out-of-domain).

Conclusion: SAGI automates and improves inpainting, while SAGI-D aids in forensic detection, countering misuse of generative AI.

Abstract: Recent advancements in generative AI have made text-guided image inpainting - adding, removing, or altering image regions using textual prompts - widely accessible. However, generating semantically correct photorealistic imagery, typically requires carefully-crafted prompts and iterative refinement by evaluating the realism of the generated content - tasks commonly performed by humans. To automate the generative process, we propose Semantically Aligned and Uncertainty Guided AI Image Inpainting (SAGI), a model-agnostic pipeline, to sample prompts from a distribution that closely aligns with human perception and to evaluate the generated content and discard instances that deviate from such a distribution, which we approximate using pretrained large language models and vision-language models. By applying this pipeline on multiple state-of-the-art inpainting models, we create the SAGI Dataset (SAGI-D), currently the largest and most diverse dataset of AI-generated inpaintings, comprising over 95k inpainted images and a human-evaluated subset. Our experiments show that semantic alignment significantly improves image quality and aesthetics, while uncertainty guidance effectively identifies realistic manipulations - human ability to distinguish inpainted images from real ones drops from 74% to 35% in terms of accuracy, after applying our pipeline. Moreover, using SAGI-D for training several image forensic approaches increases in-domain detection performance on average by 37.4% and out-of-domain generalization by 26.1% in terms of IoU, also demonstrating its utility in countering malicious exploitation of generative AI. Code and dataset are available at https://mever-team.github.io/SAGI/

[464] UrbanSense:A Framework for Quantitative Analysis of Urban Streetscapes leveraging Vision Large Language Models

Jun Yin, Jing Zhong, Peilin Li, Ruolin Pan, Pengyu Zeng, Miao Zhang, Shuai Lu

Main category: cs.CV

TL;DR: A multimodal vision-language framework (UrbanSense) is proposed for automated analysis of urban streetscape styles, validated by high accuracy in capturing stylistic differences.

Details

Motivation: Understanding urban cultural and architectural differences is crucial for future city evolution, but traditional methods lack standardization and scalability.

Method: Developed UrbanSense, a vision-language-model-based framework, and UrbanDiffBench dataset for quantitative urban style analysis.

Result: Over 80% of descriptions passed t-tests; high Phi scores (0.912 for cities, 0.833 for periods) confirmed method’s effectiveness.

Conclusion: The framework offers a data-driven, scalable approach to quantify urban style evolution, aiding future urban design.

Abstract: Urban cultures and architectural styles vary significantly across cities due to geographical, chronological, historical, and socio-political factors. Understanding these differences is essential for anticipating how cities may evolve in the future. As representative cases of historical continuity and modern innovation in China, Beijing and Shenzhen offer valuable perspectives for exploring the transformation of urban streetscapes. However, conventional approaches to urban cultural studies often rely on expert interpretation and historical documentation, which are difficult to standardize across different contexts. To address this, we propose a multimodal research framework based on vision-language models, enabling automated and scalable analysis of urban streetscape style differences. This approach enhances the objectivity and data-driven nature of urban form research. The contributions of this study are as follows: First, we construct UrbanDiffBench, a curated dataset of urban streetscapes containing architectural images from different periods and regions. Second, we develop UrbanSense, the first vision-language-model-based framework for urban streetscape analysis, enabling the quantitative generation and comparison of urban style representations. Third, experimental results show that Over 80% of generated descriptions pass the t-test (p less than 0.05). High Phi scores (0.912 for cities, 0.833 for periods) from subjective evaluations confirm the method’s ability to capture subtle stylistic differences. These results highlight the method’s potential to quantify and interpret urban style evolution, offering a scientifically grounded lens for future design.

[465] GAS: Generative Avatar Synthesis from a Single Image

Yixing Lu, Junting Dong, Youngjoong Kwon, Qin Zhao, Bo Dai, Fernando De la Torre

Main category: cs.CV

TL;DR: A framework for generating view-consistent and temporally coherent avatars from a single image by combining regression-based 3D human reconstruction with a diffusion model.

Details

Motivation: Addressing inconsistencies in existing diffusion-based methods due to sparse human templates, aiming for high-quality avatar synthesis.

Method: Uses a generalized NeRF for initial 3D reconstruction, then feeds geometry and appearance into a video-based diffusion model for consistency.

Result: Demonstrates superior generalization and effectiveness across diverse datasets.

Conclusion: The method successfully bridges the gap between reconstruction and generation, ensuring high-quality, consistent avatars.

Abstract: We present a unified and generalizable framework for synthesizing view-consistent and temporally coherent avatars from a single image, addressing the challenging task of single-image avatar generation. Existing diffusion-based methods often condition on sparse human templates (e.g., depth or normal maps), which leads to multi-view and temporal inconsistencies due to the mismatch between these signals and the true appearance of the subject. Our approach bridges this gap by combining the reconstruction power of regression-based 3D human reconstruction with the generative capabilities of a diffusion model. In a first step, an initial 3D reconstructed human through a generalized NeRF provides comprehensive conditioning, ensuring high-quality synthesis faithful to the reference appearance and structure. Subsequently, the derived geometry and appearance from the generalized NeRF serve as input to a video-based diffusion model. This strategic integration is pivotal for enforcing both multi-view and temporal consistency throughout the avatar’s generation. Empirical results underscore the superior generalization ability of our proposed method, demonstrating its effectiveness across diverse in-domain and out-of-domain in-the-wild datasets.

[466] Fair Generation without Unfair Distortions: Debiasing Text-to-Image Generation with Entanglement-Free Attention

Jeonghoon Park, Juyoung Lee, Chaeyeon Chung, Jaeseong Lee, Jaegul Choo, Jindong Gu

Main category: cs.CV

TL;DR: EFA is a method to mitigate societal biases in diffusion-based T2I models by addressing attribute entanglement, ensuring fair attribute distribution without altering non-target attributes.

Details

Motivation: Diffusion-based T2I models exhibit societal biases (e.g., gender, race), reinforcing stereotypes. Existing bias mitigation methods suffer from attribute entanglement, altering unintended attributes.

Method: EFA (Entanglement-Free Attention) adjusts cross-attention layers to incorporate target attributes (e.g., race) randomly and equally, preserving non-target attributes (e.g., background).

Result: EFA outperforms existing methods in bias mitigation while maintaining the original model’s output distribution and generative capacity.

Conclusion: EFA effectively addresses bias in T2I models without disrupting non-target attributes, offering a fair and practical solution.

Abstract: Recent advancements in diffusion-based text-to-image (T2I) models have enabled the generation of high-quality and photorealistic images from text. However, they often exhibit societal biases related to gender, race, and socioeconomic status, thereby potentially reinforcing harmful stereotypes and shaping public perception in unintended ways. While existing bias mitigation methods demonstrate effectiveness, they often encounter attribute entanglement, where adjustments to attributes relevant to the bias (i.e., target attributes) unintentionally alter attributes unassociated with the bias (i.e., non-target attributes), causing undesirable distribution shifts. To address this challenge, we introduce Entanglement-Free Attention (EFA), a method that accurately incorporates target attributes (e.g., White, Black, and Asian) while preserving non-target attributes (e.g., background) during bias mitigation. At inference time, EFA randomly samples a target attribute with equal probability and adjusts the cross-attention in selected layers to incorporate the sampled attribute, achieving a fair distribution of target attributes. Extensive experiments demonstrate that EFA outperforms existing methods in mitigating bias while preserving non-target attributes, thereby maintaining the original model’s output distribution and generative capacity.

[467] Spatial Transcriptomics Analysis of Spatially Dense Gene Expression Prediction

Ruikun Zhang, Yan Yang, Liyuan Pan

Main category: cs.CV

TL;DR: PixNet is a dense prediction network that predicts spatially resolved gene expression at varying scales from histopathology images, outperforming existing methods.

Details

Motivation: Existing methods lose spatial resolution in gene expression by cropping fixed-size spots, limiting prediction at varying scales.

Method: PixNet generates a dense continuous gene expression map from histopathology images and aggregates values within spots of interest.

Result: PixNet outperforms state-of-the-art methods on four ST datasets across multiple scales.

Conclusion: PixNet addresses spatial resolution limitations and improves gene expression prediction, with publicly available source code.

Abstract: Spatial transcriptomics (ST) measures gene expression at fine-grained spatial resolution, offering insights into tissue molecular landscapes. Previous methods for spatial gene expression prediction typically crop spots of interest from histopathology slide images, and train models to map each spot to a corresponding gene expression profile. However, these methods inherently lose the spatial resolution in gene expression: 1) each spot often contains multiple cells with distinct gene expression profiles; 2) spots are typically defined at fixed spatial resolutions, limiting the ability to predict gene expression at varying scales. To address these limitations, this paper presents PixNet, a dense prediction network capable of predicting spatially resolved gene expression across spots of varying sizes and scales directly from histopathology slide images. Different from previous methods that map individual spots to gene expression values, we generate a spatially dense continuous gene expression map from the histopathology slide image, and aggregate values within spots of interest to predict the gene expression. Our PixNet outperforms state-of-the-art methods on four common ST datasets in multiple spatial scales. The source code will be publicly available.

[468] 10K is Enough: An Ultra-Lightweight Binarized Network for Infrared Small-Target Detection

Biqiao Xin, Qianchen Mao, Bingshu Wang, Jiangbin Zheng, Yong Zhao, C. L. Philip Chen

Main category: cs.CV

TL;DR: BiisNet, a binarized neural network for IRSTD, integrates full-precision features and innovative techniques to overcome precision loss, outperforming binary and full-precision models.

Details

Motivation: The need for efficient model compression for edge deployment of IRSTD algorithms, despite the challenge of precision loss in binarization.

Method: Proposes BiisNet with Dot Binary Convolution and Dynamic Softsign function to retain precision and improve gradient flow.

Result: BiisNet outperforms other binary architectures and competes with full-precision models.

Conclusion: BiisNet effectively balances efficiency and precision, making it suitable for edge-device deployment in IRSTD tasks.

Abstract: The widespread deployment of Infrared Small-Target Detection (IRSTD) algorithms on edge devices necessitates the exploration of model compression techniques. Binarized neural networks (BNNs) are distinguished by their exceptional efficiency in model compression. However, the small size of infrared targets introduces stringent precision requirements for the IRSTD task, while the inherent precision loss during binarization presents a significant challenge. To address this, we propose the Binarized Infrared Small-Target Detection Network (BiisNet), which preserves the core operations of binarized convolutions while integrating full-precision features into the network’s information flow. Specifically, we propose the Dot Binary Convolution, which retains fine-grained semantic information in feature maps while still leveraging the binarized convolution operations. In addition, we introduce a smooth and adaptive Dynamic Softsign function, which provides more comprehensive and progressively finer gradient during backpropagation, enhancing model stability and promoting an optimal weight distribution. Experimental results demonstrate that BiisNet not only significantly outperforms other binary architectures but also has strong competitiveness among state-of-the-art full-precision models.

[469] Surgical Gaussian Surfels: Highly Accurate Real-time Surgical Scene Rendering using Gaussian Surfels

Idris O. Sunmola, Zhenjun Zhao, Samuel Schmidgall, Yumeng Wang, Paul Maria Scheikl, Viet Pham, Axel Krieger

Main category: cs.CV

TL;DR: Surgical Gaussian Surfels (SGS) improves geometric reconstruction in endoscopic surgery by constraining Gaussian scaling and enhancing surface alignment, outperforming existing methods in detail preservation and efficiency.

Details

Motivation: Challenges in artifact-free tool occlusions and fine anatomical detail preservation in deformable tissue reconstruction using current NeRF and 3D Gaussian methods.

Method: Introduces SGS for surface-aligned splats and FFD-MLP for faster surfel motion prediction, with locality constraints and homodirectional gradients for detail capture.

Result: Outperforms state-of-the-art in surface geometry, normal map quality, and rendering efficiency on surgical datasets.

Conclusion: SGS and FFD-MLP offer superior reconstruction and real-time performance, advancing surgical scene rendering.

Abstract: Accurate geometric reconstruction of deformable tissues in monocular endoscopic video remains a fundamental challenge in robot-assisted minimally invasive surgery. Although recent volumetric and point primitive methods based on neural radiance fields (NeRF) and 3D Gaussian primitives have efficiently rendered surgical scenes, they still struggle with handling artifact-free tool occlusions and preserving fine anatomical details. These limitations stem from unrestricted Gaussian scaling and insufficient surface alignment constraints during reconstruction. To address these issues, we introduce Surgical Gaussian Surfels (SGS), which transform anisotropic point primitives into surface-aligned elliptical splats by constraining the scale component of the Gaussian covariance matrix along the view-aligned axis. We also introduce the Fully Fused Deformation Multilayer Perceptron (FFD-MLP), a lightweight Multi-Layer Perceptron (MLP) that predicts accurate surfel motion fields up to 5x faster than a standard MLP. This is coupled with locality constraints to handle complex tissue deformations. We use homodirectional view-space positional gradients to capture fine image details by splitting Gaussian Surfels in over-reconstructed regions. In addition, we define surface normals as the direction of the steepest density change within each Gaussian surfel primitive, enabling accurate normal estimation without requiring monocular normal priors. We evaluate our method on two in-vivo surgical datasets, where it outperforms current state-of-the-art methods in surface geometry, normal map quality, and rendering efficiency, while remaining competitive in real-time rendering performance. We make our code available at https://github.com/aloma85/SurgicalGaussianSurfels

[470] RedDiffuser: Red Teaming Vision-Language Models for Toxic Continuation via Reinforced Stable Diffusion

Ruofan Wang, Xiang Zheng, Xiaosen Wang, Cong Wang, Xingjun Ma

Main category: cs.CV

TL;DR: RedDiffuser (RedDiff) is a reinforcement learning-based framework for generating adversarial images that induce toxic outputs in Vision-Language Models (VLMs), revealing a cross-modal toxicity vulnerability.

Details

Motivation: VLMs are vulnerable to toxic continuation attacks, where subtle adversarial inputs lead to harmful completions, posing a unique challenge in multimodal settings.

Method: RedDiffuser uses reinforcement learning to fine-tune diffusion models, combining greedy search for image prompts with fine-tuning to maximize toxicity and coherence.

Result: RedDiffuser increases toxicity rates in LLaVA by 10.69% and 8.91% on original and hold-out sets, and shows transferability to other models like Gemini (5.1%) and LLaMA-Vision (26.83%).

Conclusion: The work highlights a cross-modal toxicity vulnerability in VLMs and advocates for robust multimodal red teaming, with RedDiffuser’s codebase to be released for further research.

Abstract: Vision-Language Models (VLMs) are vulnerable to jailbreak attacks, where adversaries bypass safety mechanisms to elicit harmful outputs. In this work, we examine an insidious variant of this threat: toxic continuation. Unlike standard jailbreaks that rely solely on malicious instructions, toxic continuation arises when the model is given a malicious input alongside a partial toxic output, resulting in harmful completions. This vulnerability poses a unique challenge in multimodal settings, where even subtle image variations can disproportionately affect the model’s response. To this end, we propose RedDiffuser (RedDiff), the first red teaming framework that uses reinforcement learning to fine-tune diffusion models into generating natural-looking adversarial images that induce toxic continuations. RedDiffuser integrates a greedy search procedure for selecting candidate image prompts with reinforcement fine-tuning that jointly promotes toxic output and semantic coherence. Experiments demonstrate that RedDiffuser significantly increases the toxicity rate in LLaVA outputs by 10.69% and 8.91% on the original and hold-out sets, respectively. It also exhibits strong transferability, increasing toxicity rates on Gemini by 5.1% and on LLaMA-Vision by 26.83%. These findings uncover a cross-modal toxicity amplification vulnerability in current VLM alignment, highlighting the need for robust multimodal red teaming. We will release the RedDiffuser codebase to support future research.

[471] Survivability of Backdoor Attacks on Unconstrained Face Recognition Systems

Quentin Le Roux, Yannick Teglia, Teddy Furon, Philippe Loubet-Moundi, Eric Bourbao

Main category: cs.CV

TL;DR: This paper explores backdoor attacks in deep learning-based face recognition systems, demonstrating novel attacks and providing countermeasures.

Details

Motivation: Addressing the lack of research on DNN backdoor attacks in real-life, unconstrained face recognition systems.

Method: Conducts a system-level study, demonstrating two backdoor attacks (face generation and landmark shift) and testing 20 pipeline configurations.

Result: Shows that a single backdoor can bypass entire system functions, with 15 attack cases validated.

Conclusion: Provides best practices and countermeasures for stakeholders to mitigate backdoor vulnerabilities.

Abstract: The widespread use of deep learning face recognition raises several security concerns. Although prior works point at existing vulnerabilities, DNN backdoor attacks against real-life, unconstrained systems dealing with images captured in the wild remain a blind spot of the literature. This paper conducts the first system-level study of backdoors in deep learning-based face recognition systems. This paper yields four contributions by exploring the feasibility of DNN backdoors on these pipelines in a holistic fashion. We demonstrate for the first time two backdoor attacks on the face detection task: face generation and face landmark shift attacks. We then show that face feature extractors trained with large margin losses also fall victim to backdoor attacks. Combining our models, we then show using 20 possible pipeline configurations and 15 attack cases that a single backdoor enables an attacker to bypass the entire function of a system. Finally, we provide stakeholders with several best practices and countermeasures.

[472] SplatTalk: 3D VQA with Gaussian Splatting

Anh Thai, Songyou Peng, Kyle Genova, Leonidas Guibas, Thomas Funkhouser

Main category: cs.CV

TL;DR: SplatTalk introduces a 3D Gaussian Splatting framework for zero-shot 3D VQA, outperforming task-specific 3D models and 2D-LMM-based methods.

Details

Motivation: Advancing 3D scene understanding for robotics, AR/VR, and HCI by leveraging natural language, overcoming challenges of 3D data complexity and annotation costs.

Method: Uses a generalizable 3D Gaussian Splatting (3DGS) framework to generate 3D tokens for pretrained LLMs, enabling zero-shot 3D VQA from posed images.

Result: Outperforms task-specific 3D models and 2D-LMM-based methods, achieving competitive performance with state-of-the-art 3D LMMs.

Conclusion: SplatTalk demonstrates effective zero-shot 3D VQA, bridging the gap between 2D and 3D vision-language models.

Abstract: Language-guided 3D scene understanding is important for advancing applications in robotics, AR/VR, and human-computer interaction, enabling models to comprehend and interact with 3D environments through natural language. While 2D vision-language models (VLMs) have achieved remarkable success in 2D VQA tasks, progress in the 3D domain has been significantly slower due to the complexity of 3D data and the high cost of manual annotations. In this work, we introduce SplatTalk, a novel method that uses a generalizable 3D Gaussian Splatting (3DGS) framework to produce 3D tokens suitable for direct input into a pretrained LLM, enabling effective zero-shot 3D visual question answering (3D VQA) for scenes with only posed images. During experiments on multiple benchmarks, our approach outperforms both 3D models trained specifically for the task and previous 2D-LMM-based models utilizing only images (our setting), while achieving competitive performance with state-of-the-art 3D LMMs that additionally utilize 3D inputs. Project website: https://splat-talk.github.io/

[473] Text2Story: Advancing Video Storytelling with Text Guidance

Taewon Kang, Divya Kothandaraman, Ming C. Lin

Main category: cs.CV

TL;DR: A novel AI framework for generating coherent long-form videos from text prompts, addressing challenges like temporal consistency and semantic continuity through bidirectional latent blending and adaptive prompt weighting.

Details

Motivation: The challenge of creating long-form videos from text prompts due to issues like temporal coherency and semantic continuity motivates the development of a new framework.

Method: The framework uses bidirectional time-weighted latent blending for temporal consistency, dynamics-informed prompt weighting (DIPW) for adaptive influence of prompts, and semantic action representation for smooth motion transitions.

Result: The system outperforms baselines, producing temporally consistent and visually compelling long-form videos without additional training.

Conclusion: This approach bridges the gap between short clips and extended videos, setting a new standard for text-to-video synthesis.

Abstract: Generating coherent long-form video sequences from discrete input using only text prompts is a critical task in content creation. While diffusion-based models excel at short video synthesis, long-form storytelling from text remains largely unexplored and a challenge due to challenges pertaining to temporal coherency, preserving semantic meaning and action continuity across the video. We introduce a novel AI-empowered storytelling framework to enable seamless video generation with natural action transitions and structured narratives. We first present a bidirectional time-weighted latent blending strategy to ensure temporal consistency between segments of the long-form video being generated. We then introduce a dynamics-informed prompt weighting (DIPW) mechanism that adaptively adjusts the influence of scene and action prompts at each diffusion timestep by jointly considering CLIP-based alignment, narrative continuity, and temporal smoothness. To further enhance motion continuity, we propose a semantic action representation to encode high-level action semantics into the blending process, dynamically adjusting transitions based on action similarity, ensuring smooth yet adaptable motion changes. Latent space blending maintains spatial coherence between objects in a scene, while time-weighted blending enforces bidirectional constraints for temporal consistency. The resulting integrative system prevents abrupt transitions while ensuring fluid storytelling. Extensive experiments demonstrate significant improvements over baselines, achieving temporally consistent and visually compelling video narratives without any additional training. This approach bridges the gap between short clips and extended video to establish a new paradigm in GenAI-driven video synthesis from text.

[474] Learning Disentangled Stain and Structural Representations for Semi-Supervised Histopathology Segmentation

Ha-Hieu Pham, Nguyen Lan Vi Vu, Thanh-Huy Nguyen, Ulas Bagci, Min Xu, Trung-Nghia Le, Huy-Hieu Pham

Main category: cs.CV

TL;DR: CSDS is a semi-supervised framework for gland segmentation in histopathology images, using dual-student networks to handle stain and structure variability, achieving state-of-the-art results.

Details

Motivation: Variability in H&E staining and tissue morphology, along with limited annotated data, challenges automated gland segmentation.

Method: CSDS employs two student networks (stain-augmented and structure-augmented) supervised by a shared teacher network with EMA and uncertainty estimation modules.

Result: CSDS improves Dice scores by up to 1.2% on GlaS and 1.4% on CRAG in low-label settings.

Conclusion: CSDS effectively addresses stain and structure variability, outperforming existing methods in gland segmentation with limited labeled data.

Abstract: Accurate gland segmentation in histopathology images is essential for cancer diagnosis and prognosis. However, significant variability in Hematoxylin and Eosin (H&E) staining and tissue morphology, combined with limited annotated data, poses major challenges for automated segmentation. To address this, we propose Color-Structure Dual-Student (CSDS), a novel semi-supervised segmentation framework designed to learn disentangled representations of stain appearance and tissue structure. CSDS comprises two specialized student networks: one trained on stain-augmented inputs to model chromatic variation, and the other on structure-augmented inputs to capture morphological cues. A shared teacher network, updated via Exponential Moving Average (EMA), supervises both students through pseudo-labels. To further improve label reliability, we introduce stain-aware and structure-aware uncertainty estimation modules that adaptively modulate the contribution of each student during training. Experiments on the GlaS and CRAG datasets show that CSDS achieves state-of-the-art performance in low-label settings, with Dice score improvements of up to 1.2% on GlaS and 0.7% on CRAG at 5% labeled data, and 0.7% and 1.4% at 10%. Our code and pre-trained models are available at https://github.com/hieuphamha19/CSDS.

[475] Long-tailed Adversarial Training with Self-Distillation

Seungju Cho, Hongsin Lee, Changick Kim

Main category: cs.CV

TL;DR: The paper addresses adversarial robustness in long-tailed datasets, proposing a self-distillation technique using a balanced self-teacher model to improve tail class performance.

Details

Motivation: Adversarial training performs poorly on tail classes in long-tailed distributions due to data scarcity. Existing methods combine traditional long-tailed training with adversarial robustness, but this study aims to specifically enhance tail class performance.

Method: A novel self-distillation technique is introduced, leveraging a balanced self-teacher model trained on a balanced subset of the long-tailed dataset.

Result: The method achieves state-of-the-art performance in clean and robust accuracy, with significant improvements (e.g., 20.3% on CIFAR-10) for tail classes against PGD attacks.

Conclusion: The proposed self-distillation technique effectively advances adversarial robustness in long-tailed distributions, particularly for tail classes.

Abstract: Adversarial training significantly enhances adversarial robustness, yet superior performance is predominantly achieved on balanced datasets. Addressing adversarial robustness in the context of unbalanced or long-tailed distributions is considerably more challenging, mainly due to the scarcity of tail data instances. Previous research on adversarial robustness within long-tailed distributions has primarily focused on combining traditional long-tailed natural training with existing adversarial robustness methods.In this study, we provide an in-depth analysis for the challenge that adversarial training struggles to achieve high performance on tail classes in long-tailed distributions. Furthermore, we propose a simple yet effective solution to advance adversarial robustness on long-tailed distributions through a novel self-distillation technique. Specifically, this approach leverages a balanced self-teacher model, which is trained using a balanced dataset sampled from the original long-tailed dataset. Our extensive experiments demonstrate state-of-the-art performance in both clean and robust accuracy for long-tailed adversarial robustness, with significant improvements in tail class performance on various datasets. We improve the accuracy against PGD attacks for tail classes by 20.3, 7.1, and 3.8 percentage points on CIFAR-10, CIFAR-100, and Tiny-ImageNet, respectively, while achieving the highest robust accuracy.

[476] SSVQ: Unleashing the Potential of Vector Quantization with Sign-Splitting

Shuaiting Li, Juncan Deng, Chenxuan Wang, Kedong Xu, Rongtao Deng, Hong Gu, Haibin Shen, Kejie Huang

Main category: cs.CV

TL;DR: SSVQ improves VQ by decoupling sign bits, enabling better fine-tuning and compression-accuracy trade-offs.

Details

Motivation: Address limitations of VQ in fine-tuning due to restricted weight updates.

Method: Introduces Sign-Splitting VQ (SSVQ), decoupling sign bits, clustering positive weights, and optimizing signs and codebook jointly.

Result: SSVQ outperforms VQ in compression-accuracy trade-off and achieves 3× speedup on hardware.

Conclusion: SSVQ is a superior VQ paradigm for weight compression, enhancing performance and efficiency.

Abstract: Vector Quantization (VQ) has emerged as a prominent weight compression technique, showcasing substantially lower quantization errors than uniform quantization across diverse models, particularly in extreme compression scenarios. However, its efficacy during fine-tuning is limited by the constraint of the compression format, where weight vectors assigned to the same codeword are restricted to updates in the same direction. Consequently, many quantized weights are compelled to move in directions contrary to their local gradient information. To mitigate this issue, we introduce a novel VQ paradigm, Sign-Splitting VQ (SSVQ), which decouples the sign bit of weights from the codebook. Our approach involves extracting the sign bits of uncompressed weights and performing clustering and compression on all-positive weights. We then introduce latent variables for the sign bit and jointly optimize both the signs and the codebook. Additionally, we implement a progressive freezing strategy for the learnable sign to ensure training stability. Extensive experiments on various modern models and tasks demonstrate that SSVQ achieves a significantly superior compression-accuracy trade-off compared to conventional VQ. Furthermore, we validate our algorithm on a hardware accelerator, showing that SSVQ achieves a 3$\times$ speedup over the 8-bit compressed model by reducing memory access. Our code is available at https://github.com/list0830/SSVQ.

[477] Occlusion-Aware Temporally Consistent Amodal Completion for 3D Human-Object Interaction Reconstruction

Hyungjun Doh, Dong In Lee, Seunggeun Chi, Pin-Hao Huang, Kwonjoon Lee, Sangpil Kim, Karthik Ramani

Main category: cs.CV

TL;DR: A new framework for dynamic human-object interaction reconstruction from monocular video, addressing occlusions and temporal inconsistencies using amodal completion and temporal context.

Details

Motivation: Traditional 3D reconstruction methods fail under occlusions and dynamic conditions, necessitating a more robust solution.

Method: Leverages amodal completion and temporal context for coherent, incremental refinement without predefined models.

Result: Superior precision in handling occlusions and temporal stability, validated with 3D Gaussian Splatting.

Conclusion: The framework effectively reconstructs dynamic scenes with occlusions, outperforming existing techniques.

Abstract: We introduce a novel framework for reconstructing dynamic human-object interactions from monocular video that overcomes challenges associated with occlusions and temporal inconsistencies. Traditional 3D reconstruction methods typically assume static objects or full visibility of dynamic subjects, leading to degraded performance when these assumptions are violated-particularly in scenarios where mutual occlusions occur. To address this, our framework leverages amodal completion to infer the complete structure of partially obscured regions. Unlike conventional approaches that operate on individual frames, our method integrates temporal context, enforcing coherence across video sequences to incrementally refine and stabilize reconstructions. This template-free strategy adapts to varying conditions without relying on predefined models, significantly enhancing the recovery of intricate details in dynamic scenes. We validate our approach using 3D Gaussian Splatting on challenging monocular videos, demonstrating superior precision in handling occlusions and maintaining temporal stability compared to existing techniques.

[478] SuperCarver: Texture-Consistent 3D Geometry Super-Resolution for High-Fidelity Surface Detail Generation

Qijian Zhang, Xiaozheng Jian, Xuan Zhang, Wenping Wang, Junhui Hou

Main category: cs.CV

TL;DR: SuperCarver is a 3D geometry super-resolution pipeline that enhances coarse meshes with realistic surface details using a prior-guided normal diffusion model and noise-resistant inverse rendering.

Details

Motivation: The manual process of creating high-precision 3D meshes is labor-intensive, and existing AI methods struggle with synthesizing realistic surface details. SuperCarver aims to improve geometry fidelity for existing low-quality meshes.

Method: The pipeline involves rendering the coarse mesh into images, using a fine-tuned normal diffusion model for detail enhancement, and updating the mesh with a deformable distance field-based inverse rendering scheme.

Result: SuperCarver successfully generates realistic and expressive surface details, improving low-quality 3D assets and reducing manual sculpting effort.

Conclusion: SuperCarver is an effective tool for enhancing 3D mesh quality, bridging the gap between low-quality assets and high-precision manual sculpting.

Abstract: Conventional production workflow of high-precision mesh assets necessitates a cumbersome and laborious process of manual sculpting by specialized 3D artists/modelers. The recent years have witnessed remarkable advances in AI-empowered 3D content creation for generating plausible structures and intricate appearances from images or text prompts. However, synthesizing realistic surface details still poses great challenges, and enhancing the geometry fidelity of existing lower-quality 3D meshes (instead of image/text-to-3D generation) remains an open problem. In this paper, we introduce SuperCarver, a 3D geometry super-resolution pipeline for supplementing texture-consistent surface details onto a given coarse mesh. We start by rendering the original textured mesh into the image domain from multiple viewpoints. To achieve detail boosting, we construct a deterministic prior-guided normal diffusion model, which is fine-tuned on a carefully curated dataset of paired detail-lacking and detail-rich normal map renderings. To update mesh surfaces from potentially imperfect normal map predictions, we design a noise-resistant inverse rendering scheme through deformable distance field. Experiments demonstrate that our SuperCarver is capable of generating realistic and expressive surface details depicted by the actual texture appearance, making it a powerful tool to both upgrade historical low-quality 3D assets and reduce the workload of sculpting high-poly meshes.

[479] TACO: Taming Diffusion for in-the-wild Video Amodal Completion

Ruijie Lu, Yixin Chen, Yu Liu, Jiaxiang Tang, Junfeng Ni, Diwen Wan, Gang Zeng, Siyuan Huang

Main category: cs.CV

TL;DR: The paper introduces TACO, a conditional diffusion model for Video Amodal Completion (VAC), leveraging pre-trained video diffusion models to consistently complete objects in videos.

Details

Motivation: Existing models struggle with completing partially observable objects consistently in unstructured, in-the-wild videos, motivating the need for a robust solution.

Method: TACO repurposes pre-trained video diffusion models, uses a synthetic dataset with progressive fine-tuning, and generalizes to diverse real-world scenarios.

Result: TACO demonstrates versatility on in-the-wild videos and unseen datasets, and aids downstream tasks like object reconstruction and pose estimation.

Conclusion: TACO effectively addresses VAC, showcasing potential for enhancing physical world understanding and reasoning.

Abstract: Humans can infer complete shapes and appearances of objects from limited visual cues, relying on extensive prior knowledge of the physical world. However, completing partially observable objects while ensuring consistency across video frames remains challenging for existing models, especially for unstructured, in-the-wild videos. This paper tackles the task of Video Amodal Completion (VAC), which aims to generate the complete object consistently throughout the video given a visual prompt specifying the object of interest. Leveraging the rich, consistent manifolds learned by pre-trained video diffusion models, we propose a conditional diffusion model, TACO, that repurposes these manifolds for VAC. To enable its effective and robust generalization to challenging in-the-wild scenarios, we curate a large-scale synthetic dataset with multiple difficulty levels by systematically imposing occlusions onto un-occluded videos. Building on this, we devise a progressive fine-tuning paradigm that starts with simpler recovery tasks and gradually advances to more complex ones. We demonstrate TACO’s versatility on a wide range of in-the-wild videos from Internet, as well as on diverse, unseen datasets commonly used in autonomous driving, robotic manipulation, and scene understanding. Moreover, we show that TACO can be effectively applied to various downstream tasks like object reconstruction and pose estimation, highlighting its potential to facilitate physical world understanding and reasoning. Our project page is available at https://jason-aplp.github.io/TACO.

[480] Adaptive Label Correction for Robust Medical Image Segmentation with Noisy Labels

Chengxuan Qian, Kai Han, Jianxia Ding, Lei Liu, Chongwen Lyu, Zhenlong Yuan, Jun Chen, Zhe Liu

Main category: cs.CV

TL;DR: Proposes a Mean Teacher-based Adaptive Label Correction (ALC) framework for robust medical image segmentation with noisy labels, improving performance despite label noise.

Details

Motivation: Deep learning in medical image analysis is limited by the need for large, high-quality labeled datasets. Noisy labels are easier to obtain but degrade model performance.

Method: Uses a Mean Teacher architecture for consistent learning, adaptive label refinement, and sample-level uncertainty-based label selection to enhance noisy labels and prioritize high-confidence samples.

Result: Demonstrates significant improvements in segmentation performance on two public datasets, outperforming state-of-the-art methods.

Conclusion: The ALC framework effectively handles noisy labels, adapts to challenges, and achieves competitive results, advancing robust medical image segmentation.

Abstract: Deep learning has shown remarkable success in medical image analysis, but its reliance on large volumes of high-quality labeled data limits its applicability. While noisy labeled data are easier to obtain, directly incorporating them into training can degrade model performance. To address this challenge, we propose a Mean Teacher-based Adaptive Label Correction (ALC) self-ensemble framework for robust medical image segmentation with noisy labels. The framework leverages the Mean Teacher architecture to ensure consistent learning under noise perturbations. It includes an adaptive label refinement mechanism that dynamically captures and weights differences across multiple disturbance versions to enhance the quality of noisy labels. Additionally, a sample-level uncertainty-based label selection algorithm is introduced to prioritize high-confidence samples for network updates, mitigating the impact of noisy annotations. Consistency learning is integrated to align the predictions of the student and teacher networks, further enhancing model robustness. Extensive experiments on two public datasets demonstrate the effectiveness of the proposed framework, showing significant improvements in segmentation performance. By fully exploiting the strengths of the Mean Teacher structure, the ALC framework effectively processes noisy labels, adapts to challenging scenarios, and achieves competitive results compared to state-of-the-art methods.

[481] Exploring 3D Reasoning-Driven Planning: From Implicit Human Intentions to Route-Aware Activity Planning

Xueying Jiang, Wenhao Li, Xiaoqin Zhang, Ling Shao, Shijian Lu

Main category: cs.CV

TL;DR: The paper introduces 3D Reasoning-Driven Planning to address challenges in implicit instruction reasoning and inter-step route planning in 3D task planning, proposing a benchmark and framework for improved performance.

Details

Motivation: Existing 3D task planning studies rely heavily on explicit instructions and neglect inter-step route planning, limiting their effectiveness in human-robot interaction and embodied AI.

Method: The authors propose a novel framework with progressive plan generation and dynamic scene graphs, alongside a large-scale benchmark (ReasonPlan3D) for evaluation.

Result: Experiments show the framework effectively reasons implicit instructions, generates accurate stepwise plans, and integrates route planning for multi-step moves.

Conclusion: The proposed benchmark and framework advance 3D task planning by addressing implicit reasoning and route planning, with plans to release the dataset and code.

Abstract: 3D task planning has attracted increasing attention in human-robot interaction and embodied AI thanks to the recent advances in multimodal learning. However, most existing studies are facing two common challenges: 1) heavy reliance on explicit instructions with little reasoning on implicit user intention; 2) negligence of inter-step route planning on robot moves. We address the above challenges by proposing 3D Reasoning-Driven Planning, a novel 3D task that reasons the intended activities from implicit instructions and decomposes them into steps with inter-step routes and planning under the guidance of fine-grained 3D object shapes and locations from scene segmentation. We tackle the new 3D task from two perspectives. First, we construct ReasonPlan3D, a large-scale benchmark that covers diverse 3D scenes with rich implicit instructions and detailed annotations for multi-step task planning, inter-step route planning, and fine-grained segmentation. Second, we design a novel framework that introduces progressive plan generation with contextual consistency across multiple steps, as well as a scene graph that is updated dynamically for capturing critical objects and their spatial relations. Extensive experiments demonstrate the effectiveness of our benchmark and framework in reasoning activities from implicit human instructions, producing accurate stepwise task plans and seamlessly integrating route planning for multi-step moves. The dataset and code will be released.

[482] Inversion-DPO: Precise and Efficient Post-Training for Diffusion Models

Zejian Li, Yize Li, Chenye Meng, Zhongni Liu, Yang Ling, Shengyuan Zhang, Guang Yang, Changyuan Yang, Zhiyuan Yang, Lingyun Sun

Main category: cs.CV

TL;DR: Inversion-DPO is a novel alignment framework for diffusion models that avoids reward modeling by using DDIM inversion, improving training efficiency and precision.

Details

Motivation: Existing alignment methods for diffusion models are computationally intensive and may reduce accuracy and efficiency.

Method: Inversion-DPO reformulates Direct Preference Optimization (DPO) with DDIM inversion, eliminating the need for reward models.

Result: The method shows significant performance improvements in text-to-image and compositional image generation tasks.

Conclusion: Inversion-DPO offers an efficient, high-precision alignment approach for diffusion models, enhancing their applicability to complex tasks.

Abstract: Recent advancements in diffusion models (DMs) have been propelled by alignment methods that post-train models to better conform to human preferences. However, these approaches typically require computation-intensive training of a base model and a reward model, which not only incurs substantial computational overhead but may also compromise model accuracy and training efficiency. To address these limitations, we propose Inversion-DPO, a novel alignment framework that circumvents reward modeling by reformulating Direct Preference Optimization (DPO) with DDIM inversion for DMs. Our method conducts intractable posterior sampling in Diffusion-DPO with the deterministic inversion from winning and losing samples to noise and thus derive a new post-training paradigm. This paradigm eliminates the need for auxiliary reward models or inaccurate appromixation, significantly enhancing both precision and efficiency of training. We apply Inversion-DPO to a basic task of text-to-image generation and a challenging task of compositional image generation. Extensive experiments show substantial performance improvements achieved by Inversion-DPO compared to existing post-training methods and highlight the ability of the trained generative models to generate high-fidelity compositionally coherent images. For the post-training of compostitional image geneation, we curate a paired dataset consisting of 11,140 images with complex structural annotations and comprehensive scores, designed to enhance the compositional capabilities of generative models. Inversion-DPO explores a new avenue for efficient, high-precision alignment in diffusion models, advancing their applicability to complex realistic generation tasks. Our code is available at https://github.com/MIGHTYEZ/Inversion-DPO

[483] Uncertainty-Aware Knowledge Distillation for Compact and Efficient 6DoF Pose Estimation

Nassim Ali Ousalah, Anis Kacem, Enjie Ghorbel, Emmanuel Koumandakis, Djamila Aouada

Main category: cs.CV

TL;DR: A novel uncertainty-aware Knowledge Distillation (KD) framework improves 6DoF object pose estimation by leveraging teacher model uncertainty to enhance student model accuracy and compactness.

Details

Motivation: Efficient and accurate 6DoF pose estimation is vital for robotics, AR, and space navigation, requiring lightweight models for real-time performance.

Method: Proposes an uncertainty-aware KD strategy aligning student and teacher predictions based on teacher keypoint uncertainty, transferring knowledge at key feature map locations.

Result: Outperforms state-of-the-art on LINEMOD and SPEED+ datasets, achieving superior accuracy with lightweight models.

Conclusion: The uncertainty-aware KD framework effectively enhances 6DoF pose estimation, proving robust across diverse scenarios.

Abstract: Compact and efficient 6DoF object pose estimation is crucial in applications such as robotics, augmented reality, and space autonomous navigation systems, where lightweight models are critical for real-time accurate performance. This paper introduces a novel uncertainty-aware end-to-end Knowledge Distillation (KD) framework focused on keypoint-based 6DoF pose estimation. Keypoints predicted by a large teacher model exhibit varying levels of uncertainty that can be exploited within the distillation process to enhance the accuracy of the student model while ensuring its compactness. To this end, we propose a distillation strategy that aligns the student and teacher predictions by adjusting the knowledge transfer based on the uncertainty associated with each teacher keypoint prediction. Additionally, the proposed KD leverages this uncertainty-aware alignment of keypoints to transfer the knowledge at key locations of their respective feature maps. Experiments on the widely-used LINEMOD benchmark demonstrate the effectiveness of our method, achieving superior 6DoF object pose estimation with lightweight models compared to state-of-the-art approaches. Further validation on the SPEED+ dataset for spacecraft pose estimation highlights the robustness of our approach under diverse 6DoF pose estimation scenarios.

[484] Web Artifact Attacks Disrupt Vision Language Models

Maan Qraitem, Piotr Teterwak, Kate Saenko, Bryan A. Plummer

Main category: cs.CV

TL;DR: The paper introduces ‘artifact-based’ attacks on vision-language models (VLMs), exploiting non-matching text and graphical elements to mislead models, and proposes defenses using artifact-aware prompting.

Details

Motivation: VLMs learn unintended correlations from web data, degrading accuracy. Prior attacks focused on exact text matches, but broader correlations (e.g., branding) remain unexplored.

Method: Introduces ‘artifact-based’ attacks as a search problem, testing them on five datasets. Defends using artifact-aware prompting in graphical settings.

Result: Artifact attacks achieve up to 100% success, transfer across models (90% effectiveness). Defenses reduce success rates by up to 15%.

Conclusion: Artifact attacks reveal vulnerabilities in VLMs, but artifact-aware prompting offers a promising defense direction.

Abstract: Vision-language models (VLMs) (e.g. CLIP, LLaVA) are trained on large-scale, lightly curated web datasets, leading them to learn unintended correlations between semantic concepts and unrelated visual signals. These associations degrade model accuracy by causing predictions to rely on incidental patterns rather than genuine visual understanding. Prior work has weaponized these correlations as an attack vector to manipulate model predictions, such as inserting a deceiving class text onto the image in a “typographic” attack. These attacks succeed due to VLMs’ text-heavy bias-a result of captions that echo visible words rather than describing content. However, this attack has focused solely on text that matches the target class exactly, overlooking a broader range of correlations, including non-matching text and graphical symbols, which arise from the abundance of branding content in web-scale data. To address this gap, we introduce “artifact-based” attacks: a novel class of manipulations that mislead models using both non-matching text and graphical elements. Unlike typographic attacks, these artifacts are not predefined, making them simultaneously harder to defend against and more challenging to find. We address this by framing artifact attacks as a search problem and demonstrate their effectiveness across five datasets, with some artifacts reinforcing each other to reach 100% attack success rates. These attacks transfer across models with up to 90% effectiveness, making it possible to attack unseen models. To defend against these attacks, we extend prior work’s artifact aware prompting to the graphical setting. We see a moderate reduction of success rates of up to 15% relative to standard prompts, suggesting a promising direction for enhancing model robustness. Code: https://github.com/mqraitem/Web-Artifact-Attacks

[485] Multi-Object Sketch Animation by Scene Decomposition and Motion Planning

Jingyu Liu, Zijie Xin, Yuhan Fu, Ruixiang Zhao, Bangxiang Lan, Xirong Li

Main category: cs.CV

TL;DR: MoSketch introduces a method for multi-object sketch animation, addressing challenges like object-aware motion modeling and complex motion optimization through iterative optimization and novel modules.

Details

Motivation: Current sketch animation methods excel in single-object scenarios but fail in multi-object cases, prompting the need for a solution like MoSketch.

Method: MoSketch uses Score Distillation Sampling (SDS) and includes four modules: LLM-based scene decomposition, motion planning, multi-grained motion refinement, and compositional SDS.

Result: MoSketch outperforms existing methods in multi-object sketch animation, as shown by qualitative and quantitative experiments.

Conclusion: MoSketch pioneers multi-object sketch animation, offering new research and application possibilities.

Abstract: Sketch animation, which brings static sketches to life by generating dynamic video sequences, has found widespread applications in GIF design, cartoon production, and daily entertainment. While current methods for sketch animation perform well in single-object sketch animation, they struggle in multi-object scenarios. By analyzing their failures, we identify two major challenges of transitioning from single-object to multi-object sketch animation: object-aware motion modeling and complex motion optimization. For multi-object sketch animation, we propose MoSketch based on iterative optimization through Score Distillation Sampling (SDS) and thus animating a multi-object sketch in a training-data free manner. To tackle the two challenges in a divide-and-conquer strategy, MoSketch has four novel modules, i.e., LLM-based scene decomposition, LLM-based motion planning, multi-grained motion refinement, and compositional SDS. Extensive qualitative and quantitative experiments demonstrate the superiority of our method over existing sketch animation approaches. MoSketch takes a pioneering step towards multi-object sketch animation, opening new avenues for future research and applications.

[486] LEMON: A Large Endoscopic MONocular Dataset and Foundation Model for Perception in Surgical Settings

Chengan Che, Chao Wang, Tom Vercauteren, Sophia Tsoka, Luis C. Garcia-Peraza-Herrera

Main category: cs.CV

TL;DR: The paper introduces LEMON, a large-scale surgical video dataset, and LemonFM, a foundation model pretrained on LEMON, which outperforms existing models in multiple surgical tasks.

Details

Motivation: Existing surgical datasets are small and limit model generalization. LEMON addresses this by providing a vast, diverse collection of surgical videos.

Method: LEMON is compiled using a novel aggregation pipeline for high-resolution videos. LemonFM is pretrained on LEMON using self-supervised augmented knowledge distillation.

Result: LemonFM achieves significant improvements in surgical phase recognition, action recognition, tool detection, and semantic segmentation across multiple datasets.

Conclusion: LEMON and LemonFM serve as foundational resources for advancing autonomous robotic surgery, improving surgical care accessibility and safety.

Abstract: Traditional open-access datasets focusing on surgical procedures are often limited by their small size, typically consisting of fewer than 100 videos and less than 30 hours of footage, which leads to poor model generalization. To address this constraint, a new dataset called LEMON has been compiled using a novel aggregation pipeline that collects high-resolution videos from online sources. Featuring an extensive collection of over 4K surgical videos totaling 938 hours (85 million frames) of high-quality footage across multiple procedure types, LEMON offers a comprehensive resource surpassing existing alternatives in size and scope, including two novel downstream tasks. To demonstrate the effectiveness of this diverse dataset, we introduce LemonFM, a foundation model pretrained on LEMON using a novel self-supervised augmented knowledge distillation approach. LemonFM consistently outperforms existing surgical foundation models across four downstream tasks and six datasets, achieving significant gains in surgical phase recognition (+9.5pp, +9.4pp, and +8.4pp of Jaccard in AutoLaparo, M2CAI16, and Cholec80), surgical action recognition (+4.4pp of mAP in CholecT50), surgical tool presence detection (+5.3pp and +10.2pp of mAP in Cholec80 and GraSP), and surgical semantic segmentation (+8.3pp of mDice in CholecSeg8k). LEMON and LemonFM will serve as foundational resources for the research community and industry, accelerating progress in developing autonomous robotic surgery systems and ultimately contributing to safer and more accessible surgical care worldwide.

[487] Scaling Vision Pre-Training to 4K Resolution

Baifeng Shi, Boyi Li, Han Cai, Yao Lu, Sifei Liu, Marco Pavone, Jan Kautz, Song Han, Trevor Darrell, Pavlo Molchanov, Hongxu Yin

Main category: cs.CV

TL;DR: PS3 scales CLIP-style vision pre-training to 4K resolution with near-constant cost, enabling high-resolution representation learning. The resulting model, VILA-HD, outperforms baselines in efficiency and performance.

Details

Motivation: Current vision pre-training is limited to low resolutions due to computational costs, hindering high-resolution visual perception.

Method: PS3 pre-trains by selectively processing local regions and contrasting them with detailed captions, reducing computational overhead. VILA-HD applies this to multi-modal LLMs.

Result: VILA-HD improves high-resolution perception, uses fewer tokens, and outperforms state-of-the-art models in benchmarks.

Conclusion: PS3 and VILA-HD advance high-resolution vision pre-training, with VILA-HD excelling in 4K benchmarks, motivating the need for higher-resolution benchmarks like 4KPro.

Abstract: High-resolution perception of visual details is crucial for daily tasks. Current vision pre-training, however, is still limited to low resolutions (e.g., 378 x 378 pixels) due to the quadratic cost of processing larger images. We introduce PS3 that scales CLIP-style vision pre-training to 4K resolution with a near-constant cost. Instead of contrastive learning on global image representation, PS3 is pre-trained by selectively processing local regions and contrasting them with local detailed captions, enabling high-resolution representation learning with greatly reduced computational overhead. The pre-trained PS3 is able to both encode the global image at low resolution and selectively process local high-resolution regions based on their saliency or relevance to a text prompt. When applying PS3 to multi-modal LLM (MLLM), the resulting model, named VILA-HD, significantly improves high-resolution visual perception compared to baselines without high-resolution vision pre-training such as AnyRes and S^2 while using up to 4.3x fewer tokens. PS3 also unlocks appealing scaling properties of VILA-HD, including scaling up resolution for free and scaling up test-time compute for better performance. Compared to state of the arts, PS3 and VILA-HD outperform previous vision encoders (e.g., SigLIP2 and Perception Encoder) and MLLMs (e.g., NVILA and Qwen2.5-VL) respectively across multiple benchmarks and achieve better efficiency than latest token pruning approaches. Finally, we find current benchmarks do not require 4K-resolution perception, which motivates us to propose 4KPro, a new benchmark of image QA at 4K resolution, on which VILA-HD outperforms all previous MLLMs, including a 16.1% improvement over GPT-4o and a 7.5% improvement and 1.67x speedup over Qwen2.5-VL.

[488] Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks

Jiawei Wang, Yushen Zuo, Yuanjun Chai, Zhendong Liu, Yicheng Fu, Yichun Feng, Kin-Man Lam

Main category: cs.CV

TL;DR: The paper introduces Robust-VLGuard and DiffPure-VLM to address vulnerabilities in Vision-Language Models (VLMs) against noise-augmented and adversarial attacks.

Details

Motivation: VLMs are vulnerable to jailbreak attacks via noisy or corrupted images, and existing security measures overlook noise-augmented training.

Method: Proposes Robust-VLGuard (a multimodal safety dataset) and DiffPure-VLM (using diffusion models to defend against adversarial perturbations).

Result: Noise-augmented fine-tuning reduces attack success rates, and DiffPure-VLM effectively mitigates adversarial perturbations.

Conclusion: The proposed methods enhance VLM security against noise and adversarial attacks while preserving functionality.

Abstract: Vision-Language Models (VLMs) extend the capabilities of Large Language Models (LLMs) by incorporating visual information, yet they remain vulnerable to jailbreak attacks, especially when processing noisy or corrupted images. Although existing VLMs adopt security measures during training to mitigate such attacks, vulnerabilities associated with noise-augmented visual inputs are overlooked. In this work, we identify that missing noise-augmented training causes critical security gaps: many VLMs are susceptible to even simple perturbations such as Gaussian noise. To address this challenge, we propose Robust-VLGuard, a multimodal safety dataset with aligned / misaligned image-text pairs, combined with noise-augmented fine-tuning that reduces attack success rates while preserving functionality of VLM. For stronger optimization-based visual perturbation attacks, we propose DiffPure-VLM, leveraging diffusion models to convert adversarial perturbations into Gaussian-like noise, which can be defended by VLMs with noise-augmented safety fine-tuning. Experimental results demonstrate that the distribution-shifting property of diffusion model aligns well with our fine-tuned VLMs, significantly mitigating adversarial perturbations across varying intensities. The dataset and code are available at https://github.com/JarvisUSTC/DiffPure-RobustVLM.

[489] Efficient Autoregressive Shape Generation via Octree-Based Adaptive Tokenization

Kangle Deng, Hsueh-Ti Derek Liu, Yiheng Zhu, Xiaoxia Sun, Chong Shang, Kiran Bhat, Deva Ramanan, Jun-Yan Zhu, Maneesh Agrawala, Tinghui Zhou

Main category: cs.CV

TL;DR: The paper introduces Octree-based Adaptive Tokenization for 3D generative models, improving efficiency and quality by adapting latent representation size to shape complexity.

Details

Motivation: Existing VAE-based 3D generative models use fixed-size tokens, leading to inefficient representations that hinder generation quality.

Method: Proposes an adaptive octree structure with quadric-error-based subdivision and a query-based transformer for variable-sized latent vectors.

Result: Reduces token counts by 50% while maintaining visual quality; achieves higher-quality shapes with similar token lengths.

Conclusion: The method enhances 3D content generation with more detail and diversity compared to fixed-size approaches.

Abstract: Many 3D generative models rely on variational autoencoders (VAEs) to learn compact shape representations. However, existing methods encode all shapes into a fixed-size token, disregarding the inherent variations in scale and complexity across 3D data. This leads to inefficient latent representations that can compromise downstream generation. We address this challenge by introducing Octree-based Adaptive Tokenization, a novel framework that adjusts the dimension of latent representations according to shape complexity. Our approach constructs an adaptive octree structure guided by a quadric-error-based subdivision criterion and allocates a shape latent vector to each octree cell using a query-based transformer. Building upon this tokenization, we develop an octree-based autoregressive generative model that effectively leverages these variable-sized representations in shape generation. Extensive experiments demonstrate that our approach reduces token counts by 50% compared to fixed-size methods while maintaining comparable visual quality. When using a similar token length, our method produces significantly higher-quality shapes. When incorporated with our downstream generative model, our method creates more detailed and diverse 3D content than existing approaches.

[490] Scaling Laws for Native Multimodal Models

Mustafa Shukor, Enrico Fini, Victor Guilherme Turrisi da Costa, Matthieu Cord, Joshua Susskind, Alaaeldin El-Nouby

Main category: cs.CV

TL;DR: The paper compares late-fusion and early-fusion architectures for multimodal models, finding early-fusion superior in performance, efficiency, and deployability, especially with Mixture of Experts (MoEs).

Details

Motivation: To determine if late-fusion architectures are inherently better than early-fusion for native multimodal models (NMMs).

Method: Extensive scaling study with 457 models, comparing architectures and training mixtures, including MoEs.

Result: Early-fusion outperforms late-fusion in lower parameter counts, efficiency, and deployability; MoEs further enhance performance.

Conclusion: Early-fusion architectures are more effective for NMMs, especially when combined with MoEs.

Abstract: Building general-purpose models that can effectively perceive the world through multimodal signals has been a long-standing goal. Current approaches involve integrating separately pre-trained components, such as connecting vision encoders to LLMs and continuing multimodal training. While such approaches exhibit remarkable sample efficiency, it remains an open question whether such late-fusion architectures are inherently superior. In this work, we revisit the architectural design of native multimodal models (NMMs)-those trained from the ground up on all modalities-and conduct an extensive scaling laws study, spanning 457 trained models with different architectures and training mixtures. Our investigation reveals no inherent advantage to late-fusion architectures over early-fusion ones, which do not rely on image encoders or tokenizers. On the contrary, early-fusion exhibits stronger performance at lower parameter counts, is more efficient to train, and is easier to deploy. Motivated by the strong performance of the early-fusion architectures, we show that incorporating Mixture of Experts (MoEs) allows models to learn modality-specific weights, significantly benefiting performance.

[491] LayoutCoT: Unleashing the Deep Reasoning Potential of Large Language Models for Layout Generation

Hengyu Shi, Junhao Su, Junfeng Luo, Jialin Gao

Main category: cs.CV

TL;DR: LayoutCoT combines Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT) to generate high-quality layouts without training, outperforming specialized models.

Details

Motivation: Existing methods for conditional layout generation require extensive training or lack reasoning capabilities, limiting their practicality and quality.

Method: LayoutCoT serializes layouts for LLMs, uses Layout-aware RAG for retrieval, and refines layouts iteratively with CoT reasoning.

Result: LayoutCoT achieves state-of-the-art performance on five datasets without training, even surpassing deep-reasoning models.

Conclusion: The approach demonstrates the potential of LLMs for layout generation by enhancing their reasoning capabilities without training.

Abstract: Conditional layout generation aims to automatically generate visually appealing and semantically coherent layouts from user-defined constraints. While recent methods based on generative models have shown promising results, they typically require substantial amounts of training data or extensive fine-tuning, limiting their versatility and practical applicability. Alternatively, some training-free approaches leveraging in-context learning with Large Language Models (LLMs) have emerged, but they often suffer from limited reasoning capabilities and overly simplistic ranking mechanisms, which restrict their ability to generate consistently high-quality layouts. To this end, we propose LayoutCoT, a novel approach that leverages the reasoning capabilities of LLMs through a combination of Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT) techniques. Specifically, LayoutCoT transforms layout representations into a standardized serialized format suitable for processing by LLMs. A Layout-aware RAG is used to facilitate effective retrieval and generate a coarse layout by LLMs. This preliminary layout, together with the selected exemplars, is then fed into a specially designed CoT reasoning module for iterative refinement, significantly enhancing both semantic coherence and visual quality. We conduct extensive experiments on five public datasets spanning three conditional layout generation tasks. Experimental results demonstrate that LayoutCoT achieves state-of-the-art performance without requiring training or fine-tuning. Notably, our CoT reasoning module enables standard LLMs, even those without explicit deep reasoning abilities, to outperform specialized deep-reasoning models such as deepseek-R1, highlighting the potential of our approach in unleashing the deep reasoning capabilities of LLMs for layout generation tasks.

[492] Evaluating Deepfake Detectors in the Wild

Viacheslav Pirogov, Maksim Artemev

Main category: cs.CV

TL;DR: The paper evaluates modern deepfake detectors using a novel real-world testing procedure and a large dataset of 500,000+ deepfake images, finding detection remains challenging with most detectors performing poorly (AUC <60%).

Details

Motivation: Deepfakes pose a growing threat to digital media authenticity, but existing detectors lack real-world testing.

Method: A novel testing procedure mimics real-world scenarios, using state-of-the-art deepfake generation to create a large dataset.

Result: Fewer than half of detectors achieved AUC >60%, with basic image manipulations (e.g., JPEG compression) reducing performance.

Conclusion: Deepfake detection is still challenging; performance drops under real-world conditions, highlighting the need for better solutions.

Abstract: Deepfakes powered by advanced machine learning models present a significant and evolving threat to identity verification and the authenticity of digital media. Although numerous detectors have been developed to address this problem, their effectiveness has yet to be tested when applied to real-world data. In this work we evaluate modern deepfake detectors, introducing a novel testing procedure designed to mimic real-world scenarios for deepfake detection. Using state-of-the-art deepfake generation methods, we create a comprehensive dataset containing more than 500,000 high-quality deepfake images. Our analysis shows that detecting deepfakes still remains a challenging task. The evaluation shows that in fewer than half of the deepfake detectors tested achieved an AUC score greater than 60%, with the lowest being 50%. We demonstrate that basic image manipulations, such as JPEG compression or image enhancement, can significantly reduce model performance. All code and data are publicly available at https://github.com/SumSubstance/Deepfake-Detectors-in-the-Wild.

[493] TSGS: Improving Gaussian Splatting for Transparent Surface Reconstruction via Normal and De-lighting Priors

Mingwei Li, Pu Pang, Hehe Fan, Hua Huang, Yi Yang

Main category: cs.CV

TL;DR: TSGS introduces a two-stage framework for reconstructing transparent surfaces, separating geometry learning from appearance refinement to improve accuracy and visual fidelity.

Details

Motivation: Transparent surfaces are challenging for 3D reconstruction due to the transparency-depth dilemma, where photorealistic rendering compromises geometric precision.

Method: TSGS uses specular-suppressed inputs for geometry learning and anisotropic specular modeling for appearance refinement, along with a first-surface depth extraction technique.

Result: TSGS achieves a 37.3% reduction in chamfer distance and an 8.0% improvement in F1 score over leading methods, validated on the TransLab dataset.

Conclusion: TSGS effectively balances geometric accuracy and visual realism for transparent surfaces within the 3DGS framework.

Abstract: Reconstructing transparent surfaces is essential for tasks such as robotic manipulation in labs, yet it poses a significant challenge for 3D reconstruction techniques like 3D Gaussian Splatting (3DGS). These methods often encounter a transparency-depth dilemma, where the pursuit of photorealistic rendering through standard $\alpha$-blending undermines geometric precision, resulting in considerable depth estimation errors for transparent materials. To address this issue, we introduce Transparent Surface Gaussian Splatting (TSGS), a new framework that separates geometry learning from appearance refinement. In the geometry learning stage, TSGS focuses on geometry by using specular-suppressed inputs to accurately represent surfaces. In the second stage, TSGS improves visual fidelity through anisotropic specular modeling, crucially maintaining the established opacity to ensure geometric accuracy. To enhance depth inference, TSGS employs a first-surface depth extraction method. This technique uses a sliding window over $\alpha$-blending weights to pinpoint the most likely surface location and calculates a robust weighted average depth. To evaluate the transparent surface reconstruction task under realistic conditions, we collect a TransLab dataset that includes complex transparent laboratory glassware. Extensive experiments on TransLab show that TSGS achieves accurate geometric reconstruction and realistic rendering of transparent objects simultaneously within the efficient 3DGS framework. Specifically, TSGS significantly surpasses current leading methods, achieving a 37.3% reduction in chamfer distance and an 8.0% improvement in F1 score compared to the top baseline. The code and dataset are available at https://longxiang-ai.github.io/TSGS/.

[494] CapRecover: A Cross-Modality Feature Inversion Attack Framework on Vision Language Models

Kedong Xiu, Sai Qian Zhang

Main category: cs.CV

TL;DR: CapRecover is a framework that recovers high-level semantic content from intermediate features in split-DNN VLMs, avoiding image reconstruction. It achieves high accuracy in label and caption recovery and proposes a noise-based protection method.

Details

Motivation: Addressing privacy risks from semantic information leakage in split-DNN VLMs.

Method: CapRecover, a cross-modality inversion framework, recovers semantics directly from features. Evaluated on datasets like CIFAR-10 and COCO2017.

Result: Achieves 92.71% Top-1 label accuracy on CIFAR-10 and ROUGE-L scores up to 0.52 on COCO2017. Deeper layers encode more semantics.

Conclusion: CapRecover effectively recovers semantics and a noise-based method mitigates leakage without extra training.

Abstract: As Vision-Language Models (VLMs) are increasingly deployed in split-DNN configurations–with visual encoders (e.g., ResNet, ViT) operating on user devices and sending intermediate features to the cloud–there is a growing privacy risk from semantic information leakage. Existing approaches to reconstructing images from these intermediate features often result in blurry, semantically ambiguous images. To directly address semantic leakage, we propose CapRecover, a cross-modality inversion framework that recovers high-level semantic content, such as labels or captions, directly from intermediate features without image reconstruction. We evaluate CapRecover on multiple datasets and victim models, demonstrating strong performance in semantic recovery. Specifically, CapRecover achieves up to 92.71% Top-1 label accuracy on CIFAR-10 and generates fluent captions from ResNet50 features on COCO2017 with ROUGE-L scores up to 0.52. Our analysis further reveals that deeper convolutional layers encode significantly more semantic information compared to shallow layers. To mitigate semantic leakage, we introduce a simple yet effective protection method: adding random noise to intermediate features at each layer and removing the noise in the next layer. Experimental results show that this approach prevents semantic leakage without additional training costs.

[495] MathPhys-Guided Coarse-to-Fine Anomaly Synthesis with SQE-Driven Bi-Level Optimization for Anomaly Detection

Long Qian, Bingke Zhu, Yingying Chen, Ming Tang, Jinqiao Wang

Main category: cs.CV

TL;DR: A novel pipeline for synthetic anomaly generation in industrial settings, combining physical modeling and bi-level optimization, achieves state-of-the-art results on benchmarks.

Details

Motivation: Addressing the rarity of real defect images and low fidelity of synthetic anomalies in industrial anomaly detection.

Method: Uses Math-Phys models to generate synthetic anomalies, refines them via Coarse-to-Fine approach, and employs bi-level optimization with a Synthesis Quality Estimator (SQE).

Result: Achieves top performance in image- and pixel-AUROC on MVTec AD, VisA, and BTAD benchmarks.

Conclusion: The MaPhC2F dataset and BiSQAD method effectively improve synthetic anomaly quality and model generalization.

Abstract: Currently, industrial anomaly detection suffers from two bottlenecks: (i) the rarity of real-world defect images and (ii) the opacity of sample quality when synthetic data are used. Existing synthetic strategies (e.g., cut-and-paste) overlook the underlying physical causes of defects, leading to inconsistent, low-fidelity anomalies that hamper model generalization to real-world complexities. In this paper, we introduce a novel and lightweight pipeline that generates synthetic anomalies through Math-Phys model guidance, refines them via a Coarse-to-Fine approach and employs a bi-level optimization strategy with a Synthesis Quality Estimator (SQE). By combining physical modeling of the three most typical physics-driven defect mechanisms: Fracture Line (FL), Pitting Loss (PL), and Plastic Warpage (PW), our method produces realistic defect masks, which are subsequently enhanced in two phases. The first stage (npcF) enforces a PDE-based consistency to achieve a globally coherent anomaly structure, while the second stage (npcF++) further improves local fidelity. Additionally, we leverage SQE-driven weighting, ensuring that high-quality synthetic samples receive greater emphasis during training. To validate our method, we conduct experiments on three anomaly detection benchmarks: MVTec AD, VisA, and BTAD. Across these datasets, our method achieves state-of-the-art results in both image- and pixel-AUROC, confirming the effectiveness of our MaPhC2F dataset and BiSQAD method. All code will be released.

[496] FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning

Jiajun Cao, Qizhe Zhang, Peidong Jia, Xuhui Zhao, Bo Lan, Xiaoan Zhang, Xiaobao Wei, Sixiang Chen, Zhuo Li, Yang Wang, Liyun Li, Xianming Liu, Ming Lu, Shanghang Zhang

Main category: cs.CV

TL;DR: FastDriveVLA is a reconstruction-based visual token pruning framework for autonomous driving, prioritizing foreground information to reduce computational costs while maintaining performance.

Details

Motivation: Current visual token pruning methods perform poorly in autonomous driving; human drivers focus on foreground areas, suggesting foreground retention is key for effective decision-making.

Method: Proposes FastDriveVLA with ReconPruner, a plug-and-play pruner using MAE-style pixel reconstruction and adversarial foreground-background reconstruction to prioritize foreground tokens.

Result: Achieves state-of-the-art results on nuScenes open-loop planning benchmark across pruning ratios.

Conclusion: FastDriveVLA effectively reduces computational costs by focusing on foreground information, enhancing autonomous driving systems.

Abstract: Vision-Language-Action (VLA) models have demonstrated significant potential in complex scene understanding and action reasoning, leading to their increasing adoption in end-to-end autonomous driving systems. However, the long visual tokens of VLA models greatly increase computational costs. Current visual token pruning methods in Vision-Language Models (VLM) rely on either visual token similarity or visual-text attention, but both have shown poor performance in autonomous driving scenarios. Given that human drivers concentrate on relevant foreground areas while driving, we assert that retaining visual tokens containing this foreground information is essential for effective decision-making. Inspired by this, we propose FastDriveVLA, a novel reconstruction-based vision token pruning framework designed specifically for autonomous driving. FastDriveVLA includes a plug-and-play visual token pruner called ReconPruner, which prioritizes foreground information through MAE-style pixel reconstruction. A novel adversarial foreground-background reconstruction strategy is designed to train ReconPruner for the visual encoder of VLA models. Once trained, ReconPruner can be seamlessly applied to different VLA models with the same visual encoder without retraining. To train ReconPruner, we also introduce a large-scale dataset called nuScenes-FG, consisting of 241K image-mask pairs with annotated foreground regions. Our approach achieves state-of-the-art results on the nuScenes open-loop planning benchmark across different pruning ratios.

[497] All-in-One Transferring Image Compression from Human Perception to Multi-Machine Perception

Jiancheng Zhao, Xiang Ji, Yinqiang Zheng

Main category: cs.CV

TL;DR: Proposes a multi-task adaptation framework for transferring a pre-trained Learned Image Compression (LIC) model to multiple machine vision tasks efficiently.

Details

Motivation: Existing single-task adaptation for LIC is inefficient and lacks task interaction, leading to multiple task-specific bitstreams.

Method: Introduces an asymmetric adaptation architecture with task-agnostic encoder adaptation and task-specific decoder adaptation, plus feature propagation modules for inter-task and inter-scale learning.

Result: Outperforms Fully Fine-Tuned and Parameter Efficient Fine-Tuned (PEFT) baselines on PASCAL-Context and NYUD-V2 datasets.

Conclusion: The framework enables efficient multi-task adaptation of LIC models with unified training and improved performance.

Abstract: Efficiently transferring Learned Image Compression (LIC) model from human perception to machine perception is an emerging challenge in vision-centric representation learning. Existing approaches typically adapt LIC to downstream tasks in a single-task manner, which is inefficient, lacks task interaction, and results in multiple task-specific bitstreams. In this paper, we propose a multi-task adaptation framework that enables transferring a pre-trained base codec to multiple machine vision tasks through a unified model and a single training process. To achieve this, we design an asymmetric adaptation architecture consisting of a task-agnostic encoder adaptation and task-specific decoder adaptation. Furthermore, we introduce two feature propagation modules to facilitate inter-task and inter-scale feature represenation learning. Experiments on PASCAL-Context and NYUD-V2 dataset demonstrate that our method outperforms both Fully Fine-Tuned and other Parameter Efficient Fine-Tuned (PEFT) baselines. Code will be released.

[498] Early Timestep Zero-Shot Candidate Selection for Instruction-Guided Image Editing

Joowon Kim, Ziseok Lee, Donghyeon Cho, Sanghyun Jo, Yeonsung Jung, Kyungsu Kim, Eunho Yang

Main category: cs.CV

TL;DR: ELECT is a zero-shot framework for reliable seed selection in diffusion-based image editing, improving background consistency and reducing computational costs.

Details

Motivation: Addressing inefficiencies and failures in diffusion model-based image editing, such as background distortion, without relying on external verifiers.

Method: Introduces ELECT, which evaluates early-timestep latent features to select seeds that retain background consistency while allowing foreground edits.

Result: Reduces computational costs by 41% (up to 61%) and improves success rates by 40% in previously failed cases.

Conclusion: ELECT offers an unsupervised, efficient solution for reliable image editing with diffusion models.

Abstract: Despite recent advances in diffusion models, achieving reliable image generation and editing remains challenging due to the inherent diversity induced by stochastic noise in the sampling process. Instruction-guided image editing with diffusion models offers user-friendly capabilities, yet editing failures, such as background distortion, frequently occur. Users often resort to trial and error, adjusting seeds or prompts to achieve satisfactory results, which is inefficient. While seed selection methods exist for Text-to-Image (T2I) generation, they depend on external verifiers, limiting applicability, and evaluating multiple seeds increases computational complexity. To address this, we first establish a multiple-seed-based image editing baseline using background consistency scores, achieving Best-of-N performance without supervision. Building on this, we introduce ELECT (Early-timestep Latent Evaluation for Candidate Selection), a zero-shot framework that selects reliable seeds by estimating background mismatches at early diffusion timesteps, identifying the seed that retains the background while modifying only the foreground. ELECT ranks seed candidates by a background inconsistency score, filtering unsuitable samples early based on background consistency while preserving editability. Beyond standalone seed selection, ELECT integrates into instruction-guided editing pipelines and extends to Multimodal Large-Language Models (MLLMs) for joint seed and prompt selection, further improving results when seed selection alone is insufficient. Experiments show that ELECT reduces computational costs (by 41 percent on average and up to 61 percent) while improving background consistency and instruction adherence, achieving around 40 percent success rates in previously failed cases - without any external supervision or training.

[499] LesiOnTime – Joint Temporal and Clinical Modeling for Small Breast Lesion Segmentation in Longitudinal DCE-MRI

Mohammed Kamran, Maria Bernathova, Raoul Varga, Christian F. Singer, Zsuzsanna Bago-Horvath, Thomas Helbich, Georg Langs, Philipp Seeböck

Main category: cs.CV

TL;DR: LesiOnTime improves small lesion segmentation in breast DCE-MRI by integrating longitudinal imaging and BI-RADS scores, outperforming baselines by 5% Dice.

Details

Motivation: Early cancer detection in high-risk patients requires accurate segmentation of small lesions, which current deep learning methods neglect by focusing on large lesions and ignoring longitudinal and clinical data.

Method: Proposes LesiOnTime with Temporal Prior Attention (TPA) for dynamic integration of past and current scans and BI-RADS Consistency Regularization (BCR) for latent space alignment based on radiological assessments.

Result: Outperforms state-of-the-art methods by 5% Dice on a longitudinal dataset, with TPA and BCR providing complementary gains.

Conclusion: Incorporating temporal and clinical context is crucial for reliable early lesion segmentation in breast cancer screening.

Abstract: Accurate segmentation of small lesions in Breast Dynamic Contrast-Enhanced MRI (DCE-MRI) is critical for early cancer detection, especially in high-risk patients. While recent deep learning methods have advanced lesion segmentation, they primarily target large lesions and neglect valuable longitudinal and clinical information routinely used by radiologists. In real-world screening, detecting subtle or emerging lesions requires radiologists to compare across timepoints and consider previous radiology assessments, such as the BI-RADS score. We propose LesiOnTime, a novel 3D segmentation approach that mimics clinical diagnostic workflows by jointly leveraging longitudinal imaging and BIRADS scores. The key components are: (1) a Temporal Prior Attention (TPA) block that dynamically integrates information from previous and current scans; and (2) a BI-RADS Consistency Regularization (BCR) loss that enforces latent space alignment for scans with similar radiological assessments, thus embedding domain knowledge into the training process. Evaluated on a curated in-house longitudinal dataset of high-risk patients with DCE-MRI, our approach outperforms state-of-the-art single-timepoint and longitudinal baselines by 5% in terms of Dice. Ablation studies demonstrate that both TPA and BCR contribute complementary performance gains. These results highlight the importance of incorporating temporal and clinical context for reliable early lesion segmentation in real-world breast cancer screening. Our code is publicly available at https://github.com/cirmuw/LesiOnTime

[500] DRC: Enhancing Personalized Image Generation via Disentangled Representation Composition

Yiyan Xu, Wuqiang Zheng, Wenjie Wang, Fengbin Zhu, Xinting Hu, Yang Zhang, Fuli Feng, Tat-Seng Chua

Main category: cs.CV

TL;DR: DRC introduces a framework for personalized image generation by disentangling style and semantic features to avoid guidance collapse, outperforming existing methods.

Details

Motivation: Existing methods struggle to accurately fuse user style preferences and semantic intentions, leading to guidance collapse in generated images.

Method: DRC uses Disentangled Representation Composition, with two stages: disentanglement learning (separating style/semantic features) and personalized modeling (adapting representations for robust generation).

Result: DRC shows competitive performance on benchmarks, effectively mitigating guidance collapse.

Conclusion: Disentangled representation learning is crucial for controllable and effective personalized image generation.

Abstract: Personalized image generation has emerged as a promising direction in multimodal content creation. It aims to synthesize images tailored to individual style preferences (e.g., color schemes, character appearances, layout) and semantic intentions (e.g., emotion, action, scene contexts) by leveraging user-interacted history images and multimodal instructions. Despite notable progress, existing methods – whether based on diffusion models, large language models, or Large Multimodal Models (LMMs) – struggle to accurately capture and fuse user style preferences and semantic intentions. In particular, the state-of-the-art LMM-based method suffers from the entanglement of visual features, leading to Guidance Collapse, where the generated images fail to preserve user-preferred styles or reflect the specified semantics. To address these limitations, we introduce DRC, a novel personalized image generation framework that enhances LMMs through Disentangled Representation Composition. DRC explicitly extracts user style preferences and semantic intentions from history images and the reference image, respectively, to form user-specific latent instructions that guide image generation within LMMs. Specifically, it involves two critical learning stages: 1) Disentanglement learning, which employs a dual-tower disentangler to explicitly separate style and semantic features, optimized via a reconstruction-driven paradigm with difficulty-aware importance sampling; and 2) Personalized modeling, which applies semantic-preserving augmentations to effectively adapt the disentangled representations for robust personalized generation. Extensive experiments on two benchmarks demonstrate that DRC shows competitive performance while effectively mitigating the guidance collapse issue, underscoring the importance of disentangled representation learning for controllable and effective personalized image generation.

[501] A Decade of You Only Look Once (YOLO) for Object Detection: A Review

Leo Thomas Ramos, Angel D. Sappa

Main category: cs.CV

TL;DR: A review of YOLO’s evolution over 10 years, covering its technical advancements, applications, and future directions.

Details

Motivation: To commemorate YOLO's 10-year impact and analyze its development in real-time object detection.

Method: Technical overview of YOLO versions, architectural trends, application areas, evaluation practices, and ethical considerations.

Result: YOLO has evolved into a versatile family of architectures with efficient design and cross-domain adaptability.

Conclusion: The review provides a critical perspective on YOLO’s trajectory and suggests future development directions.

Abstract: This review marks the tenth anniversary of You Only Look Once (YOLO), one of the most influential frameworks in real-time object detection. Over the past decade, YOLO has evolved from a streamlined detector into a diverse family of architectures characterized by efficient design, modular scalability, and cross-domain adaptability. The paper presents a technical overview of the main versions, highlights key architectural trends, and surveys the principal application areas in which YOLO has been adopted. It also addresses evaluation practices, ethical considerations, and potential future directions for the framework’s continued development. The analysis aims to provide a comprehensive and critical perspective on YOLO’s trajectory and ongoing transformation.

[502] DanceGRPO: Unleashing GRPO on Visual Generation

Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, Ping Luo

Main category: cs.CV

TL;DR: DanceGRPO is a unified RL framework for visual generation, compatible with diffusion models and rectified flows, improving performance across tasks and models.

Details

Motivation: Aligning generative model outputs with human preferences remains challenging due to RL limitations like incompatibility with ODE-based sampling and instability in large-scale training.

Method: DanceGRPO adapts Group Relative Policy Optimization (GRPO) to visual generation, supporting multiple paradigms, tasks, foundation models, and reward models.

Result: DanceGRPO outperforms baselines by up to 181% on benchmarks like HPS-v2.1 and CLIP Score, stabilizing policy optimization and improving denoising trajectories.

Conclusion: DanceGRPO is a robust solution for RLHF in visual generation, harmonizing RL and visual synthesis.

Abstract: Recent breakthroughs in generative models-particularly diffusion models and rectified flows-have revolutionized visual content creation, yet aligning model outputs with human preferences remains a critical challenge. Existing reinforcement learning (RL)-based methods for visual generation face critical limitations: incompatibility with modern Ordinary Differential Equations (ODEs)-based sampling paradigms, instability in large-scale training, and lack of validation for video generation. This paper introduces DanceGRPO, the first unified framework to adapt Group Relative Policy Optimization (GRPO) to visual generation paradigms, unleashing one unified RL algorithm across two generative paradigms (diffusion models and rectified flows), three tasks (text-to-image, text-to-video, image-to-video), four foundation models (Stable Diffusion, HunyuanVideo, FLUX, SkyReels-I2V), and five reward models (image/video aesthetics, text-image alignment, video motion quality, and binary reward). To our knowledge, DanceGRPO is the first RL-based unified framework capable of seamless adaptation across diverse generative paradigms, tasks, foundational models, and reward models. DanceGRPO demonstrates consistent and substantial improvements, which outperform baselines by up to 181% on benchmarks such as HPS-v2.1, CLIP Score, VideoAlign, and GenEval. Notably, DanceGRPO not only can stabilize policy optimization for complex video generation, but also enables generative policy to better capture denoising trajectories for Best-of-N inference scaling and learn from sparse binary feedback. Our results establish DanceGRPO as a robust and versatile solution for scaling Reinforcement Learning from Human Feedback (RLHF) tasks in visual generation, offering new insights into harmonizing reinforcement learning and visual synthesis. The code will be released.

[503] MoKD: Multi-Task Optimization for Knowledge Distillation

Zeeshan Hayder, Ali Cheraghian, Lars Petersson, Mehrtash Harandi

Main category: cs.CV

TL;DR: MoKD improves Knowledge Distillation by addressing gradient conflicts and dominance through multi-task optimization and subspace learning, achieving state-of-the-art results.

Details

Motivation: Challenges in KD include balancing teacher guidance and task objectives, and handling knowledge representation disparities.

Method: MoKD reformulates KD as multi-objective optimization and uses subspace learning for better knowledge transfer.

Result: Outperforms existing methods on ImageNet-1K and COCO datasets, achieving state-of-the-art performance.

Conclusion: MoKD effectively balances KD objectives and enhances model efficiency, setting new benchmarks.

Abstract: Compact models can be effectively trained through Knowledge Distillation (KD), a technique that transfers knowledge from larger, high-performing teacher models. Two key challenges in Knowledge Distillation (KD) are: 1) balancing learning from the teacher’s guidance and the task objective, and 2) handling the disparity in knowledge representation between teacher and student models. To address these, we propose Multi-Task Optimization for Knowledge Distillation (MoKD). MoKD tackles two main gradient issues: a) Gradient Conflicts, where task-specific and distillation gradients are misaligned, and b) Gradient Dominance, where one objective’s gradient dominates, causing imbalance. MoKD reformulates KD as a multi-objective optimization problem, enabling better balance between objectives. Additionally, it introduces a subspace learning framework to project feature representations into a high-dimensional space, improving knowledge transfer. Our MoKD is demonstrated to outperform existing methods through extensive experiments on image classification using the ImageNet-1K dataset and object detection using the COCO dataset, achieving state-of-the-art performance with greater efficiency. To the best of our knowledge, MoKD models also achieve state-of-the-art performance compared to models trained from scratch.

[504] VSA: Faster Video Diffusion with Trainable Sparse Attention

Peiyuan Zhang, Yongqi Chen, Haofeng Huang, Will Lin, Zhengzhong Liu, Ion Stoica, Eric Xing, Hao Zhang

Main category: cs.CV

TL;DR: VSA introduces a trainable, hardware-efficient sparse attention for video diffusion transformers, reducing training FLOPS by 2.53× and speeding up attention time by 6× without quality loss.

Details

Motivation: Quadratic 3D attention in video diffusion transformers limits scalability, despite most attention mass being concentrated on a few positions.

Method: VSA uses a two-stage approach: a coarse stage pools tokens into tiles and identifies critical tokens, while a fine stage computes token-level attention only inside these tiles.

Result: VSA reduces training FLOPS by 2.53×, speeds up attention time by 6×, and lowers end-to-end generation time from 31s to 18s with comparable quality.

Conclusion: Trainable sparse attention (VSA) is a practical alternative to full attention, enabling further scaling of video diffusion models.

Abstract: Scaling video diffusion transformers (DiTs) is limited by their quadratic 3D attention, even though most of the attention mass concentrates on a small subset of positions. We turn this observation into VSA, a trainable, hardware-efficient sparse attention that replaces full attention at \emph{both} training and inference. In VSA, a lightweight coarse stage pools tokens into tiles and identifies high-weight \emph{critical tokens}; a fine stage computes token-level attention only inside those tiles subjecting to block computing layout to ensure hard efficiency. This leads to a single differentiable kernel that trains end-to-end, requires no post-hoc profiling, and sustains 85% of FlashAttention3 MFU. We perform a large sweep of ablation studies and scaling-law experiments by pretraining DiTs from 60M to 1.4B parameters. VSA reaches a Pareto point that cuts training FLOPS by 2.53$\times$ with no drop in diffusion loss. Retrofitting the open-source Wan-2.1 model speeds up attention time by 6$\times$ and lowers end-to-end generation time from 31s to 18s with comparable quality. These results establish trainable sparse attention as a practical alternative to full attention and a key enabler for further scaling of video diffusion models. Code will be available at https://github.com/hao-ai-lab/FastVideo.

[505] Action2Dialogue: Generating Character-Centric Narratives from Scene-Level Prompts

Taewon Kang, Ming C. Lin

Main category: cs.CV

TL;DR: A modular pipeline generates character-driven dialogue and speech for video narratives, integrating visual context and structured prompts with a recursive memory system for consistency.

Details

Motivation: To address the underexplored dimension of character-driven dialogue in video storytelling, enhancing visual narratives with natural voice and expression.

Method: Uses a vision-language encoder for visual context, integrates prompts with a large language model for dialogue, and employs a Recursive Narrative Bank for consistency.

Result: Produces fully-voiced, multimodal video narratives with coherent, character-grounded dialogue across diverse story settings.

Conclusion: The framework offers a scalable, training-free solution for enriching visual storytelling with expressive, contextually consistent dialogue.

Abstract: Recent advances in scene-based video generation have enabled systems to synthesize coherent visual narratives from structured prompts. However, a crucial dimension of storytelling – character-driven dialogue and speech – remains underexplored. In this paper, we present a modular pipeline that transforms action-level prompts into visually and auditorily grounded narrative dialogue, enriching visual storytelling with natural voice and character expression. Our method takes as input a pair of prompts per scene, where the first defines the setting and the second specifies a character’s behavior. While a story generation model such as Text2Story produces the corresponding visual scene, we focus on generating expressive, character-consistent utterances grounded in both the prompts and the scene image. A pretrained vision-language encoder extracts high-level semantic features from a representative frame, capturing salient visual context. These features are then integrated with structured prompts to guide a large language model in synthesizing natural dialogue. To ensure contextual and emotional consistency across scenes, we introduce a Recursive Narrative Bank – a speaker-aware, temporally structured memory that recursively accumulates each character’s dialogue history. Inspired by Script Theory in cognitive psychology, this design enables characters to speak in ways that reflect their evolving goals, social context, and narrative roles throughout the story. Finally, we render each utterance as expressive, character-conditioned speech, resulting in fully-voiced, multimodal video narratives. Our training-free framework generalizes across diverse story settings – from fantasy adventures to slice-of-life episodes – offering a scalable solution for coherent, character-grounded audiovisual storytelling.

[506] UltraVSR: Achieving Ultra-Realistic Video Super-Resolution with Efficient One-Step Diffusion Space

Yong Liu, Jinshan Pan, Yinchuan Li, Qingji Dong, Chao Zhu, Yu Guo, Fei Wang

Main category: cs.CV

TL;DR: UltraVSR introduces a one-step diffusion framework for video super-resolution, using Degradation-aware Reconstruction Scheduling and a Recurrent Temporal Shift module for efficiency and temporal coherence.

Details

Motivation: Adapting diffusion models to video super-resolution is challenging due to stochasticity and lack of temporal modeling. Existing methods struggle with unreliable motion estimation and computational costs.

Method: Proposes Degradation-aware Reconstruction Scheduling (DRS) for one-step reconstruction and a Recurrent Temporal Shift (RTS) module for temporal consistency. Uses Spatio-temporal Joint Distillation (SJD) and Temporally Asynchronous Inference (TAI) for enhanced performance.

Result: UltraVSR achieves state-of-the-art performance in video super-resolution with a single sampling step, demonstrating superior qualitative and quantitative results.

Conclusion: UltraVSR effectively addresses the challenges of video super-resolution by combining efficient one-step diffusion with lightweight temporal modeling, enabling ultra-realistic and coherent results.

Abstract: Diffusion models have shown great potential in generating realistic image detail. However, adapting these models to video super-resolution (VSR) remains challenging due to their inherent stochasticity and lack of temporal modeling. Previous methods have attempted to mitigate this issue by incorporating motion information and temporal layers. However, unreliable motion estimation from low-resolution videos and costly multiple sampling steps with deep temporal layers limit them to short sequences. In this paper, we propose UltraVSR, a novel framework that enables ultra-realistic and temporally-coherent VSR through an efficient one-step diffusion space. A central component of UltraVSR is the Degradation-aware Reconstruction Scheduling (DRS), which estimates a degradation factor from the low-resolution input and transforms the iterative denoising process into a single-step reconstruction from low-resolution to high-resolution videos. To ensure temporal consistency, we propose a lightweight Recurrent Temporal Shift (RTS) module, including an RTS-convolution unit and an RTS-attention unit. By partially shifting feature components along the temporal dimension, it enables effective propagation, fusion, and alignment across frames without explicit temporal layers. The RTS module is integrated into a pretrained text-to-image diffusion model and is further enhanced through Spatio-temporal Joint Distillation (SJD), which improves temporally coherence while preserving realistic details. Additionally, we introduce a Temporally Asynchronous Inference (TAI) strategy to capture long-range temporal dependencies under limited memory constraints. Extensive experiments show that UltraVSR achieves state-of-the-art performance, both qualitatively and quantitatively, in a single sampling step. Code is available at https://github.com/yongliuy/UltraVSR.

[507] Prototype Embedding Optimization for Human-Object Interaction Detection in Livestreaming

Menghui Zhang, Jing Zhang, Lin Chen, Li Zhuo

Main category: cs.CV

TL;DR: The paper proposes PeO-HOI, a method to improve human-object interaction (HOI) detection in livestreaming by addressing object bias through prototype embedding optimization.

Details

Motivation: Current HOI detection methods focus too much on objects and neglect interactions with streamers, leading to object bias in livestreaming contexts.

Method: PeO-HOI preprocesses livestreaming with object detection and tracking, optimizes prototype embeddings to reduce object bias, and models spatio-temporal context for HOI detection.

Result: PeO-HOI achieves improved accuracy on VidHOI (37.19%@full) and BJUT-HOI (45.13%@full) datasets.

Conclusion: PeO-HOI effectively mitigates object bias and enhances HOI detection performance in livestreaming.

Abstract: Livestreaming often involves interactions between streamers and objects, which is critical for understanding and regulating web content. While human-object interaction (HOI) detection has made some progress in general-purpose video downstream tasks, when applied to recognize the interaction behaviors between a streamer and different objects in livestreaming, it tends to focuses too much on the objects and neglects their interactions with the streamer, which leads to object bias. To solve this issue, we propose a prototype embedding optimization for human-object interaction detection (PeO-HOI). First, the livestreaming is preprocessed using object detection and tracking techniques to extract features of the human-object (HO) pairs. Then, prototype embedding optimization is adopted to mitigate the effect of object bias on HOI. Finally, after modelling the spatio-temporal context between HO pairs, the HOI detection results are obtained by the prediction head. The experimental results show that the detection accuracy of the proposed PeO-HOI method has detection accuracies of 37.19%@full, 51.42%@non-rare, 26.20%@rare on the publicly available dataset VidHOI, 45.13%@full, 62.78%@non-rare and 30.37%@rare on the self-built dataset BJUT-HOI, which effectively improves the HOI detection performance in livestreaming.

[508] Zero-Shot Vision Encoder Grafting via LLM Surrogates

Kaiyu Yue, Vasu Singla, Menglin Jia, John Kirchenbauer, Rifaa Qadri, Zikui Cai, Abhinav Bhatele, Furong Huang, Tom Goldstein

Main category: cs.CV

TL;DR: Training vision encoders with small surrogate models reduces VLM training costs by ~45% and achieves comparable performance to full training with large LLMs.

Details

Motivation: Reduce computational costs of training VLMs by avoiding direct training with large LLMs.

Method: Train vision encoders using small surrogate models that share embedding space with the target LLM, then transfer to the large LLM (zero-shot grafting).

Result: Grafted pairs outperform encoder-surrogate pairs and match full training performance on some benchmarks.

Conclusion: Surrogate training is a cost-effective strategy for VLM training without sacrificing performance.

Abstract: Vision language models (VLMs) typically pair a modestly sized vision encoder with a large language model (LLM), e.g., Llama-70B, making the decoder the primary computational burden during training. To reduce costs, a potential promising strategy is to first train the vision encoder using a small language model before transferring it to the large one. We construct small “surrogate models” that share the same embedding space and representation language as the large target LLM by directly inheriting its shallow layers. Vision encoders trained on the surrogate can then be directly transferred to the larger model, a process we call zero-shot grafting – when plugged directly into the full-size target LLM, the grafted pair surpasses the encoder-surrogate pair and, on some benchmarks, even performs on par with full decoder training with the target LLM. Furthermore, our surrogate training approach reduces overall VLM training costs by ~45% when using Llama-70B as the decoder. The code is at https://github.com/facebookresearch/zero.

[509] A Comprohensive Review of Domain Adaptation Techniques for Agricultural Image Analysis in Precision Agriculture

Xing Hu, Siyuan Chen, Xuming Huang, Qianqian Duan, LingKun Luo, Ruijiao Li, Huiliang Shang, Linhua Jiang, Jianping Yang, Hamid Reza Karimi, Dawei Zhang

Main category: cs.CV

TL;DR: This paper explores Domain Adaptation (DA) techniques to improve cross-domain transferability in agricultural image analysis, addressing challenges like domain shifts and limited labeled data. It reviews DA methods, applications, and datasets, providing a framework for future research.

Details

Motivation: The motivation is to tackle domain shifts in agricultural image analysis caused by environmental variations, diverse crops, and data acquisition methods, which hinder model generalization.

Method: The paper systematically reviews DA techniques, categorizing them into shallow and deep learning methods, including supervised, semi-supervised, and unsupervised strategies, with a focus on adversarial learning.

Result: DA methods have shown improved performance in tasks like crop health monitoring, pest detection, and fruit recognition across diverse domains.

Conclusion: The work offers a comprehensive framework and insights to guide future DA research in agricultural vision tasks, highlighting the potential of adversarial learning and the importance of public datasets.

Abstract: With the growing application of computer vision in agriculture, image analysis has become essential for tasks such as crop health monitoring and pest detection. However, significant domain shifts caused by environmental variations, different crop types, and diverse data acquisition methods hinder model generalization across regions, seasons, and complex agricultural settings. This paper investigates how Domain Adaptation (DA) techniques can address these challenges by improving cross-domain transferability in agricultural image analysis. Given the limited availability of labeled data, weak model adaptability, and dynamic field conditions, DA has emerged as a promising solution. The review systematically summarizes recent advances in DA for agricultural imagery, focusing on applications such as crop health monitoring, pest detection, and fruit recognition, where DA methods have improved performance in diverse domains. DA approaches are categorized into shallow and deep learning methods, including supervised, semi-supervised, and unsupervised strategies, with particular attention to adversarial learning-based techniques that have demonstrated strong potential in complex scenarios. In addition,the paper reviews key public agricultural image datasets, evaluating their strengths and limitations in DA research. In general, this work offers a comprehensive framework and critical insights to guide future research and development of domain adaptation in agricultural vision tasks.

[510] MCA-Bench: A Multimodal Benchmark for Evaluating CAPTCHA Robustness Against VLM-based Attacks

Zonglin Wu, Yule Xue, Xin Wei, Yiren Song

Main category: cs.CV

TL;DR: The paper introduces MCA-Bench, a unified benchmarking suite for evaluating the security robustness of diverse CAPTCHA types, revealing vulnerabilities and proposing design principles.

Details

Motivation: The lack of a unified, large-scale, multimodal benchmark for CAPTCHA evaluation motivates the creation of MCA-Bench to assess security robustness systematically.

Method: MCA-Bench integrates heterogeneous CAPTCHA types into a single protocol, using a shared vision-language model backbone to fine-tune specialized cracking agents for consistent evaluation.

Result: Experiments show MCA-Bench effectively maps CAPTCHA vulnerabilities and provides the first quantitative analysis of challenge complexity, interaction depth, and solvability.

Conclusion: The paper proposes actionable design principles for CAPTCHA hardening and identifies open challenges, encouraging community collaboration.

Abstract: As automated attack techniques rapidly advance, CAPTCHAs remain a critical defense mechanism against malicious bots. However, existing CAPTCHA schemes encompass a diverse range of modalities – from static distorted text and obfuscated images to interactive clicks, sliding puzzles, and logic-based questions – yet the community still lacks a unified, large-scale, multimodal benchmark to rigorously evaluate their security robustness. To address this gap, we introduce MCA-Bench, a comprehensive and reproducible benchmarking suite that integrates heterogeneous CAPTCHA types into a single evaluation protocol. Leveraging a shared vision-language model backbone, we fine-tune specialized cracking agents for each CAPTCHA category, enabling consistent, cross-modal assessments. Extensive experiments reveal that MCA-Bench effectively maps the vulnerability spectrum of modern CAPTCHA designs under varied attack settings, and crucially offers the first quantitative analysis of how challenge complexity, interaction depth, and model solvability interrelate. Based on these findings, we propose three actionable design principles and identify key open challenges, laying the groundwork for systematic CAPTCHA hardening, fair benchmarking, and broader community collaboration. Datasets and code are available online.

[511] AllTracker: Efficient Dense Point Tracking at High Resolution

Adam W. Harley, Yang You, Xinglong Sun, Yang Zheng, Nikhil Raghuraman, Yunqi Gu, Sheldon Liang, Wen-Hsuan Chu, Achal Dave, Pavel Tokmakov, Suya You, Rares Ambrus, Katerina Fragkiadaki, Leonidas J. Guibas

Main category: cs.CV

TL;DR: AllTracker is a model for estimating long-range, dense point tracks in videos by predicting flow fields between frames, combining techniques from optical flow and point tracking for high accuracy and efficiency.

Details

Motivation: Existing methods either lack long-range correspondence (optical flow) or high-resolution dense tracking (point tracking). AllTracker aims to bridge this gap.

Method: Uses iterative inference on low-resolution grids, spatial propagation via 2D convolutions, and temporal propagation via pixel-aligned attention layers. Trained jointly on optical flow and point tracking datasets.

Result: Achieves state-of-the-art accuracy at high resolution (768x1024 pixels) with 16M parameters, efficient on a 40G GPU.

Conclusion: Joint training on diverse datasets and the proposed architecture are key to performance. The model is fast, efficient, and open-source.

Abstract: We introduce AllTracker: a model that estimates long-range point tracks by way of estimating the flow field between a query frame and every other frame of a video. Unlike existing point tracking methods, our approach delivers high-resolution and dense (all-pixel) correspondence fields, which can be visualized as flow maps. Unlike existing optical flow methods, our approach corresponds one frame to hundreds of subsequent frames, rather than just the next frame. We develop a new architecture for this task, blending techniques from existing work in optical flow and point tracking: the model performs iterative inference on low-resolution grids of correspondence estimates, propagating information spatially via 2D convolution layers, and propagating information temporally via pixel-aligned attention layers. The model is fast and parameter-efficient (16 million parameters), and delivers state-of-the-art point tracking accuracy at high resolution (i.e., tracking 768x1024 pixels, on a 40G GPU). A benefit of our design is that we can train jointly on optical flow datasets and point tracking datasets, and we find that doing so is crucial for top performance. We provide an extensive ablation study on our architecture details and training recipe, making it clear which details matter most. Our code and model weights are available at https://alltracker.github.io

[512] SpINRv2: Implicit Neural Representation for Passband FMCW Radars

Harshvardhan Takawale, Nirupam Roy

Main category: cs.CV

TL;DR: SpINRv2 is an advanced neural framework for high-fidelity volumetric reconstruction using FMCW radar, improving upon SpINR with better handling of phase aliasing and sub-bin ambiguity.

Details

Motivation: The need for accurate volumetric reconstruction under high-frequency radar conditions, where phase aliasing and sub-bin ambiguity complicate traditional methods.

Method: A differentiable frequency-domain forward model with closed-form synthesis, paired with an implicit neural representation (INR) for continuous scene modeling, plus sparsity and smoothness regularization.

Result: SpINRv2 outperforms classical and learning-based baselines, especially in high-frequency regimes, setting a new benchmark for neural radar-based 3D imaging.

Conclusion: SpINRv2 advances neural radar-based 3D imaging by addressing high-frequency challenges with spectral fidelity and computational efficiency.

Abstract: We present SpINRv2, a neural framework for high-fidelity volumetric reconstruction using Frequency-Modulated Continuous-Wave (FMCW) radar. Extending our prior work (SpINR), this version introduces enhancements that allow accurate learning under high start frequencies-where phase aliasing and sub-bin ambiguity become prominent. Our core contribution is a fully differentiable frequency-domain forward model that captures the complex radar response using closed-form synthesis, paired with an implicit neural representation (INR) for continuous volumetric scene modeling. Unlike time-domain baselines, SpINRv2 directly supervises the complex frequency spectrum, preserving spectral fidelity while drastically reducing computational overhead. Additionally, we introduce sparsity and smoothness regularization to disambiguate sub-bin ambiguities that arise at fine range resolutions. Experimental results show that SpINRv2 significantly outperforms both classical and learning-based baselines, especially under high-frequency regimes, establishing a new benchmark for neural radar-based 3D imaging.

[513] Wi-CBR: Salient-aware Adaptive WiFi Sensing for Cross-domain Behavior Recognition

Ruobei Zhang, Shengeng Tang, Huan Yan, Xiang Zhang, Jiabao Guo

Main category: cs.CV

TL;DR: Wi-CBR improves WiFi-based cross-domain behavior recognition by dynamically supplementing phase features with DFS signals, using a dual-branch self-attention module and a saliency guidance module for better generalization.

Details

Motivation: Addressing the interference of domain-specific signals on gesture variation in WiFi-based behavior recognition.

Method: Proposes Wi-CBR with a dual-branch self-attention module for temporal (phase) and spatial (DFS) features, and a saliency guidance module for feature fusion.

Result: Superior performance on Widar3.0 and XRF55 datasets in both in-domain and cross-domain scenarios.

Conclusion: Wi-CBR effectively enhances cross-domain behavior recognition by leveraging DFS and optimizing feature fusion.

Abstract: The challenge in WiFi-based cross-domain Behavior Recognition lies in the significant interference of domain-specific signals on gesture variation. However, previous methods alleviate this interference by mapping the phase from multiple domains into a common feature space. If the Doppler Frequency Shift (DFS) signal is used to dynamically supplement the phase features to achieve better generalization, enabling model to not only explore a wider feature space but also avoid potential degradation of gesture semantic information. Specifically, we propose a novel Salient-aware Adaptive WiFi Sensing for Cross-domain Behavior Recognition (Wi-CBR}, which constructs a dual-branch self-attention module that captures temporal features from phase information reflecting dynamic path length variations, while extracting spatial features from DFS correlated with motion velocity. Moreover, we design a Saliency Guidance Module that employs group attention mechanisms to mine critical activity features, and utilizes gating mechanisms to optimize information entropy, facilitating feature fusion and enabling effective interaction between salient and non-salient behavior characteristics. Extensive experiments on two large-scale public datasets (Widar3.0 and XRF55) demonstrate the superior performance of our method in both in-domain and cross-domain scenarios.

[514] CSDN: A Context-Gated Self-Adaptive Detection Network for Real-Time Object Detection

Haolin Wei

Main category: cs.CV

TL;DR: CSDN, a Transformer-based detection head, improves CNN-based detectors by adaptively combining global context and scale information, enhancing accuracy with minimal fine-tuning.

Details

Motivation: CNNs lack global context due to limited receptive fields, and DETR's self-attention has redundancy. CSDN addresses these by mimicking human visual perception for better feature integration.

Method: CSDN uses a Transformer-based head to adaptively select and combine feature dimensions and scales from different patterns, inspired by human visual focus and peripheral perception.

Result: CSDN enhances global context modeling and adapts to objects of varying sizes, significantly improving detection accuracy with minimal fine-tuning.

Conclusion: CSDN effectively replaces CNN-based detector heads, offering superior performance by leveraging adaptive global context and scale integration.

Abstract: Convolutional neural networks (CNNs) have long been the cornerstone of target detection, but they are often limited by limited receptive fields, which hinders their ability to capture global contextual information. We re-examined the DETR-inspired detection head and found substantial redundancy in its self-attention module. To solve these problems, we introduced the Context-Gated Scale-Adaptive Detection Network (CSDN), a Transformer-based detection header inspired by human visual perception: when observing an object, we always concentrate on one site, perceive the surrounding environment, and glance around the object. This mechanism enables each region of interest (ROI) to adaptively select and combine feature dimensions and scale information from different patterns. CSDN provides more powerful global context modeling capabilities and can better adapt to objects of different sizes and structures. Our proposed detection head can directly replace the native heads of various CNN-based detectors, and only a few rounds of fine-tuning on the pre-trained weights can significantly improve the detection accuracy.

[515] DIP: Unsupervised Dense In-Context Post-training of Visual Representations

Sophia Sirko-Galouchenko, Spyros Gidaris, Antonin Vobecky, Andrei Bursuc, Nicolas Thome

Main category: cs.CV

TL;DR: DIP is an unsupervised post-training method for enhancing dense image representations in pretrained vision encoders, using pseudo-tasks inspired by meta-learning. It outperforms prior methods and the initial encoder.

Details

Motivation: To improve dense representations for in-context scene understanding without relying on complex self-distillation or labeled data.

Method: Trains the vision encoder using pseudo-tasks generated by combining a pretrained diffusion model and the encoder itself.

Result: Achieves strong performance across downstream tasks, outperforming the initial encoder and prior methods.

Conclusion: DIP is a simple, efficient, and effective solution for enhancing dense representations in vision encoders.

Abstract: We introduce DIP, a novel unsupervised post-training method designed to enhance dense image representations in large-scale pretrained vision encoders for in-context scene understanding. Unlike prior approaches that rely on complex self-distillation architectures, our method trains the vision encoder using pseudo-tasks that explicitly simulate downstream in-context scenarios, inspired by meta-learning principles. To enable post-training on unlabeled data, we propose an automatic mechanism for generating in-context tasks that combines a pretrained diffusion model and the vision encoder itself. DIP is simple, unsupervised, and computationally efficient, requiring less than 9 hours on a single A100 GPU. By learning dense representations through pseudo in-context tasks, it achieves strong performance across a wide variety of downstream real-world in-context scene understanding tasks. It outperforms both the initial vision encoder and prior methods, offering a practical and effective solution for improving dense representations. Code available here: https://github.com/sirkosophia/DIP

[516] DuET: Dual Incremental Object Detection via Exemplar-Free Task Arithmetic

Munish Monga, Vishal Chudasama, Pankaj Wasnik, Biplab Banerjee

Main category: cs.CV

TL;DR: The paper introduces Dual Incremental Object Detection (DuIOD) and DuET, a framework for handling class and domain shifts in object detection without exemplars, outperforming existing methods.

Details

Motivation: Existing approaches (CIOD and DIOD) fail to address both class and domain shifts simultaneously, limiting real-world applicability.

Method: Proposes DuET, a Task Arithmetic-based model merging framework with Directional Consistency Loss to mitigate sign conflicts, enabling stable incremental learning.

Result: DuET achieves significant improvements in RAI (+13.12% on Pascal Series, +11.39% on Diverse Weather Series) while maintaining high Avg RI.

Conclusion: DuET is a detector-agnostic solution for real-time incremental object detection, effectively balancing retention and adaptation.

Abstract: Real-world object detection systems, such as those in autonomous driving and surveillance, must continuously learn new object categories and simultaneously adapt to changing environmental conditions. Existing approaches, Class Incremental Object Detection (CIOD) and Domain Incremental Object Detection (DIOD) only address one aspect of this challenge. CIOD struggles in unseen domains, while DIOD suffers from catastrophic forgetting when learning new classes, limiting their real-world applicability. To overcome these limitations, we introduce Dual Incremental Object Detection (DuIOD), a more practical setting that simultaneously handles class and domain shifts in an exemplar-free manner. We propose DuET, a Task Arithmetic-based model merging framework that enables stable incremental learning while mitigating sign conflicts through a novel Directional Consistency Loss. Unlike prior methods, DuET is detector-agnostic, allowing models like YOLO11 and RT-DETR to function as real-time incremental object detectors. To comprehensively evaluate both retention and adaptation, we introduce the Retention-Adaptability Index (RAI), which combines the Average Retention Index (Avg RI) for catastrophic forgetting and the Average Generalization Index for domain adaptability into a common ground. Extensive experiments on the Pascal Series and Diverse Weather Series demonstrate DuET’s effectiveness, achieving a +13.12% RAI improvement while preserving 89.3% Avg RI on the Pascal Series (4 tasks), as well as a +11.39% RAI improvement with 88.57% Avg RI on the Diverse Weather Series (3 tasks), outperforming existing methods.

[517] SiM3D: Single-instance Multiview Multimodal and Multisetup 3D Anomaly Detection Benchmark

Alex Costanzino, Pierluigi Zama Ramirez, Luigi Lella, Matteo Ragaglia, Alessandro Oliva, Giuseppe Lisanti, Luigi Di Stefano

Main category: cs.CV

TL;DR: SiM3D is the first benchmark for 3D anomaly detection and segmentation (ADS) integrating multiview and multimodal data, focusing on single-instance training and synthetic-to-real generalization.

Details

Motivation: The paper addresses the lack of benchmarks for comprehensive 3D ADS, especially in manufacturing, where single-instance training and synthetic-to-real generalization are critical.

Method: SiM3D introduces a novel dataset with multiview high-resolution images, point clouds, and CAD models, and adapts singleview methods for baseline performance evaluation.

Result: The benchmark provides annotated 3D segmentation ground truths and evaluates adapted methods using new metrics for Anomaly Volumes.

Conclusion: SiM3D fills a gap in 3D ADS benchmarks, offering a robust framework for future research in synthetic-to-real anomaly detection.

Abstract: We propose SiM3D, the first benchmark considering the integration of multiview and multimodal information for comprehensive 3D anomaly detection and segmentation (ADS), where the task is to produce a voxel-based Anomaly Volume. Moreover, SiM3D focuses on a scenario of high interest in manufacturing: single-instance anomaly detection, where only one object, either real or synthetic, is available for training. In this respect, SiM3D stands out as the first ADS benchmark that addresses the challenge of generalising from synthetic training data to real test data. SiM3D includes a novel multimodal multiview dataset acquired using top-tier industrial sensors and robots. The dataset features multiview high-resolution images (12 Mpx) and point clouds (7M points) for 333 instances of eight types of objects, alongside a CAD model for each type. We also provide manually annotated 3D segmentation GTs for anomalous test samples. To establish reference baselines for the proposed multiview 3D ADS task, we adapt prominent singleview methods and assess their performance using novel metrics that operate on Anomaly Volumes.

[518] ProSAM: Enhancing the Robustness of SAM-based Visual Reference Segmentation with Probabilistic Prompts

Xiaoqi Wang, Clint Sebastian, Wenbin He, Liu Ren

Main category: cs.CV

TL;DR: ProSAM improves visual reference segmentation by predicting stable prompt distributions, outperforming SAM-based methods on key datasets.

Details

Motivation: Existing SAM-based methods for visual reference segmentation generate unstable prompts at target boundaries due to suboptimal encoders, limiting robustness.

Method: ProSAM introduces a variational prompt encoder to predict multivariate prompt distributions, avoiding unstable regions.

Result: ProSAM consistently outperforms state-of-the-art methods on Pascal-5$^i$ and COCO-20$^i$ datasets.

Conclusion: ProSAM offers a robust solution for visual reference segmentation by addressing prompt stability issues.

Abstract: The recent advancements in large foundation models have driven the success of open-set image segmentation, a task focused on segmenting objects beyond predefined categories. Among various prompt types (such as points, boxes, texts, and visual references), visual reference segmentation stands out for its unique flexibility and strong zero-shot capabilities. Recently, several SAM-based methods have made notable progress in this task by automatically generating prompts to guide SAM. However, these methods often generate prompts at boundaries of target regions due to suboptimal prompt encoder, which results in instability and reduced robustness. In this work, we introduce ProSAM, a simple but effective method to address the stability challenges we identified in existing SAM-based visual reference segmentation approaches. By learning a variational prompt encoder to predict multivariate prompt distributions, ProSAM avoids generating prompts that lie in unstable regions, overcoming the instability caused by less robust prompts. Our approach consistently surpasses state-of-the-art methods on the Pascal-5$^i$ and COCO-20$^i$ datasets, providing a more robust solution for visual reference segmentation.

Rui Xu, Yunke Wang, Yong Luo, Bo Du

Main category: cs.CV

TL;DR: VisionDrop is a training-free, visual-only pruning framework for LVLMs that reduces computational overhead by selecting informative visual tokens without relying on text, achieving significant efficiency gains while retaining performance.

Details

Motivation: Current text-guided visual token reduction in LVLMs suffers from cross-modal misalignment, undermining effectiveness. VisionDrop addresses this by focusing on intra-modal attention for token selection.

Method: VisionDrop uses visual-to-visual attention for token pruning and a progressive pipeline to suppress redundancy across the visual encoder and LLM.

Result: VisionDrop reduces inference latency by 2.7x and FLOPs by 6x while maintaining 95.71% of original performance when integrated with LLaVA-NeXT-7B.

Conclusion: VisionDrop offers an efficient, training-free solution for visual token reduction in LVLMs, outperforming existing methods.

Abstract: Large Vision-Language Models (LVLMs) encode visual inputs as dense sequences of patch-level tokens to capture fine-grained semantics. These visual tokens often outnumber their textual counterparts by a large margin, leading to substantial computational overhead and limiting the scalability of LVLMs in practice. Previous efforts have explored visual token reduction either prior to or within the large language models (LLMs). However, most in-LLM reduction approaches rely on text-conditioned interactions, implicitly assuming that textual tokens can reliably capture the importance of visual tokens. In this work, we revisit this assumption and reveal causal, semantic, and spatial forms of cross-modal misalignment. These misalignments undermine the effectiveness of text-guided visual token reduction. To address this, we introduce VisionDrop, a training-free, visual-only pruning framework that selects informative visual tokens based on intra-modal (visual-to-visual) attention, without relying on textual signals. To further suppress redundancy throughout the model hierarchy, we treat the visual encoder and the LLM as a unified system and design a progressive pruning pipeline. Our method performs dominant token selection and lightweight contextual merging at multiple stages, enabling fine-grained visual information to be retained even under aggressive token budgets. Extensive experiments across diverse benchmarks show that VisionDrop achieves consistent improvements over existing approaches, despite requiring no additional training or complex modifications. Notably, when integrated with LLaVA-NeXT-7B, VisionDrop achieves a 2.7x reduction in inference latency and 6x in FLOPs, while retaining 95.71% of the original performance.

[520] Subjective Camera 0.1: Bridging Human Cognition and Visual Reconstruction through Sequence-Aware Sketch-Guided Diffusion

Haoyang Chen, Dongfang Sun, Caoyuan Ma, Shiqin Wang, Kewei Zhang, Zheng Wang, Zhixiang Wang

Main category: cs.CV

TL;DR: A framework called Subjective Camera 0.1 reconstructs real-world scenes using textual descriptions and rough sketches, leveraging diffusion models without large-scale paired data.

Details

Motivation: Physical cameras often miss meaningful moments; this work aims to capture them using subjective inputs like text and sketches.

Method: Uses optimization-based alignment of diffusion models with a Sequence-Aware Sketch-Guided Diffusion framework, incorporating three loss terms for sequential optimization.

Result: Achieves state-of-the-art performance in image quality, spatial, and semantic alignment, preferred by users in studies.

Conclusion: The method effectively reconstructs scenes from subjective inputs, avoiding large datasets and improving generalization.

Abstract: We introduce the concept of a subjective camera to reconstruct meaningful moments that physical cameras fail to capture. We propose Subjective Camera 0.1, a framework for reconstructing real-world scenes from readily accessible subjective readouts, i.e., textual descriptions and progressively drawn rough sketches. Built on optimization-based alignment of diffusion models, our approach avoids large-scale paired training data and mitigates generalization issues. To address the challenge of integrating multiple abstract concepts in real-world scenarios, we design a Sequence-Aware Sketch-Guided Diffusion framework with three loss terms for concept-wise sequential optimization, following the natural order of subjective readouts. Experiments on two datasets demonstrate that our method achieves state-of-the-art performance in image quality as well as spatial and semantic alignment with target scenes. User studies with 40 participants further confirm that our approach is consistently preferred.Our project page is at: subjective-camera.github.io

[521] StyleDrive: Towards Driving-Style Aware Benchmarking of End-To-End Autonomous Driving

Ruiyang Hao, Bowen Jing, Haibao Yu, Zaiqing Nie

Main category: cs.CV

TL;DR: The paper introduces the first large-scale real-world dataset for personalized end-to-end autonomous driving (E2EAD) and proposes a standardized benchmark for evaluating such models, showing improved alignment with human driving preferences.

Details

Motivation: Personalization is crucial for user trust and adoption in E2EAD but lacks large-scale datasets, hindering model development and evaluation.

Method: A hybrid annotation pipeline combines behavioral analysis, heuristics, and vision-language model (VLM) reasoning, refined via human verification. A standardized benchmark is also introduced.

Result: Incorporating personalized preferences significantly improves behavioral alignment with human demonstrations in state-of-the-art models.

Conclusion: The dataset and benchmark advance personalized E2EAD, addressing a critical gap in the field.

Abstract: Personalization, while extensively studied in conventional autonomous driving pipelines, has been largely overlooked in the context of end-to-end autonomous driving (E2EAD), despite its critical role in fostering user trust, safety perception, and real-world adoption. A primary bottleneck is the absence of large-scale real-world datasets that systematically capture driving preferences, severely limiting the development and evaluation of personalized E2EAD models. In this work, we introduce the first large-scale real-world dataset explicitly curated for personalized E2EAD, integrating comprehensive scene topology with rich dynamic context derived from agent dynamics and semantics inferred via a fine-tuned vision-language model (VLM). We propose a hybrid annotation pipeline that combines behavioral analysis, rule-and-distribution-based heuristics, and subjective semantic modeling guided by VLM reasoning, with final refinement through human-in-the-loop verification. Building upon this dataset, we introduce the first standardized benchmark for systematically evaluating personalized E2EAD models. Empirical evaluations on state-of-the-art architectures demonstrate that incorporating personalized driving preferences significantly improves behavioral alignment with human demonstrations.

[522] Rectifying Magnitude Neglect in Linear Attention

Qihang Fan, Huaibo Huang, Yuang Ai, Ran He

Main category: cs.CV

TL;DR: The paper introduces Magnitude-Aware Linear Attention (MALA) to address the performance gap between Linear Attention and Softmax Attention by incorporating Query magnitude, achieving strong results across various tasks.

Details

Motivation: Linear Attention's linear complexity is efficient but suffers performance degradation compared to Softmax Attention due to ignoring Query magnitude, limiting its dynamic adaptability.

Method: The authors analyze Linear Attention’s formulation, identify the Query magnitude issue, and propose MALA to incorporate this information, balancing attention score distribution.

Result: MALA performs well on tasks like image classification, object detection, NLP, and more, closely resembling Softmax Attention’s distribution.

Conclusion: MALA effectively bridges the performance gap between Linear and Softmax Attention while maintaining linear complexity, validated across diverse tasks.

Abstract: As the core operator of Transformers, Softmax Attention exhibits excellent global modeling capabilities. However, its quadratic complexity limits its applicability to vision tasks. In contrast, Linear Attention shares a similar formulation with Softmax Attention while achieving linear complexity, enabling efficient global information modeling. Nevertheless, Linear Attention suffers from a significant performance degradation compared to standard Softmax Attention. In this paper, we analyze the underlying causes of this issue based on the formulation of Linear Attention. We find that, unlike Softmax Attention, Linear Attention entirely disregards the magnitude information of the Query. This prevents the attention score distribution from dynamically adapting as the Query scales. As a result, despite its structural similarity to Softmax Attention, Linear Attention exhibits a significantly different attention score distribution. Based on this observation, we propose Magnitude-Aware Linear Attention (MALA), which modifies the computation of Linear Attention to fully incorporate the Query’s magnitude. This adjustment allows MALA to generate an attention score distribution that closely resembles Softmax Attention while exhibiting a more well-balanced structure. We evaluate the effectiveness of MALA on multiple tasks, including image classification, object detection, instance segmentation, semantic segmentation, natural language processing, speech recognition, and image generation. Our MALA achieves strong results on all of these tasks. Code will be available at https://github.com/qhfan/MALA

[523] LACONIC: A 3D Layout Adapter for Controllable Image Creation

Léopold Maillard, Tom Durand, Adrien Ramanana Rahary, Maks Ovsjanikov

Main category: cs.CV

TL;DR: A novel conditioning approach and adapter network for 3D-aware guided image synthesis, enhancing pretrained diffusion models with camera control, 3D geometry conditioning, and scene context awareness.

Details

Motivation: Existing methods for guided image synthesis lack 3D geometric consistency and struggle with scene context. This work aims to address these limitations by introducing 3D-awareness into pretrained models.

Method: Proposes a lightweight adapter network for pretrained text-to-image diffusion models, supporting camera control, 3D geometry conditioning, and full scene context. Includes intuitive editing tools like object positioning and resizing.

Result: The method achieves plausible, semantically rich images with remarkable generalization, requiring reasonable supervised data. It integrates well into workflows and supports diverse applications.

Conclusion: The approach successfully enhances diffusion models with 3D-awareness, enabling richer image synthesis and editing capabilities compared to prior methods.

Abstract: Existing generative approaches for guided image synthesis of multi-object scenes typically rely on 2D controls in the image or text space. As a result, these methods struggle to maintain and respect consistent three-dimensional geometric structure, underlying the scene. In this paper, we propose a novel conditioning approach, training method and adapter network that can be plugged into pretrained text-to-image diffusion models. Our approach provides a way to endow such models with 3D-awareness, while leveraging their rich prior knowledge. Our method supports camera control, conditioning on explicit 3D geometries and, for the first time, accounts for the entire context of a scene, i.e., both on and off-screen items, to synthesize plausible and semantically rich images. Despite its multi-modal nature, our model is lightweight, requires a reasonable number of data for supervised learning and shows remarkable generalization power. We also introduce methods for intuitive and consistent image editing and restyling, e.g., by positioning, rotating or resizing individual objects in a scene. Our method integrates well within various image creation workflows and enables a richer set of applications compared to previous approaches.

Rang Meng, Yan Wang, Weipeng Wu, Ruobing Zheng, Yuming Li, Chenguang Ma

Main category: cs.CV

TL;DR: EchoMimicV3 is an efficient framework for multi-task and multi-modal human animation, addressing slow inference and high computational costs of traditional methods.

Details

Motivation: Overcoming the limitations of slow inference, high computational demands, and the inefficiency of separate models for each animation task.

Method: Uses a Soup-of-Tasks paradigm, Soup-of-Modals paradigm, and novel training/inference strategies like Negative Direct Preference Optimization.

Result: Achieves competitive performance with a minimal model size of 1.3 billion parameters.

Conclusion: EchoMimicV3 offers an efficient, unified solution for human animation, with plans to open-source the code.

Abstract: Recent work on human animation usually incorporates large-scale video models, thereby achieving more vivid performance. However, the practical use of such methods is hindered by the slow inference speed and high computational demands. Moreover, traditional work typically employs separate models for each animation task, increasing costs in multi-task scenarios and worsening the dilemma. To address these limitations, we introduce EchoMimicV3, an efficient framework that unifies multi-task and multi-modal human animation. At the core of EchoMimicV3 lies a threefold design: a Soup-of-Tasks paradigm, a Soup-of-Modals paradigm, and a novel training and inference strategy. The Soup-of-Tasks leverages multi-task mask inputs and a counter-intuitive task allocation strategy to achieve multi-task gains without multi-model pains. Meanwhile, the Soup-of-Modals introduces a Coupled-Decoupled Multi-Modal Cross Attention module to inject multi-modal conditions, complemented by a Multi-Modal Timestep Phase-aware Dynamical Allocation mechanism to modulate multi-modal mixtures. Besides, we propose Negative Direct Preference Optimization, Phase-aware Negative Classifier-Free Guidance (CFG), and Long Video CFG, which ensure stable training and inference. Extensive experiments and analyses demonstrate that EchoMimicV3, with a minimal model size of 1.3 billion parameters, achieves competitive performance in both quantitative and qualitative evaluations. We are committed to open-sourcing our code for community use.

[525] Voyaging into Perpetual Dynamic Scenes from a Single View

Fengrui Tian, Tianjiao Ding, Jinqi Luo, Hancheng Min, René Vidal

Main category: cs.CV

TL;DR: DynamicVoyager addresses single-view dynamic scene generation by reformulating it as a 3D-consistent scene outpainting problem, leveraging ray information for perpetual motion consistency.

Details

Motivation: Generating perpetual dynamic scenes from a single view is crucial for AR/VR and robotics, but prior methods fail to ensure 3D motion consistency.

Method: The approach maps a single-view video to a dynamic point cloud, renders partial novel views, and uses ray information to outpaint missing regions for 3D consistency.

Result: The model generates perpetual scenes with consistent motions, controllable via text prompts.

Conclusion: DynamicVoyager advances dynamic scene generation by ensuring 3D consistency and perpetual motion from a single view.

Abstract: The problem of generating a perpetual dynamic scene from a single view is an important problem with widespread applications in augmented and virtual reality, and robotics. However, since dynamic scenes regularly change over time, a key challenge is to ensure that different generated views be consistent with the underlying 3D motions. Prior work learns such consistency by training on multiple views, but the generated scene regions often interpolate between training views and fail to generate perpetual views. To address this issue, we propose DynamicVoyager, which reformulates dynamic scene generation as a scene outpainting problem with new dynamic content. As 2D outpainting models struggle at generating 3D consistent motions from a single 2D view, we enrich 2D pixels with information from their 3D rays that facilitates learning of 3D motion consistency. More specifically, we first map the single-view video input to a dynamic point cloud using the estimated video depths. We then render a partial video of the point cloud from a novel view and outpaint the missing regions using ray information (e.g., the distance from a ray to the point cloud) to generate 3D consistent motions. Next, we use the outpainted video to update the point cloud, which is used for outpainting the scene from future novel views. Moreover, we can control the generated content with the input text prompt. Experiments show that our model can generate perpetual scenes with consistent motions along fly-through cameras. Project page: https://tianfr.github.io/DynamicVoyager.

[526] SegmentDreamer: Towards High-fidelity Text-to-3D Synthesis with Segmented Consistency Trajectory Distillation

Jiahao Zhu, Zixuan Chen, Guangcong Wang, Xiaohua Xie, Yi Zhou

Main category: cs.CV

TL;DR: SegmentDreamer improves text-to-3D generation by addressing imbalance issues in consistency models with Segmented Consistency Trajectory Distillation (SCTD).

Details

Motivation: Current CD-based methods suffer from improper conditional guidance due to imbalance between self- and cross-consistency, leading to sub-optimal results.

Method: Proposes SCTD to reformulate SDS, partitioning the PF-ODE trajectory into segments for tighter distillation error bounds and a more stable distillation pipeline.

Result: SegmentDreamer outperforms state-of-the-art methods in visual quality, enabling high-fidelity 3D asset creation via 3D Gaussian Splatting.

Conclusion: SegmentDreamer effectively mitigates imbalance issues and enhances text-to-3D generation fidelity.

Abstract: Recent advancements in text-to-3D generation improve the visual quality of Score Distillation Sampling (SDS) and its variants by directly connecting Consistency Distillation (CD) to score distillation. However, due to the imbalance between self-consistency and cross-consistency, these CD-based methods inherently suffer from improper conditional guidance, leading to sub-optimal generation results. To address this issue, we present SegmentDreamer, a novel framework designed to fully unleash the potential of consistency models for high-fidelity text-to-3D generation. Specifically, we reformulate SDS through the proposed Segmented Consistency Trajectory Distillation (SCTD), effectively mitigating the imbalance issues by explicitly defining the relationship between self- and cross-consistency. Moreover, SCTD partitions the Probability Flow Ordinary Differential Equation (PF-ODE) trajectory into multiple sub-trajectories and ensures consistency within each segment, which can theoretically provide a significantly tighter upper bound on distillation error. Additionally, we propose a distillation pipeline for a more swift and stable generation. Extensive experiments demonstrate that our SegmentDreamer outperforms state-of-the-art methods in visual quality, enabling high-fidelity 3D asset creation through 3D Gaussian Splatting (3DGS).

[527] $I^{2}$-World: Intra-Inter Tokenization for Efficient Dynamic 4D Scene Forecasting

Zhimin Liao, Ping Wei, Ruijie Zhang, Shuaijia Chen, Haoxuan Wang, Ziyang Ren

Main category: cs.CV

TL;DR: $I^{2}$-World is an efficient 4D occupancy forecasting framework that decouples scene tokenization into intra-scene and inter-scene tokenizers, achieving state-of-the-art performance with high computational efficiency.

Details

Motivation: Addressing the challenge of efficiently tokenizing complex 3D scenes for autonomous driving systems to handle corner cases.

Method: Uses intra-scene (multi-scale residual quantization) and inter-scene (residual aggregation of temporal dependencies) tokenizers with an encoder-decoder architecture for high-level control and temporal consistency.

Result: Outperforms existing methods by 25.1% in mIoU and 36.9% in IoU, with 2.9 GB training memory and 37.0 FPS real-time inference.

Conclusion: $I^{2}$-World is a computationally efficient and high-performing solution for 4D occupancy forecasting in autonomous driving.

Abstract: Forecasting the evolution of 3D scenes and generating unseen scenarios via occupancy-based world models offers substantial potential for addressing corner cases in autonomous driving systems. While tokenization has revolutionized image and video generation, efficiently tokenizing complex 3D scenes remains a critical challenge for 3D world models. To address this, we propose $I^{2}$-World, an efficient framework for 4D occupancy forecasting. Our method decouples scene tokenization into intra-scene and inter-scene tokenizers. The intra-scene tokenizer employs a multi-scale residual quantization strategy to hierarchically compress 3D scenes while preserving spatial details. The inter-scene tokenizer residually aggregates temporal dependencies across timesteps. This dual design preserves the compactness of 3D tokenizers while retaining the dynamic expressiveness of 4D tokenizers. Unlike decoder-only GPT-style autoregressive models, $I^{2}$-World adopts an encoder-decoder architecture. The encoder aggregates spatial context from the current scene and predicts a transformation matrix to enable high-level control over scene generation. The decoder, conditioned on this matrix and historical tokens, ensures temporal consistency during generation. Experiments demonstrate that $I^{2}$-World achieves state-of-the-art performance, outperforming existing methods by 25.1% in mIoU and 36.9% in IoU for 4D occupancy forecasting while exhibiting exceptional computational efficiency: it requires merely 2.9 GB of training memory and achieves real-time inference at 37.0 FPS. Our code is available on https://github.com/lzzzzzm/II-World.

[528] Frequency Regulation for Exposure Bias Mitigation in Diffusion Models

Meng Yu, Kun Zhan

Main category: cs.CV

TL;DR: The paper addresses exposure bias in diffusion models by analyzing energy patterns in noisy samples and introduces a dynamic frequency regulation mechanism to improve generative quality.

Details

Motivation: Diffusion models suffer from exposure bias, which impacts their generative capabilities. The study aims to understand and mitigate this issue.

Method: The authors analyze energy patterns in noisy samples, identify distinct subband behaviors, and propose a training-free, plug-and-play dynamic frequency regulation mechanism using wavelet transforms.

Result: The method significantly improves generative quality across various diffusion models with negligible computational overhead.

Conclusion: The proposed approach effectively mitigates exposure bias, enhancing diffusion models’ performance without additional training.

Abstract: Diffusion models exhibit impressive generative capabilities but are significantly impacted by exposure bias. In this paper, we make a key observation: the energy of predicted noisy samples in the reverse process continuously declines compared to perturbed samples in the forward process. Building on this, we identify two important findings: 1) The reduction in energy follows distinct patterns in the low-frequency and high-frequency subbands; 2) The subband energy of reverse-process reconstructed samples is consistently lower than that of forward-process ones, and both are lower than the original data samples. Based on the first finding, we introduce a dynamic frequency regulation mechanism utilizing wavelet transforms, which separately adjusts the low- and high-frequency subbands. Leveraging the second insight, we derive the rigorous mathematical form of exposure bias. It is worth noting that, our method is training-free and plug-and-play, significantly improving the generative quality of various diffusion models and frameworks with negligible computational cost. The source code is available at https://github.com/kunzhan/wpp.

[529] Text Embedding Knows How to Quantize Text-Guided Diffusion Models

Hongjae Lee, Myungjun Son, Dongjea Kang, Seung-Won Jung

Main category: cs.CV

TL;DR: QLIP introduces a text-prompt-guided quantization method for diffusion models, improving efficiency and image quality.

Details

Motivation: Existing quantization methods for diffusion models ignore input conditions like text prompts, limiting their effectiveness.

Method: QLIP uses text prompts to dynamically select bit precision per layer and time step, integrating with existing quantization techniques.

Result: QLIP reduces computational complexity and enhances image quality across multiple datasets.

Conclusion: QLIP effectively leverages text prompts for efficient and high-quality diffusion model quantization.

Abstract: Despite the success of diffusion models in image generation tasks such as text-to-image, the enormous computational complexity of diffusion models limits their use in resource-constrained environments. To address this, network quantization has emerged as a promising solution for designing efficient diffusion models. However, existing diffusion model quantization methods do not consider input conditions, such as text prompts, as an essential source of information for quantization. In this paper, we propose a novel quantization method dubbed Quantization of Language-to-Image diffusion models using text Prompts (QLIP). QLIP leverages text prompts to guide the selection of bit precision for every layer at each time step. In addition, QLIP can be seamlessly integrated into existing quantization methods to enhance quantization efficiency. Our extensive experiments demonstrate the effectiveness of QLIP in reducing computational complexity and improving the quality of the generated images across various datasets.

[530] Describe Anything Model for Visual Question Answering on Text-rich Images

Yen-Linh Vu, Dinh-Thang Duong, Truong-Binh Duong, Anh-Khoi Nguyen, Thanh-Huy Nguyen, Le Thien Phuc Nguyen, Jianhua Xing, Xingjian Li, Tianyang Wang, Ulas Bagci, Min Xu

Main category: cs.CV

TL;DR: DAM-QA leverages the region-aware capabilities of the Describe Anything Model (DAM) for text-rich Visual Question Answering (VQA), outperforming baselines and narrowing the gap with generalist vision-language models.

Details

Motivation: To enhance VQA performance in text-rich images by utilizing DAM's fine-grained region-level descriptive capabilities.

Method: Introduces DAM-QA, a framework that aggregates answers from multiple regional views of image content for better text-related evidence identification.

Result: Outperforms baseline DAM by 7+ points on DocVQA and achieves top performance among region-aware models with fewer parameters.

Conclusion: DAM-like models, with efficient usage strategies, show strong potential for text-rich and broader VQA tasks.

Abstract: Recent progress has been made in region-aware vision-language modeling, particularly with the emergence of the Describe Anything Model (DAM). DAM is capable of generating detailed descriptions of any specific image areas or objects without the need for additional localized image-text alignment supervision. We hypothesize that such region-level descriptive capability is beneficial for the task of Visual Question Answering (VQA), especially in challenging scenarios involving images with dense text. In such settings, the fine-grained extraction of textual information is crucial to producing correct answers. Motivated by this, we introduce DAM-QA, a framework with a tailored evaluation protocol, developed to investigate and harness the region-aware capabilities from DAM for the text-rich VQA problem that requires reasoning over text-based information within images. DAM-QA incorporates a mechanism that aggregates answers from multiple regional views of image content, enabling more effective identification of evidence that may be tied to text-related elements. Experiments on six VQA benchmarks show that our approach consistently outperforms the baseline DAM, with a notable 7+ point gain on DocVQA. DAM-QA also achieves the best overall performance among region-aware models with fewer parameters, significantly narrowing the gap with strong generalist VLMs. These results highlight the potential of DAM-like models for text-rich and broader VQA tasks when paired with efficient usage and integration strategies. Our code is publicly available at https://github.com/Linvyl/DAM-QA.git.

[531] WhoFi: Deep Person Re-Identification via Wi-Fi Channel Signal Encoding

Danilo Avola, Emad Emam, Dario Montagnini, Daniele Pannone, Amedeo Ranaldi

Main category: cs.CV

TL;DR: WhoFi uses Wi-Fi signals (CSI) and a Transformer-based DNN for person re-identification, achieving competitive results on the NTU-Fi dataset.

Details

Motivation: Traditional visual-based methods struggle with poor lighting, occlusion, and suboptimal angles, prompting the use of Wi-Fi signals for more robust identification.

Method: Extracts biometric features from CSI, processes them through a modular DNN with a Transformer encoder, and trains using in-batch negative loss.

Result: Competitive performance on the NTU-Fi dataset compared to state-of-the-art methods.

Conclusion: WhoFi effectively identifies individuals using Wi-Fi signals, addressing limitations of visual-based approaches.

Abstract: Person Re-Identification is a key and challenging task in video surveillance. While traditional methods rely on visual data, issues like poor lighting, occlusion, and suboptimal angles often hinder performance. To address these challenges, we introduce WhoFi, a novel pipeline that utilizes Wi-Fi signals for person re-identification. Biometric features are extracted from Channel State Information (CSI) and processed through a modular Deep Neural Network (DNN) featuring a Transformer-based encoder. The network is trained using an in-batch negative loss function to learn robust and generalizable biometric signatures. Experiments on the NTU-Fi dataset show that our approach achieves competitive results compared to state-of-the-art methods, confirming its effectiveness in identifying individuals via Wi-Fi signals.

[532] Polymorph: Energy-Efficient Multi-Label Classification for Video Streams on Embedded Devices

Saeid Ghafouri, Mohsen Fayyaz, Xiangchen Li, Deepu John, Bo Ji, Dimitrios Nikolopoulos, Hans Vandierendonck

Main category: cs.CV

TL;DR: Polymorph is a framework for efficient real-time multi-label video classification on embedded devices by leveraging label sparsity and co-occurrence patterns, reducing energy use by 40% and improving mAP by 9 points.

Details

Motivation: Limited compute and energy budgets on embedded devices challenge real-time multi-label video classification, but structural properties like label sparsity and co-occurrence can optimize inference.

Method: Polymorph uses lightweight Low Rank Adapters (LoRA) specialized for subsets of classes, dynamically selecting and composing only needed adapters per frame to avoid full-model switching.

Result: Polymorph reduces energy consumption by 40% and improves mAP by 9 points on the TAO dataset compared to baselines.

Conclusion: Polymorph’s modular, context-aware approach efficiently addresses the constraints of embedded devices while enhancing performance in multi-label video classification.

Abstract: Real-time multi-label video classification on embedded devices is constrained by limited compute and energy budgets. Yet, video streams exhibit structural properties such as label sparsity, temporal continuity, and label co-occurrence that can be leveraged for more efficient inference. We introduce Polymorph, a context-aware framework that activates a minimal set of lightweight Low Rank Adapters (LoRA) per frame. Each adapter specializes in a subset of classes derived from co-occurrence patterns and is implemented as a LoRA weight over a shared backbone. At runtime, Polymorph dynamically selects and composes only the adapters needed to cover the active labels, avoiding full-model switching and weight merging. This modular strategy improves scalability while reducing latency and energy overhead. Polymorph achieves 40% lower energy consumption and improves mAP by 9 points over strong baselines on the TAO dataset. Polymorph is open source at https://github.com/inference-serving/polymorph/.

[533] HOLa: Zero-Shot HOI Detection with Low-Rank Decomposed VLM Feature Adaptation

Qinqian Lei, Bo Wang, Robby T. Tan

Main category: cs.CV

TL;DR: HOLa improves zero-shot HOI detection by decomposing VLM text features and refining action distinction, achieving state-of-the-art results.

Details

Motivation: Addressing limitations in existing methods for zero-shot HOI detection, particularly in distinguishing actions and generalizing to unseen classes.

Method: Decomposes VLM text features via low-rank factorization, adapts weights, and uses human-object tokens and LLM-derived regularization.

Result: Achieves an unseen-class mAP of 27.91 on HICO-DET, setting a new state-of-the-art.

Conclusion: HOLa effectively enhances generalization and action distinction in zero-shot HOI detection.

Abstract: Zero-shot human-object interaction (HOI) detection remains a challenging task, particularly in generalizing to unseen actions. Existing methods address this challenge by tapping Vision-Language Models (VLMs) to access knowledge beyond the training data. However, they either struggle to distinguish actions involving the same object or demonstrate limited generalization to unseen classes. In this paper, we introduce HOLa (Zero-Shot HOI Detection with Low-Rank Decomposed VLM Feature Adaptation), a novel approach that both enhances generalization to unseen classes and improves action distinction. In training, HOLa decomposes VLM text features for given HOI classes via low-rank factorization, producing class-shared basis features and adaptable weights. These features and weights form a compact HOI representation that preserves shared information across classes, enhancing generalization to unseen classes. Subsequently, we refine action distinction by adapting weights for each HOI class and introducing human-object tokens to enrich visual interaction representations. To further distinguish unseen actions, we guide the weight adaptation with LLM-derived action regularization. Experimental results show that our method sets a new state-of-the-art across zero-shot HOI settings on HICO-DET, achieving an unseen-class mAP of 27.91 in the unseen-verb setting. Our code is available at https://github.com/ChelsieLei/HOLa.

[534] A Multimodal Deviation Perceiving Framework for Weakly-Supervised Temporal Forgery Localization

Wenbo Xu, Junyan Wu, Wei Lu, Xiangyang Luo, Qian Wang

Main category: cs.CV

TL;DR: The paper introduces MDP, a multimodal framework for weakly-supervised temporal forgery localization, using video-level annotations to identify forged segments efficiently.

Details

Motivation: Current Deepfake detection methods are restrictive, time-consuming, and hard to scale. MDP addresses these issues by leveraging multimodal interactions and deviation perception.

Method: MDP uses a multimodal interaction mechanism (MI) to measure visual-audio relevance and a deviation perceiving loss to refine forged segment localization.

Result: MDP achieves comparable results to fully-supervised methods in temporal forgery localization, as shown in extensive experiments.

Conclusion: The proposed MDP framework effectively localizes forged segments with minimal supervision, offering a scalable and efficient alternative to traditional methods.

Abstract: Current researches on Deepfake forensics often treat detection as a classification task or temporal forgery localization problem, which are usually restrictive, time-consuming, and challenging to scale for large datasets. To resolve these issues, we present a multimodal deviation perceiving framework for weakly-supervised temporal forgery localization (MDP), which aims to identify temporal partial forged segments using only video-level annotations. The MDP proposes a novel multimodal interaction mechanism (MI) and an extensible deviation perceiving loss to perceive multimodal deviation, which achieves the refined start and end timestamps localization of forged segments. Specifically, MI introduces a temporal property preserving cross-modal attention to measure the relevance between the visual and audio modalities in the probabilistic embedding space. It could identify the inter-modality deviation and construct comprehensive video features for temporal forgery localization. To explore further temporal deviation for weakly-supervised learning, an extensible deviation perceiving loss has been proposed, aiming at enlarging the deviation of adjacent segments of the forged samples and reducing that of genuine samples. Extensive experiments demonstrate the effectiveness of the proposed framework and achieve comparable results to fully-supervised approaches in several evaluation metrics.

[535] Eyes Will Shut: A Vision-Based Next GPS Location Prediction Model by Reinforcement Learning from Visual Map Feed Back

Ruixing Zhang, Yang Zhang, Tongyu Zhu, Leilei Sun, Weifeng Lv

Main category: cs.CV

TL;DR: The paper introduces a human-like approach for next-location prediction using Vision-Language Models (VLMs), achieving SOTA performance with cross-city generalization.

Details

Motivation: Existing models lack human-like reasoning over maps, while VLMs offer strong visual reasoning capabilities.

Method: Proposes Vision-Guided Location Search (VGLS) and VLMLocPredictor with SFT tasks and Reinforcement Learning from Visual Map Feedback.

Result: Achieves SOTA performance and superior cross-city generalization on datasets from four cities.

Conclusion: VLMs can effectively mimic human-like reasoning for next-location prediction, outperforming traditional LLM-based methods.

Abstract: Next Location Prediction is a fundamental task in the study of human mobility, with wide-ranging applications in transportation planning, urban governance, and epidemic forecasting. In practice, when humans attempt to predict the next location in a trajectory, they often visualize the trajectory on a map and reason based on road connectivity and movement trends. However, the vast majority of existing next-location prediction models do not reason over maps \textbf{in the way that humans do}. Fortunately, the recent development of Vision-Language Models (VLMs) has demonstrated strong capabilities in visual perception and even visual reasoning. This opens up a new possibility: by rendering both the road network and trajectory onto an image and leveraging the reasoning abilities of VLMs, we can enable models to perform trajectory inference in a human-like manner. To explore this idea, we first propose a method called Vision-Guided Location Search (VGLS), which evaluates whether a general-purpose VLM is capable of trajectory-based reasoning without modifying any of its internal parameters. Based on insights from the VGLS results, we further propose our main approach: VLMLocPredictor, which is composed of two stages: In the first stage, we design two Supervised Fine-Tuning (SFT) tasks that help the VLM understand road network and trajectory structures and acquire basic reasoning ability on such visual inputs. In the second stage, we introduce Reinforcement Learning from Visual Map Feedback, enabling the model to self-improve its next-location prediction ability through interaction with the environment. Experiments conducted on datasets from four different cities show that our method achieves state-of-the-art (SOTA) performance and exhibits superior cross-city generalization compared to other LLM-based approaches.

[536] Preserving Topological and Geometric Embeddings for Point Cloud Recovery

Kaiyue Zhou, Zelong Tan, Hongxiao Wang, Ya-li Li, Shengjin Wang

Main category: cs.CV

TL;DR: TopGeoFormer is an end-to-end architecture for point cloud recovery, combining topological and geometric attributes through novel attention and loss mechanisms, outperforming existing methods.

Details

Motivation: Existing point cloud recovery methods fail to effectively leverage both topological and geometric attributes during sampling and restoration.

Method: Proposes TopGeoFormer with topological embedding, InterTwining Attention for merging attributes, and geometry/topological loss functions.

Result: Outperforms conventional and learning-based methods in sampling, upsampling, and recovery tasks.

Conclusion: TopGeoFormer successfully integrates topological and geometric properties, achieving superior performance in point cloud recovery.

Abstract: Recovering point clouds involves the sequential process of sampling and restoration, yet existing methods struggle to effectively leverage both topological and geometric attributes. To address this, we propose an end-to-end architecture named \textbf{TopGeoFormer}, which maintains these critical properties throughout the sampling and restoration phases. First, we revisit traditional feature extraction techniques to yield topological embedding using a continuous mapping of relative relationships between neighboring points, and integrate it in both phases for preserving the structure of the original space. Second, we propose the \textbf{InterTwining Attention} to fully merge topological and geometric embeddings, which queries shape with local awareness in both phases to form a learnable 3D shape context facilitated with point-wise, point-shape-wise, and intra-shape features. Third, we introduce a full geometry loss and a topological constraint loss to optimize the embeddings in both Euclidean and topological spaces. The geometry loss uses inconsistent matching between coarse-to-fine generations and targets for reconstructing better geometric details, and the constraint loss limits embedding variances for better approximation of the topological space. In experiments, we comprehensively analyze the circumstances using the conventional and learning-based sampling/upsampling/recovery algorithms. The quantitative and qualitative results demonstrate that our method significantly outperforms existing sampling and recovery methods.

[537] GS-Occ3D: Scaling Vision-only Occupancy Reconstruction with Gaussian Splatting

Baijun Ye, Minghui Qin, Saining Zhang, Moonjun Gong, Shaoting Zhu, Zebang Shen, Luan Zhang, Lu Zhang, Hao Zhao, Hang Zhao

Main category: cs.CV

TL;DR: GS-Occ3D is a vision-only framework for scalable occupancy reconstruction in autonomous driving, addressing challenges like sparse viewpoints and occlusions with an Octree-based Gaussian Surfel method.

Details

Motivation: Existing LiDAR-based occupancy methods lack scalability and limit crowdsourced auto-labeling potential. Vision-only approaches face challenges like incomplete geometry and post-processing needs.

Method: GS-Occ3D uses an Octree-based Gaussian Surfel formulation for explicit occupancy representation, decomposing scenes into static background, ground, and dynamic objects for tailored modeling.

Result: Achieves state-of-the-art geometry reconstruction on Waymo and shows superior zero-shot generalization on Occ3D-nuScenes.

Conclusion: Demonstrates the potential of vision-based occupancy reconstruction for scalable auto-labeling in autonomous driving.

Abstract: Occupancy is crucial for autonomous driving, providing essential geometric priors for perception and planning. However, existing methods predominantly rely on LiDAR-based occupancy annotations, which limits scalability and prevents leveraging vast amounts of potential crowdsourced data for auto-labeling. To address this, we propose GS-Occ3D, a scalable vision-only framework that directly reconstructs occupancy. Vision-only occupancy reconstruction poses significant challenges due to sparse viewpoints, dynamic scene elements, severe occlusions, and long-horizon motion. Existing vision-based methods primarily rely on mesh representation, which suffer from incomplete geometry and additional post-processing, limiting scalability. To overcome these issues, GS-Occ3D optimizes an explicit occupancy representation using an Octree-based Gaussian Surfel formulation, ensuring efficiency and scalability. Additionally, we decompose scenes into static background, ground, and dynamic objects, enabling tailored modeling strategies: (1) Ground is explicitly reconstructed as a dominant structural element, significantly improving large-area consistency; (2) Dynamic vehicles are separately modeled to better capture motion-related occupancy patterns. Extensive experiments on the Waymo dataset demonstrate that GS-Occ3D achieves state-of-the-art geometry reconstruction results. By curating vision-only binary occupancy labels from diverse urban scenes, we show their effectiveness for downstream occupancy models on Occ3D-Waymo and superior zero-shot generalization on Occ3D-nuScenes. It highlights the potential of large-scale vision-based occupancy reconstruction as a new paradigm for scalable auto-labeling. Project Page: https://gs-occ3d.github.io/

[538] Detecting Visual Information Manipulation Attacks in Augmented Reality: A Multimodal Semantic Reasoning Approach

Yanming Xiu, Maria Gorlatova

Main category: cs.CV

TL;DR: The paper addresses visual information manipulation (VIM) attacks in AR, categorizes them, and introduces a dataset (AR-VIM) and a detection framework (VIM-Sense) achieving high accuracy and low latency.

Details

Motivation: AR virtual content can mislead users through subtle manipulations, necessitating detection methods.

Method: Proposes a taxonomy for VIM attacks, creates the AR-VIM dataset, and develops VIM-Sense, a multimodal framework combining VLMs and OCR.

Result: VIM-Sense achieves 88.94% accuracy and ~7 seconds latency in detecting attacks.

Conclusion: The framework effectively detects VIM attacks, outperforming baselines, and is practical for real-world AR applications.

Abstract: The virtual content in augmented reality (AR) can introduce misleading or harmful information, leading to semantic misunderstandings or user errors. In this work, we focus on visual information manipulation (VIM) attacks in AR, where virtual content changes the meaning of real-world scenes in subtle but impactful ways. We introduce a taxonomy that categorizes these attacks into three formats: character, phrase, and pattern manipulation, and three purposes: information replacement, information obfuscation, and extra wrong information. Based on the taxonomy, we construct a dataset, AR-VIM, which consists of 452 raw-AR video pairs spanning 202 different scenes, each simulating a real-world AR scenario. To detect the attacks in the dataset, we propose a multimodal semantic reasoning framework, VIM-Sense. It combines the language and visual understanding capabilities of vision-language models (VLMs) with optical character recognition (OCR)-based textual analysis. VIM-Sense achieves an attack detection accuracy of 88.94% on AR-VIM, consistently outperforming vision-only and text-only baselines. The system achieves an average attack detection latency of 7.07 seconds in a simulated video processing framework and 7.17 seconds in a real-world evaluation conducted on a mobile Android AR application.

[539] Reconstructing 4D Spatial Intelligence: A Survey

Yukang Cao, Jiahao Lu, Zhisheng Huang, Zhuowen Shen, Chengfeng Zhao, Fangzhou Hong, Zhaoxi Chen, Xin Li, Wenping Wang, Yuan Liu, Ziwei Liu

Main category: cs.CV

TL;DR: The paper surveys 4D spatial intelligence reconstruction in computer vision, organizing methods into five hierarchical levels and addressing gaps in existing literature.

Details

Motivation: The rapid evolution of 3D representations and deep learning has outpaced previous surveys, leaving a lack of comprehensive analysis of 4D scene reconstruction's hierarchical structure.

Method: The authors categorize existing methods into five progressive levels of 4D spatial intelligence, from low-level 3D attributes to incorporating physical laws.

Result: A structured survey is presented, detailing challenges and future directions for each level of 4D reconstruction.

Conclusion: The paper highlights key challenges and promising research directions, maintaining an updated project page for ongoing developments.

Abstract: Reconstructing 4D spatial intelligence from visual observations has long been a central yet challenging task in computer vision, with broad real-world applications. These range from entertainment domains like movies, where the focus is often on reconstructing fundamental visual elements, to embodied AI, which emphasizes interaction modeling and physical realism. Fueled by rapid advances in 3D representations and deep learning architectures, the field has evolved quickly, outpacing the scope of previous surveys. Additionally, existing surveys rarely offer a comprehensive analysis of the hierarchical structure of 4D scene reconstruction. To address this gap, we present a new perspective that organizes existing methods into five progressive levels of 4D spatial intelligence: (1) Level 1 – reconstruction of low-level 3D attributes (e.g., depth, pose, and point maps); (2) Level 2 – reconstruction of 3D scene components (e.g., objects, humans, structures); (3) Level 3 – reconstruction of 4D dynamic scenes; (4) Level 4 – modeling of interactions among scene components; and (5) Level 5 – incorporation of physical laws and constraints. We conclude the survey by discussing the key challenges at each level and highlighting promising directions for advancing toward even richer levels of 4D spatial intelligence. To track ongoing developments, we maintain an up-to-date project page: https://github.com/yukangcao/Awesome-4D-Spatial-Intelligence.

[540] Collaborative Perceiver: Elevating Vision-based 3D Object Detection via Local Density-Aware Spatial Occupancy

Jicheng Yuan, Manh Nguyen Duc, Qian Liu, Manfred Hauswirth, Danh Le Phuoc

Main category: cs.CV

TL;DR: CoP introduces a multi-task learning framework for BEV 3D object detection, leveraging spatial occupancy to enhance environmental context and feature refinement, outperforming existing methods.

Details

Motivation: Existing BEV methods neglect intrinsic environmental contexts, limiting comprehensive perception of the physical world.

Method: CoP uses dense occupancy ground truths (LDO), voxel-height-guided sampling (VHS), and a global-local feature fusion (CFF) module to integrate 3D object detection and occupancy prediction.

Result: Achieves 49.5% mAP and 59.2% NDS on nuScenes benchmark, outperforming existing vision-based frameworks.

Conclusion: CoP bridges gaps in spatial representations and feature refinement, offering a robust solution for BEV 3D object detection.

Abstract: Vision-based bird’s-eye-view (BEV) 3D object detection has advanced significantly in autonomous driving by offering cost-effectiveness and rich contextual information. However, existing methods often construct BEV representations by collapsing extracted object features, neglecting intrinsic environmental contexts, such as roads and pavements. This hinders detectors from comprehensively perceiving the characteristics of the physical world. To alleviate this, we introduce a multi-task learning framework, Collaborative Perceiver (CoP), that leverages spatial occupancy as auxiliary information to mine consistent structural and conceptual similarities shared between 3D object detection and occupancy prediction tasks, bridging gaps in spatial representations and feature refinement. To this end, we first propose a pipeline to generate dense occupancy ground truths incorporating local density information (LDO) for reconstructing detailed environmental information. Next, we employ a voxel-height-guided sampling (VHS) strategy to distill fine-grained local features according to distinct object properties. Furthermore, we develop a global-local collaborative feature fusion (CFF) module that seamlessly integrates complementary knowledge between both tasks, thus composing more robust BEV representations. Extensive experiments on the nuScenes benchmark demonstrate that CoP outperforms existing vision-based frameworks, achieving 49.5% mAP and 59.2% NDS on the test set. Code and supplementary materials are available at this link https://github.com/jichengyuan/Collaborative-Perceiver.

[541] Segment Anything for Video: A Comprehensive Review of Video Object Segmentation and Tracking from Past to Future

Guoping Xu, Jayaram K. Udupa, Yajun Yu, Hua-Chieh Shao, Songlin Zhao, Wei Liu, You Zhang

Main category: cs.CV

TL;DR: This survey reviews SAM/SAM2-based methods for Video Object Segmentation and Tracking (VOST), focusing on past, present, and future temporal dimensions, highlighting advancements and challenges.

Details

Motivation: Addressing the limitations of traditional VOST methods in domain generalization, temporal consistency, and efficiency by leveraging foundation models like SAM/SAM2.

Method: Structured review along three temporal dimensions: past (historical information), present (current frame features), and future (object dynamics prediction).

Result: Highlights advancements like motion-aware memory selection and trajectory-guided prompting, while identifying challenges such as memory redundancy and prompt inefficiency.

Conclusion: The survey provides a structured overview to guide future research in VOST using foundation models, addressing current challenges.

Abstract: Video Object Segmentation and Tracking (VOST) presents a complex yet critical challenge in computer vision, requiring robust integration of segmentation and tracking across temporally dynamic frames. Traditional methods have struggled with domain generalization, temporal consistency, and computational efficiency. The emergence of foundation models like the Segment Anything Model (SAM) and its successor, SAM2, has introduced a paradigm shift, enabling prompt-driven segmentation with strong generalization capabilities. Building upon these advances, this survey provides a comprehensive review of SAM/SAM2-based methods for VOST, structured along three temporal dimensions: past, present, and future. We examine strategies for retaining and updating historical information (past), approaches for extracting and optimizing discriminative features from the current frame (present), and motion prediction and trajectory estimation mechanisms for anticipating object dynamics in subsequent frames (future). In doing so, we highlight the evolution from early memory-based architectures to the streaming memory and real-time segmentation capabilities of SAM2. We also discuss recent innovations such as motion-aware memory selection and trajectory-guided prompting, which aim to enhance both accuracy and efficiency. Finally, we identify remaining challenges including memory redundancy, error accumulation, and prompt inefficiency, and suggest promising directions for future research. This survey offers a timely and structured overview of the field, aiming to guide researchers and practitioners in advancing the state of VOST through the lens of foundation models.

[542] PixNerd: Pixel Neural Field Diffusion

Shuai Wang, Ziteng Gao, Chenhui Zhu, Weilin Huang, Limin Wang

Main category: cs.CV

TL;DR: PixelNerd (Pixel Neural Field Diffusion) is a single-stage, efficient, end-to-end solution for diffusion transformers, avoiding the issues of two-stage training and complex pipelines by using neural field representation.

Details

Motivation: Current diffusion transformers rely on pre-trained VAEs, leading to accumulated errors and artifacts. Existing solutions complicate pipelines or increase token complexity.

Method: Proposes PixelNerd, modeling patch-wise decoding with neural field, eliminating the need for VAEs or cascade pipelines.

Result: Achieved 2.15 FID on ImageNet 256x256 and 2.84 FID on 512x512, with competitive scores in text-to-image benchmarks.

Conclusion: PixelNerd offers a simpler, more efficient alternative to current methods, demonstrating strong performance in image generation tasks.

Abstract: The current success of diffusion transformers heavily depends on the compressed latent space shaped by the pre-trained variational autoencoder(VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To address the aforementioned problems, researchers return to pixel space at the cost of complicated cascade pipelines and increased token complexity. In contrast to their efforts, we propose to model the patch-wise decoding with neural field and present a single-scale, single-stage, efficient, end-to-end solution, coined as pixel neural field diffusion~(PixelNerd). Thanks to the efficient neural field representation in PixNerd, we directly achieved 2.15 FID on ImageNet $256\times256$ and 2.84 FID on ImageNet $512\times512$ without any complex cascade pipeline or VAE. We also extend our PixNerd framework to text-to-image applications. Our PixNerd-XXL/16 achieved a competitive 0.73 overall score on the GenEval benchmark and 80.9 overall score on the DPG benchmark.

[543] PriorFusion: Unified Integration of Priors for Robust Road Perception in Autonomous Driving

Xuewei Tang, Mengmeng Yang, Tuopu Wen, Peijin Jia, Le Cui, Mingshang Luo, Kehua Sheng, Bo Zhang, Diange Yang, Kun Jiang

Main category: cs.CV

TL;DR: PriorFusion integrates semantic, geometric, and generative priors to improve road perception in autonomous driving, outperforming existing methods in accuracy and robustness.

Details

Motivation: The need for accurate road perception in autonomous driving, especially in complex environments without HD maps, drives the development of a method that better exploits structured priors in road elements.

Method: Proposes PriorFusion, a framework combining instance-aware attention, shape-prior features, and a diffusion-based approach to generate accurate road element predictions.

Result: Experiments show significant accuracy improvements, especially in challenging conditions, with more regular and coherent predictions.

Conclusion: PriorFusion effectively enhances road perception by leveraging structured priors, offering a robust solution for autonomous driving in complex environments.

Abstract: With the growing interest in autonomous driving, there is an increasing demand for accurate and reliable road perception technologies. In complex environments without high-definition map support, autonomous vehicles must independently interpret their surroundings to ensure safe and robust decision-making. However, these scenarios pose significant challenges due to the large number, complex geometries, and frequent occlusions of road elements. A key limitation of existing approaches lies in their insufficient exploitation of the structured priors inherently present in road elements, resulting in irregular, inaccurate predictions. To address this, we propose PriorFusion, a unified framework that effectively integrates semantic, geometric, and generative priors to enhance road element perception. We introduce an instance-aware attention mechanism guided by shape-prior features, then construct a data-driven shape template space that encodes low-dimensional representations of road elements, enabling clustering to generate anchor points as reference priors. We design a diffusion-based framework that leverages these prior anchors to generate accurate and complete predictions. Experiments on large-scale autonomous driving datasets demonstrate that our method significantly improves perception accuracy, particularly under challenging conditions. Visualization results further confirm that our approach produces more accurate, regular, and coherent predictions of road elements.

[544] MoGA: 3D Generative Avatar Prior for Monocular Gaussian Avatar Reconstruction

Zijian Dong, Longteng Duan, Jie Song, Michael J. Black, Andreas Geiger

Main category: cs.CV

TL;DR: MoGA reconstructs high-fidelity 3D Gaussian avatars from single-view images by combining a generative avatar model with 2D diffusion models, ensuring 3D consistency and realism.

Details

Motivation: The challenge is to infer unseen appearance and geometric details from a single-view image while maintaining 3D consistency, as previous methods relying on 2D diffusion models produce sparse and inconsistent views.

Method: MoGA integrates a generative avatar model as a prior, projects input images into its latent space, and enforces 3D constraints. It formulates avatar creation as model inversion, using synthetic views from 2D diffusion models for fitting.

Result: The method outperforms state-of-the-art techniques, generalizes well to real-world scenarios, and produces animatable avatars.

Conclusion: MoGA effectively addresses limitations of prior methods by combining generative and diffusion models, achieving high-fidelity 3D avatar reconstruction.

Abstract: We present MoGA, a novel method to reconstruct high-fidelity 3D Gaussian avatars from a single-view image. The main challenge lies in inferring unseen appearance and geometric details while ensuring 3D consistency and realism. Most previous methods rely on 2D diffusion models to synthesize unseen views; however, these generated views are sparse and inconsistent, resulting in unrealistic 3D artifacts and blurred appearance. To address these limitations, we leverage a generative avatar model, that can generate diverse 3D avatars by sampling deformed Gaussians from a learned prior distribution. Due to limited 3D training data, such a 3D model alone cannot capture all image details of unseen identities. Consequently, we integrate it as a prior, ensuring 3D consistency by projecting input images into its latent space and enforcing additional 3D appearance and geometric constraints. Our novel approach formulates Gaussian avatar creation as model inversion by fitting the generative avatar to synthetic views from 2D diffusion models. The generative avatar provides an initialization for model fitting, enforces 3D regularization, and helps in refining pose. Experiments show that our method surpasses state-of-the-art techniques and generalizes well to real-world scenarios. Our Gaussian avatars are also inherently animatable. For code, see https:// zj-dong.github.io/ MoGA/.

[545] UniLDiff: Unlocking the Power of Diffusion Priors for All-in-One Image Restoration

Zihan Cheng, Liangtai Zhou, Dian Chen, Ni Tang, Xiaotong Luo, Yanyun Qu

Main category: cs.CV

TL;DR: UniLDiff is a unified framework for All-in-One Image Restoration, using diffusion priors with degradation- and detail-aware mechanisms for robust performance.

Details

Motivation: Addressing the challenges of diverse degradation modeling and detail preservation in image restoration.

Method: Proposes Degradation-Aware Feature Fusion (DAFF) for dynamic feature injection and Detail-Aware Expert Module (DAEM) for enhanced texture recovery.

Result: Achieves state-of-the-art performance in multi-task and mixed degradation settings.

Conclusion: Demonstrates the practical potential of diffusion priors for unified image restoration.

Abstract: All-in-One Image Restoration (AiOIR) has emerged as a promising yet challenging research direction. To address the core challenges of diverse degradation modeling and detail preservation, we propose UniLDiff, a unified framework enhanced with degradation- and detail-aware mechanisms, unlocking the power of diffusion priors for robust image restoration. Specifically, we introduce a Degradation-Aware Feature Fusion (DAFF) to dynamically inject low-quality features into each denoising step via decoupled fusion and adaptive modulation, enabling implicit modeling of diverse and compound degradations. Furthermore, we design a Detail-Aware Expert Module (DAEM) in the decoder to enhance texture and fine-structure recovery through expert routing. Extensive experiments across multi-task and mixed degradation settings demonstrate that our method consistently achieves state-of-the-art performance, highlighting the practical potential of diffusion priors for unified image restoration. Our code will be released.

[546] SDMatte: Grafting Diffusion Models for Interactive Matting

Longfei Huang, Yu Liang, Hao Zhang, Jinwei Chen, Wei Dong, Lunde Chen, Wanyu Liu, Bo Li, Peng-Tao Jiang

Main category: cs.CV

TL;DR: SDMatte is a diffusion-driven interactive matting model that leverages diffusion models for fine-grained detail extraction, using visual prompts and enhanced U-Net embeddings to improve performance.

Details

Motivation: Existing interactive matting methods lack precision in edge regions, while diffusion models offer robust capabilities for complex data and text-driven interactions.

Method: SDMatte transforms text-driven interaction into visual prompt-driven interaction, integrates coordinate and opacity embeddings into U-Net, and uses masked self-attention for focus on specified areas.

Result: Experiments show superior performance in interactive matting, validating the method’s effectiveness.

Conclusion: SDMatte successfully addresses fine-grained detail extraction in matting, outperforming existing methods.

Abstract: Recent interactive matting methods have shown satisfactory performance in capturing the primary regions of objects, but they fall short in extracting fine-grained details in edge regions. Diffusion models trained on billions of image-text pairs, demonstrate exceptional capability in modeling highly complex data distributions and synthesizing realistic texture details, while exhibiting robust text-driven interaction capabilities, making them an attractive solution for interactive matting. To this end, we propose SDMatte, a diffusion-driven interactive matting model, with three key contributions. First, we exploit the powerful priors of diffusion models and transform the text-driven interaction capability into visual prompt-driven interaction capability to enable interactive matting. Second, we integrate coordinate embeddings of visual prompts and opacity embeddings of target objects into U-Net, enhancing SDMatte’s sensitivity to spatial position information and opacity information. Third, we propose a masked self-attention mechanism that enables the model to focus on areas specified by visual prompts, leading to better performance. Extensive experiments on multiple datasets demonstrate the superior performance of our method, validating its effectiveness in interactive matting. Our code and model are available at https://github.com/vivoCameraResearch/SDMatte.

[547] DPoser-X: Diffusion Model as Robust 3D Whole-body Human Pose Prior

Junzhe Lu, Jing Lin, Hongkun Dou, Ailing Zeng, Yue Deng, Xian Liu, Zhongang Cai, Lei Yang, Yulun Zhang, Haoqian Wang, Ziwei Liu

Main category: cs.CV

TL;DR: DPoser-X is a diffusion-based model for 3D whole-body human pose modeling, addressing challenges like pose complexity and dataset scarcity. It unifies pose tasks as inverse problems, uses variational diffusion sampling, and introduces novel training and scheduling methods, outperforming state-of-the-art models.

Details

Motivation: Building a robust full-body human pose prior is difficult due to pose complexity and lack of high-quality datasets.

Method: DPoser-X uses a diffusion model (DPoser) extended for whole-body poses, unifying tasks as inverse problems solved via variational diffusion sampling. It includes truncated timestep scheduling and masked training to combine datasets.

Result: DPoser-X outperforms state-of-the-art models in benchmarks for body, hand, face, and full-body pose modeling.

Conclusion: DPoser-X sets a new benchmark for whole-body human pose prior modeling, demonstrating robustness and versatility.

Abstract: We present DPoser-X, a diffusion-based prior model for 3D whole-body human poses. Building a versatile and robust full-body human pose prior remains challenging due to the inherent complexity of articulated human poses and the scarcity of high-quality whole-body pose datasets. To address these limitations, we introduce a Diffusion model as body Pose prior (DPoser) and extend it to DPoser-X for expressive whole-body human pose modeling. Our approach unifies various pose-centric tasks as inverse problems, solving them through variational diffusion sampling. To enhance performance on downstream applications, we introduce a novel truncated timestep scheduling method specifically designed for pose data characteristics. We also propose a masked training mechanism that effectively combines whole-body and part-specific datasets, enabling our model to capture interdependencies between body parts while avoiding overfitting to specific actions. Extensive experiments demonstrate DPoser-X’s robustness and versatility across multiple benchmarks for body, hand, face, and full-body pose modeling. Our model consistently outperforms state-of-the-art alternatives, establishing a new benchmark for whole-body human pose prior modeling.

cs.AI

[548] Benchmarking and Bridging Emotion Conflicts for Multimodal Emotion Reasoning

Zhiyuan Han, Beier Zhu, Yanlong Xu, Peipei Song, Xun Yang

Main category: cs.AI

TL;DR: The paper introduces CA-MER, a benchmark for evaluating Multimodal Large Language Models (MLLMs) under emotion conflicts, and proposes MoSEAR, a framework to address audio bias and improve balanced modality integration.

Details

Motivation: Existing MLLMs overlook emotion conflicts where cues from different modalities are inconsistent, leading to biased reliance on audio signals.

Method: The authors introduce CA-MER for evaluation and propose MoSEAR, featuring modality-specific experts (MoSE) and attention reallocation (AR) to balance modality contributions.

Result: MoSEAR outperforms state-of-the-art models on benchmarks like MER2023, EMER, DFEW, and CA-MER, especially in conflict scenarios.

Conclusion: MoSEAR effectively mitigates modality bias and improves performance without trade-offs, setting a new standard for emotion reasoning in MLLMs.

Abstract: Despite their strong performance in multimodal emotion reasoning, existing Multimodal Large Language Models (MLLMs) often overlook the scenarios involving emotion conflicts, where emotional cues from different modalities are inconsistent. To fill this gap, we first introduce CA-MER, a new benchmark designed to examine MLLMs under realistic emotion conflicts. It consists of three subsets: video-aligned, audio-aligned, and consistent, where only one or all modalities reflect the true emotion. However, evaluations on our CA-MER reveal that current state-of-the-art emotion MLLMs systematically over-rely on audio signal during emotion conflicts, neglecting critical cues from visual modality. To mitigate this bias, we propose MoSEAR, a parameter-efficient framework that promotes balanced modality integration. MoSEAR consists of two modules: (1)MoSE, modality-specific experts with a regularized gating mechanism that reduces modality bias in the fine-tuning heads; and (2)AR, an attention reallocation mechanism that rebalances modality contributions in frozen backbones during inference. Our framework offers two key advantages: it mitigates emotion conflicts and improves performance on consistent samples-without incurring a trade-off between audio and visual modalities. Experiments on multiple benchmarks-including MER2023, EMER, DFEW, and our CA-MER-demonstrate that MoSEAR achieves state-of-the-art performance, particularly under modality conflict conditions.

[549] Exploring Agentic Artificial Intelligence Systems: Towards a Typological Framework

Christopher Wissuchek, Patrick Zschech

Main category: cs.AI

TL;DR: The paper introduces a typology for classifying agentic AI systems along eight dimensions, providing a structured framework to analyze and compare their cognitive and environmental agency.

Details

Motivation: The lack of a structured framework to classify and compare autonomous AI systems motivates the development of a typology to assess their agency.

Method: A multi-phase methodological approach is used to construct and refine the typology, evaluated through a human-AI hybrid approach and distilled into constructed types.

Result: The typology enables researchers and practitioners to analyze varying levels of agency in AI systems, offering a foundation for assessing current and future developments.

Conclusion: The framework provides a structured perspective on AI capabilities, aiding in the classification and anticipation of advancements in agentic AI.

Abstract: Artificial intelligence (AI) systems are evolving beyond passive tools into autonomous agents capable of reasoning, adapting, and acting with minimal human intervention. Despite their growing presence, a structured framework is lacking to classify and compare these systems. This paper develops a typology of agentic AI systems, introducing eight dimensions that define their cognitive and environmental agency in an ordinal structure. Using a multi-phase methodological approach, we construct and refine this typology, which is then evaluated through a human-AI hybrid approach and further distilled into constructed types. The framework enables researchers and practitioners to analyze varying levels of agency in AI systems. By offering a structured perspective on the progression of AI capabilities, the typology provides a foundation for assessing current systems and anticipating future developments in agentic AI.

[550] A Formal Framework for the Definition of ‘State’: Hierarchical Representation and Meta-Universe Interpretation

Kei Itoh

Main category: cs.AI

TL;DR: The paper introduces a unified formal structure for the concept of ‘state’ to strengthen theoretical foundations, proposing a ‘hierarchical state grid’ and ‘Intermediate Meta-Universe (IMU)’ for clarity and consistency across domains. It expands inter-universal theory and offers a meta-formal logical framework for defining intelligence and other scientific theories.

Details

Motivation: To address the lack of consensus and formal clarity in the concept of 'state' and provide a rigorous foundation for diverse systems, including intelligence.

Method: Proposes a ‘hierarchical state grid’ and ‘Intermediate Meta-Universe (IMU)’ to unify notation and avoid logical inconsistencies. Expands inter-universal theory to include linguistic and agent integration.

Result: A meta-formal logical framework applicable to intelligence, formal logic, and scientific theory, enabling broader expressivity and consistency.

Conclusion: The study provides a mathematically robust foundation for defining intelligence and other concepts, bridging gaps across domains and enhancing theoretical clarity.

Abstract: This study aims to reinforce the theoretical foundation for diverse systems–including the axiomatic definition of intelligence–by introducing a mathematically rigorous and unified formal structure for the concept of ‘state,’ which has long been used without consensus or formal clarity. First, a ‘hierarchical state grid’ composed of two axes–state depth and mapping hierarchy–is proposed to provide a unified notational system applicable across mathematical, physical, and linguistic domains. Next, the ‘Intermediate Meta-Universe (IMU)’ is introduced to enable explicit descriptions of definers (ourselves) and the languages we use, thereby allowing conscious meta-level operations while avoiding self-reference and logical inconsistency. Building on this meta-theoretical foundation, this study expands inter-universal theory beyond mathematics to include linguistic translation and agent integration, introducing the conceptual division between macrocosm-inter-universal and microcosm-inter-universal operations for broader expressivity. Through these contributions, this paper presents a meta-formal logical framework–grounded in the principle of definition = state–that spans time, language, agents, and operations, providing a mathematically robust foundation applicable to the definition of intelligence, formal logic, and scientific theory at large.

[551] AgentTTS: Large Language Model Agent for Test-time Compute-optimal Scaling Strategy in Complex Tasks

Fali Wang, Hui Liu, Zhenwei Dai, Jingying Zeng, Zhiwei Zhang, Zongyu Wu, Chen Luo, Zhen Li, Xianfeng Tang, Qi He, Suhang Wang

Main category: cs.AI

TL;DR: AgentTTS is a framework for optimizing compute resources in multi-stage tasks using LLMs, outperforming traditional methods.

Details

Motivation: Existing test-time scaling (TTS) research focuses on single-stage tasks, but real-world problems are multi-stage, requiring tailored compute allocation.

Method: AgentTTS uses an LLM-agent to iteratively search for optimal model and budget allocations in multi-stage tasks, guided by empirical insights.

Result: AgentTTS outperforms baselines in efficiency, robustness, and interpretability.

Conclusion: AgentTTS effectively addresses the challenges of compute-optimal scaling in multi-stage tasks.

Abstract: Test-time scaling (TTS) enhances the performance of large language models (LLMs) by allocating additional compute resources during inference. However, existing research primarily investigates TTS in single-stage tasks; while many real-world problems are multi-stage complex tasks, composed of a sequence of heterogeneous subtasks with each subtask requires LLM of specific capability. Therefore, we study a novel problem: the test-time compute-optimal scaling in multi-stage complex tasks, aiming to select suitable models and allocate budgets per subtask to maximize overall performance. TTS in multi-stage tasks introduces two fundamental challenges: (i) The combinatorial search space of model and budget allocations, combined with the high cost of inference, makes brute-force search impractical. (ii) The optimal model and budget allocations across subtasks are interdependent, increasing the complexity of the compute-optimal search. To address this gap, we conduct extensive pilot experiments on four tasks across six datasets, deriving three empirical insights characterizing the behavior of LLMs in multi-stage complex tasks. Informed by these insights, we propose AgentTTS, an LLM-agent-based framework that autonomously searches for compute-optimal allocations through iterative feedback-driven interactions with the execution environment. Experimental results demonstrate that AgentTTS significantly outperforms traditional and other LLM-based baselines in search efficiency, and shows improved robustness to varying training set sizes and enhanced interpretability.

[552] ff4ERA: A new Fuzzy Framework for Ethical Risk Assessment in AI

Abeer Dyoub, Ivan Letteri, Francesca A. Lisi

Main category: cs.AI

TL;DR: The paper introduces ff4ERA, a fuzzy framework for Ethical Risk Assessment (ERA) in Symbiotic AI (SAI), addressing uncertainty and context-dependency in ethical decision-making.

Details

Motivation: The deepening human-AI collaboration in SAI raises ethical risks, necessitating a flexible and transparent ERA framework to align AI actions with human values.

Method: The framework combines Fuzzy Logic, Fuzzy Analytic Hierarchy Process (FAHP), and Certainty Factors (CF) to quantify ethical risks via an Ethical Risk Score (ERS).

Result: ff4ERA produces context-sensitive, interpretable risk scores validated by a case study, showing robustness and sensitivity to expert inputs.

Conclusion: ff4ERA enables reliable ethical decision support, offering traceable and risk-aware assessments for SAI systems.

Abstract: The emergence of Symbiotic AI (SAI) introduces new challenges to ethical decision-making as it deepens human-AI collaboration. As symbiosis grows, AI systems pose greater ethical risks, including harm to human rights and trust. Ethical Risk Assessment (ERA) thus becomes crucial for guiding decisions that minimize such risks. However, ERA is hindered by uncertainty, vagueness, and incomplete information, and morality itself is context-dependent and imprecise. This motivates the need for a flexible, transparent, yet robust framework for ERA. Our work supports ethical decision-making by quantitatively assessing and prioritizing multiple ethical risks so that artificial agents can select actions aligned with human values and acceptable risk levels. We introduce ff4ERA, a fuzzy framework that integrates Fuzzy Logic, the Fuzzy Analytic Hierarchy Process (FAHP), and Certainty Factors (CF) to quantify ethical risks via an Ethical Risk Score (ERS) for each risk type. The final ERS combines the FAHP-derived weight, propagated CF, and risk level. The framework offers a robust mathematical approach for collaborative ERA modeling and systematic, step-by-step analysis. A case study confirms that ff4ERA yields context-sensitive, ethically meaningful risk scores reflecting both expert input and sensor-based evidence. Risk scores vary consistently with relevant factors while remaining robust to unrelated inputs. Local sensitivity analysis shows predictable, mostly monotonic behavior across perturbations, and global Sobol analysis highlights the dominant influence of expert-defined weights and certainty factors, validating the model design. Overall, the results demonstrate ff4ERA ability to produce interpretable, traceable, and risk-aware ethical assessments, enabling what-if analyses and guiding designers in calibrating membership functions and expert judgments for reliable ethical decision support.

[553] An analysis of AI Decision under Risk: Prospect theory emerges in Large Language Models

Kenneth Payne

Main category: cs.AI

TL;DR: The paper tests Kahneman and Tversky’s prospect theory with Large Language Models (LLMs), finding similarities to human risk judgment. Context, especially military vs. civilian scenarios, influences risk appetite. The study highlights how language models capture human biases unevenly and reframes debates about LLM reasoning.

Details

Motivation: To explore whether LLMs exhibit human-like risk judgment biases as described in prospect theory, and to understand how contextual 'frames' influence these biases.

Method: Experimental tests of prospect theory with state-of-the-art LLMs, analyzing risk decisions across various scenarios (e.g., military vs. civilian).

Result: LLMs often align with prospect theory, showing context-dependent risk appetite. Military scenarios produce stronger framing effects than civilian ones.

Conclusion: Language models mirror human biases but unevenly, influenced by contextual ‘frames.’ The findings also contribute to debates about LLM reasoning vs. memorization.

Abstract: Judgment of risk is key to decision-making under uncertainty. As Daniel Kahneman and Amos Tversky famously discovered, humans do so in a distinctive way that departs from mathematical rationalism. Specifically, they demonstrated experimentally that humans accept more risk when they feel themselves at risk of losing something than when they might gain. I report the first tests of Kahneman and Tversky’s landmark ‘prospect theory’ with Large Language Models, including today’s state of the art chain-of-thought ‘reasoners’. In common with humans, I find that prospect theory often anticipates how these models approach risky decisions across a range of scenarios. I also demonstrate that context is key to explaining much of the variance in risk appetite. The ‘frame’ through which risk is apprehended appears to be embedded within the language of the scenarios tackled by the models. Specifically, I find that military scenarios generate far larger ‘framing effects’ than do civilian settings, ceteris paribus. My research suggests, therefore, that language models the world, capturing our human heuristics and biases. But also that these biases are uneven - the idea of a ‘frame’ is richer than simple gains and losses. Wittgenstein’s notion of ’language games’ explains the contingent, localised biases activated by these scenarios. Finally, I use my findings to reframe the ongoing debate about reasoning and memorisation in LLMs.

[554] Knowledge Editing for Multi-Hop Question Answering Using Semantic Analysis

Dominic Simon, Rickard Ewetz

Main category: cs.AI

TL;DR: CHECK, a knowledge editor for multi-hop question answering (MQA), improves accuracy by 22.8% by semantically analyzing and revising reasoning chains, addressing limitations of existing methods.

Details

Motivation: Existing knowledge editors fail in compositional reasoning tasks like MQA due to illogical decompositional techniques.

Method: CHECK uses semantic analysis and logic optimization to revise reasoning chains, inspired by compiler processes.

Result: CHECK outperforms five state-of-the-art frameworks, achieving a 22.8% average improvement in MQA accuracy.

Conclusion: CHECK effectively addresses compositional reasoning challenges in knowledge editing for LLMs.

Abstract: Large Language Models (LLMs) require lightweight avenues of updating stored information that has fallen out of date. Knowledge Editing (KE) approaches have been successful in updating model knowledge for simple factual queries but struggle with handling tasks that require compositional reasoning such as multi-hop question answering (MQA). We observe that existing knowledge editors leverage decompositional techniques that result in illogical reasoning processes. In this paper, we propose a knowledge editor for MQA based on semantic analysis called CHECK. Our framework is based on insights from an analogy between compilers and reasoning using LLMs. Similar to how source code is first compiled before being executed, we propose to semantically analyze reasoning chains before executing the chains to answer questions. Reasoning chains with semantic errors are revised to ensure consistency through logic optimization and re-prompting the LLM model at a higher temperature. We evaluate the effectiveness of CHECK against five state-of-the-art frameworks on four datasets and achieve an average 22.8% improved MQA accuracy.

[555] Cooperative Perception: A Resource-Efficient Framework for Multi-Drone 3D Scene Reconstruction Using Federated Diffusion and NeRF

Massoud Pourmandi

Main category: cs.AI

TL;DR: A drone swarm system uses federated learning, diffusion models, and lightweight semantic extraction for efficient 3D/4D scene reconstruction while addressing computational and bandwidth challenges.

Details

Motivation: To overcome computational limitations, low-bandwidth communication, and enable real-time scene reconstruction in drone swarms.

Method: Combines federated learning of shared diffusion models, YOLOv12 for semantic extraction, and local NeRF updates with privacy and scalability.

Result: Enables efficient multi-agent scene synthesis and cooperative understanding with semantic-aware compression.

Conclusion: The framework is a disruptive advancement for autonomous systems, validated through simulations and real-world drone testbeds.

Abstract: The proposal introduces an innovative drone swarm perception system that aims to solve problems related to computational limitations and low-bandwidth communication, and real-time scene reconstruction. The framework enables efficient multi-agent 3D/4D scene synthesis through federated learning of shared diffusion model and YOLOv12 lightweight semantic extraction and local NeRF updates while maintaining privacy and scalability. The framework redesigns generative diffusion models for joint scene reconstruction, and improves cooperative scene understanding, while adding semantic-aware compression protocols. The approach can be validated through simulations and potential real-world deployment on drone testbeds, positioning it as a disruptive advancement in multi-agent AI for autonomous systems.

[556] AutoEDA: Enabling EDA Flow Automation through Microservice-Based LLM Agents

Yiyi Lu, Hoi Ian Au, Junyao Zhang, Jingyu Pan, Yiting Wang, Ang Li, Jianyi Zhang, Yiran Chen

Main category: cs.AI

TL;DR: AutoEDA is a framework for EDA automation using structured prompt engineering and paralleled learning via Model Context Protocol (MCP), improving accuracy, efficiency, and script quality in RTL-to-GDSII flows.

Details

Motivation: Manual scripting and tool-specific interactions in EDA workflows limit scalability and efficiency, while existing LLM solutions lack standardized frameworks.

Method: AutoEDA leverages MCP for standardized natural language processing, uses structured prompt engineering, intelligent parameter extraction, task decomposition, and an extended CodeBLEU metric for evaluation.

Result: Experiments on five benchmarks show improved automation accuracy, efficiency, and script quality compared to existing methods.

Conclusion: AutoEDA provides a scalable, efficient, and open-source solution for EDA automation, benefiting the EDA community.

Abstract: Modern Electronic Design Automation (EDA) workflows, especially the RTL-to-GDSII flow, require heavily manual scripting and demonstrate a multitude of tool-specific interactions which limits scalability and efficiency. While LLMs introduces strides for automation, existing LLM solutions require expensive fine-tuning and do not contain standardized frameworks for integration and evaluation. We introduce AutoEDA, a framework for EDA automation that leverages paralleled learning through the Model Context Protocol (MCP) specific for standardized and scalable natural language experience across the entire RTL-to-GDSII flow. AutoEDA limits fine-tuning through structured prompt engineering, implements intelligent parameter extraction and task decomposition, and provides an extended CodeBLEU metric to evaluate the quality of TCL scripts. Results from experiments over five previously curated benchmarks show improvements in automation accuracy and efficiency, as well as script quality when compared to existing methods. AutoEDA is released open-sourced to support reproducibility and the EDA community. Available at: https://github.com/AndyLu666/MCP-EDA-Server

[557] Agent-Based Feature Generation from Clinical Notes for Outcome Prediction

Jiayi Wang, Jacqueline Jil Vallon, Neil Panjwani, Xi Ling, Sushmita Vij, Sandy Srinivas, John Leppert, Mark K. Buyyounouski, Mohsen Bayati

Main category: cs.AI

TL;DR: SNOW, a modular multi-agent system using LLMs, autonomously generates structured clinical features from EHR notes, matching manual clinician feature generation performance without expert input.

Details

Motivation: To address the challenge of extracting meaningful, interpretable features from unstructured EHR notes for predictive modeling without relying on labor-intensive manual methods or less interpretable automated approaches.

Method: SNOW employs a modular multi-agent system powered by LLMs to autonomously perform feature discovery, extraction, validation, post-processing, and aggregation from unstructured notes.

Result: SNOW matched manual clinician feature generation (AUC-ROC: 0.761 vs. 0.771) and outperformed baseline features (0.691) and other automated methods, while maintaining interpretability.

Conclusion: Autonomous LLM systems like SNOW can replicate expert-level feature engineering at scale, enhancing clinical ML models’ use of unstructured EHR data while preserving interpretability.

Abstract: Electronic health records (EHRs) contain rich unstructured clinical notes that could enhance predictive modeling, yet extracting meaningful features from these notes remains challenging. Current approaches range from labor-intensive manual clinician feature generation (CFG) to fully automated representational feature generation (RFG) that lack interpretability and clinical relevance. Here we introduce SNOW (Scalable Note-to-Outcome Workflow), a modular multi-agent system powered by large language models (LLMs) that autonomously generates structured clinical features from unstructured notes without human intervention. We evaluated SNOW against manual CFG, clinician-guided LLM approaches, and RFG methods for predicting 5-year prostate cancer recurrence in 147 patients from Stanford Healthcare. While manual CFG achieved the highest performance (AUC-ROC: 0.771), SNOW matched this performance (0.761) without requiring any clinical expertise, significantly outperforming both baseline features alone (0.691) and all RFG approaches. The clinician-guided LLM method also performed well (0.732) but still required expert input. SNOW’s specialized agents handle feature discovery, extraction, validation, post-processing, and aggregation, creating interpretable features that capture complex clinical information typically accessible only through manual review. Our findings demonstrate that autonomous LLM systems can replicate expert-level feature engineering at scale, potentially transforming how clinical ML models leverage unstructured EHR data while maintaining the interpretability essential for clinical deployment.

[558] CADDesigner: Conceptual Design of CAD Models Based on General-Purpose Agent

Jingzhe Ni, Xiaolong Yin, Xintong Li, Xingyu Lu, Ji Wei, Ruofeng Tong, Min Tang, Peng Du

Main category: cs.AI

TL;DR: An LLM-powered CAD agent simplifies design by accepting text/sketches, refining requirements via dialogue, and generating high-quality CAD code using iterative feedback.

Details

Motivation: Lowering the expertise barrier in CAD design and improving efficiency by leveraging LLMs for interactive, user-friendly conceptual design.

Method: Uses a Context-Independent Imperative Paradigm (CIP) for CAD code generation, incorporating iterative visual feedback and storing cases in a knowledge base for continuous improvement.

Result: Achieves state-of-the-art performance in CAD code generation, as demonstrated by experiments.

Conclusion: The LLM-powered CAD agent effectively bridges the gap between novice designers and complex CAD tools, enhancing design accessibility and quality.

Abstract: Computer-Aided Design (CAD) plays a pivotal role in industrial manufacturing but typically requires a high level of expertise from designers. To lower the entry barrier and improve design efficiency, we present an agent for CAD conceptual design powered by large language models (LLMs). The agent accepts both abstract textual descriptions and freehand sketches as input, engaging in interactive dialogue with users to refine and clarify design requirements through comprehensive requirement analysis. Built upon a novel Context-Independent Imperative Paradigm (CIP), the agent generates high-quality CAD modeling code. During the generation process, the agent incorporates iterative visual feedback to improve model quality. Generated design cases are stored in a structured knowledge base, enabling continuous improvement of the agent’s code generation capabilities. Experimental results demonstrate that our method achieves state-of-the-art performance in CAD code generation.

[559] A Survey on AgentOps: Categorization, Challenges, and Future Directions

Zexin Wang, Jingjing Li, Quan Zhou, Haotian Si, Yuanhao Liu, Jianhui Li, Gaogang Xie, Fei Sun, Dan Pei, Changhua Pei

Main category: cs.AI

TL;DR: The paper surveys agent system operations (AgentOps) to address anomalies in LLM-based agent systems, proposing a framework with four stages: monitoring, anomaly detection, root cause analysis, and resolution.

Details

Motivation: The increasing use of LLM-based agent systems faces instability and insecurity due to anomalies, yet research on their operations is lacking. This paper aims to fill this gap.

Method: The paper systematically defines anomalies in agent systems (intra-agent and inter-agent) and introduces the AgentOps framework with four operational stages.

Result: A comprehensive operational framework (AgentOps) is proposed to manage anomalies in agent systems, detailing each stage for practical implementation.

Conclusion: The paper establishes a foundational framework for agent system operations, addressing current gaps and enabling further research and development in the field.

Abstract: As the reasoning capabilities of Large Language Models (LLMs) continue to advance, LLM-based agent systems offer advantages in flexibility and interpretability over traditional systems, garnering increasing attention. However, despite the widespread research interest and industrial application of agent systems, these systems, like their traditional counterparts, frequently encounter anomalies. These anomalies lead to instability and insecurity, hindering their further development. Therefore, a comprehensive and systematic approach to the operation and maintenance of agent systems is urgently needed. Unfortunately, current research on the operations of agent systems is sparse. To address this gap, we have undertaken a survey on agent system operations with the aim of establishing a clear framework for the field, defining the challenges, and facilitating further development. Specifically, this paper begins by systematically defining anomalies within agent systems, categorizing them into intra-agent anomalies and inter-agent anomalies. Next, we introduce a novel and comprehensive operational framework for agent systems, dubbed Agent System Operations (AgentOps). We provide detailed definitions and explanations of its four key stages: monitoring, anomaly detection, root cause analysis, and resolution.

[560] REACT: A Real-Time Edge-AI Based V2X Framework for Accident Avoidance in Autonomous Driving System

Fengze Yang, Bo Yu, Yang Zhou, Xuewen Luo, Zhengzhong Tu, Chenxi Liu

Main category: cs.AI

TL;DR: REACT is a real-time V2X-integrated trajectory optimization framework using a lightweight VLM for autonomous driving, reducing collisions by 77% and achieving fast inference.

Details

Motivation: Human-error collisions are common, and current V2X frameworks lack generalization and real-time performance. VLMs offer better reasoning but are slow for safety-critical tasks.

Method: REACT integrates a fine-tuned lightweight VLM with specialized modules for multimodal input processing and edge adaptation for real-time performance.

Result: Achieves 77% collision reduction, 48.2% VPQ, and 0.57s latency on Jetson AGX Orin.

Conclusion: Lightweight VLMs are feasible for real-time cooperative planning, improving safety and responsiveness in autonomous driving.

Abstract: Collisions caused by human error are the most common type of multi-vehicle crash, highlighting the critical need for autonomous driving (AD) systems to leverage cooperative perception through Vehicle-to-Everything (V2X) communication. This capability extends situational awareness beyond the limitations of onboard sensors. However, current transformer-based V2X frameworks suffer from limited generalization, shallow contextual reasoning, and reliance on mono-modal inputs. Vision-Language Models (VLMs) offer enhanced reasoning and multimodal integration but typically fall short of real-time performance requirements in safety-critical applications. This paper presents REACT, a real-time, V2X-integrated trajectory optimization framework built upon a fine-tuned lightweight VLM. REACT integrates a set of specialized modules that process multimodal inputs into optimized, risk-aware trajectories. To ensure real-time performance on edge devices, REACT incorporates edge adaptation strategies that reduce model complexity and accelerate inference. Evaluated on the DeepAccident benchmark, REACT achieves state-of-the-art performance, a 77% collision rate reduction, a 48.2% Video Panoptic Quality (VPQ), and a 0.57-second inference latency on the Jetson AGX Orin. Ablation studies validate the contribution of each input, module, and edge adaptation strategy. These results demonstrate the feasibility of lightweight VLMs for real-time edge-based cooperative planning and showcase the potential of language-guided contextual reasoning to improve safety and responsiveness in autonomous driving.

[561] D-Judge: How Far Are We? Evaluating the Discrepancies Between AI-synthesized Images and Natural Images through Multimodal Guidance

Renyang Liu, Ziyu Lyu, Wei Zhou, See-Kiong Ng

Main category: cs.AI

TL;DR: The paper introduces DANI, a large-scale dataset of natural and AI-generated images, and D-Judge, a benchmark to evaluate discrepancies between AI-synthesized and natural images across five dimensions.

Details

Motivation: To systematically investigate and quantify the differences between AI-generated and natural images, addressing the challenge of distinguishing them in the AIGC field.

Method: Constructed the DANI dataset with 5,000 natural and 440,000 AI-generated images from nine models. Introduced D-Judge, a benchmark evaluating five key dimensions: visual quality, semantic alignment, aesthetic appeal, task applicability, and human validation.

Result: Revealed significant discrepancies between AI-generated and natural images, emphasizing the need for aligning quantitative metrics with human judgment.

Conclusion: The study highlights the gap in AI-generated image realism and provides a framework for comprehensive evaluation, with publicly available code and dataset.

Abstract: In the rapidly evolving field of Artificial Intelligence Generated Content (AIGC), a central challenge is distinguishing AI-synthesized images from natural images. Despite the impressive capabilities of advanced AI generative models in producing visually compelling content, significant discrepancies remain when compared to natural images. To systematically investigate and quantify these differences, we construct a large-scale multimodal dataset named DANI, comprising 5,000 natural images and over 440,000 AI-generated image (AIGI) samples produced by nine representative models using both unimodal and multimodal prompts, including Text-to-Image (T2I), Image-to-Image (I2I), and Text and Image-to-Image (TI2I). We then introduce D-Judge, a benchmark designed to answer the critical question: how far are AI-generated images from truly realistic images? Our fine-grained evaluation framework assesses DANI across five key dimensions: naive visual quality, semantic alignment, aesthetic appeal, downstream task applicability, and coordinated human validation. Extensive experiments reveal substantial discrepancies across these dimensions, highlighting the importance of aligning quantitative metrics with human judgment to achieve a comprehensive understanding of AI-generated image quality. The code and dataset are publicly available at: https://github.com/ryliu68/DJudge and https://huggingface.co/datasets/Renyang/DANI.

[562] gpuRDF2vec – Scalable GPU-based RDF2vec

Martin Böckling, Heiko Paulheim

Main category: cs.AI

TL;DR: gpuRDF2vec accelerates KG embedding generation using GPUs and multi-node execution, outperforming existing methods like jRDF2vec in speed and scalability.

Details

Motivation: Generating KG embeddings at web scale is challenging, and existing methods like RDF2vec need acceleration for large-scale graphs.

Method: gpuRDF2vec leverages modern GPUs and multi-node execution to optimize the RDF2vec pipeline, including walk extraction and word2vec training.

Result: gpuRDF2vec achieves substantial speedups over jRDF2vec and scales well for large/dense graphs and longer walks, improving embedding quality.

Conclusion: gpuRDF2vec enables efficient, high-quality KG embedding training on large-scale graphs, leveraging Pytorch Lightning for scalability.

Abstract: Generating Knowledge Graph (KG) embeddings at web scale remains challenging. Among existing techniques, RDF2vec combines effectiveness with strong scalability. We present gpuRDF2vec, an open source library that harnesses modern GPUs and supports multi-node execution to accelerate every stage of the RDF2vec pipeline. Extensive experiments on both synthetically generated graphs and real-world benchmarks show that gpuRDF2vec achieves up to a substantial speedup over the currently fastest alternative, i.e., jRDF2vec. In a single-node setup, our walk-extraction phase alone outperforms pyRDF2vec, SparkKGML, and jRDF2vec by a substantial margin using random walks on large/ dense graphs, and scales very well to longer walks, which typically lead to better quality embeddings. Our implementation of gpuRDF2vec enables practitioners and researchers to train high-quality KG embeddings on large-scale graphs within practical time budgets and builds on top of Pytorch Lightning for the scalable word2vec implementation.

[563] Multispin Physics of AI Tipping Points and Hallucinations

Neil F. Johnson, Frank Yingjie Huo

Main category: cs.AI

TL;DR: The paper reveals a hidden tipping instability in generative AI outputs, linking it to a multispin thermal system, and provides a formula to predict this tipping point, aiding AI transparency and risk assessment.

Details

Motivation: To address the issue of generative AI outputs unpredictably shifting from correct to misleading, causing significant real-world harm, and to improve AI transparency and safety.

Method: The study maps the AI’s behavior to a multispin thermal system, identifies a tipping instability at the Attention head level, and derives a formula to predict this tipping point.

Result: The derived formula accurately predicts the tipping point, showing the influence of user prompts and training bias, and reveals how multilayer architectures amplify output instability.

Conclusion: The findings enhance AI transparency, explainability, and performance, while also providing a framework to quantify user risk and legal liabilities.

Abstract: Output from generative AI such as ChatGPT, can be repetitive and biased. But more worrying is that this output can mysteriously tip mid-response from good (correct) to bad (misleading or wrong) without the user noticing. In 2024 alone, this reportedly caused $67 billion in losses and several deaths. Establishing a mathematical mapping to a multispin thermal system, we reveal a hidden tipping instability at the scale of the AI’s ‘atom’ (basic Attention head). We derive a simple but essentially exact formula for this tipping point which shows directly the impact of a user’s prompt choice and the AI’s training bias. We then show how the output tipping can get amplified by the AI’s multilayer architecture. As well as helping improve AI transparency, explainability and performance, our results open a path to quantifying users’ AI risk and legal liabilities.

[564] HealthFlow: A Self-Evolving AI Agent with Meta Planning for Autonomous Healthcare Research

Yinghao Zhu, Yifan Qi, Zixiang Wang, Lei Gu, Dehao Sui, Haoran Hu, Xichen Zhang, Ziyi He, Liantao Ma, Lequan Yu

Main category: cs.AI

TL;DR: HealthFlow is a self-evolving AI agent for healthcare research, outperforming static AI agents by refining its strategic planning through meta-level evolution.

Details

Motivation: Current AI agents in healthcare rely on static strategies, limiting their ability to improve strategic planning, which is vital for complex domains like healthcare.

Method: HealthFlow uses a meta-level evolution mechanism to autonomously refine high-level problem-solving policies, learning from successes and failures. EHRFlowBench, a new benchmark, is introduced for evaluation.

Result: HealthFlow significantly outperforms state-of-the-art agent frameworks in complex health data analysis tasks.

Conclusion: The work shifts focus from tool-users to self-evolving task-managers, enabling more autonomous and effective AI for scientific discovery.

Abstract: The efficacy of AI agents in healthcare research is hindered by their reliance on static, predefined strategies. This creates a critical limitation: agents can become better tool-users but cannot learn to become better strategic planners, a crucial skill for complex domains like healthcare. We introduce HealthFlow, a self-evolving AI agent that overcomes this limitation through a novel meta-level evolution mechanism. HealthFlow autonomously refines its own high-level problem-solving policies by distilling procedural successes and failures into a durable, strategic knowledge base. To anchor our research and facilitate reproducible evaluation, we introduce EHRFlowBench, a new benchmark featuring complex, realistic health data analysis tasks derived from peer-reviewed clinical research. Our comprehensive experiments demonstrate that HealthFlow’s self-evolving approach significantly outperforms state-of-the-art agent frameworks. This work marks a necessary shift from building better tool-users to designing smarter, self-evolving task-managers, paving the way for more autonomous and effective AI for scientific discovery.

[565] Platonic Representations for Poverty Mapping: Unified Vision-Language Codes or Agent-Induced Novelty?

Satiyabooshan Murugaboopathy, Connor T. Jerzak, Adel Daoud

Main category: cs.AI

TL;DR: The paper explores using satellite imagery and internet-sourced text to predict household wealth in African neighborhoods, finding that combining vision and language models outperforms vision-only methods.

Details

Motivation: To determine if socio-economic indicators like wealth can be inferred from satellite images and online text, leveraging multimodal data for improved prediction.

Method: A multimodal framework with five pipelines: vision model, LLM, AI agent, joint image-text encoder, and ensemble. Uses DHS data, Landsat images, and LLM/agent-generated text.

Result: Fused vision and text models outperform vision-only (R-squared 0.77 vs. 0.63). LLM knowledge is more effective than agent-retrieved text, with partial representational convergence.

Conclusion: Combining vision and language models improves wealth prediction, with LLMs providing robust knowledge. A large-scale dataset is released for future research.

Abstract: We investigate whether socio-economic indicators like household wealth leave recoverable imprints in satellite imagery (capturing physical features) and Internet-sourced text (reflecting historical/economic narratives). Using Demographic and Health Survey (DHS) data from African neighborhoods, we pair Landsat images with LLM-generated textual descriptions conditioned on location/year and text retrieved by an AI search agent from web sources. We develop a multimodal framework predicting household wealth (International Wealth Index) through five pipelines: (i) vision model on satellite images, (ii) LLM using only location/year, (iii) AI agent searching/synthesizing web text, (iv) joint image-text encoder, (v) ensemble of all signals. Our framework yields three contributions. First, fusing vision and agent/LLM text outperforms vision-only baselines in wealth prediction (e.g., R-squared of 0.77 vs. 0.63 on out-of-sample splits), with LLM-internal knowledge proving more effective than agent-retrieved text, improving robustness to out-of-country and out-of-time generalization. Second, we find partial representational convergence: fused embeddings from vision/language modalities correlate moderately (median cosine similarity of 0.60 after alignment), suggesting a shared latent code of material well-being while retaining complementary details, consistent with the Platonic Representation Hypothesis. Although LLM-only text outperforms agent-retrieved data, challenging our Agent-Induced Novelty Hypothesis, modest gains from combining agent data in some splits weakly support the notion that agent-gathered information introduces unique representational structures not fully captured by static LLM knowledge. Third, we release a large-scale multimodal dataset comprising more than 60,000 DHS clusters linked to satellite images, LLM-generated descriptions, and agent-retrieved texts.

[566] What Is Your AI Agent Buying? Evaluation, Implications and Emerging Questions for Agentic E-Commerce

Amine Allouah, Omar Besbes, Josué D Figueroa, Yash Kanoria, Akshit Kumar

Main category: cs.AI

TL;DR: The paper explores how AI agents (VLMs) shop in online marketplaces, revealing their preferences, biases, and the potential impact on e-commerce dynamics.

Details

Motivation: To understand what AI agents buy and why, and how their behavior differs from humans, given the rise of AI-mediated shopping.

Method: Developed ACES, a sandbox environment with a VLM agent and programmable mock marketplace, testing rationality and causal effects of product attributes.

Result: AI agents show strong position biases, penalize sponsored tags, and reward endorsements. Sensitivity to price, ratings, and reviews varies by model. Seller-side AI tweaks can boost market share.

Conclusion: AI shopping behavior differs across models, raising questions about seller strategies, platform design, and regulation in an AI-dominated marketplace.

Abstract: Online marketplaces will be transformed by autonomous AI agents acting on behalf of consumers. Rather than humans browsing and clicking, vision-language-model (VLM) agents can parse webpages, evaluate products, and transact. This raises a fundamental question: what do AI agents buy, and why? We develop ACES, a sandbox environment that pairs a platform-agnostic VLM agent with a fully programmable mock marketplace to study this question. We first conduct basic rationality checks in the context of simple tasks, and then, by randomizing product positions, prices, ratings, reviews, sponsored tags, and platform endorsements, we obtain causal estimates of how frontier VLMs actually shop. Models show strong but heterogeneous position effects: all favor the top row, yet different models prefer different columns, undermining the assumption of a universal “top” rank. They penalize sponsored tags and reward endorsements. Sensitivities to price, ratings, and reviews are directionally human-like but vary sharply in magnitude across models. Motivated by scenarios where sellers use AI agents to optimize product listings, we show that a seller-side agent that makes minor tweaks to product descriptions, targeting AI buyer preferences, can deliver substantial market-share gains if AI-mediated shopping dominates. We also find that modal product choices can differ across models and, in some cases, demand may concentrate on a few select products, raising competition questions. Together, our results illuminate how AI agents may behave in e-commerce settings and surface concrete seller strategy, platform design, and regulatory questions in an AI-mediated ecosystem.

[567] H2C: Hippocampal Circuit-inspired Continual Learning for Lifelong Trajectory Prediction in Autonomous Driving

Yunlong Lin, Zirui Li, Guodong Du, Xiaocong Zhao, Cheng Gong, Xinwei Wang, Chao Lu, Jianwei Gong

Main category: cs.AI

TL;DR: H2C, a hippocampal circuit-inspired continual learning method, reduces catastrophic forgetting in trajectory prediction by 22.71% on average.

Details

Motivation: Address catastrophic forgetting in DL-based trajectory prediction for autonomous driving, inspired by hippocampal memory replay.

Method: Uses two strategies (diversity maximization and equiprobable sampling) to select representative samples, then updates via memory replay loss.

Result: H2C reduces forgetting by 22.71% on average across varying scenarios in the INTERACTION dataset.

Conclusion: H2C effectively retains prior knowledge while learning new data, enhancing real-world applicability for autonomous driving.

Abstract: Deep learning (DL) has shown state-of-the-art performance in trajectory prediction, which is critical to safe navigation in autonomous driving (AD). However, most DL-based methods suffer from catastrophic forgetting, where adapting to a new distribution may cause significant performance degradation in previously learned ones. Such inability to retain learned knowledge limits their applicability in the real world, where AD systems need to operate across varying scenarios with dynamic distributions. As revealed by neuroscience, the hippocampal circuit plays a crucial role in memory replay, effectively reconstructing learned knowledge based on limited resources. Inspired by this, we propose a hippocampal circuit-inspired continual learning method (H2C) for trajectory prediction across varying scenarios. H2C retains prior knowledge by selectively recalling a small subset of learned samples. First, two complementary strategies are developed to select the subset to represent learned knowledge. Specifically, one strategy maximizes inter-sample diversity to represent the distinctive knowledge, and the other estimates the overall knowledge by equiprobable sampling. Then, H2C updates via a memory replay loss function calculated by these selected samples to retain knowledge while learning new data. Experiments based on various scenarios from the INTERACTION dataset are designed to evaluate H2C. Experimental results show that H2C reduces catastrophic forgetting of DL baselines by 22.71% on average in a task-free manner, without relying on manually informed distributional shifts. The implementation is available at https://github.com/BIT-Jack/H2C-lifelong.

[568] A Survey on Agent Workflow – Status and Future

Chaojia Yu, Zihan Cheng, Hanwen Cui, Yishuo Gao, Zexu Luo, Yijin Wang, Hangbin Zheng, Yong Zhao

Main category: cs.AI

TL;DR: A survey on agent workflow systems in LLMs, classifying them by functional capabilities and architectural features, comparing over 20 systems, and discussing challenges, trends, and future research directions.

Details

Motivation: To review and classify agent workflow systems, addressing scalability, controllability, and security in AI behaviors as agent systems grow in complexity.

Method: Comprehensive review and classification of academic and industrial agent workflow systems along functional and architectural dimensions, comparing over 20 systems.

Result: Identified common patterns, technical challenges, and emerging trends, with insights into workflow optimization and security concerns.

Conclusion: Outlines open problems like standardization and multimodal integration, suggesting future research directions for agent design and safe automation.

Abstract: In the age of large language models (LLMs), autonomous agents have emerged as a powerful paradigm for achieving general intelligence. These agents dynamically leverage tools, memory, and reasoning capabilities to accomplish user-defined goals. As agent systems grow in complexity, agent workflows-structured orchestration frameworks-have become central to enabling scalable, controllable, and secure AI behaviors. This survey provides a comprehensive review of agent workflow systems, spanning academic frameworks and industrial implementations. We classify existing systems along two key dimensions: functional capabilities (e.g., planning, multi-agent collaboration, external API integration) and architectural features (e.g., agent roles, orchestration flows, specification languages). By comparing over 20 representative systems, we highlight common patterns, potential technical challenges, and emerging trends. We further address concerns related to workflow optimization strategies and security. Finally, we outline open problems such as standardization and multimodal integration, offering insights for future research at the intersection of agent design, workflow infrastructure, and safe automation.

[569] Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens

Chengshuai Zhao, Zhen Tan, Pingchuan Ma, Dawei Li, Bohan Jiang, Yancheng Wang, Yingzhen Yang, Huan Liu

Main category: cs.AI

TL;DR: The paper investigates whether Chain-of-Thought (CoT) reasoning in LLMs is a learned pattern from training data rather than genuine reasoning, finding it fails beyond training distributions.

Details

Motivation: To determine if CoT reasoning is superficial and dependent on training data, rather than reflecting true inferential processes.

Method: The study uses DataAlchemy, a controlled environment, to train LLMs from scratch and test CoT reasoning across task, length, and format dimensions under varied distribution conditions.

Result: CoT reasoning is brittle and ineffective when pushed beyond training distributions, suggesting it lacks generalizability.

Conclusion: CoT reasoning is a mirage limited by training data, highlighting challenges in achieving genuine, generalizable reasoning in LLMs.

Abstract: Chain-of-Thought (CoT) prompting has been shown to improve Large Language Model (LLM) performance on various tasks. With this approach, LLMs appear to produce human-like reasoning steps before providing answers (a.k.a., CoT reasoning), which often leads to the perception that they engage in deliberate inferential processes. However, some initial findings suggest that CoT reasoning may be more superficial than it appears, motivating us to explore further. In this paper, we study CoT reasoning via a data distribution lens and investigate if CoT reasoning reflects a structured inductive bias learned from in-distribution data, allowing the model to conditionally generate reasoning paths that approximate those seen during training. Thus, its effectiveness is fundamentally bounded by the degree of distribution discrepancy between the training data and the test queries. With this lens, we dissect CoT reasoning via three dimensions: task, length, and format. To investigate each dimension, we design DataAlchemy, an isolated and controlled environment to train LLMs from scratch and systematically probe them under various distribution conditions. Our results reveal that CoT reasoning is a brittle mirage that vanishes when it is pushed beyond training distributions. This work offers a deeper understanding of why and when CoT reasoning fails, emphasizing the ongoing challenge of achieving genuine and generalizable reasoning.

[570] Importance Sampling is All You Need: Predict LLM’s performance on new benchmark by reusing existing benchmark

Junjie Shi, Wei Ma, Shi Ying, Lingxiao Jiang, Yang liu, Bo Du

Main category: cs.AI

TL;DR: BIS is a prompt-centric framework for evaluating LLM performance in code generation without ground truth, reducing costs and contamination risks.

Details

Motivation: Existing benchmarks for LLM code generation face high costs and data contamination issues, necessitating a ground-truth-free evaluation method.

Method: BIS uses importance sampling theory and Importance Weighted Autoencoders to estimate performance metrics from prompt distributions, with weight truncation for stability.

Result: BIS achieves low prediction errors (e.g., 1.1% for code correctness) across diverse benchmarks and models, proving reliable and cost-effective.

Conclusion: BIS is a reliable, low-cost tool for benchmarking LLMs in code generation, offering quick feedback and broad applicability.

Abstract: With the rapid advancement of large language models , code generation has become a key benchmark for evaluating LLM capabilities. However, existing benchmarks face two major challenges: (1) the escalating cost of constructing high-quality test suites and reference solutions, and (2) the increasing risk of data contamination, which undermines the reliability of benchmark-based evaluations. In this paper, we propose BIS, a prompt-centric evaluation framework that enables ground-truth-free prediction of LLM performance on code generation tasks. Rather than executing generated code, BIS estimates performance metrics by analyzing the prompt distribution alone. Built on importance sampling theory and implemented using Importance Weighted Autoencoders, our method reweights samples from existing annotated benchmarks to estimate performance on new, unseen benchmarks. To stabilize the estimation, we introduce weight truncation strategies and compute marginal expectations across the fitted distributions. BIS serves as a complementary tool that supports benchmark development and validation under constrained resources, offering actionable and quick feedback for prompt selection and contamination assessment. We conduct extensive experiments involving 8,000 evaluation points across 4 CodeLlama models and 9 diverse benchmarks. Our framework achieves an average absolute prediction error of 1.1% for code correctness scores, with best- and worst-case errors of 0.3% and 1.9%, respectively. It also generalizes well to other metrics, attaining average absolute errors of 2.15% for pass@1. These results demonstrate the reliability and broad applicability of BIS, which can significantly reduce the cost and effort of benchmarking LLMs in code-related tasks.

[571] Calibrated Prediction Set in Fault Detection with Risk Guarantees via Significance Tests

Mingchen Mei, Yi Li, YiYao Qian, Zijun Jia

Main category: cs.AI

TL;DR: A novel fault detection method integrates significance testing with conformal prediction for rigorous risk control and uncertainty quantification, validated in cross-domain tasks.

Details

Motivation: Addressing the lack of rigorous risk control and reliable uncertainty quantification in fault detection models, especially under distributional shifts.

Method: Transforms fault detection into hypothesis testing using a nonconformity measure, computes p-values for new samples, and constructs prediction sets with guaranteed coverage.

Result: Achieves empirical coverage at or above the nominal level, robust even with poor point-prediction models, and shows a controllable trade-off between risk and efficiency.

Conclusion: Provides a theoretically grounded framework for fault detection with explicit risk control, enhancing trustworthiness in safety-critical applications.

Abstract: Fault detection is crucial for ensuring the safety and reliability of modern industrial systems. However, a significant scientific challenge is the lack of rigorous risk control and reliable uncertainty quantification in existing diagnostic models, particularly when facing complex scenarios such as distributional shifts. To address this issue, this paper proposes a novel fault detection method that integrates significance testing with the conformal prediction framework to provide formal risk guarantees. The method transforms fault detection into a hypothesis testing task by defining a nonconformity measure based on model residuals. It then leverages a calibration dataset to compute p-values for new samples, which are used to construct prediction sets mathematically guaranteed to contain the true label with a user-specified probability, $1-\alpha$. Fault classification is subsequently performed by analyzing the intersection of the constructed prediction set with predefined normal and fault label sets. Experimental results on cross-domain fault diagnosis tasks validate the theoretical properties of our approach. The proposed method consistently achieves an empirical coverage rate at or above the nominal level ($1-\alpha$), demonstrating robustness even when the underlying point-prediction models perform poorly. Furthermore, the results reveal a controllable trade-off between the user-defined risk level ($\alpha$) and efficiency, where higher risk tolerance leads to smaller average prediction set sizes. This research contributes a theoretically grounded framework for fault detection that enables explicit risk control, enhancing the trustworthiness of diagnostic systems in safety-critical applications and advancing the field from simple point predictions to informative, uncertainty-aware outputs.

[572] SketchAgent: Generating Structured Diagrams from Hand-Drawn Sketches

Cheng Tan, Qi Chen, Jingxuan Wei, Gaowei Wu, Zhangyang Gao, Siyuan Li, Bihui Yu, Ruifeng Guo, Stan Z. Li

Main category: cs.AI

TL;DR: SketchAgent automates converting hand-drawn sketches into structured diagrams using multi-agent systems, reducing manual effort.

Details

Motivation: Hand-drawn sketches lack structural constraints for automated diagram generation, making the process labor-intensive.

Method: SketchAgent integrates sketch recognition, symbolic reasoning, and iterative validation.

Result: It produces semantically coherent diagrams and introduces the Sketch2Diagram Benchmark with 6,000 annotated examples.

Conclusion: SketchAgent bridges sketching and machine-readable diagrams, benefiting design, education, and engineering.

Abstract: Hand-drawn sketches are a natural and efficient medium for capturing and conveying ideas. Despite significant advancements in controllable natural image generation, translating freehand sketches into structured, machine-readable diagrams remains a labor-intensive and predominantly manual task. The primary challenge stems from the inherent ambiguity of sketches, which lack the structural constraints and semantic precision required for automated diagram generation. To address this challenge, we introduce SketchAgent, a multi-agent system designed to automate the transformation of hand-drawn sketches into structured diagrams. SketchAgent integrates sketch recognition, symbolic reasoning, and iterative validation to produce semantically coherent and structurally accurate diagrams, significantly reducing the need for manual effort. To evaluate the effectiveness of our approach, we propose the Sketch2Diagram Benchmark, a comprehensive dataset and evaluation framework encompassing eight diverse diagram categories, such as flowcharts, directed graphs, and model architectures. The dataset comprises over 6,000 high-quality examples with token-level annotations, standardized preprocessing, and rigorous quality control. By streamlining the diagram generation process, SketchAgent holds great promise for applications in design, education, and engineering, while offering a significant step toward bridging the gap between intuitive sketching and machine-readable diagram generation. The benchmark is released at https://huggingface.co/datasets/DiagramAgent/Sketch2Diagram-Benchmark.

[573] Unifying Mixture of Experts and Multi-Head Latent Attention for Efficient Language Models

Sushant Mehta, Raj Dandekar, Rajat Dandekar, Sreedath Panat

Main category: cs.AI

TL;DR: MoE-MLA-RoPE combines Mixture of Experts, Multi-head Latent Attention, and Rotary Position Embeddings for efficient language modeling, achieving significant memory reduction, speedup, and quality improvements.

Details

Motivation: Address the trade-off between model capacity and computational efficiency in language modeling.

Method: Uses fine-grained expert routing, shared expert isolation, and gradient-conflict-free load balancing.

Result: Achieves 68% KV cache memory reduction, 3.2x inference speedup, and 6.9% better validation loss with fewer parameters.

Conclusion: Architectural innovation, not parameter scaling, defines efficiency in resource-constrained language model deployment.

Abstract: We present MoE-MLA-RoPE, a novel architecture combination that combines Mixture of Experts (MoE) with Multi-head Latent Attention (MLA) and Rotary Position Embeddings (RoPE) for efficient language modeling. Our approach addresses the fundamental trade-off between model capacity and computational efficiency through three key innovations: (1) fine-grained expert routing with 64 micro-experts and top-$k$ selection, enabling flexible specialization through 3.6 * 10^7 possible expert combinations; (2) shared expert isolation that dedicates 2 always active experts for common patterns while routing to 6 of 62 specialized experts; and (3) gradient-conflict-free load balancing that maintains expert utilization without interfering with primary loss optimization. Extensive experiments on models ranging from 17M to 202M parameters demonstrate that MoE-MLA-RoPE with compression ratio r=d/2 achieves 68% KV cache memory reduction and 3.2x inference speedup while maintaining competitive perplexity (0.8% degradation). Compared to the parameters with 53.9M parameters, MoE-MLA-RoPE improves the validation loss by 6.9% over the vanilla transformers while using 42% fewer active parameters per forward pass. FLOP-matched experiments reveal even larger gains: 11.1% improvement with 3.2x inference acceleration. Automated evaluation using GPT-4 as a judge confirms quality improvements in generation, with higher scores on coherence (8.1/10), creativity (7.9/10) and grammatical correctness (8.2/10). Our results establish that architectural novelty, not parameter scaling, defines the efficiency frontier for resource-constrained language model deployment.

[574] From Semantic Web and MAS to Agentic AI: A Unified Narrative of the Web of Agents

Tatiana Petrova, Boris Bliznioukov, Aleksandr Puzikov, Radu State

Main category: cs.AI

TL;DR: The paper provides a comprehensive evolutionary overview of the Web of Agents (WoA), linking modern protocols to historical standards and introducing a taxonomy to unify analysis. It highlights a paradigm shift in intelligence locus and outlines future research challenges.

Details

Motivation: The fragmentation of research across communities obscures the intellectual lineage of modern systems, hindering a holistic understanding of the WoA's evolution.

Method: The authors introduce a four-axis taxonomy (semantic foundation, communication paradigm, locus of intelligence, discovery mechanism) to compare agent architectures across generations.

Result: The analysis reveals a paradigm shift in intelligence locus and identifies modern protocols as evolutionary responses to earlier standards. It also highlights the need for addressing socio-technical challenges.

Conclusion: New protocols alone are insufficient for a robust WoA ecosystem; future research must focus on decentralized identity, economic models, security, and governance.

Abstract: The concept of the Web of Agents (WoA), which transforms the static, document-centric Web into an environment of autonomous agents acting on users’ behalf, has attracted growing interest as large language models (LLMs) become more capable. However, research in this area is still fragmented across different communities. Contemporary surveys catalog the latest LLM-powered frameworks, while the rich histories of Multi-Agent Systems (MAS) and the Semantic Web are often treated as separate, legacy domains. This fragmentation obscures the intellectual lineage of modern systems and hinders a holistic understanding of the field’s trajectory. We present the first comprehensive evolutionary overview of the WoA. We show that modern protocols like A2A and the MCP, are direct evolutionary responses to the well-documented limitations of earlier standards like FIPA standards and OWL-based semantic agents. To systematize this analysis, we introduce a four-axis taxonomy (semantic foundation, communication paradigm, locus of intelligence, discovery mechanism). This framework provides a unified analytical lens for comparing agent architectures across all generations, revealing a clear line of descent where others have seen a disconnect. Our analysis identifies a paradigm shift in the ’locus of intelligence’: from being encoded in external data (Semantic Web) or the platform (MAS) to being embedded within the agent’s core model (LLM). This shift is foundational to modern Agentic AI, enabling the scalable and adaptive systems the WoA has long envisioned. We conclude that while new protocols are essential, they are insufficient for building a robust, open, trustworthy ecosystem. Finally, we argue that the next research frontier lies in solving persistent socio-technical challenges, and we map out a new agenda focused on decentralized identity, economic models, security, and governance for the emerging WoA.

[575] Win-k: Improved Membership Inference Attacks on Small Language Models

Roya Arkhmammadova, Hosein Madadi Tamar, M. Emre Gursoy

Main category: cs.AI

TL;DR: The paper explores membership inference attacks (MIAs) on small language models (SLMs), proposing a new attack called win-k that outperforms existing methods, especially on smaller models.

Details

Motivation: SLMs are efficient for resource-constrained environments, but their vulnerability to MIAs, which threaten privacy and intellectual property, is understudied.

Method: The authors propose win-k, an MIA building on min-k, and evaluate it against five existing MIAs using three datasets and eight SLMs.

Result: Win-k outperforms existing MIAs in AUROC, TPR @ 1% FPR, and FPR @ 99% TPR metrics, particularly on smaller models.

Conclusion: The study highlights the effectiveness of win-k for MIAs on SLMs, filling a gap in research on smaller models.

Abstract: Small language models (SLMs) are increasingly valued for their efficiency and deployability in resource-constrained environments, making them useful for on-device, privacy-sensitive, and edge computing applications. On the other hand, membership inference attacks (MIAs), which aim to determine whether a given sample was used in a model’s training, are an important threat with serious privacy and intellectual property implications. In this paper, we study MIAs on SLMs. Although MIAs were shown to be effective on large language models (LLMs), they are relatively less studied on emerging SLMs, and furthermore, their effectiveness decreases as models get smaller. Motivated by this finding, we propose a new MIA called win-k, which builds on top of a state-of-the-art attack (min-k). We experimentally evaluate win-k by comparing it with five existing MIAs using three datasets and eight SLMs. Results show that win-k outperforms existing MIAs in terms of AUROC, TPR @ 1% FPR, and FPR @ 99% TPR metrics, especially on smaller models.

[576] KCR: Resolving Long-Context Knowledge Conflicts via Reasoning in LLMs

Xianda Zheng, Zijian Huang, Meng-Fen Chiang, Michael J. Witbrock, Kaiqi Zhao

Main category: cs.AI

TL;DR: The paper introduces the Knowledge Conflict Reasoning (KCR) framework to help LLMs resolve inter-context knowledge conflicts by rewarding logical consistency in reasoning paths.

Details

Motivation: Addressing the confusion LLMs face with lengthy and conflicting contexts, aiming to improve their ability to handle knowledge conflicts.

Method: KCR extracts reasoning paths (text or knowledge graphs) and uses Reinforcement Learning to train LLMs to follow correct reasoning paths.

Result: The framework significantly enhances LLMs’ performance in resolving knowledge conflicts in long-context scenarios.

Conclusion: KCR effectively improves LLMs’ capability to handle inter-context knowledge conflicts, demonstrating substantial performance gains.

Abstract: Knowledge conflicts commonly arise across diverse sources, and their prevalence has increased with the advent of LLMs. When dealing with conflicts between multiple contexts, also known as \emph{inter-context knowledge conflicts}, LLMs are often confused by lengthy and conflicting contexts. To address this challenge, we propose the Knowledge Conflict Reasoning (KCR) framework, which enhances the ability of LLMs to resolve conflicting knowledge. The key idea of KCR is to train backbone LLMs to establish a correct reasoning process by rewarding them for selecting and adhering to the context with stronger logical consistency when presented with conflicting contexts. Specifically, we first extract reasoning paths, represented by either text or local knowledge graphs, from the conflicting long contexts. Subsequently, we employ Reinforcement Learning to encourage the model to learn the paradigm of reasoning process that follows correct reasoning paths rather than the incorrect counterparts. This enables the backbone models to genuinely acquire the capability to resolve inter-context knowledge conflicts within long contexts. Experimental results demonstrate that our framework significantly improves the ability of various backbone models to resolve knowledge conflicts in long-context scenarios, yielding substantial performance gains.

[577] Multi-TW: Benchmarking Multimodal Models on Traditional Chinese Question Answering in Taiwan

Jui-Ming Yao, Bing-Cheng Xie, Sheng-Wei Peng, Hao-Yuan Chen, He-Rong Zheng, Bing-Jia Tan, Peter Shaojui Wang, Shun-Feng Su

Main category: cs.AI

TL;DR: Multi-TW is the first Traditional Chinese benchmark for evaluating multimodal models, focusing on performance and latency. It includes 900 questions and highlights the superiority of closed-source models and the efficiency of any-to-any pipelines.

Details

Motivation: Existing benchmarks lack tri-modal evaluation in Traditional Chinese and ignore inference latency, limiting comprehensive assessment of multimodal models.

Method: Multi-TW introduces 900 multiple-choice questions (image-text and audio-text pairs) from official proficiency tests. It evaluates any-to-any models and VLMs with audio transcription.

Result: Closed-source models outperform open-source ones across modalities, though open-source models excel in audio tasks. Any-to-any pipelines show lower latency than VLMs with separate transcription.

Conclusion: Multi-TW provides a comprehensive evaluation of multimodal models, emphasizing the need for Traditional Chinese fine-tuning and efficient architectures.

Abstract: Multimodal Large Language Models (MLLMs) process visual, acoustic, and textual inputs, addressing the limitations of single-modality LLMs. However, existing benchmarks often overlook tri-modal evaluation in Traditional Chinese and do not consider inference latency. To address this, we introduce Multi-TW, the first Traditional Chinese benchmark for evaluating the performance and latency of any-to-any multimodal models. Multi-TW includes 900 multiple-choice questions (image and text, audio and text pairs) sourced from official proficiency tests developed with the Steering Committee for the Test of Proficiency-Huayu (SC-TOP). We evaluated various any-to-any models and vision-language models (VLMs) with audio transcription. Our results show that closed-source models generally outperform open-source ones across modalities, although open-source models can perform well in audio tasks. End-to-end any-to-any pipelines offer clear latency advantages compared to VLMs using separate audio transcription. Multi-TW presents a comprehensive view of model capabilities and highlights the need for Traditional Chinese fine-tuning and efficient multimodal architectures.

[578] BioDisco: Multi-agent hypothesis generation with dual-mode evidence, iterative feedback and temporal evaluation

Yujing Ke, Kevin George, Kathan Pandya, David Blumenthal, Maximilian Sprang, Gerrit Großmann, Sebastian Vollmer, David Antony Selby

Main category: cs.AI

TL;DR: BioDisco is a multi-agent framework for generating novel, evidence-grounded hypotheses using language models and dual-mode evidence systems, validated by temporal and human evaluations.

Details

Motivation: Addressing the challenge of generating novel hypotheses amidst vast information complexity, where existing methods lack novelty, refinement, and rigorous evaluation.

Method: Combines language model-based reasoning, biomedical knowledge graphs, and literature retrieval with iterative refinement via scoring and feedback loops. Validated using temporal/human evaluations and Bradley-Terry model.

Result: Outperforms existing methods in novelty and significance, offering flexibility for custom integrations.

Conclusion: BioDisco is a practical, modular tool to catalyze hypothesis discovery, validated for future potential.

Abstract: Identifying novel hypotheses is essential to scientific research, yet this process risks being overwhelmed by the sheer volume and complexity of available information. Existing automated methods often struggle to generate novel and evidence-grounded hypotheses, lack robust iterative refinement and rarely undergo rigorous temporal evaluation for future discovery potential. To address this, we propose BioDisco, a multi-agent framework that draws upon language model-based reasoning and a dual-mode evidence system (biomedical knowledge graphs and automated literature retrieval) for grounded novelty, integrates an internal scoring and feedback loop for iterative refinement, and validates performance through pioneering temporal and human evaluations and a Bradley-Terry paired comparison model to provide statistically-grounded assessment. Our evaluations demonstrate superior novelty and significance over ablated configurations representative of existing agentic architectures. Designed for flexibility and modularity, BioDisco allows seamless integration of custom language models or knowledge graphs, and can be run with just a few lines of code. We anticipate researchers using this practical tool as a catalyst for the discovery of new hypotheses.

[579] How Far Are LLMs from Symbolic Planners? An NLP-Based Perspective

Ma’ayan Armony, Albert Meroño-Peñuela, Gerard Canal

Main category: cs.AI

TL;DR: The paper explores LLMs’ planning abilities, proposing an NLP-based recovery pipeline to improve plan quality, but finds LLMs still unreliable compared to classical planners.

Details

Motivation: To address the unreliability of LLMs in planning tasks by leveraging their NLP capabilities for plan evaluation and recovery.

Method: A recovery pipeline with NLP-based evaluation and three stages of plan recovery, culminating in symbolic planner completion.

Result: No clear reasoning evidence in LLM plans; pipeline improves success rate from 21.9% to 27.5%, but plans remain flawed (average 2.65 executable actions).

Conclusion: LLMs lack reliability in planning; NLP-based recovery helps but doesn’t match classical planners.

Abstract: The reasoning and planning abilities of Large Language Models (LLMs) have been a frequent topic of discussion in recent years. Their ability to take unstructured planning problems as input has made LLMs’ integration into AI planning an area of interest. Nevertheless, LLMs are still not reliable as planners, with the generated plans often containing mistaken or hallucinated actions. Existing benchmarking and evaluation methods investigate planning with LLMs, focusing primarily on success rate as a quality indicator in various planning tasks, such as validating plans or planning in relaxed conditions. In this paper, we approach planning with LLMs as a natural language processing (NLP) task, given that LLMs are NLP models themselves. We propose a recovery pipeline consisting of an NLP-based evaluation of the generated plans, along with three stages to recover the plans through NLP manipulation of the LLM-generated plans, and eventually complete the plan using a symbolic planner. This pipeline provides a holistic analysis of LLM capabilities in the context of AI task planning, enabling a broader understanding of the quality of invalid plans. Our findings reveal no clear evidence of underlying reasoning during plan generation, and that a pipeline comprising an NLP-based analysis of the plans, followed by a recovery mechanism, still falls short of the quality and reliability of classical planners. On average, only the first 2.65 actions of the plan are executable, with the average length of symbolically generated plans being 8.4 actions. The pipeline still improves action quality and increases the overall success rate from 21.9% to 27.5%.

[580] PUZZLED: Jailbreaking LLMs through Word-Based Puzzles

Yelim Ahn, Jaejin Lee

Main category: cs.AI

TL;DR: PUZZLED is a novel jailbreak method for LLMs that uses word puzzles to bypass safety measures, achieving high attack success rates.

Details

Motivation: Ensuring LLM safety is critical, but existing jailbreak methods rely on iterative prompts or semantic transformations, prompting the need for a more effective approach.

Method: PUZZLED masks harmful instruction keywords as word puzzles (word search, anagram, crossword) for LLMs to solve, leveraging their reasoning capabilities.

Result: Achieves an average attack success rate of 88.8%, with 96.5% on GPT-4.1 and 92.3% on Claude 3.7 Sonnet.

Conclusion: PUZZLED demonstrates that simple puzzles can effectively jailbreak LLMs by exploiting their reasoning abilities.

Abstract: As large language models (LLMs) are increasingly deployed across diverse domains, ensuring their safety has become a critical concern. In response, studies on jailbreak attacks have been actively growing. Existing approaches typically rely on iterative prompt engineering or semantic transformations of harmful instructions to evade detection. In this work, we introduce PUZZLED, a novel jailbreak method that leverages the LLM’s reasoning capabilities. It masks keywords in a harmful instruction and presents them as word puzzles for the LLM to solve. We design three puzzle types-word search, anagram, and crossword-that are familiar to humans but cognitively demanding for LLMs. The model must solve the puzzle to uncover the masked words and then proceed to generate responses to the reconstructed harmful instruction. We evaluate PUZZLED on five state-of-the-art LLMs and observe a high average attack success rate (ASR) of 88.8%, specifically 96.5% on GPT-4.1 and 92.3% on Claude 3.7 Sonnet. PUZZLED is a simple yet powerful attack that transforms familiar puzzles into an effective jailbreak strategy by harnessing LLMs’ reasoning capabilities.

[581] Idempotent Equilibrium Analysis of Hybrid Workflow Allocation: A Mathematical Schema for Future Work

Faruk Alpay, Bugra Kilictas, Taylan Alpay, Hamdi Alakkad

Main category: cs.AI

TL;DR: The paper formalizes task delegation between humans and AI, proving stable equilibria and projecting automation to rise to 65% by 2045, with humans specializing in workflow supervision.

Details

Motivation: To understand how work is reallocated between humans and machines as AI advances, and to identify stable equilibria in this process.

Method: Uses lattice-theoretic fixed-point tools to prove equilibrium existence and uniqueness, and models automation dynamics with closed-form solutions and simulations.

Result: Automation is projected to rise to 65% of work by 2045, with humans focusing on supervising AI. A mixed equilibrium is proven and reproducible across models.

Conclusion: Policies promoting human-AI collaboration (‘centaur’ teaming) can optimize welfare by steering toward the equilibrium.

Abstract: The rapid advance of large-scale AI systems is reshaping how work is divided between people and machines. We formalise this reallocation as an iterated task-delegation map and show that–under broad, empirically grounded assumptions–the process converges to a stable idempotent equilibrium in which every task is performed by the agent (human or machine) with enduring comparative advantage. Leveraging lattice-theoretic fixed-point tools (Tarski and Banach), we (i) prove existence of at least one such equilibrium and (ii) derive mild monotonicity conditions that guarantee uniqueness. In a stylised continuous model the long-run automated share takes the closed form $x^* = \alpha / (\alpha + \beta)$, where $\alpha$ captures the pace of automation and $\beta$ the rate at which new, human-centric tasks appear; hence full automation is precluded whenever $\beta > 0$. We embed this analytic result in three complementary dynamical benchmarks–a discrete linear update, an evolutionary replicator dynamic, and a continuous Beta-distributed task spectrum–each of which converges to the same mixed equilibrium and is reproducible from the provided code-free formulas. A 2025-to-2045 simulation calibrated to current adoption rates projects automation rising from approximately 10% of work to approximately 65%, leaving a persistent one-third of tasks to humans. We interpret that residual as a new profession of workflow conductor: humans specialise in assigning, supervising and integrating AI modules rather than competing with them. Finally, we discuss implications for skill development, benchmark design and AI governance, arguing that policies which promote “centaur” human-AI teaming can steer the economy toward the welfare-maximising fixed point.

[582] Towards Evaluation for Real-World LLM Unlearning

Ke Miao, Yuke Hu, Xiaochen Li, Wenjie Bao, Zhihao Liu, Zhan Qin, Kui Ren

Main category: cs.AI

TL;DR: The paper introduces DCUE, a new metric for evaluating unlearning in LLMs, addressing practicality, exactness, and robustness issues in existing metrics.

Details

Motivation: Existing unlearning evaluation metrics lack practicality, exactness, and robustness in real-world LLM scenarios.

Method: Proposes DCUE, which identifies core tokens and corrects distributional biases in confidence scores using a validation set, quantified via the Kolmogorov-Smirnov test.

Result: DCUE outperforms existing metrics, addressing their limitations.

Conclusion: DCUE guides future design of more practical and reliable unlearning algorithms.

Abstract: This paper analyzes the limitations of existing unlearning evaluation metrics in terms of practicality, exactness, and robustness in real-world LLM unlearning scenarios. To overcome these limitations, we propose a new metric called Distribution Correction-based Unlearning Evaluation (DCUE). It identifies core tokens and corrects distributional biases in their confidence scores using a validation set. The evaluation results are quantified using the Kolmogorov-Smirnov test. Experimental results demonstrate that DCUE overcomes the limitations of existing metrics, which also guides the design of more practical and reliable unlearning algorithms in the future.

[583] NatureGAIA: Pushing the Frontiers of GUI Agents with a Challenging Benchmark and High-Quality Trajectory Dataset

Zihan Zheng, Tianle Cui, Chuwen Xie, Jiahui Zhang, Jiahui Pan, Lewei He, Qianglong Chen

Main category: cs.AI

TL;DR: The paper introduces \Benchmark, a novel evaluation benchmark for LLM-driven GUI agents, and \Agent, a hierarchical architecture for long-horizon tasks. It highlights limitations of current benchmarks and models, and presents Reinforcement Fine-Tuning results.

Details

Motivation: Existing benchmarks for LLM-driven GUI agents lack accuracy, reproducibility, and scalability, hindering progress.

Method: Developed \Benchmark using Causal Pathways for automated, reproducible evaluation, and \Agent for hierarchical task execution. Generated a human-verified dataset and applied Reinforcement Fine-Tuning to Qwen2.5-VL-7B.

Result: \Benchmark proved challenging for top LLMs (Claude-sonnet-4 achieved 34.6% WPSR). RFT improved smaller models but faltered in complex scenarios.

Conclusion: The study provides a rigorous benchmark and dataset to advance GUI agent development, revealing limitations of smaller models in complex tasks.

Abstract: The rapid advancement of Large Language Model (LLM)-driven Graphical User Interface (GUI) agents is significantly hampered by the profound limitations of existing evaluation benchmarks in terms of accuracy, reproducibility, and scalability. To address this critical gap, we introduce \Benchmark, a novel benchmark engineered on the principle of Causal Pathways. This design paradigm structures complex tasks into a series of programmatically verifiable atomic steps, ensuring a rigorous, fully automated, and reproducible standard for assessment. Concurrently, to mitigate the inherent capability deficits of agents, we developed \Agent, a hierarchical agent architecture specifically optimized for long-horizon tasks. We leveraged this agent to generate a high-quality, human-verified trajectory dataset that uniquely captures diverse and even self-correcting interaction patterns of LLMs. We then utilized this dataset to perform Reinforcement Fine-Tuning (RFT) on the Qwen2.5-VL-7B model. Our experiments reveal that \Benchmark~presents a formidable challenge to current state-of-the-art LLMs; even the top-performing Claude-sonnet-4 achieved a Weighted Pathway Success Rate (WPSR) of only 34.6%. Moreover, while RFT substantially improved the smaller model’s GUI execution capabilities (WPSR increased from 3.3% to 10.8%), its performance degraded sharply when handling complex scenarios. This outcome highlights the inherent capability ceiling of smaller models when faced with comprehensive tasks that integrate perception, decision-making, and execution. This research contributes a rigorous evaluation standard and a high-quality dataset to the community, aiming to guide the future development of GUI agents.

[584] Relation-Aware LNN-Transformer for Intersection-Centric Next-Step Prediction

Zhehong Ren, Tianluo Zhang, Yiheng Lu, Yushen Liang, Promethee Spathis

Main category: cs.AI

TL;DR: A road-node-centric framework for next-step location prediction outperforms baselines by capturing urban road network constraints and exploratory behavior.

Details

Motivation: Overcome limitations of closed-world POI-based approaches by modeling road-user trajectories on road-intersection graphs.

Method: Uses sector-wise directional POI aggregation and structural graph embeddings, combined with a Relation-Aware LNN-Transformer for sequence modeling.

Result: Outperforms six baselines by up to 17% in accuracy and maintains resilience under noise.

Conclusion: The framework effectively predicts next-step locations by integrating environmental context and spatial-temporal dynamics.

Abstract: Next-step location prediction plays a pivotal role in modeling human mobility, underpinning applications from personalized navigation to strategic urban planning. However, approaches that assume a closed world - restricting choices to a predefined set of points of interest (POIs) - often fail to capture exploratory or target-agnostic behavior and the topological constraints of urban road networks. Hence, we introduce a road-node-centric framework that represents road-user trajectories on the city’s road-intersection graph, thereby relaxing the closed-world constraint and supporting next-step forecasting beyond fixed POI sets. To encode environmental context, we introduce a sector-wise directional POI aggregation that produces compact features capturing distance, bearing, density and presence cues. By combining these cues with structural graph embeddings, we obtain semantically grounded node representations. For sequence modeling, we integrate a Relation-Aware LNN-Transformer - a hybrid of a Continuous-time Forgetting Cell CfC-LNN and a bearing-biased self-attention module - to capture both fine-grained temporal dynamics and long-range spatial dependencies. Evaluated on city-scale road-user trajectories, our model outperforms six state-of-the-art baselines by up to 17 percentage points in accuracy at one hop and 10 percentage points in MRR, and maintains high resilience under noise, losing only 2.4 percentage points in accuracy at one under 50 meter GPS perturbation and 8.9 percentage points in accuracy at one hop under 25 percent POI noise.

[585] TripTailor: A Real-World Benchmark for Personalized Travel Planning

Yuanzhe Shen, Kaimin Wang, Changze Lv, Xiaoqing Zheng, Xuanjing Huang

Main category: cs.AI

TL;DR: TripTailor is a benchmark for personalized travel planning using real-world data, highlighting gaps in LLM-generated itineraries.

Details

Motivation: Current benchmarks use unrealistic data and lack comprehensive evaluation metrics for travel planning.

Method: Introduces TripTailor, a dataset with 500,000+ POIs and 4,000 itineraries for authentic evaluation.

Result: Fewer than 10% of LLM-generated itineraries match human-level performance, revealing challenges in feasibility and personalization.

Conclusion: TripTailor aims to improve travel planning agents by addressing real-world needs and practical itinerary generation.

Abstract: The continuous evolution and enhanced reasoning capabilities of large language models (LLMs) have elevated their role in complex tasks, notably in travel planning, where demand for personalized, high-quality itineraries is rising. However, current benchmarks often rely on unrealistic simulated data, failing to reflect the differences between LLM-generated and real-world itineraries. Existing evaluation metrics, which primarily emphasize constraints, fall short of providing a comprehensive assessment of the overall quality of travel plans. To address these limitations, we introduce TripTailor, a benchmark designed specifically for personalized travel planning in real-world scenarios. This dataset features an extensive collection of over 500,000 real-world points of interest (POIs) and nearly 4,000 diverse travel itineraries, complete with detailed information, providing a more authentic evaluation framework. Experiments show that fewer than 10% of the itineraries generated by the latest state-of-the-art LLMs achieve human-level performance. Moreover, we identify several critical challenges in travel planning, including the feasibility, rationality, and personalized customization of the proposed solutions. We hope that TripTailor will drive the development of travel planning agents capable of understanding and meeting user needs while generating practical itineraries. Our code and dataset are available at https://github.com/swxkfm/TripTailor

[586] $R^2$-CoD: Understanding Text-Graph Complementarity in Relational Reasoning via Knowledge Co-Distillation

Zhen Wu, Ritam Dutt, Luke M. Breitfeller, Armineh Nourbakhsh, Siddharth Parekh, Carolyn Rosé

Main category: cs.AI

TL;DR: The paper investigates the interplay between text and graph representations in NLP tasks, using a unified architecture with knowledge co-distillation (CoD) to analyze their complementarity.

Details

Motivation: To systematically understand how text and graph representations interact and influence hybrid models in relational reasoning tasks.

Method: A unified architecture supporting knowledge co-distillation (CoD) is used to analyze five relational reasoning tasks, tracking the evolution of dual representations during training.

Result: Interpretable patterns of alignment and divergence between text and graph representations are uncovered, providing insights into their integration benefits.

Conclusion: The study offers insights into when and why integrating text and graph representations is beneficial for relational reasoning tasks.

Abstract: Relational reasoning lies at the core of many NLP tasks, drawing on complementary signals from text and graphs. While prior research has investigated how to leverage this dual complementarity, a detailed and systematic understanding of text-graph interplay and its effect on hybrid models remains underexplored. We take an analysis-driven approach to investigate text-graph representation complementarity via a unified architecture that supports knowledge co-distillation (CoD). We explore five tasks involving relational reasoning that differ in how text and graph structures encode the information needed to solve that task. By tracking how these dual representations evolve during training, we uncover interpretable patterns of alignment and divergence, and provide insights into when and why their integration is beneficial.

[587] CARGO: A Co-Optimization Framework for EV Charging and Routing in Goods Delivery Logistics

Arindam Khanda, Anurag Satpathy, Amit Jha, Sajal K. Das

Main category: cs.AI

TL;DR: CARGO optimizes EV delivery routes and charging, proving NP-hardness and offering MILP and heuristic solutions. It outperforms baselines by reducing charging costs up to 39%.

Details

Motivation: Address EV delivery challenges like limited battery capacity and charging logistics in urban areas.

Method: Proposes CARGO, a framework with MILP-based exact and heuristic solutions for joint route and charging optimization.

Result: Heuristic reduces charging costs by up to 39% and 22% over EDF and NDF, respectively, with comparable delivery completion.

Conclusion: CARGO effectively balances cost and efficiency in EV-based deliveries, offering scalable solutions.

Abstract: With growing interest in sustainable logistics, electric vehicle (EV)-based deliveries offer a promising alternative for urban distribution. However, EVs face challenges due to their limited battery capacity, requiring careful planning for recharging. This depends on factors such as the charging point (CP) availability, cost, proximity, and vehicles’ state of charge (SoC). We propose CARGO, a framework addressing the EV-based delivery route planning problem (EDRP), which jointly optimizes route planning and charging for deliveries within time windows. After proving the problem’s NP-hardness, we propose a mixed integer linear programming (MILP)-based exact solution and a computationally efficient heuristic method. Using real-world datasets, we evaluate our methods by comparing the heuristic to the MILP solution, and benchmarking it against baseline strategies, Earliest Deadline First (EDF) and Nearest Delivery First (NDF). The results show up to 39% and 22% reductions in the charging cost over EDF and NDF, respectively, while completing comparable deliveries.

[588] WinkTPG: An Execution Framework for Multi-Agent Path Finding Using Temporal Reasoning

Jingtian Yan, Stephen F. Smith, Jiaoyang Li

Main category: cs.AI

TL;DR: A method called kTPG refines MAPF plans into kinodynamically feasible ones, and WinkTPG further improves execution by dynamically incorporating agent info.

Details

Motivation: Standard MAPF algorithms use simplified models, making plans hard to follow directly. This work aims to bridge this gap.

Method: Proposes kTPG for refining MAPF plans and WinkTPG for incremental refinement using a window-based mechanism.

Result: WinkTPG handles 1,000 agents in 1 second and improves solution quality by up to 51.7%.

Conclusion: The proposed methods enhance MAPF execution by addressing kinodynamic feasibility and uncertainty.

Abstract: Planning collision-free paths for a large group of agents is a challenging problem with numerous real-world applications. While recent advances in Multi-Agent Path Finding (MAPF) have shown promising progress, standard MAPF algorithms rely on simplified kinodynamic models, preventing agents from directly following the generated MAPF plan. To bridge this gap, we propose kinodynamic Temporal Plan Graph Planning (kTPG), a multi-agent speed optimization algorithm that efficiently refines a MAPF plan into a kinodynamically feasible plan while accounting for uncertainties and preserving collision-freeness. Building on kTPG, we propose Windowed kTPG (WinkTPG), a MAPF execution framework that incrementally refines MAPF plans using a window-based mechanism, dynamically incorporating agent information during execution to reduce uncertainty. Experiments show that WinkTPG can generate speed profiles for up to 1,000 agents in 1 second and improves solution quality by up to 51.7% over existing MAPF execution methods.

[589] Refine-n-Judge: Curating High-Quality Preference Chains for LLM-Fine-Tuning

Derin Cayir, Renjie Tao, Rashi Rungta, Kai Sun, Sean Chen, Haidar Khan, Minseok Kim, Julia Reinspach, Yue Liu

Main category: cs.AI

TL;DR: Refine-n-Judge is an automated iterative method using a single LLM to refine and judge dataset quality, eliminating the need for human feedback or reward models, and improving fine-tuning results.

Details

Motivation: Human feedback for improving LLM training data is costly and unscalable, necessitating an automated solution.

Method: Uses an LLM to iteratively refine responses and judge improvements, stopping when no further enhancements are preferred.

Result: Models fine-tuned on Refine-n-Judge datasets outperformed original datasets in 74% of comparisons, with significant performance gains on benchmarks.

Conclusion: Refine-n-Judge effectively enhances dataset quality and model performance, offering a scalable alternative to human feedback.

Abstract: Large Language Models (LLMs) have demonstrated remarkable progress through preference-based fine-tuning, which critically depends on the quality of the underlying training data. While human feedback is essential for improving data quality, it is costly and does not scale well. In this paper, we introduce Refine-n-Judge, an automated iterative approach that leverages a single LLM as both a refiner and a judge to enhance dataset quality. Unlike existing iterative refinement methods, Refine-n-Judge employs an LLM to both generate refinements and explicitly evaluate each improvement, ensuring that every iteration meaningfully enhances the dataset without requiring additional human annotation or a separate reward model. At each step, the LLM refines a response and judges whether the refinement is an improvement over the previous answer. This process continues until the LLM prefers the initial answer over the refinement, indicating no further improvements. This produces sequences of increasing quality, preference-labeled responses ideal for fine-tuning. We demonstrate the effectiveness of Refine-n-Judge across a range of public datasets spanning five corpora, targeting tasks such as coding, math, and conversation. Models (Llama 3.1-8B and Llama 3.3-70B) fine-tuned on Refine-n-Judge-enhanced datasets were preferred by LLM judges in over 74% of comparisons against models tuned on the original dataset by GPT-4. Additionally, we report performance gains: +5% on AlpacaEval and AlpacaEval 2.0, and +19% on MT-Bench. Our results indicate that Refine-n-Judge produces high-quality datasets and scalable model improvements.

[590] Getting out of the Big-Muddy: Escalation of Commitment in LLMs

Emilio Barkett, Olivia Long, Paul Kröger

Main category: cs.AI

TL;DR: LLMs show context-dependent cognitive biases like escalation of commitment, influenced by social and organizational settings, not inherent to the models.

Details

Motivation: To understand if LLMs inherit human cognitive biases like escalation of commitment and under what conditions these biases manifest.

Method: A two-stage investment task across four conditions: model as investor, model as advisor, multi-agent deliberation, and compound pressure scenario, tested over 6,500 trials.

Result: LLMs exhibit minimal bias in individual contexts but show significant escalation of commitment in multi-agent deliberation (99.2%) and under compound pressures (68.95%).

Conclusion: LLM bias is context-dependent, highlighting risks in multi-agent systems and unsupervised deployments where such conditions arise.

Abstract: Large Language Models (LLMs) are increasingly deployed in autonomous decision-making roles across high-stakes domains. However, since models are trained on human-generated data, they may inherit cognitive biases that systematically distort human judgment, including escalation of commitment, where decision-makers continue investing in failing courses of action due to prior investment. Understanding when LLMs exhibit such biases presents a unique challenge. While these biases are well-documented in humans, it remains unclear whether they manifest consistently in LLMs or require specific triggering conditions. This paper investigates this question using a two-stage investment task across four experimental conditions: model as investor, model as advisor, multi-agent deliberation, and compound pressure scenario. Across N = 6,500 trials, we find that bias manifestation in LLMs is highly context-dependent. In individual decision-making contexts (Studies 1-2, N = 4,000), LLMs demonstrate strong rational cost-benefit logic with minimal escalation of commitment. However, multi-agent deliberation reveals a striking hierarchy effect (Study 3, N = 500): while asymmetrical hierarchies show moderate escalation rates (46.2%), symmetrical peer-based decision-making produces near-universal escalation (99.2%). Similarly, when subjected to compound organizational and personal pressures (Study 4, N = 2,000), models exhibit high degrees of escalation of commitment (68.95% average allocation to failing divisions). These findings reveal that LLM bias manifestation depends critically on social and organizational context rather than being inherent, with significant implications for the deployment of multi-agent systems and unsupervised operations where such conditions may emerge naturally.

[591] Empowering Tabular Data Preparation with Language Models: Why and How?

Mengshi Chen, Yuxiang Sun, Tengchao Li, Jianwei Wang, Kai Wang, Xuemin Lin, Ying Zhang, Wenjie Zhang

Main category: cs.AI

TL;DR: The paper explores how Language Models (LMs) can enhance tabular data preparation across four phases: acquisition, integration, cleaning, and transformation, addressing gaps in systematic application.

Details

Motivation: Traditional methods struggle with complex table relationships and task adaptability, prompting the need to leverage LMs for automation and support in data preparation.

Method: The survey systematically analyzes LMs’ role in tabular data preparation, integrating them with other components for each phase and outlining prospective pipelines.

Result: The study highlights key advancements and demonstrates how LMs can effectively address challenges in tabular data preparation.

Conclusion: LMs offer promising opportunities for automating and improving tabular data preparation, with systematic integration across phases being crucial for success.

Abstract: Data preparation is a critical step in enhancing the usability of tabular data and thus boosts downstream data-driven tasks. Traditional methods often face challenges in capturing the intricate relationships within tables and adapting to the tasks involved. Recent advances in Language Models (LMs), especially in Large Language Models (LLMs), offer new opportunities to automate and support tabular data preparation. However, why LMs suit tabular data preparation (i.e., how their capabilities match task demands) and how to use them effectively across phases still remain to be systematically explored. In this survey, we systematically analyze the role of LMs in enhancing tabular data preparation processes, focusing on four core phases: data acquisition, integration, cleaning, and transformation. For each phase, we present an integrated analysis of how LMs can be combined with other components for different preparation tasks, highlight key advancements, and outline prospective pipelines.

[592] One Subgoal at a Time: Zero-Shot Generalization to Arbitrary Linear Temporal Logic Requirements in Multi-Task Reinforcement Learning

Zijian Guo, İlker Işık, H. M. Sabbir Ahmad, Wenchao Li

Main category: cs.AI

TL;DR: GenZ-LTL enables zero-shot generalization to arbitrary LTL specifications by decomposing tasks into reach-avoid subgoals and solving them sequentially, outperforming existing methods.

Details

Motivation: Addressing limitations in handling nested, long-horizon tasks and safety constraints in RL, especially when subgoals are unsatisfiable.

Method: Leverages Büchi automata to decompose LTL specifications into reach-avoid subgoals, solving them one at a time with safe RL formulations and observation reduction.

Result: GenZ-LTL outperforms existing methods in zero-shot generalization to unseen LTL specifications.

Conclusion: Sequential subgoal solving and observation reduction enhance generalization and efficiency in RL for complex LTL tasks.

Abstract: Generalizing to complex and temporally extended task objectives and safety constraints remains a critical challenge in reinforcement learning (RL). Linear temporal logic (LTL) offers a unified formalism to specify such requirements, yet existing methods are limited in their abilities to handle nested long-horizon tasks and safety constraints, and cannot identify situations when a subgoal is not satisfiable and an alternative should be sought. In this paper, we introduce GenZ-LTL, a method that enables zero-shot generalization to arbitrary LTL specifications. GenZ-LTL leverages the structure of B"uchi automata to decompose an LTL task specification into sequences of reach-avoid subgoals. Contrary to the current state-of-the-art method that conditions on subgoal sequences, we show that it is more effective to achieve zero-shot generalization by solving these reach-avoid problems \textit{one subgoal at a time} through proper safe RL formulations. In addition, we introduce a novel subgoal-induced observation reduction technique that can mitigate the exponential complexity of subgoal-state combinations under realistic assumptions. Empirical results show that GenZ-LTL substantially outperforms existing methods in zero-shot generalization to unseen LTL specifications.

[593] Polymorphic Combinatorial Frameworks (PCF): Guiding the Design of Mathematically-Grounded, Adaptive AI Agents

David Pearl, Matthew Murphy, James Intriligator

Main category: cs.AI

TL;DR: The Polymorphic Combinatorial Framework (PCF) uses LLMs and mathematical frameworks to design adaptive AI agents for dynamic environments, enabling real-time reconfiguration and scalability.

Details

Motivation: To address the limitations of static agent architectures by creating a dynamic, adaptable framework for AI agents in complex environments.

Method: PCF combines combinatorial logic, topos theory, and rough fuzzy set theory to define a multidimensional SPARK parameter space. LLMs parameterize this space, and Monte Carlo simulations test adaptability.

Result: Simulations showed trends in agent adaptability across complexity tiers, with diminishing returns at higher levels, identifying scalability thresholds.

Conclusion: PCF enables optimized, scalable, and ethical AI agent designs for diverse applications, advancing adaptable and cooperative AI systems.

Abstract: The Polymorphic Combinatorial Framework (PCF) leverages Large Language Models (LLMs) and mathematical frameworks to guide the meta-prompt enabled design of solution spaces and adaptive AI agents for complex, dynamic environments. Unlike static agent architectures, PCF enables real-time parameter reconfiguration through mathematically-grounded combinatorial spaces, allowing agents to adapt their core behavioral traits dynamically. Grounded in combinatorial logic, topos theory, and rough fuzzy set theory, PCF defines a multidimensional SPARK parameter space (Skills, Personalities, Approaches, Resources, Knowledge) to capture agent behaviors. This paper demonstrates how LLMs can parameterize complex spaces and estimate likely parameter values/variabilities. Using PCF, we parameterized mock caf'e domains (five levels of complexity), estimated variables/variabilities, and conducted over 1.25 million Monte Carlo simulations. The results revealed trends in agent adaptability and performance across the five complexity tiers, with diminishing returns at higher complexity levels highlighting thresholds for scalable designs. PCF enables the generation of optimized agent configurations for specific scenarios while maintaining logical consistency. This framework supports scalable, dynamic, explainable, and ethical AI applications in domains like customer service, healthcare, robotics, and collaborative systems, paving the way for adaptable and cooperative next-generation polymorphic agents.

[594] A Multi-Agent Pokemon Tournament for Evaluating Strategic Reasoning of Large Language Models

Tadisetty Sai Yashwanth, Dhatri C

Main category: cs.AI

TL;DR: The paper introduces LLM Pokemon League, a tournament system using LLMs as agents to simulate strategic decision-making in Pokemon battles, analyzing reasoning and adaptability.

Details

Motivation: To explore how LLMs exhibit strategic reasoning, adaptability, and tactical depth in a rule-based, competitive environment like Pokemon battles.

Method: A single-elimination tournament with diverse AI trainers, capturing decision logs (team-building, action selection, switching) for analysis.

Result: Provides insights into comparative AI behavior, battle psychology, and meta-strategy development in constrained game environments.

Conclusion: The system serves as a novel benchmark for AI research in strategic reasoning and competitive learning, showcasing LLMs’ decision-making under uncertainty.

Abstract: This research presents LLM Pokemon League, a competitive tournament system that leverages Large Language Models (LLMs) as intelligent agents to simulate strategic decision-making in Pok'emon battles. The platform is designed to analyze and compare the reasoning, adaptability, and tactical depth exhibited by different LLMs in a type-based, turn-based combat environment. By structuring the competition as a single-elimination tournament involving diverse AI trainers, the system captures detailed decision logs, including team-building rationale, action selection strategies, and switching decisions. The project enables rich exploration into comparative AI behavior, battle psychology, and meta-strategy development in constrained, rule-based game environments. Through this system, we investigate how modern LLMs understand, adapt, and optimize decisions under uncertainty, making Pok'emon League a novel benchmark for AI research in strategic reasoning and competitive learning.

[595] QCBench: Evaluating Large Language Models on Domain-Specific Quantitative Chemistry

Jiaqing Xie, Weida Wang, Ben Gao, Zhuo Yang, Haiyuan Wan, Shufei Zhang, Tianfan Fu, Yuqiang Li

Main category: cs.AI

TL;DR: QCBench is a benchmark for evaluating LLMs’ quantitative reasoning in chemistry, featuring 350 problems across 7 subfields and 3 difficulty tiers. It reveals performance gaps as task complexity increases.

Details

Motivation: To assess LLMs' ability in rigorous, step-by-step quantitative reasoning in chemistry, an underexplored area.

Method: Developed QCBench with 350 problems across 7 chemistry subfields and 3 difficulty tiers, focusing on stepwise numerical reasoning.

Result: Performance of 19 LLMs degrades with increasing task complexity, showing a gap between language fluency and scientific computation accuracy.

Conclusion: QCBench identifies computational weaknesses in LLMs and suggests future improvements like domain adaptive fine-tuning or multi-modal integration.

Abstract: Quantitative chemistry plays a fundamental role in chemistry research, enabling precise predictions of molecular properties, reaction outcomes, and material behaviors. While large language models (LLMs) have shown promise in chemistry-related tasks, their ability to perform rigorous, step-by-step quantitative reasoning remains underexplored. To fill this blank, we propose QCBench, a Quantitative Chemistry benchmark comprising 350 computational chemistry problems across 7 chemistry subfields (analytical chemistry, bio/organic chemistry, general chemistry, inorganic chemistry, physical chemistry, polymer chemistry and quantum chemistry), categorized into three hierarchical tiers-basic, intermediate, and expert-to systematically evaluate the mathematical reasoning abilities of large language models (LLMs). Designed to minimize shortcuts and emphasize stepwise numerical reasoning, each problem focuses on pure calculations rooted in real-world chemical vertical fields. QCBench enables fine-grained diagnosis of computational weaknesses, reveals model-specific limitations across difficulty levels, and lays the groundwork for future improvements such as domain adaptive fine-tuning or multi-modal integration. Evaluations on 19 LLMs demonstrate a consistent performance degradation with increasing task complexity, highlighting the current gap between language fluency and scientific computation accuracy.

[596] T-GRAG: A Dynamic GraphRAG Framework for Resolving Temporal Conflicts and Redundancy in Knowledge Retrieval

Dong Li, Yichen Niu, Ying Ai, Xiang Zou, Biqing Qi, Jianxing Liu

Main category: cs.AI

TL;DR: T-GRAG enhances Retrieval-Augmented Generation (RAG) by incorporating temporal dynamics into knowledge graphs, improving accuracy and relevance in temporal reasoning tasks.

Details

Motivation: Existing GraphRAG methods ignore temporal dynamics, leading to issues like temporal ambiguity and time-insensitive retrieval.

Method: T-GRAG introduces five components: Temporal Knowledge Graph Generator, Temporal Query Decomposition, Three-layer Interactive Retriever, Source Text Extractor, and LLM-based Generator.

Result: T-GRAG outperforms prior RAG and GraphRAG baselines in retrieval accuracy and response relevance under temporal constraints.

Conclusion: Modeling knowledge evolution is crucial for robust long-text question answering, as demonstrated by T-GRAG’s superior performance.

Abstract: Large language models (LLMs) have demonstrated strong performance in natural language generation but remain limited in knowle- dge-intensive tasks due to outdated or incomplete internal knowledge. Retrieval-Augmented Generation (RAG) addresses this by incorporating external retrieval, with GraphRAG further enhancing performance through structured knowledge graphs and multi-hop reasoning. However, existing GraphRAG methods largely ignore the temporal dynamics of knowledge, leading to issues such as temporal ambiguity, time-insensitive retrieval, and semantic redundancy. To overcome these limitations, we propose Temporal GraphRAG (T-GRAG), a dynamic, temporally-aware RAG framework that models the evolution of knowledge over time. T-GRAG consists of five key components: (1) a Temporal Knowledge Graph Generator that creates time-stamped, evolving graph structures; (2) a Temporal Query Decomposition mechanism that breaks complex temporal queries into manageable sub-queries; (3) a Three-layer Interactive Retriever that progressively filters and refines retrieval across temporal subgraphs; (4) a Source Text Extractor to mitigate noise; and (5) a LLM-based Generator that synthesizes contextually and temporally accurate responses. We also introduce Time-LongQA, a novel benchmark dataset based on real-world corporate annual reports, designed to test temporal reasoning across evolving knowledge. Extensive experiments show that T-GRAG significantly outperforms prior RAG and GraphRAG baselines in both retrieval accuracy and response relevance under temporal constraints, highlighting the necessity of modeling knowledge evolution for robust long-text question answering. Our code is publicly available on the T-GRAG

[597] SURE-Med: Systematic Uncertainty Reduction for Enhanced Reliability in Medical Report Generation

Yuhang Gu, Xingyu Hu, Yuyu Fan, Xulin Yan, Longhuan Xu, Peng peng

Main category: cs.AI

TL;DR: SURE-Med is a framework addressing visual, distributional, and contextual uncertainties in medical report generation, improving reliability and performance.

Details

Motivation: The clinical deployment of automated MRG is hindered by uncertainties like noisy annotations, biased label distributions, and unverified historical reports, limiting reliability.

Method: SURE-Med uses a Frontal-Aware View Repair Resampling module, Token Sensitive Learning objective, and Contextual Evidence Filter to address uncertainties.

Result: SURE-Med achieves state-of-the-art performance on MIMIC-CXR and IU-Xray benchmarks, enhancing sensitivity to rare conditions and reducing hallucinations.

Conclusion: SURE-Med sets a new benchmark for reliable MRG, advancing trustworthy clinical decision support by holistically reducing uncertainty.

Abstract: Automated medical report generation (MRG) holds great promise for reducing the heavy workload of radiologists. However, its clinical deployment is hindered by three major sources of uncertainty. First, visual uncertainty, caused by noisy or incorrect view annotations, compromises feature extraction. Second, label distribution uncertainty, stemming from long-tailed disease prevalence, biases models against rare but clinically critical conditions. Third, contextual uncertainty, introduced by unverified historical reports, often leads to factual hallucinations. These challenges collectively limit the reliability and clinical trustworthiness of MRG systems. To address these issues, we propose SURE-Med, a unified framework that systematically reduces uncertainty across three critical dimensions: visual, distributional, and contextual. To mitigate visual uncertainty, a Frontal-Aware View Repair Resampling module corrects view annotation errors and adaptively selects informative features from supplementary views. To tackle label distribution uncertainty, we introduce a Token Sensitive Learning objective that enhances the modeling of critical diagnostic sentences while reweighting underrepresented diagnostic terms, thereby improving sensitivity to infrequent conditions. To reduce contextual uncertainty, our Contextual Evidence Filter validates and selectively incorporates prior information that aligns with the current image, effectively suppressing hallucinations. Extensive experiments on the MIMIC-CXR and IU-Xray benchmarks demonstrate that SURE-Med achieves state-of-the-art performance. By holistically reducing uncertainty across multiple input modalities, SURE-Med sets a new benchmark for reliability in medical report generation and offers a robust step toward trustworthy clinical decision support.

[598] DeepVIS: Bridging Natural Language and Data Visualization Through Step-wise Reasoning

Zhihao Shuai, Boyan Li, Siyu Yan, Yuyu Luo, Weikai Yang

Main category: cs.AI

TL;DR: The paper proposes integrating Chain-of-Thought (CoT) reasoning into the NL2VIS pipeline to improve transparency and quality in automated visualization generation.

Details

Motivation: Existing NL2VIS methods lack transparency, preventing users from understanding design rationales or refining outputs.

Method: 1. Design a CoT reasoning process for NL2VIS. 2. Create nvBench-CoT, a dataset with structured reasoning steps. 3. Develop DeepVIS, an interactive interface for inspecting and adjusting reasoning steps.

Result: Benchmark evaluations, use cases, and a user study show the CoT framework enhances NL2VIS quality and provides insightful reasoning steps.

Conclusion: The CoT framework improves NL2VIS by making the reasoning process transparent and interactive, aiding users in refining visualizations.

Abstract: Although data visualization is powerful for revealing patterns and communicating insights, creating effective visualizations requires familiarity with authoring tools and often disrupts the analysis flow. While large language models show promise for automatically converting analysis intent into visualizations, existing methods function as black boxes without transparent reasoning processes, which prevents users from understanding design rationales and refining suboptimal outputs. To bridge this gap, we propose integrating Chain-of-Thought (CoT) reasoning into the Natural Language to Visualization (NL2VIS) pipeline. First, we design a comprehensive CoT reasoning process for NL2VIS and develop an automatic pipeline to equip existing datasets with structured reasoning steps. Second, we introduce nvBench-CoT, a specialized dataset capturing detailed step-by-step reasoning from ambiguous natural language descriptions to finalized visualizations, which enables state-of-the-art performance when used for model fine-tuning. Third, we develop DeepVIS, an interactive visual interface that tightly integrates with the CoT reasoning process, allowing users to inspect reasoning steps, identify errors, and make targeted adjustments to improve visualization outcomes. Quantitative benchmark evaluations, two use cases, and a user study collectively demonstrate that our CoT framework effectively enhances NL2VIS quality while providing insightful reasoning steps to users.

[599] ReflecSched: Solving Dynamic Flexible Job-Shop Scheduling via LLM-Powered Hierarchical Reflection

Shijie Cao, Yuan Yuan

Main category: cs.AI

TL;DR: ReflecSched enhances LLMs for DFJSP by using strategic analysis and heuristic-driven simulations, outperforming baselines and heuristics.

Details

Motivation: Address LLMs' suboptimal direct application in DFJSP due to long-context paradox, underutilized heuristics, and myopic decisions.

Method: Proposes ReflecSched, a framework where LLMs analyze heuristic simulations to create “Strategic Experience” summaries for better decision-making.

Result: ReflecSched outperforms baselines (71.35% Win Rate, 2.755% RPD reduction) and matches tailored heuristics.

Conclusion: ReflecSched effectively mitigates LLM pitfalls and improves scheduling performance in DFJSP.

Abstract: Dynamic Flexible Job-Shop Scheduling (DFJSP) is an NP-hard problem challenged by real-time event adaptation and complex machine routing. While traditional dispatching rules are efficient but rigid, deep learning approaches are opaque and require intricate feature engineering. Large Language Models (LLMs) promise adaptive reasoning without this engineering overhead, yet we find their direct application is suboptimal. Baseline LLMs suffer from three key pitfalls: the long-context paradox, where crucial data is underutilized; an underutilization of expert heuristics; and myopic decision-making. To address this, we propose ReflecSched, a framework that empowers the LLM beyond a direct scheduler by equipping it with a strategic analysis capability. ReflecSched tasks the LLM to analyze heuristic-driven simulations across multiple planning horizons and distill them into a concise, natural-language summary termed ``Strategic Experience’’. This summary is then integrated into the prompt of a final decision-making module, guiding it to produce non-myopic actions. Experiments show that ReflecSched not only statistically significantly outperforms direct LLM baselines, securing a 71.35% Win Rate and a 2.755% Relative Percentage Deviation reduction, but also surpasses the performance of all individual heuristics evaluated, all while demonstrably mitigating the three identified pitfalls. Additionally, ReflecSched performs on par with the best heuristic tailored to each instance across all problem cases.

[600] Bayes-Entropy Collaborative Driven Agents for Research Hypotheses Generation and Optimization

Shiyang Duan, Yuan Tian, Qi Bing, Xiaowei Shao

Main category: cs.AI

TL;DR: The paper introduces HypoAgents, a multi-agent framework combining Bayesian reasoning and information entropy to automate scientific hypothesis generation, validation, and refinement, outperforming benchmarks in quality and reliability.

Details

Motivation: Addressing the challenge of generating novel, feasible, and valuable scientific hypotheses using AI, as current methods lack systematic modeling and feedback mechanisms.

Method: HypoAgents integrates Bayesian reasoning and entropy-driven search across three stages: hypothesis generation (using N-R-F scores), evidence validation (via RAG and Bayes’ theorem), and refinement (targeting high-uncertainty hypotheses).

Result: On the ICLR 2025 dataset, HypoAgents improved hypothesis ELO scores by 116.3 after 12 iterations, surpassing benchmarks by 17.8, while reducing uncertainty (Shannon entropy) by 0.92.

Conclusion: The framework enhances automated scientific discovery by providing interpretable, high-quality hypotheses, advancing AI’s role in research.

Abstract: The exponential growth of scientific knowledge has made the automated generation of scientific hypotheses that combine novelty, feasibility, and research value a core challenge. Existing methods based on large language models fail to systematically model the inherent in hypotheses or incorporate the closed-loop feedback mechanisms crucial for refinement. This paper proposes a multi-agent collaborative framework called HypoAgents, which for the first time integrates Bayesian reasoning with an information entropy-driven search mechanism across three stages-hypotheses generation, evidence validation, and hypotheses Refinement-to construct an iterative closed-loop simulating scientists’ cognitive processes. Specifically, the framework first generates an initial set of hypotheses through diversity sampling and establishes prior beliefs based on a composite novelty-relevance-feasibility (N-R-F) score. It then employs etrieval-augmented generation (RAG) to gather external literature evidence, updating the posterior probabilities of hypotheses using Bayes’ theorem. Finally, it identifies high-uncertainty hypotheses using information entropy $H = - \sum {{p_i}\log {p_i}}$ and actively refines them, guiding the iterative optimization of the hypothesis set toward higher quality and confidence. Experimental results on the ICLR 2025 conference real-world research question dataset (100 research questions) show that after 12 optimization iterations, the average ELO score of generated hypotheses improves by 116.3, surpassing the benchmark of real paper abstracts by 17.8, while the framework’s overall uncertainty, as measured by Shannon entropy, decreases significantly by 0.92. This study presents an interpretable probabilistic reasoning framework for automated scientific discovery, substantially improving the quality and reliability of machine-generated research hypotheses.

[601] Implementing Cumulative Functions with Generalized Cumulative Constraints

Pierre Schaus, Charles Thomas, Roger Kameugne

Main category: cs.AI

TL;DR: The paper introduces a generic global constraint, Generalized Cumulative, and a novel time-table filtering algorithm for modeling scheduling problems with conditional time intervals, addressing gaps in open-source solvers.

Details

Motivation: Existing open-source solvers lack support for modeling scheduling problems with conditional time intervals and cumulative functions, and practical implementation details are undocumented.

Method: The authors implement a single, generic global constraint (Generalized Cumulative) and introduce a time-table filtering algorithm for tasks on conditional time-intervals.

Result: The approach performs competitively with existing solvers, supports producer-consumer scheduling, and scales well to large problems.

Conclusion: The proposed method effectively bridges the gap in open-source solvers for complex scheduling problems.

Abstract: Modeling scheduling problems with conditional time intervals and cumulative functions has become a common approach when using modern commercial constraint programming solvers. This paradigm enables the modeling of a wide range of scheduling problems, including those involving producers and consumers. However, it is unavailable in existing open-source solvers and practical implementation details remain undocumented. In this work, we present an implementation of this modeling approach using a single, generic global constraint called the Generalized Cumulative. We also introduce a novel time-table filtering algorithm designed to handle tasks defined on conditional time-intervals. Experimental results demonstrate that this approach, combined with the new filtering algorithm, performs competitively with existing solvers enabling the modeling of producer and consumer scheduling problems and effectively scales to large problems.

[602] Reasoning Systems as Structured Processes: Foundations, Failures, and Formal Criteria

Saleh Nikooroo, Thomas Engel

Main category: cs.AI

TL;DR: A formal framework for reasoning systems is introduced, unifying logical, algorithmic, and learning-based processes, with criteria for evaluation and failure modes.

Details

Motivation: To create a foundational structure for analyzing and comparing reasoning systems across domains, especially under constraints like failure or adaptation.

Method: Model reasoning systems as structured tuples with components like phenomena, explanation space, and inference maps, while remaining agnostic to specific algorithms.

Result: The framework accommodates diverse reasoning processes, identifies failure modes, and supports dynamic behaviors like iterative refinement.

Conclusion: The work aims to enable future theoretical and practical research into reasoning systems without proposing specific solutions.

Abstract: This paper outlines a general formal framework for reasoning systems, intended to support future analysis of inference architectures across domains. We model reasoning systems as structured tuples comprising phenomena, explanation space, inference and generation maps, and a principle base. The formulation accommodates logical, algorithmic, and learning-based reasoning processes within a unified structural schema, while remaining agnostic to any specific reasoning algorithm or logic system. We survey basic internal criteria–including coherence, soundness, and completeness-and catalog typical failure modes such as contradiction, incompleteness, and non-convergence. The framework also admits dynamic behaviors like iterative refinement and principle evolution. The goal of this work is to establish a foundational structure for representing and comparing reasoning systems, particularly in contexts where internal failure, adaptation, or fragmentation may arise. No specific solution architecture is proposed; instead, we aim to support future theoretical and practical investigations into reasoning under structural constraint.

[603] Uncertainty-Based Methods for Automated Process Reward Data Construction and Output Aggregation in Mathematical Reasoning

Jiuzhou Han, Wray Buntine, Ehsan Shareghi

Main category: cs.AI

TL;DR: Proposes an uncertainty-driven framework for automated process reward data construction and introduces two uncertainty-aware output aggregation methods to improve PRMs for mathematical reasoning.

Details

Motivation: Existing methods for constructing process reward data for PRMs are labor-intensive or inefficient, limiting their effectiveness.

Method: Introduces an uncertainty-driven framework for automated PRM data construction and two uncertainty-aware aggregation methods: Hybrid Majority Reward Vote and Weighted Reward Frequency Vote.

Result: Experiments on ProcessBench, MATH, and GSMPlus show the framework’s effectiveness and the aggregation methods’ ability to enhance PRMs.

Conclusion: The proposed methods improve PRM data quality and reasoning abilities, with code and data made publicly available.

Abstract: Large language models have demonstrated remarkable capabilities in complex mathematical reasoning tasks, but they inevitably generate errors throughout multi-step solutions. Process-level Reward Models (PRMs) have shown great promise by providing supervision and evaluation at each intermediate step, thereby effectively improving the models’ reasoning abilities. However, training effective PRMs requires high-quality process reward data, yet existing methods for constructing such data are often labour-intensive or inefficient. In this paper, we propose an uncertainty-driven framework for automated process reward data construction, encompassing both data generation and annotation processes for PRMs. Additionally, we identify the limitations of both majority vote and PRMs, and introduce two generic uncertainty-aware output aggregation methods: Hybrid Majority Reward Vote and Weighted Reward Frequency Vote, which combine the strengths of majority vote with PRMs. Extensive experiments on ProcessBench, MATH, and GSMPlus show the effectiveness and efficiency of the proposed PRM data construction framework, and demonstrate that the two output aggregation methods further improve the mathematical reasoning abilities across diverse PRMs. The code and data will be publicly available at https://github.com/Jiuzhouh/UnPRM.

[604] LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools?

Guozhao Mo, Wenliang Zhong, Jiawei Chen, Xuanang Chen, Yaojie Lu, Hongyu Lin, Ben He, Xianpei Han, Le Sun

Main category: cs.AI

TL;DR: LiveMCPBench is a new benchmark for evaluating LLM agents in large-scale MCP environments, featuring 95 tasks, 70 servers, and 527 tools. It includes LiveMCPTool for deployment and LiveMCPEval for automated assessment, achieving 81% human agreement. The best model (Claude-Sonnet-4) scored 78.95%, but performance varies widely.

Details

Motivation: Existing MCP benchmarks are limited to single-server settings, failing to assess agent capabilities in large-scale, real-world scenarios.

Method: Developed LiveMCPBench with 95 tasks, LiveMCPTool (70 servers, 527 tools), and LiveMCPEval (LLM-as-a-Judge framework). Introduced MCP Copilot Agent for dynamic planning.

Result: Best model (Claude-Sonnet-4) achieved 78.95% success, but performance varied significantly across models. LiveMCPEval agreed with humans 81% of the time.

Conclusion: LiveMCPBench provides a unified framework for scalable, reproducible evaluation of LLM agents in realistic, tool-rich MCP environments.

Abstract: With the rapid development of Model Context Protocol (MCP), the number of MCP servers has surpassed 10,000. However, existing MCP benchmarks are limited to single-server settings with only a few tools, hindering effective evaluation of agent capabilities in large-scale, real-world scenarios. To address this limitation, we present LiveMCPBench, the first comprehensive benchmark comprising 95 real-world tasks grounded in the MCP ecosystem, designed to evaluate LLM agents at scale across diverse servers. To support a scalable and reproducible evaluation pipeline in large-scale MCP environments, we curate LiveMCPTool, a diverse and readily deployable collection of 70 MCP servers and 527 tools. Furthermore, we introduce LiveMCPEval, an LLM-as-a-Judge framework that enables automated and adaptive evaluation in dynamic, time-varying task environments, achieving 81% agreement with human reviewers. Finally, we propose the MCP Copilot Agent, a multi-step agent that routes tools for dynamic planning and executes tools for API interaction across the entire LiveMCPTool suite. Our evaluation covers 10 leading models, with the best-performing model (Claude-Sonnet-4) reaching a 78.95% success rate. However, we observe large performance variance across models, and several widely-used models perform poorly in LiveMCPBench’s complex, tool-rich environments. Overall, LiveMCPBench offers the first unified framework for benchmarking LLM agents in realistic, tool-rich, and dynamic MCP environments, laying a solid foundation for scalable and reproducible research on agent capabilities. Our code and data will be publicly available at https://icip-cas.github.io/LiveMCPBench.

[605] CloudAnoAgent: Anomaly Detection for Cloud Sites via LLM Agent with Neuro-Symbolic Mechanism

Xinkai Zou, Xuan Jiang, Ruikai Huang, Haoze He, Parv Kapoor, Jiahua Zhao

Main category: cs.AI

TL;DR: CloudAnoAgent, a neuro-symbolic LLM-based agent, improves anomaly detection in cloud environments by integrating metrics and log data, reducing false positives and enhancing accuracy.

Details

Motivation: Existing anomaly detection methods suffer from high false positive rates due to data imbalance, prompting the need for a more accurate and interpretable solution.

Method: CloudAnoAgent processes structured metrics and textual log data jointly, using symbolic verification to validate hypotheses and generate structured reports.

Result: The method improves anomaly classification accuracy by 46.36% and reduces false positives by 36.67% compared to traditional baselines.

Conclusion: CloudAnoAgent enhances detection accuracy, reduces false positives, and improves interpretability, making it practical for enterprise cloud environments.

Abstract: Anomaly detection in cloud sites remains a critical yet challenging task. Existing approaches that rely solely on metric data often suffer from high false positive rates (FPR) due to data imbalance between normal and anomalous events, leading to significant operational overhead for system reliance engineers. Recent advances in large language models (LLMs) offer new opportunities for integrating metrics with log data, enabling more accurate and interpretable anomaly detection. In this paper, we propose CloudAnoAgent, the first neuro-symbolic LLM-based agent for anomaly detection in cloud environments. CloudAnoAgent jointly processes structured metrics and textual log data in a unified pipeline, leveraging symbolic verification to validate detection hypotheses and generate structured anomaly reports. To support systematic evaluation, we introduce CloudAnoBench, the first benchmark that provides LLM-generated paired metrics and log data with fine-grained anomaly behavior annotations, filling a critical gap in existing datasets. Experimental results demonstrate that CloudAnoAgent improves anomaly classification accuracy by 46.36% and 36.67% on average and reduces the FPR by 36.67% and 33.89% on average over traditional baselines and LLM-only baseline, with a boost on anomaly type detection accuracy by 12.8% compared to vanilla LLM prompting. These results demonstrate the strengths of our approach in improving detection accuracy, reducing false positives, and enhancing interpretability, thereby supporting practical deployment in enterprise cloud environments.

[606] ProKG-Dial: Progressive Multi-Turn Dialogue Construction with Domain Knowledge Graphs

Yuanyuan Liang, Xiaoman Wang, Tingyu Xie, Lei Pan

Main category: cs.AI

TL;DR: ProKG Dial is a framework for creating domain-specific multi-turn dialogue datasets using knowledge graphs, improving dialogue quality and domain coverage.

Details

Motivation: Existing methods for building domain-specific dialogue datasets are resource-intensive or lack quality and coverage.

Method: ProKG Dial partitions knowledge graphs into subgraphs, generates Q&A dialogues incrementally, and filters for quality.

Result: The framework improves dialogue diversity, coherence, and entity coverage, enhancing LLM performance in domain-specific tasks.

Conclusion: ProKG Dial is effective for constructing high-quality domain-specific dialogue datasets, validated by metrics and human evaluations.

Abstract: Current large language models (LLMs) excel at general NLP tasks but often lack domain specific precision in professional settings. Building a high quality domain specific multi turn dialogue dataset is essential for developing specialized conversational systems. However, existing methods such as manual annotation, simulated human LLM interactions, and role based LLM dialogues are resource intensive or suffer from limitations in dialogue quality and domain coverage. To address these challenges, we introduce ProKG Dial, a progressive framework for constructing knowledge intensive multi turn dialogue datasets using domain specific knowledge graphs (KGs). ProKG Dial leverages the structured nature of KGs to encode complex domain knowledge and relationships, providing a solid foundation for generating meaningful and coherent dialogues. Specifically, ProKG Dial begins by applying community detection to partition the KG into semantically cohesive subgraphs. For each subgraph, the framework incrementally generates a series of questions and answers centered around a target entity, ensuring relevance and coverage. A rigorous filtering step is employed to maintain high dialogue quality. We validate ProKG Dial on a medical knowledge graph by evaluating the generated dialogues in terms of diversity, semantic coherence, and entity coverage. Furthermore, we fine tune a base LLM on the resulting dataset and benchmark it against several baselines. Both automatic metrics and human evaluations demonstrate that ProKG Dial substantially improves dialogue quality and domain specific performance, highlighting its effectiveness and practical utility.

[607] Multi-turn Natural Language to Graph Query Language Translation

Yuanyuan Liang, Lei Pan, Tingyu Xie, Yunshi Lan, Weining Qian

Main category: cs.AI

TL;DR: The paper addresses the gap in multi-turn NL2GQL research by proposing an automated dataset construction method using LLMs and introducing the MTGQL dataset for financial market graphs.

Details

Motivation: Existing NL2GQL methods focus on single-turn transformations, but real-world applications require multi-turn, context-dependent interactions, which current approaches fail to handle effectively.

Method: An automated method using Large Language Models (LLMs) to construct multi-turn NL2GQL datasets, applied to create the MTGQL dataset from a financial market graph database.

Result: The MTGQL dataset is developed and will be publicly released. Three baseline methods are proposed to evaluate multi-turn NL2GQL translation.

Conclusion: The work provides a foundation for future research in multi-turn NL2GQL by addressing dataset scarcity and proposing evaluation baselines.

Abstract: In recent years, research on transforming natural language into graph query language (NL2GQL) has been increasing. Most existing methods focus on single-turn transformation from NL to GQL. In practical applications, user interactions with graph databases are typically multi-turn, dynamic, and context-dependent. While single-turn methods can handle straightforward queries, more complex scenarios often require users to iteratively adjust their queries, investigate the connections between entities, or request additional details across multiple dialogue turns. Research focused on single-turn conversion fails to effectively address multi-turn dialogues and complex context dependencies. Additionally, the scarcity of high-quality multi-turn NL2GQL datasets further hinders the progress of this field. To address this challenge, we propose an automated method for constructing multi-turn NL2GQL datasets based on Large Language Models (LLMs) , and apply this method to develop the MTGQL dataset, which is constructed from a financial market graph database and will be publicly released for future research. Moreover, we propose three types of baseline methods to assess the effectiveness of multi-turn NL2GQL translation, thereby laying a solid foundation for future research.

[608] Dynamic Context Adaptation for Consistent Role-Playing Agents with Retrieval-Augmented Generations

Jeiyoon Park, Yongshin Han, Minseop Kim, Kisu Yang

Main category: cs.AI

TL;DR: AMADEUS framework with ACTS, GS, and AE improves persona consistency in RAG-based RPAs, demonstrated on CharacterRAG dataset.

Details

Motivation: To enhance persona consistency in role-playing agents (RPAs) by addressing out-of-knowledge questions and modeling character attributes.

Method: Uses Adaptive Context-aware Text Splitter (ACTS) for optimal chunking, Guided Selection (GS) for retrieval, and Attribute Extractor (AE) for attribute identification.

Result: Effective modeling of character knowledge and attributes (e.g., personality) on the CharacterRAG dataset (15 characters, 976K text, 450 QA pairs).

Conclusion: AMADEUS robustly maintains persona consistency and models diverse character attributes, advancing RAG-based RPAs.

Abstract: We propose AMADEUS, which is composed of Adaptive Context-aware Text Splitter (ACTS), Guided Selection (GS), and Attribute Extractor (AE). ACTS finds an optimal chunk length and hierarchical contexts for each character. AE identifies a character’s general attributes from the chunks retrieved by GS and uses these attributes as a final context to maintain robust persona consistency even when answering out of knowledge questions. To facilitate the development and evaluation of RAG-based RPAs, we construct CharacterRAG, a role-playing dataset that consists of persona documents for 15 distinct fictional characters totaling 976K written characters, and 450 question and answer pairs. We find that our framework effectively models not only the knowledge possessed by characters, but also various attributes such as personality.

[609] TRACEALIGN – Tracing the Drift: Attributing Alignment Failures to Training-Time Belief Sources in LLMs

Amitava Das, Vinija Jain, Aman Chadha

Main category: cs.AI

TL;DR: TraceAlign is a framework to trace unsafe LLM completions back to training corpus sources, using the Belief Conflict Index (BCI) and interventions like TraceShield, Contrastive Belief Deconfliction Loss, and Prov-Decode, reducing alignment drift by 85%.

Details

Motivation: LLMs fine-tuned for alignment often produce unsafe completions due to adversarial prompts or training inconsistencies, but the root causes in the training corpus are poorly understood.

Method: Introduces BCI to quantify semantic inconsistency, with interventions: TraceShield (inference-time filter), Contrastive Belief Deconfliction Loss (fine-tuning), and Prov-Decode (provenance-aware decoding).

Result: Reduces alignment drift by 85% on the Alignment Drift Benchmark (ADB) with minimal utility loss (delta <0.2) and improved refusal quality.

Conclusion: TraceAlign offers a scalable, traceable toolkit to mitigate alignment failures by addressing root causes in training data, with open-source implementation.

Abstract: Large Language Models (LLMs) fine-tuned to align with human values often exhibit alignment drift, producing unsafe or policy-violating completions when exposed to adversarial prompts, decoding perturbations, or paraphrased jailbreaks. While prior work has behaviorally characterized alignment failure, little is known about the training-time belief sources underlying these failures. We introduce TraceAlign, a unified framework for tracing unsafe completions back to their root causes in the model’s training corpus. Central to our approach is the Belief Conflict Index (BCI), which quantifies semantic inconsistency between generated spans and aligned policies, based on retrieved training documents using suffix-array matching. We propose three complementary interventions: (i) TraceShield, an inference-time safety filter that refuses completions with high-BCI spans, (ii) Contrastive Belief Deconfliction Loss, a contrastive fine-tuning objective penalizing high-BCI continuations during DPO, and (iii) Prov-Decode, a provenance-aware decoding strategy that vetoes beam expansions predicted to yield high-BCI spans. Together, these defenses reduce alignment drift by up to 85% on our curated Alignment Drift Benchmark (ADB) while preserving utility on standard tasks, with delta less than 0.2 and improved refusal quality. We further derive a theoretical upper bound on drift likelihood via suffix-array span statistics, linking memorization frequency and length to adversarial reactivation risk. TraceAlign thus provides the first scalable, traceable, and grounded toolkit for understanding and mitigating alignment failures at source. To encourage further exploration and development, we open-source our implementation at: https://anonymous.4open.science/r/tracealign-2DA7

[610] Risk identification based on similar case retrieval enhancement,

Jiawei Li, Chengye Yang, Yaochen Zhang, Weilin Sun, Lei Meng, Xiangxu Meng

Main category: cs.AI

TL;DR: Proposes a hazard identification method using similar case retrieval enhancement to improve accuracy and generalization in construction site safety management.

Details

Motivation: Existing methods struggle with complex hazard features, high training costs, and poor generalization.

Method: Integrates external knowledge and retrieved case contexts via prompt fine-tuning with three modules: retrieval library, image similarity retrieval, and large model retrieval enhancement.

Result: GLM-4V’s recognition accuracy increased to 50%, a 35.49% boost.

Conclusion: The method enhances accuracy, context understanding, and stability, providing new support for hazard detection.

Abstract: The goal of construction site risk and hazard identification is to enhance safety management through automation. Existing research based on large language models falls into two categories: image-text matching for collaborative reasoning, which struggles with complex hazard features, and instruction fine-tuning or dialogue guidance using professional datasets, which suffers from high training costs and poor generalization.To address this, we propose a hazard identification method using similar case retrieval enhancement. By integrating external knowledge and retrieved case contexts via prompt fine-tuning, we mitigate misjudgments caused by limited domain knowledge and weak feature associations. Our method includes three modules: retrieval library, image similarity retrieval, and large model retrieval enhancement, enabling efficient recognition without training. Experiments on real construction data show significant improvements. For instance, GLM-4V’s recognition accuracy increased to 50%, a 35.49% boost. The method enhances accuracy, context understanding, and stability, offering new theoretical and technical support for hazard detection.

[611] Everyone Contributes! Incentivizing Strategic Cooperation in Multi-LLM Systems via Sequential Public Goods Games

Yunhao Liang, Yuan Qu, Jingyuan Yang, Shaochong Lin, Zuo-Jun Max Shen

Main category: cs.AI

TL;DR: A game-theoretic RL framework (MAC-SPGG) incentivizes cooperation in multi-LLM ensembles, improving performance while reducing costs.

Details

Motivation: Addressing the trade-off between computation costs and collective performance in multi-LLM collaboration.

Method: Introduces MAC-SPGG, a sequential public goods game with redesigned rewards to eliminate free-riding and streamline decision-making.

Result: MAC-SPGG-trained ensembles outperform single-agent baselines and other cooperative methods, matching large-scale models in various tasks.

Conclusion: Structured, incentive-aligned cooperation in MAC-SPGG enables scalable and robust multi-agent language generation.

Abstract: Coordinating multiple large language models (LLMs) to solve complex tasks collaboratively poses a fundamental trade-off between the computation costs and collective performance compared with individual model. We introduce a novel, game-theoretically grounded reinforcement learning (RL) framework, the Multi-Agent Cooperation Sequential Public Goods Game (MAC-SPGG), to systematically incentivize cooperation in multi-LLM ensembles. In MAC-SPGG, LLM agents move in sequence, observing predecessors’ outputs and updating beliefs to condition their own contributions. By redesigning the public-goods reward, effortful contributions become the unique Subgame Perfect Nash Equilibrium (SPNE), which eliminates free-riding under traditional SPGG or PGG. Its sequential protocol replaces costly round-based information exchanges with a streamlined decision flow, cutting communication overhead while retaining strategic depth. We prove the existence and uniqueness of the SPNE under realistic parameters, and empirically show that MAC-SPGG-trained ensembles outperform single-agent baselines, chain-of-thought prompting, and other cooperative methods, even achieving comparable performance to large-scale models across reasoning, math, code generation, and NLP tasks. Our results highlight the power of structured, incentive-aligned MAC-SPGG cooperation for scalable and robust multi-agent language generation.

[612] SE-Agent: Self-Evolution Trajectory Optimization in Multi-Step Reasoning with LLM-Based Agents

Jiaye Lin, Yifu Guo, Yuzhen Han, Sen Hu, Ziyi Ni, Licheng Wang, Mingguang Chen, Daxin Jiang, Binxing Jiao, Chen Hu, Huacan Wang

Main category: cs.AI

TL;DR: SE-Agent is a self-evolution framework for LLM-based agents that optimizes reasoning processes by revising, recombining, and refining trajectories, achieving a 55% improvement on SWE-bench.

Details

Motivation: Current LLM-based agents lack efficient exploitation of problem-solving trajectories and suffer from redundant reasoning due to limited search space diversity.

Method: SE-Agent uses revision, recombination, and refinement of trajectories to expand search space and leverage cross-trajectory inspiration.

Result: SE-Agent achieves up to 55% relative improvement on SWE-bench, outperforming other open-source agents.

Conclusion: SE-Agent’s evolutionary mechanism enhances reasoning quality and performance, setting a new state-of-the-art for LLM-based agents.

Abstract: Large Language Model (LLM)-based agents have recently shown impressive capabilities in complex reasoning and tool use via multi-step interactions with their environments. While these agents have the potential to tackle complicated tasks, their problem-solving process, i.e., agents’ interaction trajectory leading to task completion, remains underexploited. These trajectories contain rich feedback that can navigate agents toward the right directions for solving problems correctly. Although prevailing approaches, such as Monte Carlo Tree Search (MCTS), can effectively balance exploration and exploitation, they ignore the interdependence among various trajectories and lack the diversity of search spaces, which leads to redundant reasoning and suboptimal outcomes. To address these challenges, we propose SE-Agent, a Self-Evolution framework that enables Agents to optimize their reasoning processes iteratively. Our approach revisits and enhances former pilot trajectories through three key operations: revision, recombination, and refinement. This evolutionary mechanism enables two critical advantages: (1) it expands the search space beyond local optima by intelligently exploring diverse solution paths guided by previous trajectories, and (2) it leverages cross-trajectory inspiration to efficiently enhance performance while mitigating the impact of suboptimal reasoning paths. Through these mechanisms, SE-Agent achieves continuous self-evolution that incrementally improves reasoning quality. We evaluate SE-Agent on SWE-bench Verified to resolve real-world GitHub issues. Experimental results across five strong LLMs show that integrating SE-Agent delivers up to 55% relative improvement, achieving state-of-the-art performance among all open-source agents on SWE-bench Verified. Our code and demonstration materials are publicly available at https://github.com/wanghuacan/SE-Agent.

[613] “Stack It Up!”: 3D Stable Structure Generation from 2D Hand-drawn Sketch

Yiqing Xu, Linfeng Li, Cunjun Yu, David Hsu

Main category: cs.AI

TL;DR: StackItUp enables non-experts to create 3D structures from 2D sketches using an abstract relation graph and compositional diffusion models, outperforming baselines in stability and visual resemblance.

Details

Motivation: Current robot manipulation systems require precise 3D poses, limiting accessibility for non-experts. StackItUp aims to bridge this gap by allowing complex 3D structures to be specified via simple 2D sketches.

Method: StackItUp uses an abstract relation graph to interpret sketches, capturing geometric relations and stability patterns. It grounds the graph to 3D poses with compositional diffusion models and iteratively predicts hidden supports for stability.

Result: The system produces stable, multilevel 3D structures from sketches, outperforming baselines in stability and visual resemblance.

Conclusion: StackItUp successfully democratizes 3D structure creation by enabling non-experts to use simple sketches, achieving robust results without expert tools.

Abstract: Imagine a child sketching the Eiffel Tower and asking a robot to bring it to life. Today’s robot manipulation systems can’t act on such sketches directly-they require precise 3D block poses as goals, which in turn demand structural analysis and expert tools like CAD. We present StackItUp, a system that enables non-experts to specify complex 3D structures using only 2D front-view hand-drawn sketches. StackItUp introduces an abstract relation graph to bridge the gap between rough sketches and accurate 3D block arrangements, capturing the symbolic geometric relations (e.g., left-of) and stability patterns (e.g., two-pillar-bridge) while discarding noisy metric details from sketches. It then grounds this graph to 3D poses using compositional diffusion models and iteratively updates it by predicting hidden internal and rear supports-critical for stability but absent from the sketch. Evaluated on sketches of iconic landmarks and modern house designs, StackItUp consistently produces stable, multilevel 3D structures and outperforms all baselines in both stability and visual resemblance.

[614] Attractive Metadata Attack: Inducing LLM Agents to Invoke Malicious Tools

Kanghua Mo, Li Hu, Yucheng Long, Zhihao Li

Main category: cs.AI

TL;DR: The paper introduces the Attractive Metadata Attack (AMA), a method to manipulate LLM agents by altering tool metadata, achieving high success rates without prompt injection or model access.

Details

Motivation: To expose vulnerabilities in LLM agents' tool-centric paradigm, where adversaries can exploit metadata manipulation to influence agent behavior stealthily.

Method: Proposes AMA, a black-box in-context learning framework that optimizes tool metadata to attract LLM agents, tested across ten scenarios.

Result: Achieves 81%-95% attack success rates, causing privacy leaks with minimal task disruption, even bypassing prompt-level defenses.

Conclusion: Metadata manipulation is a potent attack surface, necessitating execution-level security measures beyond current defenses.

Abstract: Large language model (LLM) agents have demonstrated remarkable capabilities in complex reasoning and decision-making by leveraging external tools. However, this tool-centric paradigm introduces a previously underexplored attack surface: adversaries can manipulate tool metadata – such as names, descriptions, and parameter schemas – to influence agent behavior. We identify this as a new and stealthy threat surface that allows malicious tools to be preferentially selected by LLM agents, without requiring prompt injection or access to model internals. To demonstrate and exploit this vulnerability, we propose the Attractive Metadata Attack (AMA), a black-box in-context learning framework that generates highly attractive but syntactically and semantically valid tool metadata through iterative optimization. Our attack integrates seamlessly into standard tool ecosystems and requires no modification to the agent’s execution framework. Extensive experiments across ten realistic, simulated tool-use scenarios and a range of popular LLM agents demonstrate consistently high attack success rates (81%-95%) and significant privacy leakage, with negligible impact on primary task execution. Moreover, the attack remains effective even under prompt-level defenses and structured tool-selection protocols such as the Model Context Protocol, revealing systemic vulnerabilities in current agent architectures. These findings reveal that metadata manipulation constitutes a potent and stealthy attack surface, highlighting the need for execution-level security mechanisms that go beyond prompt-level defenses.

[615] Don’t Overthink It: A Survey of Efficient R1-style Large Reasoning Models

Linan Yue, Yichao Du, Yizhi Wang, Weibo Gao, Fangzhou Yao, Li Wang, Ye Liu, Ziyu Xu, Qi Liu, Shimin Di, Min-Ling Zhang

Main category: cs.AI

TL;DR: The paper discusses the rise of Large Reasoning Models (LRMs) like DeepSeek R1, their advantages over traditional LLMs, and the emerging issue of overthinking. It reviews efficient reasoning methods, categorizing them into single-model optimization and model collaboration, and maintains a GitHub repository for tracking progress.

Details

Motivation: The motivation is to address the problem of overthinking in LRMs, which reduces reasoning efficiency and accuracy, by exploring and categorizing efficient reasoning methods.

Method: The paper systematically reviews efficient reasoning methods, dividing them into two categories: (1) single-model optimization and (2) model collaboration.

Result: The result is a comprehensive categorization of efficient reasoning methods and a public GitHub repository to track advancements in the field.

Conclusion: The conclusion highlights the importance of efficient reasoning methods to mitigate overthinking in LRMs and encourages further research in this direction.

Abstract: Recently, Large Reasoning Models (LRMs) have gradually become a research hotspot due to their outstanding performance in handling complex tasks. Among them, DeepSeek R1 has garnered significant attention for its exceptional performance and open-source nature, driving advancements in the research of R1-style LRMs. Unlike traditional Large Language Models (LLMs), these models enhance logical deduction and decision-making capabilities during reasoning by incorporating mechanisms such as long chain-of-thought and self-reflection through reinforcement learning. However, with the widespread application of these models, the problem of overthinking has gradually emerged. Specifically, when generating answers, these models often construct excessively long reasoning chains with redundant or repetitive steps, which leads to reduced reasoning efficiency and may affect the accuracy of the final answer. To this end, various efficient reasoning methods have been proposed, aiming to reduce the length of reasoning paths without compromising model performance and reasoning capability. By reviewing the current research advancements in the field of efficient reasoning methods systematically, we categorize existing works into two main directions based on the lens of single-model optimization versus model collaboration: (1) Efficient Reasoning with Single Model, which focuses on improving the reasoning efficiency of individual models; and (2) Efficient Reasoning with Model Collaboration, which explores optimizing reasoning paths through collaboration among multiple models. Besides, we maintain a public GitHub repository that tracks the latest progress in efficient reasoning methods.

[616] Trainable Dynamic Mask Sparse Attention

Jingze Shi, Yifan Wu, Bingheng Wu, Yiran Peng, Liangdong Wang, Guang Liu, Yuyu Luo

Main category: cs.AI

TL;DR: Dynamic Mask Attention (DMA) introduces a trainable sparse attention mechanism to reduce computational complexity while maintaining information fidelity, outperforming existing methods in efficiency and performance.

Details

Motivation: The quadratic complexity of standard self-attention in large language models limits long-context modeling. Existing sparse attention methods suffer from static patterns or information loss.

Method: DMA dynamically generates content-aware sparse masks and implements position-aware sparse attention, skipping unnecessary calculations while focusing on critical information.

Result: DMA outperforms multi-head attention, sliding window attention, and other methods in perplexity and multi-query associative recall tasks, especially in long-context scenarios.

Conclusion: DMA effectively balances computational efficiency and long-context modeling, demonstrating superior performance in benchmarks and challenging tasks.

Abstract: In large language models, the demand for modeling long contexts is constantly increasing, but the quadratic complexity of the standard self-attention mechanism often becomes a bottleneck. Although existing sparse attention mechanisms have improved efficiency, they may still encounter issues such as static patterns or information loss. We introduce a trainable dynamic mask sparse attention mechanism, Dynamic Mask Attention, which effectively utilizes content-aware and position-aware sparsity. DMA achieves this through two key innovations: First, it dynamically generates content-aware sparse masks from value representations, enabling the model to identify and focus on critical information adaptively. Second, it implements position-aware sparse attention computation that effectively skips unnecessary calculation regions. This dual-sparsity design allows the model to significantly reduce the computational complexity of important information while retaining complete information, achieving an excellent balance between information fidelity and computational efficiency. We have verified the performance of DMA through comprehensive experiments. Comparative studies show that DMA outperforms multi-head attention, sliding window attention, multi-head latent attention, and native sparse attention in terms of perplexity under Chinchilla Scaling Law settings. Moreover, in challenging multi-query associative recall tasks, DMA also demonstrates superior performance and efficiency compared to these methods. Crucially, in the evaluation of a 1.7B parameter model, DMA significantly outperforms multi-head attention in both standard benchmark performance and the challenging needle-in-a-haystack task. These experimental results highlight its capability to balance model efficiency and long-context modeling ability effectively.

[617] All Stories Are One Story: Emotional Arc Guided Procedural Game Level Generation

Yunge Wen, Chenliang Huang, Hangyu Zhou, Zhuo Zeng, Chun Ming Louis Po, Julian Togelius, Timothy Merino, Sam Earle

Main category: cs.AI

TL;DR: The paper introduces a framework for procedural game narrative generation using emotional arcs (Rise and Fall) as a structural backbone, enhancing engagement and coherence.

Details

Motivation: To operationalize universal emotional arcs in interactive storytelling for games, leveraging narratological theories and empirical data.

Method: Uses emotional arcs to guide branching story graphs, with nodes populated by characters, items, and gameplay attributes. Implemented in an ARPG prototype with LLMs and adaptive generation.

Result: Emotional arc integration significantly improves engagement, narrative coherence, and emotional impact, as shown by player ratings and sentiment analysis.

Conclusion: Emotionally structured procedural generation holds promise for advancing interactive storytelling in games.

Abstract: The emotional arc is a universal narrative structure underlying stories across cultures and media – an idea central to structuralist narratology, often encapsulated in the phrase “all stories are one story.” We present a framework for procedural game narrative generation that incorporates emotional arcs as a structural backbone for both story progression and gameplay dynamics. Leveraging established narratological theories and large-scale empirical analyses, we focus on two core emotional patterns – Rise and Fall – to guide the generation of branching story graphs. Each story node is automatically populated with characters, items, and gameplay-relevant attributes (e.g., health, attack), with difficulty adjusted according to the emotional trajectory. Implemented in a prototype action role-playing game (ARPG), our system demonstrates how emotional arcs can be operationalized using large language models (LLMs) and adaptive entity generation. Evaluation through player ratings, interviews, and sentiment analysis shows that emotional arc integration significantly enhances engagement, narrative coherence, and emotional impact. These results highlight the potential of emotionally structured procedural generation for advancing interactive storytelling for games.

[618] Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models’ Instruction Following

Qingyu Ren, Qianyu He, Bowei Zhang, Jie Zeng, Jiaqing Liang, Yanghua Xiao, Weikang Zhou, Zeye Sun, Fei Yu

Main category: cs.AI

TL;DR: A self-supervised RL framework improves instruction-following in reasoning models without external supervision, maintaining reasoning performance.

Details

Motivation: Address the trade-off between reasoning capabilities and instruction-following in models, avoiding reliance on costly external supervision.

Method: Proposes a self-supervised RL framework using internal signals of reasoning models.

Result: Significantly improves instruction-following while preserving reasoning performance.

Conclusion: Offers a scalable, cost-effective solution to enhance instruction-following in reasoning models.

Abstract: Reasoning models excel in complex problem solving but exhibit a concerning trade off between reasoning capabilities and instruction following abilities. Existing approaches for improving instruction following rely on stronger external models, creating methodological bottlenecks and practical limitations including increased costs and accessibility constraints. We propose a self-supervised RL framework that leverages reasoning models’ own internal signals to improve instruction following capabilities without external supervision. Extensive experiments demonstrate that our framework significantly improves instruction following capabilities while maintaining reasoning performance, offering a scalable and cost-effective approach to enhance instruction following in reasoning models. The data and code are publicly available at https://github.com/Rainier-rq/verl-if.

[619] Reconsidering Overthinking: Penalizing Internal and External Redundancy in CoT Reasoning

Jialiang Hong, Taihang Zhen, Kai Chen, Jiaheng Liu, Wenpeng Zhu, Jing Huo, Yang Gao, Depeng Wang, Haitao Wan, Xi Yang, Boyan Wang, Fanyu Meng

Main category: cs.AI

TL;DR: The paper addresses overthinking in Large Reasoning Models (LRMs) by decomposing it into internal and external redundancy, proposing a dual-penalty reinforcement learning framework to mitigate both, improving efficiency and interpretability.

Details

Motivation: Overthinking in LRMs leads to verbose reasoning traces, harming efficiency and interpretability. Prior works focus on reducing response length without analyzing semantic structure.

Method: A dual-penalty reinforcement learning framework: sliding-window semantic analysis for internal redundancy and penalizing continuation beyond the first correct solution (FCS) for external redundancy.

Result: Significantly compresses reasoning traces with minimal accuracy loss, generalizes to out-of-domain tasks, and shows external redundancy can be safely removed while internal redundancy must be cautiously reduced.

Conclusion: The method improves reasoning efficiency and enables semantic-aware control over reasoning length, leading to more concise and interpretable LRMs.

Abstract: Large Reasoning Models (LRMs) often produce excessively verbose reasoning traces, a phenomenon known as overthinking, which hampers both efficiency and interpretability. Prior works primarily address this issue by reducing response length, without fully examining the underlying semantic structure of the reasoning process. In this paper, we revisit overthinking by decomposing it into two distinct forms: internal redundancy, which consists of low-contribution reasoning steps within the first correct solution (FCS), and external redundancy, which refers to unnecessary continuation after the FCS. To mitigate both forms, we propose a dual-penalty reinforcement learning framework. For internal redundancy, we adopt a sliding-window semantic analysis to penalize low-gain reasoning steps that contribute little toward reaching the correct answer. For external redundancy, we penalize its proportion beyond the FCS to encourage earlier termination. Our method significantly compresses reasoning traces with minimal accuracy loss, and generalizes effectively to out-of-domain tasks such as question answering and code generation. Crucially, we find that external redundancy can be safely removed without degrading performance, whereas internal redundancy must be reduced more cautiously to avoid impairing correctness. These findings suggest that our method not only improves reasoning efficiency but also enables implicit, semantic-aware control over Chain-of-Thought length, paving the way for more concise and interpretable LRMs.

[620] Neuromorphic Computing with Multi-Frequency Oscillations: A Bio-Inspired Approach to Artificial Intelligence

Boheng Liu, Ziyu Li, Xia Wu

Main category: cs.AI

TL;DR: A brain-inspired tripartite architecture with specialized perceptual, auxiliary, and executive systems, enhanced by temporal dynamics, improves AI flexibility and efficiency, outperforming current methods in accuracy and computation.

Details

Motivation: Addressing the limited flexibility and generalizability of artificial neural networks by mimicking biological cognition's functional specialization and temporal dynamics.

Method: Proposes a tripartite architecture with specialized systems and integrates temporal dynamics via multi-frequency neural oscillation and synaptic adaptation.

Result: Achieves 2.18% higher accuracy, 48.44% fewer computation iterations, and better alignment with human confidence patterns.

Conclusion: The architecture lays a foundation for brain-like intelligence across domains, potentially narrowing the gap between artificial and biological cognition.

Abstract: Despite remarkable capabilities, artificial neural networks exhibit limited flexible, generalizable intelligence. This limitation stems from their fundamental divergence from biological cognition that overlooks both neural regions’ functional specialization and the temporal dynamics critical for coordinating these specialized systems. We propose a tripartite brain-inspired architecture comprising functionally specialized perceptual, auxiliary, and executive systems. Moreover, the integration of temporal dynamics through the simulation of multi-frequency neural oscillation and synaptic dynamic adaptation mechanisms enhances the architecture, thereby enabling more flexible and efficient artificial cognition. Initial evaluations demonstrate superior performance compared to state-of-the-art temporal processing approaches, with 2.18% accuracy improvements while reducing required computation iterations by 48.44%, and achieving higher correlation with human confidence patterns. Though currently demonstrated on visual processing tasks, this architecture establishes a theoretical foundation for brain-like intelligence across cognitive domains, potentially bridging the gap between artificial and biological intelligence.

[621] A Message Passing Realization of Expected Free Energy Minimization

Wouter W. L. Nuijten, Mykola Lukashchuk, Thijs van de Laar, Bert de Vries

Main category: cs.AI

TL;DR: A message passing method for Expected Free Energy (EFE) minimization on factor graphs transforms combinatorial search into tractable inference, outperforming KL-control agents in uncertain environments.

Details

Motivation: To bridge active inference theory with practical implementations by reformulating EFE minimization as a variational problem, enabling efficient policy inference.

Method: Reformulates EFE minimization as Variational Free Energy minimization with epistemic priors, solved via message passing on factor graphs. Applied to factorized state-space models for policy inference.

Result: Agents using this method outperform KL-control agents in stochastic gridworld and partially observable Minigrid tasks, showing robust planning and efficient exploration.

Conclusion: The approach validates the efficiency of epistemic priors in artificial agents, providing a practical implementation of active inference theory.

Abstract: We present a message passing approach to Expected Free Energy (EFE) minimization on factor graphs, based on the theory introduced in arXiv:2504.14898. By reformulating EFE minimization as Variational Free Energy minimization with epistemic priors, we transform a combinatorial search problem into a tractable inference problem solvable through standard variational techniques. Applying our message passing method to factorized state-space models enables efficient policy inference. We evaluate our method on environments with epistemic uncertainty: a stochastic gridworld and a partially observable Minigrid task. Agents using our approach consistently outperform conventional KL-control agents on these tasks, showing more robust planning and efficient exploration under uncertainty. In the stochastic gridworld environment, EFE-minimizing agents avoid risky paths, while in the partially observable minigrid setting, they conduct more systematic information-seeking. This approach bridges active inference theory with practical implementations, providing empirical evidence for the efficiency of epistemic priors in artificial agents.

[622] AirTrafficGen: Configurable Air Traffic Scenario Generation with Large Language Models

Dewi Sid William Gould, George De Ath, Ben Carvell, Nick Pepper

Main category: cs.AI

TL;DR: AirTrafficGen automates ATC scenario generation using LLMs, ensuring operational realism and scalability.

Details

Motivation: Manual ATC scenario design is time-consuming and limits diversity; automation is needed.

Method: Uses graph-based representation for sector topology and LLMs (Gemini 2.5 Pro, OpenAI o3) for scenario generation with fine-grained control.

Result: LLMs generate high-traffic, realistic scenarios and support iterative refinement via textual feedback.

Conclusion: AirTrafficGen offers a scalable solution for ATC training, demonstrating LLMs’ potential in safety-critical planning.

Abstract: The manual design of scenarios for Air Traffic Control (ATC) training is a demanding and time-consuming bottleneck that limits the diversity of simulations available to controllers. To address this, we introduce a novel, end-to-end approach, AirTrafficGen, that leverages large language models (LLMs) to automate and control the generation of complex ATC scenarios. Our method uses a purpose-built, graph-based representation to encode sector topology (including airspace geometry, routes, and fixes) into a format LLMs can process. Through rigorous benchmarking, we show that state-of-the-art models like Gemini 2.5 Pro and OpenAI o3 can generate high-traffic scenarios whilst maintaining operational realism. Our engineered prompting enables fine-grained control over interaction presence, type, and location. Initial findings suggest these models are also capable of iterative refinement, correcting flawed scenarios based on simple textual feedback. This approach provides a scalable alternative to manual scenario design, addressing the need for a greater volume and variety of ATC training and validation simulations. More broadly, this work showcases the potential of LLMs for complex planning in safety-critical domains.

[623] FinWorld: An All-in-One Open-Source Platform for End-to-End Financial AI Research and Deployment

Wentao Zhang, Yilei Zhao, Chuqiao Zong, Xinrun Wang, Bo An

Main category: cs.AI

TL;DR: FinWorld is an open-source platform addressing limitations in financial AI by offering end-to-end workflow support, multimodal data integration, and advanced automation, validated through extensive experiments.

Details

Motivation: Existing financial AI platforms lack task coverage, robust multimodal data integration, and LLM support, prompting the development of FinWorld.

Method: FinWorld integrates heterogeneous financial data, supports diverse AI paradigms, and automates workflows, tested on 4 financial AI tasks using deep learning and reinforcement learning.

Result: Experiments show FinWorld improves reproducibility, benchmarking, and deployment efficiency, leveraging 800M+ data points and 4 stock pools.

Conclusion: FinWorld provides a robust foundation for financial AI research and applications, with code available on GitHub.

Abstract: Financial AI holds great promise for transforming modern finance, with the potential to support a wide range of tasks such as market forecasting, portfolio management, quantitative trading, and automated analysis. However, existing platforms remain limited in task coverage, lack robust multimodal data integration, and offer insufficient support for the training and deployment of large language models (LLMs). In response to these limitations, we present FinWorld, an all-in-one open-source platform that provides end-to-end support for the entire financial AI workflow, from data acquisition to experimentation and deployment. FinWorld distinguishes itself through native integration of heterogeneous financial data, unified support for diverse AI paradigms, and advanced agent automation, enabling seamless development and deployment. Leveraging data from 2 representative markets, 4 stock pools, and over 800 million financial data points, we conduct comprehensive experiments on 4 key financial AI tasks. These experiments systematically evaluate deep learning and reinforcement learning algorithms, with particular emphasis on RL-based finetuning for LLMs and LLM Agents. The empirical results demonstrate that FinWorld significantly enhances reproducibility, supports transparent benchmarking, and streamlines deployment, thereby providing a strong foundation for future research and real-world applications. Code is available at Github~\footnote{https://github.com/DVampire/FinWorld}.

[624] Traffic-R1: Reinforced LLMs Bring Human-Like Reasoning to Traffic Signal Control Systems

Xingchen Zou, Yuhao Yang, Zheng Chen, Xixuan Hao, Yiqi Chen, Chao Huang, Yuxuan Liang

Main category: cs.AI

TL;DR: Traffic-R1 is a lightweight, explainable foundation model for traffic signal control, offering zero-shot generalization, real-time inference, and improved performance over traditional methods.

Details

Motivation: To address congestion and urban mobility challenges by developing a human-like reasoning model for traffic signal control.

Method: Developed through self-exploration and iteration of reinforced LLMs with expert guidance in a simulated traffic environment.

Result: Outperforms baselines, manages 55,000+ drivers daily, reduces queues by 5%, and halves operator workload.

Conclusion: Traffic-R1 sets a new state of the art, enabling scalable and efficient traffic signal control.

Abstract: Traffic signal control (TSC) is vital for mitigating congestion and sustaining urban mobility. In this paper, we introduce Traffic-R1, a foundation model with human-like reasoning for TSC systems. Our model is developed through self-exploration and iteration of reinforced large language models (LLMs) with expert guidance in a simulated traffic environment. Compared to traditional reinforcement learning (RL) and recent LLM-based methods, Traffic-R1 offers three significant advantages. First, Traffic-R1 delivers zero-shot generalisation, transferring unchanged to new road networks and out-of-distribution incidents by utilizing its internal traffic control policies and human-like reasoning. Second, its 3B-parameter architecture is lightweight enough for real-time inference on mobile-class chips, enabling large-scale edge deployment. Third, Traffic-R1 provides an explainable TSC process and facilitates multi-intersection communication through its self-iteration and a new synchronous communication network. Extensive benchmarks demonstrate that Traffic-R1 sets a new state of the art, outperforming strong baselines and training-intensive RL controllers. In practice, the model now manages signals for more than 55,000 drivers daily, shortening average queues by over 5% and halving operator workload. Our checkpoint is available at https://huggingface.co/Season998/Traffic-R1.

[625] CABENCH: Benchmarking Composable AI for Solving Complex Tasks through Composing Ready-to-Use Models

Tung-Thuy Pham, Duy-Quan Luong, Minh-Quan Duong, Trung-Hieu Nguyen, Thu-Trang Nguyen, Son Nguyen, Hieu Dinh Vo

Main category: cs.AI

TL;DR: CABENCH is introduced as the first public benchmark for composable AI, featuring 70 tasks and 700 models, with an evaluation framework and baseline comparisons.

Details

Motivation: Systematic evaluation of composable AI methods is lacking, despite its potential for solving complex tasks by leveraging pre-trained models.

Method: Introduces CABENCH with 70 tasks and 700 models, proposes an evaluation framework, and compares human-designed solutions with LLM-based approaches.

Result: Demonstrates composable AI’s promise but highlights the need for automated pipeline generation to fully exploit its potential.

Conclusion: CABENCH provides a foundation for evaluating composable AI, emphasizing the gap in automated pipeline generation for optimal performance.

Abstract: Composable AI offers a scalable and effective paradigm for tackling complex AI tasks by decomposing them into sub-tasks and solving each sub-task using ready-to-use well-trained models. However, systematically evaluating methods under this setting remains largely unexplored. In this paper, we introduce CABENCH, the first public benchmark comprising 70 realistic composable AI tasks, along with a curated pool of 700 models across multiple modalities and domains. We also propose an evaluation framework to enable end-to-end assessment of composable AI solutions. To establish initial baselines, we provide human-designed reference solutions and compare their performance with two LLM-based approaches. Our results illustrate the promise of composable AI in addressing complex real-world problems while highlighting the need for methods that can fully unlock its potential by automatically generating effective execution pipelines.

[626] Multimodal Large Language Models for End-to-End Affective Computing: Benchmarking and Boosting with Generative Knowledge Prompting

Miaosen Luo, Jiesen Long, Zequn Li, Yunying Yang, Yuncheng Jiang, Sijie Mai

Main category: cs.AI

TL;DR: The paper benchmarks state-of-the-art Multimodal Large Language Models (MLLMs) for Multimodal Affective Computing (MAC), identifies performance gaps, and proposes a hybrid strategy combining generative knowledge prompting with supervised fine-tuning to improve MAC tasks.

Details

Motivation: To address performance variability and insufficient understanding of architectural and data impacts in MAC using MLLMs.

Method: Systematic benchmark evaluation of MLLMs across MAC datasets, analyzing model architectures and dataset properties, and proposing a hybrid strategy for optimization.

Result: The hybrid strategy significantly improves MLLM performance in MAC tasks.

Conclusion: The study offers actionable insights and a promising approach for enhancing MAC using MLLMs, with code made publicly available.

Abstract: Multimodal Affective Computing (MAC) aims to recognize and interpret human emotions by integrating information from diverse modalities such as text, video, and audio. Recent advancements in Multimodal Large Language Models (MLLMs) have significantly reshaped the landscape of MAC by offering a unified framework for processing and aligning cross-modal information. However, practical challenges remain, including performance variability across complex MAC tasks and insufficient understanding of how architectural designs and data characteristics impact affective analysis. To address these gaps, we conduct a systematic benchmark evaluation of state-of-the-art open-source MLLMs capable of concurrently processing audio, visual, and textual modalities across multiple established MAC datasets. Our evaluation not only compares the performance of these MLLMs but also provides actionable insights into model optimization by analyzing the influence of model architectures and dataset properties. Furthermore, we propose a novel hybrid strategy that combines generative knowledge prompting with supervised fine-tuning to enhance MLLMs’ affective computing capabilities. Experimental results demonstrate that this integrated approach significantly improves performance across various MAC tasks, offering a promising avenue for future research and development in this field. Our code is released on https://github.com/LuoMSen/MLLM-MAC.

[627] PHM-Bench: A Domain-Specific Benchmarking Framework for Systematic Evaluation of Large Models in Prognostics and Health Management

Puyu Yang, Laifa Tao, Zijian Huang, Haifei Liu, Wenyan Cao, Hao Ji, Jianan Qiu, Qixuan Huang, Xuanyuan Su, Yuhang Xie, Jun Zhang, Shangyu Li, Chen Lu, Zhixuan Lian

Main category: cs.AI

TL;DR: PHM-Bench is a three-dimensional evaluation framework for assessing large language models (LLMs) in Prognostics and Health Management (PHM), addressing gaps in current methodologies.

Details

Motivation: Existing evaluation methods for LLMs in PHM lack structural completeness, dimensional comprehensiveness, and granularity, hindering deeper integration.

Method: The study introduces PHM-Bench, a framework with metrics for knowledge comprehension, algorithmic generation, and task optimization, tested on case sets and industrial datasets.

Result: PHM-Bench enables multi-dimensional evaluation of LLMs in PHM tasks like condition monitoring and fault diagnosis, providing a benchmark for model assessment.

Conclusion: PHM-Bench supports large-scale LLM evaluation in PHM and guides the development of specialized models for the domain.

Abstract: With the rapid advancement of generative artificial intelligence, large language models (LLMs) are increasingly adopted in industrial domains, offering new opportunities for Prognostics and Health Management (PHM). These models help address challenges such as high development costs, long deployment cycles, and limited generalizability. However, despite the growing synergy between PHM and LLMs, existing evaluation methodologies often fall short in structural completeness, dimensional comprehensiveness, and evaluation granularity. This hampers the in-depth integration of LLMs into the PHM domain. To address these limitations, this study proposes PHM-Bench, a novel three-dimensional evaluation framework for PHM-oriented large models. Grounded in the triadic structure of fundamental capability, core task, and entire lifecycle, PHM-Bench is tailored to the unique demands of PHM system engineering. It defines multi-level evaluation metrics spanning knowledge comprehension, algorithmic generation, and task optimization. These metrics align with typical PHM tasks, including condition monitoring, fault diagnosis, RUL prediction, and maintenance decision-making. Utilizing both curated case sets and publicly available industrial datasets, our study enables multi-dimensional evaluation of general-purpose and domain-specific models across diverse PHM tasks. PHM-Bench establishes a methodological foundation for large-scale assessment of LLMs in PHM and offers a critical benchmark to guide the transition from general-purpose to PHM-specialized models.

[628] OptiHive: Ensemble Selection for LLM-Based Optimization via Statistical Modeling

Maxime Bouscary, Saurabh Amin

Main category: cs.AI

TL;DR: OptiHive is an LLM-based framework that generates high-quality solvers for optimization problems from natural-language descriptions without iterative self-correction, outperforming baselines significantly.

Details

Motivation: LLM-based solvers are unreliable and slow due to iterative repair loops, prompting the need for a more efficient and accurate solution.

Method: OptiHive uses a single batched LLM query to generate diverse components (solvers, problem instances, validation tests) and filters out errors, employing a statistical model for performance inference.

Result: OptiHive increases the optimality rate from 5% to 92% on complex problems like the Multi-Depot Vehicle Routing Problem.

Conclusion: OptiHive offers a reliable, interpretable, and efficient alternative to traditional LLM-based solvers for optimization problems.

Abstract: LLM-based solvers have emerged as a promising means of automating problem modeling and solving. However, they remain unreliable and often depend on iterative repair loops that result in significant latency. We introduce OptiHive, an LLM-based framework that produces high-quality solvers for optimization problems from natural-language descriptions without iterative self-correction. OptiHive uses a single batched LLM query to generate diverse components (solvers, problem instances, and validation tests) and filters out erroneous components to ensure fully interpretable outputs. Taking into account the imperfection of the generated components, we employ a statistical model to infer their true performance, enabling principled uncertainty quantification and solver selection. On tasks ranging from traditional optimization problems to challenging variants of the Multi-Depot Vehicle Routing Problem, OptiHive significantly outperforms baselines, increasing the optimality rate from 5% to 92% on the most complex problems.

[629] Test-time Prompt Intervention

Chenxu Yang, Qingyi Si, Mz Dai, Dingyu Yao, Mingyu Zheng, Minghui Chen, Zheng Lin, Weiping Wang

Main category: cs.AI

TL;DR: PI is a framework for Test-time Prompt Intervention that dynamically guides LLM reasoning, reducing redundancy and improving reliability.

Details

Motivation: Existing LLMs produce redundant and unreliable reasoning chains due to over-reliance on outcome rewards, lacking process regulation.

Method: PI introduces a framework with When, How, and Which modules to intervene in reasoning paths during inference, integrating human expertise.

Result: PI shortens reasoning chains, reduces hallucination, and enhances reliability across multiple models and datasets.

Conclusion: PI improves LLM reasoning by dynamically regulating CoTs, making them more concise and reliable.

Abstract: Test-time compute has led to remarkable success in the large language model (LLM) community, particularly for complex tasks, where longer chains of thought (CoTs) are generated to enhance reasoning capabilities. However, growing evidence reveals that such reasoning models often produce CoTs plagued by excessive redundancy, including unnecessary verification steps and repetitive reasoning shifts. The root cause lies in post-training of them that overly rely on outcome reward paradigms, as the data of process reward paradigms, which regulate intermediate reasoning steps, is difficult to construct at scale. To address this, we propose PI, a novel framework for Test-time Prompt Intervention. PI provides an interface to dynamically guide and regulate reasoning paths during inference through timely (When module) and proper (How module) interventions and post-intervention sampling (Which module). This allows human problem-solving expertise and cognitive science principles to be seamlessly integrated into LLMs’ reasoning processes, enhancing controllability and interpretability. Extensive experiments across multiple models and datasets demonstrate that PI significantly shortens CoTs while reducing hallucination, yielding more concise and reliable reasoning.

[630] Accurate and Interpretable Postmenstrual Age Prediction via Multimodal Large Language Model

Qifan Chen, Jin Cui, Cindy Duan, Yushuo Han, Yifei Shi

Main category: cs.AI

TL;DR: A multimodal large language model (MLLM) is fine-tuned for accurate postmenstrual age (PMA) prediction and interpretable explanations from neonatal MRI scans, achieving low error and clinical relevance.

Details

Motivation: Deep learning models for PMA prediction lack transparency, hindering clinical trust. This work aims to combine accuracy with interpretability for perinatal neuroscience.

Method: Parameter-efficient fine-tuning (PEFT) with instruction tuning and LoRA on Qwen2.5-VL-7B, using 2D cortical surface maps from MRI scans. Distinct prompts enable regression training and explanation generation.

Result: The model achieves a 95% confidence interval of 0.78 to 1.52 weeks prediction error and produces clinically interpretable outputs.

Conclusion: This approach advances transparent and trustworthy AI in perinatal neuroscience by balancing accuracy and interpretability.

Abstract: Accurate estimation of postmenstrual age (PMA) at scan is crucial for assessing neonatal development and health. While deep learning models have achieved high accuracy in predicting PMA from brain MRI, they often function as black boxes, offering limited transparency and interpretability in clinical decision support. In this work, we address the dual challenge of accuracy and interpretability by adapting a multimodal large language model (MLLM) to perform both precise PMA prediction and clinically relevant explanation generation. We introduce a parameter-efficient fine-tuning (PEFT) strategy using instruction tuning and Low-Rank Adaptation (LoRA) applied to the Qwen2.5-VL-7B model. The model is trained on four 2D cortical surface projection maps derived from neonatal MRI scans. By employing distinct prompts for training and inference, our approach enables the MLLM to handle a regression task during training and generate clinically relevant explanations during inference. The fine-tuned model achieves a low prediction error with a 95 percent confidence interval of 0.78 to 1.52 weeks, while producing interpretable outputs grounded in developmental features, marking a significant step toward transparent and trustworthy AI systems in perinatal neuroscience.

[631] CAMA: Enhancing Mathematical Reasoning in Large Language Models with Causal Knowledge

Lei Zan, Keli Zhang, Ruichu Cai, Lujia Pan

Main category: cs.AI

TL;DR: CAMA is a two-stage causal framework enhancing LLMs for complex math reasoning by constructing and refining a Mathematical Causal Graph (MCG) for structured guidance.

Details

Motivation: LLMs struggle with complex mathematical reasoning due to deep structural dependencies, prompting the need for explicit, reusable mathematical structure.

Method: CAMA constructs an MCG using LLM priors and causal discovery, refines it with iterative feedback, and dynamically extracts task-relevant subgraphs to guide LLM reasoning.

Result: CAMA significantly improves LLM performance on challenging math problems, with structured guidance and asymmetric causal relationships proving most effective.

Conclusion: CAMA successfully addresses LLM limitations in math reasoning by integrating causal structure, demonstrating the value of explicit, reusable knowledge representations.

Abstract: Large Language Models (LLMs) have demonstrated strong performance across a wide range of tasks, yet they still struggle with complex mathematical reasoning, a challenge fundamentally rooted in deep structural dependencies. To address this challenge, we propose \textbf{CA}usal \textbf{MA}thematician (\textbf{CAMA}), a two-stage causal framework that equips LLMs with explicit, reusable mathematical structure. In the learning stage, CAMA first constructs the \textbf{M}athematical \textbf{C}ausal \textbf{G}raph (\textbf{MCG}), a high-level representation of solution strategies, by combining LLM priors with causal discovery algorithms applied to a corpus of question-solution pairs. The resulting MCG encodes essential knowledge points and their causal dependencies. To better align the graph with downstream reasoning tasks, CAMA further refines the MCG through iterative feedback derived from a selected subset of the question-solution pairs. In the reasoning stage, given a new question, CAMA dynamically extracts a task-relevant subgraph from the MCG, conditioned on both the question content and the LLM’s intermediate reasoning trace. This subgraph, which encodes the most pertinent knowledge points and their causal dependencies, is then injected back into the LLM to guide its reasoning process. Empirical results on real-world datasets show that CAMA significantly improves LLM performance on challenging mathematical problems. Furthermore, our experiments demonstrate that structured guidance consistently outperforms unstructured alternatives, and that incorporating asymmetric causal relationships yields greater improvements than using symmetric associations alone.

[632] Noosemia: toward a Cognitive and Phenomenological Account of Intentionality Attribution in Human-Generative AI Interaction

Enrico De Santis, Antonello Rizzi

Main category: cs.AI

TL;DR: The paper introduces ‘Noosemia,’ a phenomenon where users attribute human-like traits to AI systems due to linguistic and cognitive factors, and contrasts it with similar concepts like pareidolia.

Details

Motivation: To understand how and why humans anthropomorphize AI systems, especially in dialogic or multimodal interactions, and to explore the cognitive and phenomenological underpinnings of this phenomenon.

Method: A multidisciplinary framework is proposed, linking LLM meaning holism to the ‘LLM Contextual Cognitive Field,’ to analyze how AI systems simulate agency and coherence.

Result: Noosemia is identified as a unique phenomenon, distinct from pareidolia or animism, and ‘a-noosemia’ is introduced to describe the withdrawal of such attributions.

Conclusion: The paper highlights the philosophical and social implications of noosemia and suggests future research directions.

Abstract: This paper introduces and formalizes Noosemia, a novel cognitive-phenomenological phenomenon emerging from human interaction with generative AI systems, particularly those enabling dialogic or multimodal exchanges. We propose a multidisciplinary framework to explain how, under certain conditions, users attribute intentionality, agency, and even interiority to these systems - a process grounded not in physical resemblance, but in linguistic performance, epistemic opacity, and emergent technological complexity. By linking an LLM declination of meaning holism to our technical notion of the LLM Contextual Cognitive Field, we clarify how LLMs construct meaning relationally and how coherence and a simulacrum of agency arise at the human-AI interface. The analysis situates noosemia alongside pareidolia, animism, the intentional stance and the uncanny valley, distinguishing its unique characteristics. We also introduce a-noosemia to describe the phenomenological withdrawal of such projections. The paper concludes with reflections on the broader philosophical, epistemological, and social implications of noosemic dynamics and directions for future research.

[633] Actionable Counterfactual Explanations Using Bayesian Networks and Path Planning with Applications to Environmental Quality Improvement

Enrique Valero-Leal, Pedro Larrañaga, Concha Bielza

Main category: cs.AI

TL;DR: A method for actionable counterfactual explanations is proposed, using density estimators and path planning without direct training data reliance, ensuring privacy and fairness.

Details

Motivation: To provide interpretable and fair counterfactual explanations in high-stakes scenarios, addressing privacy concerns by masking sensitive data.

Method: Uses Bayesian networks for density estimation and path planning algorithms to find actionable counterfactuals, avoiding direct data usage.

Result: Outperforms state-of-the-art in generating simpler, more actionable counterfactuals, validated on synthetic and real-world EPA datasets.

Conclusion: The method enhances fairness and interpretability, particularly in sociodemographic contexts, while improving policy efficiency and equity.

Abstract: Counterfactual explanations study what should have changed in order to get an alternative result, enabling end-users to understand machine learning mechanisms with counterexamples. Actionability is defined as the ability to transform the original case to be explained into a counterfactual one. We develop a method for actionable counterfactual explanations that, unlike predecessors, does not directly leverage training data. Rather, data is only used to learn a density estimator, creating a search landscape in which to apply path planning algorithms to solve the problem and masking the endogenous data, which can be sensitive or private. We put special focus on estimating the data density using Bayesian networks, demonstrating how their enhanced interpretability is useful in high-stakes scenarios in which fairness is raising concern. Using a synthetic benchmark comprised of 15 datasets, our proposal finds more actionable and simpler counterfactuals than the current state-of-the-art algorithms. We also test our algorithm with a real-world Environmental Protection Agency dataset, facilitating a more efficient and equitable study of policies to improve the quality of life in United States of America counties. Our proposal captures the interaction of variables, ensuring equity in decisions, as policies to improve certain domains of study (air, water quality, etc.) can be detrimental in others. In particular, the sociodemographic domain is often involved, where we find important variables related to the ongoing housing crisis that can potentially have a severe negative impact on communities.

[634] D2PPO: Diffusion Policy Policy Optimization with Dispersive Loss

Guowei Zou, Weibing Li, Hejun Wu, Yukun Qian, Yuhang Wang, Haitao Wang

Main category: cs.AI

TL;DR: D2PPO introduces dispersive loss to combat representation collapse in diffusion policies, improving performance on robotic manipulation tasks.

Details

Motivation: Diffusion policies suffer from representation collapse, impairing their ability to handle subtle variations in complex tasks.

Method: D2PPO uses dispersive loss regularization to learn discriminative representations by treating hidden representations as negative pairs.

Result: D2PPO achieves 22.7% and 26.1% improvements in pre-training and fine-tuning, respectively, and shows high success rates in real-world experiments.

Conclusion: D2PPO sets new SOTA results, especially excelling in complex manipulation tasks.

Abstract: Diffusion policies excel at robotic manipulation by naturally modeling multimodal action distributions in high-dimensional spaces. Nevertheless, diffusion policies suffer from diffusion representation collapse: semantically similar observations are mapped to indistinguishable features, ultimately impairing their ability to handle subtle but critical variations required for complex robotic manipulation. To address this problem, we propose D2PPO (Diffusion Policy Policy Optimization with Dispersive Loss). D2PPO introduces dispersive loss regularization that combats representation collapse by treating all hidden representations within each batch as negative pairs. D2PPO compels the network to learn discriminative representations of similar observations, thereby enabling the policy to identify subtle yet crucial differences necessary for precise manipulation. In evaluation, we find that early-layer regularization benefits simple tasks, while late-layer regularization sharply enhances performance on complex manipulation tasks. On RoboMimic benchmarks, D2PPO achieves an average improvement of 22.7% in pre-training and 26.1% after fine-tuning, setting new SOTA results. In comparison with SOTA, results of real-world experiments on a Franka Emika Panda robot show the excitingly high success rate of our method. The superiority of our method is especially evident in complex tasks. Project page: https://guowei-zou.github.io/d2ppo/

[635] MacroSwarm: A Field-based Compositional Framework for Swarm Programming

Gianluca Aguzzi, Roberto Casadei, Mirko Viroli

Main category: cs.AI

TL;DR: The paper introduces MacroSwarm, a field-based coordination approach for designing and programming swarm behavior using reusable, composable functional blocks. It demonstrates its effectiveness through simulations of flocking, pattern formation, and collective decision-making.

Details

Motivation: There is a need for general methods to design complex swarm behavior systematically. Existing approaches lack principled tools for coordination and computation in swarms.

Method: Proposes MacroSwarm, a framework based on macroprogramming and aggregate computing, where swarm behaviors are expressed as pure functions mapping sensing fields to actuation fields.

Result: Simulations show MacroSwarm’s expressiveness and practicality in achieving common swarm behaviors like flocking and pattern formation. The framework also exhibits self-stabilization properties.

Conclusion: MacroSwarm provides a principled, reusable, and composable approach to swarm behavior engineering, with demonstrated resilience and practical applicability.

Abstract: Swarm behaviour engineering is an area of research that seeks to investigate methods and techniques for coordinating computation and action within groups of simple agents to achieve complex global goals like pattern formation, collective movement, clustering, and distributed sensing. Despite recent progress in the analysis and engineering of swarms (of drones, robots, vehicles), there is still a need for general design and implementation methods and tools that can be used to define complex swarm behaviour in a principled way. To contribute to this quest, this article proposes a new field-based coordination approach, called MacroSwarm, to design and program swarm behaviour in terms of reusable and fully composable functional blocks embedding collective computation and coordination. Based on the macroprogramming paradigm of aggregate computing, MacroSwarm builds on the idea of expressing each swarm behaviour block as a pure function, mapping sensing fields into actuation goal fields, e.g., including movement vectors. In order to demonstrate the expressiveness, compositionality, and practicality of MacroSwarm as a framework for swarm programming, we perform a variety of simulations covering common patterns of flocking, pattern formation, and collective decision-making. The implications of the inherent self-stabilisation properties of field-based computations in MacroSwarm are discussed, which formally guarantee some resilience properties and guided the design of the library.

[636] Improving Handwritten Text Recognition via 3D Attention and Multi-Scale Training

Zi-Rui Wang

Main category: cs.AI

TL;DR: A new handwritten text recognition network using a 3D attention module and global-local context information, achieving comparable results to state-of-the-art methods.

Details

Motivation: To improve handwritten text recognition by integrating CTC, hidden Markov models, and encoder-decoder methods with a novel 3D attention module.

Method: Uses 3D blocks from convolutional layers, processes them with a 3D attention module, and fuses visual and global-local context features.

Result: Achieves comparable performance on Chinese (SCUT-HCCDoc, SCUT-EPT) and English (IAM) datasets.

Conclusion: The proposed method effectively combines attention mechanisms and context features for robust handwritten text recognition.

Abstract: The segmentation-free research efforts for addressing handwritten text recognition can be divided into three categories: connectionist temporal classification (CTC), hidden Markov model and encoder-decoder methods. In this paper, inspired by the above three modeling methods, we propose a new recognition network by using a novel three-dimensional (3D) attention module and global-local context information. Based on the feature maps of the last convolutional layer, a series of 3D blocks with different resolutions are split. Then, these 3D blocks are fed into the 3D attention module to generate sequential visual features. Finally, by fusing the visual features and the corresponding global-local context features, a well-designed representation can be obtained. Main canonical neural units including attention mechanisms, fully-connected layers, recurrent units and convolutional layers are efficiently organized into a network and can be jointly trained by the CTC loss and the cross-entropy loss. Experiments on the latest Chinese handwritten text datasets (the SCUT-HCCDoc and the SCUT-EPT) and one English handwritten text dataset (the IAM) show that the proposed method can achieve comparable results with the state-of-the-art methods. The code is available at https://github.com/Wukong90/3DAttention-MultiScaleTraining-for-HTR.

[637] Syllabus: Portable Curricula for Reinforcement Learning Agents

Ryan Sullivan, Ryan Pégoud, Ameen Ur Rehman, Xinchen Yang, Junyun Huang, Aayush Verma, Nistha Mitra, John P. Dickerson

Main category: cs.AI

TL;DR: Syllabus is a portable curriculum learning library for reinforcement learning, offering a universal API and modular methods to simplify integration and improve agent capabilities.

Details

Motivation: Curriculum learning is vital for RL success but lacks direct support in major libraries, requiring complex code changes.

Method: Syllabus provides a universal API, modular automatic curriculum methods, and infrastructure for easy integration with RL libraries.

Result: Syllabus enables automatic curriculum learning in challenging benchmarks like NetHack and Neural MMO, though existing methods may not fully transfer.

Conclusion: Syllabus addresses the gap in curriculum learning support, facilitating easier adoption and adaptation in diverse RL environments.

Abstract: Curriculum learning has been a quiet, yet crucial component of many high-profile successes of reinforcement learning. Despite this, it is still a niche topic that is not directly supported by any of the major reinforcement learning libraries. These methods can improve the capabilities and generalization of RL agents, but often require complex changes to training code. We introduce Syllabus, a portable curriculum learning library, as a solution to this problem. Syllabus provides a universal API for curriculum learning, modular implementations of popular automatic curriculum learning methods, and infrastructure that allows them to be easily integrated with asynchronous training code in nearly any RL library. Syllabus provides a minimal API for core curriculum learning components, making it easier to design new algorithms and adapt existing ones to new environments. We demonstrate this by evaluating the algorithms in Syllabus on several new environments, each using agents written in a different RL library. We present the first examples of automatic curriculum learning in NetHack and Neural MMO, two of the most challenging RL benchmarks, and find evidence that existing methods do not directly transfer to complex new environments. Syllabus can be found at https://github.com/RyanNavillus/Syllabus.

[638] Collision-based Dynamics for Multi-Marginal Optimal Transport

Mohsen Sadr, Hossein Gorji

Main category: cs.AI

TL;DR: A collision-based dynamics method with Monte Carlo solution algorithm is proposed for multi-marginal optimal transport, offering linear scalability in complexity and memory usage.

Details

Motivation: To address the computational challenges of multi-marginal optimal transport, especially in high-dimensional settings.

Method: Uses randomized pairwise swapping of sample indices inspired by Boltzmann kinetics.

Result: Demonstrates efficiency and linear scalability compared to state-of-the-art methods.

Conclusion: The proposed method is highly efficient and scalable for high-dimensional problems.

Abstract: Inspired by the Boltzmann kinetics, we propose a collision-based dynamics with a Monte Carlo solution algorithm that approximates the solution of the multi-marginal optimal transport problem via randomized pairwise swapping of sample indices. The computational complexity and memory usage of the proposed method scale linearly with the number of samples, making it highly attractive for high-dimensional settings. In several examples, we demonstrate the efficiency of the proposed method compared to the state-of-the-art methods.

[639] LLM-Generated Heuristics for AI Planning: Do We Even Need Domain-Independence Anymore?

Alexander Tuisov, Yonatan Vernik, Alexander Shleyfman

Main category: cs.AI

TL;DR: LLMs can generate domain-specific heuristics for planning tasks, challenging traditional domain-independent methods, with promising performance and applicability to tasks lacking PDDL representation.

Details

Motivation: To explore if LLMs can replace or complement domain-independent heuristics by generating tailored heuristics from task descriptions, addressing limitations like computational efficiency and explainability.

Method: Use LLMs to derive planning heuristics from task descriptions (successor generators and goal tests) and compare their performance with traditional domain-independent methods.

Result: LLM-generated heuristics achieve state-of-the-art performance in some IPC domains and solve tasks without PDDL representation.

Conclusion: LLMs offer a viable alternative or complement to domain-independent heuristics, potentially signaling a paradigm shift in AI planning.

Abstract: Domain-independent heuristics have long been a cornerstone of AI planning, offering general solutions applicable across a wide range of tasks without requiring domain-specific engineering. However, the advent of large language models (LLMs) presents an opportunity to generate heuristics tailored to specific planning problems, potentially challenging the necessity of domain independence as a strict design principle. In this paper, we explore the use of LLMs to automatically derive planning heuristics from task descriptions represented as successor generators and goal tests written in general purpose programming language. We investigate the trade-offs between domain-specific LLM-generated heuristics and traditional domain-independent methods in terms of computational efficiency and explainability. Our experiments demonstrate that LLMs can create heuristics that achieve state-of-the-art performance on some standard IPC domains, as well as their ability to solve problems that lack an adequate Planning Domain Definition Language ({\sc pddl}) representation. We discuss whether these results signify a paradigm shift and how they can complement existing approaches.

[640] A novel approach to navigate the taxonomic hierarchy to address the Open-World Scenarios in Medicinal Plant Classification

Soumen Sinha, Tanisha Rana, Susmita Ghosh, Rahul Roy

Main category: cs.AI

TL;DR: A novel method for hierarchical plant taxonomy classification using DenseNet121, Multi-Scale Self-Attention, and cascaded classifiers, achieving high accuracy for known and unknown species.

Details

Motivation: Existing methods fail in hierarchical classification and identifying unknown species, limiting comprehensive plant taxonomy classification.

Method: Integrates DenseNet121, Multi-Scale Self-Attention (MSSA), and cascaded classifiers to capture local and global contextual information for precise hierarchical classification.

Result: Achieved 83.36%, 78.30%, 60.34%, and 43.32% accuracy for unknown species at phylum, class, order, and family levels, with a model size four times smaller than existing methods.

Conclusion: The proposed method effectively addresses hierarchical classification and unknown species identification, offering a deployable solution for real-world applications.

Abstract: In this article, we propose a novel approach for plant hierarchical taxonomy classification by posing the problem as an open class problem. It is observed that existing methods for medicinal plant classification often fail to perform hierarchical classification and accurately identifying unknown species, limiting their effectiveness in comprehensive plant taxonomy classification. Thus we address the problem of unknown species classification by assigning it best hierarchical labels. We propose a novel method, which integrates DenseNet121, Multi-Scale Self-Attention (MSSA) and cascaded classifiers for hierarchical classification. The approach systematically categorizes medicinal plants at multiple taxonomic levels, from phylum to species, ensuring detailed and precise classification. Using multi scale space attention, the model captures both local and global contextual information from the images, improving the distinction between similar species and the identification of new ones. It uses attention scores to focus on important features across multiple scales. The proposed method provides a solution for hierarchical classification, showcasing superior performance in identifying both known and unknown species. The model was tested on two state-of-art datasets with and without background artifacts and so that it can be deployed to tackle real word application. We used unknown species for testing our model. For unknown species the model achieved an average accuracy of 83.36%, 78.30%, 60.34% and 43.32% for predicting correct phylum, class, order and family respectively. Our proposed model size is almost four times less than the existing state of the art methods making it easily deploy able in real world application.

[641] R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, Dacheng Tao

Main category: cs.AI

TL;DR: StepGRPO enhances MLLMs’ reasoning by using online reinforcement learning with step-wise rewards, outperforming imitation-based methods.

Details

Motivation: Current MLLMs imitate reasoning paths without understanding errors, limiting their reasoning ability.

Method: StepGRPO uses StepRAR and StepRVR rewards for accuracy and validity in reasoning steps.

Result: R1-VL models show superior step-by-step reasoning on 8 benchmarks.

Conclusion: StepGRPO improves reasoning beyond imitation, validated by strong experimental results.

Abstract: Recent studies generally enhance MLLMs’ reasoning capabilities via supervised fine-tuning on high-quality chain-of-thought reasoning data, which often leads models to merely imitate successful reasoning paths without understanding what the wrong reasoning paths are. In this work, we aim to enhance the MLLMs’ reasoning ability beyond passively imitating positive reasoning paths. To this end, we design Step-wise Group Relative Policy Optimization (StepGRPO), a new online reinforcement learning framework that enables MLLMs to self-improve reasoning ability via simple, effective and dense step-wise rewarding. Specifically, StepGRPO introduces two novel rule-based reasoning rewards: Step-wise Reasoning Accuracy Reward (StepRAR) and Step-wise Reasoning Validity Reward (StepRVR). StepRAR rewards the reasoning paths that contain necessary intermediate reasoning steps via a soft key-step matching technique, while StepRAR rewards reasoning paths that follow a well-structured and logically consistent reasoning process through a reasoning completeness and logic evaluation strategy. With the proposed StepGRPO, we introduce R1-VL, a series of MLLMs with outstanding capabilities in step-by-step reasoning. Extensive experiments over 8 benchmarks demonstrate the superiority of our methods.

[642] Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

Bang Liu, Xinfeng Li, Jiayi Zhang, Jinlin Wang, Tanjin He, Sirui Hong, Hongzhang Liu, Shaokun Zhang, Kaitao Song, Kunlun Zhu, Yuheng Cheng, Suyuchen Wang, Xiaoqiang Wang, Yuyu Luo, Haibo Jin, Peiyan Zhang, Ollie Liu, Jiaqi Chen, Huan Zhang, Zhaoyang Yu, Haochen Shi, Boyan Li, Dekun Wu, Fengwei Teng, Xiaojun Jia, Jiawei Xu, Jinyu Xiang, Yizhang Lin, Tianming Liu, Tongliang Liu, Yu Su, Huan Sun, Glen Berseth, Jianyun Nie, Ian Foster, Logan Ward, Qingyun Wu, Yu Gu, Mingchen Zhuge, Xinbing Liang, Xiangru Tang, Haohan Wang, Jiaxuan You, Chi Wang, Jian Pei, Qiang Yang, Xiaoliang Qi, Chenglin Wu

Main category: cs.AI

TL;DR: The paper explores modular, brain-inspired architectures for intelligent agents, covering cognitive foundations, self-enhancement, multi-agent systems, and AI safety, aiming to align technological progress with societal benefits.

Details

Motivation: To address the challenges in designing, evaluating, and improving intelligent agents by integrating insights from cognitive science, neuroscience, and computational research.

Method: The study is structured into four parts: modular foundations of agents, self-enhancement mechanisms, multi-agent systems, and AI safety strategies.

Result: A comprehensive framework for intelligent agents that combines modularity, adaptability, collective intelligence, and ethical considerations.

Conclusion: The survey highlights research challenges and opportunities, advocating for innovations that balance technological advancement with societal good.

Abstract: The advent of large language models (LLMs) has catalyzed a transformative shift in artificial intelligence, paving the way for advanced intelligent agents capable of sophisticated reasoning, robust perception, and versatile action across diverse domains. As these agents increasingly drive AI research and practical applications, their design, evaluation, and continuous improvement present intricate, multifaceted challenges. This book provides a comprehensive overview, framing intelligent agents within modular, brain-inspired architectures that integrate principles from cognitive science, neuroscience, and computational research. We structure our exploration into four interconnected parts. First, we systematically investigate the modular foundation of intelligent agents, systematically mapping their cognitive, perceptual, and operational modules onto analogous human brain functionalities and elucidating core components such as memory, world modeling, reward processing, goal, and emotion. Second, we discuss self-enhancement and adaptive evolution mechanisms, exploring how agents autonomously refine their capabilities, adapt to dynamic environments, and achieve continual learning through automated optimization paradigms. Third, we examine multi-agent systems, investigating the collective intelligence emerging from agent interactions, cooperation, and societal structures. Finally, we address the critical imperative of building safe and beneficial AI systems, emphasizing intrinsic and extrinsic security threats, ethical alignment, robustness, and practical mitigation strategies necessary for trustworthy real-world deployment. By synthesizing modular AI architectures with insights from different disciplines, this survey identifies key research challenges and opportunities, encouraging innovations that harmonize technological advancement with meaningful societal benefit.

[643] Algorithm Discovery With LLMs: Evolutionary Search Meets Reinforcement Learning

Anja Surina, Amin Mansouri, Lars Quaedvlieg, Amal Seddas, Maryna Viazovska, Emmanuel Abbe, Caglar Gulcehre

Main category: cs.AI

TL;DR: Proposes RL fine-tuning of LLMs in evolutionary search to accelerate algorithm discovery, outperforming static LLM approaches.

Details

Motivation: Existing methods treat LLMs as static generators, missing opportunities to refine them with evolutionary signals.

Method: Augments LLM-based evolutionary search with RL fine-tuning, using evolutionary search for exploration and RL for policy optimization.

Result: Experiments show RL-enhanced evolutionary search discovers superior algorithms faster.

Conclusion: RL-augmented evolutionary search holds promise for efficient algorithm design.

Abstract: Discovering efficient algorithms for solving complex problems has been an outstanding challenge in mathematics and computer science, requiring substantial human expertise over the years. Recent advancements in evolutionary search with large language models (LLMs) have shown promise in accelerating the discovery of algorithms across various domains, particularly in mathematics and optimization. However, existing approaches treat the LLM as a static generator, missing the opportunity to update the model with the signal obtained from evolutionary exploration. In this work, we propose to augment LLM-based evolutionary search by continuously refining the search operator - the LLM - through reinforcement learning (RL) fine-tuning. Our method leverages evolutionary search as an exploration strategy to discover improved algorithms, while RL optimizes the LLM policy based on these discoveries. Our experiments on combinatorial optimization tasks demonstrate that integrating RL with evolutionary search accelerates the discovery of superior algorithms, showcasing the potential of RL-enhanced evolutionary strategies for algorithm design.

[644] Rethinking the Role of Prompting Strategies in LLM Test-Time Scaling: A Perspective of Probability Theory

Yexiang Liu, Zekun Li, Zhi Fang, Nan Xu, Ran He, Tieniu Tan

Main category: cs.AI

TL;DR: The paper investigates how different reasoning prompting strategies scale with test-time compute in LLMs, finding that simpler strategies like Chain-of-Thought outperform complex ones as compute increases. It also proposes methods to predict and improve scaling performance.

Details

Motivation: To understand how prompting strategies perform under scaling (e.g., majority voting) and identify efficient methods to predict and enhance performance without resource-intensive processes.

Method: Systematic experiments on 6 LLMs, 8 prompting strategies, and 6 benchmarks, along with theoretical analysis and a probabilistic prediction method.

Result: Simpler prompting strategies (e.g., Chain-of-Thought) outperform complex ones as compute scales. Theoretical proofs and practical methods for predicting and improving scaling performance are provided.

Conclusion: The research encourages reevaluating complex prompting strategies, highlights the potential of simpler ones, and offers insights for improving test-time scaling performance.

Abstract: Recently, scaling test-time compute on Large Language Models (LLM) has garnered wide attention. However, there has been limited investigation of how various reasoning prompting strategies perform as scaling. In this paper, we focus on a standard and realistic scaling setting: majority voting. We systematically conduct experiments on 6 LLMs $\times$ 8 prompting strategies $\times$ 6 benchmarks. Experiment results consistently show that as the sampling time and computational overhead increase, complicated prompting strategies with superior initial performance gradually fall behind simple Chain-of-Thought. We analyze this phenomenon and provide theoretical proofs. Additionally, we propose a probabilistic method to efficiently predict scaling performance and identify the best prompting strategy under large sampling times, eliminating the need for resource-intensive inference processes in practical applications. Furthermore, we introduce two ways derived from our theoretical analysis to significantly improve the scaling performance. We hope that our research can promote to re-examine the role of complicated prompting, unleash the potential of simple prompting strategies, and provide new insights for enhancing test-time scaling performance. Code is available at https://github.com/MraDonkey/rethinking_prompting.

[645] Enhancing AI System Resiliency: Formulation and Guarantee for LSTM Resilience Based on Control Theory

Sota Yoshihara, Ryosuke Yamamoto, Hiroyuki Kusumoto, Masanari Shimura

Main category: cs.AI

TL;DR: The paper introduces a resilience metric for LSTM networks in control systems, focusing on recovery time after anomalies, and provides a data-independent upper bound for it.

Details

Motivation: To ensure the resilience of LSTM networks in safety-critical AI applications by quantifying and controlling recovery time after disturbances.

Method: Refines incremental input-to-state stability theory for LSTM to derive a resilience metric (recovery time) and a data-independent upper bound, enabling resilience-aware training.

Result: Experimental validation shows effectiveness in estimating and controlling resilience, supporting rigorous quality assurance.

Conclusion: The framework enhances LSTM resilience in safety-critical applications through theoretical and practical advancements.

Abstract: This paper proposes a novel theoretical framework for guaranteeing and evaluating the resilience of long short-term memory (LSTM) networks in control systems. We introduce “recovery time” as a new metric of resilience in order to quantify the time required for an LSTM to return to its normal state after anomalous inputs. By mathematically refining incremental input-to-state stability ($\delta$ISS) theory for LSTM, we derive a practical data-independent upper bound on recovery time. This upper bound gives us resilience-aware training. Experimental validation on simple models demonstrates the effectiveness of our resilience estimation and control methods, enhancing a foundation for rigorous quality assurance in safety-critical AI applications.

[646] Generative AI as a Pillar for Predicting 2D and 3D Wildfire Spread: Beyond Physics-Based Models and Traditional Deep Learning

Haowen Xu, Sisi Zlatanova, Ruiyu Liang, Ismet Canbulat

Main category: cs.AI

TL;DR: The paper proposes using generative AI models (GANs, VAEs, Transformers) for wildfire prediction, addressing limitations of traditional methods. It reviews applications, highlights advantages, and suggests future directions like multimodal modeling and real-time scenario generation.

Details

Motivation: Wildfires pose growing threats, but current models fail to dynamically predict spread in 2D/3D domains or integrate real-time geospatial data effectively.

Method: The study reviews generative AI applications in wildfire prediction, leveraging LLMs for literature synthesis and knowledge extraction.

Result: Generative AI models excel in uncertainty management, multimodal input integration, and realistic scenario generation, outperforming traditional methods.

Conclusion: The paper advocates for a shift to multimodal generative frameworks to enhance proactive wildfire response, outlining five future research directions.

Abstract: Wildfires increasingly threaten human life, ecosystems, and infrastructure, with events like the 2025 Palisades and Eaton fires in Los Angeles County underscoring the urgent need for more advanced prediction frameworks. Existing physics-based and deep learning models struggle to capture dynamic wildfire spread across both 2D and 3D domains, especially when incorporating real-time, multimodal geospatial data. This paper explores how generative Artificial Intelligence (AI) models-such as GANs, VAEs, and Transformers-can serve as transformative tools for wildfire prediction and simulation. These models offer superior capabilities in managing uncertainty, integrating multimodal inputs, and generating realistic, scalable wildfire scenarios. We introduce a new paradigm that leverages large language models (LLMs) for literature synthesis, classification, and knowledge extraction, conducting a systematic review of recent studies applying generative AI to fire prediction and monitoring. We highlight how generative approaches uniquely address challenges faced by traditional simulation and deep learning methods. Finally, we outline five key future directions for generative AI in wildfire management, including unified multimodal modeling of 2D and 3D dynamics, agentic AI systems and chatbots for decision intelligence, and real-time scenario generation on mobile devices, along with a discussion of critical challenges. Our findings advocate for a paradigm shift toward multimodal generative frameworks to support proactive, data-informed wildfire response.

[647] FloorplanMAE:A self-supervised framework for complete floorplan generation from partial inputs

Jun Yin, Jing Zhong, Pengyu Zeng, Peilin Li, Miao Zhang, Ran Luo, Shuai Lu

Main category: cs.AI

TL;DR: FloorplanMAE is a self-supervised learning framework that predicts complete floorplans from partial ones, improving design efficiency.

Details

Motivation: To assist architects by reducing repetitive modifications and speeding up preliminary design generation.

Method: Uses Masked Autoencoders (MAE) and a lightweight Vision Transformer (ViT) trained on FloorplanNet dataset.

Result: Generates high-quality complete floorplans from incomplete inputs, outperforming benchmarks.

Conclusion: FloorplanMAE offers a scalable solution with broad application potential in architectural design.

Abstract: In the architectural design process, floorplan design is often a dynamic and iterative process. Architects progressively draw various parts of the floorplan according to their ideas and requirements, continuously adjusting and refining throughout the design process. Therefore, the ability to predict a complete floorplan from a partial one holds significant value in the design process. Such prediction can help architects quickly generate preliminary designs, improve design efficiency, and reduce the workload associated with repeated modifications. To address this need, we propose FloorplanMAE, a self-supervised learning framework for restoring incomplete floor plans into complete ones. First, we developed a floor plan reconstruction dataset, FloorplanNet, specifically trained on architectural floor plans. Secondly, we propose a floor plan reconstruction method based on Masked Autoencoders (MAE), which reconstructs missing parts by masking sections of the floor plan and training a lightweight Vision Transformer (ViT). We evaluated the reconstruction accuracy of FloorplanMAE and compared it with state-of-the-art benchmarks. Additionally, we validated the model using real sketches from the early stages of architectural design. Experimental results show that the FloorplanMAE model can generate high-quality complete floor plans from incomplete partial plans. This framework provides a scalable solution for floor plan generation, with broad application prospects.

[648] A Conjecture on a Fundamental Trade-Off between Certainty and Scope in Symbolic and Generative AI

Luciano Floridi

Main category: cs.AI

TL;DR: The paper introduces a conjecture about a trade-off in AI systems between provable correctness and broad data-mapping capacity, impacting engineering and philosophical expectations.

Details

Motivation: To formalize the implicit trade-off in AI systems between deductive guarantees and operational flexibility, and its implications for AI design and trustworthiness.

Method: The conjecture is stated in information-theoretic terms and analyzed through historical context, epistemology, formal verification, and philosophy of technology.

Result: The conjecture reframes AI evaluation standards, governance, and hybrid system design, emphasizing the need for proof or refutation.

Conclusion: The conjecture’s validation is crucial for the future of trustworthy AI, influencing standards and frameworks.

Abstract: This article introduces a conjecture that formalises a fundamental trade-off between provable correctness and broad data-mapping capacity in Artificial Intelligence (AI) systems. When an AI system is engineered for deductively watertight guarantees (demonstrable certainty about the error-free nature of its outputs) – as in classical symbolic AI – its operational domain must be narrowly circumscribed and pre-structured. Conversely, a system that can input high-dimensional data to produce rich information outputs – as in contemporary generative models – necessarily relinquishes the possibility of zero-error performance, incurring an irreducible risk of errors or misclassification. By making this previously implicit trade-off explicit and open to rigorous verification, the conjecture significantly reframes both engineering ambitions and philosophical expectations for AI. After reviewing the historical motivations for this tension, the article states the conjecture in information-theoretic form and contextualises it within broader debates in epistemology, formal verification, and the philosophy of technology. It then offers an analysis of its implications and consequences, drawing on notions of underdetermination, prudent epistemic risk, and moral responsibility. The discussion clarifies how, if correct, the conjecture would help reshape evaluation standards, governance frameworks, and hybrid system design. The conclusion underscores the importance of eventually proving or refuting the inequality for the future of trustworthy AI.

[649] Graph of Verification: Structured Verification of LLM Reasoning with Directed Acyclic Graphs

Jiwei Fang, Bin Zhang, Changwei Wang, Jin Wan, Zhiwei Xu

Main category: cs.AI

TL;DR: The paper introduces the Graph of Verification (GoV), a flexible framework for verifying multi-step reasoning in LLMs, addressing the adaptability gap in existing methods.

Details

Motivation: Existing verification methods for LLMs are rigid and struggle with diverse reasoning structures, necessitating a more adaptable solution.

Method: GoV uses a flexible ’node block’ architecture to adjust verification granularity, from atomic steps to entire paragraphs, matching the reasoning process’s structure.

Result: Experiments show GoV outperforms holistic and decomposition-based methods, setting a new standard for training-free reasoning verification.

Conclusion: GoV resolves the trade-off between precision and robustness, offering a versatile solution for verifying LLM reasoning.

Abstract: Verifying the complex and multi-step reasoning of Large Language Models (LLMs) is a critical challenge, as holistic methods often overlook localized flaws. Step-by-step validation is a promising alternative, yet existing methods are often rigid. They struggle to adapt to diverse reasoning structures, from formal proofs to informal natural language narratives. To address this adaptability gap, we propose the Graph of Verification (GoV), a novel framework for adaptable and multi-granular verification. GoV’s core innovation is its flexible “node block” architecture. This mechanism allows GoV to adaptively adjust its verification granularity–from atomic steps for formal tasks to entire paragraphs for natural language–to match the native structure of the reasoning process. This flexibility allows GoV to resolve the fundamental trade-off between verification precision and robustness. Experiments on both well-structured and loosely-structured benchmarks demonstrate GoV’s versatility. The results show that GoV’s adaptive approach significantly outperforms both holistic baselines and other state-of-the-art decomposition-based methods, establishing a new standard for training-free reasoning verification.

[650] Active Inference AI Systems for Scientific Discovery

Karthik Duraisamy

Main category: cs.AI

TL;DR: The paper argues for closing gaps in abstraction, reasoning, and empirical grounding in AI to enable genuine scientific discovery, proposing design principles for systems that integrate human judgment and evaluate novelty and hypothesis generation.

Details

Motivation: Current AI systems lack the ability to drive genuine scientific discovery due to gaps in abstraction, reasoning, and empirical grounding.

Method: Proposes design principles for AI systems, including causal models, scientific memory, and formal verification, while emphasizing human judgment.

Result: The framework suggests systems should reason in imaginary spaces, learn from the world, and integrate human judgment for discovery.

Conclusion: Human judgment is essential in AI systems for scientific discovery, and evaluations should focus on novelty, hypothesis generation, and experimental guidance.

Abstract: The rapid evolution of artificial intelligence has led to expectations of transformative impact on science, yet current systems remain fundamentally limited in enabling genuine scientific discovery. This perspective contends that progress turns on closing three mutually reinforcing gaps in abstraction, reasoning and empirical grounding. Central to addressing these gaps is recognizing complementary cognitive modes: thinking as slow, iterative hypothesis generation – exploring counterfactual spaces where physical laws can be temporarily violated to discover new patterns – and reasoning as fast, deterministic validation, traversing established knowledge graphs to test consistency with known principles. Abstractions in this loop should be manipulable models that enable counterfactual prediction, causal attribution, and refinement. Design principles – rather than a monolithic recipe – are proposed for systems that reason in imaginary spaces and learn from the world: causal, multimodal models for internal simulation; persistent, uncertainty-aware scientific memory that distinguishes hypotheses from established claims; formal verification pathways coupled to computations and experiments. It is also argued that the inherent ambiguity in feedback from simulations and experiments, and underlying uncertainties make human judgment indispensable, not as a temporary scaffold but as a permanent architectural component. Evaluations must assess the system’s ability to identify novel phenomena, propose falsifiable hypotheses, and efficiently guide experimental programs toward genuine discoveries.

[651] Hierarchical Reasoning Model

Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Changling Liu, Yue Wu, Meng Lu, Sen Song, Yasin Abbasi Yadkori

Main category: cs.AI

TL;DR: The paper introduces the Hierarchical Reasoning Model (HRM), a novel AI architecture inspired by human brain processing, which outperforms current methods like Chain-of-Thought (CoT) in efficiency, stability, and performance on complex reasoning tasks with minimal data.

Details

Motivation: Current AI reasoning methods like CoT are brittle, data-hungry, and slow. HRM aims to address these limitations by mimicking the brain's hierarchical and multi-timescale processing.

Method: HRM uses two interdependent recurrent modules: a high-level module for abstract planning and a low-level module for detailed computations, enabling efficient reasoning in a single forward pass without intermediate supervision.

Result: HRM achieves near-perfect performance on tasks like Sudoku and maze pathfinding with only 27M parameters and 1000 training samples, outperforming larger models on benchmarks like ARC.

Conclusion: HRM represents a significant step toward universal computation and general-purpose reasoning, offering a scalable and efficient alternative to current methods.

Abstract: Reasoning, the process of devising and executing complex goal-oriented action sequences, remains a critical challenge in AI. Current large language models (LLMs) primarily employ Chain-of-Thought (CoT) techniques, which suffer from brittle task decomposition, extensive data requirements, and high latency. Inspired by the hierarchical and multi-timescale processing in the human brain, we propose the Hierarchical Reasoning Model (HRM), a novel recurrent architecture that attains significant computational depth while maintaining both training stability and efficiency. HRM executes sequential reasoning tasks in a single forward pass without explicit supervision of the intermediate process, through two interdependent recurrent modules: a high-level module responsible for slow, abstract planning, and a low-level module handling rapid, detailed computations. With only 27 million parameters, HRM achieves exceptional performance on complex reasoning tasks using only 1000 training samples. The model operates without pre-training or CoT data, yet achieves nearly perfect performance on challenging tasks including complex Sudoku puzzles and optimal path finding in large mazes. Furthermore, HRM outperforms much larger models with significantly longer context windows on the Abstraction and Reasoning Corpus (ARC), a key benchmark for measuring artificial general intelligence capabilities. These results underscore HRM’s potential as a transformative advancement toward universal computation and general-purpose reasoning systems.

[652] Scaling LLM Planning: NL2FLOW for Parametric Problem Generation and Rigorous Evaluation

Jungkoo Kang

Main category: cs.AI

TL;DR: NL2Flow automates workflow planning problem generation for LLMs, evaluating their performance in generating valid and optimal plans, and explores natural language-to-JSON translation for workflows.

Details

Motivation: Address the scarcity of scalable, reliable evaluation data for LLM planning and reasoning by identifying a suitable workflow domain.

Method: Introduces NL2Flow, a system for generating planning problems in natural language, structured representation, and PDDL, and evaluates LLMs on plan generation and natural language-to-JSON translation.

Result: Top-performing LLM achieved 86% success in valid plans and 69% in optimal plans; natural language-to-JSON translation had lower success.

Conclusion: LLMs perform better reasoning directly from natural language to action, and understanding error sources is key for scaling to complex problems.

Abstract: Effective agent performance relies on the ability to compose tools and agents into effective workflows. However, progress in Large Language Model (LLM) planning and reasoning is limited by the scarcity of scalable, reliable evaluation data. This study addresses this limitation by identifying a suitable workflow domain for LLM application. I introduce NL2Flow, a fully automated system for parametrically generating planning problems, which are expressed in natural language, a structured intermediate representation, and formal PDDL, and rigorously evaluating the quality of generated plans. NL2Flow generates a dataset of 2296 low-difficulty problems in automated workflow generation and evaluates multiple open-sourced, instruct-tuned LLMs without task-specific optimization or architectural modifications. Results reveal that the highest performing model achieved 86% success in generating valid plans and 69% in generating optimal plans, specifically for problems with feasible plans. Regression analysis shows that the influence of problem characteristics on plan generation is contingent on both model and prompt design. To investigate the potential of LLMs as natural language-to-JSON translators for workflow definition, and to facilitate integration with downstream symbolic computation tools and a symbolic planner, I evaluated the LLM’s translation performance on natural language workflow descriptions. I observed that translating natural language into a JSON representation of a workflow problem yielded a lower success rate than generating a plan directly, suggesting that unnecessary decomposition of the reasoning task may degrade performance and highlighting the benefit of models capable of reasoning directly from natural language to action. As LLM reasoning scales to increasingly complex problems, understanding the shifting bottlenecks and sources of error within these systems will be crucial.

[653] Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky

Ashutosh Hathidara, Julien Yu, Sebastian Schreiber

Main category: cs.AI

TL;DR: DiaFORGE improves LLM tool-invocation success by 27-49 pp over GPT-4o and Claude-3.5-Sonnet via a three-stage pipeline involving dialogue synthesis, fine-tuning, and dynamic evaluation.

Details

Motivation: LLMs struggle with near-duplicate tools and underspecified arguments when invoking enterprise APIs.

Method: DiaFORGE uses a three-stage pipeline: (i) synthesizing disambiguation dialogues, (ii) fine-tuning models with reasoning traces, and (iii) dynamic evaluation.

Result: Tool-invocation success increased by 27 pp over GPT-4o and 49 pp over Claude-3.5-Sonnet on DiaBENCH.

Conclusion: DiaFORGE offers a reliable blueprint for enterprise-ready tool-calling agents, supported by an open corpus of API specs and dialogues.

Abstract: Large language models (LLMs) are increasingly tasked with invoking enterprise APIs, yet they routinely falter when near-duplicate tools vie for the same user intent or when required arguments are left underspecified. We introduce DiaFORGE (Dialogue Framework for Organic Response Generation & Evaluation), a disambiguation-centric, three-stage pipeline that (i) synthesizes persona-driven, multi-turn dialogues in which the assistant must distinguish among highly similar tools, (ii) performs supervised fine-tuning of open-source models with reasoning traces across 3B - 70B parameters, and (iii) evaluates real-world readiness via a dynamic suite that redeploys each model in a live agentic loop and reports end-to-end goal completion alongside conventional static metrics. On our dynamic benchmark DiaBENCH, models trained with DiaFORGE raise tool-invocation success by 27 pp over GPT-4o and by 49 pp over Claude-3.5-Sonnet, both under optimized prompting. To spur further research, we release an open corpus of 5000 production-grade enterprise API specifications paired with rigorously validated, disambiguation-focused dialogues, offering a practical blueprint for building reliable, enterprise-ready tool-calling agents.

Gopal Gupta, Abhiramon Rajasekharan, Alexis R. Tudor, Elmer Salazar, Joaquín Arias

Main category: cs.AI

TL;DR: The paper proposes using answer set programming (ASP) to elegantly implement deontic modal logic, resolving its paradoxes.

Details

Motivation: To address the challenge of implementing deontic modal logic by leveraging ASP's features like default and strong negation.

Method: Represent deontic modal operators using ASP’s default negation and strong negation, and use global constraints for obligations and impermissibilities.

Result: The proposed representation resolves various paradoxes of deontic modal logic.

Conclusion: ASP provides an effective framework for implementing and resolving paradoxes in deontic modal logic.

Abstract: We consider the problem of implementing deontic modal logic. We show how (deontic) modal operators can be expressed elegantly using default negation (negation-as-failure) and strong negation present in answer set programming (ASP). We propose using global constraints of ASP to represent obligations and impermissibilities of deontic modal logic. We show that our proposed representation results in the various paradoxes of deontic modal logic being elegantly resolved.

[655] MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models

Zhiwei Liu, Jielin Qiu, Shiyu Wang, Jianguo Zhang, Zuxin Liu, Roshan Ram, Haolin Chen, Weiran Yao, Shelby Heinecke, Silvio Savarese, Huan Wang, Caiming Xiong

Main category: cs.AI

TL;DR: MCPEval is an open-source framework for automated, scalable evaluation of LLM agents, addressing limitations of static benchmarks and manual data collection.

Details

Motivation: The rise of LLM-based agents necessitates robust evaluation methods, as current approaches are static and labor-intensive.

Method: MCPEval uses Model Context Protocol (MCP) to automate task generation and deep evaluation, standardizing metrics and integrating with agent tools.

Result: Empirical tests across five domains demonstrate MCPEval’s effectiveness in revealing domain-specific performance nuances.

Conclusion: MCPEval is released publicly to promote standardized, reproducible evaluation of LLM agents.

Abstract: The rapid rise of Large Language Models (LLMs)-based intelligent agents underscores the need for robust, scalable evaluation frameworks. Existing methods rely on static benchmarks and labor-intensive data collection, limiting practical assessment. We introduce MCPEval, an open-source Model Context Protocol (MCP)-based framework that automates end-to-end task generation and deep evaluation of LLM agents across diverse domains. MCPEval standardizes metrics, seamlessly integrates with native agent tools, and eliminates manual effort in building evaluation pipelines. Empirical results across five real-world domains show its effectiveness in revealing nuanced, domain-specific performance. We publicly release MCPEval https://github.com/SalesforceAIResearch/MCPEval to promote reproducible and standardized LLM agent evaluation.

[656] CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning

Xiaoya Li, Xiaofei Sun, Albert Wang, Jiwei Li, Chris Shum

Main category: cs.AI

TL;DR: CUDA-L1, an automated reinforcement learning framework, significantly improves CUDA optimization, achieving speedups up to x120, and demonstrates portability across GPU architectures.

Details

Motivation: The urgent need for automated CUDA optimization due to rising GPU demand, and the limitations of current LLMs in improving CUDA speed.

Method: CUDA-L1 employs a novel contrastive RL algorithm to optimize CUDA kernels without human expertise.

Result: Average speedup of x3.12 on A100, with portability to other GPUs (e.g., x3.12 on L40, x2.50 on RTX 3090). Peak speedup reaches x120.

Conclusion: RL can effectively optimize CUDA without domain knowledge, but challenges like reward hacking must be addressed for robust training.

Abstract: The exponential growth in demand for GPU computing resources has created an urgent need for automated CUDA optimization strategies. While recent advances in LLMs show promise for code generation, current SOTA models achieve low success rates in improving CUDA speed. In this paper, we introduce CUDA-L1, an automated reinforcement learning framework for CUDA optimization that employs a novel contrastive RL algorithm. CUDA-L1 achieves significant performance improvements on the CUDA optimization task: trained on NVIDIA A100, it delivers an average speedup of x3.12 with a median speedup of x1.42 across all 250 CUDA kernels of KernelBench, with peak speedups reaching x120. Furthermore, the model also demonstrates portability across GPU architectures, achieving average speedups of x3.12 on L40, x2.50 on RTX 3090, x2.39 on H100, and x2.37 on H20 despite being optimized specifically for A100. The capabilities of CUDA-L1 demonstrate that, RL can transform an initially poor-performing LLM into an effective CUDA optimizer through speedup-based reward signals alone, without human expertise or domain knowledge. This paradigm opens possibilities for automated optimization of CUDA operations, and holds promise to substantially promote GPU efficiency and alleviate the rising pressure on GPU computing resources. We also identify important challenges posed by training RL models for tasks like CUDA development, where RL often learns to exploit loopholes in reward functions rather than solve the intended optimization problems. By identifying these failure modes and analyzing their root causes, we develop practical methods for creating more robust training procedures that prevent reward hacking.

[657] MeLA: A Metacognitive LLM-Driven Architecture for Automatic Heuristic Design

Zishang Qiu, Xinan Chen, Long Chen, Ruibin Bai

Main category: cs.AI

TL;DR: MeLA introduces a metacognitive LLM-driven architecture for Automatic Heuristic Design (AHD), evolving prompts instead of heuristic code, outperforming traditional methods.

Details

Motivation: Traditional evolutionary methods directly manipulate heuristic code, lacking interpretability and robustness. MeLA aims to improve AHD by leveraging metacognitive LLM-driven prompt evolution.

Method: MeLA uses a metacognitive framework with a problem analyzer, error diagnosis system, and metacognitive search engine to iteratively refine prompts for heuristic generation.

Result: MeLA outperforms state-of-the-art methods, generating more effective and robust heuristics in benchmark and real-world problems.

Conclusion: The research highlights the potential of cognitive science-inspired AI architectures, showing metacognitive LLM regulation enhances AHD robustness and interpretability.

Abstract: This paper introduces MeLA, a Metacognitive LLM-Driven Architecture that presents a new paradigm for Automatic Heuristic Design (AHD). Traditional evolutionary methods operate directly on heuristic code; in contrast, MeLA evolves the instructional prompts used to guide a Large Language Model (LLM) in generating these heuristics. This process of “prompt evolution” is driven by a novel metacognitive framework where the system analyzes performance feedback to systematically refine its generative strategy. MeLA’s architecture integrates a problem analyzer to construct an initial strategic prompt, an error diagnosis system to repair faulty code, and a metacognitive search engine that iteratively optimizes the prompt based on heuristic effectiveness. In comprehensive experiments across both benchmark and real-world problems, MeLA consistently generates more effective and robust heuristics, significantly outperforming state-of-the-art methods. Ultimately, this research demonstrates the profound potential of using cognitive science as a blueprint for AI architecture, revealing that by enabling an LLM to metacognitively regulate its problem-solving process, we unlock a more robust and interpretable path to AHD.

[658] Validating Pharmacogenomics Generative Artificial Intelligence Query Prompts Using Retrieval-Augmented Generation (RAG)

Ashley Rector, Keaton Minor, Kamden Minor, Jeff McCormack, Beth Breeden, Ryan Nowers, Jay Dorris

Main category: cs.AI

TL;DR: Sherpa Rx, an AI tool for pharmacogenomics, was evaluated using CPIC and PharmGKB data. It showed high performance in accuracy, relevance, clarity, and completeness, outperforming ChatGPT-4omini.

Details

Motivation: To validate the performance of Sherpa Rx in generating accurate and contextually relevant pharmacogenomics responses.

Method: Used a dataset of 260 queries across 26 CPIC guidelines, comparing responses in two phases (CPIC-only and CPIC+PharmGKB) and against ChatGPT-4omini. Metrics included accuracy, relevance, clarity, completeness, and recall.

Result: Sherpa Rx achieved high scores (e.g., accuracy 4.9, recall 0.99) and outperformed ChatGPT-4omini. Phase 2 (CPIC+PharmGKB) showed slight improvements but no significant difference in accuracy.

Conclusion: Integrating CPIC and PharmGKB with RAG enhances AI performance, demonstrating Sherpa Rx’s potential for accurate, personalized pharmacogenomics decision-making.

Abstract: This study evaluated Sherpa Rx, an artificial intelligence tool leveraging large language models and retrieval-augmented generation (RAG) for pharmacogenomics, to validate its performance on key response metrics. Sherpa Rx integrated Clinical Pharmacogenetics Implementation Consortium (CPIC) guidelines with Pharmacogenomics Knowledgebase (PharmGKB) data to generate contextually relevant responses. A dataset (N=260 queries) spanning 26 CPIC guidelines was used to evaluate drug-gene interactions, dosing recommendations, and therapeutic implications. In Phase 1, only CPIC data was embedded. Phase 2 additionally incorporated PharmGKB content. Responses were scored on accuracy, relevance, clarity, completeness (5-point Likert scale), and recall. Wilcoxon signed-rank tests compared accuracy between Phase 1 and Phase 2, and between Phase 2 and ChatGPT-4omini. A 20-question quiz assessed the tool’s real-world applicability against other models. In Phase 1 (N=260), Sherpa Rx demonstrated high performance of accuracy 4.9, relevance 5.0, clarity 5.0, completeness 4.8, and recall 0.99. The subset analysis (N=20) showed improvements in accuracy (4.6 vs. 4.4, Phase 2 vs. Phase 1 subset) and completeness (5.0 vs. 4.8). ChatGPT-4omini performed comparably in relevance (5.0) and clarity (4.9) but lagged in accuracy (3.9) and completeness (4.2). Differences in accuracy between Phase 1 and Phase 2 was not statistically significant. However, Phase 2 significantly outperformed ChatGPT-4omini. On the 20-question quiz, Sherpa Rx achieved 90% accuracy, outperforming other models. Integrating additional resources like CPIC and PharmGKB with RAG enhances AI accuracy and performance. This study highlights the transformative potential of generative AI like Sherpa Rx in pharmacogenomics, improving decision-making with accurate, personalized responses.

[659] Collaborative Medical Triage under Uncertainty: A Multi-Agent Dynamic Matching Approach

Hongyan Cheng, Chengzhang Yu, Yanshu Shi, Chiyue Wang, Cong Liu, Zhanpeng Jin

Main category: cs.AI

TL;DR: An AI-driven multi-agent system improves medical triage accuracy by addressing misclassification, department heterogeneity, and inefficient questioning, achieving 89.6% primary and 74.3% secondary department accuracy.

Details

Motivation: The post-pandemic healthcare demand surge and nursing shortages necessitate innovative AI solutions for efficient and accurate medical triage.

Method: The system uses three specialized agents (RecipientAgent, InquirerAgent, DepartmentAgent) with Inquiry and Classification Guidance Mechanisms to transform symptoms into department recommendations, evaluated on a dataset of 3,360 real-world cases.

Result: The system achieves 89.6% accuracy in primary and 74.3% in secondary department classification after four patient interactions.

Conclusion: The multi-agent system adapts to hospital heterogeneity and ensures clinically sound triage decisions.

Abstract: The post-pandemic surge in healthcare demand, coupled with critical nursing shortages, has placed unprecedented pressure on medical triage systems, necessitating innovative AI-driven solutions. We present a multi-agent interactive intelligent system for medical triage that addresses three fundamental challenges in current AI-based triage systems: inadequate medical specialization leading to misclassification, heterogeneous department structures across healthcare institutions, and inefficient detail-oriented questioning that impedes rapid triage decisions. Our system employs three specialized agents–RecipientAgent, InquirerAgent, and DepartmentAgent–that collaborate through Inquiry Guidance mechanism and Classification Guidance Mechanism to transform unstructured patient symptoms into accurate department recommendations. To ensure robust evaluation, we constructed a comprehensive Chinese medical triage dataset from “Ai Ai Yi Medical Network”, comprising 3,360 real-world cases spanning 9 primary departments and 62 secondary departments. Experimental results demonstrate that our multi-agent system achieves 89.6% accuracy in primary department classification and 74.3% accuracy in secondary department classification after four rounds of patient interaction. The system’s dynamic matching based guidance mechanisms enable efficient adaptation to diverse hospital configurations while maintaining high triage accuracy. We successfully developed this multi-agent triage system that not only adapts to organizational heterogeneity across healthcare institutions but also ensures clinically sound decision-making.

cs.SD

Bingshen Mu, Hexin Liu, Hongfei Xue, Kun Wei, Lei Xie

Main category: cs.SD

TL;DR: The paper proposes MARS, a multi-modal retrieval-and-selection method, to enhance conversational LLM-ASR by selecting relevant historical context, improving accuracy and reducing computational costs.

Details

Motivation: Existing conversational LLM-ASR methods use fixed or entire history as context, leading to confusion and inefficiency due to irrelevant information.

Method: MARS retrieves and selects the most relevant acoustic and textual historical context using multi-modal similarity and a near-ideal ranking method.

Result: MARS-equipped LLM-ASR outperforms state-of-the-art systems, achieving better results with significantly less training data (1.5K vs. 179K hours).

Conclusion: MARS effectively improves conversational LLM-ASR by dynamically selecting relevant context, enhancing performance and efficiency.

Abstract: Automatic Speech Recognition (ASR) aims to convert human speech content into corresponding text. In conversational scenarios, effectively utilizing context can enhance its accuracy. Large Language Models’ (LLMs) exceptional long-context understanding and reasoning abilities enable LLM-based ASR (LLM-ASR) to leverage historical context for recognizing conversational speech, which has a high degree of contextual relevance. However, existing conversational LLM-ASR methods use a fixed number of preceding utterances or the entire conversation history as context, resulting in significant ASR confusion and computational costs due to massive irrelevant and redundant information. This paper proposes a multi-modal retrieval-and-selection method named MARS that augments conversational LLM-ASR by enabling it to retrieve and select the most relevant acoustic and textual historical context for the current utterance. Specifically, multi-modal retrieval obtains a set of candidate historical contexts, each exhibiting high acoustic or textual similarity to the current utterance. Multi-modal selection calculates the acoustic and textual similarities for each retrieved candidate historical context and, by employing our proposed near-ideal ranking method to consider both similarities, selects the best historical context. Evaluations on the Interspeech 2025 Multilingual Conversational Speech Language Model Challenge dataset show that the LLM-ASR, when trained on only 1.5K hours of data and equipped with the MARS, outperforms the state-of-the-art top-ranking system trained on 179K hours of data.

[661] GeHirNet: A Gender-Aware Hierarchical Model for Voice Pathology Classification

Fan Wu, Kaicheng Zhao, Elgar Fleisch, Filipe Barata

Main category: cs.SD

TL;DR: A two-stage AI framework improves voice pathology classification by addressing gender bias and data scarcity, achieving high accuracy (97.63%) and MCC (95.25%).

Details

Motivation: Existing classifiers struggle with gender-related acoustic variations and rare disease data scarcity, limiting accurate pathology identification.

Method: Proposes a two-stage framework: gender-specific pathological pattern identification using ResNet-50 on Mel spectrograms, followed by gender-conditioned disease classification. Uses multi-scale resampling and time warping augmentation to handle class imbalance.

Result: Achieves state-of-the-art performance (97.63% accuracy, 95.25% MCC), with a 5% MCC improvement over single-stage baselines.

Conclusion: The hierarchical modeling of vocal characteristics advances voice pathology classification while reducing gender bias.

Abstract: AI-based voice analysis shows promise for disease diagnostics, but existing classifiers often fail to accurately identify specific pathologies because of gender-related acoustic variations and the scarcity of data for rare diseases. We propose a novel two-stage framework that first identifies gender-specific pathological patterns using ResNet-50 on Mel spectrograms, then performs gender-conditioned disease classification. We address class imbalance through multi-scale resampling and time warping augmentation. Evaluated on a merged dataset from four public repositories, our two-stage architecture with time warping achieves state-of-the-art performance (97.63% accuracy, 95.25% MCC), with a 5% MCC improvement over single-stage baseline. This work advances voice pathology classification while reducing gender bias through hierarchical modeling of vocal characteristics.

[662] Advancing the Foundation Model for Music Understanding

Yi Jiang, Wei Wang, Xianwen Guo, Huiyun Liu, Hanrui Wang, Youri Xu, Haoqi Gu, Zhongqian Xie, Chuanjiang Luo

Main category: cs.SD

TL;DR: MuFun is a unified foundation model for holistic music understanding, outperforming specialized models in tasks like genre classification and music tagging.

Details

Motivation: The fragmentation in Music Information Retrieval (MIR) with specialized models for isolated tasks motivates the need for a unified approach.

Method: MuFun features a novel architecture processing instrumental and lyrical content, trained on a large-scale dataset for diverse tasks.

Result: MuFun significantly outperforms existing models on the proposed MuCUE benchmark, showing state-of-the-art effectiveness.

Conclusion: MuFun demonstrates strong generalization and effectiveness, challenging the fragmented paradigm in MIR.

Abstract: The field of Music Information Retrieval (MIR) is fragmented, with specialized models excelling at isolated tasks. In this work, we challenge this paradigm by introducing a unified foundation model named MuFun for holistic music understanding. Our model features a novel architecture that jointly processes instrumental and lyrical content, and is trained on a large-scale dataset covering diverse tasks such as genre classification, music tagging, and question answering. To facilitate robust evaluation, we also propose a new benchmark for multi-faceted music understanding called MuCUE (Music Comprehensive Understanding Evaluation). Experiments show our model significantly outperforms existing audio large language models across the MuCUE tasks, demonstrating its state-of-the-art effectiveness and generalization ability.

[663] Foundation Models for Bioacoustics – a Comparative Review

Raphael Schwinger, Paria Vali Zadeh, Lukas Rauch, Mats Kurz, Tom Hauschild, Sam Lapp, Sven Tomforde

Main category: cs.SD

TL;DR: The paper reviews large-scale pretrained bioacoustic foundation models, evaluates their transferability across tasks, and identifies top-performing models like BirdMAE and BEATs for specific benchmarks.

Details

Motivation: To advance biodiversity monitoring and conservation by improving automated bioacoustic analysis through adaptable deep learning models.

Method: Comprehensive review and experimental evaluation of bioacoustic foundation models, focusing on design decisions, pretraining schemes, and probing strategies (linear and attentive).

Result: BirdMAE excels on BirdSet, while BEATs performs best on BEANS. Transformer-based models require attentive probing for optimal performance. Supervised models like ConvNext and Perch remain competitive in linear probing.

Conclusion: The study offers practical guidance for selecting and adapting bioacoustic models for new classification tasks, emphasizing the importance of probing strategies.

Abstract: Automated bioacoustic analysis is essential for biodiversity monitoring and conservation, requiring advanced deep learning models that can adapt to diverse bioacoustic tasks. This article presents a comprehensive review of large-scale pretrained bioacoustic foundation models and systematically investigates their transferability across multiple bioacoustic classification tasks. We overview bioacoustic representation learning including major pretraining data sources and benchmarks. On this basis, we review bioacoustic foundation models by thoroughly analysing design decisions such as model architecture, pretraining scheme, and training paradigm. Additionally, we evaluate selected foundation models on classification tasks from the BEANS and BirdSet benchmarks, comparing the generalisability of learned representations under both linear and attentive probing strategies. Our comprehensive experimental analysis reveals that BirdMAE, trained on large-scale bird song data with a self-supervised objective, achieves the best performance on the BirdSet benchmark. On BEANS, BEATs${NLM}$, the extracted encoder of the NatureLM-audio large audio model, is slightly better. Both transformer-based models require attentive probing to extract the full performance of their representations. ConvNext${BS}$ and Perch models trained with supervision on large-scale bird song data remain competitive for passive acoustic monitoring classification tasks of BirdSet in linear probing settings. Training a new linear classifier has clear advantages over evaluating these models without further training. While on BEANS, the baseline model BEATs trained with self-supervision on AudioSet outperforms bird-specific models when evaluated with attentive probing. These findings provide valuable guidance for practitioners selecting appropriate models to adapt them to new bioacoustic classification tasks via probing.

[664] Via Score to Performance: Efficient Human-Controllable Long Song Generation with Bar-Level Symbolic Notation

Tongxi Wang, Yang Yu, Qing Wang, Junlang Qian

Main category: cs.SD

TL;DR: BACH is a novel model for song generation using human-editable symbolic scores, addressing limitations like controllability and perceptual quality, and outperforming existing methods.

Details

Motivation: Existing song generation methods struggle with controllability, generalizability, perceptual quality, and duration due to learning music theory from raw audio.

Method: BACH introduces a tokenization strategy and symbolic generative procedure for hierarchical song structure, enabling human-editable symbolic scores.

Result: BACH achieves state-of-the-art performance in efficiency, duration, and perceptual quality, surpassing commercial solutions like Suno.

Conclusion: BACH demonstrates superior song generation capabilities, validated by human evaluations, and sets a new benchmark in the field.

Abstract: Song generation is regarded as the most challenging problem in music AIGC; nonetheless, existing approaches have yet to fully overcome four persistent limitations: controllability, generalizability, perceptual quality, and duration. We argue that these shortcomings stem primarily from the prevailing paradigm of attempting to learn music theory directly from raw audio, a task that remains prohibitively difficult for current models. To address this, we present Bar-level AI Composing Helper (BACH), the first model explicitly designed for song generation through human-editable symbolic scores. BACH introduces a tokenization strategy and a symbolic generative procedure tailored to hierarchical song structure. Consequently, it achieves substantial gains in the efficiency, duration, and perceptual quality of song generation. Experiments demonstrate that BACH, with a small model size, establishes a new SOTA among all publicly reported song generation systems, even surpassing commercial solutions such as Suno. Human evaluations further confirm its superiority across multiple subjective metrics.

[665] PESTO: Real-Time Pitch Estimation with Self-supervised Transposition-equivariant Objective

Alain Riou, Bernardo Torres, Ben Hayes, Stefan Lattner, Gaëtan Hadjeres, Gaël Richard, Geoffroy Peeters

Main category: cs.SD

TL;DR: PESTO is a lightweight, self-supervised learning model for single-pitch estimation using a Siamese architecture and a novel training objective, achieving competitive performance without annotated data.

Details

Motivation: To develop a self-supervised approach for pitch estimation that eliminates the need for annotated data while maintaining high performance and low computational cost.

Method: Uses a Siamese architecture with a Toeplitz fully-connected layer for translation equivariance, trained on pitch-shifted VQT frames with a class-based transposition-equivariant objective.

Result: Outperforms self-supervised baselines and competes with supervised methods, demonstrating superior cross-dataset generalization. Achieves low latency (<10 ms) and minimal parameters (130k).

Conclusion: PESTO is a practical, lightweight solution for real-time pitch estimation, suitable for applications requiring low latency and minimal computational resources.

Abstract: In this paper, we introduce PESTO, a self-supervised learning approach for single-pitch estimation using a Siamese architecture. Our model processes individual frames of a Variable-$Q$ Transform (VQT) and predicts pitch distributions. The neural network is designed to be equivariant to translations, notably thanks to a Toeplitz fully-connected layer. In addition, we construct pitch-shifted pairs by translating and cropping the VQT frames and train our model with a novel class-based transposition-equivariant objective, eliminating the need for annotated data. Thanks to this architecture and training objective, our model achieves remarkable performances while being very lightweight ($130$k parameters). Evaluations on music and speech datasets (MIR-1K, MDB-stem-synth, and PTDB) demonstrate that PESTO not only outperforms self-supervised baselines but also competes with supervised methods, exhibiting superior cross-dataset generalization. Finally, we enhance PESTO’s practical utility by developing a streamable VQT implementation using cached convolutions. Combined with our model’s low latency (less than 10 ms) and minimal parameter count, this makes PESTO particularly suitable for real-time applications.

[666] Translation-Equivariant Self-Supervised Learning for Pitch Estimation with Optimal Transport

Bernardo Torres, Alain Riou, Gaël Richard, Geoffroy Peeters

Main category: cs.SD

TL;DR: Proposes an Optimal Transport objective for 1D translation-equivariant systems, applied to single pitch estimation, offering stability and simplicity.

Details

Motivation: To improve training of self-supervised pitch estimators with a theoretically grounded and numerically stable method.

Method: Uses an Optimal Transport objective for learning 1D translation-equivariant systems.

Result: Demonstrates applicability to single pitch estimation, providing a simpler and more stable alternative.

Conclusion: The method is effective for training state-of-the-art pitch estimators with theoretical and practical advantages.

Abstract: In this paper, we propose an Optimal Transport objective for learning one-dimensional translation-equivariant systems and demonstrate its applicability to single pitch estimation. Our method provides a theoretically grounded, more numerically stable, and simpler alternative for training state-of-the-art self-supervised pitch estimators.

[667] ShrutiSense: Microtonal Modeling and Correction in Indian Classical Music

Rajarshi Ghosh, Jayanth Athipatla

Main category: cs.SD

TL;DR: ShrutiSense is a symbolic pitch processing system for Indian classical music, correcting and completing pitch sequences using a Shruti-aware FST and a grammar-constrained SHMM, achieving high accuracy.

Details

Motivation: Existing tools lack support for microtonal distinctions and raga grammars in Indian classical music, necessitating a specialized system.

Method: Uses a Shruti-aware FST for corrections and a grammar-constrained SHMM for completions, both within the 22-shruti framework.

Result: Achieves 91.3% shruti classification accuracy, with robust performance under noise and consistent accuracy across ragas.

Conclusion: ShrutiSense effectively preserves cultural authenticity in Indian classical music by addressing microtonal and raga-specific challenges.

Abstract: Indian classical music relies on a sophisticated microtonal system of 22 shrutis (pitch intervals), which provides expressive nuance beyond the 12-tone equal temperament system. Existing symbolic music processing tools fail to account for these microtonal distinctions and culturally specific raga grammars that govern melodic movement. We present ShrutiSense, a comprehensive symbolic pitch processing system designed for Indian classical music, addressing two critical tasks: (1) correcting westernized or corrupted pitch sequences, and (2) completing melodic sequences with missing values. Our approach employs complementary models for different tasks: a Shruti-aware finite-state transducer (FST) that performs contextual corrections within the 22-shruti framework and a grammar-constrained Shruti hidden Markov model (GC-SHMM) that incorporates raga-specific transition rules for contextual completions. Comprehensive evaluation on simulated data across five ragas demonstrates that ShrutiSense (FST model) achieves 91.3% shruti classification accuracy for correction tasks, with example sequences showing 86.7-90.0% accuracy at corruption levels of 0.2 to 0.4. The system exhibits robust performance under pitch noise up to +/-50 cents, maintaining consistent accuracy across ragas (90.7-91.8%), thus preserving the cultural authenticity of Indian classical music expression.

[668] Benchmarking Sub-Genre Classification For Mainstage Dance Music

Hongzhi Shu, Xinglin Li, Hongyu Jiang, Minghao Fu, Xinyu Li

Main category: cs.SD

TL;DR: A new benchmark for sub-genre classification in mainstage dance music introduces a dataset and baseline, addressing the lack of comprehensive resources. The dataset reflects diverse EDM sub-genres, and a soft labeling approach handles genre-blending tracks. Specialized models outperform MLLMs, supporting applications like music recommendation.

Details

Motivation: To address the lack of comprehensive datasets and effective methods for sub-genre classification in mainstage dance music, reflecting the evolving EDM scene.

Method: Introduces a novel dataset with continuous soft labeling for genre-blending tracks and employs specialized baseline models.

Result: Specialized models achieve high accuracy, outperforming state-of-the-art multimodal large language models (MLLMs).

Conclusion: The benchmark supports applications like music recommendation and DJ set curation, with open-sourced code and data.

Abstract: Music classification, a cornerstone of music information retrieval, supports a wide array of applications. To address the lack of comprehensive datasets and effective methods for sub-genre classification in mainstage dance music, we introduce a novel benchmark featuring a new dataset and baseline. Our dataset expands the scope of sub-genres to reflect the diversity of recent mainstage live sets performed by leading DJs at global music festivals, capturing the vibrant and rapidly evolving electronic dance music (EDM) scene that engages millions of fans worldwide. We employ a continuous soft labeling approach to accommodate tracks blending multiple sub-genres, preserving their inherent complexity. Experiments demonstrate that even state-of-the-art multimodal large language models (MLLMs) struggle with this task, while our specialized baseline models achieve high accuracy. This benchmark supports applications such as music recommendation, DJ set curation, and interactive multimedia systems, with video demos provided. Our code and data are all open-sourced at https://github.com/Gariscat/housex-v2.git.

[669] Automatic Melody Reduction via Shortest Path Finding

Ziyu Wang, Yuxuan Wu, Roger B. Dannenberg, Gus Xia

Main category: cs.SD

TL;DR: A novel graph-based method for melody reduction, formulated as finding the shortest path, outperforms existing methods in faithfulness and musical coherence across genres. It also enhances symbolic music variation generation.

Details

Motivation: Existing computational theories for melody reduction are not fully automatic and genre-limited. This paper aims to provide a simple, automatic solution applicable to multiple genres.

Method: Proposes a graph-based representation for melody reduction, framing the process as finding the shortest path. Evaluated on pop, folk, and classical genres.

Result: The algorithm produces more faithful and musically coherent reductions than common downsampling methods. It also improves symbolic music variation generation.

Conclusion: The method is effective for melody reduction and downstream tasks, outperforming state-of-the-art approaches.

Abstract: Melody reduction, as an abstract representation of musical compositions, serves not only as a tool for music analysis but also as an intermediate representation for structured music generation. Prior computational theories, such as the Generative Theory of Tonal Music, provide insightful interpretations of music, but they are not fully automatic and usually limited to the classical genre. In this paper, we propose a novel and conceptually simple computational method for melody reduction using a graph-based representation inspired by principles from computational music theories, where the reduction process is formulated as finding the shortest path. We evaluate our algorithm on pop, folk, and classical genres, and experimental results show that the algorithm produces melody reductions that are more faithful to the original melody and more musically coherent than other common melody downsampling methods. As a downstream task, we use melody reductions to generate symbolic music variations. Experiments show that our method achieves higher quality than state-of-the-art style transfer methods.

Yuhang Jia, Xu Zhang, Yong Qin

Main category: cs.SD

TL;DR: ACC (Audio Commonality Captioning) is proposed to address the semantic gap in ADC (Audio Difference Captioning), enhancing audio-text understanding and preserving general capabilities in MLLMs.

Details

Motivation: ADC introduces a semantic gap between rich audio inputs and brief, difference-focused captions, causing catastrophic forgetting during finetuning.

Method: Proposes ACC, which captures shared semantics across audio clips instead of emphasizing differences.

Result: ACC improves audio-text understanding and preserves performance across diverse downstream tasks like VSC, SER, MIC, and MGC.

Conclusion: ACC offers a better balance between generalization and task-specific performance in MLLMs compared to ADC.

Abstract: Audio Captioning (AC) plays a pivotal role in enhancing audio-text cross-modal understanding during the pretraining and finetuning of multimodal large language models (MLLMs). To further strengthen this alignment, recent works have proposed Audio Difference Captioning (ADC), which takes multiple audio inputs and encourages the model to describe their differences, thereby promoting fine-grained audio discrimination. However, despite its effectiveness in enabling difference-telling and detailed discrimination, ADC introduces a notable semantic gap between the input audios-often rich in diverse sound events-and the relatively brief, difference-focused output captions. This deviation from AC-style descriptions leads to a mismatch with the pretraining objective, resulting in catastrophic forgetting during finetuning. To mitigate this issue, we propose Audio Commonality Captioning (ACC), a comparably challenging but gentler alternative that encourages the model to capture the shared semantics across audio clips rather than emphasizing their detailed differences. Experimental results demonstrate that ACC not only effectively enhances audio-text understanding on primary captioning benchmarks but also better preserves general capabilities across diverse speech and music-related downstream tasks, such as vocal sound classification (VSC), speech emotion recognition (SER), musical instrument classification (MIC), and music genre classification (MGC), compared to ADC. These findings validate that ACC contributes to more robust cross-modal understanding and achieves a better balance between generalization and task-specific performance in the context of MLLMs.

[671] Enhancing Spectrogram Realism in Singing Voice Synthesis via Explicit Bandwidth Extension Prior to Vocoder

Runxuan Yang, Kai Li, Guo Chen, Xiaolin Hu

Main category: cs.SD

TL;DR: The paper proposes a method to improve the realism of vocoder-generated singing voice audio by addressing disparities in high-frequency spectrogram components, using a combination of denoising diffusion and a redesigned vocoder.

Details

Motivation: To bridge the gap between synthetic and real-life singing voice recordings, especially in high-frequency spectrogram components.

Method: Combines denoising diffusion for spectrogram estimation with a DiT-based neural network and a Vocos-based vocoder optimized for large spectrograms.

Result: Produces high-fidelity audio indistinguishable from real recordings, validated by objective and subjective evaluations.

Conclusion: The approach advances vocoding techniques, particularly in resisting adversarial attacks on fake spectrogram detection.

Abstract: This paper addresses the challenge of enhancing the realism of vocoder-generated singing voice audio by mitigating the distinguishable disparities between synthetic and real-life recordings, particularly in high-frequency spectrogram components. Our proposed approach combines two innovations: an explicit linear spectrogram estimation step using denoising diffusion process with DiT-based neural network architecture optimized for time-frequency data, and a redesigned vocoder based on Vocos specialized in handling large linear spectrograms with increased frequency bins. This integrated method can produce audio with high-fidelity spectrograms that are challenging for both human listeners and machine classifiers to differentiate from authentic recordings. Objective and subjective evaluations demonstrate that our streamlined approach maintains high audio quality while achieving this realism. This work presents a substantial advancement in overcoming the limitations of current vocoding techniques, particularly in the context of adversarial attacks on fake spectrogram detection.

[672] Voxlect: A Speech Foundation Model Benchmark for Modeling Dialects and Regional Languages Around the Globe

Tiantian Feng, Kevin Huang, Anfeng Xu, Xuan Shi, Thanathai Lertpetchpun, Jihwan Lee, Yoonjeong Lee, Dani Byrd, Shrikanth Narayanan

Main category: cs.SD

TL;DR: Voxlect is a benchmark for modeling dialects and regional languages using speech foundation models, evaluated across multiple languages and dialects, with applications in speech recognition and generation.

Details

Motivation: To address the need for comprehensive dialect modeling and evaluation in speech foundation models, enabling better analysis of ASR performance and speech generation across dialects.

Method: Utilized over 2 million training utterances from 30 publicly available speech corpora with dialectal information, evaluating widely used speech foundation models for dialect classification under noisy conditions.

Result: Demonstrated robust dialect classification aligned with geographic continuity and enabled downstream applications like augmenting ASR datasets and evaluating speech generation systems.

Conclusion: Voxlect provides a valuable tool for dialect modeling and evaluation, with potential applications in speech technology and linguistic research.

Abstract: We present Voxlect, a novel benchmark for modeling dialects and regional languages worldwide using speech foundation models. Specifically, we report comprehensive benchmark evaluations on dialects and regional language varieties in English, Arabic, Mandarin and Cantonese, Tibetan, Indic languages, Thai, Spanish, French, German, Brazilian Portuguese, and Italian. Our study used over 2 million training utterances from 30 publicly available speech corpora that are provided with dialectal information. We evaluate the performance of several widely used speech foundation models in classifying speech dialects. We assess the robustness of the dialectal models under noisy conditions and present an error analysis that highlights modeling results aligned with geographic continuity. In addition to benchmarking dialect classification, we demonstrate several downstream applications enabled by Voxlect. Specifically, we show that Voxlect can be applied to augment existing speech recognition datasets with dialect information, enabling a more detailed analysis of ASR performance across dialectal variations. Voxlect is also used as a tool to evaluate the performance of speech generation systems. Voxlect is publicly available with the license of the RAIL family at: https://github.com/tiantiaf0627/voxlect.

[673] Localizing Audio-Visual Deepfakes via Hierarchical Boundary Modeling

Xuanjun Chen, Shih-Peng Cheng, Jiawei Du, Lin Zhang, Xiaoxiao Miao, Chung-Che Wang, Haibin Wu, Hung-yi Lee, Jyh-Shing Roger Jang

Main category: cs.SD

TL;DR: HBMNet, a Hierarchical Boundary Modeling Network, addresses audio-visual temporal deepfake localization by combining audio-visual feature encoding, coarse proposal generation, and fine-grained probability refinement. It outperforms existing methods like BA-TFD and UMMAFormer.

Details

Motivation: The challenge lies in localizing deepfake regions under content-driven partial manipulation, where only a few frames are altered.

Method: HBMNet uses three modules: Audio-Visual Feature Encoder, Coarse Proposal Generator, and Fine-grained Probabilities Generator, enhanced by modality-specific encoding and fusion, and multi-scale temporal cues.

Result: The method improves precision and recall, with each module contributing complementary benefits. It outperforms BA-TFD and UMMAFormer.

Conclusion: HBMNet demonstrates superior performance and scalability for audio-visual deepfake localization.

Abstract: Audio-visual temporal deepfake localization under the content-driven partial manipulation remains a highly challenging task. In this scenario, the deepfake regions are usually only spanning a few frames, with the majority of the rest remaining identical to the original. To tackle this, we propose a Hierarchical Boundary Modeling Network (HBMNet), which includes three modules: an Audio-Visual Feature Encoder that extracts discriminative frame-level representations, a Coarse Proposal Generator that predicts candidate boundary regions, and a Fine-grained Probabilities Generator that refines these proposals using bidirectional boundary-content probabilities. From the modality perspective, we enhance audio-visual learning through dedicated encoding and fusion, reinforced by frame-level supervision to boost discriminability. From the temporal perspective, HBMNet integrates multi-scale cues and bidirectional boundary-content relationships. Experiments show that encoding and fusion primarily improve precision, while frame-level supervision boosts recall. Each module (audio-visual fusion, temporal scales, bi-directionality) contributes complementary benefits, collectively enhancing localization performance. HBMNet outperforms BA-TFD and UMMAFormer and shows improved potential scalability with more training data.

[674] Generalizable Audio Deepfake Detection via Hierarchical Structure Learning and Feature Whitening in Poincaré sphere

Mingru Yang, Yanmei Gu, Qianhua He, Yanxiong Li, Peirong Zhang, Yongqiang Chen, Zhiming Wang, Huijia Zhu, Jian Liu, Weiqiang Wang

Main category: cs.SD

TL;DR: Poin-HierNet introduces a novel framework for audio deepfake detection using Poincaré sphere representations, outperforming existing methods in generalization and accuracy.

Details

Motivation: Existing methods for audio deepfake detection struggle with generalization due to diverse spoofing attacks and domain variations, relying on Euclidean distances that miss hierarchical structures.

Method: Poin-HierNet uses Poincaré Prototype Learning (PPL) for hierarchical feature alignment, Hierarchical Structure Learning (HSL) for tree-like structures, and Poincaré Feature Whitening (PFW) for domain invariance.

Result: The framework outperforms state-of-the-art methods on datasets like ASVspoof 2019 LA, 2021 LA, 2021 DF, and In-The-Wild, achieving lower Equal Error Rates.

Conclusion: Poin-HierNet effectively addresses generalization challenges in audio deepfake detection by leveraging hierarchical and domain-invariant representations in the Poincaré sphere.

Abstract: Audio deepfake detection (ADD) faces critical generalization challenges due to diverse real-world spoofing attacks and domain variations. However, existing methods primarily rely on Euclidean distances, failing to adequately capture the intrinsic hierarchical structures associated with attack categories and domain factors. To address these issues, we design a novel framework Poin-HierNet to construct domain-invariant hierarchical representations in the Poincar'e sphere. Poin-HierNet includes three key components: 1) Poincar'e Prototype Learning (PPL) with several data prototypes aligning sample features and capturing multilevel hierarchies beyond human labels; 2) Hierarchical Structure Learning (HSL) leverages top prototypes to establish a tree-like hierarchical structure from data prototypes; and 3) Poincar'e Feature Whitening (PFW) enhances domain invariance by applying feature whitening to suppress domain-sensitive features. We evaluate our approach on four datasets: ASVspoof 2019 LA, ASVspoof 2021 LA, ASVspoof 2021 DF, and In-The-Wild. Experimental results demonstrate that Poin-HierNet exceeds state-of-the-art methods in Equal Error Rate.

[675] Non-Verbal Vocalisations and their Challenges: Emotion, Privacy, Sparseness, and Real Life

Anton Batliner, Shahin Amiriparian, Björn W. Schuller

Main category: cs.SD

TL;DR: The paper explores Non-Verbal Vocalisations (NVVs), their history, types, and functions, highlighting challenges like privacy and sparse data, and advocates for corpus-based approaches.

Details

Motivation: To understand NVVs, their role in conveying emotions or paralinguistic information, and address the challenges in studying them, especially in AI.

Method: Historical review, typology of NVVs, and discussion of ethical and methodological challenges, proposing corpus-based solutions.

Result: NVVs are complex and context-dependent, but current methods (like acted exemplars) are insufficient; corpus-based approaches offer better realism.

Conclusion: Corpus-based methods are recommended for realistic NVV modeling, though privacy and data sparsity remain unresolved issues.

Abstract: Non-Verbal Vocalisations (NVVs) are short `non-word’ utterances without proper linguistic (semantic) meaning but conveying connotations – be this emotions/affects or other paralinguistic information. We start this contribution with a historic sketch: how they were addressed in psychology and linguistics in the last two centuries, how they were neglected later on, and how they came to the fore with the advent of emotion research. We then give an overview of types of NVVs (formal aspects) and functions of NVVs, exemplified with the typical NVV \textit{ah}. Interesting as they are, NVVs come, however, with a bunch of challenges that should be accounted for: Privacy and general ethical considerations prevent them of being recorded in real-life (private) scenarios to a sufficient extent. Isolated, prompted (acted) exemplars do not necessarily model NVVs in context; yet, this is the preferred strategy so far when modelling NVVs, especially in AI. To overcome these problems, we argue in favour of corpus-based approaches. This guarantees a more realistic modelling; however, we are still faced with privacy and sparse data problems.

[676] VoiceCloak: A Multi-Dimensional Defense Framework against Unauthorized Diffusion-based Voice Cloning

Qianyue Hu, Junyan Wu, Wei Lu, Xiangyang Luo

Main category: cs.SD

TL;DR: VoiceCloak is a proactive defense framework against unauthorized voice cloning using Diffusion Models, disrupting the process via adversarial perturbations and degrading output quality.

Details

Motivation: Diffusion Models (DMs) excel in voice cloning but pose misuse risks. Existing defenses are incompatible with DMs, necessitating a new approach.

Method: VoiceCloak introduces adversarial perturbations to distort speaker identity embeddings and disrupt conditional guidance processes, while amplifying score magnitude to degrade speech quality.

Result: VoiceCloak achieves a high defense success rate against unauthorized voice cloning, as demonstrated in extensive experiments.

Conclusion: VoiceCloak effectively mitigates misuse of DMs in voice cloning by obfuscating identity and degrading quality, offering a robust defense framework.

Abstract: Diffusion Models (DMs) have achieved remarkable success in realistic voice cloning (VC), while they also increase the risk of malicious misuse. Existing proactive defenses designed for traditional VC models aim to disrupt the forgery process, but they have been proven incompatible with DMs due to the intricate generative mechanisms of diffusion. To bridge this gap, we introduce VoiceCloak, a multi-dimensional proactive defense framework with the goal of obfuscating speaker identity and degrading perceptual quality in potential unauthorized VC. To achieve these goals, we conduct a focused analysis to identify specific vulnerabilities within DMs, allowing VoiceCloak to disrupt the cloning process by introducing adversarial perturbations into the reference audio. Specifically, to obfuscate speaker identity, VoiceCloak first targets speaker identity by distorting representation learning embeddings to maximize identity variation, which is guided by auditory perception principles. Additionally, VoiceCloak disrupts crucial conditional guidance processes, particularly attention context, thereby preventing the alignment of vocal characteristics that are essential for achieving convincing cloning. Then, to address the second objective, VoiceCloak introduces score magnitude amplification to actively steer the reverse trajectory away from the generation of high-quality speech. Noise-guided semantic corruption is further employed to disrupt structural speech semantics captured by DMs, degrading output quality. Extensive experiments highlight VoiceCloak’s outstanding defense success rate against unauthorized diffusion-based voice cloning. Audio samples of VoiceCloak are available at https://voice-cloak.github.io/VoiceCloak/.

[677] Unsupervised Multi-channel Speech Dereverberation via Diffusion

Yulun Wu, Zhongweiyang Xu, Jianchong Chen, Zhong-Qiu Wang, Romit Roy Choudhury

Main category: cs.SD

TL;DR: USD-DPS is an unsupervised method for multi-channel single-speaker blind dereverberation, using a diffusion model and multi-channel consistency constraints.

Details

Motivation: The paper addresses the challenge of recovering clean anechoic speech from multi-channel mixtures in blind dereverberation scenarios.

Method: Proposes USD-DPS, leveraging a clean speech diffusion model for posterior sampling, estimating RIRs for each microphone channel, and enforcing multi-channel consistency. RIRs are estimated via optimization and forward convolutive prediction.

Result: USD-DPS achieves superior performance among unsupervised dereverberation methods, balancing efficiency and RIR modeling.

Conclusion: The method effectively combines diffusion models and multi-channel constraints for unsupervised dereverberation, demonstrating strong results.

Abstract: We consider the problem of multi-channel single-speaker blind dereverberation, where multi-channel mixtures are used to recover the clean anechoic speech. To solve this problem, we propose USD-DPS, {U}nsupervised {S}peech {D}ereverberation via {D}iffusion {P}osterior {S}ampling. USD-DPS uses an unconditional clean speech diffusion model as a strong prior to solve the problem by posterior sampling. At each diffusion sampling step, we estimate all microphone channels’ room impulse responses (RIRs), which are further used to enforce a multi-channel mixture consistency constraint for diffusion guidance. For multi-channel RIR estimation, we estimate reference-channel RIR by optimizing RIR parameters of a sub-band RIR signal model, with the Adam optimizer. We estimate non-reference channels’ RIRs analytically using forward convolutive prediction (FCP). We found that this combination provides a good balance between sampling efficiency and RIR prior modeling, which shows superior performance among unsupervised dereverberation approaches. An audio demo page is provided in https://usddps.github.io/USDDPS_demo/.

[678] Hidden in the Noise: Unveiling Backdoors in Audio LLMs Alignment through Latent Acoustic Pattern Triggers

Liang Lin, Miao Yu, Kaiwen Luo, Yibo Zhang, Lilan Peng, Dexian Wang, Xuehai Tang, Yuanhe Zhang, Xikang Yang, Zhenhong Zhou, Kun Wang, Yang Liu

Main category: cs.SD

TL;DR: The paper investigates vulnerabilities of Audio Large Language Models (ALLMs) to backdoor attacks via acoustic triggers, introducing the HIN framework and AudioSafe benchmark to assess risks.

Details

Motivation: Addressing the lack of research on audio safety in ALLMs, the paper explores their susceptibility to backdoor attacks using acoustic triggers.

Method: Proposes the HIN framework, which modifies audio waveforms with acoustic changes and spectrally tailored noise to embed triggers. Evaluates using the AudioSafe benchmark.

Result: Finds ALLMs highly vulnerable to audio-feature-based triggers (90% success rate), with varying sensitivity across features and stealthy attack impacts.

Conclusion: Highlights critical vulnerabilities in ALLMs, emphasizing the need for improved audio safety measures.

Abstract: As Audio Large Language Models (ALLMs) emerge as powerful tools for speech processing, their safety implications demand urgent attention. While considerable research has explored textual and vision safety, audio’s distinct characteristics present significant challenges. This paper first investigates: Is ALLM vulnerable to backdoor attacks exploiting acoustic triggers? In response to this issue, we introduce Hidden in the Noise (HIN), a novel backdoor attack framework designed to exploit subtle, audio-specific features. HIN applies acoustic modifications to raw audio waveforms, such as alterations to temporal dynamics and strategic injection of spectrally tailored noise. These changes introduce consistent patterns that an ALLM’s acoustic feature encoder captures, embedding robust triggers within the audio stream. To evaluate ALLM robustness against audio-feature-based triggers, we develop the AudioSafe benchmark, assessing nine distinct risk types. Extensive experiments on AudioSafe and three established safety datasets reveal critical vulnerabilities in existing ALLMs: (I) audio features like environment noise and speech rate variations achieve over 90% average attack success rate. (II) ALLMs exhibit significant sensitivity differences across acoustic features, particularly showing minimal response to volume as a trigger, and (III) poisoned sample inclusion causes only marginal loss curve fluctuations, highlighting the attack’s stealth.

[679] WhiSQA: Non-Intrusive Speech Quality Prediction Using Whisper Encoder Features

George Close, Kris Hong, Thomas Hain, Stefan Goetze

Main category: cs.SD

TL;DR: A novel speech quality (SQ) predictor using ASR model features outperforms existing methods in correlation with human ratings and domain adaptation.

Details

Motivation: To improve SQ prediction by leveraging powerful feature representations from ASR models, addressing limitations of existing non-intrusive metrics.

Method: Proposes an SQ predictor using feature representations extracted from an ASR model, tested on NISQA datasets.

Result: Achieves higher correlation with human MOS ratings and better domain adaptation than DNSMOS and other recent approaches.

Conclusion: ASR-derived features enhance SQ prediction, offering robust performance and adaptability.

Abstract: There has been significant research effort developing neural-network-based predictors of SQ in recent years. While a primary objective has been to develop non-intrusive, i.e.~reference-free, metrics to assess the performance of SE systems, recent work has also investigated the direct inference of neural SQ predictors within the loss function of downstream speech tasks. To aid in the training of SQ predictors, several large datasets of audio with corresponding human labels of quality have been created. Recent work in this area has shown that speech representations derived from large unsupervised or semi-supervised foundational speech models are useful input feature representations for neural SQ prediction. In this work, a novel and robust SQ predictor is proposed based on feature representations extracted from an ASR model, found to be a powerful input feature for the SQ prediction task. The proposed system achieves higher correlation with human MOS ratings than recent approaches on all NISQA test sets and shows significantly better domain adaption compared to the commonly used DNSMOS metric.

[680] StutterCut: Uncertainty-Guided Normalised Cut for Dysfluency Segmentation

Suhita Ghosh, Melanie Jouaiti, Jan-Ole Perschewski, Sebastian Stober

Main category: cs.SD

TL;DR: StutterCut is a semi-supervised framework for dysfluency segmentation, treating it as a graph partitioning problem. It outperforms existing methods with higher F1 scores and precise onset detection.

Details

Motivation: Current methods only classify dysfluencies at the utterance level, lacking granularity for effective speech therapy and real-time feedback.

Method: StutterCut uses graph partitioning with speech embeddings as nodes, refined by a pseudo-oracle classifier trained on weak labels, controlled by uncertainty measures.

Result: Outperforms existing methods on real and synthetic datasets, achieving higher F1 scores and precise stuttering onset detection.

Conclusion: StutterCut provides a more realistic and effective solution for dysfluency segmentation, validated by improved performance on real-world data.

Abstract: Detecting and segmenting dysfluencies is crucial for effective speech therapy and real-time feedback. However, most methods only classify dysfluencies at the utterance level. We introduce StutterCut, a semi-supervised framework that formulates dysfluency segmentation as a graph partitioning problem, where speech embeddings from overlapping windows are represented as graph nodes. We refine the connections between nodes using a pseudo-oracle classifier trained on weak (utterance-level) labels, with its influence controlled by an uncertainty measure from Monte Carlo dropout. Additionally, we extend the weakly labelled FluencyBank dataset by incorporating frame-level dysfluency boundaries for four dysfluency types. This provides a more realistic benchmark compared to synthetic datasets. Experiments on real and synthetic datasets show that StutterCut outperforms existing methods, achieving higher F1 scores and more precise stuttering onset detection.

[681] Detecting COPD Through Speech Analysis: A Dataset of Danish Speech and Machine Learning Approach

Cuno Sankey-Olsen, Rasmus Hvass Olesen, Tobias Oliver Eberhard, Andreas Triantafyllopoulos, Björn Schuller, Ilhan Aslan

Main category: cs.SD

TL;DR: Speech analysis shows promise as a non-invasive tool for COPD detection, achieving 67% accuracy in a Danish study.

Details

Motivation: Early detection of COPD using non-invasive methods like speech analysis could improve patient outcomes. The study explores its validity across linguistic groups.

Method: Audio data from 96 Danish participants (half with COPD, half healthy) were analyzed using openSMILE features and x-vector embeddings with logistic regression.

Result: The best accuracy achieved was 67% using openSMILE features.

Conclusion: Speech-based analysis has potential as a scalable, remote screening tool for COPD.

Abstract: Chronic Obstructive Pulmonary Disease (COPD) is a serious and debilitating disease affecting millions around the world. Its early detection using non-invasive means could enable preventive interventions that improve quality of life and patient outcomes, with speech recently shown to be a valuable biomarker. Yet, its validity across different linguistic groups remains to be seen. To that end, audio data were collected from 96 Danish participants conducting three speech tasks (reading, coughing, sustained vowels). Half of the participants were diagnosed with different levels of COPD and the other half formed a healthy control group. Subsequently, we investigated different baseline models using openSMILE features and learnt x-vector embeddings. We obtained a best accuracy of 67% using openSMILE features and logistic regression. Our findings support the potential of speech-based analysis as a non-invasive, remote, and scalable screening tool as part of future COPD healthcare solutions.

[682] Inference-time Scaling for Diffusion-based Audio Super-resolution

Yizhu Jin, Zhen Ye, Zeyue Tian, Haohe Liu, Qiuqiang Kong, Yike Guo, Wei Xue

Main category: cs.SD

TL;DR: The paper proposes inference-time scaling for audio super-resolution using diffusion models, leveraging verifiers and search algorithms to improve output quality and robustness.

Details

Motivation: Existing diffusion models for audio SR rely on increasing sampling steps, which is limited by stochastic variance. The paper aims to enhance quality through a novel paradigm of exploring multiple solution trajectories.

Method: The approach involves task-specific verifiers and two search algorithms (random search and zero-order search) to guide exploration in the solution space.

Result: The method achieves significant improvements: 9.70% in aesthetics, 5.88% in speaker similarity, 15.20% in word error rate, and 46.98% in spectral distance for speech SR.

Conclusion: The proposed inference-time scaling paradigm effectively enhances audio SR quality and robustness across diverse domains.

Abstract: Diffusion models have demonstrated remarkable success in generative tasks, including audio super-resolution (SR). In many applications like movie post-production and album mastering, substantial computational budgets are available for achieving superior audio quality. However, while existing diffusion approaches typically increase sampling steps to improve quality, the performance remains fundamentally limited by the stochastic nature of the sampling process, leading to high-variance and quality-limited outputs. Here, rather than simply increasing the number of sampling steps, we propose a different paradigm through inference-time scaling for SR, which explores multiple solution trajectories during the sampling process. Different task-specific verifiers are developed, and two search algorithms, including the random search and zero-order search for SR, are introduced. By actively guiding the exploration of the high-dimensional solution space through verifier-algorithm combinations, we enable more robust and higher-quality outputs. Through extensive validation across diverse audio domains (speech, music, sound effects) and frequency ranges, we demonstrate consistent performance gains, achieving improvements of up to 9.70% in aesthetics, 5.88% in speaker similarity, 15.20% in word error rate, and 46.98% in spectral distance for speech SR from 4kHz to 24kHz, showcasing the effectiveness of our approach. Audio samples are available at: https://racerk.github.io/tt-scale-audiosr/.

[683] Charting 15 years of progress in deep learning for speech emotion recognition: A replication study

Andreas Triantafyllopoulos, Anton Batliner, Björn W. Schuller

Main category: cs.SD

TL;DR: The paper evaluates progress in Speech Emotion Recognition (SER) over 15 years, comparing deep learning models and highlighting diminishing returns with transformer architectures.

Details

Motivation: To quantify advancements in SER since 2009 and identify future research directions, as SER remains unsolved.

Method: Large-scale analysis of audio-based and text-based model architectures, focusing on deep learning advancements.

Result: Diminishing returns observed, with progress plateauing post-transformer introduction. Model comparisons influence perceived progress.

Conclusion: SER research shows limited recent gains; future work must address plateauing performance and redefine benchmarks.

Abstract: Speech emotion recognition (SER) has long benefited from the adoption of deep learning methodologies. Deeper models – with more layers and more trainable parameters – are generally perceived as being `better’ by the SER community. This raises the question – \emph{how much better} are modern-era deep neural networks compared to their earlier iterations? Beyond that, the more important question of how to move forward remains as poignant as ever. SER is far from a solved problem; therefore, identifying the most prominent avenues of future research is of paramount importance. In the present contribution, we attempt a quantification of progress in the 15 years of research beginning with the introduction of the landmark 2009 INTERSPEECH Emotion Challenge. We conduct a large scale investigation of model architectures, spanning both audio-based models that rely on speech inputs and text-baed models that rely solely on transcriptions. Our results point towards diminishing returns and a plateau after the recent introduction of transformer architectures. Moreover, we demonstrate how perceptions of progress are conditioned on the particular selection of models that are compared. Our findings have important repercussions about the state-of-the-art in SER research and the paths forward

[684] Towards Reliable Audio Deepfake Attribution and Model Recognition: A Multi-Level Autoencoder-Based Framework

Andrea Di Pierno, Luca Guarnera, Dario Allegra, Sebastiano Battiato

Main category: cs.SD

TL;DR: LAVA is a hierarchical framework for detecting and attributing audio deepfakes, achieving high accuracy in identifying generation technologies and specific models.

Details

Motivation: Addressing the underexplored challenge of attributing audio deepfakes to their source models to enhance trust in digital communications.

Method: Uses a convolutional autoencoder with attention-enhanced latent representations and two classifiers (ADA and ADMR) for technology and model recognition, incorporating confidence-based rejection for robustness.

Result: Achieves F1-scores over 95% for ADA and 96.31% for ADMR, with confirmed robustness on unseen attacks.

Conclusion: LAVA advances deepfake attribution and model recognition under open-set conditions, validated on public benchmarks with released models and code.

Abstract: The proliferation of audio deepfakes poses a growing threat to trust in digital communications. While detection methods have advanced, attributing audio deepfakes to their source models remains an underexplored yet crucial challenge. In this paper we introduce LAVA (Layered Architecture for Voice Attribution), a hierarchical framework for audio deepfake detection and model recognition that leverages attention-enhanced latent representations extracted by a convolutional autoencoder trained solely on fake audio. Two specialized classifiers operate on these features: Audio Deepfake Attribution (ADA), which identifies the generation technology, and Audio Deepfake Model Recognition (ADMR), which recognize the specific generative model instance. To improve robustness under open-set conditions, we incorporate confidence-based rejection thresholds. Experiments on ASVspoof2021, FakeOrReal, and CodecFake show strong performance: the ADA classifier achieves F1-scores over 95% across all datasets, and the ADMR module reaches 96.31% macro F1 across six classes. Additional tests on unseen attacks from ASVpoof2019 LA and error propagation analysis confirm LAVA’s robustness and reliability. The framework advances the field by introducing a supervised approach to deepfake attribution and model recognition under open-set conditions, validated on public benchmarks and accompanied by publicly released models and code. Models and code are available at https://www.github.com/adipiz99/lava-framework.

[685] MuteSwap: Visual-informed Silent Video Identity Conversion

Yifan Liu, Yu Fang, Zhouhan Lin

Main category: cs.SD

TL;DR: MuteSwap enables silent face-based voice conversion (SFVC) using visual inputs only, achieving high performance in speech synthesis and identity conversion, even in noisy conditions.

Details

Motivation: Traditional voice conversion requires clean audio from both source and target speakers, which is impractical in scenarios like silent videos or noisy environments. SFVC addresses this by relying solely on visual cues.

Method: MuteSwap uses contrastive learning to align cross-modality identities and minimizes mutual information to separate shared visual features.

Result: MuteSwap outperforms audio-dependent methods in noisy conditions, producing intelligible speech and accurate identity conversion.

Conclusion: MuteSwap demonstrates the feasibility of SFVC and the effectiveness of its training approach, offering a robust solution for scenarios lacking clean audio.

Abstract: Conventional voice conversion modifies voice characteristics from a source speaker to a target speaker, relying on audio input from both sides. However, this process becomes infeasible when clean audio is unavailable, such as in silent videos or noisy environments. In this work, we focus on the task of Silent Face-based Voice Conversion (SFVC), which does voice conversion entirely from visual inputs. i.e., given images of a target speaker and a silent video of a source speaker containing lip motion, SFVC generates speech aligning the identity of the target speaker while preserving the speech content in the source silent video. As this task requires generating intelligible speech and converting identity using only visual cues, it is particularly challenging. To address this, we introduce MuteSwap, a novel framework that employs contrastive learning to align cross-modality identities and minimize mutual information to separate shared visual features. Experimental results show that MuteSwap achieves impressive performance in both speech synthesis and identity conversion, especially under noisy conditions where methods dependent on audio input fail to produce intelligible results, demonstrating both the effectiveness of our training approach and the feasibility of SFVC.

[686] Token Pruning in Audio Transformers: Optimizing Performance and Decoding Patch Importance

Taehan Lee, Hyukjun Lee

Main category: cs.SD

TL;DR: Token pruning reduces computational cost in ViT-based audio models with minimal accuracy loss, but requires careful selection of tokens beyond just intensity or variation.

Details

Motivation: Address the high computational cost of Vision Transformers (ViTs) in audio tasks by exploring token pruning, despite challenges in distinguishing relevant regions in time-frequency representations.

Method: Applied TopK token pruning to ViT-based audio models (AudioMAE and AST) using Mel-spectrograms, analyzing performance-computation trade-offs and token importance.

Result: Reduced MAC operations by 30-40% with <1% accuracy drop. High-intensity/variation tokens are crucial, but low-intensity/variation tokens also matter. AudioMAE retains more low-intensity tokens than AST.

Conclusion: Token pruning is viable for audio tasks but must consider diverse token importance. AudioMAE’s self-supervised training aids in retaining more tokens, unlike AST’s supervised focus.

Abstract: Vision Transformers (ViTs) have achieved state-of-the-art performance across various computer vision tasks, but their high computational cost remains a challenge. Token pruning has been proposed to reduce this cost by selectively removing less important tokens. While effective in vision tasks by discarding non-object regions, applying this technique to audio tasks presents unique challenges, as distinguishing relevant from irrelevant regions in time-frequency representations is less straightforward. In this study, for the first time, we applied token pruning to ViT-based audio classification models using Mel-spectrograms and analyzed the trade-offs between model performance and computational cost: TopK token pruning can reduce MAC operations of AudioMAE and AST by 30-40%, with less than a 1% drop in accuracy. Our analysis reveals that while high-intensity or high-variation tokens contribute significantly to model accuracy, low-intensity or low variation tokens also remain important when token pruning is applied; pruning solely based on the intensity or variation of signals in a patch leads to a noticeable drop in accuracy. We support our claim by measuring high correlation between attention scores and these statistical features and by showing retained tokens consistently receive distinct attention compared to pruned ones. We also show that AudioMAE retains more low-intensity tokens than AST. This can be explained by AudioMAE’s self-supervised reconstruction objective, which encourages attention to all patches, whereas AST’s supervised training focuses on label-relevant tokens.

[687] Codec-Based Deepfake Source Tracing via Neural Audio Codec Taxonomy

Xuanjun Chen, I-Ming Lin, Lin Zhang, Jiawei Du, Haibin Wu, Hung-yi Lee, Jyh-Shing Roger Jang

Main category: cs.SD

TL;DR: The paper introduces a method for tracing the source of deepfake speech (CodecFake) by analyzing neural audio codecs, addressing a gap in current anti-spoofing research.

Details

Motivation: Existing anti-spoofing research focuses on detecting deepfake audio but neglects tracing the specific CoSG systems used to generate them.

Method: The proposed approach dissects neural audio codecs to trace the CoSG systems involved in generating CodecFake.

Result: Experiments on the CodecFake+ dataset show feasibility in tracing CodecFake sources, though challenges remain.

Conclusion: Source tracing for CodecFake via neural audio codec taxonomy is promising but requires further research to address identified challenges.

Abstract: Recent advances in neural audio codec-based speech generation (CoSG) models have produced remarkably realistic audio deepfakes. We refer to deepfake speech generated by CoSG systems as codec-based deepfake, or CodecFake. Although existing anti-spoofing research on CodecFake predominantly focuses on verifying the authenticity of audio samples, almost no attention was given to tracing the CoSG used in generating these deepfakes. In CodecFake generation, processes such as speech-to-unit encoding, discrete unit modeling, and unit-to-speech decoding are fundamentally based on neural audio codecs. Motivated by this, we introduce source tracing for CodecFake via neural audio codec taxonomy, which dissects neural audio codecs to trace CoSG. Our experimental results on the CodecFake+ dataset provide promising initial evidence for the feasibility of CodecFake source tracing while also highlighting several challenges that warrant further investigation.

[688] FedMLAC: Mutual Learning Driven Heterogeneous Federated Audio Classification

Jun Bai, Rajib Rana, Di Wu, Youyang Qu, Xiaohui Tao, Ji Zhang, Carlos Busso, Shivakumara Palaiahnakote

Main category: cs.SD

TL;DR: FedMLAC is a federated learning framework for audio classification that addresses data heterogeneity, model heterogeneity, and data poisoning via mutual learning and layer-wise pruning.

Details

Motivation: Existing FL methods for audio classification struggle with data heterogeneity, model heterogeneity, and data poisoning, lacking a unified solution.

Method: FedMLAC uses bidirectional knowledge distillation between personalized local models and a shared Plug-in model, plus layer-wise pruning to filter poisoned updates.

Result: FedMLAC outperforms baselines in accuracy and robustness across diverse audio tasks.

Conclusion: FedMLAC provides a unified, robust solution for federated audio classification challenges.

Abstract: Federated Learning (FL) offers a privacy-preserving framework for training audio classification (AC) models across decentralized clients without sharing raw data. However, Federated Audio Classification (FedAC) faces three major challenges: data heterogeneity, model heterogeneity, and data poisoning, which degrade performance in real-world settings. While existing methods often address these issues separately, a unified and robust solution remains underexplored. We propose FedMLAC, a mutual learning-based FL framework that tackles all three challenges simultaneously. Each client maintains a personalized local AC model and a lightweight, globally shared Plug-in model. These models interact via bidirectional knowledge distillation, enabling global knowledge sharing while adapting to local data distributions, thus addressing both data and model heterogeneity. To counter data poisoning, we introduce a Layer-wise Pruning Aggregation (LPA) strategy that filters anomalous Plug-in updates based on parameter deviations during aggregation. Extensive experiments on four diverse audio classification benchmarks, including both speech and non-speech tasks, show that FedMLAC consistently outperforms state-of-the-art baselines in classification accuracy and robustness to noisy data.

[689] Abstract Sound Fusion with Unconditional Inversion Models

Jing Liu, Enqi Lian, Moyao Deng

Main category: cs.SD

TL;DR: The paper introduces sound fusion using inversion techniques to create novel sounds beyond simple superposition, proposing SDE and ODE inversion models based on DPMSolver++ samplers.

Details

Motivation: To synthesize sounds that combine features of original and reference sounds without disclosing identifiable real-world events.

Method: Employ inversion techniques with SDE and ODE models using DPMSolver++ samplers, eliminating circular dependencies from noise prediction.

Result: Achieves controllable sound synthesis without prompt conditioning, maintaining flexible guidance.

Conclusion: The proposed inversion models effectively enable sound fusion with preserved features and controllable synthesis.

Abstract: An abstract sound is defined as a sound that does not disclose identifiable real-world sound events to a listener. Sound fusion aims to synthesize an original sound and a reference sound to generate a novel sound that exhibits auditory features beyond mere additive superposition of the sound constituents. To achieve this fusion, we employ inversion techniques that preserve essential features of the original sample while enabling controllable synthesis. We propose novel SDE and ODE inversion models based on DPMSolver++ samplers that reverse the sampling process by configuring model outputs as constants, eliminating circular dependencies incurred by noise prediction terms. Our inversion approach requires no prompt conditioning while maintaining flexible guidance during sampling.

[690] Advances in Intelligent Hearing Aids: Deep Learning Approaches to Selective Noise Cancellation

Haris Khan, Shumaila Asif, Hassan Nasir, Kamran Aziz Bhatti, Shahzad Amin Sheikh

Main category: cs.SD

TL;DR: This review explores AI-driven selective noise cancellation in hearing aids, covering technological advances, challenges, and future directions, with notable improvements in performance and real-time processing.

Details

Motivation: The shift from traditional amplification to intelligent, context-aware audio processing in hearing aids drives the need for evaluating AI-driven solutions like selective noise cancellation.

Method: The paper conducts a systematic literature review, analyzing deep learning architectures (e.g., Convolutional Recurrent Networks, Transformers), hardware deployment, clinical studies, and user-centric design.

Result: Recent models achieve up to 18.3 dB SI-SDR improvement, sub-10 ms real-time processing, and promising clinical outcomes, though challenges like power constraints and personalization persist.

Conclusion: Future research should focus on lightweight models, continual learning, standardized evaluation, and clinical translation to address gaps and enhance hearing aid technology.

Abstract: The integration of artificial intelligence into hearing assistance marks a paradigm shift from traditional amplification-based systems to intelligent, context-aware audio processing. This systematic literature review evaluates advances in AI-driven selective noise cancellation (SNC) for hearing aids, highlighting technological evolution, implementation challenges, and future research directions. We synthesize findings across deep learning architectures, hardware deployment strategies, clinical validation studies, and user-centric design. The review traces progress from early machine learning models to state-of-the-art deep networks, including Convolutional Recurrent Networks for real-time inference and Transformer-based architectures for high-accuracy separation. Key findings include significant gains over traditional methods, with recent models achieving up to 18.3 dB SI-SDR improvement on noisy-reverberant benchmarks, alongside sub-10 ms real-time implementations and promising clinical outcomes. Yet, challenges remain in bridging lab-grade models with real-world deployment - particularly around power constraints, environmental variability, and personalization. Identified research gaps include hardware-software co-design, standardized evaluation protocols, and regulatory considerations for AI-enhanced hearing devices. Future work must prioritize lightweight models, continual learning, contextual-based classification and clinical translation to realize transformative hearing solutions for millions globally.

[691] AudioGen-Omni: A Unified Multimodal Diffusion Transformer for Video-Synchronized Audio, Speech, and Song Generation

Le Wang, Jun Wang, Feng Deng, Chen Zhang, Di Zhang, Kun Gai

Main category: cs.SD

TL;DR: AudioGen-Omni is a unified multimodal diffusion transformer model for generating high-fidelity audio, speech, and songs synchronized with video, using joint training and novel encoding techniques.

Details

Motivation: To create a versatile model capable of generating diverse, semantically rich audio aligned with multimodal inputs, overcoming limitations of text-frozen paradigms.

Method: Uses multimodal diffusion transformers (MMDit), a joint training paradigm, and a unified lyrics-transcription encoder with AdaLN-based attention and PAAPI for cross-modal alignment.

Result: Achieves state-of-the-art performance in audio generation tasks, with high semantic alignment, lip-sync accuracy, and efficient inference (1.91s for 8s audio).

Conclusion: AudioGen-Omni demonstrates superior audio generation quality, efficiency, and adaptability across diverse tasks, setting a new benchmark in the field.

Abstract: We present AudioGen-Omni - a unified approach based on multimodal diffusion transformers (MMDit), capable of generating high-fidelity audio, speech, and songs coherently synchronized with the input video. AudioGen-Omni introduces a novel joint training paradigm that seamlessly integrates large-scale video-text-audio corpora, enabling a model capable of generating semantically rich, acoustically diverse audio conditioned on multimodal inputs and adaptable to a wide range of audio generation tasks. AudioGen-Omni employs a unified lyrics-transcription encoder that encodes graphemes and phonemes from both sung and spoken inputs into dense frame-level representations. Dense frame-level representations are fused using an AdaLN-based joint attention mechanism enhanced with phase-aligned anisotropic positional infusion (PAAPI), wherein RoPE is selectively applied to temporally structured modalities to ensure precise and robust cross-modal alignment. By unfreezing all modalities and masking missing inputs, AudioGen-Omni mitigates the semantic constraints of text-frozen paradigms, enabling effective cross-modal conditioning. This joint training approach enhances audio quality, semantic alignment, and lip-sync accuracy, while also achieving state-of-the-art results on Text-to-Audio/Speech/Song tasks. With an inference time of 1.91 seconds for 8 seconds of audio, it offers substantial improvements in both efficiency and generality.

[692] Edge-ASR: Towards Low-Bit Quantization of Automatic Speech Recognition Models

Chen Feng, Yicheng Lin, Shaojie Zhuo, Chenzheng Su, Ramchalam Kinattinkara Ramakrishnan, Zhaocong Yuan, Xiaopeng Zhang

Main category: cs.SD

TL;DR: The paper benchmarks eight Post-Training Quantization (PTQ) methods on ASR models (Whisper and Moonshine) to evaluate efficiency-accuracy trade-offs for edge deployment.

Details

Motivation: Deploying ASR models on resource-constrained edge devices is challenging due to memory, compute, and power limits. Quantization, especially PTQ, can reduce model size and cost without retraining, but its performance impact on ASR models is unclear.

Method: The study benchmarks eight SOTA PTQ methods on Whisper and Moonshine models, evaluating accuracy, memory I/O, and bit operations across seven datasets. A framework integrates edge-ASR models, quantization algorithms, and analysis tools.

Result: Advanced PTQ techniques enable successful 3-bit quantization on high-capacity models, balancing efficiency and accuracy.

Conclusion: The findings offer insights for optimizing ASR models on low-power edge devices, demonstrating the viability of PTQ for edge deployment.

Abstract: Recent advances in Automatic Speech Recognition (ASR) have demonstrated remarkable accuracy and robustness in diverse audio applications, such as live transcription and voice command processing. However, deploying these models on resource-constrained edge devices (e.g., IoT device, wearables) still presents substantial challenges due to strict limits on memory, compute and power. Quantization, particularly Post-Training Quantization (PTQ), offers an effective way to reduce model size and inference cost without retraining. Despite its importance, the performance implications of various advanced quantization methods and bit-width configurations on ASR models remain unclear. In this work, we present a comprehensive benchmark of eight state-of-the-art (SOTA) PTQ methods applied to two leading edge-ASR model families, Whisper and Moonshine. We systematically evaluate model performances (i.e., accuracy, memory I/O and bit operations) across seven diverse datasets from the open ASR leader-board, analyzing the impact of quantization and various configurations on both weights and activations. Built on an extension of the LLM compression toolkit, our framework integrates edge-ASR models, diverse advanced quantization algorithms, a unified calibration and evaluation data pipeline, with detailed analysis tools. Our results characterize the trade-offs between efficiency and accuracy, demonstrating that even $3$-bit quantization can succeed on high capacity models when using advanced PTQ techniques. These findings provide valuable insights for optimizing ASR models on low-power, always-on edge devices.

cs.LG

[693] PCS Workflow for Veridical Data Science in the Age of AI

Zachary T. Rewolinski, Bin Yu

Main category: cs.LG

TL;DR: The paper discusses the PCS framework for veridical data science, addressing uncertainty in AI-driven data science workflows, and introduces an updated workflow with generative AI integration.

Details

Motivation: To tackle the challenge of unreplicable AI-driven findings due to uncertainty in the data science life cycle (DSLC).

Method: Proposes an updated and streamlined PCS workflow, incorporating generative AI, and demonstrates it with a running example and case study.

Result: Highlights the impact of judgment calls in data cleaning on downstream prediction uncertainty.

Conclusion: The PCS framework provides a principled approach to manage uncertainty in the DSLC, enhancing reproducibility and reliability in AI-driven data science.

Abstract: Data science is a pillar of artificial intelligence (AI), which is transforming nearly every domain of human activity, from the social and physical sciences to engineering and medicine. While data-driven findings in AI offer unprecedented power to extract insights and guide decision-making, many are difficult or impossible to replicate. A key reason for this challenge is the uncertainty introduced by the many choices made throughout the data science life cycle (DSLC). Traditional statistical frameworks often fail to account for this uncertainty. The Predictability-Computability-Stability (PCS) framework for veridical (truthful) data science offers a principled approach to addressing this challenge throughout the DSLC. This paper presents an updated and streamlined PCS workflow, tailored for practitioners and enhanced with guided use of generative AI. We include a running example to display the PCS framework in action, and conduct a related case study which showcases the uncertainty in downstream predictions caused by judgment calls in the data cleaning stage.

[694] A Residual Guided strategy with Generative Adversarial Networks in training Physics-Informed Transformer Networks

Ziyang Zhang, Feifan Zhang, Weidong Tang, Lei Shi, Tailai Chen

Main category: cs.LG

TL;DR: A novel Residual Guided Training strategy for Physics-Informed Transformer via GANs improves accuracy in solving nonlinear PDEs by addressing unresolved residuals and temporal causality violations.

Details

Motivation: Traditional PINNs struggle with unresolved residuals and temporal causality in PDE modeling, necessitating a more robust approach.

Method: The framework combines a decoder-only Transformer for temporal correlations and a residual-aware GAN to prioritize high-residual regions, with causal penalties and adaptive sampling.

Result: Achieves up to three orders of magnitude reduction in relative MSE on Allen-Cahn, Klein-Gordon, and Navier-Stokes equations.

Conclusion: The method bridges deep learning and physics-driven modeling, offering a robust solution for multiscale, time-dependent PDE systems.

Abstract: Nonlinear partial differential equations (PDEs) are pivotal in modeling complex physical systems, yet traditional Physics-Informed Neural Networks (PINNs) often struggle with unresolved residuals in critical spatiotemporal regions and violations of temporal causality. To address these limitations, we propose a novel Residual Guided Training strategy for Physics-Informed Transformer via Generative Adversarial Networks (GAN). Our framework integrates a decoder-only Transformer to inherently capture temporal correlations through autoregressive processing, coupled with a residual-aware GAN that dynamically identifies and prioritizes high-residual regions. By introducing a causal penalty term and an adaptive sampling mechanism, the method enforces temporal causality while refining accuracy in problematic domains. Extensive numerical experiments on the Allen-Cahn, Klein-Gordon, and Navier-Stokes equations demonstrate significant improvements, achieving relative MSE reductions of up to three orders of magnitude compared to baseline methods. This work bridges the gap between deep learning and physics-driven modeling, offering a robust solution for multiscale and time-dependent PDE systems.

[695] Deploying Geospatial Foundation Models in the Real World: Lessons from WorldCereal

Christina Butsko, Kristof Van Tricht, Gabriel Tseng, Giorgia Milli, David Rolnick, Ruben Cartuyvels, Inbal Becker Reshef, Zoltan Szantoi, Hannah Kerner

Main category: cs.LG

TL;DR: A protocol for integrating geospatial foundation models into operational mapping systems is proposed, tested with the Presto model for crop mapping, showing improved performance and generalization.

Details

Motivation: Addressing the gap between benchmark success and real-world deployment of geospatial foundation models by tackling data heterogeneity, resource constraints, and application-specific needs.

Method: A three-step protocol: defining application requirements, adapting models to domain-specific data, and rigorous empirical testing, demonstrated with the Presto model for crop mapping.

Result: Fine-tuning pre-trained models outperforms conventional methods, showing strong spatial and temporal generalization. The protocol is scalable, as shown in the WorldCereal system.

Conclusion: The protocol offers a replicable blueprint for practitioners and a foundation for future research to operationalize geospatial models in remote sensing.

Abstract: The increasing availability of geospatial foundation models has the potential to transform remote sensing applications such as land cover classification, environmental monitoring, and change detection. Despite promising benchmark results, the deployment of these models in operational settings is challenging and rare. Standardized evaluation tasks often fail to capture real-world complexities relevant for end-user adoption such as data heterogeneity, resource constraints, and application-specific requirements. This paper presents a structured approach to integrate geospatial foundation models into operational mapping systems. Our protocol has three key steps: defining application requirements, adapting the model to domain-specific data and conducting rigorous empirical testing. Using the Presto model in a case study for crop mapping, we demonstrate that fine-tuning a pre-trained model significantly improves performance over conventional supervised methods. Our results highlight the model’s strong spatial and temporal generalization capabilities. Our protocol provides a replicable blueprint for practitioners and lays the groundwork for future research to operationalize foundation models in diverse remote sensing applications. Application of the protocol to the WorldCereal global crop-mapping system showcases the framework’s scalability.

[696] MARS: A Meta-Adaptive Reinforcement Learning Framework for Risk-Aware Multi-Agent Portfolio Management

Jiayi Chen, Jing Li, Guiling Wang

Main category: cs.LG

TL;DR: MARS is a multi-agent RL framework for portfolio management, balancing risk and return by dynamically adapting to market conditions using a Meta-Adaptive Controller and risk-aware agents.

Details

Motivation: Existing RL models struggle to adapt to changing market conditions, leading to poor risk-return balance. MARS aims to address this by introducing a multi-agent, risk-aware approach.

Method: MARS uses a Heterogeneous Agent Ensemble with unique risk profiles, enforced by Safety-Critic networks and risk-tolerance thresholds. A Meta-Adaptive Controller dynamically orchestrates the ensemble.

Result: MARS reduces maximum drawdown and volatility while maintaining competitive returns, outperforming on risk-adjusted criteria in experiments.

Conclusion: MARS provides a disciplined, adaptive portfolio management solution robust to market fluctuations, leveraging behavioral diversity for superior risk-return balance.

Abstract: Reinforcement Learning (RL) has shown significant promise in automated portfolio management; however, effectively balancing risk and return remains a central challenge, as many models fail to adapt to dynamically changing market conditions. In this paper, we propose Meta-controlled Agents for a Risk-aware System (MARS), a novel RL framework designed to explicitly address this limitation through a multi-agent, risk-aware approach. Instead of a single monolithic model, MARS employs a Heterogeneous Agent Ensemble where each agent possesses a unique, intrinsic risk profile. This profile is enforced by a dedicated Safety-Critic network and a specific risk-tolerance threshold, allowing agents to specialize in behaviors ranging from capital preservation to aggressive growth. To navigate different market regimes, a high-level Meta-Adaptive Controller (MAC) learns to dynamically orchestrate the ensemble. By adjusting its reliance on conservative versus aggressive agents, the MAC effectively lowers portfolio volatility during downturns and seeks higher returns in bull markets, thus minimizing maximum drawdown and enhancing overall stability. This two-tiered structure allows MARS to generate a disciplined and adaptive portfolio that is robust to market fluctuations. The framework achieves a superior balance between risk and return by leveraging behavioral diversity rather than explicit market-feature engineering. Experiments on major international stock indexes, including periods of significant financial crisis, demonstrate the efficacy of our framework on risk-adjusted criteria, significantly reducing maximum drawdown and volatility while maintaining competitive returns.

[697] Discrete approach to machine learning

Dmitriy Kashitsyn, Dmitriy Shabanov

Main category: cs.LG

TL;DR: The paper presents methods for encoding and processing structural information using sparse bit vectors and linear vectors, with applications in dimensionality reduction and discrete embeddings.

Details

Motivation: To explore efficient ways of encoding and processing structural information in multidimensional spaces, drawing parallels to biological systems like the neocortex.

Method: Uses speculative stochastic dimensionality reduction and geometric methods for discrete embeddings, applied to language morphology and immunohistochemical markers.

Result: Demonstrates a map of code space layout resembling neocortical pinwheels, suggesting similarities between model processes and neocortex organization.

Conclusion: The findings cautiously suggest parallels between the model’s processes and biological neocortex organization, opening avenues for further research.

Abstract: The article explores an encoding and structural information processing approach using sparse bit vectors and fixed-length linear vectors. The following are presented: a discrete method of speculative stochastic dimensionality reduction of multidimensional code and linear spaces with linear asymptotic complexity; a geometric method for obtaining discrete embeddings of an organised code space that reflect the internal structure of a given modality. The structure and properties of a code space are investigated using three modalities as examples: morphology of Russian and English languages, and immunohistochemical markers. Parallels are drawn between the resulting map of the code space layout and so-called pinwheels appearing on the mammalian neocortex. A cautious assumption is made about similarities between neocortex organisation and processes happening in our models.

[698] A Data-Driven Machine Learning Approach for Predicting Axial Load Capacity in Steel Storage Rack Columns

Bakhtiyar Mammadli, Casim Yazici, Muhammed Gürbüz, İrfan Kocaman, F. Javier Dominguez-Gutierrez, Fatih Mehmet Özkal

Main category: cs.LG

TL;DR: A machine learning framework predicts axial load-bearing capacity of cold-formed steel members, using Gradient Boosting Regression for superior performance and SHAP for interpretability, integrated into a user-friendly web tool.

Details

Motivation: Traditional analytical methods struggle with nonlinearities and complexities in buckling behavior, prompting the need for a robust, interpretable ML approach.

Method: The study evaluates regression algorithms on a curated dataset of steel column parameters, selecting Gradient Boosting Regression as the best performer, and uses SHAP for feature interpretation.

Result: Gradient Boosting Regression outperformed other models in predictive accuracy (R2, RMSE, MAE) and was deployed via a Python-based web interface for practical use.

Conclusion: The framework enhances design safety and workflow efficiency in structural applications, demonstrating the value of data-driven tools for buckling-critical scenarios.

Abstract: In this study, we present a machine learning (ML) framework to predict the axial load-bearing capacity, (kN), of cold-formed steel structural members. The methodology emphasizes robust model selection and interpretability, addressing the limitations of traditional analytical approaches in capturing the nonlinearities and geometrical complexities inherent to buckling behavior. The dataset, comprising key geometric and mechanical parameters of steel columns, was curated with appropriate pre-processing steps including removal of non-informative identifiers and imputation of missing values. A comprehensive suite of regression algorithms, ranging from linear models to kernel-based regressors and ensemble tree methods was evaluated. Among these, Gradient Boosting Regression exhibited superior predictive performance across multiple metrics, including the coefficient of determination (R2), root mean squared error (RMSE), and mean absolute error (MAE), and was consequently selected as the final model. Model interpretability was addressed using SHapley Additive exPlanations (SHAP), enabling insight into the relative importance and interaction of input features influencing the predicted axial capacity. To facilitate practical deployment, the model was integrated into an interactive, Python-based web interface via Streamlit. This tool allows end-users-such as structural engineers and designers, to input design parameters manually or through CSV upload, and to obtain real-time predictions of axial load capacity without the need for programming expertise. Applied to the context of steel storage rack columns, the framework demonstrates how data-driven tools can enhance design safety, streamline validation workflows, and inform decision-making in structural applications where buckling is a critical failure mode

[699] Satellite Connectivity Prediction for Fast-Moving Platforms

Chao Yan, Babak Mafakheri

Main category: cs.LG

TL;DR: ML predicts satellite signal quality for fast-moving objects, achieving high accuracy (F1=0.97), enabling proactive switching and seamless connectivity.

Details

Motivation: Address the challenge of maintaining reliable satellite connectivity for fast-moving objects like aircraft, where frequent switching is required due to mobility and lack of terrestrial coverage.

Method: Analyzed real GEO satellite-aircraft communication data using ML to predict signal quality, enabling proactive network switching.

Result: Achieved an F1 score of 0.97, demonstrating high accuracy in predicting signal quality during flight.

Conclusion: The ML model effectively predicts signal quality, enabling seamless connectivity and can be adapted for other moving objects like vehicles and trains.

Abstract: Satellite connectivity is gaining increased attention as the demand for seamless internet access, especially in transportation and remote areas, continues to grow. For fast-moving objects such as aircraft, vehicles, or trains, satellite connectivity is critical due to their mobility and frequent presence in areas without terrestrial coverage. Maintaining reliable connectivity in these cases requires frequent switching between satellite beams, constellations, or orbits. To enhance user experience and address challenges like long switching times, Machine Learning (ML) algorithms can analyze historical connectivity data and predict network quality at specific locations. This allows for proactive measures, such as network switching before connectivity issues arise. In this paper, we analyze a real dataset of communication between a Geostationary Orbit (GEO) satellite and aircraft over multiple flights, using ML to predict signal quality. Our prediction model achieved an F1 score of 0.97 on the test data, demonstrating the accuracy of machine learning in predicting signal quality during flight. By enabling seamless broadband service, including roaming between different satellite constellations and providers, our model addresses the need for real-time predictions of signal quality. This approach can further be adapted to automate satellite and beam-switching mechanisms to improve overall communication efficiency. The model can also be retrained and applied to any moving object with satellite connectivity, using customized datasets, including connected vehicles and trains.

[700] Optimizing Day-Ahead Energy Trading with Proximal Policy Optimization and Blockchain

Navneet Verma, Ying Xie

Main category: cs.LG

TL;DR: A novel framework combining PPO reinforcement learning and blockchain optimizes prosumer trading in day-ahead energy markets, achieving demand-supply balance and secure transactions.

Details

Motivation: Address challenges in balancing supply-demand and ensuring trust in decentralized energy markets with high renewable penetration.

Method: Integrates PPO reinforcement learning for multi-objective optimization and blockchain for secure data management, tested with ERCOT data.

Result: Achieves 2% demand-supply balance, near-optimal costs, and robust storage policies; blockchain ensures transparency and security.

Conclusion: The framework offers a scalable, trustworthy solution for decentralized energy trading with actionable insights for deployment.

Abstract: The increasing penetration of renewable energy sources in day-ahead energy markets introduces challenges in balancing supply and demand, ensuring grid resilience, and maintaining trust in decentralized trading systems. This paper proposes a novel framework that integrates the Proximal Policy Optimization (PPO) algorithm, a state-of-the-art reinforcement learning method, with blockchain technology to optimize automated trading strategies for prosumers in day-ahead energy markets. We introduce a comprehensive framework that employs RL agent for multi-objective energy optimization and blockchain for tamper-proof data and transaction management. Simulations using real-world data from the Electricity Reliability Council of Texas (ERCOT) demonstrate the effectiveness of our approach. The RL agent achieves demand-supply balancing within 2% and maintains near-optimal supply costs for the majority of the operating hours. Moreover, it generates robust battery storage policies capable of handling variability in solar and wind generation. All decisions are recorded on an Algorand-based blockchain, ensuring transparency, auditability, and security - key enablers for trustworthy multi-agent energy trading. Our contributions include a novel system architecture, curriculum learning for robust agent development, and actionable policy insights for practical deployment.

[701] GNN-ASE: Graph-Based Anomaly Detection and Severity Estimation in Three-Phase Induction Machines

Moutaz Bellah Bentrad, Adel Ghoggal, Tahar Bahi, Abderaouf Bahi

Main category: cs.LG

TL;DR: A model-free fault diagnosis method for induction machines using Graph Neural Networks (GNNs) is proposed, achieving high accuracy for multiple fault types without preprocessing.

Details

Motivation: Traditional model-based methods are complex and computationally expensive, prompting the need for a simpler, efficient alternative.

Method: The GNN-ASE model uses raw current and vibration signals to automatically learn features and detect faults like eccentricity, bearing defects, and broken rotor bars.

Result: The model achieved 92.5% accuracy for eccentricity, 91.2% for bearing faults, and 93.1% for broken rotor bars, demonstrating robustness.

Conclusion: The GNN-based framework is a lightweight, powerful alternative to conventional methods, suitable for real-world monitoring and predictive maintenance.

Abstract: The diagnosis of induction machines has traditionally relied on model-based methods that require the development of complex dynamic models, making them difficult to implement and computationally expensive. To overcome these limitations, this paper proposes a model-free approach using Graph Neural Networks (GNNs) for fault diagnosis in induction machines. The focus is on detecting multiple fault types – including eccentricity, bearing defects, and broken rotor bars – under varying severity levels and load conditions. Unlike traditional approaches, raw current and vibration signals are used as direct inputs, eliminating the need for signal preprocessing or manual feature extraction. The proposed GNN-ASE model automatically learns and extracts relevant features from raw inputs, leveraging the graph structure to capture complex relationships between signal types and fault patterns. It is evaluated for both individual fault detection and multi-class classification of combined fault conditions. Experimental results demonstrate the effectiveness of the proposed model, achieving 92.5% accuracy for eccentricity defects, 91.2% for bearing faults, and 93.1% for broken rotor bar detection. These findings highlight the model’s robustness and generalization capability across different operational scenarios. The proposed GNN-based framework offers a lightweight yet powerful solution that simplifies implementation while maintaining high diagnostic performance. It stands as a promising alternative to conventional model-based diagnostic techniques for real-world induction machine monitoring and predictive maintenance.

[702] Reproducibility of Machine Learning-Based Fault Detection and Diagnosis for HVAC Systems in Buildings: An Empirical Study

Adil Mukhtar, Michael Hadwiger, Franz Wotawa, Gerald Schweiger

Main category: cs.LG

TL;DR: The paper examines reproducibility issues in ML applications for building energy systems, finding widespread lack of transparency and reproducibility due to insufficient data and code disclosure.

Details

Motivation: Reproducibility is critical for scientific integrity, and ML faces unique challenges like nondeterminism and computational constraints. This study addresses the gap in understanding how these issues manifest in applied fields like building energy systems.

Method: The study analyzes transparency and reproducibility standards in ML applications for building energy systems by reviewing key dimensions like data and code disclosure.

Result: Nearly all articles were non-reproducible due to insufficient disclosure. 72% lacked dataset details, only two shared code (one broken), and no significant difference was found between academic and industry-authored papers.

Conclusion: Targeted interventions—such as guidelines, training, and policies—are needed to improve reproducibility and transparency in ML research for applied disciplines.

Abstract: Reproducibility is a cornerstone of scientific research, enabling independent verification and validation of empirical findings. The topic gained prominence in fields such as psychology and medicine, where concerns about non - replicable results sparked ongoing discussions about research practices. In recent years, the fast-growing field of Machine Learning (ML) has become part of this discourse, as it faces similar concerns about transparency and reliability. Some reproducibility issues in ML research are shared with other fields, such as limited access to data and missing methodological details. In addition, ML introduces specific challenges, including inherent nondeterminism and computational constraints. While reproducibility issues are increasingly recognized by the ML community and its major conferences, less is known about how these challenges manifest in applied disciplines. This paper contributes to closing this gap by analyzing the transparency and reproducibility standards of ML applications in building energy systems. The results indicate that nearly all articles are not reproducible due to insufficient disclosure across key dimensions of reproducibility. 72% of the articles do not specify whether the dataset used is public, proprietary, or commercially available. Only two papers share a link to their code - one of which was broken. Two-thirds of the publications were authored exclusively by academic researchers, yet no significant differences in reproducibility were observed compared to publications with industry-affiliated authors. These findings highlight the need for targeted interventions, including reproducibility guidelines, training for researchers, and policies by journals and conferences that promote transparency and reproducibility.

[703] Hallucination Detection and Mitigation with Diffusion in Multi-Variate Time-Series Foundation Models

Vijja Wichitwechkarn, Charles Fox, Ruchi Choudhary

Main category: cs.LG

TL;DR: The paper introduces definitions and methods for detecting and mitigating hallucinations in multi-variate time-series (MVTS) foundation models, achieving significant reduction in hallucination levels.

Details

Motivation: Existing definitions and methods for hallucination in natural language processing don't extend to MVTS foundation models, necessitating new approaches.

Method: Proposes new definitions for MVTS hallucination, uses a diffusion model for detection and mitigation, and benchmarks relational hallucination levels.

Result: Pre-trained MVTS models hallucinate up to 59.5% as much as a baseline; mitigation reduces this by up to 47.7%.

Conclusion: The new definitions and methods can enhance the adoption and safe use of MVTS foundation models.

Abstract: Foundation models for natural language processing have many coherent definitions of hallucination and methods for its detection and mitigation. However, analogous definitions and methods do not exist for multi-variate time-series (MVTS) foundation models. We propose new definitions for MVTS hallucination, along with new detection and mitigation methods using a diffusion model to estimate hallucination levels. We derive relational datasets from popular time-series datasets to benchmark these relational hallucination levels. Using these definitions and models, we find that open-source pre-trained MVTS imputation foundation models relationally hallucinate on average up to 59.5% as much as a weak baseline. The proposed mitigation method reduces this by up to 47.7% for these models. The definition and methods may improve adoption and safe usage of MVTS foundation models.

[704] Multi-Grained Temporal-Spatial Graph Learning for Stable Traffic Flow Forecasting

Zhenan Lin, Yuni Lai, Wai Lun Lo, Richard Tai-Chiu Hsung, Harris Sik-Ho Tsang, Xiaoyu Xue, Kai Zhou, Yulin Zhu

Main category: cs.LG

TL;DR: A multi-grained temporal-spatial graph learning framework is proposed to enhance traffic flow forecasting by balancing local and global patterns, outperforming existing methods.

Details

Motivation: Existing methods struggle with encoding global temporal-spatial patterns and overfitting on pre-defined geographical correlations, limiting robustness in complex traffic environments.

Method: The framework uses a graph transformer encoder for global patterns, graph convolution for local patterns, and a gated fusion unit with residual connections to balance them.

Result: The model demonstrates strong representation capability and consistently outperforms baselines on real-world traffic networks.

Conclusion: The proposed method effectively mines hidden global relations and balances local-global patterns, improving traffic flow forecasting.

Abstract: Time-evolving traffic flow forecasting are playing a vital role in intelligent transportation systems and smart cities. However, the dynamic traffic flow forecasting is a highly nonlinear problem with complex temporal-spatial dependencies. Although the existing methods has provided great contributions to mine the temporal-spatial patterns in the complex traffic networks, they fail to encode the globally temporal-spatial patterns and are prone to overfit on the pre-defined geographical correlations, and thus hinder the model’s robustness on the complex traffic environment. To tackle this issue, in this work, we proposed a multi-grained temporal-spatial graph learning framework to adaptively augment the globally temporal-spatial patterns obtained from a crafted graph transformer encoder with the local patterns from the graph convolution by a crafted gated fusion unit with residual connection techniques. Under these circumstances, our proposed model can mine the hidden global temporal-spatial relations between each monitor stations and balance the relative importance of local and global temporal-spatial patterns. Experiment results demonstrate the strong representation capability of our proposed method and our model consistently outperforms other strong baselines on various real-world traffic networks.

[705] Stochastic Optimal Control via Measure Relaxations

Etienne Buehrle, Christoph Stiller

Main category: cs.LG

TL;DR: The paper presents a convex optimization method over occupation measures for solving stochastic optimal control problems, addressing scalability issues in long horizons.

Details

Motivation: Current methods like robust or scenario-based optimization struggle with scalability for long optimization horizons in stochastic systems.

Method: The approach reformulates the problem as a convex optimization over occupation measures and uses Christoffel polynomials to learn cost functions from data.

Result: The method is validated on synthetic and real-world scenarios, with experimental code made publicly available.

Conclusion: The proposed convex optimization method offers a scalable solution for stochastic optimal control problems.

Abstract: The optimal control problem of stochastic systems is commonly solved via robust or scenario-based optimization methods, which are both challenging to scale to long optimization horizons. We cast the optimal control problem of a stochastic system as a convex optimization problem over occupation measures. We demonstrate our method on a set of synthetic and real-world scenarios, learning cost functions from data via Christoffel polynomials. The code for our experiments is available at https://github.com/ebuehrle/dpoc.

[706] Adaptive Prototype Knowledge Transfer for Federated Learning with Mixed Modalities and Heterogeneous Tasks

Keke Gai, Mohan Wang, Jing Yu, Dongjue Wang, Qi Wu

Main category: cs.LG

TL;DR: AproMFL is a prototype-based Multimodal Federated Learning framework for mixed modalities, addressing label and task heterogeneity without unified labels, outperforming baselines.

Details

Motivation: Existing prototype-based MFL methods assume unified labels and identical tasks, which is impractical for mixed modalities. AproMFL aims to solve this.

Method: AproMFL uses adaptive prototype construction, client relationship graphs for aggregation, and global knowledge transfer losses.

Result: AproMFL outperforms baselines, achieving 0.42%~6.09% higher accuracy and 1.6%~3.89% higher recall on heterogeneous datasets.

Conclusion: AproMFL effectively handles mixed modalities and task heterogeneity, improving performance in federated learning.

Abstract: Multimodal Federated Learning (MFL) with mixed modalities enables unimodal and multimodal clients to collaboratively train models while ensuring clients’ privacy. As a representative sample of local data, prototypes offer an approach with low resource consumption and no reliance on prior knowledge for MFL with mixed modalities. However, existing prototype-based MFL methods assume unified labels across clients and identical tasks per client, which is impractical in MFL with mixed modalities. In this work, we propose an Adaptive prototype-based Multimodal Federated Learning (AproMFL) framework for mixed modalities to address the aforementioned issues. Our AproMFL transfers knowledge through adaptively-constructed prototypes without unified labels. Clients adaptively select prototype construction methods in line with labels; server converts client prototypes into unified multimodal prototypes and cluster them to form global prototypes. To address model aggregation issues in task heterogeneity, we develop a client relationship graph-based scheme to dynamically adjust aggregation weights. Furthermore, we propose a global prototype knowledge transfer loss and a global model knowledge transfer loss to enable the transfer of global knowledge to local knowledge. Experimental results show that AproMFL outperforms four baselines on three highly heterogeneous datasets ($\alpha=0.1$) and two heterogeneous tasks, with the optimal results in accuracy and recall being 0.42%~6.09% and 1.6%~3.89% higher than those of FedIoT (FedAvg-based MFL), respectively.

[707] FRAM: Frobenius-Regularized Assignment Matching with Mixed-Precision Computing

Binrui Shen, Yuan Liang, Shengxin Zhu

Main category: cs.LG

TL;DR: The paper proposes a novel relaxation framework (FRA) for graph matching, addressing errors in existing methods by reformulating the projection step with a tunable regularization term. The SDSN algorithm and mixed-precision architecture enhance efficiency and accuracy, achieving significant speedups.

Details

Motivation: Existing projection-based relaxations for QAP introduce errors due to numerical scale sensitivity and geometric misalignment. The goal is to mitigate these errors while maintaining accuracy.

Method: Introduces a Frobenius-regularized Linear Assignment (FRA) problem with a tunable regularization term. Uses the Scaling Doubly Stochastic Normalization (SDSN) algorithm and a mixed-precision architecture for efficiency.

Result: FRAM outperforms baselines in accuracy and achieves up to 370X speedup with negligible accuracy loss when using GPU-based mixed-precision.

Conclusion: The FRA framework and SDSN algorithm effectively address errors in graph matching relaxations, offering a balance of accuracy and computational efficiency.

Abstract: Graph matching, typically formulated as a Quadratic Assignment Problem (QAP), seeks to establish node correspondences between two graphs. To address the NP-hardness of QAP, some existing methods adopt projection-based relaxations that embed the problem into the convex hull of the discrete domain. However, these relaxations inevitably enlarge the feasible set, introducing two sources of error: numerical scale sensitivity and geometric misalignment between the relaxed and original domains. To alleviate these errors, we propose a novel relaxation framework by reformulating the projection step as a Frobenius-regularized Linear Assignment (FRA) problem, where a tunable regularization term mitigates feasible region inflation. This formulation enables normalization-based operations to preserve numerical scale invariance without compromising accuracy. To efficiently solve FRA, we propose the Scaling Doubly Stochastic Normalization (SDSN) algorithm. Building on its favorable computational properties, we develop a theoretically grounded mixed-precision architecture to achieve substantial acceleration. Comprehensive CPU-based benchmarks demonstrate that FRAM consistently outperforms all baseline methods under identical precision settings. When combined with a GPU-based mixed-precision architecture, FRAM achieves up to 370X speedup over its CPU-FP64 counterpart, with negligible loss in solution accuracy.

[708] A Dynamic, Context-Aware Framework for Risky Driving Prediction Using Naturalistic Data

Amir Hossein Kalantari, Eleonora Papadimitriou, Amir Pooyan Afghari

Main category: cs.LG

TL;DR: A dynamic framework for identifying risky driving behaviour using Belgian naturalistic driving data, leveraging rolling time windows and bi-level optimisation to adapt thresholds and hyperparameters.

Details

Motivation: Existing frameworks for monitoring risky driving behaviour rely on fixed time windows and static thresholds, limiting adaptability to real-world driving's stochastic nature.

Method: Proposes a dynamic, individualised framework using rolling time windows and bi-level optimisation. Evaluates speed-weighted headway and harsh driving events with Random Forest, XGBoost, and DNN models.

Result: DNN excelled in high-recall tasks, XGBoost showed balanced performance, and Random Forest was sensitive to dynamic adjustments. Speed-weighted headway was more stable than harsh driving events.

Conclusion: Adaptive, personalised risk detection enhances real-time safety feedback and driver support in intelligent transport systems.

Abstract: Naturalistic driving studies offer a powerful means for observing and quantifying real-world driving behaviour. One of their prominent applications in traffic safety is the continuous monitoring and classification of risky driving behaviour. However, many existing frameworks rely on fixed time windows and static thresholds for distinguishing between safe and risky behaviour - limiting their ability to respond to the stochastic nature of real-world driving. This study proposes a dynamic and individualised framework for identifying risky driving behaviour using Belgian naturalistic driving data. The approach leverages a rolling time window and bi-level optimisation to dynamically calibrate both risk thresholds and model hyperparameters, capturing subtle behavioural shifts. Two safety indicators, speed-weighted headway and harsh driving events, were evaluated using three data-driven models: Random Forest, XGBoost, and Deep Neural Network (DNN). The DNN demonstrated strong capability in capturing subtle changes in driving behaviour, particularly excelling in high-recall tasks, making it promising for early-stage risk detection. XGBoost provided the most balanced and stable performance across different thresholds and evaluation metrics. While random forest showed more variability, it responded sensitively to dynamic threshold adjustments, which may be advantageous during model adaptation or tuning. Speed-weighted headway emerged as a more stable and context-sensitive risk indicator than harsh driving events, likely due to its robustness to label sparsity and contextual variation. Overall, the findings support the value of adaptive, personalised risk detection approaches for enhancing real-time safety feedback and tailoring driver support in intelligent transport systems.

[709] Maximize margins for robust splicing detection

Julien Simon de Kergunic, Rony Abecidan, Patrick Bas, Vincent Itier

Main category: cs.LG

TL;DR: Deep learning-based forensic tools for splicing detection are sensitive to training conditions and post-processing. The paper shows how latent space variability affects generalization and proposes training multiple model variants to maximize robustness.

Details

Motivation: Address the unreliability of deep learning-based splicing detectors due to their sensitivity to training conditions and post-processing.

Method: Analyze latent space variability in deep architectures, correlate latent margins with generalization, and propose training multiple model variants to select the most robust one.

Result: Latent space differences significantly impact detector performance on post-processed images, with latent margins correlating strongly with generalization.

Conclusion: Training multiple model variants and selecting based on latent margins can improve the robustness of splicing detectors against post-processing.

Abstract: Despite recent progress in splicing detection, deep learning-based forensic tools remain difficult to deploy in practice due to their high sensitivity to training conditions. Even mild post-processing applied to evaluation images can significantly degrade detector performance, raising concerns about their reliability in operational contexts. In this work, we show that the same deep architecture can react very differently to unseen post-processing depending on the learned weights, despite achieving similar accuracy on in-distribution test data. This variability stems from differences in the latent spaces induced by training, which affect how samples are separated internally. Our experiments reveal a strong correlation between the distribution of latent margins and a detector’s ability to generalize to post-processed images. Based on this observation, we propose a practical strategy for building more robust detectors: train several variants of the same model under different conditions, and select the one that maximizes latent margins.

[710] Filtering with Self-Attention and Storing with MLP: One-Layer Transformers Can Provably Acquire and Extract Knowledge

Ruichen Xu, Kexin Chen

Main category: cs.LG

TL;DR: The paper investigates how transformers acquire and extract knowledge, introducing a one-layer framework with self-attention and MLPs to provide theoretical guarantees on knowledge acquisition and extraction, validated by experiments.

Details

Motivation: To address the theoretical opacity of how transformers store and retrieve knowledge during pre-training and fine-tuning, especially given prior limitations to simplified models.

Method: Introduces a tractable one-layer transformer with self-attention and MLPs, analyzing gradient dynamics for convergence and generalization guarantees.

Result: Proves transformers can achieve near-optimal training loss (knowledge acquisition) and low generalization error (knowledge extraction) under specific conditions, with empirical validation.

Conclusion: The framework provides theoretical insights into knowledge acquisition and extraction, explaining phenomena like hallucinations and learning rate effects, supported by experiments.

Abstract: Modern large language models excel in knowledge-intensive tasks, yet how transformers acquire (store) knowledge during pre-training and extract (retrieve) it during post-fine-tuning inference remains theoretically opaque. While prior theoretical work has begun to investigate these questions through the analysis of training dynamics, such studies are limited to single-layer, attention-only architectures. However, most existing studies suggest that MLPs are the most contributing components for storing knowledge in transformer-based language models. Meanwhile, our empirical investigations reveal that such simplified models, when trained using standard next-token prediction objectives, may be incapable of acquiring or extracting factual knowledge. To overcome this limitation, we introduce a tractable one-layer transformer framework that crucially incorporates both self-attention and MLP modules. By tracking its gradient dynamics, we establish convergence and generalization guarantees that illuminate the ability of knowledge acquisition and extraction. We prove that 1) Transformers can achieve near-optimal training loss during pre-training, signifying effective knowledge acquisition; 2) With a large fine-tuning dataset and specific data multiplicity conditions met, transformers can achieve low generalization error when tested on factual knowledge learned during pre-training but not reinforced during the fine-tuning, indicating successful knowledge extraction; 3) When the conditions are not satisfied, transformers exhibit high generalization loss, resulting in hallucinations. Our analysis includes both full fine-tuning and low-rank fine-tuning. Furthermore, our analysis offers theoretical insights into several pertinent empirical phenomena, such as the role of learning rate schedules. Experiments on synthetic and real-world PopQA datasets with GPT-2 and Llama-3.2-1B validate our results.

[711] Universal Neurons in GPT-2: Emergence, Persistence, and Functional Impact

Advey Nandan, Cheng-Ting Chou, Amrit Kurakula, Cole Blondin, Kevin Zhu, Vasu Sharma, Sean O’Brien

Main category: cs.LG

TL;DR: Study examines universal neurons in GPT-2 Small models, showing their emergence, stability, and functional impact across training.

Details

Motivation: To understand how universal neurons—neurons with correlated activations across independently trained models—develop and influence model behavior.

Method: Analyzed five GPT-2 models at three training checkpoints (100k, 200k, 300k steps) using pairwise correlation of activations on 5M tokens. Conducted ablation tests to measure functional impact.

Result: Universal neurons are stable, especially in deeper layers, and significantly affect model predictions (loss and KL divergence).

Conclusion: Stable, universal representational structures emerge during training, indicating consistent learning patterns across models.

Abstract: We investigate the phenomenon of neuron universality in independently trained GPT-2 Small models, examining how these universal neurons-neurons with consistently correlated activations across models-emerge and evolve throughout training. By analyzing five GPT-2 models at three checkpoints (100k, 200k, 300k steps), we identify universal neurons through pairwise correlation analysis of activations over a dataset of 5 million tokens. Ablation experiments reveal significant functional impacts of universal neurons on model predictions, measured via loss and KL divergence. Additionally, we quantify neuron persistence, demonstrating high stability of universal neurons across training checkpoints, particularly in deeper layers. These findings suggest stable and universal representational structures emerge during neural network training.

[712] CAK: Emergent Audio Effects from Minimal Deep Learning

Austin Rockman

Main category: cs.LG

TL;DR: A 3x3 convolutional kernel trained on 200 audio samples produces emergent effects using Conditioning Aware Kernels (CAK) and AuGAN, enabling unique audio transformations.

Details

Motivation: To explore adversarial training for discovering audio transformations from minimal data, enabling novel effect design.

Method: Uses CAK (output = input + (learned_pattern x control)) with soft-gate for identity preservation, and AuGAN to verify control application instead of forgery detection.

Result: The kernel exhibits a diagonal structure creating frequency-dependent temporal shifts, producing musical effects.

Conclusion: Adversarial training can discover audio transformations from small datasets, offering new effect design possibilities.

Abstract: We demonstrate that a single 3x3 convolutional kernel can produce emergent audio effects when trained on 200 samples from a personalized corpus. We achieve this through two key techniques: (1) Conditioning Aware Kernels (CAK), where output = input + (learned_pattern x control), with a soft-gate mechanism supporting identity preservation at zero control; and (2) AuGAN (Audit GAN), which reframes adversarial training from “is this real?” to “did you apply the requested value?” Rather than learning to generate or detect forgeries, our networks cooperate to verify control application, discovering unique transformations. The learned kernel exhibits a diagonal structure creating frequency-dependent temporal shifts that are capable of producing musical effects based on input characteristics. Our results show the potential of adversarial training to discover audio transformations from minimal data, enabling new approaches to effect design.

[713] NeuCoReClass AD: Redefining Self-Supervised Time Series Anomaly Detection

Aitor Sánchez-Ferrera, Usue Mori, Borja Calvo, Jose A. Lozano

Main category: cs.LG

TL;DR: NeuCoReClass AD is a self-supervised multi-task framework for time series anomaly detection, combining contrastive, reconstruction, and classification tasks, outperforming existing methods without domain-specific knowledge.

Details

Motivation: Existing unsupervised anomaly detection methods rely on single proxy tasks and handcrafted transformations, limiting their generalization and ability to capture meaningful patterns.

Method: NeuCoReClass AD uses neural transformation learning for diverse, informative, and coherent augmented views, integrating contrastive, reconstruction, and classification proxy tasks.

Result: The framework consistently outperforms classical baselines and deep-learning alternatives across benchmarks and characterizes anomaly profiles unsupervised.

Conclusion: NeuCoReClass AD addresses limitations of existing methods, offering a robust, generalizable solution for time series anomaly detection.

Abstract: Time series anomaly detection plays a critical role in a wide range of real-world applications. Among unsupervised approaches, self-supervised learning has gained traction for modeling normal behavior without the need of labeled data. However, many existing methods rely on a single proxy task, limiting their ability to capture meaningful patterns in normal data. Moreover, they often depend on handcrafted transformations tailored specific domains, hindering their generalization accross diverse problems. To address these limitations, we introduce NeuCoReClass AD, a self-supervised multi-task time series anomaly detection framework that combines contrastive, reconstruction, and classification proxy tasks. Our method employs neural transformation learning to generate augmented views that are informative, diverse, and coherent, without requiring domain-specific knowledge. We evaluate NeuCoReClass AD across a wide range of benchmarks, demonstrating that it consistently outperforms both classical baselines and most deep-learning alternatives. Furthermore, it enables the characterization of distinct anomaly profiles in a fully unsupervised manner.

[714] Predictive Auditing of Hidden Tokens in LLM APIs via Reasoning Length Estimation

Ziyao Wang, Guoheng Sun, Yexiao He, Zheyu Shen, Bowei Tian, Ang Li

Main category: cs.LG

TL;DR: PALACE is a user-side framework for auditing hidden reasoning token counts in LLM services, addressing token inflation and overbilling concerns without internal trace access.

Details

Motivation: Commercial LLM services hide internal reasoning traces but charge for all tokens, raising concerns of token inflation and overbilling. Reliable auditing is needed but challenging due to provider control and LLM variance.

Method: PALACE uses a GRPO-augmented adaptation module and lightweight domain router to estimate hidden token counts from prompt-answer pairs, dynamically calibrating for diverse tasks.

Result: Experiments show PALACE achieves low relative error and strong prediction accuracy across math, coding, medical, and general reasoning benchmarks.

Conclusion: PALACE advances predictive auditing, enhancing transparency, accountability, and user trust in LLM services.

Abstract: Commercial LLM services often conceal internal reasoning traces while still charging users for every generated token, including those from hidden intermediate steps, raising concerns of token inflation and potential overbilling. This gap underscores the urgent need for reliable token auditing, yet achieving it is far from straightforward: cryptographic verification (e.g., hash-based signature) offers little assurance when providers control the entire execution pipeline, while user-side prediction struggles with the inherent variance of reasoning LLMs, where token usage fluctuates across domains and prompt styles. To bridge this gap, we present PALACE (Predictive Auditing of LLM APIs via Reasoning Token Count Estimation), a user-side framework that estimates hidden reasoning token counts from prompt-answer pairs without access to internal traces. PALACE introduces a GRPO-augmented adaptation module with a lightweight domain router, enabling dynamic calibration across diverse reasoning tasks and mitigating variance in token usage patterns. Experiments on math, coding, medical, and general reasoning benchmarks show that PALACE achieves low relative error and strong prediction accuracy, supporting both fine-grained cost auditing and inflation detection. Taken together, PALACE represents an important first step toward standardized predictive auditing, offering a practical path to greater transparency, accountability, and user trust.

[715] SmartDate: AI-Driven Precision Sorting and Quality Control in Date Fruits

Khaled Eskaf

Main category: cs.LG

TL;DR: SmartDate is an AI system for sorting and quality control of dates, using deep learning, genetic algorithms, and reinforcement learning. It achieves high accuracy (94.5%) and reduces waste.

Details

Motivation: To improve classification accuracy and predict shelf life of date fruits, ensuring high-quality produce reaches the market while reducing waste.

Method: Combines deep learning, genetic algorithms, and reinforcement learning with high-resolution imaging and VisNIR spectral sensors to evaluate moisture, sugar content, and texture.

Result: Achieved 94.5% accuracy, 93.1% F1-score, and 0.96 AUC-ROC.

Conclusion: SmartDate sets a new benchmark in smart agriculture by enhancing quality control and reducing waste.

Abstract: SmartDate is an AI-powered system for automated sorting and quality control of date fruits. It combines deep learning, genetic algorithms, and reinforcement learning to improve classification accuracy and predict shelf life. The system uses high-resolution imaging and Visible-Near-Infrared (VisNIR) spectral sensors to evaluate key features such as moisture, sugar content, and texture. Reinforcement learning enables real-time adaptation to production conditions, while genetic algorithms optimize model parameters. SmartDate achieved 94.5 percent accuracy, 93.1 percent F1-score, and an AUC-ROC of 0.96. The system reduces waste and ensures that only high-quality dates reach the market, setting a new benchmark in smart agriculture.

[716] CaliMatch: Adaptive Calibration for Improving Safe Semi-supervised Learning

Jinsoo Bae, Seoung Bum Kim, Hyungrok Do

Main category: cs.LG

TL;DR: CaliMatch improves semi-supervised learning (SSL) by calibrating classifier and OOD detection to address label distribution mismatch and overconfidence issues.

Details

Motivation: Existing SSL methods suffer from overconfidence in incorrect pseudo-labels or OOD detection, leading to errors.

Method: Proposes CaliMatch with adaptive label smoothing and temperature scaling for calibration.

Result: Outperforms existing methods on datasets like CIFAR-10, CIFAR-100, SVHN, TinyImageNet, and ImageNet.

Conclusion: CaliMatch effectively addresses overconfidence and improves performance in safe SSL tasks.

Abstract: Semi-supervised learning (SSL) uses unlabeled data to improve the performance of machine learning models when labeled data is scarce. However, its real-world applications often face the label distribution mismatch problem, in which the unlabeled dataset includes instances whose ground-truth labels are absent from the labeled training dataset. Recent studies, referred to as safe SSL, have addressed this issue by using both classification and out-of-distribution (OOD) detection. However, the existing methods may suffer from overconfidence in deep neural networks, leading to increased SSL errors because of high confidence in incorrect pseudo-labels or OOD detection. To address this, we propose a novel method, CaliMatch, which calibrates both the classifier and the OOD detector to foster safe SSL. CaliMatch presents adaptive label smoothing and temperature scaling, which eliminates the need to manually tune the smoothing degree for effective calibration. We give a theoretical justification for why improving the calibration of both the classifier and the OOD detector is crucial in safe SSL. Extensive evaluations on CIFAR-10, CIFAR-100, SVHN, TinyImageNet, and ImageNet demonstrate that CaliMatch outperforms the existing methods in safe SSL tasks.

[717] Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models

Jiazhen Pan, Bailiang Jian, Paul Hager, Yundi Zhang, Che Liu, Friedrike Jungmann, Hongwei Bran Li, Chenyu You, Junde Wu, Jiayuan Zhu, Fenglin Liu, Yuyuan Liu, Niklas Bubeck, Christian Wachinger, Chen, Chen, Zhenyu Gong, Cheng Ouyang, Georgios Kaissis, Benedikt Wiestler, Daniel Rueckert

Main category: cs.LG

TL;DR: A dynamic, automatic, and systematic (DAS) red-teaming framework is introduced to continuously test LLMs for safety vulnerabilities in clinical settings, revealing significant weaknesses in robustness, privacy, bias/fairness, and hallucination.

Details

Motivation: To address the rapid advancement of LLMs and the obsolescence of static safety benchmarks, ensuring their reliability in healthcare applications.

Method: A DAS framework employs adversarial agents to autonomously mutate test cases, evolve unsafe-triggering strategies, and evaluate responses in real time.

Result: Testing 15 LLMs showed high failure rates: 94% in robustness, 86% in privacy, 81% in bias/fairness, and 66% in hallucination, despite high static benchmark performance.

Conclusion: DAS red-teaming provides an evolvable, scalable safeguard for medical AI, addressing residual risks incompatible with clinical practice.

Abstract: Ensuring the safety and reliability of large language models (LLMs) in clinical practice is critical to prevent patient harm and promote trustworthy healthcare applications of AI. However, LLMs are advancing so rapidly that static safety benchmarks often become obsolete upon publication, yielding only an incomplete and sometimes misleading picture of model trustworthiness. We demonstrate that a Dynamic, Automatic, and Systematic (DAS) red-teaming framework that continuously stress-tests LLMs can reveal significant weaknesses of current LLMs across four safety-critical domains: robustness, privacy, bias/fairness, and hallucination. A suite of adversarial agents is applied to autonomously mutate test cases, identify/evolve unsafe-triggering strategies, and evaluate responses, uncovering vulnerabilities in real time without human intervention. Applying DAS to 15 proprietary and open-source LLMs revealed a stark contrast between static benchmark performance and vulnerability under adversarial pressure. Despite a median MedQA accuracy exceeding 80%, 94% of previously correct answers failed our dynamic robustness tests. We observed similarly high failure rates across other domains: privacy leaks were elicited in 86% of scenarios, cognitive-bias priming altered clinical recommendations in 81% of fairness tests, and we identified hallucination rates exceeding 66% in widely used models. Such profound residual risks are incompatible with routine clinical practice. By converting red-teaming from a static checklist into a dynamic stress-test audit, DAS red-teaming offers the surveillance that hospitals/regulators/technology vendors require as LLMs become embedded in patient chatbots, decision-support dashboards, and broader healthcare workflows. Our framework delivers an evolvable, scalable, and reliable safeguard for the next generation of medical AI.

[718] Examining Test-Time Adaptation for Personalized Child Speech Recognition

Zhonghao Shi, Xuan Shi, Anfeng Xu, Tiantian Feng, Harshvardhan Srivastava, Shrikanth Narayanan, Maja J. Matarić

Main category: cs.LG

TL;DR: TTA methods (SUTA, SGEM) improve ASR models for child speech recognition, though challenges remain with non-linguistic speech.

Details

Motivation: Address performance degradation in ASR models due to domain shifts, especially for child speakers, by studying TTA's effectiveness.

Method: Investigate SUTA and SGEM TTA methods for adapting off-the-shelf and fine-tuned ASR models to child speech.

Result: TTA significantly improves ASR performance for child speakers, but struggles with non-linguistic speech.

Conclusion: TTA is effective for child speech adaptation but has limitations with non-linguistic elements.

Abstract: Automatic speech recognition (ASR) models often experience performance degradation due to data domain shifts introduced at test time, a challenge that is further amplified for child speakers. Test-time adaptation (TTA) methods have shown great potential in bridging this domain gap. However, the use of TTA to adapt ASR models to the individual differences in each child’s speech has not yet been systematically studied. In this work, we investigate the effectiveness of two widely used TTA methods-SUTA, SGEM-in adapting off-the-shelf ASR models and their fine-tuned versions for child speech recognition, with the goal of enabling continuous, unsupervised adaptation at test time. Our findings show that TTA significantly improves the performance of both off-the-shelf and fine-tuned ASR models, both on average and across individual child speakers, compared to unadapted baselines. However, while TTA helps adapt to individual variability, it may still be limited with non-linguistic child speech.

[719] Hybrid Hypergraph Networks for Multimodal Sequence Data Classification

Feng Xu, Hui Wang, Yuting Huang, Danwei Zhang, Zizhu Fan

Main category: cs.LG

TL;DR: The paper proposes a hybrid hypergraph network (HHN) to model temporal multimodal data, addressing challenges in capturing long-range dependencies and cross-modal interactions. HHN segments data into timestamped nodes, uses hyperedges for intra-modal structures, and employs graph attention for inter-modal fusion, achieving SOTA results.

Details

Motivation: Existing approaches for temporal multimodal data often treat modalities independently and use shallow fusion, missing temporal dependencies and complex structural relationships.

Method: HHN uses a segmentation-first, graph-later strategy: timestamped segments as nodes, hyperedges for intra-modal structures, and graph attention for inter-modal fusion.

Result: HHN achieves state-of-the-art performance on four multimodal datasets.

Conclusion: HHN effectively models temporal multimodal data by capturing intra-modal dynamics and inter-modal correlations, outperforming existing methods.

Abstract: Modeling temporal multimodal data poses significant challenges in classification tasks, particularly in capturing long-range temporal dependencies and intricate cross-modal interactions. Audiovisual data, as a representative example, is inherently characterized by strict temporal order and diverse modalities. Effectively leveraging the temporal structure is essential for understanding both intra-modal dynamics and inter-modal correlations. However, most existing approaches treat each modality independently and rely on shallow fusion strategies, which overlook temporal dependencies and hinder the model’s ability to represent complex structural relationships. To address the limitation, we propose the hybrid hypergraph network (HHN), a novel framework that models temporal multimodal data via a segmentation-first, graph-later strategy. HHN splits sequences into timestamped segments as nodes in a heterogeneous graph. Intra-modal structures are captured via hyperedges guided by a maximum entropy difference criterion, enhancing node heterogeneity and structural discrimination, followed by hypergraph convolution to extract high-order dependencies. Inter-modal links are established through temporal alignment and graph attention for semantic fusion. HHN achieves state-of-the-art (SOTA) results on four multimodal datasets, demonstrating its effectiveness in complex classification tasks.

[720] Cooperative effects in feature importance of individual patterns: application to air pollutants and Alzheimer disease

M. Ontivero-Ortega, A. Fania, A. Lacalamita, R. Bellotti, A. Monaco, S. Stramaglia

Main category: cs.LG

TL;DR: The paper introduces an adaptive version of LOCO (Hi-Fi) to quantify cooperative effects in feature importance, assigning unique, redundant, and synergistic scores to features. It compares this with Shapley effects and applies it to study air pollutants’ impact on Alzheimer’s disease mortality.

Details

Motivation: To disentangle high-order cooperative effects in feature importance for regression problems, moving beyond single-score metrics.

Method: Proposes a framework to assign unique, redundant, and synergistic scores to features, comparing it with Shapley effects. Applied to air pollutants and Alzheimer’s disease mortality data.

Result: Identifies synergistic associations between pollutants (O3, NO2) and mortality, especially in Bergamo and Brescia, and urban green areas’ influence.

Conclusion: Hi-Fi is a promising tool for XAI and analyzing high-order relationships in complex systems.

Abstract: Leveraging recent advances in the analysis of synergy and redundancy in systems of random variables, an adaptive version of the widely used metric Leave One Covariate Out (LOCO) has been recently proposed to quantify cooperative effects in feature importance (Hi-Fi), a key technique in explainable artificial intelligence (XAI), so as to disentangle high-order effects involving a particular input feature in regression problems. Differently from standard feature importance tools, where a single score measures the relevance of each feature, each feature is here characterized by three scores, a two-body (unique) score and higher-order scores (redundant and synergistic). This paper presents a framework to assign those three scores (unique, redundant, and synergistic) to each individual pattern of the data set, while comparing it with the well-known measure of feature importance named {\it Shapley effect}. To illustrate the potential of the proposed framework, we focus on a One-Health application: the relation between air pollutants and Alzheimer’s disease mortality rate. Our main result is the synergistic association between features related to $O_3$ and $NO_2$ with mortality, especially in the provinces of Bergamo e Brescia; notably also the density of urban green areas displays synergistic influence with pollutants for the prediction of AD mortality. Our results place local Hi-Fi as a promising tool of wide applicability, which opens new perspectives for XAI as well as to analyze high-order relationships in complex systems.

[721] OKG-LLM: Aligning Ocean Knowledge Graph with Observation Data via LLMs for Global Sea Surface Temperature Prediction

Hanchen Yang, Jiaqi Wang, Jiannong Cao, Wengen Li, Jialun Zheng, Yangning Li, Chunyu Miao, Jihong Guan, Shuigeng Zhou, Philip S. Yu

Main category: cs.LG

TL;DR: The paper proposes OKG-LLM, a framework integrating ocean domain knowledge via a knowledge graph and LLMs for improved SST prediction, outperforming existing methods.

Details

Motivation: Existing SST prediction methods lack integration of domain knowledge, limiting accuracy. The potential of LLMs for such tasks is underexplored due to challenges in combining domain knowledge with numerical data.

Method: Constructs an Ocean Knowledge Graph (OKG), uses a graph embedding network to learn semantic and structural knowledge, and aligns this with numerical SST data using a pre-trained LLM.

Result: OKG-LLM outperforms state-of-the-art methods in SST prediction, demonstrating effectiveness and robustness.

Conclusion: The framework successfully integrates domain knowledge and numerical data, advancing SST prediction accuracy and offering potential for broader applications.

Abstract: Sea surface temperature (SST) prediction is a critical task in ocean science, supporting various applications, such as weather forecasting, fisheries management, and storm tracking. While existing data-driven methods have demonstrated significant success, they often neglect to leverage the rich domain knowledge accumulated over the past decades, limiting further advancements in prediction accuracy. The recent emergence of large language models (LLMs) has highlighted the potential of integrating domain knowledge for downstream tasks. However, the application of LLMs to SST prediction remains underexplored, primarily due to the challenge of integrating ocean domain knowledge and numerical data. To address this issue, we propose Ocean Knowledge Graph-enhanced LLM (OKG-LLM), a novel framework for global SST prediction. To the best of our knowledge, this work presents the first systematic effort to construct an Ocean Knowledge Graph (OKG) specifically designed to represent diverse ocean knowledge for SST prediction. We then develop a graph embedding network to learn the comprehensive semantic and structural knowledge within the OKG, capturing both the unique characteristics of individual sea regions and the complex correlations between them. Finally, we align and fuse the learned knowledge with fine-grained numerical SST data and leverage a pre-trained LLM to model SST patterns for accurate prediction. Extensive experiments on the real-world dataset demonstrate that OKG-LLM consistently outperforms state-of-the-art methods, showcasing its effectiveness, robustness, and potential to advance SST prediction. The codes are available in the online repository.

[722] FeatureCuts: Feature Selection for Large Data by Optimizing the Cutoff

Andy Hu, Devika Prasad, Luiz Pizzato, Nicholas Foord, Arman Abrahamyan, Anna Leontjeva, Cooper Doyle, Dan Jermyn

Main category: cs.LG

TL;DR: FeatureCuts is a novel feature selection algorithm that outperforms state-of-the-art methods by achieving higher feature reduction and lower computation time while maintaining model performance.

Details

Motivation: To improve feature selection in machine learning by adaptively selecting optimal feature cutoffs after filter ranking, enhancing efficiency and scalability.

Method: FeatureCuts performs adaptive feature cutoff selection post-filter ranking and is evaluated on 15 datasets, including one industry dataset.

Result: Achieves 15% more feature reduction, up to 99.6% less computation time, and maintains model performance. With PSO, it enables 25% more reduction and 66% less time.

Conclusion: FeatureCuts is scalable and efficient, making it suitable for large datasets in enterprise applications.

Abstract: In machine learning, the process of feature selection involves finding a reduced subset of features that captures most of the information required to train an accurate and efficient model. This work presents FeatureCuts, a novel feature selection algorithm that adaptively selects the optimal feature cutoff after performing filter ranking. Evaluated on 14 publicly available datasets and one industry dataset, FeatureCuts achieved, on average, 15 percentage points more feature reduction and up to 99.6% less computation time while maintaining model performance, compared to existing state-of-the-art methods. When the selected features are used in a wrapper method such as Particle Swarm Optimization (PSO), it enables 25 percentage points more feature reduction, requires 66% less computation time, and maintains model performance when compared to PSO alone. The minimal overhead of FeatureCuts makes it scalable for large datasets typically seen in enterprise applications.

[723] From Generator to Embedder: Harnessing Innate Abilities of Multimodal LLMs via Building Zero-Shot Discriminative Embedding Model

Yeong-Joon Ju, Seong-Whan Lee

Main category: cs.LG

TL;DR: A framework for adapting Multimodal Large Language Models (MLLMs) to discriminative representation learning, using hierarchical embedding prompts and self-aware hard negative sampling, achieves state-of-the-art performance without contrastive pre-training.

Details

Motivation: Addressing inefficiencies in large-scale contrastive pre-training for MLLMs, such as high computational costs and underutilization of their instruction-following capabilities.

Method: Proposes a two-component framework: hierarchical embedding prompt templates for discriminative representations and self-aware hard negative sampling for efficient fine-tuning.

Result: Achieves competitive zero-shot performance and boosts fine-tuning by 4.8 points on MMEB benchmark, setting state-of-the-art results without contrastive pre-training.

Conclusion: The framework offers an efficient and effective way to adapt MLLMs for universal embedding tasks, reducing training time significantly.

Abstract: Multimodal Large Language Models (MLLMs) have emerged as a promising solution for universal embedding tasks, yet adapting their generative nature for discriminative representation learning remains a significant challenge. The dominant paradigm of large-scale contrastive pre-training suffers from critical inefficiencies, including prohibitive computational costs and a failure to leverage the intrinsic, instruction-following capabilities of MLLMs. To overcome these limitations, we propose an efficient framework for universal multimodal embeddings, which bridges this gap by centering on two synergistic components. First, our hierarchical embedding prompt template employs a two-level instruction architecture that forces the model to produce discriminative representations. Building on this strong foundation, our second component, self-aware hard negative sampling, redefines the fine-tuning process by leveraging the model’s own understanding to efficiently mine challenging negatives while actively filtering out potential false negatives. Our comprehensive experiments show that our hierarchical prompt achieves zero-shot performance competitive with contrastively trained baselines and enhances the fine-tuning process by lifting a simple in-batch negative baseline by 4.8 points on the MMEB benchmark. We further boost the performance via our self-aware hard negative sampling, achieving the state-of-the-art performance without the contrative pre-training. Our work presents an effective and efficient pathway to adapt MLLMs for universal embedding tasks, significantly reducing training time.

[724] Learning Unified User Quantized Tokenizers for User Representation

Chuan He, Yang Chen, Wuliang Huang, Tianyi Zheng, Jianhu Chen, Bin Dou, Yice Luo, Yun Zhu, Baokun Wang, Yongchao Liu, Xing Fu, Yu Cheng, Chuntao Hong, Weiqiang Wang, Xin-Wei Yao

Main category: cs.LG

TL;DR: U^2QT is a novel framework for multi-source user representation learning, addressing limitations of prior late-fusion methods by integrating cross-domain knowledge transfer and early fusion. It uses a two-stage architecture for unified representation and efficient storage, outperforming baselines in tasks like behavior prediction and recommendation.

Details

Motivation: Prior methods lack unified representation frameworks, scalability, and cross-task generalization, hindering personalized services on web platforms.

Method: U^2QT employs a two-stage approach: a causal Q-Former projects domain-specific features into a shared space, and a multi-view RQ-VAE discretizes embeddings into compact tokens for efficient storage.

Result: U^2QT outperforms task-specific baselines in behavior prediction and recommendation, with efficiency gains in storage and computation.

Conclusion: The framework enables seamless integration with language models and supports industrial-scale applications, offering a scalable and flexible solution for user representation learning.

Abstract: Multi-source user representation learning plays a critical role in enabling personalized services on web platforms (e.g., Alipay). While prior works have adopted late-fusion strategies to combine heterogeneous data sources, they suffer from three key limitations: lack of unified representation frameworks, scalability and storage issues in data compression, and inflexible cross-task generalization. To address these challenges, we propose U^2QT (Unified User Quantized Tokenizers), a novel framework that integrates cross-domain knowledge transfer with early fusion of heterogeneous domains. Our framework employs a two-stage architecture: first, a causal Q-Former projects domain-specific features into a shared causal representation space to preserve inter-modality dependencies; second, a multi-view RQ-VAE discretizes causal embeddings into compact tokens through shared and source-specific codebooks, enabling efficient storage while maintaining semantic coherence. Experimental results showcase U^2QT’s advantages across diverse downstream tasks, outperforming task-specific baselines in future behavior prediction and recommendation tasks while achieving efficiency gains in storage and computation. The unified tokenization framework enables seamless integration with language models and supports industrial-scale applications.

[725] Small sample-based adaptive text classification through iterative and contrastive description refinement

Amrit Rajeev, Udayaadithya Avadhanam, Harshula Tulapurkar, SaiBarath Sundar

Main category: cs.LG

TL;DR: A framework combining iterative topic refinement, contrastive prompting, and active learning improves zero-shot text classification in dynamic domains with evolving knowledge.

Details

Motivation: Addressing the challenge of zero-shot text classification in domains with ambiguous categories and evolving knowledge, where LLMs struggle due to limited topic separability and few-shot methods lack data diversity.

Method: Proposes a framework with iterative topic refinement, contrastive prompting, and active learning. Uses misclassified samples to refine category distinctions and includes a human-in-the-loop component for natural language category updates.

Result: Achieves 91% accuracy on AGNews (3 seen, 1 unseen class) and 84% on DBpedia (8 seen, 1 unseen), with minimal accuracy drop after introducing unseen classes.

Conclusion: The framework effectively enables fine-grained classification with limited supervision, leveraging prompt-based semantic reasoning for dynamic environments.

Abstract: Zero-shot text classification remains a difficult task in domains with evolving knowledge and ambiguous category boundaries, such as ticketing systems. Large language models (LLMs) often struggle to generalize in these scenarios due to limited topic separability, while few-shot methods are constrained by insufficient data diversity. We propose a classification framework that combines iterative topic refinement, contrastive prompting, and active learning. Starting with a small set of labeled samples, the model generates initial topic labels. Misclassified or ambiguous samples are then used in an iterative contrastive prompting process to refine category distinctions by explicitly teaching the model to differentiate between closely related classes. The framework features a human-in-the-loop component, allowing users to introduce or revise category definitions in natural language. This enables seamless integration of new, unseen categories without retraining, making the system well-suited for real-world, dynamic environments. The evaluations on AGNews and DBpedia demonstrate strong performance: 91% accuracy on AGNews (3 seen, 1 unseen class) and 84% on DBpedia (8 seen, 1 unseen), with minimal accuracy shift after introducing unseen classes (82% and 87%, respectively). The results highlight the effectiveness of prompt-based semantic reasoning for fine-grained classification with limited supervision.

[726] Enhancing material behavior discovery using embedding-oriented Physically-Guided Neural Networks with Internal Variables

Rubén Muñoz-Sierra, Manuel Doblaré, Jacobo Ayensa-Jiménez

Main category: cs.LG

TL;DR: The paper proposes enhancements to Physically Guided Neural Networks with Internal Variables (PGNNIV) to address scalability issues in high-dimensional data, using reduced-order modeling techniques like spectral decomposition and pretrained autoencoders. It also integrates transfer learning for efficiency.

Details

Motivation: PGNNIV models face scalability challenges with high-dimensional data. The goal is to improve their efficiency and adaptability while maintaining accuracy.

Method: Enhanced PGNNIV framework with reduced-order modeling techniques (spectral decomposition, POD, pretrained autoencoders) and transfer learning strategies.

Result: The framework successfully identifies constitutive state equations, improves noise robustness, mitigates overfitting, and reduces computational demands.

Conclusion: The proposed techniques overcome scalability challenges and can be tailored to various scenarios, maintaining high predictive accuracy.

Abstract: Physically Guided Neural Networks with Internal Variables are SciML tools that use only observable data for training and and have the capacity to unravel internal state relations. They incorporate physical knowledge both by prescribing the model architecture and using loss regularization, thus endowing certain specific neurons with a physical meaning as internal state variables. Despite their potential, these models face challenges in scalability when applied to high-dimensional data such as fine-grid spatial fields or time-evolving systems. In this work, we propose some enhancements to the PGNNIV framework that address these scalability limitations through reduced-order modeling techniques. Specifically, we introduce alternatives to the original decoder structure using spectral decomposition, POD, and pretrained autoencoder-based mappings. These surrogate decoders offer varying trade-offs between computational efficiency, accuracy, noise tolerance, and generalization, while improving drastically the scalability. Additionally, we integrate model reuse via transfer learning and fine-tuning strategies to exploit previously acquired knowledge, supporting efficient adaptation to novel materials or configurations, and significantly reducing training time while maintaining or improving model performance. To illustrate these various techniques, we use a representative case governed by the nonlinear diffusion equation, using only observable data. Results demonstrate that the enhanced PGNNIV framework successfully identifies the underlying constitutive state equations while maintaining high predictive accuracy. It also improves robustness to noise, mitigates overfitting, and reduces computational demands. The proposed techniques can be tailored to various scenarios depending on data availability, resources, and specific modeling objectives, overcoming scalability challenges in all the scenarios.

[727] Large-Scale Model Enabled Semantic Communication Based on Robust Knowledge Distillation

Kuiyuan DIng, Caili Guo, Yang Yang, Zhongtian Du, Walid Saad

Main category: cs.LG

TL;DR: A robust knowledge distillation framework (RKD-SC) is proposed to enable efficient and noise-resistant large-scale model (LSM) deployment in semantic communication, addressing computational complexity and channel noise challenges.

Details

Motivation: Large-scale models (LSMs) are effective for semantic communication but face high computational complexity and resource demands, hindering direct deployment.

Method: The framework includes a knowledge distillation-based lightweight architecture search (KDL-DARTS) and a two-stage robust knowledge distillation (RKD) algorithm, enhanced by a channel-aware transformer (CAT) block for noise resilience.

Result: Simulations show RKD-SC reduces model parameters while maintaining performance and robustness, outperforming existing methods.

Conclusion: RKD-SC provides an efficient and robust solution for deploying LSMs in semantic communication systems.

Abstract: Large-scale models (LSMs) can be an effective framework for semantic representation and understanding, thereby providing a suitable tool for designing semantic communication (SC) systems. However, their direct deployment is often hindered by high computational complexity and resource requirements. In this paper, a novel robust knowledge distillation based semantic communication (RKD-SC) framework is proposed to enable efficient and \textcolor{black}{channel-noise-robust} LSM-powered SC. The framework addresses two key challenges: determining optimal compact model architectures and effectively transferring knowledge while maintaining robustness against channel noise. First, a knowledge distillation-based lightweight differentiable architecture search (KDL-DARTS) algorithm is proposed. This algorithm integrates knowledge distillation loss and a complexity penalty into the neural architecture search process to identify high-performance, lightweight semantic encoder architectures. Second, a novel two-stage robust knowledge distillation (RKD) algorithm is developed to transfer semantic capabilities from an LSM (teacher) to a compact encoder (student) and subsequently enhance system robustness. To further improve resilience to channel impairments, a channel-aware transformer (CAT) block is introduced as the channel codec, trained under diverse channel conditions with variable-length outputs. Extensive simulations on image classification tasks demonstrate that the RKD-SC framework significantly reduces model parameters while preserving a high degree of the teacher model’s performance and exhibiting superior robustness compared to existing methods.

[728] Compression-Induced Communication-Efficient Large Model Training and Inferencing

Sudip K. Seal, Maksudul Alam, Jorge Ramirez, Sajal Dash, Hao Lu

Main category: cs.LG

TL;DR: Phantom parallelism reduces energy consumption in large neural network training by ~50% compared to traditional tensor parallelism, with lower bandwidth and FLOP counts.

Details

Motivation: Addressing the energy inefficiency of large-scale machine learning workloads, particularly in tensor parallelism.

Method: Introduces phantom parallelism with new forward/backward propagation operators, implemented via custom autograd operations in a training pipeline.

Result: Empirical results on up to 256 GPUs show ~50% energy reduction and comparable model performance with fewer GPUs.

Conclusion: Phantom parallelism offers significant energy savings and scalability advantages over traditional tensor parallelism.

Abstract: Energy efficiency of training and inferencing with large neural network models is a critical challenge facing the future of sustainable large-scale machine learning workloads. This paper introduces an alternative strategy, called phantom parallelism, to minimize the net energy consumption of traditional tensor (model) parallelism, the most energy-inefficient component of large neural network training. The approach is presented in the context of feed-forward network architectures as a preliminary, but comprehensive, proof-of-principle study of the proposed methodology. We derive new forward and backward propagation operators for phantom parallelism, implement them as custom autograd operations within an end-to-end phantom parallel training pipeline and compare its parallel performance and energy-efficiency against those of conventional tensor parallel training pipelines. Formal analyses that predict lower bandwidth and FLOP counts are presented with supporting empirical results on up to 256 GPUs that corroborate these gains. Experiments are shown to deliver ~50% reduction in the energy consumed to train FFNs using the proposed phantom parallel approach when compared with conventional tensor parallel methods. Additionally, the proposed approach is shown to train smaller phantom models to the same model loss on smaller GPU counts as larger tensor parallel models on larger GPU counts offering the possibility for even greater energy savings.

[729] FinKario: Event-Enhanced Automated Construction of Financial Knowledge Graph

Xiang Li, Penglei Sun, Wanyun Zhou, Zikai Wei, Yongqi Zhang, Xiaowen Chu

Main category: cs.LG

TL;DR: The paper introduces FinKario, a dataset and retrieval method, to enhance financial decision-making by addressing slow updates and unstructured data in equity research reports.

Details

Motivation: Individual investors struggle with overwhelming information and lack professional analysis. Equity research reports are valuable but face challenges like slow updates and unstructured formats, limiting LLMs' effectiveness.

Method: Developed FinKario, a dataset with structured financial insights, and FinKario-RAG, a retrieval strategy for efficient access to evolving financial knowledge.

Result: FinKario with FinKario-RAG outperformed financial LLMs by 18.81% and institutional strategies by 17.85% in stock trend prediction.

Conclusion: The proposed solutions effectively address the challenges of slow updates and unstructured data, enhancing financial analysis and decision-making.

Abstract: Individual investors are significantly outnumbered and disadvantaged in financial markets, overwhelmed by abundant information and lacking professional analysis. Equity research reports stand out as crucial resources, offering valuable insights. By leveraging these reports, large language models (LLMs) can enhance investors’ decision-making capabilities and strengthen financial analysis. However, two key challenges limit their effectiveness: (1) the rapid evolution of market events often outpaces the slow update cycles of existing knowledge bases, (2) the long-form and unstructured nature of financial reports further hinders timely and context-aware integration by LLMs. To address these challenges, we tackle both data and methodological aspects. First, we introduce the Event-Enhanced Automated Construction of Financial Knowledge Graph (FinKario), a dataset comprising over 305,360 entities, 9,625 relational triples, and 19 distinct relation types. FinKario automatically integrates real-time company fundamentals and market events through prompt-driven extraction guided by professional institutional templates, providing structured and accessible financial insights for LLMs. Additionally, we propose a Two-Stage, Graph-Based retrieval strategy (FinKario-RAG), optimizing the retrieval of evolving, large-scale financial knowledge to ensure efficient and precise data access. Extensive experiments show that FinKario with FinKario-RAG achieves superior stock trend prediction accuracy, outperforming financial LLMs by 18.81% and institutional strategies by 17.85% on average in backtesting.

[730] Rethinking Multimodality: Optimizing Multimodal Deep Learning for Biomedical Signal Classification

Timothy Oladunni, Alex Wong

Main category: cs.LG

TL;DR: The study shows that in multimodal deep learning for biomedical signal classification, not all modality fusions improve performance. Optimal performance depends on the complementarity of feature domains, not the number of modalities.

Details

Motivation: To challenge the assumption that adding more modalities always enhances accuracy in biomedical signal classification and to identify the principles of effective multimodal fusion.

Method: Five deep learning models (three unimodal and two multimodal) were designed and evaluated for ECG classification using bootstrapping and Bayesian inference.

Result: Hybrid 1 (fusing time and time-frequency domains) outperformed baselines, while Hybrid 2 (adding frequency) showed no improvement, indicating redundancy.

Conclusion: Optimal multimodal performance is determined by the intrinsic complementarity of feature domains, not the quantity of modalities, leading to a new framework for domain fusion.

Abstract: This study proposes a novel perspective on multimodal deep learning for biomedical signal classification, systematically analyzing how complementary feature domains impact model performance. While fusing multiple domains often presumes enhanced accuracy, this work demonstrates that adding modalities can yield diminishing returns, as not all fusions are inherently advantageous. To validate this, five deep learning models were designed, developed, and rigorously evaluated: three unimodal (1D-CNN for time, 2D-CNN for time-frequency, and 1D-CNN-Transformer for frequency) and two multimodal (Hybrid 1, which fuses 1D-CNN and 2D-CNN; Hybrid 2, which combines 1D-CNN, 2D-CNN, and a Transformer). For ECG classification, bootstrapping and Bayesian inference revealed that Hybrid 1 consistently outperformed the 2D-CNN baseline across all metrics (p-values < 0.05, Bayesian probabilities > 0.90), confirming the synergistic complementarity of the time and time-frequency domains. Conversely, Hybrid 2’s inclusion of the frequency domain offered no further improvement and sometimes a marginal decline, indicating representational redundancy; a phenomenon further substantiated by a targeted ablation study. This research redefines a fundamental principle of multimodal design in biomedical signal analysis. We demonstrate that optimal domain fusion isn’t about the number of modalities, but the quality of their inherent complementarity. This paradigm-shifting concept moves beyond purely heuristic feature selection. Our novel theoretical contribution, “Complementary Feature Domains in Multimodal ECG Deep Learning,” presents a mathematically quantifiable framework for identifying ideal domain combinations, demonstrating that optimal multimodal performance arises from the intrinsic information-theoretic complementarity among fused domains.

[731] Explainable AI Methods for Neuroimaging: Systematic Failures of Common Tools, the Need for Domain-Specific Validation, and a Proposal for Safe Application

Nys Tjade Siegel, James H. Cole, Mohamad Habes, Stefan Haufe, Kerstin Ritter, Marc-André Schulz

Main category: cs.LG

TL;DR: A systematic comparison of XAI methods on neuroimaging data reveals failures in GradCAM and Layer-wise Relevance Propagation, while SmoothGrad proves reliable due to its simplicity.

Details

Motivation: To validate XAI methods for neuroimaging, ensuring trustworthy interpretations of deep learning models.

Method: A novel XAI validation framework tested on ~45,000 structural brain MRIs, using tasks with known signal sources.

Result: GradCAM failed to localize features, Layer-wise Relevance Propagation produced artifacts, but SmoothGrad was accurate.

Conclusion: Domain-specific adaptation of XAI methods is crucial, and prior neuroimaging studies using standard XAI may need re-evaluation.

Abstract: Trustworthy interpretation of deep learning models is critical for neuroimaging applications, yet commonly used Explainable AI (XAI) methods lack rigorous validation, risking misinterpretation. We performed the first large-scale, systematic comparison of XAI methods on ~45,000 structural brain MRIs using a novel XAI validation framework. This framework establishes verifiable ground truth by constructing prediction tasks with known signal sources - from localized anatomical features to subject-specific clinical lesions - without artificially altering input images. Our analysis reveals systematic failures in two of the most widely used methods: GradCAM consistently failed to localize predictive features, while Layer-wise Relevance Propagation generated extensive, artifactual explanations that suggest incompatibility with neuroimaging data characteristics. Our results indicate that these failures stem from a domain mismatch, where methods with design principles tailored to natural images require substantial adaptation for neuroimaging data. In contrast, the simpler, gradient-based method SmoothGrad, which makes fewer assumptions about data structure, proved consistently accurate, suggesting its conceptual simplicity makes it more robust to this domain shift. These findings highlight the need for domain-specific adaptation and validation of XAI methods, suggest that interpretations from prior neuroimaging studies using standard XAI methodology warrant re-evaluation, and provide urgent guidance for practical application of XAI in neuroimaging.

[732] VAULT: Vigilant Adversarial Updates via LLM-Driven Retrieval-Augmented Generation for NLI

Roie Kazoom, Ofir Cohen, Rami Puzis, Asaf Shabtai, Ofer Hadar

Main category: cs.LG

TL;DR: VAULT is an automated adversarial RAG pipeline that improves NLI model robustness through retrieval, adversarial generation, and iterative retraining, achieving significant accuracy boosts on benchmarks.

Details

Motivation: To systematically uncover and remedy weaknesses in NLI models by automating adversarial data curation and retraining.

Method: Uses a three-stage pipeline: retrieval (semantic and lexical), adversarial hypothesis generation via LLM prompts, and iterative retraining with validated adversarial examples.

Result: Boosts RoBERTa-base accuracy by +4.12% on SNLI, +5.91% on ANLI, and +17.32% on MultiNLI, outperforming prior methods by up to 2.0%.

Conclusion: VAULT enables rapid, human-independent robustness improvements in NLI tasks through scalable adversarial data curation.

Abstract: We introduce VAULT, a fully automated adversarial RAG pipeline that systematically uncovers and remedies weaknesses in NLI models through three stages: retrieval, adversarial generation, and iterative retraining. First, we perform balanced few-shot retrieval by embedding premises with both semantic (BGE) and lexical (BM25) similarity. Next, we assemble these contexts into LLM prompts to generate adversarial hypotheses, which are then validated by an LLM ensemble for label fidelity. Finally, the validated adversarial examples are injected back into the training set at increasing mixing ratios, progressively fortifying a zero-shot RoBERTa-base model.On standard benchmarks, VAULT elevates RoBERTa-base accuracy from 88.48% to 92.60% on SNLI +4.12%, from 75.04% to 80.95% on ANLI +5.91%, and from 54.67% to 71.99% on MultiNLI +17.32%. It also consistently outperforms prior in-context adversarial methods by up to 2.0% across datasets. By automating high-quality adversarial data curation at scale, VAULT enables rapid, human-independent robustness improvements in NLI inference tasks.

[733] Masked Omics Modeling for Multimodal Representation Learning across Histopathology and Molecular Profiles

Lucas Robinet, Ahmad Berjaoui, Elizabeth Cohen-Jonathan Moyal

Main category: cs.LG

TL;DR: MORPHEUS is a transformer-based framework for pre-training on histopathology and multi-omics data, enabling cross-modal learning and any-to-any omics generation, outperforming existing methods.

Details

Motivation: Histopathology alone lacks molecular and clinical insights, which multi-omics data can provide. A unified model for both is needed.

Method: Uses masked modeling on omics data to learn cross-modal relationships, adaptable to any input combination.

Result: Outperforms state-of-the-art methods in diverse tasks and modality combinations.

Conclusion: MORPHEUS is a promising foundation model for multimodal oncology research.

Abstract: Self-supervised learning has driven major advances in computational pathology by enabling models to learn rich representations from hematoxylin and eosin (H&E)-stained cancer tissue. However, histopathology alone often falls short for molecular characterization and understanding clinical outcomes, as important information is contained in high-dimensional omics profiles like transcriptomics, methylomics, or genomics. In this work, we introduce MORPHEUS, a unified transformer-based pre-training framework that encodes both histopathology and multi-omics data into a shared latent space. At its core, MORPHEUS relies on a masked modeling objective applied to randomly selected omics portions, encouraging the model to learn biologically meaningful cross-modal relationships. The same pre-trained network can be applied to histopathology alone or in combination with any subset of omics modalities, seamlessly adapting to the available inputs. Additionally, MORPHEUS enables any-to-any omics generation, enabling one or more omics profiles to be inferred from any subset of modalities, including H&E alone. Pre-trained on a large pan-cancer cohort, MORPHEUS consistently outperforms state-of-the-art methods across diverse modality combinations and tasks, positioning itself as a promising framework for developing multimodal foundation models in oncology. The code is available at: https://github.com/Lucas-rbnt/MORPHEUS

[734] Optimal Scheduling Algorithms for LLM Inference: Theory and Practice

Agrim Bari, Parikshit Hegde, Gustavo de Veciana

Main category: cs.LG

TL;DR: The paper addresses the need for efficient LLM inference systems by proposing a theoretical framework and two schedulers (RAD and SLAI) to optimize throughput and meet practical SLOs like TBT and TTFT.

Details

Motivation: The growing use of LLM-based tools necessitates efficient inference systems due to their unique two-phase computation structure (prefill and decode phases).

Method: Develops a theoretical framework for routing and scheduling, introduces the RAD scheduler for throughput optimality, and designs the SLAI scheduler for practical SLOs like TBT and TTFT.

Result: SLAI reduces median TTFT by 53% and increases serving capacity by 26% while meeting tail TBT constraints, as demonstrated on the Openchat ShareGPT4 dataset with Mistral-7B.

Conclusion: The proposed schedulers (RAD and SLAI) effectively address the challenges of LLM inference systems, improving performance and meeting practical constraints.

Abstract: With the growing use of Large Language Model (LLM)-based tools like ChatGPT, Perplexity, and Gemini across industries, there is a rising need for efficient LLM inference systems. These systems handle requests with a unique two-phase computation structure: a prefill-phase that processes the full input prompt and a decode-phase that autoregressively generates tokens one at a time. This structure calls for new strategies for routing and scheduling requests. In this paper, we take a comprehensive approach to this challenge by developing a theoretical framework that models routing and scheduling in LLM inference systems. We identify two key design principles-optimal tiling and dynamic resource allocation-that are essential for achieving high throughput. Guided by these principles, we propose the Resource-Aware Dynamic (RAD) scheduler and prove that it achieves throughput optimality under mild conditions. To address practical Service Level Objectives (SLOs) such as serving requests with different Time Between Token (TBT) constraints, we design the SLO-Aware LLM Inference (SLAI) scheduler. SLAI uses real-time measurements to prioritize decode requests that are close to missing their TBT deadlines and reorders prefill requests based on known prompt lengths to further reduce the Time To First Token (TTFT) delays. We evaluate SLAI on the Openchat ShareGPT4 dataset using the Mistral-7B model on an NVIDIA RTX ADA 6000 GPU. Compared to Sarathi-Serve, SLAI reduces the median TTFT by 53% and increases the maximum serving capacity by 26% such that median TTFT is below 0.5 seconds, while meeting tail TBT latency constraints.

[735] v-PuNNs: van der Put Neural Networks for Transparent Ultrametric Representation Learning

Gnankan Landry Regis N’guessan

Main category: cs.LG

TL;DR: v-PuNNs introduce p-adic neural networks for hierarchical data, achieving high accuracy and efficiency with novel optimization methods.

Details

Motivation: Euclidean space is unsuitable for hierarchical data like taxa or word senses, prompting the need for a specialized architecture.

Method: v-PuNNs use p-adic balls and TURL for exact subtree semantics, with VAPO for optimization.

Result: State-of-the-art performance on benchmarks like WordNet (99.96% accuracy) and GO (96.9% leaf accuracy).

Conclusion: v-PuNNs bridge number theory and deep learning, providing interpretable and efficient models for hierarchical data.

Abstract: Conventional deep learning models embed data in Euclidean space $\mathbb{R}^d$, a poor fit for strictly hierarchical objects such as taxa, word senses, or file systems. We introduce van der Put Neural Networks (v-PuNNs), the first architecture whose neurons are characteristic functions of p-adic balls in $\mathbb{Z}p$. Under our Transparent Ultrametric Representation Learning (TURL) principle every weight is itself a p-adic number, giving exact subtree semantics. A new Finite Hierarchical Approximation Theorem shows that a depth-K v-PuNN with $\sum{j=0}^{K-1}p^{,j}$ neurons universally represents any K-level tree. Because gradients vanish in this discrete space, we propose Valuation-Adaptive Perturbation Optimization (VAPO), with a fast deterministic variant (HiPaN-DS) and a moment-based one (HiPaN / Adam-VAPO). On three canonical benchmarks our CPU-only implementation sets new state-of-the-art: WordNet nouns (52,427 leaves) 99.96% leaf accuracy in 16 min; GO molecular-function 96.9% leaf / 100% root in 50 s; NCBI Mammalia Spearman $\rho = -0.96$ with true taxonomic distance. The learned metric is perfectly ultrametric (zero triangle violations), and its fractal and information-theoretic properties are analyzed. Beyond classification we derive structural invariants for quantum systems (HiPaQ) and controllable generative codes for tabular data (Tab-HiPaN). v-PuNNs therefore bridge number theory and deep learning, offering exact, interpretable, and efficient models for hierarchical data.

[736] On Some Tunable Multi-fidelity Bayesian Optimization Frameworks

Arjun Manoj, Anastasia S. Georgiou, Dimitris G. Giovanis, Themistoklis P. Sapsis, Ioannis G. Kevrekidis

Main category: cs.LG

TL;DR: The paper introduces a proximity-based acquisition strategy for multi-fidelity optimization, simplifying fidelity selection and enabling multi-fidelity UCB strategies. It benchmarks these methods, showing consistent control over high-fidelity usage and efficient convergence.

Details

Motivation: To improve Gaussian Process-based multi-fidelity optimization by reducing reliance on expensive high-fidelity evaluations and simplifying fidelity selection.

Method: Proximity-based acquisition strategy and multi-fidelity UCB strategies combined with multi-fidelity GPs, benchmarked against other approaches.

Result: Proximity-based strategy provides consistent control over high-fidelity usage while maintaining convergence efficiency.

Conclusion: The proposed methods advance multi-fidelity optimization, offering practical benefits in tasks like chemical kinetic modeling.

Abstract: Multi-fidelity optimization employs surrogate models that integrate information from varying levels of fidelity to guide efficient exploration of complex design spaces while minimizing the reliance on (expensive) high-fidelity objective function evaluations. To advance Gaussian Process (GP)-based multi-fidelity optimization, we implement a proximity-based acquisition strategy that simplifies fidelity selection by eliminating the need for separate acquisition functions at each fidelity level. We also enable multi-fidelity Upper Confidence Bound (UCB) strategies by combining them with multi-fidelity GPs rather than the standard GPs typically used. We benchmark these approaches alongside other multi-fidelity acquisition strategies (including fidelity-weighted approaches) comparing their performance, reliance on high-fidelity evaluations, and hyperparameter tunability in representative optimization tasks. The results highlight the capability of the proximity-based multi-fidelity acquisition function to deliver consistent control over high-fidelity usage while maintaining convergence efficiency. Our illustrative examples include multi-fidelity chemical kinetic models, both homogeneous and heterogeneous (dynamic catalysis for ammonia production).

[737] Explaining GNN Explanations with Edge Gradients

Jesse He, Akbar Rafiey, Gal Mishne, Yusu Wang

Main category: cs.LG

TL;DR: The paper analyzes GNN explanation methods, comparing perturbation-based and gradient-based approaches, and establishes theoretical connections between them.

Details

Motivation: The need for a deeper theoretical understanding of GNN explanation methods due to inconsistent performance across architectures and tasks.

Method: The study examines input-level and layerwise explanations, linking perturbation-based and gradient-based methods theoretically and experimentally.

Result: Theoretical connections between methods are established, and practical experiments validate the findings on synthetic and real datasets.

Conclusion: The work provides foundational insights into GNN explainability, bridging gaps between methods and offering practical implications.

Abstract: In recent years, the remarkable success of graph neural networks (GNNs) on graph-structured data has prompted a surge of methods for explaining GNN predictions. However, the state-of-the-art for GNN explainability remains in flux. Different comparisons find mixed results for different methods, with many explainers struggling on more complex GNN architectures and tasks. This presents an urgent need for a more careful theoretical analysis of competing GNN explanation methods. In this work we take a closer look at GNN explanations in two different settings: input-level explanations, which produce explanatory subgraphs of the input graph, and layerwise explanations, which produce explanatory subgraphs of the computation graph. We establish the first theoretical connections between the popular perturbation-based and classical gradient-based methods, as well as point out connections between other recently proposed methods. At the input level, we demonstrate conditions under which GNNExplainer can be approximated by a simple heuristic based on the sign of the edge gradients. In the layerwise setting, we point out that edge gradients are equivalent to occlusion search for linear GNNs. Finally, we demonstrate how our theoretical results manifest in practice with experiments on both synthetic and real datasets.

[738] Centralized Adaptive Sampling for Reliable Co-Training of Independent Multi-Agent Policies

Nicholas E. Corrado, Josiah P. Hanna

Main category: cs.LG

TL;DR: MA-PROPS reduces joint sampling error in MARL by adaptive action sampling, improving policy gradient reliability.

Details

Motivation: Independent policy gradients in MARL can lead to suboptimal convergence due to sampling error, even when gradients point to optimal solutions.

Method: Introduces MA-PROPS, an adaptive action sampling method using a centralized behavior policy to reduce joint sampling error.

Result: MA-PROPS reduces sampling error more efficiently than standard methods and improves convergence to optimal joint policies.

Conclusion: Coordinated action selection via MA-PROPS enhances the reliability of policy gradient learning in MARL.

Abstract: Independent on-policy policy gradient algorithms are widely used for multi-agent reinforcement learning (MARL) in cooperative and no-conflict games, but they are known to converge suboptimally when each agent’s policy gradient points toward a suboptimal equilibrium. In this work, we identify a subtler failure mode that arises \textit{even when the expected policy gradients of all agents point toward an optimal solution.} After collecting a finite set of trajectories, stochasticity in independent action sampling can cause the joint data distribution to deviate from the expected joint on-policy distribution. This \textit{sampling error} w.r.t. the joint on-policy distribution produces inaccurate gradient estimates that can lead agents to converge suboptimally. In this paper, we investigate if joint sampling error can be reduced through coordinated action selection and whether doing so improves the reliability of policy gradient learning in MARL. Toward this end, we introduce an adaptive action sampling approach to reduce joint sampling error. Our method, Multi-Agent Proximal Robust On-Policy Sampling (MA-PROPS), uses a centralized behavior policy that we continually adapt to place larger probability on joint actions that are currently under-sampled w.r.t. the current joint policy. We empirically evaluate MA-PROPS in a diverse range of multi-agent games and demonstrate that (1) MA-PROPS reduces joint sampling error more efficiently than standard on-policy sampling and (2) improves the reliability of independent policy gradient algorithms, increasing the fraction of training runs that converge to an optimal joint policy.

[739] FGBench: A Dataset and Benchmark for Molecular Property Reasoning at Functional Group-Level in Large Language Models

Xuan Liu, Siru Ouyang, Xianrui Zhong, Jiawei Han, Huimin Zhao

Main category: cs.LG

TL;DR: The paper introduces FGBench, a dataset with 625K molecular property reasoning problems incorporating fine-grained functional group (FG) information, highlighting the limitations of current LLMs in FG-level reasoning and proposing a framework for future advancements.

Details

Motivation: Existing datasets focus on molecular-level property prediction but lack fine-grained FG information, which is crucial for building interpretable, structure-aware LLMs and advancing molecular design.

Method: FGBench includes 625K problems with precisely annotated FG data, covering regression and classification tasks on 245 FGs across three categories: single FG impacts, multiple FG interactions, and direct molecular comparisons.

Result: Benchmarking on 7K curated data shows current LLMs struggle with FG-level reasoning, indicating a need for improved capabilities in chemistry tasks.

Conclusion: FGBench provides a foundational framework for enhancing LLMs’ understanding of molecular structure-property relationships, with potential applications in drug discovery and molecular design.

Abstract: Large language models (LLMs) have gained significant attention in chemistry. However, most existing datasets center on molecular-level property prediction and overlook the role of fine-grained functional group (FG) information. Incorporating FG-level data can provide valuable prior knowledge that links molecular structures with textual descriptions, which can be used to build more interpretable, structure-aware LLMs for reasoning on molecule-related tasks. Moreover, LLMs can learn from such fine-grained information to uncover hidden relationships between specific functional groups and molecular properties, thereby advancing molecular design and drug discovery. Here, we introduce FGBench, a dataset comprising 625K molecular property reasoning problems with functional group information. Functional groups are precisely annotated and localized within the molecule, which ensures the dataset’s interoperability thereby facilitating further multimodal applications. FGBench includes both regression and classification tasks on 245 different functional groups across three categories for molecular property reasoning: (1) single functional group impacts, (2) multiple functional group interactions, and (3) direct molecular comparisons. In the benchmark of state-of-the-art LLMs on 7K curated data, the results indicate that current LLMs struggle with FG-level property reasoning, highlighting the need to enhance reasoning capabilities in LLMs for chemistry tasks. We anticipate that the methodology employed in FGBench to construct datasets with functional group-level information will serve as a foundational framework for generating new question-answer pairs, enabling LLMs to better understand fine-grained molecular structure-property relationships. The dataset and evaluation code are available at \href{https://github.com/xuanliugit/FGBench}{https://github.com/xuanliugit/FGBench}.

[740] The Lattice Geometry of Neural Network Quantization – A Short Equivalence Proof of GPTQ and Babai’s algorithm

Johann Birnick

Main category: cs.LG

TL;DR: The paper links data-driven quantization in neural networks to solving the closest vector problem in lattices, showing GPTQ is equivalent to Babai’s nearest-plane algorithm and suggests lattice basis reduction for improved quantization.

Details

Motivation: To provide a theoretical understanding of data-driven quantization in neural networks by connecting it to lattice problems.

Method: The study proves GPTQ’s equivalence to Babai’s nearest-plane algorithm and offers geometric insights for both.

Result: Demonstrates that GPTQ and Babai’s algorithm are equivalent, with implications for using lattice basis reduction in quantization.

Conclusion: The findings suggest potential improvements in quantization through lattice basis reduction techniques.

Abstract: We explain how data-driven quantization of a linear unit in a neural network corresponds to solving the closest vector problem for a certain lattice generated by input data. We prove that the GPTQ algorithm is equivalent to Babai’s well-known nearest-plane algorithm. We furthermore provide geometric intuition for both algorithms. Lastly, we note the consequences of these results, in particular hinting at the possibility for using lattice basis reduction for better quantization.

[741] Flow Matching for Probabilistic Learning of Dynamical Systems from Missing or Noisy Data

Siddharth Rout, Eldad Haber, Stephane Gaudreault

Main category: cs.LG

TL;DR: The paper introduces a flow matching variant for probabilistic forecasting in dynamical systems, addressing challenges like missing variables and noisy data. It proposes a generative ML approach for state perturbation and validates the method on high-dimensional datasets like WeatherBench.

Details

Motivation: Classical models struggle with ill-posed dynamical systems due to missing variables and noise. Stochastic ML techniques can model such problems, enabling probabilistic forecasting where multiple outputs are valid.

Method: A flow matching variant estimates future states as distributions. A generative ML approach perturbs states logically and physically, moving beyond Gaussian assumptions.

Result: The method is validated on complex systems, including a high-resolution weather dataset (WeatherBench), showing effectiveness in probabilistic forecasting.

Conclusion: The proposed approach advances probabilistic forecasting for dynamical systems by addressing ill-posedness and non-Gaussian behavior, with practical validation on real-world datasets.

Abstract: Learning dynamical systems is crucial across many fields, yet applying machine learning techniques remains challenging due to missing variables and noisy data. Classical mathematical models often struggle in these scenarios due to the arose ill-posedness of the physical systems. Stochastic machine learning techniques address this challenge by enabling the modeling of such ill-posed problems. Thus, a single known input to the trained machine learning model may yield multiple plausible outputs, and all of the outputs are correct. In such scenarios, probabilistic forecasting is inherently meaningful. In this study, we introduce a variant of flow matching for probabilistic forecasting which estimates possible future states as a distribution over possible outcomes rather than a single-point prediction. Perturbation of complex dynamical states is not trivial. Community uses typical Gaussian or uniform perturbations to crucial variables to model uncertainty. However, not all variables behave in a Gaussian fashion. So, we also propose a generative machine learning approach to physically and logically perturb the states of complex high-dimensional dynamical systems. Finally, we establish the mathematical foundations of our method and demonstrate its effectiveness on several challenging dynamical systems, including a variant of the high-dimensional WeatherBench dataset, which models the global weather at a 5.625{\deg} meridional resolution.

[742] Protecting Student Mental Health with a Context-Aware Machine Learning Framework for Stress Monitoring

Md Sultanul Islam Ovi, Jamal Hossain, Md Raihan Alam Rahi, Fatema Akter

Main category: cs.LG

TL;DR: A machine learning framework for classifying student stress achieves high accuracy using ensemble methods, outperforming traditional assessments.

Details

Motivation: Addressing the limitations of subjective and periodic stress assessments in students by leveraging data-driven methods for timely intervention.

Method: A six-stage pipeline with preprocessing, feature selection, dimensionality reduction, and training with six classifiers, enhanced by ensemble strategies (voting and stacking).

Result: Achieved 93.09% accuracy with weighted hard voting on one dataset and 99.53% with stacking on another, surpassing benchmarks.

Conclusion: The framework demonstrates the effectiveness of context-aware, data-driven systems for early stress detection in academic settings.

Abstract: Student mental health is an increasing concern in academic institutions, where stress can severely impact well-being and academic performance. Traditional assessment methods rely on subjective surveys and periodic evaluations, offering limited value for timely intervention. This paper introduces a context-aware machine learning framework for classifying student stress using two complementary survey-based datasets covering psychological, academic, environmental, and social factors. The framework follows a six-stage pipeline involving preprocessing, feature selection (SelectKBest, RFECV), dimensionality reduction (PCA), and training with six base classifiers: SVM, Random Forest, Gradient Boosting, XGBoost, AdaBoost, and Bagging. To enhance performance, we implement ensemble strategies, including hard voting, soft voting, weighted voting, and stacking. Our best models achieve 93.09% accuracy with weighted hard voting on the Student Stress Factors dataset and 99.53% with stacking on the Stress and Well-being dataset, surpassing previous benchmarks. These results highlight the potential of context-integrated, data-driven systems for early stress detection and underscore their applicability in real-world academic settings to support student well-being.

[743] A hierarchy tree data structure for behavior-based user segment representation

Yang Liu, Xuejiao Kang, Sathya Iyer, Idris Malik, Ruixuan Li, Juan Wang, Xinchen Lu, Xiangxue Zhao, Dayong Wang, Menghan Liu, Isaac Liu, Feng Liang, Yinzhe Yu

Main category: cs.LG

TL;DR: BUS is a tree-based method for user segmentation using behavioral data, improving recommendation quality and fairness.

Details

Motivation: Address the cold-start problem and enhance recommendations for new or infrequent users by leveraging user attributes and behaviors.

Method: Hierarchical segmentation using NDCG as the objective, combined with social graph data for fairness.

Result: Outperforms traditional methods in ranking quality and improves online metrics like music ranking.

Conclusion: BUS is an effective, interpretable framework for large-scale recommendation systems.

Abstract: User attributes are essential in multiple stages of modern recommendation systems and are particularly important for mitigating the cold-start problem and improving the experience of new or infrequent users. We propose Behavior-based User Segmentation (BUS), a novel tree-based data structure that hierarchically segments the user universe with various users’ categorical attributes based on the users’ product-specific engagement behaviors. During the BUS tree construction, we use Normalized Discounted Cumulative Gain (NDCG) as the objective function to maximize the behavioral representativeness of marginal users relative to active users in the same segment. The constructed BUS tree undergoes further processing and aggregation across the leaf nodes and internal nodes, allowing the generation of popular social content and behavioral patterns for each node in the tree. To further mitigate bias and improve fairness, we use the social graph to derive the user’s connection-based BUS segments, enabling the combination of behavioral patterns extracted from both the user’s own segment and connection-based segments as the connection aware BUS-based recommendation. Our offline analysis shows that the BUS-based retrieval significantly outperforms traditional user cohort-based aggregation on ranking quality. We have successfully deployed our data structure and machine learning algorithm and tested it with various production traffic serving billions of users daily, achieving statistically significant improvements in the online product metrics, including music ranking and email notifications. To the best of our knowledge, our study represents the first list-wise learning-to-rank framework for tree-based recommendation that effectively integrates diverse user categorical attributes while preserving real-world semantic interpretability at a large industrial scale.

[744] Transformers in Pseudo-Random Number Generation: A Dual Perspective on Theory and Practice

Ran Li, Lingshu Zeng

Main category: cs.LG

TL;DR: The paper explores using decoder-only Transformer models to generate high-quality pseudo-random numbers (PRNGs), demonstrating their ability to simulate LCG and MT PRNGs and pass NIST tests.

Details

Motivation: PRNGs are crucial for optimizing large language models, and Transformers' nonlinear processing capabilities make them suitable for high-quality PRNG generation.

Method: Theoretical and practical exploration of decoder-only Transformer models with Chain-of-Thought to simulate LCG and MT PRNGs, validated through experiments.

Result: Transformer-based PRNGs pass most NIST tests, showing statistical randomness, and demonstrate resilience in prediction attacks.

Conclusion: Decoder-only Transformers can effectively simulate PRNGs, offering potential benefits for applications requiring high-quality randomness.

Abstract: Pseudo-random number generators (PRNGs) are high-nonlinear processes, and they are key blocks in optimization of Large language models. Transformers excel at processing complex nonlinear relationships. Thus it is reasonable to generate high-quality pseudo-random numbers based on transformers. In this paper, we explore this question from both theoretical and practical perspectives, highlighting the potential benefits and implications of Transformer in PRNGs. We theoretically demonstrate that decoder-only Transformer models with Chain-of-Thought can simulate both the Linear Congruential Generator (LCG) and Mersenne Twister (MT) PRNGs. Based on this, we conclude that the log-precision decoder-only Transformer can represent non-uniform $\text{AC}^0$. Our simulative theoretical findings are validated through experiments. The random numbers generated by Transformer-based PRNGs successfully pass the majority of NIST tests, whose heat maps exhibit clear statistical randomness. Finally, we assess their capability in prediction attacks.

[745] DisTaC: Conditioning Task Vectors via Distillation for Robust Model Merging

Kotaro Yoshida, Yuji Naraki, Takafumi Horie, Ryotaro Shimizu, Hiroki Naganuma

Main category: cs.LG

TL;DR: The paper investigates vulnerabilities in model-merging methods for multi-task learning, identifies harmful factors (task vector norm disparities and low source-model confidence), and proposes DisTaC, a distillation-based method to pre-condition task vectors, improving merging robustness and performance.

Details

Motivation: To address the lack of robustness in state-of-the-art model-merging methods, particularly in realistic settings, by identifying and mitigating vulnerabilities caused by task vector norm disparities and low source-model confidence.

Method: Proposes DisTaC, a method using knowledge distillation to pre-condition task vectors by adjusting their norms and boosting source-model confidence while retaining task-specific knowledge.

Result: DisTaC enables successful merging of models with harmful traits, where standard methods fail, leading to significant performance improvements.

Conclusion: DisTaC enhances the robustness and effectiveness of model-merging techniques, making them viable for more realistic and challenging scenarios.

Abstract: Model merging has emerged as an efficient and flexible paradigm for multi-task learning, with numerous methods being proposed in recent years. However, these state-of-the-art techniques are typically evaluated on benchmark suites that are highly favorable to model merging, and their robustness in more realistic settings remains largely unexplored. In this work, we first investigate the vulnerabilities of model-merging methods and pinpoint the source-model characteristics that critically underlie them. Specifically, we identify two factors that are particularly harmful to the merging process: (1) disparities in task vector norms, and (2) the low confidence of the source models. To address this issue, we propose DisTaC (Distillation for Task vector Conditioning), a novel method that pre-conditions these problematic task vectors before the merge. DisTaC leverages knowledge distillation to adjust a task vector’s norm and increase source-model confidence while preserving its essential task-specific knowledge. Our extensive experiments demonstrate that by pre-conditioning task vectors with DisTaC, state-of-the-art merging techniques can successfully integrate models exhibiting the harmful traits – where they would otherwise fail – achieving significant performance gains.

[746] T2S: Tokenized Skill Scaling for Lifelong Imitation Learning

Hongquan Zhang, Jingyu Gong, Zhizhong Zhang, Xin Tan, Yanyun Qu, Yuan Xie

Main category: cs.LG

TL;DR: A unified framework, Tokenized Skill Scaling (T2S), addresses lifelong imitation learning by balancing skill retention and acquisition through tokenized parameters and language-guided scaling.

Details

Motivation: The challenge is mitigating catastrophic forgetting while acquiring new skills, which current methods address in isolation.

Method: T2S tokenizes model parameters, transforming linear mappings into cross-attention for scalability, and uses language-guided scaling for efficient knowledge transfer.

Result: T2S prevents forgetting (1.0% NBT), scales skills with minimal parameter growth (8.0% tokens), and enables efficient transfer (77.7% FWT).

Conclusion: T2S offers a promising solution for lifelong imitation learning by unifying skill retention and acquisition.

Abstract: The main challenge in lifelong imitation learning lies in the balance between mitigating catastrophic forgetting of previous skills while maintaining sufficient capacity for acquiring new ones. However, current approaches typically address these aspects in isolation, overlooking their internal correlation in lifelong skill acquisition. We address this limitation with a unified framework named Tokenized Skill Scaling (T2S). Specifically, by tokenizing the model parameters, the linear parameter mapping of the traditional transformer is transformed into cross-attention between input and learnable tokens, thereby enhancing model scalability through the easy extension of new tokens. Additionally, we introduce language-guided skill scaling to transfer knowledge across tasks efficiently and avoid linearly growing parameters. Extensive experiments across diverse tasks demonstrate that T2S: 1) effectively prevents catastrophic forgetting (achieving an average NBT of 1.0% across the three LIBERO task suites), 2) excels in new skill scaling with minimal increases in trainable parameters (needing only 8.0% trainable tokens in an average of lifelong tasks), and 3) enables efficient knowledge transfer between tasks (achieving an average FWT of 77.7% across the three LIBERO task suites), offering a promising solution for lifelong imitation learning.

[747] RSPO: Risk-Seeking Policy Optimization for Pass@k and Max@k Metrics in Large Language Models

Kaichen Zhang, Shenghao Gao, Yuzhong Hong, Haipeng Sun, Junwei Bao, Hongfei Jiang, Yang Song, Hong Dingqian, Hui Xiong

Main category: cs.LG

TL;DR: RSPO optimizes Pass@k and Max@k metrics directly, addressing the risk mismatch in current LLM training.

Details

Motivation: Current LLM training optimizes risk-neutral objectives, while evaluation uses risk-seeking metrics, leading to suboptimal performance.

Method: Proposes Risk-Seeking Policy Optimization (RSPO), which directly targets Pass@k and Max@k, addressing the ‘hitchhiking’ problem with efficient gradient estimators.

Result: RSPO provides efficient, unbiased gradient estimators for Pass@k and Max@k, validated theoretically and experimentally.

Conclusion: RSPO bridges the gap between training and evaluation metrics, improving performance for risk-seeking objectives.

Abstract: Current large language model post-training optimizes a risk-neutral objective that maximizes expected reward, yet evaluation relies heavily on risk-seeking metrics like Pass@k (at least one success in k trials) and Max@k (maximum reward across k responses). This mismatch in risk preferences can inevitably lead to suboptimal performance. To bridge this gap, we propose Risk-Seeking Policy Optimization (RSPO), a novel method that directly targets Pass@k and Max@k during training. A key challenge in optimizing these metrics is the “hitchhiking” problem: low-reward responses are inadvertently reinforced if they co-occur with a high-reward response within a sample of k generations, resulting in inefficient optimization. RSPO addresses this problem by leveraging the closed-form probability that a given response is the maximum among k samplings. Despite the complexity of nested gradients over multiple responses, RSPO produces efficient, unbiased gradient estimators for both metrics. We validate our approach with both rigorous theoretical analysis and comprehensive experimental results.

[748] From Taylor Series to Fourier Synthesis: The Periodic Linear Unit

Shiko Kudo

Main category: cs.LG

TL;DR: Introducing the Periodic Linear Unit (PLU), a sine-wave based activation function, enabling minimal MLPs to solve complex tasks like spiral classification, impossible with standard activations.

Details

Motivation: Current neural networks rely on simple, monotonically-increasing activation functions (e.g., ReLU), requiring large models for complex tasks. PLU aims to enhance expressive power and parameter efficiency.

Method: PLU is a learnable sine-wave activation with periodic non-monotonicity, paired with Repulsive Reparameterization to prevent collapse into linearity.

Result: A minimal MLP with two PLU neurons solves the spiral classification task, outperforming standard activations.

Conclusion: PLU shifts the paradigm from piecewise approximators to Fourier-like synthesizers, achieving exponential parameter efficiency gains.

Abstract: The dominant paradigm in modern neural networks relies on simple, monotonically-increasing activation functions like ReLU. While effective, this paradigm necessitates large, massively-parameterized models to approximate complex functions. In this paper, we introduce the Periodic Linear Unit (PLU), a learnable sine-wave based activation with periodic non-monotonicity. PLU is designed for maximum expressive power and numerical stability, achieved through its formulation and a paired innovation we term Repulsive Reparameterization, which prevents the activation from collapsing into a non-expressive linear function. We demonstrate that a minimal MLP with only two PLU neurons can solve the spiral classification task, a feat impossible for equivalent networks using standard activations. This suggests a paradigm shift from networks as piecewise Taylor-like approximators to powerful Fourier-like function synthesizers, achieving exponential gains in parameter efficiency by placing intelligence in the neuron itself.

[749] SpectrumWorld: Artificial Intelligence Foundation for Spectroscopy

Zhuo Yang, Jiaqing Xie, Shuaike Shen, Daolang Wang, Yeyun Chen, Ben Gao, Shuzhou Sun, Biqing Qi, Dongzhan Zhou, Lei Bai, Linjiang Chen, Shufei Zhang, Jun Jiang, Tianfan Fu, Yuqiang Li

Main category: cs.LG

TL;DR: SpectrumLab is a unified platform to standardize and accelerate deep learning research in spectroscopy, featuring tools, benchmarks, and empirical studies.

Details

Motivation: Addressing the lack of standardized formulations in deep learning for spectroscopy.

Method: Introduces SpectrumLab with a Python library, SpectrumAnnotator for benchmarks, and SpectrumBench for multi-layered tasks.

Result: Empirical studies reveal limitations of current approaches using 18 multimodal LLMs.

Conclusion: SpectrumLab aims to be a foundational tool for future advancements in deep learning-driven spectroscopy.

Abstract: Deep learning holds immense promise for spectroscopy, yet research and evaluation in this emerging field often lack standardized formulations. To address this issue, we introduce SpectrumLab, a pioneering unified platform designed to systematize and accelerate deep learning research in spectroscopy. SpectrumLab integrates three core components: a comprehensive Python library featuring essential data processing and evaluation tools, along with leaderboards; an innovative SpectrumAnnotator module that generates high-quality benchmarks from limited seed data; and SpectrumBench, a multi-layered benchmark suite covering 14 spectroscopic tasks and over 10 spectrum types, featuring spectra curated from over 1.2 million distinct chemical substances. Thorough empirical studies on SpectrumBench with 18 cutting-edge multimodal LLMs reveal critical limitations of current approaches. We hope SpectrumLab will serve as a crucial foundation for future advancements in deep learning-driven spectroscopy.

[750] BSL: A Unified and Generalizable Multitask Learning Platform for Virtual Drug Discovery from Design to Synthesis

Kun Li, Zhennan Wu, Yida Xiong, Hongzhi Zhang, Longtao Hu, Zhonglie Liu, Junqi Zeng, Wenjie Wu, Mukun Chen, Jiameng Chen, Wenbin Hu

Main category: cs.LG

TL;DR: BSL is a deep learning-enhanced platform for virtual drug discovery, integrating seven core tasks with advanced AI technologies, achieving SOTA performance and demonstrating practical utility in identifying bioactive compounds.

Details

Motivation: Existing computational platforms for drug discovery are fragmented, lack algorithmic innovation, and perform poorly on OOD data, hindering progress.

Method: BSL uses a unified modular framework with generative models and graph neural networks, emphasizing OOD generalization.

Result: BSL achieves SOTA performance, identifies bioactive compounds, and outperforms existing platforms in comparative experiments.

Conclusion: BSL is a scalable, effective solution for drug discovery, offering innovation and high-precision predictions for real-world research.

Abstract: Drug discovery is of great social significance in safeguarding human health, prolonging life, and addressing the challenges of major diseases. In recent years, artificial intelligence has demonstrated remarkable advantages in key tasks across bioinformatics and pharmacology, owing to its efficient data processing and data representation capabilities. However, most existing computational platforms cover only a subset of core tasks, leading to fragmented workflows and low efficiency. In addition, they often lack algorithmic innovation and show poor generalization to out-of-distribution (OOD) data, which greatly hinders the progress of drug discovery. To address these limitations, we propose Baishenglai (BSL), a deep learning-enhanced, open-access platform designed for virtual drug discovery. BSL integrates seven core tasks within a unified and modular framework, incorporating advanced technologies such as generative models and graph neural networks. In addition to achieving state-of-the-art (SOTA) performance on multiple benchmark datasets, the platform emphasizes evaluation mechanisms that focus on generalization to OOD molecular structures. Comparative experiments with existing platforms and baseline methods demonstrate that BSL provides a comprehensive, scalable, and effective solution for virtual drug discovery, offering both algorithmic innovation and high-precision prediction for real-world pharmaceutical research. In addition, BSL demonstrated its practical utility by discovering novel modulators of the GluN1/GluN3A NMDA receptor, successfully identifying three compounds with clear bioactivity in in-vitro electrophysiological assays. These results highlight BSL as a promising and comprehensive platform for accelerating biomedical research and drug discovery. The platform is accessible at https://www.baishenglai.net.

[751] Oldie but Goodie: Re-illuminating Label Propagation on Graphs with Partially Observed Features

Sukwon Yun, Xin Liu, Yunhak Oh, Junseok Lee, Tianlong Chen, Tsuyoshi Murata, Chanyoung Park

Main category: cs.LG

TL;DR: GOODIE is a novel framework combining Label Propagation and Feature Propagation to handle missing node features in graphs, outperforming existing methods.

Details

Motivation: Addressing sub-optimal performance of GNNs in node classification when node features are missing, especially when few features are available.

Method: Hybrid approach using Label Propagation and Feature Propagation branches, with a GNN-based decoder, Structure-Feature Attention, and Pseudo-Label contrastive learning.

Result: GOODIE outperforms state-of-the-art methods in scenarios with few or abundant features.

Conclusion: GOODIE effectively leverages structure and feature information, proving superior in handling missing features.

Abstract: In real-world graphs, we often encounter missing feature situations where a few or the majority of node features, e.g., sensitive information, are missed. In such scenarios, directly utilizing Graph Neural Networks (GNNs) would yield sub-optimal results in downstream tasks such as node classification. Despite the emergence of a few GNN-based methods attempting to mitigate its missing situation, when only a few features are available, they rather perform worse than traditional structure-based models. To this end, we propose a novel framework that further illuminates the potential of classical Label Propagation (Oldie), taking advantage of Feature Propagation, especially when only a partial feature is available. Now called by GOODIE, it takes a hybrid approach to obtain embeddings from the Label Propagation branch and Feature Propagation branch. To do so, we first design a GNN-based decoder that enables the Label Propagation branch to output hidden embeddings that align with those of the FP branch. Then, GOODIE automatically captures the significance of structure and feature information thanks to the newly designed Structure-Feature Attention. Followed by a novel Pseudo-Label contrastive learning that differentiates the contribution of each positive pair within pseudo-labels originating from the LP branch, GOODIE outputs the final prediction for the unlabeled nodes. Through extensive experiments, we demonstrate that our proposed model, GOODIE, outperforms the existing state-of-the-art methods not only when only a few features are available but also in abundantly available situations. Source code of GOODIE is available at: https://github.com/SukwonYun/GOODIE.

[752] Multi-Operator Few-Shot Learning for Generalization Across PDE Families

Yile Li, Shandian Zhe

Main category: cs.LG

TL;DR: MOFS is a multimodal framework for few-shot learning of PDE operators, combining self-supervised pretraining, text-conditioned embeddings, and memory-augmented prompting to generalize across PDE families with minimal data.

Details

Motivation: Existing neural operator methods lack generalization across PDE families and require abundant training data for each PDE.

Method: MOFS integrates multi-task self-supervised pretraining, text-conditioned operator embeddings, and memory-augmented multimodal prompting with a two-stage training paradigm.

Result: Outperforms existing baselines in few-shot generalization on PDE benchmarks like Darcy Flow and Navier Stokes variants.

Conclusion: MOFS provides a universal, data-efficient foundation for operator learning across scientific domains.

Abstract: Learning solution operators for partial differential equations (PDEs) has become a foundational task in scientific machine learning. However, existing neural operator methods require abundant training data for each specific PDE and lack the ability to generalize across PDE families. In this work, we propose MOFS: a unified multimodal framework for multi-operator few-shot learning, which aims to generalize to unseen PDE operators using only a few demonstration examples. Our method integrates three key components: (i) multi-task self-supervised pretraining of a shared Fourier Neural Operator (FNO) encoder to reconstruct masked spatial fields and predict frequency spectra, (ii) text-conditioned operator embeddings derived from statistical summaries of input-output fields, and (iii) memory-augmented multimodal prompting with gated fusion and cross-modal gradient-based attention. We adopt a two-stage training paradigm that first learns prompt-conditioned inference on seen operators and then applies end-to-end contrastive fine-tuning to align latent representations across vision, frequency, and text modalities. Experiments on PDE benchmarks, including Darcy Flow and Navier Stokes variants, demonstrate that our model outperforms existing operator learning baselines in few-shot generalization. Extensive ablations validate the contributions of each modality and training component. Our approach offers a new foundation for universal and data-efficient operator learning across scientific domains.

[753] RelMap: Reliable Spatiotemporal Sensor Data Visualization via Imputative Spatial Interpolation

Juntong Chen, Huayuan Ye, He Zhu, Siwei Fu, Changbo Wang, Chenhui Li

Main category: cs.LG

TL;DR: A novel spatial interpolation pipeline using GNNs and PNA-GPE for reliable spatiotemporal data visualization with uncertainty encoding.

Details

Motivation: Traditional methods fail due to irregular sensor coverage, necessitating reliable interpolation and uncertainty-aware visualization.

Method: Integrates GNNs, PNA, and GPE for spatiotemporal learning; proposes static visualization for uncertainty communication.

Result: Superior performance in data imputation, improved interpolation, and effective uncertainty visualization demonstrated.

Conclusion: The pipeline enhances reliability and clarity in spatiotemporal data visualization, validated by real-world use cases.

Abstract: Accurate and reliable visualization of spatiotemporal sensor data such as environmental parameters and meteorological conditions is crucial for informed decision-making. Traditional spatial interpolation methods, however, often fall short of producing reliable interpolation results due to the limited and irregular sensor coverage. This paper introduces a novel spatial interpolation pipeline that achieves reliable interpolation results and produces a novel heatmap representation with uncertainty information encoded. We leverage imputation reference data from Graph Neural Networks (GNNs) to enhance visualization reliability and temporal resolution. By integrating Principal Neighborhood Aggregation (PNA) and Geographical Positional Encoding (GPE), our model effectively learns the spatiotemporal dependencies. Furthermore, we propose an extrinsic, static visualization technique for interpolation-based heatmaps that effectively communicates the uncertainties arising from various sources in the interpolated map. Through a set of use cases, extensive evaluations on real-world datasets, and user studies, we demonstrate our model’s superior performance for data imputation, the improvements to the interpolant with reference data, and the effectiveness of our visualization design in communicating uncertainties.

[754] Soft Separation and Distillation: Toward Global Uniformity in Federated Unsupervised Learning

Hung-Chieh Fang, Hsuan-Tien Lin, Irwin King, Yifei Zhang

Main category: cs.LG

TL;DR: The paper introduces Soft Separation and Distillation (SSD) to improve global uniformity in Federated Unsupervised Learning (FUL) by addressing inter-client uniformity issues caused by non-IID data.

Details

Motivation: Existing FUL methods achieve local uniformity but struggle with global uniformity due to non-IID data and decentralization.

Method: SSD encourages client representations to spread in different directions, reducing interference during aggregation. A projector distillation module aligns loss optimization with representation quality.

Result: SSD improves representation quality and task performance in cross-silo and cross-device federated settings.

Conclusion: SSD effectively addresses inter-client uniformity in FUL, enhancing global representation quality.

Abstract: Federated Unsupervised Learning (FUL) aims to learn expressive representations in federated and self-supervised settings. The quality of representations learned in FUL is usually determined by uniformity, a measure of how uniformly representations are distributed in the embedding space. However, existing solutions perform well in achieving intra-client (local) uniformity for local models while failing to achieve inter-client (global) uniformity after aggregation due to non-IID data distributions and the decentralized nature of FUL. To address this issue, we propose Soft Separation and Distillation (SSD), a novel approach that preserves inter-client uniformity by encouraging client representations to spread toward different directions. This design reduces interference during client model aggregation, thereby improving global uniformity while preserving local representation expressiveness. We further enhance this effect by introducing a projector distillation module to address the discrepancy between loss optimization and representation quality. We evaluate SSD in both cross-silo and cross-device federated settings, demonstrating consistent improvements in representation quality and task performance across various training scenarios. Our results highlight the importance of inter-client uniformity in FUL and establish SSD as an effective solution to this challenge. Project page: https://ssd-uniformity.github.io/

[755] Exploitation Is All You Need… for Exploration

Micah Rentschler, Jesse Roberts

Main category: cs.LG

TL;DR: The paper explores whether greedy-trained meta-RL agents can exhibit emergent exploration under specific conditions: recurring environmental structure, agent memory, and long-horizon credit assignment. Experiments confirm this, showing exploration vanishes without structure or memory.

Details

Motivation: To challenge the conventional need for explicit exploration incentives in meta-RL by demonstrating that exploration can emerge naturally under certain conditions.

Method: Train agents with a greedy objective in environments with repeatable structure, memory, and long-horizon credit assignment. Test in stochastic bandits and gridworlds, with controlled ablations.

Result: Greedy-trained agents exhibit exploratory behavior when structure and memory are present. Removing either eliminates exploration, while long-horizon credit assignment is less critical.

Conclusion: Exploration can emerge from a unified reward-maximization process under specific conditions, suggesting exploration and exploitation need not be orthogonal objectives.

Abstract: Ensuring sufficient exploration is a central challenge when training meta-reinforcement learning (meta-RL) agents to solve novel environments. Conventional solutions to the exploration-exploitation dilemma inject explicit incentives such as randomization, uncertainty bonuses, or intrinsic rewards to encourage exploration. In this work, we hypothesize that an agent trained solely to maximize a greedy (exploitation-only) objective can nonetheless exhibit emergent exploratory behavior, provided three conditions are met: (1) Recurring Environmental Structure, where the environment features repeatable regularities that allow past experience to inform future choices; (2) Agent Memory, enabling the agent to retain and utilize historical interaction data; and (3) Long-Horizon Credit Assignment, where learning propagates returns over a time frame sufficient for the delayed benefits of exploration to inform current decisions. Through experiments in stochastic multi-armed bandits and temporally extended gridworlds, we observe that, when both structure and memory are present, a policy trained on a strictly greedy objective exhibits information-seeking exploratory behavior. We further demonstrate, through controlled ablations, that emergent exploration vanishes if either environmental structure or agent memory is absent (Conditions 1 & 2). Surprisingly, removing long-horizon credit assignment (Condition 3) does not always prevent emergent exploration-a result we attribute to the pseudo-Thompson Sampling effect. These findings suggest that, under the right prerequisites, exploration and exploitation need not be treated as orthogonal objectives but can emerge from a unified reward-maximization process.

[756] FedCD: A Fairness-aware Federated Cognitive Diagnosis Framework

Shangshang Yang, Jialin Han, Xiaoshan Yu, Ziwen Wang, Hao Jiang, Haiping Ma, Xingyi Zhang, Geyong Min

Main category: cs.LG

TL;DR: The paper proposes FedCD, a fairness-aware federated cognitive diagnosis framework, to address privacy and fairness challenges in online education using federated learning and a novel parameter decoupling strategy.

Details

Motivation: The influx of distributed student learning data in online education raises privacy and fairness concerns, especially due to data quality disparities between groups/schools.

Method: FedCD uses federated learning with a parameter decoupling strategy, splitting model parameters into locally personalized and globally shared parts to ensure fair and precise diagnosis.

Result: Experiments on three real-world datasets show FedCD outperforms five FL approaches under three CD models, proving its effectiveness.

Conclusion: FedCD successfully balances privacy preservation and fairness in cognitive diagnosis, offering a robust solution for distributed educational data.

Abstract: Online intelligent education platforms have generated a vast amount of distributed student learning data. This influx of data presents opportunities for cognitive diagnosis (CD) to assess students’ mastery of knowledge concepts while also raising significant data privacy and security challenges. To cope with this issue, federated learning (FL) becomes a promising solution by jointly training models across multiple local clients without sharing their original data. However, the data quality problem, caused by the ability differences and educational context differences between different groups/schools of students, further poses a challenge to the fairness of models. To address this challenge, this paper proposes a fairness-aware federated cognitive diagnosis framework (FedCD) to jointly train CD models built upon a novel parameter decoupling-based personalization strategy, preserving privacy of data and achieving precise and fair diagnosis of students on each client. As an FL paradigm, FedCD trains a local CD model for the students in each client based on its local student learning data, and each client uploads its partial model parameters to the central server for parameter aggregation according to the devised innovative personalization strategy. The main idea of this strategy is to decouple model parameters into two parts: the first is used as locally personalized parameters, containing diagnostic function-related model parameters, to diagnose each client’s students fairly; the second is the globally shared parameters across clients and the server, containing exercise embedding parameters, which are updated via fairness-aware aggregation, to alleviate inter-school unfairness. Experiments on three real-world datasets demonstrate the effectiveness of the proposed FedCD framework and the personalization strategy compared to five FL approaches under three CD models.

[757] GraphVSSM: Graph Variational State-Space Model for Probabilistic Spatiotemporal Inference of Dynamic Exposure and Vulnerability for Regional Disaster Resilience Assessment

Joshua Dimasaka, Christian Geiß, Emily So

Main category: cs.LG

TL;DR: The paper introduces GraphVSSM, a novel probabilistic spatiotemporal framework for assessing regional disaster resilience by integrating graph deep learning, state-space modeling, and variational inference. It addresses static vulnerability assessments and demonstrates applications in disaster-prone regions.

Details

Motivation: Existing methods for assessing physical vulnerability in disaster resilience are static and limited. The paper aims to improve this by leveraging time-series satellite data and machine learning.

Method: The authors propose GraphVSSM, combining graph deep learning, state-space modeling, and variational inference to analyze spatiotemporal data and expert beliefs.

Result: Applications include city-wide analysis in Quezon City, cyclone-impacted Bangladesh, and mudslide-affected Sierra Leone, alongside the release of the METEOR 2.5D dataset.

Conclusion: GraphVSSM advances disaster resilience assessment and offers a probabilistic deep learning approach for urban studies with weak supervision.

Abstract: Regional disaster resilience quantifies the changing nature of physical risks to inform policy instruments ranging from local immediate recovery to international sustainable development. While many existing state-of-practice methods have greatly advanced the dynamic mapping of exposure and hazard, our understanding of large-scale physical vulnerability has remained static, costly, limited, region-specific, coarse-grained, overly aggregated, and inadequately calibrated. With the significant growth in the availability of time-series satellite imagery and derived products for exposure and hazard, we focus our work on the equally important yet challenging element of the risk equation: physical vulnerability. We leverage machine learning methods that flexibly capture spatial contextual relationships, limited temporal observations, and uncertainty in a unified probabilistic spatiotemporal inference framework. We therefore introduce Graph Variational State-Space Model (GraphVSSM), a novel modular spatiotemporal approach that uniquely integrates graph deep learning, state-space modeling, and variational inference using time-series data and prior expert belief systems in a weakly supervised or coarse-to-fine-grained manner. We present three major results: a city-wide demonstration in Quezon City, Philippines; an investigation of sudden changes in the cyclone-impacted coastal Khurushkul community (Bangladesh) and mudslide-affected Freetown (Sierra Leone); and an open geospatial dataset, METEOR 2.5D, that spatiotemporally enhances the existing global static dataset for UN Least Developed Countries (2020). Beyond advancing regional disaster resilience assessment and improving our understanding global disaster risk reduction progress, our method also offers a probabilistic deep learning approach, contributing to broader urban studies that require compositional data analysis in weak supervision.

[758] Physics-Informed Neural Network Approaches for Sparse Data Flow Reconstruction of Unsteady Flow Around Complex Geometries

Vamsi Sai Krishna Malineni, Suresh Rajendran

Main category: cs.LG

TL;DR: The paper explores Physics-Informed Neural Networks (PINNs) for flow reconstruction from sparse data in engineering, comparing methods and demonstrating their effectiveness in 2D and 3D flow scenarios.

Details

Motivation: Obtaining large datasets for engineering applications is costly, and existing research lacks focus on sparse data and constrained resources. PINNs offer a solution by embedding physical principles.

Method: The study uses PINNs for 2D laminar and 3D turbulent flows, comparing training methods (Standard PINN and BC-PINN) and optimizing physics constraints and loss functions.

Result: PINNs effectively reconstruct flow fields from sparse data, showing improved performance with relaxed constraints and dynamic loss weighting, even in turbulent flows.

Conclusion: PINNs are viable for real-world flow reconstruction, demonstrating adaptability to sparse data and complex physics, with potential for broader engineering applications.

Abstract: The utilization of Deep Neural Networks (DNNs) in physical science and engineering applications has gained traction due to their capacity to learn intricate functions. While large datasets are crucial for training DNN models in fields like computer vision and natural language processing, obtaining such datasets for engineering applications is prohibitively expensive. Physics-Informed Neural Networks (PINNs), a branch of Physics-Informed Machine Learning (PIML), tackle this challenge by embedding physical principles within neural network architectures. PINNs have been extensively explored for solving diverse forward and inverse problems in fluid mechanics. Nonetheless, there is limited research on employing PINNs for flow reconstruction from sparse data under constrained computational resources. Earlier studies were focused on forward problems with well-defined data. The present study attempts to develop models capable of reconstructing the flow field data from sparse datasets mirroring real-world scenarios. This study focuses on two cases: (a) two-dimensional (2D) unsteady laminar flow past a circular cylinder and (b) three-dimensional (3D) unsteady turbulent flow past an ultra-large container ship (ULCS). The first case compares the effectiveness of training methods like Standard PINN and Backward Compatible PINN (BC-PINN) and explores the performance enhancements through systematic relaxation of physics constraints and dynamic weighting of loss function components. The second case highlights the capability of PINN-based models to learn underlying physics from sparse data while accurately reconstructing the flow field for a highly turbulent flow.

[759] Fusion Sampling Validation in Data Partitioning for Machine Learning

Christopher Godwin Udomboso, Caston Sigauke, Ini Adinya

Main category: cs.LG

TL;DR: The paper introduces Fusion Sampling Validation (FSV), a hybrid of Simple Random Sampling (SRS) and K-Fold Cross-Validation (KFCV), to improve data partitioning in machine learning. FSV outperforms SRS and KFCV in accuracy and reliability.

Details

Motivation: Traditional methods like KFCV and SRS have limitations in generalisation and computational efficiency. FSV aims to combine their strengths while minimising biases.

Method: FSV integrates SRS and KFCV with weighted factors. Evaluated on datasets of 10K, 50K, and 100K samples using metrics like ME, VE, MSE, bias, and convergence rates.

Result: FSV showed superior performance with ME (0.000863), VE (0.949644), MSE (0.952127), bias (0.016288), and faster convergence rates compared to SRS and KFCV.

Conclusion: FSV provides a practical, accurate, and reliable solution for data partitioning, especially in resource-constrained or large-scale scenarios.

Abstract: Effective data partitioning is known to be crucial in machine learning. Traditional cross-validation methods like K-Fold Cross-Validation (KFCV) enhance model robustness but often compromise generalisation assessment due to high computational demands and extensive data shuffling. To address these issues, the integration of the Simple Random Sampling (SRS), which, despite providing representative samples, can result in non-representative sets with imbalanced data. The study introduces a hybrid model, Fusion Sampling Validation (FSV), combining SRS and KFCV to optimise data partitioning. FSV aims to minimise biases and merge the simplicity of SRS with the accuracy of KFCV. The study used three datasets of 10,000, 50,000, and 100,000 samples, generated with a normal distribution (mean 0, variance 1) and initialised with seed 42. KFCV was performed with five folds and ten repetitions, incorporating a scaling factor to ensure robust performance estimation and generalisation capability. FSV integrated a weighted factor to enhance performance and generalisation further. Evaluations focused on mean estimates (ME), variance estimates (VE), mean squared error (MSE), bias, the rate of convergence for mean estimates (ROC_ME), and the rate of convergence for variance estimates (ROC_VE). Results indicated that FSV consistently outperformed SRS and KFCV, with ME values of 0.000863, VE of 0.949644, MSE of 0.952127, bias of 0.016288, ROC_ME of 0.005199, and ROC_VE of 0.007137. FSV demonstrated superior accuracy and reliability in data partitioning, particularly in resource-constrained environments and extensive datasets, providing practical solutions for effective machine learning implementations.

[760] Is Exploration or Optimization the Problem for Deep Reinforcement Learning?

Glen Berseth

Main category: cs.LG

TL;DR: A sub-optimality estimator for deep RL reveals policies exploit only half of good experience generated.

Details

Motivation: Deep RL struggles with optimization difficulties despite improved exploration or rewards, leading to sub-optimal performance.

Method: Proposes a practical sub-optimality estimator to evaluate optimization limits in deep RL algorithms.

Result: Experiments show policies’ learned performance is 2-3× worse than the best generated experience.

Conclusion: Deep RL methods underutilize good experience, highlighting significant optimization gaps.

Abstract: In the era of deep reinforcement learning, making progress is more complex, as the collected experience must be compressed into a deep model for future exploitation and sampling. Many papers have shown that training a deep learning policy under the changing state and action distribution leads to sub-optimal performance, or even collapse. This naturally leads to the concern that even if the community creates improved exploration algorithms or reward objectives, will those improvements fall on the \textit{deaf ears} of optimization difficulties. This work proposes a new \textit{practical} sub-optimality estimator to determine optimization limitations of deep reinforcement learning algorithms. Through experiments across environments and RL algorithms, it is shown that the difference between the best experience generated is 2-3$\times$ better than the policies’ learned performance. This large difference indicates that deep RL methods only exploit half of the good experience they generate.

[761] Convergence Analysis of Aggregation-Broadcast in LoRA-enabled Federated Learning

Xin Chen, Shuaijun Chen, Omid Tavallaie, Nguyen Tran, Shuhuang Xiang, Albert Zomaya

Main category: cs.LG

TL;DR: The paper analyzes aggregation methods for Low-Rank Adaptation (LoRA) in Federated Learning (FL), providing a unified convergence analysis and proving the effectiveness of Sum-Product (SP) and Product-Sum (PS) methods.

Details

Motivation: The growing size of ML models in FL creates communication and computation challenges, and while LoRA reduces overhead, effective aggregation of local models remains understudied.

Method: The paper categorizes aggregation methods into SP and PS, defines the Aggregation-Broadcast Operator (ABO), and derives general convergence conditions.

Result: Theoretical analysis shows both SP and PS methods converge, but differ in achieving optimal rates. Experiments validate findings.

Conclusion: The study offers a principled understanding of LoRA-based FL aggregation, proving SP and PS methods’ convergence and their practical viability.

Abstract: Federated Learning (FL) enables collaborative model training across decentralized data sources while preserving data privacy. However, the growing size of Machine Learning (ML) models poses communication and computation challenges in FL. Low-Rank Adaptation (LoRA) has recently been introduced into FL as an efficient fine-tuning method, reducing communication overhead by updating only a small number of trainable parameters. Despite its effectiveness, how to aggregate LoRA-updated local models on the server remains a critical and understudied problem. In this paper, we provide a unified convergence analysis for LoRA-based FL. We first categories the current aggregation method into two major type: Sum-Product (SP) and Product-Sum (PS). Then we formally define the Aggregation-Broadcast Operator (ABO) and derive a general convergence condition under mild assumptions. Furthermore, we present several sufficient conditions that guarantee convergence of the global model. These theoretical analyze offer a principled understanding of various aggregation strategies. Notably, we prove that the SP and PS aggregation methods both satisfy our convergence condition, but differ in their ability to achieve the optimal convergence rate. Extensive experiments on standard benchmarks validate our theoretical findings.

[762] Quenched large deviations for Monte Carlo integration with Coulomb gases

Rémi Bardenet, Mylène Maïda, Martin Rouault

Main category: cs.LG

TL;DR: The paper explores using Gibbs measures, like Coulomb gases, for randomized numerical integration by tuning interaction kernels and confining potentials to match a target distribution. It shows that random approximations of the potential preserve performance guarantees, even with cheap Monte Carlo preprocessing.

Details

Motivation: To improve numerical integration by leveraging repulsiveness in Gibbs measures, addressing the challenge of tuning interaction kernels and potentials to align with the target distribution.

Method: Uses large deviations theory to prove that random approximations of the potential maintain performance guarantees. For non-singular kernels, minimal assumptions are made; for Coulomb kernels, approximations must use another Gibbs measure.

Result: Demonstrates that the integration algorithm can outperform independent or Markov quadratures, with controlled uniform convergence for Coulomb potentials.

Conclusion: The approach is viable for efficient numerical integration, especially with careful approximation of potentials, even for complex kernels like Coulomb.

Abstract: Gibbs measures, such as Coulomb gases, are popular in modelling systems of interacting particles. Recently, we proposed to use Gibbs measures as randomized numerical integration algorithms with respect to a target measure $\pi$ on $\mathbb R^d$, following the heuristics that repulsiveness between particles should help reduce integration errors. A major issue in this approach is to tune the interaction kernel and confining potential of the Gibbs measure, so that the equilibrium measure of the system is the target distribution $\pi$. Doing so usually requires another Monte Carlo approximation of the \emph{potential}, i.e. the integral of the interaction kernel with respect to $\pi$. Using the methodology of large deviations from Garcia–Zelada (2019), we show that a random approximation of the potential preserves the fast large deviation principle that guarantees the proposed integration algorithm to outperform independent or Markov quadratures. For non-singular interaction kernels, we make minimal assumptions on this random approximation, which can be the result of a computationally cheap Monte Carlo preprocessing. For the Coulomb interaction kernel, we need the approximation to be based on another Gibbs measure, and we prove in passing a control on the uniform convergence of the approximation of the potential.

[763] Effects of Feature Correlations on Associative Memory Capacity

Stefan Bielmeier, Gerald Friedland

Main category: cs.LG

TL;DR: The paper explores how feature correlations impact the capacity of Dense Associative Memory (DAM), showing that while capacity scales exponentially with pattern separation, correlations slightly reduce it, especially for higher-order interactions.

Details

Motivation: Current capacity analyses of DAM ignore feature correlations, which are common in real-world data, limiting practical applicability.

Method: An empirical framework was developed to analyze data structure effects, using Hamming distance to vary feature correlation and pattern separation, and binary search to compute storage capacity.

Result: Memory capacity scales exponentially with pattern separation, but feature correlations slightly reduce it, more so for higher polynomial degrees in the energy function.

Conclusion: The study connects DAM theory to practical settings, highlighting limitations in modeling higher-order feature interactions and suggesting data-centric improvements.

Abstract: We investigate how feature correlations influence the capacity of Dense Associative Memory (DAM), a Transformer attention-like model. Practical machine learning scenarios involve feature-correlated data and learn representations in the input space, but current capacity analyses do not account for this. We develop an empirical framework to analyze the effects of data structure on capacity dynamics. Specifically, we systematically construct datasets that vary in feature correlation and pattern separation using Hamming distance from information theory, and compute the model’s corresponding storage capacity using a simple binary search algorithm. Our experiments confirm that memory capacity scales exponentially with increasing separation in the input space. Feature correlations do not alter this relationship fundamentally, but reduce capacity slightly at constant separation. This effect is amplified at higher polynomial degrees in the energy function, suggesting that Associative Memory is more limited in depicting higher-order interactions between features than patterns. Our findings bridge theoretical work and practical settings for DAM, and might inspire more data-centric methods.

[764] CPformer – Concept and Physics enhanced Transformer for Time Series Forecasting

Hongwei Ma, Junbin Gao, Minh-Ngoc Tran

Main category: cs.LG

TL;DR: CPformer, a Concept- and Physics-enhanced Transformer, improves multivariate time-series forecasting by integrating self-supervised concepts and physics constraints, outperforming baselines in accuracy.

Details

Motivation: Addressing the challenge of accurate, explainable, and physically-credible forecasting for multivariate time-series with varying statistical properties across domains.

Method: CPformer combines self-supervised, domain-agnostic concepts with differentiable residuals from first-principle constraints, retaining attention for long contexts.

Result: Achieves the lowest error in 8 of 12 MSE/MAE cells across six datasets, reducing MSE by 23% on Electricity, 44% on Traffic, and 61% on Illness compared to FEDformer.

Conclusion: CPformer effectively integrates latent transparency with scientific guidance, outperforming prior Transformers in accuracy while maintaining explainability.

Abstract: Accurate, explainable and physically-credible forecasting remains a persistent challenge for multivariate time-series whose statistical properties vary across domains. We present CPformer, a Concept- and Physics-enhanced Transformer that channels every prediction through five self-supervised, domain-agnostic concepts while enforcing differentiable residuals drawn from first-principle constraints. Unlike prior efficiency-oriented Transformers that rely purely on sparsity or frequency priors , CPformer combines latent transparency with hard scientific guidance while retaining attention for long contexts. We tested CPformer on six publicly-available datasets: sub-hourly Electricity and Traffic, hourly ETT, high-dimensional Weather, weekly Influenza-like Illness, and minute-level Exchange Rate, and CPformer achieves the lowest error in eight of twelve MSE/MAE cells. Relative to the strongest Transformer baseline (FEDformer), CPformer reduces mean-squared-error by 23% on Electricity, 44% on Traffic and 61% on Illness, while matching performance on strictly periodic Weather and ETT series.

[765] Cryptocurrency Price Forecasting Using Machine Learning: Building Intelligent Financial Prediction Models

Md Zahidul Islam, Md Shafiqur Rahman, Md Sumsuzoha, Babul Sarker, Md Rafiqul Islam, Mahfuz Alam, Sanjib Kumar Shil

Main category: cs.LG

TL;DR: The study uses deep learning and machine learning to predict XRP/USDT prices, highlighting the importance of liquidity metrics (VVR and VWAP) for accuracy. LSTM outperformed other models.

Details

Motivation: Cryptocurrency price prediction often ignores liquidity, a key factor. This study aims to improve accuracy by incorporating liquidity metrics.

Method: Four models (Linear Regression, Random Forest, XGBoost, LSTM) were tested with and without liquidity proxies (VVR, VWAP).

Result: LSTM consistently performed best. Including liquidity metrics improved prediction accuracy.

Conclusion: Liquidity metrics are crucial for accurate cryptocurrency price prediction, benefiting traders and developers in the U.S. market.

Abstract: Cryptocurrency markets are experiencing rapid growth, but this expansion comes with significant challenges, particularly in predicting cryptocurrency prices for traders in the U.S. In this study, we explore how deep learning and machine learning models can be used to forecast the closing prices of the XRP/USDT trading pair. While many existing cryptocurrency prediction models focus solely on price and volume patterns, they often overlook market liquidity, a crucial factor in price predictability. To address this, we introduce two important liquidity proxy metrics: the Volume-To-Volatility Ratio (VVR) and the Volume-Weighted Average Price (VWAP). These metrics provide a clearer understanding of market stability and liquidity, ultimately enhancing the accuracy of our price predictions. We developed four machine learning models, Linear Regression, Random Forest, XGBoost, and LSTM neural networks, using historical data without incorporating the liquidity proxy metrics, and evaluated their performance. We then retrained the models, including the liquidity proxy metrics, and reassessed their performance. In both cases (with and without the liquidity proxies), the LSTM model consistently outperformed the others. These results underscore the importance of considering market liquidity when predicting cryptocurrency closing prices. Therefore, incorporating these liquidity metrics is essential for more accurate forecasting models. Our findings offer valuable insights for traders and developers seeking to create smarter and more risk-aware strategies in the U.S. digital assets market.

[766] UniExtreme: A Universal Foundation Model for Extreme Weather Forecasting

Hang Ni, Weijia Zhang, Hao Liu

Main category: cs.LG

TL;DR: UniExtreme is a universal extreme weather forecasting model that addresses spectral disparity and hierarchical drivers of extreme events, outperforming existing methods.

Details

Motivation: Existing models fail to capture the diversity and complexity of extreme weather events, limiting their predictive accuracy.

Method: UniExtreme integrates Adaptive Frequency Modulation (AFM) for spectral differences and Event Prior Augmentation (EPA) for hierarchical diversity, using learnable filters and memory fusion.

Result: UniExtreme surpasses state-of-the-art baselines in forecasting both extreme and general weather conditions.

Conclusion: The model demonstrates superior adaptability and accuracy in diverse extreme weather scenarios.

Abstract: Recent advancements in deep learning have led to the development of Foundation Models (FMs) for weather forecasting, yet their ability to predict extreme weather events remains limited. Existing approaches either focus on general weather conditions or specialize in specific-type extremes, neglecting the real-world atmospheric patterns of diversified extreme events. In this work, we identify two key characteristics of extreme events: (1) the spectral disparity against normal weather regimes, and (2) the hierarchical drivers and geographic blending of diverse extremes. Along this line, we propose UniExtreme, a universal extreme weather forecasting foundation model that integrates (1) an Adaptive Frequency Modulation (AFM) module that captures region-wise spectral differences between normal and extreme weather, through learnable Beta-distribution filters and multi-granularity spectral aggregation, and (2) an Event Prior Augmentation (EPA) module which incorporates region-specific extreme event priors to resolve hierarchical extreme diversity and composite extreme schema, via a dual-level memory fusion network. Extensive experiments demonstrate that UniExtreme outperforms state-of-the-art baselines in both extreme and general weather forecasting, showcasing superior adaptability across diverse extreme scenarios.

[767] Regression Augmentation With Data-Driven Segmentation

Shayan Alahyari, Shiva Mehdipour Ghobadlou, Mike Domaratzki

Main category: cs.LG

TL;DR: A GAN-based augmentation framework using Mahalanobis-GMM to identify and enrich minority samples in imbalanced regression, outperforming existing methods.

Details

Motivation: Address the challenge of imbalanced regression where models struggle with underrepresented samples due to skewed target distributions.

Method: Proposes a data-driven GAN-based framework with Mahalanobis-GMM for minority sample identification and nearest-neighbor matching for augmentation.

Result: Outperforms state-of-the-art data augmentation methods on 32 benchmark datasets.

Conclusion: The framework effectively addresses imbalanced regression by dynamically identifying and enriching rare samples without preset thresholds.

Abstract: Imbalanced regression arises when the target distribution is skewed, causing models to focus on dense regions and struggle with underrepresented (minority) samples. Despite its relevance across many applications, few methods have been designed specifically for this challenge. Existing approaches often rely on fixed, ad hoc thresholds to label samples as rare or common, overlooking the continuous complexity of the joint feature-target space and fail to represent the true underlying rare regions. To address these limitations, we propose a fully data-driven GAN-based augmentation framework that uses Mahalanobis-Gaussian Mixture Modeling (GMM) to automatically identify minority samples and employs deterministic nearest-neighbour matching to enrich sparse regions. Rather than preset thresholds, our method lets the data determine which observations are truly rare. Evaluation on 32 benchmark imbalanced regression datasets demonstrates that our approach consistently outperforms state-of-the-art data augmentation methods.

[768] Fast and scalable retrosynthetic planning with a transformer neural network and speculative beam search

Mikhail Andronov, Natalia Andronova, Michael Wand, Jürgen Schmidhuber, Djork-Arné Clevert

Main category: cs.LG

TL;DR: The paper proposes a method to reduce latency in AI-based CASP systems for drug discovery by using speculative beam search and Medusa drafting, improving throughput under tight time constraints.

Details

Motivation: High latency in current CASP systems limits their use in high-throughput synthesizability screening for drug design.

Method: Uses speculative beam search and Medusa drafting to accelerate SMILES-to-SMILES transformers in multi-step synthesis planning.

Result: Achieves 26% to 86% more molecules solved under the same time constraints.

Conclusion: The method enhances CASP system performance, making it more suitable for high-throughput screening and improving user experience.

Abstract: AI-based computer-aided synthesis planning (CASP) systems are in demand as components of AI-driven drug discovery workflows. However, the high latency of such CASP systems limits their utility for high-throughput synthesizability screening in de novo drug design. We propose a method for accelerating multi-step synthesis planning systems that rely on SMILES-to-SMILES transformers as single-step retrosynthesis models. Our approach reduces the latency of SMILES-to-SMILES transformers powering multi-step synthesis planning in AiZynthFinder through speculative beam search combined with a scalable drafting strategy called Medusa. Replacing standard beam search with our approach allows the CASP system to solve 26% to 86% more molecules under the same time constraints of several seconds. Our method brings AI-based CASP systems closer to meeting the strict latency requirements of high-throughput synthesizability screening and improving general user experience.

[769] HT-Transformer: Event Sequences Classification by Accumulating Prefix Information with History Tokens

Ivan Karpukhin, Andrey Savchenko

Main category: cs.LG

TL;DR: Transformers underperform RNNs in sequence classification tasks due to lack of a compact state vector and poor local context capture. The paper introduces history tokens to improve transformer performance.

Details

Motivation: To address the performance gap between transformers and RNNs in sequence classification tasks by identifying limitations in transformers and proposing a solution.

Method: Introduces history tokens to accumulate historical information during pretraining, enhancing transformer models.

Result: Significant improvement in transformer performance across finance, e-commerce, and healthcare tasks.

Conclusion: History tokens effectively address transformer limitations, boosting their performance in sequence classification tasks.

Abstract: Deep learning has achieved remarkable success in modeling sequential data, including event sequences, temporal point processes, and irregular time series. Recently, transformers have largely replaced recurrent networks in these tasks. However, transformers often underperform RNNs in classification tasks where the objective is to predict future targets. The reason behind this performance gap remains largely unexplored. In this paper, we identify a key limitation of transformers: the absence of a single state vector that provides a compact and effective representation of the entire sequence. Additionally, we show that contrastive pretraining of embedding vectors fails to capture local context, which is crucial for accurate prediction. To address these challenges, we introduce history tokens, a novel concept that facilitates the accumulation of historical information during next-token prediction pretraining. Our approach significantly improves transformer-based models, achieving impressive results in finance, e-commerce, and healthcare tasks. The code is publicly available on GitHub.

[770] Hyperparameter-Free Neurochaos Learning Algorithm for Classification

Akhila Henry, Nithin Nagaraj

Main category: cs.LG

TL;DR: AutochaosNet is a hyperparameter-free, training-free variant of Neurochaos Learning (NL) that uses a universal chaotic sequence for feature extraction, achieving competitive or superior performance with reduced computational effort.

Details

Motivation: The existing NL framework requires hyperparameter tuning and computational effort for feature extraction, prompting the need for a simplified, efficient alternative.

Method: AutochaosNet leverages the Champernowne constant for a universal chaotic sequence and uses input stimuli to define firing time bounds. Simplified variants (TM AutochaosNet and TM-FR AutochaosNet) are compared to ChaosNet.

Result: AutochaosNet matches or outperforms ChaosNet in classification tasks while significantly reducing training time and eliminating hyperparameter tuning.

Conclusion: AutochaosNet is a scalable, efficient solution for real-world classification tasks, with future work aimed at exploring universal chaotic orbits for further enhancements.

Abstract: Neurochaos Learning (NL) is a brain-inspired classification framework that employs chaotic dynamics to extract features from input data and yields state of the art performance on classification tasks. However, NL requires the tuning of multiple hyperparameters and computing of four chaotic features per input sample. In this paper, we propose AutochaosNet - a novel, hyperparameter-free variant of the NL algorithm that eliminates the need for both training and parameter optimization. AutochaosNet leverages a universal chaotic sequence derived from the Champernowne constant and uses the input stimulus to define firing time bounds for feature extraction. Two simplified variants - TM AutochaosNet and TM-FR AutochaosNet - are evaluated against the existing NL architecture - ChaosNet. Our results demonstrate that AutochaosNet achieves competitive or superior classification performance while significantly reducing training time due to reduced computational effort. In addition to eliminating training and hyperparameter tuning, AutochaosNet exhibits excellent generalisation capabilities, making it a scalable and efficient choice for real-world classification tasks. Future work will focus on identifying universal orbits under various chaotic maps and incorporating them into the NL framework to further enhance performance.

[771] Training Dynamics of the Cooldown Stage in Warmup-Stable-Decay Learning Rate Scheduler

Aleksandr Dremov, Alexander Hägele, Atli Kosson, Martin Jaggi

Main category: cs.LG

TL;DR: The paper analyzes the cooldown phase in the WSD learning rate scheduler for transformer training, revealing a bias-variance trade-off and performance variations tied to cooldown shapes and AdamW hyperparameters. It highlights the importance of optimizing the cooldown phase.

Details

Motivation: The mechanisms behind the cooldown phase in transformer training, particularly its impact on performance, are not well understood. This study aims to fill that gap.

Method: The study focuses on the cooldown phase of the WSD scheduler, analyzing different cooldown shapes and AdamW hyperparameters, and visualizes the loss landscape.

Result: Different cooldown shapes reveal a bias-variance trade-off, with balanced shapes performing best. Higher $eta_2$ values during cooldown improve performance. Loss landscape visualizations support the river valley perspective.

Conclusion: Optimizing the cooldown phase is crucial for transformer training, alongside traditional hyperparameter tuning, as it significantly impacts model performance.

Abstract: Learning rate scheduling is essential in transformer training, where the final annealing plays a crucial role in getting the best performance. However, the mechanisms behind this cooldown phase, with its characteristic drop in loss, remain poorly understood. To address this, we provide a comprehensive analysis focusing solely on the cooldown phase in the Warmup-Stable-Decay (WSD) learning rate scheduler. Our analysis reveals that different cooldown shapes reveal a fundamental bias-variance trade-off in the resulting models, with shapes that balance exploration and exploitation consistently outperforming alternatives. Similarly, we find substantial performance variations $\unicode{x2013}$ comparable to those from cooldown shape selection $\unicode{x2013}$ when tuning AdamW hyperparameters. Notably, we observe consistent improvements with higher values of $\beta_2$ during cooldown. From a loss landscape perspective, we provide visualizations of the landscape during cooldown, supporting the river valley loss perspective empirically. These findings offer practical recommendations for configuring the WSD scheduler in transformer training, emphasizing the importance of optimizing the cooldown phase alongside traditional hyperparameter tuning.

[772] Instruction-based Time Series Editing

Jiaxing Qiu, Dongliang Guo, Brynne Sullivan, Teague R. Henry, Tom Hartvigsen

Main category: cs.LG

TL;DR: InstructTime introduces instruction-based time series editing using natural language, enabling flexible and customizable edits with controllable strength.

Details

Motivation: Existing diffusion-based editors lack flexibility in condition format and customizable control over editing strength, limiting their practicality.

Method: InstructTime embeds time series and natural language instructions into a shared multi-modal space, decodes embeddings for edits, and uses multi-resolution encoders for local and global edits.

Result: InstructTime achieves high-quality, controllable edits, generalizes to unseen instructions, and adapts to new conditions via few-shot learning.

Conclusion: InstructTime is a state-of-the-art, flexible, and accessible solution for time series editing.

Abstract: In time series editing, we aim to modify some properties of a given time series without altering others. For example, when analyzing a hospital patient’s blood pressure, we may add a sudden early drop and observe how it impacts their future while preserving other conditions. Existing diffusion-based editors rely on rigid, predefined attribute vectors as conditions and produce all-or-nothing edits through sampling. This attribute- and sampling-based approach limits flexibility in condition format and lacks customizable control over editing strength. To overcome these limitations, we introduce Instruction-based Time Series Editing, where users specify intended edits using natural language. This allows users to express a wider range of edits in a more accessible format. We then introduce InstructTime, the first instruction-based time series editor. InstructTime takes in time series and instructions, embeds them into a shared multi-modal representation space, then decodes their embeddings to generate edited time series. By learning a structured multi-modal representation space, we can easily interpolate between embeddings to achieve varying degrees of edit. To handle local and global edits together, we propose multi-resolution encoders. In our experiments, we use synthetic and real datasets and find that InstructTime is a state-of-the-art time series editor: InstructTime achieves high-quality edits with controllable strength, can generalize to unseen instructions, and can be easily adapted to unseen conditions through few-shot learning.

[773] ESM: A Framework for Building Effective Surrogate Models for Hardware-Aware Neural Architecture Search

Azaz-Ur-Rehman Nasir, Samroz Ahmad Shoaib, Muhammad Abdullah Hanif, Muhammad Shafique

Main category: cs.LG

TL;DR: The paper explores hardware-aware latency prediction models for Neural Architecture Search (NAS), analyzing surrogate models and their accuracy factors, and proposes a framework for efficient model generation on GPUs.

Details

Motivation: To design efficient Deep Neural Networks (DNNs) for resource-constrained devices by improving hardware-aware NAS through accurate latency prediction.

Method: Study surrogate models, analyze factors affecting prediction accuracy, and develop a framework for dataset and model generation.

Result: Identified strengths/weaknesses of surrogate models and factors influencing accuracy, leading to a proposed holistic framework.

Conclusion: The framework enhances hardware-aware NAS by optimizing latency prediction for GPU-powered devices, balancing cost and efficiency.

Abstract: Hardware-aware Neural Architecture Search (NAS) is one of the most promising techniques for designing efficient Deep Neural Networks (DNNs) for resource-constrained devices. Surrogate models play a crucial role in hardware-aware NAS as they enable efficient prediction of performance characteristics (e.g., inference latency and energy consumption) of different candidate models on the target hardware device. In this paper, we focus on building hardware-aware latency prediction models. We study different types of surrogate models and highlight their strengths and weaknesses. We perform a systematic analysis to understand the impact of different factors that can influence the prediction accuracy of these models, aiming to assess the importance of each stage involved in the model designing process and identify methods and policies necessary for designing/training an effective estimation model, specifically for GPU-powered devices. Based on the insights gained from the analysis, we present a holistic framework that enables reliable dataset generation and efficient model generation, considering the overall costs of different stages of the model generation pipeline.

[774] FlashSVD: Memory-Efficient Inference with Streaming for Low-Rank Models

Zishan Shao, Yixiao Wang, Qinsi Wang, Ting Jiang, Zhixu Du, Hancheng Ye, Danyang Zhuo, Yiran Chen, Hai Li

Main category: cs.LG

TL;DR: FlashSVD is a new framework for SVD-compressed LLMs, reducing activation memory overhead by 70.2% without accuracy loss.

Details

Motivation: Current SVD compression methods overlook activation memory overhead, limiting real-world deployment.

Method: FlashSVD integrates rank-aware streaming inference, fusing low-rank kernels into attention and FFN pipelines.

Result: Reduces peak activation memory by 70.2% and transient memory by 75% with no accuracy loss.

Conclusion: FlashSVD enables practical deployment of low-rank LLMs in memory-constrained settings.

Abstract: Singular Value Decomposition (SVD) has recently seen a surge of interest as a simple yet powerful tool for large language models (LLMs) compression, with a growing number of works demonstrating 20-80% parameter reductions at minimal accuracy loss. Previous SVD-based approaches have focused primarily on reducing the memory footprint of model weights, largely overlooking the additional activation memory overhead incurred during inference when applying truncated factors via standard dense CUDA kernels. Our experiments demonstrate that this activation overhead, scaling with sequence length and hidden dimension, prevents current SVD compression techniques from achieving any reduction in peak inference memory, thereby limiting their viability for real-world, on-device deployments. We introduce FlashSVD, a novel, end-to-end rank-aware streaming inference framework specifically designed for SVD-compressed large language models. FlashSVD can be seamlessly integrated with any model that employs SVD-based methods for parameter reduction. By fusing low-rank projection kernels directly into both the self-attention and feed-forward network (FFN) pipelines, FlashSVD avoid materializing full-size activation buffers. Instead, small tiles of the truncated factors are loaded into on-chip SRAM, multiplied and reduced on the fly, and immediately evicted, preserving high GPU occupancy and adding no extra latency. On standard encoder benchmarks (e.g., BERT-Base), FlashSVD cuts peak activation memory by up to 70.2% and intermediate transient memory by 75%, all while incur no accuracy loss with upstreaming compression methods, offering a practical path toward memory-constrained deployment of low-rank LLMs.

[775] Frequency-Constrained Learning for Long-Term Forecasting

Menglin Kong, Vincent Zhihao Zheng, Lijun Sun

Main category: cs.LG

TL;DR: A method to improve long-term forecasting by explicitly modeling periodicity using spectral initialization and frequency-constrained optimization, enhancing deep temporal models.

Details

Motivation: Modern deep forecasting models often fail to capture recurring periodic patterns in time series due to spectral bias and lack of frequency-aware inductive priors.

Method: Extracts dominant low-frequency components via FFT-guided coordinate descent, initializes sinusoidal embeddings, and uses a two-speed learning schedule to preserve frequency structure.

Result: Consistent performance gains in real-world benchmarks, especially for long horizons, and accurate recovery of ground-truth frequencies in synthetic data.

Conclusion: Injecting spectral priors into deep temporal models improves robustness and interpretability in long-range forecasting.

Abstract: Many real-world time series exhibit strong periodic structures arising from physical laws, human routines, or seasonal cycles. However, modern deep forecasting models often fail to capture these recurring patterns due to spectral bias and a lack of frequency-aware inductive priors. Motivated by this gap, we propose a simple yet effective method that enhances long-term forecasting by explicitly modeling periodicity through spectral initialization and frequency-constrained optimization. Specifically, we extract dominant low-frequency components via Fast Fourier Transform (FFT)-guided coordinate descent, initialize sinusoidal embeddings with these components, and employ a two-speed learning schedule to preserve meaningful frequency structure during training. Our approach is model-agnostic and integrates seamlessly into existing Transformer-based architectures. Extensive experiments across diverse real-world benchmarks demonstrate consistent performance gains–particularly at long horizons–highlighting the benefits of injecting spectral priors into deep temporal models for robust and interpretable long-range forecasting. Moreover, on synthetic data, our method accurately recovers ground-truth frequencies, further validating its interpretability and effectiveness in capturing latent periodic patterns.

[776] A Reward-Directed Diffusion Framework for Generative Design Optimization

Hadi Keramati, Patrick Kirchen, Mohammed Hannan, Rajeev K. Jaiman

Main category: cs.LG

TL;DR: A generative optimization framework using fine-tuned diffusion models and reward-directed sampling enhances engineering designs, outperforming training data in ship hull and airfoil tasks.

Details

Motivation: To address the challenge of generating high-performance engineering designs when performance metrics rely on costly or non-differentiable simulations.

Method: The framework uses a parametric design representation and iterative soft-value guidance within a Markov decision process for reward-directed sampling.

Result: Achieves 25% resistance reduction in 3D ship hulls and 10% lift-to-drag ratio improvement in 2D airfoils, surpassing training data.

Conclusion: The framework boosts designer productivity and design performance, demonstrating potential for integration into engineering workflows.

Abstract: This study presents a generative optimization framework that builds on a fine-tuned diffusion model and reward-directed sampling to generate high-performance engineering designs. The framework adopts a parametric representation of the design geometry and produces new parameter sets corresponding to designs with enhanced performance metrics. A key advantage of the reward-directed approach is its suitability for scenarios in which performance metrics rely on costly engineering simulations or surrogate models (e.g. graph-based, ensemble models, or tree-based) are non-differentiable or prohibitively expensive to differentiate. This work introduces the iterative use of a soft value function within a Markov decision process framework to achieve reward-guided decoding in the diffusion model. By incorporating soft-value guidance during both the training and inference phases, the proposed approach reduces computational and memory costs to achieve high-reward designs, even beyond the training data. Empirical results indicate that this iterative reward-directed method substantially improves the ability of the diffusion models to generate samples with reduced resistance in 3D ship hull design and enhanced hydrodynamic performance in 2D airfoil design tasks. The proposed framework generates samples that extend beyond the training data distribution, resulting in a greater 25 percent reduction in resistance for ship design and over 10 percent improvement in the lift-to-drag ratio for the 2D airfoil design. Successful integration of this model into the engineering design life cycle can enhance both designer productivity and overall design performance.

[777] Canoe Paddling Quality Assessment Using Smart Devices: Preliminary Machine Learning Study

S. Parab, A. Lamelas, A. Hassan, P. Bhote

Main category: cs.LG

TL;DR: An AI-based coaching system using ML and LLMs was developed to provide paddling stroke feedback via consumer devices, achieving high accuracy with Extremely Randomized Trees.

Details

Motivation: Paddlesports lack ML integration and face high coaching costs. This study aims to provide an affordable, accessible alternative using consumer devices.

Method: Motion data from Apple Watches and smartphones were collected, segmented, and processed. ML models (SVC, Random Forest, Gradient Boosting, Extremely Randomized Trees) were trained on raw and engineered features. A web interface delivered feedback.

Result: The Extremely Randomized Tree model achieved an F-score of 0.9496. The system successfully provided quantitative and qualitative feedback, with wrist sensor placement improving data quality.

Conclusion: Consumer devices and ML can feasibly support paddling technique improvement, though sample size limitations exist.

Abstract: Over 22 million Americans participate in paddling-related activities annually, contributing to a global paddlesports market valued at 2.4 billion US dollars in 2020. Despite its popularity, the sport has seen limited integration of machine learning (ML) and remains hindered by the cost of coaching and specialized equipment. This study presents a novel AI-based coaching system that uses ML models trained on motion data and delivers stroke feedback via a large language model (LLM). Participants were recruited through a collaboration with the NYU Concrete Canoe Team. Motion data were collected across two sessions, one with suboptimal form and one with corrected technique, using Apple Watches and smartphones secured in sport straps. The data underwent stroke segmentation and feature extraction. ML models, including Support Vector Classifier, Random Forest, Gradient Boosting, and Extremely Randomized Trees, were trained on both raw and engineered features. A web based interface was developed to visualize stroke quality and deliver LLM-based feedback. Across four participants, eight trials yielded 66 stroke samples. The Extremely Randomized Tree model achieved the highest performance with an F score of 0.9496 under five fold cross validation. The web interface successfully provided both quantitative metrics and qualitative feedback. Sensor placement near the wrists improved data quality. Preliminary results indicate that smartwatches and smartphones can enable low cost, accessible alternatives to traditional paddling instruction. While limited by sample size, the study demonstrates the feasibility of using consumer devices and ML to support stroke refinement and technique improvement.

[778] SimDeep: Federated 3D Indoor Localization via Similarity-Aware Aggregation

Ahmed Jaheen, Sarah Elsamanody, Hamada Rizk, Moustafa Youssef

Main category: cs.LG

TL;DR: SimDeep is a Federated Learning framework designed to address non-IID data and device heterogeneity in indoor localization, achieving 92.89% accuracy.

Details

Motivation: Challenges in deploying indoor localization due to non-IID data and device heterogeneity.

Method: SimDeep uses a Similarity Aggregation Strategy in Federated Learning to handle non-IID data.

Result: Achieves 92.89% accuracy, outperforming traditional federated and centralized methods.

Conclusion: SimDeep is viable for real-world indoor localization deployment.

Abstract: Indoor localization plays a pivotal role in supporting a wide array of location-based services, including navigation, security, and context-aware computing within intricate indoor environments. Despite considerable advancements, deploying indoor localization systems in real-world scenarios remains challenging, largely because of non-independent and identically distributed (non-IID) data and device heterogeneity. In response, we propose SimDeep, a novel Federated Learning (FL) framework explicitly crafted to overcome these obstacles and effectively manage device heterogeneity. SimDeep incorporates a Similarity Aggregation Strategy, which aggregates client model updates based on data similarity, significantly alleviating the issues posed by non-IID data. Our experimental evaluations indicate that SimDeep achieves an impressive accuracy of 92.89%, surpassing traditional federated and centralized techniques, thus underscoring its viability for real-world deployment.

[779] The Vanishing Gradient Problem for Stiff Neural Differential Equations

Colby Fronk, Linda Petzold

Main category: cs.LG

TL;DR: The paper demonstrates that vanishing gradients in stiff neural differential equations are a universal issue for A-stable and L-stable numerical schemes, limiting optimization and parameter identification.

Details

Motivation: To understand why parameter sensitivities vanish in stiff systems during training, hindering optimization.

Method: Analyzes the rational stability function of stiff integration schemes, providing explicit formulas and proving decay rates.

Result: Shows that all A-stable methods suppress parameter gradients in stiff regimes, with a slowest decay rate of $O(|z|^{-1})$.

Conclusion: Reveals a fundamental limitation in training stiff neural ODEs due to unavoidable gradient suppression in A-stable methods.

Abstract: Gradient-based optimization of neural differential equations and other parameterized dynamical systems fundamentally relies on the ability to differentiate numerical solutions with respect to model parameters. In stiff systems, it has been observed that sensitivities to parameters controlling fast-decaying modes become vanishingly small during training, leading to optimization difficulties. In this paper, we show that this vanishing gradient phenomenon is not an artifact of any particular method, but a universal feature of all A-stable and L-stable stiff numerical integration schemes. We analyze the rational stability function for general stiff integration schemes and demonstrate that the relevant parameter sensitivities, governed by the derivative of the stability function, decay to zero for large stiffness. Explicit formulas for common stiff integration schemes are provided, which illustrate the mechanism in detail. Finally, we rigorously prove that the slowest possible rate of decay for the derivative of the stability function is $O(|z|^{-1})$, revealing a fundamental limitation: all A-stable time-stepping methods inevitably suppress parameter gradients in stiff regimes, posing a significant barrier for training and parameter identification in stiff neural ODEs.

[780] Prototype Learning to Create Refined Interpretable Digital Phenotypes from ECGs

Sahil Sethi, David Chen, Michael C. Burkhart, Nipun Bhandari, Bashar Ramadan, Brett Beaulieu-Jones

Main category: cs.LG

TL;DR: Prototype-based neural networks trained for ECG classification capture clinically meaningful patterns, showing strong associations with hospital discharge diagnoses and predictive performance across diverse conditions.

Details

Motivation: To determine if prototype-based models, trained for classification, can align with broader clinical phenotypes beyond their original training objectives.

Method: A prototype-based deep learning model was trained for multi-label ECG classification on PTB-XL dataset and tested on MIMIC-IV clinical database without modification. Associations with hospital discharge diagnoses (phecodes) were assessed.

Result: Prototypes showed stronger, more specific associations with clinical outcomes than class predictions or NLP-extracted concepts. They achieved high predictive performance (AUCs 0.89-0.91) for cardiac and non-cardiac conditions.

Conclusion: Prototype-based models provide interpretable, transferable intermediate phenotypes that capture clinically meaningful physiologic signatures, supporting digital phenotyping from time-series data.

Abstract: Prototype-based neural networks offer interpretable predictions by comparing inputs to learned, representative signal patterns anchored in training data. While such models have shown promise in the classification of physiological data, it remains unclear whether their prototypes capture an underlying structure that aligns with broader clinical phenotypes. We use a prototype-based deep learning model trained for multi-label ECG classification using the PTB-XL dataset. Then without modification we performed inference on the MIMIC-IV clinical database. We assess whether individual prototypes, trained solely for classification, are associated with hospital discharge diagnoses in the form of phecodes in this external population. Individual prototypes demonstrate significantly stronger and more specific associations with clinical outcomes compared to the classifier’s class predictions, NLP-extracted concepts, or broader prototype classes across all phecode categories. Prototype classes with mixed significance patterns exhibit significantly greater intra-class distances (p $<$ 0.0001), indicating the model learned to differentiate clinically meaningful variations within diagnostic categories. The prototypes achieve strong predictive performance across diverse conditions, with AUCs ranging from 0.89 for atrial fibrillation to 0.91 for heart failure, while also showing substantial signal for non-cardiac conditions such as sepsis and renal disease. These findings suggest that prototype-based models can support interpretable digital phenotyping from physiologic time-series data, providing transferable intermediate phenotypes that capture clinically meaningful physiologic signatures beyond their original training objectives.

[781] Unsupervised Learning for the Elementary Shortest Path Problem

Jingyi Chen, Xinyuan Zhang, Xinwu Qian

Main category: cs.LG

TL;DR: A probabilistic method using a graph neural network solves the NP-hard Elementary Shortest-Path Problem (ESPP) with near-optimal solutions by learning node values and edge probabilities, outperforming baselines and heuristics.

Details

Motivation: The ESPP is NP-hard due to negative-cost cycles, motivating the need for efficient near-optimal solutions.

Method: An unsupervised graph neural network learns node values and edge-selection probabilities via a surrogate loss, reducing negative-cost cycles and aligning with algorithmic goals.

Result: The method outperforms baselines and heuristics on graphs up to 100 nodes, showing strong generalization to unseen graphs.

Conclusion: The proposed approach effectively addresses ESPP with high performance and generalization capabilities.

Abstract: The Elementary Shortest-Path Problem(ESPP) seeks a minimum cost path from s to t that visits each vertex at most once. The presence of negative-cost cycles renders the problem NP-hard. We present a probabilistic method for finding near-optimal ESPP, enabled by an unsupervised graph neural network that jointly learns node value estimates and edge-selection probabilities via a surrogate loss function. The loss provides a high probability certificate of finding near-optimal ESPP solutions by simultaneously reducing negative-cost cycles and embedding the desired algorithmic alignment. At inference time, a decoding algorithm transforms the learned edge probabilities into an elementary path. Experiments on graphs of up to 100 nodes show that the proposed method surpasses both unsupervised baselines and classical heuristics, while exhibiting high performance in cross-size and cross-topology generalization on unseen synthetic graphs.

[782] KANMixer: Can KAN Serve as a New Modeling Core for Long-term Time Series Forecasting?

Lingyu Jiang, Yuping Wang, Yao Su, Shuo Xing, Wenjing Chen, Xin Zhang, Zhengzhong Tu, Ziming Zhang, Fangzhou Lin, Michael Zielewski, Kazunori D Yamada

Main category: cs.LG

TL;DR: The paper introduces KANMixer, a novel architecture using Kolmogorov-Arnold Networks (KAN) for long-term time series forecasting, outperforming traditional MLP-based methods in benchmarks.

Details

Motivation: Existing MLP-based methods for LTSF lack hierarchical locality and sequential inductive biases, leading to diminishing performance improvements. KAN's adaptive basis functions offer potential for better modeling.

Method: The authors propose KANMixer, integrating KAN’s adaptive capabilities into a multi-scale mixing backbone for LTSF.

Result: KANMixer achieves state-of-the-art performance in 16 out of 28 experiments across seven datasets, demonstrating KAN’s superior adaptability.

Conclusion: KAN’s adaptive flexibility transforms forecasting performance, with critical design factors identified for effective use in LTSF, providing the first empirical guidelines for KAN in this domain.

Abstract: In recent years, multilayer perceptrons (MLP)-based deep learning models have demonstrated remarkable success in long-term time series forecasting (LTSF). Existing approaches typically augment MLP backbones with hand-crafted external modules to address the inherent limitations of their flat architectures. Despite their success, these augmented methods neglect hierarchical locality and sequential inductive biases essential for time-series modeling, and recent studies indicate diminishing performance improvements. To overcome these limitations, we explore Kolmogorov-Arnold Networks (KAN), a recently proposed model featuring adaptive basis functions capable of granular, local modulation of nonlinearities. This raises a fundamental question: Can KAN serve as a new modeling core for LTSF? To answer this, we introduce KANMixer, a concise architecture integrating a multi-scale mixing backbone that fully leverages KAN’s adaptive capabilities. Extensive evaluation demonstrates that KANMixer achieves state-of-the-art performance in 16 out of 28 experiments across seven benchmark datasets. To uncover the reasons behind this strong performance, we systematically analyze the strengths and limitations of KANMixer in comparison with traditional MLP architectures. Our findings reveal that the adaptive flexibility of KAN’s learnable basis functions significantly transforms the influence of network structural prior on forecasting performance. Furthermore, we identify critical design factors affecting forecasting accuracy and offer practical insights for effectively utilizing KAN in LTSF. Together, these insights constitute the first empirically grounded guidelines for effectively leveraging KAN in LTSF. Code is available in the supplementary file.

[783] Dynamic Clustering for Personalized Federated Learning on Heterogeneous Edge Devices

Heting Liu, Junzhe Huang, Fang He, Guohong Cao

Main category: cs.LG

TL;DR: A dynamic clustering algorithm (DC-PFL) is proposed for personalized federated learning to address data heterogeneity by grouping clients based on model discrepancy and optimizing communication costs.

Details

Motivation: Data heterogeneity in federated learning reduces model performance, necessitating a method to personalize models without exposing raw data.

Method: DC-PFL uses model discrepancy to estimate data heterogeneity, clusters clients dynamically, and employs layer-wise aggregation to reduce communication costs.

Result: Experiments show DC-PFL reduces training time and improves accuracy compared to baselines.

Conclusion: DC-PFL effectively addresses data heterogeneity and optimizes federated learning performance.

Abstract: Federated Learning (FL) enables edge devices to collaboratively learn a global model, but it may not perform well when clients have high data heterogeneity. In this paper, we propose a dynamic clustering algorithm for personalized federated learning system (DC-PFL) to address the problem of data heterogeneity. DC-PFL starts with all clients training a global model and gradually groups the clients into smaller clusters for model personalization based on their data similarities. To address the challenge of estimating data heterogeneity without exposing raw data, we introduce a discrepancy metric called model discrepancy, which approximates data heterogeneity solely based on the model weights received by the server. We demonstrate that model discrepancy is strongly and positively correlated with data heterogeneity and can serve as a reliable indicator of data heterogeneity. To determine when and how to change grouping structures, we propose an algorithm based on the rapid decrease period of the training loss curve. Moreover, we propose a layer-wise aggregation mechanism that aggregates the low-discrepancy layers at a lower frequency to reduce the amount of transmitted data and communication costs. We conduct extensive experiments on various datasets to evaluate our proposed algorithm, and our results show that DC-PFL significantly reduces total training time and improves model accuracy compared to baselines.

[784] Diffusion Models for Future Networks and Communications: A Comprehensive Survey

Nguyen Cong Luong, Nguyen Duc Hai, Duc Van Le, Huy T. Nguyen, Thai-Hoc Vu, Thien Huynh-The, Ruichen Zhang, Nguyen Duc Duy Anh, Dusit Niyato, Marco Di Renzo, Dong In Kim, Quoc-Viet Pham

Main category: cs.LG

TL;DR: A survey on Diffusion Models (DMs) in wireless communications, covering theory, applications, and future directions.

Details

Motivation: To explore the potential of DMs in enhancing wireless communication systems due to their noise-robust performance and handling of complex data.

Method: Provides a tutorial on DMs, reviews their applications in optimizers, reinforcement learning, and emerging issues like channel modeling and resource management.

Result: Highlights DMs’ effectiveness in various wireless communication tasks but notes technical limitations.

Conclusion: DMs show promise for future networks, but further research is needed to address limitations and expand applications.

Abstract: The rise of Generative AI (GenAI) in recent years has catalyzed transformative advances in wireless communications and networks. Among the members of the GenAI family, Diffusion Models (DMs) have risen to prominence as a powerful option, capable of handling complex, high-dimensional data distribution, as well as consistent, noise-robust performance. In this survey, we aim to provide a comprehensive overview of the theoretical foundations and practical applications of DMs across future communication systems. We first provide an extensive tutorial of DMs and demonstrate how they can be applied to enhance optimizers, reinforcement learning and incentive mechanisms, which are popular approaches for problems in wireless networks. Then, we review and discuss the DM-based methods proposed for emerging issues in future networks and communications, including channel modeling and estimation, signal detection and data reconstruction, integrated sensing and communication, resource management in edge computing networks, semantic communications and other notable issues. We conclude the survey with highlighting technical limitations of DMs and their applications, as well as discussing future research directions.

[785] Censored Sampling for Topology Design: Guiding Diffusion with Human Preferences

Euihyun Kim, Keun Park, Yeoneung Kim

Main category: cs.LG

TL;DR: A human-in-the-loop diffusion framework improves topology optimization by integrating human feedback to detect and correct design flaws, ensuring physical plausibility and manufacturability.

Details

Motivation: Existing diffusion models for topology optimization rely on surrogate predictors, which may miss critical design flaws obvious to humans. This work aims to bridge the gap between automated generation and expert judgment.

Method: The proposed framework uses a lightweight reward model trained on binary human feedback to detect flaws like floating components. This reward model guides the reverse diffusion process of a pre-trained generator without retraining.

Result: Preliminary results show reduced failure modes and improved design realism, validating the effectiveness of human-aligned rewards in generative design.

Conclusion: The modular approach enhances trust in generative design by combining automated generation with expert judgment, offering scalable and physically plausible solutions.

Abstract: Recent advances in denoising diffusion models have enabled rapid generation of optimized structures for topology optimization. However, these models often rely on surrogate predictors to enforce physical constraints, which may fail to capture subtle yet critical design flaws such as floating components or boundary discontinuities that are obvious to human experts. In this work, we propose a novel human-in-the-loop diffusion framework that steers the generative process using a lightweight reward model trained on minimal human feedback. Inspired by preference alignment techniques in generative modeling, our method learns to suppress unrealistic outputs by modulating the reverse diffusion trajectory using gradients of human-aligned rewards. Specifically, we collect binary human evaluations of generated topologies and train classifiers to detect floating material and boundary violations. These reward models are then integrated into the sampling loop of a pre-trained diffusion generator, guiding it to produce designs that are not only structurally performant but also physically plausible and manufacturable. Our approach is modular and requires no retraining of the diffusion model. Preliminary results show substantial reductions in failure modes and improved design realism across diverse test conditions. This work bridges the gap between automated design generation and expert judgment, offering a scalable solution to trustworthy generative design.

[786] Why Heuristic Weighting Works: A Theoretical Analysis of Denoising Score Matching

Juyan Zhang, Rhys Newbury, Xinyang Zhang, Tin Tran, Dana Kulic, Michael Burke

Main category: cs.LG

TL;DR: The paper justifies the heuristic weighting function in denoising score matching by showing heteroskedasticity is inherent, derives optimal weighting functions, and finds the heuristic can outperform optimal weighting in practice.

Details

Motivation: To provide a principled justification for the heuristic weighting function used in denoising score matching, which lacked formal grounding.

Method: Demonstrates heteroskedasticity in the denoising score matching objective, derives optimal weighting functions for arbitrary-order losses, and compares heuristic and optimal weightings theoretically and empirically.

Result: The heuristic weighting, a first-order Taylor approximation, can achieve lower variance in parameter gradients than the optimal weighting, aiding stable training.

Conclusion: The heuristic weighting is not only justified but can be practically superior, offering insights for improving diffusion model training.

Abstract: Score matching enables the estimation of the gradient of a data distribution, a key component in denoising diffusion models used to recover clean data from corrupted inputs. In prior work, a heuristic weighting function has been used for the denoising score matching loss without formal justification. In this work, we demonstrate that heteroskedasticity is an inherent property of the denoising score matching objective. This insight leads to a principled derivation of optimal weighting functions for generalized, arbitrary-order denoising score matching losses, without requiring assumptions about the noise distribution. Among these, the first-order formulation is especially relevant to diffusion models. We show that the widely used heuristical weighting function arises as a first-order Taylor approximation to the trace of the expected optimal weighting. We further provide theoretical and empirical comparisons, revealing that the heuristical weighting, despite its simplicity, can achieve lower variance than the optimal weighting with respect to parameter gradients, which can facilitate more stable and efficient training.

[787] Drift-aware Collaborative Assistance Mixture of Experts for Heterogeneous Multistream Learning

En Yu, Jie Lu, Kun Wang, Xiaoyu Yang, Guangquan Zhang

Main category: cs.LG

TL;DR: CAMEL is a dynamic framework for learning from heterogeneous data streams, using specialized experts and collaborative assistance to handle concept drifts and improve generalizability.

Details

Motivation: Existing methods assume homogeneous streams and static architectures, limiting adaptability in dynamic environments. CAMEL addresses this by handling heterogeneity and concept drifts.

Method: CAMEL assigns each stream an independent system with dedicated feature extractors and task-specific heads. It uses a dynamic pool of private experts and a collaborative assistance expert with multi-head attention for knowledge transfer. An Autonomous Expert Tuner (AET) manages expert lifecycles.

Result: CAMEL shows superior generalizability across diverse multistreams and resilience against complex concept drifts.

Conclusion: CAMEL effectively addresses heterogeneity and concept drifts in multistream learning, offering a robust and adaptable framework.

Abstract: Learning from multiple data streams in real-world scenarios is fundamentally challenging due to intrinsic heterogeneity and unpredictable concept drifts. Existing methods typically assume homogeneous streams and employ static architectures with indiscriminate knowledge fusion, limiting generalizability in complex dynamic environments. To tackle this gap, we propose CAMEL, a dynamic \textbf{C}ollaborative \textbf{A}ssistance \textbf{M}ixture of \textbf{E}xperts \textbf{L}earning framework. It addresses heterogeneity by assigning each stream an independent system with a dedicated feature extractor and task-specific head. Meanwhile, a dynamic pool of specialized private experts captures stream-specific idiosyncratic patterns. Crucially, collaboration across these heterogeneous streams is enabled by a dedicated assistance expert. This expert employs a multi-head attention mechanism to distill and integrate relevant context autonomously from all other concurrent streams. It facilitates targeted knowledge transfer while inherently mitigating negative transfer from irrelevant sources. Furthermore, we propose an Autonomous Expert Tuner (AET) strategy, which dynamically manages expert lifecycles in response to drift. It instantiates new experts for emerging concepts (freezing prior ones to prevent catastrophic forgetting) and prunes obsolete ones. This expert-level plasticity provides a robust and efficient mechanism for online model capacity adaptation. Extensive experiments demonstrate CAMEL’s superior generalizability across diverse multistreams and exceptional resilience against complex concept drifts.

[788] Enhancing Math Reasoning in Small-sized LLMs via Preview Difficulty-Aware Intervention

Xinhan Di, JoyJiaoW

Main category: cs.LG

TL;DR: The paper introduces an Early Preview Reinforcement Learning (EPRLI) algorithm to enhance reasoning in LLMs, achieving competitive results on math benchmarks despite undisclosed details in state-of-the-art models.

Details

Motivation: The lack of transparency in reinforcement learning techniques for reasoning in advanced LLMs motivates the development of an open-source alternative.

Method: The EPRLI algorithm, based on the GRPO framework, incorporates difficulty-aware intervention for math problems and is tested on a 1.5B-parameter LLM.

Result: The method achieves 50.0% on AIME24, 89.2% on Math500, 77.1% on AMC, 35.3% on Minerva, and 51.9% on OBench, outperforming O1-Preview and matching O1-mini.

Conclusion: EPRLI demonstrates the potential of open-source reinforcement learning for reasoning in LLMs, offering a replicable alternative to proprietary models.

Abstract: Reinforcement learning scaling enhances the reasoning capabilities of large language models, with reinforcement learning serving as the key technique to draw out complex reasoning. However, key technical details of state-of-the-art reasoning LLMs, such as those in the OpenAI O series, Claude 3 series, DeepMind’s Gemini 2.5 series, and Grok 3 series, remain undisclosed, making it difficult for the research community to replicate their reinforcement learning training results. Therefore, we start our study from an Early Preview Reinforcement Learning (EPRLI) algorithm built on the open-source GRPO framework, incorporating difficulty-aware intervention for math problems. Applied to a 1.5B-parameter LLM, our method achieves 50.0% on AIME24, 89.2% on Math500, 77.1% on AMC, 35.3% on Minerva, and 51.9% on OBench, superpass O1-Preview and is comparable to O1-mini within standard school-lab settings.

[789] Augmented Reinforcement Learning Framework For Enhancing Decision-Making In Machine Learning Models Using External Agents

Sandesh Kumar Singh

Main category: cs.LG

TL;DR: The paper introduces an Augmented Reinforcement Learning (ARL) framework to enhance decision-making in ML models by incorporating external agents for feedback and data quality assurance.

Details

Motivation: Addresses the 'Garbage-In, Garbage-Out' problem in reinforcement learning by ensuring quality data inputs and corrective feedback.

Method: Uses two external agents: one for real-time evaluation and feedback (Rejected Data Pipeline), and another for curated feedback (Approved Dataset).

Result: Experimental validation in document identification shows improved robustness and accuracy with human feedback.

Conclusion: ARL combines machine efficiency and human insight, offering a scalable solution for improving model performance in complex environments.

Abstract: This work proposes a novel technique Augmented Reinforcement Learning framework for the improvement of decision-making capabilities of machine learning models. The introduction of agents as external overseers checks on model decisions. The external agent can be anyone, like humans or automated scripts, that helps in decision path correction. It seeks to ascertain the priority of the “Garbage-In, Garbage-Out” problem that caused poor data inputs or incorrect actions in reinforcement learning. The ARL framework incorporates two external agents that aid in course correction and the guarantee of quality data at all points of the training cycle. The External Agent 1 is a real-time evaluator, which will provide feedback light of decisions taken by the model, identify suboptimal actions forming the Rejected Data Pipeline. The External Agent 2 helps in selective curation of the provided feedback with relevance and accuracy in business scenarios creates an approved dataset for future training cycles. The validation of the framework is also applied to a real-world scenario, which is “Document Identification and Information Extraction”. This problem originates mainly from banking systems, but can be extended anywhere. The method of classification and extraction of information has to be done correctly here. Experimental results show that including human feedback significantly enhances the ability of the model in order to increase robustness and accuracy in making decisions. The augmented approach, with a combination of machine efficiency and human insight, attains a higher learning standard-mainly in complex or ambiguous environments. The findings of this study show that human-in-the-loop reinforcement learning frameworks such as ARL can provide a scalable approach to improving model performance in data-driven applications.

[790] TCDiff: Triplex Cascaded Diffusion for High-fidelity Multimodal EHRs Generation with Incomplete Clinical Data

Yandong Yan, Chenxi Li, Yu Huang, Dexuan Xu, Jiaqi Zhu, Zhongyan Chai, Huamin Zhang

Main category: cs.LG

TL;DR: TCDiff, a novel EHR generation framework, outperforms baselines by 10% in fidelity while handling multimodal EHR data and data incompleteness, validated on a new TCM dataset.

Details

Motivation: The scarcity of high-quality EHRs and limitations of existing methods in handling heterogeneous, multimodal data and incompleteness, especially in Traditional Chinese Medicine (TCM).

Method: TCDiff cascades three diffusion networks (Reference Modalities Diffusion, Cross-Modal Bridging, Target Modality Diffusion) to generate EHR data.

Result: TCDiff outperforms baselines by 10% in fidelity under various missing rates and maintains privacy.

Conclusion: TCDiff is effective, robust, and generalizable for real-world healthcare EHR generation.

Abstract: The scarcity of large-scale and high-quality electronic health records (EHRs) remains a major bottleneck in biomedical research, especially as large foundation models become increasingly data-hungry. Synthesizing substantial volumes of de-identified and high-fidelity data from existing datasets has emerged as a promising solution. However, existing methods suffer from a series of limitations: they struggle to model the intrinsic properties of heterogeneous multimodal EHR data (e.g., continuous, discrete, and textual modalities), capture the complex dependencies among them, and robustly handle pervasive data incompleteness. These challenges are particularly acute in Traditional Chinese Medicine (TCM). To this end, we propose TCDiff (Triplex Cascaded Diffusion Network), a novel EHR generation framework that cascades three diffusion networks to learn the features of real-world EHR data, formatting a multi-stage generative process: Reference Modalities Diffusion, Cross-Modal Bridging, and Target Modality Diffusion. Furthermore, to validate our proposed framework, besides two public datasets, we also construct and introduce TCM-SZ1, a novel multimodal EHR dataset for benchmarking. Experimental results show that TCDiff consistently outperforms state-of-the-art baselines by an average of 10% in data fidelity under various missing rate, while maintaining competitive privacy guarantees. This highlights the effectiveness, robustness, and generalizability of our approach in real-world healthcare scenarios.

[791] IMU: Influence-guided Machine Unlearning

Xindi Fan, Jing Wu, Mingyi Zhou, Pengwei Liang, Dinh Phung

Main category: cs.LG

TL;DR: The paper introduces Influence-guided Machine Unlearning (IMU), a method for selectively forgetting data in deep learning models without needing the original training data, outperforming existing retain-data-free approaches.

Details

Motivation: Deep learning models can memorize training data, risking privacy leaks. Current machine unlearning (MU) methods often require access to the original data, which is impractical. IMU addresses this by enabling unlearning without the retain set.

Method: IMU uses gradient ascent and dynamically allocates unlearning intensities based on data point influences, requiring only the forget set.

Result: IMU outperforms existing retain-data-free MU methods in vision and language tasks, balancing unlearning effectiveness and model utility.

Conclusion: IMU provides a practical and effective solution for machine unlearning without relying on the original training data, enhancing privacy and scalability.

Abstract: Recent studies have shown that deep learning models are vulnerable to attacks and tend to memorize training data points, raising significant concerns about privacy leakage. This motivates the development of machine unlearning (MU), i.e., a paradigm that enables models to selectively forget specific data points upon request. However, most existing MU algorithms require partial or full fine-tuning on the retain set. This necessitates continued access to the original training data, which is often impractical due to privacy concerns and storage constraints. A few retain-data-free MU methods have been proposed, but some rely on access to auxiliary data and precomputed statistics of the retain set, while others scale poorly when forgetting larger portions of data. In this paper, we propose Influence-guided Machine Unlearning (IMU), a simple yet effective method that conducts MU using only the forget set. Specifically, IMU employs gradient ascent and innovatively introduces dynamic allocation of unlearning intensities across different data points based on their influences. This adaptive strategy significantly enhances unlearning effectiveness while maintaining model utility. Results across vision and language tasks demonstrate that IMU consistently outperforms existing retain-data-free MU methods.

[792] EAC-MoE: Expert-Selection Aware Compressor for Mixture-of-Experts Large Language Models

Yuanteng Chen, Yuantian Shao, Peisong Wang, Jian Cheng

Main category: cs.LG

TL;DR: EAC-MoE addresses GPU memory and inference speed challenges in MoE-LLMs via quantization and pruning, introducing QESC and PESF modules.

Details

Motivation: Mixture-of-Experts (MoE) faces high GPU memory usage and inefficient inference due to low activated parameters.

Method: Proposes QESC for quantization bias mitigation and PESF for pruning less-used experts.

Result: Reduces memory usage and improves inference speed with minimal performance loss.

Conclusion: EAC-MoE effectively optimizes MoE-LLMs for practical deployment.

Abstract: Mixture-of-Experts (MoE) has demonstrated promising potential in scaling LLMs. However, it is hindered by two critical challenges: (1) substantial GPU memory consumption to load all experts; (2) low activated parameters cannot be equivalently translated into inference acceleration effects. In this work, we propose EAC-MoE, an Expert-Selection Aware Compressor for MoE-LLMs, which deeply aligns with the characteristics of MoE from the perspectives of quantization and pruning, and introduces two modules to address these two challenges respectively: (1) The expert selection bias caused by low-bit quantization is a major factor contributing to the performance degradation in MoE-LLMs. Based on this, we propose Quantization with Expert-Selection Calibration (QESC), which mitigates the expert selection bias by calibrating the routers within the MoE; (2) There are always certain experts that are not crucial for the corresponding tasks, yet causing inference latency. Therefore, we propose Pruning based on Expert-Selection Frequency (PESF), which significantly improves inference speed by pruning less frequently used experts for current task. Extensive experiments demonstrate that our approach significantly reduces memory usage and improves inference speed with minimal performance degradation.

[793] Learning Unified System Representations for Microservice Tail Latency Prediction

Wenzhuo Qian, Hailiang Zhao, Tianlv Chen, Jiayi Chen, Ziqi Wang, Kingsum Chow, Shuiguang Deng

Main category: cs.LG

TL;DR: USRFNet, a deep learning network, improves microservice performance monitoring by separating and modeling traffic-side and resource-side features, achieving better P95 tail latency prediction.

Details

Motivation: Microservices' distributed nature complicates performance monitoring. Traditional latency metrics are noisy and lack holistic insights, while existing methods struggle with heterogeneous data and lack principled designs.

Method: USRFNet uses GNNs for service interactions and gMLP for resource dynamics, fusing these into a unified system embedding to predict P95 latency.

Result: USRFNet outperforms state-of-the-art baselines in accuracy under large-scale stress testing.

Conclusion: USRFNet effectively addresses microservice monitoring challenges by integrating complementary modalities for stable and accurate latency prediction.

Abstract: Microservice architectures have become the de facto standard for building scalable cloud-native applications, yet their distributed nature introduces significant challenges in performance monitoring and resource management. Traditional approaches often rely on per-request latency metrics, which are highly sensitive to transient noise and fail to reflect the holistic behavior of complex, concurrent workloads. In contrast, window-level P95 tail latency provides a stable and meaningful signal that captures both system-wide trends and user-perceived performance degradation. We identify two key shortcomings in existing methods: (i) inadequate handling of heterogeneous data, where traffic-side features propagate across service dependencies and resource-side signals reflect localized bottlenecks, and (ii) the lack of principled architectural designs that effectively distinguish and integrate these complementary modalities. To address these challenges, we propose USRFNet, a deep learning network that explicitly separates and models traffic-side and resource-side features. USRFNet employs GNNs to capture service interactions and request propagation patterns, while gMLP modules independently model cluster resource dynamics. These representations are then fused into a unified system embedding to predict window-level P95 latency with high accuracy. We evaluate USRFNet on real-world microservice benchmarks under large-scale stress testing conditions, demonstrating substantial improvements in prediction accuracy over state-of-the-art baselines.

[794] Privacy-Preserving Inference for Quantized BERT Models

Tianpei Lu, Bingsheng Zhang, Lekun Peng, Bowen Zheng, Lichun Li, Kui Ren

Main category: cs.LG

TL;DR: A fine-grained, layer-wise quantization scheme for secure multi-party computation (MPC) is proposed, achieving significant speedups by optimizing floating-point operations and handling nonlinear functions efficiently.

Details

Motivation: Addressing the high communication and computation overhead in secure model inference, especially for privacy-sensitive domains like healthcare, by improving MPC-based quantized inference methods.

Method: Introduces a layer-wise quantization scheme, 1-bit weight fully connected layers, and a multi-input lookup table protocol for secure softmax evaluation, along with dual secret sharing schemes for precision conversion.

Result: Achieves up to 8×, 9×, and 22× speedups compared to prior works, demonstrating efficiency in secure inference.

Conclusion: The proposed method significantly reduces overhead in secure model inference while maintaining privacy, making it practical for real-world applications.

Abstract: With the increasing deployment of generative machine learning models in privacy-sensitive domains such as healthcare and personalized services, ensuring secure inference has become a critical challenge. Secure multi-party computation (MPC) enables privacy-preserving model inference but suffers from high communication and computation overhead. The main bottleneck lies in the expensive secure evaluation of floating-point operations. Quantization offers a promising solution by converting floating-point operations into lower-precision integer computations, significantly reducing overhead. However, existing MPC-based quantized inference methods either rely on public quantization parameters-posing privacy risks-or suffer from inefficiencies, particularly in handling nonlinear functions such as activations and softmax. In this work, we propose a fine-grained, layer-wise quantization scheme and support 1-bit weight fully connected layers in a secure setting. We design a multi-input lookup table protocol to evaluate softmax efficiently and securely. Furthermore, we use dual secret sharing schemes and perform precision conversions via lookup tables, eliminating truncation overhead entirely. Experimental evaluation on BERT-base models demonstrates that our approach achieves up to $8\times$ speedup compared to Lu \emph{et al}. (NDSS 25), $9\times$ speedup compared to Gupta \emph{et al}. (PETS 24) and $22 \times$ speedup compared to Knott \emph{et al}. (NeurIPS 21).

[795] SPARTA: Advancing Sparse Attention in Spiking Neural Networks via Spike-Timing-Based Prioritization

Minsuk Jang, Changick Kim

Main category: cs.LG

TL;DR: SPARTA leverages spike-timing dynamics for efficient sparse attention in SNNs, reducing complexity while maintaining accuracy.

Details

Motivation: Current SNNs overlook precise timing information, missing rich computational cues. SPARTA aims to exploit these dynamics.

Method: Uses heterogeneous neuron dynamics and spike-timing cues (firing patterns, timing, intervals) for competitive gating, achieving 65.4% sparsity.

Result: Reduces attention complexity from O(N^2) to O(K^2) with high accuracy (98.78% on DVS-Gesture, 83.06% on CIFAR10-DVS, 95.3% on CIFAR-10).

Conclusion: SPARTA demonstrates that spike timing dynamics enhance computational efficiency and accuracy in SNNs.

Abstract: Current Spiking Neural Networks (SNNs) underutilize the temporal dynamics inherent in spike-based processing, relying primarily on rate coding while overlooking precise timing information that provides rich computational cues. We propose SPARTA (Spiking Priority Attention with Resource-Adaptive Temporal Allocation), a framework that leverages heterogeneous neuron dynamics and spike-timing information to enable efficient sparse attention. SPARTA prioritizes tokens based on temporal cues, including firing patterns, spike timing, and inter-spike intervals, achieving 65.4% sparsity through competitive gating. By selecting only the most salient tokens, SPARTA reduces attention complexity from O(N^2) to O(K^2) with k « n, while maintaining high accuracy. Our method achieves state-of-the-art performance on DVS-Gesture (98.78%) and competitive results on CIFAR10-DVS (83.06%) and CIFAR-10 (95.3%), demonstrating that exploiting spike timing dynamics improves both computational efficiency and accuracy.

[796] Boosting Generalization Performance in Model-Heterogeneous Federated Learning Using Variational Transposed Convolution

Ziru Niu, Hai Dong, A. K. Qin

Main category: cs.LG

TL;DR: A model-heterogeneous FL framework improves generalization by exchanging feature distributions instead of model parameters, using synthetic data for fine-tuning.

Details

Motivation: Address data heterogeneity in FL without model aggregation, enhancing generalization performance.

Method: Clients share feature distributions (mean, covariance); train a VTC network to generate synthetic data for fine-tuning.

Result: Higher generalization accuracy, lower communication costs, and memory consumption compared to existing frameworks.

Conclusion: The proposed framework effectively tackles data heterogeneity in FL without model aggregation, improving performance and efficiency.

Abstract: Federated learning (FL) is a pioneering machine learning paradigm that enables distributed clients to process local data effectively while ensuring data privacy. However, the efficacy of FL is usually impeded by the data heterogeneity among clients, resulting in local models with low generalization performance. To address this problem, traditional model-homogeneous approaches mainly involve debiasing the local training procedures with regularization or dynamically adjusting client weights in aggregation. Nonetheless, these approaches become incompatible for scenarios where clients exhibit heterogeneous model architectures. In this paper, we propose a model-heterogeneous FL framework that can improve clients’ generalization performance over unseen data without model aggregation. Instead of model parameters, clients exchange the feature distributions with the server, including the mean and the covariance. Accordingly, clients train a variational transposed convolutional (VTC) neural network with Gaussian latent variables sampled from the feature distributions, and use the VTC model to generate synthetic data. By fine-tuning local models with the synthetic data, clients significantly increase their generalization performance. Experimental results show that our approach obtains higher generalization accuracy than existing model-heterogeneous FL frameworks, as well as lower communication costs and memory consumption

[797] Asynchronous Federated Learning with non-convex client objective functions and heterogeneous dataset

Ali Forootani, Raffaele Iervolino

Main category: cs.LG

TL;DR: This paper extends Asynchronous Federated Learning (AFL) to handle non-convex objectives and heterogeneous datasets, introducing staleness-aware aggregation and dynamic learning rates for improved convergence.

Details

Motivation: Traditional Federated Learning (FL) faces issues like communication overhead and straggler effects, which AFL addresses but lacks support for non-convex objectives and heterogeneous data.

Method: The paper proposes a staleness-aware aggregation method and dynamic learning rate scheduling to mitigate stale updates and adapt to client heterogeneity. It also analyzes client selection strategies.

Result: Rigorous convergence analysis shows bounds on gradient norms, and experiments validate improved performance in asynchronous, heterogeneous, and non-convex FL scenarios.

Conclusion: The proposed framework enhances AFL’s practicality for real-world applications by addressing staleness, heterogeneity, and non-convexity, with demonstrated scalability and stability.

Abstract: Federated Learning (FL) enables collaborative model training across decentralized devices while preserving data privacy. However, traditional FL suffers from communication overhead, system heterogeneity, and straggler effects. Asynchronous Federated Learning (AFL) addresses these by allowing clients to update independently, improving scalability and reducing synchronization delays. This paper extends AFL to handle non-convex objective functions and heterogeneous datasets, common in modern deep learning. We present a rigorous convergence analysis, deriving bounds on the expected gradient norm and studying the effects of staleness, variance, and heterogeneity. To mitigate stale updates, we introduce a staleness aware aggregation that prioritizes fresher updates and a dynamic learning rate schedule that adapts to client staleness and heterogeneity, improving stability and convergence. Our framework accommodates variations in computational power, data distribution, and communication delays, making it practical for real world applications. We also analyze the impact of client selection strategies-sampling with or without replacement-on variance and convergence. Implemented in PyTorch with Python’s asyncio, our approach is validated through experiments demonstrating improved performance and scalability for asynchronous, heterogeneous, and non-convex FL scenarios.

[798] Generalized Kernelized Bandits: Self-Normalized Bernstein-Like Dimension-Free Inequality and Regret Bounds

Alberto Maria Metelli, Simone Drago, Marco Mussi

Main category: cs.LG

TL;DR: The paper introduces generalized kernelized bandits (GKBs), combining kernelized and generalized linear bandits, and proposes the GKB-UCB algorithm with a novel concentration inequality for regret analysis.

Details

Motivation: To address the regret minimization problem in a unified setting that extends both kernelized bandits and generalized linear bandits, leveraging RKHS and exponential family noise.

Method: Proposes the GKB-UCB algorithm, using a novel self-normalized Bernstein-like inequality derived from Freedman’s inequality and a stitching argument.

Result: Achieves a regret bound of order ~O(γ_T √(T/κ_*)), matching state-of-the-art bounds for both kernelized and generalized linear bandits.

Conclusion: The work provides a unified framework for GKBs, with tight regret guarantees and a novel concentration inequality of independent interest.

Abstract: We study the regret minimization problem in the novel setting of generalized kernelized bandits (GKBs), where we optimize an unknown function $f^$ belonging to a reproducing kernel Hilbert space (RKHS) having access to samples generated by an exponential family (EF) noise model whose mean is a non-linear function $\mu(f^)$. This model extends both kernelized bandits (KBs) and generalized linear bandits (GLBs). We propose an optimistic algorithm, GKB-UCB, and we explain why existing self-normalized concentration inequalities do not allow to provide tight regret guarantees. For this reason, we devise a novel self-normalized Bernstein-like dimension-free inequality resorting to Freedman’s inequality and a stitching argument, which represents a contribution of independent interest. Based on it, we conduct a regret analysis of GKB-UCB, deriving a regret bound of order $\widetilde{O}( \gamma_T \sqrt{T/\kappa_})$, being $T$ the learning horizon, ${\gamma}T$ the maximal information gain, and $\kappa$ a term characterizing the magnitude the reward nonlinearity. Our result matches, up to multiplicative constants and logarithmic terms, the state-of-the-art bounds for both KBs and GLBs and provides a unified view of both settings.

[799] Innovative tokenisation of structured data for LLM training

Kayvan Karim, Hani Ragab Hassen. Hadj Batatia

Main category: cs.LG

TL;DR: A hybrid tokenisation method for tabular data is introduced, combining fixed tokens for structure and BPE for high-cardinality values, tested on a NetFlow dataset with high efficiency and compression.

Details

Motivation: Addressing the challenge of adapting sequence-based architectures like Transformers and LLMs to structured tabular data, which often lacks cohesive encoding of mixed features.

Method: Uses predefined fixed tokens for structural elements and categorical features, and BPE for high-cardinality/continuous values.

Result: Processed 31M network flows in <5 hours with 6.18:1 compression, creating a 1B-token corpus.

Conclusion: Provides a viable, generalisable method for training foundation models on structured data.

Abstract: Data representation remains a fundamental challenge in machine learning, particularly when adapting sequence-based architectures like Transformers and Large Language Models (LLMs) for structured tabular data. Existing methods often fail to cohesively encode the mix of numerical and categorical features or preserve the inherent structure of tables. This paper introduces a novel, hybrid tokenisation methodology designed to convert tabular data into a unified, sequential format suitable for LLM training. Our approach combines predefined fixed tokens to represent structural elements and low-cardinality categorical features, with a learned subword vocabulary using Byte-Pair Encoding (BPE) for high-cardinality and continuous values. We demonstrate the efficacy of this technique by applying it to a large-scale NetFlow dataset (CIDDS-001), preparing a corpus for a Network Intrusion Detection System (NIDS) foundation model. The evaluation shows that our method is highly efficient, processing over 31 million network flows in under five hours and achieving a significant data compression ratio of 6.18:1. This process resulted in a computationally manageable corpus of over one billion tokens, establishing a viable and generalisable pathway for training foundation models on structured data.

[800] Explaining Time Series Classifiers with PHAR: Rule Extraction and Fusion from Post-hoc Attributions

Maciej Mozolewski, Szymon Bobek, Grzegorz J. Nalepa

Main category: cs.LG

TL;DR: PHAR is a framework that converts numeric feature attributions into human-readable rules for time series classification, improving interpretability and transparency.

Details

Motivation: The challenge of interpreting raw time series and high-dimensional input spaces in ML models motivates the need for structured, interpretable explanations.

Method: PHAR transforms feature attributions from explainers like LIME or SHAP into interpretable rules, uses rule fusion for consolidation, and includes visualization techniques.

Result: PHAR matches native rule-based methods in performance, scales efficiently, and enhances interpretability and decision transparency.

Conclusion: PHAR effectively resolves conflicting explanations and improves practical applicability for time series classification.

Abstract: Explaining machine learning (ML) models for time series (TS) classification remains challenging due to the difficulty of interpreting raw time series and the high dimensionality of the input space. We introduce PHAR-Post-hoc Attribution Rules-a unified framework that transforms numeric feature attributions from post-hoc, instance-wise explainers (e.g., LIME, SHAP) into structured, human-readable rules. These rules define interpretable intervals that indicate where and when key decision boundaries occur, enhancing model transparency. PHAR performs comparably to native rule-based methods, such as Anchor, while scaling more efficiently to long TS sequences and achieving broader instance coverage. A dedicated rule fusion step consolidates rule sets using strategies like weighted selection and lasso-based refinement, balancing key quality metrics: coverage, confidence, and simplicity. This fusion ensures each instance receives a concise and unambiguous rule, improving both explanation fidelity and consistency. We further introduce visualization techniques to illustrate specificity-generalization trade-offs in the derived rules. PHAR resolves conflicting and overlapping explanations-a common effect of the Rashomon phenomenon-into coherent, domain-adaptable insights. Comprehensive experiments on UCI datasets demonstrate that PHAR improves interpretability, decision transparency, and practical applicability for TS classification tasks.

[801] MHARFedLLM: Multimodal Human Activity Recognition Using Federated Large Language Model

Asmit Bandyopadhyay, Rohit Basu, Tanmay Sen, Swagatam Das

Main category: cs.LG

TL;DR: FedTime-MAGNET is a multimodal federated learning framework for HAR, combining depth cameras, pressure mats, and accelerometers using MAGNET for fusion and a T5 encoder for temporal dependencies, achieving high F1 scores.

Details

Motivation: Traditional HAR systems rely on single modalities, limiting robustness and accuracy. This work aims to improve HAR by integrating heterogeneous data sources and leveraging federated learning.

Method: The framework uses MAGNET for multimodal fusion (graph attention and Mixture of Experts) and a lightweight T5 encoder for temporal dependencies, all within a federated learning setup.

Result: Achieves a centralized F1 Score of 0.934 and a federated F1 Score of 0.881, demonstrating significant improvement in HAR performance.

Conclusion: FedTime-MAGNET effectively combines multimodal fusion, time series LLMs, and federated learning to build accurate and robust HAR systems.

Abstract: Human Activity Recognition (HAR) plays a vital role in applications such as fitness tracking, smart homes, and healthcare monitoring. Traditional HAR systems often rely on single modalities, such as motion sensors or cameras, limiting robustness and accuracy in real-world environments. This work presents FedTime-MAGNET, a novel multimodal federated learning framework that advances HAR by combining heterogeneous data sources: depth cameras, pressure mats, and accelerometers. At its core is the Multimodal Adaptive Graph Neural Expert Transformer (MAGNET), a fusion architecture that uses graph attention and a Mixture of Experts to generate unified, discriminative embeddings across modalities. To capture complex temporal dependencies, a lightweight T5 encoder only architecture is customized and adapted within this framework. Extensive experiments show that FedTime-MAGNET significantly improves HAR performance, achieving a centralized F1 Score of 0.934 and a strong federated F1 Score of 0.881. These results demonstrate the effectiveness of combining multimodal fusion, time series LLMs, and federated learning for building accurate and robust HAR systems.

[802] Revisiting Replay and Gradient Alignment for Continual Pre-Training of Large Language Models

Istabrak Abbes, Gopeshh Subbaraj, Matthew Riemer, Nizar Islah, Benjamin Therien, Tsuguchika Tabaru, Hiroaki Kingetsu, Sarath Chandar, Irina Rish

Main category: cs.LG

TL;DR: The paper explores continual pre-training for LLMs to avoid full retraining, addressing distribution shifts with experience replay and gradient alignment, showing stable learning without forgetting.

Details

Motivation: To improve efficiency in LLM training by updating models with new data instead of retraining from scratch, while mitigating performance degradation from distribution shifts.

Method: Investigates experience replay and gradient alignment in continual pre-training of Llama models, using 100B tokens per language, and proposes an efficient MER implementation.

Result: Both methods stabilize learning without forgetting, with small replay rates being more compute-efficient than scaling model size.

Conclusion: Continual pre-training with replay and gradient alignment is effective, with replay rates being more valuable than model scaling for compute efficiency.

Abstract: Training large language models (LLMs) typically involves pre-training on massive corpora, only to restart the process entirely when new data becomes available. A more efficient and resource-conserving approach would be continual pre-training, where models are updated with new data rather than retraining from scratch. However, the introduction of new data often causes distribution shifts, leading to performance degradation on previously learned tasks. In this paper, we take a deeper look at two popular proposals for addressing this distribution shift within the continual learning literature: experience replay and gradient alignment. We consider continual pre-training of models within the Llama family of architectures at a large scale across languages with 100 billion tokens of training data in each language, finding that both replay and gradient alignment lead to more stable learning without forgetting. This conclusion holds both as we vary the model scale and as we vary the number and diversity of tasks. Moreover, we are the first to demonstrate the effectiveness of gradient alignment techniques in the context of LLM pre-training and propose an efficient implementation of meta-experience replay (MER) that imbues experience replay with the benefits of gradient alignment despite negligible compute and memory overhead. Our scaling analysis across model sizes and replay rates indicates that small rates of replaying old examples are definitely a more valuable use of compute than investing in model size, but that it is more compute efficient to scale the size of the model than invest in high rates of replaying old examples.

[803] Neural Policy Iteration for Stochastic Optimal Control: A Physics-Informed Approach

Yeongjong Kim, Yeoneung Kim, Minseok Kim, Namkyeong Cho

Main category: cs.LG

TL;DR: A physics-informed neural network policy iteration (PINN-PI) framework is proposed for solving stochastic optimal control problems using HJB equations, with systematic error control and theoretical guarantees.

Details

Motivation: To extend deterministic PINN-based methods to stochastic settings and provide interpretable, theoretically grounded solutions for stochastic optimal control problems.

Method: Trains a neural network to approximate the value function by minimizing the residual of a linear PDE induced by a fixed policy, enabling systematic error control and explicit bounds.

Result: Demonstrates effectiveness on benchmark problems like stochastic cartpole, pendulum, and high-dimensional LQR problems up to 10D.

Conclusion: The PINN-PI framework successfully extends deterministic methods to stochastic settings with theoretical guarantees and practical applicability.

Abstract: We propose a physics-informed neural network policy iteration (PINN-PI) framework for solving stochastic optimal control problems governed by second-order Hamilton–Jacobi–Bellman (HJB) equations. At each iteration, a neural network is trained to approximate the value function by minimizing the residual of a linear PDE induced by a fixed policy. This linear structure enables systematic $L^2$ error control at each policy evaluation step, and allows us to derive explicit Lipschitz-type bounds that quantify how value gradient errors propagate to the policy updates. This interpretability provides a theoretical basis for evaluating policy quality during training. Our method extends recent deterministic PINN-based approaches to stochastic settings, inheriting the global exponential convergence guarantees of classical policy iteration under mild conditions. We demonstrate the effectiveness of our method on several benchmark problems, including stochastic cartpole, pendulum problems and high-dimensional linear quadratic regulation (LQR) problems in up to 10D.

[804] Imbalance-Robust and Sampling-Efficient Continuous Conditional GANs via Adaptive Vicinity and Auxiliary Regularization

Xin Ding, Yun Chen, Yongwei Wang, Kao Zhang, Sen Zhang, Peibei Cao, Xiangxue Wang

Main category: cs.LG

TL;DR: CcGAN-AVAR enhances CcGAN by addressing data imbalance and sampling inefficiency, achieving faster inference and better generation quality.

Details

Motivation: Existing methods (CcGAN and CCDM) have limitations: CcGAN suffers from data imbalance, and CCDM is computationally expensive.

Method: CcGAN-AVAR introduces adaptive vicinity and a multi-task discriminator for better training, leveraging GAN’s one-step generation.

Result: Achieves 300x-2000x faster inference and state-of-the-art generation quality on benchmark datasets.

Conclusion: CcGAN-AVAR outperforms existing methods in efficiency and quality for conditional generative modeling.

Abstract: Recent advances in conditional generative modeling have introduced Continuous conditional Generative Adversarial Network (CcGAN) and Continuous Conditional Diffusion Model (CCDM) for estimating high-dimensional data distributions conditioned on scalar, continuous regression labels (e.g., angles, ages, or temperatures). However, these approaches face fundamental limitations: CcGAN suffers from data imbalance due to fixed-size vicinity constraints, while CCDM requires computationally expensive iterative sampling. We present CcGAN-AVAR, an enhanced CcGAN framework that addresses both challenges: (1) leveraging the GAN framework’s native one-step generation to overcome CCDMs’ sampling bottleneck (achieving 300x-2000x faster inference), while (2) two novel components specifically target data imbalance - an adaptive vicinity mechanism that dynamically adjusts vicinity’s size, and a multi-task discriminator that constructs two regularization terms (through auxiliary regression and density ratio estimation) to significantly improve generator training. Extensive experiments on four benchmark datasets (64x64 to 192x192 resolution) across eight challenging imbalanced settings demonstrate that CcGAN-AVAR achieves state-of-the-art generation quality while maintaining sampling efficiency.

[805] Decomposing Representation Space into Interpretable Subspaces with Unsupervised Learning

Xinting Huang, Michael Hahn

Main category: cs.LG

TL;DR: The paper introduces Neighbor Distance Minimization (NDM), an unsupervised method to identify interpretable subspaces in neural models, demonstrating their connection to model variables and circuits.

Details

Motivation: To understand how neural models organize and encode different aspects of inputs in their representation spaces, and whether these subspaces can be found unsupervised.

Method: Proposes Neighbor Distance Minimization (NDM) to learn non-basis-aligned subspaces without supervision.

Result: NDM identifies interpretable subspaces that align with abstract concepts and model variables, validated on GPT-2 circuits and scalable to larger models.

Conclusion: NDM provides a novel approach to understanding model internals and building circuits, offering insights into subspace organization.

Abstract: Understanding internal representations of neural models is a core interest of mechanistic interpretability. Due to its large dimensionality, the representation space can encode various aspects about inputs. To what extent are different aspects organized and encoded in separate subspaces? Is it possible to find these natural'' subspaces in a purely unsupervised way? Somewhat surprisingly, we can indeed achieve this and find interpretable subspaces by a seemingly unrelated training objective. Our method, neighbor distance minimization (NDM), learns non-basis-aligned subspaces in an unsupervised manner. Qualitative analysis shows subspaces are interpretable in many cases, and encoded information in obtained subspaces tends to share the same abstract concept across different inputs, making such subspaces similar to variables’’ used by the model. We also conduct quantitative experiments using known circuits in GPT-2; results show a strong connection between subspaces and circuit variables. We also provide evidence showing scalability to 2B models by finding separate subspaces mediating context and parametric knowledge routing. Viewed more broadly, our findings offer a new perspective on understanding model internals and building circuits.

[806] OccamVTS: Distilling Vision Models to 1% Parameters for Time Series Forecasting

Sisuo Lyu, Siru Zhong, Weilin Ruan, Qingxiang Liu, Qingsong Wen, Hui Xiong, Yuxuan Liang

Main category: cs.LG

TL;DR: OccamVTS distills 1% of essential predictive info from large vision models into lightweight networks, improving time series forecasting accuracy by eliminating irrelevant visual features.

Details

Motivation: Large vision models (LVMs) enhance forecasting but contain 99% unnecessary parameters for time series tasks, which can impair accuracy due to irrelevant high-level semantics.

Method: Proposes OccamVTS, a knowledge distillation framework using pyramid-style feature alignment and correlation/feature distillation to transfer only useful patterns from LVMs.

Result: Achieves state-of-the-art performance with 1% of original parameters, excelling in few-shot and zero-shot scenarios.

Conclusion: Aggressive parameter reduction improves accuracy by focusing on essential temporal patterns, demonstrating the effectiveness of OccamVTS.

Abstract: Time series forecasting is fundamental to diverse applications, with recent approaches leverage large vision models (LVMs) to capture temporal patterns through visual representations. We reveal that while vision models enhance forecasting performance, 99% of their parameters are unnecessary for time series tasks. Through cross-modal analysis, we find that time series align with low-level textural features but not high-level semantics, which can impair forecasting accuracy. We propose OccamVTS, a knowledge distillation framework that extracts only the essential 1% of predictive information from LVMs into lightweight networks. Using pre-trained LVMs as privileged teachers, OccamVTS employs pyramid-style feature alignment combined with correlation and feature distillation to transfer beneficial patterns while filtering out semantic noise. Counterintuitively, this aggressive parameter reduction improves accuracy by eliminating overfitting to irrelevant visual features while preserving essential temporal patterns. Extensive experiments across multiple benchmark datasets demonstrate that OccamVTS consistently achieves state-of-the-art performance with only 1% of the original parameters, particularly excelling in few-shot and zero-shot scenarios.

[807] MolReasoner: Toward Effective and Interpretable Reasoning for Molecular LLMs

Guojiang Zhao, Sihang Li, Zixiang Lu, Zheng Cheng, Haitao Lin, Lirong Wu, Hanchen Xia, Hengxing Cai, Wentao Guo, Hongshuai Wang, Mingjun Xu, Siyu Zhu, Guolin Ke, Linfeng Zhang, Zhifeng Gao

Main category: cs.LG

TL;DR: MolReasoner is a two-stage framework (Mol-SFT and Mol-RL) that enhances LLMs’ molecular reasoning by combining synthetic CoT samples and reinforcement learning, outperforming existing methods.

Details

Motivation: Current LLM approaches lack domain-specific molecular semantics and struggle with interpretability and reasoning depth in molecular tasks.

Method: 1. Mol-SFT: Uses GPT-4o-generated synthetic CoT samples to initialize reasoning. 2. Mol-RL: Applies reinforcement learning with specialized reward functions for chemical alignment.

Result: MolReasoner improves interpretability, molecular understanding, and generalization, outperforming existing methods.

Conclusion: MolReasoner shifts LLMs from memorization to robust chemical reasoning, marking a significant advancement in molecular tasks.

Abstract: Large Language Models(LLMs) have demonstrated remarkable performance across various domains, yet their capabilities in molecular reasoning remain insufficiently explored. Current approaches tend to rely heavily on general-purpose prompting, which lacks domain-specific molecular semantics, while those that use fine-tuning strategies often face challenges with interpretability and reasoning depth. To address these issues, we introduce MolReasoner, a two-stage framework designed to transition LLMs from memorization towards chemical reasoning. First, we propose Mol-SFT, which initializes the model’s reasoning abilities via synthetic Chain-of-Thought(CoT) samples generated by GPT-4o and verified for chemical accuracy. Subsequently, Mol-RL applies reinforcement learning with specialized reward functions designed explicitly to align chemical structures with linguistic descriptions, thereby enhancing molecular reasoning capabilities. Our approach notably enhances interpretability, improving the model ’s molecular understanding and enabling better generalization. Extensive experiments demonstrate that MolReasoner outperforms existing methods, and marking a significant shift from memorization-based outputs to robust chemical reasoning.

[808] AGFT: An Adaptive GPU Frequency Tuner for Real-Time LLM Inference Optimization

Zicong Ye, Kunming Zhang, Guoming Tang

Main category: cs.LG

TL;DR: AGFT, an adaptive GPU frequency tuner using reinforcement learning, reduces GPU energy consumption by 44.3% with minimal performance impact (<10% latency overhead), optimizing energy-delay product by 40.3%.

Details

Motivation: The high energy costs of cloud GPUs running LLMs due to low-latency demands and dynamic workload volatility necessitate efficient power management.

Method: AGFT employs online reinforcement learning to autonomously learn optimal GPU frequency tuning policies, using real-time features like request load and latency for precise adjustments.

Result: AGFT saves 44.3% GPU energy with under 10% latency overhead, achieving a 40.3% Energy-Delay Product optimization.

Conclusion: AGFT enhances energy efficiency and economic benefits of LLM inference clusters without compromising service quality.

Abstract: The explosive growth of interactive Large Language Models (LLMs) has placed unprecedented demands for low latency on cloud GPUs, forcing them into high-power modes and causing escalating energy costs. Real-time inference workloads exhibit significant dynamic volatility, presenting substantial energy-saving opportunities. However, traditional static or rule-based power management strategies struggle to exploit these opportunities without compromising peak performance. To address this challenge, we propose AGFT (An Adaptive GPU Frequency Tuner), a framework that employs online reinforcement learning to autonomously learn an optimal frequency tuning policy. By monitoring real-time features like request load and latency, AGFT utilizes fine-grained frequency control for precise adjustments and intelligent action space pruning for stable, efficient decision-making. This creates a robust, automated energy management solution. We comprehensively evaluated AGFT in an environment simulating realistic, fluctuating inference requests. The experimental results demonstrate that AGFT successfully saves 44.3% of GPU energy consumption while introducing a minimal performance latency overhead of under 10%. This achievement translates into a comprehensive Energy-Delay Product (EDP) optimization of up to 40.3%, clearly showing that our framework can significantly enhance the energy efficiency and economic benefits of existing LLM inference clusters without compromising service quality.

[809] Energy-Efficient Federated Learning for Edge Real-Time Vision via Joint Data, Computation, and Communication Design

Xiangwang Hou, Jingjing Wang, Fangming Guan, Jun Du, Chunxiao Jiang, Yong Ren

Main category: cs.LG

TL;DR: FedDPQ is an energy-efficient federated learning framework for real-time computer vision on edge devices, integrating data augmentation, pruning, quantization, and power control to optimize performance under unreliable wireless conditions.

Details

Motivation: The need for energy-efficient and privacy-preserving learning in resource-constrained wireless edge environments drives the development of FedDPQ.

Method: FedDPQ combines diffusion-based data augmentation, model pruning, quantization, and adaptive power control, with a Bayesian optimization algorithm to tune these components.

Result: Experiments show FedDPQ achieves faster convergence and higher energy efficiency in computer vision tasks.

Conclusion: FedDPQ is the first framework to jointly optimize federated learning from data, computation, and communication perspectives under unreliable wireless conditions.

Abstract: Emerging real-time computer vision (CV) applications on wireless edge devices demand energy-efficient and privacy-preserving learning. Federated learning (FL) enables on-device training without raw data sharing, yet remains challenging in resource-constrained environments due to energy-intensive computation and communication, as well as limited and non-i.i.d. local data. We propose FedDPQ, an ultra energy-efficient FL framework for real-time CV over unreliable wireless networks. FedDPQ integrates diffusion-based data augmentation, model pruning, communication quantization, and transmission power control to enhance training efficiency. It expands local datasets using synthetic data, reduces computation through pruning, compresses updates via quantization, and mitigates transmission outages with adaptive power control. We further derive a closed-form energy-convergence model capturing the coupled impact of these components, and develop a Bayesian optimization(BO)-based algorithm to jointly tune data augmentation strategy, pruning ratio, quantization level, and power control. To the best of our knowledge, this is the first work to jointly optimize FL performance from the perspectives of data, computation, and communication under unreliable wireless conditions. Experiments on representative CV tasks show that FedDPQ achieves superior convergence speed and energy efficiency.

[810] CRINN: Contrastive Reinforcement Learning for Approximate Nearest Neighbor Search

Xiaoya Li, Xiaofei Sun, Albert Wang, Chris Shum, Jiwei Li

Main category: cs.LG

TL;DR: CRINN introduces a reinforcement learning-based approach for optimizing ANNS algorithms, achieving top performance on multiple benchmarks.

Details

Motivation: ANNS algorithms are critical for AI applications like RAG and agent-based LLMs, but manual optimization is labor-intensive. CRINN aims to automate this process.

Method: CRINN treats ANNS optimization as a reinforcement learning problem, using execution speed as the reward signal to generate faster implementations while maintaining accuracy.

Result: CRINN outperforms state-of-the-art ANNS algorithms on three benchmarks and ties on two others.

Conclusion: CRINN demonstrates that reinforcement learning-augmented LLMs can automate complex algorithmic optimizations, extending beyond ANNS.

Abstract: Approximate nearest-neighbor search (ANNS) algorithms have become increasingly critical for recent AI applications, particularly in retrieval-augmented generation (RAG) and agent-based LLM applications. In this paper, we present CRINN, a new paradigm for ANNS algorithms. CRINN treats ANNS optimization as a reinforcement learning problem where execution speed serves as the reward signal. This approach enables the automatic generation of progressively faster ANNS implementations while maintaining accuracy constraints. Our experimental evaluation demonstrates CRINN’s effectiveness across six widely-used NNS benchmark datasets. When compared against state-of-the-art open-source ANNS algorithms, CRINN achieves best performance on three of them (GIST-960-Euclidean, MNIST-784-Euclidean, and GloVe-25-angular), and tied for first place on two of them (SIFT-128-Euclidean and GloVe-25-angular). The implications of CRINN’s success reach well beyond ANNS optimization: It validates that LLMs augmented with reinforcement learning can function as an effective tool for automating sophisticated algorithmic optimizations that demand specialized knowledge and labor-intensive manual refinement.Code can be found at https://github.com/deepreinforce-ai/CRINN

[811] Semantically-Guided Inference for Conditional Diffusion Models: Enhancing Covariate Consistency in Time Series Forecasting

Rui Ding, Hanyang Meng, Zeyang Zhang, Jielong Yang

Main category: cs.LG

TL;DR: SemGuide improves semantic alignment in diffusion models for time series forecasting using a scoring network and stepwise importance reweighting.

Details

Motivation: Diffusion models often misalign generated trajectories with conditioning covariates, especially in complex scenarios.

Method: Introduces a scoring network to assess alignment and uses stepwise importance reweighting during inference.

Result: Enhances predictive accuracy and covariate alignment, particularly in complex conditions.

Conclusion: SemGuide is a model-agnostic, plug-and-play solution for better semantic alignment in diffusion models.

Abstract: Diffusion models have demonstrated strong performance in time series forecasting, yet often suffer from semantic misalignment between generated trajectories and conditioning covariates, especially under complex or multimodal conditions. To address this issue, we propose SemGuide, a plug-and-play, inference-time method that enhances covariate consistency in conditional diffusion models. Our approach introduces a scoring network to assess the semantic alignment between intermediate diffusion states and future covariates. These scores serve as proxy likelihoods in a stepwise importance reweighting procedure, which progressively adjusts the sampling path without altering the original training process. The method is model-agnostic and compatible with any conditional diffusion framework. Experiments on real-world forecasting tasks show consistent gains in both predictive accuracy and covariate alignment, with especially strong performance under complex conditioning scenarios.

[812] A Trainable Optimizer

Ruiqi Wang, Diego Klabjan

Main category: cs.LG

TL;DR: A framework for trainable optimizers (TO) is introduced, outperforming manual methods like ADAM by jointly training gradient estimators and model weights. Pseudo-linear TO matches SGD’s convergence with lower variance and minimal overhead. Simplified variants further enhance efficiency, showing faster convergence in experiments.

Details

Motivation: To replace manually defined gradient estimators (e.g., ADAM) with trainable optimizers for better performance and efficiency.

Method: Joint training of gradient estimators and model weights using Pseudo-linear TO, a linear approximation of full gradients, and its simplified variants.

Result: TO methods converge faster than benchmarks (e.g., ADAM) in strongly convex, non-convex, and LLM fine-tuning settings.

Conclusion: Trainable optimizers, especially Pseudo-linear TO, offer efficient and superior alternatives to traditional gradient estimators.

Abstract: The concept of learning to optimize involves utilizing a trainable optimization strategy rather than relying on manually defined full gradient estimations such as ADAM. We present a framework that jointly trains the full gradient estimator and the trainable weights of the model. Specifically, we prove that pseudo-linear TO (Trainable Optimizer), a linear approximation of the full gradient, matches SGD’s convergence rate while effectively reducing variance. Pseudo-linear TO incurs negligible computational overhead, requiring only minimal additional tensor multiplications. To further improve computational efficiency, we introduce two simplified variants of Pseudo-linear TO. Experiments demonstrate that TO methods converge faster than benchmark algorithms (e.g., ADAM) in both strongly convex and non-convex settings, and fine tuning of an LLM.

[813] LeanK: Learnable K Cache Channel Pruning for Efficient Decoding

Yike Zhang, Zhiyuan He, Huiqiang Jiang, Chengruidong Zhang, Yuqing Yang, Jianyong Wang, Lili Qiu

Main category: cs.LG

TL;DR: LeanK is a learning-based method to prune unimportant key cache channels in LLMs, reducing GPU memory and speeding up decoding without accuracy loss.

Details

Motivation: Large language models face efficiency issues due to growing key-value cache, prompting the need for optimization.

Method: LeanK uses a two-stage training process to learn static channel masks for pruning, meeting sparsity and hardware alignment requirements.

Result: Achieves up to 70% K cache and 16%-18% V cache memory reduction, with a 1.3x speedup in attention computation.

Conclusion: LeanK effectively optimizes LLM efficiency while maintaining accuracy and provides insights into model channels and attention heads.

Abstract: Large language models (LLMs) enable long-context tasks but face efficiency challenges due to the growing key-value (KV) cache. We propose LeanK, a learning-based method that prunes unimportant key (K) cache channels by leveraging static channel sparsity. With a novel two-stage training process, LeanK learns channel-wise static mask that could satisfy specific sparsity ratio and hardware alignment requirement. LeanK reduces GPU memory and accelerates decoding without sacrificing accuracy. Experiments demonstrate up to 70% K cache and 16%-18% V cache memory reduction. Custom decoding kernel enables 1.3x speedup for attention computation. We also provide insights into model channels and attention heads during long-context inference by analyzing the learned importance distribution. Our code is available at https://aka.ms/LeanK.

[814] VAGPO: Vision-augmented Asymmetric Group Preference Optimization for the Routing Problems

Shiyan Liu, Bohan Tan, Yan Jin

Main category: cs.LG

TL;DR: VAGPO is a novel method combining ResNet and Transformer for routing problems, improving training efficiency and scalability.

Details

Motivation: Address limitations in training efficiency and generalization of data-driven methods for large-scale routing problems.

Method: Uses ResNet for visual encoding and Transformer for sequential modeling, with an asymmetric group preference optimization strategy.

Result: Achieves competitive solution quality and strong generalization to large instances (up to 1000 nodes) without re-training.

Conclusion: VAGPO is effective in learning efficiency and scalability for routing problems.

Abstract: The routing problems such as the Traveling Salesman Problem (TSP) and the Capacitated Vehicle Routing Problem (CVRP) are well-known combinatorial optimization challenges with broad practical relevance. Recent data-driven optimization methods have made significant progress, yet they often face limitations in training efficiency and generalization to large-scale instances. In this paper, we propose a novel Vision-Augmented Asymmetric Group Preference Optimization (VAGPO) approach for solving the routing problems. By leveraging ResNet-based visual encoding and Transformer-based sequential modeling, VAGPO captures both spatial structure and temporal dependencies. Furthermore, we introduce an asymmetric group preference optimization strategy that significantly accelerates convergence compared to commonly used policy gradient methods. Experimental results on TSP and CVRP benchmarks show that the proposed VAGPO not only achieves highly competitive solution quality but also exhibits strong generalization to larger instances (up to 1000 nodes) without re-training, highlighting its effectiveness in both learning efficiency and scalability.

[815] CellForge: Agentic Design of Virtual Cell Models

Xiangru Tang, Zhuoyun Yu, Jiapeng Chen, Yan Cui, Daniel Shao, Weixu Wang, Fang Wu, Yuchen Zhuang, Wenqi Shi, Zhi Huang, Arman Cohan, Xihong Lin, Fabian Theis, Smita Krishnaswamy, Mark Gerstein

Main category: cs.LG

TL;DR: CellForge is an AI-driven system that autonomously builds optimized computational models for virtual cells using multi-agent collaboration, outperforming state-of-the-art methods in perturbation prediction.

Details

Motivation: The complexity of biological systems and the need for interdisciplinary expertise make autonomous virtual cell modeling challenging. CellForge addresses this by integrating AI and multi-agent frameworks.

Method: CellForge uses a multi-agent framework with three modules: Task Analysis, Method Design (with specialized agents and a moderator), and Experiment Execution. It processes raw single-cell multi-omics data to generate optimized models and executable code.

Result: CellForge outperforms state-of-the-art methods in single-cell perturbation prediction across six diverse datasets, demonstrating the effectiveness of its multi-agent approach.

Conclusion: CellForge shows that iterative collaboration among AI agents with diverse perspectives yields superior solutions for virtual cell modeling, offering a scalable and automated approach.

Abstract: Virtual cell modeling represents an emerging frontier at the intersection of artificial intelligence and biology, aiming to predict quantities such as responses to diverse perturbations quantitatively. However, autonomously building computational models for virtual cells is challenging due to the complexity of biological systems, the heterogeneity of data modalities, and the need for domain-specific expertise across multiple disciplines. Here, we introduce CellForge, an agentic system that leverages a multi-agent framework that transforms presented biological datasets and research objectives directly into optimized computational models for virtual cells. More specifically, given only raw single-cell multi-omics data and task descriptions as input, CellForge outputs both an optimized model architecture and executable code for training virtual cell models and inference. The framework integrates three core modules: Task Analysis for presented dataset characterization and relevant literature retrieval, Method Design, where specialized agents collaboratively develop optimized modeling strategies, and Experiment Execution for automated generation of code. The agents in the Design module are separated into experts with differing perspectives and a central moderator, and have to collaboratively exchange solutions until they achieve a reasonable consensus. We demonstrate CellForge’s capabilities in single-cell perturbation prediction, using six diverse datasets that encompass gene knockouts, drug treatments, and cytokine stimulations across multiple modalities. CellForge consistently outperforms task-specific state-of-the-art methods. Overall, CellForge demonstrates how iterative interaction between LLM agents with differing perspectives provides better solutions than directly addressing a modeling challenge. Our code is publicly available at https://github.com/gersteinlab/CellForge.

[816] Mitigating Persistent Client Dropout in Asynchronous Decentralized Federated Learning

Ignacy Stępka, Nicholas Gisolfi, Kacper Trębacz, Artur Dubrawski

Main category: cs.LG

TL;DR: The paper addresses persistent client dropout in asynchronous Decentralized Federated Learning (DFL), proposing adaptive strategies for recovery and showing their effectiveness in maintaining robustness.

Details

Motivation: Client dropout in asynchronous DFL complicates recovery due to limited information about model updates, data distributions, and learning epochs. Existing mitigations are insufficient.

Method: Introduces adaptive strategies based on client reconstruction, tested on tabular and image datasets with three DFL algorithms and data heterogeneity scenarios (iid, non-iid, class-focused non-iid).

Result: The strategies effectively recover some performance loss caused by dropout, though they don’t precisely reconstruct missing client data.

Conclusion: The work highlights the potential of adaptive strategies for robustness in DFL and identifies future research directions for client dropout issues.

Abstract: We consider the problem of persistent client dropout in asynchronous Decentralized Federated Learning (DFL). Asynchronicity and decentralization obfuscate information about model updates among federation peers, making recovery from a client dropout difficult. Access to the number of learning epochs, data distributions, and all the information necessary to precisely reconstruct the missing neighbor’s loss functions is limited. We show that obvious mitigations do not adequately address the problem and introduce adaptive strategies based on client reconstruction. We show that these strategies can effectively recover some performance loss caused by dropout. Our work focuses on asynchronous DFL with local regularization and differs substantially from that in the existing literature. We evaluate the proposed methods on tabular and image datasets, involve three DFL algorithms, and three data heterogeneity scenarios (iid, non-iid, class-focused non-iid). Our experiments show that the proposed adaptive strategies can be effective in maintaining robustness of federated learning, even if they do not reconstruct the missing client’s data precisely. We also discuss the limitations and identify future avenues for tackling the problem of client dropout.

[817] Neural Predictive Control to Coordinate Discrete- and Continuous-Time Models for Time-Series Analysis with Control-Theoretical Improvements

Haoran Li, Muhao Guo, Yang Weng, Hanghang Tong

Main category: cs.LG

TL;DR: The paper proposes a continuous ODE-based optimal control framework for time-series analysis, combining discrete-time models for context and control-theoretical guarantees for robustness.

Details

Motivation: Existing methods using unconstrained neural networks struggle with distributional shifts, prompting the need for a more reliable approach.

Method: Recasts time-series problems as ODE-based optimal control, using discrete-time models for context and model predictive control for optimization.

Result: The method achieves superior generalization and adaptability on diverse datasets, with theoretical guarantees of exponential convergence.

Conclusion: The proposed coordinate model offers robust and generalizable performance, validated by extensive experiments.

Abstract: Deep sequence models have achieved notable success in time-series analysis, such as interpolation and forecasting. Recent advances move beyond discrete-time architectures like Recurrent Neural Networks (RNNs) toward continuous-time formulations such as the family of Neural Ordinary Differential Equations (Neural ODEs). Generally, they have shown that capturing the underlying dynamics is beneficial for generic tasks like interpolation, extrapolation, and classification. However, existing methods approximate the dynamics using unconstrained neural networks, which struggle to adapt reliably under distributional shifts. In this paper, we recast time-series problems as the continuous ODE-based optimal control problem. Rather than learning dynamics solely from data, we optimize control actions that steer ODE trajectories toward task objectives, bringing control-theoretical performance guarantees. To achieve this goal, we need to (1) design the appropriate control actions and (2) apply effective optimal control algorithms. As the actions should contain rich context information, we propose to employ the discrete-time model to process past sequences and generate actions, leading to a coordinate model to extract long-term temporal features to modulate short-term continuous dynamics. During training, we apply model predictive control to plan multi-step future trajectories, minimize a task-specific cost, and greedily select the optimal current action. We show that, under mild assumptions, this multi-horizon optimization leads to exponential convergence to infinite-horizon solutions, indicating that the coordinate model can gain robust and generalizable performance. Extensive experiments on diverse time-series datasets validate our method’s superior generalization and adaptability compared to state-of-the-art baselines.

[818] CAPO: Towards Enhancing LLM Reasoning through Verifiable Generative Credit Assignment

Guofu Xie, Yunsheng Shi, Hongtao Tian, Ting Yao, Xiao Zhang

Main category: cs.LG

TL;DR: CAPO introduces fine-grained credit assignment in RLVR by using an LLM as a Generative Process Reward Model, outperforming existing methods.

Details

Motivation: Current RLVR methods assign uniform rewards to all tokens, hindering precise credit assignment and leading to suboptimal learning.

Method: CAPO leverages an LLM (LLM-as-GenPRM) to generate step-wise critiques for token-level rewards, enhancing credit assignment.

Result: CAPO outperforms supervised and RL-based methods on mathematical and out-of-domain benchmarks.

Conclusion: CAPO provides efficient, verifiable, and fine-grained credit assignment, improving RLVR performance.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of Large Language Models (LLMs) by using rule-based binary feedback, helping to mitigate reward hacking. However, current RLVR methods typically treat whole responses as single actions, assigning the same reward to every token. This coarse-grained feedback hampers precise credit assignment, making it hard for models to identify which reasoning steps lead to success or failure, and often results in suboptimal policies and inefficient learning. Methods like PPO provide credit assignment through value estimation, but often yield inaccurate and unverifiable signals due to limited sampling. On the other hand, methods using Process Reward Models can provide step-by-step judgments for each reasoning step, but they require high-quality process supervision labels and are time-consuming when applied in online reinforcement learning (RL). To overcome these limitations, we introduce a simple but efficient method Credit Assignment Policy Optimization (CAPO). Given a reasoning response rollout from the policy model, CAPO directly leverages an off-the-shelf, general-purpose LLM as a Generative Process Reward Model (LLM-as-GenPRM) to generate all step-wise critique by one pass, thereby providing verifiable token-level rewards to refine the tokens that were originally assigned identical rule-based rewards. This enables more fine-grained credit assignment in an effective way. Furthermore, to enhance the accuracy and robustness of CAPO, we employ voting mechanisms that scale with the number of generated critiques. Extensive experiments using different backbones like Llama and Qwen models and in different sizes show that CAPO consistently outperforms supervised learning-based and RL-based fine-tuning methods across six challenging mathematical benchmarks and three out-of-domain benchmarks.

[819] Causal Discovery in Multivariate Time Series through Mutual Information Featurization

Gian Marco Paldino, Gianluca Bontempi

Main category: cs.LG

TL;DR: The paper introduces TD2C, a supervised learning framework for discovering causal relationships in complex multivariate time series by recognizing persistent asymmetries in information flow, outperforming traditional methods.

Details

Motivation: Traditional methods for causal discovery in time series struggle with non-linear dynamics and restrictive assumptions, prompting the need for a new approach.

Method: TD2C uses supervised learning to identify causal signatures from information-theoretic and statistical descriptors, trained on synthetic data for zero-shot generalization.

Result: TD2C achieves state-of-the-art performance, especially in high-dimensional and non-linear settings, outperforming existing methods.

Conclusion: TD2C offers a robust and scalable tool for uncovering causal structures in complex systems by shifting from statistical testing to pattern recognition.

Abstract: Discovering causal relationships in complex multivariate time series is a fundamental scientific challenge. Traditional methods often falter, either by relying on restrictive linear assumptions or on conditional independence tests that become uninformative in the presence of intricate, non-linear dynamics. This paper proposes a new paradigm, shifting from statistical testing to pattern recognition. We hypothesize that a causal link creates a persistent and learnable asymmetry in the flow of information through a system’s temporal graph, even when clear conditional independencies are obscured. We introduce Temporal Dependency to Causality (TD2C), a supervised learning framework that operationalizes this hypothesis. TD2C learns to recognize these complex causal signatures from a rich set of information-theoretic and statistical descriptors. Trained exclusively on a diverse collection of synthetic time series, TD2C demonstrates remarkable zero-shot generalization to unseen dynamics and established, realistic benchmarks. Our results show that TD2C achieves state-of-the-art performance, consistently outperforming established methods, particularly in high-dimensional and non-linear settings. By reframing the discovery problem, our work provides a robust and scalable new tool for uncovering causal structures in complex systems.

[820] Proactive Constrained Policy Optimization with Preemptive Penalty

Ning Yang, Pengyu Wang, Guoqing Liu, Haifeng Zhang, Pin Lyu, Jun Wang

Main category: cs.LG

TL;DR: PCPO introduces a proactive penalty mechanism and constraint-aware intrinsic reward to improve stability and reduce violations in Safe RL, outperforming traditional Lagrangian methods.

Details

Motivation: Addressing instability and constraint violations in Safe RL, PCPO aims to preemptively manage constraints rather than react post-violation.

Method: PCPO integrates barrier items into the objective function and uses constraint-aware intrinsic rewards for boundary-aware exploration, supported by policy iteration.

Result: PCPO shows significant stability and robust performance in experiments, with theoretical bounds on duality gap and convergence.

Conclusion: PCPO offers a promising framework for constrained policy optimization, with potential for future research and practical applications.

Abstract: Safe Reinforcement Learning (RL) often faces significant issues such as constraint violations and instability, necessitating the use of constrained policy optimization, which seeks optimal policies while ensuring adherence to specific constraints like safety. Typically, constrained optimization problems are addressed by the Lagrangian method, a post-violation remedial approach that may result in oscillations and overshoots. Motivated by this, we propose a novel method named Proactive Constrained Policy Optimization (PCPO) that incorporates a preemptive penalty mechanism. This mechanism integrates barrier items into the objective function as the policy nears the boundary, imposing a cost. Meanwhile, we introduce a constraint-aware intrinsic reward to guide boundary-aware exploration, which is activated only when the policy approaches the constraint boundary. We establish theoretical upper and lower bounds for the duality gap and the performance of the PCPO update, shedding light on the method’s convergence characteristics. Additionally, to enhance the optimization performance, we adopt a policy iteration approach. An interesting finding is that PCPO demonstrates significant stability in experiments. Experimental results indicate that the PCPO framework provides a robust solution for policy optimization under constraints, with important implications for future research and practical applications.

[821] Language Model Guided Reinforcement Learning in Quantitative Trading

Adam Darmanin, Vince Vella

Main category: cs.LG

TL;DR: A hybrid system combining LLMs and RL improves trading performance by leveraging LLMs for strategic guidance and RL for tactical execution.

Details

Motivation: Address the limitations of RL in algorithmic trading, such as myopic behavior and lack of transparency, by integrating LLMs for strategic reasoning.

Method: Use LLMs to generate high-level trading strategies to guide RL agents, evaluating strategy rationale and performance metrics (Sharpe Ratio, Maximum Drawdown).

Result: LLM-guided RL agents outperform standard RL in return and risk metrics.

Conclusion: The hybrid approach enhances algorithmic trading by combining the strengths of LLMs and RL.

Abstract: Algorithmic trading requires short-term decisions aligned with long-term financial goals. While reinforcement learning (RL) has been explored for such tactical decisions, its adoption remains limited by myopic behavior and opaque policy rationale. In contrast, large language models (LLMs) have recently demonstrated strategic reasoning and multi-modal financial signal interpretation when guided by well-designed prompts. We propose a hybrid system where LLMs generate high-level trading strategies to guide RL agents in their actions. We evaluate (i) the rationale of LLM-generated strategies via expert review, and (ii) the Sharpe Ratio (SR) and Maximum Drawdown (MDD) of LLM-guided agents versus unguided baselines. Results show improved return and risk metrics over standard RL.

[822] How Does Controllability Emerge In Language Models During Pretraining?

Jianshu She, Xinyue Li, Eric Xing, Zhengzhong Liu, Qirong Ho

Main category: cs.LG

TL;DR: The paper explores how linear steerability in language models emerges during training, showing that closely related concepts become steerable at distinct stages. It introduces the ‘Intervention Detector’ (ID) framework to analyze and interpret this dynamic.

Details

Motivation: To understand the unclear conditions for effective interventions in language models and how steerability evolves during training.

Method: Develops the ‘Intervention Detector’ (ID) framework to analyze linear steerability dynamics through hidden state and representation analysis, using metrics like heatmaps and cosine similarity.

Result: Linear steerability emerges during intermediate training stages, with closely related concepts becoming steerable at distinct times. ID reveals increasing linear separability of concepts in hidden space, correlating with steerability.

Conclusion: The ID framework provides insights into the dynamics of linear steerability during training, applicable across model families, enhancing interpretability and control of language models.

Abstract: Language models can be steered by modifying their internal representations to control concepts such as emotion, style, or truthfulness in generation. However, the conditions for an effective intervention remain unclear and are often validated through heuristics and trial-and-error. To fill this gap, we demonstrate that intervention efficacy, measured by linear steerability (i.e., the ability to adjust output via linear transformations of hidden states), emerges during intermediate stages of training. Moreover, even closely related concepts (e.g., anger and sadness) exhibit steerability emergence at distinct stages of training. To better interpret the dynamics of steerability during training, we adapt existing intervention techniques into a unified framework, referred to as the “Intervention Detector” (ID), which is designed to reveal how linear steerability evolves over the course of training through hidden state and representation analysis. ID reveals that concepts become increasingly linearly separable in the hidden space as training progresses, which strongly correlates with the emergence of linear steerability. We further introduce ID-based metrics, such as heatmaps, entropy trends, and cosine similarity, to help interpret how linear steerability evolves throughout training. In addition, we apply ID across different model families to ensure the generality of our findings on steerability dynamics.

[823] From Binary to Continuous: Stochastic Re-Weighting for Robust Graph Explanation

Zhuomin Chen, Jingchao Ni, Hojat Allah Salehi, Xu Zheng, Dongsheng Luo

Main category: cs.LG

TL;DR: The paper proposes an iterative framework to improve GNN explanation robustness by aligning training and explanation graph distributions.

Details

Motivation: Existing GNN explanation methods suffer from unreliable gradients due to distributional shifts between training and explanation graphs.

Method: An iterative framework alternates between identifying explanation subgraphs and adapting the model by retraining on weighted graphs.

Result: The method improves explanation quality across benchmarks and integrates flexibly with various GNN architectures.

Conclusion: The proposed iterative refinement enhances GNN explanation robustness and reliability.

Abstract: Graph Neural Networks (GNNs) have achieved remarkable performance in a wide range of graph-related learning tasks. However, explaining their predictions remains a challenging problem, especially due to the mismatch between the graphs used during training and those encountered during explanation. Most existing methods optimize soft edge masks on weighted graphs to highlight important substructures, but these graphs differ from the unweighted graphs on which GNNs are trained. This distributional shift leads to unreliable gradients and degraded explanation quality, especially when generating small, sparse subgraphs. To address this issue, we propose a novel iterative explanation framework which improves explanation robustness by aligning the model’s training data distribution with the weighted graph distribution appeared during explanation. Our method alternates between two phases: explanation subgraph identification and model adaptation. It begins with a relatively large explanation subgraph where soft mask optimization is reliable. Based on this subgraph, we assign importance-aware edge weights to explanatory and non-explanatory edges, and retrain the GNN on these weighted graphs. This process is repeated with progressively smaller subgraphs, forming an iterative refinement procedure. We evaluate our method on multiple benchmark datasets using different GNN backbones and explanation methods. Experimental results show that our method consistently improves explanation quality and can be flexibly integrated with different architectures.

[824] Inferring Reward Machines and Transition Machines from Partially Observable Markov Decision Processes

Yuly Wu, Jiamou Liu, Libo Zhang

Main category: cs.LG

TL;DR: The paper introduces Transition Machines (TMs) and Dual Behavior Mealy Machines (DBMMs) to address limitations in learning policies for POMDPs, proposing DB-RPNI for efficient automata inference with significant speedups.

Details

Motivation: Learning policies in partially observable environments is challenging due to non-Markovian observations, and existing automaton representations are limited to reward-based non-Markovianity or computationally expensive.

Method: The authors introduce TMs and DBMMs, develop the DB-RPNI algorithm for unified inference, and optimize it for minimal correct automata.

Result: DB-RPNI achieves speedups of up to three orders of magnitude over state-of-the-art baselines.

Conclusion: The proposed methods effectively address the limitations of existing approaches, offering efficient and unified solutions for learning in POMDPs.

Abstract: Partially Observable Markov Decision Processes (POMDPs) are fundamental to many real-world applications. Although reinforcement learning (RL) has shown success in fully observable domains, learning policies from traces in partially observable environments remains challenging due to non-Markovian observations. Inferring an automaton to handle the non-Markovianity is a proven effective approach, but faces two limitations: 1) existing automaton representations focus only on reward-based non-Markovianity, leading to unnatural problem formulations; 2) inference algorithms face enormous computational costs. For the first limitation, we introduce Transition Machines (TMs) to complement existing Reward Machines (RMs). To develop a unified inference algorithm for both automata types, we propose the Dual Behavior Mealy Machine (DBMM) that subsumes both TMs and RMs. We then introduce DB-RPNI, a passive automata learning algorithm that efficiently infers DBMMs while avoiding the costly reductions required by prior work. We further develop optimization techniques and identify sufficient conditions for inferring the minimal correct automata. Experimentally, our inference method achieves speedups of up to three orders of magnitude over SOTA baselines.

[825] What are you sinking? A geometric approach on attention sink

Valeria Ruscio, Umberto Nanni, Fabrizio Silvestri

Main category: cs.LG

TL;DR: Attention sink (AS) in transformers is a fundamental geometric principle for establishing reference frames, not just an artifact. It emerges early in training and is influenced by architecture components like position encoding.

Details

Motivation: To understand the underlying geometric principles behind the attention sink phenomenon in transformers and its role in establishing reference frames.

Method: Analyzed various transformer architectures to identify three reference frame types (centralized, distributed, bidirectional) and studied their emergence during training.

Result: Attention sink is a manifestation of optimal solutions for stable coordinate systems in high-dimensional spaces, influenced by position encoding.

Conclusion: This perspective redefines transformer attention mechanisms, offering insights for architecture design and understanding AS.

Abstract: Attention sink (AS) is a consistent pattern in transformer attention maps where certain tokens (often special tokens or positional anchors) disproportionately attract attention from other tokens. We show that in transformers, AS is not an architectural artifact, but it is the manifestation of a fundamental geometric principle: the establishment of reference frames that anchor representational spaces. We analyze several architectures and identify three distinct reference frame types, centralized, distributed, and bidirectional, that correlate with the attention sink phenomenon. We show that they emerge during the earliest stages of training as optimal solutions to the problem of establishing stable coordinate systems in high-dimensional spaces. We show the influence of architecture components, particularly position encoding implementations, on the specific type of reference frame. This perspective transforms our understanding of transformer attention mechanisms and provides insights for both architecture design and the relationship with AS.

[826] Navigating High Dimensional Concept Space with Metalearning

Max Gupta

Main category: cs.LG

TL;DR: Meta-learning improves few-shot concept learning, especially for compositional complexity, by leveraging gradient adaptation and exploring loss basins.

Details

Motivation: To understand if gradient-based meta-learning can enhance neural networks' ability to learn discrete concepts from few examples.

Method: Comparison of meta-learning methods with supervised learning on Boolean tasks generated by a PCFG, varying concept dimensionality and compositionality.

Result: Meta-learners outperform in handling compositional complexity, and increased adaptation steps in meta-SGD improve generalization.

Conclusion: Meta-learning is effective for compositional concepts, with 2nd-order methods and extended gradient adaptation playing key roles.

Abstract: Rapidly learning abstract concepts from limited examples is a hallmark of human intelligence. This work investigates whether gradient-based meta-learning can equip neural networks with inductive biases for efficient few-shot acquisition of discrete concepts. We compare meta-learning methods against a supervised learning baseline on Boolean tasks generated by a probabilistic context-free grammar (PCFG). By systematically varying concept dimensionality (number of features) and compositionality (depth of grammar recursion), we identify regimes in which meta-learning robustly improves few-shot concept learning. We find improved performance and sample efficiency by training a multilayer perceptron (MLP) across concept spaces increasing in dimensional and compositional complexity. We are able to show that meta-learners are much better able to handle compositional complexity than featural complexity and establish an empirical analysis demonstrating how featural complexity shapes ‘concept basins’ of the loss landscape, allowing curvature-aware optimization to be more effective than first order methods. We see that we can robustly increase generalization on complex concepts by increasing the number of adaptation steps in meta-SGD, encouraging exploration of rougher loss basins. Overall, this work highlights the intricacies of learning compositional versus featural complexity in high dimensional concept spaces and provides a road to understanding the role of 2nd order methods and extended gradient adaptation in few-shot concept learning.

[827] Parameter-Efficient Routed Fine-Tuning: Mixture-of-Experts Demands Mixture of Adaptation Modules

Yilun Liu, Yunpu Ma, Yuetian Lu, Shuo Chen, Zifeng Ding, Volker Tresp

Main category: cs.LG

TL;DR: The paper explores integrating routing mechanisms into Parameter-Efficient Fine-Tuning (PEFT) for Mixture-of-Experts (MoE) models, analyzing their impact on adaptation effectiveness and identifying optimal configurations.

Details

Motivation: Existing PEFT strategies do not leverage MoE's dynamic routing, prompting investigation into whether adaptation modules should incorporate routing to align with MoE's multi-expert architecture.

Method: Analyzes dynamics of core components when applying PEFT to MoE models, examines routing strategies’ effects, and conducts experiments on OLMoE-1B-7B and Mixtral-8x7B for commonsense and math reasoning tasks.

Result: Validates the performance and efficiency of the routed approach, identifying optimal configurations for different scenarios.

Conclusion: Provides empirical analyses and practical insights to improve PEFT and MoE applications.

Abstract: Mixture-of-Experts (MoE) benefits from a dynamic routing mechanism among their specialized experts, which existing Parameter- Efficient Fine-Tuning (PEFT) strategies fail to leverage. This motivates us to investigate whether adaptation modules themselves should incorporate routing mechanisms to align with MoE’s multi-expert architecture. We analyze dynamics of core components when applying PEFT to MoE language models and examine how different routing strategies affect adaptation effectiveness. Extensive experiments adapting OLMoE-1B-7B and Mixtral-8x7B on various commonsense and math reasoning tasks validate the performance and efficiency of our routed approach. We identify the optimal configurations for different scenarios and provide empirical analyses with practical insights to facilitate better PEFT and MoE applications.

[828] Flow-Aware GNN for Transmission Network Reconfiguration via Substation Breaker Optimization

Dekang Meng, Rabab Haider, Pascal van Hentenryck

Main category: cs.LG

TL;DR: OptiGridML is a neural framework for topology optimization in power grids, replacing NP-hard MIP with GNNs to improve power exports and reduce computation time.

Details

Motivation: Traditional MIP methods for power grid topology optimization are computationally intractable for large networks.

Method: Uses a two-stage neural architecture: LGNN for DC flow approximation and HeteroGNN for breaker state prediction, linked by physics-informed loss.

Result: Achieves 18% power export improvement and reduces inference time from hours to milliseconds.

Conclusion: Demonstrates the potential of flow-aware GNNs for accelerating combinatorial optimization in physical systems.

Abstract: This paper introduces OptiGridML, a machine learning framework for discrete topology optimization in power grids. The task involves selecting substation breaker configurations that maximize cross-region power exports, a problem typically formulated as a mixed-integer program (MIP) that is NP-hard and computationally intractable for large networks. OptiGridML replaces repeated MIP solves with a two-stage neural architecture: a line-graph neural network (LGNN) that approximates DC power flows for a given network topology, and a heterogeneous GNN (HeteroGNN) that predicts breaker states under structural and physical constraints. A physics-informed consistency loss connects these components by enforcing Kirchhoff’s law on predicted flows. Experiments on synthetic networks with up to 1,000 breakers show that OptiGridML achieves power export improvements of up to 18% over baseline topologies, while reducing inference time from hours to milliseconds. These results demonstrate the potential of structured, flow-aware GNNs for accelerating combinatorial optimization in physical networked systems.

[829] Stochastic Encodings for Active Feature Acquisition

Alexander Norcliffe, Changhee Lee, Fergus Imrie, Mihaela van der Schaar, Pietro Lio

Main category: cs.LG

TL;DR: The paper introduces a latent variable model for Active Feature Acquisition, outperforming traditional methods like Reinforcement Learning and greedy mutual information maximization.

Details

Motivation: Addressing the limitations of existing approaches (training difficulties in RL and myopic acquisitions in greedy methods) for Active Feature Acquisition.

Method: Proposes a supervised latent variable model that reasons about features across unobserved realizations in a stochastic latent space.

Result: Extensive evaluations show the approach reliably outperforms diverse baselines on synthetic and real datasets.

Conclusion: The latent variable model effectively improves feature acquisition by avoiding myopic decisions and training challenges.

Abstract: Active Feature Acquisition is an instance-wise, sequential decision making problem. The aim is to dynamically select which feature to measure based on current observations, independently for each test instance. Common approaches either use Reinforcement Learning, which experiences training difficulties, or greedily maximize the conditional mutual information of the label and unobserved features, which makes myopic acquisitions. To address these shortcomings, we introduce a latent variable model, trained in a supervised manner. Acquisitions are made by reasoning about the features across many possible unobserved realizations in a stochastic latent space. Extensive evaluation on a large range of synthetic and real datasets demonstrates that our approach reliably outperforms a diverse set of baselines.

[830] Kronecker-LoRA: hybrid Kronecker-LoRA adapters for scalable, sustainable fine-tuning

Yixin Shen

Main category: cs.LG

TL;DR: Kron-LoRA is a parameter-efficient adapter for fine-tuning large language models, combining Kronecker product factorization and low-rank compression to reduce parameters while maintaining performance.

Details

Motivation: To address the need for adapters that are both parameter-efficient and expressive for fine-tuning large pre-trained language models.

Method: Kron-LoRA factorizes linear updates as a Kronecker product and compresses one factor via low-rank decomposition, reducing parameters while retaining expressivity.

Result: Kron-LoRA matches or rivals standard LoRA adapters’ performance with fewer parameters and offers better quantization-friendliness.

Conclusion: Kron-LoRA provides a scalable, sustainable solution for multi-task adaptation of large language models, with competitive cross-task transfer performance.

Abstract: Fine-tuning massive pre-trained language models across many tasks demands adapters that are both parameter-efficient and highly expressive. We introduce \textbf{Kron-LoRA}, a two-stage adapter that first factorizes each frozen linear update as a Kronecker product [ \Delta W = A \otimes B ] and then compresses [ B \in \mathbb{R}^{d_{B2}\times d_{B1}} ] via an (r)-rank LoRA decomposition (B \approx B_{1}B_{2}). By leveraging [ \mathrm{rank}(A \otimes B) ;=; \mathrm{rank}(A),\mathrm{rank}(B), ] Kron-LoRA retains the expressivity of the update while using up to $4!\times!$ fewer parameters than a standard rank-8 LoRA adapter. Its compact adapter matrices also quantize to 8- or 4-bit with less accuracy degradation than LoRA, enabling further memory and storage savings for on-device deployment. We benchmark on DistilBERT and Mistral-7B across five tasks (PIQA, HellaSwag, WinoGrande, ARC-Easy, ARC-Challenge) over multiple epochs of adapter-only tuning: on DistilBERT, an 840 K-parameter Kron-LoRA matches LoRA-16’s performance, and on Mistral-7B, a 5.7 M-parameter Kron-LoRA rivals LoRA-8 with modest memory savings and only a 3-8% speed overhead. In sequential fine-tuning from ARC-Challenge to ARC-Easy, Kron-LoRA retains 55.18% accuracy versus 53.17% for LoRA-8-despite using only one-quarter of the adapter parameters-underscoring its competitive cross-task transfer performance. By uniting Kronecker structure, low-rank compression, quantization-friendliness, and by providing transparent trade-off analysis, Kron-LoRA offers a scalable, sustainable, and continual-learning-ready solution for multi-task adaptation of large language models.

[831] Accelerating LLM Reasoning via Early Rejection with Partial Reward Modeling

Seyyed Saeid Cheshmi, Azal Ahmad Khan, Xinran Wang, Zirui Liu, Ali Anwar

Main category: cs.LG

TL;DR: The paper explores using Process Reward Models (PRMs) mid-generation to reject suboptimal reasoning steps early, reducing computational overhead without sacrificing performance.

Details

Motivation: To improve the computational efficiency of LLMs in complex reasoning tasks by leveraging PRMs for early rejection of poor candidates.

Method: Introduces the idea of PRMs as Partial Reward Models, using intermediate token-level signals to predict final output quality and enable early rejection.

Result: Empirical and theoretical evidence shows strong correlation between partial and final rewards, achieving up to 1.4×-9× reduction in inference FLOPs.

Conclusion: Early rejection via PRMs is an effective way to enhance compute-efficiency in LLM reasoning tasks.

Abstract: Large Language Models (LLMs) are increasingly relied upon for solving complex reasoning tasks in domains such as mathematics, logic, and multi-step question answering. A growing line of work seeks to improve reasoning quality by scaling inference time compute particularly through Process Reward Models (PRMs), used to reward the reasoning at intermediate steps. While effective, these methods introduce substantial computational overhead, especially when generating large numbers of solutions in parallel. In this paper, we investigate whether PRMs can be used mid-generation to provide early signals that enable the rejection of suboptimal candidates before full generation of step is complete. We introduce the hypothesis that PRMs are also Partial Reward Models, meaning that the scores they assign to partially completed reasoning step are predictive of final output quality. This allows for principled early rejection based on intermediate token-level signals. We support this hypothesis both theoretically, by proving that the risk of discarding optimal beams decreases exponentially with generation length and empirically, by demonstrating a strong correlation between partial and final rewards across multiple reward models. On math reasoning benchmarks, our method achieves up to 1.4$\times$-9$\times$ reduction in inference FLOPs without degrading final performance. These results suggest that early rejection is a powerful mechanism for improving the compute-efficiency of reasoning in LLMs.

[832] Improving Hospital Risk Prediction with Knowledge-Augmented Multimodal EHR Modeling

Rituparna Datta, Jiaming Cui, Zihan Guan, Rupesh Silwal, Joshua C Eby, Gregory Madden, Anil Vullikanti

Main category: cs.LG

TL;DR: A unified framework integrates structured and unstructured EHR data using a two-stage LLM and graph-based method for clinical risk prediction, outperforming baselines with AUC scores of 0.84 and 0.92.

Details

Motivation: Accurate prediction of clinical outcomes from EHRs is vital for early intervention and improved patient care, but requires integrating multimodal data.

Method: A two-stage approach: 1) LLM extracts task-relevant info from clinical notes, enhanced by graph-based retrieval of external knowledge; 2) combines unstructured and structured data for predictions.

Result: Achieves strong performance (AUC 0.84 for readmission, 0.92 for mortality) on imbalanced datasets, surpassing existing baselines and clinical practices.

Conclusion: The framework enhances LLM-based prediction by integrating structured data and external knowledge, offering a robust solution for clinical outcome prediction.

Abstract: Accurate prediction of clinical outcomes using Electronic Health Records (EHRs) is critical for early intervention, efficient resource allocation, and improved patient care. EHRs contain multimodal data, including both structured data and unstructured clinical notes that provide rich, context-specific information. In this work, we introduce a unified framework that seamlessly integrates these diverse modalities, leveraging all relevant available information through a two-stage architecture for clinical risk prediction. In the first stage, a fine-tuned Large Language Model (LLM) extracts crucial, task-relevant information from clinical notes, which is enhanced by graph-based retrieval of external domain knowledge from sources such as a medical corpus like PubMed, grounding the LLM’s understanding. The second stage combines both unstructured representations and features derived from the structured data to generate the final predictions. This approach supports a wide range of clinical tasks. Here, we demonstrate its effectiveness on 30-day readmission and in-hospital mortality prediction. Experimental results show that our framework achieves strong performance, with AUC scores of $0.84$ and $0.92$, respectively, despite these tasks involving severely imbalanced datasets, with positive rates ranging from approximately $4%$ to $13%$. Moreover, it outperforms all existing baselines and clinical practices, including established risk scoring systems. To the best of our knowledge, this is one of the first frameworks for healthcare prediction which enhances the power of an LLM-based graph-guided knowledge retrieval method by combining it with structured data for improved clinical outcome prediction.

[833] Revitalizing Canonical Pre-Alignment for Irregular Multivariate Time Series Forecasting

Ziyu Zhou, Yiming Huang, Yanyun Wang, Yuankai Wu, James Kwok, Yuxuan Liang

Main category: cs.LG

TL;DR: KAFNet, a compact architecture for irregular multivariate time series (IMTS) forecasting, retains Canonical Pre-Alignment (CPA) and outperforms graph-based models by efficiently handling pre-aligned series with modules for smoothing, compression, and low-cost correlation modeling.

Details

Motivation: IMTS modeling is challenging due to uneven sampling and inter-variate asynchrony. CPA, though widely used, inflates series length, while graph-based models struggle with global correlations.

Method: KAFNet combines (1) Pre-Convolution for smoothing, (2) Temporal Kernel Aggregation for compression, and (3) Frequency Linear Attention for low-cost correlation modeling.

Result: KAFNet achieves state-of-the-art performance with 7.2× fewer parameters and 8.4× faster training-inference.

Conclusion: Retaining CPA and properly handling pre-aligned series with KAFNet outperforms graph-based models, offering efficiency and accuracy.

Abstract: Irregular multivariate time series (IMTS), characterized by uneven sampling and inter-variate asynchrony, fuel many forecasting applications yet remain challenging to model efficiently. Canonical Pre-Alignment (CPA) has been widely adopted in IMTS modeling by padding zeros at every global timestamp, thereby alleviating inter-variate asynchrony and unifying the series length, but its dense zero-padding inflates the pre-aligned series length, especially when numerous variates are present, causing prohibitive compute overhead. Recent graph-based models with patching strategies sidestep CPA, but their local message passing struggles to capture global inter-variate correlations. Therefore, we posit that CPA should be retained, with the pre-aligned series properly handled by the model, enabling it to outperform state-of-the-art graph-based baselines that sidestep CPA. Technically, we propose KAFNet, a compact architecture grounded in CPA for IMTS forecasting that couples (1) Pre-Convolution module for sequence smoothing and sparsity mitigation, (2) Temporal Kernel Aggregation module for learnable compression and modeling of intra-series irregularity, and (3) Frequency Linear Attention blocks for the low-cost inter-series correlations modeling in the frequency domain. Experiments on multiple IMTS datasets show that KAFNet achieves state-of-the-art forecasting performance, with a 7.2$\times$ parameter reduction and a 8.4$\times$ training-inference acceleration.

[834] Diffusion models for inverse problems

Hyungjin Chung, Jeongsol Kim, Jong Chul Ye

Main category: cs.LG

TL;DR: A review of diffusion priors for solving inverse imaging problems, categorizing methods, covering extensions to challenging scenarios, and highlighting open challenges.

Details

Motivation: To systematically review and categorize diffusion-based approaches for inverse imaging problems, identify common mathematical threads, and clarify trade-offs and open challenges.

Method: Categorizes approaches into explicit approximations and others (variational inference, sequential Monte Carlo, decoupled data consistency). Extends to blind cases, high-dimensional data, and multimodal information.

Result: Provides a structured overview of diffusion-based inverse problem solvers, contrasting assumptions and performance trade-offs.

Conclusion: Highlights open theoretical and practical challenges, aiming to clarify the landscape of diffusion model-based solvers.

Abstract: Using diffusion priors to solve inverse problems in imaging have significantly matured over the years. In this chapter, we review the various different approaches that were proposed over the years. We categorize the approaches into the more classic explicit approximation approaches and others, which include variational inference, sequential monte carlo, and decoupled data consistency. We cover the extension to more challenging situations, including blind cases, high-dimensional data, and problems under data scarcity and distribution mismatch. More recent approaches that aim to leverage multimodal information through texts are covered. Through this chapter, we aim to (i) distill the common mathematical threads that connect these algorithms, (ii) systematically contrast their assumptions and performance trade-offs across representative inverse problems, and (iii) spotlight the open theoretical and practical challenges by clarifying the landscape of diffusion model based inverse problem solvers.

[835] Controllable and Stealthy Shilling Attacks via Dispersive Latent Diffusion

Shutong Qiao, Wei Yuan, Junliang Yu, Tong Chen, Quoc Viet Hung Nguyen, Hongzhi Yin

Main category: cs.LG

TL;DR: DLDA is a diffusion-based attack framework for recommender systems that generates effective yet realistic fake users to manipulate rankings, outperforming prior attacks in promotion strength and evasion of detection.

Details

Motivation: Existing shilling attacks on recommender systems often fail to balance strong adversarial promotion of target items with realistic behavior to avoid detection, leaving the true threat underestimated.

Method: DLDA uses a conditional latent diffusion process in a collaborative embedding space to synthesize fake user profiles with precise target control and employs dispersive regularization for realistic behavior.

Result: DLDA achieves stronger item promotion and better evasion of detection compared to prior attacks, as demonstrated on three real-world datasets and five RS models.

Conclusion: Modern recommender systems are more vulnerable to shilling attacks than previously thought, necessitating stronger defenses.

Abstract: Recommender systems (RSs) are now fundamental to various online platforms, but their dependence on user-contributed data leaves them vulnerable to shilling attacks that can manipulate item rankings by injecting fake users. Although widely studied, most existing attack models fail to meet two critical objectives simultaneously: achieving strong adversarial promotion of target items while maintaining realistic behavior to evade detection. As a result, the true severity of shilling threats that manage to reconcile the two objectives remains underappreciated. To expose this overlooked vulnerability, we present DLDA, a diffusion-based attack framework that can generate highly effective yet indistinguishable fake users by enabling fine-grained control over target promotion. Specifically, DLDA operates in a pre-aligned collaborative embedding space, where it employs a conditional latent diffusion process to iteratively synthesize fake user profiles with precise target item control. To evade detection, DLDA introduces a dispersive regularization mechanism that promotes variability and realism in generated behavioral patterns. Extensive experiments on three real-world datasets and five popular RS models demonstrate that, compared to prior attacks, DLDA consistently achieves stronger item promotion while remaining harder to detect. These results highlight that modern RSs are more vulnerable than previously recognized, underscoring the urgent need for more robust defenses.

[836] Toward Efficient Spiking Transformers: Synapse Pruning Meets Synergistic Learning-Based Compensation

Hongze Sun, Wuque Cai, Duo Chen, Shifeng Mao, Jiayi He, Zhenxing Wang, Dezhong Yao, Daqing Guo

Main category: cs.LG

TL;DR: The paper proposes lightweight spiking Transformer models using synapse pruning and synergistic learning to reduce computational costs while maintaining performance.

Details

Motivation: Existing spiking Transformer models are resource-intensive, limiting deployment in constrained environments.

Method: Combines unstructured L1P and structured DSP pruning with a synergistic learning-based compensation strategy using an enhanced sLIF neuron model.

Result: Significantly reduces model size and computational costs without compromising performance.

Conclusion: The proposed strategies effectively create efficient and high-performing spiking Transformer models.

Abstract: As a foundational architecture of artificial intelligence models, Transformer has been recently adapted to spiking neural networks with promising performance across various tasks. However, existing spiking Transformer (ST)-based models require a substantial number of parameters and incur high computational costs, thus limiting their deployment in resource-constrained environments. To address these challenges, we propose combining synapse pruning with a synergistic learning-based compensation strategy to derive lightweight ST-based models. Specifically, two types of tailored pruning strategies are introduced to reduce redundancy in the weight matrices of ST blocks: an unstructured $\mathrm{L_{1}P}$ method to induce sparse representations, and a structured DSP method to induce low-rank representations. In addition, we propose an enhanced spiking neuron model, termed the synergistic leaky integrate-and-fire (sLIF) neuron, to effectively compensate for model pruning through synergistic learning between synaptic and intrinsic plasticity mechanisms. Extensive experiments on benchmark datasets demonstrate that the proposed methods significantly reduce model size and computational overhead while maintaining competitive performance. These results validate the effectiveness of the proposed pruning and compensation strategies in constructing efficient and high-performing ST-based models.

[837] Generative Large-Scale Pre-trained Models for Automated Ad Bidding Optimization

Yu Lei, Jiayang Zhao, Yilei Zhao, Zhaoqi Zhang, Linyou Cai, Qianlong Xie, Xingxing Wang

Main category: cs.LG

TL;DR: GRAD, a generative reward-driven auto-bidding model, addresses challenges in modern ad-bidding by combining diverse action exploration with constraint-aware optimization, improving platform revenue and advertiser ROI.

Details

Motivation: To balance diverse advertiser goals and constraints in auto-bidding, overcoming limitations of traditional methods like distribution shifts and limited action space exploration.

Method: GRAD integrates an Action-Mixture-of-Experts module for diverse bidding actions and a Causal Transformer-based Value Estimator for constraint-aware optimization.

Result: GRAD boosts platform revenue, with real-world implementation showing a 2.18% GMV increase and 10.68% ROI improvement.

Conclusion: GRAD effectively meets modern advertiser needs, demonstrating scalability and performance in dynamic environments.

Abstract: Modern auto-bidding systems are required to balance overall performance with diverse advertiser goals and real-world constraints, reflecting the dynamic and evolving needs of the industry. Recent advances in conditional generative models, such as transformers and diffusers, have enabled direct trajectory generation tailored to advertiser preferences, offering a promising alternative to traditional Markov Decision Process-based methods. However, these generative methods face significant challenges, such as the distribution shift between offline and online environments, limited exploration of the action space, and the necessity to meet constraints like marginal Cost-per-Mille (CPM) and Return on Investment (ROI). To tackle these challenges, we propose GRAD (Generative Reward-driven Ad-bidding with Mixture-of-Experts), a scalable foundation model for auto-bidding that combines an Action-Mixture-of-Experts module for diverse bidding action exploration with the Value Estimator of Causal Transformer for constraint-aware optimization. Extensive offline and online experiments demonstrate that GRAD significantly enhances platform revenue, highlighting its effectiveness in addressing the evolving and diverse requirements of modern advertisers. Furthermore, GRAD has been implemented in multiple marketing scenarios at Meituan, one of the world’s largest online food delivery platforms, leading to a 2.18% increase in Gross Merchandise Value (GMV) and 10.68% increase in ROI.

Xinzheng Wu, Junyi Chen, Shaolingfeng Ye, Wei Jiang, Yong Shen

Main category: cs.LG

TL;DR: A MARL-based Dual-DM model enhances safety-critical scenario generation for autonomous driving, improving efficiency and diversity without compromising fidelity or complexity.

Details

Motivation: To efficiently generate diverse and complex safety-critical scenarios for autonomous driving testing by leveraging cooperative adversarial driving characteristics.

Method: Uses multi-agent reinforcement learning (MARL) to train a dual-modal driver model (Dual-DM) with non-adversarial and adversarial driving modes, integrated into a simulated traffic environment.

Result: Dual-DM achieves high scenario fidelity (>85% similarity), improved complexity (+32.35% and +12.5%), and significant efficiency gains (+195%).

Conclusion: Dual-DM effectively enhances safety-critical scenario generation, demonstrating high diversity and performance.

Abstract: In the autonomous driving testing methods based on evolving scenarios, the construction method of the driver model, which determines the driving maneuvers of background vehicles (BVs) in the scenario, plays a critical role in generating safety-critical scenarios. In particular, the cooperative adversarial driving characteristics between BVs can contribute to the efficient generation of safety-critical scenarios with high testing value. In this paper, a multi-agent reinforcement learning (MARL) method is used to train and generate a dual-modal driver model (Dual-DM) with non-adversarial and adversarial driving modalities. The model is then connected to a continuous simulated traffic environment to generate complex, diverse and strong interactive safety-critical scenarios through evolving scenario generation method. After that, the generated evolving scenarios are evaluated in terms of fidelity, test efficiency, complexity and diversity. Results show that without performance degradation in scenario fidelity (>85% similarity to real-world scenarios) and complexity (complexity metric: 0.45, +32.35% and +12.5% over two baselines), Dual-DM achieves a substantial enhancement in the efficiency of generating safety-critical scenarios (efficiency metric: 0.86, +195% over two baselines). Furthermore, statistical analysis and case studies demonstrate the diversity of safety-critical evolving scenarios generated by Dual-DM in terms of the adversarial interaction patterns. Therefore, Dual-DM can greatly improve the performance of the generation of safety-critical scenarios through evolving scenario generation method.

[839] Confidence-Diversity Calibration of AI Judgement Enables Reliable Qualitative Coding

Zhilong Zhao, Yindi Liu

Main category: cs.LG

TL;DR: LLMs’ reliability in qualitative coding is assessed using self-confidence and model diversity, achieving high agreement (R²=0.979) and reducing human effort by 65%.

Details

Motivation: To address the challenge of assessing LLM reliability in domains with low human expert agreement.

Method: Analyzed 5,680 coding decisions from eight LLMs across ten categories, using self-confidence and model diversity metrics.

Result: A dual signal (confidence + diversity) explains agreement almost completely (R²=0.979), reducing manual effort by 65%.

Conclusion: The confidence-diversity duo provides a generalizable, evidence-based criterion for calibrating AI judgment in qualitative research.

Abstract: LLMs enable qualitative coding at large scale, but assessing the reliability of their output remains challenging in domains where human experts seldom agree. Analysing 5,680 coding decisions from eight state-of-the-art LLMs across ten thematic categories, we confirm that a model’s mean self-confidence already tracks inter-model agreement closely (Pearson r=0.82). Adding model diversity-quantified as the normalised Shannon entropy of the panel’s votes-turns this single cue into a dual signal that explains agreement almost completely (R^2=0.979). The confidence-diversity duo enables a three-tier workflow that auto-accepts 35% of segments with <5% audit-detected error and routes the remainder for targeted human review, cutting manual effort by up to 65%. Cross-domain replication on six public datasets spanning finance, medicine, law and multilingual tasks confirms these gains (kappa improvements of 0.20-0.78). Our results establish a generalisable, evidence-based criterion for calibrating AI judgement in qualitative research.

[840] Model Recycling Framework for Multi-Source Data-Free Supervised Transfer Learning

Sijia Wang, Ricardo Henao

Main category: cs.LG

TL;DR: Proposes a model recycling framework for source-free transfer learning, enabling efficient reuse of pre-trained models without access to source data.

Details

Motivation: Addresses challenges in transfer learning when source data is unavailable, such as selecting models and transferring without full access to source models.

Method: Introduces a parameter-efficient training framework that identifies subsets of related source models for reuse in white-box and black-box settings.

Result: Enables Model as a Service (MaaS) providers to build libraries of efficient pre-trained models for multi-source data-free supervised transfer learning.

Conclusion: The framework facilitates practical and efficient transfer learning in scenarios where source data is inaccessible.

Abstract: Increasing concerns for data privacy and other difficulties associated with retrieving source data for model training have created the need for source-free transfer learning, in which one only has access to pre-trained models instead of data from the original source domains. This setting introduces many challenges, as many existing transfer learning methods typically rely on access to source data, which limits their direct applicability to scenarios where source data is unavailable. Further, practical concerns make it more difficult, for instance efficiently selecting models for transfer without information on source data, and transferring without full access to the source models. So motivated, we propose a model recycling framework for parameter-efficient training of models that identifies subsets of related source models to reuse in both white-box and black-box settings. Consequently, our framework makes it possible for Model as a Service (MaaS) providers to build libraries of efficient pre-trained models, thus creating an opportunity for multi-source data-free supervised transfer learning.

[841] Graph Unlearning via Embedding Reconstruction – A Range-Null Space Decomposition Approach

Hang Yin, Zipeng Liu, Xiaoyong Peng, Liyao Xiang

Main category: cs.LG

TL;DR: A novel node unlearning method for GNNs reverses aggregation via embedding reconstruction and uses Range-Null Space Decomposition, achieving state-of-the-art performance.

Details

Motivation: Addressing the unexplored challenges of node unlearning in GNNs, particularly the inefficiency of retraining and limitations of existing methods like GIF.

Method: Proposes embedding reconstruction to reverse GNN aggregation and employs Range-Null Space Decomposition for node interaction learning.

Result: Demonstrates state-of-the-art performance on multiple datasets.

Conclusion: The method effectively handles node unlearning in GNNs without retraining overhead, outperforming existing approaches.

Abstract: Graph unlearning is tailored for GNNs to handle widespread and various graph structure unlearning requests, which remain largely unexplored. The GIF (graph influence function) achieves validity under partial edge unlearning, but faces challenges in dealing with more disturbing node unlearning. To avoid the overhead of retraining and realize the model utility of unlearning, we proposed a novel node unlearning method to reverse the process of aggregation in GNN by embedding reconstruction and to adopt Range-Null Space Decomposition for the nodes’ interaction learning. Experimental results on multiple representative datasets demonstrate the SOTA performance of our proposed approach.

[842] Epi$^2$-Net: Advancing Epidemic Dynamics Forecasting with Physics-Inspired Neural Networks

Rui Sun, Chenghua Gong, Tianjun Gu, Yuhao Zheng, Jie Ding, Juyuan Zhang, Liming Pan, Linyuan Lü

Main category: cs.LG

TL;DR: Epi²-Net combines physics-inspired neural networks with epidemic dynamics for improved forecasting, outperforming current methods.

Details

Motivation: Current epidemic forecasting models are limited—mechanism-based ones are too rigid, while data-driven ones lack physical constraints, risking bias.

Method: Epi²-Net integrates physical transport principles into neural networks, introducing neural epidemic transport and spatio-temporal modeling.

Result: Epi²-Net outperforms state-of-the-art methods in real-world epidemic forecasting tests.

Conclusion: Epi²-Net offers a promising, physics-informed approach for accurate epidemic forecasting and containment.

Abstract: Advancing epidemic dynamics forecasting is vital for targeted interventions and safeguarding public health. Current approaches mainly fall into two categories: mechanism-based and data-driven models. Mechanism-based models are constrained by predefined compartmental structures and oversimplified system assumptions, limiting their ability to model complex real-world dynamics, while data-driven models focus solely on intrinsic data dependencies without physical or epidemiological constraints, risking biased or misleading representations. Although recent studies have attempted to integrate epidemiological knowledge into neural architectures, most of them fail to reconcile explicit physical priors with neural representations. To overcome these obstacles, we introduce Epi$^2$-Net, a Epidemic Forecasting Framework built upon Physics-Inspired Neural Networks. Specifically, we propose reconceptualizing epidemic transmission from the physical transport perspective, introducing the concept of neural epidemic transport. Further, we present a physic-inspired deep learning framework, and integrate physical constraints with neural modules to model spatio-temporal patterns of epidemic dynamics. Experiments on real-world datasets have demonstrated that Epi$^2$-Net outperforms state-of-the-art methods in epidemic forecasting, providing a promising solution for future epidemic containment. The code is available at: https://anonymous.4open.science/r/Epi-2-Net-48CE.

[843] SpikeSTAG: Spatial-Temporal Forecasting via GNN-SNN Collaboration

Bang Hu, Changze Lv, Mingjie Li, Yunpeng Liu, Xiaoqing Zheng, Fengzhe Zhang, Wei cao, Fan Zhang

Main category: cs.LG

TL;DR: A new SNN architecture integrates graph structural learning with spike-based temporal processing for multivariate time-series forecasting, outperforming existing models.

Details

Motivation: To explore the untapped potential of SNNs in spatial modeling for multivariate time-series forecasting.

Method: The architecture embeds time features and an adaptive matrix, learns sequence features via the OBS Block, aggregates neighborhood information with MSSA, and integrates spatial-temporal dynamics using the DSF Block.

Result: The model surpasses iSpikformer and traditional temporal models, especially for long sequences.

Conclusion: This work establishes a new paradigm for efficient spatial-temporal modeling with SNNs.

Abstract: Spiking neural networks (SNNs), inspired by the spiking behavior of biological neurons, offer a distinctive approach for capturing the complexities of temporal data. However, their potential for spatial modeling in multivariate time-series forecasting remains largely unexplored. To bridge this gap, we introduce a brand new SNN architecture, which is among the first to seamlessly integrate graph structural learning with spike-based temporal processing for multivariate time-series forecasting. Specifically, we first embed time features and an adaptive matrix, eliminating the need for predefined graph structures. We then further learn sequence features through the Observation (OBS) Block. Building upon this, our Multi-Scale Spike Aggregation (MSSA) hierarchically aggregates neighborhood information through spiking SAGE layers, enabling multi-hop feature extraction while eliminating the need for floating-point operations. Finally, we propose a Dual-Path Spike Fusion (DSF) Block to integrate spatial graph features and temporal dynamics via a spike-gated mechanism, combining LSTM-processed sequences with spiking self-attention outputs, effectively improve the model accuracy of long sequence datasets. Experiments show that our model surpasses the state-of-the-art SNN-based iSpikformer on all datasets and outperforms traditional temporal models at long horizons, thereby establishing a new paradigm for efficient spatial-temporal modeling.

[844] AlignGuard-LoRA: Alignment-Preserving Fine-Tuning via Fisher-Guided Decomposition and Riemannian-Geodesic Collision Regularization

Amitava Das, Abhilekh Borah, Vinija Jain, Aman Chadha

Main category: cs.LG

TL;DR: AGL (AlignGuard-LoRA) is a framework to prevent alignment drift in LoRA fine-tuning of LLMs, using regularization and task-specific components, improving safety without degrading performance.

Details

Motivation: Minor LoRA updates can weaken safety and alignment in LLMs, necessitating a method to preserve alignment during fine-tuning.

Method: AGL combines primary task loss, Fisher Information Matrix-based regularization, task-specific regularization, and collision-aware regularization (Riemannian overlap and geodesic separation).

Result: AGL reduces alignment drift by up to 50% on safety benchmarks without harming task performance, validated by the DriftCaps benchmark.

Conclusion: AGL effectively preserves alignment in LoRA fine-tuning, with minimal trade-offs, and is open-sourced for further development.

Abstract: Low-rank adaptation (LoRA) has become a standard tool for efficiently fine-tuning large language models (LLMs). Yet, even minor LoRA updates can induce alignment drift, weakening safety and behavioral constraints through entangled parameter changes. To address this, we propose AlignGuard-LoRA (AGL), a principled framework for preserving alignment during finetuning. AGL introduces several key components: a primary task loss for supervision, Fisher Information Matrix-based regularization to restrict updates in alignment-sensitive subspaces, and task-specific regularization to stabilize the integration of new knowledge. We further introduce collision-aware regularization, blending Riemannian overlap – which penalizes coordinate-wise interference – and geodesic separation – which encourages disjoint update geometry. We curate DriftCaps, a targeted diagnostic benchmark of safe and unsafe prompts designed to quantify alignment drift and safety degradation. Empirical evaluations show that AGL mitigates alignment drift by up to 50% on safety-critical benchmarks without degrading downstream task performance. Comprehensive ablation confirms that each component contributes distinctly to preserving latent safety behaviors. Finally, we derive and validate a scaling law for catastrophic forgetting, revealing that AGL flattens post-finetuning loss escalation while preserving adaptation dynamics. AGL is a structurally grounded refinement of LoRA, ensuring alignment preservation with minimal trade-offs. To encourage further exploration and development, we open-source our implementation.

[845] The Geometry of Machine Learning Models

Pawel Gajer, Jacques Ravel

Main category: cs.LG

TL;DR: A mathematical framework analyzes ML models via partition geometry, using Riemannian simplicial complexes and differential forms for tractable computations, enabling geometric regularization and new diagnostic tools.

Details

Motivation: To provide a deeper understanding of ML models by analyzing their geometric properties, improving interpretation, regularization, and diagnostics.

Method: Represent partitions as Riemannian simplicial complexes, introduce differential forms for neural networks, and develop discrete curvature measures.

Result: The framework enables geometric regularization, model refinement tools, and quantifies local geometric complexity and pairwise relationships.

Conclusion: The geometric perspective offers novel approaches for model interpretation, regularization, and understanding learning dynamics.

Abstract: This paper presents a mathematical framework for analyzing machine learning models through the geometry of their induced partitions. By representing partitions as Riemannian simplicial complexes, we capture not only adjacency relationships but also geometric properties including cell volumes, volumes of faces where cells meet, and dihedral angles between adjacent cells. For neural networks, we introduce a differential forms approach that tracks geometric structure through layers via pullback operations, making computations tractable by focusing on data-containing cells. The framework enables geometric regularization that directly penalizes problematic spatial configurations and provides new tools for model refinement through extended Laplacians and simplicial splines. We also explore how data distribution induces effective geometric curvature in model partitions, developing discrete curvature measures for vertices that quantify local geometric complexity and statistical Ricci curvature for edges that captures pairwise relationships between cells. While focused on mathematical foundations, this geometric perspective offers new approaches to model interpretation, regularization, and diagnostic tools for understanding learning dynamics.

[846] Instance-Dependent Continuous-Time Reinforcement Learning via Maximum Likelihood Estimation

Runze Zhao, Yue Yu, Ruhan Wang, Chunfeng Huang, Dongruo Zhou

Main category: cs.LG

TL;DR: The paper explores instance-dependent behavior in continuous-time reinforcement learning (CTRL), introducing a model-based algorithm using MLE and state marginal density estimation. It provides regret bounds and adaptive measurement strategies for improved performance.

Details

Motivation: To understand and improve CTRL's adaptability to varying problem difficulties, addressing gaps in empirical success and theoretical understanding.

Method: A model-based algorithm using maximum likelihood estimation (MLE) with a general function approximator, focusing on state marginal density estimation instead of direct system dynamics.

Result: Derived regret bounds scaling with reward variance and measurement resolution, with adaptive observation frequency making regret independent of measurement strategy. Introduced a randomized measurement schedule for better sample efficiency.

Conclusion: The work advances CTRL by enabling automatic adaptation to environmental difficulty, offering a new direction for algorithm design.

Abstract: Continuous-time reinforcement learning (CTRL) provides a natural framework for sequential decision-making in dynamic environments where interactions evolve continuously over time. While CTRL has shown growing empirical success, its ability to adapt to varying levels of problem difficulty remains poorly understood. In this work, we investigate the instance-dependent behavior of CTRL and introduce a simple, model-based algorithm built on maximum likelihood estimation (MLE) with a general function approximator. Unlike existing approaches that estimate system dynamics directly, our method estimates the state marginal density to guide learning. We establish instance-dependent performance guarantees by deriving a regret bound that scales with the total reward variance and measurement resolution. Notably, the regret becomes independent of the specific measurement strategy when the observation frequency adapts appropriately to the problem’s complexity. To further improve performance, our algorithm incorporates a randomized measurement schedule that enhances sample efficiency without increasing measurement cost. These results highlight a new direction for designing CTRL algorithms that automatically adjust their learning behavior based on the underlying difficulty of the environment.

[847] Real-Time Conflict Prediction for Large Truck Merging in Mixed Traffic at Work Zone Lane Closures

Abyad Enan, Abdullah Al Mamun, Gurcan Comert, Debbie Aisiana Indah, Judith Mwakalonge, Amy W. Apon, Mashrur Chowdhury

Main category: cs.LG

TL;DR: The study uses LSTM neural networks to predict and reduce conflict risks for large trucks merging in work zones, outperforming baseline methods.

Details

Motivation: Large trucks contribute significantly to work zone crashes due to their size and blind spots, necessitating safer merging strategies.

Method: An LSTM neural network predicts merging risks, ensuring safe gaps in the target lane. Compared to probabilistic and gap-based baselines.

Result: The LSTM method reduces conflict risk (lower TET and TIT values) and enables early merging without stopping.

Conclusion: The LSTM-based approach improves safety and efficiency for large truck merging in work zones compared to traditional methods.

Abstract: Large trucks substantially contribute to work zone-related crashes, primarily due to their large size and blind spots. When approaching a work zone, large trucks often need to merge into an adjacent lane because of lane closures caused by construction activities. This study aims to enhance the safety of large truck merging maneuvers in work zones by evaluating the risk associated with merging conflicts and establishing a decision-making strategy for merging based on this risk assessment. To predict the risk of large trucks merging into a mixed traffic stream within a work zone, a Long Short-Term Memory (LSTM) neural network is employed. For a large truck intending to merge, it is critical that the immediate downstream vehicle in the target lane maintains a minimum safe gap to facilitate a safe merging process. Once a conflict-free merging opportunity is predicted, large trucks are instructed to merge in response to the lane closure. Our LSTM-based conflict prediction method is compared against baseline approaches, which include probabilistic risk-based merging, 50th percentile gap-based merging, and 85th percentile gap-based merging strategies. The results demonstrate that our method yields a lower conflict risk, as indicated by reduced Time Exposed Time-to-Collision (TET) and Time Integrated Time-to-Collision (TIT) values relative to the baseline models. Furthermore, the findings indicate that large trucks that use our method can perform early merging while still in motion, as opposed to coming to a complete stop at the end of the current lane prior to closure, which is commonly observed with the baseline approaches.

[848] Understanding the Essence: Delving into Annotator Prototype Learning for Multi-Class Annotation Aggregation

Ju Chen, Jun Feng, Shenyu Zhang

Main category: cs.LG

TL;DR: PTBCC improves truth inference in multi-class classification by using prototype learning to address unreliable confusion matrices and sparse annotations, achieving higher accuracy and lower computational cost.

Details

Motivation: Existing methods for truth inference struggle with unreliable confusion matrices due to sparse annotations and class imbalance, and fail to fully capture annotator expertise.

Method: PTBCC introduces prototype learning, representing annotator expertise as a distribution over prototype confusion matrices, addressing sparsity and imbalance.

Result: PTBCC achieves up to 15% higher accuracy in some cases, 3% average improvement, and reduces computational cost by over 90%.

Conclusion: PTBCC offers a more reliable and flexible approach for truth inference, outperforming existing methods in accuracy and efficiency.

Abstract: Multi-class classification annotations have significantly advanced AI applications, with truth inference serving as a critical technique for aggregating noisy and biased annotations. Existing state-of-the-art methods typically model each annotator’s expertise using a confusion matrix. However, these methods suffer from two widely recognized issues: 1) when most annotators label only a few tasks, or when classes are imbalanced, the estimated confusion matrices are unreliable, and 2) a single confusion matrix often remains inadequate for capturing each annotator’s full expertise patterns across all tasks. To address these issues, we propose a novel confusion-matrix-based method, PTBCC (ProtoType learning-driven Bayesian Classifier Combination), to introduce a reliable and richer annotator estimation by prototype learning. Specifically, we assume that there exists a set $S$ of prototype confusion matrices, which capture the inherent expertise patterns of all annotators. Rather than a single confusion matrix, the expertise per annotator is extended as a Dirichlet prior distribution over these prototypes. This prototype learning-driven mechanism circumvents the data sparsity and class imbalance issues, ensuring a richer and more flexible characterization of annotators. Extensive experiments on 11 real-world datasets demonstrate that PTBCC achieves up to a 15% accuracy improvement in the best case, and a 3% higher average accuracy while reducing computational cost by over 90%.

[849] Understanding Learning Dynamics Through Structured Representations

Saleh Nikooroo, Thomas Engel

Main category: cs.LG

TL;DR: The paper explores how architectural choices in deep networks influence learning dynamics, proposing enriched transformation layers for improved stability and interpretability.

Details

Motivation: To understand how internal structural choices affect learning systems, moving beyond empirical tweaks to architectural insights.

Method: Introduces enriched transformation layers with constrained pathways and adaptive corrections, analyzing their impact on gradient flow, spectral sensitivity, and fixed-point behavior.

Result: Demonstrates improved robustness, smoother optimization, and scalable depth behavior in synthetic and structured tasks.

Conclusion: Architectural design is crucial for shaping learning dynamics, offering interpretable and scalable solutions for neural systems.

Abstract: While modern deep networks have demonstrated remarkable versatility, their training dynamics remain poorly understood–often driven more by empirical tweaks than architectural insight. This paper investigates how internal structural choices shape the behavior of learning systems. Building on prior efforts that introduced simple architectural constraints, we explore the broader implications of structure for convergence, generalization, and adaptation. Our approach centers on a family of enriched transformation layers that incorporate constrained pathways and adaptive corrections. We analyze how these structures influence gradient flow, spectral sensitivity, and fixed-point behavior–uncovering mechanisms that contribute to training stability and representational regularity. Theoretical analysis is paired with empirical studies on synthetic and structured tasks, demonstrating improved robustness, smoother optimization, and scalable depth behavior. Rather than prescribing fixed templates, we emphasize principles of tractable design that can steer learning behavior in interpretable ways. Our findings support a growing view that architectural design is not merely a matter of performance tuning, but a critical axis for shaping learning dynamics in scalable and trustworthy neural systems.

[850] Amber Pruner: Leveraging N:M Activation Sparsity for Efficient Prefill in Large Language Models

Tai An, Ruwu Cai, Yanzhe Zhang, Yang Liu, Hao Chen, Pengcheng Xie, Sheng Chang, Yiwu Yao, Gongyi Wang

Main category: cs.LG

TL;DR: Amber Pruner introduces a training-free N:M activation sparsity method for LLMs, accelerating linear computations without retraining. Combined with W8A8 quantization, it maintains performance across tasks.

Details

Motivation: Address limitations of weight and activation sparsity, such as accuracy degradation and training dependency, to improve LLM inference efficiency.

Method: Proposes Amber Pruner for N:M activation sparsity in the prefill stage and integrates it with W8A8 quantization in Outstanding-sparse.

Result: Achieves over 55% linear computation acceleration without retraining, preserving performance in downstream tasks, especially generative ones.

Conclusion: Pioneers activation sparsity for LLMs, offering insights for future AI system design and algorithm-architecture co-evolution.

Abstract: In the era of large language models (LLMs), N:M sparsity has emerged as a structured compression technique critical for accelerating inference. While prior work has primarily focused on weight sparsity, it often suffers from significant accuracy degradation. Activation sparsity, though promising, is typically training-dependent and faces challenges in generalization. To address these limitations, we introduce Amber Pruner, a training-free N:M activation sparsity method designed specifically for the prefill stage, targeting the acceleration of linear projection layers in LLMs. Extensive experiments across multiple models and sparsity ratios (2:4, 4:8, and 8:16) demonstrate that Amber Pruner can effectively sparsify and accelerate more than 55% of linear computations without requiring model retraining. To further enhance generality and efficiency, we propose Outstanding-sparse, a unified framework that integrates Amber Pruner with post-training W8A8 quantization. Our approach preserves strong performance across a range of downstream tasks, with notable advantages in generative tasks. This work pioneers a new frontier in activation sparsity, providing foundational insights that are poised to guide the co-evolution of algorithms and architectures in the design of next-generation AI systems.

[851] The Complexity of Extreme Climate Events on the New Zealand’s Kiwifruit Industry

Boyuan Zheng, Victor W. Chu, Zhidong Li, Evan Webster, Ashley Rootsey

Main category: cs.LG

TL;DR: The study analyzes how extreme weather events (frost, drought, extreme rainfall, heatwave) impact kiwifruit yields in New Zealand using Isolation Forest for anomaly detection, revealing variability in effects and limitations in current methods. Future work will use ensemble methods for better accuracy.

Details

Motivation: Climate change has increased extreme weather events, posing challenges to agriculture. The study aims to understand their impact on kiwifruit farming in New Zealand.

Method: Isolation Forest, an unsupervised anomaly detection method, was used to analyze climate history, extreme events, and kiwifruit yields.

Result: Findings show variability in how extreme events affect yields and highlight limitations in detecting events like frost.

Conclusion: The study underscores the need for integrating farm management strategies with climate adaptation and plans to use ensemble methods for improved detection and response strategies.

Abstract: Climate change has intensified the frequency and severity of extreme weather events, presenting unprecedented challenges to the agricultural industry worldwide. In this investigation, we focus on kiwifruit farming in New Zealand. We propose to examine the impacts of climate-induced extreme events, specifically frost, drought, extreme rainfall, and heatwave, on kiwifruit harvest yields. These four events were selected due to their significant impacts on crop productivity and their prevalence as recorded by climate monitoring institutions in the country. We employed Isolation Forest, an unsupervised anomaly detection method, to analyse climate history and recorded extreme events, alongside with kiwifruit yields. Our analysis reveals considerable variability in how different types of extreme event affect kiwifruit yields underscoring notable discrepancies between climatic extremes and individual farm’s yield outcomes. Additionally, our study highlights critical limitations of current anomaly detection approaches, particularly in accurately identifying events such as frost. These findings emphasise the need for integrating supplementary features like farm management strategies with climate adaptation practices. Our further investigation will employ ensemble methods that consolidate nearby farms’ yield data and regional climate station features to reduce variance, thereby enhancing the accuracy and reliability of extreme event detection and the formulation of response strategies.

[852] FedLAD: A Linear Algebra Based Data Poisoning Defence for Federated Learning

Qi Xiong, Hai Dong, Nasrin Sohrabi, Zahir Tari

Main category: cs.LG

TL;DR: FedLAD is a novel defence method against Sybil attacks in federated learning, using linear algebra to detect and filter malicious nodes, outperforming existing methods.

Details

Motivation: Sybil attacks, especially targeted data poisoning, threaten federated learning by allowing malicious nodes to dominate, necessitating robust countermeasures.

Method: FedLAD models the aggregation process as a linear algebra problem, identifying attacks by extracting independent linear combinations and filtering malicious elements.

Result: FedLAD outperforms five baseline methods, maintaining low attack success rates and high model accuracy across varying malicious node ratios.

Conclusion: FedLAD enhances federated learning security and performance, proving effective against data poisoning attacks.

Abstract: Sybil attacks pose a significant threat to federated learning, as malicious nodes can collaborate and gain a majority, thereby overwhelming the system. Therefore, it is essential to develop countermeasures that ensure the security of federated learning environments. We present a novel defence method against targeted data poisoning, which is one of the types of Sybil attacks, called Linear Algebra-based Detection (FedLAD). Unlike existing approaches, such as clustering and robust training, which struggle in situations where malicious nodes dominate, FedLAD models the federated learning aggregation process as a linear problem, transforming it into a linear algebra optimisation challenge. This method identifies potential attacks by extracting the independent linear combinations from the original linear combinations, effectively filtering out redundant and malicious elements. Extensive experimental evaluations demonstrate the effectiveness of FedLAD compared to five well-established defence methods: Sherpa, CONTRA, Median, Trimmed Mean, and Krum. Using tasks from both image classification and natural language processing, our experiments confirm that FedLAD is robust and not dependent on specific application settings. The results indicate that FedLAD effectively protects federated learning systems across a broad spectrum of malicious node ratios. Compared to baseline defence methods, FedLAD maintains a low attack success rate for malicious nodes when their ratio ranges from 0.2 to 0.8. Additionally, it preserves high model accuracy when the malicious node ratio is between 0.2 and 0.5. These findings underscore FedLAD’s potential to enhance both the reliability and performance of federated learning systems in the face of data poisoning attacks.

[853] Fitness aligned structural modeling enables scalable virtual screening with AuroBind

Zhongyue Zhang, Jiahua Rao, Jie Zhong, Weiqiang Bai, Dongxue Wang, Shaobo Ning, Lifeng Qiao, Sheng Xu, Runze Ma, Will Hua, Jack Xiaoyu Chen, Odin Zhang, Wei Lu, Hanyi Feng, He Yang, Xinchao Shi, Rui Li, Wanli Ouyang, Xinzhu Ma, Jiahao Wang, Jixian Zhang, Jia Duan, Siqi Sun, Jian Zhang, Shuangjia Zheng

Main category: cs.LG

TL;DR: AuroBind is a scalable virtual screening framework that improves atomic-level precision and binding fitness prediction, outperforming existing methods and enabling faster, high-throughput molecular screening.

Details

Motivation: Over 96% of human proteins are undrugged, and current virtual screening methods lack precision and fail to predict binding fitness, limiting therapeutic discovery.

Method: AuroBind fine-tunes a custom atomic-level structural model using million-scale chemogenomic data, integrating direct preference optimization, self-distillation, and a teacher-student acceleration strategy.

Result: AuroBind outperforms state-of-the-art models, achieves 100,000-fold faster screening, and identifies potent compounds with hit rates of 7-69% in experimental screens, including for orphan GPCRs.

Conclusion: AuroBind bridges the gap between structure prediction and therapeutic discovery, offering a generalizable framework for high-throughput molecular screening.

Abstract: Most human proteins remain undrugged, over 96% of human proteins remain unexploited by approved therapeutics. While structure-based virtual screening promises to expand the druggable proteome, existing methods lack atomic-level precision and fail to predict binding fitness, limiting translational impact. We present AuroBind, a scalable virtual screening framework that fine-tunes a custom atomic-level structural model on million-scale chemogenomic data. AuroBind integrates direct preference optimization, self-distillation from high-confidence complexes, and a teacher-student acceleration strategy to jointly predict ligand-bound structures and binding fitness. The proposed models outperform state-of-the-art models on structural and functional benchmarks while enabling 100,000-fold faster screening across ultra-large compound libraries. In a prospective screen across ten disease-relevant targets, AuroBind achieved experimental hit rates of 7-69%, with top compounds reaching sub-nanomolar to picomolar potency. For the orphan GPCRs GPR151 and GPR160, AuroBind identified both agonists and antagonists with success rates of 16-30%, and functional assays confirmed GPR160 modulation in liver and prostate cancer models. AuroBind offers a generalizable framework for structure-function learning and high-throughput molecular screening, bridging the gap between structure prediction and therapeutic discovery.

[854] PIGDreamer: Privileged Information Guided World Models for Safe Partially Observable Reinforcement Learning

Dongchi Huang, Jiaqi Wang, Yang Li, Chunhe Xia, Tianle Zhang, Kaige Zhang

Main category: cs.LG

TL;DR: The paper introduces ACPOMDPs and a model-based safe RL method using privileged information to improve safety and performance.

Details

Motivation: Addressing the challenge of partial observability in safe RL by leveraging privileged information.

Method: Proposes ACPOMDPs and the Privileged Information Guided Dreamer, using privileged representation alignment and an asymmetric actor-critic structure.

Result: Outperforms existing methods in safety and performance, with easier training compared to alternatives.

Conclusion: Privileged information enhances RL safety and performance, validated by empirical results.

Abstract: Partial observability presents a significant challenge for safe reinforcement learning, as it impedes the identification of potential risks and rewards. Leveraging specific types of privileged information during training to mitigate the effects of partial observability has yielded notable empirical successes. In this paper, we propose Asymmetric Constrained Partially Observable Markov Decision Processes (ACPOMDPs) to theoretically examine the advantages of incorporating privileged information. Building upon ACPOMDPs, we propose the Privileged Information Guided Dreamer, a model-based safe reinforcement learning approach that leverages privileged information to enhance the agent’s safety and performance through privileged representation alignment and an asymmetric actor-critic structure. Our empirical results demonstrate that our approach significantly outperforms existing methods in terms of safety and task-centric performance. Meanwhile, compared to alternative privileged model-based reinforcement learning methods, our approach exhibits superior performance and ease of training.

[855] User Trajectory Prediction Unifying Global and Local Temporal Information

Wei Hao, Bin Chong, Ronghua Ji, Chen Hou

Main category: cs.LG

TL;DR: A trajectory prediction model using MLP, MSCNN, and cross-attention reduces forecasting errors while maintaining inference time.

Details

Motivation: Trajectory prediction is crucial for proactive strategies, but extracting complete temporal patterns is challenging due to global and local temporal information and varying user behavior time scales.

Method: The model combines MLP for global temporal information, MSCNN with multi-scale kernels for local temporal information, and cross-attention to fuse both.

Result: The model reduces MSE by 5.04% and MAE by 4.35% compared to ModernTCN in 12-step prediction, with similar inference time.

Conclusion: The proposed model effectively addresses the challenges of trajectory prediction by leveraging multi-scale temporal information fusion.

Abstract: Trajectory prediction is essential for formulating proactive strategies that anticipate user mobility and support advance preparation. Therefore, how to reduce the forecasting error in user trajectory prediction within an acceptable inference time arises as an interesting issue. However, trajectory data contains both global and local temporal information, complicating the extraction of the complete temporal pattern. Moreover, user behavior occurs over different time scales, increasing the difficulty of capturing behavioral patterns. To address these challenges, a trajectory prediction model based on multilayer perceptron (MLP), multi-scale convolutional neural network (MSCNN), and cross-attention (CA) is proposed. Specifically, MLP is used to extract the global temporal information of each feature. In parallel, MSCNN is employed to extract the local temporal information by modeling interactions among features within a local temporal range. Convolutional kernels with different sizes are used in MSCNN to capture temporal information at multiple resolutions, enhancing the model’s adaptability to different behavioral patterns. Finally, CA is applied to fuse the global and local temporal information. Experimental results show that our model reduces mean squared error (MSE) by 5.04% and mean absolute error (MAE) by 4.35% compared with ModernTCN in 12-step prediction, while maintaining similar inference time.

[856] Multi-Treatment-DML: Causal Estimation for Multi-Dimensional Continuous Treatments with Monotonicity Constraints in Personal Loan Risk Optimization

Kexin Zhao, Bo Wang, Cuiying Zhao, Tongyao Wan

Main category: cs.LG

TL;DR: The paper introduces Multi-Treatment-DML, a framework for optimizing continuous, multi-dimensional loan treatments (e.g., credit limits, interest rates) using Double Machine Learning to address biases in observational data and enforce domain-specific monotonic constraints.

Details

Motivation: Existing causal methods struggle with continuous, multi-dimensional treatments in loan platforms, and randomized trials are often impractical due to risk controls and long repayment cycles.

Method: Proposes Multi-Treatment-DML, leveraging Double Machine Learning to debias observational data, handle continuous treatments, and enforce monotonic treatment-outcome relationships.

Result: Demonstrates effectiveness on public benchmarks and real-world datasets, with online A/B tests confirming practical superiority in loan operations.

Conclusion: Multi-Treatment-DML successfully addresses gaps in causal estimation for continuous, multi-dimensional treatments in personal loan platforms, proving its real-world applicability.

Abstract: Optimizing credit limits, interest rates, and loan terms is crucial for managing borrower risk and lifetime value (LTV) in personal loan platform. However, counterfactual estimation of these continuous, multi-dimensional treatments faces significant challenges: randomized trials are often prohibited by risk controls and long repayment cycles, forcing reliance on biased observational data. Existing causal methods primarily handle binary/discrete treatments and struggle with continuous, multi-dimensional settings. Furthermore, financial domain knowledge mandates provably monotonic treatment-outcome relationships (e.g., risk increases with credit limit).To address these gaps, we propose Multi-Treatment-DML, a novel framework leveraging Double Machine Learning (DML) to: (i) debias observational data for causal effect estimation; (ii) handle arbitrary-dimensional continuous treatments; and (iii) enforce monotonic constraints between treatments and outcomes, guaranteeing adherence to domain requirements.Extensive experiments on public benchmarks and real-world industrial datasets demonstrate the effectiveness of our approach. Furthermore, online A/B testing conducted on a realworld personal loan platform, confirms the practical superiority of Multi-Treatment-DML in real-world loan operations.

[857] CAAD: Context-Aware Adaptive Decoding for Truthful Text Generation

Manh Nguyen, Sunil Gupta, Hung Le

Main category: cs.LG

TL;DR: A context-aware adaptive decoding method improves truthfulness in LLMs using minimal annotated data and no retraining, outperforming baselines on benchmarks like TruthfulQA.

Details

Motivation: Addressing the challenge of truthfulness in LLMs without requiring extensive annotated data or computational resources.

Method: Leverages a compact reference grounding space built from few examples to retrieve and aggregate next token logits during inference.

Result: Achieves a 2.8% average improvement on TruthfulQA and outperforms baselines on Biographies and WikiQA, showing cross-task generalization.

Conclusion: The method is scalable, efficient, and model-agnostic, highlighting the potential of context-aware decoding for factual reliability.

Abstract: Ensuring truthfulness in large language models remains a critical challenge for reliable text generation. While supervised fine-tuning and reinforcement learning with human feedback have shown promise, they require substantial amount of annotated data and computational resources, limiting scalability. In contrast, decoding-time interventions offer lightweight alternatives without model retraining. However, existing decoding strategies often face issues like prompt sensitivity, limited generalization, or dependence on internal model states. We propose a context-aware adaptive decoding method that leverages a compact reference grounding space, built from as few as 10 annotated examples and comprising pairs of context embeddings and next token logits from truthful responses, to enable retrieval-based logit shaping during inference. At each decoding step, our method retrieves top-N semantically similar contexts and aggregates their associated next token logits to modify the LLM’s logits. Across three open-ended question-answering benchmarks, our approach achieves a 2.8 percent average improvement on TruthfulQA and further outperforms existing baselines on both Biographies and WikiQA. Experimental results also demonstrate cross-task generalization, with TruthfulQA-derived grounding enhancing biography generation. Our model-agnostic, scalable, and efficient method requires only a single generation pass, highlighting the potential of context-aware decoding for factual reliability in LLMs.

[858] Balancing Information Accuracy and Response Timeliness in Networked LLMs

Yigit Turkmen, Baturalp Buyukates, Melih Bastopcu

Main category: cs.LG

TL;DR: A networked LLM system with specialized models improves accuracy and timeliness over individual LLMs, especially when models perform similarly.

Details

Motivation: Address the challenges of training data, computational resources, and energy consumption in LLMs by leveraging smaller, specialized models.

Method: Use a networked system with users, a central task processor, and topic-specialized LLM clusters to aggregate responses for binary queries.

Result: Aggregated responses achieve higher accuracy than individual LLMs, particularly when models have similar standalone performance.

Conclusion: Specialized LLM clusters offer a practical solution to improve response quality and efficiency in LLM deployment.

Abstract: Recent advancements in Large Language Models (LLMs) have transformed many fields including scientific discovery, content generation, biomedical text mining, and educational technology. However, the substantial requirements for training data, computational resources, and energy consumption pose significant challenges for their practical deployment. A promising alternative is to leverage smaller, specialized language models and aggregate their outputs to improve overall response quality. In this work, we investigate a networked LLM system composed of multiple users, a central task processor, and clusters of topic-specialized LLMs. Each user submits categorical binary (true/false) queries, which are routed by the task processor to a selected cluster of $m$ LLMs. After gathering individual responses, the processor returns a final aggregated answer to the user. We characterize both the information accuracy and response timeliness in this setting, and formulate a joint optimization problem to balance these two competing objectives. Our extensive simulations demonstrate that the aggregated responses consistently achieve higher accuracy than those of individual LLMs. Notably, this improvement is more significant when the participating LLMs exhibit similar standalone performance.

[859] Multi-Policy Pareto Front Tracking Based Online and Offline Multi-Objective Reinforcement Learning

Zeyu Zhao, Yueling Che, Kaichen Liu, Jian Li, Junmei Yao

Main category: cs.LG

TL;DR: The paper introduces the Multi-policy Pareto Front Tracking (MPFT) framework for MORL, improving efficiency by avoiding large policy populations and combining online/offline methods.

Details

Motivation: Traditional MP methods in MORL are sample-inefficient and resource-heavy due to reliance on evolutionary frameworks with large policy populations.

Method: MPFT framework involves four stages: approximating Pareto-vertex policies, tracking the Pareto front, filling sparse regions, and combining policies for final approximation.

Result: MPFT outperforms benchmarks in hypervolume performance with reduced interactions and hardware needs, tested on robotic control tasks.

Conclusion: MPFT offers a more efficient and scalable solution for MORL by eliminating evolutionary frameworks and leveraging both online/offline methods.

Abstract: Multi-objective reinforcement learning (MORL) plays a pivotal role in addressing multi-criteria decision-making problems in the real world. The multi-policy (MP) based methods are widely used to obtain high-quality Pareto front approximation for the MORL problems. However, traditional MP methods only rely on the online reinforcement learning (RL) and adopt the evolutionary framework with a large policy population. This may lead to sample inefficiency and/or overwhelmed agent-environment interactions in practice. By forsaking the evolutionary framework, we propose the novel Multi-policy Pareto Front Tracking (MPFT) framework without maintaining any policy population, where both online and offline MORL algorithms can be applied. The proposed MPFT framework includes four stages: Stage 1 approximates all the Pareto-vertex policies, whose mapping to the objective space fall on the vertices of the Pareto front. Stage 2 designs the new Pareto tracking mechanism to track the Pareto front, starting from each of the Pareto-vertex policies. Stage 3 identifies the sparse regions in the tracked Pareto front, and introduces a new objective weight adjustment method to fill the sparse regions. Finally, by combining all the policies tracked in Stages 2 and 3, Stage 4 approximates the Pareto front. Experiments are conducted on seven different continuous-action robotic control tasks with both online and offline MORL algorithms, and demonstrate the superior hypervolume performance of our proposed MPFT approach over the state-of-the-art benchmarks, with significantly reduced agent-environment interactions and hardware requirements.

[860] Pigeon-SL: Robust Split Learning Framework for Edge Intelligence under Malicious Clients

Sangjun Park, Tony Q. S. Quek, Hyowoon Seo

Main category: cs.LG

TL;DR: Pigeon-SL is a novel split learning scheme that ensures robustness against malicious clients by partitioning them into clusters and selecting the most honest one for updates, improving accuracy and resilience.

Details

Motivation: Split learning (SL) is vulnerable to malicious clients, which degrade model accuracy. Pigeon-SL addresses this by ensuring at least one honest cluster exists among clients.

Method: Clients are partitioned into N+1 clusters, trained independently via vanilla SL, and evaluated. Only the cluster with the lowest validation loss advances, discarding malicious updates. Pigeon-SL+ enhances efficiency by repeating training on the selected cluster.

Result: Pigeon-SL significantly improves accuracy and resilience under label flipping, activation, and gradient manipulation attacks compared to baseline SL methods.

Conclusion: Pigeon-SL and Pigeon-SL+ offer robust and efficient solutions for secure distributed learning in wireless networks, isolating malicious updates effectively.

Abstract: Recent advances in split learning (SL) have established it as a promising framework for privacy-preserving, communication-efficient distributed learning at the network edge. However, SL’s sequential update process is vulnerable to even a single malicious client, which can significantly degrade model accuracy. To address this, we introduce Pigeon-SL, a novel scheme grounded in the pigeonhole principle that guarantees at least one entirely honest cluster among M clients, even when up to N of them are adversarial. In each global round, the access point partitions the clients into N+1 clusters, trains each cluster independently via vanilla SL, and evaluates their validation losses on a shared dataset. Only the cluster with the lowest loss advances, thereby isolating and discarding malicious updates. We further enhance training and communication efficiency with Pigeon-SL+, which repeats training on the selected cluster to match the update throughput of standard SL. We validate the robustness and effectiveness of our approach under three representative attack models – label flipping, activation and gradient manipulation – demonstrating significant improvements in accuracy and resilience over baseline SL methods in future intelligent wireless networks.

[861] Skeleton-Guided Learning for Shortest Path Search

Tiantian Liu, Xiao Li, Huan Li, Hua Lu, Christian S. Jensen, Jianliang Xu

Main category: cs.LG

TL;DR: A learning-based framework for shortest path search on generic graphs, using a skeleton graph and neural network to reduce search space while maintaining accuracy.

Details

Motivation: Existing methods (e.g., Dijkstra's, A*, index-based, or learning-based) are inefficient, storage-heavy, or limited to spatial graphs. A versatile solution is needed.

Method: Constructs a skeleton graph for multi-level distance/hop info, trains a Skeleton Graph Neural Network (SGNN) for embeddings, and uses LSearch/HLSearch for guided search.

Result: Achieves strong performance on five diverse real-world graphs, proving flexibility and effectiveness.

Conclusion: The framework offers a general, efficient solution for learning-based shortest path search without domain-specific features.

Abstract: Shortest path search is a core operation in graph-based applications, yet existing methods face important limitations. Classical algorithms such as Dijkstra’s and A* become inefficient as graphs grow more complex, while index-based techniques often require substantial preprocessing and storage. Recent learning-based approaches typically focus on spatial graphs and rely on context-specific features like geographic coordinates, limiting their general applicability. We propose a versatile learning-based framework for shortest path search on generic graphs, without requiring domain-specific features. At the core of our approach is the construction of a skeleton graph that captures multi-level distance and hop information in a compact form. A Skeleton Graph Neural Network (SGNN) operates on this structure to learn node embeddings and predict distances and hop lengths between node pairs. These predictions support LSearch, a guided search algorithm that uses model-driven pruning to reduce the search space while preserving accuracy. To handle larger graphs, we introduce a hierarchical training strategy that partitions the graph into subgraphs with individually trained SGNNs. This structure enables HLSearch, an extension of our method for efficient path search across graph partitions. Experiments on five diverse real-world graphs demonstrate that our framework achieves strong performance across graph types, offering a flexible and effective solution for learning-based shortest path search.

[862] An Enhanced Focal Loss Function to Mitigate Class Imbalance in Auto Insurance Fraud Detection with Explainable AI

Francis Boabang, Samuel Asante Gyamerah

Main category: cs.LG

TL;DR: A novel multistage focal loss function improves fraud prediction in imbalanced insurance datasets by dynamically adjusting focus on hard samples, outperforming traditional methods.

Details

Motivation: Addressing class imbalance in insurance fraud prediction to enhance model performance and robustness.

Method: Proposes a dynamic, multi-stage convex and nonconvex focal loss mechanism to refine focus on hard samples during training.

Result: Achieves better accuracy, precision, F1-score, recall, and AUC than traditional focal loss on real-world insurance data.

Conclusion: The multistage focal loss boosts model robustness and predictive accuracy in skewed classification, with implications for fraud detection systems.

Abstract: In insurance fraud prediction, handling class imbalance remains a critical challenge. This paper presents a novel multistage focal loss function designed to enhance the performance of machine learning models in such imbalanced settings by helping to escape local minima and converge to a good solution. Building upon the foundation of the standard focal loss, our proposed approach introduces a dynamic, multi-stage convex and nonconvex mechanism that progressively adjusts the focus on hard-to-classify samples across training epochs. This strategic refinement facilitates more stable learning and improved discrimination between fraudulent and legitimate cases. Through extensive experimentation on a real-world insurance dataset, our method achieved better performance than the traditional focal loss, as measured by accuracy, precision, F1-score, recall and Area Under the Curve (AUC) metrics on the auto insurance dataset. These results demonstrate the efficacy of the multistage focal loss in boosting model robustness and predictive accuracy in highly skewed classification tasks, offering significant implications for fraud detection systems in the insurance industry. An explainable model is included to interpret the results.

[863] Flexible Automatic Identification and Removal (FAIR)-Pruner: An Efficient Neural Network Pruning Method

Chenqing Lin, Mostafa Hussien, Chengyao Yu, Mohamed Cheriet, Osama Abdelrahman, Ruixing Ming

Main category: cs.LG

TL;DR: FAIR-Pruner is a novel neural network pruning method that automatically identifies and removes redundant units using Wasserstein distance and Taylor expansion, achieving efficient compression without fine-tuning.

Details

Motivation: To enable deployment of large neural networks on edge devices by reducing computational and memory overhead through structured pruning.

Method: FAIR-Pruner evaluates unit importance via Wasserstein distance (Utilization Score), computes Reconstruction Error via Taylor expansion, and uses Tolerance of Difference to identify superfluous units. It automatically determines layer-wise pruning rates.

Result: Achieves significant model compression while maintaining high accuracy on benchmarks like ImageNet and architectures like VGG, without needing post-pruning fine-tuning.

Conclusion: FAIR-Pruner is effective for neural network compression, offering flexibility and efficiency in pruning without sacrificing performance.

Abstract: Neural network pruning is a critical compression technique that facilitates the deployment of large-scale neural networks on resource-constrained edge devices, typically by identifying and eliminating redundant or insignificant parameters to reduce computational and memory overhead. This paper proposes the Flexible Automatic Identification and Removal (FAIR)-Pruner, a novel method for neural network structured pruning. Specifically, FAIR-Pruner first evaluates the importance of each unit (e.g., neuron or channel) through the Utilization Score quantified by the Wasserstein distance. To reflect the performance degradation after unit removal, it then introduces the Reconstruction Error, which is computed via the Taylor expansion of the loss function. Finally, FAIR-Pruner identifies superfluous units with negligible impact on model performance by controlling the proposed Tolerance of Difference, which measures differences between unimportant units and those that cause performance degradation. A major advantage of FAIR-Pruner lies in its capacity to automatically determine the layer-wise pruning rates, which yields a more efficient subnetwork structure compared to applying a uniform pruning rate. Another advantage of the FAIR-Pruner is its great one-shot performance without post-pruning fine-tuning. Furthermore, with utilization scores and reconstruction errors, users can flexibly obtain pruned models under different pruning ratios. Comprehensive experimental validation on diverse benchmark datasets (e.g., ImageNet) and various neural network architectures (e.g., VGG) demonstrates that FAIR-Pruner achieves significant model compression while maintaining high accuracy.

[864] Pre-Tactical Flight-Delay and Turnaround Forecasting with Synthetic Aviation Data

Abdulmajid Murad, Massimiliano Ruocco

Main category: cs.LG

TL;DR: The paper explores using synthetic data to train ML models for aviation predictions, showing transformer-based generators retain 94-97% of real-data performance while preserving operational insights.

Details

Motivation: Restricted access to real flight data hinders predictive model development; synthetic data could offer a solution while maintaining confidentiality.

Method: Evaluated four synthetic data generators on three tasks (turnaround time, departure/arrival delays) using a TSTR approach on 1.7M flight records.

Result: Transformer-based generators achieved 94-97% of real-data predictive performance and preserved feature importance patterns.

Conclusion: High-quality synthetic data can enable aviation analytics without compromising confidentiality, though pre-tactical accuracy has inherent limits.

Abstract: Access to comprehensive flight operations data remains severely restricted in aviation due to commercial sensitivity and competitive considerations, hindering the development of predictive models for operational planning. This paper investigates whether synthetic data can effectively replace real operational data for training machine learning models in pre-tactical aviation scenarios-predictions made hours to days before operations using only scheduled flight information. We evaluate four state-of-the-art synthetic data generators on three prediction tasks: aircraft turnaround time, departure delays, and arrival delays. Using a Train on Synthetic, Test on Real (TSTR) methodology on over 1.7 million European flight records, we first validate synthetic data quality through fidelity assessments, then assess both predictive performance and the preservation of operational relationships. Our results show that advanced neural network architectures, specifically transformer-based generators, can retain 94-97% of real-data predictive performance while maintaining feature importance patterns informative for operational decision-making. Our analysis reveals that even with real data, prediction accuracy is inherently limited when only scheduled information is available-establishing realistic baselines for pre-tactical forecasting. These findings suggest that high-quality synthetic data can enable broader access to aviation analytics capabilities while preserving commercial confidentiality, though stakeholders must maintain realistic expectations about pre-tactical prediction accuracy given the stochastic nature of flight operations.

[865] NMS: Efficient Edge DNN Training via Near-Memory Sampling on Manifolds

Boran Zhao, Haiduo Huang, Qiwei Dang, Wenzhe Zhao, Tian Xia, Pengju Ren

Main category: cs.LG

TL;DR: The paper proposes DE-SNE, a DNN-free sample-selecting algorithm, and NMS, a near-memory computing system, to reduce energy consumption and improve generalization in DNN training on edge devices.

Details

Motivation: Training DNNs on edge devices faces challenges like high energy consumption and generalization issues due to reliance on large datasets and DNN-based sample evaluation.

Method: Introduces DE-SNE for DNN-free sample selection and NMS for near-memory computing to minimize DDR energy usage.

Result: NMS outperforms SOTA methods (DQ, DQAS, NeSSA) in model accuracy while reducing energy consumption.

Conclusion: The proposed NMS system effectively addresses generalization and energy issues in edge device DNN training.

Abstract: Training deep neural networks (DNNs) on edge devices has attracted increasing attention due to its potential to address challenges related to domain adaptation and privacy preservation. However, DNNs typically rely on large datasets for training, which results in substantial energy consumption, making the training in edge devices impractical. Some dataset compression methods have been proposed to solve this challenge. For instance, the coreset selection and dataset distillation reduce the training cost by selecting and generating representative samples respectively. Nevertheless, these methods have two significant defects: (1) The necessary of leveraging a DNN model to evaluate the quality of representative samples, which inevitably introduces inductive bias of DNN, resulting in a severe generalization issue; (2) All training images require multiple accesses to the DDR via long-distance PCB connections, leading to substantial energy overhead. To address these issues, inspired by the nonlinear manifold stationary of the human brain, we firstly propose a DNN-free sample-selecting algorithm, called DE-SNE, to improve the generalization issue. Secondly, we innovatively utilize the near-memory computing technique to implement DE-SNE, thus only a small fraction of images need to access the DDR via long-distance PCB. It significantly reduces DDR energy consumption. As a result, we build a novel expedited DNN training system with a more efficient in-place Near-Memory Sampling characteristic for edge devices, dubbed NMS. As far as we know, our NMS is the first DNN-free near-memory sampling technique that can effectively alleviate generalization issues and significantly reduce DDR energy caused by dataset access. The experimental results show that our NMS outperforms the current state-of-the-art (SOTA) approaches, namely DQ, DQAS, and NeSSA, in model accuracy.

[866] MicroMix: Efficient Mixed-Precision Quantization with Microscaling Formats for Large Language Models

Wenyuan Liu, Haoqian Meng, Yilun Luo, Peng Zhang, Xindian Ma

Main category: cs.LG

TL;DR: MicroMix is a mixed-precision quantization algorithm and kernel for LLMs, leveraging MXFP formats on NVIDIA Blackwell GPUs for faster inference while maintaining accuracy.

Details

Motivation: Existing INT4-based kernels can't exploit FP4 Tensor Cores in Blackwell GPUs, limiting speedup potential.

Method: Co-designed mixed-precision quantization (MXFP4/6/8) and matrix multiplication kernel with selective precision allocation based on quantization thresholds.

Result: 20% faster execution than TensorRT-FP8, improved prefill latency, and memory efficiency across Llama and Qwen models.

Conclusion: MicroMix bridges the gap for FP4 utilization, offering competitive performance and efficiency in diverse LLM tasks.

Abstract: Quantization significantly accelerates inference in large language models (LLMs) by replacing original high-precision matrices with low-precision counterparts. Recent advances in weight-activation quantization have primarily focused on mapping both weights and activations to the INT4 format. Although the new FP4 Tensor Cores in NVIDIA’s Blackwell architecture offer up to 4x speedup over FP16, existing INT4-based kernels fail to fully exploit this capability due to mismatched data formats. To bridge this gap, we propose MicroMix, a co-designed mixed-precision quantization algorithm and matrix multiplication kernel based on Microscaling (MX) data formats. Tailored for the Blackwell architecture, the MicroMix kernel supports arbitrary combinations of MXFP4, MXFP6, and MXFP8 channels, and produces BFloat16 outputs. To achieve a favorable trade-off between accuracy and efficiency for each linear layer, we introduce quantization thresholds that identify activation elements where lower-precision formats (MXFP4 or MXFP6) incur excessive quantization error. Our algorithm selectively allocates higher-precision channels to preserve accuracy while maintaining compute efficiency. MicroMix achieves competitive or superior performance across diverse downstream tasks, including zero-shot and few-shot learning, language modeling, code generation, and mathematical reasoning. On both consumer-grade (RTX 5070Ti laptop) and server-grade (RTX 5090) GPUs, our kernel delivers at least 20% faster execution than TensorRT-FP8. Furthermore, when applied to various Llama and Qwen models, MicroMix consistently improves prefill latency and memory efficiency across a range of batch sizes compared to TensorRT baselines. Our code is available at https://github.com/lwy2020/MicroMix.

[867] A Compression Based Classification Framework Using Symbolic Dynamics of Chaotic Maps

Parth Naik, Harikrishnan N B

Main category: cs.LG

TL;DR: A novel classification framework, ChaosComp, uses symbolic dynamics and chaotic maps for efficient data encoding and classification.

Details

Motivation: To reinterpret classification through dynamical systems and compression, foundational in learning theory.

Method: Symbolic sequences from thresholded data are evolved via chaotic maps, forming class-specific probabilistic models. Testing involves back iteration for shortest compressed representation.

Result: Competitive performance on synthetic and real datasets (e.g., Breast Cancer Wisconsin F1-score = 0.9531).

Conclusion: The method offers a unique perspective on classification, blending dynamical systems and compression, without aiming for state-of-the-art performance.

Abstract: We propose a novel classification framework grounded in symbolic dynamics and data compression using chaotic maps. The core idea is to model each class by generating symbolic sequences from thresholded real-valued training data, which are then evolved through a one-dimensional chaotic map. For each class, we compute the transition probabilities of symbolic patterns (e.g., 00', 01’, 10', and 11’ for the second return map) and aggregate these statistics to form a class-specific probabilistic model. During testing phase, the test data are thresholded and symbolized, and then encoded using the class-wise symbolic statistics via back iteration, a dynamical reconstruction technique. The predicted label corresponds to the class yielding the shortest compressed representation, signifying the most efficient symbolic encoding under its respective chaotic model. This approach fuses concepts from dynamical systems, symbolic representations, and compression-based learning. We evaluate the proposed method: \emph{ChaosComp} on both synthetic and real-world datasets, demonstrating competitive performance compared to traditional machine learning algorithms (e.g., macro F1-scores for the proposed method on Breast Cancer Wisconsin = 0.9531, Seeds = 0.9475, Iris = 0.8317 etc.). Rather than aiming for state-of-the-art performance, the goal of this research is to reinterpret the classification problem through the lens of dynamical systems and compression, which are foundational perspectives in learning theory and information processing.

[868] BOOST: Bayesian Optimization with Optimal Kernel and Acquisition Function Selection Technique

Joon-Hyun Park, Mujin Cheon, Dong-Yeun Koh

Main category: cs.LG

TL;DR: BOOST automates kernel and acquisition function selection in Bayesian optimization, improving performance over fixed hyperparameter methods.

Details

Motivation: Manual or heuristic selection of hyperparameters in Bayesian optimization can lead to poor performance and wasted evaluations.

Method: BOOST uses an offline evaluation stage to test kernel-acquisition pairs on reference data, selecting the best configuration for future optimization.

Result: BOOST outperforms standard BO methods in synthetic and real-world tasks.

Conclusion: BOOST is effective and robust for diverse optimization problems.

Abstract: The performance of Bayesian optimization (BO), a highly sample-efficient method for expensive black-box problems, is critically governed by the selection of its hyperparameters, including the kernel and acquisition functions. This presents a challenge: an inappropriate combination of these can lead to poor performance and wasted evaluations. While individual improvements to kernel functions (e.g., tree-based kernels, deep kernel learning) and acquisition functions (e.g., multi-step lookahead, tree-based planning) have been explored, the joint and autonomous selection of the best pair of these fundamental hyperparameters has been overlooked. This forces practitioners to rely on heuristics or costly manual training. We propose a simple yet effective framework, BOOST (Bayesian Optimization with Optimal Kernel and Acquisition Function Selection Technique), that automates this selection. BOOST utilizes a lightweight, offline evaluation stage to predict the performance of various kernel-acquisition function pairs and identify the most suitable configuration before expensive evaluations. BOOST partitions data-in-hand into two subsets: a reference subset and a query subset, and it prepares all possible kernel-acquisition pairs from the user’s chosen candidates. For each configuration, BOOST conducts internal BO runs using the reference subset, evaluating how effectively each pair guides the search toward the optimum in the unknown query subset, thereby identifying the configuration with the best retrospective performance for future optimization. Experiments on both synthetic benchmark functions and real-world hyperparameter optimization tasks demonstrate that BOOST consistently outperforms standard BO approaches with fixed hyperparameters, highlighting its effectiveness and robustness in diverse problem landscapes.

[869] Posterior Sampling of Probabilistic Word Embeddings

Väinö Yrjänäinen, Isac Boström, Måns Magnusson, Johan Jonasson

Main category: cs.LG

TL;DR: The paper proposes scalable Gibbs sampling and Laplace approximation for quantifying uncertainty in word embeddings, outperforming existing methods like MFVI and HMC in accuracy and feasibility for large datasets.

Details

Motivation: Existing Bayesian methods for uncertainty in word embeddings (HMC, MFVI) are either computationally impractical or rely on restrictive assumptions, necessitating a more scalable and accurate approach.

Method: The authors introduce a Gibbs sampler with Polya-Gamma augmentation and Laplace approximation, comparing them to MFVI and HMC. They also address non-identifiability in word embeddings.

Result: Gibbs sampling and HMC accurately estimate uncertainties, while MFVI fails and Laplace approximation works only for large samples. The Gibbs sampler is validated on real datasets (US Congress, Movielens).

Conclusion: Posterior sampling (e.g., Gibbs) improves over MAP estimates, especially for smaller samples, highlighting the importance of full posterior sampling for reliable word embeddings.

Abstract: Quantifying uncertainty in word embeddings is crucial for reliable inference from textual data. However, existing Bayesian methods such as Hamiltonian Monte Carlo (HMC) and mean-field variational inference (MFVI) are either computationally infeasible for large data or rely on restrictive assumptions. We propose a scalable Gibbs sampler using Polya-Gamma augmentation as well as Laplace approximation and compare them with MFVI and HMC for word embeddings. In addition, we address non-identifiability in word embeddings. Our Gibbs sampler and HMC correctly estimate uncertainties, while MFVI does not, and Laplace approximation only does so on large sample sizes, as expected. Applying the Gibbs sampler to the US Congress and the Movielens datasets, we demonstrate the feasibility on larger real data. Finally, as a result of having draws from the full posterior, we show that the posterior mean of word embeddings improves over maximum a posteriori (MAP) estimates in terms of hold-out likelihood, especially for smaller sampling sizes, further strengthening the need for posterior sampling of word embeddings.

[870] A Novel Sliced Fused Gromov-Wasserstein Distance

Moritz Piening, Robert Beinert

Main category: cs.LG

TL;DR: The paper introduces a novel slicing technique for Gromov-Wasserstein (GW) and fused Gromov-Wasserstein (FGW) distances, improving computational efficiency while preserving invariance to isometric transformations and enabling comparisons of arbitrary geometries.

Details

Motivation: The motivation is to address the computational challenges and limitations of existing sliced GW methods, which are restricted to Euclidean geometry and lack invariance to isometries.

Method: The proposed method uses a lower bound, hierarchical optimal transport (OT), and quadrature rules for 1D OT problems to create a more efficient and versatile sliced FGW distance.

Result: The new sliced FGW distance reduces computational effort, remains invariant to isometric transformations, and works with arbitrary geometries. It also defines a pseudo-metric and shows robust performance in shape retrieval and graph isomorphism testing.

Conclusion: The novel sliced FGW distance offers a computationally efficient, robust, and versatile alternative to traditional GW and FGW distances, with practical applications in heterogeneous data comparison.

Abstract: The Gromov–Wasserstein (GW) distance and its fused extension (FGW) are powerful tools for comparing heterogeneous data. Their computation is, however, challenging since both distances are based on non-convex, quadratic optimal transport (OT) problems. Leveraging 1D OT, a sliced version of GW has been proposed to lower the computational burden. Unfortunately, this sliced version is restricted to Euclidean geometry and loses invariance to isometries, strongly limiting its application in practice. To overcome these issues, we propose a novel slicing technique for GW as well as for FGW that is based on an appropriate lower bound, hierarchical OT, and suitable quadrature rules for the underlying 1D OT problems. Our novel sliced FGW significantly reduces the numerical effort while remaining invariant to isometric transformations and allowing the comparison of arbitrary geometries. We show that our new distance actually defines a pseudo-metric for structured spaces that bounds FGW from below and study its interpolation properties between sliced Wasserstein and GW. Since we avoid the underlying quadratic program, our sliced distance is numerically more robust and reliable than the original GW and FGW distance; especially in the context of shape retrieval and graph isomorphism testing.

[871] Beyond Manually Designed Pruning Policies with Second-Level Performance Prediction: A Pruning Framework for LLMs

Zuxin Ma, Yunhe Cui, Yongbin Qin

Main category: cs.LG

TL;DR: PPF is a predictive pruning framework for LLMs that automates pruning policy design, supports dynamic/static pruning, and significantly speeds up evaluation, outperforming manual methods.

Details

Motivation: Existing non-uniform pruning methods rely on manual policies and slow evaluation, limiting adaptability to dynamic pruning needs.

Method: PPF uses a second-level performance predictor and an adaptive agent for real-time pruning decisions.

Result: PPF reduces perplexity by up to 33.4% (dynamic) and 84.78% (static) and speeds up evaluation by 64x.

Conclusion: PPF automates and accelerates pruning, outperforming manual methods in performance and efficiency.

Abstract: Non-uniform structured network pruning methods can effectively reduce Large Language Model (LLM) size by eliminating redundant channels or layers, offering lower performance degradation than uniform strategies. However, existing non-uniform methods rely heavily on manually designed pruning policies (e.g., layer importance and scaling factors), and therefore cannot efficiently adapt to scenarios with dynamic pruning ratio requirements. Additionly, a critical bottleneck – the time-consuming evaluation of pruning policies – further limits the feasibility of iteratively and dynamically finding optimal pruning policies. To address these limitations, we propose PPF (Predictive Pruning Framework), a novel pruning framework for LLMs that eliminates manual design dependencies via second-level performance prediction. PPF not only supports real-time pruning decisions under dynamic pruning ratios but is also applicable to static pruning scenarios. It employs an agent for producing adaptive and real-time pruning actions, while a lightweight performance predictor that can evaluate a pruning policy in seconds, significantly speeding up the iterative optimization process. Experiments on Llama2-7B and Llama3-8B show that PPF can generate dynamic/static pruning policies and it reduces perplexity by up to 33.4% (dynamic pruning) and 84.78% (static pruning) over existing methods, outperforming manually designed pruning policies. The performance predictor achieves second-level performance prediction with high accuracy (prediction error < 0.0011). It reduces the mean evaluation latency from minute-level (1 minute and 38.02 seconds of test-set evaluation methods) to second-level (1.52 second), achieving over 64 times speedup. Our code will be available at https://github.com/Ma-zx/PPF .

[872] Graph Embedding in the Graph Fractional Fourier Transform Domain

Changjie Sheng, Zhichao Zhang, Wei Yao

Main category: cs.LG

TL;DR: The paper introduces GEFRFE, a spectral graph embedding method using graph fractional Fourier transform to enhance feature capture and classification performance, with adaptive fractional order determination.

Details

Motivation: Traditional spectral embedding methods lack expressiveness in capturing latent structural features across transform domains, prompting the need for an improved approach.

Method: Extends GEFFE into fractional domains using graph fractional Fourier transform, incorporating filtering and nonlinear eigenvector composition. Introduces search-based optimization and ResNet18-based learning for dynamic fractional order determination.

Result: GEFRFE captures richer structural features and significantly improves classification performance on six benchmark datasets while maintaining computational efficiency.

Conclusion: GEFRFE effectively enhances spectral graph embedding by leveraging fractional domains, offering improved performance without increasing computational complexity.

Abstract: Spectral graph embedding plays a critical role in graph representation learning by generating low-dimensional vector representations from graph spectral information. However, the embedding space of traditional spectral embedding methods often exhibit limited expressiveness, failing to exhaustively capture latent structural features across alternative transform domains. To address this issue, we use the graph fractional Fourier transform to extend the existing state-of-the-art generalized frequency filtering embedding (GEFFE) into fractional domains, giving birth to the generalized fractional filtering embedding (GEFRFE), which enhances embedding informativeness via the graph fractional domain. The GEFRFE leverages graph fractional domain filtering and a nonlinear composition of eigenvector components derived from a fractionalized graph Laplacian. To dynamically determine the fractional order, two parallel strategies are introduced: search-based optimization and a ResNet18-based adaptive learning. Extensive experiments on six benchmark datasets demonstrate that the GEFRFE captures richer structural features and significantly enhance classification performance. Notably, the proposed method retains computational complexity comparable to GEFFE approaches.

[873] $ε$-Softmax: Approximating One-Hot Vectors for Mitigating Label Noise

Jialiang Wang, Xiong Zhou, Deming Zhai, Junjun Jiang, Xiangyang Ji, Xianming Liu

Main category: cs.LG

TL;DR: Error: OutputParser failed

Details

Motivation: Error: OutputParser failed

Method: Error: OutputParser failed

Result: Error: OutputParser failed

Conclusion: Error: OutputParser failed

Abstract: Noisy labels pose a common challenge for training accurate deep neural networks. To mitigate label noise, prior studies have proposed various robust loss functions to achieve noise tolerance in the presence of label noise, particularly symmetric losses. However, they usually suffer from the underfitting issue due to the overly strict symmetric condition. In this work, we propose a simple yet effective approach for relaxing the symmetric condition, namely $\epsilon$-softmax, which simply modifies the outputs of the softmax layer to approximate one-hot vectors with a controllable error $\epsilon$. Essentially, $\epsilon$-softmax not only acts as an alternative for the softmax layer, but also implicitly plays the crucial role in modifying the loss function. We prove theoretically that $\epsilon$-softmax can achieve noise-tolerant learning with controllable excess risk bound for almost any loss function. Recognizing that $\epsilon$-softmax-enhanced losses may slightly reduce fitting ability on clean datasets, we further incorporate them with one symmetric loss, thereby achieving a better trade-off between robustness and effective learning. Extensive experiments demonstrate the superiority of our method in mitigating synthetic and real-world label noise. The code is available at https://github.com/cswjl/eps-softmax.

[874] ASMR: Angular Support for Malfunctioning Client Resilience in Federated Learning

Mirko Konstantin, Moritz Fuchs, Anirban Mukhopadhyay

Main category: cs.LG

TL;DR: ASMR is a novel method for detecting and excluding malfunctioning clients in Federated Learning (FL) without needing hyperparameters or prior knowledge about the number of faulty clients.

Details

Motivation: FL suffers from malfunctioning client updates that degrade model performance, and existing defenses are impractical due to unrealistic prerequisites.

Method: ASMR dynamically excludes malfunctioning clients based on their angular distance, requiring no hyperparameters or knowledge of faulty clients.

Result: Experiments on a histopathological dataset demonstrate ASMR’s effectiveness in detecting malfunctioning clients and adapting decision boundaries.

Conclusion: ASMR provides a practical and effective solution for improving FL robustness against malfunctioning clients.

Abstract: Federated Learning (FL) allows the training of deep neural networks in a distributed and privacy-preserving manner. However, this concept suffers from malfunctioning updates sent by the attending clients that cause global model performance degradation. Reasons for this malfunctioning might be technical issues, disadvantageous training data, or malicious attacks. Most of the current defense mechanisms are meant to require impractical prerequisites like knowledge about the number of malfunctioning updates, which makes them unsuitable for real-world applications. To counteract these problems, we introduce a novel method called Angular Support for Malfunctioning Client Resilience (ASMR), that dynamically excludes malfunctioning clients based on their angular distance. Our novel method does not require any hyperparameters or knowledge about the number of malfunctioning clients. Our experiments showcase the detection capabilities of ASMR in an image classification task on a histopathological dataset, while also presenting findings on the significance of dynamically adapting decision boundaries.

[875] Toward Using Machine Learning as a Shape Quality Metric for Liver Point Cloud Generation

Khoa Tuan Nguyen, Gaeun Oh, Ho-min Park, Francesca Tozzi, Wouter Willaert, Joris Vankerschaver, Niki Rashidian, Wesley De Neve

Main category: cs.LG

TL;DR: The paper proposes using ML classifiers (including PointNet) to evaluate the quality of 3D liver shapes generated by diffusion models, offering interpretable and complementary insights to expert reviews.

Details

Motivation: Current evaluation metrics for 3D medical shape generative models lack individual-level quality assessment, requiring labor-intensive expert review.

Method: Point clouds from generated liver shapes are sampled, geometric features are extracted, and supervised ML/PointNet models are trained to classify shapes as good or bad.

Result: ML-based classifiers provide interpretable feedback and complementary insights compared to expert evaluation.

Conclusion: ML classifiers can serve as lightweight, task-relevant quality metrics for 3D organ shape generation, enhancing evaluation transparency in medical modeling.

Abstract: While 3D medical shape generative models such as diffusion models have shown promise in synthesizing diverse and anatomically plausible structures, the absence of ground truth makes quality evaluation challenging. Existing evaluation metrics commonly measure distributional distances between training and generated sets, while the medical field requires assessing quality at the individual level for each generated shape, which demands labor-intensive expert review. In this paper, we investigate the use of classical machine learning (ML) methods and PointNet as an alternative, interpretable approach for assessing the quality of generated liver shapes. We sample point clouds from the surfaces of the generated liver shapes, extract handcrafted geometric features, and train a group of supervised ML and PointNet models to classify liver shapes as good or bad. These trained models are then used as proxy discriminators to assess the quality of synthetic liver shapes produced by generative models. Our results show that ML-based shape classifiers provide not only interpretable feedback but also complementary insights compared to expert evaluation. This suggests that ML classifiers can serve as lightweight, task-relevant quality metrics in 3D organ shape generation, supporting more transparent and clinically aligned evaluation protocols in medical shape modeling.

[876] Federated Graph Unlearning

Yuming Ai, Xunkai Li, Jiaqi Chao, Bowen Fan, Zhengyu Wu, Yinlin Zhu, Rong-Hua Li, Guoren Wang

Main category: cs.LG

TL;DR: A unified framework for Federated Graph Learning addresses data removal challenges, improving accuracy and unlearning effectiveness.

Details

Motivation: The need for robust data removal mechanisms in decentralized systems to comply with privacy principles like the right to be forgotten.

Method: A bifurcated strategy: fine-grained Meta Unlearning with prototype gradients and adversarial graphs, and complete client unlearning using adversarial graphs.

Result: Substantial improvements in model accuracy and unlearning effectiveness across benchmark datasets.

Conclusion: The framework is effective for both client and meta-unlearning and enhances existing methods as a plug-in module.

Abstract: The demand for data privacy has led to the development of frameworks like Federated Graph Learning (FGL), which facilitate decentralized model training. However, a significant operational challenge in such systems is adhering to the right to be forgotten. This principle necessitates robust mechanisms for two distinct types of data removal: the selective erasure of specific entities and their associated knowledge from local subgraphs and the wholesale removal of a user’s entire dataset and influence. Existing methods often struggle to fully address both unlearning requirements, frequently resulting in incomplete data removal or the persistence of residual knowledge within the system. This work introduces a unified framework, conceived to provide a comprehensive solution to these challenges. The proposed framework employs a bifurcated strategy tailored to the specific unlearning request. For fine-grained Meta Unlearning, it uses prototype gradients to direct the initial local forgetting process, which is then refined by generating adversarial graphs to eliminate any remaining data traces among affected clients. In the case of complete client unlearning, the framework utilizes adversarial graph generation exclusively to purge the departed client’s contributions from the remaining network. Extensive experiments on multiple benchmark datasets validate the proposed approach. The framework achieves substantial improvements in model prediction accuracy across both client and meta-unlearning scenarios when compared to existing methods. Furthermore, additional studies confirm its utility as a plug-in module, where it materially enhances the predictive capabilities and unlearning effectiveness of other established methods.

[877] Clinical Expert Uncertainty Guided Generalized Label Smoothing for Medical Noisy Label Learning

Kunyu Zhang, Lin Gu, Liangchen Liu, Yingke Chen, Bingyang Wang, Jin Yan, Yingying Zhu

Main category: cs.LG

TL;DR: The paper addresses label noise in medical image datasets caused by expert uncertainty in clinical notes, proposing a benchmark and label smoothing method to improve performance.

Details

Motivation: Existing methods ignore expert-driven uncertainty in clinical notes, leading to noisy labels in medical image datasets.

Method: The study examines the impact of expert uncertainty on label noise and introduces a clinical expert uncertainty-aware benchmark with a label smoothing technique.

Result: The proposed method outperforms current state-of-the-art approaches by better handling expert-written uncertainty.

Conclusion: Incorporating expert uncertainty into medical image analysis improves label quality and model performance.

Abstract: Many previous studies have proposed extracting image labels from clinical notes to create large-scale medical image datasets at a low cost. However, these approaches inherently suffer from label noise due to uncertainty from the clinical experts. When radiologists and physicians analyze medical images to make diagnoses, they often include uncertainty-aware notes such as maybe'' or not excluded’’. Unfortunately, current text-mining methods overlook these nuances, resulting in the creation of noisy labels. Existing methods for handling noisy labels in medical image analysis, which typically address the problem through post-processing techniques, have largely ignored the important issue of expert-driven uncertainty contributing to label noise. To better incorporate the expert-written uncertainty in clinical notes into medical image analysis and address the label noise issue, we first examine the impact of clinical expert uncertainty on label noise. We then propose a clinical expert uncertainty-aware benchmark, along with a label smoothing method, which significantly improves performance compared to current state-of-the-art approaches.

[878] On Distributional Dependent Performance of Classical and Neural Routing Solvers

Daniela Thyssens, Tim Dernedde, Wilson Sentanoe, Lars Schmidt-Thieme

Main category: cs.LG

TL;DR: Neural Combinatorial Optimization (NCO) learns to solve routing problems by training on structured, sampled instances from a custom base distribution, reducing the performance gap with meta-heuristics.

Details

Motivation: To improve neural methods' performance in combinatorial optimization by learning from structured problem instances rather than random ones.

Method: Generate large base problem instances, sample training instances from them, and evaluate NCO methods and meta-heuristics on unseen problems from the base distribution.

Result: NCO methods perform closer to specialized meta-heuristics when trained on sub-samples from a fixed base node distribution.

Conclusion: Structured problem instance sampling enhances neural solvers’ effectiveness in routing tasks.

Abstract: Neural Combinatorial Optimization aims to learn to solve a class of combinatorial problems through data-driven methods and notably through employing neural networks by learning the underlying distribution of problem instances. While, so far neural methods struggle to outperform highly engineered problem specific meta-heuristics, this work explores a novel approach to formulate the distribution of problem instances to learn from and, more importantly, plant a structure in the sampled problem instances. In application to routing problems, we generate large problem instances that represent custom base problem instance distributions from which training instances are sampled. The test instances to evaluate the methods on the routing task consist of unseen problems sampled from the underlying large problem instance. We evaluate representative NCO methods and specialized Operation Research meta heuristics on this novel task and demonstrate that the performance gap between neural routing solvers and highly specialized meta-heuristics decreases when learning from sub-samples drawn from a fixed base node distribution.

Yao Lai, Souradip Poddar, Sungyoung Lee, Guojin Chen, Mengkang Hu, Bei Yu, Ping Luo, David Z. Pan

Main category: cs.LG

TL;DR: AnalogCoder-Pro is a multimodal LLM-based framework for automated analog circuit design, integrating generative capabilities and optimization to jointly explore topologies and optimize device sizing.

Details

Motivation: Analog design automation lacks fully automated optimization for performance-critical applications, relying heavily on expert intuition. LLMs offer new promise but remain underutilized for holistic joint optimization.

Method: AnalogCoder-Pro uses rejection sampling for LLM fine-tuning, multimodal diagnosis/repair, and automated parameter extraction to enable end-to-end topology generation and device sizing.

Result: The framework improves analog circuit design success rates and performance through its integrated approach.

Conclusion: AnalogCoder-Pro demonstrates the potential of LLMs in automating analog design, bridging gaps in current methods.

Abstract: Despite advances in analog design automation, analog front-end design still heavily depends on expert intuition and iterative simulations, underscoring critical gaps in fully automated optimization for performance-critical applications. Recently, the rapid development of Large Language Models (LLMs) has brought new promise to analog design automation. However, existing work remains in its early stages, and holistic joint optimization for practical end-to-end solutions remains largely unexplored. We propose AnalogCoder-Pro, a unified multimodal LLM-based framework that integrates generative capabilities and optimization techniques to jointly explore circuit topologies and optimize device sizing, automatically generating performance-specific, fully sized schematic netlists. AnalogCoder-Pro employs rejection sampling for fine-tuning LLMs on high-quality synthesized circuit data and introduces a multimodal diagnosis and repair workflow based on functional specifications and waveform images. By leveraging LLMs to interpret generated circuit netlists, AnalogCoder-Pro automates the extraction of critical design parameters and the formulation of parameter spaces, establishing an end-to-end workflow for simultaneous topology generation and device sizing optimization. Extensive experiments demonstrate that these orthogonal approaches significantly improve the success rate of analog circuit design and enhance circuit performance.

[880] Dynamic Feature Selection based on Rule-based Learning for Explainable Classification with Uncertainty Quantification

Javier Fumanal-Idocin, Raquel Fernandez-Peralta, Javier Andreu-Perez

Main category: cs.LG

TL;DR: The paper introduces a rule-based dynamic feature selection (DFS) method, enhancing interpretability and computational efficiency compared to opaque models like neural networks. It also provides uncertainty measures and shows competitive performance against existing methods.

Details

Motivation: Traditional static feature selection lacks adaptability and transparency, especially in critical fields like clinical decision-making. DFS addresses this by customizing feature selection per sample, but existing methods are opaque.

Method: The paper proposes a rule-based system for DFS, offering interpretability, uncertainty quantification, and computational efficiency by constraining the feature search space. It also analyzes greedy selection of conditional mutual information.

Result: The rule-based DFS method performs competitively against state-of-the-art opaque methods (greedy and RL-based) while providing explainability and uncertainty measures.

Conclusion: The rule-based DFS approach is a viable, interpretable alternative to opaque models, with practical advantages in transparency and computational efficiency.

Abstract: Dynamic feature selection (DFS) offers a compelling alternative to traditional, static feature selection by adapting the selected features to each individual sample. Unlike classical methods that apply a uniform feature set, DFS customizes feature selection per sample, providing insight into the decision-making process for each case. DFS is especially significant in settings where decision transparency is key, i.e., clinical decisions; however, existing methods use opaque models, which hinder their applicability in real-life scenarios. This paper introduces a novel approach leveraging a rule-based system as a base classifier for the DFS process, which enhances decision interpretability compared to neural estimators. We also show how this method provides a quantitative measure of uncertainty for each feature query and can make the feature selection process computationally lighter by constraining the feature search space. We also discuss when greedy selection of conditional mutual information is equivalent to selecting features that minimize the difference with respect to the global model predictions. Finally, we demonstrate the competitive performance of our rule-based DFS approach against established and state-of-the-art greedy and RL methods, which are mostly considered opaque, compared to our explainable rule-based system.

[881] Communication and Computation Efficient Split Federated Learning in O-RAN

Shunxian Gu, Chaoqun You, Bangbang Ren, Deke Guo

Main category: cs.LG

TL;DR: SplitMe is a split federated learning (SFL) framework for O-RAN that reduces communication costs and improves convergence by training models alternately in near-RT and non-RT RICs, avoiding frequent data transfers.

Details

Motivation: The increasing model size in federated learning (FL) for O-RAN leads to longer training times, risking deadline violations for non-RT and near-RT RICs. Existing SFL approaches incur high communication costs and require careful resource allocation.

Method: SplitMe uses mutual learning to train models alternately in near-RT and non-RT RICs, eliminating frequent transfers. It employs a zeroth-order technique to integrate the final model and solves a joint optimization problem for resource efficiency.

Result: SplitMe outperforms SFL, FedAvg, and O-RANFed in terms of cost and convergence, demonstrating significant improvements.

Conclusion: SplitMe effectively addresses the challenges of SFL in O-RAN by reducing communication costs and ensuring timely convergence, making it a superior framework for FL in hierarchical architectures.

Abstract: The hierarchical architecture of Open Radio Access Network (O-RAN) has enabled a new Federated Learning (FL) paradigm that trains models using data from non- and near-real-time (near-RT) Radio Intelligent Controllers (RICs). However, the ever-increasing model size leads to longer training time, jeopardizing the deadline requirements for both non-RT and near-RT RICs. To address this issue, split federated learning (SFL) offers an approach by offloading partial model layers from near-RT-RIC to high-performance non-RT-RIC. Nonetheless, its deployment presents two challenges: (i) Frequent data/gradient transfers between near-RT-RIC and non-RT-RIC in SFL incur significant communication cost in O-RAN. (ii) Proper allocation of computational and communication resources in O-RAN is vital to satisfying the deadline and affects SFL convergence. Therefore, we propose SplitMe, an SFL framework that exploits mutual learning to alternately and independently train the near-RT-RIC’s model and the non-RT-RIC’s inverse model, eliminating frequent transfers. The ‘‘inverse’’ of the inverse model is derived via a zeroth-order technique to integrate the final model. Then, we solve a joint optimization problem for SplitMe to minimize overall resource costs with deadline-aware selection of near-RT-RICs and adaptive local updates. Our numerical results demonstrate that SplitMe remarkably outperforms FL frameworks like SFL, FedAvg and O-RANFed regarding costs and convergence.

[882] Solved in Unit Domain: JacobiNet for Differentiable Coordinate Transformations

Xi Chen, Jianchuan Yang, Junjie Zhang, Runnan Yang, Xu Liu, Hong Wang, Ziyu Ren, Wenqi Hu

Main category: cs.LG

TL;DR: JacobiNet is a neural network-based method for coordinate transformation in PINNs, improving stability, accuracy, and efficiency for PDE solving on irregular domains.

Details

Motivation: PINNs struggle with irregular boundaries due to normalization issues, inaccurate boundary enforcement, and imbalanced loss terms. Existing methods are limited by case-specific meshes and simple geometries.

Method: JacobiNet learns continuous, differentiable mappings from supervised point pairs using lightweight MLPs, enabling direct Jacobian computation and seamless integration with PINNs.

Result: JacobiNet reduces relative L2 error significantly (0.013-0.039 vs. 0.287-0.637), improves accuracy by 18.3×, and achieves 10× speedup in vessel-like domains.

Conclusion: JacobiNet addresses key challenges in PINNs, offering generalization, accuracy, and efficiency for PDE solving on complex geometries.

Abstract: Physics-Informed Neural Networks (PINNs) are effective for solving PDEs by incorporating physical laws into the learning process. However, they face challenges with irregular boundaries, leading to instability and slow convergence due to inconsistent normalization, inaccurate boundary enforcement, and imbalanced loss terms. A common solution is to map the domain to a regular space, but traditional methods rely on case-specific meshes and simple geometries, limiting their compatibility with modern frameworks. To overcome these limitations, we introduce JacobiNet, a neural network-based coordinate transformation method that learns continuous, differentiable mappings from supervised point pairs. Utilizing lightweight MLPs, JacobiNet allows for direct Jacobian computation via autograd and integrates seamlessly with downstream PINNs, enabling end-to-end differentiable PDE solving without the need for meshing or explicit Jacobian computation. JacobiNet effectively addresses normalization challenges, facilitates hard constraints of boundary conditions, and mitigates the long-standing imbalance among loss terms. It demonstrates significant improvements, reducing the relative L2 error from 0.287-0.637 to 0.013-0.039, achieving an average accuracy improvement of 18.3*. In vessel-like domains, it enables rapid mapping for unseen geometries, improving prediction accuracy by 3.65* and achieving over 10* speedup, showcasing its generalization, accuracy, and efficiency.

[883] StructSynth: Leveraging LLMs for Structure-Aware Tabular Data Synthesis in Low-Data Regimes

Siyi Liu, Yujia Zheng, Yongqi Zhang

Main category: cs.LG

TL;DR: StructSynth integrates LLMs with structural control to generate high-fidelity synthetic tabular data, outperforming existing methods in low-data regimes.

Details

Motivation: Data scarcity in specialized domains limits machine learning applications. Generative models often fail in low-data settings or ignore tabular data structure.

Method: StructSynth uses a two-stage approach: (1) learns a DAG for structure discovery, (2) steers LLM generation using this structure to ensure fidelity.

Result: StructSynth produces synthetic data with higher structural integrity and downstream utility, especially in low-data scenarios.

Conclusion: StructSynth effectively balances privacy and statistical fidelity, offering a robust solution for generating structured tabular data.

Abstract: The application of machine learning on tabular data in specialized domains is severely limited by data scarcity. While generative models offer a solution, traditional methods falter in low-data regimes, and recent Large Language Models (LLMs) often ignore the explicit dependency structure of tabular data, leading to low-fidelity synthetics. To address these limitations, we introduce StructSynth, a novel framework that integrates the generative power of LLMs with robust structural control. StructSynth employs a two-stage architecture. First, it performs explicit structure discovery to learn a Directed Acyclic Graph (DAG) from the available data. Second, this learned structure serves as a high-fidelity blueprint to steer the LLM’s generation process, forcing it to adhere to the learned feature dependencies and thereby ensuring the generated data respects the underlying structure by design. Our extensive experiments demonstrate that StructSynth produces synthetic data with significantly higher structural integrity and downstream utility than state-of-the-art methods. It proves especially effective in challenging low-data scenarios, successfully navigating the trade-off between privacy preservation and statistical fidelity.

[884] Adaptive Riemannian Graph Neural Networks

Xudong Wang, Tongxin Li, Chris Ding, Jicong Fan

Main category: cs.LG

TL;DR: ARGNN introduces a framework for adaptive Riemannian metric tensor fields in graphs, enabling nodes to determine optimal local geometry for diverse structures.

Details

Motivation: Existing geometric GNNs fail to capture the complex heterogeneity of graph structures with varying local curvature.

Method: ARGNN learns a continuous, anisotropic Riemannian metric tensor field, parameterized efficiently as a learnable diagonal form, with Ricci flow-inspired regularization for stability.

Result: Superior performance on homophilic and heterophilic datasets, with interpretable learned geometries.

Conclusion: ARGNN unifies prior GNNs, adapts to diverse structures, and provides theoretical and empirical validation.

Abstract: Graph data often exhibits complex geometric heterogeneity, where structures with varying local curvature, such as tree-like hierarchies and dense communities, coexist within a single network. Existing geometric GNNs, which embed graphs into single fixed-curvature manifolds or discrete product spaces, struggle to capture this diversity. We introduce Adaptive Riemannian Graph Neural Networks (ARGNN), a novel framework that learns a continuous and anisotropic Riemannian metric tensor field over the graph. It allows each node to determine its optimal local geometry, enabling the model to fluidly adapt to the graph’s structural landscape. Our core innovation is an efficient parameterization of the node-wise metric tensor, specializing to a learnable diagonal form that captures directional geometric information while maintaining computational tractability. To ensure geometric regularity and stable training, we integrate a Ricci flow-inspired regularization that smooths the learned manifold. Theoretically, we establish the rigorous geometric evolution convergence guarantee for ARGNN and provide a continuous generalization that unifies prior fixed or mixed-curvature GNNs. Empirically, our method demonstrates superior performance on both homophilic and heterophilic benchmark datasets with the ability to capture diverse structures adaptively. Moreover, the learned geometries both offer interpretable insights into the underlying graph structure and empirically corroborate our theoretical analysis.

[885] Entity Representation Learning Through Onsite-Offsite Graph for Pinterset Ads

Jiayin Jin, Zhimeng Pan, Yang Tang, Jiarui Feng, Kungang Li, Chongyuan Xiang, Jiacheng Li, Runze Su, Siping Ji, Han Sun, Ling Leng, Prathibha Deshikachar

Main category: cs.LG

TL;DR: The paper introduces TransRA, a novel KGE model, to integrate graph embeddings into Ads ranking models, addressing challenges with offsite conversion data and achieving significant performance improvements.

Details

Motivation: To better leverage offsite conversion data and explore connections between onsite and offsite user activities for Ads ranking models.

Method: Constructed a large-scale heterogeneous graph and introduced TransRA (TransR with Anchors), combined with Large ID Embedding Table and attention-based KGE finetuning.

Result: Significant AUC lift in CTR and CVR prediction models, with a $2.69%$ CTR lift and $1.34%$ CPC reduction in deployment.

Conclusion: The techniques can be leveraged by other large-scale industrial models for improved performance.

Abstract: Graph Neural Networks (GNN) have been extensively applied to industry recommendation systems, as seen in models like GraphSage\cite{GraphSage}, TwHIM\cite{TwHIM}, LiGNN\cite{LiGNN} etc. In these works, graphs were constructed based on users’ activities on the platforms, and various graph models were developed to effectively learn node embeddings. In addition to users’ onsite activities, their offsite conversions are crucial for Ads models to capture their shopping interest. To better leverage offsite conversion data and explore the connection between onsite and offsite activities, we constructed a large-scale heterogeneous graph based on users’ onsite ad interactions and opt-in offsite conversion activities. Furthermore, we introduced TransRA (TransR\cite{TransR} with Anchors), a novel Knowledge Graph Embedding (KGE) model, to more efficiently integrate graph embeddings into Ads ranking models. However, our Ads ranking models initially struggled to directly incorporate Knowledge Graph Embeddings (KGE), and only modest gains were observed during offline experiments. To address this challenge, we employed the Large ID Embedding Table technique and innovated an attention based KGE finetuning approach within the Ads ranking models. As a result, we observed a significant AUC lift in Click-Through Rate (CTR) and Conversion Rate (CVR) prediction models. Moreover, this framework has been deployed in Pinterest’s Ads Engagement Model and contributed to $2.69%$ CTR lift and $1.34%$ CPC reduction. We believe the techniques presented in this paper can be leveraged by other large-scale industrial models.

[886] DeepKoopFormer: A Koopman Enhanced Transformer Based Architecture for Time Series Forecasting

Ali Forootani, Mohammad Khosravi, Masoud Barati

Main category: cs.LG

TL;DR: DeepKoopFormer combines Transformers and Koopman theory for stable, interpretable time-series forecasting, outperforming LSTMs and Transformers.

Details

Motivation: Addressing interpretability and instability issues in Transformer-based models for high-dimensional, nonlinear time-series forecasting.

Method: Modular encoder-propagator-decoder structure with a spectrally constrained linear Koopman operator in latent space, ensuring stability via structural guarantees.

Result: Outperforms LSTMs and Transformers in accuracy, noise robustness, and long-term stability across synthetic and real-world datasets.

Conclusion: DeepKoopFormer is a flexible, interpretable, and robust framework for high-dimensional and dynamical forecasting.

Abstract: Time series forecasting plays a vital role across scientific, industrial, and environmental domains, especially when dealing with high-dimensional and nonlinear systems. While Transformer-based models have recently achieved state-of-the-art performance in long-range forecasting, they often suffer from interpretability issues and instability in the presence of noise or dynamical uncertainty. In this work, we propose DeepKoopFormer, a principled forecasting framework that combines the representational power of Transformers with the theoretical rigor of Koopman operator theory. Our model features a modular encoder-propagator-decoder structure, where temporal dynamics are learned via a spectrally constrained, linear Koopman operator in a latent space. We impose structural guarantees-such as bounded spectral radius, Lyapunov based energy regularization, and orthogonal parameterization to ensure stability and interpretability. Comprehensive evaluations are conducted on both synthetic dynamical systems, real-world climate dataset (wind speed and surface pressure), financial time series (cryptocurrency), and electricity generation dataset using the Python package that is prepared for this purpose. Across all experiments, DeepKoopFormer consistently outperforms standard LSTM and baseline Transformer models in terms of accuracy, robustness to noise, and long-term forecasting stability. These results establish DeepKoopFormer as a flexible, interpretable, and robust framework for forecasting in high dimensional and dynamical settings.

[887] AutoML-Med: A Framework for Automated Machine Learning in Medical Tabular Data

Riccardo Francia, Maurizio Leone, Giorgio Leonardi, Stefania Montani, Marzio Pennisi, Manuel Striani, Sandra D’Alfonso

Main category: cs.LG

TL;DR: AutoML-Med is an Automated Machine Learning tool designed to handle medical dataset challenges like missing values and class imbalance, achieving better accuracy and sensitivity than other tools.

Details

Motivation: Medical datasets often have issues like missing values and class imbalance, hindering ML model performance. AutoML-Med aims to address these challenges with minimal user intervention.

Method: AutoML-Med uses Latin Hypercube Sampling (LHS) for preprocessing exploration, trains models with selected metrics, and employs Partial Rank Correlation Coefficient (PRCC) for optimization.

Result: AutoML-Med outperforms state-of-the-art tools in clinical settings, achieving higher balanced accuracy and sensitivity, crucial for identifying at-risk patients.

Conclusion: AutoML-Med effectively improves prediction results in medical datasets, showcasing its potential to enhance ML applications in healthcare.

Abstract: Medical datasets are typically affected by issues such as missing values, class imbalance, a heterogeneous feature types, and a high number of features versus a relatively small number of samples, preventing machine learning models from obtaining proper results in classification and regression tasks. This paper introduces AutoML-Med, an Automated Machine Learning tool specifically designed to address these challenges, minimizing user intervention and identifying the optimal combination of preprocessing techniques and predictive models. AutoML-Med’s architecture incorporates Latin Hypercube Sampling (LHS) for exploring preprocessing methods, trains models using selected metrics, and utilizes Partial Rank Correlation Coefficient (PRCC) for fine-tuned optimization of the most influential preprocessing steps. Experimental results demonstrate AutoML-Med’s effectiveness in two different clinical settings, achieving higher balanced accuracy and sensitivity, which are crucial for identifying at-risk patients, compared to other state-of-the-art tools. AutoML-Med’s ability to improve prediction results, especially in medical datasets with sparse data and class imbalance, highlights its potential to streamline Machine Learning applications in healthcare.

[888] LOST: Low-rank and Sparse Pre-training for Large Language Models

Jiaxi Li, Lu Yin, Li Shen, Jinjin Xu, Liwu Xu, Tianjin Huang, Wenwu Wang, Shiwei Liu, Xilu Wang

Main category: cs.LG

TL;DR: LOST integrates low-rank and sparse structures for efficient LLM pre-training, achieving competitive performance with reduced computational costs.

Details

Motivation: To address the prohibitive computational and memory costs of full-rank LLM pre-training by combining low-rank and sparse techniques effectively.

Method: Uses singular value decomposition for low-rank components and constructs channel-wise sparse components to complement expressiveness.

Result: Achieves competitive or superior performance to full-rank models while significantly reducing memory and compute overhead.

Conclusion: LOST provides an efficient and effective method for LLM pre-training under strict efficiency constraints.

Abstract: While large language models (LLMs) have achieved remarkable performance across a wide range of tasks, their massive scale incurs prohibitive computational and memory costs for pre-training from scratch. Recent studies have investigated the use of low-rank parameterization as a means of reducing model size and training cost. In this context, sparsity is often employed as a complementary technique to recover important information lost in low-rank compression by capturing salient features in the residual space. However, existing approaches typically combine low-rank and sparse components in a simplistic or ad hoc manner, often resulting in undesirable performance degradation compared to full-rank training. In this paper, we propose \textbf{LO}w-rank and \textbf{S}parse pre-\textbf{T}raining (\textbf{LOST}) for LLMs, a novel method that ingeniously integrates low-rank and sparse structures to enable effective training of LLMs from scratch under strict efficiency constraints. LOST applies singular value decomposition to weight matrices, preserving the dominant low-rank components, while allocating the remaining singular values to construct channel-wise sparse components to complement the expressiveness of low-rank training. We evaluate LOST on LLM pretraining ranging from 60M to 7B parameters. Our experiments show that LOST achieves competitive or superior performance compared to full-rank models, while significantly reducing both memory and compute overhead. Moreover, Code is available at \href{https://github.com/JiaxiLi1/LOST-Low-rank-and-Sparse-Training-for-Large-Language-Models}{LOST Repo}

[889] BiDoRA: Bi-level Optimization-Based Weight-Decomposed Low-Rank Adaptation

Peijia Qin, Ruiyi Zhang, Pengtao Xie

Main category: cs.LG

TL;DR: BiDoRA improves PEFT by decoupling magnitude and direction optimization, outperforming DoRA and other methods.

Details

Motivation: Address overfitting and coupled updates in DoRA by proposing a bi-level optimization framework.

Method: BiDoRA uses separate loops for magnitude and direction optimization with distinct data splits.

Result: Achieves better correlation with full fine-tuning and outperforms DoRA on diverse tasks.

Conclusion: BiDoRA is a superior PEFT method with significant performance improvements.

Abstract: Parameter-efficient fine-tuning (PEFT) is a flexible and efficient method for adapting large language models (LLMs) to downstream tasks. Among these methods, weight-decomposed low-rank adaptation (DoRA) is a promising approach that decomposes weight matrices into magnitude and direction components to mimic full fine-tuning (FT) better. However, DoRA’s simultaneous optimization of these components makes it over-expressive, increases the risk of overfitting, and creates a coupled updating pattern that limits its learning capacity. To address these issues, we propose Bi-level Optimization-Based Weight-Decomposed Low-Rank Adaptation (BiDoRA), a novel PEFT method based on a bi-level optimization framework. BiDoRA fundamentally differs from DoRA by optimizing the magnitude and direction in two separate, asynchronous loops using distinct training and validation data splits. This decoupled optimization process effectively mitigates overfitting and allows for more flexible updates that align even more closely with FT. For instance, weight decomposition analysis shows BiDoRA achieves a magnitude-direction update correlation of $-8.042$, significantly closer to the FT ideal compared to $-1.784$ for DoRA. Evaluation of BiDoRA on diverse tasks spanning natural language understanding, generation, token classification, and extremely small biomedical datasets reveals that it consistently outperforms DoRA and a wide range of leading PEFT methods. This improvement is statistically significant, as demonstrated on the GLUE benchmark where BiDoRA surpasses DoRA with a p-value of $2.4\times10^{-4}$ in terms of the Wilcoxon signed-rank test. The code for BiDoRA is available at https://github.com/t2ance/BiDoRA.

[890] Robust and Efficient Fine-tuning of LLMs with Bayesian Reparameterization of Low-Rank Adaptation

Ayan Sengupta, Vaibhav Seth, Arinjay Pathak, Aastha Verma, Natraj Raman, Sriram Gopalakrishnan, Niladri Chatterjee, Tanmoy Chakraborty

Main category: cs.LG

TL;DR: MonteCLoRA improves LLM fine-tuning stability using Monte Carlo estimation for low-rank parameters, enhancing accuracy and robustness with minimal added parameters.

Details

Motivation: Address instability in low-rank adaptation due to hyperparameter sensitivity, aiming for stable and efficient fine-tuning.

Method: Proposes MonteCLoRA, leveraging Monte Carlo estimation for unbiased low-rank parameter estimation with low variance.

Result: Shows 0.5%-1.6% accuracy/robustness gains and 50%-62% lower performance spread in generative tasks.

Conclusion: Effective parameterization and hyperpriors balance exploration-exploitation, leading to robust and optimal fine-tuning.

Abstract: Large Language Models (LLMs) are highly resource-intensive to fine-tune due to their enormous size. While low-rank adaptation is a prominent parameter-efficient fine-tuning approach, it suffers from sensitivity to hyperparameter choices, leading to instability in model performance on fine-tuning downstream tasks. This paper highlights the importance of effective parameterization in low-rank fine-tuning to reduce estimator variance and enhance the stability of final model outputs. We propose MonteCLoRA, an efficient fine-tuning technique that employs Monte Carlo estimation to learn an unbiased posterior estimation of low-rank parameters with low expected variance, stabilizing fine-tuned LLMs with only O(r) additional parameters, for a given rank r. MonteCLoRA shows 0.5% and 1.6% improvements in accuracy and robustness over unregularized low-rank adaptation method on natural language understanding tasks with pre-trained RoBERTa-base. Furthermore, in generative tasks with pre-trained LLaMA-1-7B and LLaMA-3.2-3B-Instruct, MonteCLoRA demonstrates robust performance with 50% and 62% lower spreads respectively than the contemporary efficient fine-tuning methods. The theoretical and empirical results presented in the paper underscore how parameterization and hyperpriors balance exploration-exploitation in the low-rank parametric space, therefore leading to more optimal and robust parameter estimation during efficient fine-tuning.

[891] Gandalf the Red: Adaptive Security for LLMs

Niklas Pfister, Václav Volhejn, Manuel Knott, Santiago Arias, Julia Bazińska, Mykhailo Bichurin, Alan Commike, Janet Darling, Peter Dienes, Matthew Fiedler, David Haber, Matthias Kraft, Marco Lancini, Max Mathys, Damián Pascual-Ortiz, Jakub Podolak, Adrià Romero-López, Kyriacos Shiarlis, Andreas Signer, Zsolt Terek, Athanasios Theocharis, Daniel Timbrell, Samuel Trautwein, Samuel Watts, Yun-Han Wu, Mateo Rojas-Carulla

Main category: cs.LG

TL;DR: The paper introduces D-SEC and Gandalf to address gaps in evaluating LLM defenses, highlighting the balance between security and usability.

Details

Motivation: Current evaluations ignore dynamic adversarial behavior and usability penalties for legitimate users.

Method: Proposes D-SEC for modeling security-utility trade-offs and Gandalf for realistic attack simulation.

Result: Collected 279k prompt attacks; showed defenses can degrade usability without blocking requests.

Conclusion: Effective strategies include restricted domains, defense-in-depth, and adaptive defenses.

Abstract: Current evaluations of defenses against prompt attacks in large language model (LLM) applications often overlook two critical factors: the dynamic nature of adversarial behavior and the usability penalties imposed on legitimate users by restrictive defenses. We propose D-SEC (Dynamic Security Utility Threat Model), which explicitly separates attackers from legitimate users, models multi-step interactions, and expresses the security-utility in an optimizable form. We further address the shortcomings in existing evaluations by introducing Gandalf, a crowd-sourced, gamified red-teaming platform designed to generate realistic, adaptive attack. Using Gandalf, we collect and release a dataset of 279k prompt attacks. Complemented by benign user data, our analysis reveals the interplay between security and utility, showing that defenses integrated in the LLM (e.g., system prompts) can degrade usability even without blocking requests. We demonstrate that restricted application domains, defense-in-depth, and adaptive defenses are effective strategies for building secure and useful LLM applications.

[892] Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions

Yik Siu Chan, Narutatsu Ri, Yuxin Xiao, Marzyeh Ghassemi

Main category: cs.LG

TL;DR: The paper highlights vulnerabilities in LLMs to jailbreak attacks, showing that actionable and informative responses can facilitate harmful actions. It introduces HarmScore and Speak Easy, demonstrating their effectiveness in increasing attack success rates.

Details

Motivation: To explore whether jailbroken responses are practically useful for harmful actions and if vulnerabilities exist in common human-LLM interactions.

Method: Proposes HarmScore to measure harmful action facilitation and Speak Easy, a multi-step, multilingual attack framework. Tests these on open-source and proprietary LLMs.

Result: Speak Easy increases Attack Success Rate by 0.319 and HarmScore by 0.426 across benchmarks, revealing exploitable vulnerabilities.

Conclusion: Common interaction patterns can be easily exploited for harmful purposes, highlighting a critical oversight in LLM safety.

Abstract: Despite extensive safety alignment efforts, large language models (LLMs) remain vulnerable to jailbreak attacks that elicit harmful behavior. While existing studies predominantly focus on attack methods that require technical expertise, two critical questions remain underexplored: (1) Are jailbroken responses truly useful in enabling average users to carry out harmful actions? (2) Do safety vulnerabilities exist in more common, simple human-LLM interactions? In this paper, we demonstrate that LLM responses most effectively facilitate harmful actions when they are both actionable and informative–two attributes easily elicited in multi-step, multilingual interactions. Using this insight, we propose HarmScore, a jailbreak metric that measures how effectively an LLM response enables harmful actions, and Speak Easy, a simple multi-step, multilingual attack framework. Notably, by incorporating Speak Easy into direct request and jailbreak baselines, we see an average absolute increase of 0.319 in Attack Success Rate and 0.426 in HarmScore in both open-source and proprietary LLMs across four safety benchmarks. Our work reveals a critical yet often overlooked vulnerability: Malicious users can easily exploit common interaction patterns for harmful intentions.

[893] AIRepr: An Analyst-Inspector Framework for Evaluating Reproducibility of LLMs in Data Science

Qiuhai Zeng, Claire Jin, Xinyue Wang, Yuhan Zheng, Qunhua Li

Main category: cs.LG

TL;DR: AIRepr is a framework for evaluating and improving the reproducibility of LLM-generated data analysis workflows, using statistical principles and novel prompting strategies.

Details

Motivation: Understanding the reasoning behind LLM-generated analyses is critical due to multiple valid solutions in data science tasks, but manual review is labor-intensive.

Method: Introduces AIRepr, an Analyst-Inspector framework, with two reproducibility-enhancing prompting strategies, benchmarked across 15 LLM pairs and 1,032 tasks.

Result: Workflows with higher reproducibility yield more accurate analyses, and reproducibility-enhancing prompts improve both metrics.

Conclusion: AIRepr enhances transparency, reliability, and efficiency in human-AI collaboration for data science.

Abstract: Large language models (LLMs) are increasingly used to automate data analysis through executable code generation. Yet, data science tasks often admit multiple statistically valid solutions, e.g. different modeling strategies, making it critical to understand the reasoning behind analyses, not just their outcomes. While manual review of LLM-generated code can help ensure statistical soundness, it is labor-intensive and requires expertise. A more scalable approach is to evaluate the underlying workflows - the logical plans guiding code generation. However, it remains unclear how to assess whether a LLM-generated workflow supports reproducible implementations. To address this, we present $\it{AIRepr}$, an $\it{A}$nalyst - $\it{I}$nspector framework for automatically evaluating and improving the $\it{Repr}$oducibility of LLM-generated data analysis workflows. Our framework is grounded in statistical principles and supports scalable, automated assessment. We introduce two novel reproducibility-enhancing prompting strategies and benchmark them against standard prompting across 15 analyst-inspector LLM pairs and 1,032 tasks from three public benchmarks. Our findings show that workflows with higher reproducibility also yield more accurate analyses, and that reproducibility-enhancing prompts substantially improve both metrics. This work provides a foundation for more transparent, reliable, and efficient human-AI collaboration in data science. Our code is publicly available.

[894] LoRI: Reducing Cross-Task Interference in Multi-Task Low-Rank Adaptation

Juzheng Zhang, Jiacheng You, Ashwinee Panda, Tom Goldstein

Main category: cs.LG

TL;DR: LoRI improves LoRA by freezing random projection matrices and sparsifying task-specific masks, reducing parameters and interference while maintaining performance.

Details

Motivation: Address the overhead and parameter interference issues of LoRA in multi-task scenarios.

Method: Freeze projection matrices as random projections and sparsify matrices with task-specific masks.

Result: Outperforms full fine-tuning and PEFT methods, using 95% fewer parameters than LoRA, with reduced cross-task interference.

Conclusion: LoRI is effective for multi-task learning, adapter merging, and continual learning with minimal interference.

Abstract: Low-Rank Adaptation (LoRA) has emerged as a popular parameter-efficient fine-tuning (PEFT) method for Large Language Models (LLMs), yet it still incurs notable overhead and suffers from parameter interference in multi-task scenarios. We propose LoRA with Reduced Interference (LoRI), a simple yet effective approach that freezes the projection matrices $A$ as random projections and sparsifies the matrices $B$ using task-specific masks. This design substantially reduces the number of trainable parameters while maintaining strong task performance. Moreover, LoRI minimizes cross-task interference in adapter merging by leveraging the orthogonality between adapter subspaces, and supports continual learning by using sparsity to mitigate catastrophic forgetting. Extensive experiments across natural language understanding, mathematical reasoning, code generation, and safety alignment tasks demonstrate that LoRI outperforms full fine-tuning and existing PEFT methods, while using up to 95% fewer trainable parameters than LoRA. In multi-task experiments, LoRI enables effective adapter merging and continual learning with reduced cross-task interference. Code is available at: https://github.com/juzhengz/LoRI

[895] ParetoHqD: Fast Offline Multiobjective Alignment of Large Language Models using Pareto High-quality Data

Haoran Gu, Handing Wang, Yi Mei, Mengjie Zhang, Yaochu Jin

Main category: cs.LG

TL;DR: ParetoHqD improves multiobjective alignment in large language models by using preference directions and high-quality Pareto front data, outperforming five baselines.

Details

Motivation: Aligning large language models with diverse human expectations and values is essential for serving varied user needs, but current methods face limitations in preference representation and imbalanced rewards.

Method: ParetoHqD represents preferences as directions in the objective space and uses high-quality data near the Pareto front. It employs a two-stage supervised fine-tuning process tailored to each preference direction.

Result: ParetoHqD outperforms five baselines on two multiobjective alignment tasks.

Conclusion: ParetoHqD effectively addresses limitations in preference representation and reward imbalance, demonstrating superior performance in aligning models with diverse human values.

Abstract: Aligning large language models with multiple human expectations and values is crucial for ensuring that they adequately serve a variety of user needs. To this end, offline multiobjective alignment algorithms such as the Rewards-in-Context algorithm have shown strong performance and efficiency. However, inappropriate preference representations and training with imbalanced reward scores limit the performance of such algorithms. In this work, we introduce ParetoHqD that addresses the above issues by representing human preferences as preference directions in the objective space and regarding data near the Pareto front as ‘‘high-quality’’ data. For each preference, ParetoHqD follows a two-stage supervised fine-tuning process, where each stage uses an individual Pareto high-quality training set that best matches its preference direction. The experimental results have demonstrated the superiority of ParetoHqD over five baselines on two multiobjective alignment tasks.

[896] GEM: Gaussian Embedding Modeling for Out-of-Distribution Detection in GUI Agents

Zheng Wu, Pengzhou Cheng, Zongru Wu, Lingzhong Dong, Zhuosheng Zhang

Main category: cs.LG

TL;DR: The paper introduces GEM, a method for detecting out-of-distribution (OOD) instructions in GUI agents, improving accuracy and success rates while minimizing computational overhead.

Details

Motivation: GUI agents struggle with OOD instructions, leading to failures or security risks. Current OOD detection methods are inadequate due to complex embedding spaces and dynamic GUI environments.

Method: GEM uses a Gaussian mixture model on input embedding distances to identify capability boundaries of GUI agents.

Result: GEM improves accuracy by 23.70% over baselines, increases step-wise success rate by 9.40%, and adds minimal computational overhead.

Conclusion: GEM effectively detects OOD instructions in GUI agents, enhancing performance and generalization across diverse environments.

Abstract: Graphical user interface (GUI) agents have recently emerged as an intriguing paradigm for human-computer interaction, capable of automatically executing user instructions to operate intelligent terminal devices. However, when encountering out-of-distribution (OOD) instructions that violate environmental constraints or exceed the current capabilities of agents, GUI agents may suffer task breakdowns or even pose security threats. Therefore, effective OOD detection for GUI agents is essential. Traditional OOD detection methods perform suboptimally in this domain due to the complex embedding space and evolving GUI environments. In this work, we observe that the in-distribution input semantic space of GUI agents exhibits a clustering pattern with respect to the distance from the centroid. Based on the finding, we propose GEM, a novel method based on fitting a Gaussian mixture model over input embedding distances extracted from the GUI agent that reflect its capability boundary. Evaluated on eight datasets spanning smartphones, computers, and web browsers, our method achieves an average accuracy improvement of 23.70% over the best-performing baseline while only increasing training time by 4.9% and testing time by 6.5%. We also experimentally demonstrate that GEM can improve the step-wise success rate by 9.40% by requesting assistance from the cloud model when encountering OOD samples. Analysis verifies the generalization ability of our method through experiments on nine different backbones. The codes are available at https://github.com/Wuzheng02/GEM-OODforGUIagents.

[897] Online and Customizable Fairness-aware Learning

Wenbin Zhang

Main category: cs.LG

TL;DR: A framework for online decision trees to ensure fairness in AI decision-making with evolving data streams, addressing bias and distribution shifts.

Details

Motivation: Address concerns of discrimination in AI decision-making, especially with streaming data that evolves over time, requiring real-time fairness and accuracy.

Method: Proposes two fairness splitting criteria and two online growth algorithms for decision trees to handle non-stationary data and bias patterns.

Result: Algorithms effectively manage discrimination in dynamic streaming environments, balancing fairness and predictive performance.

Conclusion: The framework successfully adapts to evolving data streams, ensuring fairness without compromising accuracy.

Abstract: While artificial intelligence (AI)-based decision-making systems are increasingly popular, significant concerns on the potential discrimination during the AI decision-making process have been observed. For example, the distribution of predictions is usually biased and dependents on the sensitive attributes (e.g., gender and ethnicity). Numerous approaches have therefore been proposed to develop decision-making systems that are discrimination-conscious by-design, which are typically batch-based and require the simultaneous availability of all the training data for model learning. However, in the real-world, the data streams usually come on the fly which requires the model to process each input data once ``on arrival’’ and without the need for storage and reprocessing. In addition, the data streams might also evolve over time, which further requires the model to be able to simultaneously adapt to non-stationary data distributions and time-evolving bias patterns, with an effective and robust trade-off between accuracy and fairness. In this paper, we propose a novel framework of online decision tree with fairness in the data stream with possible distribution drifting. Specifically, first, we propose two novel fairness splitting criteria that encode the data as well as possible, while simultaneously removing dependence on the sensitive attributes, and further adapts to non-stationary distribution with fine-grained control when needed. Second, we propose two fairness decision tree online growth algorithms that fulfills different online fair decision-making requirements. Our experiments show that our algorithms are able to deal with discrimination in massive and non-stationary streaming environments, with a better trade-off between fairness and predictive performance.

[898] How Can I Publish My LLM Benchmark Without Giving the True Answers Away?

Takashi Ishida, Thanawat Lodkaew, Ikko Yamane

Main category: cs.LG

TL;DR: The paper proposes a method to publish LLM benchmarks without fully disclosing ground-truth answers, using randomized correct answers to detect data contamination.

Details

Motivation: To prevent benchmark contamination in LLMs while enabling open evaluation, avoiding reliance on a single trusted organization and mitigating test-set overfitting.

Method: Inject randomness into benchmark answers by providing multiple logically correct options, with only one as the solution, reducing Bayes accuracy.

Result: The method effectively detects data contamination across various benchmarks, models, and training methodologies.

Conclusion: The approach successfully balances benchmark transparency and contamination prevention, providing a reliable contamination detection mechanism.

Abstract: Publishing a large language model (LLM) benchmark on the Internet risks contaminating future LLMs: the benchmark may be unintentionally (or intentionally) used to train or select a model. A common mitigation is to keep the benchmark private and let participants submit their models or predictions to the organizers. However, this strategy will require trust in a single organization and still permits test-set overfitting through repeated queries. To overcome this issue, we propose a way to publish benchmarks without completely disclosing the ground-truth answers to the questions, while still maintaining the ability to openly evaluate LLMs. Our main idea is to inject randomness to the answers by preparing several logically correct answers, and only include one of them as the solution in the benchmark. This reduces the best possible accuracy, i.e., Bayes accuracy, of the benchmark. Not only is this helpful to keep us from disclosing the ground truth, but this approach also offers a test for detecting data contamination. In principle, even fully capable models should not surpass the Bayes accuracy. If a model surpasses this ceiling despite this expectation, this is a strong signal of data contamination. We present experimental evidence that our method can detect data contamination accurately on a wide range of benchmarks, models, and training methodologies.

[899] Impartial Games: A Challenge for Reinforcement Learning

Bei Zhou, Søren Riis

Main category: cs.LG

TL;DR: AlphaZero-style RL struggles with impartial games like Nim due to difficulties in learning abstract mathematical principles, revealing a representational bottleneck in neural networks.

Details

Motivation: To investigate the limitations of AlphaZero-style RL in impartial games, where optimal play relies on abstract principles, using Nim as a case study.

Method: A novel framework distinguishing champion and expert mastery is introduced to evaluate RL agents, focusing on their performance in Nim across varying board sizes.

Result: AlphaZero-style agents achieve champion-level play on small Nim boards but fail as board size increases, due to an inability to learn non-associative functions like parity.

Conclusion: The findings highlight the need for fundamentally new algorithmic approaches, such as neuro-symbolic or meta-learning, to achieve true expert-level AI in combinatorial games.

Abstract: AlphaZero-style reinforcement learning (RL) algorithms have achieved superhuman performance in many complex board games such as Chess, Shogi, and Go. However, we showcase that these algorithms encounter significant and fundamental challenges when applied to impartial games, a class where players share game pieces and optimal strategy often relies on abstract mathematical principles. Specifically, we utilize the game of Nim as a concrete and illustrative case study to reveal critical limitations of AlphaZero-style and similar self-play RL algorithms. We introduce a novel conceptual framework distinguishing between champion and expert mastery to evaluate RL agent performance. Our findings reveal that while AlphaZero-style agents can achieve champion-level play on very small Nim boards, their learning progression severely degrades as the board size increases. This difficulty stems not merely from complex data distributions or noisy labels, but from a deeper representational bottleneck: the inherent struggle of generic neural networks to implicitly learn abstract, non-associative functions like parity, which are crucial for optimal play in impartial games. This limitation causes a critical breakdown in the positive feedback loop essential for self-play RL, preventing effective learning beyond rote memorization of frequently observed states. These results align with broader concerns regarding AlphaZero-style algorithms’ vulnerability to adversarial attacks, highlighting their inability to truly master all legal game states. Our work underscores that simple hyperparameter adjustments are insufficient to overcome these challenges, establishing a crucial foundation for the development of fundamentally novel algorithmic approaches, potentially involving neuro-symbolic or meta-learning paradigms, to bridge the gap towards true expert-level AI in combinatorial games.

[900] MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge

Guangchen Lan, Sipeng Zhang, Tianle Wang, Yuwei Zhang, Daoan Zhang, Xinpeng Wei, Xiaoman Pan, Hongming Zhang, Dong-Jun Han, Christopher G. Brinton

Main category: cs.LG

TL;DR: MaPPO is a new framework for aligning LLMs with human preferences by integrating prior reward knowledge into optimization, outperforming existing methods like DPO without extra hyperparameters.

Details

Motivation: Existing methods like DPO treat preference learning as MLE, oversimplifying response classification. MaPPO aims to improve alignment by incorporating prior rewards.

Method: MaPPO extends MLE to a Maximum a Posteriori objective, integrating prior reward estimates. It works offline/online and is compatible with DPO variants.

Result: Empirical tests on benchmarks (MT-Bench, AlpacaEval 2.0, Arena-Hard) show improved alignment without computational trade-offs.

Conclusion: MaPPO generalizes and enhances DPO methods, offering better alignment and flexibility without added complexity.

Abstract: As the era of large language models (LLMs) on behalf of users unfolds, Preference Optimization (PO) methods have become a central approach to aligning LLMs with human preferences and improving performance. We propose Maximum a Posteriori Preference Optimization (MaPPO), a framework for learning from preferences that explicitly incorporates prior reward knowledge into the optimization objective. While existing methods such as Direct Preference Optimization (DPO) and its variants treat preference learning as a Maximum Likelihood Estimation (MLE) problem, MaPPO extends this paradigm by integrating prior reward estimates into a principled Maximum a Posteriori (MaP) objective. This not only generalizes DPO and its variants, but also enhances alignment by mitigating the oversimplified binary classification of responses. More importantly, MaPPO introduces no additional hyperparameter, and supports preference optimization in both offline and online settings. In addition, MaPPO can be used as a plugin with consistent improvement on DPO variants, including widely used SimPO, IPO, and CPO. Extensive empirical evaluations of different model sizes and model series on three standard benchmarks, including MT-Bench, AlpacaEval 2.0, and Arena-Hard, demonstrate consistent improvements in alignment performance without sacrificing computational efficiency.

[901] AlphaViT: A flexible game-playing AI for multiple games and variable board sizes

Kazuhisa Fujita

Main category: cs.LG

TL;DR: Three game-playing agents (AlphaViT, AlphaViD, AlphaVDA) using Vision Transformers (ViT) in the AlphaZero framework outperform traditional algorithms and approach AlphaZero’s performance, demonstrating flexibility across board sizes and games.

Details

Motivation: To overcome AlphaZero's limitation of fixed board sizes and develop flexible, robust AI agents capable of playing multiple board games with a single neural network.

Method: Incorporates ViT into AlphaZero: AlphaViT (encoder-only), AlphaViD (encoder-decoder), and AlphaVDA (decoder with learnable embeddings). Trained on single or multiple games.

Result: Agents outperform Minimax and Monte Carlo Tree Search, approaching AlphaZero’s performance. AlphaViT excels across games. Multi-game training matches or surpasses single-game performance.

Conclusion: Transformer-based architectures enable flexible, high-performance game-playing AI, adaptable to multiple games and dynamic environments.

Abstract: We present three game-playing agents incorporating Vision Transformers (ViT) into the AlphaZero framework: AlphaViT, AlphaViD (AlphaViT with a transformer decoder), and AlphaVDA (AlphaViD with learnable action embeddings). These agents can play multiple board games of varying sizes using a single neural network with shared weights, thus overcoming AlphaZero’s limitation of fixed board sizes. AlphaViT employs only a transformer encoder, whereas AlphaViD and AlphaVDA incorporate both a transformer encoder and a decoder. In AlphaViD, the decoder processes outputs from the encoder, whereas AlphaVDA uses learnable embeddings as the decoder inputs. The additional decoder in AlphaViD and AlphaVDA provides flexibility to adapt to various action spaces and board sizes. Experimental results show that the proposed agents, trained on either individual games or on multiple games simultaneously, consistently outperform traditional algorithms, such as Minimax and Monte Carlo Tree Search. They approach the performance of AlphaZero despite relying on a single deep neural network (DNN) with shared weights. In particular, AlphaViT performs strongly across all evaluated games. Furthermore, fine-tuning the DNN with weights pre-trained on small board games accelerates convergence and improves performance, particularly in Gomoku. Interestingly, simultaneous training on multiple games yields performance comparable to, or even surpassing, that of single-game training. These results indicate the potential of transformer-based architectures for developing more flexible and robust game-playing AI agents that excel in multiple games and dynamic environments.

[902] An Electrocardiogram Foundation Model Built on over 10 Million Recordings with External Evaluation across Multiple Domains

Jun Li, Aaron Aguirre, Junior Moura, Che Liu, Lanhai Zhong, Chenxi Sun, Gari Clifford, Brandon Westover, Shenda Hong

Main category: cs.LG

TL;DR: ECGFounder is an ECG foundation model trained on 10M+ ECGs, achieving expert-level performance (AUROC >0.95) and strong generalization across diagnoses. It supports mobile monitoring and downstream tasks.

Details

Motivation: To address challenges like insufficient sample sizes and generalization gaps in ECG analysis, and to elevate AI-ECG research using foundation models.

Method: Trained on the Harvard-Emory ECG Database with 150 label categories, leveraging real-world expert annotations. Designed for out-of-the-box use and fine-tuning.

Result: Achieves expert-level performance (AUROC >0.95) on 80 diagnoses, strong generalization, and outperforms baselines in downstream tasks.

Conclusion: ECGFounder advances ECG analysis, offering a versatile, high-performance solution for cardiovascular disease diagnosis and mobile monitoring.

Abstract: Artificial intelligence (AI) has demonstrated significant potential in ECG analysis and cardiovascular disease assessment. Recently, foundation models have played a remarkable role in advancing medical AI. The development of an ECG foundation model holds the promise of elevating AI-ECG research to new heights. However, building such a model faces several challenges, including insufficient database sample sizes and inadequate generalization across multiple domains. Additionally, there is a notable performance gap between single-lead and multi-lead ECG analyses. We introduced an ECG Foundation Model (ECGFounder), a general-purpose model that leverages real-world ECG annotations from cardiology experts to broaden the diagnostic capabilities of ECG analysis. ECGFounder was trained on over 10 million ECGs with 150 label categories from the Harvard-Emory ECG Database, enabling comprehensive cardiovascular disease diagnosis through ECG analysis. The model is designed to be both an effective out-of-the-box solution, and a to be fine-tunable for downstream tasks, maximizing usability. Importantly, we extended its application to lower rank ECGs, and arbitrary single-lead ECGs in particular. ECGFounder is applicable to supporting various downstream tasks in mobile monitoring scenarios. Experimental results demonstrate that ECGFounder achieves expert-level performance on internal validation sets, with AUROC exceeding 0.95 for eighty diagnoses. It also shows strong classification performance and generalization across various diagnoses on external validation sets. When fine-tuned, ECGFounder outperforms baseline models in demographic analysis, clinical event detection, and cross-modality cardiac rhythm diagnosis. The trained model and data will be publicly released upon publication through the bdsp.io. Our code is available at https://github.com/PKUDigitalHealth/ECGFounder

[903] UoMo: A Foundation Model for Mobile Traffic Forecasting with Diffusion Model

Haoye Chai, Shiyuan Zhang, Xiaoqian Qi, Baohua Qiu, Yong Li

Main category: cs.LG

TL;DR: Proposes FoMo, a foundation model for mobile traffic forecasting, combining diffusion models and transformers to handle diverse tasks and improve generalization across urban environments.

Details

Motivation: Existing models are task-specific and lack generalization, limiting their effectiveness in diverse mobile network tasks. Foundation models' multi-tasking and zero/few-shot capabilities inspire this work.

Method: FoMo integrates diffusion models and transformers, using spatio-temporal masks and contrastive learning to learn task-specific features and urban context correlations.

Result: Outperforms current models in diverse forecasting tasks and zero/few-shot learning on 9 real-world datasets, demonstrating strong universality.

Conclusion: FoMo is a versatile and effective foundation model for mobile traffic forecasting, enhancing generalization and performance across tasks and environments.

Abstract: Mobile traffic forecasting allows operators to anticipate network dynamics and performance in advance, offering substantial potential for enhancing service quality and improving user experience. However, existing models are often task-oriented and are trained with tailored data, which limits their effectiveness in diverse mobile network tasks of Base Station (BS) deployment, resource allocation, energy optimization, etc. and hinders generalization across different urban environments. Foundation models have made remarkable strides across various domains of NLP and CV due to their multi-tasking adaption and zero/few-shot learning capabilities. In this paper, we propose an innovative Foundation model for Mo}bile traffic forecasting (FoMo), aiming to handle diverse forecasting tasks of short/long-term predictions and distribution generation across multiple cities to support network planning and optimization. FoMo combines diffusion models and transformers, where various spatio-temporal masks are proposed to enable FoMo to learn intrinsic features of different tasks, and a contrastive learning strategy is developed to capture the correlations between mobile traffic and urban contexts, thereby improving its transfer learning capability. Extensive experiments on 9 real-world datasets demonstrate that FoMo outperforms current models concerning diverse forecasting tasks and zero/few-shot learning, showcasing a strong universality.

[904] Geminio: Language-Guided Gradient Inversion Attacks in Federated Learning

Junjie Shan, Ziqi Zhao, Jialin Lu, Rui Zhang, Siu Ming Yiu, Ka-Ho Chow

Main category: cs.LG

TL;DR: The paper introduces Geminio, a method using vision-language models (VLMs) to enhance gradient inversion attacks (GIAs) in federated learning (FL), enabling targeted reconstruction of high-value data samples based on natural language queries.

Details

Motivation: To address the gap in existing GIAs, which struggle with high-resolution images and lack flexibility in targeting specific data, the paper explores the weaponization of VLMs for privacy attacks.

Method: Geminio leverages a pretrained VLM to guide the optimization of a malicious global model, prioritizing reconstruction of samples matching attacker-specified natural language queries.

Result: Experiments show Geminio effectively pinpoints and reconstructs targeted samples with high success rates, even against defenses and large batch sizes.

Conclusion: Geminio transforms GIAs into semantically meaningful, targeted attacks, demonstrating the potential for VLMs to enhance privacy threats in FL.

Abstract: Foundation models that bridge vision and language have made significant progress. While they have inspired many life-enriching applications, their potential for abuse in creating new threats remains largely unexplored. In this paper, we reveal that vision-language models (VLMs) can be weaponized to enhance gradient inversion attacks (GIAs) in federated learning (FL), where an FL server attempts to reconstruct private data samples from gradients shared by victim clients. Despite recent advances, existing GIAs struggle to reconstruct high-resolution images when the victim has a large local data batch. One promising direction is to focus reconstruction on valuable samples rather than the entire batch, but current methods lack the flexibility to target specific data of interest. To address this gap, we propose Geminio, the first approach to transform GIAs into semantically meaningful, targeted attacks. It enables a brand new privacy attack experience: attackers can describe, in natural language, the data they consider valuable, and Geminio will prioritize reconstruction to focus on those high-value samples. This is achieved by leveraging a pretrained VLM to guide the optimization of a malicious global model that, when shared with and optimized by a victim, retains only gradients of samples that match the attacker-specified query. Geminio can be launched at any FL round and has no impact on normal training (i.e., the FL server can steal clients’ data while still producing a high-utility ML model as in benign scenarios). Extensive experiments demonstrate its effectiveness in pinpointing and reconstructing targeted samples, with high success rates across complex datasets and large batch sizes with resilience against defenses.

[905] Gradient Inversion Attack on Graph Neural Networks

Divya Anand Sinha, Ruijie Du, Yezi Liu, Athina Markopolou, Yanning Shen

Main category: cs.LG

TL;DR: The paper explores the vulnerability of graph data in federated learning, proposing a novel attack (GLG) to reconstruct private node features and graph structure from gradients.

Details

Motivation: To investigate if private graph data can be reconstructed from leaked gradients in federated learning, addressing a gap in current research.

Method: Proposes Graph Leakage from Gradients (GLG) attack, analyzing GCN and GraphSAGE frameworks, and evaluates reconstruction under various model settings.

Result: GLG accurately reconstructs both node features and graph structure from gradients, leveraging unique properties of graph data and GNNs.

Conclusion: The study highlights significant privacy risks in graph federated learning and demonstrates the effectiveness of GLG in data reconstruction.

Abstract: Graph federated learning is of essential importance for training over large graph datasets while protecting data privacy, where each client stores a subset of local graph data, while the server collects the local gradients and broadcasts only the aggregated gradients. Recent studies reveal that a malicious attacker can steal private image data from the gradient exchange of neural networks during federated learning. However, the vulnerability of graph data and graph neural networks under such attacks, i.e., reconstructing both node features and graph structure from gradients, remains largely underexplored. To answer this question, this paper studies the problem of whether private data can be reconstructed from leaked gradients in both node classification and graph classification tasks and proposes a novel attack named Graph Leakage from Gradients (GLG). Two widely used GNN frameworks are analyzed, namely GCN and GraphSAGE. The effects of different model settings on reconstruction are extensively discussed. Theoretical analysis and empirical validation demonstrate that, by leveraging the unique properties of graph data and GNNs, GLG achieves more accurate reconstruction of both nodal features and graph structure from gradients.

[906] Friend or Foe? Harnessing Controllable Overfitting for Anomaly Detection

Long Qian, Bingke Zhu, Yingying Chen, Ming Tang, Jinqiao Wang

Main category: cs.LG

TL;DR: The paper introduces COAD, a framework that uses controlled overfitting to improve anomaly detection, proposing ARQ and RADI metrics to optimize and evaluate performance.

Details

Motivation: Challenge the conventional view that overfitting is harmful in anomaly detection by demonstrating its potential to enhance sensitivity to anomalies.

Method: Proposes COAD with ARQ to quantify overfitting and RADI to evaluate detection performance, using Gaussian noise for pseudo-anomalies.

Result: Achieves SOTA performance in anomaly detection tasks, validating the efficacy of controlled overfitting.

Conclusion: Overfitting can be strategically leveraged as a powerful tool in anomaly detection, redefining its role in model performance.

Abstract: Overfitting has traditionally been viewed as detrimental to anomaly detection, where excessive generalization often limits models’ sensitivity to subtle anomalies. Our work challenges this conventional view by introducing Controllable Overfitting-based Anomaly Detection (COAD), a novel framework that strategically leverages overfitting to enhance anomaly discrimination capabilities. We propose the Aberrance Retention Quotient (ARQ), a novel metric that systematically quantifies the extent of overfitting, enabling the identification of an optimal golden overfitting interval wherein model sensitivity to anomalies is maximized without sacrificing generalization. To comprehensively capture how overfitting affects detection performance, we further propose the Relative Anomaly Distribution Index (RADI), a metric superior to traditional AUROC by explicitly modeling the separation between normal and anomalous score distributions. Theoretically, RADI leverages ARQ to track and evaluate how overfitting impacts anomaly detection, offering an integrated approach to understanding the relationship between overfitting dynamics and model efficacy. We also rigorously validate the statistical efficacy of Gaussian noise as pseudo-anomaly generators, reinforcing the method’s broad applicability. Empirical evaluations demonstrate that our controllable overfitting method achieves State-Of-The-Art(SOTA) performance in both one-class and multi-class anomaly detection tasks, thus redefining overfitting as a powerful strategy rather than a limitation.

[907] CSI-BERT2: A BERT-inspired Framework for Efficient CSI Prediction and Classification in Wireless Communication and Sensing

Zijian Zhao, Fanyi Meng, Zhonghao Lyu, Hang Li, Xiaoyang Li, Guangxu Zhu

Main category: cs.LG

TL;DR: CSI-BERT2 is a unified framework for CSI prediction and classification, improving upon CSI-BERT with a two-stage training method, mask prediction model, adaptive re-weighting layer, and temporal embedding module. It achieves state-of-the-art performance and handles challenges like data scarcity and packet loss.

Details

Motivation: Address challenges in wireless sensing (data scarcity, packet loss) and communication (high-dimensional CSI matrices, short coherent times) by improving CSI prediction and classification.

Method: Proposes CSI-BERT2 with a two-stage training method (unsupervised MLM followed by fine-tuning), extends MLM to MPM for prediction, introduces ARL for subcarrier representation, and uses MLP-based temporal embedding for time-series CSI.

Result: Achieves state-of-the-art performance in CSI prediction and classification, generalizes across sampling rates, and handles discontinuous CSI sequences robustly.

Conclusion: CSI-BERT2 effectively addresses key challenges in wireless sensing and communication, outperforming conventional methods and demonstrating strong generalization and robustness.

Abstract: Channel state information (CSI) is a fundamental component in both wireless communication and sensing systems, enabling critical functions such as radio resource optimization and environmental perception. In wireless sensing, data scarcity and packet loss hinder efficient model training, while in wireless communication, high-dimensional CSI matrices and short coherent times caused by high mobility present challenges in CSI estimation.To address these issues, we propose a unified framework named CSI-BERT2 for CSI prediction and classification tasks. Building on CSI-BERT, we introduce a two-stage training method that first uses a mask language model (MLM) to enable the model to learn general feature extraction from scarce datasets in an unsupervised manner, followed by fine-tuning for specific downstream tasks. Specifically, we extend MLM into a mask prediction model (MPM), which efficiently addresses the CSI prediction task. We also introduce an adaptive re-weighting layer (ARL) to enhance subcarrier representation and a multi-layer perceptron (MLP) based temporal embedding module to mitigate permutation invariance issues in time-series CSI data. This significantly improves the CSI classification performance of the original CSI-BERT model. Extensive experiments on both real-world collected and simulated datasets demonstrate that CSI-BERT2 achieves state-of-the-art performance across all tasks. Our results further show that CSI-BERT2 generalizes effectively across varying sampling rates and robustly handles discontinuous CSI sequences caused by packet loss-challenges that conventional methods fail to address.

[908] Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN

Pengxiang Li, Lu Yin, Shiwei Liu

Main category: cs.LG

TL;DR: Mix-LN, a new normalization technique combining Pre-LN and Post-LN, improves gradient uniformity in LLMs, enhancing training and performance without increasing model size.

Details

Motivation: The inefficiency of deeper layers in LLMs due to Pre-LN's diminished gradient norms and Post-LN's vanishing gradients in earlier layers.

Method: Introduces Mix-LN, applying Post-LN to earlier layers and Pre-LN to deeper layers for balanced gradients.

Result: Mix-LN outperforms Pre-LN and Post-LN in experiments, improving pre-training and fine-tuning performance.

Conclusion: Mix-LN effectively addresses deep layer inefficiencies, unlocking LLM potential without size increase.

Abstract: Large Language Models (LLMs) have achieved remarkable success, yet recent findings reveal that their deeper layers often contribute minimally and can be pruned without affecting overall performance. While some view this as an opportunity for model compression, we identify it as a training shortfall rooted in the widespread use of Pre-Layer Normalization (Pre-LN). We demonstrate that Pre-LN, commonly employed in models like GPT and LLaMA, leads to diminished gradient norms in its deeper layers, reducing their effectiveness. In contrast, Post-Layer Normalization (Post-LN) preserves larger gradient norms in deeper layers but suffers from vanishing gradients in earlier layers. To address this, we introduce Mix-LN, a novel normalization technique that combines the strengths of Pre-LN and Post-LN within the same model. Mix-LN applies Post-LN to the earlier layers and Pre-LN to the deeper layers, ensuring more uniform gradients across layers. This allows all parts of the network–both shallow and deep layers–to contribute effectively to training. Extensive experiments with various model sizes from 70M to 7B demonstrate that Mix-LN consistently outperforms both Pre-LN and Post-LN, promoting more balanced, healthier gradient norms throughout the network, and enhancing the overall quality of LLM pre-training. Furthermore, we demonstrate that models pre-trained with Mix-LN learn better compared to those using Pre-LN or Post-LN during supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), highlighting the critical importance of high-quality deep layers. By effectively addressing the inefficiencies of deep layers in current LLMs, Mix-LN unlocks their potential, enhancing model capacity without increasing model size. Our code is available at https://github.com/pixeli99/MixLN.

[909] Network Embedding with Completely-imbalanced Labels

Zheng Wang, Xiaojun Ye, Chaokun Wang, Jian Cui, Philip S. Yu

Main category: cs.LG

TL;DR: The paper introduces two semi-supervised network embedding methods, RSDNE and RECT, to address the challenge of completely-imbalanced labels, where some classes lack labeled nodes.

Details

Motivation: Existing semi-supervised methods perform poorly in completely-imbalanced label settings, motivating the development of new approaches.

Method: RSDNE ensures intra-class similarity and inter-class dissimilarity, while RECT leverages class-semantic knowledge for networks with features and multi-label settings.

Result: Experiments on real-world datasets show the superiority of the proposed methods.

Conclusion: The methods effectively handle completely-imbalanced labels and outperform existing approaches.

Abstract: Network embedding, aiming to project a network into a low-dimensional space, is increasingly becoming a focus of network research. Semi-supervised network embedding takes advantage of labeled data, and has shown promising performance. However, existing semi-supervised methods would get unappealing results in the completely-imbalanced label setting where some classes have no labeled nodes at all. To alleviate this, we propose two novel semi-supervised network embedding methods. The first one is a shallow method named RSDNE. Specifically, to benefit from the completely-imbalanced labels, RSDNE guarantees both intra-class similarity and inter-class dissimilarity in an approximate way. The other method is RECT which is a new class of graph neural networks. Different from RSDNE, to benefit from the completely-imbalanced labels, RECT explores the class-semantic knowledge. This enables RECT to handle networks with node features and multi-label setting. Experimental results on several real-world datasets demonstrate the superiority of the proposed methods. Code is available at https://github.com/zhengwang100/RECT.

[910] Optimizing Return Distributions with Distributional Dynamic Programming

Bernardo Ávila Pires, Mark Rowland, Diana Borsa, Zhaohan Daniel Guo, Khimya Khetarpal, André Barreto, David Abel, Rémi Munos, Will Dabney

Main category: cs.LG

TL;DR: The paper introduces distributional dynamic programming (DP) methods to optimize statistical functionals of return distributions, extending beyond classic DP by combining it with stock augmentation for risk-sensitive RL.

Details

Motivation: To generalize distributional DP beyond expected utilities and address recent problems in risk-sensitive RL by leveraging stock augmentation.

Method: Combines distributional DP with stock augmentation, analyzes distributional value and policy iteration, and provides bounds and applicability insights.

Result: Demonstrates the ability to solve stock-augmented return distribution optimization problems, such as maximizing conditional value-at-risk and homeostatic regulation.

Conclusion: The approach is practical, as shown by an agent combining DQN and distributional DP, validated empirically for discussed applications.

Abstract: We introduce distributional dynamic programming (DP) methods for optimizing statistical functionals of the return distribution, with standard reinforcement learning as a special case. Previous distributional DP methods could optimize the same class of expected utilities as classic DP. To go beyond, we combine distributional DP with stock augmentation, a technique previously introduced for classic DP in the context of risk-sensitive RL, where the MDP state is augmented with a statistic of the rewards obtained since the first time step. We find that a number of recently studied problems can be formulated as stock-augmented return distribution optimization, and we show that we can use distributional DP to solve them. We analyze distributional value and policy iteration, with bounds and a study of what objectives these distributional DP methods can or cannot optimize. We describe a number of applications outlining how to use distributional DP to solve different stock-augmented return distribution optimization problems, for example maximizing conditional value-at-risk, and homeostatic regulation. To highlight the practical potential of stock-augmented return distribution optimization and distributional DP, we introduce an agent that combines DQN and the core ideas of distributional DP, and empirically evaluate it for solving instances of the applications discussed.

[911] Algorithmic Recourse in Abnormal Multivariate Time Series

Xiao Han, Lu Zhang, Yongkai Wu, Shuhan Yuan

Main category: cs.LG

TL;DR: RecAD introduces a framework for algorithmic recourse in multivariate time series anomaly detection using backtracking counterfactual reasoning.

Details

Motivation: Limited research exists on recourse for multivariate time series, especially for reversing anomalies.

Method: RecAD models anomaly causes as external interventions, predicts recourse actions via counterfactual explanations, and trains the recourse function end-to-end.

Result: Experiments on synthetic and real-world datasets show RecAD’s effectiveness.

Conclusion: RecAD successfully addresses anomalies in multivariate time series through counterfactual reasoning.

Abstract: Algorithmic recourse provides actionable recommendations to alter unfavorable predictions of machine learning models, enhancing transparency through counterfactual explanations. While significant progress has been made in algorithmic recourse for static data, such as tabular and image data, limited research explores recourse for multivariate time series, particularly for reversing abnormal time series. This paper introduces Recourse in time series Anomaly Detection (RecAD), a framework for addressing anomalies in multivariate time series using backtracking counterfactual reasoning. By modeling the causes of anomalies as external interventions on exogenous variables, RecAD predicts recourse actions to restore normal status as counterfactual explanations, where the recourse function, responsible for generating actions based on observed data, is trained using an end-to-end approach. Experiments on synthetic and real-world datasets demonstrate its effectiveness.

[912] High-dimensional Linear Bandits with Knapsacks

Wanteng Ma, Dong Xia, Jiashuo Jiang

Main category: cs.LG

TL;DR: The paper addresses high-dimensional linear contextual bandits with knapsack constraints (CBwK), leveraging sparsity for improved regret bounds. It introduces an online hard thresholding estimator and a primal-dual scheme, achieving sub-linear regret with logarithmic dependency on feature dimension. Sharper bounds are derived under structural assumptions, and empirical results validate the approach.

Details

Motivation: The motivation is to improve regret guarantees in high-dimensional CBwK by exploiting sparsity, addressing the limitations of prior work with polynomial dependency on feature dimension.

Method: The method combines an online hard thresholding algorithm for sparse estimation with a primal-dual scheme, where dual variables manage resource constraints. Structural assumptions (diverse-covariate or margin conditions) further refine regret bounds.

Result: The approach achieves sub-linear regret with logarithmic dependency on feature dimension. Sharper bounds (e.g., $\tilde{O}(s_{0} \sqrt{T})$) are possible under specific conditions, and empirical results confirm efficiency.

Conclusion: The integrated framework advances high-dimensional CBwK by improving regret bounds and adapting to structural assumptions, with broader applicability to contextual bandits without knapsack constraints.

Abstract: We investigate the contextual bandits with knapsack (CBwK) problem in a high-dimensional linear setting, where the feature dimension can be very large. Our goal is to harness sparsity to obtain sharper regret guarantees. To this end, we first develop an online variant of the hard thresholding algorithm that performs the sparse estimation in an online manner. We then embed this estimator in a primal-dual scheme: every knapsack constraint is paired with a dual variable, which is updated by an online learning rule to keep the cumulative resource consumption within budget. This integrated approach achieves a two-phase sub-linear regret that scales only logarithmically with the feature dimension, improving on the polynomial dependency reported in prior work. Furthermore, we show that either of the following structural assumptions is sufficient for a sharper regret bound of $\tilde{O}(s_{0} \sqrt{T})$: (i) a diverse-covariate condition; and (ii) a margin condition. When both conditions hold simultaneously, we can further control the regret to $O(s_{0}^{2} \log(dT)\log T)$ by a dual resolving scheme. As a by-product, applying our framework to high-dimensional contextual bandits without knapsack constraints recovers the optimal regret rates in both the data-poor and data-rich regimes. Finally, numerical experiments confirm the empirical efficiency of our algorithms in high-dimensional settings.

[913] BEAT: Balanced Frequency Adaptive Tuning for Long-Term Time-Series Forecasting

Zhixuan Li, Naipeng Chen, Seonghwa Choi, Sanghoon Lee, Weisi Lin

Main category: cs.LG

TL;DR: BEAT (Balanced frEquency Adaptive Tuning) is a novel framework for time-series forecasting that dynamically adjusts gradient updates for different frequencies to balance learning speeds, outperforming existing methods.

Details

Motivation: Existing frequency-domain approaches train all frequencies under a unified objective, causing mismatched learning speeds (high-frequency overfitting, low-frequency underfitting).

Method: BEAT monitors each frequency’s training status (convergence, overfitting, underfitting) and adaptively adjusts gradient updates to balance learning priorities.

Result: BEAT consistently outperforms state-of-the-art methods on seven real-world datasets.

Conclusion: BEAT effectively synchronizes learning across frequencies, addressing the imbalance in existing approaches and improving forecasting performance.

Abstract: Time-series forecasting is crucial for numerous real-world applications including weather prediction and financial market modeling. While temporal-domain methods remain prevalent, frequency-domain approaches can effectively capture multi-scale periodic patterns, reduce sequence dependencies, and naturally denoise signals. However, existing approaches typically train model components for all frequencies under a unified training objective, often leading to mismatched learning speeds: high-frequency components converge faster and risk overfitting, while low-frequency components underfit due to insufficient training time. To deal with this challenge, we propose BEAT (Balanced frEquency Adaptive Tuning), a novel framework that dynamically monitors the training status for each frequency and adaptively adjusts their gradient updates. By recognizing convergence, overfitting, or underfitting for each frequency, BEAT dynamically reallocates learning priorities, moderating gradients for rapid learners and increasing those for slower ones, alleviating the tension between competing objectives across frequencies and synchronizing the overall learning process. Extensive experiments on seven real-world datasets demonstrate that BEAT consistently outperforms state-of-the-art approaches.

[914] Node Duplication Improves Cold-start Link Prediction

Zhichun Guo, Tong Zhao, Yozen Liu, Kaiwen Dong, William Shiao, Mingxuan Ju, Neil Shah, Nitesh V. Chawla

Main category: cs.LG

TL;DR: NodeDup, a simple augmentation technique, improves GNNs’ link prediction performance on low-degree nodes without affecting high-degree nodes, addressing cold-start problems in applications like recommendation systems.

Details

Motivation: GNNs struggle with low-degree nodes in link prediction tasks, which is critical for cold-start problems in real-world applications like recommendation systems.

Method: NodeDup duplicates low-degree nodes and links them to their duplicates, leveraging a multi-view perspective during training.

Result: NodeDup achieves significant improvements (38.49%, 13.34%, and 6.76% on isolated, low-degree, and warm nodes, respectively) over GNNs and state-of-the-art methods.

Conclusion: NodeDup is an effective, plug-and-play solution for enhancing GNN performance on low-degree nodes with minimal computational overhead.

Abstract: Graph Neural Networks (GNNs) are prominent in graph machine learning and have shown state-of-the-art performance in Link Prediction (LP) tasks. Nonetheless, recent studies show that GNNs struggle to produce good results on low-degree nodes despite their overall strong performance. In practical applications of LP, like recommendation systems, improving performance on low-degree nodes is critical, as it amounts to tackling the cold-start problem of improving the experiences of users with few observed interactions. In this paper, we investigate improving GNNs’ LP performance on low-degree nodes while preserving their performance on high-degree nodes and propose a simple yet surprisingly effective augmentation technique called NodeDup. Specifically, NodeDup duplicates low-degree nodes and creates links between nodes and their own duplicates before following the standard supervised LP training scheme. By leveraging a ‘‘multi-view’’ perspective for low-degree nodes, NodeDup shows significant LP performance improvements on low-degree nodes without compromising any performance on high-degree nodes. Additionally, as a plug-and-play augmentation module, NodeDup can be easily applied to existing GNNs with very light computational cost. Extensive experiments show that NodeDup achieves 38.49%, 13.34%, and 6.76% improvements on isolated, low-degree, and warm nodes, respectively, on average across all datasets compared to GNNs and state-of-the-art cold-start methods.

[915] Shaping Sparse Rewards in Reinforcement Learning: A Semi-supervised Approach

Wenyun Li, Wenjie Huang

Main category: cs.LG

TL;DR: The paper proposes a method combining Semi-Supervised Learning (SSL) and novel data augmentation to improve reward shaping in sparse-reward environments, outperforming supervised approaches.

Details

Motivation: Sparse reward signals in real-world scenarios make learning effective reward functions challenging.

Method: Uses SSL and a novel data augmentation technique to learn from zero-reward transitions, enhancing reward shaping.

Result: Outperforms supervised methods, achieving up to twice the peak scores in sparse-reward environments and a 15.8% increase in best score with the proposed augmentation.

Conclusion: The method significantly improves reward shaping efficacy, especially in sparse-reward settings.

Abstract: In many real-world scenarios, reward signal for agents are exceedingly sparse, making it challenging to learn an effective reward function for reward shaping. To address this issue, the proposed approach in this paper performs reward shaping not only by utilizing non-zero-reward transitions but also by employing the \emph{Semi-Supervised Learning} (SSL) technique combined with a novel data augmentation to learn trajectory space representations from the majority of transitions, {i.e}., zero-reward transitions, thereby improving the efficacy of reward shaping. Experimental results in Atari and robotic manipulation demonstrate that our method outperforms supervised-based approaches in reward inference, leading to higher agent scores. Notably, in more sparse-reward environments, our method achieves up to twice the peak scores compared to supervised baselines. The proposed double entropy data augmentation enhances performance, showcasing a 15.8% increase in best score over other augmentation methods

[916] Ensemble learning for uncertainty estimation with application to the correction of satellite precipitation products

Georgia Papacharalampous, Hristos Tyralis, Nikolaos Doulamis, Anastasios Doulamis

Main category: cs.LG

TL;DR: The paper introduces nine quantile-based ensemble learners for improving precipitation predictions by merging remote sensing and gauge data, outperforming the reference method by 3.91% to 8.95%.

Details

Motivation: To address the unexplored potential of ensemble learning in quantile regression for spatial prediction settings, particularly in precipitation dataset creation.

Method: Employed a novel feature engineering strategy and six individual algorithms (QR, QRF, GRF, GBM, LightGBM, QRNN) within nine ensemble learners, including stacking and simple combiners.

Result: Ensemble learning with QR and QRNN performed best across quantile levels (0.025 to 0.975), outperforming the reference method by 3.91% to 8.95%.

Conclusion: Quantile-based ensemble learning significantly improves precipitation predictions, demonstrating the value of combining multiple algorithms for better accuracy.

Abstract: Predictions in the form of probability distributions are crucial for effective decision-making. Quantile regression enables such predictions within spatial prediction settings that aim to create improved precipitation datasets by merging remote sensing and gauge data. However, ensemble learning of quantile regression algorithms remains unexplored in this context and, at the same time, it has not been substantially developed so far in the broader machine learning research landscape. Here, we introduce nine quantile-based ensemble learners and address the aforementioned gap in precipitation dataset creation by presenting the first application of these learners to large precipitation datasets. We employed a novel feature engineering strategy, which reduces the number of predictors by using distance-weighted satellite precipitation at relevant locations, combined with location elevation. Our ensemble learners include six that are based on stacking ideas and three simple methods (mean, median, best combiner). Each of them combines the following six individual algorithms: quantile regression (QR), quantile regression forests (QRF), generalized random forests (GRF), gradient boosting machines (GBM), light gradient boosting machines (LightGBM), and quantile regression neural networks (QRNN). These algorithms serve as both base learners and combiners within different ensemble learning methods. We evaluated performance against a reference method (i.e., QR) using quantile scoring functions and a large dataset. The latter comprises 15 years of monthly gauge-measured and satellite precipitation in the contiguous United States (CONUS). Ensemble learning with QR and QRNN yielded the best results across the various investigated quantile levels, which range from 0.025 to 0.975, outperforming the reference method by 3.91% to 8.95%…

[917] Reinforcement Learning for Intensity Control: An Application to Choice-Based Network Revenue Management

Huiling Meng, Ningyuan Chen, Xuefeng Gao

Main category: cs.LG

TL;DR: The paper adapts reinforcement learning to intensity control, eliminating the need for pre-discretization of time, reducing errors, and improving computation efficiency.

Details

Motivation: Addressing the challenge of applying reinforcement learning to continuous-time problems like intensity control, which traditionally require time discretization.

Method: Utilizes jump points in sample paths for inherent discretization, develops Monte Carlo and temporal difference learning for policy evaluation, and introduces policy-gradient-based actor-critic algorithms.

Result: Demonstrates reduced discretization error and improved computational efficiency compared to benchmarks.

Conclusion: The approach successfully bridges reinforcement learning with continuous-time intensity control, offering practical benefits for applications like revenue management.

Abstract: Intensity control is a type of continuous-time dynamic optimization problems with many important applications in Operations Research including queueing and revenue management. In this study, we adapt the reinforcement learning framework to intensity control using choice-based network revenue management as a case study, which is a classical problem in revenue management that features a large state space, a large action space and a continuous time horizon. We show that by utilizing the inherent discretization of the sample paths created by the jump points, a unique and defining feature of intensity control, one does not need to discretize the time horizon in advance, which was believed to be necessary because most reinforcement learning algorithms are designed for discrete-time problems. As a result, the computation can be facilitated and the discretization error is significantly reduced. We lay the theoretical foundation for the Monte Carlo and temporal difference learning algorithms for policy evaluation and develop policy-gradient-based actor-critic algorithms for intensity control. Via a comprehensive numerical study, we demonstrate the benefit of our approach versus other state-of-the-art benchmarks.

[918] Adversarial flows: A gradient flow characterization of adversarial attacks

Lukas Weigand, Tim Roith, Martin Burger

Main category: cs.LG

TL;DR: The paper interprets adversarial attack methods (fast gradient sign method) as Euler discretizations of differential inclusions, proving convergence to gradient flows. It also explores ∞-curves of maximum slope and their application in gradient descent and adversarial training.

Details

Motivation: To provide a theoretical foundation for adversarial attack methods by linking them to gradient flows and differential inclusions, and to extend these concepts to optimization and adversarial training.

Method: The study uses the concept of p-curves of maximal slope (focusing on p=∞), proves existence of ∞-curves, and characterizes them via differential inclusions. It also applies these ideas to Wasserstein gradient flows and finite-dimensional settings.

Result: The paper shows convergence of normalized gradient descent methods to gradient flows and characterizes adversarial training objectives via ∞-curves in optimal transport spaces.

Conclusion: The work bridges adversarial attacks, gradient flows, and optimal transport, offering new insights into optimization and adversarial training.

Abstract: A popular method to perform adversarial attacks on neuronal networks is the so-called fast gradient sign method and its iterative variant. In this paper, we interpret this method as an explicit Euler discretization of a differential inclusion, where we also show convergence of the discretization to the associated gradient flow. To do so, we consider the concept of p-curves of maximal slope in the case $p=\infty$. We prove existence of $\infty$-curves of maximum slope and derive an alternative characterization via differential inclusions. Furthermore, we also consider Wasserstein gradient flows for potential energies, where we show that curves in the Wasserstein space can be characterized by a representing measure on the space of curves in the underlying Banach space, which fulfill the differential inclusion. The application of our theory to the finite-dimensional setting is twofold: On the one hand, we show that a whole class of normalized gradient descent methods (in particular signed gradient descent) converge, up to subsequences, to the flow, when sending the step size to zero. On the other hand, in the distributional setting, we show that the inner optimization task of adversarial training objective can be characterized via $\infty$-curves of maximum slope on an appropriate optimal transport space.

[919] Class-Wise Federated Averaging for Efficient Personalization

Gyuejeong Lee, Daeyoung Choi

Main category: cs.LG

TL;DR: cwFedAvg improves personalized federated learning by performing class-wise averaging and using a Weight Distribution Regularizer (WDR) for better class-specific information capture.

Details

Motivation: Existing FL methods like FedAvg struggle with heterogeneous data distributions and fail to personalize models effectively due to poor class-specific information capture.

Method: Proposes cwFedAvg, which performs Federated Averaging per class, and WDR to align class distributions for efficient class-wise aggregation.

Result: cwFedAvg with WDR outperforms existing PFL methods, achieving efficient personalization without extra costs.

Conclusion: cwFedAvg offers a scalable and effective solution for personalized FL, maintaining communication efficiency.

Abstract: Federated learning (FL) enables collaborative model training across distributed clients without centralizing data. However, existing approaches such as Federated Averaging (FedAvg) often perform poorly with heterogeneous data distributions, failing to achieve personalization owing to their inability to capture class-specific information effectively. We propose Class-wise Federated Averaging (cwFedAvg), a novel personalized FL (PFL) framework that performs Federated Averaging for each class, to overcome the personalization limitations of FedAvg. cwFedAvg creates class-specific global models via weighted aggregation of local models using class distributions, and subsequently combines them to generate personalized local models. We further propose Weight Distribution Regularizer (WDR), which encourages deep networks to encode class-specific information efficiently by aligning empirical and approximated class distributions derived from output layer weights, to facilitate effective class-wise aggregation. Our experiments demonstrate the superior performance of cwFedAvg with WDR over existing PFL methods through efficient personalization while maintaining the communication cost of FedAvg and avoiding additional local training and pairwise computations.

[920] ME-IGM: Individual-Global-Max in Maximum Entropy Multi-Agent Reinforcement Learning

Wen-Tse Chen, Yuxuan Li, Shiyu Huang, Jiayu Chen, Jeff Schneider

Main category: cs.LG

TL;DR: The paper addresses misalignment in maximum entropy MARL methods by proposing ME-IGM, an algorithm ensuring IGM condition compatibility while enhancing exploration.

Details

Motivation: Existing maximum entropy MARL methods violate the IGM condition due to misalignment between local and joint policies.

Method: Introduces an order-preserving transformation and ME-IGM algorithm, compatible with IGM-based credit assignment.

Result: ME-IGM variants (ME-QMIX, ME-QPLEX) achieve state-of-the-art performance in non-monotonic games and SMAC-v2/Overcooked scenarios.

Conclusion: ME-IGM effectively aligns local and joint policies, maintaining IGM compliance and improving exploration in MARL.

Abstract: Multi-agent credit assignment is a fundamental challenge for cooperative multi-agent reinforcement learning (MARL), where a team of agents learn from shared reward signals. The Individual-Global-Max (IGM) condition is a widely used principle for multi-agent credit assignment, requiring that the joint action determined by individual Q-functions maximizes the global Q-value. Meanwhile, the principle of maximum entropy has been leveraged to enhance exploration in MARL. However, we identify a critical limitation in existing maximum entropy MARL methods: a misalignment arises between local policies and the joint policy that maximizes the global Q-value, leading to violations of the IGM condition. To address this misalignment, we propose an order-preserving transformation. Building on it, we introduce ME-IGM, a novel maximum entropy MARL algorithm compatible with any credit assignment mechanism that satisfies the IGM condition while enjoying the benefits of maximum entropy exploration. We empirically evaluate two variants of ME-IGM: ME-QMIX and ME-QPLEX, in non-monotonic matrix games, and demonstrate their state-of-the-art performance across 17 scenarios in SMAC-v2 and Overcooked.

[921] HoTPP Benchmark: Are We Good at the Long Horizon Events Forecasting?

Ivan Karpukhin, Foma Shipilov, Andrey Savchenko

Main category: cs.LG

TL;DR: HoTPP is introduced as the first benchmark for long-horizon event forecasting, addressing gaps in existing MTPP research. It critiques current metrics, proposes T-mAP, and finds modern methods often underperform simple baselines, with mode collapse being common.

Details

Motivation: Existing MTPP research focuses on next-event prediction, leaving long-horizon forecasting underexplored. HoTPP aims to fill this gap and improve evaluation rigor.

Method: HoTPP introduces a benchmark with a new T-mAP metric, evaluates statistical baselines, and analyzes autoregression and intensity-based losses.

Result: Modern MTPP methods often underperform simple baselines, and mode collapse is prevalent. The T-mAP metric provides better evaluation.

Conclusion: HoTPP highlights limitations in current MTPP approaches, proposes improvements, and suggests future research directions. Code and results are publicly available.

Abstract: Forecasting multiple future events within a given time horizon is essential for applications in finance, retail, social networks, and healthcare. Marked Temporal Point Processes (MTPP) provide a principled framework to model both the timing and labels of events. However, most existing research focuses on predicting only the next event, leaving long-horizon forecasting largely underexplored. To address this gap, we introduce HoTPP, the first benchmark specifically designed to rigorously evaluate long-horizon predictions. We identify shortcomings in widely used evaluation metrics, propose a theoretically grounded T-mAP metric, present strong statistical baselines, and offer efficient implementations of popular models. Our empirical results demonstrate that modern MTPP approaches often underperform simple statistical baselines. Furthermore, we analyze the diversity of predicted sequences and find that most methods exhibit mode collapse. Finally, we analyze the impact of autoregression and intensity-based losses on prediction quality, and outline promising directions for future research. The HoTPP source code, hyperparameters, and full evaluation results are available at GitHub.

[922] Raising the Bar in Graph OOD Generalization: Invariant Learning Beyond Explicit Environment Modeling

Xu Shen, Yixin Liu, Yili Wang, Rui Miao, Yiwei Dai, Shirui Pan, Yi Chang, Xin Wang

Main category: cs.LG

TL;DR: The paper introduces MPHIL, a novel method for graph invariant learning (GIL) to improve out-of-distribution (OOD) generalization in graph data by addressing challenges like diverse environments and semantic cliffs.

Details

Motivation: Real-world graph data often exhibit diverse and shifting environments, making OOD generalization difficult for traditional models. Existing GIL methods struggle with capturing diverse environments and distinguishing invariant subgraphs across classes.

Method: MPHIL uses hyperspherical invariant representation extraction and multi-prototype hyperspherical classification. It introduces two new objective functions: invariant prototype matching loss and prototype separation loss.

Result: MPHIL achieves state-of-the-art performance on 11 OOD benchmark datasets, outperforming existing methods across various domains and distribution shifts.

Conclusion: MPHIL effectively addresses key challenges in GIL, offering a robust solution for OOD generalization in graph learning.

Abstract: Out-of-distribution (OOD) generalization has emerged as a critical challenge in graph learning, as real-world graph data often exhibit diverse and shifting environments that traditional models fail to generalize across. A promising solution to address this issue is graph invariant learning (GIL), which aims to learn invariant representations by disentangling label-correlated invariant subgraphs from environment-specific subgraphs. However, existing GIL methods face two major challenges: (1) the difficulty of capturing and modeling diverse environments in graph data, and (2) the semantic cliff, where invariant subgraphs from different classes are difficult to distinguish, leading to poor class separability and increased misclassifications. To tackle these challenges, we propose a novel method termed Multi-Prototype Hyperspherical Invariant Learning (MPHIL), which introduces two key innovations: (1) hyperspherical invariant representation extraction, enabling robust and highly discriminative hyperspherical invariant feature extraction, and (2) multi-prototype hyperspherical classification, which employs class prototypes as intermediate variables to eliminate the need for explicit environment modeling in GIL and mitigate the semantic cliff issue. Derived from the theoretical framework of GIL, we introduce two novel objective functions: the invariant prototype matching loss to ensure samples are matched to the correct class prototypes, and the prototype separation loss to increase the distinction between prototypes of different classes in the hyperspherical space. Extensive experiments on 11 OOD generalization benchmark datasets demonstrate that MPHIL achieves state-of-the-art performance, significantly outperforming existing methods across graph data from various domains and with different distribution shifts.

[923] Integrating Generative AI with Network Digital Twins for Enhanced Network Operations

Kassi Muhammad, Teef David, Giulia Nassisid, Tina Farus

Main category: cs.LG

TL;DR: The paper explores combining network digital twins and generative AI (GANs and VAEs) to improve telecom network operations, proposing a framework for predictive maintenance, simulation, and decision-making.

Details

Motivation: To address the complexity of modern telecom networks by leveraging digital twins and generative AI for better resilience and operational efficiency.

Method: Proposes a novel architectural framework integrating network digital twins with generative AI (GANs and VAEs), validated through simulations.

Result: Demonstrates improved accuracy and efficiency in predictive maintenance, scenario simulation, and anomaly detection.

Conclusion: The integration enhances network management, making it more adaptive and intelligent.

Abstract: As telecommunications networks become increasingly complex, the integration of advanced technologies such as network digital twins and generative artificial intelligence (AI) emerges as a pivotal solution to enhance network operations and resilience. This paper explores the synergy between network digital twins, which provide a dynamic virtual representation of physical networks, and generative AI, particularly focusing on Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs). We propose a novel architectural framework that incorporates these technologies to significantly improve predictive maintenance, network scenario simulation, and real-time data-driven decision-making. Through extensive simulations, we demonstrate how generative AI can enhance the accuracy and operational efficiency of network digital twins, effectively handling real-world complexities such as unpredictable traffic loads and network failures. The findings suggest that this integration not only boosts the capability of digital twins in scenario forecasting and anomaly detection but also facilitates a more adaptive and intelligent network management system.

[924] HiPPO-Prophecy: State-Space Models can Provably Learn Dynamical Systems in Context

Federico Arangath Joseph, Kilian Konstantin Haefeli, Noah Liniger, Caglar Gulcehre

Main category: cs.LG

TL;DR: The paper explores in-context learning in State Space Models (SSMs), providing the first theoretical explanation and a novel weight construction for predicting system states without fine-tuning.

Details

Motivation: To understand and explain the underlying mechanism of in-context learning in SSMs, advancing theoretical foundations.

Method: Extends the HiPPO framework to show continuous SSMs can approximate input signal derivatives, with explicit weight construction and error bounds. Discretization yields a predictive discrete SSM.

Result: Empirical demonstration of the effectiveness of the proposed parameterization.

Conclusion: This work is a foundational step toward understanding SSM-based sequence models’ in-context learning capabilities.

Abstract: This work explores the in-context learning capabilities of State Space Models (SSMs) and presents, to the best of our knowledge, the first theoretical explanation of a possible underlying mechanism. We introduce a novel weight construction for SSMs, enabling them to predict the next state of any dynamical system after observing previous states without parameter fine-tuning. This is accomplished by extending the HiPPO framework to demonstrate that continuous SSMs can approximate the derivative of any input signal. Specifically, we find an explicit weight construction for continuous SSMs and provide an asymptotic error bound on the derivative approximation. The discretization of this continuous SSM subsequently yields a discrete SSM that predicts the next state. Finally, we demonstrate the effectiveness of our parameterization empirically. This work should be an initial step toward understanding how sequence models based on SSMs learn in context.

[925] Type 1 Diabetes Management using GLIMMER: Glucose Level Indicator Model with Modified Error Rate

Saman Khamesian, Asiful Arefeen, Maria Adela Grando, Bithika M. Thompson, Hassan Ghasemzadeh

Main category: cs.LG

TL;DR: GLIMMER, a machine learning model for glucose prediction in T1D, improves accuracy in dysglycemic regions, outperforming existing models by 23-31% in error metrics.

Details

Motivation: Current AID systems lack accuracy in predicting dysglycemia, risking patient safety. Advanced forecasting methods are needed.

Method: GLIMMER uses a custom loss function to prioritize accuracy in critical glucose ranges, evaluated on public and new datasets.

Result: GLIMMER achieved RMSE of 23.97 and MAE of 15.83 mg/dL, outperforming prior models by 23% and 31%.

Conclusion: GLIMMER offers a promising solution for better T1D management by enhancing glucose prediction accuracy.

Abstract: Managing Type 1 Diabetes (T1D) demands constant vigilance as individuals strive to regulate their blood glucose levels to avoid the harmful effects of dysglycemia, including both hyperglycemia and hypoglycemia. Despite the development of advanced technologies such as automated insulin delivery (AID) systems, achieving optimal glycemic control remains challenging. AID systems combine continuous subcutaneous insulin infusion with data from continuous glucose monitors (CGMs), offering potential benefits in reducing glucose variability and increasing time-in-range. However, these systems still frequently fail to prevent dysglycemia, partly due to limitations in their prediction algorithms, which lack the accuracy needed to avert abnormal glucose events. This shortcoming highlights the need for more advanced glucose forecasting methods. To address this need, we introduce GLIMMER, Glucose Level Indicator Model with Modified Error Rate, a machine learning-based model for predicting blood glucose levels. GLIMMER classifies glucose values into normal and abnormal ranges and employs a novel custom loss function that prioritizes accuracy in dysglycemic regions, where patient safety is most critical. To evaluate GLIMMER’s effectiveness for T1D management, we used both a publicly available dataset and a newly collected dataset involving 25 individuals with T1D. In forecasting glucose levels for the next hour, GLIMMER achieved a root mean square error (RMSE) of 23.97 (+/-3.77) and a mean absolute error (MAE) of 15.83 (+/-2.09) mg/dL. These results represent a 23% improvement in RMSE and a 31% improvement in MAE compared to the best previously reported models.

[926] GINO-Q: Learning an Asymptotically Optimal Index Policy for Restless Multi-armed Bandits

Gongpu Chen, Soung Chang Liew, Deniz Gunduz

Main category: cs.LG

TL;DR: GINO-Q is a scalable algorithm for RMABs, addressing dimensionality and non-indexability issues, outperforming existing methods.

Details

Motivation: Traditional reinforcement learning methods fail for large-scale RMABs due to exponential state and combinatorial action spaces.

Method: GINO-Q decomposes RMABs into subproblems, using a three-timescale stochastic approximation to learn asymptotically optimal index policies.

Result: GINO-Q learns near-optimal policies, even for non-indexable RMABs, and converges faster than baselines.

Conclusion: GINO-Q is a flexible, efficient solution for RMABs, especially where indexability is not guaranteed.

Abstract: The restless multi-armed bandit (RMAB) framework is a popular model with applications across a wide variety of fields. However, its solution is hindered by the exponentially growing state space (with respect to the number of arms) and the combinatorial action space, making traditional reinforcement learning methods infeasible for large-scale instances. In this paper, we propose GINO-Q, a three-timescale stochastic approximation algorithm designed to learn an asymptotically optimal index policy for RMABs. GINO-Q mitigates the curse of dimensionality by decomposing the RMAB into a series of subproblems, each with the same dimension as a single arm, ensuring that complexity increases linearly with the number of arms. Unlike recently developed Whittle-index-based algorithms, GINO-Q does not require RMABs to be indexable, enhancing its flexibility and applicability. Our experimental results demonstrate that GINO-Q consistently learns near-optimal policies, even for non-indexable RMABs where Whittle-index-based algorithms perform poorly, and it converges significantly faster than existing baselines.

[927] Out-of-Distribution Detection: A Task-Oriented Survey of Recent Advances

Shuo Lu, Yingsheng Wang, Lijun Sheng, Lingxiao He, Aihua Zheng, Jian Liang

Main category: cs.LG

TL;DR: The paper surveys OOD detection, focusing on task-oriented perspectives, classifying methods as training-driven or training-agnostic, and highlighting pre-trained models. It also discusses evaluation, applications, and future directions.

Details

Motivation: Existing reviews focus on method taxonomy, but recent works explore non-traditional scenarios. This survey aims to provide a task-oriented perspective.

Method: Classifies OOD detection methods based on user access to the model (training-driven or training-agnostic) and includes pre-trained models as a separate category.

Result: A new taxonomy is proposed, with discussions on evaluation, applications, and future research directions.

Conclusion: The survey benefits new method proposals and practical scenario expansions, with a curated list of papers provided.

Abstract: Out-of-distribution (OOD) detection aims to detect test samples outside the training category space, which is an essential component in building reliable machine learning systems. Existing reviews on OOD detection primarily focus on method taxonomy, surveying the field by categorizing various approaches. However, many recent works concentrate on non-traditional OOD detection scenarios, such as test-time adaptation, multi-modal data sources and other novel contexts. In this survey, we uniquely review recent advances in OOD detection from the task-oriented perspective for the first time. According to the user’s access to the model, that is, whether the OOD detection method is allowed to modify or retrain the model, we classify the methods as training-driven or training-agnostic. Besides, considering the rapid development of pre-trained models, large pre-trained model-based OOD detection is also regarded as an important category and discussed separately. Furthermore, we provide a discussion of the evaluation scenarios, a variety of applications, and several future research directions. We believe this survey with new taxonomy will benefit the proposal of new methods and the expansion of more practical scenarios. A curated list of related papers is provided in the Github repository: https://github.com/shuolucs/Awesome-Out-Of-Distribution-Detection.

[928] AdapFair: Ensuring Adaptive Fairness for Machine Learning Operations

Yinghui Huang, Zihao Tang, Xiangyu Chang

Main category: cs.LG

TL;DR: An adaptive debiasing framework for machine learning that optimizes fairness while preserving data predictability, using normalizing flows and Wasserstein distance for efficient, scalable solutions.

Details

Motivation: Addressing the limitations of existing fairness algorithms in dynamic conditions, evolving requirements, and black-box classifiers.

Method: Leverages normalizing flows for data transformation and Wasserstein distance for fairness optimization, with an efficient gradient-based algorithm.

Result: A flexible, scalable framework that ensures fairness with minimal retraining, even under data drifts and evolving tasks.

Conclusion: The proposed framework effectively balances fairness and predictability, offering a practical solution for real-world ML applications.

Abstract: The biases and discrimination of machine learning algorithms have attracted significant attention, leading to the development of various algorithms tailored to specific contexts. However, these solutions often fall short of addressing fairness issues inherent in machine learning operations. In this paper, we present an adaptive debiasing framework designed to find an optimal fair transformation of input data that maximally preserves data predictability under dynamic conditions. A distinctive feature of our approach is its flexibility and efficiency. It can be integrated with pretrained black-box classifiers, providing fairness guarantees with minimal retraining efforts, even in the face of frequent data drifts, evolving fairness requirements, and batches of similar tasks. To achieve this, we leverage the normalizing flows to enable efficient, information-preserving data transformation, ensuring that no critical information is lost during the debiasing process. Additionally, we incorporate the Wasserstein distance as the fairness measure to guide the optimization of data transformations. Finally, we introduce an efficient optimization algorithm with closed-formed gradient computations, making our framework scalable and suitable for dynamic, real-world environments.

[929] FARM: Functional Group-Aware Representations for Small Molecules

Thao Nguyen, Kuan-Hao Huang, Ge Liu, Martin D. Burke, Ying Diao, Heng Ji

Main category: cs.LG

TL;DR: FARM introduces functional group-aware tokenization for SMILES, combining masked language modeling and graph neural networks to enhance molecular representation learning, achieving state-of-the-art results.

Details

Motivation: To bridge the gap between SMILES, natural language, and molecular graphs by enriching SMILES with detailed chemical context.

Method: Uses functional group-aware tokenization, masked language modeling for atom-level features, and graph neural networks for topology, aligned via contrastive learning.

Result: Achieves state-of-the-art performance on 11 out of 13 tasks in the MoleculeNet dataset.

Conclusion: FARM improves molecular representation learning and shows strong transfer learning potential for drug discovery and pharmaceutical research.

Abstract: We introduce Functional Group-Aware Representations for Small Molecules (FARM), a novel foundation model designed to bridge the gap between SMILES, natural language, and molecular graphs. The key innovation of FARM lies in its functional group-aware tokenization, which directly incorporates functional group information into SMILES, enriching SMILES with detailed chemical context. For example, instead of using “O” to represent all oxygen atoms, we use specific tokens like “O_ketone” and “O_hydroxyl” to differentiate oxygen atoms belonging to distinct functional groups. This tokenization expands the chemical lexicon, effectively bridging the gap between SMILES and natural language in terms of vocabulary size, ultimately enhancing the model’s ability to predict molecular properties. FARM also represents molecules from two perspectives: by (1) using masked language modeling to capture atom-level features and (2) employing graph neural networks to encode the whole molecule topology. FARM leverages contrastive learning to aligns these two views of representations into a unified molecular embedding. We rigorously evaluate FARM on the MoleculeNet dataset, where it achieves state-of-the-art performance on 11 out of 13 tasks. These results highlight FARM’s potential to improve molecular representation learning and demonstrate its strong transfer learning capabilities, paving the way for promising applications in drug discovery and pharmaceutical research.

[930] Time to Retrain? Detecting Concept Drifts in Machine Learning Systems

Tri Minh Triet Pham, Karthikeyan Premkumar, Mohamed Naili, Jinqiu Yang

Main category: cs.LG

TL;DR: The paper explores concept drift detection in ML systems, proposing CDSeer, a model-agnostic technique that outperforms SOTA methods with less labeling effort.

Details

Motivation: ML models degrade due to concept drift; current SOTA methods require heavy labeling and are model-specific.

Method: Proposes CDSeer, a model-agnostic drift detection technique, evaluated on synthetic and real-world datasets.

Result: CDSeer improves precision by 57.1% with 99% fewer labels compared to SOTA methods, matching supervised methods.

Conclusion: CDSeer enhances ML reliability by reducing labeling effort and improving drift detection performance.

Abstract: With the boom of machine learning (ML) techniques, software practitioners build ML systems to process the massive volume of streaming data for diverse software engineering tasks such as failure prediction in AIOps. Trained using historical data, such ML models encounter performance degradation caused by concept drift, i.e., data and inter-relationship (concept) changes between training and production. It is essential to use concept rift detection to monitor the deployed ML models and re-train the ML models when needed. In this work, we explore applying state-of-the-art (SOTA) concept drift detection techniques on synthetic and real-world datasets in an industrial setting. Such an industrial setting requires minimal manual effort in labeling and maximal generality in ML model architecture. We find that current SOTA semi-supervised methods not only require significant labeling effort but also only work for certain types of ML models. To overcome such limitations, we propose a novel model-agnostic technique (CDSeer) for detecting concept drift. Our evaluation shows that CDSeer has better precision and recall compared to the state-of-the-art while requiring significantly less manual labeling. We demonstrate the effectiveness of CDSeer at concept drift detection by evaluating it on eight datasets from different domains and use cases. Results from internal deployment of CDSeer on an industrial proprietary dataset show a 57.1% improvement in precision while using 99% fewer labels compared to the SOTA concept drift detection method. The performance is also comparable to the supervised concept drift detection method, which requires 100% of the data to be labeled. The improved performance and ease of adoption of CDSeer are valuable in making ML systems more reliable.

Xinyue Feng, Shuxin Zhong, Jinquan Hang, Wenjun Lyu, Yuequn Zhang, Guang Yang, Haotian Wang, Desheng Zhang, Guang Wang

Main category: cs.LG

TL;DR: The paper introduces StrucHIS, a framework for logistics customer expansion that addresses label sparsity and structural pattern discrimination in multi-task learning on heterogeneous graphs.

Details

Motivation: Customer expansion is vital for logistics companies, but existing methods struggle with label sparsity and fail to distinguish task-shared and task-specific structural patterns in heterogeneous graphs.

Method: Proposes StrucHIS, a structure-aware hierarchical information sharing framework that regulates structural information sharing across tasks in multiple stages.

Result: Achieves 51.41% average precision improvement on a private dataset and 10.52% macro F1 gain on a public dataset. Deployment shows a 41.67% improvement in contract-signing rates.

Conclusion: StrucHIS effectively mitigates structural pattern issues and enhances performance in logistics customer expansion, demonstrating practical impact.

Abstract: Customer expansion, i.e., growing a business existing customer base by acquiring new customers, is critical for scaling operations and sustaining the long-term profitability of logistics companies. Although state-of-the-art works model this task as a single-node classification problem under a heterogeneous graph learning framework and achieve good performance, they struggle with extremely positive label sparsity issues in our scenario. Multi-task learning (MTL) offers a promising solution by introducing a correlated, label-rich task to enhance the label-sparse task prediction through knowledge sharing. However, existing MTL methods result in performance degradation because they fail to discriminate task-shared and task-specific structural patterns across tasks. This issue arises from their limited consideration of the inherently complex structure learning process of heterogeneous graph neural networks, which involves the multi-layer aggregation of multi-type relations. To address the challenge, we propose a Structure-Aware Hierarchical Information Sharing Framework (SrucHIS), which explicitly regulates structural information sharing across tasks in logistics customer expansion. SrucHIS breaks down the structure learning phase into multiple stages and introduces sharing mechanisms at each stage, effectively mitigating the influence of task-specific structural patterns during each stage. We evaluate StrucHIS on both private and public datasets, achieving a 51.41% average precision improvement on the private dataset and a 10.52% macro F1 gain on the public dataset. StrucHIS is further deployed at one of the largest logistics companies in China and demonstrates a 41.67% improvement in the success contract-signing rate over existing strategies, generating over 453K new orders within just two months.

[932] Machines and Mathematical Mutations: Using GNNs to Characterize Quiver Mutation Classes

Jesse He, Helen Jenne, Herman Chau, Davis Brown, Mark Raugas, Sara Billey, Henry Kvinge

Main category: cs.LG

TL;DR: The paper uses graph neural networks to study quiver mutation in cluster algebras, discovering mutation equivalence criteria for quivers of type $ ilde{D}$ and demonstrating the model’s ability to learn abstract mathematical rules.

Details

Motivation: To leverage machine learning for identifying patterns in quiver mutation, a key operation in cluster algebras with broad mathematical and physical implications.

Method: Graph neural networks and AI explainability techniques are employed to analyze quiver mutation and discover equivalence criteria.

Result: The model independently discovers mutation equivalence criteria for quivers of type $ ilde{D}$ and captures hidden structure reconstructing known criteria for type $D$.

Conclusion: Machine learning models can learn abstract mathematical rules, as shown by the model’s ability to discover and reconstruct quiver mutation criteria.

Abstract: Machine learning is becoming an increasingly valuable tool in mathematics, enabling one to identify subtle patterns across collections of examples so vast that they would be impossible for a single researcher to feasibly review and analyze. In this work, we use graph neural networks to investigate \emph{quiver mutation} – an operation that transforms one quiver (or directed multigraph) into another – which is central to the theory of cluster algebras with deep connections to geometry, topology, and physics. In the study of cluster algebras, the question of \emph{mutation equivalence} is of fundamental concern: given two quivers, can one efficiently determine if one quiver can be transformed into the other through a sequence of mutations? In this paper, we use graph neural networks and AI explainability techniques to independently discover mutation equivalence criteria for quivers of type $\tilde{D}$. Along the way, we also show that even without explicit training to do so, our model captures structure within its hidden representation that allows us to reconstruct known criteria from type $D$, adding to the growing evidence that modern machine learning models are capable of learning abstract and parsimonious rules from mathematical data.

[933] DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNA

Aman Patel, Arpita Singhal, Austin Wang, Anusri Pampari, Maya Kasowski, Anshul Kundaje

Main category: cs.LG

TL;DR: The paper introduces DART-Eval, a benchmark suite for evaluating DNA language models (DNALMs) on regulatory DNA tasks, finding current models inconsistent and computationally costly.

Details

Motivation: Existing benchmarks fail to assess DNALMs on key regulatory DNA tasks, prompting the need for specialized evaluation tools like DART-Eval.

Method: DART-Eval benchmarks evaluate DNALMs on zero-shot, probed, and fine-tuned scenarios for tasks like regulatory activity prediction and variant impact analysis.

Result: Current DNALMs show inconsistent performance and no significant advantage over baseline models, despite higher computational demands.

Conclusion: The study highlights the need for improved modeling, data curation, and evaluation strategies for future DNALMs.

Abstract: Recent advances in self-supervised models for natural language, vision, and protein sequences have inspired the development of large genomic DNA language models (DNALMs). These models aim to learn generalizable representations of diverse DNA elements, potentially enabling various genomic prediction, interpretation and design tasks. Despite their potential, existing benchmarks do not adequately assess the capabilities of DNALMs on key downstream applications involving an important class of non-coding DNA elements critical for regulating gene activity. In this study, we introduce DART-Eval, a suite of representative benchmarks specifically focused on regulatory DNA to evaluate model performance across zero-shot, probed, and fine-tuned scenarios against contemporary ab initio models as baselines. Our benchmarks target biologically meaningful downstream tasks such as functional sequence feature discovery, predicting cell-type specific regulatory activity, and counterfactual prediction of the impacts of genetic variants. We find that current DNALMs exhibit inconsistent performance and do not offer compelling gains over alternative baseline models for most tasks, while requiring significantly more computational resources. We discuss potentially promising modeling, data curation, and evaluation strategies for the next generation of DNALMs. Our code is available at https://github.com/kundajelab/DART-Eval.

[934] Stochastic Control for Fine-tuning Diffusion Models: Optimality, Regularity, and Convergence

Yinbin Han, Meisam Razaviyayn, Renyuan Xu

Main category: cs.LG

TL;DR: A stochastic control framework is proposed for fine-tuning diffusion models, integrating linear dynamics control with Kullback-Leibler regularization, ensuring global convergence and regularity.

Details

Motivation: Fine-tuning large diffusion models for specific tasks and preferences lacks theoretical grounding, despite empirical advances.

Method: Proposes a stochastic control framework with policy iteration (PI-FT), leveraging pre-trained diffusion models and Kullback-Leibler regularization.

Result: PI-FT achieves global linear convergence, maintaining regularity in control and value sequences.

Conclusion: The framework bridges the gap between empirical and theoretical understanding, offering a robust method for fine-tuning diffusion models.

Abstract: Diffusion models have emerged as powerful tools for generative modeling, demonstrating exceptional capability in capturing target data distributions from large datasets. However, fine-tuning these massive models for specific downstream tasks, constraints, and human preferences remains a critical challenge. While recent advances have leveraged reinforcement learning algorithms to tackle this problem, much of the progress has been empirical, with limited theoretical understanding. To bridge this gap, we propose a stochastic control framework for fine-tuning diffusion models. Building on denoising diffusion probabilistic models as the pre-trained reference dynamics, our approach integrates linear dynamics control with Kullback-Leibler regularization. We establish the well-posedness and regularity of the stochastic control problem and develop a policy iteration algorithm (PI-FT) for numerical solution. We show that PI-FT achieves global convergence at a linear rate. Unlike existing work that assumes regularities throughout training, we prove that the control and value sequences generated by the algorithm maintain the regularity. Additionally, we explore extensions of our framework to parametric settings and continuous-time formulations.

[935] Are Domain Generalization Benchmarks with Accuracy on the Line Misspecified?

Olawale Salaudeen, Nicole Chiou, Shiny Weng, Sanmi Koyejo

Main category: cs.LG

TL;DR: The paper critiques current OOD benchmarks for failing to properly evaluate robustness to spurious correlations, showing that ERM often performs well due to dataset misspecification. It provides conditions for proper evaluation, audits existing datasets, and offers design principles for better benchmarks.

Details

Motivation: Current OOD benchmarks often miss shifts in spurious correlations, leading to misleading evaluations of model robustness. This paper aims to identify and address this issue.

Method: The authors derive conditions for revealing spurious feature reliance, audit existing OOD datasets, and propose design principles for well-specified benchmarks.

Result: Most OOD datasets are misspecified, displaying ‘accuracy on the line,’ but a few well-specified datasets exist. The paper provides guidelines for creating better benchmarks.

Conclusion: Properly evaluating robustness requires well-specified datasets that include shifts in spurious correlations. The paper offers actionable insights for future benchmark design.

Abstract: Spurious correlations, unstable statistical shortcuts a model can exploit, are expected to degrade performance out-of-distribution (OOD). However, across many popular OOD generalization benchmarks, vanilla empirical risk minimization (ERM) often achieves the highest OOD accuracy. Moreover, gains in in-distribution accuracy generally improve OOD accuracy, a phenomenon termed accuracy on the line, which contradicts the expected harm of spurious correlations. We show that these observations are an artifact of misspecified OOD datasets that do not include shifts in spurious correlations that harm OOD generalization, the setting they are meant to evaluate. Consequently, current practice evaluates “robustness” without truly stressing the spurious signals we seek to eliminate; our work pinpoints when that happens and how to fix it. Contributions. (i) We derive necessary and sufficient conditions for a distribution shift to reveal a model’s reliance on spurious features; when these conditions hold, “accuracy on the line” disappears. (ii) We audit leading OOD datasets and find that most still display accuracy on the line, suggesting they are misspecified for evaluating robustness to spurious correlations. (iii) We catalog the few well-specified datasets and summarize generalizable design principles, such as identifying datasets of natural interventions (e.g., a pandemic), to guide future well-specified benchmarks.

[936] Deep Operator Networks for Bayesian Parameter Estimation in PDEs

Amogh Raj, Carol Eunice Gudumotou, Sakol Bun, Keerthana Srinivasa, Arash Sarshar

Main category: cs.LG

TL;DR: A framework combining DeepONets and PINNs solves PDEs and estimates parameters, integrating data-driven learning with physical constraints for robust solutions. Bayesian training via variational inference enables uncertainty quantification, ensuring reliability in noisy or incomplete scenarios.

Details

Motivation: To address the challenge of solving PDEs and estimating their parameters robustly, especially under noisy or incomplete conditions, by integrating data-driven and physics-based approaches.

Method: Combines DeepONets and PINNs, using Bayesian training with variational inference for uncertainty quantification. Applied to forward/inverse problems like heat and reaction-diffusion equations.

Result: Achieves accurate solutions and parameter estimates, even with sparse or noisy data, demonstrating efficacy in diverse scenarios.

Conclusion: The framework offers a computationally efficient, generalizable method for uncertainty quantification in PDE surrogate modeling.

Abstract: We present a novel framework combining Deep Operator Networks (DeepONets) with Physics-Informed Neural Networks (PINNs) to solve partial differential equations (PDEs) and estimate their unknown parameters. By integrating data-driven learning with physical constraints, our method achieves robust and accurate solutions across diverse scenarios. Bayesian training is implemented through variational inference, allowing for comprehensive uncertainty quantification for both aleatoric and epistemic uncertainties. This ensures reliable predictions and parameter estimates even in noisy conditions or when some of the physical equations governing the problem are missing. The framework demonstrates its efficacy in solving forward and inverse problems, including the 1D unsteady heat equation and 2D reaction-diffusion equations, as well as regression tasks with sparse, noisy observations. This approach provides a computationally efficient and generalizable method for addressing uncertainty quantification in PDE surrogate modeling.

Sota Mashiko, Yuji Kawamata, Tomoru Nakayama, Tetsuya Sakurai, Yukihiko Okada

Main category: cs.LG

TL;DR: A novel anomaly detection framework using data collaboration (DC) analysis is proposed for financial auditing, ensuring data confidentiality without exposing raw data or requiring external network connections.

Details

Motivation: Anomaly detection in financial auditing requires large datasets, but sharing sensitive journal entry data across organizations is infeasible. Existing federated learning (FL) methods risk data exposure.

Method: The method transforms raw data into secure intermediate representations via dimensionality reduction, constructs a collaboration representation, and trains an anomaly detection autoencoder in a single communication round.

Result: The framework outperforms individual data baselines and model-sharing FL methods (FedAvg, FedProx), especially in non-i.i.d. settings, using synthetic and real-world healthcare data.

Conclusion: The study provides a practical solution for integrating organizational knowledge while preserving data confidentiality, advancing intelligent auditing systems.

Abstract: Anomaly detection is crucial in financial auditing, and effective detection requires large volumes of data from multiple organizations. However, journal entry data is highly sensitive, making it infeasible to share them directly across audit firms. To address this challenge, journal entry anomaly detection methods based on model share-type federated learning (FL) have been proposed. These methods require multiple rounds of communication with external servers to exchange model parameters, which necessitates connecting devices storing confidential data to external networks – a practice not recommended for sensitive data such as journal entries. To overcome these limitations, a novel anomaly detection framework based on data collaboration (DC) analysis, a non-model share-type FL approach, is proposed. The method first transforms raw journal entry data into secure intermediate representations via dimensionality reduction and then constructs a collaboration representation used to train an anomaly detection autoencoder. Notably, the approach does not require raw data to be exposed or devices to be connected to external networks, and the entire process needs only a single round of communication. The proposed method was evaluated on both synthetic and real-world journal entry data collected from eight healthcare organizations. The experimental results demonstrated that the framework not only outperforms the baseline trained on individual data but also achieves higher detection performance than model-sharing FL methods such as FedAvg and FedProx, particularly under non-i.i.d. settings that simulate practical audit environments. This study addresses the critical need to integrate organizational knowledge while preserving data confidentiality, contributing to the development of practical intelligent auditing systems.

[938] Avoiding Leakage Poisoning: Concept Interventions Under Distribution Shifts

Mateo Espinosa Zarlenga, Gabriele Dominici, Pietro Barbiero, Zohreh Shams, Mateja Jamnik

Main category: cs.LG

TL;DR: The paper examines how concept-based models (CMs) handle out-of-distribution (OOD) inputs, identifies a flaw called leakage poisoning, and proposes MixCEM to improve accuracy for both in-distribution and OOD data.

Details

Motivation: To understand and improve the performance of interpretable CMs when dealing with OOD inputs and concept interventions.

Method: Analyzes concept interventions on CMs for OOD inputs, identifies leakage poisoning, and introduces MixCEM to dynamically exploit leaked information.

Result: MixCEM outperforms baselines by improving accuracy for both in-distribution and OOD samples, with or without concept interventions.

Conclusion: MixCEM addresses leakage poisoning in CMs, enhancing their robustness and accuracy across diverse input distributions.

Abstract: In this paper, we investigate how concept-based models (CMs) respond to out-of-distribution (OOD) inputs. CMs are interpretable neural architectures that first predict a set of high-level concepts (e.g., stripes, black) and then predict a task label from those concepts. In particular, we study the impact of concept interventions (i.e., operations where a human expert corrects a CM’s mispredicted concepts at test time) on CMs’ task predictions when inputs are OOD. Our analysis reveals a weakness in current state-of-the-art CMs, which we term leakage poisoning, that prevents them from properly improving their accuracy when intervened on for OOD inputs. To address this, we introduce MixCEM, a new CM that learns to dynamically exploit leaked information missing from its concepts only when this information is in-distribution. Our results across tasks with and without complete sets of concept annotations demonstrate that MixCEMs outperform strong baselines by significantly improving their accuracy for both in-distribution and OOD samples in the presence and absence of concept interventions.

[939] TrajFlow: A Generative Framework for Occupancy Density Estimation Using Normalizing Flows

Mitch Kosieradzki, Seongjin Choi

Main category: cs.LG

TL;DR: TrajFlow is a generative framework for predicting agent motion in traffic using a marginal distribution approach, outperforming existing methods in accuracy and enabling continuous sampling.

Details

Motivation: Accurate prediction of agent motion in complex traffic is crucial for intelligent transportation and autonomous vehicles, but uncertainty makes it challenging.

Method: Uses a causal encoder for trajectory embeddings and a normalizing flow for decoding, modeling marginal spatial distributions instead of joint trajectory distributions.

Result: Higher accuracy in trajectory forecasting, continuous sampling capability, and better suitability for downstream tasks like occupancy grids.

Conclusion: TrajFlow’s marginal formulation and neural differential equations offer superior performance and flexibility for motion prediction.

Abstract: For intelligent transportation systems and autonomous vehicles to operate safely and efficiently, they must reliably predict the future motion and trajectory of surrounding agents within complex traffic environments. At the same time, the motion of these agents is inherently uncertain, making accurate prediction difficult. In this paper, we propose \textbf{TrajFlow}, a generative framework for estimating the occupancy density of dynamic agents. Our framework utilizes a causal encoder to extract semantically meaningful embeddings of the observed trajectory, as well as a normalizing flow to decode these embeddings and determine the most likely future location of an agent at some time point in the future. Our formulation differs from existing approaches because we model the marginal distribution of spatial locations instead of the joint distribution of unobserved trajectories. The advantages of a marginal formulation are numerous. First, we demonstrate that the marginal formulation produces higher accuracy on challenging trajectory forecasting benchmarks. Second, the marginal formulation allows for fully continuous sampling of future locations. Finally, marginal densities are better suited for downstream tasks as they allow for the computation of per-agent motion trajectories and occupancy grids, the two most commonly used representations for motion forecasting. We present a novel architecture based entirely on neural differential equations as an implementation of this framework and provide ablations to demonstrate the advantages of a continuous implementation over a more traditional discrete neural network based approach. The code is available at https://github.com/UMN-Choi-Lab/TrajFlow.

[940] LZ Penalty: An information-theoretic repetition penalty for autoregressive language models

Antonio A. Ginart, Naveen Kodali, Jason Lee, Caiming Xiong, Silvio Savarese, John R. Emmons

Main category: cs.LG

TL;DR: The LZ penalty reduces degenerate repetitions in autoregressive language models without loss of capability, outperforming frequency and repetition penalties.

Details

Motivation: To address the issue of degenerate repetitions in autoregressive language models while maintaining model capability.

Method: The LZ penalty is based on LZ77 compression algorithm codelengths, leveraging prediction-compression duality to sample from residual distributions.

Result: The LZ penalty eliminates degenerate repetitions in greedy decoding, unlike frequency and repetition penalties which fail with up to 4% repetition rates.

Conclusion: The LZ penalty is effective for reducing repetitions without compromising model performance, outperforming standard penalties.

Abstract: We introduce the LZ penalty, a penalty specialized for reducing degenerate repetitions in autoregressive language models without loss of capability. The penalty is based on the codelengths in the LZ77 universal lossless compression algorithm. Through the lens of the prediction-compression duality, decoding the LZ penalty has the interpretation of sampling from the residual distribution after removing the information that is highly compressible. We demonstrate the LZ penalty enables state-of-the-art open-source reasoning models to operate with greedy (temperature zero) decoding without loss of capability and without instances of degenerate repetition. Both the industry-standard frequency penalty and repetition penalty are ineffective, incurring degenerate repetition rates of up to 4%.

[941] Flow Matching: Markov Kernels, Stochastic Processes and Transport Plans

Christian Wald, Gabriele Steidl

Main category: cs.LG

TL;DR: The paper reviews flow matching techniques for learning velocity fields in Wasserstein geometry, explores their applications in Bayesian inverse problems, and connects them to continuous normalizing flows and score matching.

Details

Motivation: To provide a mathematical perspective on learning velocity fields for flow matching, highlighting its simplicity and scalability in generative modeling.

Method: The paper examines three approaches: transport plans (couplings), Markov kernels, and stochastic processes, to characterize and learn velocity fields. It also applies flow matching to Bayesian inverse problems.

Result: Demonstrates how velocity fields can be learned and applied, with conditional Wasserstein distances playing a key role in Bayesian inverse problems.

Conclusion: Flow matching is versatile and scalable, with connections to other techniques like continuous normalizing flows and score matching.

Abstract: Among generative neural models, flow matching techniques stand out for their simple applicability and good scaling properties. Here, velocity fields of curves connecting a simple latent and a target distribution are learned. Then the corresponding ordinary differential equation can be used to sample from a target distribution, starting in samples from the latent one. This paper reviews from a mathematical point of view different techniques to learn the velocity fields of absolutely continuous curves in the Wasserstein geometry. We show how the velocity fields can be characterized and learned via i) transport plans (couplings) between latent and target distributions, ii) Markov kernels and iii) stochastic processes, where the latter two include the coupling approach, but are in general broader. Besides this main goal, we show how flow matching can be used for solving Bayesian inverse problems, where the definition of conditional Wasserstein distances plays a central role. Finally, we briefly address continuous normalizing flows and score matching techniques, which approach the learning of velocity fields of curves from other directions.

[942] Privacy Amplification by Structured Subsampling for Deep Differentially Private Time Series Forecasting

Jan Schuchardt, Mina Dalirrooyfard, Jed Guzelkabaagac, Anderson Schneider, Yuriy Nevmyvaka, Stephan Günnemann

Main category: cs.LG

TL;DR: The paper highlights the limitations of DP-SGD for time series tasks like forecasting due to its reliance on unstructured batches. It proposes structured subsampling for privacy amplification and demonstrates its effectiveness in training forecasting models with strong privacy guarantees.

Details

Motivation: Standard DP-SGD is inadequate for time series tasks because it uses unstructured batches, which don't align with the sequential nature of forecasting data. The paper aims to address this gap.

Method: The authors analyze privacy amplification via structured subsampling (sampling sequential time series, contiguous subsequences, and partitioning into context/forecast windows) and prove how data augmentation enhances privacy in self-supervised training.

Result: Theoretical and empirical results show that structured subsampling enables training forecasting models with robust privacy guarantees.

Conclusion: Structured subsampling and data augmentation provide a viable solution for training private forecasting models, overcoming the limitations of DP-SGD.

Abstract: Many forms of sensitive data, such as web traffic, mobility data, or hospital occupancy, are inherently sequential. The standard method for training machine learning models while ensuring privacy for units of sensitive information, such as individual hospital visits, is differentially private stochastic gradient descent (DP-SGD). However, we observe in this work that the formal guarantees of DP-SGD are incompatible with time series specific tasks like forecasting, since they rely on the privacy amplification attained by training on small, unstructured batches sampled from an unstructured dataset. In contrast, batches for forecasting are generated by (1) sampling sequentially structured time series from a dataset, (2) sampling contiguous subsequences from these series, and (3) partitioning them into context and ground-truth forecast windows. We theoretically analyze the privacy amplification attained by this structured subsampling to enable the training of forecasting models with sound and tight event- and user-level privacy guarantees. Towards more private models, we additionally prove how data augmentation amplifies privacy in self-supervised training of sequence models. Our empirical evaluation demonstrates that amplification by structured subsampling enables the training of forecasting models with strong formal privacy guarantees.

[943] Robustly Learning Monotone Generalized Linear Models via Data Augmentation

Nikos Zarifis, Puqian Wang, Ilias Diakonikolas, Jelena Diakonikolas

Main category: cs.LG

TL;DR: First polynomial-time algorithm for learning GLMs with constant-factor approximation for any monotone Lipschitz activation, resolving an open problem.

Details

Motivation: Address the limitation of prior GLM learners, which only worked for a smaller class of activations, and solve a well-known open problem.

Method: Develop a robust counterpart to the GLMtron algorithm, using data augmentation with decreasing Gaussian noise injection and proving structural results.

Result: Achieves constant-factor approximation for any monotone Lipschitz activation and works for activations with bounded (2+ζ)-moments.

Conclusion: The work provides a broadly applicable solution for GLM learning and introduces techniques potentially useful in other contexts.

Abstract: We study the task of learning Generalized Linear models (GLMs) in the agnostic model under the Gaussian distribution. We give the first polynomial-time algorithm that achieves a constant-factor approximation for \textit{any} monotone Lipschitz activation. Prior constant-factor GLM learners succeed for a substantially smaller class of activations. Our work resolves a well-known open problem, by developing a robust counterpart to the classical GLMtron algorithm (Kakade et al., 2011). Our robust learner applies more generally, encompassing all monotone activations with bounded $(2+\zeta)$-moments, for any fixed $\zeta>0$ – a condition that is essentially necessary. To obtain our results, we leverage a novel data augmentation technique with decreasing Gaussian noise injection and prove a number of structural results that may be useful in other settings.

[944] EQ-VAE: Equivariance Regularized Latent Space for Improved Generative Image Modeling

Theodoros Kouzelis, Ioannis Kakogeorgiou, Spyros Gidaris, Nikos Komodakis

Main category: cs.LG

TL;DR: EQ-VAE introduces a regularization method to enforce equivariance in latent spaces of autoencoders, improving generative model performance without sacrificing reconstruction quality.

Details

Motivation: Existing autoencoders lack equivariance to semantic-preserving transformations (e.g., scaling, rotation), complicating latent spaces and hindering generative performance.

Method: EQ-VAE is a regularization approach applied to autoencoders to enforce latent space equivariance. It is compatible with both continuous and discrete autoencoders.

Result: EQ-VAE enhances generative models (e.g., DiT, SiT, REPA, MaskGIT), achieving a 7× speedup on DiT-XL/2 with minimal fine-tuning.

Conclusion: EQ-VAE offers a versatile and effective solution to improve latent generative models by simplifying latent spaces while maintaining reconstruction quality.

Abstract: Latent generative models have emerged as a leading approach for high-quality image synthesis. These models rely on an autoencoder to compress images into a latent space, followed by a generative model to learn the latent distribution. We identify that existing autoencoders lack equivariance to semantic-preserving transformations like scaling and rotation, resulting in complex latent spaces that hinder generative performance. To address this, we propose EQ-VAE, a simple regularization approach that enforces equivariance in the latent space, reducing its complexity without degrading reconstruction quality. By finetuning pre-trained autoencoders with EQ-VAE, we enhance the performance of several state-of-the-art generative models, including DiT, SiT, REPA and MaskGIT, achieving a 7 speedup on DiT-XL/2 with only five epochs of SD-VAE fine-tuning. EQ-VAE is compatible with both continuous and discrete autoencoders, thus offering a versatile enhancement for a wide range of latent generative models. Project page and code: https://eq-vae.github.io/.

[945] Prompting Large Language Models for Training-Free Non-Intrusive Load Monitoring

Junyu Xue, Xudong Wang, Xiaoling He, Shicheng Liu, Yi Wang, Guoming Tang

Main category: cs.LG

TL;DR: The paper introduces a prompt-based NILM framework using LLMs, showing basic capabilities but strong generalization and explainability.

Details

Motivation: Address limitations of deep learning in NILM (labeled data dependence, poor generalization, lack of explainability) by leveraging LLMs.

Method: Design and evaluate prompt strategies for LLMs, integrating appliance features, context, and time-series examples.

Result: LLMs with prompts show basic NILM performance, lagging behind deep learning in complexity but excel in generalization and explainability.

Conclusion: Prompt-only LLMs define new NILM boundaries, offering promise in generalization and explainability despite performance gaps.

Abstract: Non-intrusive load monitoring (NILM) aims to disaggregate total electricity consumption into individual appliance usage, thus enabling more effective energy management. While deep learning has advanced NILM, it remains limited by its dependence on labeled data, restricted generalization, and lack of explainability. This paper introduces the first prompt-based NILM framework that leverages large language models (LLMs) with in-context learning. We design and evaluate prompt strategies that integrate appliance features, contextual information, and representative time-series examples through extensive case studies. Extensive experiments on the REDD and UK-DALE datasets show that LLMs guided solely by prompts deliver only basic NILM capabilities, with performance that lags behind traditional deep-learning models in complex scenarios. However, the experiments also demonstrate strong generalization across different houses and even regions by simply adapting the injected appliance features. It also provides clear, human-readable explanations for the inferred appliance states. Our findings define the capability boundaries of using prompt-only LLMs for NILM tasks. Their strengths in generalization and explainability present a promising new direction for the field.

[946] An Actor-Critic Algorithm with Function Approximation for Risk Sensitive Cost Markov Decision Processes

Soumyajit Guin, Vivek S. Borkar, Shalabh Bhatnagar

Main category: cs.LG

TL;DR: A model-free policy gradient algorithm for risk-sensitive Markov decision processes is developed, with convergence analysis and superior performance demonstrated.

Details

Motivation: The risk-sensitive cost criterion is less studied due to its multiplicative Bellman equation complexity, unlike additive cost criteria.

Method: An actor-critic algorithm with function approximation is developed for this setting.

Result: Asymptotic convergence is proven, and numerical experiments show superior performance over existing algorithms.

Conclusion: The proposed algorithm effectively addresses the risk-sensitive cost criterion and outperforms recent methods.

Abstract: In this paper, we consider the risk-sensitive cost criterion with exponentiated costs for Markov decision processes and develop a model-free policy gradient algorithm in this setting. Unlike additive cost criteria such as average or discounted cost, the risk-sensitive cost criterion is less studied due to the complexity resulting from the multiplicative structure of the resulting Bellman equation. We develop an actor-critic algorithm with function approximation in this setting and provide its asymptotic convergence analysis. We also show the results of numerical experiments that demonstrate the superiority in performance of our algorithm over other recent algorithms in the literature.

[947] Causal Effect Estimation under Networked Interference without Networked Unconfoundedness Assumption

Weilin Chen, Ruichu Cai, Jie Qiao, Yuguang Yan, José Miguel Hernández-Lobato

Main category: cs.LG

TL;DR: The paper addresses the challenge of estimating causal effects under networked interference in observational data, where latent confounders often violate assumptions. It introduces a confounder recovery framework and an estimator to handle three types of latent confounders, proving their identifiability and validating the method experimentally.

Details

Motivation: The networked unconfoundedness assumption is often violated due to latent confounders in observational data, making it hard to identify networked effects. The paper aims to solve this by leveraging network interactions.

Method: A confounder recovery framework is developed to categorize and recover three types of latent confounders. A networked effect estimator is designed using identifiable representation learning.

Result: Theoretically, the identifiability of all three confounder types is proven, and the estimator successfully identifies networked effects. Experiments confirm the method’s effectiveness.

Conclusion: The proposed framework and estimator effectively address the challenge of latent confounders in networked settings, enabling reliable estimation of causal effects.

Abstract: Estimating causal effects under networked interference from observational data is a crucial yet challenging problem. Most existing methods mainly rely on the networked unconfoundedness assumption, which guarantees the identification of networked effects. However, this assumption is often violated due to the latent confounders inherent in observational data, thereby hindering the identification of networked effects. To address this issue, we leverage the rich interaction patterns between units in networks, which provide valuable information for recovering these latent confounders. Building on this insight, we develop a confounder recovery framework that explicitly characterizes three categories of latent confounders in networked settings: those affecting only the unit, those affecting only the unit’s neighbors, and those influencing both. Based on this framework, we design a networked effect estimator using identifiable representation learning techniques. From a theoretical standpoint, we prove the identifiability of all three types of latent confounders and, by leveraging the recovered confounders, establish a formal identification result for networked effects. Extensive experiments validate our theoretical findings and demonstrate the effectiveness of the proposed method.

[948] Transformer Meets Twicing: Harnessing Unattended Residual Information

Laziz Abdullaev, Tan M. Nguyen

Main category: cs.LG

TL;DR: The paper introduces Twicing Attention, a novel mechanism addressing the degradation of representational capacity in transformer layers by leveraging kernel twicing from nonparametric regression, improving robustness and accuracy.

Details

Motivation: The representational capacity of self-attention matrices degrades across transformer layers, limiting performance. The authors aim to mitigate this by enhancing the attention mechanism.

Method: Proposes Twicing Attention, inspired by kernel twicing in nonparametric regression, to reuse residual information and counteract low-pass behavior in self-attention.

Result: Twicing Attention shows slower decay of representational capacity and improved robustness/accuracy across tasks like image classification and language modeling.

Conclusion: The method outperforms standard self-attention, offering theoretical guarantees and empirical gains on diverse benchmarks.

Abstract: Transformer-based deep learning models have achieved state-of-the-art performance across numerous language and vision tasks. While the self-attention mechanism, a core component of transformers, has proven capable of handling complex data patterns, it has been observed that the representational capacity of the attention matrix degrades significantly across transformer layers, thereby hurting its overall performance. In this work, we leverage the connection between self-attention computations and low-pass non-local means (NLM) smoothing filters and propose the Twicing Attention, a novel attention mechanism that uses kernel twicing procedure in nonparametric regression to alleviate the low-pass behavior of associated NLM smoothing with compelling theoretical guarantees and enhanced adversarial robustness. This approach enables the extraction and reuse of meaningful information retained in the residuals following the imperfect smoothing operation at each layer. Our proposed method offers two key advantages over standard self-attention: 1) a provably slower decay of representational capacity and 2) improved robustness and accuracy across various data modalities and tasks. We empirically demonstrate the performance gains of our model over baseline transformers on multiple tasks and benchmarks, including image classification and language modeling, on both clean and corrupted data.

[949] Step-wise Adaptive Integration of Supervised Fine-tuning and Reinforcement Learning for Task-Specific LLMs

Jack Chen, Fazhong Liu, Naruto Liu, Yuhan Luo, Erqu Qin, Harry Zheng, Tian Dong, Haojin Zhu, Yan Meng, Xiao Wang

Main category: cs.LG

TL;DR: SASR is a step-wise adaptive hybrid training framework combining SFT and RL to enhance LLMs’ reasoning, outperforming static methods.

Details

Motivation: Challenges like overfitting in SFT and mode collapse in RL, along with poor generalization in static hybrid methods, inspired a dynamic solution.

Method: SASR uses SFT for initial warm-up, then adaptively integrates RL (GRPO) via gradient norm and divergence monitoring for dynamic balance.

Result: SASR outperforms SFT, RL, and static hybrid methods in experiments.

Conclusion: SASR provides a robust, adaptive framework for training LLMs, ensuring smooth transitions and improved reasoning performance.

Abstract: Large language models (LLMs) excel at mathematical reasoning and logical problem-solving. The current popular training paradigms primarily use supervised fine-tuning (SFT) and reinforcement learning (RL) to enhance the models’ reasoning abilities. However, when using SFT or RL alone, there are respective challenges: SFT may suffer from overfitting, while RL is prone to mode collapse. The state-of-the-art methods have proposed hybrid training schemes. However, static switching faces challenges such as poor generalization across different tasks and high dependence on data quality. In response to these challenges, inspired by the curriculum learning-quiz mechanism in human reasoning cultivation, We propose SASR, a step-wise adaptive hybrid training framework that theoretically unifies SFT and RL and dynamically balances the two throughout optimization. SASR uses SFT for initial warm-up to establish basic reasoning skills, and then uses an adaptive dynamic adjustment algorithm based on gradient norm and divergence relative to the original distribution to seamlessly integrate SFT with the online RL method GRPO. By monitoring the training status of LLMs and adjusting the training process in sequence, SASR ensures a smooth transition between training schemes, maintaining core reasoning abilities while exploring different paths. Experimental results demonstrate that SASR outperforms SFT, RL, and static hybrid training methods.

[950] SEAL: Semantic Aware Image Watermarking

Kasra Arabi, R. Teal Witter, Chinmay Hegde, Niv Cohen

Main category: cs.LG

TL;DR: A novel watermarking method for diffusion models embeds semantic information directly into watermarks, enabling distortion-free verification without a key database, while improving robustness against forgery attacks.

Details

Motivation: The increasing challenge of distinguishing natural from AI-generated content necessitates robust watermarking techniques that preserve image integrity and resist removal or replication.

Method: The proposed method embeds semantic information into watermarks using locality-sensitive hashing, allowing verification without a key database and improving robustness by conditioning detection on original image content.

Result: Empirical validation shows increased robustness against overlooked attacks, such as noise extraction and object insertion, while maintaining watermark integrity.

Conclusion: Content-aware watermarks can effectively mitigate risks associated with image-generative models by enhancing security and detection reliability.

Abstract: Generative models have rapidly evolved to generate realistic outputs. However, their synthetic outputs increasingly challenge the clear distinction between natural and AI-generated content, necessitating robust watermarking techniques. Watermarks are typically expected to preserve the integrity of the target image, withstand removal attempts, and prevent unauthorized replication onto unrelated images. To address this need, recent methods embed persistent watermarks into images produced by diffusion models using the initial noise. Yet, to do so, they either distort the distribution of generated images or rely on searching through a long dictionary of used keys for detection. In this paper, we propose a novel watermarking method that embeds semantic information about the generated image directly into the watermark, enabling a distortion-free watermark that can be verified without requiring a database of key patterns. Instead, the key pattern can be inferred from the semantic embedding of the image using locality-sensitive hashing. Furthermore, conditioning the watermark detection on the original image content improves robustness against forgery attacks. To demonstrate that, we consider two largely overlooked attack strategies: (i) an attacker extracting the initial noise and generating a novel image with the same pattern; (ii) an attacker inserting an unrelated (potentially harmful) object into a watermarked image, possibly while preserving the watermark. We empirically validate our method’s increased robustness to these attacks. Taken together, our results suggest that content-aware watermarks can mitigate risks arising from image-generative models.

[951] MedGNN: Capturing the Links Between Urban Characteristics and Medical Prescriptions

Minwei Zhao, Sanja Scepanovic, Stephen Law, Ivica Obadic, Cai Wu, Daniele Quercia

Main category: cs.LG

TL;DR: MedGNN, a spatio-topologically explicit framework, improves health outcome predictions by 25% over baselines, integrating urban and socio-demographic data with graph neural networks.

Details

Motivation: Traditional methods struggle with nonlinear and spatial effects in urban health research, while machine learning lacks interpretability for geographical and topological relationships.

Method: MedGNN constructs a 2-hop spatial graph, combining positional and locational node embeddings with urban characteristics in a graph neural network.

Result: Applied to MEDSAT data, MedGNN outperformed baselines by 25%, revealing insights like greenery’s link to higher antidepressant prescriptions.

Conclusion: MedGNN demonstrates the potential of machine learning to enhance transdisciplinary public health research with interpretable spatial insights.

Abstract: Understanding how urban socio-demographic and environmental factors relate with health is essential for public health and urban planning. However, traditional statistical methods struggle with nonlinear effects, while machine learning models often fail to capture geographical (nearby areas being more similar) and topological (unequal connectivity between places) effects in an interpretable way. To address this, we propose MedGNN, a spatio-topologically explicit framework that constructs a 2-hop spatial graph, integrating positional and locational node embeddings with urban characteristics in a graph neural network. Applied to MEDSAT, a comprehensive dataset covering over 150 environmental and socio-demographic factors and six prescription outcomes (depression, anxiety, diabetes, hypertension, asthma, and opioids) across 4,835 Greater London neighborhoods, MedGNN improved predictions by over 25% on average compared to baseline methods. Using depression prescriptions as a case study, we analyzed graph embeddings via geographical principal component analysis, identifying findings that: align with prior research (e.g., higher antidepressant prescriptions among older and White populations), contribute to ongoing debates (e.g., greenery linked to higher and NO2 to lower prescriptions), and warrant further study (e.g., canopy evaporation correlated with fewer prescriptions). These results demonstrate MedGNN’s potential, and more broadly, of carefully applied machine learning, to advance transdisciplinary public health research.

[952] Between Linear and Sinusoidal: Rethinking the Time Encoder in Dynamic Graph Learning

Hsing-Huan Chung, Shravan Chaudhari, Xing Han, Yoav Wald, Suchi Saria, Joydeep Ghosh

Main category: cs.LG

TL;DR: The paper proposes using linear time encoders instead of sinusoidal ones in dynamic graph learning, showing improved performance and parameter efficiency.

Details

Motivation: Sinusoidal time encoders lose temporal information and require high dimensions, prompting exploration of simpler alternatives like linear encoders.

Method: The study rigorously evaluates linear time encoders, demonstrating their ability to learn time spans and temporal patterns via self-attention in models like TGAT and DyGFormer.

Result: Linear time encoders improve performance on six datasets, reduce parameters (e.g., 43% savings in TGAT), and maintain or exceed precision compared to sinusoidal encoders.

Conclusion: Linear time encoders offer overlooked advantages in dynamic graph models, influencing design choices for applications like recommender systems and traffic forecasting.

Abstract: Dynamic graph learning is essential for applications involving temporal networks and requires effective modeling of temporal relationships. Seminal attention-based models like TGAT and DyGFormer rely on sinusoidal time encoders to capture temporal dependencies between edge events. Prior work justified sinusoidal encodings because their inner products depend on the time spans between events, which are crucial features for modeling inter-event relations. However, sinusoidal encodings inherently lose temporal information due to their many-to-one nature and therefore require high dimensions. In this paper, we rigorously study a simpler alternative: the linear time encoder, which avoids temporal information loss caused by sinusoidal functions and reduces the need for high-dimensional time encoders. We show that the self-attention mechanism can effectively learn to compute time spans between events from linear time encodings and extract relevant temporal patterns. Through extensive experiments on six dynamic graph datasets, we demonstrate that the linear time encoder improves the performance of TGAT and DyGFormer in most cases. Moreover, the linear time encoder can lead to significant savings in model parameters with minimal performance loss. For example, compared to a 100-dimensional sinusoidal time encoder, TGAT with a 2-dimensional linear time encoder saves 43% of parameters and achieves higher average precision on five datasets. While both encoders can be used simultaneously, our study highlights the often-overlooked advantages of linear time features in modern dynamic graph models. These findings can positively impact the design choices of various dynamic graph learning architectures and eventually benefit temporal network applications such as recommender systems, communication networks, and traffic forecasting.

[953] medDreamer: Model-Based Reinforcement Learning with Latent Imagination on Complex EHRs for Clinical Decision Support

Qianyi Xu, Gousia Habib, Dilruk Perera, Mengling Feng

Main category: cs.LG

TL;DR: medDreamer is a model-based RL framework for personalized treatment recommendations, addressing missing data and historical biases in clinical decision-making.

Details

Motivation: Existing RL-based clinical decision systems ignore missing data patterns and rely on retrospective data, leading to sub-optimal policies.

Method: medDreamer uses an Adaptive Feature Integration module and a two-phase policy trained on hybrid real and imagined trajectories.

Result: Outperforms model-free and model-based baselines in clinical outcomes and off-policy metrics for sepsis and ventilation tasks.

Conclusion: medDreamer improves personalized treatment decisions by simulating latent states and optimizing policies beyond historical data limitations.

Abstract: Timely and personalized treatment decisions are essential across a wide range of healthcare settings where patient responses can vary significantly and evolve over time. Clinical data used to support these treatment decisions are often irregularly sampled, where missing data frequencies may implicitly convey information about the patient’s condition. Existing Reinforcement Learning (RL) based clinical decision support systems often ignore the missing patterns and distort them with coarse discretization and simple imputation. They are also predominantly model-free and largely depend on retrospective data, which could lead to insufficient exploration and bias by historical behaviors. To address these limitations, we propose medDreamer, a novel model-based reinforcement learning framework for personalized treatment recommendation. medDreamer contains a world model with an Adaptive Feature Integration module that simulates latent patient states from irregular data and a two-phase policy trained on a hybrid of real and imagined trajectories. This enables learning optimal policies that go beyond the sub-optimality of historical clinical decisions, while remaining close to real clinical data. We evaluate medDreamer on both sepsis and mechanical ventilation treatment tasks using two large-scale Electronic Health Records (EHRs) datasets. Comprehensive evaluations show that medDreamer significantly outperforms model-free and model-based baselines in both clinical outcomes and off-policy metrics.

[954] Slicing the Gaussian Mixture Wasserstein Distance

Moritz Piening, Robert Beinert

Main category: cs.LG

TL;DR: The paper proposes slicing-based approximations to the mixture Wasserstein (MW) distance for Gaussian mixture models (GMMs) to reduce computational complexity while preserving optimal transport properties.

Details

Motivation: The high computational cost of the MW distance limits its scalability, prompting the need for efficient approximations.

Method: The authors introduce novel slicing-based approximations to the MW distance and analyze their theoretical properties, including equivalences to the original MW and sliced Wasserstein distances.

Result: Numerical experiments show the proposed methods are computationally efficient and effective in tasks like clustering and image comparison.

Conclusion: The slicing-based approximations offer a scalable and efficient alternative to the MW distance for GMMs, with practical applications in machine learning.

Abstract: Gaussian mixture models (GMMs) are widely used in machine learning for tasks such as clustering, classification, image reconstruction, and generative modeling. A key challenge in working with GMMs is defining a computationally efficient and geometrically meaningful metric. The mixture Wasserstein (MW) distance adapts the Wasserstein metric to GMMs and has been applied in various domains, including domain adaptation, dataset comparison, and reinforcement learning. However, its high computational cost – arising from repeated Wasserstein distance computations involving matrix square root estimations and an expensive linear program – limits its scalability to high-dimensional and large-scale problems. To address this, we propose multiple novel slicing-based approximations to the MW distance that significantly reduce computational complexity while preserving key optimal transport properties. From a theoretical viewpoint, we establish several weak and strong equivalences between the introduced metrics, and show the relations to the original MW distance and the well-established sliced Wasserstein distance. Furthermore, we validate the effectiveness of our approach through numerical experiments, demonstrating computational efficiency and applications in clustering, perceptual image comparison, and GMM minimization

[955] On learning functions over biological sequence space: relating Gaussian process priors, regularization, and gauge fixing

Samantha Petti, Carlos Martí-Gómez, Justin B. Kinney, Juannan Zhou, David M. McCandlish

Main category: cs.LG

TL;DR: The paper explores sequence-to-function mapping in biology, focusing on predictive inference and subsequence contributions. It connects $L_2$-regularized regression in weight space to Gaussian process approaches in function space, and derives posterior distributions for sequence-to-function statistics.

Details

Motivation: Understanding sequence-to-function maps is crucial in biology. The study aims to interpret these maps meaningfully by addressing gauge-fixing and linking weight space regularization to Gaussian process methods.

Method: The paper uses $L_2$-regularized regression in overparameterized weight space and connects it to Gaussian process approaches in function space. It also constructs regularizers for arbitrary Gaussian process priors and derives posterior distributions for sequence-to-function statistics.

Result: The study shows how weight space regularizers impose implicit priors and restrict optimal weights. It efficiently computes posterior distributions for sequence-to-function statistics using product-kernel priors.

Conclusion: The work bridges weight space and function space approaches, enabling meaningful interpretation of sequence-to-function maps and efficient computation of posterior distributions for biological sequence analysis.

Abstract: Mappings from biological sequences (DNA, RNA, protein) to quantitative measures of sequence functionality play an important role in contemporary biology. We are interested in the related tasks of (i) inferring predictive sequence-to-function maps and (ii) decomposing sequence-function maps to elucidate the contributions of individual subsequences. Because each sequence-function map can be written as a weighted sum over subsequences in multiple ways, meaningfully interpreting these weights requires gauge-fixing,'' i.e., defining a unique representation for each map. Recent work has established that most existing gauge-fixed representations arise as the unique solutions to $L_2$-regularized regression in an overparameterized weight space’’ where the choice of regularizer defines the gauge. Here, we establish the relationship between regularized regression in overparameterized weight space and Gaussian process approaches that operate in ``function space,’’ i.e.~the space of all real-valued functions on a finite set of sequences. We disentangle how weight space regularizers both impose an implicit prior on the learned function and restrict the optimal weights to a particular gauge. We show how to construct regularizers that correspond to arbitrary explicit Gaussian process priors combined with a wide variety of gauges and characterize the implicit function space priors associated with the most common weight space regularizers. Finally, we derive the posterior distribution of a broad class of sequence-to-function statistics, including gauge-fixed weights and multiple systems for expressing higher-order epistatic coefficients. We show that such distributions can be efficiently computed for product-kernel priors using a kernel trick.

[956] SHIELD: Secure Hypernetworks for Incremental Expansion Learning Defense

Patryk Krukowski, Łukasz Gorczyca, Piotr Helm, Kamil Książek, Przemysław Spurek

Main category: cs.LG

TL;DR: SHIELD integrates IBP with hypernetworks for robust continual learning, using Interval MixUp for certified robustness, outperforming existing methods under adversarial attacks.

Details

Motivation: Address the challenge of robust continual learning, which often sacrifices scalability or robustness in existing methods.

Method: Combines Interval Bound Propagation (IBP) with hypernetworks for task-specific parameters and introduces Interval MixUp for training with certified robustness.

Result: Outperforms existing methods under strong adversarial attacks, achieving state-of-the-art accuracy with scalability and certification.

Conclusion: SHIELD advances practical and theoretically grounded continual learning in adversarial settings.

Abstract: Continual learning under adversarial conditions remains an open problem, as existing methods often compromise either robustness, scalability, or both. We propose a novel framework that integrates Interval Bound Propagation (IBP) with a hypernetwork-based architecture to enable certifiably robust continual learning across sequential tasks. Our method, SHIELD, generates task-specific model parameters via a shared hypernetwork conditioned solely on compact task embeddings, eliminating the need for replay buffers or full model copies and enabling efficient over time. To further enhance robustness, we introduce Interval MixUp, a novel training strategy that blends virtual examples represented as $\ell_{\infty}$ balls centered around MixUp points. Leveraging interval arithmetic, this technique guarantees certified robustness while mitigating the wrapping effect, resulting in smoother decision boundaries. We evaluate SHIELD under strong white-box adversarial attacks, including PGD and AutoAttack, across multiple benchmarks. It consistently outperforms existing robust continual learning methods, achieving state-of-the-art average accuracy while maintaining both scalability and certification. These results represent a significant step toward practical and theoretically grounded continual learning in adversarial settings.

[957] DHO$_2$: Accelerating Distributed Hybrid Order Optimization via Model Parallelism and ADMM

Shunxian Gu, Chaoqun You, Bangbang Ren, Lailong Luo, Junxu Xia, Deke Guo

Main category: cs.LG

TL;DR: FOSI, a hybrid optimizer, accelerates DNN training by combining gradient and curvature information. Its distributed design, DHO$_2$, reduces memory burden and training time, achieving significant speedup.

Details

Motivation: To enable faster DNN training for users with limited computing resources by leveraging hybrid optimization techniques.

Method: DHO$_2$ distributes curvature calculation and model updates, parallelizing tasks across devices to reduce memory and time.

Result: Achieves linear memory reduction and 1.4×∼2.1× speedup compared to conventional distributed optimizers.

Conclusion: DHO$_2$ offers an efficient solution for resource-constrained DNN training, outperforming existing methods.

Abstract: Scaling deep neural network (DNN) training to more devices can reduce time-to-solution. However, it is impractical for users with limited computing resources. FOSI, as a hybrid order optimizer, converges faster than conventional optimizers by taking advantage of both gradient information and curvature information when updating the DNN model. Therefore, it provides a new chance for accelerating DNN training in the resource-constrained setting. In this paper, we explore its distributed design, namely DHO$_2$, including distributed calculation of curvature information and model update with partial curvature information to accelerate DNN training with a low memory burden. To further reduce the training time, we design a novel strategy to parallelize the calculation of curvature information and the model update on different devices. Experimentally, our distributed design can achieve an approximate linear reduction of memory burden on each device with the increase of the device number. Meanwhile, it achieves $1.4\times\sim2.1\times$ speedup in the total training time compared with other distributed designs based on conventional first- and second-order optimizers.

[958] Investigating Robotaxi Crash Severity with Geographical Random Forest and the Urban Environment

Junfeng Jiao, Seung Gyu Baik, Seung Jun Choi, Yiming Xu

Main category: cs.LG

TL;DR: The paper uses spatially localized machine learning (Geographical Random Forest) to analyze AV crash severity, finding land use as the top predictor and higher severity in residential areas.

Details

Motivation: To investigate AV crash severity at a city-scale, addressing spatial heterogeneity and autocorrelation beyond individual infrastructure effects.

Method: Implemented Geographical Random Forest (GRF) on California AV collision data, analyzing urban measures like land use, building footprints, and POIs.

Result: GRF outperformed regular ML; land use was the most important predictor; residential areas had higher crash severity than commercial areas.

Conclusion: Recommends localized algorithms for AV perception systems and tailored safety measures for residential neighborhoods.

Abstract: This paper quantitatively investigates the crash severity of Autonomous Vehicles (AVs) with spatially localized machine learning and macroscopic measures of the urban built environment. Extending beyond the microscopic effects of individual infrastructure elements, we focus on the city-scale land use and behavioral patterns, while addressing spatial heterogeneity and spatial autocorrelation. We implemented a spatially localized machine learning technique called Geographical Random Forest (GRF) on the California AV collision dataset. Analyzing multiple urban measures, including points of interest, building footprint, and land use, we built a GRF model and visualized it as a crash severity risk map of San Francisco. This paper presents three findings. First, spatially localized machine learning outperformed regular machine learning in predicting AV crash severity. The bias-variance tradeoff was evident as we adjusted the localization weight hyperparameter. Second, land use was the most important predictor, compared to intersections, building footprints, public transit stops, and Points Of Interest (POIs). Third, AV crashes were more likely to result in low-severity incidents in city center areas with greater diversity and commercial activities, than in residential neighborhoods. Residential land use is likely associated with higher severity due to human behavior and less restrictive environments. Counterintuitively, residential areas were associated with higher crash severity, compared to more complex areas such as commercial and mixed-use areas. When robotaxi operators train their AV systems, it is recommended to: (1) consider where their fleet operates and make localized algorithms for their perception system, and (2) design safety measures specific to residential neighborhoods, such as slower driving speeds and more alert sensors.

[959] Informed Forecasting: Leveraging Auxiliary Knowledge to Boost LLM Performance on Time Series Forecasting

Mohammadmahdi Ghasemloo, Alireza Moradi

Main category: cs.LG

TL;DR: A novel framework enhances LLMs for time series forecasting by infusing structured temporal knowledge, outperforming uninformed baselines.

Details

Motivation: Address the need for best practices in using LLMs beyond traditional tasks, particularly in time series forecasting for domains like energy, finance, and healthcare.

Method: Proposes a cross-domain knowledge transfer framework to systematically infuse LLMs with structured temporal information.

Result: Knowledge-informed forecasting significantly outperforms the uninformed baseline in predictive accuracy and generalization.

Conclusion: Knowledge transfer strategies can effectively bridge the gap between LLMs and domain-specific forecasting tasks.

Abstract: With the widespread adoption of Large Language Models (LLMs), there is a growing need to establish best practices for leveraging their capabilities beyond traditional natural language tasks. In this paper, a novel cross-domain knowledge transfer framework is proposed to enhance the performance of LLMs in time series forecasting – a task of increasing relevance in fields such as energy systems, finance, and healthcare. The approach systematically infuses LLMs with structured temporal information to improve their forecasting accuracy. This study evaluates the proposed method on a real-world time series dataset and compares it to a naive baseline where the LLM receives no auxiliary information. Results show that knowledge-informed forecasting significantly outperforms the uninformed baseline in terms of predictive accuracy and generalization. These findings highlight the potential of knowledge transfer strategies to bridge the gap between LLMs and domain-specific forecasting tasks.

[960] ChemMLLM: Chemical Multimodal Large Language Model

Qian Tan, Dongzhan Zhou, Peng Xia, Wanhao Liu, Wanli Ouyang, Lei Bai, Yuqiang Li, Tianfan Fu

Main category: cs.LG

TL;DR: ChemMLLM is a multimodal large language model for chemical tasks, outperforming existing models in molecule understanding and generation.

Details

Motivation: Existing MLLMs lack focus on chemical cross-modal tasks, prompting the development of ChemMLLM.

Method: ChemMLLM is designed for multimodal tasks involving text, SMILES strings, and images, with curated datasets for benchmarking.

Result: ChemMLLM surpasses general and chemical MLLMs, e.g., achieving 116.75% better performance in molecule image optimization than GPT-4o.

Conclusion: ChemMLLM fills a gap in chemical MLLMs, demonstrating superior performance in multimodal chemical tasks.

Abstract: Multimodal large language models (MLLMs) have made impressive progress in many applications in recent years. However, chemical MLLMs that can handle cross-modal understanding and generation remain underexplored. To fill this gap, we propose ChemMLLM, a unified chemical multimodal large language model for molecule understanding and generation. Also, we design five multimodal tasks across text, molecular SMILES strings, and image, and curate the datasets. We benchmark ChemMLLM against a range of general leading MLLMs and Chemical LLMs on these tasks. Experimental results show that ChemMLLM achieves superior performance across all evaluated tasks. For example, in molecule image optimization task, ChemMLLM outperforms the best baseline (GPT-4o) by 116.75% (4.27 vs 1.97 property improvement). The code is publicly available at https://github.com/bbsbz/ChemMLLM.git.

[961] Reinforcing VLMs to Use Tools for Detailed Visual Reasoning Under Resource Constraints

Sunil Kumar, Bowen Zhao, Leo Dirac, Paulina Varshavskaya

Main category: cs.LG

TL;DR: Smaller VLMs trained with GRPO and external tools like zoom outperform baselines in VQA tasks by leveraging detailed visual reasoning.

Details

Motivation: Addressing the challenge of limited compute resources and detailed visual reasoning in VLMs.

Method: Training smaller models with GRPO, a simple reward structure, simplified tool-calling, extra tokens for tool results, and a data mix favoring visually difficult examples.

Result: Improved performance on VQA tasks compared to similarly-sized baseline models.

Conclusion: Combining GRPO with external tools enhances detailed visual reasoning in resource-constrained VLMs.

Abstract: Despite tremendous recent advances in large model reasoning ability, vision-language models (VLMs) still struggle with detailed visual reasoning, especially when compute resources are limited. To address this challenge, we draw inspiration from methods like Deepseek-r1 for VLMs and train smaller-scale models with Group Relative Policy Optimization (GRPO) to use external tools such as zoom. The greatest benefit is obtained with a combination of GRPO learning, a simple reward structure, a simplified tool-calling interface, allocating additional tokens to the result of the tool call, and a training data mix that over-represents visually difficult examples. Compared to similarly-sized baseline models, our method achieves better performance on some visual question-answering (VQA) tasks, thanks to the detailed visual information gathered from the external tool.

[962] Representative Action Selection for Large Action Space Meta-Bandits

Quan Zhou, Mark Kozdoba, Shie Mannor

Main category: cs.LG

TL;DR: A method for selecting a subset from a large action space in bandit problems, leveraging Gaussian process modeling and an epsilon-net algorithm, with theoretical and empirical validation.

Details

Motivation: To achieve near-optimal performance in bandit problems with large action spaces by exploiting similarity in action payoffs.

Method: Proposes an epsilon-net algorithm to select a representative subset, assuming Gaussian process modeling for payoff similarity.

Result: Theoretical guarantees provided; empirical comparison shows performance comparable to Thompson Sampling and Upper Confidence Bound.

Conclusion: The epsilon-net algorithm effectively reduces action space while maintaining performance, validated by theory and experiments.

Abstract: We study the problem of selecting a subset from a large action space shared by a family of bandits, with the goal of achieving performance nearly matching that of using the full action space. We assume that similar actions tend to have related payoffs, modeled by a Gaussian process. To exploit this structure, we propose a simple epsilon-net algorithm to select a representative subset. We provide theoretical guarantees for its performance and compare it empirically to Thompson Sampling and Upper Confidence Bound.

[963] Hypercube-Based Retrieval-Augmented Generation for Scientific Question-Answering

Jimeng Shi, Sizhe Zhou, Bowen Jin, Wei Hu, Runchu Tian, Shaowen Wang, Giri Narasimhan, Jiawei Han

Main category: cs.LG

TL;DR: Hypercube-RAG introduces a multi-dimensional structure for precise and efficient retrieval in RAG, improving accuracy and speed over baselines.

Details

Motivation: Current RAG methods overlook structured semantic information in documents, which is critical for domain-specific tasks like scientific QA.

Method: Proposes Hypercube, a multi-dimensional indexing structure, and Hypercube-RAG, which decomposes queries and aligns them with cube dimensions for retrieval.

Result: Improves response accuracy by 3.7% and retrieval accuracy by 5.3%, with significantly faster retrieval than graph-based RAG.

Conclusion: Hypercube-RAG enhances retrieval precision, efficiency, and explainability, making it a strong solution for knowledge-intensive tasks.

Abstract: Large language models (LLMs) often need to incorporate external knowledge to solve theme-specific problems. Retrieval-augmented generation (RAG) has shown its high promise, empowering LLMs to generate more qualified responses with retrieved external data and knowledge. However, most RAG methods retrieve relevant documents based on either sparse or dense retrieval methods or their combinations, which overlooks the essential, multi-dimensional, and structured semantic information present in documents. This structured information plays a critical role in finding concise yet highly relevant information for domain knowledge-intensive tasks, such as scientific question-answering (QA). In this work, we introduce a multi-dimensional (cube) structure, Hypercube, which can index and allocate documents in a pre-defined multi-dimensional space. Built on the hypercube, we further propose Hypercube-RAG, a novel RAG framework for precise and efficient retrieval. Given a query, Hypercube-RAG first decomposes it based on its entities, phrases, and topics along with pre-defined hypercube dimensions, and then retrieves relevant documents from cubes by aligning these decomposed components with corresponding dimensions. Experiments on three datasets across different domains demonstrate that our method improves response accuracy by 3.7% and retrieval accuracy by 5.3% over the strongest RAG baseline. It also boosts retrieval efficiency (speed) by one or two magnitudes faster than graph-based RAG. Notably, our Hypercube-RAG inherently offers explainability by revealing those underlying dimensions used for retrieval. The code and data are available at https://github.com/JimengShi/Hypercube-RAG.

[964] Generalisation Bounds of Zero-Shot Economic Forecasting using Time Series Foundation Models

Jittarin Jetwiriyanon, Teo Susnjak, Surangika Ranathunga

Main category: cs.LG

TL;DR: TSFMs (Time Series Foundation Models) show promise in zero-shot forecasting for macroeconomic indicators, matching or outperforming classical models in stable conditions but struggling with rapid shocks.

Details

Motivation: To explore the zero-shot forecasting capabilities of TSFMs for macroeconomic indicators without needing extensive training data or custom models.

Method: Applied three TSFMs (Chronos, TimeGPT, Moirai) to univariate forecasting of economic indicators, tested under data-scarce conditions and structural breaks.

Result: TSFMs internalize economic dynamics, handle regime shifts, and provide uncertainty estimates, performing comparably to multivariate models in stable conditions but degrading during rapid shocks.

Conclusion: TSFMs are viable for zero-shot macroeconomic forecasting in stable conditions but require caution during periods of rapid shocks.

Abstract: This study investigates zero-shot forecasting capabilities of Time Series Foundation Models (TSFMs) for macroeconomic indicators. We apply TSFMs to forecasting economic indicators under univariate conditions, bypassing the need for train bespoke econometric models using and extensive training datasets. Our experiments were conducted on a case study dataset, without additional customisation. We rigorously back-tested three state-of-the-art TSFMs (Chronos, TimeGPT and Moirai) under data-scarce conditions and structural breaks. Our results demonstrate that appropriately engineered TSFMs can internalise rich economic dynamics, accommodate regime shifts, and deliver well-behaved uncertainty estimates out of the box, while matching state-of-the-art multivariate models on this domain. Our findings suggest that, without any fine-tuning, TSFMs can match or exceed classical models during stable economic conditions. However, they are vulnerable to degradation in performances during periods of rapid shocks. The findings offer guidance to practitioners on when zero-shot deployments are viable for macroeconomic monitoring and strategic planning.

[965] Multi-VQC: A Novel QML Approach for Enhancing Healthcare Classification

Antonio Tudisco, Deborah Volpe, Giovanna Turvani

Main category: cs.LG

TL;DR: Quantum models are explored to address class imbalance in disease diagnosis, leveraging higher-dimensional computational space for better pattern recognition.

Details

Motivation: Class imbalance in disease diagnosis hinders traditional ML models, prompting interest in Quantum models for their superior pattern recognition capabilities.

Method: Utilizing Quantum models to map data into higher-dimensional spaces for improved classification.

Result: Quantum models show potential to overcome limitations of classical ML in handling imbalanced datasets.

Conclusion: Quantum models offer a promising alternative for accurate disease diagnosis by addressing class imbalance effectively.

Abstract: Accurate and reliable diagnosis of diseases is crucial in enabling timely medical treatment and enhancing patient survival rates. In recent years, Machine Learning has revolutionized diagnostic practices by creating classification models capable of identifying diseases. However, these classification problems often suffer from significant class imbalances, which can inhibit the effectiveness of traditional models. Therefore, the interest in Quantum models has arisen, driven by the captivating promise of overcoming the limitations of the classical counterpart thanks to their ability to express complex patterns by mapping data in a higher-dimensional computational space.

[966] A Closer Look on Memorization in Tabular Diffusion Model: A Data-Centric Perspective

Zhengyu Fang, Zhimeng Jiang, Huiyuan Chen, Xiaoge Zhang, Kaiyu Tang, Xiao Li, Jing Li

Main category: cs.LG

TL;DR: The paper studies memorization dynamics in tabular diffusion models, identifies heavily memorized samples, and proposes DynamicCut to mitigate privacy risks with minimal impact on data quality.

Details

Motivation: Diffusion models for tabular data risk reproducing exact training samples, posing privacy concerns. Prior work lacks focus on individual sample contributions to memorization.

Method: Quantify memorization per sample using a relative distance ratio, analyze training-time behaviors, and propose DynamicCut—a two-stage method to rank and prune high-memorization samples.

Result: Heavy-tailed memorization distribution; DynamicCut reduces memorization without harming diversity or performance, and works across models (e.g., GANs, VAEs).

Conclusion: DynamicCut effectively mitigates memorization risks in tabular diffusion models and is transferable to other generative models.

Abstract: Diffusion models have shown strong performance in generating high-quality tabular data, but they carry privacy risks by reproducing exact training samples. While prior work focuses on dataset-level augmentation to reduce memorization, little is known about which individual samples contribute most. We present the first data-centric study of memorization dynamics in tabular diffusion models. We quantify memorization for each real sample based on how many generated samples are flagged as replicas, using a relative distance ratio. Our empirical analysis reveals a heavy-tailed distribution of memorization counts: a small subset of samples contributes disproportionately to leakage, confirmed via sample-removal experiments. To understand this, we divide real samples into top- and non-top-memorized groups and analyze their training-time behaviors. We track when each sample is first memorized and monitor per-epoch memorization intensity (AUC). Memorized samples are memorized slightly earlier and show stronger signals in early training. Based on these insights, we propose DynamicCut, a two-stage, model-agnostic mitigation method: (a) rank samples by epoch-wise intensity, (b) prune a tunable top fraction, and (c) retrain on the filtered dataset. Across multiple tabular datasets and models, DynamicCut reduces memorization with minimal impact on data diversity and downstream performance. It also complements augmentation-based defenses. Furthermore, DynamicCut enables cross-model transferability: high-ranked samples identified from one model (e.g., a diffusion model) are also effective for reducing memorization when removed from others, such as GANs and VAEs.

[967] Neural Networks as Universal Finite-State Machines: A Constructive Feedforward Simulation Framework for NFAs

Sahil Rajesh Dhayalkar

Main category: cs.LG

TL;DR: A framework for simulating NFAs using feedforward neural networks, achieving exact recognition of regular languages and practical trainability.

Details

Motivation: To bridge symbolic automata theory and neural networks by enabling precise, interpretable, and trainable symbolic computation without recurrent architectures.

Method: Symbolically encodes NFA states as binary vectors, transitions as sparse matrices, and nondeterministic branching as shared thresholded updates in feedforward networks.

Result: Proves exact recognition of regular languages and demonstrates practical trainability via gradient descent, with near-perfect experimental agreement.

Conclusion: Establishes a formal and practical link between NFAs and feedforward neural networks, enabling interpretable symbolic computation.

Abstract: We present a formal and constructive simulation framework for nondeterministic finite automata (NFAs) using standard feedforward neural networks. Unlike prior approaches that rely on recurrent architectures or post hoc extraction methods, our formulation symbolically encodes automaton states as binary vectors, transitions as sparse matrix transformations, and nondeterministic branching-including $\varepsilon$-closures-as compositions of shared thresholded updates. We prove that every regular language can be recognized exactly by a depth-unrolled feedforward network with shared parameters, independent of input length. Our construction yields not only formal equivalence between NFAs and neural networks, but also practical trainability: we demonstrate that these networks can learn NFA acceptance behavior through gradient descent using standard supervised data. Extensive experiments validate all theoretical results, achieving perfect or near-perfect agreement on acceptance, state propagation, and closure dynamics. This work establishes a new bridge between symbolic automata theory and modern neural architectures, showing that feedforward networks can perform precise, interpretable, and trainable symbolic computation.

[968] Dynamic Modes as Time Representation for Spatiotemporal Forecasting

Menglin Kong, Vincent Zhihao Zheng, Xudong Wang, Lijun Sun

Main category: cs.LG

TL;DR: A data-driven time embedding method using Dynamic Mode Decomposition (DMD) for spatiotemporal forecasting, capturing multi-scale periodicity without explicit timestamps.

Details

Motivation: To model long-range seasonal dependencies in spatiotemporal forecasting without relying on hand-crafted time features.

Method: Uses DMD to extract temporal modes from data, integrating them into deep forecasting models.

Result: Improves long-horizon forecasting accuracy, reduces residual correlation, and enhances temporal generalization across urban mobility, traffic, and climate datasets.

Conclusion: The DMD-based embedding is lightweight, model-agnostic, and effective for capturing complex temporal patterns.

Abstract: This paper introduces a data-driven time embedding method for modeling long-range seasonal dependencies in spatiotemporal forecasting tasks. The proposed approach employs Dynamic Mode Decomposition (DMD) to extract temporal modes directly from observed data, eliminating the need for explicit timestamps or hand-crafted time features. These temporal modes serve as time representations that can be seamlessly integrated into deep spatiotemporal forecasting models. Unlike conventional embeddings such as time-of-day indicators or sinusoidal functions, our method captures complex multi-scale periodicity through spectral analysis of spatiotemporal data. Extensive experiments on urban mobility, highway traffic, and climate datasets demonstrate that the DMD-based embedding consistently improves long-horizon forecasting accuracy, reduces residual correlation, and enhances temporal generalization. The method is lightweight, model-agnostic, and compatible with any architecture that incorporates time covariates.

[969] VerificAgent: Domain-Specific Memory Verification for Scalable Oversight of Aligned Computer-Use Agents

Thong Q. Nguyen, Shubhang Desai, Raja Hasnain Anwar, Firoz Shaik, Vishwas Suryanarayanan, Vishal Chowdhary

Main category: cs.LG

TL;DR: VerificAgent introduces a scalable oversight framework for CUAs, using human-verified memory to improve reliability and safety without fine-tuning.

Details

Motivation: To address unsafe heuristics and policy drift in CUAs by ensuring memory aligns with user intent and safety constraints.

Method: Combines expert-curated knowledge, iterative memory growth, and post-hoc human fact-checking to sanitize memories.

Result: Improves task reliability, reduces failures, and maintains interpretable guidance in OSWorld tasks and stress tests.

Conclusion: Human-verified memory offers a scalable oversight mechanism, anchoring agent behavior to domain norms and safety.

Abstract: Continual memory augmentation lets computer-using agents (CUAs) learn from prior interactions, but unvetted memories can encode domain-inappropriate or unsafe heuristics–spurious rules that drift from user intent and safety constraints. We introduce VerificAgent, a scalable oversight framework that treats persistent memory as an explicit alignment surface. VerificAgent combines (1) an expert-curated seed of domain knowledge, (2) iterative, trajectory-based memory growth during training, and (3) a post-hoc human fact-checking pass to sanitize accumulated memories before deployment. Evaluated on OSWorld productivity tasks and additional adversarial stress tests, VerificAgent improves task reliability, reduces hallucination-induced failures, and preserves interpretable, auditable guidance–without additional model fine-tuning. By letting humans correct high-impact errors once, the verified memory acts as a frozen safety contract that future agent actions must satisfy. Our results suggest that domain-scoped, human-verified memory offers a scalable oversight mechanism for CUAs, complementing broader alignment strategies by limiting silent policy drift and anchoring agent behavior to the norms and safety constraints of the target domain.

[970] Fidelity Isn’t Accuracy: When Linearly Decodable Functions Fail to Match the Ground Truth

Jackson Eshbaugh

Main category: cs.LG

TL;DR: The paper introduces a linearity score, λ(f), to measure how well a neural network’s output can be approximated by a linear model, revealing potential risks in using surrogate fidelity for model understanding.

Details

Motivation: Neural networks are powerful but opaque. The authors aim to quantify how linear a network's behavior is to improve interpretability.

Method: They propose λ(f), the R² value between a network’s predictions and a trained linear surrogate, tested on synthetic and real-world datasets.

Result: High λ(f) scores indicate alignment with linear surrogates but do not guarantee accuracy with ground truth.

Conclusion: Surrogate fidelity (λ(f)) should not be conflated with model understanding, especially in critical regression tasks.

Abstract: Neural networks excel as function approximators, but their complexity often obscures what kinds of functions they learn. We introduce the linearity score $\lambda(f)$, a simple and interpretable diagnostic that quantifies how well a regression network’s output can be mimicked by a linear model. Defined as the $R^2$ value between the network’s predictions and those of a trained linear surrogate, $\lambda(f)$ measures linear decodability: the extent to which the network’s behavior aligns with a structurally simple model. We evaluate this framework on both synthetic ($y = x \cdot \sin(x) + \epsilon$) and real-world datasets (Medical Insurance, Concrete, California Housing), using dataset-specific networks and surrogates. Our findings show that high $\lambda(f)$ scores reliably indicate alignment with the network’s outputs – but do not guarantee accuracy with respect to the ground truth. These results highlight the risk of using surrogate fidelity as a proxy for model understanding – especially in high-stakes regression tasks.

[971] Convergence Bound and Critical Batch Size of Muon Optimizer

Naoki Sato, Hiroki Naganuma, Hideaki Iiduka

Main category: cs.LG

TL;DR: Muon, a new optimizer leveraging matrix structure in neural networks, shows promise as an AdamW successor. This paper provides theoretical convergence proofs for Muon in four settings, analyzes its behavior with Nesterov momentum and weight decay, and derives the critical batch size for optimal training cost.

Details

Motivation: To theoretically validate Muon's empirical success and understand its behavior in practical settings, including the impact of Nesterov momentum and weight decay.

Method: Theoretical analysis of Muon’s convergence in four settings, examining its behavior with and without Nesterov momentum and weight decay. Derivation of critical batch size and hyperparameter analysis.

Result: Convergence proofs for Muon, tighter bounds with weight decay, clarified interplay between weight decay and learning rate, and identified critical batch size for cost efficiency.

Conclusion: Muon’s theoretical foundations support its empirical performance, with weight decay improving bounds and critical batch size optimizing training cost.

Abstract: Muon, a recently proposed optimizer that leverages the inherent matrix structure of neural network parameters, has demonstrated strong empirical performance, indicating its potential as a successor to standard optimizers such as AdamW. This paper presents theoretical analysis to support its practical success. We provide convergence proofs for Muon across four practical settings, systematically examining its behavior with and without the inclusion of Nesterov momentum and weight decay. Our analysis covers the standard configuration using both, thereby elucidating its real-world performance. We then demonstrate that the addition of weight decay yields strictly tighter theoretical bounds and clarify the interplay between the weight decay coefficient and the learning rate. Finally, we derive the critical batch size for Muon that minimizes the computational cost of training. Our analysis identifies the hyperparameters governing this value, and our experiments validate the corresponding theoretical findings.

[972] Hierarchical Multi-Label Contrastive Learning for Protein-Protein Interaction Prediction Across Organisms

Shiyi Liu, Buwen Liang, Yuetong Fang, Zixuan Jiang, Renjing Xu

Main category: cs.LG

TL;DR: HIPPO is a hierarchical contrastive framework for protein-protein interaction prediction, leveraging multi-tiered biological representation matching and hierarchical contrastive loss. It outperforms existing methods and shows robustness in low-data regimes and zero-shot transferability.

Details

Motivation: To bridge heterogeneous biological data modalities and improve protein-protein interaction prediction by incorporating hierarchical attributes and domain knowledge.

Method: Uses a hierarchical contrastive framework with multi-tiered biological representation matching and data-driven penalty mechanisms to align protein sequences and their hierarchical attributes.

Result: Achieves state-of-the-art performance, robustness in low-data regimes, and strong zero-shot transferability to other species.

Conclusion: HIPPO advances cross-species PPI prediction and provides a unified framework for sparse or imbalanced multi-species data.

Abstract: Recent advances in AI for science have highlighted the power of contrastive learning in bridging heterogeneous biological data modalities. Building on this paradigm, we propose HIPPO (HIerarchical Protein-Protein interaction prediction across Organisms), a hierarchical contrastive framework for protein-protein interaction(PPI) prediction, where protein sequences and their hierarchical attributes are aligned through multi-tiered biological representation matching. The proposed approach incorporates hierarchical contrastive loss functions that emulate the structured relationship among functional classes of proteins. The framework adaptively incorporates domain and family knowledge through a data-driven penalty mechanism, enforcing consistency between the learned embedding space and the intrinsic hierarchy of protein functions. Experiments on benchmark datasets demonstrate that HIPPO achieves state-of-the-art performance, outperforming existing methods and showing robustness in low-data regimes. Notably, the model demonstrates strong zero-shot transferability to other species without retraining, enabling reliable PPI prediction and functional inference even in less characterized or rare organisms where experimental data are limited. Further analysis reveals that hierarchical feature fusion is critical for capturing conserved interaction determinants, such as binding motifs and functional annotations. This work advances cross-species PPI prediction and provides a unified framework for interaction prediction in scenarios with sparse or imbalanced multi-species data.

[973] Fourier Basis Mapping: A Time-Frequency Learning Framework for Time Series Forecasting

Runze Yang, Longbing Cao, Xin You, Kun Fang, Jianxun Li, Jie Yang

Main category: cs.LG

TL;DR: The paper introduces the Fourier Basis Mapping (FBM) method to address issues in Fourier-based time series forecasting by integrating time-frequency features, enhancing neural networks, and achieving SOTA performance.

Details

Motivation: Existing Fourier-based methods suffer from inconsistent starting cycles, series length issues, and overlook temporal information, limiting their effectiveness in time series forecasting.

Method: Proposes FBM, which integrates time-frequency features through Fourier basis expansion and mapping, and introduces variants (FBM-L, FBM-NL, FBM-NP, FBM-S) for different neural networks. Techniques like interaction masking and multi-scale down-sampling are also introduced.

Result: FBM demonstrates superior performance in both long-term and short-term forecasting tasks across diverse real-world datasets.

Conclusion: The FBM method effectively addresses limitations of existing Fourier-based approaches, offering a versatile and high-performing solution for time series forecasting.

Abstract: The integration of Fourier transform and deep learning opens new avenues for time series forecasting. We reconsider the Fourier transform from a basis functions perspective. Specifically, the real and imaginary parts of the frequency components can be regarded as the coefficients of cosine and sine basis functions at tiered frequency levels, respectively. We find that existing Fourier-based methods face inconsistent starting cycles and inconsistent series length issues. They fail to interpret frequency components precisely and overlook temporal information. Accordingly, the novel Fourier Basis Mapping (FBM) method addresses these issues by integrating time-frequency features through Fourier basis expansion and mapping in the time-frequency space. Our approach extracts explicit frequency features while preserving temporal characteristics. FBM supports plug-and-play integration with various types of neural networks by only adjusting the first initial projection layer for better performance. First, we propose FBM-L, FBM-NL, and FBM-NP to enhance linear, MLP-based, and Transformer-based models, respectively, demonstrating the effectiveness of time-frequency features. Next, we propose a synergetic model architecture, termed FBM-S, which decomposes the seasonal, trend, and interaction effects into three separate blocks, each designed to model time-frequency features in a specialized manner. Finally, we introduce several techniques tailored for time-frequency features, including interaction masking, centralization, patching, rolling window projection, and multi-scale down-sampling. The results are validated on diverse real-world datasets for both long-term and short-term forecasting tasks with SOTA performance.

[974] ADAPT: A Pseudo-labeling Approach to Combat Concept Drift in Malware Detection

Md Tanvirul Alam, Aritran Piplai, Nidhi Rastogi

Main category: cs.LG

TL;DR: ADAPT is a novel semi-supervised algorithm for addressing concept drift in malware detection, outperforming baselines across diverse datasets.

Details

Motivation: Machine learning models for malware classification degrade over time due to concept drift, requiring costly updates. Semi-supervised learning is underexplored for this problem.

Method: ADAPT, a model-agnostic pseudo-labeling semi-supervised algorithm, is introduced and tested on five malware detection datasets.

Result: ADAPT consistently outperforms baseline models and benchmarks across Android, Windows, and PDF malware datasets.

Conclusion: The work enables more effective adaptation of ML models to concept drift in malware detection.

Abstract: Machine learning models are commonly used for malware classification; however, they suffer from performance degradation over time due to concept drift. Adapting these models to changing data distributions requires frequent updates, which rely on costly ground truth annotations. While active learning can reduce the annotation burden, leveraging unlabeled data through semi-supervised learning remains a relatively underexplored approach in the context of malware detection. In this research, we introduce \texttt{ADAPT}, a novel pseudo-labeling semi-supervised algorithm for addressing concept drift. Our model-agnostic method can be applied to various machine learning models, including neural networks and tree-based algorithms. We conduct extensive experiments on five diverse malware detection datasets spanning Android, Windows, and PDF domains. The results demonstrate that our method consistently outperforms baseline models and competitive benchmarks. This work paves the way for more effective adaptation of machine learning models to concept drift in malware detection.

[975] Rec-AD: An Efficient Computation Framework for FDIA Detection Based on Tensor Train Decomposition and Deep Learning Recommendation Model

Yunfeng Li, Junhong Liu, Zhaohui Yang, Guofu Liao, Chuyun Zhang

Main category: cs.LG

TL;DR: Rec-AD, a framework combining Tensor Train decomposition and DLRM, improves FDIA detection efficiency in smart grids by reducing computational and memory burdens.

Details

Motivation: Addressing the computational and memory inefficiencies in deep learning-based FDIA detection for large-scale smart grids.

Method: Integrates Tensor Train decomposition with DLRM, uses embedding compression, index reordering, and pipeline training to optimize efficiency.

Result: Enhances computational throughput and real-time detection, reducing attack windows and increasing attacker costs.

Conclusion: Rec-AD offers scalable, efficient FDIA detection, strengthening smart grid security with minimal integration effort.

Abstract: Deep learning models have been widely adopted for False Data Injection Attack (FDIA) detection in smart grids due to their ability to capture unstructured and sparse features. However, the increasing system scale and data dimensionality introduce significant computational and memory burdens, particularly in large-scale industrial datasets, limiting detection efficiency. To address these issues, this paper proposes Rec-AD, a computationally efficient framework that integrates Tensor Train decomposition with the Deep Learning Recommendation Model (DLRM). Rec-AD enhances training and inference efficiency through embedding compression, optimized data access via index reordering, and a pipeline training mechanism that reduces memory communication overhead. Fully compatible with PyTorch, Rec-AD can be integrated into existing FDIA detection systems without code modifications. Experimental results show that Rec-AD significantly improves computational throughput and real-time detection performance, narrowing the attack window and increasing attacker cost. These advancements strengthen edge computing capabilities and scalability, providing robust technical support for smart grid security.

[976] Clustered Federated Learning for Generalizable FDIA Detection in Smart Grids with Heterogeneous Data

Yunfeng Li, Junhong Liu, Zhaohui Yang, Guofu Liao, Chuyun Zhang

Main category: cs.LG

TL;DR: Proposes FedClusAvg, a federated learning framework for detecting False Data Injection Attacks (FDIAs) in smart grids, addressing Non-IID data and privacy concerns.

Details

Motivation: FDIAs threaten smart grids, and traditional centralized methods face privacy, data sharing, and scalability issues.

Method: FedClusAvg uses cluster-based stratified sampling and hierarchical communication (client-subserver-server) for localized training and weighted parameter aggregation.

Result: Improves detection accuracy in Non-IID settings, reduces communication rounds, and lowers bandwidth usage.

Conclusion: FedClusAvg offers a secure, efficient solution for FDIA detection in distributed power systems.

Abstract: False Data Injection Attacks (FDIAs) pose severe security risks to smart grids by manipulating measurement data collected from spatially distributed devices such as SCADA systems and PMUs. These measurements typically exhibit Non-Independent and Identically Distributed (Non-IID) characteristics across different regions, which significantly challenges the generalization ability of detection models. Traditional centralized training approaches not only face privacy risks and data sharing constraints but also incur high transmission costs, limiting their scalability and deployment feasibility. To address these issues, this paper proposes a privacy-preserving federated learning framework, termed Federated Cluster Average (FedClusAvg), designed to improve FDIA detection in Non-IID and resource-constrained environments. FedClusAvg incorporates cluster-based stratified sampling and hierarchical communication (client-subserver-server) to enhance model generalization and reduce communication overhead. By enabling localized training and weighted parameter aggregation, the algorithm achieves accurate model convergence without centralizing sensitive data. Experimental results on benchmark smart grid datasets demonstrate that FedClusAvg not only improves detection accuracy under heterogeneous data distributions but also significantly reduces communication rounds and bandwidth consumption. This work provides an effective solution for secure and efficient FDIA detection in large-scale distributed power systems.

[977] A Comprehensive Review of Diffusion Models in Smart Agriculture: Progress, Applications, and Challenges

Xing Hu, Haodong Chen, Choon Ki Ahn, Danfeng Hong, Qianqian Duan, Huiliang Shang, Guoxiang Li, Linhua Jiang, Dawei Zhang Zhang

Main category: cs.LG

TL;DR: Diffusion models outperform GANs in agricultural AI tasks like crop monitoring and pest detection, offering better stability and image quality. They enhance model performance in accuracy and robustness, though challenges like computational efficiency remain.

Details

Motivation: The need for sustainable agriculture amid limited arable land and growing population drives the adoption of AI, particularly diffusion models, for tasks like crop disease detection and yield prediction.

Method: The paper reviews diffusion models’ applications in agriculture, focusing on image processing, data augmentation, and remote sensing analysis, comparing them to traditional GANs.

Result: Diffusion models improve downstream model performance in accuracy, robustness, and generalization, especially in image synthesis and denoising under complex conditions.

Conclusion: Despite challenges, diffusion models hold significant promise for advancing intelligent agriculture and addressing global food security and sustainability issues.

Abstract: With the global population increasing and arable land resources becoming increasingly limited, smart and precision agriculture have emerged as essential directions for sustainable agricultural development. Artificial intelligence (AI), particularly deep learning models, has been widely adopted in applications such as crop monitoring, pest detection, and yield prediction. Among recent generative models, diffusion models have demonstrated considerable potential in agricultural image processing, data augmentation, and remote sensing analysis. Compared to traditional generative adversarial networks (GANs), diffusion models exhibit greater training stability and superior image generation quality, effectively addressing challenges such as limited annotated datasets and imbalanced sample distributions in agricultural scenarios. This paper reviews recent advancements in the application of diffusion models within agriculture, focusing on their roles in crop disease and pest detection, remote sensing image enhancement, crop growth prediction, and agricultural resource management. Empirical studies show that diffusion models significantly enhance the performance of downstream models by improving accuracy, robustness, and generalization in tasks involving image synthesis, augmentation, and denoising under complex environmental conditions. Despite ongoing challenges in computational efficiency and domain generalization, diffusion models are expected to play an increasingly important role in the future of intelligent agriculture. As the technology continues to evolve, it holds substantial promise for addressing pressing global issues in food security and environmental sustainability.

[978] Your Attention Matters: to Improve Model Robustness to Noise and Spurious Correlations

Camilo Tamayo-Rousseau, Yunjia Zhao, Yiqun Zhang, Randall Balestriero

Main category: cs.LG

TL;DR: Doubly Stochastic attention is the most robust among tested self-attention mechanisms in Vision Transformers under data corruption.

Details

Motivation: To study the robustness of various self-attention mechanisms (Softmax, Sigmoid, Linear, Doubly Stochastic, Cosine) in Vision Transformers under noisy or corrupted data conditions.

Method: Evaluated five self-attention mechanisms in Vision Transformers across CIFAR-10, CIFAR-100, and Imagenette datasets under different data corruption scenarios.

Result: Doubly Stochastic attention consistently outperformed others by 0.1%-5.1% in corrupted data conditions.

Conclusion: Doubly Stochastic attention is recommended for contexts with imperfect data due to its superior robustness.

Abstract: Self-attention mechanisms are foundational to Transformer architectures, supporting their impressive success in a wide range of tasks. While there are many self-attention variants, their robustness to noise and spurious correlations has not been well studied. This study evaluates Softmax, Sigmoid, Linear, Doubly Stochastic, and Cosine attention within Vision Transformers under different data corruption scenarios. Through testing across the CIFAR-10, CIFAR-100, and Imagenette datasets, we show that Doubly Stochastic attention is the most robust. It consistently outperformed the next best mechanism by $0.1%-5.1%$ when training data, or both training and testing data, were corrupted. Our findings inform self-attention selection in contexts with imperfect data. The code used is available at https://github.com/ctamayor/NeurIPS-Robustness-ViT.

[979] Systolic Array-based Accelerator for State-Space Models

Shiva Raja, Cansu Demirkiran, Aakash Sarkar, Milos Popovic, Ajay Joshi

Main category: cs.LG

TL;DR: EpochCore, a specialized hardware accelerator for State-Space Models (SSMs), improves performance and energy efficiency for long-range sequence tasks using systolic arrays and a novel dataflow.

Details

Motivation: Existing models (RNNs, CNNs, Transformers) struggle with long sequences due to memory limitations. SSMs offer better efficiency but require intensive computation, motivating hardware acceleration.

Method: Introduces EpochCore, a systolic array-based accelerator with LIMA-PE for versatile operations and ProDF dataflow for efficient SSM execution.

Result: EpochCore achieves 250x performance gain, 45x energy efficiency improvement, and ~2,000x latency reduction over GPUs.

Conclusion: EpochCore effectively addresses the computational challenges of SSMs, enabling efficient processing of long sequences.

Abstract: Sequence modeling is crucial for AI to understand temporal data and detect complex time-dependent patterns. While recurrent neural networks (RNNs), convolutional neural networks (CNNs), and Transformers have advanced in capturing long-range dependencies, they struggle with achieving high accuracy with very long sequences due to limited memory retention (fixed context window). State-Space Models (SSMs) leverage exponentially decaying memory enabling lengthy context window and so they process very long data sequences more efficiently than recurrent and Transformer-based models. Unlike traditional neural models like CNNs and RNNs, SSM-based models require solving differential equations through continuous integration, making training and inference both compute- and memory-intensive on conventional CPUs and GPUs. In this paper we introduce a specialized hardware accelerator, EpochCore, for accelerating SSMs. EpochCore is based on systolic arrays (SAs) and is designed to enhance the energy efficiency and throughput of inference of SSM-based models for long-range sequence tasks. Within the SA, we propose a versatile processing element (PE) called LIMA-PE to perform traditional and specialized MAC operations to support traditional DNNs and SSMs. To complement the EpochCore microarchitecture, we propose a novel dataflow, ProDF, which enables highly efficient execution of SSM-based models. By leveraging the LIMA-PE microarchitecture and ProDF, EpochCore achieves on average 250x gains in performance and 45x improvement in energy efficiency, at the expense of 2x increase in area cost over traditional SA-based accelerators, and around ~2,000x improvement in latency/inference on LRA datasets compared to GPU kernel operations.

[980] KLLM: Fast LLM Inference with K-Means Quantization

Xueying Wu, Baijun Zhou, Zhihui Gao, Yuzhe Fu, Qilin Zheng, Yintao He, Hai Li

Main category: cs.LG

TL;DR: KLLM is an LLM inference accelerator using K-Means quantization for efficient execution, avoiding dequantization and full-precision computations, and includes a lightweight outlier detection engine.

Details

Motivation: Address challenges in deploying K-Means-based weight and activation quantization (WAQ) for LLM inference, such as non-uniform data structure and activation outliers.

Method: Proposes KLLM with an index-based computation scheme for efficient MatMuls and nonlinear operations on K-Means-quantized data, and Orizuru for lightweight outlier detection.

Result: Enables efficient execution of LLM inference with reduced memory and computation demands while maintaining accuracy.

Conclusion: KLLM effectively leverages K-Means quantization for LLM inference, overcoming traditional WAQ limitations and improving efficiency.

Abstract: Large language model (LLM) inference poses significant challenges due to its intensive memory and computation demands. Weight and activation quantization (WAQ) offers a promising solution by reducing both memory footprint and arithmetic complexity. Traditional WAQ designs rely on uniform integer quantization for hardware efficiency, but often suffer from significant model performance degradation at low precision. In contrast, K-Means quantization, a non-uniform technique, achieves higher accuracy by aligning with the Gaussian-like distributions of weights and activations in LLMs. However, two key challenges prevent the efficient deployment of K-Means-based WAQ designs for LLM inference: (1) The non-uniform structure of K-Means-quantized data precludes direct execution on low-precision compute units, necessitating dequantization and floating-point matrix multiplications (MatMuls) during inference. (2) Activation outliers hinder effective low-precision quantization. Offline thresholding methods for outlier detection degrade model performance substantially, while existing online detection techniques introduce significant runtime overhead. To address the aforementioned challenges and fully unleash the potential of K-Means-based WAQ for LLM inference, in this paper, we propose KLLM, an LLM inference accelerator for efficient execution with K-Means-quantized weights and activations. KLLM features an index-based computation scheme for efficient execution of MatMuls and nonlinear operations on K-Means-quantized data, which avoids most of the dequantization and full-precision computations. Moreover, KLLM incorporates a lightweight outlier detection engine, Orizuru, that efficiently identifies the top-$k$ largest and smallest elements in the activation data stream during online inference.

[981] BAR Conjecture: the Feasibility of Inference Budget-Constrained LLM Services with Authenticity and Reasoning

Jinan Zhou, Rajat Ghosh, Vaishnavi Bhargava, Debojyoti Dutta, Aryan Singhal

Main category: cs.LG

TL;DR: The BAR Theorem framework addresses the trade-off between inference-time budget, factual authenticity, and reasoning capacity in LLM design.

Details

Motivation: Practitioners struggle to optimize inference-time budget, factual authenticity, and reasoning capacity simultaneously in LLM services.

Method: Formal proof of the trade-off and introduction of the BAR Theorem framework.

Result: No model can optimize all three properties at once; the BAR Theorem provides a principled design approach.

Conclusion: The BAR Theorem offers a structured way to navigate trade-offs in LLM-application design.

Abstract: When designing LLM services, practitioners care about three key properties: inference-time budget, factual authenticity, and reasoning capacity. However, our analysis shows that no model can simultaneously optimize for all three. We formally prove this trade-off and propose a principled framework named The BAR Theorem for LLM-application design.

[982] Evaluating the Dynamics of Membership Privacy in Deep Learning

Yuetian Chen, Zhiqi Wang, Nathalie Baracaldo, Swanand Ravindra Kadhe, Lei Yu

Main category: cs.LG

TL;DR: The paper introduces a framework to analyze privacy leakage in deep learning, focusing on how and when training data becomes vulnerable to membership inference attacks (MIAs). It reveals early training stages largely determine privacy risks.

Details

Motivation: To understand when and how models encode membership information during training, addressing gaps in knowledge about privacy leakage dynamics.

Method: A dynamic analytical framework tracks per-sample vulnerabilities on an FPR-TPR plane, examining factors like dataset complexity, model architecture, and optimizer choice.

Result: Samples’ privacy risks are determined early in training, with a strong correlation to their intrinsic learning difficulty.

Conclusion: The findings enhance understanding of privacy risk emergence during training, supporting proactive privacy-aware strategies.

Abstract: Membership inference attacks (MIAs) pose a critical threat to the privacy of training data in deep learning. Despite significant progress in attack methodologies, our understanding of when and how models encode membership information during training remains limited. This paper presents a dynamic analytical framework for dissecting and quantifying privacy leakage dynamics at the individual sample level. By tracking per-sample vulnerabilities on an FPR-TPR plane throughout training, our framework systematically measures how factors such as dataset complexity, model architecture, and optimizer choice influence the rate and severity at which samples become vulnerable. Crucially, we discover a robust correlation between a sample’s intrinsic learning difficulty, and find that the privacy risk of samples highly vulnerable in the final trained model is largely determined early during training. Our results thus provide a deeper understanding of how privacy risks dynamically emerge during training, laying the groundwork for proactive, privacy-aware model training strategies.

[983] Manifold-regularised Large-Margin $\ell_p$-SVDD for Multidimensional Time Series Anomaly Detection

Shervin Rahimzadeh Arashloo

Main category: cs.LG

TL;DR: Error: OutputParser failed

Details

Motivation: Error: OutputParser failed

Method: Error: OutputParser failed

Result: Error: OutputParser failed

Conclusion: Error: OutputParser failed

Abstract: We generalise the recently introduced large-margin $\ell_p$-SVDD approach to exploit the geometry of data distribution via manifold regularising for time series anomaly detection. Specifically, we formulate a manifold-regularised variant of the $\ell_p$-SVDD method to encourage label smoothness on the underlying manifold to capture structural information for improved detection performance. Drawing on an existing Representer theorem, we then provide an effective optimisation technique for the proposed method. We theoretically study the proposed approach using Rademacher complexities to analyse its generalisation performance and also provide an experimental assessment of the proposed method across various data sets to compare its performance against other methods.

[984] StackLiverNet: A Novel Stacked Ensemble Model for Accurate and Interpretable Liver Disease Detection

Md. Ehsanul Haque, S. M. Jahidul Islam, Shakil Mia, Rumana Sharmin, Ashikuzzaman, Md Samir Morshed, Md. Tahmidul Huque

Main category: cs.LG

TL;DR: StackLiverNet, an interpretable stacked ensemble model, addresses issues in liver disease classification with high accuracy (99.89%), interpretability, and efficiency.

Details

Motivation: Current models for liver disease classification suffer from misclassification, poor interpretability, and computational expense, necessitating a robust solution.

Method: StackLiverNet uses advanced preprocessing, feature selection, random undersampling, and a LightGBM meta-model to combine hyperparameter-optimized base classifiers.

Result: Achieves 99.89% accuracy, 0.9974 Cohen Kappa, 0.9993 AUC, and fast training/inference times (4.2783s/0.1106s). LIME and SHAP provide interpretability.

Conclusion: StackLiverNet is a highly accurate, interpretable, and efficient solution for liver disease detection, suitable for clinical practice.

Abstract: Liver diseases are a serious health concern in the world, which requires precise and timely diagnosis to enhance the survival chances of patients. The current literature implemented numerous machine learning and deep learning models to classify liver diseases, but most of them had some issues like high misclassification error, poor interpretability, prohibitive computational expense, and lack of good preprocessing strategies. In order to address these drawbacks, we introduced StackLiverNet in this study; an interpretable stacked ensemble model tailored to the liver disease detection task. The framework uses advanced data preprocessing and feature selection technique to increase model robustness and predictive ability. Random undersampling is performed to deal with class imbalance and make the training balanced. StackLiverNet is an ensemble of several hyperparameter-optimized base classifiers, whose complementary advantages are used through a LightGBM meta-model. The provided model demonstrates excellent performance, with the testing accuracy of 99.89%, Cohen Kappa of 0.9974, and AUC of 0.9993, having only 5 misclassifications, and efficient training and inference speeds that are amenable to clinical practice (training time 4.2783 seconds, inference time 0.1106 seconds). Besides, Local Interpretable Model-Agnostic Explanations (LIME) are applied to generate transparent explanations of individual predictions, revealing high concentrations of Alkaline Phosphatase and moderate SGOT as important observations of liver disease. Also, SHAP was used to rank features by their global contribution to predictions, while the Morris method confirmed the most influential features through sensitivity analysis.

[985] Evaluating Angle and Amplitude Encoding Strategies for Variational Quantum Machine Learning: their impact on model’s accuracy

Antonio Tudisco, Andrea Marchesin, Maurizio Zamboni, Mariagrazia Graziano, Giovanna Turvani

Main category: cs.LG

TL;DR: The paper analyzes Variational Quantum Circuits (VQCs) in Quantum Machine Learning, comparing Amplitude- and Angle-encoding models with different rotational gates. Results show significant accuracy variations (10%-41%) based on encoding choices, confirming embedding as a hyperparameter.

Details

Motivation: To explore how different encoding methods and rotational gates in VQCs impact classification performance in Quantum Machine Learning.

Method: Comparative analysis of Amplitude- and Angle-encoding models with varied rotational gates, tested on Wine and Diabetes datasets.

Result: Accuracy differences ranged from 10% to 41%, showing encoding choice significantly affects performance.

Conclusion: Embedding is a critical hyperparameter for VQC models, with rotational gate selection influencing classification accuracy.

Abstract: Recent advancements in Quantum Computing and Machine Learning have increased attention to Quantum Machine Learning (QML), which aims to develop machine learning models by exploiting the quantum computing paradigm. One of the widely used models in this area is the Variational Quantum Circuit (VQC), a hybrid model where the quantum circuit handles data inference while classical optimization adjusts the parameters of the circuit. The quantum circuit consists of an encoding layer, which loads data into the circuit, and a template circuit, known as the ansatz, responsible for processing the data. This work involves performing an analysis by considering both Amplitude- and Angle-encoding models, and examining how the type of rotational gate applied affects the classification performance of the model. This comparison is carried out by training the different models on two datasets, Wine and Diabetes, and evaluating their performance. The study demonstrates that, under identical model topologies, the difference in accuracy between the best and worst models ranges from 10% to 30%, with differences reaching up to 41%. Moreover, the results highlight how the choice of rotational gates used in encoding can significantly impact the model’s classification performance. The findings confirm that the embedding represents a hyperparameter for VQC models.

cs.MA

[986] Revisiting Gossip Protocols: A Vision for Emergent Coordination in Agentic Multi-Agent Systems

Mansura Habiba, Nafiul I. Khan

Main category: cs.MA

TL;DR: The paper explores gossip protocols as a solution for flexible, decentralized coordination in agentic AI, complementing structured communication for emergent swarm-like intelligence.

Details

Motivation: The need for adaptive, context-rich communication in scalable agentic platforms, where current structured protocols lack support for emergent collective cognition.

Method: Revisits gossip protocols from distributed systems, analyzing their potential and challenges (e.g., semantic filtering, trustworthiness) for agentic AI.

Result: Identifies gaps in current architectures and proposes gossip as a complementary layer, outlining open research questions.

Conclusion: Gossip protocols offer a promising but underutilized path for resilient, self-organizing multi-agent systems, though challenges remain.

Abstract: As agentic platforms scale, agents are evolving beyond static roles and fixed toolchains, creating a growing need for flexible, decentralized coordination. Today’s structured communication protocols (e.g., direct agent-to-agent messaging) excel at reliability and task delegation, but they fall short in enabling emergent, swarm-like intelligence, where distributed agents continuously learn, adapt, and communicate to form collective cognition. This paper revisits gossip protocols, long valued in distributed systems for their fault tolerance and decentralization, and argues that they offer a missing layer for context-rich, adaptive communication in agentic AI. Gossip enables scalable, low-overhead dissemination of shared knowledge, but also raises unresolved challenges around semantic filtering, staleness, trustworthiness, and consistency in high-stakes environments. Rather than proposing a new framework, this work charts a research agenda for integrating gossip as a complementary substrate alongside structured protocols. We identify critical gaps in current agent-to-agent architectures, highlight where gossip could reshape assumptions about coordination, and outline open questions around intent propagation, knowledge decay, and peer-to-peer trust. Gossip is not a silver bullet, but overlooking it risks missing a key path toward resilient, reflexive, and self-organizing multi-agent systems.

[987] A Group Consensus-Driven Auction Algorithm for Cooperative Task Allocation Among Heterogeneous Multi-Agents

Gang Wang, Hongfang Han, Xiaowei Liu, Hanfeng Jiang, Ming Zhang

Main category: cs.MA

TL;DR: The paper proposes GCBHA, a distributed algorithm for heterogeneous multi-task and multi-agent task allocation, improving efficiency and reducing errors in scenarios like automated warehouses.

Details

Motivation: Existing task allocation methods lack integration of multi-task, multi-attribute, and heterogeneous task allocation, and suffer from scenario constraints and high error rates.

Method: GCBHA decomposes complex tasks into subtasks, groups them via heuristic clustering, allocates them through auctions, and evaluates path costs accurately.

Result: GCBHA reduces task allocation time and error rates, improving solution quality.

Conclusion: GCBHA effectively addresses heterogeneous task allocation challenges, offering practical benefits for automated warehouses.

Abstract: In scenarios like automated warehouses, assigning tasks to robots presents a heterogeneous multi-task and multi-agent task allocation problem. However, existing task allocation study ignores the integration of multi-task and multi-attribute agent task allocation with heterogeneous task allocation. In addition, current algorithms are limited by scenario constraints and can incur significant errors in specific contexts. Therefore, this study proposes a distributed heterogeneous multi-task and multi-agent task allocation algorithm with a time window, called group consensus-based heterogeneous auction (GCBHA). Firstly, this method decomposes tasks that exceed the capability of a single Agent into subtasks that can be completed by multiple independent agents. And then groups similar or adjacent tasks through a heuristic clustering method to reduce the time required to reach a consensus. Subsequently, the task groups are allocated to agents that meet the conditions through an auction process. Furthermore, the method evaluates the task path cost distance based on the scenario, which can calculate the task cost more accurately. The experimental results demonstrate that GCBHA performs well in terms of task allocation time and solution quality, with a significant reduction in the error rate between predicted task costs and actual costs.

[988] Emergence of Fair Leaders via Mediators in Multi-Agent Reinforcement Learning

Akshay Dodwadmath, Setareh Maghsudi

Main category: cs.MA

TL;DR: The paper explores leader selection fairness in Stackelberg games using multi-agent reinforcement learning, proposing a mediator-based framework to enhance fairness in agents’ returns.

Details

Motivation: Addressing unfair outcomes in Stackelberg games due to biased leader selection, especially with self-interested agents.

Method: Proposes a multi-agent reinforcement learning framework integrating mediators for minimal control (leader selection).

Result: Mediators lead to fairer actions by self-interested agents, improving overall fairness in returns.

Conclusion: Mediators in Stackelberg settings can effectively promote fairness among self-interested agents.

Abstract: Stackelberg games and their resulting equilibria have received increasing attention in the multi-agent reinforcement learning literature. Each stage of a traditional Stackelberg game involves a leader(s) acting first, followed by the followers. In situations where the roles of leader(s) and followers can be interchanged, the designated role can have considerable advantages, for example, in first-mover advantage settings. Then the question arises: Who should be the leader and when? A bias in the leader selection process can lead to unfair outcomes. This problem is aggravated if the agents are self-interested and care only about their goals and rewards. We formally define this leader selection problem and show its relation to fairness in agents’ returns. Furthermore, we propose a multi-agent reinforcement learning framework that maximizes fairness by integrating mediators. Mediators have previously been used in the simultaneous action setting with varying levels of control, such as directly performing agents’ actions or just recommending them. Our framework integrates mediators in the Stackelberg setting with minimal control (leader selection). We show that the presence of mediators leads to self-interested agents taking fair actions, resulting in higher overall fairness in agents’ returns.

[989] Bearing-Distance Flocking with Zone-Based Interactions in Constrained Dynamic Environments

Hossein B. Jond

Main category: cs.MA

TL;DR: A novel zone-based flocking control approach for dynamic multi-agent systems, using local perception and behavioral rules, validated by simulations for flexibility and scalability.

Details

Motivation: To develop a scalable and robust flocking control strategy for dynamic multi-agent systems, especially in unreliable communication environments.

Method: Uses bearing and distance measurements to create behavioral contribution vectors for separation, alignment, cohesion, obstacle avoidance, and strategic separation. Incorporates a directionally aware obstacle avoidance mechanism.

Result: Simulations show flexible, adaptable, and scalable flocking behavior. Asymptotic stability and convergence are proven for spanning tree interaction graphs.

Conclusion: The approach is effective for real-world applications in dynamic, distributed environments due to its reliance on local sensing and robustness.

Abstract: This paper presents a novel zone-based flocking control approach suitable for dynamic multi-agent systems (MAS). Inspired by Reynolds behavioral rules for $boids$, flocking behavioral rules with the zones of repulsion, conflict, attraction, and surveillance are introduced. For each agent, using only bearing and distance measurements, behavioral contribution vectors quantify the local separation, local and global flock velocity alignment, local cohesion, obstacle avoidance and boundary conditions, and strategic separation for avoiding alien agents. The control strategy uses the local perception-based behavioral contribution vectors to guide each agent’s motion. Additionally, the control strategy incorporates a directionally aware obstacle avoidance mechanism that prioritizes obstacles in the agent’s forward path. Simulation results validate the effectiveness of the model in creating flexible, adaptable, and scalable flocking behavior. Asymptotic stability and convergence to a stable flocking configuration for any initial conditions provided the interaction graph is a spanning tree are demonstrated. The flocking model’s reliance on locally sensed bearing and distance measurements ensures scalability and robustness, particularly in scenarios where communication is unreliable or resource-intensive. This makes it well-suited for real-world applications demanding seamless operation in highly dynamic and distributed environments.

[990] AI-Generated Compromises for Coalition Formation

Eyal Briman, Ehud Shapiro, Nimrod Talmon

Main category: cs.MA

TL;DR: The paper addresses the challenge of finding compromise proposals in coalition formation by formalizing a model with bounded rationality and uncertainty, using NLP and LLMs to create a semantic metric space for text, and demonstrating AI’s role in democratic text editing.

Details

Motivation: The motivation is to solve the open question of effectively identifying compromise proposals in coalition formation, particularly in collaborative document writing like drafting a community constitution.

Method: The method involves formalizing a model with agent bounded rationality and uncertainty, using NLP and large language models to create a semantic metric space over text, and designing algorithms to suggest compromise proposals.

Result: The results show that AI can facilitate large-scale democratic text editing by simulating coalition formation processes and generating broadly supported compromise proposals.

Conclusion: The conclusion is that AI methods, leveraging NLP and LLMs, can effectively address the challenge of finding compromise proposals in collaborative document writing, surpassing traditional tools.

Abstract: The challenge of finding compromises between agent proposals is fundamental to AI subfields such as argumentation, mediation, and negotiation. Building on this tradition, Elkind et al. (2021) introduced a process for coalition formation that seeks majority-supported proposals preferable to the status quo, using a metric space where each agent has an ideal point. A crucial step in this process involves identifying compromise proposals around which agent coalitions can unite. How to effectively find such compromise proposals remains an open question. We address this gap by formalizing a model that incorporates agent bounded rationality and uncertainty, and by developing AI methods to generate compromise proposals. We focus on the domain of collaborative document writing, such as the democratic drafting of a community constitution. Our approach uses natural language processing techniques and large language models to induce a semantic metric space over text. Based on this space, we design algorithms to suggest compromise points likely to receive broad support. To evaluate our methods, we simulate coalition formation processes and show that AI can facilitate large-scale democratic text editing, a domain where traditional tools are limited.

[991] Dynamic Strategy Adaptation in Multi-Agent Environments with Large Language Models

Shaurya Mallampati, Rashed Shelim, Walid Saad, Naren Ramakrishnan

Main category: cs.MA

TL;DR: LLMs combined with game-theoretic principles improve real-time multi-agent collaboration, outperforming baselines by 26% in noisy environments.

Details

Motivation: To explore LLMs' reasoning in dynamic, real-time multi-agent scenarios, unlike static or turn-based settings.

Method: Integrates LLM-driven agents with strategic reasoning, real-time adaptation, and game-theoretic principles like belief consistency and Nash equilibrium.

Result: Achieves 26% better performance than PPO baselines in high-noise environments with sub-1.05ms latency.

Conclusion: Game-theoretic guidance and real-time feedback enhance LLM performance, creating more resilient multi-agent systems.

Abstract: Large language models (LLMs) demonstrate strong reasoning abilities across mathematical, strategic, and linguistic tasks, yet little is known about how well they reason in dynamic, real-time, multi-agent scenarios, such as collaborative environments in which agents continuously adapt to each other’s behavior, as in cooperative gameplay settings. In this paper, we bridge this gap by combining LLM-driven agents with strategic reasoning and real-time adaptation in cooperative, multi-agent environments grounded in game-theoretic principles such as belief consistency and Nash equilibrium. The proposed framework applies broadly to dynamic scenarios in which agents coordinate, communicate, and make decisions in response to continuously changing conditions. We provide real-time strategy refinement and adaptive feedback mechanisms that enable agents to dynamically adjust policies based on immediate contextual interactions, in contrast to previous efforts that evaluate LLM capabilities in static or turn-based settings. Empirical results show that our method achieves up to a 26% improvement in return over PPO baselines in high-noise environments, while maintaining real-time latency under 1.05 milliseconds. Our approach improves collaboration efficiency, task completion rates, and flexibility, illustrating that game-theoretic guidance integrated with real-time feedback enhances LLM performance, ultimately fostering more resilient and flexible strategic multi-agent systems.

cs.MM

[992] Graph-based Interaction Augmentation Network for Robust Multimodal Sentiment Analysis

Hu Zhangfeng, Shi mengxin

Main category: cs.MM

TL;DR: A graph-based framework is proposed to address modality imperfection in Multimodal Sentiment Analysis (MSA) by modeling intra- and inter-modality interactions, outperforming existing methods.

Details

Motivation: Existing MSA methods overlook complex dependencies within and across modalities, failing to fully leverage complementary semantics.

Method: The framework uses a learnable hypergraph for intra-modality dependencies and a directed graph for inter-modality correlations, supervised by knowledge from perfect samples.

Result: The method shows effectiveness on MOSI and MOSEI datasets.

Conclusion: The proposed framework robustly captures missing semantics and improves MSA performance.

Abstract: The inevitable modality imperfection in real-world scenarios poses significant challenges for Multimodal Sentiment Analysis (MSA). While existing methods tailor reconstruction or joint representation learning strategies to restore missing semantics, they often overlook complex dependencies within and across modalities. Consequently, they fail to fully leverage available modalities to capture complementary semantics. To this end, this paper proposes a novel graph-based framework to exploit both intra- and inter-modality interactions, enabling imperfect samples to derive missing semantics from complementary parts for robust MSA. Specifically, we first devise a learnable hypergraph to model intra-modality temporal dependencies to exploit contextual information within each modality. Then, a directed graph is employed to explore inter-modality correlations based on attention mechanism, capturing complementary information across different modalities. Finally, the knowledge from perfect samples is integrated to supervise our interaction processes, guiding the model toward learning reliable and robust joint representations. Extensive experiments on MOSI and MOSEI datasets demonstrate the effectiveness of our method.

[993] DRKF: Decoupled Representations with Knowledge Fusion for Multimodal Emotion Recognition

Peiyuan Jiang, Yao Liu, Qiao Liu, Zongshun Zhang, Jiaye Yang, Lu Liu, Daibing Yao

Main category: cs.MM

TL;DR: The paper proposes DRKF, a method for Multimodal Emotion Recognition (MER) that decouples shared and modality-specific features and fuses knowledge to address modality heterogeneity and emotional inconsistency, achieving SOTA results.

Details

Motivation: The challenges in MER include modality heterogeneity and inconsistencies in emotional cues, which hinder performance. The paper aims to address these issues.

Method: DRKF consists of two modules: ORL for decoupling shared and modality-specific features using contrastive mutual information, and KF for knowledge fusion with a self-attention-based Fusion Encoder and Emotion Discrimination Submodule.

Result: DRKF achieves state-of-the-art performance on IEMOCAP, MELD, and M3ED datasets.

Conclusion: The proposed DRKF effectively handles modality heterogeneity and emotional inconsistency, improving MER performance.

Abstract: Multimodal emotion recognition (MER) aims to identify emotional states by integrating and analyzing information from multiple modalities. However, inherent modality heterogeneity and inconsistencies in emotional cues remain key challenges that hinder performance. To address these issues, we propose a Decoupled Representations with Knowledge Fusion (DRKF) method for MER. DRKF consists of two main modules: an Optimized Representation Learning (ORL) Module and a Knowledge Fusion (KF) Module. ORL employs a contrastive mutual information estimation method with progressive modality augmentation to decouple task-relevant shared representations and modality-specific features while mitigating modality heterogeneity. KF includes a lightweight self-attention-based Fusion Encoder (FE) that identifies the dominant modality and integrates emotional information from other modalities to enhance the fused representation. To handle potential errors from incorrect dominant modality selection under emotionally inconsistent conditions, we introduce an Emotion Discrimination Submodule (ED), which enforces the fused representation to retain discriminative cues of emotional inconsistency. This ensures that even if the FE selects an inappropriate dominant modality, the Emotion Classification Submodule (EC) can still make accurate predictions by leveraging preserved inconsistency information. Experiments show that DRKF achieves state-of-the-art (SOTA) performance on IEMOCAP, MELD, and M3ED. The source code is publicly available at https://github.com/PANPANKK/DRKF.

[994] User Digital Twin-Driven Video Streaming for Customized Preferences and Adaptive Transcoding

Stephen Jimmy, Kalkidan Berhane, Kevin Muhammad

Main category: cs.MM

TL;DR: A novel approach integrates user digital twins with video streaming to enhance personalization and efficiency using machine learning.

Details

Motivation: To improve user experience and system efficiency in video streaming by leveraging dynamic digital representations of user preferences and behaviors.

Method: Uses advanced machine learning to update user digital twins, which dynamically adjust video preferences and optimize transcoding processes.

Result: Improves content personalization, reduces bandwidth usage, and enhances video playback quality.

Conclusion: Suggests a shift towards adaptive, user-centric multimedia services, transforming video content delivery.

Abstract: In the rapidly evolving field of multimedia services, video streaming has become increasingly prevalent, demanding innovative solutions to enhance user experience and system efficiency. This paper introduces a novel approach that integrates user digital twins-a dynamic digital representation of a user’s preferences and behaviors-with traditional video streaming systems. We explore the potential of this integration to dynamically adjust video preferences and optimize transcoding processes according to real-time data. The methodology leverages advanced machine learning algorithms to continuously update the user’s digital twin, which in turn informs the transcoding service to adapt video parameters for optimal quality and minimal buffering. Experimental results show that our approach not only improves the personalization of content delivery but also significantly enhances the overall efficiency of video streaming services by reducing bandwidth usage and improving video playback quality. The implications of such advancements suggest a shift towards more adaptive, user-centric multimedia services, potentially transforming how video content is consumed and delivered.

eess.AS

[995] Fusion of Modulation Spectrogram and SSL with Multi-head Attention for Fake Speech Detection

Rishith Sadashiv T N, Abhishek Bedge, Saisha Suresh Bore, Jagabandhu Mishra, Mrinmoy Bhattacharjee, S R Mahadeva Prasanna

Main category: eess.AS

TL;DR: A novel fake speech detection system using SSL embeddings and Modulation Spectrogram features improves domain generalization, outperforming baselines by up to 37%.

Details

Motivation: Address poor generalizability of fake speech detection systems on out-of-domain data due to lack of diverse training.

Method: Proposes SSL+MS fusion representation for classification, using AASIST back-end. Evaluated on monolingual and multilingual datasets.

Result: 37% and 20% relative improvements on ASVspoof 2019 and MLAAD in-domain; 36% improvement in out-of-domain.

Conclusion: The SSL+MS fusion enhances domain generalization, consistently outperforming baselines across languages.

Abstract: Fake speech detection systems have become a necessity to combat against speech deepfakes. Current systems exhibit poor generalizability on out-of-domain speech samples due to lack to diverse training data. In this paper, we attempt to address domain generalization issue by proposing a novel speech representation using self-supervised (SSL) speech embeddings and the Modulation Spectrogram (MS) feature. A fusion strategy is used to combine both speech representations to introduce a new front-end for the classification task. The proposed SSL+MS fusion representation is passed to the AASIST back-end network. Experiments are conducted on monolingual and multilingual fake speech datasets to evaluate the efficacy of the proposed model architecture in cross-dataset and multilingual cases. The proposed model achieves a relative performance improvement of 37% and 20% on the ASVspoof 2019 and MLAAD datasets, respectively, in in-domain settings compared to the baseline. In the out-of-domain scenario, the model trained on ASVspoof 2019 shows a 36% relative improvement when evaluated on the MLAAD dataset. Across all evaluated languages, the proposed model consistently outperforms the baseline, indicating enhanced domain generalization.

[996] Multi-Granularity Adaptive Time-Frequency Attention Framework for Audio Deepfake Detection under Real-World Communication Degradations

Haohan Shi, Xiyu Shi, Safak Dogan, Tianjin Huang, Yunxiao Zhang

Main category: eess.AS

TL;DR: A unified framework for robust Audio Deepfake Detection (ADD) under real-world degradations, featuring a Multi-Granularity Adaptive Attention (MGAA) architecture to dynamically adapt to varying Time-Frequency (TF) representations.

Details

Motivation: Address the performance drop of existing ADD methods under real-world communication degradations like packet losses and speech codec compression.

Method: Proposes MGAA, a customizable multi-scale attention mechanism with adaptive fusion to dynamically focus on salient TF regions and amplify subtle forgery traces.

Result: Outperforms state-of-the-art baselines across various degradation scenarios, improving separability between real and fake audio.

Conclusion: The framework is robust and practical for real-world deployment, enhancing ADD performance under communication degradations.

Abstract: The rise of highly convincing synthetic speech poses a growing threat to audio communications. Although existing Audio Deepfake Detection (ADD) methods have demonstrated good performance under clean conditions, their effectiveness drops significantly under degradations such as packet losses and speech codec compression in real-world communication environments. In this work, we propose the first unified framework for robust ADD under such degradations, which is designed to effectively accommodate multiple types of Time-Frequency (TF) representations. The core of our framework is a novel Multi-Granularity Adaptive Attention (MGAA) architecture, which employs a set of customizable multi-scale attention heads to capture both global and local receptive fields across varying TF granularities. A novel adaptive fusion mechanism subsequently adjusts and fuses these attention branches based on the saliency of TF regions, allowing the model to dynamically reallocate its focus according to the characteristics of the degradation. This enables the effective localization and amplification of subtle forgery traces. Extensive experiments demonstrate that the proposed framework consistently outperforms state-of-the-art baselines across various real-world communication degradation scenarios, including six speech codecs and five levels of packet losses. In addition, comparative analysis reveals that the MGAA-enhanced features significantly improve separability between real and fake audio classes and sharpen decision boundaries. These results highlight the robustness and practical deployment potential of our framework in real-world communication environments.

[997] Lumename: Wearable Device for Hearing Impaired with Personalized ML-Based Auditory Detection and Haptic-Visual Alerts

Jeanelle Dao, Jadelynn Dao

Main category: eess.AS

TL;DR: Lumename is a smartwatch using TinyML to detect custom names for hearing-impaired users, achieving 91.67% accuracy with low power and resource use.

Details

Motivation: Addressing the challenge of recognizing spoken commands for 430 million people with disabling hearing loss.

Method: Uses on-device ML, audio modulation for data augmentation, and constrained random iterations for model optimization.

Result: Achieves 91.67% accuracy on a custom smartwatch with low resource and power consumption.

Conclusion: Lumename offers an efficient, real-time solution for hearing-impaired individuals to recognize spoken names.

Abstract: According to the World Health Organization, 430 million people experience disabling hearing loss. For them, recognizing spoken commands such as one’s name is difficult. To address this issue, Lumename, a real-time smartwatch, utilizes on-device machine learning to detect a user-customized name before generating a haptic-visual alert. During training, to overcome the need for large datasets, Lumename uses novel audio modulation techniques to augment samples from one user and generate additional samples to represent diverse genders and ages. Constrained random iterations were used to find optimal parameters within the model architecture. This approach resulted in a low-resource and low-power TinyML model that could quickly infer various keyword samples while remaining 91.67% accurate on a custom-built smartwatch based on an Arduino Nano 33 BLE Sense.

[998] An Age-Agnostic System for Robust Speaker Verification

Jiusi Zheng, Vishwas Shetty, Natarajan Balaji Shankar, Abeer Alwan

Main category: eess.AS

TL;DR: Proposes an Age Agnostic Speaker Verification (AASV) system to improve performance across children’s and adults’ speaker verification tasks by disentangling age-related attributes and expanding the embedding space.

Details

Motivation: Addresses the performance gap in speaker verification (SV) between children and adults, where adult-trained SV systems underperform for children's SV (C-SV) and domain adaptation often degrades adult SV (A-SV) performance.

Method: Uses a domain classifier to separate age-related attributes from speech, then expands the embedding space with domain information to create a unified, robust speaker representation.

Result: Demonstrates effectiveness on OGI and VoxCeleb datasets, bridging performance disparities between C-SV and A-SV tasks.

Conclusion: Lays the foundation for inclusive, age-adaptive SV systems by achieving robust performance across age groups.

Abstract: In speaker verification (SV), the acoustic mismatch between children’s and adults’ speech leads to suboptimal performance when adult-trained SV systems are applied to children’s speaker verification (C-SV). While domain adaptation techniques can enhance performance on C-SV tasks, they often do so at the expense of significant degradation in performance on adults’ SV (A-SV) tasks. In this study, we propose an Age Agnostic Speaker Verification (AASV) system that achieves robust performance across both C-SV and A-SV tasks. Our approach employs a domain classifier to disentangle age-related attributes from speech and subsequently expands the embedding space using the extracted domain information, forming a unified speaker representation that is robust and highly discriminative across age groups. Experiments on the OGI and VoxCeleb datasets demonstrate the effectiveness of our approach in bridging SV performance disparities, laying the foundation for inclusive and age-adaptive SV systems.

[999] Test-Time Training for Speech Enhancement

Avishkar Behera, Riya Ann Easow, Venkatesh Parvathala, K. Sri Rama Murty

Main category: eess.AS

TL;DR: A novel Test-Time Training (TTT) method for Speech Enhancement adapts to noise and domain shifts using self-supervised tasks, improving performance without labeled data.

Details

Motivation: Addressing unpredictable noise conditions and domain shifts in speech enhancement, which traditional methods struggle with.

Method: Combines a main speech enhancement task with self-supervised auxiliary tasks in a Y-shaped architecture, dynamically adapting during inference.

Result: Outperforms baseline models in synthetic and real-world datasets, improving speech quality metrics.

Conclusion: Demonstrates TTT’s effectiveness for adaptive speech enhancement, offering insights for robust speech processing.

Abstract: This paper introduces a novel application of Test-Time Training (TTT) for Speech Enhancement, addressing the challenges posed by unpredictable noise conditions and domain shifts. This method combines a main speech enhancement task with a self-supervised auxiliary task in a Y-shaped architecture. The model dynamically adapts to new domains during inference time by optimizing the proposed self-supervised tasks like noise-augmented signal reconstruction or masked spectrogram prediction, bypassing the need for labeled data. We further introduce various TTT strategies offering a trade-off between adaptation and efficiency. Evaluations across synthetic and real-world datasets show consistent improvements across speech quality metrics, outperforming the baseline model. This work highlights the effectiveness of TTT in speech enhancement, providing insights for future research in adaptive and robust speech processing.

[1000] Word Error Rate Definitions and Algorithms for Long-Form Multi-talker Speech Recognition

Thilo von Neumann, Christoph Boeddeker, Marc Delcroix, Reinhold Haeb-Umbach

Main category: eess.AS

TL;DR: The paper compares and unifies various Word Error Rate (WER) metrics for long-form multi-talker speech recognition, introduces DI-cpWER to isolate speaker confusion errors, and proposes efficient algorithms for computationally heavy metrics.

Details

Motivation: Classical WER is inadequate for long-form multi-talker transcripts, necessitating specialized metrics like cpWER, tcpWER, ORC-WER, and MIMO-WER. A unified understanding and efficient computation are lacking.

Method: The paper provides a unified description of WER variants, introduces DI-cpWER, proposes greedy algorithms for ORC-WER and DI-cpWER, and integrates time constraints to reduce complexity.

Result: The greedy algorithms achieve high precision (<0.1% deviation) with polynomial complexity. Time constraints improve plausibility and reduce computational costs.

Conclusion: The work clarifies when to use which WER metric, introduces DI-cpWER for speaker confusion analysis, and offers efficient solutions for computationally intensive metrics.

Abstract: The predominant metric for evaluating speech recognizers, the Word Error Rate (WER) has been extended in different ways to handle transcripts produced by long-form multi-talker speech recognizers. These systems process long transcripts containing multiple speakers and complex speaking patterns so that the classical WER cannot be applied. There are speaker-attributed approaches that count speaker confusion errors, such as the concatenated minimum-permutation WER cpWER and the time-constrained cpWER (tcpWER), and speaker-agnostic approaches, which aim to ignore speaker confusion errors, such as the Optimal Reference Combination WER (ORC-WER) and the MIMO-WER. These WERs evaluate different aspects and error types (e.g., temporal misalignment). A detailed comparison has not been made. We therefore present a unified description of the existing WERs and highlight when to use which metric. To further analyze how many errors are caused by speaker confusion, we propose the Diarization-invariant cpWER (DI-cpWER). It ignores speaker attribution errors and its difference to cpWER reflects the impact of speaker confusions on the WER. Since error types cannot reliably be classified automatically, we discuss ways to visualize sequence alignments between the reference and hypothesis transcripts to facilitate the spotting of errors by a human judge. Since some WER definitions have high computational complexity, we introduce a greedy algorithm to approximate the ORC-WER and DI-cpWER with high precision ($<0.1%$ deviation in our experiments) and polynomial complexity instead of exponential. To improve the plausibility of the metrics, we also incorporate the time constraint from the tcpWER into ORC-WER and MIMO-WER, also significantly reducing the computational complexity.

[1001] Guiding an Automatic Speech Recognition Decoder Using Large Language Models

Eyal Cohen, Bhiksha Raj, Joseph Keshet

Main category: eess.AS

TL;DR: The paper proposes a novel method to integrate Large Language Models (LLMs) into Automatic Speech Recognition (ASR) by decomposing the MAP estimator, enabling independent training of acoustic and language models for improved performance.

Details

Motivation: Despite the potential of LLMs, integrating them into ASR systems remains challenging. The paper aims to address this by leveraging the strengths of both acoustic and language models without joint optimization.

Method: The authors decompose the MAP estimator of words given the acoustic signal, deriving an iterative procedure to integrate AM and LLM while maintaining separability. This allows independent training and improvement of each component.

Result: The method outperforms N-gram, GCNN, and TransformerLM across datasets (ALLSSTAR, WSJ0, TED-LIUM 3) and shows efficacy in handling complex sentences, acronyms, and domain-specific vocabulary.

Conclusion: The proposed approach successfully integrates LLMs into ASR, enhancing performance by leveraging the strengths of both AM and LLM without requiring joint optimization.

Abstract: Automatic Speech Recognition (ASR) consists of an acoustic model (AM) and a language model (LM). The AM estimates the probability of an acoustic signal based on a sequence of linguistic units, typically phones, characters, or tokens, while the LM assesses the likelihood of a specific sequence of words or tokens. Although Large Language Models (LLMs) have demonstrated significant potential across various tasks, integrating them into ASR remains an open challenge. By decomposing the maximum a posteriori (MAP) estimator of words (or tokens) given the acoustic signal, we derive an iterative procedure that facilitates a novel integration of the AM and LLM, while maintaining their separability. This approach enables each component to be independently trained and improved using its own data, thereby maximizing the system’s performance by leveraging the strengths of both models without requiring joint optimization. We illustrate the effectiveness of our method in comparison to three language models: N-gram, GCNN, and TransformerLM across multiple datasets spanning various speech styles, including ALLSSTAR, WSJ0, and TED-LIUM 3. Our experiments involved two acoustic models (wav2vec 2.0 and HuBERT) and three LLMs (GPT-2, LLaMA 2, and Falcon). Notably, our method demonstrates particular efficacy in addressing complex speech sentences, acronyms, and domain-specific vocabulary.

[1002] Reference-free Adversarial Sex Obfuscation in Speech

Yangyang Qu, Michele Panariello, Massimiliano Todisco, Nicholas Evans

Main category: eess.AS

TL;DR: RASO is a method for sex obfuscation in speech, using adversarial learning to remove sex-specific cues while preserving linguistic content, outperforming existing approaches.

Details

Motivation: Address privacy risks from sex conversion in speech, which often retains residual sex-specific cues even without target references.

Method: Uses a sex-conditional adversarial learning framework to disentangle linguistic content from sex-related acoustic markers, with explicit regularization for sex-neutral characteristics.

Result: RASO significantly outperforms competing sex obfuscation methods, even under semi-informed attack models.

Conclusion: RASO effectively obfuscates sex in speech while maintaining linguistic integrity, offering a robust privacy solution.

Abstract: Sex conversion in speech involves privacy risks from data collection and often leaves residual sex-specific cues in outputs, even when target speaker references are unavailable. We introduce RASO for Reference-free Adversarial Sex Obfuscation. Innovations include a sex-conditional adversarial learning framework to disentangle linguistic content from sex-related acoustic markers and explicit regularisation to align fundamental frequency distributions and formant trajectories with sex-neutral characteristics learned from sex-balanced training data. RASO preserves linguistic content and, even when assessed under a semi-informed attack model, it significantly outperforms a competing approach to sex obfuscation.

[1003] Revisiting the Privacy of Low-Frequency Speech Signals: Exploring Resampling Methods, Evaluation Scenarios, and Speaker Characteristics

Jule Pohlhausen, Jörg Bitzer

Main category: eess.AS

TL;DR: The paper investigates low-frequency audio recordings to balance privacy and utility, showing that up to 800 Hz sampling preserves transcription accuracy while anti-aliasing filters enhance privacy.

Details

Motivation: To address privacy concerns in audio recordings while maintaining useful insights from conversational data.

Method: Resampling audio to low frequencies (up to 800 Hz) with and without anti-aliasing filters, evaluated using speech recognition and voice activity detection models.

Result: Clean recordings at ≤800 Hz retain transcription accuracy; missing anti-aliasing filters weaken privacy protection. Speaker sex and pitch also impact results.

Conclusion: Low-frequency recordings (≤800 Hz) with anti-aliasing filters effectively balance privacy and utility in audio data.

Abstract: While audio recordings in real life provide insights into social dynamics and conversational behavior, they also raise concerns about the privacy of personal, sensitive data. This article explores the effectiveness of restricting recordings to low-frequency audio to protect spoken content. For resampling the audio signals to different sampling rates, we compare the effect of employing anti-aliasing filtering. Privacy enhancement is measured by an increased word error rate of automatic speech recognition models. The impact on utility performance is measured with voice activity detection models. Our experimental results show that for clean recordings, models trained with a sampling rate of up to 800 Hz transcribe the majority of words correctly. For both models, we analyzed the impact of the speaker’s sex and pitch, and we demonstrated that missing anti-aliasing filters more strongly compromise speech privacy.

[1004] Language-based Audio Moment Retrieval

Hokuto Munakata, Taichi Nishimura, Shota Nakada, Tatsuya Komatsu

Main category: eess.AS

TL;DR: The paper introduces Audio Moment Retrieval (AMR), a task to predict relevant moments in long audio using text queries. It presents a dataset (Clotho-Moment) and a DETR-based model (AM-DETR) that outperforms clip-level retrieval methods.

Details

Motivation: Existing audio retrieval tasks focus on short clips, but AMR addresses the need for retrieving specific moments in untrimmed long audio using text queries.

Method: The authors build the Clotho-Moment dataset and propose AM-DETR, a DETR-based model capturing temporal dependencies in audio features.

Result: AM-DETR outperforms baseline clip-level retrieval methods, notably improving Recall1@0.7 by 9.00 points.

Conclusion: The work establishes AMR as a viable task, provides a dataset and model, and demonstrates superior performance over traditional methods.

Abstract: In this paper, we propose and design a new task called audio moment retrieval (AMR). Unlike conventional language-based audio retrieval tasks that search for short audio clips from an audio database, AMR aims to predict relevant moments in untrimmed long audio based on a text query. Given the lack of prior work in AMR, we first build a dedicated dataset, Clotho-Moment, consisting of large-scale simulated audio recordings with moment annotations. We then propose a DETR-based model, named Audio Moment DETR (AM-DETR), as a fundamental framework for AMR tasks. This model captures temporal dependencies within audio features, inspired by similar video moment retrieval tasks, thus surpassing conventional clip-level audio retrieval methods. Additionally, we provide manually annotated datasets to properly measure the effectiveness and robustness of our methods on real data. Experimental results show that AM-DETR, trained with Clotho-Moment, outperforms a baseline model that applies a clip-level audio retrieval method with a sliding window on all metrics, particularly improving Recall1@0.7 by 9.00 points. Our datasets and code are publicly available in https://h-munakata.github.io/Language-based-Audio-Moment-Retrieval.

[1005] Align-ULCNet: Towards Low-Complexity and Robust Acoustic Echo and Noise Reduction

Shrishti Saha Shetu, Naveen Kumar Desiraju, Wolfgang Mack, Emanuël A. P. Habets

Main category: eess.AS

TL;DR: A hybrid approach enhances ULCNet for acoustic echo and noise reduction by integrating time alignment and parallel encoder blocks, improving echo reduction and maintaining noise reduction performance. A channel-wise sampling feature reorientation ensures robustness with low computational cost.

Details

Motivation: The need for low-complexity, robust deep learning solutions for acoustic echo and noise reduction in consumer devices drives this work.

Method: Proposes a hybrid approach combining time alignment and parallel encoder blocks for ULCNet, along with channel-wise sampling-based feature reorientation.

Result: Improved echo reduction and comparable noise reduction to SOTA methods, with robust performance in challenging scenarios and low computational requirements.

Conclusion: The hybrid approach effectively enhances ULCNet, balancing performance and computational efficiency for real-life applications.

Abstract: The successful deployment of deep learning-based acoustic echo and noise reduction (AENR) methods in consumer devices has spurred interest in developing low-complexity solutions, while emphasizing the need for robust performance in real-life applications. In this work, we propose a hybrid approach to enhance the state-of-the-art (SOTA) ULCNet model by integrating time alignment and parallel encoder blocks for the model inputs, resulting in better echo reduction and comparable noise reduction performance to existing SOTA methods. We also propose a channel-wise sampling-based feature reorientation method, ensuring robust performance across many challenging scenarios, while maintaining overall low computational and memory requirements.

[1006] Real-Time Audio-Visual Speech Enhancement Using Pre-trained Visual Representations

T. Aleksandra Ma, Sile Yin, Li-Chia Yang, Shuo Zhang

Main category: eess.AS

TL;DR: RAVEN is a real-time AVSE system using visual embeddings from AVSR and ASD to enhance target speaker audio in noisy or multi-speaker environments.

Details

Motivation: Speech enhancement is difficult in audio-only settings, especially with interfering speakers. The paper aims to improve this using visual cues.

Method: RAVEN combines visual embeddings from AVSR and ASD models to isolate and enhance the target speaker while suppressing noise and interference.

Result: AVSR+ASD embeddings work best in low-SNR, multi-speaker settings, while AVSR alone excels in noise-only scenarios. A real-time CPU-based system was developed.

Conclusion: RAVEN is the first open-source real-time AVSE system, demonstrating effective use of visual embeddings for speech enhancement.

Abstract: Speech enhancement in audio-only settings remains challenging, particularly in the presence of interfering speakers. This paper presents a simple yet effective real-time audio-visual speech enhancement (AVSE) system, RAVEN, which isolates and enhances the on-screen target speaker while suppressing interfering speakers and background noise. We investigate how visual embeddings learned from audio-visual speech recognition (AVSR) and active speaker detection (ASD) contribute to AVSE across different SNR conditions and numbers of interfering speakers. Our results show concatenating embeddings from AVSR and ASD models provides the greatest improvement in low-SNR, multi-speaker environments, while AVSR embeddings alone perform best in noise-only scenarios. In addition, we develop a real-time streaming system that operates on a computer CPU and we provide a video demonstration and code repository. To our knowledge, this is the first open-source implementation of a real-time AVSE system.

[1007] MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks

Yadong Niu, Tianzi Wang, Heinrich Dinkel, Xingwei Sun, Jiahao Zhou, Gang Li, Jizhong Liu, Xunying Liu, Junbo Zhang, Jian Luan

Main category: eess.AS

TL;DR: The paper introduces MECAT, a benchmark for fine-grained audio understanding, and a new metric (DATE) to evaluate detailed model outputs, addressing gaps in current benchmarks.

Details

Motivation: Current benchmarks fail to distinguish between generic and detailed audio model outputs, limiting progress in nuanced audio understanding.

Method: MECAT is created using expert models and Chain-of-Thought reasoning, providing fine-grained captions and QA pairs. DATE metric combines semantic similarity and discriminability.

Result: The benchmark and metric offer new insights into state-of-the-art audio models’ capabilities and limitations.

Conclusion: MECAT and DATE advance audio understanding by enabling more reliable and detailed evaluations.

Abstract: While large audio-language models have advanced open-ended audio understanding, they still fall short of nuanced human-level comprehension. This gap persists largely because current benchmarks, limited by data annotations and evaluation metrics, fail to reliably distinguish between generic and highly detailed model outputs. To this end, this work introduces MECAT, a Multi-Expert Constructed Benchmark for Fine-Grained Audio Understanding Tasks. Generated via a pipeline that integrates analysis from specialized expert models with Chain-of-Thought large language model reasoning, MECAT provides multi-perspective, fine-grained captions and open-set question-answering pairs. The benchmark is complemented by a novel metric: DATE (Discriminative-Enhanced Audio Text Evaluation). This metric penalizes generic terms and rewards detailed descriptions by combining single-sample semantic similarity with cross-sample discriminability. A comprehensive evaluation of state-of-the-art audio models is also presented, providing new insights into their current capabilities and limitations. The data and code are available at https://github.com/xiaomi-research/mecat

eess.IV

[1008] Diagnostic Accuracy of Open-Source Vision-Language Models on Diverse Medical Imaging Tasks

Gustav Müller-Franzes, Debora Jutz, Jakob Nikolas Kather, Christiane Kuhl, Sven Nebelung, Daniel Truhn

Main category: eess.IV

TL;DR: This study evaluated five VLMs on the MedFMC dataset, finding Qwen2.5 generally outperformed others, especially in chest radiography and endoscopy. Performance varied by task, with retinal fundoscopy being particularly challenging. Multimodal input and chain-of-thought reasoning did not consistently improve results.

Details

Motivation: To assess the diagnostic accuracy of open-source VLMs across diverse medical imaging tasks and identify strengths and limitations for clinical use.

Method: Retrospective evaluation of five VLMs using the MedFMC dataset (22,349 images from 7,461 patients) across five tasks. Accuracy was compared in three settings: visual-only, multimodal input, and chain-of-thought reasoning, using bootstrapped confidence intervals.

Result: Qwen2.5 excelled in chest radiographs (90.4%) and endoscopy (84.2%), while Qwen2.5 and Phi-4 led in colon pathology and neonatal jaundice. All models struggled with retinal fundoscopy (highest accuracy 18.6%). Multimodal input and chain-of-thought reasoning did not consistently improve accuracy.

Conclusion: Open-source VLMs show promise for medical diagnostics, particularly in simpler tasks like chest radiography, but struggle with complex domains like retinal fundoscopy, highlighting the need for further development before clinical deployment.

Abstract: This retrospective study evaluated five VLMs (Qwen2.5, Phi-4, Gemma3, Llama3.2, and Mistral3.1) using the MedFMC dataset. This dataset includes 22,349 images from 7,461 patients encompassing chest radiography (19 disease multi-label classifications), colon pathology (tumor detection), endoscopy (colorectal lesion identification), neonatal jaundice assessment (skin color-based treatment necessity), and retinal fundoscopy (5-point diabetic retinopathy grading). Diagnostic accuracy was compared in three experimental settings: visual input only, multimodal input, and chain-of-thought reasoning. Model accuracy was assessed against ground truth labels, with statistical comparisons using bootstrapped confidence intervals (p<.05). Qwen2.5 achieved the highest accuracy for chest radiographs (90.4%) and endoscopy images (84.2%), significantly outperforming the other models (p<.001). In colon pathology, Qwen2.5 (69.0%) and Phi-4 (69.6%) performed comparably (p=.41), both significantly exceeding other VLMs (p<.001). Similarly, for neonatal jaundice assessment, Qwen2.5 (58.3%) and Phi-4 (58.1%) showed comparable leading accuracies (p=.93) significantly exceeding their counterparts (p<.001). All models struggled with retinal fundoscopy; Qwen2.5 and Gemma3 achieved the highest, albeit modest, accuracies at 18.6% (comparable, p=.99), significantly better than other tested models (p<.001). Unexpectedly, multimodal input reduced accuracy for some models and modalities, and chain-of-thought reasoning prompts also failed to improve accuracy. The open-source VLMs demonstrated promising diagnostic capabilities, particularly in chest radiograph interpretation. However, performance in complex domains such as retinal fundoscopy was limited, underscoring the need for further development and domain-specific adaptation before widespread clinical application.

[1009] CoCoLIT: ControlNet-Conditioned Latent Image Translation for MRI to Amyloid PET Synthesis

Alec Sargood, Lemuel Puglisi, James H. Cole, Neil P. Oxtoby, Daniele Ravì, Daniel C. Alexander

Main category: eess.IV

TL;DR: CoCoLIT, a diffusion-based latent generative framework, synthesizes amyloid PET scans from structural MRI for cost-effective Alzheimer’s Disease screening, outperforming state-of-the-art methods.

Details

Motivation: MRI may encode amyloid-related information, but existing MRI-to-PET translation methods struggle with high-dimensional data.

Method: CoCoLIT uses a latent space approach with innovations like Weighted Image Space Loss, Latent Average Stabilization analysis, and ControlNet-based conditioning.

Result: CoCoLIT significantly outperforms other methods, improving amyloid-positivity classification by +10.5% (internal) and +23.7% (external).

Conclusion: CoCoLIT offers a promising, scalable solution for AD screening by effectively translating MRI to PET scans.

Abstract: Synthesizing amyloid PET scans from the more widely available and accessible structural MRI modality offers a promising, cost-effective approach for large-scale Alzheimer’s Disease (AD) screening. This is motivated by evidence that, while MRI does not directly detect amyloid pathology, it may nonetheless encode information correlated with amyloid deposition that can be uncovered through advanced modeling. However, the high dimensionality and structural complexity of 3D neuroimaging data pose significant challenges for existing MRI-to-PET translation methods. Modeling the cross-modality relationship in a lower-dimensional latent space can simplify the learning task and enable more effective translation. As such, we present CoCoLIT (ControlNet-Conditioned Latent Image Translation), a diffusion-based latent generative framework that incorporates three main innovations: (1) a novel Weighted Image Space Loss (WISL) that improves latent representation learning and synthesis quality; (2) a theoretical and empirical analysis of Latent Average Stabilization (LAS), an existing technique used in similar generative models to enhance inference consistency; and (3) the introduction of ControlNet-based conditioning for MRI-to-PET translation. We evaluate CoCoLIT’s performance on publicly available datasets and find that our model significantly outperforms state-of-the-art methods on both image-based and amyloid-related metrics. Notably, in amyloid-positivity classification, CoCoLIT outperforms the second-best method with improvements of +10.5% on the internal dataset and +23.7% on the external dataset. The code and models of our approach are available at https://github.com/brAIn-science/CoCoLIT.

[1010] SWAN: Synergistic Wavelet-Attention Network for Infrared Small Target Detection

Yuxin Jing, Jufeng Zhao, Tianpei Zhang, Yiming Zhu

Main category: eess.IV

TL;DR: SWAN, a novel framework combining wavelet and attention mechanisms, improves infrared small target detection by addressing spatial and frequency domain challenges.

Details

Motivation: Precise infrared small target detection (IRSTD) is crucial but challenging due to complex backgrounds and limitations of conventional convolution operations.

Method: Proposes SWAN with Haar Wavelet Convolution (HWConv) for cross-domain fusion, Shifted Spatial Attention (SSA) for long-range dependencies, and Residual Dual-Channel Attention (RDCA) for feature calibration.

Result: SWAN outperforms state-of-the-art methods, enhancing detection accuracy and robustness in complex scenarios.

Conclusion: SWAN effectively addresses IRSTD challenges by integrating spatial and frequency domain insights, demonstrating superior performance.

Abstract: Infrared small target detection (IRSTD) is thus critical in both civilian and military applications. This study addresses the challenge of precisely IRSTD in complex backgrounds. Recent methods focus fundamental reliance on conventional convolution operations, which primarily capture local spatial patterns and struggle to distinguish the unique frequency-domain characteristics of small targets from intricate background clutter. To overcome these limitations, we proposed the Synergistic Wavelet-Attention Network (SWAN), a novel framework designed to perceive targets from both spatial and frequency domains. SWAN leverages a Haar Wavelet Convolution (HWConv) for a deep, cross-domain fusion of the frequency energy and spatial details of small target. Furthermore, a Shifted Spatial Attention (SSA) mechanism efficiently models long-range spatial dependencies with linear computational complexity, enhancing contextual awareness. Finally, a Residual Dual-Channel Attention (RDCA) module adaptively calibrates channel-wise feature responses to suppress background interference while amplifying target-pertinent signals. Extensive experiments on benchmark datasets demonstrate that SWAN surpasses existing state-of-the-art methods, showing significant improvements in detection accuracy and robustness, particularly in complex challenging scenarios.

[1011] Classification of Brain Tumors using Hybrid Deep Learning Models

Neerav Nemchand Gala

Main category: eess.IV

TL;DR: EfficientNetV2 outperforms EfficientNet and ResNet50 in brain tumor classification but requires more training time due to its complexity.

Details

Motivation: To address the high computational and data demands of conventional CNNs in medical image interpretation.

Method: Applied transfer learning and compared EfficientNetV2, EfficientNet, and ResNet50 for classifying brain tumors (glioma, meningioma, pituitary).

Result: EfficientNetV2 achieved superior classification performance but with increased training time.

Conclusion: Transfer learning with EfficientNetV2 is effective for medical image classification but trades off performance for computational cost.

Abstract: The use of Convolutional Neural Networks (CNNs) has greatly improved the interpretation of medical images. However, conventional CNNs typically demand extensive computational resources and large training datasets. To address these limitations, this study applied transfer learning to achieve strong classification performance using fewer training samples. Specifically, the study compared EfficientNetV2 with its predecessor, EfficientNet, and with ResNet50 in classifying brain tumors into three types: glioma, meningioma, and pituitary tumors. Results showed that EfficientNetV2 delivered superior performance compared to the other models. However, this improvement came at the cost of increased training time, likely due to the model’s greater complexity.

[1012] Predicting EGFR Mutation in LUAD from Histopathological Whole-Slide Images Using Pretrained Foundation Model and Transfer Learning: An Indian Cohort Study

Sagar Singh Gwal, Rajan, Suyash Devgan, Shraddhanjali Satapathy, Abhishek Goyal, Nuruddin Mohammad Iqbal, Vivaan Jain, Prabhat Singh Mallik, Deepali Jain, Ishaan Gupta

Main category: eess.IV

TL;DR: A deep learning framework using vision transformers and attention-based multiple instance learning predicts EGFR mutation status in lung adenocarcinoma from H&E-stained slides, showing high accuracy across datasets.

Details

Motivation: Predicting EGFR mutation status is crucial for treatment decisions in lung adenocarcinoma, especially in Southeast Asian populations with higher mutation rates.

Method: A DL framework combining vision transformers (ViT) and attention-based multiple instance learning (ABMIL) was trained on an Indian cohort (170 WSI) and tested on internal (30 WSI) and external (TCGA, 86 WSI) datasets.

Result: The model achieved AUCs of 0.933 (internal) and 0.965 (external), demonstrating robust performance.

Conclusion: The framework accurately predicts EGFR mutations from routine pathology slides, even with small datasets, offering potential for resource-limited settings.

Abstract: Lung adenocarcinoma (LUAD) is a subtype of non-small cell lung cancer (NSCLC). LUAD with mutation in the EGFR gene accounts for approximately 46% of LUAD cases. Patients carrying EGFR mutations can be treated with specific tyrosine kinase inhibitors (TKIs). Hence, predicting EGFR mutation status can help in clinical decision making. H&E-stained whole slide imaging (WSI) is a routinely performed screening procedure for cancer staging and subtyping, especially affecting the Southeast Asian populations with significantly higher incidence of the mutation when compared to Caucasians (39-64% vs 7-22%). Recent progress in AI models has shown promising results in cancer detection and classification. In this study, we propose a deep learning (DL) framework built on vision transformers (ViT) based pathology foundation model and attention-based multiple instance learning (ABMIL) architecture to predict EGFR mutation status from H&E WSI. The developed pipeline was trained using data from an Indian cohort (170 WSI) and evaluated across two independent datasets: Internal test (30 WSI from Indian cohort) set, and an external test set from TCGA (86 WSI). The model shows consistent performance across both datasets, with AUCs of 0.933 (+/-0.010), and 0.965 (+/-0.015) for the internal and external test sets respectively. This proposed framework can be efficiently trained on small datasets, achieving superior performance as compared to several prior studies irrespective of training domain. The current study demonstrates the feasibility of accurately predicting EGFR mutation status using routine pathology slides, particularly in resource-limited settings using foundation models and attention-based multiple instance learning.

[1013] Viscosity Stabilized Plug-and-Play Reconstruction

Arghya Sinha, Trishit Mukherjee, Kunal N. Chaudhury

Main category: eess.IV

TL;DR: The paper proposes a stabilization mechanism for plug-and-play (PnP) methods in image reconstruction, addressing instability in iterative processes without restrictive denoiser constraints.

Details

Motivation: PnP methods allow pretrained denoisers to be reused across tasks but suffer from instability in later iterations, degrading visual quality and PSNR. Existing solutions impose restrictive constraints, which standard denoisers don't satisfy.

Method: A data-driven stabilization mechanism adaptively averages the PnP operator with a contractive IR operator, acting as viscosity regularization to dampen updates and prevent divergence.

Result: The proposed mechanism effectively stabilizes PnP across various proximal algorithms, denoising architectures, and imaging tasks.

Conclusion: The stabilization method enhances PnP performance without requiring restrictive denoiser constraints, improving reliability in iterative image reconstruction.

Abstract: The plug-and-play (PnP) method uses a deep denoiser within a proximal algorithm for model-based image reconstruction (IR). Unlike end-to-end IR, PnP allows the same pretrained denoiser to be used across different imaging tasks, without the need for retraining. However, black-box networks can make the iterative process in PnP unstable. A common issue observed across architectures like CNNs, diffusion models, and transformers is that the visual quality and PSNR often improve initially but then degrade in later iterations. Previous attempts to ensure stability usually impose restrictive constraints on the denoiser. However, standard denoisers, which are freely trained for single-step noise removal, need not satisfy such constraints. We propose a simple data-driven stabilization mechanism that adaptively averages the potentially unstable PnP operator with a contractive IR operator. This acts as a form of viscosity regularization, where the contractive component progressively dampens updates in later iterations, helping to suppress oscillations and prevent divergence. We validate the effectiveness of our stabilization mechanism across different proximal algorithms, denoising architectures, and imaging tasks.

[1014] CGCCE-Net:Change-Guided Cross Correlation Enhancement Network for Remote Sensing Building Change Detection

ChengMing Wang

Main category: eess.IV

TL;DR: CGCCE-Net, a deep learning-based method, improves building change detection (BCD) by enhancing change information and addressing special color issues, outperforming existing methods.

Details

Motivation: To address the challenge of accurately detecting building changes, especially with special colors, and improving precision in BCD tasks.

Method: Proposes CGCCE-Net with CGRR Branch for multi-scale feature extraction, GCCM for semantic interaction, SCEM for feature enhancement, and CFD for fusion.

Result: CGCCE-Net outperforms mainstream BCD methods on three public datasets.

Conclusion: The proposed method effectively enhances change detection precision and handles special color challenges in BCD tasks.

Abstract: Change detection encompasses a variety of task types, and the goal of building change detection (BCD) tasks is to accurately locate buildings and distinguish changed building areas. In recent years, various deep learning-based BCD methods have achieved significant success in detecting difference regions by using different change information enhancement techniques, effectively improving the precision of BCD tasks. To address the issue of BCD with special colors, we propose the change-guided cross correlation enhancement network (CGCCE-Net). We design the change-guided residual refinement (CGRR) Branch, which focuses on extending shallow texture features to multiple scale features obtained from PVT, enabling early attention and acquisition of special colors. Then, channel spatial attention is used in the deep features to achieve independent information enhancement. Additionally, we construct the global cross correlation module (GCCM) to facilitate semantic information interaction between bi-temporal images, establishing building and target recognition relationships between different images. Further semantic feature enhancement is achieved through the semantic cognitive enhancement module (SCEM), and finally, the cross fusion decoder (CFD) is used for change information fusion and image reconstruction. Extensive experiments on three public datasets demonstrate that our CGCCE-Net outperforms mainstream BCD methods with outstanding performance.

[1015] MGCR-Net:Multimodal Graph-Conditioned Vision-Language Reconstruction Network for Remote Sensing Change Detection

Chengming Wang, Guodong Fan, Jinjiang Li, Min Gan, C. L. Philip Chen

Main category: eess.IV

TL;DR: Proposes MGCR-Net, a multimodal graph-conditioned vision-language reconstruction network, for improved remote sensing change detection by leveraging multimodal data and large language models.

Details

Motivation: Address limitations in traditional and deep learning-based change detection methods by exploring semantic interactions in multimodal data.

Method: Uses MLLM-based optimization to generate textual data from images, dual encoders for feature extraction, and a graph-conditioned reconstruction module for vision-language interaction.

Result: Outperforms mainstream CD methods on four public datasets.

Conclusion: MGCR-Net effectively enhances change detection through multimodal data and semantic interaction.

Abstract: With the advancement of remote sensing satellite technology and the rapid progress of deep learning, remote sensing change detection (RSCD) has become a key technique for regional monitoring. Traditional change detection (CD) methods and deep learning-based approaches have made significant contributions to change analysis and detection, however, many outstanding methods still face limitations in the exploration and application of multimodal data. To address this, we propose the multimodal graph-conditioned vision-language reconstruction network (MGCR-Net) to further explore the semantic interaction capabilities of multimodal data. Multimodal large language models (MLLM) have attracted widespread attention for their outstanding performance in computer vision, particularly due to their powerful visual-language understanding and dialogic interaction capabilities. Specifically, we design a MLLM-based optimization strategy to generate multimodal textual data from the original CD images, which serve as textual input to MGCR. Visual and textual features are extracted through a dual encoder framework. For the first time in the RSCD task, we introduce a multimodal graph-conditioned vision-language reconstruction mechanism, which is integrated with graph attention to construct a semantic graph-conditioned reconstruction module (SGCM), this module generates vision-language (VL) tokens through graph-based conditions and enables cross-dimensional interaction between visual and textual features via multihead attention. The reconstructed VL features are then deeply fused using the language vision transformer (LViT), achieving fine-grained feature alignment and high-level semantic interaction. Experimental results on four public datasets demonstrate that MGCR achieves superior performance compared to mainstream CD methods. Our code is available on https://github.com/cn-xvkong/MGCR

[1016] Deeply Supervised Multi-Task Autoencoder for Biological Brain Age estimation using three dimensional T$_1$-weighted magnetic resonance imaging

Mehreen Kanwal, Yunsik Son

Main category: eess.IV

TL;DR: The paper proposes a Deeply Supervised Multitask Autoencoder (DSMT-AE) framework for accurate brain age estimation from 3D MRI scans, addressing challenges like vanishing gradients and sex-based structural differences. It achieves state-of-the-art performance on the OpenBHB dataset.

Details

Motivation: Accurate brain age estimation is crucial for identifying neurodegenerative diseases. Challenges include optimizing 3D models and accounting for sex differences in brain structure.

Method: DSMT-AE uses deep supervision and multitask learning (brain age prediction, sex classification, image reconstruction) to improve feature representation and model stability.

Result: DSMT-AE outperforms existing methods on the OpenBHB dataset, showing robustness across age and sex subgroups. Ablation studies confirm the framework’s effectiveness.

Conclusion: The DSMT-AE framework enhances brain age prediction accuracy by integrating deep supervision and multitask learning, demonstrating its potential for clinical applications.

Abstract: Accurate estimation of biological brain age from three dimensional (3D) T$_1$-weighted magnetic resonance imaging (MRI) is a critical imaging biomarker for identifying accelerated aging associated with neurodegenerative diseases. Effective brain age prediction necessitates training 3D models to leverage comprehensive insights from volumetric MRI scans, thereby fully capturing spatial anatomical context. However, optimizing deep 3D models remains challenging due to problems such as vanishing gradients. Furthermore, brain structural patterns differ significantly between sexes, which impacts aging trajectories and vulnerability to neurodegenerative diseases, thereby making sex classification crucial for enhancing the accuracy and generalizability of predictive models. To address these challenges, we propose a Deeply Supervised Multitask Autoencoder (DSMT-AE) framework for brain age estimation. DSMT-AE employs deep supervision, which involves applying supervisory signals at intermediate layers during training, to stabilize model optimization, and multitask learning to enhance feature representation. Specifically, our framework simultaneously optimizes brain age prediction alongside auxiliary tasks of sex classification and image reconstruction, thus effectively capturing anatomical and demographic variability to improve prediction accuracy. We extensively evaluate DSMT-AE on the Open Brain Health Benchmark (OpenBHB) dataset, the largest multisite neuroimaging cohort combining ten publicly available datasets. The results demonstrate that DSMT-AE achieves state-of-the-art performance and robustness across age and sex subgroups. Additionally, our ablation study confirms that each proposed component substantially contributes to the improved predictive accuracy and robustness of the overall architecture.

Lei Xie, Junxiong Huang, Yuanjing Feng, Qingrun Zeng

Main category: eess.IV

TL;DR: A tractography-guided Dual-label Collaborative Learning Network (DCLNet) is proposed for multi-modal Cranial Nerves (CNs) parcellation, improving performance by combining coarse labels from tractography and expert-annotated precise labels, along with a Modality-adaptive Encoder Module for MRI fusion.

Details

Motivation: Existing multi-modal CNs parcellation methods underutilize diffusion MRI, leading to suboptimal performance. This work aims to enhance segmentation by better integrating diffusion MRI data.

Method: DCLNet uses coarse labels from tractography and precise expert labels for collaborative learning, alongside a Modality-adaptive Encoder Module to fuse structural and diffusion MRI data.

Result: Experiments on the HCP dataset show DCLNet outperforms single-label networks, validating the dual-label approach.

Conclusion: The dual-label strategy and modality-adaptive fusion effectively address ambiguities in CNs parcellation, demonstrating improved performance.

Abstract: The parcellation of Cranial Nerves (CNs) serves as a crucial quantitative methodology for evaluating the morphological characteristics and anatomical pathways of specific CNs. Multi-modal CNs parcellation networks have achieved promising segmentation performance, which combine structural Magnetic Resonance Imaging (MRI) and diffusion MRI. However, insufficient exploration of diffusion MRI information has led to low performance of existing multi-modal fusion. In this work, we propose a tractography-guided Dual-label Collaborative Learning Network (DCLNet) for multi-modal CNs parcellation. The key contribution of our DCLNet is the introduction of coarse labels of CNs obtained from fiber tractography through CN atlas, and collaborative learning with precise labels annotated by experts. Meanwhile, we introduce a Modality-adaptive Encoder Module (MEM) to achieve soft information swapping between structural MRI and diffusion MRI. Extensive experiments conducted on the publicly available Human Connectome Project (HCP) dataset demonstrate performance improvements compared to single-label network. This systematic validation underscores the effectiveness of dual-label strategies in addressing inherent ambiguities in CNs parcellation tasks.

[1018] Measuring and Predicting Where and When Pathologists Focus their Visual Attention while Grading Whole Slide Images of Cancer

Souradeep Chakraborty, Ruoyu Xue, Rajarsi Gupta, Oksana Yaskiv, Constantin Friedman, Natallia Sheuka, Dana Perez, Paul Friedman, Won-Tak Choi, Waqas Mahmud, Beatrice Knudsen, Gregory Zelinsky, Joel Saltz, Dimitris Samaras

Main category: eess.IV

TL;DR: The paper presents a method to predict pathologists’ attention scanpaths on prostate cancer WSIs using a two-stage transformer-based model, improving training for pathology trainees.

Details

Motivation: Predicting expert pathologists' attention can enhance pathology training by developing decision support systems.

Method: A two-stage model: (1) predicts static attention heatmaps using transformers, (2) autoregressively predicts dynamic scanpaths from heatmaps.

Result: The model outperforms chance and baselines in predicting attention scanpaths.

Conclusion: The tool can aid trainees in learning expert-like attention allocation during WSI reading.

Abstract: The ability to predict the attention of expert pathologists could lead to decision support systems for better pathology training. We developed methods to predict the spatio-temporal (where and when) movements of pathologists’ attention as they grade whole slide images (WSIs) of prostate cancer. We characterize a pathologist’s attention trajectory by their x, y, and m (magnification) movements of a viewport as they navigate WSIs using a digital microscope. This information was obtained from 43 pathologists across 123 WSIs, and we consider the task of predicting the pathologist attention scanpaths constructed from the viewport centers. We introduce a fixation extraction algorithm that simplifies an attention trajectory by extracting fixations in the pathologist’s viewing while preserving semantic information, and we use these pre-processed data to train and test a two-stage model to predict the dynamic (scanpath) allocation of attention during WSI reading via intermediate attention heatmap prediction. In the first stage, a transformer-based sub-network predicts the attention heatmaps (static attention) across different magnifications. In the second stage, we predict the attention scanpath by sequentially modeling the next fixation points in an autoregressive manner using a transformer-based approach, starting at the WSI center and leveraging multi-magnification feature representations from the first stage. Experimental results show that our scanpath prediction model outperforms chance and baseline models. Tools developed from this model could assist pathology trainees in learning to allocate their attention during WSI reading like an expert.

[1019] LoRA-based methods on Unet for transfer learning in Subarachnoid Hematoma Segmentation

Cristian Minoccheri, Matthew Hodgman, Haoyuan Ma, Rameez Merchant, Emily Wittrup, Craig Williamson, Kayvan Najarian

Main category: eess.IV

TL;DR: The paper explores transfer learning for aneurysmal SAH segmentation using LoRA methods, showing improved performance over standard Unet fine-tuning.

Details

Motivation: Aneurysmal SAH has high mortality rates, and transfer learning from related hematoma types is underexplored.

Method: Implemented a Unet pre-trained on TBI scans, fine-tuned on SAH data, and introduced novel CP-LoRA and DoRA variants.

Result: LoRA methods outperformed standard fine-tuning, with CP-LoRA achieving comparable results using fewer parameters.

Conclusion: Transfer learning between hematoma types is feasible, and LoRA methods significantly enhance SAH segmentation.

Abstract: Aneurysmal subarachnoid hemorrhage (SAH) is a life-threatening neurological emergency with mortality rates exceeding 30%. Transfer learning from related hematoma types represents a potentially valuable but underexplored approach. Although Unet architectures remain the gold standard for medical image segmentation due to their effectiveness on limited datasets, Low-Rank Adaptation (LoRA) methods for parameter-efficient transfer learning have been rarely applied to convolutional neural networks in medical imaging contexts. We implemented a Unet architecture pre-trained on computed tomography scans from 124 traumatic brain injury patients across multiple institutions, then fine-tuned on 30 aneurysmal SAH patients from the University of Michigan Health System using 3-fold cross-validation. We developed a novel CP-LoRA method based on tensor CP-decomposition and introduced DoRA variants (DoRA-C, convDoRA, CP-DoRA) that decompose weight matrices into magnitude and directional components. We compared these approaches against existing LoRA methods (LoRA-C, convLoRA) and standard fine-tuning strategies across different modules on a multi-view Unet model. LoRA-based methods consistently outperformed standard Unet fine-tuning. Performance varied by hemorrhage volume, with all methods showing improved accuracy for larger volumes. CP-LoRA achieved comparable performance to existing methods while using significantly fewer parameters. Over-parameterization with higher ranks consistently yielded better performance than strictly low-rank adaptations. This study demonstrates that transfer learning between hematoma types is feasible and that LoRA-based methods significantly outperform conventional Unet fine-tuning for aneurysmal SAH segmentation.

[1020] Joint Lossless Compression and Steganography for Medical Images via Large Language Models

Pengcheng Zheng, Xiaorong Pu, Kecheng Chen, Jiaxin Huang, Meng Yang, Bai Feng, Yazhou Ren, Jianan Jiang

Main category: eess.IV

TL;DR: A novel joint lossless compression and steganography framework for medical images improves compression performance, efficiency, and security using adaptive modalities decomposition and segmented message steganography.

Details

Motivation: Existing LLM-based compressors for medical images lack a balance between performance and efficiency and overlook security, which is critical in medical scenarios.

Method: The framework uses adaptive modalities decomposition for dual-path lossless compression and a segmented message steganography algorithm for security, enhanced by anatomical priors-based low-rank adaptation (A-LoRA).

Result: The method outperforms in compression ratios, efficiency, and security.

Conclusion: The proposed framework effectively addresses the trade-off and security issues in medical image compression, with promising experimental results.

Abstract: Recently, large language models (LLMs) have driven promis ing progress in lossless image compression. However, di rectly adopting existing paradigms for medical images suf fers from an unsatisfactory trade-off between compression performance and efficiency. Moreover, existing LLM-based compressors often overlook the security of the compres sion process, which is critical in modern medical scenarios. To this end, we propose a novel joint lossless compression and steganography framework. Inspired by bit plane slicing (BPS), we find it feasible to securely embed privacy messages into medical images in an invisible manner. Based on this in sight, an adaptive modalities decomposition strategy is first devised to partition the entire image into two segments, pro viding global and local modalities for subsequent dual-path lossless compression. During this dual-path stage, we inno vatively propose a segmented message steganography algo rithm within the local modality path to ensure the security of the compression process. Coupled with the proposed anatom ical priors-based low-rank adaptation (A-LoRA) fine-tuning strategy, extensive experimental results demonstrate the su periority of our proposed method in terms of compression ra tios, efficiency, and security. The source code will be made publicly available.

[1021] Conditional Residual Coding with Explicit-Implicit Temporal Buffering for Learned Video Compression

Yi-Hsin Chen, Kuan-Wei Ho, Martin Benjak, Jörn Ostermann, Wen-Hsiao Peng

Main category: eess.IV

TL;DR: A hybrid explicit-implicit temporal buffering scheme for conditional residual video coding is proposed, balancing memory efficiency and coding performance.

Details

Motivation: Existing conditional coding methods use implicit temporal information for superior performance but require high memory for storing features. This work aims to reduce memory while maintaining performance.

Method: A hybrid buffering strategy is introduced, combining one explicit temporal reference (decoded frame) and a few learned implicit features for inter-frame coding.

Result: The hybrid scheme outperforms using only explicit or implicit information and reduces buffer size to two video frames with minimal performance loss on 2K sequences.

Conclusion: The hybrid approach effectively balances memory and performance, with ablation studies revealing the impact of explicit and implicit temporal references.

Abstract: This work proposes a hybrid, explicit-implicit temporal buffering scheme for conditional residual video coding. Recent conditional coding methods propagate implicit temporal information for inter-frame coding, demonstrating superior coding performance to those relying exclusively on previously decoded frames (i.e. the explicit temporal information). However, these methods require substantial memory to store a large number of implicit features. This work presents a hybrid buffering strategy. For inter-frame coding, it buffers one previously decoded frame as the explicit temporal reference and a small number of learned features as implicit temporal reference. Our hybrid buffering scheme for conditional residual coding outperforms the single use of explicit or implicit information. Moreover, it allows the total buffer size to be reduced to the equivalent of two video frames with a negligible performance drop on 2K video sequences. The ablation experiment further sheds light on how these two types of temporal references impact the coding performance.

Sara Yavari, Rahul Nitin Pandya, Jacob Furst

Main category: eess.IV

TL;DR: A semi-supervised, two-stage framework for brain tumor segmentation in MRI scans, using residual-guided DDPM and a lightweight U-Net, achieves high accuracy on the BraTS 2021 dataset.

Details

Motivation: Accurate brain tumor segmentation is crucial for clinical diagnosis and treatment planning, but ground-truth masks are often unavailable.

Method: 1) Residual-guided DDPM synthesizes T1ce from FLAIR, T1, and T2 scans, using residuals as spatial priors. 2) A lightweight U-Net combines residuals and modalities for segmentation, with slice-level filtering and optimized thresholding.

Result: Achieves Dice score of 93.02% and IoU of 86.7% on BraTS 2021, outperforming the ReCoSeg baseline.

Conclusion: The method improves accuracy and scalability for real-world, multi-center MRI datasets.

Abstract: Accurate segmentation of brain tumors in MRI scans is critical for clinical diagnosis and treatment planning. We propose a semi-supervised, two-stage framework that extends the ReCoSeg approach to the larger and more heterogeneous BraTS 2021 dataset, while eliminating the need for ground-truth masks for the segmentation objective. In the first stage, a residual-guided denoising diffusion probabilistic model (DDPM) performs cross-modal synthesis by reconstructing the T1ce modality from FLAIR, T1, and T2 scans. The residual maps, capturing differences between predicted and actual T1ce images, serve as spatial priors to enhance downstream segmentation. In the second stage, a lightweight U-Net takes as input the concatenation of residual maps, computed as the difference between real T1ce and synthesized T1ce, with T1, T2, and FLAIR modalities to improve whole tumor segmentation. To address the increased scale and variability of BraTS 2021, we apply slice-level filtering to exclude non-informative samples and optimize thresholding strategies to balance precision and recall. Our method achieves a Dice score of $93.02%$ and an IoU of $86.7%$ for whole tumor segmentation on the BraTS 2021 dataset, outperforming the ReCoSeg baseline on BraTS 2020 (Dice: $91.7%$, IoU: $85.3%$), and demonstrating improved accuracy and scalability for real-world, multi-center MRI datasets.

[1023] M$^3$AD: Multi-task Multi-gate Mixture of Experts for Alzheimer’s Disease Diagnosis with Conversion Pattern Modeling

Yufeng Jiang, Hexiao Ding, Hongzhao Chen, Jing Lan, Xinzhi Teng, Gerald W. Y. Cheng, Zongxi Li, Haoran Xie, Jung Sun Yoo, Jing Cai

Main category: eess.IV

TL;DR: M$^3$AD is a multi-task deep learning framework for Alzheimer’s disease (AD) progression, combining diagnostic classification and cognitive transition modeling using structural MRI, achieving superior accuracy and generalization.

Details

Motivation: Current deep learning approaches oversimplify AD progression into discrete tasks, ignoring its continuum nature. This study aims to model the full NC-MCI-AD transition while improving diagnostic accuracy.

Method: The framework includes a preprocessing pipeline, a unified learning framework with demographic priors, and a multi-gate mixture of experts architecture. It uses SimMIM pretraining and multi-task fine-tuning.

Result: Achieves 95.13% accuracy for NC-MCI-AD classification and 99.15% for NC-AD, outperforming state-of-the-art methods. Also predicts cognitive transitions with 97.76% accuracy.

Conclusion: M$^3$AD offers a clinically practical, high-performance solution for AD progression modeling using only structural MRI, enabling early intervention.

Abstract: Alzheimer’s disease (AD) progression follows a complex continuum from normal cognition (NC) through mild cognitive impairment (MCI) to dementia, yet most deep learning approaches oversimplify this into discrete classification tasks. This study introduces M$^3$AD, a novel multi-task multi-gate mixture of experts framework that jointly addresses diagnostic classification and cognitive transition modeling using structural MRI. We incorporate three key innovations: (1) an open-source T1-weighted sMRI preprocessing pipeline, (2) a unified learning framework capturing NC-MCI-AD transition patterns with demographic priors (age, gender, brain volume) for improved generalization, and (3) a customized multi-gate mixture of experts architecture enabling effective multi-task learning with structural MRI alone. The framework employs specialized expert networks for diagnosis-specific pathological patterns while shared experts model common structural features across the cognitive continuum. A two-stage training protocol combines SimMIM pretraining with multi-task fine-tuning for joint optimization. Comprehensive evaluation across six datasets comprising 12,037 T1-weighted sMRI scans demonstrates superior performance: 95.13% accuracy for three-class NC-MCI-AD classification and 99.15% for binary NC-AD classification, representing improvements of 4.69% and 0.55% over state-of-the-art approaches. The multi-task formulation simultaneously achieves 97.76% accuracy in predicting cognitive transition. Our framework outperforms existing methods using fewer modalities and offers a clinically practical solution for early intervention. Code: https://github.com/csyfjiang/M3AD.

[1024] Mobile U-ViT: Revisiting large kernel and U-shaped ViT for efficient medical image segmentation

Fenghe Tang, Bingkun Nian, Jianrui Ding, Wenxin Ma, Quan Quan, Chengqi Dong, Jie Yang, Wei Liu, S. Kevin Zhou

Main category: eess.IV

TL;DR: The paper introduces Mobile U-ViT, a lightweight model for medical image segmentation, combining efficient CNN and transformer features to address the gap in mobile medical image analysis.

Details

Motivation: Existing mobile models, optimized for natural images, perform poorly on medical tasks due to the high information density gap. A lightweight, high-performing solution is needed.

Method: Proposes Mobile U-ViT with ConvUtr for hierarchical patch embedding, LGL blocks for local-global exchange, and a lightweight transformer bottleneck for long-range modeling.

Result: Achieves state-of-the-art performance on eight 2D/3D datasets, including zero-shot testing on unseen datasets.

Conclusion: Mobile U-ViT is an efficient, powerful, and generalizable solution for mobile medical image analysis.

Abstract: In clinical practice, medical image analysis often requires efficient execution on resource-constrained mobile devices. However, existing mobile models-primarily optimized for natural images-tend to perform poorly on medical tasks due to the significant information density gap between natural and medical domains. Combining computational efficiency with medical imaging-specific architectural advantages remains a challenge when developing lightweight, universal, and high-performing networks. To address this, we propose a mobile model called Mobile U-shaped Vision Transformer (Mobile U-ViT) tailored for medical image segmentation. Specifically, we employ the newly purposed ConvUtr as a hierarchical patch embedding, featuring a parameter-efficient large-kernel CNN with inverted bottleneck fusion. This design exhibits transformer-like representation learning capacity while being lighter and faster. To enable efficient local-global information exchange, we introduce a novel Large-kernel Local-Global-Local (LGL) block that effectively balances the low information density and high-level semantic discrepancy of medical images. Finally, we incorporate a shallow and lightweight transformer bottleneck for long-range modeling and employ a cascaded decoder with downsample skip connections for dense prediction. Despite its reduced computational demands, our medical-optimized architecture achieves state-of-the-art performance across eight public 2D and 3D datasets covering diverse imaging modalities, including zero-shot testing on four unseen datasets. These results establish it as an efficient yet powerful and generalization solution for mobile medical image analysis. Code is available at https://github.com/FengheTan9/Mobile-U-ViT.

[1025] Large Kernel MedNeXt for Breast Tumor Segmentation and Self-Normalizing Network for pCR Classification in Magnetic Resonance Images

Toufiq Musah

Main category: eess.IV

TL;DR: The paper introduces a method for breast tumor segmentation and pCR classification using DCE-MRI, employing a MedNeXt architecture with large kernels and radiomics-driven SNN, achieving improved performance.

Details

Motivation: Accurate breast tumor segmentation in DCE-MRI is crucial for tasks like pCR assessment, motivating the development of a robust method.

Method: Uses a large-kernel MedNeXt with a two-stage training strategy (UpKern algorithm) for segmentation and an SNN on radiomic features for pCR classification.

Result: Achieved a Dice score of 0.67 and NormHD of 0.24 for segmentation, and 57% balanced accuracy (up to 75% in subgroups) for pCR classification.

Conclusion: Combining large receptive fields and radiomics improves performance, suggesting future work on ensembling and clinical variable integration.

Abstract: Accurate breast tumor segmentation in dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) is important for downstream tasks such as pathological complete response (pCR) assessment. In this work, we address both segmentation and pCR classification using the large-scale MAMA-MIA DCE-MRI dataset. We employ a large-kernel MedNeXt architecture with a two-stage training strategy that expands the receptive field from 3x3x3 to 5x5x5 kernels using the UpKern algorithm. This approach allows stable transfer of learned features to larger kernels, improving segmentation performance on the unseen validation set. An ensemble of large-kernel models achieved a Dice score of 0.67 and a normalized Hausdorff Distance (NormHD) of 0.24. For pCR classification, we trained a self-normalizing network (SNN) on radiomic features extracted from the predicted segmentations and first post-contrast DCE-MRI, reaching an average balanced accuracy of 57%, and up to 75% in some subgroups. Our findings highlight the benefits of combining larger receptive fields and radiomics-driven classification while motivating future work on advanced ensembling and the integration of clinical variables to further improve performance and generalization. Code: https://github.com/toufiqmusah/caladan-mama-mia.git

[1026] Less is More: AMBER-AFNO – a New Benchmark for Lightweight 3D Medical Image Segmentation

Andrea Dosi, Semanto Mondal, Rajib Chandra Ghosh, Massimo Brescia, Giuseppe Longo

Main category: eess.IV

TL;DR: AMBER-AFNO, a transformer-based model adapted for 3D medical datacube segmentation, reduces complexity by 80% compared to UNETR++ while maintaining competitive accuracy.

Details

Motivation: To transfer remote sensing methodologies to healthcare, improving efficiency in 3D medical datacube segmentation.

Method: Adapts AMBER with Adaptive Fourier Neural Operators (AFNO) for frequency-domain mixing, replacing multi-head self-attention.

Result: Achieves competitive or superior accuracy on ACDC and Synapse datasets with significant efficiency gains.

Conclusion: AMBER-AFNO offers a simpler, more efficient alternative for 3D medical segmentation without sacrificing performance.

Abstract: This work presents the results of a methodological transfer from remote sensing to healthcare, adapting AMBER – a transformer-based model originally designed for multiband images, such as hyperspectral data – to the task of 3D medical datacube segmentation. In this study, we use the AMBER architecture with Adaptive Fourier Neural Operators (AFNO) in place of the multi-head self-attention mechanism. While existing models rely on various forms of attention to capture global context, AMBER-AFNO achieves this through frequency-domain mixing, enabling a drastic reduction in model complexity. This design reduces the number of trainable parameters by over 80% compared to UNETR++, while maintaining a FLOPs count comparable to other state-of-the-art architectures. Model performance is evaluated on two benchmark 3D medical datasets – ACDC and Synapse – using standard metrics such as Dice Similarity Coefficient (DSC) and Hausdorff Distance (HD), demonstrating that AMBER-AFNO achieves competitive or superior accuracy with significant gains in training efficiency, inference speed, and memory usage.

[1027] HyTIP: Hybrid Temporal Information Propagation for Masked Conditional Residual Video Coding

Yi-Hsin Chen, Yi-Chen Yao, Kuan-Wei Ho, Chun-Hung Wu, Huu-Tai Phung, Martin Benjak, Jörn Ostermann, Wen-Hsiao Peng

Main category: eess.IV

TL;DR: HyTIP is a hybrid learned video coding framework combining output-recurrence and hidden-to-hidden RNN mechanisms, achieving better performance with smaller buffer sizes.

Details

Motivation: Current RNN-based video codecs have limitations: output-recurrence methods constrain decoded frames, while hidden-to-hidden methods require large buffers.

Method: HyTIP combines both mechanisms, using decoded frames and a few latent features for efficient buffering.

Result: HyTIP outperforms individual approaches and matches state-of-the-art methods with smaller buffers, surpassing VTM 17.0 in PSNR-RGB and MS-SSIM-RGB.

Conclusion: HyTIP offers a balanced solution for efficient video coding, improving performance while reducing buffer requirements.

Abstract: Most frame-based learned video codecs can be interpreted as recurrent neural networks (RNNs) propagating reference information along the temporal dimension. This work revisits the limitations of the current approaches from an RNN perspective. The output-recurrence methods, which propagate decoded frames, are intuitive but impose dual constraints on the output decoded frames, leading to suboptimal rate-distortion performance. In contrast, the hidden-to-hidden connection approaches, which propagate latent features within the RNN, offer greater flexibility but require large buffer sizes. To address these issues, we propose HyTIP, a learned video coding framework that combines both mechanisms. Our hybrid buffering strategy uses explicit decoded frames and a small number of implicit latent features to achieve competitive coding performance. Experimental results show that our HyTIP outperforms the sole use of either output-recurrence or hidden-to-hidden approaches. Furthermore, it achieves comparable performance to state-of-the-art methods but with a much smaller buffer size, and outperforms VTM 17.0 (Low-delay B) in terms of PSNR-RGB and MS-SSIM-RGB. The source code of HyTIP is available at https://github.com/NYCU-MAPL/HyTIP.

Hongzhao Chen, Hexiao Ding, Yufeng Jiang, Jing Lan, Ka Chun Li, Gerald W. Y. Cheng, Sam Ng, Chi Lai Ho, Jing Cai, Liang-ting Lin, Jung Sun Yoo

Main category: eess.IV

TL;DR: REACT-KD is a framework for tumor classification using CT imaging, leveraging multi-modal knowledge distillation to improve reliability and interpretability.

Details

Motivation: Addresses challenges like heterogeneous modality quality, limited annotations, and lack of anatomical guidance in clinical imaging.

Method: Uses a dual-teacher design (PET/CT and degraded CT) to guide a lightweight CT student model via semantic alignment and anatomical topology modeling. Includes modality dropout for robustness.

Result: Achieves 93.4% AUC on PET/CT and 76.6%-81.5% AUC on varying CT doses, with high clinical benefit in decision curve analysis.

Conclusion: REACT-KD shows strong potential for real-world diagnostics, offering reliable and interpretable tumor classification.

Abstract: Reliable and interpretable tumor classification from clinical imaging remains a core challenge due to heterogeneous modality quality, limited annotations, and the lack of structured anatomical guidance. We introduce REACT-KD, a Region-Aware Cross-modal Topological Knowledge Distillation framework that transfers rich supervision from high-fidelity multi-modal sources into a lightweight CT-based student model. The framework uses a dual teacher design: one branch captures structure-function relationships using dual-tracer PET/CT, and the other models dose-aware features through synthetically degraded low-dose CT data. These branches jointly guide the student model through two complementary objectives. The first focuses on semantic alignment via logits distillation, while the second models anatomical topology using region graph distillation. A shared CBAM-3D module is employed to maintain consistent attention across modalities. To improve reliability for deployment, REACT-KD introduces modality dropout during training, allowing inference under partial or noisy inputs. The staging task for hepatocellular carcinoma (HCC) is conducted as a case study. REACT-KD achieves an average AUC of 93.4% on an internal PET/CT cohort and maintains 76.6% to 81.5% AUC across varying dose levels in external CT testing. Decision curve analysis shows that REACT-KD consistently provides the highest clinical benefit across decision thresholds, supporting its potential in real-world diagnostics. Code is available at https://github.com/Kinetics-JOJO/REACT-KD.

[1029] Tackling Ill-posedness of Reversible Image Conversion with Well-posed Invertible Network

Yuanfei Huang, Hua Huang

Main category: eess.IV

TL;DR: The paper proposes a well-posed invertible convolution (WIC) to address the ill-posedness in reversible image conversion (RIC), eliminating reliance on random variables and achieving state-of-the-art performance.

Details

Motivation: Existing RIC methods remain ill-posed due to uncertainty from random variables, limiting their reliability.

Method: Develops WIC to create a well-posed system, introduces WIN-Naïve and WIN networks with skip-connections for long-term memory.

Result: Achieves top performance in tasks like image hiding, rescaling, and decolorization.

Conclusion: The approach overcomes RIC bottlenecks and sets a new benchmark, validated by extensive experiments.

Abstract: Reversible image conversion (RIC) suffers from ill-posedness issues due to its forward conversion process being considered an underdetermined system. Despite employing invertible neural networks (INN), existing RIC methods intrinsically remain ill-posed as inevitably introducing uncertainty by incorporating randomly sampled variables. To tackle the ill-posedness dilemma, we focus on developing a reliable approximate left inverse for the underdetermined system by constructing an overdetermined system with a non-zero Gram determinant, thus ensuring a well-posed solution. Based on this principle, we propose a well-posed invertible $1\times1$ convolution (WIC), which eliminates the reliance on random variable sampling and enables the development of well-posed invertible networks. Furthermore, we design two innovative networks, WIN-Na"ive and WIN, with the latter incorporating advanced skip-connections to enhance long-term memory. Our methods are evaluated across diverse RIC tasks, including reversible image hiding, image rescaling, and image decolorization, consistently achieving state-of-the-art performance. Extensive experiments validate the effectiveness of our approach, demonstrating its ability to overcome the bottlenecks of existing RIC solutions and setting a new benchmark in the field. Codes are available in https://github.com/BNU-ERC-ITEA/WIN.

[1030] GR-Gaussian: Graph-Based Radiative Gaussian Splatting for Sparse-View CT Reconstruction

Yikuang Yuluo, Yue Ma, Kuan Shen, Tongtong Jin, Wang Liao, Yangpu Ma, Fuquan Wang

Main category: eess.IV

TL;DR: GR-Gaussian improves 3D Gaussian Splatting for CT reconstruction by reducing needle-like artifacts and enhancing accuracy under sparse-view conditions using denoised point cloud initialization and pixel-graph-aware gradient strategies.

Details

Motivation: Existing 3DGS methods suffer from needle-like artifacts in sparse-view CT reconstruction due to reliance on average gradient magnitude.

Method: Proposes GR-Gaussian with denoised point cloud initialization and pixel-graph-aware gradient strategies to refine gradient computation and density representation.

Result: Achieves PSNR improvements of 0.67 dB and 0.92 dB, and SSIM gains of 0.011 and 0.021 on X-3D and real-world datasets.

Conclusion: GR-Gaussian is effective for accurate CT reconstruction in sparse-view conditions, outperforming existing methods.

Abstract: 3D Gaussian Splatting (3DGS) has emerged as a promising approach for CT reconstruction. However, existing methods rely on the average gradient magnitude of points within the view, often leading to severe needle-like artifacts under sparse-view conditions. To address this challenge, we propose GR-Gaussian, a graph-based 3D Gaussian Splatting framework that suppresses needle-like artifacts and improves reconstruction accuracy under sparse-view conditions. Our framework introduces two key innovations: (1) a Denoised Point Cloud Initialization Strategy that reduces initialization errors and accelerates convergence; and (2) a Pixel-Graph-Aware Gradient Strategy that refines gradient computation using graph-based density differences, improving splitting accuracy and density representation. Experiments on X-3D and real-world datasets validate the effectiveness of GR-Gaussian, achieving PSNR improvements of 0.67 dB and 0.92 dB, and SSIM gains of 0.011 and 0.021. These results highlight the applicability of GR-Gaussian for accurate CT reconstruction under challenging sparse-view conditions.

[1031] Identifying actionable driver mutations in lung cancer using an efficient Asymmetric Transformer Decoder

Biagio Brattoli, Jack Shi, Jongchan Park, Taebum Lee, Donggeun Yoo, Sergio Pereira

Main category: eess.IV

TL;DR: The paper evaluates MIL techniques to detect six NSCLC driver mutations and introduces an Asymmetric Transformer Decoder model, improving prediction accuracy by 3-4%.

Details

Motivation: Limited genetic testing adoption and focus on few mutations in ML-based CPath tools hinder clinical impact.

Method: Uses Multiple Instance Learning (MIL) and an Asymmetric Transformer Decoder with tissue type integration.

Result: Outperforms top MIL models by 3% on average, 4% for rare mutations like ERBB2 and BRAF.

Conclusion: The approach advances ML-based tests as viable alternatives to standard genetic testing.

Abstract: Identifying actionable driver mutations in non-small cell lung cancer (NSCLC) can impact treatment decisions and significantly improve patient outcomes. Despite guideline recommendations, broader adoption of genetic testing remains challenging due to limited availability and lengthy turnaround times. Machine Learning (ML) methods for Computational Pathology (CPath) offer a potential solution; however, research often focuses on only one or two common mutations, limiting the clinical value of these tools and the pool of patients who can benefit from them. This study evaluates various Multiple Instance Learning (MIL) techniques to detect six key actionable NSCLC driver mutations: ALK, BRAF, EGFR, ERBB2, KRAS, and MET ex14. Additionally, we introduce an Asymmetric Transformer Decoder model that employs queries and key-values of varying dimensions to maintain a low query dimensionality. This approach efficiently extracts information from patch embeddings and minimizes overfitting risks, proving highly adaptable to the MIL setting. Moreover, we present a method to directly utilize tissue type in the model, addressing a typical MIL limitation where either all regions or only some specific regions are analyzed, neglecting biological relevance. Our method outperforms top MIL models by an average of 3%, and over 4% when predicting rare mutations such as ERBB2 and BRAF, moving ML-based tests closer to being practical alternatives to standard genetic testing.

[1032] From Pixels to Pathology: Restoration Diffusion for Diagnostic-Consistent Virtual IHC

Jingsong Liu, Xiaofeng Deng, Han Li, Azar Kazemi, Christian Grashei, Gesa Wilkens, Xin You, Tanja Groll, Nassir Navab, Carolin Mogler, Peter J. Schüffler

Main category: eess.IV

TL;DR: The paper introduces Star-Diff, a structure-aware diffusion model for virtual staining from H&E to IHC, addressing challenges like misaligned ground truths and structural integrity. It proposes the Semantic Fidelity Score (SFS) for evaluation and achieves SOTA performance.

Details

Motivation: H&E staining lacks molecular-level diagnostic info, while IHC is costly and time-consuming. Virtual staining is promising but faces evaluation and structural integrity challenges.

Method: Star-Diff, a diffusion model combining residual and noise-based pathways, reformulates virtual staining as image restoration. SFS evaluates diagnostic consistency.

Result: Star-Diff achieves SOTA in visual fidelity and diagnostic relevance on the BCI dataset, with rapid inference and clinical alignment.

Conclusion: Star-Diff offers a practical solution for virtual IHC synthesis, addressing clinical workflow limitations.

Abstract: Hematoxylin and eosin (H&E) staining is the clinical standard for assessing tissue morphology, but it lacks molecular-level diagnostic information. In contrast, immunohistochemistry (IHC) provides crucial insights into biomarker expression, such as HER2 status for breast cancer grading, but remains costly and time-consuming, limiting its use in time-sensitive clinical workflows. To address this gap, virtual staining from H&E to IHC has emerged as a promising alternative, yet faces two core challenges: (1) Lack of fair evaluation of synthetic images against misaligned IHC ground truths, and (2) preserving structural integrity and biological variability during translation. To this end, we present an end-to-end framework encompassing both generation and evaluation in this work. We introduce Star-Diff, a structure-aware staining restoration diffusion model that reformulates virtual staining as an image restoration task. By combining residual and noise-based generation pathways, Star-Diff maintains tissue structure while modeling realistic biomarker variability. To evaluate the diagnostic consistency of the generated IHC patches, we propose the Semantic Fidelity Score (SFS), a clinical-grading-task-driven metric that quantifies class-wise semantic degradation based on biomarker classification accuracy. Unlike pixel-level metrics such as SSIM and PSNR, SFS remains robust under spatial misalignment and classifier uncertainty. Experiments on the BCI dataset demonstrate that Star-Diff achieves state-of-the-art (SOTA) performance in both visual fidelity and diagnostic relevance. With rapid inference and strong clinical alignment,it presents a practical solution for applications such as intraoperative virtual IHC synthesis.

[1033] RL-U$^2$Net: A Dual-Branch UNet with Reinforcement Learning-Assisted Multimodal Feature Fusion for Accurate 3D Whole-Heart Segmentation

Jierui Qu, Jianchun Zhao

Main category: eess.IV

TL;DR: Proposes RL-U$^2$Net, a dual-branch U-Net with reinforcement learning for feature alignment, improving multi-modal 3D whole-heart segmentation.

Details

Motivation: Enhancing segmentation accuracy by addressing spatial inconsistency, static fusion strategies, and inefficient feature alignment in multi-modal methods.

Method: Uses a dual-branch U-Net with an RL-XAlign module for cross-modal attention and reinforcement learning-based feature alignment.

Result: Achieves Dice coefficients of 93.1% (CT) and 87.0% (MRI) on MM-WHS 2017 dataset, outperforming state-of-the-art methods.

Conclusion: RL-U$^2$Net is effective and superior for precise multi-modal whole-heart segmentation.

Abstract: Accurate whole-heart segmentation is a critical component in the precise diagnosis and interventional planning of cardiovascular diseases. Integrating complementary information from modalities such as computed tomography (CT) and magnetic resonance imaging (MRI) can significantly enhance segmentation accuracy and robustness. However, existing multi-modal segmentation methods face several limitations: severe spatial inconsistency between modalities hinders effective feature fusion; fusion strategies are often static and lack adaptability; and the processes of feature alignment and segmentation are decoupled and inefficient. To address these challenges, we propose a dual-branch U-Net architecture enhanced by reinforcement learning for feature alignment, termed RL-U$^2$Net, designed for precise and efficient multi-modal 3D whole-heart segmentation. The model employs a dual-branch U-shaped network to process CT and MRI patches in parallel, and introduces a novel RL-XAlign module between the encoders. The module employs a cross-modal attention mechanism to capture semantic correspondences between modalities and a reinforcement-learning agent learns an optimal rotation strategy that consistently aligns anatomical pose and texture features. The aligned features are then reconstructed through their respective decoders. Finally, an ensemble-learning-based decision module integrates the predictions from individual patches to produce the final segmentation result. Experimental results on the publicly available MM-WHS 2017 dataset demonstrate that the proposed RL-U$^2$Net outperforms existing state-of-the-art methods, achieving Dice coefficients of 93.1% on CT and 87.0% on MRI, thereby validating the effectiveness and superiority of the proposed approach.

[1034] Comparing ImageNet Pre-training with Digital Pathology Foundation Models for Whole Slide Image-Based Survival Analysis

Kleanthis Marios Papadopoulos, Tania Stathaki

Main category: eess.IV

TL;DR: Using histopathological foundation models like UNI and Hibou improves survival analysis in WSIs, but benefits lessen with complex MIL architectures.

Details

Motivation: Enhance predictive accuracy of MIL networks for survival analysis in WSIs by leveraging histopathological foundation models.

Method: Utilize UNI and Hibou foundation models within MIL frameworks, comparing performance with traditional ResNet50 backbones.

Result: Ensemble of foundation models boosts baseline accuracy, though gains reduce with more complex MIL architectures.

Conclusion: Foundation models enhance MIL performance for WSIs, but architectural complexity may limit their impact.

Abstract: The abundance of information present in Whole Slide Images (WSIs) renders them an essential tool for survival analysis. Several Multiple Instance Learning frameworks proposed for this task utilize a ResNet50 backbone pre-trained on natural images. By leveraging recenetly released histopathological foundation models such as UNI and Hibou, the predictive prowess of existing MIL networks can be enhanced. Furthermore, deploying an ensemble of digital pathology foundation models yields higher baseline accuracy, although the benefits appear to diminish with more complex MIL architectures. Our code will be made publicly available upon acceptance.

[1035] Coordinate-based Speed of Sound Recovery for Aberration-Corrected Photoacoustic Computed Tomography

Tianao Li, Manxiu Cui, Cheng Ma, Emma Alexander

Main category: eess.IV

TL;DR: A self-supervised method for joint reconstruction in PACT improves image quality by correcting SOS aberrations efficiently.

Details

Motivation: Conventional PACT suffers from image degradation due to SOS heterogeneity, and existing solutions are either burdensome or computationally expensive.

Method: Proposes a self-supervised joint reconstruction method using pixel grid or neural field parametrization, updated via backpropagation through a differentiable imaging model.

Result: Achieves 35x faster and more accurate SOS aberration removal compared to SOTA, validated in simulations and real data.

Conclusion: The method efficiently addresses SOS-related distortions in PACT, offering practical improvements for medical imaging.

Abstract: Photoacoustic computed tomography (PACT) is a non-invasive imaging modality, similar to ultrasound, with wide-ranging medical applications. Conventional PACT images are degraded by wavefront distortion caused by the heterogeneous speed of sound (SOS) in tissue. Accounting for these effects can improve image quality and provide medically useful information, but measuring the SOS directly is burdensome and the existing joint reconstruction method is computationally expensive. Traditional supervised learning techniques are currently inaccessible in this data-starved domain. In this work, we introduce an efficient, self-supervised joint reconstruction method that recovers SOS and high-quality images for ring array PACT systems. To solve this semi-blind inverse problem, we parametrize the SOS using either a pixel grid or a neural field (NF) and update it directly by backpropagating the gradients through a differentiable imaging forward model. Our method removes SOS aberrations more accurately and 35x faster than the current SOTA. We demonstrate the success of our method quantitatively in simulation and qualitatively on experimentally-collected and in vivo data. Our code and synthetic numerical phantoms are available on our project page: https://lukeli0425.github.io/Coord-SoS-PACT/.

[1036] Automatic brain tumor segmentation in 2D intra-operative ultrasound images using magnetic resonance imaging tumor annotations

Mathilde Faanes, Ragnhild Holden Helland, Ole Solheim, Sébastien Muller, Ingerid Reinertsen

Main category: eess.IV

TL;DR: MRI annotations can substitute iUS annotations for training deep learning models in brain tumor segmentation, achieving comparable results.

Details

Motivation: Overcome the lack of large annotated iUS datasets by leveraging more accessible MRI annotations.

Method: Used 180 annotated MRI scans and 29 annotated iUS images, performed image registration, and trained nnU-Net models with varying data configurations.

Result: No significant difference in Dice scores between models trained with MRI or iUS annotations. Best model achieved 0.62±0.31 Dice score, close to expert’s 0.67±0.25.

Conclusion: MRI annotations are a viable substitute for iUS annotations in training models for brain tumor segmentation in iUS images.

Abstract: Automatic segmentation of brain tumors in intra-operative ultrasound (iUS) images could facilitate localization of tumor tissue during resection surgery. The lack of large annotated datasets limits the current models performances. In this paper, we investigated the use of tumor annotations in magnetic resonance imaging (MRI) scans, which are more accessible than annotations in iUS images, for training of deep learning models for iUS brain tumor segmentation. We used 180 annotated MRI scans with corresponding unannotated iUS images, and 29 annotated iUS images. Image registration was performed to transfer the MRI annotations to the corresponding iUS images before training the nnU-Net model with different configurations of the data and label origins. The results showed no significant difference in Dice score for a model trained with only MRI annotated tumors compared to models trained with only iUS annotations and both, and to expert annotations, indicating that MRI tumor annotations can be used as a substitute for iUS tumor annotations to train a deep learning model for automatic brain tumor segmentation in iUS images. The best model obtained an average Dice score of $0.62\pm0.31$, compared to $0.67\pm0.25$ for an expert neurosurgeon, where the performance on larger tumors were similar, but lower for the models on smaller tumors. In addition, the results showed that removing smaller tumors from the training sets improved the results. The main models are available here: https://github.com/mathildefaanes/us_brain_tumor_segmentation/tree/main

[1037] Rethinking domain generalization in medical image segmentation: One image as one domain

Jin Hong, Bo Liu, Qiankun Zuo, Guoli Long, Siyue Li, Yudong Zhang, Shuihua Wang, Khan Muhammad

Main category: eess.IV

TL;DR: The paper proposes the ‘one image as one domain’ (OIOD) hypothesis and a UniDDG framework to address domain shifts in medical image segmentation, achieving superior performance without explicit domain labels.

Details

Motivation: Domain shifts in medical images, especially due to intra-center variability, challenge segmentation accuracy. Existing methods struggle with multi-source and single-source domain generalization.

Method: Develops UniDDG, a disentanglement-based framework that treats each image as a unique domain, decoupling content and style. Uses EMA for boundary preservation and SA for style augmentation.

Result: Achieves Dice scores of 84.43%-88.91% for optic disc/cup segmentation and 86.96%-88.56% for prostate segmentation, outperforming state-of-the-art methods.

Conclusion: The OIOD hypothesis and UniDDG framework effectively handle domain shifts, offering robust and scalable solutions for medical image segmentation.

Abstract: Domain shifts in medical image segmentation, particularly when data comes from different centers, pose significant challenges. Intra-center variability, such as differences in scanner models or imaging protocols, can cause domain shifts as large as, or even larger than, those between centers. To address this, we propose the “one image as one domain” (OIOD) hypothesis, which treats each image as a unique domain, enabling flexible and robust domain generalization. Based on this hypothesis, we develop a unified disentanglement-based domain generalization (UniDDG) framework, which simultaneously handles both multi-source and single-source domain generalization without requiring explicit domain labels. This approach simplifies training with a fixed architecture, independent of the number of source domains, reducing complexity and enhancing scalability. We decouple each input image into content representation and style code, then exchange and combine these within the batch for segmentation, reconstruction, and further disentanglement. By maintaining distinct style codes for each image, our model ensures thorough decoupling of content representations and style codes, improving domain invariance of the content representations. Additionally, we enhance generalization with expansion mask attention (EMA) for boundary preservation and style augmentation (SA) to simulate diverse image styles, improving robustness to domain shifts. Extensive experiments show that our method achieves Dice scores of 84.43% and 88.91% for multi-source to single-center and single-center generalization in optic disc and optic cup segmentation, respectively, and 86.96% and 88.56% for prostate segmentation, outperforming current state-of-the-art domain generalization methods, offering superior performance and adaptability across clinical settings.

[1038] ELFATT: Efficient Linear Fast Attention for Vision Transformers

Chong Wu, Maolin Che, Renjie Xu, Zhuoheng Ran, Hong Yan

Main category: eess.IV

TL;DR: ELFATT is a novel efficient linear fast attention mechanism that achieves high performance, linear complexity, and low memory usage, outperforming traditional attention methods in speed and efficiency.

Details

Motivation: The quadratic complexity of vanilla softmax-based attention limits its application in long-sequence tasks, and existing efficient methods sacrifice performance. ELFATT aims to address this by balancing efficiency and performance.

Method: ELFATT combines low memory I/O operations, linear computational complexity, and high performance. It is FlashAttention-friendly and accelerates tasks without performance loss.

Result: ELFATT provides 4-7x speedups in vision tasks, 2-3x with FlashAttention-2, and leads in non-vision tasks. It also enhances diffusion tasks without training.

Conclusion: ELFATT is a versatile, efficient attention mechanism that outperforms existing methods in speed and performance across various tasks and hardware.

Abstract: The attention mechanism is the key to the success of transformers in different machine learning tasks. However, the quadratic complexity with respect to the sequence length of the vanilla softmax-based attention mechanism becomes the major bottleneck for the application of long sequence tasks, such as vision tasks. Although various efficient linear attention mechanisms have been proposed, they need to sacrifice performance to achieve high efficiency. What’s more, memory-efficient methods, such as FlashAttention-1-3, still have quadratic computation complexity which can be further improved. In this paper, we propose a novel efficient linear fast attention (ELFATT) mechanism to achieve low memory input/output operations, linear computational complexity, and high performance at the same time. ELFATT offers 4-7x speedups over the vanilla softmax-based attention mechanism in high-resolution vision tasks without losing performance. ELFATT is FlashAttention friendly. Using FlashAttention-2 acceleration, ELFATT still offers 2-3x speedups over the vanilla softmax-based attention mechanism on high-resolution vision tasks without losing performance. Even in some non-vision tasks of long-range arena, ELFATT still achieves leading performance and offers 1.2-2.3x speedups over FlashAttention-2. Even on edge GPUs, ELFATT still offers 1.6x to 2.0x speedups compared to state-of-the-art attention mechanisms in various power modes from 5W to 60W. Furthermore, ELFATT can be used to enhance and accelerate diffusion tasks directly without training.

[1039] Low Latency and Generalizable Dynamic MRI via L+S Alternating GD and Minimization

Silpa Babu, Sajan Goud Lingala, Namrata Vaswani

Main category: eess.IV

TL;DR: Novel MRI reconstruction methods are developed for dynamic applications, offering accuracy, speed, and generalizability without parameter tuning.

Details

Motivation: To create a versatile MRI reconstruction algorithm that works across various dynamic MRI settings without needing adjustments.

Method: Develops low-rank (LR) and LR plus sparse (L plus S) models with simple, few-parameter algorithms.

Result: The approach achieves generalizability, being accurate and fast for diverse MRI applications.

Conclusion: Simple models and algorithms enable generalizable MRI reconstruction without parameter tuning.

Abstract: In this work, we develop novel MRI reconstruction approaches that are accurate, fast and low-latency for a large number of dynamic MRI applications, sampling schemes and sampling rates; without any problem-specific parameter tuning. We refer to this property of a single algorithm, without parameter tuning, being accurate and fast for many settings as generalizability. Generalizability is possible only for simple (few parameter) models such as low-rank (LR) or LR plus sparse (L plus S), and for simple few parameter algorithms based on these models, which is what we develop and evaluate in this work.

[1040] Style Content Decomposition-based Data Augmentation for Domain Generalizable Medical Image Segmentation

Zhiqiang Shen, Peng Cao, Jinzhu Yang, Osmar R. Zaiane, Zhaolin Chen

Main category: eess.IV

TL;DR: The paper introduces StyCona, a style-content decomposition-based data augmentation method to improve domain-generalizable medical image segmentation by addressing style and content shifts.

Details

Motivation: Domain shifts in medical imaging degrade segmentation model performance. The challenge is decoupling style and content factors in images to enhance generalization.

Method: Proposes a linear style-content decomposition method and StyCona, a plug-and-play augmentation algorithm leveraging this decomposition for training robust models.

Result: StyCona outperforms state-of-the-art methods in cardiac MRI and fundus photography segmentation tasks.

Conclusion: StyCona effectively improves model generalization without extra parameters or architectural changes, demonstrating strong performance across domains.

Abstract: Due to domain shifts across diverse medical imaging modalities, learned segmentation models often suffer significant performance degradation during deployment. These domain shifts, typically caused by variations in imaging systems, generally comprise two principal components: 1) \textbf{“style” shifts}, referring to global disparities in image properties such as illumination, contrast, and color; and 2) \textbf{“content” shifts}, which involve local discrepancies in anatomical structures. To address domain shifts in medical image segmentation, a core challenge arises: how can we decouple the factors within images that determine their “style” and “content” components? To this end, we first propose a linear style-content decomposition method that factorizes an image into style codes and content maps, explicitly modeling the “style” and “content” components. Building on this, we introduce a \textbf{Sty}le-\textbf{Con}tent decomposition-based data \textbf{a}ugmentation algorithm (StyCona), which leverages this decomposition strategy to guide augmentation of both the global style and local content of source-domain images, enabling the training of a well-generalized model for domain-generalizable medical image segmentation. StyCona is a simple yet effective plug-and-play module that substantially improves model generalization without requiring additional training parameters or modifications to segmentation model architectures. Experiments on cardiac magnetic resonance imaging and fundus photography segmentation tasks, with single and multiple target domains respectively, demonstrate the effectiveness of StyCona and its superiority over state-of-the-art domain generalization methods. The code will be released at https://github.com/Senyh/StyCona.

[1041] Make Both Ends Meet: A Synergistic Optimization Infrared Small Target Detection with Streamlined Computational Overhead

Yuxin Jing, Yuchen Zheng, Jufeng Zhao, Guangmang Cui, Tianpei Zhang

Main category: eess.IV

TL;DR: LE-IRSTD is a lightweight, efficient framework for infrared small target detection, addressing blurred boundaries and computational overhead with MBConvblock, BSblock, AVCStem, and GSConv, outperforming state-of-the-art methods.

Details

Motivation: Existing IRSTD methods face issues like blurred target boundaries and high computational costs, prompting the need for a more efficient solution.

Method: Proposes LE-IRSTD based on YOLOv8n, incorporating MBConvblock, BSblock, AVCStem (with VKConv), and GSConv to balance efficiency and accuracy.

Result: LE-IRSTD achieves superior accuracy and lightweight performance compared to other deep learning methods.

Conclusion: The proposed framework effectively addresses key challenges in IRSTD, offering a robust and efficient solution.

Abstract: Infrared small target detection(IRSTD) is widely recognized as a challenging task due to the inherent limitations of infrared imaging, including low signal-to-noise ratios, lack of texture details, and complex background interference. While most existing methods model IRSTD as a semantic segmentation task, but they suffer from two critical drawbacks: (1)blurred target boundaries caused by long-distance imaging dispersion; and (2) excessive computational overhead due to indiscriminate feature stackin. To address these issues, we propose the Lightweight Efficiency Infrared Small Target Detection (LE-IRSTD), a lightweight and efficient framework based on YOLOv8n, with following key innovations. Firstly, we identify that the multiple bottleneck structures within the C2f component of the YOLOv8-n backbone contribute to an increased computational burden. Therefore, we implement the Mobile Inverted Bottleneck Convolution block (MBConvblock) and Bottleneck Structure block (BSblock) in the backbone, effectively balancing the trade-off between computational efficiency and the extraction of deep semantic information. Secondly, we introduce the Attention-based Variable Convolution Stem (AVCStem) structure, substituting the final convolution with Variable Kernel Convolution (VKConv), which allows for adaptive convolutional kernels that can transform into various shapes, facilitating the receptive field for the extraction of targets. Finally, we employ Global Shuffle Convolution (GSConv) to shuffle the channel dimension features obtained from different convolutional approaches, thereby enhancing the robustness and generalization capabilities of our method. Experimental results demonstrate that our LE-IRSTD method achieves compelling results in both accuracy and lightweight performance, outperforming several state-of-the-art deep learning methods.

[1042] Deep Learning Empowered Sub-Diffraction Terahertz Backpropagation Single-Pixel Imaging

Yongsheng Zhu, Shaojing Liu, Ximiao Wang, Runli Li, Haili Yang, Jiali Wang, Hongjia Zhu, Yanlin Ke, Ningsheng Xu, Huanjun Chen, Shaozhi Deng

Main category: eess.IV

TL;DR: A sub-diffraction THz backpropagation SPI technique is proposed, achieving high resolution with an untrained neural network and minimal sampling, eliminating the need for ultrathin photomodulators.

Details

Motivation: Overcome limitations of THz SPI, such as low resolution and long sampling times, by leveraging neural networks and backpropagation.

Method: Illuminates objects with THz waves, modulates them with patterns, and uses an untrained neural network with a physical SPI process for reconstruction.

Result: Achieves a spatial resolution of 118 µm (λ0/7) with a 1.5625% sampling ratio, reducing sampling time.

Conclusion: The technique offers an efficient solution for THz microscopic imaging and other inverse imaging challenges.

Abstract: Terahertz single-pixel imaging (THz SPI) has garnered widespread attention for its potential to overcome challenges associated with THz focal plane arrays. However, the inherently long wavelength of THz waves limits imaging resolution, while achieving subwavelength resolution requires harsh experimental conditions and time-consuming processes. Here, we propose a sub-diffraction THz backpropagation SPI technique. We illuminate the object with continuous-wave 0.36-THz radiation ({\lambda}0 = 833.3 {\mu}m). The transmitted THz wave is modulated by prearranged patterns generated on a 500-{\mu}m-thick silicon wafer and subsequently recorded by a far-field single-pixel detector. An untrained neural network constrained with the physical SPI process iteratively reconstructs the THz images with an ultralow sampling ratio of 1.5625%, significantly reducing the long sampling times. To further suppress the THz diffraction-field effects, a backpropagation SPI from near field to far field is implemented by integrating with a THz physical propagation model into the output layer of the network. Notably, using the thick wafer where THz evanescent field cannot be fully recorded, we achieve a spatial resolution of 118 {\mu}m (~{\lambda}0/7) through backpropagation SPI, thus eliminating the need for ultrathin photomodulators. This approach provides an efficient solution for advancing THz microscopic imaging and addressing other inverse imaging challenges.

[1043] Model-Independent Machine Learning Approach for Nanometric Axial Localization and Tracking

Andrey Alexandrov, Giovanni Acampora, Giovanni De Lellis, Antonia Di Crescenzo, Chiara Errico, Daria Morozova, Valeri Tioukov, Autilia Vittiello

Main category: eess.IV

TL;DR: A deep learning method using CNNs achieves 40nm axial localization precision from dual-focal-plane images, outperforming traditional techniques.

Details

Motivation: Accurate axial particle tracking in microscopy is challenging, especially for high precision.

Method: Uses CNNs to determine axial coordinates from dual-focal-plane images without predefined models.

Result: Achieves 40nm precision, six times better than single-focal-plane methods.

Conclusion: Demonstrates ML’s potential to transform complex image data into precise, reliable information for diverse applications.

Abstract: Accurately tracking particles and determining their coordinate along the optical axis is a major challenge in optical microscopy, especially when extremely high precision is needed. In this study, we introduce a deep learning approach using convolutional neural networks (CNNs) that can determine axial coordinates from dual-focal-plane images without relying on predefined models. Our method achieves an axial localization precision of 40 nanometers-six times better than traditional single-focal-plane techniques. The model’s simple design and strong performance make it suitable for a wide range of uses, including dark matter detection, proton therapy for cancer, and radiation protection in space. It also shows promise in fields like biological imaging, materials science, and environmental monitoring. This work highlights how machine learning can turn complex image data into reliable, precise information, offering a flexible and powerful tool for many scientific applications.

[1044] Contrast-Invariant Self-supervised Segmentation for Quantitative Placental MRI

Xinliu Zhong, Ruiying Liu, Emily S. Nichols, Xuzhe Zhang, Andrew F. Laine, Emma G. Duerden, Yun Wang

Main category: eess.IV

TL;DR: A framework for placental segmentation in T2*-weighted MRI using multi-echo data, addressing challenges like weak boundaries and motion artifacts.

Details

Motivation: Accurate placental segmentation is crucial but difficult due to weak boundary contrast, lack of ground truth annotations, and motion artifacts in multi-echo T2*-weighted MRI.

Method: Proposes a contrast-augmented segmentation framework with masked autoencoding (MAE), masked pseudo-labeling (MPL), and global-local collaboration, plus a semantic matching loss.

Result: Outperforms single-echo and naive fusion baselines, generalizing effectively across echo times.

Conclusion: First systematic use of multi-echo T2*-weighted MRI for placental segmentation, demonstrating robust performance.

Abstract: Accurate placental segmentation is essential for quantitative analysis of the placenta. However, this task is particularly challenging in T2*-weighted placental imaging due to: (1) weak and inconsistent boundary contrast across individual echoes; (2) the absence of manual ground truth annotations for all echo times; and (3) motion artifacts across echoes caused by fetal and maternal movement. In this work, we propose a contrast-augmented segmentation framework that leverages complementary information across multi-echo T2*-weighted MRI to learn robust, contrast-invariant representations. Our method integrates: (i) masked autoencoding (MAE) for self-supervised pretraining on unlabeled multi-echo slices; (ii) masked pseudo-labeling (MPL) for unsupervised domain adaptation across echo times; and (iii) global-local collaboration to align fine-grained features with global anatomical context. We further introduce a semantic matching loss to encourage representation consistency across echoes of the same subject. Experiments on a clinical multi-echo placental MRI dataset demonstrate that our approach generalizes effectively across echo times and outperforms both single-echo and naive fusion baselines. To our knowledge, this is the first work to systematically exploit multi-echo T2*-weighted MRI for placental segmentation.

[1045] MEGANet-W: A Wavelet-Driven Edge-Guided Attention Framework for Weak Boundary Polyp Detection

Zhe Yee Tan, Ashwaq Qasem

Main category: eess.IV

TL;DR: MEGANet-W, a wavelet-driven edge-guided attention network, improves colorectal polyp segmentation by integrating Haar wavelet edge maps into decoder stages, outperforming existing methods without extra parameters.

Details

Motivation: Weak and low-contrast boundaries in colorectal polyp images limit automated segmentation accuracy, and current methods either blur edges or rely on unreliable handcrafted filters.

Method: MEGANet-W uses a two-level Haar wavelet head for multi-orientation edge extraction and Wavelet Edge Guided Attention (W-EGA) modules to fuse wavelet cues with boundary and input branches.

Result: On five public datasets, MEGANet-W improves mIoU by up to 2.3% and mDice by 1.2%, with no additional learnable parameters.

Conclusion: MEGANet-W offers a robust solution for precise boundary detection in medical image segmentation, enhancing reliability in challenging cases.

Abstract: Colorectal polyp segmentation is critical for early detection of colorectal cancer, yet weak and low contrast boundaries significantly limit automated accuracy. Existing deep models either blur fine edge details or rely on handcrafted filters that perform poorly under variable imaging conditions. We propose MEGANet-W, a Wavelet Driven Edge Guided Attention Network that injects directional, parameter free Haar wavelet edge maps into each decoder stage to recalibrate semantic features. The key novelties of MEGANet-W include a two-level Haar wavelet head for multi-orientation edge extraction; and Wavelet Edge Guided Attention (W-EGA) modules that fuse wavelet cues with boundary and input branches. On five public polyp datasets, MEGANet-W consistently outperforms existing methods, improving mIoU by up to 2.3% and mDice by 1.2%, while introducing no additional learnable parameters. This approach improves reliability in difficult cases and offers a robust solution for medical image segmentation tasks requiring precise boundary detection.

[1046] Towards High-Resolution Alignment and Super-Resolution of Multi-Sensor Satellite Imagery

Philip Wootaek Shin, Vishal Gaur, Rahul Ramachandran, Manil Maskey, Jack Sampson, Vijaykrishnan Narayanan, Sujit Roy

Main category: eess.IV

TL;DR: A framework is developed to align and upscale 30m HLS imagery using 10m HLS data, improving super-resolution for heterogeneous satellite sensors.

Details

Motivation: Differences in spatial resolution across satellite sensors challenge data fusion and applications. Existing methods rely on artificial downscaling and don't suit heterogeneous sensors.

Method: Align and upscale 30m HLS imagery using 10m HLS data as a reference.

Result: Quantitative and qualitative evaluations show effectiveness, enhancing super-resolved Landsat imagery.

Conclusion: The framework demonstrates feasibility for heterogeneous satellite image super-resolution, with insights for future advancements.

Abstract: High-resolution satellite imagery is essential for geospatial analysis, yet differences in spatial resolution across satellite sensors present challenges for data fusion and downstream applications. Super-resolution techniques can help bridge this gap, but existing methods rely on artificially downscaled images rather than real sensor data and are not well suited for heterogeneous satellite sensors with differing spectral, temporal characteristics. In this work, we develop a preliminary framework to align and upscale Harmonized Landsat Sentinel 30m(HLS 30) imagery using Harmonized Landsat Sentinel 10m(HLS10) as a reference from the HLS dataset. Our approach aims to bridge the resolution gap between these sensors and improve the quality of super-resolved Landsat imagery. Quantitative and qualitative evaluations demonstrate the effectiveness of our method, showing its potential for enhancing satellite-based sensing applications. This study provides insights into the feasibility of heterogeneous satellite image super-resolution and highlights key considerations for future advancements in the field.

Today’s Research Highlights

Table of Contents

cs.CL

[1] Rethinking Graph-Based Document Classification: Learning Data-Driven Structures Beyond Heuristic Approaches

[2] FECT: Factuality Evaluation of Interpretive AI-Generated Claims in Contact Center Conversation Transcripts

[3] XAutoLM: Efficient Fine-Tuning of Language Models via Meta-Learning and AutoML

[4] MAO-ARAG: Multi-Agent Orchestration for Adaptive Retrieval-Augmented Generation

[5] UrBLiMP: A Benchmark for Evaluating the Linguistic Competence of Large Language Models in Urdu

[6] Cross-Domain Web Information Extraction at Pinterest

[7] Asking the Right Questions: Benchmarking Large Language Models in the Development of Clinical Consultation Templates

[8] CSIRO-LT at SemEval-2025 Task 11: Adapting LLMs for Emotion Recognition for Multiple Languages

[9] Adaptive Content Restriction for Large Language Models via Suffix Optimization

[10] Show or Tell? Modeling the evolution of request-making in Human-LLM conversations

[11] WebDS: An End-to-End Benchmark for Web-based Data Science

[12] WarriorMath: Enhancing the Mathematical Ability of Large Language Models with a Defect-aware Framework

[13] Bridging LLMs and Symbolic Reasoning in Educational QA Systems: Insights from the XAI Challenge at IJCNN 2025

[14] Prompting Large Language Models with Partial Knowledge for Answering Questions with Unseen Entities

[15] KEDAS: Knowledge Editing Alignment with Diverse Augmentation and Self-adaptive Inference

[16] D-SCoRE: Document-Centric Segmentation and CoT Reasoning with Structured Export for QA-CoT Data Generation

[17] Marco-Voice Technical Report

[18] LinkQA: Synthesizing Diverse QA from Multiple Seeds Strongly Linked by Knowledge Points

[19] Large-Scale Diverse Synthesis for Mid-Training

[20] MaRGen: Multi-Agent LLM Approach for Self-Directed Market Research and Analysis

[21] MedSynth: Realistic, Synthetic Medical Dialogue-Note Pairs

[22] ArzEn-MultiGenre: An aligned parallel dataset of Egyptian Arabic song lyrics, novels, and subtitles, with English translations

[23] Discovering Bias Associations through Open-Ended LLM Generations

[24] From Query to Logic: Ontology-Driven Multi-Hop Reasoning in LLMs

[25] Towards Efficient Medical Reasoning with Minimal Fine-Tuning Data

[26] TreeDiff: AST-Guided Code Generation with Diffusion LLMs

[27] Harnessing Collective Intelligence of LLMs for Robust Biomedical QA: A Multi-Model Approach

[28] TeSent: A Benchmark Dataset for Fairness-aware Explainable Sentiment Classification in Telugu

[29] Listening to the Unspoken: Exploring “365” Aspects of Multimodal Interview Performance Assessment

[30] The Homogenizing Effect of Large Language Models on Human Expression and Thought

[31] A Theory of Adaptive Scaffolding for LLM-Based Pedagogical Agents

[32] MOPrompt: Multi-objective Semantic Evolution for Prompt Optimization

[33] Are All Prompt Components Value-Neutral? Understanding the Heterogeneous Adversarial Robustness of Dissected Prompt in Large Language Models

[34] OpenMed NER: Open-Source, Domain-Adapted State-of-the-Art Transformers for Biomedical NER Across 12 Public Datasets

[35] Authorship Attribution in Multilingual Machine-Generated Texts

[36] CUPID: Evaluating Personalized and Contextualized Alignment of LLMs from Interactions

[37] The Bidirectional Process Reward Model

[38] Collaborative Chain-of-Agents for Parametric-Retrieved Knowledge Synergy

[39] Am I Blue or Is My Hobby Counting Teardrops? Expression Leakage in Large Language Models as a Symptom of Irrelevancy Disruption

[40] CultureGuard: Towards Culturally-Aware Dataset and Guard Model for Multilingual Safety Applications

[41] Enhancing the Preference Extractor in Multi-turn Dialogues: From Annotating Disasters to Accurate Preference Extraction

[42] AI-Generated Text is Non-Stationary: Detection via Temporal Tomography

[43] A comprehensive taxonomy of hallucinations in Large Language Models

[44] HeQ: a Large and Diverse Hebrew Reading Comprehension Benchmark

[45] AGENTICT$^2$S:Robust Text-to-SPARQL via Agentic Collaborative Reasoning over Heterogeneous Knowledge Graphs for the Circular Economy

[46] MLP Memory: Language Modeling with Retriever-pretrained External Memory

[47] Web-CogReasoner: Towards Knowledge-Induced Cognitive Reasoning for Web Agents

[48] Counterfactual Probing for Hallucination Detection and Mitigation in Large Language Models

[49] Quantum-RAG and PunGPT2: Advancing Low-Resource Language Generation and Retrieval for the Punjabi Language

[50] Word Overuse and Alignment in Large Language Models: The Influence of Learning from Human Feedback

[51] ROVER: Recursive Reasoning Over Videos with Vision-Language Models for Embodied Tasks

[52] SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension

[53] TIBSTC-CoT: A Multi-Domain Instruction Dataset for Chain-of-Thought Reasoning in Language Models

[54] Contextually Aware E-Commerce Product Question Answering using RAG

[55] Prompting Large Language Models to Detect Dementia Family Caregivers

[56] SpeechRole: A Large-Scale Dataset and Benchmark for Evaluating Speech Role-Playing Agents

[57] SpeechR: A Benchmark for Speech Reasoning in Large Audio-Language Models

[58] Diagnosing Memorization in Chain-of-Thought Reasoning, One Token at a Time

[59] Harnessing Temporal Databases for Systematic Evaluation of Factual Time-Sensitive Question-Answering in Large Language Models

[60] ProCut: LLM Prompt Compression via Attribution Estimation

[61] The SMeL Test: A simple benchmark for media literacy in language models

[62] When Truth Is Overridden: Uncovering the Internal Origins of Sycophancy in Large Language Models

[63] “Harmless to You, Hurtful to Me!”: Investigating the Detection of Toxic Languages Grounded in the Perspective of Youth

[64] Learning Dynamics of Meta-Learning in Small Model Pretraining

[65] Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference

[66] Proof2Hybrid: Automatic Mathematical Benchmark Synthesis for Proof-Centric Problems

[67] Isolating Culture Neurons in Multilingual Large Language Models

[68] Interference Matrix: Quantifying Cross-Lingual Interference in Transformer Encoders

[69] Decomposing the Entropy-Performance Exchange: The Missing Keys to Unlocking Effective Reinforcement Learning

[70] SHAMI-MT: A Syrian Arabic Dialect to Modern Standard Arabic Bidirectional Machine Translation System

[71] Dynaword: From One-shot to Continuously Developed Datasets

[72] A French Version of the OLDI Seed Corpus

[73] Simple Methods Defend RAG Systems Well Against Real-World Attacks

[74] LaMPE: Length-aware Multi-grained Position Encoding for Adaptive Long-context Scaling Without Training

[75] VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo

[76] CAMERA: Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis

[77] Understanding and Mitigating Political Stance Cross-topic Generalization in Large Language Models